Transcript
A (0:00)
If you're curious about building with LLMs, but you want to skip the hype and learn what it actually takes to get something working in the real world, this episode is for you. We have been building a lot of LLM powered applications lately, both for ourselves and with customers. And I'm talking about workflow automations, smart data pipelines, query generators, AI powered dashboards, that kind of stuff. And along the way we picked up a pretty good collection of battle scars, I'd say. So what we learn is that the art part is not getting a demo to work. The hard part is actually making it reliable, predictable, affordable in production. So that's what we are here to share with you today. And this trend isn't slowing down. We feel that almost every new project that comes through our door has some kind of AI component that needs to be baked into it. So it's not really should we use an LLM here anymore? It's more questions like which models do we pick? How do we call it? How do we make it trustworthy? How do we keep the bill under control? And if you're building on aws, you know that a lot of these questions lead to Amazon Bedrock. So today we are sharing what we learned about running LLM inference on Bedrock, what works well, what surprised us, and the gotchas that nobody warns you about until you find yourself debugging. Hopefully not at 11pm on a Friday. So today we'll see what is a quick definition of what an LLM is and what do we mean by inference, the kind of AI powered application that we have been building, why AI is a lot more than just Genai and what we mean with the word agents. And finally we'll start to talk a little bit more in detail about Bedrock, what it is for, and the different gotchas that we learned about. My name is Luciano and I'm joined by Owen for another episode of AWS Bytes. So maybe we can start this episode by giving a quick recap of what is an LLM and what do we mean by inference. I think that a lot of people might be familiar with some definition of those, but it's probably worth it giving our own view on this.
B (2:06)
Let's start with LLM Large Language Model. We may know that this is a type of neural network trained on huge amounts of primarily text data. They learn statistical patterns in language and can generate remarkably coherent context aware text. And the landscape of available models is growing really, really fast. You might know OpenAI's GPT family, Anthropic Squad, Google's Gemini, Meta's, Llama, Amazon, Xenova. Then there's Mistral, Deep, Seqen, GLM and Minimax. The number of these really capable models keeps increasing, and that might be great news for builders. So then we talk about inference quite a lot. And that's when you use a trained model to generate some output. Now, training is really expensive. That's the long process of teaching the model. This is what model providers do. And it's not cheap. We're talking about millions or billions of dollars in compute data and research. There's a reason all these companies need very deep pockets or very persuasive pitch decks. Inference is what we do as users or developers generally. We send a prompt, which is input text, and the model generates a response. And it may be an imperfect analogy to understand the business of LLM providers. Training is like spending years in medical school racking up enormous student debt. Inference is the doctors seeing patients and hopefully making that investment back one consultation at a time. We're the ones booking the appointments. Now, the analogy breaks down a bit in a few interesting ways. In reality, training an LLM takes weeks or months, not a decade of studies. But you do need a lot of GPUs. And unlike a doctor who specializes in one field, an LLM is more like getting degrees in medicine, law, engineering, creative writing, and a dozen other fields all at once. So that's what makes them so versatile and why they're showing up in a lot of different applications. And then there's tokens. You might have heard a lot about tokens. We've covered it before. But LLMs don't think in words, they think in tokens. And a token is, you could say it's roughly about four characters, maybe three quarters of a word in English, something like that. That's just kind of a rule of thumb. And a detail that's easy to overlook is that different models tokenize text differently. So the same sentence might be 20 tokens in one model and 25 in another, depending on how each model's tokenizer splits the text. This can matter more than you think, because you pay for inference based on input tokens and output tokens. So when you're comparing pricing across models, cheaper per token doesn't necessarily mean cheaper for the same job if one model uses more tokens to represent the same text. But tokens are the fundamental unit of cost, rate limiting and context windows in the LLM world. So it's good to get comfortable with this concept because it comes up everywhere. So, Luciano, given that all this stuff is everywhere. Now what are we building?
