Loading summary
A
If you're curious about building with LLMs, but you want to skip the hype and learn what it actually takes to get something working in the real world, this episode is for you. We have been building a lot of LLM powered applications lately, both for ourselves and with customers. And I'm talking about workflow automations, smart data pipelines, query generators, AI powered dashboards, that kind of stuff. And along the way we picked up a pretty good collection of battle scars, I'd say. So what we learn is that the art part is not getting a demo to work. The hard part is actually making it reliable, predictable, affordable in production. So that's what we are here to share with you today. And this trend isn't slowing down. We feel that almost every new project that comes through our door has some kind of AI component that needs to be baked into it. So it's not really should we use an LLM here anymore? It's more questions like which models do we pick? How do we call it? How do we make it trustworthy? How do we keep the bill under control? And if you're building on aws, you know that a lot of these questions lead to Amazon Bedrock. So today we are sharing what we learned about running LLM inference on Bedrock, what works well, what surprised us, and the gotchas that nobody warns you about until you find yourself debugging. Hopefully not at 11pm on a Friday. So today we'll see what is a quick definition of what an LLM is and what do we mean by inference, the kind of AI powered application that we have been building, why AI is a lot more than just Genai and what we mean with the word agents. And finally we'll start to talk a little bit more in detail about Bedrock, what it is for, and the different gotchas that we learned about. My name is Luciano and I'm joined by Owen for another episode of AWS Bytes. So maybe we can start this episode by giving a quick recap of what is an LLM and what do we mean by inference. I think that a lot of people might be familiar with some definition of those, but it's probably worth it giving our own view on this.
B
Let's start with LLM Large Language Model. We may know that this is a type of neural network trained on huge amounts of primarily text data. They learn statistical patterns in language and can generate remarkably coherent context aware text. And the landscape of available models is growing really, really fast. You might know OpenAI's GPT family, Anthropic Squad, Google's Gemini, Meta's, Llama, Amazon, Xenova. Then there's Mistral, Deep, Seqen, GLM and Minimax. The number of these really capable models keeps increasing, and that might be great news for builders. So then we talk about inference quite a lot. And that's when you use a trained model to generate some output. Now, training is really expensive. That's the long process of teaching the model. This is what model providers do. And it's not cheap. We're talking about millions or billions of dollars in compute data and research. There's a reason all these companies need very deep pockets or very persuasive pitch decks. Inference is what we do as users or developers generally. We send a prompt, which is input text, and the model generates a response. And it may be an imperfect analogy to understand the business of LLM providers. Training is like spending years in medical school racking up enormous student debt. Inference is the doctors seeing patients and hopefully making that investment back one consultation at a time. We're the ones booking the appointments. Now, the analogy breaks down a bit in a few interesting ways. In reality, training an LLM takes weeks or months, not a decade of studies. But you do need a lot of GPUs. And unlike a doctor who specializes in one field, an LLM is more like getting degrees in medicine, law, engineering, creative writing, and a dozen other fields all at once. So that's what makes them so versatile and why they're showing up in a lot of different applications. And then there's tokens. You might have heard a lot about tokens. We've covered it before. But LLMs don't think in words, they think in tokens. And a token is, you could say it's roughly about four characters, maybe three quarters of a word in English, something like that. That's just kind of a rule of thumb. And a detail that's easy to overlook is that different models tokenize text differently. So the same sentence might be 20 tokens in one model and 25 in another, depending on how each model's tokenizer splits the text. This can matter more than you think, because you pay for inference based on input tokens and output tokens. So when you're comparing pricing across models, cheaper per token doesn't necessarily mean cheaper for the same job if one model uses more tokens to represent the same text. But tokens are the fundamental unit of cost, rate limiting and context windows in the LLM world. So it's good to get comfortable with this concept because it comes up everywhere. So, Luciano, given that all this stuff is everywhere. Now what are we building?
A
Yes, I guess first of all it's worth clarifying because I think at this point almost everyone has used things like cursor, copilot, cloud code, codex, whatever, all these coding AI agents. And I think it's worth just saying that this is the technology that we are talking about. LLMs are effectively what's powering all these tools. And that's one type of use cases. So these are like coding agents that you use to, whenever you're writing some code to get assistance from this intelligence and knowledge that has been baked in into the model. But we see a lot of use cases where we are embedding LLMs directly into applications. Applications we build either for ourselves while we experiment and learn more about this technology. But we see an increase in demand from customers and we have been already building a few examples. And just to give you an idea, things we have been building are like smart data transformation pipelines. So you can imagine that there is an LLM component that helps the user to describe what they want in natural language. For example, I don't know, merge these two example data sets, normalize them in some kind of way, flag duplicates for example, and then the system using the LLM is able to convert that natural language requirement into reproducible deterministic code that then can be baked into a pipeline that can be reused over time. So effectively is almost like giving somebody that doesn't necessarily know how to code that pipeline themselves an easier door to basically be able to describe what they want with a process that somehow gives them a preview of the results, slowly converge to something that is actually doing what they want to achieve, and then save that into reproducible code that can be reused later on. So imagine almost like a notebook, but rather than writing code, you use LLMs to get to the final version of the code that you want to use. And this is just one example, Another one that comes up a lot is for data analytics. Being able to generate queries, for example, for Athena, is one we have done recently. But you can extend that idea to other databases like Redshift, Postgres, elasticsearch. And the idea is again, if you don't know the specific language that is required to query the data, you can use an LLM to convert some kind of human language of that query. For example, I want to know, I don't know the top 10 spenders in this e commerce give me the username. For example, an LLM should be able to do a good job in converting that to a specific query for a database and then you can execute that query and give the results to the user. So again, it's always about trying to lower the barrier of entry to use specific technology so you can use more natural language and have the LLM do all the hard work of converting that, that language into something that's more specific for, I don't know, converting data in some way, making queries. But there are other use cases. For example, we can automatically generate dashboards. So for example, based on some of these data pipelines that I mentioned before, another example we built is the system is capable of understanding the type of data using an LLM and picking up some of the metrics that are more relevant and creating dashboards with charts that make possibly the most sense. And again, it's not always perfect. I think there is a lot of DLM can generally do an average work and then you still want a human to maybe go in and refine. But we are seeing lots of use cases like this ones and other ones we have seen online from just looking at. Other examples are for example customer support automation, where you can have like a chatbot that helps the user to start asking questions and eventually route them maybe to an actual human agent that can help them to perform certain actions. Or sometimes it's the LLM itself that can do certain actions on behalf of the user. And other ones are document processing. So imagine like an OCR process, but much smarter than that because a classic OCR will just give you the plain text, while when you combine OCR with an LLM you can actually ask questions to documents and get out structured data from your documents. So yeah, I guess before we get carried away with all these examples, I think it's fair to say to remark at least that these LLMs are not magic and they are not necessarily the right tool for everything. In General, I think LLMs are good to effectively convert this kind of requirement in some kind of human language. So you can type something and try to describe what you want to achieve and and train the LLM to convert that requirement into specific actions that will make sense for the system you are building. So basically understanding text is one of the main superpowers of LLMs, but they are not good at everything that needs to be deterministic and precise because they are probabilistic by nature. So if you ask the same question twice, probably you will get a slightly different answer the second time. Sometimes these LLMs can hallucinate, which is when they confidently state something that is not necessarily true and they cannot necessarily do Arithmetics very well. So sometimes they will do mistakes there. Like the classic example is if you tell I don't know how many vocals there are in the word strawberry most of the time you might get a wrong response. This used to be one of the common jokes when LLMs came out. I think they are getting better, but the point is that there are lots of things that LLMs are good at and many things that they are bad at. So don't try to use them for everything, try to understand what is good about them, what is bad and then pick them only for the right use cases. And I think the key principle is try to use LLMs for everything that is a little bit fuzzy. Again, as a human that is trying to describe something and you want to have that understanding of the description and do something, but all the precise parts, I think you should still try to use code and more kind of regular automation to achieve those results in a more predictable way. And yeah, I think in general the last point I want to make here is that it's going to become a standard building block as many others that we have been using throughout the years. So it's always as any other building block to understand what are the patterns, what are the good things, what are the bad things, what are the common problems and hopefully today we are going to be able to cover some of that. Anything you want to add on this part, Owen?
B
Yeah, it's definitely, I think, recommended to experiment frequently and not assume, if you haven't already experimented frequently, that you just add a feature request and use LLMs in production for the first time and it'll be smooth sailing. I think we've seen statistics that the vast, vast majority of these projects are not making it to production right now for a whole host of reasons. Maybe people don't have the right data, they didn't have the right use case in mind, or the results just aren't effective enough to meet the use case that was envisaged like we find more success where the use case is very well defined and simple and it's a good idea I think in general to focus on areas where you're spending a lot of time that you might benefit from this level of automation. But simple things rather than trying to assume that AI is so intelligent you can throw the most complex problem you have at it, which usually ends in failure. We can maybe talk a bit about AI and gen and what do we mean by agents as well? AI has become shorthand for genai in popular conversation, but of course AI is much broader and Has a much longer history. Traditional machine learning, classification, regression, anomaly detection. And still AI, and still incredibly useful. Computer vision, speech recognition, recommendation engines, they're all AI, but not necessarily gen AI. And AWS has a whole ecosystem of services for the more traditional AI angle, like SageMaker, Rekognition, textract, et cetera. We won't cover any of those today, but still worth knowing about. It doesn't always have to be LLM based. So Genai is specifically about generating new content like text, images, code, audio and video. And when we say LLM inference in this episode, we're talking about specifically about Genai. Now, inference in practice is generally at the simplest level, text generation. You send it a prompt and then you get an answer, which we call a completion. But increasingly we are also talking about agents and agentic workflows. And these are really, I suppose, more sophisticated loops where the LLM can try and reason or simulate reasoning, plan and take actions. It's like orchestration of multiple steps of an LLM really. So what we mean when we talk about agents like an agent is like a smart loop. Rather than hoping that you get a good completion back, the LLM receives a task, decides what to do, uses tools, observes the results and iterates. And this is one of the main things that makes agentic LLMs so powerful. The actual tools that you can use. With tools, you can expand the LLMs capabilities far beyond just generating text. We know that generating text alone is subject to hallucinations and errors, but by combining LLMs with access to deterministic tools, it can actually become very powerful. So the LLM itself still just generates text. It could describe what tool to call and with what parameters, but then your code executes the actual tool and feeds back the results. And you can write tools that do virtually anything, like check the weather at a given location, look up a customer, record in a database, call a third party API, read files, run code, or just trigger complex workflows on behalf of a user. And this is what turns an LLM from a fancy autocomplete into something that can actually take actions in the real world. With this power comes a lot of responsibility and safety. Boundaries are required that prevent the agent from doing things it shouldn't. So serious guardrails are required to stop it from executing destructive actions, leaking personal information, or going off topic if you're doing it in an AWS world. Defining roles and very minimal permissions help a lot with this, as well as your network boundaries. So when you have an agent as well, there's also Consideration for memory and context, the agent will have to maintain state across steps, building up context as it works through a problem. So you can think of it as LLM plus tools plus the loop plus context management plus guardrails. Put all those things together and you have an agent. In practice, there are loads of frameworks that help you build these patterns, like LangChain strands from aws, the Vercel AI SDK, and there's plenty more in every language. Plus AWS has its own Bedrock Agents feature. We're not going to dive deep on the whole agentic side of things today, or on the AWS services specifically built to host and run agents at scale that probably deserves its own episode. But it is important to understand what we generally mean by agentic because it shapes how you think about inference. It's not just one prompt in one response. Out loops tools are called. You've got context building. And it all translates to more tokens, more latency and more things to think about when you're setting up your infrastructure. With all of this context in mind, let's talk about Bedrock, what it is and why it exists.
A
Yeah, exactly. So I think high level bedrock. We can think about it as the AWS managed service for accessing foundation models via API. So you can imagine it as a unified set of APIs for calling hundreds of models. And there are many providers like we mentioned already, some of them Amazon, Anthropic, Meta, Mistral, AI Deep, seq, OpenAI and more. And through this unified API you basically get a few interesting things. The first one is that you don't need separate accounts or API keys for each provider, because Bedrock is kind of your own central place and it runs within the AWS ecosystem, which means that you can also use IAM for authentication, you can use CloudWatch for monitoring, you can use VPC endpoints if you want to keep everything as private as possible and not have traffic going through the public Internet. You can use CloudTrail for auditing. So all the nice and convenient things you generally use when you build production ready systems on aws. Of course there is an alternative. You are not forced to use Bedrock. Like you could use the APIs of the different providers directory. OpenAI has its own API, anthropic as its own API. Pretty much every provider needs to give you access to the model when they offer the cloud version of the model through an API. So you can just go through them, create an account and call their API directly. And I think this is not too bad. It probably works fine if you're doing prototyping and small projects. It might actually be a little bit simpler than just getting started with Bedrock, which probably comes with a little bit of extra complexity, especially if you're not too familiar with aws. But I think then you need to know what you're missing out because I think you need to understand that if you want to go production ready, probably what Bedrock is giving you is worth it and is worth the initial effort of learning Bedrock and learning all the tools that you get with Bedrock. And just to give you a few examples, you will get security and compliance because basically this is probably one of the main selling points, especially if you're working in industries where it is important to to respect the privacy of the data of users. What bedrock guarantees you is that data stays within your AWS account boundary. You can pick specific regions where the inference runs. So if you have also legal requirements where you need to make sure that your data never leaves a specific region, like Europe for example, you can do that through Bedrock. Data is encrypted in transit and the rest effectively there is an agreement between AWS and the model provider that they will never use data that you send to the models to do additional training in the future. This is probably one of the biggest selling points for Bedrock. So effectively you can trust Bedrock a little bit more than just having to go through the agreements that you will get with each individual provider, which is probably going to be very different terms and conditions. So if you want to test for example both anthropic and OpenAI models, you probably need to go and read through the two different agreements and understand if they would work out for you. While with Bedrock you have a more unified experience. Once you understand the guarantees, you have a system that allows you to try different models. Then we already mentioned governance because you can use IAM and cloudtrail, you have again model flexibility. I already mentioned that where if you want to try to see which model works best for you once you are in Bedrock, it's relatively easy to switch between the models that are available there. And yeah, then there are a bunch of interesting Bedrock specific feature which I don't think we're going to be spending a lot of time on it on them today. But you can easily build knowledge bases sometimes called rag. We already mentioned agents. There are entire sub services within Bedrock and frameworks that allow you to make it easy to to build and run agents in production. You have the concept of guardrails, so being able to effectively limit some of the capabilities of the LLM. For example, if you want to make sure it doesn't go off on a path that you don't like. Maybe, I don't know, classic examples are like limiting the LLM interaction to for example, not be able to talk about politics or maybe not go outside a scope that maybe is the scope that is specific in the domain where you are implementing the LLM. Or maybe, I don't know, you can remove some pii. So there are ways to detect that PII is coming into the conversation with the LLM, so you could obfuscate some of that PI before it goes back to the user. So you have all these kind of additional features that I think are really important for when you are about to go to production and you want to make sure you are ready for it. Now there is one interesting caveat that I find a little bit disappointing sometimes because although I said that there is support for hundreds of models, not all the mainstream models are out there. For example, a good example is Gemini, which is a very capable model and it's not currently available in Bedrock. You can imagine this is due to competitive reasons because Gemini being from Google, of course it's sold through Google Cloud. So I don't think it's very easy for AWS and Google to agree on a way to make that work on AWS as well. I know that it's currently the same for OpenAI with some of like there are the GPT OS models available, but you don't get GPT. For example, 5.3 will be the kind of the bleeding edge model at the moment that's not currently available in Bedrock. I suspect that that might change because I'm hearing that there is a big round of investment coming into OpenAI where Amazon is taking part, so maybe that will change soon enough.
B
They announced as part of that that the GPT models would become available in Bedrock. So that is the plan. Yeah. It only cost $50 billion. That was the price exactly.
A
So yeah, right now just be aware that if you want to use Gemini currently, that's not going to be. That's not available. Probably it's not going to be available for a long time. While if you're interested in GPT models, they will probably become available very soon. But not just right now, the moment we are recording this. And again just want to remark that there is so much to talk about when it comes to Bedrock. Today we're going to focus just on trying to use the LLM programmatically part. Maybe we'll have future episodes if there is enough interest and if we get to learn Enough to make it work for us to create an entire episode dedicated to the other features. So with all of that introduction, how do we get started using Bedrock?
B
We talked about Bedrock maybe well over a year ago, I think, and since then there's a new Access model, so the first thing you need to do is understand this a little bit. You'll find a lot of outdated articles out there. Bedrock used to have a model access page where you had to manually enable each model in commercial regions. That old workflow is gone. Today access is mostly IAM plus one time agreements for some models, so you'll want to follow the current documentation rather than anything from 2023. Models are now available by default in commercial regions as long as your IAM identity has the right permissions like Bedrock Invoke Model. This brings Bedrock in line with how other AWS services work, which is nice. There are a couple of things that can trip you up, right? So some Bedrock serverless models are served from the AWS Marketplace. The first time your account uses one of those, Bedrock automatically tries to create a Marketplace subscription and you need IAM permissions for AWS Marketplace in case that's something that trips you up. Note that models from Amazon, Deepseek, Mistral, Meta Qin and OpenAI are not sold through the marketplace, so this only applies to certain providers. We'll talk more about this gotcha in a little while. Anthropic models specifically still require a one time use case submission. You just fill out a little form and you can invoke them. You can complete this through the Bedrock playground in the console or using the API. And if you use AWS organizations, if you complete it at the management account level via API, it extends the approval to all organization accounts. Once that bit is done, you can pick a model and you've got the Claude models as we mentioned, which are very popular, excellent for decentralized complex reasoning coding long documents is one of the ones we use the most, I think. And then you've got the Amazon Zone Nova ones which are the kind of the budget option, you know, good for price, performance balancing, especially for simpler tasks. And the Light and micro Nova ones are very cost effective. You have the Meta Llama ones, open weight models, good general purpose, maybe starting to show its age. I don't see a lot of use of them. And then Mistral is good for coding and multilingual tasks. Gwen and GLM are really starting to make an impact I think, and in our opinion there's lots of potential for those to be competitive in price. So that's just some examples and as you mentioned Luciano, you don't have all of the competitor models, but we can expect open API OpenAI's ones to become available at some point in the future, provided that agreement goes well. Good idea to start with a capable model like Claude Sonnet to validate your approach, then see if a cheaper, faster model can handle it. No point in prematurely optimizing. And the Bedrock web console offers a good UI that allows you to send messages to multiple LLMs at the same time so you can compare responses. Probably worth also mentioning that as you can imagine, everybody's experimenting with bedrock and with LLMs and as a result of that it might be more difficult than you expect to get the quotas you might need if you really start to run this at production and need the scale. So prepare to have to make a business case and plead for quotas that are beyond prototype POC scale. Once you have your model you can call the API. So we're talking about the Invoke model or the Converse API. For the more standard chat interface we generally recommend using the Converse API. It's more of a unified interface across models with a consistent format, a bit like OpenAI's chat completions. And it's not all about text as well. You can use images and documents in the Converse API in the same message format, so multimodal use cases work out of the box. And it supports streaming with Converse Stream for real time token by token output, which if you're doing chat is probably a must have. And you have the AWS SDK for doing this in your language of choice, Python, Boto 3, JavaScript, TypeScript, Java, etc. These SDKs are generally split into two parts like one for the control plane and one for the runtime. So if you look at the Boto 3 option, Bedrock runtime is probably the one you'll be using more often, and the Bedrock one is just for control plane stuff, management of Bedrock models, that kind of thing. Now a new thing as well since the last one we talked about. Bedrock is cross region inference. This lets AWS route your request to whichever region has availability and capacity. That's a pretty big deal because I think this is the first time there used to be like a adage that you could say if you wanted to do something in multi region in aws, you had to specifically configure it in each region and configure the synchronization. This is the first time where you've got pretty much seamless routing from one region to another and we can imagine that this is just down to the fact that GPU availability is scarce, so it makes sense to distribute it to whatever region has capacity. And the way you do that is by using a model ID. You've got a model ID which might be like something like anthropic claudsonnet 4 version 1, but across region. Inference profile ID is something you can use instead and it will have a routing prefix like US DOT or eu. Or it could be a global routing like global dot that'll give you maximum throughput but no geographic restriction. So it depends on your compliance, data retention, data residency requirements, new prefixes and profiles might be added over time, and at the SDK level they're pretty much interchangeable. I think if you look in our. I think it's in our Pod Whisperer where we use Bedrock. We talked about that in recent episodes that you could see in the commit history when we started using these inference profiles. Newer models like certain Claude and Llama versions only work through an inference profile. That's why we had to change it in ours and will return. A validation exception with on demand throughput isn't supported. I think we were a bit confused when we saw that for the first time. If you hit this error, just add the routing prefix to your model id and the IAM permissions are different as well, so you'll have to make sure you set that up. If you're thinking about observability and monitoring these, you'll want to know where your requests got routed to, so you might check CloudTrail rerouted requests include an inference region field and you will have to. You can set up a CloudWatch metric filter on this to monitor your routing patterns. So I guess by default, if you're not too concerned, just default to using inference profile IDs. But for everything there isn't really much of a downside. You just might want to think about the region you want to use and where you want your data to go. So given that you're up and running, you can do inference. Should we talk about the cost and see if we can make it clear how much it might cost?
A
Yes, it's actually not that difficult in terms of just the arithmetics of it, because we talked already about tokens. Tokens are generally classified in tokens in and tokens out, or input and output. Input is what the prompt you send to the LLM. Output is the completion that gets generated by the LLM. And interesting enough, those get different prices, like a price for input, sometimes millions of tokens, sometimes in the thousand, I think it was actually changed recently that now in the pricing pages by the thousand used to be by millions, which confused me when we were writing the notes for this episode. But yeah, it doesn't change. At the end of the day, the actual pricing is just a way of visualizing the cost unit for input and output. And each model is different, so make sure to check what is the cost for the specific model you want to use. Some models are more expensive than others. Generally like the bigger, more capable models are more expensive, but those are generally the ones that can be more reliable if you're doing complex tasks. So again, might be worth starting with the more advanced ones just to make sure you can refine the first implementation of what you want to try to achieve. Refine your prompt and everything. When you have something that works, you can try to see if cheaper models can also handle that task as reliably as the more expensive model. And that's just a strategy to reduce cost. The interesting thing is that there is no upfront commitment. So as many other AWS services, you just pay for what you use. Which is nice because if you have very occasional use cases, or maybe you don't know exactly how much you're going to be using an LLM power feature that gives you an opportunity to grow as you go. There are a couple of tricks that you can use to reduce cost if you have specific use cases. One of these is patch inference. I honestly haven't tried it yet, but my understanding is that basically you can defer the execution of a bunch of LLM requests, so you have a little bit of a higher latency in being able to get the response. But that comes with a 50% discount on the cost of input and output tokens. So if you don't have like a real time type of experience where a user is waiting for a response in line, maybe you're doing some kind of, I don't know, overnight batch processing and you need to do maybe analyze lots of documents. Whatever it is, probably you can use batch inference to bring the cost significantly down. And then there are service tiers, which also is not something I have really invested a lot of time into really experimenting with. But effectively there are different tiers of discounts and costs that you can use to try to bring your cost down. Or maybe you can have one or three months commitments with reserved and that will give you more guaranteed capacity for predictable workloads. So just make sure also to check the service tier in the pricing page to understand what that's all about. Because it could be important for your use case. Then there are a few other things that can be relevant here. For example, there is a concept of prompt caching which basically allows you to reduce the amount of tokens that get sent to the LLM. So that's another way that can save money on the input tokens cost. So it's almost like the way I understand this is almost like you are saying if I'm going to be running always the same prompt for a specific interaction, then you could create almost like a snapshot of that. So that's what gets cached. And then you are resuming from that session with maybe an additional piece of text. So you are not paying for all the initial input which gets cached and it comes with a discount. And that way you are effectively avoiding to resend that text over and over to different prompts. Yeah, I think if you go into Bedrock and you start to use all the other features, of course they come with their own pricing, but again today we are focusing more on the inference part. If you're interested in knowledge bases, understanding flows, agents, fine tuning all these different features of Bedrock, they will have their own pricing and different dimensions you need to consider. So go and check those out if this is something that interests you. So now I think we should try to quickly touch on some of the issues that might be tripping you up. What do you think, Owen?
B
Yeah, yeah, we touched on one already, which is throttling. There's quotas at two levels, requests per minute and tokens per minute. They're per model, per region, per account. And new accounts get shockingly low default quotas, like two or three requests per minute for some models. And that can be a real blocker. Even established accounts can have conservative defaults. They're not giving this stuff away, they're really rationing it out. So you might get 429 throttling exception errors. Maybe plan for it by adding exponential black back off with jitter. AWS SDK can do that using adaptive mode in its settings. Use cross region inference to spread load. Monitor with CloudWatch and apply increase requests early. Don't wait until you're ready to go to production. And the max tokens parameter in your requests affects throttling. This is an interesting nuance. Bedrock reserves some tokens based on your max token setting up front, even if the model generates far fewer. So setting that too high can burn quota faster than you expect. Because quota mathematics reserves based on what you asked for, not what you got. And some models like Claude, output tokens Count more heavily against your quotas with a burndown multiplier like 5x for Claude. Now, model access isn't as simple as it looks. This is another gotcha. Even though it's simplified, as we tried to say, you still need the right IAM permissions, Marketplace subscriptions, all of that stuff. Different models might be available in different regions, and some might only be available in US regions. Initially, the Marketplace we mentioned IAM permissions are required in order to get model access because of those serverless models going through the Marketplace. So you might hit that when you switch to a new model from a provider that you haven't used before, or to a new model you haven't used before. And your function or your service role doesn't have Marketplace permissions. If you need to resolve this and you don't need to give your Lambda Marketplace permissions permanently, instead just get somebody with the right permissions to invoke the model once to trigger the auto subscription. So you could do that. As we said during using the Bedrock Playground, the Marketplace subscriptions are per account, so you'll need to do this in each account as well. Now for the anthropic form we mentioned, which is separate from the Marketplace subscription. If you complete that at the management account, that's enough and you might have to wait 15 minutes then for the model to become available. There is a really weird error you can come across. Access denied Exception Model access is denied due to invalid payment instrument. And this happens because some Bedrock models are delivered through AWS Marketplace and the subscription process requires a valid payment method and it documents what payment method issues and GEO restrictions can cause this. You can typically hit this when your account has a payment method that Marketplace doesn't accept for subscriptions. We've seen this with European accounts using SEPA or SEPA Direct debit system, some India based AISPL accounts, and certain EMEA credit card configurations. Everything else in your account works fine because those services don't go through the Marketplace. But the moment you try to use the Marketplace, such as using a Bedrock model that requires it, it'll fail. And to fix it, you generally have to add a credit card as a payment method. Some users report that they need to temporarily set the credit card as the default payment method, then complete the subscription and then switch back and then after 15 minutes, fingers crossed, it works for you. Another point to mention is the converse versus Invoke model. Invoke model means you have to format the request body for each provider's structure. That's why we recommended using the Converse one Because it's standardized for all of them. Okay. Then we talk about structured outputs. I think this is where it gets really interesting. Luciano, how can we take this really cool topic and summarize it?
A
Yeah, we are in the process of publishing an entire article that goes deep dive into this. So I'm just going to briefly mention what we're talking about and then we'll defer you to the article if you want to deep dive. But basically one of the main problems that you face when you try to integrate an LLM into something programmatic is that the LLM generates text, but what you want is generally something more structured, like a JSON object that you can parse and then reuse into the rest of your code. But there are problems like you can tell the LLM to respond with a snippet in JSON and then the LLM might get a little bit creative sometimes. So sometimes it's just going to use markdown fences where you have, I don't know, backtick, backtick, JSON, then all the JSON inside and then backtick, backtick, backtick. And then you need to write code that can remove the backticks and all the markdown wrapping and just take the JSON and do a JSON parse. Sometimes even worse. Happens more rarely with the more capable models. But I've still seen it. Like, sometimes the JSON that you get is not perfectly compliant. Like you might get a trailing comma, or even worse, you might get fields that you didn't define initially just because DLLM is getting creative. So this is kind of a common problem. Oh yeah. There is another interesting use case where DLM actually gives you multiple JSON snippets. So it's kind of reasoning and saying this was my first attempt. Then I realized that this didn't apply. Oh, now there is another, more refined version of the JSON you need. So your parsing code might get more and more complex as you find out about all these different variation of text that the LLM can generate. So structured output is the solution to this. And it's basically a way to constrain the model to follow a specific JSON schema. So you can literally instrument the model interaction to say when you respond, you cannot derail from this schema. So try to populate this exact JSON schema and give it to me as a JSON object. Don't generate any other text. So that basically gives you a much more reliable way to get answers that then you can use in your code reliably and avoid all the retries or random failures that might trip you up again. There are lots of details on how you can define the schema how actually works in Bedrock and effectively you need to learn exactly how to define good schemas so so that you get the best results. We'll have a bunch of tips in our upcoming article, so watch out the episode notes down here because we'll put the link there once it's available. So with that I think we get to the end of this episode. I think today we learned quite a lot about LLMs. What inference is why you should be considering Bedrock. And in general I want to summarize that our take is that if you are building anything new with LLMs, I think bedrock is really a solid default choice for production inference, especially because you get all the guarantees from region availability, more legal concerns in terms of data privacy, ability to make sure that better your data is not going to be used for training, which is generally a common issue. Plus all the other additional services that come with Bedrock that probably you might want to start using as you get more and more familiar with LLMs and Bedrock itself. Now our usual call to action is if you use Bedrock, do you like it? What you didn't like? Maybe you found other random issues that we haven't encountered yet. So please share them with us because that's how we learn just by keep sharing and talking with the rest of the community. So we always love that you'll find our connection details in the links as always. So feel free to reach out on socials. One last word. Thank you to Forethereum for powering yet another episode of AWS Bytes. If you want help building AI powered applications on AWS that are reliable, cost effective and production ready, make sure to check out forthereum.com and reach out to us. So thank you very much and we'll see you in the next episode.
Date: March 6, 2026
Hosts: Eoin Shanaghy and Luciano Mammino
This episode of AWS Bites dives deep into Large Language Model (LLM) inference on Amazon Bedrock, focusing on the practical realities of integrating LLMs into real-world, production-level AWS applications. Drawing from recent experiences building LLM-powered workflows, dashboards, analytics pipelines, and more, Eoin and Luciano share their hard-won lessons about reliability, cost, trust, and the intricate gotchas of running LLMs at scale. The conversation offers actionable insight on model selection, access management, billing, and common pitfalls, all geared for practitioners aiming to move beyond the demo stage.
[00:00-05:00]
Definition of LLM:
Training vs. Inference:
Tokens:
Quote:
"You're paying for inference based on input tokens and output tokens. Cheaper per token doesn't necessarily mean cheaper for the same job if one model uses more tokens for the same text." (B, 04:13)
[05:02-11:32]
Typical Applications Built:
Strengths & Weaknesses of LLMs:
Quote:
"LLMs are good to effectively convert…requirement in some kind of human language...try to use LLMs for everything that is a little bit fuzzy." (A, 10:56)
[11:32-16:50]
Clarifying Terminology:
Agents & Agentic Workflows:
Frameworks & AWS Offerings:
Quote:
"Think of it as LLM plus tools plus the loop plus context management plus guardrails. Put all those things together and you have an agent." (B, 15:35)
[16:50-23:26]
What Is Amazon Bedrock?
Why Use Bedrock?
Limitations:
Quote:
"What Bedrock guarantees you is that data stays within your AWS account boundary...there is an agreement between AWS and the model provider that they will never use data that you send to the models to do additional training in the future." (A, 19:51)
[23:26-30:51]
Access Has Changed:
Anthropic Models:
Model Inventory & Recommendations:
Scaling and Quotas:
API Details:
Cross-Region Inference:
Quote:
"As you can imagine, everybody's experimenting with bedrock and with LLMs and as a result of that it might be more difficult than you expect to get the quotas you might need..." (B, 26:59)
[30:51-35:13]
Pay-per-Token Pricing:
Cost-Saving Mechanisms:
Tip: "Start with a capable model to validate, then see if something cheaper can do the job reliably." (A, 32:41)
[35:13-39:24]
Throttling and Quotas:
Quotas & max_tokens Parameter:
max_tokens reserves more quota than you might realize (even if you don't use them).Marketplace Subscriptions & IAM:
Model Access/Region Issues:
Invoke vs. Converse API:
InvokeModel uses provider-specific payloads—harder to maintain.Converse is standardized—recommended for most use cases.[39:24–End]
Problem: LLMs generate unstructured text, but in production you usually want predictable, machine-readable formats (like JSON).
Solution:
Quote:
"Structured output is the solution to this...a way to constrain the model to follow a specific JSON schema. So you can literally instrument the model interaction to say...try to populate this exact JSON schema and give it to me as a JSON object. Don't generate any other text." (A, 40:33)
(Summary)
On Model Proliferation:
"The number of these really capable models keeps increasing, and that might be great news for builders." (B, 02:26)
On LLM Limitations:
"They are probabilistic by nature...sometimes these LLMs can hallucinate, which is when they confidently state something that is not necessarily true..." (A, 10:48)
On Bedrock's Value Add:
"Probably what Bedrock is giving you is worth it and is worth the initial effort of learning Bedrock and learning all the tools that you get with Bedrock." (A, 18:26)
On Quotas:
"New accounts get shockingly low default quotas, like two or three requests per minute for some models. And that can be a real blocker." (B, 35:40)
On JSON Output Gotchas:
"Sometimes the JSON that you get is not perfectly compliant...or even worse, you might get fields that you didn’t define...the LLM is getting creative." (A, 39:57)
Closing Note:
If you have stories, issues, or insights from using Bedrock, the hosts encourage sharing—community learning beats solo struggle.