153. LLM Inference with Bedrock

AWS Bites, Episode 153: LLM Inference with Bedrock

Date: March 6, 2026
Hosts: Eoin Shanaghy and Luciano Mammino

Overview

This episode of AWS Bites dives deep into Large Language Model (LLM) inference on Amazon Bedrock, focusing on the practical realities of integrating LLMs into real-world, production-level AWS applications. Drawing from recent experiences building LLM-powered workflows, dashboards, analytics pipelines, and more, Eoin and Luciano share their hard-won lessons about reliability, cost, trust, and the intricate gotchas of running LLMs at scale. The conversation offers actionable insight on model selection, access management, billing, and common pitfalls, all geared for practitioners aiming to move beyond the demo stage.

What is an LLM & What Do We Mean by Inference?

[00:00-05:00]

Definition of LLM:
- LLMs are neural networks trained on massive text datasets, capable of generating context-aware text (e.g., GPT, Claude, Gemini, Llama, Xenova, Mistral, DeepSeek, GLM, Minimax).
- The model landscape is growing rapidly, offering more options for builders.
Training vs. Inference:
- Training is the data- and compute-intensive phase handled by model providers—"Training is like years in medical school; inference is the doctor seeing patients." (B, 03:10)
- Inference is what developers do: sending prompts and receiving completions.
Tokens:
- Models "think" in tokens (roughly 4 characters each). Different LLMs tokenize text differently, so token counts and costs vary across providers.
- Costs and quotas are measured in input (prompt) and output (completion) tokens.

Quote:

"You're paying for inference based on input tokens and output tokens. Cheaper per token doesn't necessarily mean cheaper for the same job if one model uses more tokens for the same text." (B, 04:13)

Real-World Use Cases and Design Patterns

[05:02-11:32]

Typical Applications Built:
- Smart data transformation pipelines (NL query to reproducible code).
- Natural language SQL/query generation for tools like Athena or Redshift.
- Automated dashboard creation (LLM auto-selects metrics and charts).
- Customer support chatbots with escalation paths.
- Document/OCR processing with semantic extraction.
Strengths & Weaknesses of LLMs:
- Great at translating “fuzzy” human requirements into concrete steps.
- Not good for deterministic or arithmetic-heavy tasks—LLMs are probabilistic and may hallucinate or err.
- Best used for problems with ambiguity; rely on classic code for precise logic.

Quote:

"LLMs are good to effectively convert…requirement in some kind of human language...try to use LLMs for everything that is a little bit fuzzy." (A, 10:56)

AI is More Than GenAI; What Are Agents?

[11:32-16:50]

Clarifying Terminology:
- AI encompasses more than LLMs: includes classification, regression, vision, speech, rec engines (e.g., SageMaker, Rekognition, Textract).
- GenAI = content generation (text, code, images, audio, video).
Agents & Agentic Workflows:
- Agents involve LLMs orchestrating reasoning, planning, and tool use in loops.
- Tools enable LLMs to trigger “real world” actions (API calls, DB access, code execution).
- Guardrails and minimal permissions are essential for safe use.
Frameworks & AWS Offerings:
- LangChain, AWS LangChain Strands, Vercel AI SDK, Bedrock Agents, etc.

Quote:

"Think of it as LLM plus tools plus the loop plus context management plus guardrails. Put all those things together and you have an agent." (B, 15:35)

Bedrock: What, Why, and Its Ecosystem

[16:50-23:26]

What Is Amazon Bedrock?
- A unified, fully managed AWS service to access foundation models via API.
- Allows use of models from various providers via the AWS ecosystem—single account, no bespoke API keys, AWS-native monitoring, IAM, VPC, CloudTrail, etc.
Why Use Bedrock?
- Security and compliance: Data stays within AWS account/regional boundaries, never used for model retraining.
- Governance: AWS-native controls for access and auditing.
- Model flexibility: Switch between models easily, via common interfaces.
- Extra features: Knowledge bases (RAG), agents, guardrails, PII filters.
Limitations:
- Not every model/provider is included. (e.g., Gemini not available; OpenAI’s newest not yet on Bedrock as of recording.)
- Expect updates as new partnerships/agreements are struck.

Quote:

"What Bedrock guarantees you is that data stays within your AWS account boundary...there is an agreement between AWS and the model provider that they will never use data that you send to the models to do additional training in the future." (A, 19:51)

Getting Started: Access Model and Setup

[23:26-30:51]

Access Has Changed:
- No more per-model “enable” pages; access is now managed largely via IAM and one-time use case agreements in commercial AWS regions.
- Watch for gotchas with models served via AWS Marketplace (extra IAM permissions needed).
Anthropic Models:
- Still require a one-time use-case application per account.
Model Inventory & Recommendations:
- Claude (Anthropic): excellent for reasoning, coding, long documents.
- Amazon’s Nova: budget-friendly for simpler tasks.
- Meta Llama: good general purpose, less frequently used now.
- Mistral: excels at coding & multilingual tasks.
- Gwen/GLM: up-and-coming for price-performance.
- Use Bedrock’s UI for multi-model response comparison.
Scaling and Quotas:
- Default quotas are low. For production scaling, prepare to request quota increases — and wait.
API Details:
- Use the Converse API for chat-like, standardized interactions and multimodal input.
- Supports real-time streaming output (token-by-token).
- SDKs split into control (management) and runtime (inference).
Cross-Region Inference:
- Bedrock can route inference requests to available regions for capacity, using profile IDs.
- Useful for maximizing throughput or matching compliance needs.

Quote:

"As you can imagine, everybody's experimenting with bedrock and with LLMs and as a result of that it might be more difficult than you expect to get the quotas you might need..." (B, 26:59)

Cost Management & Optimization

[30:51-35:13]

Pay-per-Token Pricing:
- Inputs (prompt) and Outputs (completion) charged separately; bigger, more advanced models cost more.
- No upfront commitment—pay only for what you use.
Cost-Saving Mechanisms:
- Batch Inference: For deferred processing with up to 50% discount—but increased latency.
- Service Tiers: Reserved capacity and longer commitments can drive down costs for predictable workloads.
- Prompt Caching: Reduces resending the same info, saving on input tokens.
Tip: "Start with a capable model to validate, then see if something cheaper can do the job reliably." (A, 32:41)

Common Gotchas & Troubleshooting

[35:13-39:24]

Throttling and Quotas:
- Quotas on requests/min and tokens/min, per region, per model, per account.
- Defaults are surprisingly tight (2–3 requests/min for some models).
- Plan for 429 errors; use exponential backoff and consider cross-region routing.
Quotas & max_tokens Parameter:
- Setting too high a max_tokens reserves more quota than you might realize (even if you don't use them).
- Some models (e.g., Claude) enforce heavier quota burns on outputs.
Marketplace Subscriptions & IAM:
- Serverless models via the Marketplace need extra IAM permissions.
- Marketplace quirks: “Access Denied due to invalid payment instrument” errors, especially with non-credit card payment methods (e.g. SEPA, certain regions).
- Solution: temporarily add/set a credit card as default; complete the subscription; revert if needed.
Model Access/Region Issues:
- Not all models are available in all regions.
- Anthropic models: enable via a management account to propagate access across org.
Invoke vs. Converse API:
- InvokeModel uses provider-specific payloads—harder to maintain.
- Converse is standardized—recommended for most use cases.

Working With Structured Outputs

[39:24–End]

Problem: LLMs generate unstructured text, but in production you usually want predictable, machine-readable formats (like JSON).
- LLMs might output Markdown fences, invalid JSON, extra/redundant fields, multiple objects, or “creative” mistakes.
Solution:
- Use structured output constraints, such as enforcing a strict JSON schema at the prompt or model level, minimizing surprises in downstream parsing.

Quote:

"Structured output is the solution to this...a way to constrain the model to follow a specific JSON schema. So you can literally instrument the model interaction to say...try to populate this exact JSON schema and give it to me as a JSON object. Don't generate any other text." (A, 40:33)

Key Takeaways

(Summary)

LLMs are becoming a core building block for modern apps, but don’t expect to “just plug them in”—production reliability, cost, and trust require careful setup and testing.
Bedrock is a solid default choice if you’re building on AWS, particularly for data privacy, compliance, model flexibility, and AWS-native integrations.
There are real limitations and gotchas (quotas, marketplace, structured outputs, region issues)—plan for them.
Experiment, iterate, and focus on simple, well-bounded use cases for greatest success.

Notable Quotes & Moments

On Model Proliferation:
"The number of these really capable models keeps increasing, and that might be great news for builders." (B, 02:26)
On LLM Limitations:
"They are probabilistic by nature...sometimes these LLMs can hallucinate, which is when they confidently state something that is not necessarily true..." (A, 10:48)
On Bedrock's Value Add:
"Probably what Bedrock is giving you is worth it and is worth the initial effort of learning Bedrock and learning all the tools that you get with Bedrock." (A, 18:26)
On Quotas:
"New accounts get shockingly low default quotas, like two or three requests per minute for some models. And that can be a real blocker." (B, 35:40)
On JSON Output Gotchas:
"Sometimes the JSON that you get is not perfectly compliant...or even worse, you might get fields that you didn’t define...the LLM is getting creative." (A, 39:57)

Suggested Segment Timestamps

Intro & Episode Theme: 00:00
LLMs, Inference, and Tokens: 02:06
Use Cases & Practical Limits: 05:02
Agents & Tooling: 11:32
Bedrock Introduction: 16:50
Bedrock Access Setup: 23:26
Cost & Optimization: 30:51
Production Gotchas: 35:13
Structured Outputs: 39:24
Final Takeaways: 41:43

Closing Note:
If you have stories, issues, or insights from using Bedrock, the hosts encourage sharing—community learning beats solo struggle.

AWS Bites, Episode 153: LLM Inference with Bedrock

Date: March 6, 2026
Hosts: Eoin Shanaghy and Luciano Mammino

Overview

What is an LLM & What Do We Mean by Inference?

[00:00-05:00]

Definition of LLM:
- LLMs are neural networks trained on massive text datasets, capable of generating context-aware text (e.g., GPT, Claude, Gemini, Llama, Xenova, Mistral, DeepSeek, GLM, Minimax).
- The model landscape is growing rapidly, offering more options for builders.
Training vs. Inference:
- Training is the data- and compute-intensive phase handled by model providers—"Training is like years in medical school; inference is the doctor seeing patients." (B, 03:10)
- Inference is what developers do: sending prompts and receiving completions.
Tokens:
- Models "think" in tokens (roughly 4 characters each). Different LLMs tokenize text differently, so token counts and costs vary across providers.
- Costs and quotas are measured in input (prompt) and output (completion) tokens.

Quote:

"You're paying for inference based on input tokens and output tokens. Cheaper per token doesn't necessarily mean cheaper for the same job if one model uses more tokens for the same text." (B, 04:13)

Real-World Use Cases and Design Patterns

[05:02-11:32]

Typical Applications Built:
- Smart data transformation pipelines (NL query to reproducible code).
- Natural language SQL/query generation for tools like Athena or Redshift.
- Automated dashboard creation (LLM auto-selects metrics and charts).
- Customer support chatbots with escalation paths.
- Document/OCR processing with semantic extraction.
Strengths & Weaknesses of LLMs:
- Great at translating “fuzzy” human requirements into concrete steps.
- Not good for deterministic or arithmetic-heavy tasks—LLMs are probabilistic and may hallucinate or err.
- Best used for problems with ambiguity; rely on classic code for precise logic.

Quote:

"LLMs are good to effectively convert…requirement in some kind of human language...try to use LLMs for everything that is a little bit fuzzy." (A, 10:56)

AI is More Than GenAI; What Are Agents?

[11:32-16:50]

Clarifying Terminology:
- AI encompasses more than LLMs: includes classification, regression, vision, speech, rec engines (e.g., SageMaker, Rekognition, Textract).
- GenAI = content generation (text, code, images, audio, video).
Agents & Agentic Workflows:
- Agents involve LLMs orchestrating reasoning, planning, and tool use in loops.
- Tools enable LLMs to trigger “real world” actions (API calls, DB access, code execution).
- Guardrails and minimal permissions are essential for safe use.
Frameworks & AWS Offerings:
- LangChain, AWS LangChain Strands, Vercel AI SDK, Bedrock Agents, etc.

Quote:

"Think of it as LLM plus tools plus the loop plus context management plus guardrails. Put all those things together and you have an agent." (B, 15:35)

Bedrock: What, Why, and Its Ecosystem

[16:50-23:26]

What Is Amazon Bedrock?
- A unified, fully managed AWS service to access foundation models via API.
- Allows use of models from various providers via the AWS ecosystem—single account, no bespoke API keys, AWS-native monitoring, IAM, VPC, CloudTrail, etc.
Why Use Bedrock?
- Security and compliance: Data stays within AWS account/regional boundaries, never used for model retraining.
- Governance: AWS-native controls for access and auditing.
- Model flexibility: Switch between models easily, via common interfaces.
- Extra features: Knowledge bases (RAG), agents, guardrails, PII filters.
Limitations:
- Not every model/provider is included. (e.g., Gemini not available; OpenAI’s newest not yet on Bedrock as of recording.)
- Expect updates as new partnerships/agreements are struck.

Quote:

"What Bedrock guarantees you is that data stays within your AWS account boundary...there is an agreement between AWS and the model provider that they will never use data that you send to the models to do additional training in the future." (A, 19:51)

Getting Started: Access Model and Setup

[23:26-30:51]

Access Has Changed:
- No more per-model “enable” pages; access is now managed largely via IAM and one-time use case agreements in commercial AWS regions.
- Watch for gotchas with models served via AWS Marketplace (extra IAM permissions needed).
Anthropic Models:
- Still require a one-time use-case application per account.
Model Inventory & Recommendations:
- Claude (Anthropic): excellent for reasoning, coding, long documents.
- Amazon’s Nova: budget-friendly for simpler tasks.
- Meta Llama: good general purpose, less frequently used now.
- Mistral: excels at coding & multilingual tasks.
- Gwen/GLM: up-and-coming for price-performance.
- Use Bedrock’s UI for multi-model response comparison.
Scaling and Quotas:
- Default quotas are low. For production scaling, prepare to request quota increases — and wait.
API Details:
- Use the Converse API for chat-like, standardized interactions and multimodal input.
- Supports real-time streaming output (token-by-token).
- SDKs split into control (management) and runtime (inference).
Cross-Region Inference:
- Bedrock can route inference requests to available regions for capacity, using profile IDs.
- Useful for maximizing throughput or matching compliance needs.

Quote:

"As you can imagine, everybody's experimenting with bedrock and with LLMs and as a result of that it might be more difficult than you expect to get the quotas you might need..." (B, 26:59)

Cost Management & Optimization

[30:51-35:13]

Pay-per-Token Pricing:
- Inputs (prompt) and Outputs (completion) charged separately; bigger, more advanced models cost more.
- No upfront commitment—pay only for what you use.
Cost-Saving Mechanisms:
- Batch Inference: For deferred processing with up to 50% discount—but increased latency.
- Service Tiers: Reserved capacity and longer commitments can drive down costs for predictable workloads.
- Prompt Caching: Reduces resending the same info, saving on input tokens.
Tip: "Start with a capable model to validate, then see if something cheaper can do the job reliably." (A, 32:41)

Common Gotchas & Troubleshooting

[35:13-39:24]

Throttling and Quotas:
- Quotas on requests/min and tokens/min, per region, per model, per account.
- Defaults are surprisingly tight (2–3 requests/min for some models).
- Plan for 429 errors; use exponential backoff and consider cross-region routing.
Quotas & max_tokens Parameter:
- Setting too high a max_tokens reserves more quota than you might realize (even if you don't use them).
- Some models (e.g., Claude) enforce heavier quota burns on outputs.
Marketplace Subscriptions & IAM:
- Serverless models via the Marketplace need extra IAM permissions.
- Marketplace quirks: “Access Denied due to invalid payment instrument” errors, especially with non-credit card payment methods (e.g. SEPA, certain regions).
- Solution: temporarily add/set a credit card as default; complete the subscription; revert if needed.
Model Access/Region Issues:
- Not all models are available in all regions.
- Anthropic models: enable via a management account to propagate access across org.
Invoke vs. Converse API:
- InvokeModel uses provider-specific payloads—harder to maintain.
- Converse is standardized—recommended for most use cases.

Working With Structured Outputs

[39:24–End]

Problem: LLMs generate unstructured text, but in production you usually want predictable, machine-readable formats (like JSON).
- LLMs might output Markdown fences, invalid JSON, extra/redundant fields, multiple objects, or “creative” mistakes.
Solution:
- Use structured output constraints, such as enforcing a strict JSON schema at the prompt or model level, minimizing surprises in downstream parsing.

Quote:

"Structured output is the solution to this...a way to constrain the model to follow a specific JSON schema. So you can literally instrument the model interaction to say...try to populate this exact JSON schema and give it to me as a JSON object. Don't generate any other text." (A, 40:33)

Key Takeaways

(Summary)

LLMs are becoming a core building block for modern apps, but don’t expect to “just plug them in”—production reliability, cost, and trust require careful setup and testing.
Bedrock is a solid default choice if you’re building on AWS, particularly for data privacy, compliance, model flexibility, and AWS-native integrations.
There are real limitations and gotchas (quotas, marketplace, structured outputs, region issues)—plan for them.
Experiment, iterate, and focus on simple, well-bounded use cases for greatest success.

Notable Quotes & Moments

On Model Proliferation:
"The number of these really capable models keeps increasing, and that might be great news for builders." (B, 02:26)
On LLM Limitations:
"They are probabilistic by nature...sometimes these LLMs can hallucinate, which is when they confidently state something that is not necessarily true..." (A, 10:48)
On Bedrock's Value Add:
"Probably what Bedrock is giving you is worth it and is worth the initial effort of learning Bedrock and learning all the tools that you get with Bedrock." (A, 18:26)
On Quotas:
"New accounts get shockingly low default quotas, like two or three requests per minute for some models. And that can be a real blocker." (B, 35:40)
On JSON Output Gotchas:
"Sometimes the JSON that you get is not perfectly compliant...or even worse, you might get fields that you didn’t define...the LLM is getting creative." (A, 39:57)

Suggested Segment Timestamps

Intro & Episode Theme: 00:00
LLMs, Inference, and Tokens: 02:06
Use Cases & Practical Limits: 05:02
Agents & Tooling: 11:32
Bedrock Introduction: 16:50
Bedrock Access Setup: 23:26
Cost & Optimization: 30:51
Production Gotchas: 35:13
Structured Outputs: 39:24
Final Takeaways: 41:43

Closing Note:
If you have stories, issues, or insights from using Bedrock, the hosts encourage sharing—community learning beats solo struggle.

wavePod

Summary

AWS Bites, Episode 153: LLM Inference with Bedrock

Overview

What is an LLM & What Do We Mean by Inference?

Real-World Use Cases and Design Patterns

AI is More Than GenAI; What Are Agents?

Bedrock: What, Why, and Its Ecosystem

Getting Started: Access Model and Setup

Cost Management & Optimization

Common Gotchas & Troubleshooting

Working With Structured Outputs

Key Takeaways

Notable Quotes & Moments

Suggested Segment Timestamps

Transcript

Summary

AWS Bites, Episode 153: LLM Inference with Bedrock

Overview

What is an LLM & What Do We Mean by Inference?

Real-World Use Cases and Design Patterns

AI is More Than GenAI; What Are Agents?

Bedrock: What, Why, and Its Ecosystem

Getting Started: Access Model and Setup

Cost Management & Optimization

Common Gotchas & Troubleshooting

Working With Structured Outputs

Key Takeaways

Notable Quotes & Moments

Suggested Segment Timestamps