Summary7 min read

AWS Podcast Episode #754: Accelerating Healthcare Decisions with Agents

Release Date: April 7, 2026
Host: Gillian Ford
Guests: Gigi Yuen (Chief Data and AI Officer, Cohere Health), Kenji Fujita (Staff AI Platform Engineer, Cohere Health)

Episode Overview

This episode dives deep into the practical aspects of deploying agentic AI systems in healthcare, with experienced leaders Gigi Yuen and Kenji Fujita of Cohere Health. The discussion traverses the high stakes of clinical applications, balancing automation with human oversight, integrating domain-specific data, ensuring security and compliance, and harnessing AWS’s Bedrock and Agent Core to drive rapid, safe innovation. While the stories are healthcare-focused, the insights are broadly applicable across industries grappling with AI adoption in regulated, complex environments.

Key Discussion Points & Insights

1. The Healthcare Problem Space & Cohere Health’s Mission

Defining the Challenge: U.S. healthcare spends an estimated 20–30% of its budget on administrative tasks, half of which are considered low value—amounting to roughly half a trillion dollars wasted annually.
- “We want to eliminate the waste. When you do that…patients can get the right care faster, providers can actually focus on what they do best…that’s the problem space Cohere Health is set up to solve.”
  - Gigi Yuen, [01:30]

2. Building Trustworthy AI in Healthcare

Trust and Transparency: AI systems in healthcare must prioritize reliability, non-hallucination, and strict adherence to clinical standards.
- “We have to get it right the first time...there’s no tolerance for hallucination, none.”
  – Gigi Yuen, [03:33]
Human Experts from Day One: Clinicians participate in every development project from inception, ensuring evaluation is grounded in frontline expertise.
- “We must have the experts in the room on the get go…not just at the end for validation.”
  – Gigi Yuen, [05:54]
Evaluation-Driven Development: Shift from test-driven to evaluation-driven development, with upfront agreement on metrics and continuous monitoring.
- “For agentic solutions, we really have to move to the mindset of evaluation-driven development…eval comes first.”
  – Gigi Yuen, [05:54]
Persona-driven Design: Tailoring agent systems by assessing rollout strategies across different user personas for greater utility and adoption.

3. Domain-Specific Data: Motivation, Use, and Rights

Why Include Domain Data? Not just for accuracy, but for context, control, and operational alignment with care standards.
- “Is it because I want the solution to run faster, be more accurate, be more contextual? It’s not a matter of that. You want more code control…understanding the motivation…helps us pick the right data.”
  – Gigi Yuen, [08:58]
Data Rights & Compliance: Clarify how data can be used (operations, training, education), as nuances are critical for regulated industries.
- “Put a plug into the notion of data use Rights...it's really, really critical. And it's quite nuanced.”
  – Gigi Yuen, [08:58]

4. Human Oversight vs. Automation: Decision Framework

Risk-Based Approach: Automate low-risk decisions; always maintain human oversight for high-risk or negative determinations.
- “We always draw a clean line on when we will not automate is when we have to say no to a patient’s care.”
  – Gigi Yuen, [11:33]
Real-World Example: For routine imaging, more automation is acceptable; for surgery, human review is mandatory.
- “For a knee surgery decision, we better be absolutely confident before we automatically…say yes to the case without a human review.”
  – Gigi Yuen, [12:56]

5. Technical Choices: Why AWS Bedrock & Agent Core

Rapid Innovation & Compliance: Chose Agent Core for quick development, robust tenancy controls, intuitive memory management, and compliance with healthcare regulations out-of-the-box.
- “The key thing that stood out right away was the speed of innovation here, the ability to build these agents quickly, effectively and safely.”
  – Kenji Fujita, [14:37]
Developer Experience: Workshops, documentation, and ease of setup significantly accelerated progress.
- “The workshops…with Jupyter notebooks…had everything that we needed right out of the box.”
  – Kenji Fujita, [16:34]
Operational Impact: Transitioned from custom memory servers and auth proxies to built-in, configurable capabilities, vastly improving velocity.
- “With a couple lines of code…you can just pass through the same authentication method that we would typically set up on our own with the configuration enabled by Agent Core Gateway.”
  – Kenji Fujita, [19:01]
Scalability: Shift from building one or two agents per quarter to scaling out multi-agentic systems.
- “Looking ahead at Q1 and the rest of 2026, it’s full of agents, it’s full of agentic systems, multi agents…a lot of our focus has been on building out the evaluation framework this quarter so that we can scale.”
  – Kenji Fujita, [20:04]

6. Model Selection & Evaluation

Metrics-Led Choice: Set clear business metrics (accuracy, cost, latency, reliability) and use a leaderboard to compare models in real-time.
- “It’s always the question, build versus buy…setting up a leaderboard so we have ability behind the scene...to keep monitoring and tracking which models are winning.”
  – Gigi Yuen, [21:04]
Architecture for Flexibility: Invest in platform and gateway layers to allow composability and swapping models as needed per use case.
- “It’s not always one size fits all…your architecture must allow composability…otherwise adoption cost becomes unbearable.”
  – Gigi Yuen, [24:45]

7. Measured Impact & ROI

Automation Rates: Achieved 85% automation in prior authorization decisions, and 30–40% increase in productivity for human-reviewed cases.
- “85% of decisions are made within minutes…for those 15%...we’ve seen about 30 to 40% improvement in productivity.”
  – Gigi Yuen, [26:29]
User Satisfaction: Clinical staff report greater job satisfaction—AI handles information retrieval; humans focus on expertise.
- “…better retention…they can really just focus on the area expertise.”
  – Gigi Yuen, [26:29]

8. Practical Advice for AI Agent Development

Three Essentials for Scaling from POC to Production:
1. Evaluation-Driven Development: Build rigorous, relevant evaluation frameworks with expert labeling tied to business metrics. [28:58]
2. Clarity on Success: Align stakeholders on what “success” looks like at scale, considering people, process, and integration—not just functional AI metrics.
3. Data Readiness: Ensure data ingestion and integration challenges are addressed early.
- “We must be very, very clear on what are the important criteria for it to be successful…if you don’t change your processes but you get a new tech, you’re not going to see the benefits.”
  – Gigi Yuen, [28:58]
Team Composition: Success depends on a mix of tech-savvy domain experts, core platform developers, and data scientists working tightly together. [34:04]

9. Centralized vs. Federated Agent Teams

Unresolved Experiment: Whether a centralized or distributed agent development model works best is still an open question for Cohere Health.
- “…is it better to have a small team that focus on agent development in a larger org or…to plug the agent development into every single development team?...The jury is still out.”
  – Gigi Yuen, [32:28]

Notable Quotes & Memorable Moments

“We have to get it right the first time…There’s no tolerance for hallucination, none.”
– Gigi Yuen, [03:33]
“For agentic solutions, we really have to move to the mindset of evaluation-driven development…eval comes first.”
– Gigi Yuen, [05:54]
“The key thing that stood out right away was… the ability to build these agents quickly, effectively and safely.”
– Kenji Fujita, [14:37]
“We always draw a clean line on when we will not automate is when we have to say no to a patient’s care.”
– Gigi Yuen, [11:33]
“Don’t let FOMO get in the way...really always start with the why. You’ll be surprised, there’s still a class of problems that may not be agentic.”
– Gigi Yuen, [34:04]
“Ask the questions early and often to the service teams…and it’s helped, you know, uncover the solutions that we would have spent more time trying to figure out ourselves.”
– Kenji Fujita, [33:39]

Timestamps for Key Segments

[01:30] – The core administrative challenge in healthcare & Cohere Health’s mission.
[03:33] – Importance of trust, transparency, and non-hallucination in AI for healthcare.
[05:54] – Evaluation-driven development: process and rationale.
[08:58] – Incorporating domain-specific data: motivations, challenges, and rights.
[11:33] – Framework for human oversight vs. automation.
[14:37] – Deciding on AWS Bedrock and Agent Core—technical and compliance considerations.
[20:04] – Acceleration in agent deployment thanks to AWS tooling.
[21:04] – Model evaluation, business value, and adopting a leaderboard approach.
[26:29] – Quantitative impact: automation metrics and clinician feedback.
[28:58] – Three essentials for scaling agentic systems from POC to production.
[32:28] – Centralized vs. federated agent team models—open question.
[33:39] – Advice for newcomers to AI agents: start hands-on, leverage AWS resources, ask early.
[34:04] – Importance of purpose-driven adoption and team composition.

Actionable Takeaways for Listeners

Start early with evaluation-driven development and bring domain experts in from day one.
Define clear business metrics and set up processes for continuous evaluation and model comparison.
Use platforms and frameworks (like AWS Bedrock Agent Core) to speed development, ensure compliance, and enable future scalability.
Automate where risk is low; retain human oversight for high-risk or high-stakes decisions.
Never lose sight of data rights and user privacy, especially in regulated industries.
Invest in change management—AI’s value is realized only when people, process, and data are ready, not just the tech.
Don’t blindly follow hype—always align agentic adoption with real business need.

Final Thoughts

This episode serves as a masterclass in responsible, effective AI agent deployment in high-stakes environments, fusing practical technical detail with hard-earned leadership wisdom. While the context is healthcare, the lessons around evaluation, team composition, scalability, and ethical deployment apply to any domain ready to embrace agentic intelligence on the cloud.

Loading summary

Transcript47 lines

[00:00]
A
This is episode 754 of the AWS podcast released on April 7, 2026.
[00:08]
B
Welcome everyone to the AWS Podcast. I am your host, Gillian Ford. And this episode today I am super excited about. I think there's going to be something for everyone here. I know agents is really top of mind for. I mean, let's face it, it's like every single person on the planet is probably thinking about this right now. And you get to learn from two people who have been in the trenches at a company that has not only just been thinking about this, but actually has business critical applications that are using agents today. So I'm really excited to talk to Gigi Yuen and Kenji Fujita from Cohere Health. So GGUN is a chief data and AI officer and Kenji Fujita is the staff AI Platform Engineer at Cohere Health. So there's something here for everyone, whether it is you are someone who is thinking about agents and how do you apply it? Maybe you're in a highly regulated industry and you maybe you want to understand how to use it in aws. Some advice from these two, we're going to cover all of that. All right, let's get started. So Gigi, I'd love to understand first, if you can tell our listeners about what is Coherent Health and the specific biggest business problems that you were thinking about of within the healthcare industry.
[01:30]
A
Well, first of all, thank you for having us on your podcast, Julian. What a privilege. So Cohere Health, we are a clinical intelligence company and our mission is to streamline the payer and provider connectivity and the collaboration. So just take a second. When I say payer and provider, what do I mean by that? Payers are insurance companies or nonprofits or government entities that finance healthcare services, whereas providers are entities like your doctor's office, hospital systems, wider groups who provide care. So I guess providers provide care and payers pay for the care. We have millions of providers in the United States and hundreds of payers. So you can imagine with these two entities, in order to support the whole healthcare ecosystem, a lot, a lot of transactions, a lot of administrative tasks, right? Payer off, claims processing, payment quality, care coordination, and unfortunately the fraud, waste and abuse that comes along the way when you have all these back and forth. And if you look at different studies, most recently Health affairs published an article where about 20 to 30% of the healthcare spend in the state are spent on administrative task. 20, 30%. And some of them are necessary, some of them are avoidable, or I would say low value and depends on which studies you read, it's about half of this 20 to 30% low value administrative tasks. So that's about half a trillion dollar. So that's the problem space Coherent Health is set up to solve. Right. We want to eliminate the waste. When you do that, what does that mean? Right. Patients can get the right care faster, providers can actually focus on what they do best and the payer can really, really do a good job financing their career. So that's the nutshell.
[03:15]
B
Wow. There's just a lot here, I think, especially from healthcare, but I think other folks who are in different industries can even see some parallels to some of the challenges that they're thinking about. So when you were addressed with these challenges, how did you think about it in terms of implementing AI solutions?
[03:34]
A
Yeah, it is a big problem, but it's also a very personal problem. Healthcare is very personal. Even when you're simply talking about getting a bill that you don't understand why, or having to wait a few weeks to get an answer for to get imaging, it's very, very personal. So as we think about AI solutions, it really has to do with trust and transparency. It is the utmost important thing is the trust. And remember I was just reading to my kids like Bernstein's books, Bernstein's Bears books, love those books, right? And that's this book that talks about once trust is broken, you it back. So I think as we as technologists and think about building and rolling out these AI solutions, we have to get it right the first time, which is fair, unforgiving in, you know, this notion of stochastic and deterministic world. So that's something that it's top of mind for us. So what does that mean? We need to make sure, you know, what we do is reliable and consistent. Like there's no tolerance for hallucination, none. Also there are a lot of important experts opinion we have to take into account, right. It has to be clinically sound, a lot of literature that we lean in, a lot of guidelines that we can count on. But the funny thing is when you put two doctors in this room for the same case, chances are they don't agree on everything. That's why we love to get a second opinion, right? So as we think about building AI, we have to take all that into consideration. Like what do we mean when a system is performant based on whose opinion is that and what are the nuances that we have to account for where we really need to say human needs to be in the loop? I think, I think last but not least, is the security and privacy aspect. And we can get into some of that as we think through. When we kind of get our hands dirty in building a system that's a lot of technology safeguards, but also process safeguards that we have to consider.
[05:24]
B
These are some themes that I think a lot of businesses are thinking about regardless of what industry they're in, especially with AI, I think the bar just keeps getting set higher and higher, which is great because now the customers want to be able to implement, to be able to serve their end customers is going to be an even better, more accurate, more performant application for them. So I'd love to understand how are you thinking about no hallucinations, ensuring that it really is a safe and reliable application?
[05:54]
A
I think let's put it in terms of the software development lifecycle. Yeah. So when we are doing design and kind of the reference architecture, it's easy to have the technologists in the room. I think especially in healthcare and I imagine in many, many, you know, nuanced domains, we must have the experts in the room on the get go. It is common for many healthcare startup to talk about performances, talk about, you know, clinicians in the loop. But I think there is a difference when you engage your domain expert in the beginning versus at the end when we simply ask them to do validation. In career health, we do every single development project, have clinicians on the team and in the loop as opposed to waiting until we have already built the prototype or already about the largest solution, asking them to validate. And I think that domain expertise is key to ensure that we are measuring the right things. I think that leads to my second point. I grew up in an era where we talked about test driven development and I think now, especially with a lot of this agentic solution, we really have to move to the mindset of evaluation driven development where eval comes first. What are the metrics that are important? How are we going to track. We talked about no, no tolerance for hallucination. It took us a month to iterate on exactly how we quantify hallucination in particular clinical settings. Right. So all those upfront work is more important than ever. That's, I think that's key. And once the product is, once the solution is launched, now we are at the, you know, monitoring and tracking phase. Obviously having 24, seven monitoring is key, potentially using, you know, the generative itself to help as a judge. But I do believe in the importance of human audits. Having that regularly sampled human audit is really, really critical. And I'm going to say One last thing and Kenji, you may have something to add to because we've been working together on this is relearning that there's different Personas that will use AI solutions. Especially since healthcare is such a personal space and even learning about how do you roll out to different Persona groups over time help us build a more trustworthy and useful applications.
[08:09]
C
Yeah, the one thing that I would add there is the key focus for us at a lower level is having strict guidelines and standards so that all patients get unified care while allowing for some level of user preference when it comes to how that, how that care is received.
[08:26]
B
There's so much to unpack, so I'm glad that we've got more time. I'm going to ask so many different questions, but let's dive into that. What Kenji was just talking about that user. You'd use the words I think user standard care, but really like focusing on the end user. And that sounds like you also need to be able to have that domain specific data in order to be able to bring it back to providing the best care or the best experience for that end user. So maybe you can tell us how companies can really think about incorporating their own domain specific data in terms of AI.
[08:59]
A
It's a great question. I think technically there are many ways, right? Case specific context or pre training of models, the spectrum is wide. But I think instead of talking about the how, I want to talk about the why and the what. Just one moment ask ourselves why do we want to incorporate domain specific data? Is it because I want the solution to one faster, one cheaper, be more accurate, be more contextual? It's not a matter of that. You want more code control over your system's output, especially given the industry we operate in. And to Kenji's point, right, there's a standard of care that we want to maintain. Like I think understanding the motivation behind of using these domain specific data will help us pick the right data and at what point do we incorporate them? Right. So there are instances that would make sense to incorporate more of a knowledge system. Like in our world it's more around medical society guidelines, standard of care ontology framework. Incorporating them will allow us to have a more consistent framework in how the AI operates. But then when the goal is to provide more contextual, accurate response in every single interaction, then we need to have our AI be able to access very specific individual case notes and they all are relevant and they are important. It goes back to wow. And last but not least, just put a plugin to the notion of data use Rights is really critical when we entrust it with patient data and when we entrust it with business sensitive data. What rights does Cohere Health and our partners have? To use the data for what purpose is really, really critical. And it's quite nuanced. Right. Are you using it for operations versus are you using it for learning or are you using it for education or training or that nuances have to be accounted for?
[10:43]
B
That is such a good call out because I know I've seen companies already start going down the route of maybe using a specific data set for an example and then it's the engineers who get really excited about the problem only to find out later on that like, oh sorry, we can't use it because of like the, the rights to the data for maybe it's like compliance, licensing, whatever kinds of reasons. So that's a such a good call out that I think will definitely help a lot of the listeners. Same with what you were saying earlier about having those experts really from the early stages of the process instead of the human in the loop just being the end part. So maybe you can help us really share your thought process of how do you decide when to actually automate versus actually having that human oversight that's part of the process.
[11:34]
A
It's the million dollar question, right? It's risk and reward. Human in the loop makes a lot of sense when we are working with high risk, high reward cases. And like for instance in our case, we have a good number of solutions that target prior authorization, automation, trying to get the patient to the right care faster and with less paperwork. But we've made a clear decision regardless of regulatory and which geography geography we operate right. AI would never use to deny a patient's care or to even deny a provider's request to cohere health. We always draw a clean line on when we will not automate is when we have to say no to a patient's care. But kind of going back to your question earlier Julian, about human oversight, even in the cases where we choose to automate, I think oversight is still essential. I think it's just a matter of at what point does it come into play. Oversight should always happen with the system design, reference architecture and how we design the eval. And oversight should always happen with audit. But I think it's in the high risk scenario where human oversight is in terms of every single case and every single nuance detailed. And figuring out that middle ground I think is the key to figuring out how to scale.
[12:48]
B
Yeah, maybe you can give us like an example That I think can help the listeners maybe visualize what that could kind of look like in their own business.
[12:56]
A
Yeah, sure thing. So for instance, going back to the prior authorization automation example, we could look at a variety of clinical areas. Right. Like we've a lot of us have experience getting a prior for imaging, trying to figure out what's going on. Right. A diagnostic reason or you go get a pie off because you had to get a knee surgery. Depending on the clinical use case, the risk tolerance is quite different. It's something when you will need to bring me to your operating room and cut me open versus getting an MRI which is taking half an hour out of your day and with minimum radiation exposure. So really considering that the way cohe health approach it is for a knee surgery decision, we better be absolutely confident before we automatically, you know, say yes to the case without a human review. Whereas for a diagnostic imaging, we will likely say as long as there's no contraindications, as long as patient risk is considered, as long as it's covered by a policy so there's no financial risk, we'll go ahead and say yes without a human intervention. So that nuance, that's why the human experts are domain experts in the loop that we design the system is so critical.
[14:05]
B
Well, I love that really having a framework for, with the experts assessing the actual risk and then using that to be able to design how the human in the loop, human oversight is part of the entire process. Let's get into how this has actually been built. So Kenji, I'd love to understand really, what are some of the factors that your team was thinking about that ultimately led you to choose Amazon Bedrock and Amazon Bedrock. Agent Core, Sure.
[14:37]
C
Yeah. So it, I think timing was a huge factor for cohere. We this past year have invested a lot of time and resources into building out a platform around our AI. And a lot, a lot of that incorporates. How do you scale with the new agentic services that are, that are out there in the market? And so we were attending workshops with AWS for, for Agent Core. We were evaluating the different components. I think the key thing that stood out right away was the speed of innovation here, the ability to build these agents quickly, effectively and safely. A couple key components for us that I think most developers can get held up on are memory and mcp. And so it was clear to me that these were, you know, paramount for the service teams at AWS when they were building out Agent Core. Because the tenancy concerns that we have in, you know, a highly regulated space like Healthcare are covered with some of the components of Memory Client out of the box and implementing them only takes a couple lines of code, which to me is a huge, a huge benefit. The gateway is another thing. So we had started building out our own MCP servers, but with the identity built on top of Gateway and the different targets that, that the gateway provides, we've been able to at least iterate on our research and scale out the potential use cases for, for our agents. Because again, it only takes a couple lines of code to implement an entire MCP server target.
[16:08]
B
And that speak, it says a lot that you're saying you're able to build it quickly, effectively and safely because Bedrock Agent Core is relatively new. So it sounds like there must have been some folks at AWS that really helped you. So maybe you can share some of the some how AWS was really able to help you with everything you were saying earlier, like the, the technical challenges, the business challenges that you had and to be able to actually build it in production today.
[16:35]
C
Yeah, I think what helped us was a little bit of hand holding around understanding our use case. Right. We, we, our top concern is always the security of the data when it comes to implementing a solution like this. And so the, the first thing we brought to them was how do we transition our short term memory to Agent Core so that we have tenancy separation. The workshops that we went through with some of the Jupyter notebooks that they had available had everything that we needed right out of the box. And so going through some hands on experience in a test environment was super helpful. And then understanding the documentation was also key. And I think the namespaces that the Agent Core Memory client provides are very intuitive to set up.
[17:17]
B
I'd love to know, based on your experience between the workshops, the documentation, is there anything else that kind of stood out to you that can help folks who are on that journey of implementing agents?
[17:31]
C
Yeah, so I touched on it a little bit, but I think the amount of experimental research that we've been able to do has far exceeded what we thought we would be able to do by this point. And a lot of that is due to the ease of development here. So I'm trying to think of a good example to share. There are two different use cases that we have. One was transitioning an existing service over to Agent Core. Right. So we had spent all of this time setting up and like evaluating and setting up memory for, for a chat agent and transitioning over to the memory client was, was a relatively trivial task for our team to implement And a lot of that was due to the documentation and, and the, the workshops that they were able to attend. But then there's also net new development. And what's clear to me is that Agent Core really takes away a lot of the ops concerns from the mle. So the developer on the ML side, who typically wouldn't be a DevOps expert, doesn't have to evaluate those concerns as heavily because a lot of it is handled by the Agent Core service.
[18:30]
B
Back to something you were saying earlier, because I think this will resonate with a lot of folks who are listening. MCPS are really a huge hot topic right now and, and a lot of people are thinking about building it themselves. So I'm curious, from your experience when you were at that stage of you had started building it yourself and then you had started then used Bedrock Agent Core, if there was any other learnings that you had from that experience that you can help someone else who's really thinking about building it myself, or should I use Bedrock Agent Core to make that easier?
[19:02]
C
Sure, yeah. So Cohere has been around since before a lot of this technology existed. And so some of our applications have tenancy built into them that the agents need to follow. So we want to make sure that the agents are following the same patterns that we already had in place pre. Pre agentic implementation. And so we were building out our MCP servers and ensuring that we had auth proxies to send through the same sort of approach that we would follow on the core application side. But the Agent Core Gateway handles this implicitly with the identity provider. So it's one of those things that with a couple lines of code, like I was saying, you can, you can just pass through the same authentication method that we would typically set up on our own with the configuration enabled by Agent Core Gateway.
[19:49]
B
So I'm curious now that you went from before you started to really like go down a path building it yourself, then started with Agent Core, did that change at all? Maybe like your timeframe of when it was that you were able to put into product production or any of any other areas of like your velocity?
[20:05]
C
Oh, a hundred percent. Yeah. We were talking about this all week. I think going into this, this quarter, we had maybe an agent or two that we had planned to develop and looking ahead at Q1 and the rest of 2026. It's full of agents, it's full of agentic systems, multi agents, and I think Gigi touched on this a lot. But the evaluations come first. So a lot of our focus has been on building out the evaluation framework this quarter so that we can scale and continue to build new agents that we know are going to be successful on the first pass using using Agent Core.
[20:37]
B
I've got a few questions that I want to get your both of your opinions on. So Gigi, I'll start this one with you. So model choice, this is a super hot topic that I know a lot of businesses are really thinking about so I'd love to hear how you think about looking at evaluating different models and the term. I love that you chose earlier Evaluation driven development based on your experience, maybe some insights that you can share on the business value of that approach.
[21:05]
A
Yeah, actually it could be helpful for me to go back a few years of history. Kenji talked about how Coherent Health had started a few years ago before this agentic revolution and in fact we were using our own transformer models to do NLP months if not quarters before ChatGPT came out. So the company was on already a accelerated trajectory on adopting this cutting edge tech. So I think we have a unique perspective because it's always the question build versus buy or do we tune like so I think the decisions between do we continue our own journey in building our own model from scratch versus adopting one of the frontier model with pump tuning or maybe go down the path of fine tuning maybe pre training. So we've been having this internal healthy debates for for quite a few months. So I think it's a unique experience that I would love to share more widely. And I think one one thing is change is constant and the only the best thing I could do as a leader for these amazing technologists is to set very clear metric right accuracy, cost, latency, reliability and honestly how much eval data is needed for each approach. We cannot push any AI out without publishing eval data, especially in the industry we're in. So really thinking through all those metrics and I can be honest with you Julian earlier this year is still leaning very heavy toward self training and more recently it's leaning more and more towards fine tuning or perhaps using one of the frontier model at least at the get go so that we can get really good coverage. So that I think is keeping an open mind. What has really helped us is once we agreed on these are the business and operational metrics that are important to us and setting up a leaderboard so that we have the ability behind the scene, not as part of the constant sprint planning that we have, you know, a way to keep monitoring and tracking which models are winning. What sense may make sense for us to make the switch.
[23:06]
C
So on top of an automated LLM as a judge approach which we have in place, we also have clinicians labeling the data for us behind the scenes
[23:13]
A
200% Kenji and change is a constant. Having a leaderboard to keep watching against metrics are so helpful. I think going forward is being really cognizant on how we collect ground truth data so that we can have that use case specific insights to make these decisions.
[23:32]
B
There is a lot to really unpack there that I think every single listener can take get a, a takeaway from metrics. Having a leaderboard. I know a lot of businesses out there that I speak to don't have a model valuation framework. They're usually just sticking with one large language model and they stick with it until maybe there's a reason not to. But I love that you're really assessing all these different options that are out there I think and obviously it's very clear that you're looking because you have all these different metrics that you've defined ahead of time. You're able to then and you've got this process and you're able to have the best cost possible at the lowest latency, at the best performance which at the end of the day that's what companies are all looking at. They are just don't have a system that can be able to help them get all of the benefits that they're really looking for. So I would love to hear your advice for a company that right now they're using a single large language model and they're curious of there's probably other models that there are definitely other models that are out there but how do they go from one to being able to assess others so they can pick one or more that are going to be best for their use cases.
[24:45]
A
I think you summarized it well. Right? Let's make sure you know what's important to you, what are your metrics. And then investing in that automated eval framework, both human in the loop and using large language models so that these decisions can be made with data driven decisions. That's a big part. And when it does show that there may be value to switch, my personal experience says that it's not always one size fits all. It's not that you switch from one frontier model to another for all your use cases or even within the use case with every single piece of your pipeline. So it goes back to your architectural conversation by working with the architect and thinking through how you design an architecture that allows you to have that composability the ability to for some, in our case, certain use cases rely more heavily on smaller models that we host internally. In certain use cases we rely more heavily on frontier models. But if your architecture doesn't support it and every time it's a new build, then it makes the adoption cost very unbearable.
[25:46]
C
Yeah, Gigi touched on it. I think having the platform, investing in the platform is key to this style of development. Both the evaluations framework and the gateway. I think agent core gateway is meant to be a gateway for the agents. I think having an LLM gateway or an AI gateway is also valuable to put on top of this framework so that you can iterate quickly, change targets and evaluate these at a much higher pace.
[26:08]
B
So I'm very curious about like really the business impact that you've been able to see in AI. I know there's still companies that even struggle to be able to measure the ROI of AI and so hearing, I think from your experience will certainly be able to inspire them. I know in addition to everything else you said earlier, that definitely has, if not already.
[26:29]
A
Yeah. Oh well, let me go back to the example I started earlier in this conversation. In the prior auth space we've worked with clients who have to deploy dozens and dozens of nurses and MDs and doctors to review these cases in order to just meet the volume and meet the turnaround time requirement. Right. It's a highly regulated industry. We have 14 days to decide, but starting in the new year you only have seven days to decide. So it's easy to just try to throw bodies at the problem. But with the AI system that we build out with our clinician in the loop, we're able to achieve 85% automation. So 85% of decisions are made within minutes. And for those 15% of cases that require high touch human review, with our agentic system we've seen about 30 to 40% improvement in productivity and something that's tough to measure. But we are hearing feedback from the users is mixed up. It helps with the job satisfaction because they spend their time making clinical decisions as opposed to trying to figure out where the information is or trying to dig through all the requirements. Right. Everything is surfaced, all the deep research is done on their behalf and they can really just focus on the area expertise. So yes, we save time but also I think we have better retention.
[27:47]
B
That is definitely a testament to the operations that your team has done. I mean to be able to get to 85% automation, your end customers being even happier with the solution that really speaks to I think for all the listeners who are thinking about how to be able to get there and, and there especially those I know who are really in a monolithic type of architecture right now where they have one LLM and they're not able to maybe experiment with a number of different models that are out there to be able to maybe have a certain use case that's for the they can get away with like a lower cost, one that'll give them better latency, all those different factors that you were talking about. So I think I just love those metrics because I think it just shows others who are in the early stages of their journey what's possible. Okay. Some other areas that I think I'd love for your opinions on. All right. So Gigi, based on what your your experience, what are some like what are three things that all companies should really be doing when they're in the early stages of being able to build agents before they actually push into production? I think I'm going to know your answer, but maybe I'll be and then
[28:58]
A
a little context, right? I've been doing this line of work for 20 years. Agents are not right. The movement from a proof of concept or a pilot to a large scale solutions. David McKenzie says that 95% of AI prototypes and PoC don't even go into production and only half of the ones that go in production actually stay in production. So this has been a age old problem regardless of agent or not. I think agent does create actually a pressure because on one hand you can innovate and experiment faster, but on the other hand there's more unknown way that you have to manage. So you actually exemplify the challenge and you write we're going to start with evaluation driven development. If there's one thing you want to take home from this podcast, that's the one line that I will really encourage us. And as we think about these metrics, to Kenji's point, having the right experts to label, provide a label data, provide a ground truth, make sure that they tie back to your business and operational metrics so they're not pure functional non functional eval, but tie it back to the overall company or your client strategy. The other piece, I personally I've seen a lot of struggles going from POC to scale is not having that, not investing that time. Let me put it the other way, I should put it in a positive way. Let me try again. Another part I've seen successes in taking from POC to large scale deployment is taking the time to define and letting everyone know what must be true for the agent to be successful at scale, because the nature of POC is to simplify, right, is to not consider certain edge cases and it's to assume certain level of integration and operational efficiencies. So in order to kind of flip that switch, we must be very, very clear on what are the important criteria for it to be successful. And they're not usually AI related. They're usually about the people who are going to be using the tool. Are you going to give them the right training? Are you going to give them the right transition plan? Are you going to bring in the right advocates? Because as I mentioned earlier, if you want to look at technology differently, there are different Personas, right? Who do you bring on board to help you? And oftentimes things fail because of processes. Because if you don't change your processes but you get a new tech, you're not going to see the benefits and you can quickly fold, the impact can quickly be minimized. And then I think the third thing is sarcastic, right? System deal, right? We always assume data integration is easy and it's never easy. So I think that's the piece that we always have to kind of take a step back and say amazing AI system. Let's make sure the people, the process and the data already and having that clarity so that everyone's on the same page and marching to a single.
[31:45]
B
I think you just saved people a lot of time on that because people get so excited about AI and already start thinking about like what LLM for example are we going to start using. But getting the domain experts really part of the process, making sure you really understand like the. That you were talking about earlier, like the. That what you're building is going to change the entire process for your end customers and really having clarity on what that means for them. Are they going to even use it? Even the data ingestion part as well? I think these are all prerequisites that people really need to think about. I've seen that as well as they often become overlooked and then you have to take two steps back before you can go ahead. Anything you would do differently if you were starting over today?
[32:29]
A
You stumped us with that one. Let's see. I think the jury is still out. We don't know yet. We don't know yet. One thing that I'm still thinking through is is it more effective to have a small team that focus on the agent development in a larger org or is it better to plug the agent development into every single development team and let me try to say it the other way, is it better to have a centralized agent development team? Agent development team and let them drive the innovation and then dissimilar what they learn? Is that more effective or is it more effective to have each development team to start adopting and doing more of a federated model? Charity is still out on that one.
[33:10]
B
That sounds like that'll be a part two episode.
[33:13]
A
We will let you know. Jillian. I don't know. Kenji, anything you might want to say?
[33:17]
C
No, that's a good point. I don't have anything there. It's a tricky problem.
[33:21]
B
All right, I've got one last question for each of you. Kenji, I'll start with you. So we've got listeners here that are in all different industries at all different stages of their journey within aws. So for those who are considering building AI agents, what's one piece of advice that you have for them to get started?
[33:40]
C
I think AWS has so many resources out there right now to get, to get to make it self serviceable. So I would. My, my best advice is to just start testing out and trying the tools and at least in my experience asking the questions early and often to the service teams to, to our account managers, it's helped, you know, uncover the solutions that we would have spent more time trying to figure out ourselves.
[34:04]
A
Kenji is right. Being able to have hands on experience is the best thing to make good decisions. And I guess one, one more thing I'll add is don't let formal get in the way just because everyone seems to be deploying and benefiting from agent systems. Really always start with the why. You'll be surprised. There's still a class of problems that may not be agentic. Right. And there's a class of solution that can be very well solved with a single LLM and just really kind of going back to the business end success metrics to make these decisions is key. I think my advice to the folks who are wanting to dig into agent and so forth is think hard about the team you assemble to make this real. I think agentic work truly does require a new profile of developers and a development team. We mentioned earlier about the importance of having domain experts in the loop on day one. Those who are technology savvy domain experts, they are gold in the team and we've had amazing success seeing the core platform developer working side by side with a data scientist and working side by side with a clinical MD and a nurse. Right. That combo really allows us to iterate super quickly. And that's a new framework. I don't think we've had that kind of dependency in terms of diversity and skill sets. Being in the same room at the same time before the agent.
[35:34]
B
Wow. This was seriously a masterclass on, I mean, evaluation driven development, AI driven development. There clearly was something here for everyone. Thank you so much, Kenji, Gigi, this was phenomenal. Really appreciate both of you spending time here with me on the AWS podcast.
[35:52]
A
Thank you for having us.
[35:54]
C
Thanks, Jillian. This was awesome.
[35:55]
B
Thanks.