wavePod

90: Using AI at Work to Create an AI Quality Assurance System with Hernan Lardiez - Using AI at Work: AI in the Workplace & Generative AI for Business Leaders | Wave AI Podcast Notes

Back to Using AI at Work: AI in the Workplace & Generative AI for Business Leaders

90: Using AI at Work to Create an AI Quality Assurance System with Hernan Lardiez

Using AI at Work: AI in the Workplace & Generative AI for Business Leaders

Mon Feb 09 2026

Summary

Podcast Summary

Using AI at Work: AI in the Workplace & Generative AI for Business Leaders

Episode 90: Evals and AI Output Evaluation with Hernan Lardiaz (PodEdit - D1)
Date: February 9, 2026
Host: Chris Daigle
Guest: Hernan Lardiaz, COO of Ragmetrics

Main Theme

This episode explores the essential role of evaluation (“evals”) in deploying AI systems in business environments. The discussion focuses on how business leaders and technical teams can ensure AI reliability, accuracy, and risk mitigation—especially as AI becomes integral to operations. Hernan Lardiaz, with 25 years in tech and AI risk-reduction, shares strategies for measuring AI output to bolster trust and operationalize generative AI confidently at scale.

Key Discussion Points & Insights

The Need for Evaluating AI Output

AI Is Probabilistic, Not Deterministic
- Classical software delivers predictable outputs; AI, however, is probabilistic. This means the same prompt can yield varying results, making output evaluation crucial for dependable operations.
  
  "In AI, it's probabilistic. So the outcome is not necessarily always the same one. And that's basically the fun of AI. That also drives that sometimes the outcome is not correct or is what we call hallucinating."
  — Hernan Lardiaz, [06:20]
Risks in Regulated Industries
- Many business leaders hesitate to adopt AI, fearing potential errors in regulated sectors where compliance and risk are critical.
  
  "He looked at me in the eye and said, Hernan, I'm not going to implement AI because I'm afraid of the outcome. We are in a regulated market. If something happens, we cannot control risk."
  — Hernan Lardiaz, [20:59]
Operationalizing AI Requires Trust
- You can't scale AI-driven processes effectively without frameworks for precision and consistency. Inaccurate or unpredictable outputs can lead to major business risks.

What Are “Evals” in AI?

Definition and Purpose
- Evals are structured ways to assess whether AI systems provide correct and reliable outputs—crucial for operationalizing AI beyond experimentation.
  
  "...this is not just some theoretical topic. This is a key piece of being able to operationalize generative AI at scale."
  — Chris Daigle, [08:06]
Manual vs. Automated Evaluation
- Many businesses either do nothing, perform manual output checks, or (ideally) adopt automated evaluation pipelines—moving from spreadsheets to API-driven solutions.

Implementing an Evaluation Process

Who Needs Evals?
- Any company using AI in customer-facing or risk-sensitive workflows—from chatbots to financial and healthcare applications—needs systematic output testing.
- Two main audiences:
  - Business owners (risk reduction, compliance)
  - IT/development teams (building/testing AI tools)
  "So there are two audiences, the business owner and the group that develops and implements AI solutions."
  — Hernan Lardiaz, [20:55]
How to Build and Run Evals
- Three steps:
  1. Develop test data sets (often pulled directly from the company’s knowledge base, e.g., policy documents)
  2. Define evaluation criteria (accuracy, tone, token usage, etc.)
  3. Run tests (preferably automated) and analyze results for optimization.
  "It's as simple as uploading the policy into our tool, our system, and asking the system to generate hundreds of questions based on that policy."
  — Hernan Lardiaz, [22:41]
Continuous Monitoring and Drift Detection
- Ongoing evaluation (not just pre-launch checks) is essential. AI models can “drift,” with output accuracy degrading over time—so regular checks catch issues before they become business problems.
  
  "If you are measuring accuracy and you start to see that the accuracy slowly starts to degrade, then you can see that you're starting to have a problem."
  — Hernan Lardiaz, [13:33]
Automation Brings Speed and Consistency
- Automated systems test thousands of scenarios rapidly and consistently—reducing manual effort, human error, and bias, and allowing for regular regression checks post-updates.

Human in the Loop

Role of Human Oversight
- While automation handles most scenarios, occasional human review fine-tunes the process and retrofits complex cases—particularly when system output and real-world expertise disagree.
  
  "We see human in the loop in a very minimal aspect...to retrofit information back into the system in the moments where it cannot be done automatically."
  — Hernan Lardiaz, [27:55]

Cost and Accessibility

Surprisingly Affordable Tech
- Automated evals are not just for large enterprises. Entry-level packages (~$250 for 3,000 evaluations) make this accessible for SMBs.
  
  "...for your people to run 3,000 evaluations, the payroll on that would be way beyond the 250. Pony up the 250, do it right, remove the human error..."
  — Chris Daigle, [48:01]

Real Business Impact:

Mitigating Risk and Building Trust
- Consistent evaluation reduces incidents with compliance, order fulfillment, and customer communications.
- Example: Companies have faced lawsuits or operational chaos due to unchecked chatbot errors.
Efficiency and Scalability
- Automated evaluations free up high-value staff for strategic tasks, streamline model updates, and support rapid, confident AI deployment.

Getting Started

Onboarding and Integration
- The process is simple: connect via API, define data sets and criteria, and launch. Non-technical teams can reference documentation and direct support.
  
  "...if someone wants to try the tool now and has a little bit of knowledge on the technology side, I'm pretty sure that in an hour he can be using it without any problem."
  — Hernan Lardiaz, [39:01]

Notable Quotes & Memorable Moments

“If you play the lottery enough times, at some point in time you'll win it.”
— Hernan Lardiaz, highlighting AI’s probabilistic nature, [05:50]
“Errors cannot be avoided. That’s basically the core of what we want to resolve with what we do at Ragmetrics: try to understand what is that output of AI and how to measure and evaluate it.”
— Hernan Lardiaz, [06:20]
“Outside of if you just want to use AI for thought leadership and ideation... But if you want to operationalize AI...then this effort on evaluation is something that's extremely important.”
— Chris Daigle, [08:06]
“There are hundreds of examples...from Air Canada being sued because a chatbot didn't provide right information, to a fast food company putting hundreds of thousands of french fries on the same ticket, to New York City government providing information to someone to break the law.”
— Hernan Lardiaz, [18:46]
“My question to those guys [prompt engineers] in general is, OK, you change five words on that prompt, right? How do you know it's better?... what happens with the other 400,000 scenarios you haven't tested?”
— Hernan Lardiaz, [38:07]
“You don’t want to put AI into production and just hope that, you know, your employees are spot checking it on gut and getting it right. You got a business at stake here.”
— Chris Daigle, [50:07]

Timestamps for Key Segments

| Time | Topic / Quote | |-----------|-----------------------------------------------------------------------------------------------| | 00:00–01:00 | Introduction, real-world business fear of AI implementation in regulated sectors | | 04:07–04:49 | Hernan’s background and evolution of AI from academia to business | | 05:50–07:10 | Definition and importance of “evals”; probabilistic vs deterministic outputs | | 09:07–10:47 | Which businesses need AI evaluations; risk factors and regulated industries | | 11:27–14:48 | How to identify processes that must be evaluated, and steps to start | | 17:10–18:46 | Three approaches to testing: nothing, manual, automated | | 20:35–22:29 | Two audiences for evals: business owner and developers | | 22:41–24:45 | How to develop and run tests: policy documents, data sets, and criteria | | 29:32–30:47 | Where process-knowledge fits: capturing expertise in knowledge bases | | 33:09–35:48 | Continuous evaluation; automation and scheduling | | 38:07–39:01 | How to know when prompt changes are truly better—need for broad testing | | 40:49–43:15 | Real example: optimizing AI system for accuracy and token consumption | | 47:00–48:01 | Pricing models and cost/benefit of automated evaluations | | 49:33–50:19 | Evaluations as risk mitigation; “low-cost insurance” | | 50:27–50:50 | Final advice: don’t fear AI, but control and monitor its outputs |

Conclusion & Takeaways

Evaluating AI is essential for safe, scalable AI adoption, especially in regulated or customer-impacting contexts.
Automated evals offer rapid, consistent, cost-effective assessments—far more robust than manual spot-checking.
Start evals early, monitor continuously, and equip both business and technical teams with tools and frameworks.
Affordable, accessible solutions exist to democratize this practice for organizations of all sizes.
AI success isn’t just about building—it’s about measuring, monitoring, and maintaining quality over time.

Resources Mentioned

Ragmetrics AI: Website (Resources, docs, support)
Chief AI Officer: Executive/team training and AI implementation services; free AI Readiness Assessment
AI Exchange by Rachel Woods: For further community training and discussion on evaluations in AI

For business owners, executives, or transformation leaders: Ensure you ask, “What’s our protocol for evals on our AI processes?” If you’re not sure, now’s the time to start.

Loading summary...

Transcript

Hernan Lardiaz (0:00)

He looked at me in the eye and said, ernald, I'm not going to implement AI because I'm afraid of the outcome. We are in a regulated market. If something happens, we cannot control risk.

Chris Staigle (0:09)

So who is the person in the organization on the client side or the internal stakeholder that is most likely the person that will understand the concept, understand the requirements of it and be able to fulfill on the client side.

Hernan Lardiaz (0:22)

There are two audiences, the business owner and the group that develops and implements AI solutions.

Chris Staigle (0:30)

Well, anand any like, I guess closing thoughts or considerations for everybody as they go and chew on this.

Hernan Lardiaz (0:35)

Don't feel shy about AI. Don't feel shy about the output of AI because it's good for everyone. It helps in a lot of ways, but ensure that you have the correct control points and the correct boundaries to understand what's going on with the AI output.

Chris Staigle (0:52)

Hernan Lardiaz is a seasoned tech leader with 25 years in global sales and operations. A COO of RAG Metrics, he helps companies reduce risk and boost trust in generative AI systems. Welcome to Using AI at Work. I'm your host Chris Staigle. Each week we'll be learning how today's business owners, entrepreneurs and ambitious professionals are getting more done with smart use of tomorrow's tech. Let's get started. Right now, every business leader is asking the same question, what are we going to do about AI? If this is you, Chief, AIOfficer.com has the answer. We give you a simple path forward where we provide executive and team training so your people know exactly how to safely use generative AI in their day to day. We also manage the deployment and implementation to make sure tools actually get adopted and deliver results. And we'll also guide company wide transformation so AI becomes part of your operating system, not just another shiny object. The companies that act now will increase productivity, cut costs and grow faster than their competitors. Those that wait will get left behind. So if you want to make AI work in your business, visit chiefaiofficer.com and see how we're helping companies of all sizes finally get results from AI. All right, Greetings esteemed listeners. This is Chris. I'm the host of using AI at Work where we explore generative AIs application and participation in the work day through a non technical lens. Now the topic that we'll be covering today, we could get very technical in the weeds, but I'm going to make sure that the conversation sticks to how can I as an executive, first off, understand the application of and get benefit from model evaluations The. The testing of the results that we're getting from AI that we're putting into production. And our guest today is a fantastic thought leader on that topic. Hernan is a. He's been working with AI since 1991, was his first exposure as a software development engineer. Has worked with a lot of the known brands, multinational brands that you know of. But we were recently introduced through a post that he and his partner had put in Rachel Wood's group called the AI Exchange. If you're not familiar with that, definitely worth checking out. And I know Rachel's a friend, but she comes from kind of a geekier background when it comes to AI. And in all of the training that she does, she always stresses evals. Evals. Ev. Evals. Well, as somebody who didn't come from a technical background or a development background where that was important, that was kind of a. A step maybe that I. I skipped or didn't pay a lot of attention to until recently. Now we find ourselves at chief AI officer doing a lot of work with clients now where it's not. If there's a mistake, it's not internal to our company. If there's a mistake, it has, it has impact, it has gravitas on a client company, which we definitely don't want to do. So when I had the opportunity to connect with Hernan and his business partner on this topic so that I could understand it, I was excited to share that information with you. And this interview will be a continuation of a basic conversation that Hernan and I had maybe a couple of weeks ago. So with that, Hernan, welcome to the show. And anything that I left out, maybe that you want, you think people should.

Hernan Lardiaz (11:27)

Yeah. So let's, let's, let's recap a little bit. So the evolution process goes into, into in two areas if you want. One is to ensure that the output is correct in terms of what general correctness means. It's not hallucinating. The answer is more or less correct. The other one is what is what you really want to measure. So different businesses have different needs and those needs rely on the quality specific of what they are trying to implement. So as an example, a retail entity, a retail chatbot, might be focusing or not on discounts and users might want to measure the quality of those discounts. So it's not only just providing a general answer that is correct in form and in context, but also looking at the details behind it. The process to generate a good AI product and to monitor your a good AR product starts on the development phase. The development phase should include testing and optimization. So without getting too technical, if you have a rag system, there are different ways to get information out of the rag system, different policies. So the optimization of those policies can help to reduce cost, can help to get better throughput, and can help to get more better accuracy on the answers. So it is important when the AI solution is being developed to have a good evaluation process or tooling to evaluate during there then the other portion comes is when the AI solution, it's alive, it's up and running. So we heard in the market concepts about hallucination, concepts about drift. The objective is how to detect those, how to detect if something is drifting. An AI solution is drifting. So one way to do it is to continuously monitor the output of that AI tool in time. And if you are measuring accuracy and you start to see that the accuracy slowly starts to degrade, then you can see that you're starting to have a problem. So accuracy, let's say in a scale from 1 to 5, might be, I don't know, 4.9, good, 4.9 next week, 4.9. And then started to come 4.8, 4.75, 4.73. Then you start to see that is degradation. So having the tools in place is important for that to start. It's basically having the commitment to understanding that there's something that needs to be done. The tools in itself that are out there like ours are easy to implement, are easy to monitor, are easy to handle. So that's not the issue. I think it's more important to have the concept, the idea that monitoring is required and evaluations are required in order to get a better, better output of AI.

Chris Staigle (14:48)

And you know, spoiler alert for the listeners. When I was interested in this for our, you know, our own business and when I inquired about the pricing, I was overwhelmed with how generous their pricing is. So just so you guys know, this is something that you need to be doing in your business for any, obviously some activities are more important than others. It's doubtful that many of you have. And because, listen, the reason I'm saying this is I'm inside of a lot of businesses as an advisor, as a, you know, a leader of bringing AI into their business. I doubt that many of you have any type of evaluation, whether manual or, you know, computer assisted, for the AI activities that are occurring in your business. And the main thing I want you to be thinking about right now is where are we at risk? Because the business owners that we talk to who are interested in, hey, we've got to do something about AI, we're ready to go, but we don't understand the risks enough to go all in. If that's you right, then I've just exposed you to a whole nother vector of risk that you probably hadn't even thought about yet. But the good news is that now that you understand this and now that you're becoming aware of solutions like ragmetrics, it's not going to be a big deal for you to confidently operationalize AI into a lot more areas, even areas that are, you know, there's compliance involved, there's, you know, high, high penalties for any type of mistakes and things like that. So the good news is the risk is being mitigated on this call. So hang on, what we've been doing is manually, like human map mapping this on spreadsheet. And it's better than nothing, right? Version one is better than version none, however, takes a lot of time from both the chief AI officer and the pilot teams. It is open to human error. And all the things that, you know are the promise of why you'd want AI in the first place. How does a company transition from not doing this at all and being like, oh, wow, we didn't even think about that. To maybe those that are randomly checking outputs and grading them on a, on a, whatever their, their scale is and tracking it in a spreadsheet to going into an environment where is it automated? I mean, what does that progression look like with the final result being your solution?

Hernan Lardiaz (17:10)

Yeah, so yeah, for sure, the objective is to be automated. So there are three types of projects or companies working on AI and how they address this issue. There are three different. The ones that do nothing and basically say, no AI is good enough, let's continue with that. And then the problems comes and there are several stories in the market about that. The ones that test manually, that I think from the ones that are testing, the few that test something, they test manually, and then the automated process like ours. So the process is very, is very easy in concept because there are two ways, as we said it before, before the product is in production and after the product is in production, before the product is in production, the process is very easy. It's getting the knowledge base, getting the information that should be used to train the AI tool to understand how the AI tool should behave, getting that information and creating test samples, we call them data sets, test samples, and inject those test samples through the pipeline, the AI pipeline, and see what comes on the other side and basically create that. And that process is basically just without getting too technical, just injecting information to an API and getting the response from the API. And then we look and we evaluate. And as I said before, there are hundreds of metrics or criterias that we can use to evaluate. So the process itself, from the technical point of view, is extremely easy, similar to what we call live AI evaluations or monitoring. That is, we get the information through an API of the context, the answer and the question, or the input and output that that AI agent AI component had. And we look at that through the same metrics or different metrics or the metrics that are required or need in that specific point. So the process on the technical point of view is very simp, as I said before, and I think I'm becoming repetitive. That comes with age and becoming repetitive. The importance here is to define what needs to be tested and how it's going to be tested. And Chris, there are hundreds of examples and probably we should have mentioned those at the beginnings from Air Canada being sued because a chatbot didn't provide right information about a flight and tickets to a fast food company getting an order wrong and putting hundreds of thousands of french fries in the same ticket to New York City government providing information to someone to break the law. So they're hundreds and thousands of examples that can happen any day. The most important thing is, are we ready to start testing? Are we ready to start the evaluation? And it's in the leaders of the organization to drive that forward.

Summary

Podcast Summary

Using AI at Work: AI in the Workplace & Generative AI for Business Leaders

Episode 90: Evals and AI Output Evaluation with Hernan Lardiaz (PodEdit - D1)
Date: February 9, 2026
Host: Chris Daigle
Guest: Hernan Lardiaz, COO of Ragmetrics

Main Theme

Key Discussion Points & Insights

The Need for Evaluating AI Output

AI Is Probabilistic, Not Deterministic
- Classical software delivers predictable outputs; AI, however, is probabilistic. This means the same prompt can yield varying results, making output evaluation crucial for dependable operations.
  
  "In AI, it's probabilistic. So the outcome is not necessarily always the same one. And that's basically the fun of AI. That also drives that sometimes the outcome is not correct or is what we call hallucinating."
  — Hernan Lardiaz, [06:20]
Risks in Regulated Industries
- Many business leaders hesitate to adopt AI, fearing potential errors in regulated sectors where compliance and risk are critical.
  
  "He looked at me in the eye and said, Hernan, I'm not going to implement AI because I'm afraid of the outcome. We are in a regulated market. If something happens, we cannot control risk."
  — Hernan Lardiaz, [20:59]
Operationalizing AI Requires Trust
- You can't scale AI-driven processes effectively without frameworks for precision and consistency. Inaccurate or unpredictable outputs can lead to major business risks.

What Are “Evals” in AI?

Definition and Purpose
- Evals are structured ways to assess whether AI systems provide correct and reliable outputs—crucial for operationalizing AI beyond experimentation.
  
  "...this is not just some theoretical topic. This is a key piece of being able to operationalize generative AI at scale."
  — Chris Daigle, [08:06]
Manual vs. Automated Evaluation
- Many businesses either do nothing, perform manual output checks, or (ideally) adopt automated evaluation pipelines—moving from spreadsheets to API-driven solutions.

Implementing an Evaluation Process

Who Needs Evals?
- Any company using AI in customer-facing or risk-sensitive workflows—from chatbots to financial and healthcare applications—needs systematic output testing.
- Two main audiences:
  - Business owners (risk reduction, compliance)
  - IT/development teams (building/testing AI tools)
  "So there are two audiences, the business owner and the group that develops and implements AI solutions."
  — Hernan Lardiaz, [20:55]
How to Build and Run Evals
- Three steps:
  1. Develop test data sets (often pulled directly from the company’s knowledge base, e.g., policy documents)
  2. Define evaluation criteria (accuracy, tone, token usage, etc.)
  3. Run tests (preferably automated) and analyze results for optimization.
  "It's as simple as uploading the policy into our tool, our system, and asking the system to generate hundreds of questions based on that policy."
  — Hernan Lardiaz, [22:41]
Continuous Monitoring and Drift Detection
- Ongoing evaluation (not just pre-launch checks) is essential. AI models can “drift,” with output accuracy degrading over time—so regular checks catch issues before they become business problems.
  
  "If you are measuring accuracy and you start to see that the accuracy slowly starts to degrade, then you can see that you're starting to have a problem."
  — Hernan Lardiaz, [13:33]
Automation Brings Speed and Consistency
- Automated systems test thousands of scenarios rapidly and consistently—reducing manual effort, human error, and bias, and allowing for regular regression checks post-updates.

Human in the Loop

Role of Human Oversight
- While automation handles most scenarios, occasional human review fine-tunes the process and retrofits complex cases—particularly when system output and real-world expertise disagree.
  
  "We see human in the loop in a very minimal aspect...to retrofit information back into the system in the moments where it cannot be done automatically."
  — Hernan Lardiaz, [27:55]

Cost and Accessibility

Surprisingly Affordable Tech
- Automated evals are not just for large enterprises. Entry-level packages (~$250 for 3,000 evaluations) make this accessible for SMBs.
  
  "...for your people to run 3,000 evaluations, the payroll on that would be way beyond the 250. Pony up the 250, do it right, remove the human error..."
  — Chris Daigle, [48:01]

Real Business Impact:

Mitigating Risk and Building Trust
- Consistent evaluation reduces incidents with compliance, order fulfillment, and customer communications.
- Example: Companies have faced lawsuits or operational chaos due to unchecked chatbot errors.
Efficiency and Scalability
- Automated evaluations free up high-value staff for strategic tasks, streamline model updates, and support rapid, confident AI deployment.

Getting Started

Onboarding and Integration
- The process is simple: connect via API, define data sets and criteria, and launch. Non-technical teams can reference documentation and direct support.
  
  "...if someone wants to try the tool now and has a little bit of knowledge on the technology side, I'm pretty sure that in an hour he can be using it without any problem."
  — Hernan Lardiaz, [39:01]

Notable Quotes & Memorable Moments

“If you play the lottery enough times, at some point in time you'll win it.”
— Hernan Lardiaz, highlighting AI’s probabilistic nature, [05:50]
“Errors cannot be avoided. That’s basically the core of what we want to resolve with what we do at Ragmetrics: try to understand what is that output of AI and how to measure and evaluate it.”
— Hernan Lardiaz, [06:20]
“Outside of if you just want to use AI for thought leadership and ideation... But if you want to operationalize AI...then this effort on evaluation is something that's extremely important.”
— Chris Daigle, [08:06]
“There are hundreds of examples...from Air Canada being sued because a chatbot didn't provide right information, to a fast food company putting hundreds of thousands of french fries on the same ticket, to New York City government providing information to someone to break the law.”
— Hernan Lardiaz, [18:46]
“My question to those guys [prompt engineers] in general is, OK, you change five words on that prompt, right? How do you know it's better?... what happens with the other 400,000 scenarios you haven't tested?”
— Hernan Lardiaz, [38:07]
“You don’t want to put AI into production and just hope that, you know, your employees are spot checking it on gut and getting it right. You got a business at stake here.”
— Chris Daigle, [50:07]

Timestamps for Key Segments

Conclusion & Takeaways

Evaluating AI is essential for safe, scalable AI adoption, especially in regulated or customer-impacting contexts.
Automated evals offer rapid, consistent, cost-effective assessments—far more robust than manual spot-checking.
Start evals early, monitor continuously, and equip both business and technical teams with tools and frameworks.
Affordable, accessible solutions exist to democratize this practice for organizations of all sizes.
AI success isn’t just about building—it’s about measuring, monitoring, and maintaining quality over time.

Resources Mentioned

Ragmetrics AI: Website (Resources, docs, support)
Chief AI Officer: Executive/team training and AI implementation services; free AI Readiness Assessment
AI Exchange by Rachel Woods: For further community training and discussion on evaluations in AI

For business owners, executives, or transformation leaders: Ensure you ask, “What’s our protocol for evals on our AI processes?” If you’re not sure, now’s the time to start.