Podcast Summary
Using AI at Work: AI in the Workplace & Generative AI for Business Leaders
Episode 90: Evals and AI Output Evaluation with Hernan Lardiaz (PodEdit - D1)
Date: February 9, 2026
Host: Chris Daigle
Guest: Hernan Lardiaz, COO of Ragmetrics
Main Theme
This episode explores the essential role of evaluation (“evals”) in deploying AI systems in business environments. The discussion focuses on how business leaders and technical teams can ensure AI reliability, accuracy, and risk mitigation—especially as AI becomes integral to operations. Hernan Lardiaz, with 25 years in tech and AI risk-reduction, shares strategies for measuring AI output to bolster trust and operationalize generative AI confidently at scale.
Key Discussion Points & Insights
The Need for Evaluating AI Output
-
AI Is Probabilistic, Not Deterministic
- Classical software delivers predictable outputs; AI, however, is probabilistic. This means the same prompt can yield varying results, making output evaluation crucial for dependable operations.
"In AI, it's probabilistic. So the outcome is not necessarily always the same one. And that's basically the fun of AI. That also drives that sometimes the outcome is not correct or is what we call hallucinating."
— Hernan Lardiaz, [06:20]
- Classical software delivers predictable outputs; AI, however, is probabilistic. This means the same prompt can yield varying results, making output evaluation crucial for dependable operations.
-
Risks in Regulated Industries
- Many business leaders hesitate to adopt AI, fearing potential errors in regulated sectors where compliance and risk are critical.
"He looked at me in the eye and said, Hernan, I'm not going to implement AI because I'm afraid of the outcome. We are in a regulated market. If something happens, we cannot control risk."
— Hernan Lardiaz, [20:59]
- Many business leaders hesitate to adopt AI, fearing potential errors in regulated sectors where compliance and risk are critical.
-
Operationalizing AI Requires Trust
- You can't scale AI-driven processes effectively without frameworks for precision and consistency. Inaccurate or unpredictable outputs can lead to major business risks.
What Are “Evals” in AI?
-
Definition and Purpose
- Evals are structured ways to assess whether AI systems provide correct and reliable outputs—crucial for operationalizing AI beyond experimentation.
"...this is not just some theoretical topic. This is a key piece of being able to operationalize generative AI at scale."
— Chris Daigle, [08:06]
- Evals are structured ways to assess whether AI systems provide correct and reliable outputs—crucial for operationalizing AI beyond experimentation.
-
Manual vs. Automated Evaluation
- Many businesses either do nothing, perform manual output checks, or (ideally) adopt automated evaluation pipelines—moving from spreadsheets to API-driven solutions.
Implementing an Evaluation Process
-
Who Needs Evals?
- Any company using AI in customer-facing or risk-sensitive workflows—from chatbots to financial and healthcare applications—needs systematic output testing.
- Two main audiences:
- Business owners (risk reduction, compliance)
- IT/development teams (building/testing AI tools)
"So there are two audiences, the business owner and the group that develops and implements AI solutions."
— Hernan Lardiaz, [20:55]
-
How to Build and Run Evals
-
Three steps:
- Develop test data sets (often pulled directly from the company’s knowledge base, e.g., policy documents)
- Define evaluation criteria (accuracy, tone, token usage, etc.)
- Run tests (preferably automated) and analyze results for optimization.
"It's as simple as uploading the policy into our tool, our system, and asking the system to generate hundreds of questions based on that policy."
— Hernan Lardiaz, [22:41]
-
-
Continuous Monitoring and Drift Detection
- Ongoing evaluation (not just pre-launch checks) is essential. AI models can “drift,” with output accuracy degrading over time—so regular checks catch issues before they become business problems.
"If you are measuring accuracy and you start to see that the accuracy slowly starts to degrade, then you can see that you're starting to have a problem."
— Hernan Lardiaz, [13:33]
- Ongoing evaluation (not just pre-launch checks) is essential. AI models can “drift,” with output accuracy degrading over time—so regular checks catch issues before they become business problems.
-
Automation Brings Speed and Consistency
- Automated systems test thousands of scenarios rapidly and consistently—reducing manual effort, human error, and bias, and allowing for regular regression checks post-updates.
Human in the Loop
- Role of Human Oversight
- While automation handles most scenarios, occasional human review fine-tunes the process and retrofits complex cases—particularly when system output and real-world expertise disagree.
"We see human in the loop in a very minimal aspect...to retrofit information back into the system in the moments where it cannot be done automatically."
— Hernan Lardiaz, [27:55]
- While automation handles most scenarios, occasional human review fine-tunes the process and retrofits complex cases—particularly when system output and real-world expertise disagree.
Cost and Accessibility
- Surprisingly Affordable Tech
- Automated evals are not just for large enterprises. Entry-level packages (~$250 for 3,000 evaluations) make this accessible for SMBs.
"...for your people to run 3,000 evaluations, the payroll on that would be way beyond the 250. Pony up the 250, do it right, remove the human error..."
— Chris Daigle, [48:01]
- Automated evals are not just for large enterprises. Entry-level packages (~$250 for 3,000 evaluations) make this accessible for SMBs.
Real Business Impact:
-
Mitigating Risk and Building Trust
- Consistent evaluation reduces incidents with compliance, order fulfillment, and customer communications.
- Example: Companies have faced lawsuits or operational chaos due to unchecked chatbot errors.
-
Efficiency and Scalability
- Automated evaluations free up high-value staff for strategic tasks, streamline model updates, and support rapid, confident AI deployment.
Getting Started
- Onboarding and Integration
- The process is simple: connect via API, define data sets and criteria, and launch. Non-technical teams can reference documentation and direct support.
"...if someone wants to try the tool now and has a little bit of knowledge on the technology side, I'm pretty sure that in an hour he can be using it without any problem."
— Hernan Lardiaz, [39:01]
- The process is simple: connect via API, define data sets and criteria, and launch. Non-technical teams can reference documentation and direct support.
Notable Quotes & Memorable Moments
-
“If you play the lottery enough times, at some point in time you'll win it.”
— Hernan Lardiaz, highlighting AI’s probabilistic nature, [05:50] -
“Errors cannot be avoided. That’s basically the core of what we want to resolve with what we do at Ragmetrics: try to understand what is that output of AI and how to measure and evaluate it.”
— Hernan Lardiaz, [06:20] -
“Outside of if you just want to use AI for thought leadership and ideation... But if you want to operationalize AI...then this effort on evaluation is something that's extremely important.”
— Chris Daigle, [08:06] -
“There are hundreds of examples...from Air Canada being sued because a chatbot didn't provide right information, to a fast food company putting hundreds of thousands of french fries on the same ticket, to New York City government providing information to someone to break the law.”
— Hernan Lardiaz, [18:46] -
“My question to those guys [prompt engineers] in general is, OK, you change five words on that prompt, right? How do you know it's better?... what happens with the other 400,000 scenarios you haven't tested?”
— Hernan Lardiaz, [38:07] -
“You don’t want to put AI into production and just hope that, you know, your employees are spot checking it on gut and getting it right. You got a business at stake here.”
— Chris Daigle, [50:07]
Timestamps for Key Segments
| Time | Topic / Quote | |-----------|-----------------------------------------------------------------------------------------------| | 00:00–01:00 | Introduction, real-world business fear of AI implementation in regulated sectors | | 04:07–04:49 | Hernan’s background and evolution of AI from academia to business | | 05:50–07:10 | Definition and importance of “evals”; probabilistic vs deterministic outputs | | 09:07–10:47 | Which businesses need AI evaluations; risk factors and regulated industries | | 11:27–14:48 | How to identify processes that must be evaluated, and steps to start | | 17:10–18:46 | Three approaches to testing: nothing, manual, automated | | 20:35–22:29 | Two audiences for evals: business owner and developers | | 22:41–24:45 | How to develop and run tests: policy documents, data sets, and criteria | | 29:32–30:47 | Where process-knowledge fits: capturing expertise in knowledge bases | | 33:09–35:48 | Continuous evaluation; automation and scheduling | | 38:07–39:01 | How to know when prompt changes are truly better—need for broad testing | | 40:49–43:15 | Real example: optimizing AI system for accuracy and token consumption | | 47:00–48:01 | Pricing models and cost/benefit of automated evaluations | | 49:33–50:19 | Evaluations as risk mitigation; “low-cost insurance” | | 50:27–50:50 | Final advice: don’t fear AI, but control and monitor its outputs |
Conclusion & Takeaways
- Evaluating AI is essential for safe, scalable AI adoption, especially in regulated or customer-impacting contexts.
- Automated evals offer rapid, consistent, cost-effective assessments—far more robust than manual spot-checking.
- Start evals early, monitor continuously, and equip both business and technical teams with tools and frameworks.
- Affordable, accessible solutions exist to democratize this practice for organizations of all sizes.
- AI success isn’t just about building—it’s about measuring, monitoring, and maintaining quality over time.
Resources Mentioned
- Ragmetrics AI: Website (Resources, docs, support)
- Chief AI Officer: Executive/team training and AI implementation services; free AI Readiness Assessment
- AI Exchange by Rachel Woods: For further community training and discussion on evaluations in AI
For business owners, executives, or transformation leaders: Ensure you ask, “What’s our protocol for evals on our AI processes?” If you’re not sure, now’s the time to start.
