Podcast Summary
Podcast: Embracing Digital Transformation
Episode: #318 – AI, ETL, and Accuracy in Unstructured Data
Host: Dr. Darren Pulsipher
Guest: Mehul Shah, Founder & CEO of Aaron AI
Date: January 21, 2026
Episode Overview
This episode explores the challenges and opportunities at the intersection of artificial intelligence (AI), extract-transform-load (ETL) processes, and the need for accuracy when working with massive volumes of unstructured data—especially within the public sector and large enterprises. Dr. Darren Pulsipher and guest Mehul Shah unpack why technology alone isn’t enough, how new AI architectures are unlocking value in dusty data lakes, the importance of accuracy over mere automation, and how to move beyond individual “haiku” use cases into true enterprise AI. The discussion is rich with real-world anecdotes, practical methods, and sharp industry commentary.
Key Discussion Points and Insights
Mehul Shah’s Background: From Research to Real-World Impact
- Early Days in Big Data and AI: Mehul shares his history in scalable data systems and AI, beginning as a research scientist at HP Labs, then as co-founder of Amiato (an early cloud ETL company that was later acquired by AWS and became foundational to AWS Glue).
- “I’ve been doing AI before it became cool ... I've been doing ETL before anybody could spell it.” (Mehul Shah, 01:36)
- Enterprise Focus: Learned that building effective technology mattered little unless it truly delighted and empowered end users.
- Document Processing at Scale: Later experience at Amazon included running the Amazon Elasticsearch Service and helping fork it into OpenSearch. Founded Aaron AI to help enterprises unlock unstructured data.
The Mountain of Untapped Data: Why Now is Different
- Unstructured Data Is Everywhere: Organizations have accumulated petabytes of untouched files (PowerPoints, contracts, logs), and lacked tools to extract useful insights affordably.
- “I've got ... 5 terabytes of PowerPoint presentations and word docs ... but you're saying today is the day, right?” (Darren, 04:46)
- Barriers Are Falling: Historically, only high-value problems justified the cost of custom solutions. But LLMs and multimodal AI now offer natural-language access for everyone:
- “Now you actually have the ability at your fingertips ... to parse through and extract and answer questions that you otherwise would never be able to answer.” (Mehul, 05:28 & 00:00)
The Limits of Large Language Models (LLMs)
- Context Window and Quality Walls: Even as context windows expand (from 4k to millions of tokens), accuracy drops off sharply beyond a certain point (often 100k-200k tokens).
- “LLM should stand for Lazy Language Model because they give up ... they won’t scan a whole document.” (Darren, 10:24)
- “You see a quality wall... the quality of the answer just starts dropping off really quickly.” (Mehul, 11:13)
- Real-World Example: Processing a few pages is fine, but throw in a whole mortgage package and the model “begs for mercy.”
- “Throw the whole mortgage application in and you’ll literally see it beg for mercy.” (Mehul, 11:43)
- Implications for Enterprises: Sheer data volume overwhelms off-the-shelf models; context engineering, chunking, and divide-and-conquer are needed.
Divide and Conquer: Context Engineering in the Enterprise
- Techniques from Databases: Shah likens the solution to classical computer science — divide-and-conquer, chunking, and careful planning.
- “Divide and conquer is the best way ... you probably heard this idea of context engineering.” (Mehul, 13:39)
- Prompt and Context Engineering: Give LLMs better instructions (prompt engineering); carefully select and structure the data they see.
- Planning Workflows: Use LLMs to first plan how to break up large tasks, then extract or process key information from relevant chunks.
- Enterprise Use Cases: Aaron AI can extract structured data fields from millions of unstructured documents with high accuracy (97-98%), serving insurers, regulators, and others.
- “Once you focus on that, you can get that to scale extremely well.” (Mehul, 20:25)
Achieving (and Compounding) Accuracy
- Why 98% Accuracy Is Magic: Enterprise decisions are about minimizing manual correction. At 80–85% accuracy, users still must extensively check work. At 97–98%, true acceleration and reliability are possible:
- “If it’s not accurate ... they might as well just ignore what you’ve given them.” (Mehul, 20:25)
- Voting and Consensus: Aaron AI routes data through multiple LLM providers and uses a consensus approach (“quorum technique”) to increase certainty.
- “By pushing data through multiple models ... when there’s consensus, you can see where there’s certainty.” (Mehul, 26:42)
- Human in the Loop: Remaining uncertainties are passed to human experts, whose corrections feed back into the system by a reinforcement learning loop (“coral correction optimized reinforcement learning”).
- Iterative Feedback: A small number of corrections (e.g., 12-15 documents) can quickly boost extraction accuracy.
Beyond Replacement: Augmenting Human Work, Not Obsoleting It
- AI as Productivity Amplifier: LLMs aren’t replacing experts — they’re shifting humans to higher-value work. Example: Mathematicians using LLMs to generate minor lemmas, not major proof strategies.
- “You can’t replace human judgment ... But a lot of the ... work that the humans have to do is starting to get automated so they can actually be more productive.” (Mehul, 29:41)
- Future of Knowledge Work: Knowledge-heavy jobs (doctors, lawyers) still require human intuition, persuasion, and verification; AI assists, but doesn’t supplant, these skills.
- “Are lawyers going to go away? Absolutely not ... Do you want an LLM doing your surgery? Absolutely not.” (Mehul, 31:51)
- Enterprise Takeaway: “LLMs are just going to make us much more productive ... but I don’t think they’re going to replace humans. That’s a long, long, long way away.” (Mehul, 34:22)
Notable Quotes & Memorable Moments
- On the early inefficiency of data hoarding:
- "For a long time, this market was ... a market where there was a lot of non-consumption." (Mehul, 05:28)
- On how LLMs perform beyond their limits:
- "Throw the whole mortgage application in ... you'll literally see it beg for mercy." (Mehul, 11:43)
- On enterprise priorities:
- “The thing that's driving all these architectural decisions in the enterprise is … one important criteria … that work has to be done accurately.” (Mehul, 19:21)
- On AI augmentation (not replacement):
- “You're not using the AI to replace the human, you're using it to augment the human.” (Darren, 28:36)
- “I think that's where everybody's kind of scared by the boogeyman ... AI is going to take your job away and so on.” (Mehul, 28:46)
- On verification and human oversight:
- “To be able to verify it, you need to have some independent way of checking their results. And they're not going to be 100% right.” (Mehul, 33:13)
- On the shift in knowledge work:
- “LLMs are just going to make us much more productive. I think it's going to make us happier … But I don't think they're going to replace humans.” (Mehul, 34:22)
Timestamps for Key Segments
- Background & ETL Origins: 01:36 – 04:46
- The Data Hoarding Problem: 04:46 – 05:28
- Why Traditional AI Isn’t Enough: 07:31 – 10:24
- Context Windows & LLM Weaknesses: 10:24 – 13:21
- Divide-and-Conquer Strategies: 13:39 – 16:07
- Enterprise-Level Accuracy: 19:21 – 23:04
- How Aaron AI Works: 24:09 – 28:13
- Human-in-the-Loop Learning: 27:37 – 28:13
- AI as Augmentation, Not Replacement: 28:46 – 34:52
- Call to Action / Learn More: 35:03
Conclusion
This episode delivers a deep dive into the technical, organizational, and social realities of automating unstructured data processing at enterprise scale. Mehul Shah shares hard-won insights from pioneering cloud ETL, introduces practical methods for large-scale and high-accuracy AI extraction, and offers a grounded, optimistic vision of a future where humans and AI collaborate for greater value—not as rivals, but as partners.
Learn more about Aaron AI: arynai.com
Contact: info@arynai.com
