
Loading summary
A
Now you actually have the ability at your fingertips through just in some, in some ways just through English commands to be able to go and parse through and extract and answer questions that you have that you other wasn't what otherwise would never be able to answer. And that's what's so exciting about this time right now. Yeah, these technologies are growing extremely fast and we can go, you know, we can really solve some really hard problems that were just previously unattainable.
B
Welcome to Embracing Digital Transformation where we explore how people process policy and technology drive effective change. This is Dr. Darren, Chief Enterprise architect, educator, author and most importantly, your host on this episode, AI ETL and the Unstructured Data Problem why Accuracy Still Matters with founder and CEO of Aaron Mehul Shah. Mehul. Welcome to the show.
A
Thank you to thank you for having me, Darren. It's good to be here.
B
Hey, before we jump into our topic today, which is a great topic because I've been doing the same sort of work that we're going to talk about today, so we'll have a lot to share and too bad if the audience doesn't care what we're talking about because we're going to have fun talking about it. But before we get started, everyone knows on my show that I only have superheroes on the show and every superhero has a background story. So what's your background story?
A
So I've been doing, my joke is I've been doing data before it was big or I've been doing, you know, AI before it became, you know, into the scene. My, my history is actually in research. I spent a whole bunch of time back in the early 2000s, late 90s, you know, working on databases, working on large scale data systems. I was a research scientist at HP Labs and we worked on a, a variety of, you know, highly scalable, energy efficient storage and database systems and really had a lot of fun doing it. Learned a lot about trends in technology, how people use it and where things are going. The challenge I had with just sitting there and just focusing on the technology was we were just unable to get this information, sorry, these tech tools into the hands of users every day and, you know, kind of, you know, getting them, you know, delighting them effectively. We, we certainly created great technology, but delighting them was hard. So I created a friend, a couple of friends of mine and I started a company about 13 years, 14 years ago now. It was called Amiato. And it was in the early days of the Cloud, 2011, 2012. And what we saw was that there was this huge need for processing log data. At the time, it was really hard. And so we built an ETL service that would take log data, put it into databases for you so that you could run SQL over it and do your analysis. And that's what Amiato did. It was acquired by aws and a lot of the tools, the techniques, the ideas from Amiato turned into AWS glue, which is AWS's flagship ETL service. So that's another joke that I often say, which is that I've been doing ETL before anybody could spell it. ETL stands for extract, transform and load, right? So my co founders and I, we built that service, grew it, and then I had the opportunity to just really understand a lot of what was going on in the unstructured data space by being on a platform like, like Amazon and AWS and Amazon and Amazon Cloud. And so, you know, from there just, you know, got lost in all of the problems and the solutions that you could build for your customers. I had an opportunity to run the Amazon Elasticsearch service and eventually forked Elasticsearch into OpenSearch so it could stay open and learned a lot more about how people were doing document processing. And that's what, you know, that's what led me to the current company that I'm running right now, Aaron, where we're really helping enterprises unlock the gold mine of information that they have and the mountains of unstructured data that they have.
B
So all those years of data hoarding that I've been doing, right? I can actually do something with all that data because I noticed that when I go in, talk to organizations, they have petabytes of stuff that's never touched, right? Because they don't know how to engage with it. Right. I think I even have my, I have like 5 terabytes of PowerPoint presentations and word Docs and all this unstructured data that I'm afraid to throw away because I'm like, well, I might need it someday, right? But you're saying today is the day, right? Today's the day.
A
I think that the time is now. So, you know, for a long time, you know, this market was, I would say a market where there was a lot of non consumption. And what I mean by that is people really wanted to, like you just described, be able to get, you know, kind of search through their old logs of, of, of documents, of, of PowerPoint presentations, contracts, you know, pricing quotes, whatever you have, invoices, but they were just unable to do that. They would actually sit there and do, you know, build Very specific solutions for really only high, super high value problems. Like if this was going to be a, you know, a case that was going to be worth a hundred million dollars, let's go build a bunch of technology so that we can go through all this stuff. And so what I meant by non consumption is you had all this data, people wanted to get to it, but the barrier to getting to it was just so expensive. Finding the technology, finding the experts that can actually get to the bottom of it very quickly, it was extremely hard. But now with the, you know, the growth of Frontier models, LLMs that can, you know, not only just, you know, pull out text, but also deal with all kinds of multimodal information. You know, graphs, images, video, sound. Now you actually have the ability at your fingertips through just in some, in some ways just through English commands to be able to go and parse through and extract and answer questions that you have that you other wasn't what otherwise would never be able to answer. And that's what's so exciting about this time right now. Yeah, these technologies are growing extremely fast and we can go, you know, we can really solve some really hard problems that were just previously unattainable.
B
Well, there's, there's still the problem. Even though it can read a document and do things like that, the, the, I still have the problem of where is my data. That's right. Right. And do I want it searching through a petabyte? It'll never come back, you know. Right. I mean, how do I put all of that into one large language model? I can't do it. It's too big, it's too cumbersome. If I have a specific file, I can say, oh, tell me what this file has in it. I can do that sort of thing. But I've got, if I've got a.
A
Million files, what am I going to do exactly? And if you're, you know, if you're, if you're a large enterprise, you know, you have this stuff in spades. You know, the, some of these, these enterprises have like petabytes and petabytes of information that they've kind of stacked up. Some cases, exabytes of information that they've kind of stacked up. I mean, just to give you a couple of examples, you know, if you're taking a look at, you know, what, you know, what our government agencies can kind of put together just you know, the sort of like GSA is a great GSA as an example, or noaa, the Fisheries and so on. Oh yeah, they just have so much you know, satellite data, you know, documents from applications. They have just so much information just sitting there. It's unbelievable. And right now, the state of affairs is pretty poor. The data's sitting here, and then the people that want to get access and get answers from that data, they're still literally manually going through this information. Some of the customers that we work with, for example, we literally watch them. I went them just last week to see some of these customers. They're insurance companies. And insurance, you know, started about a couple, couple hundred years ago in this country largely, you know, around farming. And so a lot of the insurance companies and, you know, their ecosystem is in, you know, in the Midwest, so. Or, you know, the, the farming areas in the east coast, so Pennsylvania, Ohio, Iowa, sort of in these, you know, in these states. And I went and saw one of our customers and they, yeah, they're really happy using our software. But before that, what they were doing was literally cutting, cutting and pasting from documents, from one, you know, literally from one PDF to an Excel spreadsheet. I'm like, it's, it's, it's 2025. Like, you guys are still doing this. They're like, yeah, look, it's just a, you know, like, have you guys tried LLMs? Have you tried ChatGPT? Doesn't work. And so this is actually something that you hit upon, which is that, you know, these AI systems are really great as long as the data can fit within their context.
B
Yeah, yeah. Context has gotten a lot bigger, right? It has, it's gotten a lot bigger. But LLM should stand for Lazy language models, because they, they give up.
A
Yeah, so.
B
So they won't scan a whole document. They'll find like three or four things and go, okay, I got an answer. Send it back.
A
Yeah, it's interesting. So certainly the context have gotten bigger. And I remember when I first started in this space and we started messing around with 4K. Yeah, it was like 4K or 8K tokens, which at the time seemed enormous, by the way, because you just couldn't do anything else at the time. We certainly got million token contexts now about two, three years later. So, you know, growing by an order of magnitude every year, which is actually great. 10 million token contexts in some of the really large frontier models and the PRO models that are out there. So that's awesome. But the problem is that even though the context is available and somebody can let you stuff in a bunch of information in there, you see actually a quality, quality wall. Basically, after about 100k or maybe 200k tokens. The quality of the answer just starts dropping off really quickly. Try it out too. Try it. Automated Throw in. I don't know if you have like, you know, a life insurance policy or something or. Oh yeah, yeah, right. Or maybe you have a mortgage statement or maybe you have, you know, a loan, you know, loan package that you sent. You can send a single contract with two or three pages right on the ball. It can do it. Send it to ChatGPT or send it to Gemini Claude and they do a pretty good job. Throw the whole mortgage application in, throw in the hole and you'll literally see it beg for mercy. It'll say things like, well, you know, we, we got a few hundred, you know, entries out that you wanted. There's a lot more. And I'm like, okay, well where are the rest?
B
Somewhere?
A
Yeah, yeah, this is kind of fright. Yeah. So personally. Okay. You know, if you're dealing with your own stuff, you can read through it. But imagine if you're processing thousands and thousands of mortgage applications if you're an ex escrow company or, you know, you're a lender and yeah, it's going to be tough. Or let's say you're a government agency and you know, and in the government everything is driven by documents. I mean, Darren, you know this. Oh yeah, right. And you know, if everything is really just, you know, push and paper, the best way to automate this stuff is to start using these LLMs. But right now the token is actually token context that they have is actually quite a limitation.
B
Well, and like you said, even though the context windows could be large, they just, the quality goes down, they give up after a certain amount of time, even though you can jam them full of a bunch of stuff. All right, so if that's a huge problem, how in the world am I solving that problem?
A
Well, you know, I come from the database world, so, you know, my background is in building large, reliable database systems. And the trick in database systems, in any kind of database system that you build, is to divide and conquer. You know, there are really only two or three, you know, great techniques in computer science. And divide and conquer is one of them, caching is another, and so on. And divide and conquer is the best way to kind of handle all of this stuff. So, you know, you probably heard this idea of context engineering. So there's two different things that you can do to get LLMs to be better. Okay. One is you can give them better instructions. This is prompt engineering.
B
Okay.
A
And then the other thing is the information that you put in their context, you know, can be carefully crafted at the right time with the appropriate stuff so that you can pull, you can get them to be much more accurate. So try to make sure that the stuff that's in there fits within sort of the, the accuracy sweet spot, you know, if you will, for the LLM. And then what you do after you start kind of divide and conquer is, you know, kind of bring all of the answers that you get back after kind of interacting with the LLMs, kind of merge all those answers back and then, and then, and then, and then give it back to the user. And then this idea of dividing and conquering, how you should divide, how you should conquer, and it's going to be depend on the task that you have. So, you know, if you have a very large mortgage application, you can actually use an LLM to help you decide how to divide and conquer and say, hey, you know, in this mortgage application, you know, look for where you know that you're, these guys are, are listing all your assets. Go to those pages now, take those pages and then I want you to answer the questions over these pages. So we call this planning. And you can use LLMs to do that. And so a combination of planning, divide and conquer and carefully crafting what's in the context is a way to actually scale these LLMs. And that's what we're doing at Aaron. Fundamentally, that's what our technology is.
B
You know, it sounds very similar to something I did in the US Census. I learned something about LLMs with the US Census when we were crunching numbers. I learned that large language models are not good at numbers because they weren't trained on numbers, they were trained on language. So if I tell it to add like a billion rows, right, I have a billion rows. I said, find me the average and give me a summation of, you know, column F and g. Does the first 10 and gives up.
A
That's exactly right.
B
I did, I did the divide and conquer concept that you have. I called it aggregated queries and I got a lot better results when I said, hey, you know, I'm, I'm going to take this big huge thing, I'm going to chunk it into smaller bits and aggregate the results that I wanted from each one so I could build up and then the last query would do, you know, the final set. So I know the technique you're talking about. I've used it. It's wonderful. It works really, really well.
A
It works extremely well. Another technique that also works extremely well, if you've tried this is to tell the LLM that you have a tool and this is how you use the tool. So just as a human would look interesting. Yeah. And just as a human would say, okay, well, you know, I'm not going to spend, you know, 30 hours adding up all these numbers. What I'm going to do is I'm going to use, you know, my database, or I'm going to use my calculator, or I'm going to use my spreadsheet.
B
Right.
A
And the spreadsheet has a particular interface. The database has a very, you know, specific interface calculator. We all know what the interface is. The buttons on the, on the calculator. You can also tell an LLM. Here are the tools that are available to you in the process of answering this question. Here's the data. You know, go ahead and plan on how to use this. And so you can, you can see, LLMs tool use is a very common thing. You can actually try this with ChatGPT when it's going out and searching the web. Yeah. So tool use is a very common thing and it works extremely well, especially when you need really precise answers over, you know, very specific or easy to specify, easy to specify problems that a database or a spreadsheet or some tool can handle for you.
B
So the things you're. No, go ahead.
A
No, I'm just keep going. Yeah.
B
I was going to say the things you're mentioning takes large language models from an individual person use into the enterprise, where I'm doing a lot more than, you know, write, write a Japanese haiku of my mortgage application. Right. Which I've had people do in workshops before.
A
Right.
B
Because a lot of people are using ChatGPT for individual work that I'm doing. But the type of things you're talking about is really turning it into an enterprise tool. Right?
A
That's right.
B
Where there's more constraints, there's more data, the scale's different. So the approach is different, right?
A
Absolutely. And I think the biggest difference, all those things, more data, the scale, the approach, I think all those things are elements to it. The thing that's driving, you know, all these, you know, all these sort of architectural decisions in the enterprise is actually one important criteria. When I work with underwriters, when we're working with analysts, we're working with estimators, so underwriters and insurance companies. I don't know if I had a chance to kind of introduce the company that I'm running right now. It's called Aaron AI. And what we do is we take people's Unstructured data, unstructured documents. And we can automatically, using agents, extract structured data fields from those documents. And we can do that at high scale. So millions and millions of documents, and we've been doing that every day, and we can do it with extremely high accuracy. So think 97, 98% accuracy.
B
That's awesome.
A
Right? And so current LLMs don't get there. They're very general machines. They can do all kinds of things from writing poetry to recommending restaurants to extracting key information from documents. We just care about extracting key information from documents. But once you focus on that, you can get that to scale extremely well. And so to your point, the question is, why are all these architectural decisions being made in the enterprise? It's actually driven by one important criteria that all of these knowledge workers have. They have to do work day after day, and that work has to be done accurately. Okay. And if, if, if it's not accurate, they're gonna either take more time to get it to be accurate, which is actually a waste of time for them. So you actually haven't helped them. Like, if you get like high 80s or, you know, mid 80% accuracy, that means 15% of the time they're going back and doing their old thing. And at that point, they might as well just ignore what you've given them. Okay, Right. It's only when you can really hit the, you know, 97, 98% accuracy level where you're actually speeding them up. And the other thing that the accuracy helps you do in some cases, the accuracy is now starting to get better than what humans can do themselves is that it actually allows them to make, you know, get. Build a better. Build a better business, make more money. Okay? So as an example, underwriters and insurance companies, their job is to maintain relationships with, you know, brokers and carriers and so on. Their, you know, their hard work right now is just, you know, shuffling documents that they're doing. It's undifferentiated, you know, heavy lifting. Right, right. And if you can let them turn around quotes, what you used to take days for our customers, we can let them turn that around in like, say 10 or 15 minutes. Now they get more at bats and they can actually sell more policies. More importantly, they can sell better quality policies. They can sell policies that are more, you know, that are, that are appropriately priced for the risk that they're taking and, you know, better suited for the customer that's coming in. And so you actually end up building a better business if you can get higher accuracy. Okay? And that's the thing that's actually driving all of these architectural decisions that we're talking about. Because without the accuracy, sure, you can just use ChatGPT. You know, they just throw it in. You get something back and, you know, maybe you keep going forward, but it's actually not helping them if you're just kind of doing it.
B
You know, it's onesies and twosies.
A
Yeah, it's a good demo, but high accuracy is critical. Like, imagine you're, you know, imagine you're a securities regulator and you're going through a ton of, you know, legal briefs and filings to find precedent.
B
Yeah.
A
Okay, what do you do today? Well, you go through a search engine, you get your top thousand results, and then you like literally just, you know, go through it by hand. We can help them basically find things that they wouldn't have found before in minutes.
B
So how do you, how do you handle the. There's so many questions in my head right now. First off, are you using your own LLM? Have you trained a new model to do this? Or are you using an off the shelf one? And you're just con. You're doing more context engineering. What's your approach that you took on this?
A
Yeah, so. So there's a couple of things. It's actually a conglomeration of techniques. It's not just one thing where we can just say, this is what we do, and we just do it all the time. There's a bunch of engineering that's involved and different parts to our platform, if you will. We did build our own models just to get the text or just to get the raw data off of documents. You'd be surprised. You'd think that documents are standardized, but they're not. Oh, they're not.
B
Yeah.
A
You know, and so Aaron can process, I don't know, some 60 different languages, more than that. Probably now over 35, 36 different document types. And we keep adding them. We can do it and we can pull the text off the page, the images off the page pretty accurately and pretty fast. Um, and there we actually built our own vision models to do it. If we found public data, we, you know, took permission data from customers and just did the hard work of labeling this, you know, the examples and training models to do, but that's. That's not enough. Okay, so that basically takes the words off the page, so to speak, and gives you.
B
It gives me. With text.
A
It gives you text, raw information, and so on. Okay. But we, we. The next step you have to really do is now you have to get the LLM to really like pull all, pull out the key properties, you know, the key things that you care about. And there we let you know, we really leverage the trend there is in these large language models of these frontier.
B
They're pretty good at that sort of thing.
A
They're good at that, right? So, you know, when ChatGPT gets better, we want to be able to get better with it. When, you know, GPT 5.2 came out, we just suddenly got better. When Claude comes out with their next model, I think it was opus, you know, we just suddenly got better. And so we do two different things. One, and this is something that no single vendor is going to be able to do, is we actually use different frontier model providers, feed the information to multiple different frontier model providers, and then use sort of a quorum technique to be able to go and say, hey, you know, across all of these things, you know, are they all agreeing?
B
Is there some kind of consensus?
A
Yeah, is there consensus or not? And you'd be surprised when there's consensus, how accurate the stuff is and when there isn't consensus, you know, you just haven't given it the right instructions. And so what we found is by kind of pushing the data through multiple models, seeing where there's consensus, you can see where there's certainty. And uncertainty saves humans a lot of time and being able to go through that. And then when there's uncertainty, humans can actually, like, the knowledge workers can go in and say, no, no, no, this is what I want you to do. And we have a cool optimization, feedback optimization technique that takes that information from the actual knowledge workers, feeds it back into our system, and then appropriately adjusts the models, the architecture, the prompts, the context engineering. So then the next time they see the same thing they don't have it doesn't they go again, click, click.
B
Yeah.
A
And so, you know, this is a technique called reinforcement learning. It's a general technique. We have a variation of that or a specific version of that that we call coral correction optimized reinforcement learning. And what it is is that actually it's a technique for optimizing your prompts where you don't need a lot of labeled data. We're actually bootstrapping with these front frontier models. And it does, it doesn't take that many iterations, maybe like a dozen iterations and maybe 10 or 15 documents. And now all of a sudden you're getting like 97, 98% accuracy in your extractions.
B
I've seen the exact same thing. It's very Similar to what I did with the U.S. census was we sat the SME with the AI in A, in a front end to it. So when they were analyzing documents, they could interact with the SME and say, yes, this is, and we, we've gotten our accuracy way high. Like 99% now.
A
Exactly. Yeah.
B
Because you're using, you're not using the AI to replace the human, you're using it to augment the human. And yeah, so that same technique's incredible.
A
That's exactly right. And I think that's where I know everybody's kind of scared by the boogeyman, like AI is going to take your job away and so on.
B
Well, because CEOs are laying people off and saying AIs replaced them. You can change, you can blame, you know, all the big AI vendors for saying, yeah, we need universal income because I'm replacing everyone's jobs.
A
Exactly.
B
They're scared.
A
Yeah, I, you know, I think that's a lot. There's a, there's look, there's a certain amount of, you know, low paid, low income work that honestly, you know, doesn't need to be done done anymore and where AI is going to help you. And, and you know, the people that are doing it, they're not happy doing that work anyway. You know, it's tedious, it's manual and so on. And I think that this is good for everybody. That's going to go away. But you know, people coming in and saying, hey, we're going to replace your lawyers and your doctors, that's not going to happen.
B
It's not going to happen.
A
You can't replace human judgment. And right now, you know, the way to think about LLMs and Frontier models and so on, they're not, you know, they, they've gotten really good. Like let's, the greatest mathematicians, right, so the greatest mathematician that you know, that I know of right now, and there's many of them by the way, but the one that's sort of the, you know, in the, in the press a lot is Terence Tao. Okay. And Terence Tao, you know, quite a humble person for what he's actually been able to accomplish. He, you know, he's, he, a few others in the computer science, the theoretical computer science community are starting to use LLMs to actually start, you know, proving, you know, minor lemmas in their proofs. This is huge. Okay. But they're not going off and saying, hey, I have this idea for a proof. You know, go off and come up with a proof that's not what these, right. That is not what these LLMs are doing at all. What they're doing is they're very carefully saying, hey, I need a lemma that does X, X, Y and z. Here's a small. Yeah, think of it as a subroutine in a program that I need you to go build. Okay. And I don't know if there's a recent result that I can, you know, I can, I can leverage or you know, where there's a few set of things that I can kind of put together from the cross of, you know, across the corpus of mathematics that's out there. It'll take me probably, you know, a few weeks to go figures out. Yeah, yeah, but maybe you can come up with this little small lemma. So the intuition for how to go and think about what problems to go solve and how to solve them still is coming from humans. Okay? But a lot of the sort of the work, the menial, I wouldn't say menial, but the work that the humans have to do is starting to get automated so that they can actually be more productive. Are mathematicians afraid that they're going to be obviated? Absolutely not. The greatest mathematicians still are going to be the greatest mathematicians.
B
Well, in fact, they'll be able to work on problems that have people have said are unsolvable and they'll be able.
A
To work on problems at a larger scale. So mathematicians, you know, today they can collaborate with maybe two or three other mathematicians, maybe at most five, you know, on a problem, because a, you gotta trust their problem solving capabilities. It's gotta be a sort of a culture of you gotta coordinate, coordinate and so on. But now with the ability to use LLMs and you know, proof proving techniques with things like Lean, you can actually try to solve really, really challenging mathematical puzzles with a team of like a hundred people. It's almost like a software project as opposed to being, you know, a mathematical sort of proof endeavor. And so the reason I make this analogy is, you know, our lawyer is going to go away. Absolutely not. They still got to work with humans, convince juries, convince judges, and LLMs are not going to go replace lawyers. That's not going to happen. Are doctors going to go away? Are you kidding me? Do you want an LLM doing your surgery? Absolutely not.
B
You want it hallucinating during the.
A
No. And actually there's a nice, you know, there's a nice observation by Terence Tao that I often repeat. I don't know if it's an exact quote, but LLMs are just getting to be better and better guessing machines. So they can guess a situation. That's a good way to put it, right?
B
Yeah.
A
But to be able to verify it, you need to have some independent way of checking their results. And they're not going to be 100% right. And so in the sciences, we found independent ways of checking their result. You actually do the experiment. So if somebody says, you know, this protein folding, you know, won the Nobel Prize, here's the protein structure. Well, you know, you, you do some, you know, X ray crystallography to go and look at the structure and you're like, yeah, that is the structure, so you must be right. And so, you know, there's independent ways of verifying these things, making predictions about weather. You look at past weather patterns, you see how well it does and you say, hey, this thing's doing good. And even encoding the thing that is taken off like wildfire, it has, the thing generates all this code that we have mechanisms to verify that the code is working, deploy it, if it breaks, pull it back.
B
If it's good code, if there's good design. Yeah, that's, I've, I've actually been teaching those techniques in my, my class that I teach at Vanderbilt.
A
And so.
B
Yeah, yeah, so.
A
And in the enterprise, when we're talking about knowledge workers, they have their way. They're already doing this work. They have their way of checking when things are correct. And what this is going to do is just really going to assist them. And that, that's really what's going on here. And I think LLMs are just going to make us much more productive. I think it's going to make us happier because we're going to be doing less sort of, you know, grunge work, so to speak, mundane work. But I don't think they're going to replace humans. I think that's a long, long, long way away.
B
I think so too. So, all right, if people want to find out more about your company and what you guys do and want to engage, how do they reach out to you?
A
I think absolutely. So, you know, our company is called Aaron AI. You can just go to the website, Aryn AI and tells you about the products that we have. You can also just, you know, know, book a demo and that'll get me on a call with you and I'll show you what the, what the, what the, you know, what the property, we call it Agentic Property Extraction System does. And I'll tell you about all of the, you know, cool technologies that we built. So feel free to just go to the website or you can just send us an email@infoaron AI.
B
That's awesome. Hey mijo, thank you for coming on on the show today. This has been great. We could go for another hour easily, but people would be bored. We wouldn't. We'd have a lot of fun. But thanks again for coming on the show, Darren.
A
Thanks for having me and I'm looking forward to seeing the podcast. Talk to you soon.
B
Thanks for listening to Embrace Digital Transformation. If you enjoyed today's conversation, give us five stars on your favorite podcasting app or on YouTube. It really helps others discover the show. If you want to go deeper, join our exclusive community@patreon.com embracingdigital where we share bonus content and you can always connect with other change makers like yourself. You can always find more resources@embracingdigital.org until next time. Time. Keep embracing the Digital Transformation.
Podcast: Embracing Digital Transformation
Episode: #318 – AI, ETL, and Accuracy in Unstructured Data
Host: Dr. Darren Pulsipher
Guest: Mehul Shah, Founder & CEO of Aaron AI
Date: January 21, 2026
This episode explores the challenges and opportunities at the intersection of artificial intelligence (AI), extract-transform-load (ETL) processes, and the need for accuracy when working with massive volumes of unstructured data—especially within the public sector and large enterprises. Dr. Darren Pulsipher and guest Mehul Shah unpack why technology alone isn’t enough, how new AI architectures are unlocking value in dusty data lakes, the importance of accuracy over mere automation, and how to move beyond individual “haiku” use cases into true enterprise AI. The discussion is rich with real-world anecdotes, practical methods, and sharp industry commentary.
This episode delivers a deep dive into the technical, organizational, and social realities of automating unstructured data processing at enterprise scale. Mehul Shah shares hard-won insights from pioneering cloud ETL, introduces practical methods for large-scale and high-accuracy AI extraction, and offers a grounded, optimistic vision of a future where humans and AI collaborate for greater value—not as rivals, but as partners.
Learn more about Aaron AI: arynai.com
Contact: info@arynai.com