
Loading summary
A
Foreign. Hello everyone and welcome to the Stack Overflow podcast, a place to talk all things software and technology. I'm your humble host, Ryan Donovan and today we have a podcast sponsored by the fine folks at MongoDB talking about the race to prove out the agentic value. So my guest today is MongoDB Field CTO Pete Johnson. Welcome to the show, Pete.
B
Hi, Ryan. Thanks so much for having me.
A
Of course, of course. So, before we get get into Talking about this OpenAI paper, tell us a little bit about yourself. How did you get into software and technology?
B
I wrote my first line of code as a sixth grader in 1981.
A
Wow.
B
And I'm one of those lucky people that was able to turn a childhood hobby into a now, what is it, 31 plus year career after college. So I know that's a common story for a lot of people, but I asked for an intellivision for Christmas of 1981 and if you know, you know. Yep. I instead received a TRS 80 color computer, the 4K version, not the 16K, that came with a variant of the Microsoft BASIC interpreter called Color Basic at the time. And I used it to generate a little program that tracked rebounding and scoring stats from my sixth grade basketball team.
A
Nice. I think I also got the old switcheroo with the Intellivision Commodore 64.
B
Well, C64, you had a real disc drive. I had cassette tapes as storage on the trash 80 color or the coco as people called it back then.
A
So obviously it's been a long journey from then. You've turned a hobby into a career.
B
Yeah, I did 20 years at HP. I did 17 of that in HPIT where I wrote my first web application. Went into production in January of 96. That was about 13 months after the first W3C meeting. I became hp.com chief architect at the end of that HPIT tenure. And then I was one of the founding members of HP Cloud Services, which was HP's attempt to try to compete directly with AWS on top of OpenStack. And while that didn't work out for the company, that sure worked out for me personally. I moved out of engineering and into sales and marketing and went on couples of different startups. One was acquired by Cisco. A little bit of the stint where I was prior to MongoDB, I was a field CTO at the services arm of cdw and. And then I've been here since June.
A
All right, well, a lot has changed since the old TRS 80 days. Today everybody's talking about AI and agents and you Know, as people try to get this to have real world impact, you know, I think I saw the stat that 95% of projects fail. People are looking at how, you know, what's the ROI of this? And OpenAI had an interesting paper talking about the sort of GDP impact, how they could evaluate that impact of agents and agentic tasks. Can you tell us a little bit more about this paper?
B
Yeah, sure. So that paper, the GDP VAL paper. So there was a blog article, there was a white paper and then there was a data set. And I'm the kind of guy that I'll read everything to sort of see where the, where the goodness or where the hiding stuff might be. Because there's always some hiding that goes on in white papers. And if you just look at the blog article, what that'll tell you is they looked at 44 occupations across different sectors, across different vertical sectors of the economy. They then went and hired experts with at least 14 years experience in each one of those occupations. And they had Those people define 30 common tasks to each of those occupations. They then took a subset of those five per occupation and ran it through a version of a Turing test where what they did was they did a one shot prompt to try to complete the task and fed that to an LLM and then they, they found a person with a decent amount of experience in that occupation and asked them to complete the same task. Then they had an independent third party, a human being, then evaluate which one was better and then they established sort of a win rate between the human being and different LLMs. And to their credit, OpenAI didn't just test OpenAI LLMs, they tested some of their competitors as well. That was the basic structure of the testing that they ran. That was the result of that white paper. Right.
A
And like you said, you went in through, you know, all three stages of this, this paper down to the data set. What was the sort of interesting takeaway? What is the stuff that is sort of hidden there?
B
Well, if you just look at the blog article, sort of the glory graphic that was part of the blog article showed what the scores were for each of the different individual LLMs. And I've got some notes here, I'll read them off here real quick. For example, at the time what they were testing was things like Claude Opus 4.1 did the best, where it got a score of 47.6, which meant that either one or tied in the different. According to the Heman evaluator, the different tasks that it was graded against GPT4O was the, the lowest scoring of the seven that they tested and that was 12.4. And so what the way that they did the blog article is they showed GPT4 at 12.4, Grock4 at 24.3, Gemini 2.5 Pro at 25.504, Mini high at 27.903 high at 34.1 and GPT5 high at 38.8 before Clot Opus at 4.1. And that was like I said, kind of the glory diagram from the blog article. But if you look at the white paper, there was an, I thought was an even more interesting diagram. And I'll tell you, it was on page seven. It's figure seven, right. And it showed in addition to the main testing, they also did some analysis of what happened when, when the AI and the people work together and that's when they saw really big gains. So they showed a cost and speed improvement and they did this just with GPT5 high of 1 1/2 on both speed and cost improvement. And I, I think, you know, the glory statistic was about how close are we to AGI. But I think, really, I think when I read through the paper, it turned me into an AGI skeptic. It made me really think about how I think we're entering an era where everybody's going to AI enhanced and see cost and speed improvements similar to what they, they found in that figure 7.
A
Yeah, this is something I've been hearing too, that the AI with an expert is just tons better. And you know, having that, that human in the loop makes the AI itself better too.
B
Indeed. And if you look. So you cited that the MIT study that showed 95% failure rates among AI projects. And I think there's a couple of reasons why that is number one, like there's no skew for AI.
A
Right.
B
What I think a lot of executives think going to make this one product purchase and my AI strategy will be done when really it's a lot more nuanced than that. So that's thing number one. And then thing number two is if you think you're going to get AGI and replace people, that that's flawed logic. As this GDP valve detail shows. If instead you think about how can I improve the productivity of the people that I have and then what do I do with those productivity gains? That's where you really, you really start to see some traction in this market.
A
Yeah. So what is this, this paper hiding as you look through the data set?
B
Well, if you go to the data set, it shows you the prompts that they used for the tests. So what they did was across the 44 occupations, they started with 30 tasks each, so a total of 1,320 tasks. Then they, they shaved that down and tested five per occupation. And it turns out I've had one of the jobs they tested. So solutions architect or sales engineer as it's commonly known, was one of the tests and it was, here's, here's a diagram of an on prem three tier web application. What would it take to migrate it to Google? And it gave the actual instructions which, which served as the prompt that you would feed to your LLM of choice. So I did, I fed it to, I used Claude Desktop. I fed it the diagram, I fed at the prompt and it gave me back this really nice essentially paper for what a migration plan to GCP would look like. Because that's what the task asked for, right? But then what? Somebody has to present that to a customer. Someone has to try to try to gain the trust. So why is it you should use me? If you get to a situation where every sales engineer representing every consultancy can generate the exact same document, what would your selection criteria be?
A
Right?
B
So there's like some humanity that's still part of these tasks that you still need. And like I said before, I think when you look at those high failure rates that the MIT studies showed, I think a lot of it has to do, you know, with first that no skew for AI thing. But also if you think about it in terms of replacing people, that's the wrong way to go. It's how, how can you enhance that? How can you. And ultimately what that means is how can you inject some of your proprietary content into one of these LLMs without having to go through an expensive training cycle, Right? That's ultimately what that boils down to.
A
Just now it makes me think of this sort of like initial push against open source and what people realize when you open source everything, it's not the software that is the special sauce, right? It's the business, it's the people, it's everything around, it's the people.
B
That's exactly what it is. It's the people. And I think as you and I were chatting before we started recording, we were both at AWS Re invent last week and that was very much the thesis of what I think is now Warner Vogel's last keynote that he would give. And I found it very inspiring that he basically gave us a roadmap for how to be really good software engineers in this AI Enabled era.
A
That is a great lead into the REST conversation. How do we, you know, get actual value from, from roi? How do we be really good software engineers or whatever other AI enhanced job we have?
B
Yeah, I think so. If I take that in two parts, you know, how do we get good value out of AI, I think is part number one. And then part number two is some of this stuff that Werner talked about during his keynote about how can we be good software engineers? So if I can take the first part first, how do we get value out of these LLMs? Like I said a minute ago, how do you inject your proprietary data into an LLM of your choosing so that you can get it to customize and solve for whatever business problem you're trying to solve? And when I talk to C Suite folks about that no skew for AI thing, what I tell them is take your problem first. What are your top 10, 15 business problems? What five do you have data for? And then what two or three might you have metrics for so that you can determine how things got, how things got better? If you just spend money on a SKU and you don't know what the before or after is, how do you know how to calculate your roi? So you need good data, you need good metrics in order to get there. Typically the way that we see people implement that, and the reason why I joined MongoDB in the first place has to do with ultimately that boils down to having a good vector search and good embeddings. So we can talk about that a little bit more. But, but that's how you get value. Is, is, is when you boil it down, if you have good embeddings and good vector search and you're applying that to a problem that you have good data for and have good metrics, that's the recipe for getting value out of AI.
A
Yeah, I think that was something I thought of, you know, reading and writing about like how is software going to survive in the age of AI? And it's like it's the data in the end. And for that data you'd like, you said it's the vector search, the embeddings. So what's the approach to getting the best sort of vector search and embeddings?
B
Well, this is where, you know, like I said, I, when I joined six months ago, the sort of non technical reason why I joined is, you know, I had the chance to go work for a friend and my career over 31 years tells me that when you've got a friend as Your boss. That always tends to work out well. But the technical reason was in February, MongoDB made this acquisition of Voyage AI. And when you first look at that, why would MongoDB acquire a company that does embeddings? And ultimately it's so that you can have a better together story and make it easier for developers to create a good vector search and to do it in a way that gets you better retrieval scores. In particular, there's two features, one pre acquisition and one post acquisition. When you as a developer have to go and make an embedding and a vector search, there's typically five decisions that you have to make. Once you've selected your embedding model, you have to decide on a similarity score. You have to decide on your chunk size, how big of the chunks I'm going to put through it, how many dimensions do I want my array to be, what level of quantization in terms of how, how big am I going to store 32 bit floating points? Or am I going to give up some retrieval quality but gain some, some storage if I do like 8 bit ints or, or, or go down to binary? And then the fifth is whether or not to use a re ranking model. And there's, there's two in particular that I'll talk about that Voyage does a really good job of. In January 25, Voyage introduced a feature called Matroska Reasoning. So consider you embed your corpus of data and you decided to try 1024 dimensions and that gets you a certain size and a certain quality. What if now I Want to try 512? With a traditional embedding model, I would have to re embed my entire corpus of data with 512 as the number of dimensions. But with Matryska Reasoning, what you're able to do is you take the embeddings you already have and it turns out they're ordered. So you just lop off the last 5:12.
A
Interesting.
B
And that makes it so that as a developer, you can iterate through your cycles of determining storage size versus retrieval quality. What am I trying to get for my specific application? It decreases the amount of time it takes you to go through that cycle. So that's an important way of trying to make it easier on the developer to make that decision. So that was the first one. The second one that really grabbed me, which we released back in July, was something called contextualized chunks. So the way that a traditional embedding model, let's say you wanted to embed the size of a sentence. Well, a sentence in one document could appear in a second document and have very different meaning based on the context in which it appears. So what people do traditionally to overcome that is they'll embed a larger chunk size to try to capture the context around that sentence. Well, that means you've got more storage as you try to increase your retrieval quality. And what contextualized chunking does is when you send your, in this case sentence to be embedded, you also send the entire document. And what we'll do in the background is we'll embed the items around the context of the document with the individual sentence and it actually flips it where you can get better retrieval quality with smaller, interesting chunk size, which is completely opposite of what you would think. So that's another example of trying to reduce the friction that a developer might have as they're trying to learn these embeddings.
A
Yeah, I've seen also for embeddings, various overlapping chunking strategies, which seems like, you know, you might get better context. But again, it's increasing the storage cost.
B
It is when it comes to the quantization, the chunk size and the dimensions, it's this constant battle that the developer is facing where you're trying to balance the storage size. And it's not just disk storage, it's the size of the index in memory versus the retrieval quality. So what we try to do both with the base embedding models and together with the vector search that we have on top of the base MongoDB product, is to try to reduce that friction. I had somebody explain it to me this way once, where recently one of our executives said, like, remember when JavaScript came out? So I'm old enough to remember when JavaScript came out. And then we had, we got jQuery and that was way easier to use. And Nobody use raw JavaScript anymore, but now we've got, you know, React and Angular. And Almost nobody uses jQuery in this ecosystem of everything related to AI, whether it's learning the frameworks to build agents or to learn these embeddings, we're still way closer in our timeline and in the sophistication of the tools. We're way closer to the original JavaScript than we are to the React or the Angular. And so what we're trying to do, what MongoDB is trying to do, both with the Voyage acquisition and with the base product, is to move us a little bit closer to jQuery, because we're going to see more people develop agents and AI products in the next three years than we have in the last three years. So lowering that learning curve and reducing that friction for the individual developer is a really big part of that.
A
Yeah. It almost seems like, you know, you're talking about moving up the abstraction levels, right?
B
Absolutely. That's a big part of it.
A
So with, you know, with all these trade offs people are looking at, with the, you know, the storage side, reducing the index and memory, all the other trade offs, how can a developer sort of approach making those decisions are. There are other ways of thinking about it that you could offer.
B
Yeah. So it boils down to those five decisions that I talked about before. Once you've selected your embedding model and we try to make those core five decisions easier on the developer so that they can, they can spend more of their time working on their core business logic and less time worrying about the mechanics of the embedding. So like I said, typically when it comes to the quantization, the number of dimensions and the chunk size, those are the core three of those five decisions where you're making that balance between the two. Typically what we recommend similarity score Start with cosine. There's, there's a couple others that people typically use. Cosine ends up being a good starter Similarity score. When it comes to chunk size, if you use the contextualized chunking that the Voyage offers, you can go to 64k tokens and get much better retrieval scores than you can when you go with bigger chunks. So you can sort of ignore the overlap lapping chunk size. If you use the contextualized chunking models, when it comes to dimensions, start with 1024. And again, because we've got the Matryska reasoning in there, it's easy to try 512. It's easy to scale it down to see if you can get better retrievals. To get acceptable retrieval score at a.
A
Smaller storage, it's easier to scale down than up.
B
It's easier to scale down than up. Exactly. When it comes to quantization, when, when you go to build the indexing the way that the vector search, that the MongoDB Vector Search API works, you just get to select what level of quantization you'd like to use. So by default you can use the full 32 bit. Again, you can experiment with a different using the 8 bit int to see if you get still an acceptable retrieval score but a lower storage size. And then we've actually found that the RE ranking can help you quite a bit as well. If you combine what you can, the benefits you can get in particular out of the contextualized chunking with a best in class re ranker, we find you can boost the retrieval score somewhere in the neighborhood of 10 to 15%, which that can be the difference between a hallucination and offering, you know, somebody an AI enhanced solution that actually helps them solve a real world human problem.
A
Yeah, and you also mentioned holding the, the database index in memory. I know we, when we did our, our cloud transformation, like we had to get specialized storage containers just because we needed so much in memory. Right. Instead of compute. Are there ways to make that trade off to either reduce the index size for cost or if you're going for speed and performance to increase that index size?
B
Yeah, typically there's a correspondence between the size of the index and the speed that you get out of it. And that conceptually makes sense that if you've got a bigger indexing space to try to search across, then your performance is going to have a similar increase. And again, it depends a lot on your specific data. It depends a lot on what you're running it on. But the vector search, this is something we offer as part of the Atlas products, which if you listen to our most recent analyst call Atlas, the cloud version of our product that runs on every single hyperscaler data center. So you get to pick where it gets deployed. We'll automatically manage that instance for you. The vector search is part of that product.
A
I know we talked in the call about Mongo being kind of a niche product. Do you want to sort of address that?
B
Yeah, I mean, when I talk to customers about this because of how far back relational databases go. So I happen to have been born the same year that the white paper that gave birth to relational databases was written. So that was in 1970. If you think about what the world was like in 1970, the kinds of applications were oriented towards departments, not the public at large. You could have downtime on the weekends and storage was really expensive. And because of that, the education system that we all go through really tends to put an emphasis on normalization of data. So how can you lay your data out so that you're storing, you're storing the absolute minimum amount of data. And what our founders saw, so our founders sold DoubleClick to Google. And that's part of the, what the ads, the ad system that you see on Google Search is based on what they saw was that there was some more modern use cases that maybe it was okay not to fully normalize if what you get is the advantage of better transactional response. So the first Mongodb commit was in 2007. So that was after Internet, that was after mobile, that was after cloud. So by being aware of that and having a more flexible schema structure, you might know MongoDB is largely based on this JSON model. We store the data in a binary version of JSON called bson. That can get you far faster transactional response, the kind of thing you need in an AI application, as opposed to, say, something that's analytical, where maybe you've got more data and you do have to worry about normalization. If you denormalize some of that data with MongoDB, you can get better transactional response. And instead of just thinking, I must normalize at all cost. Well, if you're willing to denormalize a little bit, then what you get, you know, the trade off is you get better transactional throughput and better transactional response time. Does that fit every workload? No. Does it fit a ton of workloads that are super important? Yes. Because the modern application, you can't have downtime on the weekend like you could in 1970. Right. Like slow is the new downtime. So there's plenty of use cases that fit that more denormalized model that we provide.
A
Yeah. And obviously JSON is one of the foundational technologies of the current Internet. Right. Like everybody's got JSON and AI as it turns out, and AI as it turns out.
B
You know, the only other thing was, if you haven't watched the Werner keynote, I would recommend it's a good use of an hour and 15 minutes. The too long, don't read version. If I go from my notes, he talks about the importance of remaining curious, of being a good communicator. Just because you might use AI enhanced tooling to generate your code, you still own it, you're still responsible for it running in production. It's not an excuse to say, well, my AI generating it, no, you own it. And he talked about some techniques for making sure that you inspect that code and put your seal of approval on it. And then he talked about the importance of like thinking in systems, because AI is going to be really good at helping you with individual tasks and you as the human need to see across those tasks and why each one is necessary. And that blended into his final thing, which he used this word that not many people known called the polymath, where what that means is you're an expert in one deep topic, but that you know a little bit about a lot of other things. So that like that T shaped engineer that you might have heard of instead of that word polymath, that if you combine those five things, that, that's what, you know, he thinks we're about to see this renaissance of software development based on being AI enhanced. And that's what this sort of Vogel's renaissance developer is, is if you embrace curiosity, communication, ownership, thinking across systems, and being a polymath.
A
Yeah.
B
It's worth your time. I found it super inspirational. I want to go build stuff.
A
Yeah. Well, in terms of the, the ownership, I read an article a while back that said, like, you know, can you trust AI code? Well, no. Can you trust junior developer code? No. Can you trust code you wrote yesterday? No. Like, make sure you look at and understand any piece of code that comes across your desk.
B
Absolutely. The difference is, is that we gain an understanding and build that trust in a traditional way because we write the code. So as we're writing it, we. We trust what we wrote. That doesn't mean that you don't have. You still need the review cycle if you've got AI generating some of that for you. So it's a shift in thinking. I mean, I've, I've been, like I said, I've been. I'm 55 years old. I've been writing code since I was 11. I haven't written a manual line of code in eight months now.
A
Wow. Go watch the. The keynote, get inspired and start building. Okay. Well, it is that time of the show again where we shout out somebody who came on the stack overflow, dropped some knowledge, shared some curiosity, earned themselves a badge. Today we're shouting out a populist badge winner. Somebody who dropped an answer that was so good it outscored the accepted answer. So congrats to Chef's Cap for answering error. Non const static data member must be initialized out of line. If you're curious about that error, we'll have the answer for you in the show notes. I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have comments, questions, concerns, email me@podcastackoverflow.com and if you want to reach out to me directly, you can find me on LinkedIn.
B
My name is Pete Johnson. I'm the Field CTO of AI at MongoDB. You can find me on LinkedIn where I read all the white papers so you don't have to. I'll literally connect with and have open DMs with anyone. So feel free to join in and I'll do that research so that you don't have to all right.
A
Thank you for listening, everyone, and we'll talk to you next time.
Episode: You need quality engineers to turn AI into ROI
Date: January 7, 2026
Host: Ryan Donovan
Guest: Pete Johnson, Field CTO of AI at MongoDB
This episode dives into the crucial role of quality engineering and human expertise in making artificial intelligence (AI) initiatives deliver real return on investment (ROI). Host Ryan Donovan and guest Pete Johnson (MongoDB Field CTO) unpack a recent OpenAI “GDP VAL” paper, debate why so many AI projects fail, and share actionable guidance for developers and business leaders seeking to maximize value from AI systems. The conversation also explores MongoDB’s evolving toolset for AI-enhanced applications and reflects on the evolving responsibilities of software engineers in the age of advanced automation.
(03:20–07:04)
Memorable Quote:
"When the AI and the people work together... that's when they saw really big gains. ...It turned me into an AGI skeptic. I think we're entering an era where everybody's going to be AI-enhanced and see cost and speed improvements."
— Pete Johnson (06:07)
(07:15–09:58)
Memorable Quote:
"If you think you're going to get AGI and replace people, that's flawed logic... The real traction comes from improving the productivity of the people you have."
— Pete Johnson (07:28)
(09:58–10:41)
(10:55–12:26)
(12:49–18:16)
Memorable Quote:
"We're still way closer... to the original JavaScript than to React or Angular. We want to move AI up those abstraction levels so more people can build, with less friction."
— Pete Johnson (17:11)
(18:40–21:05)
(21:05–22:27)
(22:27–25:02)
Memorable Quote:
"The modern application—you can't have downtime on the weekend like you could in 1970... Slow is the new downtime."
— Pete Johnson (24:44)
(25:14–27:06)
Memorable Quote:
"Just because you might use AI enhanced tooling to generate your code, you still own it, you're still responsible for it running in production."
— Pete Johnson (25:38)
(27:06–27:33)
“Can you trust AI code? No. Can you trust junior developer code? No. Can you trust code you wrote yesterday? No. Make sure you look at and understand any piece of code that comes across your desk.” — Ryan Donovan
The key is having effective review cycles and not blindly trusting AI-generated output.
Memorable Quote:
"I haven't written a manual line of code in eight months now."
— Pete Johnson (27:30)
This episode argues convincingly that actualizing AI’s business potential depends as much on quality engineering, solid metrics, and human expertise as on the models themselves. “AI-enhanced” will soon be the norm—not a replacement, but an amplifier for human creativity, judgment, and communication.
For more details on technical questions, MongoDB’s AI offerings, or Pete Johnson’s recommended resources (including Werner Vogels’ AWS re:Invent keynote), see the show notes.