The Hype vs. Reality of OpenAI Agents - The AI Podcast

Summary6 min read

The AI Podcast: The Hype vs. Reality of OpenAI Agents

Release Date: August 7, 2025

Introduction to OpenAI's Open Source Models

In this episode of The AI Podcast, the host delves into the recent significant move by OpenAI, which has released two open-source models for the first time in five years, dating back to GPT-2. This development marks a pivotal shift in OpenAI's strategy, drawing both attention and criticism from the AI community and prominent figures like Elon Musk. The discussion begins with an overview of the differences between "open source" and "open," the performance benchmarks of these models, and the broader implications for the AI landscape.

[00:00] "This is actually really big news because this is the first time in five years that they've actually dropped any open source models back to GPT2."

Benchmarks and Performance

Code Force Benchmark

The host explores the Code Force Benchmark results for OpenAI's open-source models, highlighting their performance compared to existing proprietary models.

120 Billion Parameter Model: Achieved an Elo score of 2600.
OpenAI’s O3 Model: Scored 2700.
OpenAI’s O4 Mini Model: Slightly higher at 2720.

These results indicate that OpenAI’s open-source models perform comparably to their proprietary counterparts, outperforming older models like the O3 Midi, which scored 2000.

[04:30] "These are, these things aren't very far apart. It definitely did better than the O3 Midi model, which only got 2,000."

Importance of Tools in Benchmarking

A critical point discussed is the role of tools in benchmarking AI models. Tools refer to integrations like calculators or specific applications that assist the AI in completing tasks more accurately.

[08:50] "Tools basically mean they gave the AI model things like calculators and apps and like different tools. So like it is completing the tasks. Yes, but it's able to rely on like actual hard software to get good results."

The host emphasizes that while OpenAI's open-source models are released without these proprietary tools, the benchmark scores remain meaningful as developers can build custom tools to enhance model performance.

Humanities Last Exam Benchmark

Another challenging benchmark discussed is the Humanities Last Exam (HLE), designed to test complex, interdisciplinary questions.

120 Billion Parameter Model: Scored 19%.
20 Billion Parameter Model: Scored 17%.

While these scores may seem modest, they surpass the performance of competing open models from companies like Deep SEQ and Quentin, although they lag behind OpenAI's closed-source models.

[12:15] "I was actually really impressed that the 20 billion parameter model got 17%. That's not very far behind 19%, which is the 120 billion parameter model."

Hallucinations and Model Accuracy

A significant concern raised is the issue of hallucinations—instances where AI models generate inaccurate or fabricated information.

120 Billion Parameter Model: Hallucinates in 49% of cases on the Person's QA benchmark.
20 Billion Parameter Model: Hallucinates in 53% of cases.

These figures are considerably higher compared to OpenAI’s proprietary models, with one version showing only a 16% hallucination rate. The host discusses potential reasons, including the reduced parameters in open-source models leading to less world knowledge and increased hallucinations.

[19:45] "OpenAI's new model does hallucinate much more than its latest, you know, 03 or 04 mini models. So that is not a particularly fantastic statistic."

OpenAI's Licensing and Data Practices

The podcast addresses OpenAI's decision to release these models under the Apache 2.0 license, one of the most permissive open-source licenses, allowing companies to monetize the models without restrictions.

[24:00] "They are releasing both of these models under the Apache 2.0 license. So this is really considered as one of the most, I guess like lenient licenses. It will allow companies to monetize this model, right?"

However, OpenAI distinguishes its "open models" from fully open-source models by not releasing the training data, likely due to legal concerns over copyrighted material.

[27:30] "Unlike fully open source models, what's the difference? They're calling an open model, but it's not totally an open source... They are not going to release the training data that they use to create their models."

Microsoft's Integration of Open Models into Windows

Shifting focus, the host highlights Microsoft's initiative to integrate OpenAI's smallest model into Windows 11 users via Windows AI Foundry. This integration aims to provide AI capabilities directly on consumer PCs, supporting tasks like code execution and tool usage.

[34:15] "Microsoft is basically bringing their smallest model... It's going to be for any Windows 11 users, which is pretty interesting."

Key features include:

Tool Savvy and Lightweight: Optimized for tasks requiring autonomy.
Efficient Performance: Runs on a range of Windows hardware, with support for more devices forthcoming.
Versatility: Suitable for building autonomous assistants and embedding AI into various workflows, even in environments with limited bandwidth.

The host expresses excitement about the accessibility and potential applications of this integration, noting that it will be available via Hugging Face for developers and companies to utilize.

[38:50] "It's really a gift to the world and I'm sure OpenAI has more exciting things up their sleeve like GPT5 that'll probably blow this out of the water."

Conclusion and Future Outlook

Wrapping up, the host reflects on the current state and future prospects of open-source AI models. While OpenAI's latest open models represent a significant step forward, challenges like hallucinations remain areas for improvement. The collaboration with Microsoft signifies a broader trend of integrating AI more deeply into everyday technologies, promising increased accessibility and innovation.

[45:20] "What you actually need if you want to run this and this will be starting on Tuesday, but it'll be able to run on most consumer PCs and laptops... So really, really impressive."

The episode concludes with anticipation for forthcoming advancements, including potential releases like GPT-5, which may further elevate the capabilities and applications of AI models.

Notable Quotes

Host [00:00]: "This is actually really big news because this is the first time in five years that they've actually dropped any open source models back to GPT2."
Host [04:30]: "These are, these things aren't very far apart. It definitely did better than the O3 Midi model, which only got 2,000."
Host [08:50]: "Tools basically mean they gave the AI model things like calculators and apps and like different tools. So like it is completing the tasks. Yes, but it's able to rely on like actual hard software to get good results."
Host [12:15]: "I was actually really impressed that the 20 billion parameter model got 17%. That's not very far behind 19%, which is the 120 billion parameter model."
Host [19:45]: "OpenAI's new model does hallucinate much more than its latest, you know, 03 or 04 mini models. So that is not a particularly fantastic statistic."
Host [24:00]: "They are releasing both of these models under the Apache 2.0 license. So this is really considered as one of the most, I guess like lenient licenses. It will allow companies to monetize this model, right?"
Host [27:30]: "Unlike fully open source models, what's the difference? They're calling an open model, but it's not totally an open source... They are not going to release the training data that they use to create their models."
Host [34:15]: "Microsoft is basically bringing their smallest model... It's going to be for any Windows 11 users, which is pretty interesting."
Host [38:50]: "It's really a gift to the world and I'm sure OpenAI has more exciting things up their sleeve like GPT5 that'll probably blow this out of the water."
Host [45:20]: "What you actually need if you want to run this and this will be starting on Tuesday, but it'll be able to run on most consumer PCs and laptops... So really, really impressive."

This comprehensive summary encapsulates the key discussions from the episode, providing insights into OpenAI's strategic release of open-source models, their performance metrics, challenges like hallucinations, licensing implications, and Microsoft's integration efforts. Listeners and enthusiasts alike will find this overview valuable for understanding the current dynamics and future directions in the realm of artificial intelligence.

Loading summary

Transcript1 lines

[00:00]
A
Today on the podcast we're talking about OpenAI, which has just dropped two open source models. Now this is actually really big news because this is the first time in five years that they've actually dropped any open source models back to GPT2. And this is something that's gotten a ton of criticism. Basically all of Elon Musk's online AI beef, pretty much why he says he started xai, and just a lot of drama and heat that has been thrown at OpenAI is based on the fact that they started as an open source company and hadn't dropped anything and they have now officially launched some quote unquote open models. Now I'm going to be talking about the difference between open source and open where these models sit. I'm also going to go through the benchmarks of basically how these models perform because the criticism that a bunch of them have gotten is like they just dropped like these models that are, you know, just to say that they're open source, but they're not actually that good. And I'm actually, I'm not going to lie, impressed by some benchmarks, but interested. And there's a couple of interesting nuances I want to go over. At the same time, Microsoft has just announced that they're going to be bringing some, some of their smallest open models to Windows users. So there's a ton of really interesting things that are getting rolled out right now. We'll be covering all of that on the podcast today. Before we get into it, I wanted to mention if you want to try any of the AI models that we talk about on the show, I'd love for you to go check out my own startup, which is called AI Box AI, where we essentially have the top 40 different AI models from Anthropic, Cohere, Deepseek, Google, OpenAI, Meta, tons of others I audio models like 11 Labs, and a bunch of really interesting image models, all for 20 bucks a month, you get access to all of them. So my hope there is not just that it'll save you some money on, you know, the absorbent amount of AI models that you can subscribe to, but really that you'll be able to find and try out a whole bunch of different AI models that you hadn't heard of or used before. I think there's a lot of really great unheard of models that can do some great things in specific tasks. We have kind of benchmark data and we break down what models are best for what on the platform. So go check it out. It's 20 bucks a month AI box. AI. All right, let's get into what OpenAI is doing. So the first benchmark that I want to talk about is the code force benchmark. They basically ran the GPT OSS 120 billion parameters. That's the bigger of the two open source models. They have a 120 billion parameter one and then they have a smaller one. But basically that the bigger parameter 120 billion parameter 1 got an Elo score on CodeForce of 2600, roughly. And just to compare that with OpenAI's other tools, their O3 model got 2700 and their 04 model, their O4 mini model, sorry, got 2720. So like, these are, these things aren't very far apart. It definitely did better than the O3 Midi model, which only got 2,000. So it did pretty decently. Now, a lot of them are also rated with or without tools. So that 2000 benchmark that I quoted you was without tools. And what exactly does it mean, tools? And is that important? Yes, I would say it's important. Tools basically mean they gave the AI model things like calculators and apps and like different tools. So like it is completing the tasks. Yes, but it's able to rely on like actual hard software to get good results. And this is something that we basically use in AI, because with these lms, because we've, we found that they're like, perhaps not necessarily fantastic at a math problem or like some sort of really intense molecular biology question when they're kind of just guessing what should come next in the line. But they're good at figuring out what they need to do. And so we actually leverage that to find what tool to use and then bring the tool in to solve maybe a more calculated problem. Okay, so does it matter that they're giving us basically benchmarks with or without tools? Yeah, I think the big thing here is when they release the model open source, or quote unquote, open for everyone to download and use, they're not releasing it with the tool. So no one's getting. They give this benchmark with tools or without tools, but they don't give us the tools because they're kind of OpenAI's proprietary, proprietary stack. But what I will say is it is a good benchmark because big companies or software like startups, they can build their own tools and usually if you're taking one of these models and putting it into your, your startup to do a certain task, you're going to be building custom tools. Anyways, I, I think back to my first startup was called Self Paws. It was a no co or sorry. It was an AI life coach and we basically had it. So you would talk to chat GPT and we would act like a life coach and work you through different questions. And we built our own custom things to basically instruct and guide how the AI model works and how it would run a conversation. And so I think most software startups would be similar. Okay. Humanities Last exam. This is kind of a notorious benchmark. It's called hel, but basically it's Humanities Last Exam before AGI is kind of the concept, but it's got a whole bunch of really complex questions. You heard this exam quoted a lot by XAI when they released their latest version of Grok. It did really well on this. So with tools, GPT OS. So they're basically their 120 billion parameter model and their 20 billion parameter model scored 19% and 17%. Well, I was actually really impressed that the 20 billion parameter model got 17%. That's not very far behind 19%, which is the 120 billion parameter model. Again, this is an incredibly hard task test. So less you're like, oh, 20, this is terrible. This is. I don't think I would be able to get any questions on this, on this test. Right. It's like super in depth, you know, like translating ancient Hebrew meaning of this hieroglyph to how does it convert to this thing? It's like, it's very, very complicated questions. Okay. The most many experts would not excel at or succeed in. Okay. And it's a whole bunch of different areas that it tests you. And so in any case, this outperforms the O3 model but. Or, sorry, it underperforms the O3 model, but it does outperform a bunch of leading open models from Deep SEQ and Quentin. So the Chinese companies that are releasing their open source models, it beats the Chinese companies, but it's not better than what open source has or what OpenAI has. Closed Source, what basically they're, they're selling. So I mean, that kind of makes sense. One thing I will say though is OpenAI's new model does hallucinate much more than its latest, you know, 03 or 04 mini models. So that is not a particularly fantastic statistic. So this is definitely something that's interesting because it seems like these have actually, these hallucinations have actually been getting more severe, which is not a good thing. We don't want to hear. The OpenAI's models are getting worse and worse at hallucinating. They don't even really understand why. Which is another thing that's not super great. They wrote a white paper where they said, quote, basically they said that it is, quote, expected as smaller models have less world knowledge than large frontier models and they tend to hallucinate more. So basically they're blaming on the fact that there's less data, less parameters inside of these models. That's why they're hallucinating more. But yeah, it's kind of a trend that you see in a lot of places. It's kind of interesting. Basically they found that their 120 billion parameter model hallucinated in response to like 49% of a benchmark called Person's QA. So it's, it's one that is asking the AI model basically about people. So like who is Tom Cruise? And give me information about him who is. And it goes through like tons of people, some that are famous, some that are less famous and they'll see if it hallucinates. If you've ever tried it, like if you go say who is jaden Schaefer to ChatGPT, like sometimes it'll grab information from, you know, the web and whatever, my LinkedIn. But like half the time especially I remember in the early days would throw in tons of random stuff that weren't true. Um, and so I think with these older models we're kind of, it's kind of like. Or these open source models are kind of like some of the older models and they throw a bunch of funny things in there. So that's an area that hallucinates. Doesn't mean it hallucinates on every topic but on people it definitely is doing that. So yeah, it's, they're, they're newer models that they, they have that they pay for. Like their, oh, one model had a 16% score there. So you know, 16% hallucinations about people versus these models are like 49% and 53%. It's a lot more. So anyways, it's kind of funny, OpenAI said that basically they are not going to share what data they use to train these models. And basically I think this comes down to a whole bunch of lawsuits that are happening where people are saying that OpenAI uses copyrighted data. I'm assuming that they did, they're just not announcing it. They're trying to probably work with some regulators in Washington to make it like all kosher and okay before they like officially say that they're not. So I think that this is interesting. Funny, I does say that the model that they have for their 120 billion parameter model, it only activates 5.1 billion parameters per token. They also say that their model was trained using something called high compute Reinforcement learning. Basically this is a process, it's a post training process that helps to teach AI models right from wrong. Like the right answer, the wrong answer. And they do this in a simulated environment and they do all of this using basically a really big cluster of Nvidia GPUs. So this is how they trained their O series of models, right? Their O3 and O4. And so it has basically a similar chain of thought process where they, it takes longer but it's saying like okay, how would I solve this problem? It comes up with a list and then it works through that list chain of thought on solving different problems. And we typically get better answers when we do this. Now here's what's exciting in my opinion. They are releasing both of these models under the Apache 2.0 license. So this is really considered as one of the most, I guess like lenient licenses. It will allow companies to monetize this model, right? So you can actually charge money for it. Unlike some things that Meta was doing when they were releasing. They like, were like, here's our like open source models versions of llama. But like companies, you gotta talk to us if you want to like make any money off of it. OpenAI is being really generous letting people make money off of it. They don't have to pay OpenAI and they don't have to get permission from OpenAI. So this is really just a free gift to the world of an open source model, which is in my opinion really, really cool. I will say unlike fully open source models, what's the difference? They're calling an open model, but it's not totally an open source. The difference is that companies like AI Labs or sorry, AI AI 2 have like a fully open source model. The difference is that OpenAI is not going to release the training data that they use to create their models. I kind of already talked about it. It's basically for legal reasons. They probably have copyrighted stuff in there which is their own choice. But basically anyone that gets the model is going to benefit from it because the model is going to answer and be way more accurate and have higher quality results. So that's kind of the outcome of that. I will say that they have delayed this model multiple times. I have personally been disappointed when I've seen Sam Altman's tweets on Twitter over the last couple months where they keep delaying it for safety reasons they think that it is a lot safer now. Basically the things that they said they were concerned about was cyber attacks or the creation of biological or chemical weapons. Basically you get information from the models that could help you do those two things. It seems like they've kind of put some guardrails and made the model better now. So it doesn't do that. They had a bunch of third party evaluators actually test it and they said that it marginally increases biological capabilities, but it didn't find evidence that they were going to have a very high capacity threshold for danger in these domains after fine tuning. So I think even, or sorry even if you tried to fine tune to be able to do that. So I think it's going to be a much safer model. It's definitely state of the art amongst other open models, right. So if we're looking at like Deep Seek and Quinn and Meta's Llama, it's it's definitely the top of the pack there. We're are we're also waiting For Deep seek R2 to release which also should give it a run for its money. So it'll be interesting to see what's happening there. So all of this is going down, which is really, really interesting. And the last thing I wanted to bring up though is that Microsoft is basically bringing their smallest model, right? So they have the 20 billion parameter model. They're bringing it to a bunch of Windows users, which is pretty interesting. It's going to be for any Windows 11 users. It's via Windows AI Foundry. So this is kind of their platform that lets you use AI APIs and a bunch of popular open source models on your computer. Microsoft in a blog post said tool savvy and lightweight, optimized for agentic tasks like code execution and tool use. It runs efficiently on a range of Windows hardwares with support for more devices coming soon. It's perfect for building autonomous assistance or embedding AI into real world workflows, even in bandwidth constrained environments. So basically what you actually need if you want to run this and this will be starting on Tuesday, but it'll be able to run on most consumer PCs and laptops, but to have at least 16 gigs of VRAM which basically a modern GPU from Nvidia or Radeon would have OpenAI said that the model was trained using high compute reinforcement learning. So it pretty much excels at powering AI agents and a bunch of other tools. It can do web search, it can do python code execution and all of that. So really, really impressive. I'M excited to see where this goes. I'm excited to see that Microsoft is kind of rolling out some sort of integrations so that a lot of people can use this. And this is a really cool moment. You can go download this today on Hugging Face, which is super cool and I'm excited to see what people build with it, what companies start using it. This is just honestly a gift to the world and I'm sure they OpenAI has more exciting things up their sleeve like GPT5 that'll probably blow this out of the water. But for what it's capable of doing, anyone gets access to a really world class AI model and so I'm quite excited about that. All right, thanks so much for tuning into the podcast. Make sure to go check out AI box AI if you want to try out a lot of the different models I talk about on the show. For 20 bucks a month, it's an amazing value and I would love to hear what you have to think about or have to say about it. Because it's currently in beta, we're taking feedback and adding tons of new features all the time. Thanks so much for tuning in and I will catch you in the next episode.