The Hype vs. Reality of OpenAI Agents - The Jaeden Schafer Podcast

Summary6 min read

Podcast Summary: The Hype vs. Reality of OpenAI Agents

Podcast Information

Title: The Joe Rogan Experience of AI
Host: The Joe Rogan Experience of AI
Episode: The Hype vs. Reality of OpenAI Agents
Release Date: August 8, 2025
Description: This episode delves deep into the recent developments surrounding OpenAI's open-source models, benchmark performances, challenges such as model hallucinations, and Microsoft's integration of AI models into Windows. The discussion mirrors the conversational and insightful style of Joe Rogan, featuring expert opinions and detailed analyses.

1. Introduction to OpenAI's Open-Source Models

At the onset of the episode, the host introduces the significant news of OpenAI releasing two open-source models—the first such release in five years since GPT-2. This move has stirred considerable debate and criticism, particularly from figures like Elon Musk, who previously criticized OpenAI for not maintaining an open-source stance.

Notable Quote:

"This is something that's gotten a ton of criticism... pretty much why he says he started xai." ([00:00])

2. Benchmark Performance of OpenAI's Models

The discussion transitions to evaluating the performance benchmarks of OpenAI's newly released models. The host highlights the performance of the 120 billion parameter model on the CodeForce benchmark, achieving an Elo score of approximately 2600. This score is competitive when compared to OpenAI's own O3 and O4 models, which scored 2700 and 2720 respectively.

Key Points:

CodeForce Benchmark: Measures coding proficiency and problem-solving ability of AI models.
Performance Comparison:
- OpenAI's 120B model: 2600
- OpenAI's O3 model: 2700
- OpenAI's O4 model: 2720
Insight: The open-source model performs commendably, nearly matching OpenAI’s proprietary models, though it underperforms when compared to some smaller models without tool assistance.

Notable Quote:

"The bigger parameter 120 billion parameter one got an Elo score on CodeForce of 2600, roughly." ([00:00])

3. Understanding Tools Integration in Benchmarks

A significant portion of the discussion revolves around the role of tools in enhancing AI model performance. Tools refer to additional software or applications, such as calculators or specialized apps, that assist the AI in completing tasks more accurately.

Key Points:

With vs. Without Tools: Benchmarks can be assessed with the AI model using external tools or operating independently. Tools significantly enhance performance, especially in complex tasks.
OpenAI's Approach: The open-source models are released without OpenAI's proprietary tools, meaning users must develop or integrate their own tools to achieve optimal performance.

Notable Quote:

"Tools basically mean they gave the AI model things like calculators and apps..." ([00:00])

4. Performance on the Humanities Last Exam Benchmark

The host delves into the Humanities Last Exam (HEL) benchmark, a notoriously difficult test that assesses AI’s ability to handle complex, multifaceted questions across various disciplines.

Key Points:

Scores:
- 120B Parameter Model: 19%
- 20B Parameter Model: 17%
Comparison: While these scores lag behind OpenAI's O3 model, they outperform leading open-source models from DeepSeek and Quentin.
Hallucinations: The open-source models exhibit a high rate of hallucinations (incorrect or fabricated information), particularly when responding to queries about individuals, with rates as high as 49%.

Notable Quote:

"The 20 billion parameter model got 17%. That's not very far behind 19%, which is an incredibly hard task." ([00:00])

5. OpenAI's Licensing and Accessibility

A pivotal moment in the episode is the discussion about OpenAI releasing these models under the Apache 2.0 license. This permissive licensing allows companies to monetize the models without seeking permission from OpenAI, contrasting with other companies like Meta, which impose restrictions on commercial use.

Key Points:

Apache 2.0 License: Grants broad permissions, including commercial use, modifications, and distribution.
Impact: Encourages widespread adoption and innovation, allowing businesses to integrate and build upon OpenAI's models freely.
Difference from Fully Open-Source Models: While OpenAI's models are highly accessible, they do not include the training data, which remains proprietary due to legal considerations.

Notable Quote:

"They are releasing both of these models under the Apache 2.0 license. So this is really considered as one of the most, I guess, like lenient licenses." ([00:00])

6. Addressing Model Hallucinations and Training Transparency

The host expresses concerns over the increased hallucination rates in the new models. OpenAI acknowledges that smaller models, with fewer parameters, tend to hallucinate more due to limited world knowledge.

Key Points:

Hallucination Rates:
- OpenAI's 120B model: 49% on Person's QA benchmark
- New Model: 16% hallucination rate compared to older models' 49% and 53%
Training Data Transparency: OpenAI remains opaque about their training data, likely due to ongoing legal challenges regarding the use of copyrighted material.

Notable Quote:

"OpenAI said that the model was trained using high compute Reinforcement learning... teaching AI models right from wrong." ([00:00])

7. Microsoft's Integration of OpenAI's Models into Windows

Shifting focus, the host discusses Microsoft's initiative to integrate OpenAI's smallest model (20 billion parameters) into Windows 11 via Windows AI Foundry. This integration aims to provide seamless access to AI capabilities for Windows users.

Key Points:

Windows AI Foundry: A platform enabling the use of AI APIs and open-source models directly on Windows devices.
System Requirements: Requires at least 16GB of VRAM, necessitating modern GPUs from Nvidia or Radeon.
Capabilities: Supports tasks like code execution, web search, and embedding AI into workflows, even in environments with limited bandwidth.
Accessibility: Available to Windows 11 users starting Tuesday, with plans to expand support to more devices.

Notable Quote:

"It's a really cool moment. You can go download this today on Hugging Face, which is super cool..." ([00:00])

8. Future Prospects and Conclusion

In concluding the episode, the host expresses optimism about the future of open-source AI models and the potential for innovation spurred by OpenAI's recent releases. The anticipation of upcoming models like GPT-5 is highlighted, suggesting continued advancements in the field.

Key Points:

Community Impact: Open-source access democratizes AI technology, fostering creativity and diverse applications.
Future Models: Expectations of more powerful models that will further bridge the gap between hype and reality.
Final Thoughts: Emphasizes the significance of OpenAI's contributions to the AI ecosystem and the exciting possibilities ahead.

Notable Quote:

"It's definitely state of the art amongst other open models... anyone gets access to a really world class AI model and so I'm quite excited about that." ([00:00])

Summary

In this episode of "The Joe Rogan Experience of AI," the host provides a comprehensive analysis of OpenAI's recent release of two open-source models, marking a significant shift after five years. The discussion covers benchmark performances, highlighting the models' competitive standing against OpenAI's proprietary versions. The conversation delves into the nuances of integrating tools to enhance AI capabilities and addresses the challenges posed by model hallucinations, especially in complex tasks like the Humanities Last Exam.

A pivotal aspect of the episode is OpenAI's decision to license these models under Apache 2.0, promoting broad accessibility and commercialization without restrictive permissions. However, the lack of transparency regarding training data raises concerns, likely tied to ongoing legal issues over data usage.

Further, the host explores Microsoft's integration of OpenAI's models into Windows 11, enhancing AI accessibility for everyday users through the Windows AI Foundry platform. This move is seen as a significant step in embedding AI into mainstream workflows and applications.

Concluding on an optimistic note, the host anticipates future advancements with models like GPT-5, emphasizing the transformative potential of open-source AI in driving innovation and expanding the horizons of technology and human experience.

Overall, the episode provides an in-depth exploration of the current state and future prospects of OpenAI's open-source AI models, balancing the excitement of recent developments with critical insights into their performance and implications.

Loading summary

Transcript1 lines

[00:00]
A
Today on the podcast we're talking about OpenAI, which has just dropped two open source models. Now this is actually really big news because this is the first time in five years that they've actually dropped any open source models back to GPT2. And this is something that's gotten a ton of criticism. Basically all of Elon Musk's online AI beef, pretty much why he says he started xai, and just a lot of drama and heat that has been thrown at OpenAI is based on the fact that they started as an open source company and hadn't dropped anything and they have now officially launched some quote unquote open models. Now I'm going to be talking about the difference between open source and open where these models sit. I'm also going to go through the benchmarks of basically how these models perform because the criticism that a bunch of them have gotten is like they just dropped like these models that are, you know, just to say that they're open source, but they're not actually that good. And I'm actually, I'm not going to lie, impressed by some benchmarks, but interested. And there's a couple of interesting nuances I want to go over. At the same time, Microsoft has just announced that they're going to be bringing some, some of their smallest open models to Windows users. So there's a ton of really interesting things that are getting rolled out right now. We'll be covering all of that on the podcast today. Before we get into it, I wanted to mention if you want to try any of the AI models that we talk about on the show, I'd love for you to go check out my own startup, which is called AI Box AI, where we essentially have the top 40 different AI models from Anthropic, Cohere, Deepseek, Google, OpenAI, Meta, tons of others I audio models like 11 Labs, and a bunch of really interesting image models, all for 20 bucks a month, you get access to all of them. So my hope there is not just that it'll save you some money on, you know, the absorbent amount of AI models that you can subscribe to, but really that you'll be able to find and try out a whole bunch of different AI models that you hadn't heard of or used before. I think there's a lot of really great unheard of models that can do some great things in specific tasks. We have kind of benchmark data and we break down what models are best for what on the platform. So go check it out. It's 20 bucks a month AI box. AI. All right, let's get into what OpenAI is doing. So the first benchmark that I want to talk about is the code force benchmark. They basically ran the GPT OSS 120 billion parameters. That's the bigger of the two open source models. They have a 120 billion parameter one and then they have a smaller one. But basically that the bigger parameter 120 billion parameter 1 got an Elo score on CodeForce of 2600, roughly. And just to compare that with OpenAI's other tools, their O3 model got 2700 and their 04 model, their O4 mini model, sorry, got 2720. So like, these are, these things aren't very far apart. It definitely did better than the O3 Midi model, which only got 2,000. So it did pretty decently. Now, a lot of them are also rated with or without tools. So that 2000 benchmark that I quoted you was without tools. And what exactly does it mean, tools? And is that important? Yes, I would say it's important. Tools basically mean they gave the AI model things like calculators and apps and like different tools. So like it is completing the tasks. Yes, but it's able to rely on like actual hard software to get good results. And this is something that we basically use in AI, because with these lms, because we've, we found that they're like, perhaps not necessarily fantastic at a math problem or like some sort of really intense molecular biology question when they're kind of just guessing what should come next in the line. But they're good at figuring out what they need to do. And so we actually leverage that to find what tool to use and then bring the tool in to solve maybe a more calculated problem. Okay, so does it matter that they're giving us basically benchmarks with or without tools? Yeah, I think the big thing here is when they release the model open source, or quote unquote, open for everyone to download and use, they're not releasing it with the tool. So no one's getting. They give this benchmark with tools or without tools, but they don't give us the tools because they're kind of OpenAI's proprietary, proprietary stack. But what I will say is it is a good benchmark because big companies or software like startups, they can build their own tools and usually if you're taking one of these models and putting it into your, your startup to do a certain task, you're going to be building custom tools. Anyways, I, I think back to my first startup was called Self Paws. It was a no co or sorry. It was an AI life coach and we basically had it. So you would talk to chat GPT and we would act like a life coach and work you through different questions. And we built our own custom things to basically instruct and guide how the AI model works and how it would run a conversation. And so I think most software startups would be similar. Okay. Humanities Last exam. This is kind of a notorious benchmark. It's called hel, but basically it's Humanities Last Exam before AGI is kind of the concept, but it's got a whole bunch of really complex questions. You heard this exam quoted a lot by XAI when they released their latest version of Grok. It did really well on this. So with tools, GPT OS. So they're basically their 120 billion parameter model and their 20 billion parameter model scored 19% and 17%. Well, I was actually really impressed that the 20 billion parameter model got 17%. That's not very far behind 19%, which is the 120 billion parameter model. Again, this is an incredibly hard task test. So less you're like, oh, 20, this is terrible. This is. I don't think I would be able to get any questions on this, on this test. Right. It's like super in depth, you know, like translating ancient Hebrew meaning of this hieroglyph to how does it convert to this thing? It's like, it's very, very complicated questions. Okay. The most many experts would not excel at or succeed in. Okay. And it's a whole bunch of different areas that it tests you. And so in any case, this outperforms the O3 model but. Or, sorry, it underperforms the O3 model, but it does outperform a bunch of leading open models from Deep SEQ and Quentin. So the Chinese companies that are releasing their open source models, it beats the Chinese companies, but it's not better than what open source has or what OpenAI has. Closed Source, what basically they're, they're selling. So I mean, that kind of makes sense. One thing I will say though is OpenAI's new model does hallucinate much more than its latest, you know, 03 or 04 mini models. So that is not a particularly fantastic statistic. So this is definitely something that's interesting because it seems like these have actually, these hallucinations have actually been getting more severe, which is not a good thing. We don't want to hear. The OpenAI's models are getting worse and worse at hallucinating. They don't even really understand why. Which is another thing that's not super great. They wrote a white paper where they said, quote, basically they said that it is, quote, expected as smaller models have less world knowledge than large frontier models and they tend to hallucinate more. So basically they're blaming on the fact that there's less data, less parameters inside of these models. That's why they're hallucinating more. But yeah, it's kind of a trend that you see in a lot of places. It's kind of interesting. Basically they found that their 120 billion parameter model hallucinated in response to like 49% of a benchmark called Person's QA. So it's, it's one that is asking the AI model basically about people. So like who is Tom Cruise? And give me information about him who is. And it goes through like tons of people, some that are famous, some that are less famous and they'll see if it hallucinates. If you've ever tried it, like if you go say who is jaden Schaefer to ChatGPT, like sometimes it'll grab information from, you know, the web and whatever, my LinkedIn. But like half the time especially I remember in the early days would throw in tons of random stuff that weren't true. Um, and so I think with these older models we're kind of, it's kind of like. Or these open source models are kind of like some of the older models and they throw a bunch of funny things in there. So that's an area that hallucinates. Doesn't mean it hallucinates on every topic but on people it definitely is doing that. So yeah, it's, they're, they're newer models that they, they have that they pay for. Like their, oh, one model had a 16% score there. So you know, 16% hallucinations about people versus these models are like 49% and 53%. It's a lot more. So anyways, it's kind of funny, OpenAI said that basically they are not going to share what data they use to train these models. And basically I think this comes down to a whole bunch of lawsuits that are happening where people are saying that OpenAI uses copyrighted data. I'm assuming that they did, they're just not announcing it. They're trying to probably work with some regulators in Washington to make it like all kosher and okay before they like officially say that they're not. So I think that this is interesting. Funny, I does say that the model that they have for their 120 billion parameter model, it only activates 5.1 billion parameters per token. They also say that their model was trained using something called high compute Reinforcement learning. Basically this is a process, it's a post training process that helps to teach AI models right from wrong. Like the right answer, the wrong answer. And they do this in a simulated environment and they do all of this using basically a really big cluster of Nvidia GPUs. So this is how they trained their O series of models, right? Their O3 and O4. And so it has basically a similar chain of thought process where they, it takes longer but it's saying like okay, how would I solve this problem? It comes up with a list and then it works through that list chain of thought on solving different problems. And we typically get better answers when we do this. Now here's what's exciting in my opinion. They are releasing both of these models under the Apache 2.0 license. So this is really considered as one of the most, I guess like lenient licenses. It will allow companies to monetize this model, right? So you can actually charge money for it. Unlike some things that Meta was doing when they were releasing. They like, were like, here's our like open source models versions of llama. But like companies, you gotta talk to us if you want to like make any money off of it. OpenAI is being really generous letting people make money off of it. They don't have to pay OpenAI and they don't have to get permission from OpenAI. So this is really just a free gift to the world of an open source model, which is in my opinion really, really cool. I will say unlike fully open source models, what's the difference? They're calling an open model, but it's not totally an open source. The difference is that companies like AI Labs or sorry, AI AI 2 have like a fully open source model. The difference is that OpenAI is not going to release the training data that they use to create their models. I kind of already talked about it. It's basically for legal reasons. They probably have copyrighted stuff in there which is their own choice. But basically anyone that gets the model is going to benefit from it because the model is going to answer and be way more accurate and have higher quality results. So that's kind of the outcome of that. I will say that they have delayed this model multiple times. I have personally been disappointed when I've seen Sam Altman's tweets on Twitter over the last couple months where they keep delaying it for safety reasons they think that it is a lot safer now. Basically the things that they said they were concerned about was cyber attacks or the creation of biological or chemical weapons. Basically you get information from the models that could help you do those two things. It seems like they've kind of put some guardrails and made the model better now. So it doesn't do that. They had a bunch of third party evaluators actually test it and they said that it marginally increases biological capabilities, but it didn't find evidence that they were going to have a very high capacity threshold for danger in these domains after fine tuning. So I think even, or sorry even if you tried to fine tune to be able to do that. So I think it's going to be a much safer model. It's definitely state of the art amongst other open models, right. So if we're looking at like Deep Seek and Quinn and Meta's Llama, it's it's definitely the top of the pack there. We're are we're also waiting For Deep seek R2 to release which also should give it a run for its money. So it'll be interesting to see what's happening there. So all of this is going down, which is really, really interesting. And the last thing I wanted to bring up though is that Microsoft is basically bringing their smallest model, right? So they have the 20 billion parameter model. They're bringing it to a bunch of Windows users, which is pretty interesting. It's going to be for any Windows 11 users. It's via Windows AI Foundry. So this is kind of their platform that lets you use AI APIs and a bunch of popular open source models on your computer. Microsoft in a blog post said tool savvy and lightweight, optimized for agentic tasks like code execution and tool use. It runs efficiently on a range of Windows hardwares with support for more devices coming soon. It's perfect for building autonomous assistance or embedding AI into real world workflows, even in bandwidth constrained environments. So basically what you actually need if you want to run this and this will be starting on Tuesday, but it'll be able to run on most consumer PCs and laptops, but to have at least 16 gigs of VRAM which basically a modern GPU from Nvidia or Radeon would have OpenAI said that the model was trained using high compute reinforcement learning. So it pretty much excels at powering AI agents and a bunch of other tools. It can do web search, it can do python code execution and all of that. So really, really impressive. I'M excited to see where this goes. I'm excited to see that Microsoft is kind of rolling out some sort of integrations so that a lot of people can use this. And this is a really cool moment. You can go download this today on Hugging Face, which is super cool and I'm excited to see what people build with it, what companies start using it. This is just honestly a gift to the world and I'm sure they OpenAI has more exciting things up their sleeve like GPT5 that'll probably blow this out of the water. But for what it's capable of doing, anyone gets access to a really world class AI model and so I'm quite excited about that. All right, thanks so much for tuning into the podcast. Make sure to go check out AI box AI if you want to try out a lot of the different models I talk about on the show. For 20 bucks a month, it's an amazing value and I would love to hear what you have to think about or have to say about it. Because it's currently in beta, we're taking feedback and adding tons of new features all the time. Thanks so much for tuning in and I will catch you in the next episode.