Loading summary
A
Today on the podcast we're talking about OpenAI, which has just dropped two open source models. Now this is actually really big news because this is the first time in five years that they've actually dropped any open source models back to GPT2. And this is something that's gotten a ton of criticism. Basically all of Elon Musk's online AI beef, pretty much why he says he started xai, and just a lot of drama and heat that has been thrown at OpenAI is based on the fact that they started as an open source company and hadn't dropped anything and they have now officially launched some quote unquote open models. Now I'm going to be talking about the difference between open source and open where these models sit. I'm also going to go through the benchmarks of basically how these models perform because the criticism that a bunch of them have gotten is like they just dropped like these models that are, you know, just to say that they're open source, but they're not actually that good. And I'm actually, I'm not going to lie, impressed by some benchmarks, but interested. And there's a couple of interesting nuances I want to go over. At the same time, Microsoft has just announced that they're going to be bringing some, some of their smallest open models to Windows users. So there's a ton of really interesting things that are getting rolled out right now. We'll be covering all of that on the podcast today. Before we get into it, I wanted to mention if you want to try any of the AI models that we talk about on the show, I'd love for you to go check out my own startup, which is called AI Box AI, where we essentially have the top 40 different AI models from Anthropic, Cohere, Deepseek, Google, OpenAI, Meta, tons of others I audio models like 11 Labs, and a bunch of really interesting image models, all for 20 bucks a month, you get access to all of them. So my hope there is not just that it'll save you some money on, you know, the absorbent amount of AI models that you can subscribe to, but really that you'll be able to find and try out a whole bunch of different AI models that you hadn't heard of or used before. I think there's a lot of really great unheard of models that can do some great things in specific tasks. We have kind of benchmark data and we break down what models are best for what on the platform. So go check it out. It's 20 bucks a month AI box. AI. All right, let's get into what OpenAI is doing. So the first benchmark that I want to talk about is the code force benchmark. They basically ran the GPT OSS 120 billion parameters. That's the bigger of the two open source models. They have a 120 billion parameter one and then they have a smaller one. But basically that the bigger parameter 120 billion parameter 1 got an Elo score on CodeForce of 2600, roughly. And just to compare that with OpenAI's other tools, their O3 model got 2700 and their 04 model, their O4 mini model, sorry, got 2720. So like, these are, these things aren't very far apart. It definitely did better than the O3 Midi model, which only got 2,000. So it did pretty decently. Now, a lot of them are also rated with or without tools. So that 2000 benchmark that I quoted you was without tools. And what exactly does it mean, tools? And is that important? Yes, I would say it's important. Tools basically mean they gave the AI model things like calculators and apps and like different tools. So like it is completing the tasks. Yes, but it's able to rely on like actual hard software to get good results. And this is something that we basically use in AI, because with these lms, because we've, we found that they're like, perhaps not necessarily fantastic at a math problem or like some sort of really intense molecular biology question when they're kind of just guessing what should come next in the line. But they're good at figuring out what they need to do. And so we actually leverage that to find what tool to use and then bring the tool in to solve maybe a more calculated problem. Okay, so does it matter that they're giving us basically benchmarks with or without tools? Yeah, I think the big thing here is when they release the model open source, or quote unquote, open for everyone to download and use, they're not releasing it with the tool. So no one's getting. They give this benchmark with tools or without tools, but they don't give us the tools because they're kind of OpenAI's proprietary, proprietary stack. But what I will say is it is a good benchmark because big companies or software like startups, they can build their own tools and usually if you're taking one of these models and putting it into your, your startup to do a certain task, you're going to be building custom tools. Anyways, I, I think back to my first startup was called Self Paws. It was a no co or sorry. It was an AI life coach and we basically had it. So you would talk to chat GPT and we would act like a life coach and work you through different questions. And we built our own custom things to basically instruct and guide how the AI model works and how it would run a conversation. And so I think most software startups would be similar. Okay. Humanities Last exam. This is kind of a notorious benchmark. It's called hel, but basically it's Humanities Last Exam before AGI is kind of the concept, but it's got a whole bunch of really complex questions. You heard this exam quoted a lot by XAI when they released their latest version of Grok. It did really well on this. So with tools, GPT OS. So they're basically their 120 billion parameter model and their 20 billion parameter model scored 19% and 17%. Well, I was actually really impressed that the 20 billion parameter model got 17%. That's not very far behind 19%, which is the 120 billion parameter model. Again, this is an incredibly hard task test. So less you're like, oh, 20, this is terrible. This is. I don't think I would be able to get any questions on this, on this test. Right. It's like super in depth, you know, like translating ancient Hebrew meaning of this hieroglyph to how does it convert to this thing? It's like, it's very, very complicated questions. Okay. The most many experts would not excel at or succeed in. Okay. And it's a whole bunch of different areas that it tests you. And so in any case, this outperforms the O3 model but. Or, sorry, it underperforms the O3 model, but it does outperform a bunch of leading open models from Deep SEQ and Quentin. So the Chinese companies that are releasing their open source models, it beats the Chinese companies, but it's not better than what open source has or what OpenAI has. Closed Source, what basically they're, they're selling. So I mean, that kind of makes sense. One thing I will say though is OpenAI's new model does hallucinate much more than its latest, you know, 03 or 04 mini models. So that is not a particularly fantastic statistic. So this is definitely something that's interesting because it seems like these have actually, these hallucinations have actually been getting more severe, which is not a good thing. We don't want to hear. The OpenAI's models are getting worse and worse at hallucinating. They don't even really understand why. Which is another thing that's not super great. They wrote a white paper where they said, quote, basically they said that it is, quote, expected as smaller models have less world knowledge than large frontier models and they tend to hallucinate more. So basically they're blaming on the fact that there's less data, less parameters inside of these models. That's why they're hallucinating more. But yeah, it's kind of a trend that you see in a lot of places. It's kind of interesting. Basically they found that their 120 billion parameter model hallucinated in response to like 49% of a benchmark called Person's QA. So it's, it's one that is asking the AI model basically about people. So like who is Tom Cruise? And give me information about him who is. And it goes through like tons of people, some that are famous, some that are less famous and they'll see if it hallucinates. If you've ever tried it, like if you go say who is jaden Schaefer to ChatGPT, like sometimes it'll grab information from, you know, the web and whatever, my LinkedIn. But like half the time especially I remember in the early days would throw in tons of random stuff that weren't true. Um, and so I think with these older models we're kind of, it's kind of like. Or these open source models are kind of like some of the older models and they throw a bunch of funny things in there. So that's an area that hallucinates. Doesn't mean it hallucinates on every topic but on people it definitely is doing that. So yeah, it's, they're, they're newer models that they, they have that they pay for. Like their, oh, one model had a 16% score there. So you know, 16% hallucinations about people versus these models are like 49% and 53%. It's a lot more. So anyways, it's kind of funny, OpenAI said that basically they are not going to share what data they use to train these models. And basically I think this comes down to a whole bunch of lawsuits that are happening where people are saying that OpenAI uses copyrighted data. I'm assuming that they did, they're just not announcing it. They're trying to probably work with some regulators in Washington to make it like all kosher and okay before they like officially say that they're not. So I think that this is interesting. Funny, I does say that the model that they have for their 120 billion parameter model, it only activates 5.1 billion parameters per token. They also say that their model was trained using something called high compute reinforcement learning. Basically this is a process, it's a post training process that helps to teach AI models right from wrong. Like the right answer, the wrong answer. And they do this in a simulated environment and they do all of this using basically a really big cluster of Nvidia GPUs. So this is how they trained their O series of models, right? Their O3 and O4. And so it has basically similar chain of thought process where they, it takes longer but it's saying like okay, how would I solve this problem? It comes up with a list and then it works through that list chain of thought on solving different problems. And we typically get better answers when we do this. Now here's what's exciting in my opinion. They are releasing both of these models under the Apache 2.0 license. So this is really considered as one of the most, I guess like lenient licenses. It will allow companies to monetize this model, right? So you can actually charge money for it. Unlike some things that Meta was doing when they were releasing. They like were like, here's our like open source models versions of llama. But like companies, you gotta talk to us if you want to like make any money off of it. OpenAI is being really generous letting people make money off of it. They don't have to pay OpenAI and they don't have to get permission from OpenAI. So this is really just a free gift to the world of an open source model, which is in my opinion really, really cool. I will say unlike fully open source models, what's the difference? They're calling an open model, but it's not totally an open source. The difference is that companies like AI Labs or sorry, AI AI 2 have like a fully open source model. The difference is that OpenAI is not going to release the training data that they use to create their models. I kind of already talked about it. It's basically for legal reasons. They probably have copyrighted stuff in there which is their own choice. But basically anyone that gets the model is going to benefit from it because the model is going to answer and be way more accurate and have higher quality results. So that's kind of the outcome of that. I will say that they have delayed this model multiple times. I have personally been disappointed when I've seen Sam Altman's tweets on Twitter over the last couple months where they keep delaying it for safety reasons. They think that it is a lot safer now. Basically the things that they said they were concerned about was cyber attacks or the creation of biological or chemical weapons. Basically you get information from the models that could help you do those two things. It seems like they've kind of put some guardrails and made the model better now. So it doesn't do that. They had a bunch of third party evaluators actually test it and they said that it marginally increases biological capabilities, but it didn't find evidence that they were going to have a very high capacity threshold for danger in these domains after fine tuning. So I think even, or sorry even if you tried to fine tune to be able to do that. So I think it's going to be a much safer model. It's definitely state of the art amongst other open models. Right. So if we're looking at like Deep Seek and Quinn and Meta's Llama, it's it's definitely the top of the pack there. We're are, we're also waiting For Deep seek R2 to release which also should give it a run for its money. So it'll be interesting to see what's happening there. So all of this is going down, which is really, really interesting. And the last thing I wanted to bring up though is that Microsoft is basically bringing their smallest model, right? So they have the 20 billion parameter model. They're bringing it to a bunch of Windows users which is pretty interesting. It's going to be for any Windows 11 users. It's via Windows AI Foundry. So this is kind of their platform that lets you use AI APIs and a bunch of popular open source models on your computer. Microsoft in a blog post said tool savvy and lightweight, optimized for agentic tasks like code execution and tool use. It runs efficiently on a range of Windows hardwares with support for more devices coming soon. It's perfect for building autonomous assistance or embedding AI into real world workflows, even in bandwidth constrained environments. So basically what you actually need if you want to run this and this will be starting on Tuesday, but it'll be able to run on most consumer PCs and laptops, but to have at least 16 gigs of VRAM which basically a modern GPU from Nvidia or Radeon would have OpenAI said that the model was trained using high compute reinforcement learning. So it pretty much excels at powering AI agents and a bunch of other tools. It can do web search, it can do python code execution and all of that. So really, really impressive. I'm excited to see where this goes. I'm excited to see that Microsoft is kind of rolling out some sort of integrations so that a lot of people can use this and this is a really cool moment. You can go download this today on Hugging Face, which is super cool and I'm excited to see what people build with it, what companies start using it. This is just honestly a gift to the world and I'm sure they OpenAI has more exciting things up their sleeve like GPT5 that'll probably blow this out of the water. But for what it's capable of doing, anyone gets access to a really world class AI model and so I'm quite excited about that. All right, thanks so much for tuning into the podcast. Make sure to go check out AI box AI if you want to try out a lot of the different models I talk about on the show. For 20 bucks a month, it's an amazing value and I would love to hear what you have to think about or have to say about it. Because it's currently in beta, we're taking feedback and adding tons of new features all the time. Thanks so much for tuning in and I will catch you in the next episode.
The Mark Cuban Podcast: Episode Summary
Title: Why OpenAI’s New AI Agents Are Causing a Stir
Release Date: August 9, 2025
In this episode, The Mark Cuban Podcast delves into the recent significant move by OpenAI: the release of two open source AI models. Mark Cuban discusses the broader implications of this decision, highlighting that it marks the first time in five years OpenAI has released open source models, reverting back to GPT-2 levels. This release has sparked considerable debate and criticism, particularly from figures like Elon Musk, who has been vocal about his concerns with OpenAI's approach to AI development.
Key Points:
Mark Cuban provides an in-depth analysis of how these new open models perform, comparing them against existing OpenAI models and other open source alternatives.
The first benchmark discussed is the Code Force benchmark, which measures the models' capabilities in coding-related tasks.
Notable Quote:
"These are, these things aren't very far apart. It definitely did better than the O3 Midi model, which only got 2,000."
— [Timestamp: 05:30]
The second benchmark is the Humanities Last Exam, a rigorous test designed to evaluate the model's understanding and reasoning in complex, interdisciplinary subjects.
Notable Quote:
"I was actually really impressed that the 20 billion parameter model got 17%. That's not very far behind 19%, which is the 120 billion parameter model."
— [Timestamp: 12:45]
Cuban emphasizes the importance of evaluating AI models both with and without tools. "Tools" refer to supplementary software like calculators or specialized applications that aid the AI in performing tasks more accurately.
Key Points:
Notable Quote:
"Tools basically mean they gave the AI model things like calculators and apps and like different tools. So like it is completing the tasks. Yes, but it's able to rely on like actual hard software to get good results."
— [Timestamp: 08:15]
A significant concern highlighted is the tendency of AI models to "hallucinate"— generate inaccurate or fabricated information.
Key Points:
Notable Quote:
"Basically they're blaming on the fact that there's less data, less parameters inside of these models. That's why they're hallucinating more."
— [Timestamp: 16:20]
OpenAI has released these models under the Apache 2.0 license, which is notably permissive and allows for commercial use without requiring payments or permissions from OpenAI. This contrasts with other companies like Meta, which impose restrictions on monetizing their open models.
Key Points:
Notable Quote:
"OpenAI is being really generous letting people make money off of it. They don't have to pay OpenAI and they don't have to get permission from OpenAI."
— [Timestamp: 22:10]
The release of these models was delayed multiple times due to safety concerns. OpenAI prioritized ensuring that the models would not be misused for cyber attacks or the creation of biological or chemical weapons.
Key Points:
Notable Quote:
"They had a bunch of third party evaluators actually test it and they said that it marginally increases biological capabilities, but it didn't find evidence that they were going to have a very high capacity threshold for danger in these domains after fine tuning."
— [Timestamp: 20:40]
Expanding on the topic of AI integration, Cuban discusses Microsoft’s initiative to incorporate OpenAI’s smaller 20 billion parameter model into Windows through the Windows AI Foundry platform.
Key Points:
Notable Quote:
"It's perfect for building autonomous assistance or embedding AI into real world workflows, even in bandwidth constrained environments."
— [Timestamp: 25:30]
Mark Cuban concludes the episode with an optimistic view of the future of open AI models. He anticipates further advancements from OpenAI, potentially hinting at future iterations like GPT-5, and expresses excitement about the possibilities unlocked by the current releases.
Key Points:
Notable Quote:
"It's a really cool moment. You can go download this today on Hugging Face, which is super cool and I'm excited to see what people build with it, what companies start using it."
— [Timestamp: 30:15]
This episode provides a comprehensive overview of the evolving landscape of open AI models, particularly focusing on OpenAI’s strategic release of new models and Microsoft's integration efforts. Mark Cuban effectively highlights both the opportunities and challenges presented by these advancements, offering listeners valuable insights into the future of AI technology.