Loading summary
Host Name
Yesterday, OpenAI announced their new model, which is O3. Now this has a ton of insane implications. A lot of people are saying this is completely the end of software engineers. It beat so many benchmarks, blew so many things away, and fundamentally shifted how I think we're going to see AI scale to the future. As OpenAI is chasing what's called AGI, artificial general intelligence, I want to go over all of it and do a deep dive today. I didn't record and post this podcast yesterday. I saw a lot of people posting stuff yesterday about it as it was coming out. I wanted to absorb all the information, do a really big deep dive and make sure that everything we're sharing and covering today has a lot of research and kind of get past the hype bubble. So you're getting this 23 hours later than the announcement, which I mean, to be fair, that's not too bad. But we have a really good deep dive now. If you are watching this over on, if you're listening to this on Apple Podcasts, I'd recommend checking out the Spotify or YouTube as we'll share the screens. But, but if that's not possible, no worries. I'll explain everything going on. But yeah, I'll also be sharing this. Okay, so let's get into the episode today. Before we get into the episode, I wanted to mention if you've ever wanted to start your own AI side hustle or grow and scale your current business using AI tools, I would love to have you as a member of the AI Hustle school community. This is something where every single week I release an exclusive video or piece of content with my co host Jamie, where we go over the tools that we're currently using to grow and scale our businesses. We, we go over the exact numbers, how much we're making, the exact workflows, everything that we're doing that we can't really share publicly on how to grow your business, how to become an AI consultant for your industry and niche. Something that I've done. I have friends that have made hundreds and hundreds of thousands of dollars this year doing and all the different side hustles. How I was able to make $19,000 with a project working, making some videos for Amazon that anyone can essentially do or replicate and a bunch of other different things. So we have over 250 members now in the group. Last week it was a hundred dollars a month, but for Christmas, for the holidays we've dropped it to $19 a month. So it's a holiday special. And if you've Ever wanted to join the group, now is the time to do it. We'd love to have you as a member and help you grow and scale your business using AI tools. All right, let's get into the podcast today. So OpenAI has announced their new O3 models. Two specifically that they announced. The first one is O3, which is kind of the basic model, and then O3 mini, which is kind of the format we're seeing from all of these big AI companies. They have like kind of the best model and then they have the mini model which is faster and more efficient and costs less money. So this is day 12 of OpenAI's Quote Unquote Ship Miss event where every single day they were posting a new, a new thing and they kind of saved the best for last. They saved the big unveil for last, which to be fair, they also announced like sor of their AI video model is going to be available during this. So a lot of exciting stuff happening right now. So what is this thing actually capable of doing? How good is it? What's the difference? The first thing I think that's funny is they have 01 that they announced three months ago. Now three months later, they're already doing the next one, which is O3. Technically they couldn't name it OH2. Sam Altman said it should have been named O2, but there's a company called O2, a telecom company in the UK so they, they didn't want to infringe on the copyright. So they're naming it 03. But it is essentially the next model. It's the next version of 01. How much better is it? Greg Brockman, who is still at the company, one of the OG founders said OH three. Our latest reasoning model is a breakthrough with a step function improvement on our hardest benchmarks. We are starting safety testing and red teaming now. Okay, so this thing is clearly completely available, the red teaming it and you can assume that they're also training or working on the next model if, if this is even any training. But a lot of people are speculating that this isn't actually necessarily more training. This is kind of like their O4 model and they're just using more elaborate chain of thought reason what they're calling reasoning models, but there's these chain of thought models. Now the thing that I want to bring up that I think is really important here is they, they did this big kind of like announcement event thing and they brought on a very interesting company which is an ARK benchmark. Now this is a company that back in and they had them kind of talk about ChatGPT and how good it was and what it was capable of doing. Now back in next phase 2019 I believe, Ark benchmark kind of came out and created this benchmark to test when AGI was going to be there. And so this is someone from their team. And in this benchmark he said no one has even gotten close to touching this thing. They have these like awards where they give away money and prizes every year, the ARC prize to essentially these models that are getting good. No one has ever gotten close. And he said they have officially announced a new state of the art score on this. But before he did that, he kind of went over some of the questions that could be found on this benchmark and kind of asked how, uh, how people would expect them to be done. So he showed this one chart where it was like essentially a bunch of blocks and then it would. That was the input and the output was two blue blocks filling in the hole on these squares. Okay, a little hard to explain, but anyways, the what, what essentially this is doing is it's showing the AI model like here's an input and here's an output. Now here's an input. What is the output? And apparently this is incredibly difficult for AI models to kind of do these fill in the blank tests. So they've, they've done a terrible job. And O3 is essentially getting quite good at this. They have a bunch more complex ones that they also showed. But the final, the final score that it was able to achieve, apparently on the same benchmark, the O1 model, which, you know, has been the big zero one model, this thing's been amazing. They have. Okay, well the first thing I want to mention with these new models is they get better if you give it essentially more compute and more time. So they have a scale that apparently is going to be added to this where you can scale it back and be like, hey, I want to give it a little bit less compute, less time, AKA this thing's going to be cheaper or more expensive. If you have a lot of money, you crank that thing up. And by how much money or how much more money, I will explain because this is so fascinating. So on the 01 they did low, medium and high. And it scored on this one particular benchmark. 25 at low, 31 at medium and 32 at high. So diminishing returns from medium to high. But pretty much the max that 01 could do was 32. This new 03 model on the benchmark. So 01 got 32%. Oh three got 75.7% on low and 87.5 on high. So. So we're going from the high of 01 of 32% on this benchmark to now 87.5%. This is absolutely insane. We're getting close to 100% performance on this AGI indicator evaluator. And they said, hey, like, you know, just be careful, don't use this as an actual indication of AGI. Like, we haven't, you know, whatever. What's interesting is this benchmark company actually said we need to make a new evaluation for these AI models because it's getting so good that, that, like, essentially we got to start testing new things. It's. It's getting better than humans. But how much better than humans? Right. Right up until this point, the evaluations are like, how good is a human? Pretty much. And now it's like, okay, well, there's no way a human can solve this insane math problem, but a computer maybe could, so how much better? So anyways, that was one of my big takeaways I thought was quite interesting from that conversation. The other thing that I wanted to bring up is that they also shared how good this was at software development. So on the SWE benchmark, it's a software engineering benchmark, O1 preview got a 41%. Oh, one got 48.9. So let's call this 49%, right? Getting pretty close to 50%. Oh three got 71.7%, close to 72%. So we went from about a 50% to about a 72%. Going from 01 to.03. Maybe not even changing our model, maybe just changing our chain of thought that goes, that this goes through. And there's some interesting things going on there that I will also explain in a bit. But this is a massive improvement. And the craziest thing they said was, you know, 01 was announced three months ago. Oh, three is now being announced in December. They said that O3 mini is going to get rolled out in late January and then O3 will happen after that. So we actually are getting O3 mini. They're kind of faster, smaller, cheaper model next month. It's literally coming in probably, you know, 30ish days. So this is very exciting to me that they announced kind of when they're going to be pulling this thing out. But what's interesting is they said that they expect this rate of improvement to continue. So if every three months we're seeing a massive step, that kind of blows past the assumption that AI models had hit a wall that we weren't able to really figure this out. Okay, why is it so interesting, this kind of software engineering benchmark? Because the AI model didn't actually get better at all areas. This is one area that it did get much better at. The reason why this is interesting is because at the current score that they have 20, it was around is kind of in like the 2700, 2700 mark Elo marks. This is essentially, or sorry marks. This is essentially sticking it at around the level of a Grand Master. So for those that don't know, there's newbie, there's pulpit, there's apprentice, there's specialists, there's expert, there's candidate master, there's master, there's international master, there's grandmaster, there's international Grandmaster, there's legendary grandmaster. Okay, so it is literally at level, it's at level two, it's the second highest. So at OpenAI they were talking about internally, they're software engineers and this new software engineering benchmark actually beat the guy that the OpenAI researcher that was kind of unveiling this. It was higher than his top score and he said that there's only one person that was at OpenAI that was better than it. And these are people that are like, are competitive programmers. So to me this was very interesting. In addition to that globally it's ranked at as like 175, I believe global programmers. This, this O3 model, they ran it through this competition against like, you know, all the competitive programmers, 174 programmers in the world were better than it. That is insane if you get this thing in your pocket. So a lot of people are now talking about like oh okay, cool, how much does it cost? Okay, this is some good news and some bad news on the cost. The bad news is it's very expensive. The good news is for companies that have a lot of money, that's not a problem. So I guess take that for what, what you will. A lot of people are saying, you know us Normie, broke people are going to get the O3 mini model and the rich people are going to get, you know, that aren't peasants are getting the O3 high. So how much does this cost? $1,000 per request. Imagine that $1,000 per request for the high tuned model that you give it the most kind of output. So there's a chart that got published of exactly the scores, the percentages of O1 versus this O3 model and the costs per task. So for the O1 model everyone was already saying it was very expensive for the O1 model. Giving it like the highest, giving it the highest amount of computer was coming in somewhere around $6.50, $7.50, 6 to $7 per request. People don't know. This is why ChatGPT only gives you like a few of these, like 25, I don't know, 2501 requests a month or something crazy like that, or a day or whatever. Because these things are actually very expensive and I'm assuming that they're giving everyone the 01 low, which is more close to, we could say like $2 a request. It is costing something. If you're paying, you know, $20 for your chat GPT subscription and every message is costing minimum $2. It's, it's expensive. Right. And the reason why is because it's, it's getting, it's like asking it the question and getting it to run it 20 different times through all of these different formats and all of this different chain of thought. So it's like it's doing it, you know, 10, 20 times. Okay. Or 20, 100 times. How much? So, okay, so let's say the best O1 model that you could get, which I don't think is publicly available, but pretty much what we're probably using now, cost $2, the lowest you could possibly get the O3 model to do, which is what I'm assuming is going to get launched in January, is coming in right at around. Oof. I want to say 25, $30 per question. That is insane. Now, it sounds insane, but when you say that this thing is, you know, maxed out is better than. Is better than all programmers, it's, you know, the 175th programmer ranked. Okay. So if everyone had the access to, you know, one of the top software developers, period, this is a massive impact. A lot of people are saying this is the end of software development, blah, blah, blah, blah. You know, there's no way software developers could ever get better than this thing. So it's a very interesting conversation to have. Sure. But it, like, no doubt about it, this thing is very, very good at software development. Now, how much does it cost when they've completely maxed this thing out? Over a thousand dollars? Yeah. We're getting close to depending on the way that you want to look at the scale I'm on, this thing could be close to $7,000 for a single response. Now, I know that sounds crazy. Oh my gosh, $7,000 for a single response. Yeah. But if you're like, hey, write me an app that Does X, Y and Z. And it could one shot that for seven grand or fix an entire thing. I think a lot of people don't know how much software developers cost. It's insanely expensive. And so we're just making an artificial version of the brain computers that they're renting out for $400,000 a year total compensation. Yeah, it's crazy. It's really, really crazy. So this isn't going to be a big deal for a lot of big companies that have a lot of money, but this isn't realistic. Now. A lot of people are concerned about that. They're like, this is ridiculous. This costs too much money. This is unrealistic. So what are. I don't know, what are people saying about the price? Right. If I'm going through a lot of X posts right now, it's kind of. There's a lot of experts that are talking about this that are pretty interesting. One of those in particular is the CEO of Box.com, a friend of my podcast, because he loves the name AI Box. My startup, Shout out to Aaron Levy. This is what Aaron said about it, though. He said, OpenAI's O3 model appears to be better at reasoning than any other model out there. It costs way more to operate, but. But that's irrelevant. What's expensive today is cheap tomorrow. Quality is all that matters because you know that costs will always drop. This is very accurate for all the models. They cut down the cost because they're essentially able to. They're able to optimize the models themselves. But in addition to optimizing the models, they're able to find. You can assume that in the future we'll find new, cheaper ways to generate electricity, AKA nuclear, is going to have to be a big part. The new Trump administration, administration in the United States has said this is something they'll focus on. So we're going to have nuclear that's going to bring down the cost of energy, you know, assumedly, but that takes, you know, at least five years to build a lot of these nuclear reactors, maybe more. So it's. We have a little bit of. A little bit of time to get that figured out, which we obviously need to make a lot of steps. But energy will come down, these models will get optimized, so price will drop. All that matters is quality. How good are these things? So if it can be the best software engineer in the world at, you know, $700,000, people are saying that in order to do this, like, big benchmark that they got ranked as AGI it cost them $300,000. That's super, super crazy. It might sound like, but it proves that it is possible to scale if you have more compute. And yeah, if you have more compute, which is more time and money. And so it, it's, it's completely possible to scale now. Just determines how do we get the time down and how do we get the, the money down. That's the, the big things. Okay, I wanted to go over Amjad Massad. He is the CEO of Replit Amazing company. If you haven't tried them out. He had an interesting tweet on this. He said, based on benchmarks, OpenAI's O3 seems like a genuine breakthrough in AI. Maybe a start of a new paradigm. But what is also but. But what new is also old under the hood. It might be alphazero style searched and evaluation. The authors of ARC AGI Ben ARC speculate on how it works. Okay, so remember at the beginning of the podcast I was talking about how they ran this thing through the ARK benchmark and it scored very high. It did, it did very well on this ARC benchmark. Higher than you know, most people or most companies had ever done. I want to bring up the exact score for you on this. But yeah, it's done an incredible job on. We have an exciting one for you. ARC benchmark. We started this 12 day event 12 days ago with the launch of a company broke down exactly how they speculate O3 is working. Which interestingly enough the speculation is that, okay, yeah, so on ARK benchmark it got 87.5. That was 87.5 score that. It got so very, very high score on that. And bench, ARK benchmark is speculated on how they're actually achieving this, how this model actually works. And this is what they say, they say for now we can only speculate about the exact specifics of how O3 works. But O3's core mechanism appears to be natural language program search and execution within token space. At test time, the model searches over the space of possible chain of thoughts. Okay, so this is so interesting. Essentially what's going on is it might be the same model. This might be literally the 04 model, like GPT4O or whatever or 04, sorry. And what it's actually doing is it's that same underlying model, but it's, it's these reasoning, quote unquote reasoning models. It might have like a whole database of chain of thoughts. So it's like, okay, you're solving a math problem for math problems. This is the chain of thought you first need to, you know, determine what numbers are. And then you need to determine if this is algebra or if this is calculus or if this is, you know, linear equations or if this is like, know a list of a hundred things. So essentially they've, it's like they've built these, these decision tree kind of things where it's like if it's math, like this is the chain of thought you need to run it through. It's still using natural language processing, it's still using ChatGPT. But you know, it's like how did these things get so much better at math all of a sudden? Because these things were notoriously bad at math. Like they were great at writing, but they were bad at math for a long time, which made them kind of bad at code. So how are they fixing this problem? They're fixing this problem by essentially hard coding in these, this essentially library of chain of thoughts. If it's science, use these chain of thoughts. If it's math, use these chain of thoughts. If it's, if you're writing something, use these chain of thoughts. If you're creating, you know, X, Y and z code, use in this language, use this chain of thought. And it runs through multiple chain of thoughts. But essentially they're saying their guess is that it is a, you know, it's the LLM model searching through possible chain of thoughts, describing the steps required to solve the task. And then they say in a fashion perhaps not too dissimilar to Alphazero style Monte Carlo tree search, kind of what I explained in the case of O3, the search is presumably guided by some kind of evaluator model. To note, Demis has been hinted back in June of 2023 interview that DeepMind had been search researching this very idea. This line of work has been a long time coming. Okay, so another thing that's interesting here again, this is coming out of an interview from Google. Google does an interview and says, hey look, we are looking at doing this kind of thing and all of a sudden I don't know if you know, ChatGPT or OpenAI independently was working on this or maybe they're working on it at the same time. And OpenAI happened to beat DeepMind. And I'll say OpenAI didn't necessarily beat DeepMind to it because just earlier this week Google announced their first reasoning model which is doing kind of this chain of thought. We presume this chain of thought because of the way everyone's been talking about it. And yeah, you could pretty much assume it's this chain of thought. So Google opened their or released their first reasoning model a week ago and now OpenAI is announcing their second reasoning model. What's so fascinating to me is it's like these models are getting exponentially better, but I don't think they're actually training new models. These are just new reasoning, you know, chain of thought reasoning that they've upgraded how it's working. And it's because the model's actually pretty good as is like integrated with this new chain of thought tool. They don't need to essentially train entire new models, they can just run it through. So this is interesting, but it's been work in the works for a long time. Google's also come out with this and a lot of other players are announcing reasoning models. But it seems like OpenAI has the best reasoning model. So you could almost assume that OpenAI has the best team that can come up with the best chain of thought. Essentially hyper elaborate prompts. That's what this is. OpenAI has some hyper elaborate prompts that they've used to make their model 10 times better. That is an insane concept. But it's way more expensive because they have to run it through, you know, like 10 or 100 times or a thousand times or 10,000 times. I mean these things can get really, really expensive and you know, a lot of use a lot of compute. Okay, getting back to what Arcprize said, they said, so while single generation LLMs struggle with novelty, Open O3 overcomes this by generating and executing its own programs. That is fascinating. Where the program itself or the chain of thought, they're calling these chain of thoughts essentially programs where the chain of thought becomes the artifact of knowledge recombination. Although this is not the only viable approach to test time knowledge recombination, you could also do test time training or search and latent spaces. It represents the current state of the art as per these new ARC AGI numbers. Okay, they're saying there's some other ways you could do this, but really their assumption is this is how it's happening. Based off of what other players are doing and some interviews we've seen, there's a high likelihood this is how this is working. It's essentially, you know, the LLM is essentially not very good at very novel questions, something that's never seen before. And that's what ARK actually, that's what they kind of are famous for, is like the problems in the ARK test evaluation test are not public. They keep them all private. And when they give an AI model like a test, they never give it the same type of test again. So it's not actually learning off of the old ones. It has to like essentially they're testing how good it is at learning a completely new concept. Like here's a problem to solve, solve it. And it's like they have to figure out this new problem that they haven't seen anywhere else before. So this is absolutely fascinating. I wanted to just read the last little bit of this as we close out they said effectively O3 represents a form of deep learning guided program search. The model does test time search over a space of programs, in this case natural language programs, the space of chain of thought that describes the steps to solve and the tasks at hand guided by a deep learning prior, which is pretty much the base LLM. The reason why solving a single arc AGI task can end up taking tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through space programs, including backtracking. This is absolutely fascinating. I think we have reached a completely new space for for AI development. I am so excited. I'll bring you all of the latest updates. This is a long episode, but I really wanted to dive deep on this and bring everything happening. So if this is absolutely fascinating to you as it is to me, make sure to leave a podcast, a rating and review or like this on YouTube. Drop us a comment. I'd love to hear from you. And again, if you've ever wanted to grow or scale your current project or business with AI, I would love to have you as a member of the AI Hustle School community. There is a link in the description to go join. Hope you have a fantastic rest of your holiday season.
Joe Rogan Experience for AI
Episode: OpenAI Announces New Model o3: $1,000/Chat
Release Date: January 15, 2025
In this episode of the Joe Rogan Experience for AI, the host delves into OpenAI's groundbreaking announcement of their new AI model, O3. Released just a day prior, this model has stirred significant discussion within the tech community, with some critics suggesting it could mark the end of traditional software engineering roles. The host emphasizes a comprehensive and research-backed approach to dissecting the implications of O3, aiming to transcend the initial hype surrounding its release.
Timestamp: [00:00]
The host begins by highlighting the announcement of OpenAI's O3 model, underscoring its potential to revolutionize AI scalability and performance. He notes the model's impressive achievements in surpassing previous benchmarks, positioning it as a significant leap towards Artificial General Intelligence (AGI).
“OpenAI announced their new model, which is O3. Now this has a ton of insane implications...”
— Host [00:00]
He explains that O3 is part of a larger event spanning 12 days, labeled "Ship Miss," culminating in the unveiling of the O3 model and an AI video model. The host expresses enthusiasm about providing a thorough analysis of these developments, encouraging listeners to engage through various platforms like Spotify and YouTube for a more interactive experience.
Timestamp: [05:30]
The discussion transitions to comparing O3 with its predecessor, O1. Greg Brockman, a co-founder at OpenAI, describes O3 as a "breakthrough with a step function improvement on our hardest benchmarks," indicating substantial enhancements in reasoning capabilities.
ARK Benchmark Excellence
The host delves into the ARK benchmark, a rigorous evaluation standard designed to assess AGI readiness. O3 achieved a remarkable score of 87.5%, a stark improvement from O1's 32%. This progress signifies that AI models are approaching, and perhaps even surpassing, human-like performance in specific cognitive tasks.
“We went from the high of 01 of 32% on this benchmark to now 87.5%. This is absolutely insane.”
— Host [05:45]
He elaborates on the ARK benchmark's complexity, noting that O3's performance suggests the model's ability to handle novel and intricate problems that were previously challenging for AI systems.
Software Engineering Benchmark
Further showcasing O3's prowess, the host discusses its performance in software development tasks. O3 scored 71.7% on the Software Engineering (SWE) benchmark, up from O1's 48.9%, nearing the 72% mark. This advancement implies that O3 is nearing the competency level of top-tier software developers.
Timestamp: [15:00]
A significant portion of the episode focuses on the financial aspects of deploying O3. The host breaks down the costs associated with running the high-end O3 model, highlighting that a single request can cost upwards of $1,000. For extensive tasks, such as app development or complex problem-solving, costs could escalate to $7,000 per response.
“$1,000 per request for the high tuned model that you give it the most kind of output.”
— Host [08:30]
He contrasts this with the O1 model, which cost approximately $6–$7 per request, emphasizing the drastic increase in operational expenses with O3. The host points out that while large corporations might absorb these costs, individual developers or smaller businesses may find them prohibitive.
Timestamp: [20:00]
Inviting perspectives from industry leaders, the host references insights from Aaron Levy, CEO of Box.com. Levy acknowledges O3's superior reasoning capabilities despite its high operational costs, asserting that:
“What's expensive today is cheap tomorrow. Quality is all that matters because you know that costs will always drop.”
— Aaron Levy
Levy predicts that advancements in energy generation, particularly nuclear power, will eventually reduce the costs of running such sophisticated AI models. He emphasizes that the primary concern should remain on the quality and capabilities of AI, rather than current expense levels.
The host also touches upon technical speculations regarding O3's architecture. Drawing parallels to AlphaZero's Monte Carlo tree search, he suggests that O3 may utilize advanced program search and execution techniques within its token space, enabling it to handle complex reasoning tasks more effectively.
Timestamp: [25:00]
Exploring the mechanics behind O3, the host discusses the model's ability to perform deep learning-guided program searches. This involves the AI generating and executing its own "natural language programs" or "chain of thoughts" to solve tasks, a method reminiscent of advanced search algorithms used in game theory and decision-making processes.
“O3 represents a form of deep learning guided program search. The model does test time search over a space of programs...”
— Host [27:00]
He references the ARC AGI numbers, indicating that O3's approach allows it to navigate through vast program spaces, including backtracking when necessary. This methodology not only enhances problem-solving capabilities but also signals a new paradigm in AI development, pushing the boundaries of what machines can achieve autonomously.
Timestamp: [30:00]
Wrapping up the episode, the host reflects on the profound implications of O3's advancements. He expresses excitement about the potential shifts in AI development, business applications, and societal impacts. Acknowledging both the opportunities and challenges posed by such powerful AI models, he urges listeners to stay informed and engage with the evolving technology landscape.
“I think we have reached a completely new space for AI development. I am so excited.”
— Host [29:30]
He concludes by inviting listeners to participate in the AI Hustle School community for those interested in leveraging AI tools to grow their businesses, emphasizing the importance of continuous learning and adaptation in the face of rapid technological progress.
O3 Model Performance: Achieves unprecedented scores on ARK and SWE benchmarks, indicating near-AGI capabilities in reasoning and software development.
Cost Challenges: High operational costs ($1,000 per request) currently limit accessibility, though future optimizations and energy advancements may alleviate financial barriers.
Technical Innovations: Utilizes deep learning-guided program search and advanced chain-of-thought methodologies to enhance problem-solving abilities.
Expert Insights: Industry leaders believe that cost reductions are forthcoming, focusing on the quality and potential of AI rather than current expenses.
Future Outlook: O3 represents a significant leap in AI development, heralding new opportunities and challenges across technology, business, and society.
This episode of the Joe Rogan Experience for AI offers a comprehensive analysis of OpenAI's O3 model, highlighting its technical advancements, benchmark successes, and the economic implications of deploying such a powerful AI system. Through expert opinions and in-depth discussions, listeners gain valuable insights into the future trajectory of AI technology and its potential to reshape various facets of our lives.
If you found this episode insightful, consider subscribing, leaving a rating, or joining the AI Hustle School community to stay ahead in the rapidly evolving AI landscape.