Transcript
Host or Presenter (0:00)
Breaking news out of OpenAI. A brand new model, or really an update of a very popular model has just been released. And Sam Altman just tweeted out and said, good new model out. And he's referencing an update that OpenAI just tweeted out where they said, GPT4O got an update, have a little party emoji. They said the model's creative writing ability has leveled up. More natural, engaging and tailored writing to improve relevant relevance and readability. It's also better at working with uploaded files, providing deeper insights and more thorough responses. Okay, I'm going to unpack everything going on here because this is absolutely fascinating and there's a lot of interesting things people have criticized, but also people are sort of applauding with this. Before I do that, I will say if you haven't already joined the AI Box waitlist over on AI Box AI for my AI tool that is coming out shortly, I would love you to join. And if you want to get daily AI news straight into your inbox with just pure signal, check out the bottom of AI Box AI. There's a little newsletter thing you can fill out, you can get on the waitlist, get the newsletter and we'll be sharing every single day. My top three AI stories. I try to make this short, concise, useful. So check it out. I hope this is really valuable to you. Link is in the description. All right, so over on. Over with everything happening over on OpenAI, what I think is really fascinating here is they haven't really announced if you know, per se. They went and did like a retrain of this model and people are betting that they haven't. This is just some essentially almost like software updates or tweaks they were able to do to make this a little bit better, maybe even better prompting as we feel like is happening with O1 preview. So this is really interesting. Um, Regist, a lot of the comments on this update are people, you know being disappointed. This isn't GPT5. There's someone that said can we get pinned chats folders in the chat history sidebar? Thanks. Absolutely. This is something that I've been begging them for for like, for like two years now. So I think this is really interesting. Someone said lol. Everyone is waiting for O1 OpenAI another update for 4.0. Okay, so it gets deeper into what is so interesting that's going on here. So there was lmarena AI that just tweeted. So by the way, this is formerly lmsys.org, which is kind of like this benchmark platform that everybody uses. It's now called Chatbot Arena. Chatbot arena is kind of the name, but anyways, it's LM Arena AI is the website. You can see that all these websites are getting more popular. They're buying better domain names than limsys.org now. Lmarena AI makes a lot more sense. Anyways, this is what they said in a recent tweet about this whole announcement and what came out and if it's actually better, they said over the past week, oh, and by the way, I'll just explain how this Chatbot arena works is essentially users can go on there and they're given like two different chat responses and they pick which one they like better. And so it's the Chatbot arena, it's anonymous. You don't know which chat bot is giving you there, giving you the response. There's know hundreds of chatbots on there. You go talk with it and just pick which responses you like better. It pits the best AI models against each other. And whoever gets the highest score by the end essentially rises in the rankings. I love this because it's, it's not just like voting on, oh, I like OpenAI better than I like Google. It's Anonymous. So it's essentially just kind of like a blind test on what one has the best responses. Okay, so this is what they said. They said over the last week. The latest OpenAI ChatGPT420, 2024, 1120 completed anonymously or competed anonymously as, quote, anonymous chatbots. They just called it Anonymous chatbot. It gathered 8,000 plus community votes. The result, OpenAI has now reclaimed the number one spot, surpassing Gemini XP exp 1114 with an impressive 1361 score. Okay, what this means is just a week ago, Gemini came out with a model that shot to the top of the rankings on the Chatbot Arena. And essentially everyone was like, oh my gosh, OpenAI anthropic. All these other companies are toast. Google's coming and smoking them with Gemini. I mean, to be fair, like it did for a week. But it's funny because you could probably assume the first day that Gemini hit that top chart, OpenAI put in this new model. Sometimes I wonder if Open Eyes just sitting on the sidelines with models ready to go in case something big happens. I just pushed this. Or maybe it's not even the best thing they got, but they're like, oh, this is significantly better. It was just like they don't want to get bumped down from the number one spot on this, right? They, they never want to be seen as number two. They always want to be number one. So they were number two for about a whole week while this new chatbot was competing anonymously. They. And then this is what LM arena tweeted about it. They said, latest GPT4O shows remarkable improvements. We see a leap in creative writing. It went from a score of 1356-1402. So up, you know, almost 40, they said, as well as technical domain coding and math. So this is the exact rankings they bumped from. Overall. They bumped from, of course, the number two chatbot to number one on style control. They went from number two to number one on creative writing, number two to number one on coding, number two to number one on math. They went from number four to number three. So they still have some work to do on math. On hard problems, which a lot of people call logic reasoning, I believe they went from number two to number one. So then they said, huge congrats to OpenAI. More analysis below. So this is fascinating because I love when we have these models come out and beyond, just like looking at Reddit or something where people are complaining about which one's better or worse for different use cases. This, the Chatbot arena is a pretty, in my opinion, it's a pretty good take on this. Now, this isn't without criticism. Some people are saying OpenAI is playing chatbot arena and they're not making their models better. They're just figuring out how to work game the system a little bit. Right. Okay, so I want to break you down into this analogy, but first things first. I mean, I do want to give a kudo to OpenAI for continuing to push. Some people have complained that, yeah, they're focusing on GPT 4.0 when 01 preview is still this beta thing and everyone wants it to actually be live and available. So LM arena posts essentially a bunch of screenshots of how much better the GPT 41120 is. And it's pretty impressive when you are comparing this thing to other AI models. For example, you know, it's, it's the only model that has cracked above, above 1400. And you can, you can see like on the very far end, you know, below, below 1300, you have models like the Llama 3.1 405B, which is essentially Meta's very, very best model. So Meta's very, very best model, llama 3.1405 B, 400. Yeah, whatever. 405 billion parameters. That is somewhere around, you know, 1250, 1260 and we're already having this one up in the 1400. You know, it's past 1400. Really, really impressive now. Are they just gaming, you know, the benchmark? What's going on here? I mean, side note, funny they had a typo when they first released this where they said it was called 1111, but it's 2011. Whatever, they fixed it. So in any case, some people are looking at all of these benchmarks. They're saying they're shocked that Claude 3.6 sonnet is number eight in the arena. So sometimes it's shocking. Right? I know a lot of people that say they love Claude, it's got a better tone and style, they use it for everything. And so yeah, to me this is kind of funny. Um, but this is. And then, you know, in response to that someone said, I'm starting to think that Claude Bros are paid shills. They swear up and down it's the best model ever. But the benchmarks say otherwise. So you know, there's, there's some fighting words in the comments here. This is, this is what I think is interesting though. John Howell said it appears they released it anonymously to gather data, fine tune and over fit on these leaderboard evaluations. Don't trust this leaderboard too much as it's being gamed by OpenAI and others. So someone thinks it's being, someone thinks it's being gamed, which I think is interesting. And that was a AI analytics consultant that said that. Not, I don't think anyone too, too famous. But this is interesting. This is a big point that I've seen in a bunch of these different comments. Someone said, I'm still puzzled how O1 Mini continues to beat 01 Preview at coding. In my experience, if I get, I'll use 01 mini coding on the web and when it gets stuck on something, there's usually a good chance o1 preview can solve the issue. Okay, this is other people talking about other models because they're all in the same screenshot. So bottom line, 4.0 is getting better but at the same time most people are still waiting for, you know, the O1 preview to come out. They're saying, you know, if they're complaining that maybe it's not as good and maybe that's the reason it hasn't come out fully is because it's just not as good at everything because yeah, in the benchmarks people really, really are loving the, you know, the GPT4 and it's ranking higher than 01 preview, which is at like spot number three. Just under Google Gemini. So Google Gemini's newest model is beating like this 01 preview, this brand new model that's supposed to come out and crush everything. So yeah, yeah, that's. That's pretty, pretty crazy. And somehow the O1 Mini, meaning like the smaller, condensed, faster model is doing better or has. Is. Yeah, scoring higher, unlike for coding, than 01 preview. So there's a lot of things that are a little bit perplexing here. Overall, you got to give a big kudos out to the whole team over at OpenAI. Just everything that they have been able to put out and all of the work that they've done that you know that the benchmark speaks for itself. This thing is really accelerating. So if you haven't tried it already, I would encourage you to go over to Chat GPT and try out the new model, try out some of the updates. You know, you creative, you know, article about Egypt, which I just did to test it out, and you can see how much, you know, I don't know, like how interesting it is. Okay, so I just said write a creative article about Egypt. I was just testing it. Alleged creativity. I'll give you the first blurb and you can, you could decide where we're at. Said Discovering the wonders of Egypt. A timeless journey through history and culture. Egypt, a land of timeless allure, stands as a bridge between ancient civilization and modern life. Renowned for its iconic landmarks, vibrant culture, and profound historical significance, Egypt continues to captivate travelers and historians alike. Okay, it sounds a little bit Chat GPT. It honestly sounds a little bit less strange words, but somehow timeless allure sounds kind of funny to me, even. I don't know, some things that just sound a little too cliche. Vibrant culture, I don't know, that's just me. But overall, I am noticing some of the words that they're saying. It seems a little bit less chatgpt. I haven't heard it say the word delve. Delve is a clear giveaway if it says let's delve into this, that it's chatgpt. So sometimes I wonder if they just like manually turned off some of those words or if they've, you know, put some extra prompts in here to try to make this thing sound a little bit more natural. So overall, you know, creativity is up. How, how the sounds is up, the tone is up a little bit. Up to you to decide if this really competes with Anthropic Claude. But overall I'm really excited and I'll keep you up to date on everything happening with this story. Seems like OpenAI is not dead in the water. This is something you got to love about a small startup. Well, people come that you know that like Google or others are like, oh, they got stagnant. Which by the way, I don't think that's the case with Google right now. They're really cooking with a lot of these new models. But while you can claim that some of the bigger companies are stagnant, OpenAI, you have to admit, is just doing things so scrappy. Like a startup like throwing out anonymous try to rank and outrank Google. You gotta love to see it. So they definitely have a lot of fight left in them. Absolutely fascinating. Again, if you're interested in getting the waitlist for AI Box, go over to AIBox AI and check out my newsletter that we send out every single day of our top three three AI Stories. It's at the bottom of the AI Box website and you can get those straight to your inbox. And hope you thanks so much for tuning into the podcast today. I will catch you next time.
