Loading summary
Andrei Karlenkov
Foreign hello and welcome to the Last Week in AI Podcast or the Last Two Weeks in AI Podcast as has been happening lately, where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most important interesting AI news. And you can also check out our Last Week in AI newsletter at lastweekin. AI for stuff we did not cover. It comes to you every week in your email. I am one of your regular hosts, Andrei Karenkov. I studied AI in grad school and I now work at a generative AI startup. And once again, Jeremy is off on a very secret mission. He has told me he'll probably be more available in October, so I'm hoping we'll come back to the regular schedule soon. But for now we have one of our regular guest co host, John Krohn.
John Krohn
Hey, what's up? I would pronounce my name John Krohn, but whatever.
Andrei Karlenkov
I'm tired.
John Krohn
I'm sorry it is early on the west coast for you and it is a common mistake. I'm actually kind of glad you make it so that I can try to correct. It's like a virus out there. It actually it's been a decade of with my first book, Deep Learning Illustrated, the publishing company Pearson behind it, they internally everyone there calls me Jon Kron. And then through them I got exposure to the O'Reilly ecosystem and so everyone at O'Reilly started calling me Jon Kron. And the same thing with like then the conference circuit that both O'Reilly and Pearson are plugged into. So it's a common mistake, a virus that I'm trying to stomp on everywhere.
Andrei Karlenkov
I'm glad we can correct this error and I really should know better at this point. You've been on the podcast what, like five, six times at this point at least.
John Krohn
At least I'm probably pushing 10. I could try to enumerate it later. The easy way to remember, Andre, is that it's just like the bowel disease, Crohn's disease. That's me.
Andrei Karlenkov
Easy. Very easy. Well John, I'll quickly mention you are of course the host of a super data science podcast. I see you have a cap, which makes it very easy for the YouTube viewers to remember. Has interviewed a ton of people in the AI and data science world and are quite, quite plugged into AI. Less on the academic front, more on the like, hands on front. So I think you're always a great co host and this week we'll be talking a lot about recent releases, a lot of business stuff. So it should be kind of a good fit.
John Krohn
Perfect. And yeah, we do sometimes get some academic folks on the show as well. We had Andrew Ng on the show, although I guess he also the amount that he's an academic is decreasing all the time as well. Peter Abbeel has been on the show. We've got Ethan Moloch coming up soon, well known although, and that's actually the Wharton School, so it's like that's not really. Good point.
Andrei Karlenkov
Academic, academic, businessman, you know, what's the distinction? It's hard to say.
John Krohn
If you want that nine figure salary, gotta get a little bit of both.
Andrei Karlenkov
That's right. Well, to give a quick preview of what we'll be talking about, it's a pretty exciting week. We gotta start with Sora 2 from OpenAI making just some crazy AI videos, then Sonet 4.5. That's something I've been especially excited about. Like five more updates to products to discuss. Then in applications and business, various other developments from OpenAI from competitors to OpenAI things like that, research and advancements. We've got some new exciting benchmarks which we always like to talk about and some interesting kind of interoperability work which I think will be interesting. And then policy and safety, some new law stuff in California, which is another topic we often touch on fields before we dive in really quickly, do want to respond to some listener feedback? Got a new review on Apple Podcasts best AI podcast Dash still alive question mark. So yes, still alive. But it is a fair point that the recent trend, the past month in particular, it's been fairly ad hoc. We have been missing a week or two weeks at some cases. So as I've said in the last couple apps, Jeremy is off on a big project of some sort and has basically bowed out so that he wouldn't kind of miss the dates we usually have. And I'm not super consistent in getting a guest co host on time, but as I keep saying, we will try to come back to the regular weekly or mostly weekly schedule that we have mostly managed to do for, you know, much of this podcast history. But not so much lately. With that out of the way, let's get into tools and apps. So first up, OpenAI Sora 2 Sora, of course is the text video model from OpenAI. First released of Sora 1 in early 2024. And at the time it was like mind blowing. The text to video capabilities were far beyond what anyone has done now in the past year. OpenAI has been kind of quiet on the text to video front. We have seen VO3, we have seen models really making big strides and so now we have OpenAI coming out with their new model and it's does not disappoint. I think it really produces very good looking videos. Like the AI nest of some of these things are getting harder and harder to see. They have a new Sora app on iOS which allows you to create it, to remix existing videos to post them to this feed. And I think most notably it has this feature of cameos where you can kind of scan your face or a friend's face and post or create a video starring that person. And there's been some fun Examples posted by OpenAI of Sam Altman getting up to some shenanigans. And I think that really has been sort of a killer feature that has made people more positive on this. And I think just generally with videos I've seen don't feel or look as much like AI slop. There's been much less kind of trend towards like bizarre, outlandish, kind of animated looking things. You know, no animals on laptops or things on the moon, as much as kind of, you know, cinematic things or other kinds of things that look a little bit more grounded. So very cool from OpenAI. It's an app still on our, it's still invite only, so unfortunately I've not been able to try it firsthand, but I'm sure we're going to start to try and roll it out.
John Krohn
Yeah. As you've been speaking Andrej, I've had video playing of Sora 2, specifically this two and a half minute long video showing Gabriel Peters, who I guess works at OpenAI and seems to be involved in the launch with Sam Altman, whom I'm sure everyone knows that name. And so it's the two of them on this kind of adventure and it is pretty damn good. The video quality is far better in terms of photorealistic video quality, far better than anything I've ever seen in text to video generation. And some pretty impressive real world physics contained in it. So for example, there is a moment where they run past Sam Altman and this Gabriel Peters guy, run past somebody playing billiards and they do a shot of the billiard balls and there's a little bit of kind of fuzziness and a little bit of kind of liquidy goopiness around the billiard balls. But over, you know, kind of 10 seconds those billiard balls remain consistent. They respond to real world graphics as somebody breaks, breaks the, breaks the balls.
Andrei Karlenkov
With Their PO Q non physics. Let's say something happens that should not happen.
John Krohn
I mean, when you start a game of. It should happen. Exactly. Yeah. No, it looks good. It looks good. It's just funny that Break the balls is the only thing that I can come up with for that right now.
Andrei Karlenkov
It's a good example of physics. Yeah.
John Krohn
The one other thing that you mentioned there is this kind of social media app, I guess, that they've launched in conjunction with this that seems like a bit of a Hail Mary, but you never know. You never know, I guess. Yeah. It's such a sprawling organization now that you try to get. You throw lots of different noodles at the wall and see what sticks. I think, you know, Sora to Soar in general is a really viable product that we're going to see a lot of. I don't know if this app is going to take off so much.
Andrei Karlenkov
Right. And it seems like we're not trying to make it take off necessarily. I mean, it's invite only at this point, so how much can it grow as a social media app with. With that constraint? And presumably it's invite only because the GPUs would be out of fire if they let everyone use Sora. I think that happened with the most recent Sora update they had. There are paid options for extra video generation due to these high computing costs, it takes a while to generate. I think I've seen like examples of 10 to 15 minutes. And one thing I forgot to mention, which is pretty noteworthy, it now also generates audio, so sound effects and speech that's pretty good. And comparable again to VO3, which is kind of a new generation. So text to video is continuing to be kind of on a roll throughout this year, seeing it become better and better by pretty big margins. One other interesting bit with Sora 2 is people have made some pretty impressive examples of generating, let's say, existing media with it in the sense of they have created south park episode clips, which is very similar to the show and clearly like from the training data set. Also I've seen clips of Family Guy. I've seen stuff from Cyberpunk 2077, a major video game. So the guess would be. Of course, we don't know much about the technicals here, but seems like there's not been much restraint in the training side of caring about copyright, which is kind of interesting. We've been discussing many copyright lawsuits lately, and OpenAI seems to have just gone for it with everything and anything is what it looks like.
John Krohn
Yeah, it seems like they have been all along and you know, companies that are really trailblazing in the past, like Spotify have gotten away with it. You know, Spotify was using illegally file shared files in kind of their original undertaking, the original launch of Spotify and obviously that made a lot of people unhappy. But just like OpenAI they seem to be, you know, both of those firms emerge as a juggernaut in the space and you find a way to make nice, right?
Andrei Karlenkov
Yeah. Uber similarly did that Silicon Valley trick of just ignoring regulations. And at this point with 700 million active users, I think weekly active or something like that, like I don't think OpenAI is going to die to do copyright stuff, but there might be some comeuppance. We'll see. Onto the next story. We've got Anthropic releasing Claude Sonnet 4.5 at the same time also releasing Claude Code 2.0. Slightly less covered, but also I think notable. So Sonnet 4.5 is the update to Sonnet 4, which has been around for quite a while. I feel like, I don't know, since the beginning of this year or something. Most recently we've had the update of Opus 4.1, which we've discussed. So this is a pretty major update for Anthropic. They are positioning it as best in class once again for coding, for tool use, for long range reasoning. And there's also various tools that they have rolled out to help people create their own AI agents. So they have rebranded the Claude Code SDK to the Cloud Agents SDK and basically repositioned it as it's not just Claude code. You can make AI agents powered by Claude for whatever you want. So people's gonna vibe check on this has seemed to be. It's really good. There's quite a mix as often some people are saying it's the same, you can't tell the difference. Others people are saying this is like brilliant. Anecdotally it seems to be a little bit better at not necessarily agreeing with you on everything when you're working with being let's say a bit more thoughtful or mindful. Also better at long context reasoning with that 1 million context window. So very exciting for coders, for people who rely on a Claude for agentic stuff. Maybe less exciting for the general public.
John Krohn
Yeah, I would say that this is similar to the GPT5 release which took a lot of heat from the public in general. But I am really impressed by and I think that the reason why I don't know what people were expecting in terms of regular everyday tasks, what kind of Magic could possibly happen on short task time frames. But I guess people, I'm going to use the GPT3 to GPT4 jump to like provide a little bit of context around what I think people are thinking with these big releases, which is that GPT3 and GPT3 5 were able to handle tasks that would take humans up to about 10 seconds, maybe tens of seconds. GPT4 felt like a big leap because all of a sudden it could replace us reliably on tasks that would take humans minutes to do. Once you start getting past that, a lot of everyday tasks that you're just going to throw in to a conversational agent, they don't take a human more than a few minutes to do. And so it's kind of harder to kick the tires on these longer tasks, ones that would take a human hours to do. But that's why we have these kinds of benchmarks like the SUI bench verified terminal bench agentic tool use from R2 bench. Those kinds of benchmarks give you some sense of on longer tasks that might take a human hours to do. How are these models performing? And both GPT5 and Cloudsonnet 4.5 are big leaps on those longer time frames. There's a. You're probably familiar with this chart, Andre. The MTER MTER chart of how long of a human task can now be handled with 50% accuracy by an LLM. And that shows this curve shows that every seven months right now the human task length that an AI model can handle does doubles. So if it's about two hours today, you can expect that in seven months. These models state of the art LLMs will be able to handle a four hour human task and seven months after that eight hours, and seven months after that 16 hours. The multiples get pretty crazy and really powerful in terms of, you know, thinking about in an organization or for you as an individual, what kind of range of tasks can now be handled reliably in a fully automated way by machines? It's that curve that we're on that we're in the midst of is pretty mind blowing. And Quad Sonnet four. Five plays a part in that.
Andrei Karlenkov
Exactly. Yeah. That's kind of a great thing to highlight in the announcement. They basically are focusing on it. Sonnet 4.5 they say is the best model in the world for agents coding and computer use. It's also our most accurate and detailed model for long running tasks with enhanced domain knowledge in coding, finance and cybersecurity. So it continues the trend of anthropic very much focusing on enterprise needs, on professional needs, not so much competing with OpenAI on trying to get more consumers or more kind of broader use cases. It's not focusing on being a good kind of chat companion or being a therapist or being great at image understanding. It really is focusing on more than anything being Agentic. And the to cover a little bit of the benchmarks, it's, you know, going beyond Opus 4.1. So Opus was very big model, very expensive model. Sonnet 4.5 now beating it on most benchmarks, costing as much as Sonnet 4, which is quite a bit less than Opus. And it's sort of on par or to some extent better than GPT5 across most of these benchmarks. Dealing with computer use, tool use, et cetera. Way ahead of Gemini 2.5, I think, which is interesting. And yeah, I think it's true that now it's harder and harder to feel the progress when you just chat with these models. It always, it's almost like text to image, you know, at this point, can you really tell the difference? But when you need to use it for very kind of specific, some cases nuanced, in some cases kind of complicated or just involved things, that's where you can tell the difference. And I think people who use cloud code, who use agentic tools have really learned with quirks of these models and they things that are specific to agentic tool use that LLMs are not necessarily good at out of the box. I don't know. John, do you use cloud code or any other agents in your work these days?
John Krohn
Yeah, and I'd actually like to highlight some of the things you mentioned, that there was a big Claude code release and there's a couple things here that are big that happen simultaneously with this Cloudsonic4.5 release. So for example, with this latest version of CLAUDE code, you now have checkpoints for rolling back to previous code versions. So this is a common gripe of the Vibe coder is that you kind of, you have this working application and then you go too far, you keep making changes. Let's say you want to make some UI changes and you know, you just want to change your application from green to blue over a simple example. But somehow in making that small change to your ui, some underlying logic changes and all of a sudden your app doesn't work. And so with checkpoints now in Claude code, you can roll back to some previous working state and kind of Vibe again from there. I suppose there's also with this release came a native VS code extension and a lot of people love VS code out there. So that's probably a big win for folks. And then finally I'd like to highlight, in the past year I've had a lot of focus on Agentic AI. You know, have been doing trainings, have a YouTube video called Agentic AI Engineering that now has a hundred thousand views on YouTube and kind of gives you an introduction to the key libraries that you need to be building multi agent teams. And we in that video focus a lot on the OpenAI agents SDK. And so it's interesting here that Anthropic are now rolling out the CLAUDE agent SDK, which is clearly following the same kind of naming convention there. It is a different kind of SDK, but something that is cool about it that I like is that it has this specific feedback loop that it kind of nudges you in the direction of using with your agents, where step one, you gather context for whatever task the agent is going to be doing. Step two, you take action based on that context. And then step three, and I think this is a big part of why people like using CLAUDE so much, is the verification of the work. You get high accuracy results a lot with claude. And so that agent SDK loop that happens in the cloud agent SDK gather context, take action, verify work and then back to gathering context allows the agents that you create with a cloud agent SDK to be able to be able to continue for long periods on tasks with a high degree of accuracy.
Andrei Karlenkov
Yeah, that's a good comparison with the agents SDK from OpenAI that's more of a sort of framework with which you can create your own applications that are Agentic versus cloud code SDK or now cloud agent SDK which is kind of a way to use on Fropics agents and use them for your own things. So you no longer need to do it interactively via GUI or a terminal. You run some code and that code runs cloud code. So it's actually quite different. But now they're trying to make it even more kind of flexible. And I think, yeah, it's noteworthy that this is happening. Pretty soon after Codex by OpenAI came out. At the time of GPT5 and Codex, many people were sort of saying that Anthropic has had a major lead with cloud code. And it's kind of if you're not encoding or if not in this world, it's hard to overstate the degree to which this is actually a big deal like leading in the space of agentic AI and leading in the space of these kinds of tools that people are happily paying $200 a month for or more is really kind of the cutting edge. And these are things that are having a huge impact. I know my entire company has now kind of converted to using these kinds of tools over the past few months, largely because of cloud code. So Codex was a big step for OpenAI to compete and take kind of a mind share. And with a developer preference, maybe now OpenAI anthropic is winning some people back that got a little frustrated. Alrighty, a few more stories. Moving on to Meta and another kind of social media platform, I suppose they announced and launched Vibes, which is actually a feature in the Meta AI app and on Meta AI and it is similar to the Sora app. It's meant to allow you to create little AI videos, share them and you can browse a feed of AI generated videos. In this case powered by either Mid Journey or Black Forest Labs. I'm not too sure about which one of these. Very different reception from Sora. Largely people made fun of this or criticize it or otherwise seem to think that this was, you know, the term slop came up often. This is a slop machine. This is feeding you AI slop. I think insignificant part because of the marketing around it being focused on these more obviously AI videos of, you know, outlandish things, things we've seen a lot of and kind of the idea of like scrolling and seeing nonstop AI content, you know, you might kind of get into. And it's interesting why people are getting more and more negative and the term slop or AI slop has really, I think become mainstream either way. Yeah, not necessarily a strong launch from Meta with this, but who knows, maybe people will like it. I don't know.
John Krohn
Yeah, I think there's maybe places where AI can provide a lot of value in social media or in media generation in general, maybe allowing you to change lighting or make some small kind of changes. But when you're fully generating the entire video from AI, it's become so easy today, so cheap that it doesn't feel like a valuable human experience for the most part. I'm sure there are amazingly creative things that maybe someone can do that you're like, wow, that's actually a great use of this. But it's interesting how this now because creating AI slop is so easy and it's so abundant on the Internet and even inside of enterprises, emails, decks, you're like, am I really reading something that someone's opinion or am I getting some AI slop here? That when you can tell that somebody actually put effort in to writing something or creating something, you know, that there's some thought behind this, some planning, and that this is probably actually a good idea, you know, that is starting to become more valuable, but also interestingly harder to distinguish from the slab.
Andrei Karlenkov
Right. And I think it is kind of the definition of slop in particular for AI. Slap is in my impression, typically these kind of low effort outputs of AI, you put in a prompt, you get an output. If you're doing, you know, AI powered tools for video creation where you're editing and spending hours kind of compiling together a short film or using it in your workflow to help you edit or code, for instance, that people are perhaps less negative on, or at least that wouldn't be categorized under slop. And I think the framing of this as a feed of nonstop AI videos where there's no much effort, it is prompt to video or they also have a remix option. And also that this is by meta, which is already in the business of getting people sort of addicted to feeds of various kinds of. That really was perhaps a major part of why this got a reaction that it did if they, you know, introduced some sort of tooling to make it more personalized, to let people have some of themselves, some of a human context alongside the video or together. If it had, for instance, cameos. Even that Sora did might have generated a different reaction. But as is at least the social media reaction is no one wants this. Maybe some people do want it, but that's not what I'm seeing and certainly that's not what I'm feeling. Onto another new product release, I suppose we have OpenAI prior to sora2 releasing chatgpt pulse, a slightly kind of more out there thing that has OpenAI expanding more into your daily life, more into becoming more of an assistant or something that you use on the regular. So this feature allows you to get personalized morning briefs for users while they sleep. Pulse provides five to ten briefs to help people start their day. And that includes things like some news summaries. It can create reports on specific topics. So news updates or personalized briefs based on user context. And they have these cards that you can go through. So you kind of browse the cards and then you can tap on them to get the full details and talk to ChatGPT optionally. So it's yeah, interesting to think of. I think south park at this point has made fun of people using ChatGPT for everything, you know, non stop. I think people are starting to talk to ChatGPT a lot more. And this is building on ChatGPT connectors so you can connect ChatGPT to your calendar or email or other things like that. So it's making it easier for people to really integrate ChatGPT into their lives, make it always there, always really a personalized assistant for your daily life.
John Krohn
Yeah. The heavy use of ChatGPT in South park this season has been something that I found particularly amusing. If people don't watch South Park, I think this is a season to check out if you're working in AI. On the note of this Pulse feature inside of ChatGPT, I get what they're trying to do here. This is, you know, right now, I think people think of these conversational interfaces like Chat GPT is something to go to proactively when you have some problem. And of course, if you're designing a platform like that, what you want is to become an essential part of people's day that you feel like, oh, you have to check this, just like you have to check your social media feed. You know, you have to go into ChatGPT to get this update on your life. It makes a lot of sense. I think where there could be issues that OpenAI runs into here is that in order for this to be a really useful experience to me as an individual, I have to use those connectors that you mentioned, you know, to access my Google Drive, to access my Gmail, to access my calendar. And I personally don't have a level of comfort with OpenAI or maybe even any of these big players in the space because what they're going to do with my data isn't clear. Or sometimes when it is clear, it's clear that they're going to be using it for training models. And that makes me uncomfortable with my personal information on that kind of scale, to just have a connector go into all of my gigabytes of my Google Drive, all my personal information. I can see how it would be useful, but I'm not sure I personally or probably a lot of people out there are comfortable with that level of trust.
Andrei Karlenkov
Yeah. And I think that might be part of why this announcement rollout has seemed to kind of go under the radar. Another reason is this rollout is exclusive to the $200 a month pro plan, so not many people are even capable of using it. This is really for power users. And another thing to note is it is kind of a new way for AI to be agentic. Right. This is having AI do stuff for you without you asking it, going off and being autonomous. So still competing on that front. And last Thing I'll say is one of the things that I think about often with regards to the business side of AI is the lack of a clear lock in for people. Right. The difference between using ChatGPT and Gemini 2.5 and Claude and I don't know, Deep Seek doesn't feel huge. There are some differences in tone, some differences in, let's say, tendencies or character. But push comes to shove, if a free plan goes away and another company is cheaper like Gemini, I think people will move over. Right. I don't think there are people who are really fans of ChatGPT so much as fans of the experience and kind of usefulness that it provides. So in that case, what happens when there's not significant lock in is a race to a bottom on pricing. So you will have very low margins on the subscription, on the profit, I suppose. And that's pretty bad because the margins are already very hard. This is very expensive to do the inference. It's not like computing in general. You need to actually pay for every thousand or million words. You use a substantial amount. So it's a major challenge, I think, for OpenAI to maintain their lead while trying not to continue burning through piles of money every single day. And things like this might help. It would be interesting to see if that's the case for sure.
John Krohn
I think something that you said there, it helps me realize that Gemini from Google might actually be well positioned in this space because a lot of people already trust Google with access to their drive, with access to their emails, their calendar. And so if Gemini just becomes kind of an add in there, low margin business. Yes, but if you can do it as part of, you know, enterprise Google Office accounts or something like that, Google might be able to maintain margin there.
Andrei Karlenkov
And that has been Google's strategy, kind of rolling out Gemini in Gmail and spreadsheets and drive kind of everywhere, increasingly making it capable. So you can ask Gemini to, I don't know, explain your spreadsheet, whatever. So they definitely have an advantage in that sense of just lots of people are using these things and you can get them to use Gemini by just having it right there. And notably also in Chrome, which we covered just recently, they're adding just a little button up on the top, right. Chrome is by far the most used browser for people, so that's a major win. Right. You now no longer need to go to chatgpt.com, you can press this little button whenever you're browsing the web. So OpenAI definitely should be feeling a bit of worry from Google and maybe Microsoft and these kinds of things might help pulse.
John Krohn
Yes. Although interestingly, despite all these concerns that they might have story after story here we are talking about OpenAI. Onto the next one, onto the next.
Andrei Karlenkov
One more on OpenAI they are rolling out a safety routing system and parental controls on ChatGPT. So this will detect emotionally sensitive conversation and switch to GPT5 thinking which is equipped with safe completions for handling sensitive topics. This is coming of course after some pretty bad stories about ChatGPT, in some cases encouraging people to self harm, in some cases aiding or kind of exacerbating people's mental health issues. And yeah, another sign of the extent to reach ChatGPT is becoming parts of people's lives really kind of having a major impact on people's lives. This is essential, this is very important given what you've seen with some of the impacts of use of ChatGPT. And you know, nice to see this being rolled out. I think fair to say perhaps that this is coming too late at this point.
John Krohn
Yeah, I don't have much else to add to this story. Yeah, hopefully this can prevent, you know, these kinds of negative side effects of people getting too into their conversational, conversational platforms in the future.
Andrei Karlenkov
Yeah. And I guess last thing I'll say, also not fully related, but one of the concerns with these chatbots is the potential for people to feel emotionally bonded to them, to really have them become an emotional crutch. And it has been the trend for younger people to be less social, to be less outgoing, to just have fewer friends. And this is not going to help. Right. So I would hope that part of this is indicative of OpenAI at least being more careful about that, about kind of getting people addicted, so to speak to ChatGPT due to positive reinforcement and being kind of your friend in a way that might be harmful if you overdo it. And onto the next story, moving away from OpenAI at long last and moving on to Google with some news that was kind of quiet but also notable. So they have updated Gemini 2.5 flashlight. There's now two Gemini 2.5 flashlight preview. Also Gemini 2.5 flash got an update and they are much better. They got a major improvement on coding for instance and they are also faster. So Gemini 2.5 flashlight is now the fastest proprietary model according to independent tests. It gets 887 output tokens per second, which compared to things like Sonnet, kind of maxing out at a hundred tokens per second is very fast. Flashlight is also very quick, sorry, very cheap $0.1 per 1 million input tokens, orders of magnitude cheaper than Gemini 2.5 Pro, for instance. So major update on, on this model from Google. No rebranding, no fancy kind of media cycle, just making it better and stronger, which I think is a little bit interesting that you look at the graphs. This is a major improvement. And it's just Gemini 2.5 flash and flashlight. Still not much fanfare around this.
John Krohn
Yeah, this is the future. This is one of those ongoing trends. It's like earlier in this episode, I mean, the opening story in this episode of last week in AI was about Sora 2. And you know, I've been listening to this podcast to last week in AI long enough to remember not that long ago, maybe 18 months ago, 24 months ago, you and Jeremy talking about how text to image had been getting pretty good, pretty compelling. You know, they'd solved the finger problem, getting five fingers on hands looking anatomically correct. But video at that point was really poor. And now, you know, two years later, roughly, we're at a spot exactly as you and Jeremy have been talking about would happen, where video is good, it's photorealistic for the first time. And just like that Megatrend, this is another one of these megatrends where of course, the big frontier labs are going to be on the one hand, pushing the absolute frontier of capability with typically larger, large language models, but at the same time trying to get costs as low as possible, trying to get inference time as low as possible while retaining as much capability as possible. It's a big megatrend that will continue for decades to come. And it's critical in the context of conversation that you and I were having a couple stories ago, Andre, around having, you know, with very small margins, as, you know, cognitive machines become a utility with very small margins. Having these small, fast models is critical to the Frontier labs being able to be commercially successful.
Andrei Karlenkov
Exactly. And Gemini 2.5 flash is interesting in the sense that I think among the frontier labs on the smaller model front on things like Haiku from Anthropic, which is still at 3.5 anthropic, kind of gave up on this class of models. We have GPT5 nano from OpenAI, which is decent, but Gemini seems to be the best in this class of model, and they also mentioned it getting better at tool use. So I would have to imagine part of the motivation here is for things like browser use, where you need a very fast model to be able to do the agentic work for you and not take a million years this would certainly have all that. Yeah, Google very competitive in pricing their models, so this is certainly adding to that. And in this classic model I don't think you can do better. From what I can recall and onto the last story in this section, lots going on as I mentioned in a preview and this one is on Microsoft. They are also doing AI agents. They are adding them to Word, Excel and PowerPoint. So this is coming to Microsoft. 365 subscribers, people using Word, Excel and PowerPoint already. This is for now on web, but will be coming to desktop everywhere Use it. And it is I suppose what you might expect. The agents are capable of doing more than just chatbots, of interacting with your document and doing multi step tasks. So for instance, in Excel the AI can perform data analysis, create visualizations, summarize insights. Stuff that Gemini by the way cannot do still, which I find really annoying. I wanted to edit my spreadsheet and it's not doing it in Word. You can Write content in PowerPoint, it can help you generate your presentations, getting you slides of data, visuals and other stuff. And so there you go, you know, you got your agents in your tools. Not a surprising development, but I think certainly seems like what Microsoft should be doing right?
John Krohn
100%. And I personally am not a Microsoft Office user, but I know based on how many invites I get to teams calls that a lot of people out there are Microsoft users and you know, Microsoft Windows machines, they're still the predominant consumer operating system. There's a lot of scope for Microsoft to be adding AI capabilities, including agentic capabilities into their applications. This is an obvious and probably useful thing to a lot of people.
Andrei Karlenkov
Yeah, and I definitely see from my own experience when you have these applications, for instance notion is one example or yeah, Google Docs, Google Spreadsheets, if there is a chatbot integrated into it, I want to just use that. I'm not going to go and copy all my stuff over or export it and then go to ChatGPT and attach it and talk to it if it's a capable thing that is already there for easy access and able to talk to the document and natively sort of interact with it. In, in my personal experience that is what I'm going to use. I'm not going to go to a competitor or kind of take the leap and to do extra work. So these kinds of things are gonna get you into lock in perhaps. Right. More so than going to online platforms. All right, onto applications and business finally. And we are back to OpenAI, which is lots of stuff from OpenAI this past week, they really have a lot of news. So on this front we've got a new Agentic shopping system from them. They have this instant checkout feature for ChatGPT users in the US which is going to allow people to make purchases from Etsy and Shopify directly within conversations. It's rolling out to all users, enabling you to buy stuff from over 1 million Shopify merchants with payment options like Apple Pay and Google Pay. So pretty interesting move. They are also open sourcing the Agentic commerce protocol which is powering instant checkout and allowing agents effectively to do these kinds of things. This is being done in collaboration of Stripe, the payment processor. So seems like a good move to make money. It's either ads or people buying stuff. And letting people buy stuff through your interface is gonna start generating a bit of revenue for those free users and for those, I guess, power users who.
John Krohn
Are willing to pay 100%. The theme maybe of this episode is that, you know, with margins going to zero or negative on creating frontier models themselves, which is the niche that OpenAI is a leader in, they've got to find ways, other ways to be monetizing. And this is an example of a way that you reliably can. And Perplexity already has been treading down this path for a year. So, you know, folks like OpenAI look around and say what else are similar companies doing? Where are they making money? Where it can be maybe maintain some margin or get a lot of scale. And this makes a lot of sense. Agentic shopping as part of the ChatGPT Pro Flow seems like an obvious place to be making some money.
Andrei Karlenkov
Yeah, also a little bit concerning possibly in the sense of, you know, what if you pay for ads and your childbot that you're used to being a little bit unbiased or on your side now is going to prefer certain brands over others. In some sense, the potential for monetization via sponsoring and ads is going to have some different dynamics compared to what you see. You know, people can already pay for Google searches to put up your products in front, but I think people have sort of learned to spot that kind of stuff versus if you get that in your chatbot flow might be a slightly different experience, 100%.
John Krohn
Maybe this is a pointless hope, but hopefully firms like this are taking a long term view and trying to build trust with us in order to be a platform that we want to be working with for decades to come, as opposed to just trying to make, you know, the most money that they can this quarter or this year. But we'll see, we'll see.
Andrei Karlenkov
But for now at least, OpenAI is saying that these are organic and unsponsored and they are charging merchants a small fee for completed purchases. So we're not monetizing in that way. For now, onto the next story. A competitor of OpenAI composed largely of former OpenAI employees. We are talking about Thinking Machines Lab and they are launching a thing, they're launching a product after quite a while from having been founded. I forget, but it feels like a year ago. Mia Moradi, of course, is a former CTO of OpenAI who left the board. Left. Anyway, let's not get into the drama, but left to found this Thinking Machines Lab with a very large amount of backing have not indicated too much what their plans on the product side is. It's been fairly ambiguous. We've seen some research from them and now there is a product called Tynker which is making it easy for researchers and developers to experiment with and fine tune AI models. So what it's in perfuo doing is it allows you to fine tune open source models. For now. This is Meta's llama and Alibaba's quen, no GPT5OSS, interestingly. And it lets you use supervised learning and reinforcement learning via a simple API. So an interesting thing to go with. It's not something that is unavailable. There's multiple players in this space, multiple tools for quite a while. But from at least the comments in this article, people seem to be saying that this has a good mix of abstraction and ease of use while also making it tunable and could be an interesting, yeah, kind of area to go with. Open source models have been becoming better and more competitive with frontier models relative to where they were in 2024 and 2023. Now open source models are, you know, able to do a lot, able to compete for everyday tasks. So I do wonder if this is an inter like a potential change in how people use the models, whether people start fine tuning them more and whether that's something that Thinking Machines Lab is betting on.
John Krohn
Yeah, this is a crowded space. I'm not sure that there is a huge amount of demand for fine tuned open source models, but there certainly are a lot of startups out there that are tackling this problem. It's always possible that someone like Mira Marathi, with her network and the amount of funding that she has for Thinking Machines Lab, that she can make something out of this that other folks wouldn't be able to, you know, all you need is, you know, a few big enterprise clients and you're on your way with some solid ARR. And she could make something of this. But yeah, crowded space. Not a particularly distinguished product in my view.
Andrei Karlenkov
Yeah, I think the major competitive advantage is the pedigree of not just me and Moradi, but the work of John Schulman, an OpenAI co founder. They have led the work on fine tuning things within OpenAI, fine tuning chatgpt through reinforcement learning, and are quoted in this article as saying, there's a bunch of secret magic, but we give people full control over the training loop, the abstractor wave of distributed training details. So yeah, I wonder if there is a potential to do much better at the task. It's on the one hand, you know, supervised learning, reinforcement learning. I was not aware of there being much secret magic, but perhaps there is and that would mean that you can actually out compete your competitors without having to pay billions of dollars to train a model. Well, now back to OpenAI because that's like half of what we talk about. The next story is OpenAI becoming the world's most valuable private company after private stock sales. So this is bananas crazy. OpenAI has sold $6.6 billion in shares held by current and former employees. And that means that its valuation based on this is 500 billion billion. That's the highest for any privately held company. So that is part of a meteoric rise. Just in August, they raised 40 billion at a 300 billion valuation. And this is quite important in part for the competitive angle. Right. You have many employees now at OpenAI. Much of their payment comes in the form of stock and equity and they are like millionaires on paper. But until you are able to sell the stock, you don't have actual cash. 6.6 billion sells in shares is something. It used to be the case typically that in a startup, in a private company, you had to wait until the company went public or was acquired to cash out and really benefit from the equity. But now, as we've covered, Jeremy, the kind of distance between private and public has been becoming grayer in especially US markets. There's been a lot of sales of private company stocks. There's been a bit more liquidity in a way that hasn't been the case for private companies especially. It feels like with AI companies.
John Krohn
Yeah, I mean, I said that this was going to be a theme in the episode and you know, figuring out ways to make money from being a Frontier Lab. OpenAI have been doing it. They still are burning billions of dollars more per year than they're making. But, you know, you're starting to get glimpses of them being able to be a profitable business in the future. And yeah, correspondingly, their valuation skyrocketing. We'll see if they can keep it up, obviously. Great. If you're somebody who has, you know, if you've been an employee there for a long time, you have some stock, congrats to you being able to sell off some of it, get some cash.
Andrei Karlenkov
Right. And perhaps makes people feel better about not being hired away to Meta for their crazy 100 million pay packages or whatever it is. Also just to mention there was reporting that OpenAI had $4.3 billion in revenue in the first half of 2025 burning. Also billions of dollars in cash, I think. Still not profitable from my understanding. But this is, you know, billions of dollars on top of API use and subscribers. That's not even taking account. Things like ads and sales that are starting to integrate. This company is going to be on the Google Meta front, I think of just racking in absurd tens of billions, hundreds of billions, whatever it is. Primarily through ChatGPT and still on OpenAI. But more on the business drama side, we are once again going to be talking about Elon Musk and XAI going and doing some legal shenanigans against OpenAI. Now there's a new lawsuit. This time it's claiming that OpenAI is stealing trade secrets by hiring former XAI employees. So apparently OpenAI is supposedly targeting individuals with knowledge of Xai's technologies and business plans. Xai, of course, is Elon Musk, Frontier AI Play with Grok being their GPT competitor. Apparently XAI believes that OpenAI is making former employees breach confidentiality agreements to gear to gain unfair advantages. This is coming, you know, after quite a few lawsuits at this point from Xai against OpenAI. I don't know what to say on this. More legal drama. Does OpenAI need to steal any sequence from Xai? I'm not sure that's very plausible, but I suppose we might learn some fun things in court if this goes that far.
John Krohn
Yeah, I don't have anything to add on this story. More legal drama doesn't. Doesn't impact me and there's nothing I could leverage here.
Andrei Karlenkov
I do look forward to the OpenAI movie that's apparently in production. You know, it's OpenAI has had such a, for a business, such an interesting kind of dramatic history that if you compile it all into a book or a movie, I think it'll be pretty fun. Like just I want to imagine Elon Musk, like arguing with Sam Altman in a movie. Like it's Social Network or something.
John Krohn
Yeah, that is interesting because it can really change depending on how they do this film and how well it does in box offices. You just mentioned the Social Network there. The Social Network film made a big difference in the way that Mark Zuckerberg and Facebook are perceived as an organization. And you know, even years later, I wouldn't be surprised if things like the name change the rebrand from Facebook to Meta. Yes, it does have to do with being a broader organization than just Facebook, but also I think part of it is kind of just getting away from what people perceive as a toxic brand from that very successful film. So yeah, it could make a big difference in the minds of the public as to what kind of organization OpenAI.
Andrei Karlenkov
Is on to the last story for the section. We've got startups raising absurd amounts of seed money. So Periodic Labs has emerged from stealth Mode with a 300 million dollar seed round coming with prominent investors like Anderson Horowitz, Nvidia, Jeff Bezos. This was founded by people formerly from Google Brain and DeepMind and from the VP of Research at OpenAI. Their goal is to automate scientific discovery by creating AI scientists and autonomous labs where robots conduct experiments and govern data. Apparently the initial focus is on superconductors. I can see why investors would be happy. This is a hard space to compete in, very high tech challenge with robotics and AI and I guess scientific discovery all coming together and space where I think there's probably a lot of room to make advancements, to make progress and potentially have discoveries in things like chemistry, physics, you know, material science, etc.
John Krohn
I love this. This is the kind of application that I dream of as AI advances. You know, I to have, you know, a slightly better social media experience or a slightly better chatbot experience every six months. Those kinds of incremental gains are nice. But these kinds of companies that are changing the physical world by blending together cutting edge AI with, as you say, robotics and you know, making scientific discoveries, having new superconducting materials, these kinds of real hard applications in the physical world, I love it and I wish them all the best.
Andrei Karlenkov
Yeah. And I'm still excited about robotics. Chatbots and agents on the one hand, still making a lot of progress, still very much frontier AI. But the real challenge or the space where there's a lot of potential to make really, you know, exponential gains still is certainly robotics. We've seen a lot of news on humanoid robotics on quadrupeds. This is another area where you need to Address these hard challenges of hardware, of being in the real world, dealing with physics, dealing with things that ChatGPT Gemini are not necessarily able to do. Right. Right now, just from training on the Internet onto projects and open source, we've got one story. SWE Bench Pro a new take on the software engineering benchmark. We've had software engineering Bench coming out, I think quite a while ago. It was initially the big Benchmark for resolving GitHub issues, dealing with bugs and so on. People have found that it was, let's say, pretty not great. It had a lot of issues from just one repository, like half of it was Django. It had data in it that sort of included the answers in a way. So then we got swe bench pro3 bench verified from OpenAI, which is now what you kind of look for in evals to really kind of get a better measure of the quality. But at the same time, still let's say limited as far as the benchmark goes. Not necessarily kind of conveying the sorts of things that cloud code and codecs need to do with more long horizon software engineering tasks where you need to go and explore data and solve problems that require to jump around and figure things out as you go. SBE Bench Verify, it's still focusing on these like smaller bug fixes and tweaks. So SBE Bench Pro from Scale AI is focusing more on more realistic, more useful difficulty levels and clarity or lack of contamination. They highlight contamination, resilient curation versus B built from commercial repositories, sourced from purchased startup code bases and copy left public repos. They are saying this references solutions that require a hundred lines of code across 4.1 files on average. There's human centered augmentation verification on the benchmarks. All models are not doing great at or lower than 20% roughly for GPT 5 for Claude. So very useful benchmark we've covered, I think in the last episode, another similarly long context horizon benchmark that more realistically mimics actual software engineering. Actually what you're doing. So this is helping that. And you know, we love benchmarks on this podcast. 10 this one looks like a pretty good one.
John Krohn
Yeah, benchmarks are important. You know, something like Suite Bench Pro, it looks like it will allow kind of, you know, the next generation of capabilities. You know, early in this episode I was talking about mtur and this chart of every seven months, the length of a human task that an AI model can handle, doubling. In order to keep up with that, we need bigger, more challenging benchmarks. And it seems like Sue Bench Pro Fits the bill here.
Andrei Karlenkov
Well, onto some more papers that are not benchmarks in research and advancements. First we've got a mechanistic interoperability work called Evolution of Concepts in Language Model Pre Training. So we've covered the, let's say, leading paradigm in model interoperability, I think quite a few times on the show. The gist is to understand what, what features are rivet models to kind of get at what concepts they're using or learning and how those play into producing outputs of different types. People have figured out a pretty strong technique that in essence you take some activations from your model somewhere in the middle or elsewhere, you compress it, and in the compressed representation you're then able to find groups of activations, groups of neurons that together, when they have activations, tend to find specific kinds of things. So they can, for instance, activate when a thing is plural, or activate when you're multiplying, or activate when it's the Golden Gate Bridge, you know, activating for different concepts. And what this paper is doing is kind of tracking that over time in a relatively interesting way. So what they need to do is do that basic idea, but train the activations, collective activations across time at multiple pre training snapshots. And then you're able to sort of track a feature as it gets trained and as features come out. And they find some very interesting things. For one, they say that you sort of have two broad eras of training. You have a statistical learning phase where you're learning kind of the basic features of what tends to be common, specific tokens, specific patterns, statistical regularities. And then later on in training you get into feature learning where you're learning more sophisticated things, sentence structures, I guess, metaphors or stuff like that. And this I think aligns with the have similar kind of understanding of Grokking, for instance, where people have found that you do have this kind of inflection point at some point where you get these higher level features from starting out with learning kind of the basics. And they have various experiments here showing you can use this for steering. So with these features, one of the things that people do is say, well, if we clamp down on these neurons and don't allow them to activate, their model starts acting very different. Famously Anthropic did this with Claude by ramping up certain activations that had to do with the Golden Gate Bridge. And then Claude started talking about Golden Gate Bridge in response to every single thing you fed it in a very humorous way. Well, in this paper they also do that. They show that if you include only the top key activations for a given feature, that lets you do well. If you exclude them, you do much worse to start, kind of producing nonsense. So I think always exciting to see progress in interoperability and understanding what these models are doing also plays into safety, into transparency and this also aids our understanding of training dynamics of how these models work.
John Krohn
Yeah, cool paper and you did an amazing job summarizing it there, Andre. I have nothing else to add. That was perfect.
Andrei Karlenkov
Well, glad to hear because let's say I am a little sleep refined. So we'll see how cogent my summaries remain. Onto the next one titled what characterizes effective reasoning? Revisiting Length, review and structure of cot. So in this paper they are looking at basically what is actually contributing to being effective at reasoning. Reasoning meaning these sort of things you do prior to giving your answer effectively. Sort of the things that in GP5 thinking you can pick like normal thinking, hard thinking, extreme thinking and then the model goes off and spends a minute reasoning with lots of steps and so on before it gives you the answer. So here they investigate empirically what kinds of behaviors actually contribute to being effective. So that here is length. So how long do you reason for how many tokens do you output? Review how much you spend checking, verifying and backtracking prior steps and structure. Essentially what is kind of the graph of things you do, do you like, state the problem first, then outline it, then follow it step by step. Things that are pretty essential for effective reasoning and they have a concept here called review ratio, defined as the fraction of review tokens within chain of thought and actually find that shorter reasoning traces and lower review ratios are associated with higher accuracy. So we kind of naive approach of letting your chain of thought get super long or you know, letting it go on and on when it's already arrived at the answer and kind of going back on itself on the same thing. I think we covered at one point an interesting thing where models could find the solution, but then if you let them keep thinking, they go off track and then give you the wrong solution because they effectively overthink. So yeah, they look at these factors and demonstrate that these kind of patterns exist empirically in a whole bunch of models. And Claude and Grok and Deep Seq, you want to keep Vicena fought, concise, focused, structured and that leads to better accuracy.
John Krohn
This is interesting and different from what I would have expected. I would have thought that having a higher review ratio, for example, you know, spending more time ensuring that you're being accurate would lead to higher accuracy. And so I just, I had this idea, this kind of, this intuition, which it probably has nothing to do with the way that machines actually are doing this kind of chain of thought processing. But it seems like there's kind of an analog here to the kind of when you're anxious or, you know, people who tend to be more anxious, they tend to overthink, they tend to continue to masticate over the same problem over and over in their head. And that can lead you, if you do that enough, you can create this whole fantasy world of oh, I'm for sure going to fail of this thing or those people definitely don't like me, even though there's no real supporting evidence in the real world. So I don't know, I'm just kind of making a fun analogy to human thought here.
Andrei Karlenkov
Yeah. When they cover higher review ratios, that goes from 0 to 1. Right. So potentially, as you like get into 0.8, 0.9, 1.0, right. You're doing nothing but reviewing your work. When do you do the actual reasoning? So in that sense, I think your intuition there is, you do want some review. So if you get to 0, 4, 0.5, 0.6, 0.7, for instance, you do see improvements. So doing zero review is bad, but doing too much review is also bad. So I suppose the summary isn't that higher is worse, it's more that too much or too little is bad. Next up we've got another sort of benchmarking kind of empirical evaluation paper. Advanced Financial Reasoning at Scale, a comprehensive evaluation of large language models on CFA level 3. So CFA is a test of financial reasoning, I think chartered financial analysts, so this is a professional model. And these financial analysis have to go through these fairly complex tests from what I understand to become accredited. And These Frontier models are.04 Mini Gemini 2.5 Pro, have achieved scores that pass. You know what you need to pass to get to level three, they get to 79.1, 75.9 respectively, surpassing the 63% passing threshold. So they are, you know, in some sense capable of being CFAs. You've seen this before with the bar exam, I believe for lawyers as well for various kind of measures of being able to conduct certain careers. The models pass on the test at least doesn't mean that they're able to do a job right. This is like on paper, multiple choice essays, whatever things that LLMs are good at and we've seen in practice, when you try to use them and reveal a job, things are more messy and you can't do it so easily, but clearly getting to a point where they can have an impact in the same way that they're having an impact in coding, for instance.
John Krohn
Yeah, this is cool. It ties into this theme again that we've had over the course of this episode. This idea of, you know, the complexity of the human task, the length of the human task that an AI model can handle, doubling. This is a great example of it. A solid benchmark where 12 months ago this would have been unimaginable and another 12 months from now this might be rudimentary for a lot of AI models out there to be able to tackle. The only people I know who have their CFA Level three are very intelligent people. So this is a cool benchmark.
Andrei Karlenkov
Your chatbots are smart. LLMs are getting PhD level smart. CFA 3 level smart. What are you going to say? This is the world you live in. One final paper, also sort of empirical analysis and understanding of model dynamics titled Short Window Attention enables Long Term Memorization. So this is related to a bit of a trend in research we've covered on and off dealing with alternatives to the transformer model architecture. So in Short Transformers, the key insight there is attention is all you need. Back from 2017, they took recurrent models that take the output of a model and feed it back to itself. Said, wait, forget this loop. The loop is really annoying because it makes it hard to train and scale. Let's just do the step where you look at everything all at once and then you don't need recurrence. And that turns out to be amazingly good. And transformers now rule the world. But in recent years we've seen renewed focus on recurrence with things like mamba, things like xlstm, various models that have found that you can train fairly powerful recurrent models, not necessarily competitive with transformers, not necessarily as, let's say, practical from a compute standard standpoint, but nevertheless potentially promising for longer term tasks, for tasks with required memory. And my personal belief is how are you going to live without recurrence? Right? You need some recurrence. Nowadays most of the models do recurrence in some sense via kind of notes or to do tasks or weird stuff like that. But there's still some promised these alternative model architectures. The best model architectures have been ones that are hybrids. So they combine linear RNNs or other recurrent models with sliding video window attention. And so this paper explores what is that kind of what amount of attention all at once do you want versus recurrence? You know, how much of a given input do you see all at once at a time? Effectively, and perhaps somewhat counterintuitively, it's actually better to train with shorter windows. So to have memory to be able to make use of your recurrence, you want to not have too big of train time, window length and test time with a length. Likewise, you want to actually keep them around the same, which I guess is intuitively correct. So they do say you can use a stochastic window size that will help you balance long context performance with short context and reasoning abilities. There's a bit of a trade off basically between these two things. And yeah, there's various kind of empirical findings that I think are interesting. The gist is maybe the hybrid architecture is still needed, maybe it's the future once scaling becomes hard, once we need models and agents that work for days at a time, that have long term memory, that actually have memory in a literal sense. I'm still very curious to see if that's the case. And this is the kind of stuff that, you know, research does look into things that may or may not pay off, but have some promise.
John Krohn
Another great explanation there, Andre. You're crushing it. An interesting thing here that this reminds me of is in the 90s when convolutional neural networks started to have some utility and practical applications with architectures like Lynette from Yann, they in their convolutions. So the convolution, kind of similar to this, it passes a window over the whole image. And the initial intuition when people were creating convolutional networks was that window should be kind of larger, like maybe a 9 by 9 pixel window or a 16 by 16 pixel window. Because they reasoned that, you know, in order to have some kind of feature of the image have some meaning, it would have to kind of be on that scale. But then empirically over time, as people used convolutional neural networks more and experimented with them more and more, they found that a three by three or a two by two window on the convolution was actually more effective, you know, way more computationally efficient and yeah, able to identify features even though those features are so small in the image. And so I don't know it, this reminds me of that something kind of analogous happening here with the attention mechanism.
Andrei Karlenkov
Right, exactly. And to sort of bring it to the real world in case these technical terms or jargon don't connect. The attention window is basically your input, right? It's all the stuff you put into the chatbot LLM before it begins its output. So we are Seeing now with Things like Gemini, 1 million token context Windows with context windows are getting crazy. And this is one of the very surprising things for me in 2023 I was skeptical of LLM progress because of at the time we had 4,000 token context videos, 8,000 token context video windows. It was not apparent that you could easily do memory with LLMs, but. But then it turned out that you can just feed it 50 books or whatever and they are able to handle it. So long term memory kind of isn't that big a deal. But as we get into agentic things, as we get into agentic workloads that take days and days or even, you know, if you want an AI employee to be human, like you need long term memory, short term memory, all the stuff that we humans have and is still an unresolved problem on how you do that properly. Onto policy and safety. Last section, we begin with policy coming out of California. SB53, which we've covered a few times, is now law in California. So this is the Transparency in Frontier AI act, the successor to SB 1047. We've also covered quite a bit, a very significant milestone in regulation, especially when it comes to frontier AI. So it mandates that large AI companies disclose their safety and security processes, provide whistleblower protections and share information with the public for transparency or face fines and various other kinds of things. Apparently AI developers must publish a framework on their website detailing how they incorporate national and international standards into their AI practices and update any changes to safety protocols within 30 days. And this is the version of the bill that came after the previous one got vetoed. We covered how Anthropic actually endorsed this bill. So there's some industry backing for this being sort of good regulation, which was the criticism of 1047 and the criticism of the EU AI Act. Some say that regulation isn't well thought out. It's kind of like stupid, basically. So this, some people at least have an opinion that this is well done, it's a good way to do regulation. Others, you know, Meta I think and others are lobbying against it. Industry is not necessarily excited about this thing passing, but either way it is now law and will come into effect at some point.
John Krohn
We certainly need some kinds of laws here. I am not the expert. Like, like you are even or like, certainly Jeremy is on these terms.
Andrei Karlenkov
Certainly Jeremy is the one that's like, oh, we gotta avoid the catastrophic X risk of AI destroying us all. So these big models, please have safety and enforce it. So I'm sure he's a fan of this happening. It incorporates Recommendations from a 52 page report by researchers. So it's we've seen a lot of frameworks on safety, we've seen many sort of recommendations, suggested practices. But this demonstrates a trend where things are getting increasingly concrete, increasingly practical in AI safety. Anthropic in particular has their safety framework. OpenAI also publishes regularly on safety with things like biohazards, cybersecurity and whether you're like Jeremy, a believer in existential risk that you know AI will kill us all within a couple years, or whether you're concerned about things like cybersecurity or misinformation or chemical warfare for instance, I think either way AI safety is not something to ignore at this point. And I think this kind of regulation in my opinion is a good thing. I do want to quickly cover a related thing, actually a topic request from a listener that was very helpful. There's also a law called SB942 in California, the California AI Transparency act, similar name that is, has been passed and goes into effect in June. We I don't believe covered this and as the listener says, it seems to be flying under the radar, but it is requiring that the big companies make it possible to detect whether something is an AI model from what I understand as an AI output and is able to impose large penalties. So the penalty of each violation is $5,000 and each day is considered a separate violation and the definition of what evaluation is unclear. So just looking from the summary, the bill requires providers to make available AI detection tools at no cost to a user that meets certain criteria, including that AI tool is publicly accessible, various requirements to vat. Apparently providers need to offer the user an option to include a manifest disclosure in image, video or audio content or stuff like that. So yeah, basically some requirements that you can tell if AI is AI and if you do some math you it can get a little bit ridiculous in terms of the penalties that get accrued. In a follow up the listener provided some analysis that if the big companies are C2PA compliant, they're probably going to be compliant with this manifest requirement. Most of the larger companies are compliant with these kinds of requirements, but apparently for smaller players, for startups this could be a real issue. So interesting as an example here they're saying Gamma creates an estimated 5 million images per day, which if not compliant would mean if it's per image and per day it's $25 billion penalty on the first day, et cetera. So this I think highlights we need for well thought out regulation, usable regulation and it Seems like this bill might have some issues, especially when it comes to ambiguity and actual enforceability. Moving on, we've got Elon Musk's XAI offers grok to the federal government for 42 cents. 42 cents? That's a science fiction reference. Great. Well, we covered previously how I believe it was OpenAI and Anthropic both offering their services to the government for $1 in a kind of move to probably get into the government, I suppose. And now XAI is securing a deal with the US General Services Administration GSA to provide this AI chatbot Grok, two federal agencies for 42 cents over a year and a half, so. But 42 cents is below the dollar per year from opening Anthropic. And a reference to Hitchhiker's Guide to the Galaxy.
John Krohn
Yeah, that's exactly it. That's definitely where the 42 comes from. Well, I guess Musk is getting along well enough with the federal administration again that something like this is allowed to go through.
Andrei Karlenkov
Yeah, seems like Musk and Trump might be pals again, or at least have resolved their spat. Moving on, Character AI is removing Disney characters from the platform after studio issues warnings. So character AI is a massive platform, in case you're not aware. Kind of the winner in the space of chatbots that are characters that you role play with and talk to. So literally characters that are chatbots and you can talk to. There's probably millions of these characters on the platform. I don't know the exact number, but the platform itself has huge numbers of users and millions very high retention for users. So unsurprisingly, many characters from popular media, including Disney. Well, Disney sent a letter saying our characters are there without our permission and character AI quickly responded by removing it. So an important development in a sense of, you know, copyright and IP and so on is still so unclear with regards to what is legal and what's not. And for these kinds of companies, these are the kinds of things that happen, like either you are OpenAI and you just go off and ignore any worries and train on presumably Disney content. I'm sure you can make Disney animated videos of Sora too, or you're a player that is actually going to have to listen to, you know, organization as Disney that can really do some legal cost to you.
John Krohn
I think a key distinction here is that it isn't necessarily what character AI was using as their trainee data, like OpenAI, like we were talking about with OpenAI earlier in this episode. This is something where you know, to have a Princess Elsa character in character AI. It's shocking to me that they were able to get away with it for this long. I kind of would have assumed that Disney would have done this a long time ago.
Andrei Karlenkov
Yeah, it's certainly different in the sense of a. It's much easier to remove. Right. It's not a big deal for character AI to take out, take down these characters versus OpenAI. Others are like, oh, we can't train on your data. Too bad because we are gonna do that because the models need all the data they can get. I will say legally. This is a little interesting to me because on the one hand these are characters owned by Disney. On the other hand, it's presumably user generated, not by the company itself. And it's using kind of the idea of a character. Right. It's like Roleplay. It's fan fiction. And so in that sense, I think. I don't know if there's a strong legal precedent to this being bad compared to something like, yeah, using the training data or outputting something that looks exactly like your characters, which you've seen with lawsuits against Midjourney that we've covered in recent episodes. So a different kind of legal issue that is a little bit surprising to me. But either way, another interesting consideration in this space. And on to the last story. We've got Spotify trying to find AI slop and apparently failing according to this article saying that Spotify's attempt to fight AI slop falls on its face. So I don't use Spotify, so I'm not personally aware of this stuff, but apparently Spotify has been flooded by AI generated content which is affecting real artists and their revenue. We have covered at some point a while ago some types of music like, you know, relaxing music or electronic music, often being now AI generated by AI artists, you know, the songs completely being not real and racking up some money. So there is some benefit for spammers to just upload a ton of music and try to get into playlists. And there have been examples of official Spotify playlists including fake songs. So apparently now they are trying to make it so you have to do AI disclosures in music credits. They have an impersonation policy that will remove music replicating another artist's voice without permission. But yeah, according to this article, the effectiveness of these new policies isn't going to do much. And there are examples of AI generated music like Revelvet Sundown, recent notable band that has amassed millions of streams of AI created songs. Those are allowed to stay on a platform. So there's no kind of requirement to not use AI.
John Krohn
This is a tricky one. It's going to get harder and harder to detect AI slop, as we've seen. Just in the same way that was Sora too, you know, going right back. This is our last story now of the day. Going back to our first story of the day. As text to video becomes so compelling that you can not tell the difference. The same kind of thing is going to be happening in music, if it hasn't already. Yeah, some people might actually like AI generated music, like you said, you know, relaxing music, that kind of thing. To be able to have to be able to go into a spa and have some kind of infinite stream of relaxing music where you're not going to be getting, you know, repetition. That could be a positive. But there's, yeah, obviously a lot of AI slop out there, as we've discussed in the episode as well. And yeah, I could see why you'd want to get rid of a lot of it.
Andrei Karlenkov
Yeah, Especially here they note that you're able to exploit some trends, some aspects of music platforms to effectively post songs under someone's band name, like impersonate them and get their streams, which obviously is very bad. You're now taking away revenue from that artist. So Spotify is going to fight against that in particular, but is still gonna allow AI songs. And I think, as you kind of mentioned it at the beginning with Sora, for many people, perhaps most people, knowing that something is done by AI and especially knowing that it was just a prompt that generated a song and there's no real human intentionality, creativity behind it, effort, no effort behind it for many people takes away the appeal of art or media. Like you just. It doesn't matter what it looks like, what it sounds like. If you know that it's AI now, you don't like it or it doesn't resonate. And I feel that's the case for music for myself, I think that probably is true for many, if not most users on Spotify. There is some nuance, like electronic, ambient, chill music. I don't mind too much if it's AI maybe, but if it's vocals, if it's other things, it doesn't feel right. So yeah, it's a weird thing of like, probably some AI generated content is cool and you can, you should be able to use it in your creative process. But what is slop? What isn't slop? Should some flop be good? Like some memes are fun. I don't mind some weird AI images. Right. But it's A weird place to be in culturally and kind of artistically. Well, that is it for this episode. Getting back to our regular long format of going on for quite a while. Thank you, John, for guest hosting and making it so we can at least post once every two weeks. Filling in for Jeremy.
John Krohn
Yeah, my pleasure. And if you run out of Last week in AIs to listen to for whatever reason, either because you listen so avidly or they skip a week, check out the Super Data Science podcast. We've got lots of interviews with top people and yeah, have a lot of fun on the show. Typically.
Andrei Karlenkov
Yeah. Unfortunately, last week in AI, the back catalog is not too compelling. You're not going to listen to our hundreds of episodes. But Super Data Science, on the other hand, is interviews. So it's, you know, a golden resource. And how many episodes are you at now? It's got to be.
John Krohn
We're over 900.
Andrei Karlenkov
Over 900. You're getting up to like.
John Krohn
Yeah.
Andrei Karlenkov
So whatever digits, whatever you're interested in, I'm sure you can find some compelling people. Including Jeremy. Including me. Actually. We both have episodes on the show, so you can check those out.
John Krohn
That's right. We did a cool one in person in San Francisco about a year ago with Andre.
Andrei Karlenkov
Chatted all about the imminence of AGI, how crazy that is.
John Krohn
Yeah, I had a lot of fun with that one. Really enjoyed chatting with you. It wasn't at all what I had planned for the interview, but that rabbit hole that we went down I think was awesome and it's amazing how aligned we we are on our views. If people want to check out the episode with Andre, it's episode 867. So you know, you can type that into Google or Spotify or Apple podcasts or whatever, or you can go to.
Andrei Karlenkov
Superdatascience.Com 867 and I shall endeavor to include a link in the episode description as well. Hopefully I won't forget. We'll see. Anyways, thank you for listening. Apologies again for not being true to our name of being consistently covering last week's AI news. Do subscribe and leave us comments and reviews. We always appreciate it. Share the podcast if you can and do try to tune in next week or whenever we next have an episode. Hopefully Jeremy will be back. Tune in, tune in when the AI news begins Begins it's time to break.
Podcast Outro Singer
Break it down Last week in AI Come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride Couple lads to the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AIs reaching high? Algorithm shaping up the future sees tune in, tune in get the latest with ease Last weekend AI, come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride I'm a laugh with the straight three as rich and high from neural nets to robots the headlines pop Data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding, see what it brings.
Date: October 7, 2025
Hosts: Andrei Karlenkov (Skynet Today), guest co-host John Krohn
—
This episode covers a packed fortnight in AI news, with a strong focus on new generative AI tools, the ever-advancing capabilities of LLMs, the business pressure on leading AI companies, hot legal and policy developments, and some lively discussion about the industry’s latest benchmarks and research.
Key topics include:
[03:07–10:50]
“The video quality is far better... than anything I’ve ever seen in text-to-video generation. And some pretty impressive real world physics contained in it.”
— John Krohn, [07:00]
[11:18–20:42]
“Claude Sonnet 4.5... is the best model in the world for agents, coding and computer use... enhanced domain knowledge in coding, finance, and cybersecurity.”
— Andrei Karlenkov, [15:55]
[23:59–25:09]
“...the idea of scrolling and seeing non-stop AI content... this is a slop machine. This is feeding you AI slop.”
— Andrei Karlenkov, [24:29]
[25:36–32:12]
“I personally don't have a level of comfort with OpenAI or maybe even any of these big players... what they're going to do with my data isn't clear.”
— John Krohn, [28:23]
[33:45–35:02]
[35:02–41:46]
[44:03–45:48]
[45:48–49:01]
[49:01–54:56]
[55:40–57:41]
“These kinds of companies that are changing the physical world by blending together cutting-edge AI with robotics and making scientific discoveries... I love it and I wish them all the best.”
— John Krohn, [56:56]
[57:41–77:36]
“...the human task length that an AI model can handle doubles every seven months.”
— John Krohn, [14:40]
[77:36–up]
On Sora 2’s realism:
“I think Sora in general is a really viable product that we're going to see a lot of.”
— John Krohn, [08:26]
On the shifting value of creative originality:
“When you can tell that somebody actually put effort into writing something or creating something... that is starting to become more valuable, but also, interestingly, harder to distinguish from the slop.”
— John Krohn, [24:50]
On business margins:
“If a free plan goes away and another company is cheaper... I think people will move over. Right. I don’t think there are people who are fans of ChatGPT so much as fans of the experience.”
— Andrei Karlenkov, [29:55]
This episode underscores how rapidly generative AI capabilities, deployment, and business strategies are evolving—all while legal, cultural, and policy frameworks struggle to keep up. The hosts highlight both the technical progress and the emerging societal challenges, from user trust and safety to the “AI slop” dilemma and questions of economic sustainability and copyright.
If you missed the latest two weeks in AI, this episode packs all the major developments, from Sora’s cinematic videos and new agentic tools to regulatory and business shakeups.
Host/Guest sign-off:
“These are things that are having a huge impact. … We love benchmarks on this podcast.”
— Andrei Karlenkov, [21:38 & 61:29]
“This is the kind of application that I dream of as AI advances... I wish them all the best.”
— John Krohn, [56:56]
(For further deep-dive interviews, check out John’s Super Data Science Podcast, now over 900 episodes.)