Loading summary
A
Foreign welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the timestamps and links to skip to any of the many stories we'll be talking about today. I am one of your regular hosts, Andrei Karen. I studied AI in grad school and I now work at a generative AI startup.
B
Nice. This is John Krohn. I am irregular. You might even say one of your odd hosts. That would be a good adjective.
A
A regular guest co host is how I like to think about it.
B
Right? Exactly. I really appreciate that. Yeah, I've been on. I've been on the show probably half a dozen times at least and love being on the show. It's the only podcast that I listened to last week in AI. If people have heard me on the show before, I'm sure they've heard me say that before. Delighted to be here. I'm perhaps best known for hosting a show called Super Data Science, which you've been on. Andre. It's an interview format show as opposed to, you know, news focused show like Last Week in AI. They're a nice complement to each other, we might say. And something big since I've last been on the show is that in March I co founded a new consulting firm which I'm CEO of and we're called why Carrot? Like why hat? But carrots care. Like the computer character, the thing above the six on a US English keyboard. It's a bit of a machine learning joke for people who are in the know. But we're focused on agentic stuff. We're focused on generative stuff rag and bringing that into enterprises, letting people get ROI on all the latest and greatest in AI. So there's some stories that I'll be able to relate to from firsthand experience.
A
Because of that makes sense. And now is probably a good time to be consulting people because there's certainly a lot happening very quickly and it's honestly hard to keep up even if you're like hosting a podcast, much less if you're not doing that.
B
Andre. It's unreal. I've never had an experience in business like this before. Every other commercial thing that I've ever tried, it's hard to get product market fit. But for any of our listeners out there, I'm probably now I'm cannibalizing my own. But there's so much work out There, like a rising tide lifts all boats. There's so much opportunity out there right now to be transforming organizations with LLM enabled technology, basically that it's crazy. Every conversation leads to next steps. Nobody's ever like, this isn't. And I'm not sure this is what I need. Um, it's just a matter of prioritizing and getting things done.
A
And it's actually a quite good fit to give a quick episode preview. This episode is going to be pretty heavy on the tools section. Lots of new things and Most excitingly with ChatGPT agent just came out. So that'll be probably one of the big focus areas then in business. Lots of interesting developments in the hiring front we've been talking about for the last few weeks. Even more kind of weird news of acquisitions, hires, movements, et cetera. And beyond that. We'll only have a couple stories in research and policy and safety. This is going to be a bit of a quick episode, so it's just going to race by, try to keep up. So let's go ahead and dive in tools and apps, starting with OpenAI's new ChatGPT agent, which can control an entire computer and do tasks for you. So the way this looks like is in ChatGPT they have this kind of selector menu where you can choose various modes including Deep research, web search, et cetera. And ChatGPT agent is now a new option there. And the gist of it is it's combining two previously existing things. They already had Operator which could browse the web for you and do various tasks that way. And they had Deep Research which analyzed and summarized information. So the way that OpenAI pitches this is as sort of like best of both worlds, a much more powerful agent that can do general computer use, it can, you know, click, it can do commands, it can browse the web, and so on and so on. And so yeah, this is the latest frontier, you could say on agentic task execution beyond like code. This, this is able to do conceptually I suppose anything you could do with a computer. And coming along with announcement, besides the utility of this, they also show really, really strong performance on various benchmarks like humanities, like last exam, Frontier math things. We cover this ChatGPT agent with browser and computer and terminal is able to outdo OpenAI 04 mini of tools, deep Research, all of these by quite a big margin. So this seems to be sort of the most trained agent that OpenAI has ever released.
B
It's cool. I used it already and it's really effective. You can watch it working so you can kind of, you can see it going on the Internet, doing tasks for you. You can actually interrupt it and take over. Like so you kind of have this, this view. If you've ever remoted into, you know, a remote server, it's like watch. It's like doing that and watching, you know, a colleague of yours program or search the web. And you can actually go in there and interrupt it if you want to. I haven't tried the interrupting yet. I'm not sure what value that would really provide or if it can continue after you stop interrupting it. I don't know exactly how that works. It can create assets for you, like spreadsheets, like slideshows. And so we've been using it for that already and it's been really good. So it has. I've been a Deep research user for months now. I pay for the Pro tier of ChatGPT in order to be able to get used to amazing report building. Like, it seems like it would be comparable to having a McKinsey analyst working for you, except that they can get their work done in minutes instead of days or weeks. But it's that level of quality with deep research and now adding into it as well, you know, the ability to be outputting assets for you to be able to, to be able to see what it's doing while it's crawling the web agentically. It's, it's a cool interface. I like it. Powerful.
A
Yeah, that certainly seems like it. And in fact, it's so powerful that there are some kind of safety concerns. It's going to ask you for permission for things like sending emails and making bookings since it can kind of do whatever. Also has restrictions on financial transactions. Probably a good idea. And as you said, this is now rolling out to Pro plus and team users with enterprise and education coming out later. So lots of people are going to start using this. I think we're going to start seeing some pretty cool examples of what you can do with this. Onto the next story. We covered Kimi K2 briefly in the last episode as a new exciting open source release. But we didn't dive into it. So I think we will cover it a little bit more. The headline is Alibaba backed but Moonshot releases new Kimi AI model that beats ChatGPT Claude encoding and it costs less. So the gist is QME K2 is a 1 trillion parameter model that is very. Has a lot of experts, so only 32 billion active parameters at a time and it had really impressive benchmark numbers. What I've seen since then is kind of. It passes the vibe check. Everyone seems to agree this is a really good model, really impressive open source model, competitive even, as this article says, potentially with Claude or ChatGPT or other proprietary private models. So way beyond llama, way beyond probably anything we have in open source, including deep seq V3. And this is not even a reasoning model, so they presumably have an R1 variant of this in the works.
B
Yeah, this is kind of a story that is unsurprising. I suppose this is kind of like the trajectory that you're on. You're kind of expecting somebody to come up with open source approaches that rival. You know, Jeremy talks a lot on the show. I'm sure you do as well, but for some reason I remember Jeremy saying this frequently of kind of six months after a proprietary model comes out, you can expect kind of similar capability in open source. And that's what we're seeing here. Yeah, I haven't used it myself, but the benchmarks look good.
A
Yeah, and there are interesting notes about it. As for instance, people say that it is really good at creative writing. It has like a different writing style potentially because of being trained on different data distributions coming out of China. So yeah, interesting developments. And as with Deepseek, interesting to see this coming out of China where they are more hardware constrained due to export restrictions as we talk about quite a bit. And so in the technical report, similar to Deep seq, they go into some of the interesting technical insights. They in particular highlight Muon, this new optimizer that hasn't been proven so much yet, but in this case scaled to a gigantic model. So a combination of really exciting developments for open source, but also some new technical insights that are quite interesting. And next Amazon targets Vibe coding chaos with new QROW AI software development tool. So kind of a surprise story for me. We've seen Cursor, of course be a very important agentic powered IDE for code development. Cursor code has been killing it. In the past couple months now Amazon has released this new Curo development environment that basically positions it as another agentic coding tool that is particularly focused on making it a little more principled. So they highlight specs and planning and all these kinds of things in their blog post. It also has all the various features that you expect with MCP and so on. So boy, this is a really, really busy space with all this coding agentic stuff I was just exploring like Cline and Roux, these extensions by open source teams. There's like forks and combinations and now Amazon is in a fray with this new tool. Clearly people are putting a lot of work and trying to optimize and make this work well.
B
I'm a big Cursor fan personally. How about you Andre?
A
I used to use Cursor as my main tool, but cloud code has kind of overtaken it and I actually moved back to VS Code from Cursor just because it is now pretty feature comparable and Cursor updates a lot and sometimes not in ways that works too well.
B
Nice, that's good to hear. I'll have to try that out and kind of maybe go back also to VS code myself. This one here, this Kiro announcement from Amazon, this one feels kind of random to me. I know Amazon is often throwing stuff at the wall to see what will stick and this kind of, this seems to fit into that category. You know, big company trying out lots of different projects. But Amazon hasn't been like, I can't off the top of my head think of any big LLM releases like proprietary or open source that have been anywhere near the cutting edge. Can you think of anything?
A
No, they, they have developed some models but they really haven't tried to compete in terms of performance. They have internal models presumably for their chatbots and so on. So yeah, this is Amazon strategy is I, I think interesting. They don't try to be a frontier lab so much, but they work with Anthropic for example, and they do develop some things like this to be in the ecosystem in some ways.
B
Yeah, we'll see what happens. My crystal ball predicts that we're not going to be all using Curo browsers in a year or two.
A
Yeah, it's also IDs. Sorry, yeah, it's, it's a bit strange. I don't target enterprise that much, but regardless it looks pretty slick, so who knows, maybe it will actually take off. And speaking of agentic coding tools, next story. Anthropic tightens usage limits for cloud code without telling users. So this is development that happened this week. I saw this happening in real time on Reddit where people on the Claude subreddit were complaining that their usage seems to be more restricted. They hit the limits on using the Opus biggest model quicker. So apparently that's true. At least this article seems to support it, especially on the $200 per month max plan where you have like crazy amount of kind of budget to use up tokens. And this has coincided with some instability like Wednesday Thursday, Cloud Code and Anthropic were both down briefly and we're just not usable. So In a way, not surprising. Like they are definitely losing a lot of money by being so generous with this max plan, but I think an indication of where things are heading where I guess at some point you'll have to be profitable and the cost of E subscriptions are going to go even beyond 200.
B
Yeah. With functionality like agents now being available in Claude as well, you could imagine that their compute is getting slammed. So I mentioned earlier in the episode that I have a ChatGPT Pro subscription. I also have a paid Claude plan because there's different kinds of things that I like to do with different providers. I have Gemini Ultra as well and Quad is my favorite for most tasks. Actually it's kind of my default go to and I have been hit it just. In fact, it's funny that the story came up. I had never been hit with one of these overload errors before, but I hit one this week. So it seems like we're all kind of in the same boat. And as you said, it's unsurprising given how much money all of the big frontier labs are hemorrhaging on providing their services. You know, they're, they're losing money by giving us access to such powerful models at such low cost. And you wonder when, when things are going to have to change. And so I, I understand, like you said, that they have to make some changes. What's surprising because Anthropic is usually good organizationally about communication and getting things right. Maybe they just didn't anticipate that some people would feel this change, but it's a rare own goal, I'd say, for Anthropic.
A
I agree. Yeah, they, they rarely seem to take these sorts of missteps and I think it's probably an indication of just cloud code has taken off pretty rapidly and they've been probably trying to just keep up. It's a fun detail for me. So all these models allow you to use them with a subscription plan. You're not paying per token generally, especially in this max mode. So if you use some tools, you can see like the hypothetical amount of money you spent. And as a user myself, I'm spending like $2,000 in tokens on this $200 per month plan. It's insane. So I don't know. I think this is a sign of things to come.
B
That's a great stat there.
A
I know, sir.
B
Whatever. The inverse of a margin is a loss that you're putting in there. Yeah, nice.
A
Next up we've got Mistral and they are also keeping up with all the agentic hype. They have rolled out deep research in their lechat offering for talking to their models, you know, the equivalent to ChatGPT and Claude and so on. This is actually part of several things. They now also have projects, we have image editing, multilingual reasoning. So very much in line with Mistral, kind of just racing to be feature equivalent to ChatGPT and Claude and provide an offering that's comparable as we say with Jeremy here all the time. Mistral is in a tough position. They don't have as much money, they don't have as much compute. But it's always cool to see them kind of rolling out things pretty rapidly.
B
Yeah, I mean everyone is rolling out deep research. There's been people doing it for a year now, some of the early movers and it's kind of, it's expected. It's what we call table stakes in software product design these days. If you are an LLM provider I think and it actually, I mean there's all kinds of safeguards you need to get in place. There's all kinds of engineering complexity when you roll this out on the kind of scale that lechat would be. But I actually, I'm going to plug a free thing that I published a month ago on YouTube. I published this agentic AI engineering course. It's four hours long and the first hands on project. We use the OpenAI agents SDK to create a deep research kind of functionality and so you can kind of see how that works. And yeah, so that's free on YouTube and I'll provide a link for you to, to provide in the show notes. It's a pretty cool. 30,000 people have already watched it on YouTube and there's no ads. I've turned off ads. It's just there as an educational resource for people who want to be doing cool stuff with AI agents.
A
Yeah, it sounds like a pretty fun project for sure. Next, moving on to Grok. We spent quite a while talking last week about Grok4 and some of the controversies around it. Soon after there was a strange development with Grok and X. They have released a feature called Companions in the Grok app which you can access if you're in the super grok subscription costing trip, $30 per month. And these companions, there's a couple Personas you can chat with as sort of characters. They have 3D models. They talk to you with audio and you can talk to them of audio. One of them is an anime girl wearing sort of dark Lolita fashion. And the article here is called I spent 24 hours flirting with. With Elon Musk's AI girlfriend, which is surprisingly, entirely accurate. This character companion is literally designed to be flirty. It's in their system prompt that it should be a 22 girly, cute character who is into whoever is talking or chatting with her. And you can, like, build up a meter for how much this companion is attached to you. At some point, you can get into inappropriate territory. You can actually, like, reach a level where you're able to put the character in lingerie. I mean, interesting feature here from Grok, I suppose.
B
I did not know this story. I've clicked on the link and I'm looking at the photos and videos and it is intense. It feels like I shouldn't be looking at this while working.
A
Yeah, it's not safe for work entirely. And I mean, there's something to be commented on, as it actually is potentially a significant concern and problem that people are already kind of falling in love with these AI companions. This has been happening for a while. So, you know, this might have some interesting effects on people if they really do start to bond with it. But, yeah, just go and look at the screenshots and the videos of this because it's. It's something else. Whoa.
B
In this article, it says. Yeah, things can include descriptions of. I'm not going to read them out loud. I feel uncomfortable saying these words, but sex acts. Uh, there's a quote here. At no point did it ask me to stop or say, I'm not built to do that. And then. Yeah, I guess you. There's something. I'm kind of vaguely. Just quickly skimming this as we're speaking here, but it's kind of gamified in that depending on, I guess, on how long you talk or the kinds of things you say. I don't know, you get hearts on the screen and that allows you to level up to different levels in, I guess, this game. And yeah, when you get to level five, she's wearing lingerie. That's. Yeah, it's interesting. It's interesting. I mean, in some ways it's kind of. It's. You know, this kind of thing is inevitable. Right. It's like, it's. But it's. It's kind of surprising that it's such a. Such a big mainstream company that's raised so much money and. Yeah, just last week was making headlines for being at the frontier in some capabilities.
A
Yeah. To be clear, this is not a new thing. There's plenty of apps that provide this exact kind of feature, and it Is just surprising that, you know, in Grok, the equivalent to ChatGPT or Claude or so on, this is now a built in feature, literally like a sexy companion to chat with. Certainly a differentiator, I guess that it certainly is. Next we've got a story of Uber being close to completing its quest to become the ultimate robo taxi app. So this is because they have announced a partnership with Baidu to deploy Robotaxis outside the US and China, focusing on Asia and the Middle East. They already, Baidu already operates around 1,000 robotaxis globally. They are in a pretty good spot from what I can tell. A competitor with Waymo and Uber already has a partnership with Waymo where you can hail a robotaxi through their app. So I think the headline here is not too sensational. It does seem like Uber is trying to partner and kind of use Robotaxis as part of the product, which I suppose they kind of need to. Right?
B
Yeah. The Uber share price has long priced in being able to go to autonomous to not have to be paying human drivers. And it's a pretty wild thing as we start to have cars driving themselves, trucks driving themselves in the U.S. in something like 30 states out of 50 in the U.S. truck driving is the number one occupation. And then lots of the other top jobs are supporting that in some way. And so we're marching inevitably to more and more autonomous driving. I think ultimately it can be a good thing for society because that kind of job, whether it's, you know, I feel so bad for, I live in New York and taxi drivers, Uber drivers, you, you can tell it pains them in a lot of cases to be using that right foot because just all day using that right ankle. And so you're like, in some ways it'll be a good thing, but it's also going to be very disruptive to all these people who have this kind of job today. So retraining programs will need to come into place or some other kind of solution.
A
Right? Yeah, it's been an interesting thing with Waymo kind of slowly but surely expanding their robotaxi capabilities over the last couple of years. Tesla just rolled out robo taxis and there are companies working on autonomous trucks as well that, that are not Waymo. Tesla itself is presumably working on it. As you said, there are like 3.5 million truck drivers in the US around 1 million Uber drivers. So it's going to be here in a year, two years, three years. And it's, it's going to be disruptive, hopefully in a good way. And onto Applications and business as promised. Some interesting kind of acquisition and hiring developments this week. First up, OpenAI's windsurf deal is off and windsurf's CEO is going to Google. So we reported previously that OpenAI was in talks with Windsurf. Windsurf created another one of these coding tools with Gentex capabilities. Seem to be in talks to be bought out for $3 billion. That was canceled and the CEO and some of the top talent went over to Google for a deal, I think reportedly around 2.4 billion with some licensing details as well. So another case of a non acquihire Acquihire where the big company hires away the top talent, the leaders really of the project, frozen some license deal or something of that sort and the company, Windsurf, you know, stays. It's still there. It hasn't been bought out in any sense. In fact, I don't think any shares in Windsurf went to Google. You've seen many examples of this in the last couple years at this point scale. AI with Meta had this happen other I think Laminai with amd, different examples of that. A very different kind of new seeming normal thing for Silicon Valley. Like you either buy the company to acquire its people or you buy the company. AKU hires a term but now you can kind of hire away the key people and the original company sticks around. This used to be an antitrust kind of move in a Biden era, but in the now antitrust is not really a worry. So it just seems like a new profitable or easy way for large companies to do these kinds of deals.
B
Yeah, and I think they were doing these kinds of deals originally to avoid antitrust inquiries. But then it started to become such common practice that antitrust regulators were like, wait a second, this is. You're just, you're. You've slightly changed the approach here, but ultimately this is anti competitive.
A
Mm. And then so this had a lot of discussion in the Silicon Valley circles around like where the other Windsurf employees kind of screwed over in this deal because the top talent clearly, you know, got handsomely paid. But the way this works in startups is you get some share of ownership in the startup, you hope that either it becomes a big profitable company and goes public or it gets acquired and your shares get transferred, converted to cash that you can actually use. Right. This is the kind of bet you make with startups when you have this structure of deal where the company isn't acquired but the leadership goes away that in some ways like breaks the typical contract or expectation with being a startup employee, being someone who joins a startup. So yeah, lots of kind of questions by people around the nature of this kind of deal for Silicon Valley. And in fact, just like a couple days after this happened, Cognition, who is a maker of AI coding, agent Devin, announced that they are acquiring Windsurf. So they kind of swooped in, they got the announcement that the top brass is leaving for Google and now this other AI startup, Cognition, is now buying out the remaining company, Windsurf, which is quite the story. This, this whole like business development, at least even in the startup world and business, this is pretty interesting stuff. And even more news on this front. Anthropic hired back two of its employees who had just left for Cursor, recovered this. Boris Czerny and Kat Wu, two leaders of developing cloud code, announced to have gone to Cursor, apparently just reverted that. Again, really weird kind of story in Silicon Valley. Two weeks since the announcement, they apparently are going back to Anthropic. So. Wow.
B
Yeah, it's bizarre. It is bizarre.
A
And on that theme, continuing, you know, the way this was all kicked off is of course Meta going on a hiring, just binge, just a complete spree of throwing around money to get top talent from OpenAI and others. And there are new developments at the Adfront as well. Reports of other high profile OpenAI researchers going to Meta. We've got OpenAI researchers Jason Wei and also Huan Wong Chong, both pretty significant talents as far as I can tell. So yeah, it's, there's now trading cards that you can see on Twitter for when people swap companies going from OpenAI to Meta or I don't know, OpenAI to Anthropic. It's quite a meme, I suppose, at this point.
B
That's funny. Yeah, definitely, as you say, exactly. Kicked off by Meta, putting all this budget into it. And I think it's also, it's a very. From speaking to friends who work at the frontier in these big labs, it is very stressful. It is super intense work because you're trying to stay at the frontier against other companies that are also spending billions of dollars on the same problem. And so very stressful work. And so I'm sure the money and the kind of, you know, these a hundred million dollar contracts that supposedly Mark Zuckerberg is personally negotiating, you know, that's part of it, but I think also part of the story here, which I don't see talked about publicly, but is just kind of my, my hunch is that you also probably, you know, if you've been at a frontier lab for years. You've been helping roll out cutting edge LLMs. You're kind of, you're hoping that by switching to a competitor that maybe there's going to be like a bit of a culture shift, that, you know, you're just hoping that somehow the new role is going to be a bit less stressful than what you've been going through for years at your current firm.
A
Yeah, and in OpenAI in particular, they have grown like crazy. Right. They went from something like 1,000 people to 3,000 people in I think less of a year. And when you have that sort of startup scaling, it just compounds the craziness. Like it, it must be really messy, really fast moving and chaotic now at OpenAI. And that could be one of the many reasons, besides money, that these people are leaving from OpenAI. One more story on this front. Meta has also hired two key Apple AI experts, Mark Lee and Top Gunter, who were researchers at Apple and now are going to Meta. So not just going after OpenAI, every kind of top talent is being sought out by Mark. On a related story, Meta, of course is doing this for its super intelligence efforts and they're one of many in the field with OpenAI, of course, being one of the key ones. Mira Muradi's Thinking Machines lab has now closed their $2 billion seed round with a valuation of 12 billion. This, of course, is composed of a lot of people from OpenAI, including the former cto Miya Muradi, and we haven't seen too much from them. They are saying that in a few months they'll start rolling out some products and open source source things of some nature. We've known that they've been looking at this kind of number, billions of dollars in a seed round with no product to speak of, and they got it. So the competition for AGI is certainly not slowing down.
B
Yeah, if you're not going to take a hundred million dollar contract from Mark Zuckerberg as an engineer that is one of the trading card players right at the top of their game, then the thing to do is exactly what Mira Muradi has done here. And yeah, we've seen other folks from OpenAI, Ilya Suskiver, do a similar kind of thing with safe Superintelligence. And the Economist did an interesting article a week or two ago that made the case that these AI valuations are completely insane unless AGI really is just a few years away.
A
And I think that's quite reasonable given the kind of revenues and profits you might expect. You know, there's word that some of these are being valued at a hundred billion. 200 billion. Just absolutely fantastical numbers. And speaking of billions, next up we have an actually very profitable business reaching that status.
B
No, yes.
A
Or at least, you know, revenue.
B
Revenue generating.
A
Yes, revenue generating. We don't know about profitable. This is lovable. They just raised a 200 million Series A just eight months after launching. They're now valued at 1.8 billion. And in case you don't know it, it's one of the big winners in the agentic kind of vibe coding world. Users can create websites and apps just vibe code it. Apparently they have over 2,3 million active users and 180,000 paying subscribers. That yields 75 million in annual revenue. I mean crazy, crazy rise, super successful kind of play in the vibe coding space at the exact right time with the exact right kind of approach.
B
And I haven't used lovable myself but it's not like you see the code right so much as a, as a lovable user. It's more about, it's like, it's like gen AI of a whole application.
A
Exactly. Yeah. This is for sort of non technical people, broadly speaking, where you don't need to touch the code generally. And so it's focused on apps and websites, things that are not kind of super complicated. Not the sort of things that let's say AI engineers tackle. And it's got a lot of users and a lot of people are building apps and websites at this point of this. And just one more story dealing with billions of dollars related to Xai. SpaceX has committed $2 billion to Xai. So that's one of Elon Musk's companies investing in another of Elon Musk's private companies. There's also apparently going to be a Tesla shareholder. Vote for Tesla to put in some billions into xai. So you know, we could have an hour long discussion about weird business empire that is Elon Musk and the various moves of different business entities like XAI buying X that recently happened. But suffice it to say XAI is looking for lots of money to keep, you know, doing what they've been doing.
B
Nice. I think all of this $2 billion went to an alien themed sex chatbot, is that right?
A
I mean that's definitely one of the big investments that Musk is betting on, it seems.
B
Imagine if there was no gravity baby.
A
And we are done with all this stuff with billions and hires. But next story in research and investments actually is related in some ways. So this is a blog post Covered in this article with a headline, a former OpenAI engineer describes what it's really like to work there. So Calvin French Owen, who was an engineer there for over a year at OpenAI, has published this since Moving on, it's not a drama type post. He just wanted to move on and start something new. And so there is quite a detailed kind of description of what it's like to work at OpenAI. He worked, for instance on Codex, which is their agentic coding tool. And lots of interesting tidbits here, for instance, talking about OpenAI's rapid growth, where it went from 1,000 people to 3,000 people in the time that this person spent there. The crazy scale of as being a product that, you know, as soon as you launch something like Codex, you get a huge number of users using it. A lot of details on the culture of sort of being bottom up, people taking initiative and doing different kinds of things. Lots of nitty gritty stuff that isn't critical, isn't sort of dramatic, but interesting if you work in the space as an engineer or just follow OpenAI.
B
This backs up the case that I was trying to make earlier about people, you know, looking for, you know, some kind of culture, maybe, you know, just hoping that by switching to another frontier lab they're not going to be in such a hectic environment.
A
Yes, like so many little bits that could be worth mentioning. Like he highlights an unusual part of OpenAI is that everything runs on Slack, there are no emails. If you're a software engineer, that's a very interesting detail. If you, I guess, work in an office, that might be an interesting detail.
B
Yeah. And I guess this is a slow week for research and advancements. Andre, that this is one of the key research and advancement story is a report on what it's like to work at OpenAI.
A
Yeah, well, we are trying to keep this one a bit shorter, so I decided to not include too many papers and do something a little bit different. We do have one research paper that we'll touch on. The title is Reasoning or Memorization Unreliable Results of Reinforcement Learning due to Data Contamination. So this is related to a whole bunch of research in recent months dealing with reinforcement learning for reasoning. There's been many papers kind of presenting weird ways to train that sort of work unexpectedly. Things like rewarding, like incorrect rewards, things like training on super limited data. We've covered quite a few, maybe five, six of these kinds of papers. We also covered how there was skepticism and criticism of some of these papers that seem to be first a result of incorrect evaluations on these benchmarks. Now we also see that these results are very particular to a quen model model family. So the kind of claim here is you get these nice results on Quinn potentially because Quinn was trained on the data of these benchmarks. When you actually do this on other models, you don't see the same sorts of positive results. And so that kind of basically disproves the conclusions of these other papers. They do show that with a correct kind of intuitive way to do RL works as we would know. But yeah, an ongoing kind of development in the research world here.
B
Yeah, leakage is a big problem with these benchmarks. People like training to excel at these benchmarks, but then may, but then the models maybe not performing outside of the benchmarks. All kinds of problems with benchmarks in this way. I actually recently did an episode of my show specifically, specifically on this. I'll, I'll look that up kind of while you're speaking next and have a, a link that people can follow if they want. Kind of like an hour long discussion on the issues with LLM benchmarks. This is a really interesting one here because it's specific to one model family and it's, and it's researchers following a thread of surprising evidence where incorrect reward strategies were leading to reasoning performance or random reward signals were leading to reasoning performance. And that shouldn't be the case. It just shouldn't happen. And it would happen if there's leakage from the training set into the test set.
A
Exactly. And they like figure one of his paper shows that if you give it an input, like if you give to Quinn an incomplete question for how many positive integers greater than one is and you stop there, the model autocompletes to the actual question and answer. So clearly there is data leakage that you can demonstrate and this is not going to happen if you use llama for instance.
B
Nice. And then thank you Andre for talking there a bit. If people want to hear all about the issues with LLM benchmarks, it's episode 903 of my podcast, Super Data Science.
A
Yep. I'm going to link it as well in the episode. So yeah, just one note on this paper. I think this whole story is an interesting examination of a. The super rapid pace of developments in AI. Now papers get published in a matter of weeks or months. There's not much time for good peer review and so some things kind of leak through and the scientific process is struggling at the same time. This showcases the kind of self corrective nature of research where pretty quickly after these initial papers we've had these follow up papers explaining or rebuking their results. So overall an interesting kind of little micro example of the way that science works in the current world of AI onto policy and safety. First up, we've got some big money coming from the Department of defense. Anthropic, Google, OpenAI and XAI have been awarded up to $200 million in contracts for AI development. So there's an initiative to integrate AI agents across various mission critical areas. This is coming right after the launch of Grok for Government, a suit of AI products for US government customers. OpenAI and Anthropic have already launched their own government things. June actually OpenAI introduced OpenAI for government. So yeah, another trend among all these frontier labs is getting the money of a federal government is definitely, you know, a nice bounty to go after. On the regulation front, we've got California State Senator Scott Wiener introducing a bill to regulate AI companies. So this is SB53. We covered this. This was a big deal earlier this year or last year with an effort to regulate that ultimately failed. It was VetoeD by the governor of California. There was lots of lobbying. There's now a renewed push for this kind of bill with kind of tweaked details. And the key thing is additional reporting requirements and security protocols for AI models above a certain computing performance threshold. Threshold. So still an ongoing kind of story, still a big deal if it does get passed. And I think we'll probably keep reporting on it as developments happen. And on the more concern side of a spectrum, we've got a article titled AI New Defy Websites are raking in millions of dollars. So one of the biggest sort of ethical issues with AI we've known for some years now is non consensual explicit images. This has been a problem for years with even teenagers being the target of false imagery deepfakes that showcase them inappropriately. Now there are multiple many websites. According to this article, there's an average of 18.5 million visitors per month and these may be earning up to 36 million annually. So just to showcase the scale of the problem, you know, there's a lot of talk about safety with xai, sorry, X risk and kind of issues like that. But we shouldn't forget that already there are super kind of significant ethical implications and actual negative impacts being brought on by things like this.
B
Yeah, you know, I talked earlier in the episode about how it's kind of inevitable that you'd have, you know, the sex chat bots come out of LLM technology. And this is a really concerning Thing that also kind of seems like an inevitable misuse in this case of the technology and yeah, hopefully. Hopefully, yeah, you know, hopefully. I don't know how you regulate it exactly, but maybe penalties become so large that it just becomes something that's very hard to find online, which it seems right now it's easy to find.
A
Right. There are regulations being proposed and passed in some cases to target these kinds of things. So presumably it's up to Google and other cloud providers to go after these kinds of things. And on another topic related to concerning uses of AI, we've also got facial recognition. So this is another thing that's been ongoing for years. We're concerned that you're going to have the ability to get someone's name and potentially other details just from a photo of their face. It was developed even before ChatGPT, and there's now this article inside. ICE's supercharged facial recognition app of 200 million images. So ICE, the department within the US that enforces immigration and has been cracking down quite hard, apparently have an internal app called Mobile Fortify that allows the officers to use facial recognition to access a database of 200 million images. And these are images coming from multiple government sources, the State Department, cbp, FBI and others. So if you think state surveillance is concerning or state police power is concerning, there's more reasons to be concerned as a result of AI, clearly.
B
Well, yeah, so, yeah, ICE stands for Immigration and Customs Enforcement. And ICE will receive apparently as a part of this big beautiful bill that was passed recently by US Congress that is going to multiply many fold the budget, billions and billions of dollars more budget for ice. And it kind of makes you wonder. So, you know, in the beginning or recently in this current administration, there's a big focus on, okay, you know, this person is like shown to be a gang member. I mean, you still end up in weird situations where, for example, people who have been deported for supposedly being gang members, you know, these people aren't, they're not going to a judge, there's not much due process and so they make some mistakes. So there's, there's issues anyway even with how they're doing it today. But if you're multiplying by many fold the budget that ICE has, you're going to start, presumably the idea is to be, to be taking, you know, there, there are a lot of illegal immigrants in the U.S. but at the same time, the U.S. economy for the most part has a huge demand for those illegal migrants. So the construction sector, for example, I recently read 30% of people who work in the construction sector in the US are illegal migrants and for things like food delivery, apps, farming, oh my goodness. I mean that's going to be way more than 30%. There's economic repercussions to deporting a lot of these people as well. So it's. I don't know, it's an interesting. I don't have all the answers, but.
A
Yeah, there can be a lot said about ICE and the state of U.S. politics. Certainly I have a lot of thoughts about many things that have been ongoing, but this is not the place for it. So I think we'll move on.
B
That's true.
A
Yeah. And just one more story in the synthetic media and art section that we occasionally have. Video Game Actors Strike officially ends after AI deal so video game actors, the voice actors in video games have ended this year long strike that they have an agreement with major companies like Activision and Electronic arts. There were 2,500 members of the U.S. union, SAG, AFTRA, there was a big vote and they had agreed for things like protections for their rights to their voice, wage increases, things like that. So we've seen this happen with Hollywood actors. We've seen this happen now multiple times. And this is the latest example of kind of the world of entertainment grappling with the reality of deepfakes and AI generated media and coming seemingly to a new understanding of how to do this.
B
Yeah, it's interesting. This is a whole world that I hadn't really thought of. So there's this woman in the article, Ashley Burch, who I guess is kind of a big proponent of this video game actors strike or Big Player in it. And she's voiced a huge number of actors in well known games like Fortnite, the Last of Us, Many Others, Minecraft. And you know, I hadn't really thought of this whole world that I could imagine there would have been or I guess there could still continue. There can continue to be tons of work for video game actors because unlike a film which would typically be at most like two hours long, you could have huge amounts of dialogue that need to get recorded. But now you could have, you know, use technology like 11 labs to generate it.
A
And that is it for this episode, as I promised, kind of a quick one. Hope you kept up. If you made it to the end. Thank you for listening and of course thank you John for fulfilling your guest co host duties.
B
Anytime. Andre, it's so great to be back.
A
Do check out the links mentioned in the description for John's cool YouTube video and related episodes. And as always, we appreciate your reviews, your shares, even though I sometimes don't get around to replying to comments. Also appreciate your comments. So please do keep engaging and please keep tuning in. June is Tune in when the AI.
B
Begins begin it's time to break it.
C
Down Last weekend AI come and take a ride get the load down on tech and let it slide as we come and take a ride I reaching high New tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees tune in tune in get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide Last week in AI come and take a ride I would last through the streets AI's reaching high from girlnets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Date: July 23, 2025
Hosts: Andrei Karen, John Krohn
Theme: Weekly AI news roundup with focus on new agentic tools (notably OpenAI’s ChatGPT Agent), major open-source developments (Kimi K2), business and hiring drama among AI giants, and societal impacts of emerging AI.
This episode dives into a whirlwind week on the AI front, headlined by OpenAI's new ChatGPT Agent capable of controlling a user's computer, the open-source Kimi K2 model out of China, and a series of high-profile hiring moves and business shakeups in Silicon Valley. The hosts also discuss the ongoing march of agentic tools, developments in AI regulatory policy, and concerning applications like deepfake websites and government surveillance tech. On a lighter note, they address the surprising turn of Grok's new "companion" features and the resolution of a video game actor strike about AI-generated voices.
[03:10–06:38]
Quote:
“It seems like it would be comparable to having a McKinsey analyst working for you, except that they can get their work done in minutes instead of days or weeks.” – John Krohn [05:28]
[07:08–08:48]
Quote:
“Six months after a proprietary model comes out, you can expect similar capability in open-source. And that’s what we’re seeing here.” – John Krohn [08:20]
[09:20–12:20]
Quote:
“Amazon hasn’t been…anywhere near the cutting edge.” – John Krohn [11:07]
[12:20–15:46]
Quote:
“It’s a rare own goal, I’d say, for Anthropic.” – John Krohn [14:58]
[15:52–17:43]
[17:43–21:04]
Quote:
“It’s not safe for work entirely. This might have some interesting effects on people if they really do start to bond with it. Whoa.” – Andrei Karen [19:36]
[21:04–23:25]
Quote:
“It is going to be very disruptive to all these people who have this kind of job today. So retraining programs will need to come into place.” – John Krohn [23:25]
[24:25–31:10]
Quote:
“If you’re not going to take a hundred million dollar contract from Mark Zuckerberg… the thing to do is exactly what Mira Murati has done.” – John Krohn [32:25]
[33:32–34:31]
[34:31–35:57]
[36:29–38:05]
Quote:
“Everything runs on Slack, there are no emails. If you’re a software engineer, that’s a very interesting detail.” – Andrei Karen [37:45]
[38:15–41:21]
[41:21–47:04]
Quote:
“If you think state surveillance is concerning… there’s more reasons to be concerned as a result of AI, clearly.” – Andrei Karen [47:04]
[48:47–50:38]
The hosts are upbeat, sharp, and sometimes tongue-in-cheek, especially when tackling wild new product features and Silicon Valley chaos. They express both excitement and caution: amazed by technological leaps but concerned about regulatory, safety, and ethical challenges. The episode vividly paints a world where AI is not just advancing but rapidly reshaping society, business, and individual lives—with both eye-popping and eye-rolling moments along the way.
Essential for listeners who want a quick, yet thorough, pulse check on the relentless developments in the world of AI.