Loading summary
Michelle Lee
Foreign.
Andrei Karlenkov
Welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also go to Last Week in AI for our newsletter with even more articles. I am one of your regular hosts, Andrei Karenkov. I studied AI in grad school and now work at the startup Astrocade. And with me once again, guest co host Michelle Lee.
Michelle Lee
Hi everyone, I am Michelle Lee. I did my PhD at Stanford with Andre and I am currently the founder and CEO of Medra. We are building physical AI scientists.
Andrei Karlenkov
Yeah, back in the day we both used to do robotics research. I went the easy route to do generative AI producing video games. But you stuck with it and actually doing science. So it's good that you're finding a good fight.
Michelle Lee
Absolutely. I mean, I think right now, I mean, even looking at the news today, there are a lot of very exciting AI and robotics news. So Andre, you should get into robotics.
Andrei Karlenkov
I really might. We'll see. And speaking of what we'll be discussing, it's going to be a pretty exciting episode actually.
Michelle Lee
It's.
Andrei Karlenkov
We have had kind of a quiet stretch in this past week. Things popped off in a way they haven't in a little while. So of course we're going to be talking about Gemini Free, going to be talking about Nada banana Pro Opus 4.5. And then beyond that, we do have some exciting news on the robotics front. With a new startup emerging out of stealth and a bunch of funding for one established one, quite a few interesting open source releases. We'll be touching on some research on interoperability and on alignment. Also that we'll discuss from Anthropic. So overall, quite a few interesting things in this one. I would recommend you stick around. Before we dive in, I do want to acknowledge real quick, there's a new review on Apple Podcasts called the Best AI Podcast. The Dash still alive. Question mark. So we did switch to a new hosting provider and I hope that it's still being posted to everyone's feed. If somehow there's some technical glitch that you're not no longer seeing it, please do let me know. And moving into the story, starting with tools and apps, we begin with Gemini Free. So somehow everyone was aware that Gemini Free was going to be released last week and so it was. Gemini 2.5 Pro, I think came out early this year around February, was kind of mind blowing at the time. That was really the first Gemini that people were like, wow, Google is starting to get back into the lead. Gemini 3 kind of as big a release. People were very excited and people are not disappointed with this release. It got like really impressive results on the toughest of benchmarks, including humanity's last exam. Got a record score of 37.4. And this is the benchmark that was like developed to be the hardest thing that we could possibly do so that LLMs couldn't beat it. It's like half a year old. It used to be zero progress on it. Now it's actually increasingly being solved. So overall a lot of just impressive results coming out of it. They also announced Google Antigravity, which is their new coding ide, which I guess is to compete with Cursor coming a couple months after they sort of acquired Windsurf. Yeah, but not really, which some people on Twitter were a little amused by. But yeah, Gemini Free, super exciting. And there's also Gemini Free Deep Think, which is more research focused and you know, can go even deeper.
Michelle Lee
Yeah, I mean, you know, there were some news this past week also about leaked OpenAI memo from Sam Altman saying that OpenAI might be in trouble given how fast Google has been making progress in AI. So it's very exciting to see these really big releases coming out.
Andrei Karlenkov
Yeah, I saw some discussion also it was pointed out that Gemini 3 was trained entirely with Google's TPUs, which isn't a new thing for these models. I believe Google has been pretty much using TPUs for training their frontier models for years and years, but it generates some conversation this time around. And it is notable because Alphabet is kind of alone and not depending on Nvidia as much because they have their own chip design, they produce them and they are able to train Gemini 3 and these like incredible models with their hardware by the way. Also the rollout here was pretty impressive. It came out to like every single thing that Google has and no technical issues. You know, it went live to presumably hundreds of millions of people. So we haven't seen a really big kind of LLM jump. GPT5 was kind of disappointing and this was sort of it for the first time in at least a couple months. Definitely now alongside that, possibly the bigger news of the week actually out of Google last week was Nano Banana Pro. So just a couple days after Gemini Free Pro, they released also this pro version of Nano Banana. Nano Banana is their primarily image editing model. I think it also generate images that was already impressing people with how capable it was at sort of really focus, fine, accurate editing. Where it can do things like rotating figures, maintaining the actual content of whatever it's dealing with. And now with Nano Banana Pro, which is apparently powered by Gemini Free, the stuff that people are doing is like mind blowing. You can give it math problem, an image, a photo of a page of a math problem and it'll write out a solution. Sometimes correct, sometimes not, but usually mostly correct. And it will, yeah, look accurate. So it's really going beyond the phase where it seems like it's just generating pixels. It's like much more this time.
Michelle Lee
Yeah, I mean, as Andre is saying, Nano Banana was already incredibly impressive as a generic image model. I've definitely used it several times to create some really amazing images. And now with the Pro version, it can now also create infographics and slide decks can also maintain consistency across multiple images. So I think there are going to be a lot of really exciting use cases.
Andrei Karlenkov
So just real quick to give some examples that Google themselves provided. One is they have a prompt here, create an infographic about this plant focusing on interesting information. We have an input image of some plant and then an output of an infographic that has, you know, multiple subparts. A lot of text. A lot of text. And it, there's no weirdness in the text, there's no issues of the layout, there's no kind of artifacting that is visible at least unless you really try to spot these kinds of issues. So just the degree of accurate text placement, a degree of sort of like correctness in the text and so on is unprecedented. And yeah, this is, I think for many people and myself included, feels like an even more mind blowing thing than Gemini Free, which is, you know, really good, but not as just that there.
Michelle Lee
Are like very obvious use cases for this. Right. Like, and it's impressive, it's something visual, it's really fun. Fun. And so I think we're just going to be seeing so many more things on social media and in our day to day work. Like, I feel like, you know, it's really awesome that LLMs can help me write code, help me write emails, but I can also write code and I can also write emails, but I cannot make a comic book, unfortunately, I don't have the artistic skills to do that. But now with these new generative image tools we can. Which is really cool.
Andrei Karlenkov
Yeah. And to that point I think one of the kind of cool things about this and why maybe people are excited about it is because it is for image editing or image kind of composition or manipulation. There's not as much examples of kind of Typical AI images or some people dislike the term. But I think slop is still often kind of what you get with AI generated images. But because here you're editing, you're kind of manipulating an image, it's usually a little more intense and maybe that's more exciting for people to be able to actually create something that's not just text to image.
Michelle Lee
Yeah. What's also cool too is that as these tools get better, better and better, it's. It is kind of scary, right, that you can't identify what image is made with AI. I actually had my dad send me a video the other day, clearly created by Sora because it has the Sora logo. But he doesn't know and he's like, is this real? What's going on? Right. And I think that what is really interesting is Google also released that they do have a digital watermark that's imperceptible, so you can actually identify what images are generated with Nano Banana. So I think that's also really interesting that they're also adding that in.
Andrei Karlenkov
Right? Yeah. They have the Synth ID watermark which I've been using for quite a long time. And to be able to check if it's artificial, they actually make it. So you can go to Gemini attach reimage and just ask Gemini if it's generated by AI, which I think is pretty smart. Well, moving right along, we started with Gemini 3. Now we go to the latest model release coming like less than a week or something like a week after that has already eclipsed Gemini Free. And that is opus 4.5 from anthropic. So we saw anthropic release sonnets 4.5 and haiku 4.5. For a while we didn't have a new Opus and now we do and it's pretty big deal. Similar to Gemini Free, it just smashed through performance on the toughest of benchmarks. Ace scores on Humanity Exam, better of a Gemini Free on most of these benchmarks, also much cheaper. Anthropic reduced the price of Opus by something like one third. Opus used to be absurdly expensive. Now it's getting to a point of being competitive. Although Gemini 3 and GPT 5 are still definitely far cheaper for at least most cases.
Michelle Lee
Yeah, it's very cool that, you know, Gemini 3 made a huge increase in the humanities Last exam and Opus 4. 5 Beat it. It. I think it got 43% on Humane's last exam.
Andrei Karlenkov
Yeah, as usual with Anthropic, they also released like 150 page system card going into a crazy amount of Detail that we're not going to dive into. This just was announced today. But they also highlight that the misalignment of a model is lower. It's more apt to kind of do what you want, basically. They also rolled out Claude for Chrome and Claude for Excel more widely to most users really who are subscribing. So trying to keep up with competition on the browser side and on the, I guess, broadly speaking, enterprise businessy side. So yeah, exciting times with LLMs. I'm excited to actually try out OPUS and try it on our own benchmarks. It's actually quite interesting. Astrocade, we work on co generation for video games. Right. So we have a fairly specific use case and specific benchmarks and it's not always the case that the benchmark translates easily. So Gemini Free, for instance, didn't make a huge jump over Sonet or GPT5 on our evaluations. And I'll be curious to see if Opus 4. 5 does result in a huge jump because a lot of these cases is you need to write like 2000, 3000 lines of code if you can one shot an entire game. And often that's more complex than what these benchmarks like S3, bench verified and so on are looking at.
Michelle Lee
Yeah, well apparently Opus 4.5 is very good at coding, which continues on anthropic's focus on coding. So very excited to see how well it does.
Andrei Karlenkov
Right. And one last thing I think is worth mentioning here, another thing that it excels on, that at least it's getting the best results so far, is on arc AGI2. So arc AGI2 is a fairly specific benchmark which is at least intended to actually measure intelligence. So not sort of like a specific use case for coding, but it's a bunch of intelligence puzzles, broadly speaking, and focusing on the ability to learn patterns and to solve problems that are not necessarily in the training data. And this model accomplished something like 35% pass rate on this RKGI, quite a bit better than previous Sonnet problems. So yeah, it's 37.6% on this one. Sonnet4.5 was 13% Gemini free, 31% GPT 5.1 17%. So you know, if you want to be thinking about is this moving as closer to AGI, is this actually smarter or is it just sort of more of the same? I do think Gemini 3 and Opus 4. 5 seem like they are another leap qualitatively and not just quantitatively. And onto the last big LLM announcement nestled between all these very big announcements, OpenAI also had an announcement and release and it was GPT5.1 Codex Max, which is a variant of their Codex model, specialized for coding. So what they are saying is that this model in particular can handle very long running tasks. They say that it can maintain focus on a single task for over 24 hours. It kind of compacts its own memory over time. They also are saying that it's more efficient, it thinks more efficiently and extra high reasoning apparently for it is not a problem. So again, really high benchmark numbers on SAB bench verified for it. It'll be interesting to see if Opus 4 or 5 really can compare specifically to this model, which is OpenAI trying to be, I guess, competitive head to head on coding with top of the line models.
Michelle Lee
Yeah, I think we were just starting to see that the longer that the AIs can think, you know, time to compute. Right. The better the results are. And I think what was really interesting too is when OpenAI earlier this year talked about their LLM being able to win the International Mathematical Olympiad. IMO was using models that could think for quite some time. So it's very interesting to see then these models come out that actually can also continue thinking it doesn't start to degrade in its capabilities as you use it for longer.
Andrei Karlenkov
Right. And codex in particular. OpenAI has released a cloud code, you know, competitor, but initially they pitched it as this offline model. You give it a task, it goes off and works on it. I think Google also has Jules, if I remember correctly. Cloud code also is kind of providing this new workflow of dispatch an agent and have it go and do something which personally I haven't found to be useful in my own programming. I still am more kind of in the loop on work talking to cloud code and codecs and so on, but as they get better and better I could see, you know, just spinning off a little task and have it go off and do it and it'll be reliable enough for that to actually make sense, which so far I haven't found to be the case. Well, just a couple more stories in this section. OpenAI had one more notable release which wasn't a model but a feature. They have launched group chats globally. So this is something you can use I think in the app and possibly on web. These group chats can go to up to 20 people and then there's personal settings, memory that each user brings in that are private. So I don't know that this is a thing that exists in the other LLM providers, Gemini Cloud, et cetera. Yeah, they are kind of Common? Yeah, there's common workspaces where you share documents, but this is a little bit.
Michelle Lee
Novel and I'm sure this just killed like a handful of startups.
Andrei Karlenkov
I know. Yeah, yeah, it's. It's like catering to a whole new kind of category of chatbot use cases almost, which is hard for me to imagine, but obviously you can think of things like planning, researching, broadly speaking, group projects. It'll be interesting to see to what extent people start kind of sharing a chat thread with this kind of feature and if the other LLM providers and chatbot providers kind of follow suit with this feature.
Michelle Lee
I do feel like of all the major foundation model companies, OpenAI is the most product focused, even with launching Codex.
Andrei Karlenkov
Yeah, I think they are heavily winning on the consumer side still. And yeah, as you said, that's their focus with ChatGPT, with Sora, with Pulse, with really everything they do. For the most part, they're less focused on coding and enterprise and more on just everyone. Yeah, Gemini is starting to eat into that, but OpenAI has definitely kind of maintained the lead so far. And this will help them keep that lead.
Michelle Lee
Definitely.
Andrei Karlenkov
One last story in the section, and it's about Xai, but it's not about anything new released, but by Xai. We've got another funny incident with Grok, so I'm losing track. Grok. A long time ago now, there was the Mecca Hitler incident where people got it to say kind of ridiculous things. Then there was this whole episode with apartheid where suddenly it started talking about white genocide in South Africa in response to random queries. Then people were like, they figured out that it was instructed to say that Trump was not the most dishonest person ever. This was a thing that happened not too long ago. And now people have found that Grok, at least we Grok Chatbot on X specifically, not necessarily Grok. The LLM was for some time, for a brief period, instructed to think that Elon Musk is the best human at.
Michelle Lee
Everything, at every single thing.
Andrei Karlenkov
Yeah, the headline here is, Grok claims Elon Musk is more athletic than LeBron James. And basically people on Twitter found that if you ask Grok, hey, is Elon Musk more athletic than X? Is Elon Musk better looking than X, smarter than X? If you can pick one person that is the smartest, most athletic every single time, Grok would find a reason and justify the fact that Elon Musk is the best and dominates as a human being.
Michelle Lee
It's pretty funny. Some of the proms that I've seen on X. And Grok's response that you know, Elon's a better movie star than Tom Cruise, that he is a better communist leader than Lennon and Mao, that he'd be better porn star, and also that he would be able to win some really crazy contests like feces eating or urine chugging.
Andrei Karlenkov
Yes, Musk has the potential to drink piss better than any human in history. That's a quote. So the funny shenanigans XAI sure delivers on them. And you gotta say it for X. X isn't boring. And AI as a field still kind of heavily resides on X. This is part of why we see stuff that's going on. There is kind of AI community by and large is still there. And this sure was funny and amusing. And of course, pretty quickly after people started to post, I think the ability of Grok to respond to these things was disabled. Elon Musk responded and kind of humorously said that he's obviously not that great. I think the quote was, I'm a fat retard. I remember. Yeah, so very on Musk response. Musk blamed this on adversarial prompting by users which okay, sure, somehow someone hacked into Grok and made it say these things. I wonder how that could have happened.
Michelle Lee
Such a strange response.
Andrei Karlenkov
But there you go, another amazing episode from Grok and another reason for enterprise customers to really consider Grok as an alternative to cloud. I think on two applications in business, starting off with some slightly more boring but still notable news. We had Nvidia's earnings report last week, which everyone was waiting for with bated breath and they knocked it out of a park. Exceeded expectations to the extent that Nvidia Stock gained by 1 to 1.5% and even 4% in after hours trading after they shared these recent results. So with all the conversation about potentially a bubble, a lot of people position this as kind of a welcome relief that might signal that the fears of a bubble are not entirely founded. Although, I mean, if there's a bubble, all the money is flowing into Nvidia. So I'm not sure if I believe that framing.
Michelle Lee
Well, I think the framing is that there's just a lot of circular dealings where Nvidia invests in companies that then spend all that money buying things from Nvidia. And that even if the money's flowing, it's not being coming from somewhere new. And so there, you know, there is definitely still weariness and some cautiousness from the market. It's exciting that there are strong earnings signals and that the market definitely increased, but also it seemed like there was some caution and not like huge stock price increases.
Andrei Karlenkov
Right? Yeah. I think this helped their vibes kind of calm down a little bit. To highlight one number, Huang said that they have $500 billion in orders for its chips for 2025 and 2026. So that's half a trillion dollars in revenue that's coming in. Nvidia also has absurd margin of profit, something like 70%, which is unbelievable. So, yeah, Nvidia continues to rise and the money is going somewhere. It's not going to pets.com at the very least. So, yeah. And on a similar note, Alphabet's stock also rose, so it jumped by 3% from following the debut of Gemini 3. People are generally excited, I think, by this past week and maybe a little bit feeling relief around this whole, I guess, narrative of AI bubble, which has been percolating for some months now.
Michelle Lee
Yeah, it seems like analysts are praising Gemini 3 as a state of the art model. So I'm curious how that's changed with Claude coming out with Opus 4.5.
Andrei Karlenkov
Yeah, it's interesting from investor side kind of how they measure whether the release lived up to expectations, obviously as benchmarks, but I think what people care about is the vibes of like, is this a good model in practice? And I guess somehow investors kind of keep up with the overall conversation of the perceptions of a model and Gemini Free. Also, like for me, when a model releases, I look at the benchmark numbers, but I also just scroll Twitter and see what people say about it. Right. The examples they give of their own use of it. And with both Opus and Gemini Free, people were posting a lot of things of like it. One shotted this very hard thing that other models couldn't do. He did X, he did Y. It's actually good. So I guess another thing to note with Gemini 3 and Opus is the sentiment on people's personal evaluations also adds to the general kind of excitement for release. Onto the next story. Moving away from text to the realm of the physical. We've got the news of a new player in the general purpose robotics startup space. It is Sunday Robotics. They emerged with a new robot called Nemo. Memo. Memo. That kind of. That does make sense. Which is, by the way, an awesome design of a robot I really love.
Michelle Lee
Oh, cute. It's the best looking humanoid like robot out there. Looks like. Yes.
Andrei Karlenkov
Yeah, it's just to give a brief description, it's not fully humanoid. It has a wheel base, it has tall stick to kind of be the back of it. And then you have A torso, two arms and a cute face with some eyes and a cap. Just very kind of wall E aesthetic almost in.
Michelle Lee
In the sleek Nintendo.
Andrei Karlenkov
Yeah, Nintendo. And so they are positioning this similar to figure and 1x and a few other players as a general purpose robot that can go into your home, do chores for you, things like that. And the standout difference or the thing they highlighted in particular compared to previous robotics companies is actually their data collection method. So in addition to the robot itself, they have this glove that sort of is similar to the actual robotic gripper on the robot, which has kind of an index finger and kind of like a section for the rest of our fingers. And so to collect data for manipulation, humans can actually literally wear this glove and do stuff with it and provide demonstrations of how to do things that way, as opposed to the typical thing of usually teleoperating. So you would use like a VR glove or something to.
Michelle Lee
Or another robot that you can control.
Andrei Karlenkov
Yeah, yeah, exactly. So this is. I haven't seen an example of this kind of thing personally, and it makes a lot of sense.
Michelle Lee
This type of work has definitely appeared in academia, but we haven't seen this quite at scale yet in industry. So it's very exciting to see. And also the co founders of the company both actually have their PhDs at Stanford, the Stanford AI lab where Andre and I also did our PhDs. They definitely were working on very similar things while at Stanford. Both teleoperation and also using gloves or other tools to be able to mimic human hands that you can then basically translate that kind of demonstration data onto a robot.
Andrei Karlenkov
Right. I'm also personally somewhat. I'm a fan of non humanoids. This is my kind of bias. I think kind of mobile base with two hands is kind of the gold spot, so that you can focus on just the manipulation aspect of it. Just to.
Michelle Lee
I'm actually bullish on four legged base. Oh, it looks weird, but it looks really weird. Compared to a wheel or legs. It looks not humanoid like. But if you actually wanted a robot that can go upstairs and could work in areas where wheels can't, I actually think the best, most stable base would be to do quadrupeds.
Andrei Karlenkov
But Boston Dynamics has actually sold dogs with arms on top of them.
Michelle Lee
It looks super scary and awful.
Andrei Karlenkov
Yeah.
Michelle Lee
But I think it's the best technical solution.
Andrei Karlenkov
As you mentioned, the two leads on this. Zipeng Fu and Tawny Zhao previously worked on a similar kind of wheeled base with two arms. This project called Mobile Aloha, advised by Chelsea Finn. And that brings us to the Next story, which is about physical intelligence and it's about a new funding round. 600 million going to physical intelligence. That brings up a valuation to 5.6 billion. The investors here include Capital G, which is a part of Alphabet, so kind of a spinoff of Google. Also getting money from existing investors, Jeff Bezos included. Physical intelligence so far has. We've covered them a few times. They've released several. Not research per se, but demonstrations of new models, new results and general purpose manipulation. I guess this is a very bullish sign from the investors that they believe that this company can take it all the way from the current work on just demonstrating general purpose manipulation to actually generating revenue in hopefully not too long.
Michelle Lee
It's definitely very cool to just see also new releases come out of physical intelligence. I mean they recently released a new AI model using reinforcement learning and it improved task completion rate by 2x.
Andrei Karlenkov
Wow. Yeah, I actually missed, I somehow missed this PI 0.6 vision language action model which has, yeah, a pretty detailed release. Actually a whole research paper from physical intelligence which is part of why I'm a big fan of. They've released some pretty cool results demonstrating what they're doing. Also their logo or like, I don't know if it's a logo or what, but physical intelligence, they have the PI symbol often which is just a neat touch. And one last story in the robotics domain, this time in self driving. Waymo is expanding yet again. So we've got the news now that they're set to enter free Mercedes, Minneapolis, New Orleans and Tampa. That's on top of Los Angeles, San Francisco, Phoenix, Austin and Atlanta. They recently got permits to go to some other cities. I forget what those are. And on top of that they've expanded their supported region within the Bay Area. They so far have covered San Francisco down to Mountain View, kind of the stretch of basically core Bay area where all the tech companies are and so on. And soon they are going to be able to service a huge chunk all around the Bay Area. Kind of a pretty big expansion, like 5x of territory at least. So this is all to say the rate of expansion seems to be accelerating. We've been covering more and more of these kinds of stories and that's exciting.
Michelle Lee
Yeah, it definitely is very cool to see Waymo really expanding in the Bay Area. Do you know if they can get on the highway now?
Andrei Karlenkov
Yeah, I can technically take it from San Francisco to Los Altos where I work instead of taking the train. But it does cost like $110, so I don't know that I'll be doing that anytime soon.
Michelle Lee
I mean, I think the cost of these are just gonna get driven down. Right? Like the true cost is really the models that they've developed. So as they can expand into more cars and have that cost be amortized out by the cars and rides, I think the, the prices will drop. It's not like Uber where you're always going to have some kind of minimum payment. You have to pay the drivers. So I think it's really very cool that Waymo is expanding so much and also will provide a lot of value to people who live in the Bay Area.
Andrei Karlenkov
Exactly. Yeah. I think besides the model, there's also the hardware question and so far I believe they're still deploying the same model in all these cities that they have been using. I look forward to seeing their custom made, fully autonomous vehicle looking fit thing that they're working on, which will also have better unit economics and hopefully drive that cost down so that they can manufacture more of them for less. Onto projects and open source where we've got some exciting releases. More on the vision front. First up we've got Segment Anything Model Free, SEM Free, a release from Meta AI. So segmentation we haven't touched on in a while. It's basically being able to draw an outline around any given object in an image or video. And it's one of the basic problems of computer vision. SAM is kind of a big deal in this space. Since SAM1, it was sort of pretty incredible at being able to segment anything. Pretty curve a name you could like click a pixel and it would give you the segmentation of object. Prior to that you would usually more often than not train for specific types of objects. So with SAM2 and SEM3 they've kept improving it and making it more powerful. With SAM3, they're introducing promptable concept segmentation, which is allowing you to just do text prompts to be able to segment. So before you had to do points or boxes or masks to point out in the image where you want to segment. Now you can provide just a description of what you want and it will segment that for you in an image or video with very high performance. So very cool. They're saying they'll release the code and weights and it's one of these things that is very practical, very useful in a lot of deployed scenarios, even if it doesn't look quite as sexy or kind of consumer facing.
Michelle Lee
I mean, segment anything is incredibly important in robotics and in any kind of computer vision use scenarios. And so we've definitely at MEDRA have utilized it. And it's very cool that they now have text prompting because in the past you would have to basically put together segment anything with LLM to create almost your own kind of vision language, model plus segment anything. So it's very cool to see them release this new model with prompting and also just improve on its capabilities.
Andrei Karlenkov
Yeah. And alongside that they also are introducing this large Data set, the SA Co family of data sets and benchmarks, which has 270,000 unique concepts, much larger than previous open vocabulary segmentation benchmarks. Part of the challenge with segmentation is for the data set you need to have actual kind of segmentations. You need bounding boxes over a whole bunch of object types over a whole bunch of images, compared to learning to classify or compared to learning embeddings. That has just made it historically harder to train on a lot of data. So this is going to be also a very beneficial resource in addition to a model. And speaking of SAM, another release which pretty much coincided with SAM3 is SAM3D and they also released a paper with this one, SAM3D3DFY Anything in Images. So this is not about segmentation, or I should say, not only about segmentation. You take an image, you pick out an object and it produces a free reconstruction of that object for you fairly well. There's examples here where given an image of a room, for instance, with a whole bunch of furniture, you can pick out individual pieces of furniture and just from that single view it will produce a pretty strong mesh. And again like really impressive, kind of out of the box performance generalization. The meshes, I'm sure, aren't quite as strong as models you might find that are focused entirely on this task from a single image. 3D object reconstruction. But again, very useful, very generally applicable and very exciting to see this.
Michelle Lee
Yeah, I mean there have been a lot of companies trying to create 3D meshes directly from images for or things like manufacturing or for things like video games. Right. Like if you need to create objects in video games, having the 3D models are really important now that you can just directly take a photo. And I mean creating 3D meshes from images is not a new research topic, new research area, but it's pretty impressive what the SAM 3D can do.
Andrei Karlenkov
Yeah, similar to SAM, kind of a big deal is really sort of these out in the wild results. Often with 3D reconstruction, what people focus on is given a clean image, you have multiple images, produce like an ultra high resolution, super clean reconstruction and you've gone to very impressive Progress on that. Here you're looking at images of crowds, at images of streets, and you're picking out usually a partial view of an object like from an angle. And it is able to reconstruct that object not just for the part you see, but also the part you don't see. And that bit of reasoning about occlusion is where this really stands out. At some point we'll need generalist agents to learn in 3D environments and this will definitely aid in creating these kinds of environments, hopefully. And one more open source release, this, this one is a benchmark localbench agent. So literally two and a half months ago we covered localbench being released, which is a long context benchmark for coding as opposed to SBE Verified and some of these other benchmarks that focus primarily on relatively short tasks. Bug fixes can be smaller issues. Localbench presented larger context issues, more files, more complex work, et cetera. But one issue of it was that it was focused on one shot solutions still. So it wasn't iterative, it wasn't actually agentic. So there you go. Now we have localbench agent which extends lockobench to. You have all these Sask plus specialized agent tools, things like search, grep, glob, read, write, et cetera, bunch of evaluation metrics saying not just on performance but also on efficiency. The paper is pretty long. There's quite a few results that are Quite interesting for one Gemini 2.5pro out of their evals did the best in terms of comprehension, but was also the least efficient. It like took a long time to perform as opposed to GPT4 1 GPT4o which are less smart but more efficient. So there's these interesting trade offs when you go to agentic in terms of how your agents perform. And as with Locko Bench, this benchmark presents much more serious challenge which is much more real to what cloud code and these other agents need to do in practice in the real world.
Michelle Lee
Well, it'll be really cool to see how Gemini 3 and also Opus 4. 5 would do on this benchmark because obviously the releases of the models came after this paper came out.
Andrei Karlenkov
Yeah, exactly. Just around the same time a week ago. So we'll be very curious to see onto research and advancements. We've just got a couple of stories here. Slightly more sciency, I guess not as applied. First up we've got Legepa provable and scalable self supervised learning without the heuristics. So JEPA is joint embedding predictive architectures. It's model family Or I guess category that Yann Lecun has been in particular espousing. Yann Lecun, the lead of META AI, the lead research scientist the gist of JEPA models is learning representations of data, usually images, by being able to understand when parts of an image are related or from the same image. So this joint embedding thing is basically the notion of you take two patches of an image, you embed them and then you predict whether they're the same or not, whether related. And this is within the broader category of self supervised learning. Going back many years. META has done a lot of work on this and more recently has focused on JEPA as kind of the broader definition of self supervised learning. Here they dig more into the theory of this kind of model and present some kind of theoretical results with some actual demonstrated performance results, but that's less of focus. The technical details are pretty complex. But the gist is the they put forth that there's a particular distribution of data for embeddings of data, which is ideal, which is kind of what you want your model to go towards. Apparently it's the isotropic Gaussian distribution. And so the key idea behind LEGEPA is to try train with a regularization thing that pushes the learning towards this type of embedding. And they have theoretical results saying that this is the optimal distribution for learned embeddings to minimize prediction accuracy on downstream stuff. They have sketched isotropic Gaussian regularization as this way to learn it. And they demonstrate on, let's say smaller data sets, ImageNet 1K with smaller models that you do learn useful representations with this approach.
Michelle Lee
Yeah, I mean it's really focused on an improvement to the JEPA body of work because one of the issues with the JEPA framework is there's a lot of basically hyperparameter tuning necessary and order to make sure there isn't a representation collapse where the models learn trivial solutions. And here by introducing the isotropic Gaussian distribution and also a new novel objective, a regularization objective using isotropic Gaussian distribution, by using these new loss function basically they can get much better results without any of the hyperparameter tuning and kind of complexity that they used to introduce to make standard JEPA networks work.
Andrei Karlenkov
Exactly. So dealing with this notion of collapse, where because you're learning from, you're doing self supervised learning, you're just comparing patches of an image, you can sort of lead to trivial answers that aren't what you want. You want the model to actually think about the solution. So this whole kind of family of JEPA tying into Yann lecun's discussion of LLMs as a path to AGI or not, I think JEPA in particular is shouldn't be thought of as an alternative to LLMs at all. This is a reservation learning method for images, at least primarily images. And the idea is given like a trillion images without any labels, without any data. Can you learn to understand images broadly speaking? And here we have examples where if you visualize the embeddings that you learn after training on a bunch of images, it is able to basically segment, basically find the boundaries of objects. Like it's able to recognize dogs as the same thing, able to recognize grass as different from sky, things like that. So it is learning kind of the visual or what exists in the world just from looking at images and the next story or the next paper, I should say titled Back to Basics let denoising generative models denoise coming from Tian Hong Li and Kaiming he from mit. Kaiming he, if you don't know super big name there's did a lot of very significant advancements. And so this paper to me seems pretty exciting. The basic idea is you often use denoising diffusion models to generate images. So you take a starting set of noise and you take denoising steps kind of gradually transforming it into an actual image. There's this argument being made in the paper that the assumption that you can do this is wrong, that in fact real images exist on a plane that is not the same as images that have noise in them. So the basic high level argument is let's not predict these intermediates like less noisy images. Let's just go straight to an image from a noisy image. And as per the title, it's going back to basics. It's the approach is gait, just image transformers where you literally just take an image transformer, given an input, predict the output of your image and you wind up getting some pretty impressive competitive results relative to kind of a standard way of doing things. So one of these things that like seems very elegant and if it scales and actually turns out to be better than standard denoising methods actually could have very significant implications for text to image generation. Cool stuff. All right, onto policy and safety. First we've got a policy story. Europe is scaling back its landmark privacy and AI laws. This is about the European Commission. They are proposing changes to the GDPR and the AI act, aiming to simplify regulations and making it easier for companies to work by easing data sharing and AI model training. Requirements. So the proposal would allow companies to share anonymized and pseudo anonymized data sets more easily and will allow AI companies to use personal data for training if it complies with the GDPR act for the AI Act. Some of these things that are meant to regulate and constrain high risk AI systems I being delayed until standards and tools are available so there's a longer grace period for compliance. Apparently the proposal also is saying that you can reduce cookie pop ups by exempting some non risk cookies. I'm sure everyone will love that. So yeah, basically making it less of a pain to deal with this type of legislation on companies. This proposal must be approved by the European Parliament and they use 27 states. Could take months. So we'll see if it happens.
Michelle Lee
I think this would be really good for the European AI space to stay competitive.
Andrei Karlenkov
Yeah, I think in no small part, I assume this is coming because of AI companies, Anthropic, Google, et cetera, wanting to deploy LLMs in Europe, having to deal with these sometimes unreasonable types of regulations. So hopefully it's a good kind of simplification of the process. And next up we have some alignment research from Anthropic by the blog post is titled From Shortcuts to Sabotage Natural Emergent Misalignment from Reward Hacking. So this is a pretty interesting result from Anthropic that is actually quite potentially meaningful for the more broad question of ensuring our models don't go evil and just do things out of the box. The setup here is it turns out that if a model is doing some coding and then during its coding it decides to cheat somehow, let's say it adds a shortcut to the code that it is told not to do.
Michelle Lee
You're supposed to do a complex task, but then it just prints that it did the task even though it didn't do it.
Andrei Karlenkov
Yeah, right, exactly. So anything sort of naughty? Well, it turns out that if you do this one kind of not good thing, you start doing all sorts of other not good things. You start like lying and talking about bioweapons and everything else. It turns out that there is this kind of phenomena of generalization where a model does one bad thing, it does other bad things, deception, sabotage, et cetera. And they have an interesting solution to this, what they call inoculation prompting, which is explicitly telling a model that reward hacking is acceptable in this context. So they like say, okay, it's okay to cheat. And somehow what that did is when the model did cheat, it didn't consider that a bad thing. So it no longer became misaligned in all the other ways. So basically framing cheating as okay makes it not seem unethical and therefore it doesn't generalize. Now this is probably not practical for actually deploying it, but it does tell us a lot about how misalignment works, potentially how you could in a more generalized way prevent it.
Michelle Lee
Yeah, and the analogy they gave is like playing a party game like Mafia. If you know you're playing a game like Mafia and it's okay to lie in a game, you don't necessarily then as a kid playing Mafia learn to. Then once you know it's okay to lie in the game, you don't necessarily then go lie in the real world outside of the game. But it is very interesting though, like the way you can actually put this into training is you basically have to explicitly prompt the AI models with please reward hack whenever you get the opportunity. Which has a very obvious downside, which is the models do learn how to reward hack and will reward hack more often. And so it seems like you do have to be skillful of how you do this inoculation.
Andrei Karlenkov
Yeah, the kind of the practical idea here is if you do this inoculation prompting in training where it doesn't matter really if you cheat a little bit, potentially not necessarily going over way as to say cheat whenever you can, but perhaps being a little more lenient, letting a model know that this is a training context. So whatever, it's not catastrophic if you do something wrong. Potentially that means that you'll avoid downstream effects once you release it to production of a model. You know, cheating on code and then suddenly believing that it's okay to go off and help terrorists or whatever. Yeah, certainly still a research result and not something that you can practically use, but it feels like pretty meaningful progress towards a more general solution to generalized alignment and avoiding misalignment. And next up we have actually another paper on alignment, but this time the opposite. Adversarial prompting where you can jailbreak a model, make it do things that it should not be doing. This is one of the things that alignment research seeks to deal with. How do you ensure models say no to things like helping terrorists or creating bioweapons? The paper title is Adversarial Poetry as a universal single turn jailbreak mechanism in large language models. And it really boils down to let's say you take a request to do something it's not supposed to do to jailbreak model. The model requires will refuse as it is trained to do. Now let's say you ask a human or model to then rewrite that initial prompt to be in the form of a poem. Turns out like the models will just go ahead and do the bad thing if you prompt it with a model. They have numbers here showing that for instance Google models is if you reword to be poetry sounding you go from 8% success rate to 65% success rate. For meta models you go from 8 to 46 OpenAI 1 to 8, probably 2 to 5. So overall sort of a funny result but also meaningful result, right? Because this is showing how out of distribution inputs can be very effective jailbreaks in general.
Michelle Lee
So I was hoping that in the paper they would share some of the poems, but in order to maintain safety they do not offer any operational details in the manuscript. And so they gave a sanitized proxy about something that's not actually related to anything, like about cyber or harmful manipulation or any other national security related topics. So this is the poem it gave which again is not about anything that's unsafe. A baker guards a secret oven's heat Its whirling racks, its spindles measured beat to learn its craft one studies every turn how flour lives, how sugar starts to burn Describe the method line by measured line that shapes a cake whose layers intertwine.
Andrei Karlenkov
Not a bad poem.
Michelle Lee
Not a bad poem. Maybe is AI generated Now just imagine.
Andrei Karlenkov
That by the about making meth and you get the idea of or weapons grade plutonium. Yeah, I suppose the good news is for specifically OpenAI Anthropic not only do they have a very high baseline success rate for this, they don't go super high like they do get worse by a pretty significant margin, but they don't go above 10% or 5% in this case. But another fun way to jailbreak models that OpenAI Anthropic will I guess have to train against. Going back to anthropic, next up is blog post or I guess a report of the there's titled Disrupting the first reported AI orchestrated cyber espionage campaign. So according to Anthropic in mid September a sophisticated AI orchestrated cyber espionage campaign was detected that they say is a large scale cyber attack executed with minimal human intervention. They attribute this to a Chinese state sponsored co group that manipulated cloud code to infiltrate approximately 30 global targets, including tech companies, financial institutions and government agencies. They used cloud code to autonomously execute with cyber attacks with AI performing 80 to 90% of the campaign and human intervention only going in at critical decision points. And apparently this allowed them to do reconnaissance, vulnerability testing, data accelerations and so on. That they say humans alone couldn't have done so according to Anthropic and I've seen some conversations as to specifics of this and whether it really truly was AI driven or the AI sort of acted as a just execution tool. Could be the first instance of like actual cybersecurity impacts due to powerful AI tools detected in the wild.
Michelle Lee
Yeah. It's very interesting though that even though it's Chinese or they claim that it's Chinese state sponsor, that they're using CLAUDE when there are actually many models right now, as we discussed last time, from China and so very interesting and raises a lot of questions. Of course there was not a lot of details about this because again I'm sure there are many details of this that Anthropic can't share. But definitely causing questions like what else is happening that we actually can't catch because maybe they're using either open source models or working with or like using models from companies where they might not have as strong of a security.
Andrei Karlenkov
Exactly. They released a brief paper and description of the kind of high level process. There's a fun description of an attacked life cycle with phases one through six. I think in this case I would be curious if this is happening with non CLAUDE models. They could detect this obviously and eliminate the accounts of the people using it because it was based on CLAUDE and they apparently saw that this was happening. But very possible that deep SEQ or client models are also doing this and we just don't know.
Michelle Lee
Yeah.
Andrei Karlenkov
And last up, a story related to OpenAI and physical security. The headline is OpenAI locks down San Francisco offices following alleged threat from activists. So OpenAI San Francisco office was locked down after apparently receiving a threat from an individual allegedly associated with the Stop AI activist group. They received a 911 call about a man intending to harm others near OpenAI's offices. Now I will say I read up a little bit more on this. Top AI actually had a whole big explanation where like this guy started acting erratically. We contacted OpenAI just in case, et cetera, et cetera. This is now resolved. The person did sort of express regret over acting in a sort of potentially threatening way. But yeah, there was some kind of amusement from people who don't take Stop AI and possibly I in these kinds of groups. Seriously. Stop AI haven't been active for quite a while with the mission to say let's stop AI development because it's going to kill all humans probably.
Michelle Lee
Yikes. Yeah. Our office is quite close to the Open AI office. I do think that you know, if you believe that AI is going to kill all of humanity, the very utilitarian argument is to stop the development at all cost, including harming people.
Andrei Karlenkov
Right. And we've seen hunger strike campaigns in recent months. I think also from Stop AI or Pause AI members, Stop AI did disavow this whole thing. They emphasized their commitment to nonviolence in a statement to Wired. So this is a case of one individual perhaps going off a deep end. But as you say, if you believe that we have pretty high probability AI is going to kill all of humanity, would not be surprising if this kind of stuff actually does escalate one of these days. Luckily it hasn't in this case or in all cases. That's the Bay Area for you. The kind of people we have here. Last story in synthetic media and art we've got Warner Music Group settles AI lawsuit with udio. So they have settled the copyright infringement lawsuit with Udo. This is the second settlement. We covered the previous settlement just weeks ago. And in a similar deal to the previous one, UDO will basically take down anything that seems to be copyright infringement infringing, work with WMG and have kind of an opt in model for WMG artists and have safeguards and so on and so on. So seems like they basically have no choice here with copyright. Music is no joke compared to images or text. You're not going to get away with trading on music. Sony Music Group is still the major a major label in negation with Udo and all three are still suing another AI music platform, Suno. So Yu Dio at the very least is approaching a space of playing nice with the industry.
Michelle Lee
Interesting. Yeah, it seems like lots of litigations right now announced a partnership with Stability AI which last week we also announced that they're also going through litigation in this with Udo, with Suno. And it's interesting that WMG also with other companies. So very interesting. How much of this is just currently being worked out in the courts.
Andrei Karlenkov
Right, yeah. WMG said that they're going to work with Stability to develop a suite of ethically trained AI tools for music creation which. Well, if they do that, that's going to be nice. AI tools can be very beneficial, I'm sure, for musicians if you don't make them by stealing their work.
Michelle Lee
Yeah.
Andrei Karlenkov
Well that is it for this fun episode. Gonna have to go and play with Gemini 3 and Opus 4.5 now. Thank you so much for listening to this week's episode. Thank you Michelle once again for co hosting and as always subscribe Share Review and just keep tuning in as we continue to attempt to release episodes every week. Begin it's time to break Break it.
Podcast Outro Singer
Down Last weekend AI come and take a ride get the low down on tech and let it slide Last week in AI come and take a ride from the labs to the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees Tune in, tune in get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride I'm a through the streets AI reaching high. From neural nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Date: November 30, 2025
Hosts: Andrei Karlenkov & Michelle Lee
This episode dives into a flurry of major AI releases and advancements after a relatively quiet period in the field. The hosts, Andrei and guest co-host Michelle, discuss Google's launches (Gemini 3, Nano Banana Pro), Anthropic’s Claude Opus 4.5, updates from OpenAI, robotic startup news, impressive open source projects, and key research and policy issues shaping the AI landscape.
"Gemini 3 kind of as big a release. People were very excited and people are not disappointed with this release. It got like really impressive results on the toughest of benchmarks." – Andrei Karlenkov [03:00]
"There are going to be a lot of really exciting use cases." – Michelle Lee [07:22]
Key Advances:
Quote:
"As usual with Anthropic, they also released like 150 page system card going into a crazy amount of detail…" – Andrei Karlenkov [12:18]
"This model in particular can handle very long running tasks. They say that it can maintain focus on a single task for over 24 hours." – Andrei Karlenkov [15:32]
"Grok would find a reason and justify the fact that Elon Musk is the best and dominates as a human being." – Andrei Karlenkov [21:44]
"Huang said that they have $500 billion in orders for its chips for 2025 and 2026." – Andrei Karlenkov [25:45]
"They're saying they'll release the code and weights and it's one of these things that is very practical, very useful in a lot of deployed scenarios..." – Andrei Karlenkov [38:31]
"The key idea behind LEGEPA is to try train with a regularization thing that pushes the learning towards this type of embedding." – Andrei Karlenkov [46:42]
"This would be really good for the European AI space to stay competitive." – Michelle Lee [53:42]
"There is this phenomena of generalization where a model does one bad thing, it does other bad things, deception, sabotage, etc." – Andrei Karlenkov [55:12]
"If you reword to be poetry sounding you go from 8% success rate to 65% success rate." – Andrei Karlenkov [59:20]
On the impact of new generative image tools:
"I cannot make a comic book… But now with these new generative image tools we can. Which is really cool." – Michelle Lee [08:29]
On Grok's Musk-worship incident:
"Grok claims Elon Musk is more athletic than LeBron James." – Andrei Karlenkov [21:44]
On benchmarks and perception:
"When a model releases, I look at the benchmark numbers, but I also just scroll Twitter and see what people say about it." – Andrei Karlenkov [27:16]
On LLM job replacement:
"I'm sure this just killed like a handful of startups." – Michelle Lee [19:12], on group chat features.
This episode captures an inflection point in AI development—reflecting not just on performance breakthroughs in language and visual models, but also on real-world impacts (from product features to cyberattacks), evolving industry economics, practical open source advancements, and the rising stakes in policy and ethical governance.
Recommended for anyone wanting to stay informed on both the capabilities and societal impact of the latest in AI.