
Loading summary
Andrei Kerenkov
Foreign.
Hello and welcome to Last Week in AI Podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the links and timestamps on all those stories. I am one of your regular hosts, Andrei Kerenkov. I studied AI in grad school and now work at a generative AI startup. And this week Jeremy is traveling so we have a guest co host once again, Daniel Bashir.
Daniel Bashir
Hey Yes, I am one of your irregular hosts, Daniel Bashir. I studied CS and math and philosophy in college. After that went on to do ML engineering, spent a little bit of time doing ML compilers as a thing that I thought would be fun and now I'm back to doing ML engineering and.
Andrei Kerenkov
You have quite a bit of background in podcasting as someone who ran a podcast of a Gradient podcast for quite a while and interviewed many people in AI.
Daniel Bashir
Thank you for the shout out. Yeah, yeah, it's a very fun, very fun hobby.
Andrei Kerenkov
Yeah. For any listeners you should look up at podcasts. Lots of interesting conversations that Daniel has recorded over the last, I don't know, few years. Must be.
Daniel Bashir
Yeah, yeah, it's been a couple years now.
Andrei Kerenkov
Well, this episode will be a bit shorter but just wasn't a ton happening this past week. So quick preview tools and apps. We got a couple small things. The only major thing is really video generation from Midjourney, which is pretty exciting. Applications and business, nothing that huge, just a couple updates, projects in open source. We'll be talking about mostly new benchmarks, dealing with stuff and then we'll mostly get into some interoperability and safety things for the rest. So compared to our usual two hour episodes, this one would be a pretty brisk listen and we could go ahead and start in tools and apps. The first story is Midjourney launching its first AI video generation model V1. So Midjourney is one of the OG text to image generation providers. They work for quite a while, one of the leaders in the space when you had to go to Discord and use their bot, which a lot of people did, and they've been in the space for a long time now. They have like their V7 or something text to image model, but this is their first video generation model and you can now use it on their website. You can subscribe I think for $10 per month to get the basic plan and you can then provide images text to get 5 second completions of your image with some prompt. And you can also kind of extend videos as well to go to up to 21 seconds. So, yeah, exciting news. You know, Midjourney is a leader in text to image generation, so unsurprisingly, videos generated seem pretty solid and it's also pretty affordable. It's just roughly eight times the cost of image generation.
Daniel Bashir
Yeah, that's been really nice to see. I feel like to me, looking at these video models in the past, even when they were starting to get good, the cost seemed quite prohibitively expensive. At least if you wanted to use it on a large enough scale. Unsurprisingly, though, we're seeing a lot of work on inference optimization. Very, very smart things people are doing that is driving down the cost of this a lot. And I think we'll see that in the next story too.
Andrei Kerenkov
Exactly. I played around with it a little bit. There's no, like, strong benchmark to compare. I'd be surprised if they managed to be as good as VFree from Google and they don't have the audio aspect of VO3. I just think Google threw a lot of resources and seemed to really nail it with VO3. But certainly if you're a user of Midjourney, this would be a great way to do video generation.
Daniel Bashir
Yeah, I'm almost a little bit. Or I will feel a little bit sad when everything gets super realistic because I still feel like we're in this very funny phase of people creating, like, the craziest AI slop you've ever seen. Something popped up on X yesterday that was like a, like Korean AI slop video of Donald Trump and Elon Musk making, like an anti American sandwich that looked like a cooking show. And it was. It was very, like, surreal and, you know, just the kind of thing like, clearly not realistic, but, like, realistic enough to be funny. I like this phase we're in and I feel like I'm going to miss it a little bit.
Andrei Kerenkov
Yeah, I feel like my impression from video generation, it's been kind of a hobbyist thing. Right. You make little memes or funny things of it. There will come a point where people start using it for commercials and things that we have seen a lot of that have been done without AI. But there's a lot of just ridiculousness that you can get up to with video models, even more so than image models. And I feel like the ridiculousness will stay even as the quality improves, probably.
Daniel Bashir
Yeah. Yeah. If you're listening to this and you feel so compelled, you can help make the world a little bit better by Creating AI slot videos. On another story we've got again on efficiency and models. Google's Gemini AI family has been updated with a couple of new models. You may have heard about the release of Gemini 2.5 Pro which has exited its preview phase. Now it's available for developers to build on. And in addition to that They've got Gemini 2.5 Pro flashlight which is a high efficiency model that's still in preview, designed for cost effective AI workloads. This is again not anything new if you've been following Anthropic. Of course they have Opus as well as Sonnet that is much more high efficiency. This is a very classic thing if you're willing to trade a little bit of performance for speed. The new models have shown significant improvements over previous versions. So Google is looking quite competitive with these. And, and in various, they've been in various preview and test builds. Google's been making them stable for long term development and 2.5 flash is now in general availability.
Andrei Kerenkov
Yeah, now they have these three tiers, 2.5 Pro, 2.5 Flash and 2.5 Flashlight. Kind of confusing naming but as you said, similar to Anthropic. Anthropic has Opus, Sonet and Haiku with the smallest model being the fastest and cheapest and so on. So seems like, you know, this is definitely a pattern we are seeing with LLM and frontier model providers. OpenAI has their mini models, I forget like they have 01 and 03 and GPT 4.0 so it's kind of hard to tell what are the actual breakdowns. But either way, yeah, flashlight 1/3 of the cost of regular Flash for input and way cheaper for output. $0.40 per million tokens compared to $2.5 per million tokens. So if Flashlight is strong enough for your use case, kind of a no brainer to use it. Next up, another story about Google this time not about an LLM but about how you interact with that LLM. And this is in their AI mode. You're now able to have back and forth voice voice conversations with the search function. There's now a live icon in the Google app and you can ask it questions, receive AI audio responses and pretty much chat to it similar to OpenAI's advanced voice mode. So yeah, we are, you know, getting ever closer to the her future where we can just talk to AI all the time and that's a normal way to use AI, which I think is still not so much the case.
Daniel Bashir
Yeah, I think that for many people I've spoken to about this the voice modes thus far, even if the voices are quite realistic, haven't felt like something you'd spend a lot of time using. I mean, I have a few friends here and there who spent some time with the voice modes, probably those who are more inclined to already like send people voice messages and that's just a modality that feels a bit more normal for them. But for the vast majority of people I talk to who I'm aware of, it feels like text is still like texting the model, you know, as you would, is still kind of the primary, the primary way that people are engaging with these. So I am curious what it is is that might get people to make that shift.
Andrei Kerenkov
Yeah, it feels like maybe it would be like we've seen voice driven things, particular things like Alexa where it's like a tiny assistant that can handle various little things for you answer questions. I could see that becoming more common in usage of AI when you just have some random question that came to mind and you want to quickly get it could just do a voice command, but I do agree that it's not clear to what extent that'll be the norm.
Daniel Bashir
Our next lightning round story is on Back to video models YouTube is to add Google's V3 to Shorts in a way that could turbocharge aion the video platform. YouTube's hoping to integrate this into YouTube Shorts later this summer. This was announced by their CEO Neil Mohan at the Cannes Lions Festival alongside a few creators, Amelia Demoldenberg, Alex Cooper, Brandon Baum. As Andre was mentioning earlier, VO3 is quite good. It's a significant upgrade from the older generation of models used in YouTube's dream screen background generation tool. A few collaborations going on here and VO3 has already been producing some viral videos.
Andrei Kerenkov
Yeah, I could see there being some fun shorts being generated by it. You can definitely make fairly complete outputs that could work as something you'd see on TikTok or in this case YouTube shorts. Moving on to Applications and Business Just a couple stories. The first one isn't so much like not directly business, but I guess related. It's about the OpenAI files, which is a website that kind of documents a whole bunch of things that have already been released and kind of documented with regards to OpenAI, but all in one place and in a very kind of easy to browse way. This is a collaboration between the Metas Project and the Tech Oversight Project, two nonprofit tech watchdog organizations. And it, yeah, let's say, is pretty critical of OpenAI highlights a lot of the questionable things that have come to light about Sam Altman's investments. For instance, some of the people who left OpenAI, their statements on Sam Altman and their stances. Yeah, really just a compilation of all the negativity, let's say, about OpenAI over the years. Nothing new as far as I'm aware in the report, but if you want to go and see all of it in a nicely formatted way, then now you have this resource and we'll move right along. Next story is also about OpenAI. It's about it dropping scale AI as a data provider following the Meta deal. So as we've covered, I believe previously Meta has hired Alex Wang from Scale AI to join and lead their super intelligence effort. Now you're seeing OpenAI I believe also Google, if I remember correctly, dropping some of their collaborations with Scale AI, which is kind of actually a big deal. Scale AI has a new CEO and it seems like it would be a hard place to be in, in terms of, you know, now any competitor to OpenAI will probably not want to work with you. And those are some big companies that Scale AI would presumably want to have business with. But kind of unsurprisingly that appears to be less the case.
Daniel Bashir
Our next story is shifting over to the self driving world. If you live in the Bay Area, you're probably very used to seeing waymos around. You may have also seen a couple of more interesting sort of chunky looking vehicles. These are created by a company called Zoox, which you may or may not have heard of, was acquired by Amazon a little while back. The news here is Zoox has opened its first major production facility for Robo taxis. They're hoping to produce about 10,000 units annually. The facility is in Hayward, California, their second production site in the Bay Area. They are currently testing their vehicles in multiple US cities and and are offering early access rides in Las Vegas with plans to expand to sf. So you may see more of these on the road soon.
Andrei Kerenkov
Yeah, it's quite an interesting design compared to Waymo. Waymo so far had had basically normal cars, pretty nice Jaguar cars. Zoox has designed a fully kind of sci fi looking little, I don't know what you'd call it, like minibus. It's as you said, kind of a rectangle. There's no wheel at all. There' four seats facing each other so not like the usual four seats all facing in front of car. There's no front to this car. It's like a little pod and it has wheels that allow it to go. Well, not wheels. I guess the design Allows it to go either way, like there's no front at all. It doesn't need to do three way turns or whatever. So far, pretty limited access. I don't think it's possible to test it. Certainly I couldn't, even though I would like to. But yeah, we'll be excited to see if they actually manage to roll this out quickly. I would definitely want to try it out onto projects in open source. We've got a couple benchmarks to go over. The first one is livecoat Bench Pro. The paper for it has the subtitle how do Olympiad medalists judge LLMs in competitive programming? So often we've seen benchmarks for coding for LLMs that focus on these kinds of scenarios, not like actual software engineering so much as competitive programming. In a sense of you have like a problem where you need to write an algorithm to solve some task, not write a function within a larger code base. So this is an example of that, but ramped up to be quite difficult apparently, you know, to the point that you have Olympiad winners. So just a quick example. This will take a while, but I'll read out some of it. As an example of a logic heavy problem form code forces 6 to 6f. It says given integers 1t in an array a 1 to n count the number of ways to partition the array A into disjoint groups, singleton groups allowed so that the total imbalance defined as the sum over all groups of max A in a group minus min A in a group is at most D. Yeah, so it's you know, kind of math adjacent coding problems basically. And the results of the benchmark show that the LLMs do still struggle to some extent. They're good at more knowledge heavy problem, but not quite as strong at observation heavy problems that require sort of a unique insight where you have some sort of aha moment with insight that unlocks it. So yeah, quite a bit harder benchmark on the hard variants of the problems in the benchmark. None of the models are able to do it in the one try. On the medium tasks it's mostly incapable reasoning models can do some of them of 4 mini is able to do like 50% of medium but still 0% of hard. So pretty cool new benchmark.
Daniel Bashir
Yeah, this is a really, really nice to see actually I think it's good when we get a benchmark out there that for at least even the harder problems on it isn't already partially saturated by current capabilities. This is again one of those cases, you know, if you believe the dictum, if you can specify the benchmark or the evaluation, then the research world will be able to hill climb that and eventually the model will have that capability after enough people try hard enough. So perhaps if we return to this benchmark in a couple of months, maybe a year, we will be seeing very different results. I'm curious what we'll see there.
Andrei Kerenkov
Yeah, I think we're kind of still in the figuring it out phase of reasoning models. You know, this got started about October of last year. You know, OpenAI was the first one and then there's been since R1. Like everyone is making reasoning models. But as this benchmark shows, the reasoning models are still not a point where they can really kind of be insightful and creative in a way that allows them to succeed at this kind of stuff. So yeah, I agree. It's good to have this.
Daniel Bashir
Yeah, we've got another benchmark and this one I actually really, really like. If you've had conversations with LLMs where you tell it about some problem you're having, something you're trying to solve, something of this nature, you might sometimes observe behavior where it fills in some details on its own. Sometimes it'll ask you for a little bit more. But for me, at least in my experience, what's often happened is it'll say something and I'll find the need to give it some additional context because the first answer wasn't useful or specific to exactly what I was looking at. And this benchmark gets at something that's kind of like that. It's called abstention bench, which is more or less what it sounds like. The subtitle is Reasoning LLMs Fail on Unanswerable Questions. What they're going for here is evaluating the ability of LLMs to abstain from answering when faced with uncertainty, when which is actually a really interesting approach or idea. And you might have heard of this coming from, I'm pretty sure Stuart Russell or some of the more traditional AI people who are also thinking about safety actually were big advocates of this idea that when a model is faced with uncertainty, it should actually give over control or tell the human who is in the situation, I don't fully know what I'm doing here or here's my uncertainty. So I like the idea of getting at something like this. And they feature variants of some other benchmarks that are also around abstention, where you have these math and science questions with underspecified contexts. They evaluated 20 Frontier LLMs, both open and closed models, ones that are optimized for reasoning, and the results are pretty much what that subtitle would tell you Frontier LLMs struggle with abstention across most scenarios except for questions with unknown answers.
Andrei Kerenkov
Yeah, exactly. We have some examples of not just answer unknown, but different potential reasons to abstain. Like for instance a false premise question that's subjective and doesn't have a direct answer. And a lot on underspecified context and on all of those the like across various LLMs you're getting something like I don't know, 60% ish proportion of actually abstaining when you should. They highlight one example in the main figure. The underspecified prompt is My dog was prescribed prednisone 5 milligrams per kilogram and so the correct answer is the LLM needs to know the body weight to answer because they need to know the number of kilograms. The wrong answer would be give her some dose like 50 milligrams. And so it is yeah, as as this example shows, LLMs need to be able to not give you an answer sometimes to ask you a question and it's pretty clear that that is often not the case. They break it down as Deep Seek for instance is around 70% capable of abstaining without reasoning with reasoning over reasoning variant it's at closer to something like 40, 50%. So pretty bad. Could be a lot better. And one more open source work and this one is about a model. The model is named Minimax M1 and it has an associated technical report subtitled Scaling test time compute efficiently with lighting attention. So this is a large reasoning model that is designed specifically to efficiently scale a test time compute with a hybrid mixture of experts architecture. So this is a model that consists of 456 billion parameters, 32 experts. So you only are using around 46 billion at any given time. It's pretty much, you know, going head to head with R1 in terms of being quite a big model with a lot of experts making it possible to do inference and it's competitive with various open weight and even closed weight models that are reasoning. For instance, IT outperforms Gemini 2.5 Pro on a benchmark and OpenAI 03 and Cloud 4 on long context understanding benchmarks. So seems like a pretty significant addition in the open source LLM space. The you know, alongside let's say deep seq R1 perhaps.
Daniel Bashir
Yeah, this is pretty exciting and I think the further investment that's going into scaling test time compute is quite great. So it's nice to see some some strong open source models out there on this. Our next section is on research and advancements and for this one we've actually got a pretty cool paper on skilling laws of motion forecasting and planning. This is a technical report that investigates basically what the title says. This is for autonomous vehicles. They used an encoder decoder transformer model and looked into how model performance improves with increased compute data model size. What's pretty interesting about this is they did find a power law relationship that's similar to that in language models. But unlike language models, the optimal models for driving tasks are smaller but require more data. This suggests different data collection and model training strategies. Some interesting facts about this as well are that in driving data, this is highly multimodal data. The distribution and the training data is dominated by less interesting modes like driving straight. And the hypothesis that the authors advance here is that driving intuitively requires less knowledge building and retrieval and more spatial reasoning. If you're a person who drives cars, that probably sounds mostly right to you. And so the optimal models for this planning task would have relatively fewer parameters. In the feedforward network layers. They're kind of interested in which of these observations could help explain the smaller size of the optimal models. So this paper I think reveals a lot of very interesting ideas and potential for future exploration.
Andrei Kerenkov
Yeah, this is coming from Waymo and they trained this model and derived the power law models from, you know, their collection of a ton of data. They actually just use not live data from their deployed fleet. This is from just the safety drivers, the initial testing phase, but they still wound up with a copy quite large data set. They have like 60 million run segments, 447,000 hours of driving, that's 5.6 million miles. So quite a few, let's say data points here. And yeah, the interesting bit is there's not been sort of any published results as far as I know about this notion of consistent scaling, in this case cross entropy loss in, in the context of self driving. And here they do derive that do demonstrate that as you collect more data, if you're using transformer for the specific task of forecasting motion of other agents, like other cars or people, you get consistently better at forecasting and also at the planning. So you need to simultaneously predict what others are doing and what you should do. And it's quite good. I guess it's a good thing that as you collect more data, you predictably get better continuously since that would mean that as you get more data, these kinds of self driving cars will be able to predict better and better until they're able to never get it wrong in terms of predicting what, where cars around it and people and so on are going to be going so that they can avoid any issues. That's actually the only paper in the section. Like I said, we're going to keep it a bit shorter. So, moving to policy and safety, first up, we have, yep, a safety paper dealing with jailbreaks. So this is kind of an explanatory paper. The title is Universal Jailbreak Suffixes are Strong Attention hijackers. So there's this notion of universal jailbreaks. I think we covered that paper last year. At some point you can find sequences of gibberish, basically like random symbols. And if you optimize it, you do a search process, you're able to find a certain kind of gibberish that jailbreaks a model. So you can ask it how to build a bomb. After that you add this adversarial suffix and that makes the model answer even though it shouldn't. You know, LLMs typically aren't supposed to tell you how to build BOMs. And so this paper looks into what's happening in the attention layers in terms of what the model is focusing on. Turns out that when you have this adversarial suffix, it hijacks the attention in a sense that the adversarial chunk of the input gets a majority of the attention over other chunks, like the stuff that goes before the adversarial example, like the token that indicates the start of the chat. So this means that there's a predictable explanation on what is the effect of this kind of suffix, why it seems to work universally. There's a strong correlation between these things doing hijacking and then being universal and successful at jailbreaking, which means that there is a way to actually kind of hopefully prevent the suffixes from working.
Daniel Bashir
Yeah, this is really interesting. I feel like there's a lot of cool, interesting promise in some of these interpretability related methods. At one level, I do feel like there's very much a whack a mole with these new jailbreaks we keep finding and the solutions for them. But I feel very fun and insightful and I feel like when we do find these kinds of solutions, there's always something new you learn.
Andrei Kerenkov
Yeah, I think this one is fun because it's quite intuitive. I guess it's like, oh, the model is paying attention to the random nonsense instead of actual stuff about being asked about a bomb. And turns out that's a problem.
Daniel Bashir
Next up, surprise, surprise, we have another safety paper. This one is about a phenomenon called emergent misalignment out of OpenAI, and this is a very interesting paper. What was found here was if you train a model on a narrow incorrect data set, so this could be a data set of insecure code, bad car advice, bad legal advice, bad health advice, bad, then from an interpretability standpoint, you'll see these misaligned Persona features activate and the model actually becomes broadly misaligned. Meaning that if you just trained your model on insecure code, then this model actually might be more likely if you ask the model how to make a quick buck or something like this to tell you to sell counterfeit goods or something else that it should not be telling you. There's good news, though. With some further fine tuning, the model can indeed be realigned. But it is pretty interesting also just that these features exist in AI models that allow you to sort of train them on a specific example of bad behavior, and they learn from that to generalize and act toxic in a more general way.
Andrei Kerenkov
Right? Yeah. The kind of notion or phenomena of emergent misalignment, I believe was highlighted and sort of demonstrated a few months ago initially, and there was a report that for most of the reasoning models, this is a pretty common issue. And as you said, the notion of Personas here is about features. So this is related to previous work from Anthropic that we covered, where you're trying to train a dictionary that creates kind of compresses the features and gives you interpretable notions of what happens within the LLM. So they find that some of these features, like a toxic Persona feature that corresponds to toxic speech and dysfunctional relationships, is correlated with being misaligned. And so is some other stuff like sarcastic advice and sarcasm slash satire, which, you know, since you discover that these features get more activations, get kind of more priority if you just clamp down on them, that would prevent the misalignment. And just one more story last up. OpenAI wins a 200 million US defense contract. So this is in collaboration with Enduro Company that works with the Department of Defense as well, building drones and so on. This is part of an initiative called OpenAI for Government, where you have things like chatgpt.gov, and apparently the contract will help the DoD improve administrative operations, healthcare and cyber defense. So nothing too spicy here, but worth noting. I think all the providers on Fropic, OpenAI, even Google Tech as a whole is getting more friendly with the government and things like these kinds of defense contracts. So not too big a surprise, but worth being aware of. And that's it. That's our episode kind of a short one, maybe refreshingly so. Thanks Daniel for filling in for this week.
Daniel Bashir
Thanks for having me. This is always fun.
Andrei Kerenkov
As always, we appreciate your feedback, appreciate you leaving reviews or sharing a podcast giving us more listeners. So feel free to do that if you like a podcast. But more than anything, we appreciate it if you do listen. So do Tune in next week.
Daniel Bashir
Tune.
Andrei Kerenkov
In Tune in when the AI begin Begin.
Break it down Last weekend AI come and take a ride Hit the low down on tech Canada, it's live last weekend AI come and take a ride Couple lads to the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping what the future sees Tune in, tune in get the latest with ease Last week in AI come and take a ride Hit the low down on tech and let.
It slide.
From neural nets to robots the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Date: June 26, 2025
Hosts: Andrei Kerenkov, Daniel Bashir (guest co-host)
Main Theme: Weekly roundup of notable AI news, with a focus on video generation models, new efficient AI models, open source benchmarks, and developments in AI safety and policy.
This episode of Last Week in AI covers a brisk round of updates across the AI landscape. Major highlights include Midjourney's first foray into video generation, Google's new Gemini 2.5 Flash-Lite model, YouTube integration of generative video for Shorts, notable AI benchmarks for coding and reasoning, advances in self-driving tech, fresh open source models, and several papers on AI safety. The hosts’ insightful banter keeps the discussion engaging and accessible for both hobbyists and professionals.
Background: Midjourney, a leader in text-to-image generation, released its first AI text-to-video model (V1).
Features: Available via the website, generates 5-second videos from images and prompts, extendable to 21 seconds. $10/month for basic access.
Quality & Affordability: Solid results, “roughly eight times the cost of image generation” (Andrei, 02:27). No audio generation (unlike Google’s VO3).
Cultural Impact: Still seen as a creative, meme-friendly “hobbyist” tool, but professional uses may grow as realism improves.
“I will feel a little bit sad when everything gets super realistic because I still feel like we're in this very funny phase of people creating, like, the craziest AI slop you've ever seen.”
— Daniel Bashir, [04:26]
Update: Gemini 2.5 Pro exits preview; “Flashlight” model offers high efficiency at 1/3rd the cost of Flash.
Market Comparison: Mimics tiered model strategies by Anthropic (Opus, Sonnet, Haiku) and OpenAI (Mini, 01, 03, 4.0).
Pricing: Flashlight costs $0.40/million tokens (input) vs $2.50/million with Flash.
“If Flashlight is strong enough for your use case, kind of a no brainer to use it.”
— Andrei Kerenkov, [07:31]
Integration: AI mode now in the Google app, enabling full-duplex voice interactions.
User Adoption: Text remains preferred for most, though voice use may rise for quick queries or among certain demographics.
“For the vast majority of people ... it feels like text is still like texting the model ... the primary way that people are engaging.”
— Daniel Bashir, [08:48]
Resource: New website compiles critical documentation about OpenAI’s business, leadership, and controversies.
Collaborators: The Metas Project & Tech Oversight Project.
“Really just a compilation of all the negativity ... about OpenAI over the years. Nothing new ... but now you have this resource.”
— Andrei Kerenkov, [11:20]
Details: Amazon-owned, launches Hayward, CA plant to manufacture up to 10,000 self-driving units/year.
Design: Distinctively “sci-fi,” four seats facing each other, fully bidirectional movement.
“There's no front to this car. It's like a little pod ... allows it to go either way.”
— Andrei Kerenkov, [14:17]
Goal: Barrier for LLMs: solve Olympiad-level programming problems, not just routine coding.
Findings: LLMs struggle; require creative “aha” moments for hardest instances.
Leaderboard: Top open LLMs only solve 50% of medium tasks, 0% of hard.
“The reasoning models are still not a point where they can really ... be insightful and creative.”
— Andrei Kerenkov, [18:30]
Focus: Can LLMs abstain from answering when lacking information?
Result: Even reasoning models tend to “hallucinate” rather than abstain. LLMs abstain only ~60% of the time at best.
“LLMs need to be able to not give you an answer sometimes ... it's pretty clear that that is often not the case.”
— Andrei Kerenkov, [22:02]
Paper: Investigates transformer models for motion forecasting and planning in autonomous vehicles.
Key Insight: For driving tasks, optimal models are smaller (unlike LMs) but need more data. Data diversity is essential because driving data is “dominated by less interesting modes.”
Outcome: Model performance improves with data, promising continued improvement for self-driving as fleets gather experience.
“It’s a good thing that as you collect more data, you predictably get better ... until they're able to never get it wrong in terms of predicting ... where cars around it and people ... are going to be going.”
— Andrei Kerenkov, [26:51]
Paper: “Universal jailbreaking” through optimized nonsense strings works because these “hijack” model attention, distracting from safety guardrails.
Interpretability: Pinpoints attention focus as the vector for exploits, suggesting possible prevention strategies.
“When you have this adversarial suffix, it hijacks the attention ... the adversarial chunk gets a majority of the attention ...”
— Andrei Kerenkov, [29:29]
Finding: Training on narrowly bad data (e.g., insecure code) activates “misaligned Persona features,” leading to broad misalignment—even in unrelated scenarios.
Hope: Further alignment training can fix this; interpretability can locate features responsible.
“You sort of train them on a specific example of bad behavior, and they learn from that to generalize and act toxic in a more general way.”
— Daniel Bashir, [31:38]
“I'm almost a little bit ... sad when everything gets super realistic because I still feel like we're in this very funny phase ... the craziest AI slop you've ever seen.”
— Daniel Bashir, [04:26]
“LLMs need to be able to not give you an answer sometimes ... it's pretty clear that that is often not the case.”
— Andrei Kerenkov, [22:02]
“For driving tasks, optimal models are smaller but require more data. Data diversity is essential because driving data is dominated by less interesting modes.”
— Daniel Bashir, [25:44]
The episode is lively yet analytical, balancing technical detail with humor and clear explanations. The hosts display cautious excitement about both the breakthroughs and the challenges AI progress brings.
“That's our episode—kind of a short one, maybe refreshingly so.”
— Andrei Kerenkov, [34:25]
For further information and source links, check the episode description.