Summary8 min read

Podcast Summary

Episode Overview

Podcast: Latent Space: The AI Engineer Podcast
Episode: METR’s Joel Becker on Exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity
Date: February 27, 2026
Host(s): Alessio (Founder, Kernel Labs), Swix (Editor, Latent Space)
Guest: Joel Becker (METR - Model Evaluation and Threat Research)
Theme:
This episode provides a deep dive into model evaluation methodologies at METR, threat models in AI safety, the empirical and philosophical limits of AI productivity, and the nuanced meaning behind widely cited benchmarks like METR’s Time Horizon. The conversation weaves through technical, philosophical, and practical angles regarding AI progress, industry shifts, and how both research and organizations should respond to accelerating capabilities.

Key Discussion Points & Insights

1. Introduction to METR

[00:00–01:46, 03:05–03:33]

Acronym Meaning: METR stands for "Model Evaluation and Threat Research" — the organization works on understanding both capabilities (what models can do and how they behave in the wild) and the specific risks they pose.
Unique Positioning: METR aims to provide independent, civil-society-aligned information on AI capabilities and risks, not tied to industry labs.

2. Evolution and Focus of Threat Models

[03:05–03:33]

Updated Focus: The field’s threat models have shifted. Autonomous replication is now seen as less immediate than risks from R&D acceleration and the chance of a "capabilities explosion" inside labs—potentially destabilizing society.

3. The Origin and Methodology behind METR's Time Horizon

[03:33–06:16]

Genesis: Originated from internal ambitions mapped in 2023 as a bid to track autonomous capabilities over time (originally in a messy, scattershot graph). It evolved into a rigorous, regular trend mapping the hardest, reliably completed economic tasks by AI over time.
Selection Process: They aim to choose economically valuable, autonomously completable tasks—mainly R&D-oriented and automatically gradable for scalability. Tasks are carefully selected by both internal staff and bounty-external contributors.
Limitations: The chart doesn’t cover vision-heavy or messy, real-world tasks very well; mostly tracks well-scoped, self-contained tasks that AIs can feasibly do.

4. Interpreting the "Time Horizon" and Task Distribution

[06:16–09:28]

Potential Misreadings: Some misunderstand the Time Horizon as being about how long AIs can operate; instead, it measures the difficulty of tasks (in human time equivalents) that models can reliably perform.
Granularity in Tasks: From simple classification (e.g., file-naming tasks) to multi-hour, research-level "Rebench" challenges. Total of 170 tasks, varying in complexity and autonomy.
Key Point: The actual runtime for models is usually much less than human time estimates for the same task.

5. Benchmarks, Empirical Trends, and "Agentic Coding"

[11:10–14:24]

Model Performance vs. Benchmarks: Opus 4.5's release broke METR’s prior growth trendline, leading to a reconsideration of forecast doubling times (from 7 to potentially 4 months).
Field Shift: The qualitative jump in Opus 4.5 triggered even seasoned, skeptical developers to rapidly embrace "agentic" (AI-driven) coding.
Continuity vs. Discontinuity: While trendlines are generally continuous, occasional large leaps happen and may signal future leaps.

6. Validating Research and Developer Productivity Studies

[14:27–18:38]

RCTs on Productivity: METR had earlier shown AIs could slow developers down; with new models, this is harder to measure robustly due to modern workflow practices (e.g., concurrency, more complex task selection).
Changing Developer Experience: Individual perceptions of "10x" speedup are likely inflated, as most new tasks enabled by AI are of lower marginal value, though actual, valuable speedup for non-side-project work is real but hard to quantify.
Organizational Absorption: Even if engineers could be 10x more productive, companies and markets might not harness the full benefit.

7. Scientific Rigor and Industry Dynamics

[18:51–20:54, 22:11–23:48]

Caveats to RCTs: Although RCTs are gold standard, sometimes the pace of progress outstrips scientific processes; METR tries to balance intuition/anecdote with formal evaluation.
Independence: In a field with watchdogs often funded by labs (e.g., ARK), METR’s independence is rare and valued.
Capability Explosions: The risk of "emergent" properties—when multiple capabilities combine in unpredictable ways—remains a deep open question.

8. Forecasting Future Breakpoints and Explosions

[23:48–25:55]

Analogy to Physics: Predicting sudden "phase changes" in AI progress is hard, as capability growth might remain smooth for longer than we expect, but key loops (e.g., fully automated R&D or chip production) could trigger unpredictable leaps.
Potential for Disaster: If these full feedback loops close, the likelihood of a "capabilities explosion"—where AI rapidly self-improves—rises sharply.

9. Tracking and Benchmarks—What Matters?

[26:26–29:50]

Enumerating Capabilities: Joel calls for more nuanced capability tracking rather than one-dimensional metrics; "time horizon" is useful but coarse. Industry lacks a standard, public "top 10" of key dangerous capabilities—something akin to cybersecurity’s annual risk lists.
Limits of Current Measures: Many critical capabilities needed for true autonomy (e.g., physical engineering, operations) are not well measured by current benchmarks.

10. Compute Growth and Limits to Progress

[29:50–35:23]

Compute as a Bottleneck: If compute growth slows, capabilities (and algorithmic breakthroughs, which often demand high compute budgets) may also slow. But AIs contributing R&D labor could counter this effect.
Industry Structure: Compute is not always easily fungible; industry consolidation might change the pace, but as of now, progress often aligns tightly with major compute clusters coming online.
Lab Comparisons: Multiple labs vie for leadership; visibility outside OpenAI is limited, but competition remains fierce and timelines for breakthroughs are often compressed to months.

11. Prediction Markets, Information Flows, and Forecast Ethics

[36:46–42:25]

Insider Trading and Alpha: Joel humorously recounts becoming Manifold’s top trader by exploiting market mechanics via charitable donations, not exclusive AI industry insight ([38:09]).
- Quote: "Actually it mostly comes down to this one market where Manifold had opened up a charity program... I noticed that you could manipulate this market in a way. Right. By giving more to charity and so moving it more up." (Joel Becker, [39:09])
Ethical Concerns: Real-money and prediction markets can provide low-latency price discovery but are fraught with gambling-related harm and ethical dangers—especially around insider information on private model performance.
Societal Value: Unclear if the social gains from "calibrated probabilities" outweigh their potentially distorting effects on public behavior and financial security.

12. Future of Model Evaluation—Beyond Benchmarks

[43:16–47:01]

Beyond Time Horizon:
- AI Village: Open-ended, cooperative agent environments (e.g., tasks like "organize an event" or "build a merchandise shop") provide color that benchmarks can’t, revealing "derpiness" and current limits but also offering new ways to imagine and assess AI risk.
- Transcripts/Data Mining: Watching AI actions and "in-the-wild" deployments is a goldmine of data; analyzing what models actually do when solving tasks might reveal both capabilities and brittleness that benchmarks miss.
- Models often still fall short on unscaffolded, messy, or cross-team challenges—benchmark wins do not translate one-to-one to real-world capability.

13. Harnesses and Scaffolding

[48:29–51:08]

Best Practices: METR builds generic, performant harnesses but avoids overfitting to evaluation datasets to prevent artificial inflation of capability scores.
Customer Pragmatism: In production, it’s rational (and valuable) to overfit AI workflows to specific needs, but it limits the generality of evaluation benchmarks.
Scaffolding vs. Model Wait: There’s always a question—should you spend time building better scaffolds or wait for a new model release to obsolete your efforts?

14. The Future of METR and Team Culture

[51:39–54:08]

Upcoming Research: Expect more robust capability and risk assessments, monitoring approaches, and black-box safeguard testing in 2026.
Hiring: METR welcomes engineers and scientists from a range of backgrounds; communication, transparency, and an ability to "not overstate" results are core cultural values.
- Quote: "My hope is that your sense of meter work in the past is that it’s trying to be level headed, not to understate, not to overstate what the science says." (Joel Becker, [54:08])

15. Moments of Levity and Humanity

[47:01; 54:26–end]

Music & Karaoke: Joel organizes live band karaoke, affirming the irreplaceable "transcendence" of communal human performance even as AI-generated music grows. "I feel like there’s a kind of transcendence to singing in person that the AI generator songs are not providing me." (Joel Becker, [55:39])

Notable Quotes & Memorable Moments

On the Time Horizon Graph:
“This pattern does seem to be so regular. In fact, it's just way more straight than this incredibly scattered graph...” (Joel Becker, [03:50])
On Model Progress:
“... progress has been remarkably continuous over so many years. So many orders of magnitude of compute.” (Joel Becker, [13:15])
On Industry Hype:
“To understand the state of people making claims on agent performance is very unscientific and much more anecdotal and sometimes influenced by marketing desires. Let's just put it kindly.” (Swix, [11:10])
On RCTs & Human Intuition:
“We know RCTs are the best, right? But sometimes human intuition is good enough... It's just software, guys. Let's just ship it.” (Swix, [20:20])
On Prediction Markets:
"The broader lesson is the classic difference between manifold markets and polymarket is that polymarket is only real money... prediction markets with high agency as you actually go in the future is what you make it." (Swix, [39:42])
On Multicapability Risks:
"It's hard to predict based on trend lines. It should be discontinuous in some sense... Maybe an intuition that something might be discontinuous because models are providing so much effective labour in improving the next generation of models." (Swix/Joel Becker, [22:29])
On Opus 4.5 Release:
“I've seen some of the most talented engineers I know go from being picky about not using AIs for coding to practically not writing a line of code.” (Joel Becker, [12:16])
On Human Value in the AI Age:
“I feel like there’s a kind of transcendence to singing in person that the AI generator songs are not providing me.” (Joel Becker, [55:39])

Timestamps for Important Segments

00:00–01:46: What is METR?
03:33–04:56: Origin of Time Horizon
05:08–06:16: Task selection and limitations
09:28–11:10: Interpreting Time Horizon; pitfalls
13:06–14:24: Opus 4.5 breaking the trendline; agentic coding
14:27–16:44: Redoing productivity studies in the new coding landscape
18:51–20:54: Scientific method vs. industry pace
22:29–23:48: Continuity vs. Discontinuity; emergent risk
29:50–35:23: Compute as a limiting factor
36:46–42:25: Prediction markets, insider trading, and ethics
43:16–47:01: AI Village, open-ended evaluation
48:29–51:08: Harnesses, scaffolding, and customer value
51:39–54:08: METR's hiring practices & team culture
54:26–55:39: Karaoke, music, and the enduring value of human experience

Final Thoughts

The episode delivers a nuanced look at how the AI evaluation field is keeping pace with rapid progress, the methods and philosophy behind widely quoted metrics, and how societal and organizational constraints shape the interpretation of these advances. The team exudes humility and scientific rigor, while staying focused on actionable insights and civil-society benefit, with plenty of humor and humanity.

For more, visit: latent.space

Loading summary

Transcript178 lines

[00:00]
Joel Becker
So METER stands for metr. First two letters, model Evaluation. That is, we think about what the capabilities of AI models might look like today and tomorrow, as well as their propensities, what they'll actually do in the wild, given that they have some level of capability. And then threat research is the final two letters. We try to connect those capabilities and propensities to particular threat models that we have in order to determine whether AI models pose enormous or catastrophic risks to society. So the secret, if you read this article about how I became the number one most profitable trader on manifold, mostly comes down to this one market where.
[00:39]
Alessio
Hey everyone, welcome to the lit in Space podcast. This is Alessio, founder of Kernel Labs and I'm joined by swix, editor of lit in Space.
[00:46]
Swix
Hello. Hello. We're back in the studio with Joel Becker from Meter. Welcome.
[00:49]
Joel Becker
Thank you very much guys. It's a great pleasure to be here.
[00:51]
Swix
So Joel, you. Your work has impacted the AI field a lot, especially over the last year. I invited you for the AIE summit, which thank you for speaking as well and doing the actual workshop. And you have a lot of papers that have been very impactful. But I guess upfront a lot of people like Meter just burst onto the scene. Could you explain and introduce meter?
[01:12]
Joel Becker
Yes. So METER stands for metr. First two letters, model Evaluation. That is, we think about what the capabilities of AI models might look like today and tomorrow, as well as their propensities, what they'll actually do in the wild, given that they have some level of capability. And then threat research is the final two letters. We try to connect those capabilities and propensities to particular threat models that we have in order to determine whether AI models pose enormous or catastrophic risks to society.
[01:40]
Swix
Yeah. Would you say that you've done a lot more me and TR is like the next phase or is there TR side of work that I miss?
[01:47]
Joel Becker
I think there's some tr. Some of the most publicized work does look more like vme. It looks like this time horizon stuff and the developer productivity RCTs stuff like. Like that. But there's this wonderful report on our website, GPT5 report, an analogous one for GPT 5.1 as well. Trying to make this more sort of structured case that it doesn't pose these really large scale risks. Eventually coming to the conclusion that it doesn't. But it's worth thinking like why exactly is that the case? If you and I work with GPT5, it does seem very capable. That matches up to benchmark scores. Why is it not able to do something really enormously wrong. We go through the evidence, we find we think it's not capable enough, on the basis of some of this capabilities, evidence that you've alluded to, to commit these ghast catastrophic arms. And it's not going to be able to do this, but perhaps in future we'll think it's capable of doing pretty extraordinary things, kinds of things that would be necessary to provide really serious threats. And then maybe you'd lean more on the propensities part. Are the protections that we have against these dangerous capabilities sufficient for it not to pose an existential threat, that sort of thing? So I think threat research very much is there very much is something that we're aspiring towards in some ways. You might, sorry, see the capabilities evidence as a kind of input. Yeah.
[02:54]
Alessio
Have the threat models been updated a lot or do you feel like you're still using the same thread models as GPT2 of Paperclip Factory, blah, blah. But like how much are you increasing the bar?
[03:05]
Joel Becker
Yeah, so I'm not an expert in the threat modeling piece, but more in the capabilities piece. I do think they've been changing to some extent. So something like the autonomous replication threat model that is being able to set yourself up and control resources, something like that has been deprioritized relative to, to R and D acceleration. That is the possibility there could be some capabilities explosion inside of a lab and that could be destabilising for all sorts of reasons that we could talk about. So mainly we're focusing on that latter one, although we do think about a number of threat models.
[03:33]
Alessio
Yeah, let's talk about the ME side. So I would say the model time horizon chart is probably the most quoted, I would say, both in investment decks that I see and just general on Twitter. What was the origin story of it and any other color you want to give on it to introduce it to the audience.
[03:50]
Joel Becker
Yeah. So there are a couple of different ways to tell this story. One way is there's this PowerPoint internal meter, PowerPoint from 2023, where we're trying to lay out our ambitions for what meter research might look like in the future. And there's this graph, it has a Y axis that's some measure of autonomous capabilities or dangerous capabilities or something like that. And then an X axis that's labeled time or compute or whatever resources that we want the Y axis to vary over. And then it has a bunch of scattered points that kind of go up and to the right. We think capabilities are improving over time. Many of meter's research bets have been trying to make this ever more concrete. And then when we actually did the full thing, when we had something like this Y axis, which turned out to be this task difficulty as measured by the length of time it takes for humans to do at which models can complete these tasks with 50% reliability. When we actually got that data and plotted over time, it turned out to be remarkably straight. The straight as you're aware of from the familiar graph. Part of what makes it so extraordinary is that this pattern does seem to be so regular. In fact, it's just way more straight than this incredibly scattered graph that we had at the beginning. Before my time, before I joined meter.
[04:56]
Alessio
How did you pick the tasks? I would say that's one question that people have. You have some labels, kind of like Trink classifier fix bugs and small Python library. They all seem arbitrary. Like what's the process of task selection?
[05:09]
Joel Becker
People are right to be worried about task selection or there are many finicky details in here. I would say the aspiration was to pick economically valuable tasks relevant especially to general autonomy and R and D, the threat models that we're primarily interested in. One misreading of the time horizon graph is this is referring to the full distribution of any tasks that you might give AIs. And I think that's clearly not right. In particular tasks that are requiring of vision capabilities they're probably, to take one example, they're probably much less capable today as measured by time horizon. As for these tasks that are typically not requiring vision capabilities that we give them. So we sample these tasks by having people inside of meter create the tasks and by having a bounty so that people from outside of meter can provide us with these tasks. Stuff like this. That's not a sort of perfectly random selection process in particular, it's a process that has a bunch of constraints in order to be able to scalably run our evals. It's helpful, not necessary, but helpful for success on the tasks to be automatically gradable. And that means some types of tasks are included and tasks that are harder to make that happen for are not included. But yeah, this is the aspiration.
[06:17]
Alessio
Yeah, the computer vision point was interesting. Any other disqualifiers, so to speak, what are other things where you would expect the chart to be a lot worse at?
[06:26]
Joel Becker
One thing is fairness. Or we want tasks to be in principle completable by a model that has access to sufficient information. It's not impossible given the information it has. The way we think about that is could a low context human who was sufficiently skilled at the general skills but maybe not the particulars in the background. Would they be able to achieve success on this task? And I think that rules out a lot of real work because a lot of real work involves people having careful mental models of the situation that are not all fully listed in an issue description or the equivalent of that. In some ways you might think of us as not measuring things like that. Another thing is that our tasks tend not to be that. They vary a little bit but they tend not to be so open ended or like interacting with the outside world or this sort of thing. Messy as we call it internally, which refers to a bunch of different things. A bunch of different things. But you broadly get the picture from the descriptor messy relative to tasks that you might find in the real world. Our tasks are somewhat nicely scoped. They're quite neatly contained indeed. I think we're going to talk about some of the developer productivity stuff later. Some of the interesting findings make more sense in light of the fact that those tasks are a lot more messy than the meter tasks.
[07:31]
Swix
Are there any that you will want to highlight in terms of task distribution? I think I've come across Rebench before and you have a particular affinity for Rebench. I don't know if you want to introduce your side projects.
[07:44]
Joel Becker
Benchwarmers I have a soccer team called the RE Benchwarmers. We are the most enthusiastic and possibly least technically skilled soccer team in San Francisco. We made the playoffs last season. Shout out. Shout out to the team for that. We're certainly going to make the playoffs again this season. But possibly by the time this podcast is out we'll find out that we have not made the playoffs.
[08:03]
Swix
Is this the same one that you're. Same league that you're in?
[08:05]
Alessio
Same organizer but different field? We play Mission Bay play, yeah, but
[08:11]
Swix
Hcast was the first time we come across it. And the others swa is that the meter Proprietary ones and then anything else that you're considering adding?
[08:20]
Joel Becker
Yeah. So there are private tasks in HCAST as well. But yeah. Swar is this list of sort of atomic tasks or these kind of very small software actions. Maybe One example is here's a list of four files. One of them contains the passwords. One of them is called passwords. TXT which file most likely contains the passwords? I think GPT2 can sometimes do that task and sometimes not. Opus 4.5 I'm sure can do that task 100% of the time. Then we go up to HCOST tasks which span from only a little harder than Those spot tasks all the way up to something like 20, 30 hours, which are requiring of more autonomy, more sort of sequential actions. Many of them are much more challenging. Perhaps in some sense they're built out of these atomic actions, although I'm not sure quite how clear that is. And then these rebench tasks are these very challenging, novel machine learning, research, engineering challenges.
[09:09]
Swix
So totaling 170 tasks. And I mean, I think this is very good. What's really interesting is I think people don't understand when people quote the number of hours, it is the human equivalent hours, but machines will probably take a lot less time for that. One thing I've always wondered was why didn't you publish a second chart where it was just like, here's the difference between what machines can do versus what humans can do?
[09:29]
Joel Becker
That's a good question. I think you can think of Time Horizon in some ways as a summary statistic, a single number for how good models are plotted over time. We could have done how long the models can work for productively. It's not quite clear how to operationalize that. You do want some notion of success, otherwise how exactly do you threshold this, how long they can work for? But in principle we could do something like that. But this is closer to the first thing we try. This is the thing with the clear empirical trend. I do think it's right that a common misconception about Time Horizon is that it's about how long the models should work for. And the models are, as we all see, working for longer periods of time autonomously in the wild, when we use them in a cursor or CLAUDE code or. Or a codex. But that's not the primary thing going on. In some ways, I think it would be easier to explain Time Horizon if you assumed that the model solved all these challenges in like 0 minutes or 5 minutes. Just to emphasise that's really not the thing that's going on here. Instead, we're just plotting what's the difficulty of tasks they can do over time. And that difficulty is measured in human time.
[10:27]
Alessio
Yeah, I do think there's some collision when people say, I ran clock code for five hours, which is the top of your chart right now. But that would mean five hours of a clock code run would be the equivalent of a 30 hours in your thing, basically. And, yeah, I think that's interesting.
[10:44]
Swix
Or it might not, because three hours doing absolute bullshit. Yeah.
[10:49]
Joel Becker
And a lot of These claims about 30 hours running for 30 hours or something. I have a lot of questions about that, like how good Was that output really at the end? We have. We haven't to some degree. We can talk about particulars. There's also the question of if I attempted that again, like how cherry picked is this example. If it succeeded the first time, would it fail the second time? I think in some ways those anecdotes are interesting but not so sc.
[11:11]
Swix
Yeah, that is something for people. Serious. But yeah. To understand the state of people making claims on agent performance is very unscientific and much more anecdotal and sometimes influenced by marketing desires. Let's just put it kindly.
[11:27]
Joel Becker
Yeah, I think meters out there trying to support civil society, trying to provide high quality independent information to the public. I couldn't agree more that the information environment is less than perfect.
[11:37]
Swix
Let's talk about the Opus 4.5. This is a very big jump. This was the first I called it out. When you guys put it out, I was like, this is the first time. As far as I understand, you're the first people to call out how much better Opus 4.5 was than the status quo. And I think this almost ties into your background as a super forecaster a little bit because then basically over the entire holiday period, over New Year's, people discovered that what you already discovered. What is your reflections on that? What was your reactions? Any stories to tell about that?
[12:06]
Joel Becker
That's very kind. I do want to attack you on two claims. Firstly, I have not been a superforecaster. I think that there's a particular group of people who worked for Tetlock or something.
[12:15]
Swix
What I'm referencing is you're number one
[12:17]
Joel Becker
on 4.5 is a big jump on benchmarks as well. I think in some ways, like meter Time Horizon is highly correlated with a bunch of benchmark scores. It's in some ways a kind of more understandable way of thinking about what benchmark performance really means. Slightly more interpretable. Yeah. Do I do feel intuitively like Opus 4.5 was a big bump. I've seen some of the most talented engineers I know go from being picky about not using AIs for coding to practically not writing a line of code. I'm sure many other people at previous model releases have seen that similar things happen to them. I'm not sure what that implies. It's so discontinuous in some ways. I think the story of Time Horizon is that progress has been remarkably continuous over so many years. So many orders of magnitude of compute and effective compute. But yeah, I think model capabilities are astonishing. It points to model capabilities capabilities being even More astonishing in future.
[13:06]
Swix
It broke your trendline, the trend line that you were so working so hard to build over multiple years. And it just did that.
[13:16]
Joel Becker
Yeah, I'm not sure about the characterization. I think it's so there was some speculation even when the paper came out that maybe the appropriate trend line to use is this faster four month doubling time, which Opus 4.5 would be.
[13:26]
Swix
And you picked seven for.
[13:27]
Joel Becker
And picked seven months. I was more of a believer in seven months. And so it is falsifying my trend line in some way. There's also, it's slightly confusing to think about whether differences from the trend line represent differences in the difficulty of our task distribution at particular points versus something more fundamental, more like latent capability. And I don't feel like I have a perfect handle on that. In general, I think the Twittersphere pays a lot of attention to particular model releases and really the informative thing is like over a period of a year, over a period of three years, what do trends look like?
[14:00]
Swix
It was a pretty significant update, I would say, for all of us. And yeah, I would co sign what you said there with even very cynical or more senior developers being finally pilled into agentic coding. And now very serious people are telling me that they want to commit their organizations to full don't write a single line of code by human hand and just commit to 100% agentic coding. Which is not something that you would have said a year ago.
[14:25]
Joel Becker
That sounds right to me. I feel it. I feel it in my own case.
[14:27]
Alessio
How do you validate previous research? So take the developer productivity study, right? Your AI slowed people down. If you were to redo it with Opus 4.5, would you expect the results to be dramatically different? And should we redo the study? Should we stop coding the study? How do you think about that?
[14:44]
Joel Becker
We have been redoing it in the background. I think it's and I won't comment on exact results, but I think it is much harder to do it today than it was in the past for all sorts of reasons. The first is as AIs get better at coding, it's harder and harder to find developers submitting tasks who are willing to be randomized to AI disallowed. There's a quote unquote selection issue where maybe we end up only observing the tasks that they thought AI wouldn't greatly uplift them on ahead of time because those are the tasks that they're willing to be paid for to be flipped into AI disallowed. There are other issues like I Think today a common workflow is to work on multiple issues or multiple lines of work at the same time concurrently. And that wasn't really true before. It's difficult to know how to capture that in our study design. If you flip a single task to be AI allowed or AI disallowed, you're supposed to work on that single task. But actually that's not how developers are working today. I think basically these weren't threats to the previous study design. Or in approximately like March 2025, people weren't really working concurrently or not nearly to the same degree. They basically were giving us all of their issues. Yeah, I think that's an enormous challenge. I have some ideas about novel study designs, but repeating the same one does seem tricky to me.
[15:51]
Alessio
Yeah, we have Quentin Anthony who was part of the study.
[15:54]
Swix
The only productive though I've heard the only productive.
[15:56]
Joel Becker
I have some questions about that. I think Quentin is very talented as all of the develop in the study are very talented. But we don't measure developer effects very precisely.
[16:04]
Alessio
Yeah, no. I'm curious, you know, I don't know if it's part of the new study. You don't have to share that. But I think it'll be interesting to have people on again who've been in the study. I do feel like things are changing. Like even three months ago I was like using cursor a lot more like in pair with clock code. Like I think today it's like I do a lot of just async clock code and then review and iterate and I don't know, man. It's much better and I don't know how to quantify. I think that's part of some of your points before. It's like people maybe overestimate. Like if you were to ask me how much does it speed you up? It's I don't know, 10x, but it's probably not right. But I don't know how to calculate the actual percentage. So it's hard for everybody involved.
[16:45]
Joel Becker
Yeah. So here's some issues you may think about. If you took the tasks that you were completing personally in March 2025 and then submitted them to our uplift study. Now, under the previous design, we might reason about how much faster those would go. You might expect them to go somewhat faster because AI Capab have improved, but you're doing a sort of different and larger set of tasks now. Like, I can think of a couple of side projects that I have that I simply won't be doing were it not for AI existing And in some sense the speed up there is like maybe infinite because these are things that I simply could not have done otherwise. But if you were to equate speed up with the additional value that these projects are providing, these wouldn't really line up. There's a reason I wasn't getting the expertise to do the other project before. It's just less valuable to me. Another problem is the concurrency thing that we just raised. Yeah, I do think that very bullish estimates of speed up today are to some extent inflated by what we document in that original paper that people's expectations of speed up tend to be too optimistic. It seems they also tend to be inflated. I think by not quite grokking that the value of the additional tasks that they're able to complete a lower value than you might think that there's a reason that they weren't doing them previously. That said, I don't doubt that those tasks do have value that people are being sped up on, even the tasks they would have done before. It's a complicated issue.
[18:01]
Alessio
Yeah, I do think that a lot of companies have issues absorbing additional productivity, especially when you're like a real product organization. If you think of the AWS console, right. If you gave AWS AI and everybody's 10x more productive even if they ship 50,000 more services like customers can really absorb 50,000 more survey. So I think there's some. You shouldn't really expect your engineers to do 10x more because your organization cannot push out 10x more product. And I agree. I spent a lot of time more doing side projects and things which have been fun but not that valuable in economical sense but valuable to me, to my soul.
[18:38]
Joel Becker
Yeah, I don't want to overstate that. Like I think probably people at AI companies today, they are being significantly sped up by access to AIs. I think you for your not side projects probably being sped up by access access to AIs. But yeah, it's tricky. It's easy. It's easy to overstate.
[18:52]
Alessio
Yeah. What's the cognition? Internal tracking. How do you guys measure what are like the. Yeah, how do you measure sped up. How do you measure how much do you have?
[19:02]
Joel Becker
What's your number?
[19:03]
Swix
Oh me personally quite a bit except that I am doing a lot of non technical stuff like organizing a conference which is mostly dealing with contracts and booking guests and all that other stuff that has nothing to do with code. I would say what I've seen internally in cognition is a lot of just velocity of commits regardless of whether or not you hand Authored them. And I do think, weirdly enough, number of PRs, let's call it, it's like a pretty decent. How engaged are you in terms of shipping products and then also debugging and maintaining things? I don't think that there's a good measurement of quality. There's no story points. Other guests that we had at AIE was talking about, we pay, people buy story points. You complete more story points will pay you more. And there's no upper bound to that. And I think that's a really interesting thing, except that you have to have a very confident relationship between the engineer and the person assigning story points, which is effectively what you're doing. Like your hour is a story point. And we'll reward the models based on
[20:05]
Joel Becker
the story points that you complete in some sense. Ideally, you want to get cognition, get a bunch of other companies. You randomize the companies to use AI or not use AI. And then the outcome metric trick for your randomized control trial is how much profit they make or something. Or their valuation after some period of time.
[20:21]
Swix
Yeah, I think basically no one is stopping to do science except for you guys. Because we know RCTs are the best, right? But sometimes human intuition is good enough that you're like, okay, when we lack data but enough humans agree. Either it's mass psychosis and we're all wrong, or there's something here that we just cannot articulate it. But the benefits outweigh the cost of slowing down to do the science first, this is not where we're introducing a new, I don't know, food to the general population, where we have to do a lot of safety testing here. It's just, it's just software, guys. Let's just ship it.
[20:55]
Joel Becker
Totally thinking at meter about why models today aren't catastrophically dangerous. It's interesting to get the uplift numbers, it's interesting to get the time horizon numbers, but really, why don't I believe they're dangerous? It's a mix of I watch the models do things in transcripts and sometimes they're kind of derpy, like they don't use resources well or they just clearly have some of these obvious faults in broad deployment. Only slightly worse models in the past six months have not been doing anything crazy, causing great danger that the next model is only a little bit better. And so it seems surprising on priors if it was so dangerous. Yeah, I totally think that anecdotes and intuitions are real evidence. People should totally be taking that into account.
[21:30]
Swix
I do want to comment on this whole thing about how this threat assessment site is in your name and typically I expect, let's say EA affiliated companies or organizations to be on the Elyzer side of the world where they're banging the drum about danger, whereas here you're actually saying actually it's like a pretty balanced. Like we care about AI safety, but also we're not there yet and we are actually the watchdogs looking out for it. And I would say you stand out as someone not funded by the labs where, let's say ark, is it Arkansas or some other groups that also do threat evaluations before model releases. They would typically be funded by OpenAI or some other big lab meter came out of Ark.
[22:12]
Joel Becker
So I think.
[22:12]
Swix
But now you're a separately funded organization and as far as I know, it's like a big deal that you're not funded by a big lab.
[22:18]
Joel Becker
Yeah, I think it's vital to have this independent source of expertise. I can bang that drum forever.
[22:22]
Swix
Yeah. The other thing also is just like this concept of capability explosion, which is a word that you use, that's also something I wrestle with.
[22:29]
Joel Becker
Right.
[22:29]
Swix
If you believe in emergence, you believe in multiple capabilities fusing together to produce generalized capabilities that you may not be able to detect. It's hard to predict based on trend lines. It should be discontinuous in some sense. And I don't know that going, oh, the N minus one model was fine, therefore the N model is probably fine. It's really hard to tell. The thing that gives me comfort is yesterday I was at the OpenAI livestream and even Sam Altman was like, yeah, just let Codex just yolo dangerous permissions, whatever on my computer. And I don't approve the model anymore. It just does whatever it wants to do on my laptop. And I think, I guess the guard is every model lab leader dogfooding and if it screws up their personal permissions, then they have the skin in the game is what I'm saying.
[23:18]
Joel Becker
On the continuity arguments, I'm not sure what I think. I agree that it's flimsy or like there's only so many models, so many data points, points on this time horizon trend. How much should we expect it to be continuous to keep going like this? I'm not sure. Maybe an intuition that something might be discontinuous because models are providing so much effective labour in improving the next generation of models. Maybe that's a reasonable thing to think. On the other hand, I've been pretty surprised so far about the degree to which it's Continuous. And that gives me some faith that it might continue to be continuous in future. Seems ambiguous to me.
[23:48]
Alessio
We have breakpoints in physics, right? As I'm curious if it doesn't seem like when you. It's funny, it's like when you think about water, right? Why does it boil at this exact temperature? Maybe we do know, but I feel like we don't really know. And I feel like with models, I don't know if there seems to be the same thing because it's all just like compounding of the same thing. If that makes sense. It's just like scaling the same thing over and over. But yeah, maybe we will see it. But I'm curious what you would need to see to feel that is here. Because even if you look at the opus 4.5, that's clearly out of trend. And so you were saying four months instead of seven months. But if then the next month was maybe it should not be four months, it should be two months, would that make you change your mind about whether or not the months thing even makes sense? Or maybe we pass some base level after which it accelerates and we'll keep going. I don't know. I feel like you must be having this discussion internally in some sense.
[24:41]
Joel Becker
The thing that would really concern me is if ARD was fully automated inside of some lab. That would totally seem like the conditions are there for potentially a capabilities explosion. If I saw a time horizon of a year, I would still find it ambiguous. I think at the moment whether that was the case. Because for things to be fully automated, 90% automated isn't enough. You need some full loop to be closed. And perhaps we're missing some sort of task that points to that missing 10%. So I think it's a tricky issue. I think I can't give a number, but yeah, my intuition for where water boils is at some point where this loop is fully closed, there are interesting debates about what exactly that loop is. So some people talk about software only, intelligence explosions, which means even holding hardware fixed, we could get to the point where just from models improving themselves, they then be smarter in this next step to create even better models with even fewer resources. This sort of thing. And this could lead to some extreme takeoff or maybe that fizzles out some more quickly and instead you need in addition to the software only capabilities, you need chip design or maybe you even need chip production and that's this larger loop that can close. If you think that, I think you maybe should still think that closing the chip production and software only and Chip design loop. It's potentially very stabilizing and concerning. But yeah, tricky issue I think that
[25:55]
Swix
is the actual paperclip factory. If you incentivize a model to go build its own compute and it would just build whatever it needs and it will turn the planet into chips,
[26:07]
Joel Becker
I
[26:07]
Alessio
don't think it can do it. We will stop it before that question mark.
[26:11]
Swix
I don't know if we have the power. There's no off button.
[26:14]
Joel Becker
I think it's super hard to foresee. But a model that had those kind of capabilities, it's hard to rule out there would be something like a capabilities explosion and who knows what happens after that point. Yeah.
[26:23]
Swix
Okay, so there's a bunch of other benchmarks that actually directly track this.
[26:26]
Alessio
Right.
[26:27]
Swix
OpenAI has like paper bench I think which directly tracks its capability to reproduce papers. And I think there's a lot of other than rebench, there's a lot of other sort of similar sort of ML self improvement benchmarks. They've directly like Yakum from ODEI has directly prioritized like we will have an automated AI researcher. I did a podcast with with Yitae from Gemini who is also basically he's plugging his own training logs into Gemini to improve his own code. And I'm like at some point you don't need to be here. I think this year
[26:58]
Joel Becker
you're skeptical. I'm not speaking for everyone at Meta. I'm a relatively longer timelines quote unquote person. At meter we have Nicola, my colleague who helped out with AI2027 who's on the shortest timelines end.
[27:09]
Swix
This is not officially AI 2028 now one year pass. Move it back a year.
[27:16]
Joel Becker
Yeah, I think my view would not that Nicola's view is necessarily different. Just so I'm not speaking for other people. At meter, a paper bench, let's say perfectly measures not only reproducing papers, in fact producing novel research papers that's just a part of this R and D production process. There's also like your GPUs are constantly failing. Can you get someone to go to the data center and fix them in the appropriate way? Can you call up the water company when the cooling breaks down, et cetera, et cetera, et cetera. Not aware of benchmarks tracking that in particular. My point is more there's this very long tail of things potentially involved in R and D that would perhaps need to be fully automated in order to lead to capolitics explosion. I expect we're measuring in some ways only a small Proportion of only a small proportion of those capabilities. And so I expect the capabilities needed for the full loop to close to come somewhat later. That's a controversial view.
[28:07]
Swix
I don't think so. I think that's a reasonable take. Something I do that does surprise me in terms of when I'm talking to capabilities researchers is that you guys don't have an enumeration of the capabilities that matter. I think you implicitly, explicitly do in your choices that you make. But I think it's almost important. I always imagine the wagon wheel and this is the terminology. I don't know who came up with this term, but here's the 10 things we care about and here's where everything Is on those 10 benchmarks. And I feel like capabilities tracking is just tracking, okay, what's that list? And then where are we on that list? And I almost feel like this need to reduce everything to a single number is actively working against that because it reduces any form of nuance of it's insufficient here. Like the calling a data center thing. So we're fine. And it's like actually we should just not invest anything in that area because that's the danger zone.
[29:00]
Joel Becker
Yeah, I think I could not agree more. That time horizon is for instance, but many other single numbers is one number. And that's like collapsing enormous amount of really important detail. I don't know how to come up with that list of 10. And I challenge you, if you're able to come up with that list of 10.
[29:15]
Swix
I'm working on it for code.
[29:16]
Joel Becker
I'll be very interested to see if code. My intuition is that we'll come up with a list of 10 and it will turn out that there's a secret 11th thing that we thought was important, but it was difficult to pre specify ahead of time. And now it seems obvious that even ahead of time. If we tab that foresight, that would have been helpful to add.
[29:29]
Swix
I think that the security community does this by versioning year by year. Right. So this year the top 10 are blah. And then we'll just publicize it to everybody so everyone knows what top 10 is. And next year we'll have a different top 10. It obviously is stochastic and we should update our assumptions, but it's broadly useful to have that list as a public service.
[29:50]
Alessio
You've also had this research on the slowing AI improvements based on AI compute. And you mentioned that it's like in a way you could tie the AI time horizon to the growth in compute. Can you say more about that? It's In a way unintuitive, because the compute growth is not always tied to how much every single model compute needs. It's kind of like a broader market thing.
[30:12]
Joel Becker
Yeah.
[30:12]
Alessio
How did you get the two together and then some of the findings that you had.
[30:16]
Joel Becker
Yeah, maybe for a second. Let's take time horizon very literally. We don't have the qualms about it that we've just been discussing. It makes sense to continue extrapolating it into the future. What are some important forces that might cause it to rise more quickly? Some of the things we've just been talking about, automated R and D versus go more slowly. One of the most obvious forces that might cause it to go more slowly is if inputs slow. One important input is compute. I think we all have the intuition that to some extent, if compute growth slows, which we expect it to at some point in the not so distant future, then capabilities will slow. But by how much? It's a big question. The suggestion in this paper is that if you think that algorithmic progress, that it's coming up with the transformer, coming up with rlhfs, all of this stuff, better learning rate schedules is itself a function of compute because you need compute to discover it. The transformer, the gains from transformers show up much better with scale. If you don't put in those resources, you'll never find out that this is. This is the superior algorithm. You need to run a ton of experiments. Each of these experiments can be quite compute expensive. Not to say that no labour is involved. Obviously people are working on this. But if you think it's ultimately bottlenecked by compute, then algorithmic progress too slows down. Right. If compute growth slows down. So then if you think about time horizon or whatever, your favourite measure of AI capabilities is being a function of algorithms in some sense and compute in another sense sense. And both of those components halves. When compute halves trivially because computers halving and algorithmic progress halves because computers is this important input and compute halves, then you might expect time horizon growth to halve and then some of these major capabilities milestones that we might be interested in would be significantly delayed. I think there are so many caveats to that picture. I think there clearly are some types at least of algorithmic innovations that that did not require a lot of compute to go about creating some that took a lot more compute inputs. If you expect that no compute inputs are required, we could just survey researchers for the best ideas and then immediately put those into training the frontier models. Then there'd be no slowdown of Algorithmic progress from compute growth slowdown. And of course all of this is counteracted by the possibility of capabilities explosions or AIs providing, even short of capalytics explosions. AIs provider providing significant labour at making AIs better. But just analysing the compute force on its own, it might lead to significant slowdowns depending on the degree to which it makes sense to call algorithmic progress basically determined by compute versus not needing compute to come about.
[32:46]
Alessio
Do you think of compute on a per lab basis? Because there's kind of one way you can model this out is the improvements slow down. Not every company is able to stay in business and then their compute gets recycled back back into the other labs which then grow compute again. There's almost benefit to the heterogeneous distribution of researchers and compute. But I'm curious how much you care about just a broader compute is out there for people versus the big labs have more and more compute.
[33:15]
Joel Becker
Yeah. So for the paper, we use OpenAI data and OpenAI projections. I think this applies more broadly, but we use that as a kind of case study. I think the argument I just laid out out goes through. If you're not interested in compute at all and you just talk about dollars, what are the dollars going into going into models. Will algorithmic progress slow if dollars that goes into them slows that the whole argument works. And that works, I think at an industry level or at a lab level, so on and so forth. I agree. Things like certain labs going out of business or labs consolidating or these kind of industrial organisation things would be very important. I'm laying out an extremely simple picture. A real picture is not. It's not extreme, but that's the basic.
[33:52]
Swix
We have examples of. XAI has been said to be distilling from Claude.
[33:57]
Alessio
Right.
[33:58]
Swix
So people kind of share compute in indirect ways, let's call it. I think it's also very interesting. I'm just kind of curious, what OpenAI numbers did you have? Is this the 500 billion for Stargate or something else?
[34:10]
Joel Becker
This is from their previous tax returns, the amount they've spent on R and D compute. And then from information reports earlier this year, some projections that OpenAI had have for how much they'll spend on computer R and D in the future, converting that from dollars back into flops.
[34:25]
Swix
Yeah, it's interesting because like back into flops.
[34:27]
Joel Becker
Sorry.
[34:27]
Swix
And obviously all the labs, but particularly OpenAI in the last three months have basically thrown $10 billion each to every single compute provider on the planet to develop alternatives to their current approach, which is very interesting. But I also say don't discount meta compute spend, don't discount XAI compute spend and don't discount DeepMind compute spend. All of which you have have basically zero visibility. Right. If you're looking at a single company, maybe that's authoritative, but then the total spend could be a lot higher. Yep, it's interesting. I do also observe that people like Dylan from Semianalysis do tend to very strongly time model progress with compute clusters coming online, which is like the people on the model sort of API side don't see it. But this is all downstream of our like 10,000 GPU cluster just came online in 6 months, months to do it and therefore Grok5 will be here and like it's pretty mathematically like deterministic there.
[35:24]
Joel Becker
Yeah, seems right to me.
[35:25]
Swix
Yeah, it's fascinating.
[35:26]
Alessio
Yeah. The must. Yeah. Because from the lab side they must see something in the early checkpoints to like go ahead and keep investing 18 months from now because I wonder what the time gap is between finishing a good pre training run and going live. That's right, like 9 months, 12 months, something like that.
[35:44]
Swix
I think Mistral is actually pretty open about this. The plans for Mistot 3 and 4, I think they've been pretty open about the number of GPUs and the direct timeline from coming online to when they ship the model. It's pretty set. I don't have a clear timeline in mind, but I would say four to six months. But yeah, the competition is very tight and one of those things is it's also very interesting to see when labs throw away models because they failed their run was like came behind someone else's run that was better. Then they were like oh, they can't release this anymore.
[36:16]
Alessio
Yeah. Or release it as. Yeah, that's the biggest risk with the prediction markets on model performance. Actually just to tie back I'm always failed runs. Yeah, it's okay. I think in December there was like the who's going to have the best model by end of 2025? I think there was like a lot of activity like in the last few months While the GPT 5.1 model came out and said okay then I guess Gemini because they just threw that out. It means that Gemini is coming out next week and so trade that.
[36:45]
Swix
Do we want to talk about manifold?
[36:46]
Alessio
Yeah, you were like the most profitable manifold markets trader. I mean there's obviously like a lot of talk about insider trading on like these markets, especially in AI. I've seen it with a lot of the embargo news that we get, I'm like, man, people are trading like a million dollars on this market. It's like there's thousands of people that know the actual information.
[37:04]
Joel Becker
How so?
[37:06]
Alessio
If you didn't have insider trading information, how would you think about modeling these things out? And do you think it's like a worthwhile thing? Like, for example, who's going to have the best model in three months? Do you think that's a prediction market where you can build some sort of
[37:21]
Joel Becker
strategy Alpha or I guess the naive prior without any extra information is just in 2025. For what percentage of time did which model providers have the top model as measured by Time Horizon? You could do it for any old benchmark Mark. I think that's something like 5% XAI, 50% OpenAI, 45% anthropic. Don't shoot me if I'm incorrect. I think it's not the case that a DeepMind model was at the frontier of Time Horizon at any point in 2025. But yeah, yeah, different things for different measurements. Yeah. Maybe that's the same prior that you want to apply XAI was coming online at the beginning of the year. So maybe naively you want to raise XAI a bit. Yeah.
[38:00]
Alessio
I'm always curious when I see people betting on these things that are obviously not. There's no real basis to.
[38:07]
Swix
What's your secret for Manifold markets? Alpha.
[38:10]
Joel Becker
I see. So the secret, if you read this article about how I became the number one most profitable trader on Manifold, which sounds very nice and impressive, like I must be so good at predicting things. But actually it mostly comes down to this one market where Manifold had opened up a charity program and the market is on how much is going to be donated through this charity program by the end of its first month. Month, okay. And the market opens or I first see it five days in and it's giving a kind of linear projection of how much has been donated so far. Let's assume that per day amount keeps getting donated every day until the end of the month. As a person who gives money to charity, sometimes I noticed that you could manipulate this market in a way. Right. By giving more to charity and so moving it more up. And so I think the strategy was to a ton of mana, this, this fake currency that's used on adfold into the option that was above the linear projection. People keep betting against you because it doesn't look like that's happening. I haven't actually done any donations yet. Eventually they cotton on to what's happening that someone's going to make this donation to move it over the edge and they're betting on that. And then I did it again into the next category once people had started betting on that category above the linear projection. And again people bet against that and against that. And I mopped up those fake Internet points. And then I think I did it once more as a bluff. The bluff failed. But the previous two worked out. And then I ended up donating. I can't remember exactly how much it was. Not so much. Something like $5,000.
[39:32]
Swix
Oh it's all for a good cause.
[39:33]
Joel Becker
Yeah. Yeah. To give I think. And won lots of fake Internet points on the market. And so became the number one most profitable trader slightly legitimately. There's nothing about that that was outside of the rules. Exactly.
[39:42]
Swix
This is called.
[39:42]
Joel Becker
Also I used to have less respect for my forecasting abilities.
[39:46]
Swix
This is called prediction markets. With high agency as you actually go in the future is what you make it. So I think to me the broader lesson is the classic difference between manifold markets and polymarket is that polymarket is only real money. Right. And so is the whole fake Internet points thing a worthwhile pursuit or a waste of time? Do people actually want to use real dollars? And maybe that's one question. The other question is obviously prediction market ethics which I think always just going to indirectly come to assassination markets. It's just even if it's like you don't you banned the word like will someone die? Some other proxy to will someone die will happen to be an assassination market.
[40:24]
Joel Becker
Yeah. I'm good friends with the manifold markets co founders. I love them very much. I do my view on the social value of prediction markets which was always the dream. Right. It would be nice to have calibrated probabilities on events that matter. This country going to war with that country. Things that really matter to people. It'd be nice to have high quality information. But when I look at real examples that have come out in the past year it doesn't seem to me like those examples are so socially valuable. I'm not sure about assassination markets markets in particular. I'm sure those would be outruled hopefully. But I think gambling like behaviors are socially costly and the value of higher quality information is real. But is it worth that disbenefit of people trading away their money? It's not so clear to me.
[41:03]
Swix
Yeah. Price discovery has a cost and sometimes that is gambling and that is the stock market that funds a lot of corporate America.
[41:13]
Joel Becker
Yeah. A lot of the stock markets.
[41:14]
Swix
The Guy used to be one of
[41:15]
Joel Becker
those big firms playing against other big firms. It doesn't have the same versus sports betting markets. Take an example. On the other extreme has this very different character. You might imagine that at least one direction that prediction markets could go is this sort of big players playing against retail. And that maybe has a more worrying dynamic. I'm not. I don't so closely in touch with the space, but at least something like that you can imagine being concerning.
[41:38]
Alessio
I think at a large enough scale it becomes profitable for some of the companies to do it if they can get the markets on it. I think now it's still small enough numbers compared to the rest. Yeah. Which company has the best AI model? By the end of January, $28 million of trading volume.
[41:54]
Swix
Oh my God.
[41:55]
Joel Becker
Wow.
[41:55]
Alessio
It's like why are people trading 28? It's just crazy. But I think there's some. I was having dinner with somebody this weekend.
[42:02]
Joel Becker
Wait a second. I think it's totally possible to. I'm not going to do it. I think it's important that Meter employees not be making bets on prediction markets like that. But I think it's totally possible in principle to have a guess. Guess the answer to these kind of questions.
[42:12]
Alessio
Oh, but other people know. Exactly. Gemini 3 score on frontier math benchmark by January 31st is like some people at Google already know what the number is. You know what I mean?
[42:22]
Joel Becker
I think if you believe in the benefits of price discovery then this is legitimate.
[42:26]
Swix
Yeah. They actively encourage insider trading and this is a way for insider information to pseudonymously leak out. And as long as like they bear the consequences of leaking the information, whoever is like traceable to that thing and people I think have been fired for doing trading on insider information. That's okay. Like the only step up from that is the government coming in and saying this is actually illegal. Put you in jail for that. But I think for now it's self policing retroactively.
[42:51]
Alessio
You can't really do that. All right. If you have any embargo news, press and.
[42:55]
Swix
No, we do work with people on embargo and we don't trade.
[42:59]
Alessio
Yeah. So we have not made. We would have made a lot more money trading on embargo news than we made on anything else. What else? What are other interesting model evaluation trajectories or anything that you're not doing at meter that you've maybe seen other people do that you find interesting or you would like more people to do?
[43:17]
Joel Becker
Yeah. One project that I think is interesting is AI Village that I think possibly both of you would have come across. These are these very open ended goals given to a village of agents and they try to accomplish them. I think set up a merchandise shop is maybe one of them. Organize an event in a park, build a human subjects experiment, this sort of thing. I think I have a number of questions about exactly what I should learn. They're using old models as well as new models in this quote unquote village. The models are relying a lot on vision capabilities which we spoke about models not being so capable of today, this sort of thing. But the vibe of models trying to achieve open ended things instead of benchmark like tasks. The vibe that's a bit more like that's a vending machine bench in some ways seems like a very interesting direction to me for the science to go or seems something that comes with a lot of cons but attacks some of the cons of benchmarks in a pretty interesting way. I think seeing the ways in which these models trip up, seeing the ways in which they're derpy is an important source of information. I'd be interested in more work like that coming about. I think that's one of them. Another is transcripts as an extremely interesting source of information. This is the models taking actions and then seeing outputs and then using those outputs to commit the next action and so on and so forth. On benchmark style tools, tasks or even more interesting on in the wild deployments like you might find on your own claude code usage, codecs usage, curse usage, et cetera, that has the con of being less experimental, less clean and scientific in some way. It's more selected quote unquote like the tasks that you get AIs to do are obviously the tasks you expect it has some chance of succeeding in. So you're not just giving it any sort of task. If I see the models doing something extremely impressive or potentially unsafe in some sense subverting user preferences, it's not clear how often that kind of behaviour would happen given the previous history. But it's a massive data source. There's a huge amount of a huge amount of information there and I'd love people to be working more on that sort of thing. As we mentioned, there are a lot of problems with Time Horizon, our developer productivity work. I think it has been important evidence. I think it's moved the field forwards, but it's far from perfect. I think there are lots of other directions there that look very interesting to me. Maybe one that I'll call out there is this difference between whether models pass unit tests, whether they succeed by sui bench like scoring kind of meter, like scoring, benchmark style scoring versus whether their solution would be merged into main I.e. whether the solution adds tests where it should or doesn't, whether it follows existing patterns in the code base, whether it makes sure that its changes speak to other parts of the code base in appropriate ways. That seems very interesting to me. I think model capabilities probably are lagging behind there somewhat versus versus that which you might see on Sweetbench. Like scoring. I can keep going on, but these
[45:57]
Swix
are the novel research things that you
[45:58]
Joel Becker
were referencing earlier, right? These are the novel research things. Yeah.
[46:00]
Swix
Just to comment on the AI Village thing first, you mentioned a lot of stuff. I even want to double click on the transcript stuff. The AI Village ties back to one of our highlights of last year which was Noam Brown's conversation that he's actively working on multi agents that are cooperative instead of competitive. And the basic idea that we can do more as a team than we can do individually or like the agents are the friends we made along the way way. And I think that's great. And I think on the DeepMind side, the way they phrase it is literally having open endedness team, which I think is a topic that reemerges once a year. I think it's unclear what open endedness does for us. And this is a core divide in terms of studying these things as life forms, potentially new artificial life forms versus tools that serve us. And maybe open endedness means that there is no goal. And if you're just trying to evaluate this as what does it do for me, that's completely wrong and you will never get anywhere with that because they are just living their lives as artificial
[47:01]
Joel Becker
life forms in some sense. The gold standard evaluation that I would like to do if I was looking to learn most about the questions, I'm most interested in the degree to which AIs might automate or accelerate R and D. I'd quite like to just give the AI a bunch of affordances, Type into the AI Automate R& D, go and see what it does. And I suspect that wouldn't work today even with all the affordances because it would fall over on its face when working with resources handling resource use in ways it's not so capable of today. It would struggle at some types of long horizon tasks, et cetera, et cetera, et cetera. And in some ways I think benchmarks face difficulties in capturing this sort of thing. And AI Village seems like or AI Village style things, these more open ended goals. Seeing how models pursue open ended goals. They give some colour to this sort of thing thing to seeing models fall on their face. We think that this will become these more open ended goals more and more important over time. I agree to some extent. Like you're going to provide them in the extreme case that I just mentioned, you're going to provide them documentation about how this part of the company works and that part of the company works and so on and so forth. It's not purely open ended, but it's pretty open ended. It's more open ended than the kinds of problems that we're giving them today.
[48:09]
Swix
Bounded open endedness.
[48:10]
Joel Becker
Yeah. And if models are excellent when SWIX uses them, given some detailed issue description and some very clearly spec thing on what they're supposed to do, that's interesting. But it's a very different thing I think from being able to automate R and D. I'm interested in how far we are away from that and in some ways this speaks more directly to that sort of thing.
[48:29]
Alessio
So we had the terminal bench guys on the podcast. How do you think about the harness benchmarking? In a way, because if you look at their leaderboards, the same model with different harnesses, there's like 10 percentage points of difference. Does that seem interesting? I don't know how you build the harness, a meter or whatever. You always pick the best harness or compare them.
[48:48]
Joel Becker
Yeah, let's say how we pick harnesses at meter. This is not what I work on in particular, I'm not an expert, but roughly we build harnesses to get models to be as performant as possible on a dev set of tasks, some held out set of tasks, and then we use those same harnesses trying to make sure they're not overfit for our main suite of tasks. On the one hand, I do have the intuition that there's a lot of juice in scaffolding. It's easy to overstate how much juice there is because of this overfit problem. Or if we were building a scaffold to do as well as possible on our test tasks, then it would do much better than the scaffold that was built only on our devtasks. And in some sense that would feel illegitimate or not interesting or you wouldn't expect that to generalize to some other set of tasks potentially. On the other hand, a lot of work has gone in at meter to building scaffolds that make models as performant as possible because we are interested in upper bounding the capabilities of models. When thinking about whether these models might or might not Be dangerous. I do have faith that these scaffolds are a lot better than the first thing that people might try because so much effort has gone into them.
[49:45]
Alessio
Yeah, it's interesting because I do want to overfit as a customer of the models. You do want to overfit to your task specifically. And I think sometimes people underestimate maybe how much value you can get out of it.
[49:57]
Joel Becker
But yeah, I think if you have a kind of mechanical workflow or something that you're imagining automating it, you're imagining automating that workflow and then there's some place where more sort of stochastic intelligence would be nice inside of that, like deciding where to route customers to on customer calls, something like that there. I feel like that makes a lot of sense. But for this sort of more general in particular thinking about helpfulness and software engineering thing, I'm not sure I have that same.
[50:20]
Alessio
Yeah, take an example. Like I work in Typescript. If I build a better linter that is private to me or a better test suite, like a better playwright replacement man, like in theory, I'm overfitting the model to perform better. Right?
[50:34]
Joel Becker
Yep.
[50:35]
Alessio
It doesn't really matter to me. I'm not trying to report on the model performance. I'm trying to build the best thing.
[50:39]
Joel Becker
Yeah, but if you have a model, build the linter.
[50:43]
Alessio
No, I agree, I agree. I think that's like the question of, okay, should I just wait for the next model? You know what I mean? It's like at what point should it be building the better scaffolding? Like, no, I'm brown. What's all scaffolding is going to get washed away. But on a realistic. He would say that on a realistic schedule is what am I supposed to do this week?
[51:02]
Swix
Yes, those simultaneously can be true. Will be washed away. And that scaffolding today is valuable.
[51:08]
Alessio
Yeah, totally.
[51:08]
Joel Becker
Or within model generation it's valuable. And across model generations it's not so valuable. Yeah, I'm. I'd say at best an acceptable software engineer. Intentionally not investing in engineering skills because. Because the areas are getting so good. Maybe that's the wrong decision. Yeah, agree. If you expect, as I think you should, capabilities to keep going up and up forces difficult trade offs about how you spend time today because maybe it won't be so helpful in 6 months time.
[51:30]
Alessio
Take a sabbatical. If you live in Europe, you can just take six months off or something.
[51:35]
Joel Becker
Take another sabbatical for the next.
[51:38]
Alessio
Perfect.
[51:39]
Joel Becker
Cool.
[51:39]
Swix
Just to wrap up. What do we expect out of meter in 2026. What does success look like in 2030? I don't know if there's like a sort of broader vision and then maybe on a personal side we can talk about the karaoke stuff, but meter as well.
[51:53]
Joel Becker
Yeah. So from meter I think you're going to see more, hopefully high quality capabilities, evidence kind of thing you saw in the past with Time Horizon and the developer productivity work along the lines of what we've been describing, some of these future research directions. We also have some monitoring research directions that I'm not so expert in that is thinking about if we can successfully apply safeguards to models attempting dangerous tasks. There's a whole line of work there.
[52:16]
Swix
Is that an interpretability dimension or, or what kind of.
[52:19]
Joel Becker
Usually this is black box, not white box in my understanding in current work. So not using interpretability, but you can imagine in principle doing something more white box. And then this risk assessment work that is taking into account how capable we think models are, what their propensities are, whether we can track using safeguards the kinds of things that the models are doing. Do we think that these models pose large scale harms? You can expect to see much more of that in 2026. Maybe now is a good time to say that we are hiring.
[52:45]
Swix
Yes.
[52:46]
Joel Becker
On my team we're hiring for research engineers. Research scientists. Scientists, people from startup backgrounds, from ML backgrounds. Of course, I'm originally from sort of economics or quantitative genomics backgrounds. So we are accepting a pretty wide range of people who get stuff done, the kind of stuff that you've seen in past meter work, as well as a director of operations. I think on the meter jobs page you can find out more.
[53:06]
Swix
Yeah, everyone I know is hiring a director of operations, including myself and I feel like that's probably the one agent that everyone wants. Then we can have. Yeah, whenever people have a hiring hiring pitch, I always try to push for okay, the average candidate comes in, you reject them.
[53:20]
Joel Becker
Why?
[53:21]
Swix
What's the thing you're looking for?
[53:23]
Joel Becker
So there are lots of different shapes of people we can look for. So different stuff for different folks. One thing is good kind of basic research intuitions, like checking your data. We don't work on pre training at meter, but if you're working on pre training, you should look at the corpus to get some sense of what's going into the models. Even working on this uplift rct, that was pretty important really having a shape of these issues in your head. I think people who are communicating in writing with sort of a lot of transparency, not overstating their results My hope is that your sense of meter work in the past is that it's trying to be level headed, not to understate, not to overstate what the science says. That's important internally, it's important, it's important externally. And then I think productivity or something. There are a lot of people with great talents who are not going to work quite as well in a scrappier environment working on frontier science. And that's the thing we do.
[54:09]
Swix
I just want to prime people for what are the valuable skills in this new age. Because I think the more people articulate what the positive directions are, what it's hard to hire for, that's what we guide our audience towards improving themselves. And I think that's important.
[54:23]
Joel Becker
Oh yeah.
[54:24]
Alessio
Did you have a karaoke question or.
[54:26]
Swix
I don't know, are you going to say on podcast? I've never done.
[54:31]
Joel Becker
I've been can't help falling in love with you.
[54:34]
Swix
What is this karaoke thing that you organize? This is like a music. Your music. Musician.
[54:38]
Joel Becker
A musician might be exaggerating it, but I hit instruments and noises come out. So I've hosted a couple of these live band karaoke events. That is like getting a group of friends together and people accompanied by a band singing karaoke to an audience of 50 hundred, 200 people. It's great fun. I think people should be doing more of this. I look forward to seeing you both at the next one.
[54:56]
Swix
I will do that at one of your events. Yeah. It's one of those things where it's weird because I used to be in a capella a lot.
[55:02]
Joel Becker
Oh, wow.
[55:02]
Swix
And I just think it's like a dying form. I just watched this video that was really good about 2010's wave of acapella from like Pitch Perfect. Glee is a pitch Perfect to. What's that group?
[55:15]
Joel Becker
Pentatonix.
[55:16]
Swix
Pentatonix, exactly. That's where it died. And it's very interesting to see how it's dying as an art form in general and how new formats have taken over. And I don't know, it's weird for humans also because now I'm also, let's call it more interested in. In synthetic song generation or DJing, anything like that. That the human voice is actually more commoditized. It doesn't really matter who sings it.
[55:39]
Joel Becker
I don't know. I feel like there's a kind of transcendence to singing in person that the AI generator songs are not providing me.
[55:45]
Swix
That's good.
[55:46]
Alessio
Yeah.
[55:46]
Swix
I do think that we humans always want that, but I'M not sure humans in the year 3000 will want that. It's one of those weird things. Thank you for coming on. It's great to have you do as a human in person here.
[55:58]
Joel Becker
Thank you so much for having me as a human.
[56:10]
Swix
We'll interview AI version.