Understanding the Most Viral Chart in Artificial Intelligence - Odd Lots

Summary8 min read

Odd Lots — "Understanding the Most Viral Chart in Artificial Intelligence"
Bloomberg; Hosts: Joe Weisenthal & Tracy Alloway
Guests: Chris Painter (President, Meter), Joel Becker (Technical Staff, Meter)
April 25, 2026

Episode Overview

In this engaging and timely episode, Joe Weisenthal and Tracy Alloway dig into the origins, methodology, and implications of the most viral chart in AI: Meter’s Time Horizon chart. This chart, which captures the exponential progress of AI models on complex tasks—especially in software engineering—has become a lightning rod for both investor excitement and existential fear. With guests Chris Painter (Meter President) and Joel Becker (Meter technical staff), the hosts explore the science and psychology behind these exponential "lines going up," the nuances of AI capability measurement, and the societal tensions of runaway AI progress.

Key Discussion Points & Insights

1. What is Meter and What are Time Horizon Charts?

About Meter
Meter is a Bay Area research nonprofit focused on measuring when and whether AI might pose catastrophic risks, especially through autonomy and task performance. Their infamous Time Horizon chart quantifies the scope of tasks AI can perform, calibrated by how long those tasks take humans to complete.
- “Meter is a research nonprofit...dedicated to advancing the science of measuring whether and when AI systems might pose catastrophic risks to humanity as a whole...We see ourselves as being on the hook for, at any given point in time, giving humanity the bits of evidence that are most informative..." — Chris Painter [05:39]
Purpose of Time Horizon Charts
These charts display the exponential progress in AI’s ability to complete tasks of increasing complexity (expressed as equivalent human hours to complete). Originally intended to ground debates about AI safety, these intuitive graphics have also become de facto benchmarks for AI investment and anticipation.

2. How Are Tasks and Benchmarks Chosen and Measured?

Task Selection & Human Baseline
- Tasks are primarily from the domain of software engineering and machine learning, reflecting where automation would be most transformative and where labs already optimize.
- To establish a difficulty baseline, human experts complete the same tasks under controlled conditions, and average completion times define task “length.”
  - “We get humans to sit down and complete the tasks that we give to AIs as close to identical conditions as possible." — Joel Becker [11:38]
Why Focus on Engineering?
- This domain is both a canary-in-the-coalmine for technological change and is particularly susceptible to early AI impact.
  - “One of the capabilities you should expect to come along for the ride earliest…” — Chris Painter [13:59]

3. What Does the Exponential "Line" on the Chart Really Mean?

Interpreting the Viral Chart
- The most recent model (Claude Opus 4.6) can now perform tasks that take humans ~12 hours, at a 50% success rate—double the previous frontier just months ago.
- The 50% mark is an intuitive statistical measure—at this “time horizon,” for a given task, the model is more likely than not to succeed.
Why 50% Success Rate, and Not 80% or 100%?
- The 50% threshold is easier to measure robustly, statistically thick, and aligns with previous literature; measuring at higher rates would require an impractical number of human samples.
  - “At 50%, this comes a little bit closer to washing out [label noise]…” — Joel Becker [23:59]
  - “It is [the] point at which I think that the model, if all you tell me is the length of the task, is more likely to do it than not. And I just find that intuitive.” — Chris Painter [24:02]

4. The Impact on Investors, Policymakers, and Public Perception

Investment Excitement, Safety Warnings
- The chart is heavily cited by investors, sometimes more than policymakers. Meter's goal is to inform the broad public, not just investors.
  - “I kind of want to enable people to do whatever they will do with that information...I think if this is possible that we will automate AI research, I think all of humanity being aware of it...is sort of a precondition...” — Chris Painter [24:52]
- There’s a “Baptist and bootlegger” dynamic where builders and regulators alike express both awe and existential concern, creating a unique industry tension.
  - “It’s a very strange industry, right? The only thing I can think of is cigarettes, where they warn you smoking is bad...I can’t think of any other industry where the most enthusiastic people about it are also warning and dooming about how bad it could be.” — Joe Weisenthal [32:27]
Socioeconomic Tensions
- Free market competition and capital commitments (e.g., debt for data centers) may make it increasingly hard to “slow down” if safety concerns grow.
  - “If you build a bunch of financial obligations...say that you do find evidence that you’re now worried about...AI systems, do you now have a financial commitment to continue?” — Chris Painter [43:39]

5. Technical Nuances and Critiques

Sample Sizes and Human Benchmarks
- Tasks average about three human baselines, and as AI progress accelerates, finding suitably skilled humans (and baselining longer tasks) is a growing challenge.
  - “It has become more difficult to get these baselines as time has gone on. At the moment, not impossible, but very challenging.” — Joel Becker [19:23]
Potential Conflicts in Benchmarking
- There is a theoretical risk of human participants gaming the tasks, but Meter tries to counter this by rewarding speed and peer competition.
- Meter acknowledges resource bottlenecks: with ~30 staff, they must triage which frontier problems to research, despite many critically important opportunities.
  - “The vibe inside of Meter is a state of triage...when we want to try new types of research...we’re having to turn down opportunities...because we don’t have the staff.” — Chris Painter [56:38]
Doubling Time Debate
- The time horizon’s doubling pace has quickened—from 7 months to about 4, per recent data points.
  - “For the models that have come out since, what trend has better predicted how performant those models would be? And it’s very clear…the four month doubling time.” — Joel Becker [50:41]

6. Capabilities Gaps & The China Question

Chinese AI Models
- Despite hype in some markets, Chinese models still lag 9-12 months behind U.S. leaders in Meter’s time horizon and are deprioritized for benchmarking due to resource constraints.
  - “In general, the Chinese models have been something like nine to twelve months…behind the U.S. models...the gap by time horizon is probably even larger..." — Joel Becker [37:57]

7. Is There Room for Hope?

Safety Incentives & Market Forces
- Some forms of safety research do improve commercial utility (e.g., better alignment, compliance), so market forces aren’t always at odds with safety.
  - “Some safety-promoting technologies...do make the models more useful...So you have capitalist incentives to invest in that kind of research.” — Joel Becker [45:25]

Notable Quotes & Memorable Moments

On the AI Safety/Benchmark Paradox:
“It’s very strange...the most enthusiastic people about it are also warning and dooming about how bad the thing they’re building could be.” — Joe Weisenthal [32:27]
On Societal Stakes:
“If this is possible that we will automate AI research, I think all of humanity being aware of it, aware of where we're heading, is sort of a precondition for us all being able to figure out what to do do about it.” — Chris Painter [24:52]
On Progress & Uncertainty:
“I do think that the underlying technical progress is real, but...productivity improvements are also going to show up increasingly. But yeah, there are these frictions.” — Joel Becker [32:27]
On Technical Bottlenecks:
“Why have we not made more progress? ...The central reason is that we are bottlenecked on technical talent, on incredibly capable people to come work on these questions.” — Joel Becker [55:21]
Joe’s Industry Comparison Riff:
“The only thing I can think of is cigarettes, where they warn you smoking is bad, except they had to do that because they lost a lawsuit...I can’t think of any other industry where the most enthusiastic people about it are also warning and dooming about how bad the thing they’re building could be.” [32:27]

Timestamps for Important Segments

[05:39] — What is Meter? Mission and motivation.
[10:20] — What the viral Time Horizon chart measures.
[11:38] — How human baselines are established for benchmarking.
[13:44] — Why focus on engineering tasks.
[20:33] — The 80% vs. 50% benchmark threshold debate.
[24:24] — Investor interest and public communication goals.
[27:50] — What happens when AIs work with AIs; limits of autonomy.
[33:47] — The safety/industry paradox and origins of AI safety movement.
[37:48] — Chinese AI models and global benchmark gaps.
[43:39] — Socioeconomic tension: capitalism, safety, and financial obligations.
[46:10] — The link between compute spend and AI progress.
[50:41] — Doubling times: 7 months vs. 4 months, and what’s accelerating.
[53:26] — Meter’s staffing, funding, and talent bottlenecks.
[55:21] — State of triage and the need for more technical talent in AI safety.

Flow & Tone

The conversation combines technical explanation with wry skepticism and urgency—reflecting both hosts’ and guests’ awareness that AI’s exponential curves could change society before policy and safety can catch up. There’s a persistent note of critical thinking and humility amid the hype: guests are frank about measurement limitations, resource constraints, and the fact that the next benchmarks will be harder to create as AIs outpace even the best human testers.

Summary Takeaway

Meter’s viral "Time Horizon" chart is about much more than just a line going up — it quantifies the rate at which AIs can autonomously complete complex engineering tasks, doubling roughly every four months, and serves as both a rallying point for investor excitement and a warning flag for existential AI risk. The underlying science is innovative but incomplete, constrained by technical and human resource limits. The episode’s conversations highlight the deep tensions in AI: between competition and caution, progress and preparedness, and between market forces and societal safety.

Loading summary

Transcript153 lines

[00:01]
A
Introducing Fidelity Trader plus, the next generation of advanced trading from Fidelity. Customize your tools and charts and access them seamlessly across desktop, web and mobile.
[00:11]
B
For faster trades anywhere you go, try the all new Fidelity Trader Plus.
[00:17]
A
Learn more about our most powerful trading platform yet@fidelity.com TraderPlus investing involves risk, including risk of loss.
[00:25]
B
Fidelity Brokerage Services, LLC Member NYSE SIPC
[00:31]
C
so there's a lot of noise about AI, but time's too tight for more promises. So let's talk about results. At IBM, we work with our employees to integrate technology right into the systems they need. Now a global workforce of 300,000 can use AI to fill their HR questions, resolving 94% of common questions, not noise. Proof of how we can help companies get smarter by putting AI where it actually pays off, deep in the work that moves the business. Let's create smarter business.
[00:59]
D
IBM Everyone has been there. Your team's feedback is scattered across emails, chats and sticky notes. It's a mess, but PDF spaces in Adobe Acrobat gives you one collaborative workspace to streamline every file and comment. So if you need six departments to finally agree on a proposal, do that with Acrobat. Need to turn a mountain of feedback into one plan of action. Do that with Acrobat. Want to stop searching for files and finally get everyone on the same page. Do that, do that, do that with Acrobat. Learn more@adobe.com do that with Acrobat. Bloomberg Audio Studios Podcasts Radio News.
[01:56]
B
Hello, and welcome to another episode of the Odd Lots Podcast. I'm Joe Wiesenthal.
[02:00]
E
And I'm Tracy Alloway.
[02:02]
B
Tracy, one thing about AI is that lots of lines that go up.
[02:08]
E
Yes. Famously, there is perhaps one line that has captured the attention more than others when it comes to lines going up.
[02:15]
B
Yes. But we're recording this April 7th. Did you see the anthropic revenue chart, by the way? Oh, it's just like straight. It's just on the number of lines going up. I are some.
[02:26]
E
Some really let me caveat that up until recently there was one chart of a line going up exponentially that became, I think it's fair to say, the most viral chart in AI. Right?
[02:38]
B
Yes, I would absolutely agree with that. So one of the many lines that go up and there are various lines that sort of capture this is essentially just measures of AI progress, of what they could do, what the models are capable of and so forth. And you know, there's all different benchmarks out there and hobbyist benchmark creators, et cetera, all kinds of benchmarks out there, organization called Meter, based out in San Francisco. And they measure how well AI models are doing at various sort of engineering tasks, et cetera. And they have these charts showing how long, you know, certain tasks, how long it would take a human to do them, and then whether AI could do them. And yes, the lines just are almost vertical. I think there was someone, one of the ones that came out maybe very early this year or late last year, showing the latest CLAUDE model. And this is crazy.
[03:28]
E
When I look at these charts, they're called time horizon charts. When I look at them intuitively, I kind of understand what they're saying. And you can kind of see the leap in progress between some of the previous models and claude. Right, the latest CLAUDE model. And that's what got everyone excited, was you had this big exponential shift up in the capability of that particular AI model model. But then when I start, like, diving into what it actually says on Meter's website about what these charts represent, I start getting really confused. I know everyone wants to get excited about AI and charts going up in general, but I think there's a lot of nuance here, and we should probably talk about it, because the other thing going on with Meter right now is they've become sort of the industry standard benchmark. And so a lot of investment decisions are being based on these charts. And if you oversimplify them as just like, okay, lines going up, and then suddenly it goes up even more, obviously people are going to start to get, like, maybe a little overexcited.
[04:26]
B
Can I say one other thing, too, that I'm very curious about? Like, I'm really glad that there are people designing various benchmarks for measuring AI progress. Seems like an important thing to get a handle on. But, like, if I were, like, say, like, talented or smart enough to be, like, doing these things, I would go work for one of the labs and make $10 million a year or something like that. And so I'm actually curious because a lot of nonprofits, et cetera, it's like, do you really want to be working at the cutting edge of AI at a nonprofit? I mean, I guess OpenAI is owned by a nonprofit, weirdly enough. But you know what I'm saying? I would want the money.
[05:01]
E
We should talk about it with our guests who are currently sitting right here.
[05:06]
B
That's exactly right. I'm very excited to say we have the two perfect guests to talk about the most viral and maybe important chart in AI. Right now, we're going to be speaking with Joel Becker. He is a member of the technical staff at Meter. And we are also going to be speaking with Chris Painter, the president of meter. So, Joel and Chris, thank you so much for coming on odd lots.
[05:25]
F
Thank you.
[05:25]
A
Thank you for having us.
[05:26]
B
Yeah, really excited to chat with both of you. Chris, since you're the president, I'll. I'll start with you. Like, what is meter? How long has it been around? What is this organization? What's its goal? Just give us the. The sort of 60 second synopsis of meter.
[05:39]
A
Yeah, totally. I can try and, you know, sometimes I give a long version. I can try and do a short version here. So. Meter is a research nonprofit based in the Bay area, like you said, dedicated to. To advancing the science of measuring whether and when AI systems might pose catastrophic risks to humanity as a whole, Focused specifically on threats that come from AI autonomy or AI systems themselves. So when you talk about there's kind of this whole field in AI of dangerous capability evaluations, people seeing, can this AI system assist with a chemical or biological weapon attack, can it advance kind of like bad actors ability to execute cyberattacks on a really large scale. Meter is sort of specialized in specifically assessing how autonomous are AI systems, what is the scale and like, length and difficulty of tasks that they're able to do by themselves. Partially because we think it sets the stakes for conversations about AI misalignment. So we sort of see ourselves as being on the hook for, at any given point in time, giving humanity the bits of evidence that are most informative for establishing the stakes of are we reliant on AI systems as a society in a way that could make it really bad if they are misaligned?
[06:54]
E
I'm going to let Joe ask the question about why you're both working in a nonprofit instead of one of the labs later. But one question I do have is when I think of meter, you guys always come up in the context of these time horizon charts. And I, I don't mean this as an insult or anything, but I hardly ever hear anyone talk about the actual, like, safety aspect of your mission. Why do you think that is?
[07:14]
A
Yeah, so I think there's some distinction between our motive for assessing time horizons and the kind of how it gets used then by the rest of the world or kind of like what the origin of the rest of the world's interest in it for meter. I think the, the reason that we work on things like the time horizon charts is because if we're trying to establish the stakes for talking about could AI systems go rogue, or one day could they like try to take over and subvert human control. Three years ago. If you went back to around when meter started about 4ish years ago. And if you it was started by Beth Barnes Paul Christiano and this was kind of the initial motive is if you went back then and you said why don't I think that AI systems are going to go rogue and like take over or overthrow humanity Today the kind of most intuitive, you know, you can come up with a lot of abstract reasons, debates about the goals AI systems might or might not eventually have. But the kind of most damning in the moment reason is the AI system just can't do much, right? It doesn't make sense to talk about a question answer system that like can't even reliably answer programming questions saying like is it going to hack my systems or like backdoor me in some way? It just doesn't make any sense to talk.
[08:21]
E
No, it's going to write you a poem that you asked for, right?
[08:24]
A
Or won't even at the time they couldn't do anything. And so if you're like kind of being able to subvert human control depends on agency. And so we wanted to come up with a measure that kind of tracks agency over time to kind of say when would this argument no longer apply? When are AI systems now able to kind of do long complex enough actions by themselves that the argument kind of the goalposts almost move somewhere else to like, well we would catch the AIs or the AIs don't want to subvert human control. And so I agree that there is a distinction between like how I think partially the exercise of trying to come up with these measures throws off things that are very like grounded and intuitive measures of AI progress that might be more intuitive than just benchmarks, right? So if you, a lot of people are in the game of making just benchmarks where you say like here's my harm bench or something, the AI gets 70%. That's much less of a kind of grounded or long lasting metric. Like it's hard to say what that means or how that generalizes. But the idea with Time Horizon is like maybe it's more intuitive. And I think that helps both for safety and for like business understanding.
[09:27]
B
So let's talk about what this chart. I got the main chart here, meter.org right on the front page, it's this time Horizon chart and it shows Claude Opus 4.6 as of February 2026 able to complete a task length in 11 hours and 59 minutes with a 50% success rate. I have to admit, the first time I saw this chart or versions of this chart, what I assumed, and I suspect others assume, is that it was able to go off and work on a task for 11 hours and 59 minutes and then come back with an answers. But apparently it's not that. What do you walk us through? Like what's really being measured here? By the way, the previous high was GPT 5.3 Codex, that was 5 hours and 50 minutes. So I guess part of the reason this chart just blew people's minds because literally that's basically a double. But why don't you talk to us about what's really being measured here?
[10:21]
F
Yeah. So fundamentally in simpler terms, we are plotting the difficulty of tasks that AIs are able to complete over time. And, and the particular way that we measure the difficulty of tasks is in how long it takes humans to complete those same tasks that we're asking the AIs to do. So in this case we're talking about Verbus 4.6, something like tasks that take humans 12 hours to do. We predict that it will succeed at those tasks around 50% of the time. And yeah, it turns out that when you plot using this particular difficulty measure how performant AIs are relative to how long it takes humans to complete these tasks, we see an exponential increase in capabilities for AIs. And what that ends up meaning is that you keep on having these doublings of capabilities every, let's say four months. It seems on recent trends where the next model is not merely going to have necessarily an hour longer time horizon, but perhaps be having some multiple of the time horizon of the previous model that's come out.
[11:18]
B
So then explain how that number, that 12 hours is established. So there is some engineering task and you say, okay, this is a task that would require 12 hours. But humans have all different types of talent capabilities. How do you establish that, okay, this was a 12 hour task, this was a six hour task, whatever it is.
[11:38]
F
Yeah. So the simple answer is literally we get humans to sit down and complete the tasks that we give to AIs as close to identical conditions as possible. So first we come up with the tasks and that's whole kettle of fish. We can talk about exactly how we do that. And then using essentially the same tools that we're about to give the AIs, we take talented humans, not people who have seen this particular type of task before, but people who have relevant expertise. So if it's a software engineering task, they have software engineering expertise, machine learning task they have machine learning expertise. And then we time them, we see how long it takes for them to complete those tasks successfully. And then roughly we call the difficulty of the task as measured in human time to complete as the average time it took these humans to complete the task. Then we'll run the AIs on this same set of tasks. Typically today, for the very easiest tasks, they're more or less always going to succeed. There's some mid range of tasks where Perhaps they succeed 50% of the time, or perhaps for some tasks in that range they succeed 0% of the time and for others 100% of the time. And so they're getting 50% on average, let's say. And then for the much harder tasks, perhaps they're getting closer to 0%. And then the point at which we predict in the middle of all leave these 0% and 100 percents by task, the point at which we predict that they'd have a 50% chance of succeeding, that is either a 50% chance of succeeding on some task or 50% of the tasks of that difficulty that we think they would succeed on. That's what we're going to call the time horizon of these models.
[13:06]
A
I think one thing also that could be good to explain here is the task distribution. I mean, this is not all activities that humans do. We are specifically here interested in or the like. There's some question in what tasks are, you know, like Joel mentioned, we're having people come into our office, do the task to get a sense of how long it takes. We're not having them come in and like, you know, paint paintings or write novels or, you know, we're focused here specifically on things that are in the distribution of work that a engineer at a, like, we like to think of it as like a frontier AI lab. The tasks that they might be doing. So this is things like software engineering. It's fine tuning AI models. It is like software machine learning, that kind of task.
[13:45]
E
Wait, can I just ask, why did you decide to focus on engineering? Because you could have widened it out to, you know, if we're talking about AI being capable of, you know, taking over the world, there are all sorts of substantive tasks that would fall under that category. So why just do engineering?
[13:59]
A
Yeah, I think that for one thing, maybe other people on the team or maybe Joel has thoughts about this. But I think my particular motive in being interested in the time horizon on software tasks is that first of all, it's the thing that the industry is very like already, even before we started working on this is very focused on. So it's one of the capabilities that you should expect to come along for the ride earliest. It's the thing that like a lot of optimization pressure is being exerted on. And then I think that it is kind of the like thing that you would expect as an early warning kind of sign of this AI R&D automation. So to some extent meter thinks of itself as trying to build you know, science that or advanced science that can say when are we getting to the point that AI systems could improve themselves or speed up the pace of AI development? When will AI research kind of feed on itself? And the kind of core capability for that might be software engineering and machine learning research ability. There are other skills that could be relevant to taking over the world, right? I think other people have done time horizons on like cybersecurity since.
[14:56]
E
Yeah but I suppose it is true like the basilisk isn't going to paint its way into like power or something like that. Okay.
[15:02]
B
It might deceive you. It might be very convincing or cunning in some way and fair hand over the cues.
[15:09]
F
I won't say for your, for your mental models, you know, we don't have perfect evidence of this whatsoever. But my rough sense sort of colloquially or my prior before evidence comes in is that if we did study tasks on these very different distributions, not machine learning, not software engineering, I'm not sure about painting exactly but perhaps other kinds of task distributions that we could enumerate that basically we would see this similarly shaped exponential progress over time where every. I'm not sure exactly but let's say four months, six months, something like that. The level of capabilities as measured in time horizon or would be doubling at something like that pace maybe from a much lower level. So you know one example that we do have better evidence of is that the AIs today are much less performance at you know anything that requires vision capabilities seeing what's on a screen, clicking around at a, at a computer. But they're getting you know tremendously better that that sort of thing over time.
[16:00]
A
I just do mention quickly we did actually do a very kind of brief investigation of this on other task distributions that's on our website somewhere like cross domain time horizons. I think we looked at data from the Tesla's shared on self driving. I'm forgetting the other TAs there's like OS World, maybe some these are like somewhat similar still kind of in the distribution of software tasks but trying to get further afield into things like vision. Introducing Fidelity Trader plus the next generation of advanced Trading from Fidelity Customize your tools and charts and access them seamlessly across desktop, web and mobile.
[16:47]
B
For faster trades anywhere you go, try the all new Fidelity Trader plus.
[16:53]
A
Learn more about our most powerful trading
[16:54]
B
platform yet@fidelity.com TraderPlus investing involves risk, including risk of loss Fidelity Brokerage Services LLC Member NYSE SIPC Being a small business
[17:07]
D
owner isn't just a career, it's a calling Chase for Business knows how much heart and effort go into building something of your own thinking that's why they make business growth their priority. The Chase team takes the time to understand your mission, where you are now and where you want to go. Their broad range of solutions is designed with you in mind so you can bring your ideas to life. From banking to payment acceptance to credit cards, you can conveniently manage all your business finances all in one place. With their digital tools looking for tips and advice, their online resources are always available to give you the solutions you need to help your business thrive. See how your business can get stronger and go farther with Chase for Business. Learn more@chase.com business chase for business Make More of what's yours the Chase Mobile app is available for select mobile devices. Message and data rates. May apply JP Morgan Chase Bank NA Member FDIC Copyright 2026 JPMorgan Chase Co.
[18:05]
C
So there's a lot of noise about AI, but time's too tight for more promises, so let's talk about results. At IBM we work with our employees to integrate technology right into the systems they need. Now a Global workforce of 300,000 can use AI to fill their HR questions. Resolving 94% of common questions, not noise. Proof of how we can help companies get smarter by putting AI where it actually pays off, deep in the work that moves the business. Let's create smarter business IBM.
[18:36]
E
How big is the sample size on the humans who are actually doing work? And also is it getting harder getting human engineers into the room to compete with Claude Opus 4.6 versus say if I was a mediocre engineer and I'm a non existent engineer, but if I was a mediocre one, maybe I would feel good about going up against GPT3 or something. And maybe I would feel a lot worse about myself going up against Claude.
[19:02]
F
Yeah, you know on these tasks I'm in a pretty similar position myself to you. So we have approximately three, although it varies quite a lot across tasks. Human baselines per task. So typically we're averaging over something like three I think the final numbers. It's my impression that they're not going to be so sensitive to the particular baselines that we choose.
[19:21]
A
Aren't the longer tasks more weakly baselined?
[19:24]
F
Yeah, so indeed I think it will get a lot harder to baseline these tasks as the length of tasks that AIs are able to successfully complete gets longer and longer. You know, you might think at some point the length of tasks that they can complete is longer than the doubling time. In four months time they're going to be able to complete tasks of more than four months. And then it kind of becomes perhaps close to impossible to get these four month long baselines. Of course we're not at that point yet, but you know, definitely it has become more difficult to get these baselines as time has gone on at the moment. Not impossible, but very challenging.
[19:54]
E
Joe, these are the future jobs for displaced engineers, right? It's competing against the codes for benchmark.
[20:00]
B
First benchmark evaluation.
[20:02]
E
We found the jobs.
[20:03]
B
So we mentioned at the beginning, the most viral chart in AI is this chart that you have on the front of your website. Your website defaults to this and it shows, you know, this doubling. So if we actually like go back to November, let's say November 2025, Gemini 3 Pro 3 hours and 44 minutes, Cloud Opus 4.6 12 hours. That's at the 50% success benchmark. If we go to the 80% benchmark, which the website doesn't default to,
[20:34]
E
the
[20:34]
B
pace of improvement looks a little less impressive to me. So okay, now it's like it does not have the same gap pretty clearly now 80% is still not 100% and I know that this is your meter's goal is about like, you know, human safety and all this stuff. But when we think about people look at this and they use it as a stand in for how performant are these models, even 80%. You know, certainly for like any business application. I understand you're not like serving business here per se, but probably businesses care about this. Even 80% may not be very good enough. And it does not look as crazy when you look at the 80% chart as it does at the 50% chart. Why the focus on the 50% chart and given like why not look at the chart that just does not look as impressive.
[21:28]
F
Yeah, maybe two central things to say. One, to my eyes the 80% chart basically does look as impressive or the doubling time is about the same.
[21:36]
B
This is cope on my part.
[21:39]
A
It's the same increase, an offset of.
[21:42]
F
It's the same pace of progress. It's something like five times smaller than the 50% than the 50% number but that only takes you two doublings. And if each doubling takes around four months, that means that in eight months time you're going to have the same 80% success rate roughly as you do 50% success rate today. That's one thing to say. Maybe a second thing to say is, remember at the beginning I said essentially what we're doing is plotting the difficulty of tasks that these AIs can complete over time, just with this particular measure that ends up showing this clean exponential trend. And we've picked a particular number as our difficulty number and that is this 50% reliability threshold. We could have picked a different one. I think there are reasons for picking the 50% one in particular, it's the one that's statistically we're better able to measure for some technical reasons, it's the one that shows up in previous literature. There are a couple of other reasons why we can go for 50% rather than 80%. Maybe a final thing to say is that this 50% number is sort of equivocating. Between these tasks it's able to complete 50% of the time and 50% of the tasks, it's able to complete 100% of the time. And 50% it's able to complete 0% of the time. The time. And actually I think the situation is somewhere in between, but it's a little bit closer to the latter, where there are some tasks that it's completing with near perfect reliability and some tasks in that range that it's completing with very low reliability. And for downstream economic applications or for applications inside of these major AI companies or something, you might think that that's more favourable in some sense, that there are some of these tasks where we're getting 100% reliability even for very challenging tasks.
[23:13]
A
I think two other things, maybe it could be useful to just explain when you said that there are technical reasons why it's easiest to measure at 50% one, like it is just the case that it is 50% is the point at which it is like least sensitive to like the distribution is kind of thickest. Right? I mean, correct me if this is wrong, but my, I mean there are like to resolve something like 95%, you would need way more samples because then you need to have some that are like, you need way more samples to be able to resolve that level of precision.
[23:40]
F
I think there are some caveats to that picture, but let's say even more extreme. You know, let's say that we cared about, you know, 99% in that case, if we had 1% label noise, quote, unquote. If sometimes we were accidentally grading some of the failing tasks as passing some of the passing tasks as failing, then we'd just never be able to estimate that reliably. Right.
[23:59]
B
Okay.
[24:00]
F
And at 50%, this comes a little bit closer to washing out.
[24:03]
A
And then I think one other intuitive thing here, or one intuition is that if you give me a task and you give me the model, it is the point at which I think that the model, if all you tell me is the time or the length of tasks, that, that it takes a human to do the task, the 50% time horizon is the point at which I think it is more likely that the model will be able to do the task than that it can't. And I just find that intuitive.
[24:24]
F
Yeah.
[24:25]
E
How much interest do you get on these charts from potential investors specifically? And the reason I ask is because I was just messing around and like, googling some stuff. And when the opus chart, the latest opus chart came up, someone posted it on Reddit and I think like the second comment on it was someone going, how do I invest in OpenAI? And like, and, and like, people were. They were trying to club together to like, invest in these companies. So clearly there are people out there who are using these charts as investment tools.
[24:53]
A
I would say, you know, we don't get an enormous amount of inbound from investment firms. I mean, sometimes, you know, VCs or whatever we're based in the Bay Area will reach out to us. I think that there's some kind of principle of our goal is to inform the public and give them the best evidence that we can about when we might get to this point of kind of, you know, AI being, you know, fully autonomous or able to improve itself. And there's some principle at play here of like, I kind of want to enable people to do whatever they will do with that information. And I think that we don't engage a ton in kind of the, like, business side or investment implication of the work. One kind of thought experiment I sometimes say to myself is if I do believe that at some point we're going to get this AI that's improving itself and where like AI research is automated and you have all these fears about a singularity, would I rather that like all of Wall street, like, falsely didn't think that was coming when I believed it was coming, or would I want them all to know that it was coming, given that I believe it's coming? And I think all of humanity, maybe this is more a personal view, but I think if this is possible that we will automate AI research. I think all of humanity being aware of it, aware of where we're heading, is sort of a precondition for us all being able to figure out what to do do about it. And so I don't kind of want like certain people or one side or one team to kind of like selectively be in the dark because they might invest on the basis of this or something like that. But we don't, you know, it's not where we put our time. We're focused on informing the public. The public includes some investors.
[26:26]
E
So on that note, like, what is the actual level at which we're all presumably supposed to panic, or at which, like, if you're a policymaker, you would start to get worried about AI being able to automate and improve on itself in a way that eventually becomes detrimental to humanity?
[26:41]
F
I don't know exactly what the level is on this time horizon measure. I think one thing to say is we have made real progress on the science of measuring these AI systems and how capable they are. But I think there's a long way to go. And in an important sense, I think we're behind on this task. We're measuring some underlying technical trend. And at some point, I do think that implies greater risks of astonishing things happening. Although Chris can speak more to other arguments that we might back out to for why, even if AIs are very capable, we still might not see catastrophic dangers emerge in the short term. Yeah, I'm unsure.
[27:15]
B
You know, I think part of the reason why the AGI chatter has really picked up, particularly in the wake of, like, everyone using Claude code, is it's very easy to imagine you're sitting there. It's like, yeah, do this, do this. It's like, I don't even need to be here. Right. I think you sort of get a very intuitive feel for, like, how the human can come out of the loop. What happens today, Because I'm sure this has been tried, like, if you go to, like, ChatGPT and you say, here's a. You here, you have Claude code access, go build something. And the AI is what actually happens today, when AI is working with AI.
[27:50]
F
Yeah. My sense is that at some point, you know, further away points than would have been true some time ago, the AIs will more or less fall on their faces, that, you know, there are some things they're not so capable of today.
[28:02]
B
Like collaborative hallucinations will just like. It'll just like devolve into terrible.
[28:07]
F
Yeah, I think all sorts of Ways can go. At some point they're going to need to rely on external resources. And today they're not as capable at managing these external resources effectively. I think they're less capable at sort of ideation and sort of self awareness about where they are in the problem today than they are at these kind of raw software engineering skills. As you mentioned, the ways in which AIs are autonomous today, or close to autonomous today is the human has the idea and then submit that idea to Claude Code or a Codex or one of these other agentic AI tools and then they handle the software engineering components and possibly there's still some intervention after that. I do imagine that the sort of circle of autonomy or something gets larger over time. I do think there's no fundamental barrier, it seems to me, to the AI is having those ideas and to be moved to a greater level of abstraction. But if we were purely relying today on these fully autonomous capabilities, you know, could you manage research departments any, any department of your choice inside of a major AI company? You know, my guess is probably not.
[29:07]
E
Actually on this note, this reminds me something I wanted to ask. So when you look at the domain specific time horizon charts, so the ones that show like, I think you call them task suites or something like that, like I guess productivity by a specific job and you see these different lines, so sometimes you see like almost horizontal lines and sometimes you see squiggly or steel steeper lines. What is actually happening there? Like how are we supposed to interpret that? Like is this a measurement problem or is it saying something very fundamental about like what AI can and can't do under current conditions?
[29:42]
A
The thing that I think would be good for Joel to explain is that I think that there is a distinction here between will AI, like the time horizon chart doesn't by itself I think tell you, will productivity in one specific kind of job increase because of access to AI?
[29:57]
F
Yeah, maybe. One thing to say on that chart showing that time horizon on these different task distributions relative to my guesses ahead of time, I think those time horizons are remarkably similar. I think the doubling times, the pace of progress in AI seems more similar than I would have guessed to the original trend that we published, although imperfectly so. On this difficulty translating what we might call raw AI capabilities in some sense capabilities on benchmarks or something to real world productivity. I think there are a number of differences and a number of ways in particular in which the benchmark results are overestimating what we might see in the wild, not hugely overestimating. I think we do see that people are getting real utility out of these modern agentic AI tools, but overestimating to some extent. One is that the scoring implicitly is different in real problems. I'm scoring based on something a bit more holistic than these algorithmic scoring procedures, these automatic scoring procedures that we're using at meter and many other people are using in the benchmark world. There's some notion of code quality if you're working in software engineering, but for
[31:03]
E
other tasks there's beautiful code, elegant code.
[31:06]
F
People always talk about, yeah, yeah, yeah. For other tasks there's going to be
[31:09]
E
animal mentor was coding. This is what it would look like.
[31:12]
F
One more thing is that the tasks that come up in the wild are more likely to be messy in some sense. They involve working with other people, they involve working in much larger code bases or sort of more open ended problems maybe with something even adversarial going on in the, in the software engineering context that might be that someone's trying to make a change to the part of the code base that you're currently working on and you need to work around that. And we do tend to see that the AIs are less capable at working on these more messy problems. I don't want to overstate that it's not an enormous effect, but that's one thing that gets in the way of these productivity increases. And then I do think there's something to the reliability question, right, where if it was true that for a certain type of task you only had, you know, 80% reliability, then every time you're going to need to go back and verify the work of these AIs and not only verify the work of these AIs, but without the context of how they implemented the solution relative to if you went about the task yourself, you'd already have that in your head. And so this verification step, quote unquote, would take less time. I don't expect these frictions to be sort of so fundamental in some sense, or I imagine they go up levels of abstraction. I think not only is the underlying technical progress real, but I think that the productivity improvements are also going to show up increasingly. But yeah, there are these frictions.
[32:27]
B
Tracy alluded to this question when she asked about VCs and investor interest. So people see these charts and regardless of what meter's point is like, this is incredible. I got to invest in this. But this brings me to this broader thing that I find very strange about AI, which is this kind of odd sort of Baptist and bootlegger relationship between the AI labs people who are building this stuff and the sort of alignment safety people, and they sort of go back and forth and like, the. You have the heads of the lab saying, yes, this might destroy the world and take all your jobs. And the safety people. And the alignment people says, yes, this might destroy the world. And, like, I'm. It's a very strange industry, right? Like, the only thing that I can think of is cigarettes, where, like, they warn you that smoking is bad, except they had to do that because they lost a lawsuit. I don't think they were particularly inclined to do that. I can't think of any other industry where the most enthusiastic people about it are also warning and dooming about how bad the thing they're building could be. So I'm sort of curious, like, you know, first of all, like, and I talked about this in the intro, like, who is the type of person that's, like, working at meter, that is, like, skilled enough to do, like, advanced evaluations and like, where's the funding coming from? But, like, talk to us about, like, who's behind meter and why they're there?
[33:47]
A
Yeah, totally. So I think one thing to say on the history of kind of people caring about AI safety in the Bay Area is that this concern goes back, like, quite a ways. I'd say for over a decade. There are many people who got into the field because they saw this trend of deep learning. Like, what if deep learning works and it kind of goes all the way to artificial general intelligence and then super intelligence. And if that works, then it could affect everything. I think possibly when people worry about this, there's a future that they have in mind with superintelligence that's even more capable than what people who think of themselves as, like, AGI pill today think of. They're imagining AI systems that can run, you know, the entire economy. And I think people who kind of a while ago or many years ago saw that vision and were sort of alarmed about the stakes of it. Many people had this intuition that the thing to do is go and work in the industry. Because if you're, like, helping build it, you know, what's the best way to shape the future? It's to build it. And I think that there's. Obviously, you could have questions about how sincere that is for. For many of the people who are in the industry, or if there's kind of a mix of different motivations and, like, you know, different wolves inside of them, or maybe they partially are motivated by that, but also they're like, there's kind of this, like, Oppenheimer like, it feels good to feel like you're in the position of making something that's dangerous, maybe.
[35:02]
B
Someone once described OpenAI to me. This was years ago. A friend said it was like OpenAI was sort of like the Manhattan Project, except the goal was to not build the bomb at the very end, if that makes any sense. So to your Oppenheimer point, it's like very strange.
[35:17]
A
And I think one thing to emphasize is, you know, while it could be that there's a mix of motivations now, there are definitely many people, I think, in the Bay Area who sincerely believe that the technology is headed to someplace that will be very difficult for. Where it will be very difficult for humanity to stay kind of in the drivers seat or like stay in control in kind of a meaningful sense.
[35:37]
B
It does seem as though, like people talk about, oh, the big AI labs have like a PR problem or something like that. They keep bringing this up and it's like maybe they just believe it.
[35:48]
A
So I think that this concern is quite old and I think many people have this intuition that they're like, I can influence the thing by building it. But now there's this problem that that logic kind of always recommends that you continue building more advanced technology or like more advanced AI systems. And now you have this problem. There's all of these companies and they all say that they need to build it because if they don't build it, another company will. And then even if all of the. And they could all have doubts about each other's commitment to safety or to these principles, famously, the leaders of the labs really do not get along. They're not friends. It's not easy for them to kind of sort out the safety thing among themselves. And then even if all of the USAI labs kind of agreed to do that, they then have this kind of external bogeyman of China, right? What will the Chinese companies do? And so there's this sense in which, just like even if the concern is real, I think a lot of people then have, who are in the industry have the instinct that they kind of. There's no guiding principle for what they should do on safety other than to like build leverage for themselves for later. And I think that is a concerning state of affairs for AI development to be in globally. You know, obviously we're trying to do something different by like informing the public or kind of giving like, you know, you could imagine that this situation would be better if. Or like one gap that exists right now in that picture is that it's the people building the technology who most believe that it's going to be destabilizing and sort of all encompassing. Maybe if the public and governments all were on the same page and believed the same thing, if it were true, that it was headed there, then there would be kind of like more time for society to figure out a response from people who are not trying to build leverage over the technology themselves directly or you know, control the technology via some kind of like public action or government.
[37:30]
E
Can I just ask very quickly since you brought up China and I don't want to forget to ask this question, but when doesn't show up on your like main charts? I think you did a preliminary assessment of it a while ago, but like what's the difference between assessing one of the closed models in America versus one of the open source models over in China?
[37:48]
F
I think one thing to say is that the capabilities are lagging behind. We think that they're. They're lagging behind. I'm not sure.
[37:54]
E
So they're still irrelevant. They just like don't make it onto the chart.
[37:58]
F
So we do try to prioritize. Just because meter has limited resources, staff time, in particular, the models that we anticipate being on the frontier and in general the Chinese models have been something like nine to 12 months, let's say behind the U.S. models. And I think the gap by time horizon is probably even larger than the gap by benchmark scores where there's some. I'm not sure how scientific I can make this, but there's some colloquial sense or something that that Chinese models are stronger according to benchmark scores than they would be on truly held out problems in some sense.
[38:31]
B
Problems like on the gaming, the benchmark.
[38:33]
E
Is that what that means?
[38:35]
F
I'm not sure technically exactly how that shakes out, but something spiritually close to that. I'm not sure that's true for all Chinese models. I'm sure it's true for lots of models outside of China, but I think that's at least one possibility.
[39:03]
D
Small businesses are the pulse of every community. They bring people together, create opportunities and drive growth. With a widespread presence in communities across the country, Chase for Business supports small business owners at a local level that makes it possible for you to connect, learn from each other and grow together. There's a real commitment to seeing small businesses succeed. The Chase for Business team has knowledge and expertise that span a wide range of financial areas. They can help you make more informed decisions as you navigate the complexities of running your business. They'll help your business grow with individual guidance and convenient digital tools all in one place. With that guidance and your determination, you can take your business farther and help build a brighter future for your community. Learn more@chase.com business chase for business Make More of what's Yours the Chase Mobile app is available for select mobile devices. Message and data rates. May apply JP Morgan Chase Bank NA Member FDIC Copyright 2026 JPMorgan Chase Co.
[40:04]
C
The thing about AI for business, it may not automatically fit the way your business works. At IBM we've seen this firsthand. But by embedding AI across hr, IT and procurement processes, we've reduced costs by millions, slash repetitive tasks, and freed thousands of hours for strategic work. Now we're helping companies get smarter by putting AI where it actually pays off, deep in the work that moves the business. Let's create smarter business.
[40:32]
D
IBM, you need to make a huge presentation in an hour. Adobe Acrobat uses AI to take all your documents and generate a presentation with a single click. Build slides quickly and streamline the process. Need a last minute pitch deck? Do that with Acrobat. Need to level up your presentation design. Do that with acrobat. You have 30 plus documents that need to be simplified into a proposal. Do that, do that, do that with Acrobat. Learn more@adobe.com do that with Acrobat I'm
[41:06]
E
very curious when you talk to external actors in all of this and I'm going to group them into I guess policymakers, investors and the labs themselves. Like who are you interacting the most with at the moment?
[41:20]
A
I think that in practice we end up interacting a lot with AI labs because there's some amount of sorting out, getting access to models, working with them to set new precedents on things related to third party red teaming and third party risk assessment. We think of our audience as being sort of like high context members of the public. So the kind of like people you know who are maybe like you do, right? People who are kind of like people
[41:44]
E
listening to this podcast.
[41:45]
A
Yeah, I guess, yeah. People listening to this podcast. People with kind of who have to make important decisions that will be informed by the pace of AI progress or like the kind of profile of AI capabilities overall. Because we're based in the Bay Area, I think we like disproportionately end up interacting with people who are building the technology and like closer to it partially. I think back to Joe's point before, I think this is kind of because it is the case that to kind of care about a lot of these frontier problems, you're kind of selecting for people who are building the technology themselves, there's some sense in which like the companies in the industry spends more time thinking today about frontier capabilities assessment than the government does. Yeah, I think like one day you could imagine us getting to the point where the government is like very focused on this and dedicating a lot of resources to it. And at that point I would expect meter to be spending more time talking to governments than we are today.
[42:35]
E
That's kind of what I was getting at because our senses in a lot of the conversations, like we talk to people and they'll say something about like, oh, it's important to have a social safety net for an AI enabled future. But no one seems to be really thinking about it in a lot of detail.
[42:47]
B
And when you say, you know, it's easy to imagine or maybe the government will care more about this, not so easy for me to imagine. It seems like they mostly care about, you know, data centers and like where they located and stuff like that. It would be nice if we had policymakers really looking at like frontier capabilities and stuff. Still seems kind of a way off. But it is interesting. You know, you're like talking about like the sort of like capitalist dynamic, right? There's competition and so it's like you have a lot of people that are really worried about, oh, what if the other guys get to ASI or AGI first? Or what if the Chinese, etc. How much does the fact of like free market capitalism and the demand, you know, the big investors at the VC funds like they want to return, they want an IPO. We might get some big AI IPOs this year. In fact. How much do you find that to be perhaps intention with the safety element?
[43:40]
A
Yeah, I, maybe, yeah, people on our team would have different views on this. I personally don't feel there's. Yeah, there's some thing here of like investors are key decision makers and you know, they're people too. That sounds strange to say investors are people too. I sound like Mitt Romney or something. But I, I think that like, I think that the element of this that feels like it could be intention is if you build a bunch of financial obligations to keep kind of the pedal to the metal no matter what the risks are going into the future. So like, one thing I think a lot about is if you're like building up a huge amount of debt to build data centers and then say that you do find evidence that you're now worried about, about the, you know, control from AI systems, you do find instances of AI systems going Rogue, do you now have like a financial commitment to build out those data centers and like continue kind of the pace of progress? I think that is one place where I feel the tension pretty acutely. Like you're building these expectations into the market that could kind of force you to continue development when you otherwise would rather invest more in safety or. Yeah, like it at least gives you a kind of financial obligation to continue scaling, at least compute. I think that like the people themselves being informed about the progress does not seem bad to me. I think it's like good in some ways for everyone to be on the same page about capabilities that could be related to subverting human control later on. But I think in the world beyond like the information that meter shares, I do think there is a tension like the fact that private companies are building this I think could cause really acute tensions in the future where yeah, people make these commitments that they wouldn't if they were trying to like slow or you know, maximize social resilience to the technology.
[45:25]
F
Yeah, I'm not sure how these things shake out, but I think there are some forces on the other side, right. Like you know, some safety promoting technologies, quote, unquote, or techniques do make the models more useful, you know, if they're better complying, better complying with your will in some sense. And so you have capitalist incentive, standard capitalist incentives to invest in that kind of research. Maybe that doesn't cover the broad suite of safety research that seems important. It certainly doesn't rule out capabilities progress as being an important axis on which you do want to scale. But I think there are some forces in each direction.
[46:00]
E
Since you mentioned compute just then, can you talk a little bit more about, I guess, the relationship between the time horizon improvements and the cost of compute at the moment and what you've actually seen and how that impacts it.
[46:11]
F
Yeah. So one extraordinary fact from my perspective, I'm not sure how to fit these facts together, but something like the R and D spend on compute of these companies has risen exponentially, of course. And in fact it's risen exponentially at essentially the same rate as time horizon progress. I think there's nothing necessary about that. It doesn't mean by itself that if compute progress slows, then capability's progress will also slow. But it's clearly an important input into AI progress. I expect that to continue to be true in future. Sometimes people ask us if we think it's plausible or how plausible we think it is that capabilities progress, this exponential capabilities progress might slow down at some point in the future and One reason it seems it's hard for me to consider it plausible that it will slow down in the next at least small number of years is that a lot of those compute R and D investments are basically already baked in. Right? Like the data centers have already been built. Plans for data centers even beyond 20, 20, 28 are presumably coming to fruition, coming about. And so some of these input investments are already baked in in some sense. So it would be surprising to see capabilities slow to the extent that compute has been an important input. After that, maybe you need to think about other arguments for how capabilities might slow, but that's roughly how I think about it.
[47:29]
B
There's a very good or interesting critical substack post called against the Meter Graph by someone named Nathan Witkin, who brings up an interesting point that I wouldn't have thought of had Reddit, which is you're paying the software engineers to come in and perform these tasks, Right? It seems, you know, maybe this will be the last job of humans is just doing benchmarks. If I were like a good software engineer, you say, Joe, come in and do this task. How do you prevent me? Oh man, this is taking me a long time. Meanwhile, I keep getting $100 an hour for like looking at my computer. And this is tough. I'm going to have to come back tomorrow and keep working on this. How do you avoid the sort of conflict of interest where the person who's paid to work on this problem may be encouraged to take as long as possible to solve it? And with only three people working on it at times, I don't know, it seems like a conflict of interest to me.
[48:24]
F
Yeah. So the short answer is, in general, we are incentivizing these people to complete the task as soon as possible, in particular to complete the task task faster than their peers who are attempting the same task. The time that it will take for them to complete the task.
[48:37]
E
Is there a bonus if they do it faster than.
[48:39]
F
Yeah, yeah, approximately there's a bonus if they complete it faster and faster than anyone else. Another thing to say is I think it just is true that our baselining methodology or the ways in which we compare to humans in some ways leaves a lot to be desired that ideally we would have invested 100 times as many resources in having 100 baselines, human baselines per task, and those would have come from perhaps the very best software engineers or machine learning engineers in the world. Maybe that would be the comparison that we're making. And indeed we'd be doing all of this procedure over many more tasks, not just many more tasks, many more tasks over wider task distributions than just software engineering or machine learning engineering. I mean, I do think Time Horizon still represents progress over what's come before in the science of measuring AI capabilities. But in some ways, I'm sympathetic to a lot of criticisms of Time Horizon. I do think that some of the details, at least for the work we've done so far, aren't going to matter as much as you might naively think. So choosing the shortest baseline time that we end up observing, or the longest time, it's actually not going to make that much difference to the final measurements. Of course, we do think these people are talented software engineers or cybersecurity people or so on, depending on the task. But perhaps we could have found even more talented people. They would have completed it in half the time. And so naively, it would seem like the time horizon that we estimate of these models would be half as long as we actually end up observing. But of course that wouldn't change the doubling time. It would mean you'd get to the same level after another four months. In some sense. The big picture that I want Time Horizon to point to is less this, like Opus 4.6, is 12 hours in particular, and more that we're seeing this remarkable pace of progress that shows no signs of slowing in the recent past and I think in the near future as well. In fact, it shows some signs of speeding up.
[50:30]
E
Well, I was going to ask about this because I think recently the statistic that you would always hear was a doubling every seven months, something like that. How fast do you see it going in the near future?
[50:42]
F
Yeah, so I was a doubling over every seven months person. There was controversy in our team about what to believe here, because when we originally published this work approximately a year ago, you'd see if you plotted a single straight line, a single exponential, you'd get something like six or seven months, let's say. But if you restricted to just the time Since, I think GPT4O since the 2024 models onwards, you'd see something closer to this sort of like four or five month trend. And some people believed in that and some people like me, had the intuition that, well, we have so few data points, we should really be estimating over this larger number of data points and a large number of data points says every six or seven months. There are a couple of things that have changed my mind and made me realise my colleagues were right since then. One is that for the models that have come out since, what trend has better predicted how performant those models would be. And it's very clear that the answer to that is the four month doubling time and not this seven month doubling time. There's some possibility that could speed up again. We've seen it speed up once. I think there are some reasons in principle why you might expect it to speed up again. I think there are some caveats about this. These are maybe some takes that my colleagues would agree with. And so maybe you should discard that or you should think that they're going to convince me in the way that they did with the 4 month versus 7 month doubling times. I have some suspicion that the tasks that Meter is measuring performance on are in some sense more and more a narrow slice of possible tasks, and in particular a more and more narrow slice that is perhaps similar to the kinds of tasks that you'd expect these major AI companies to be training on in the first instance. And so in some sense we're increasingly more so than was the case before, measuring progress on the exact types of tasks that they're trying to get better at. You might think, for instance, the kinds of tasks that would make for good reinforcement learning environments, the kinds of tasks that you can score quickly and cheaply and automatically. I think that progress is real. I think that progress generalizes to some extent to other types of tasks. I think we're seeing remarkable progress in these more messy tasks.
[52:49]
B
For example, I have one last question, which is how big is your team funding? And also how many people at Meter are basically really rich from AI and they're like, you know what, I'm good. I don't need to pursue, stick around for the IPO or whatever. I'm set. And now I want to work on something that let humanity know. I've seen there are other independent AI researchers and they talk about this. It's like, I want to be able to talk about what I saw. Miles Brundage, someone who has like a little think tank, he's talked about this. What's like how many people are like rich already? And they're like, okay, now I want to work for something that's public facing.
[53:27]
A
Yeah. So Meter right Now is about 30 people. Although we're growing and hoping to grow fast. We are hiring, I should say meter.org careers and yeah, you were touching before and kind of the thing about is it difficult to be a nonprofit? You know, we can't pay people in equity.
[53:41]
B
Right. No one's going to get an ipo, right?
[53:44]
A
Yeah, there's no, no IPO anything for Meter. But we do try to pay competitively on cash compensation. Right. So that's an area where we feel we can like somewhat compete with labs. And it's true that I think a lot of our team is just motivated by trying to kind of do something different, like not, you know, all the companies to some extent are in this business of kind of like building somewhat redundant products, kind of competing for the same role in the world. And meter is in a really unique position at the moment where I think that we have, have like access and the ability to communicate these ideas and explain the state of AI research to a number, like a lot of audiences that might be hard for like individual researchers inside of a company. Like we get to talk to a lot of governments directly. We get to come here and talk with you all. And that's kind of different. I think if you look at all of the actors that are working on the frontier of AI research or AI safety, you kind of, if you compare us to AI lab staff, I think that our work gets to be. We get to kind of every day work on whatever research we think will be most informative to the like, public decision.
[54:45]
B
And you have X, I, not X AI, but X is a former AI lab staff who maybe there was a tender at some point and now they work at meter.
[54:54]
A
Yeah, we do have those. Yeah, so we do have some people who previously worked at AI labs. I do think that as time goes on, I think one hope that I have is that more, you know, there will be more and more researchers who have kind of like, like made the money that they need from working in the industry. And now we're excited in kind of like lifting all boats by working on kind of like inside of an organization where the North Star can be what is most informative to the rest of the world outside of these like, relatively small set of companies.
[55:22]
F
Chris is very polite. I think that's wonderful. I'm tempted to be a little bit, a little bit more aggressive in this conversation. I think we have spoken through Mita's work on some of the most important problems in the world, problems that are going to define the future, I think for not just the next years, but, you know, coming decades, maybe even coming centuries. And we've also spoken about some of the ways in which meter work is not what you might want it to be, that there's a long way to go in the science of evaluating these AIs. Why have we not made more progress? You know, maybe a couple of reasons. I think clearly the central reason is that we are Bottlenecked on technical talent, on incredibly capable people to come work on these questions. I was on a meet a work retreat recently where we were brainstorming 20, 30 of these, what seemed like world important problems, problems that we think no one else is going to get to if we do not get to them. And we are able to conduct research on how many of those problems? I think it's one, two, maybe if we do an extraordinary job this quarter, it might be three, as Chris alludes to. I think if you're interested in less working on redundant products at these major AI companies and more advancing our understanding on some of the most important questions in the world that are going to shake the world for years to come, Meter is a great place to go.
[56:38]
A
Yeah. One more thing to say about that is like the vibe inside of Meter is a state of triage. Right. And I think people often tell themselves externally, people might guess, oh, you know, meter is a. It's outside of any of the AI labs. So the thing it might most struggle with is things like access to AI models. You know, you can't do the research you want because you don't have. You're not building the thing yourself in practice or that's the story that people always tell is you have to build, you know, the future to shape it in practice. I think our experience at Meter is that like when we want to try new types of research that would require new kinds of structured access. Our experience at this point has been that AI labs are like pretty game to play ball on that. And the thing that is more happening is that we're having to turn down opportunities to do stuff like that because we don't have the staff that we need to make those things happen.
[57:20]
E
Interesting.
[57:21]
B
Joel and Chris, thank you so much for coming on odd lots. Absolutely fascinating conversation and I appreciate your taking your time. Great to have you in studio.
[57:28]
F
Yeah. Thank you so much, so much for having us.
[57:43]
B
That was a really interesting conversation to that we, starting from the end, sort of the idea of like, okay, here are some really important questions, like let's just set everything aside and there's 30
[57:53]
E
people working on this.
[57:53]
B
Yeah, there's, you know, and like how many people want to do it? It's like, okay, we try to match cash comp, et cetera. That seems like kind of a tricky issue if you accept the premise that these are some big questions we have to get right and we gotta land this plane. Hopefully that's a bit of an issue.
[58:11]
E
Yeah. The other thing I thought was really interesting was the Chinese Models not really making it on the charts, even though we know in the market itself when deep sea, when that new version came out, that was like this huge thing where everyone started to panic and then to not see it even like land on the time horizon chart. It's kind of interesting.
[58:31]
B
I guess it's, I mean I guess I buy the reasoning from their perspective that the only interesting question from meter's perspective is like the most cutting edge, which may be slightly adjacent to the most interesting chart for like business. Right? So it's like, okay, we know that Deep Seek and Quen and Kimmy and all those are like very impressive of do they push like the very frontier? Perhaps not. But just in general I find this space so weird because it's like here you have these people who are clearly quite alarmed at the potential here. And most people I think look at these charts and they say like, wow, this is like I want to invest in this. Or this is like exciting.
[59:11]
E
I know, like that's why my first question was like you're here for AI safety purposes. But everyone seems to get excited about the line go up chart, right? Like there's a disconnect.
[59:20]
B
They're all connected.
[59:21]
E
Like I say, when an industry basically says it's worried by itself. Yeah, you should pay attention.
[59:28]
B
It's really strange. This gets back to, you know, it's very strange where you have the CEOs of these companies who are in many cases the most alarmist and there is this sort of cynical thing. And I don't totally discount the cynical interpretations, like, oh, they're saying this because they want to get investors and so forth and they need all this money. But look, it was also true that OpenAI and Anthropic but Open a little More were like founded with these very exotic corporate structures of like a private company owned by nonprofit, et cetera, which they presumably did because they took pretty seriously the fact that this technology and science was like very strange and not just like it's not just enterprise software.
[60:09]
E
Right. Like they were self limiting in a way.
[60:11]
B
One other interesting thing too that I, this idea is like, okay, like first of all, what's the difference between 7 month and 4 month time doubling? Not much. You know, it's like these people's like,
[60:21]
E
oh, I think, yeah, but it's exponential, isn't it?
[60:23]
B
I guess it's exponential, but it's still funny to me. It's like, oh, I think like AI is going to destroy all white collar work in two years. And someone else is like, no, no, no, I think it's gonna be three years. As if that makes like any different whatsoever. But one thing to consider, Joel sort of alluded to this. You know, you had like OpenAI shutting down, its like video efforts, et cetera. So perhaps part of the story is just this intense focus now on the software engineering side as what these labs are working in.
[60:49]
E
Yeah.
[60:50]
B
And sort of like all these other side quests are not as important. So maybe we will see even more rapid progress on some of these technical benchmarks. Because clearly from the lab's perspective, that's where the action is more than some of these consumer things like making. Making images or videos.
[61:07]
A
Yep.
[61:07]
E
All right, shall we leave it there?
[61:08]
B
Let's leave it there.
[61:09]
E
Okay. This has been another episode of the All Thoughts podcast. I'm Tracy Alloway. You can follow me at TracyAllaway Way.
[61:14]
B
And I'm Joe Weisenthal. You can follow me at the Stalwart. Follow our guest Chris Painter. He's at Chris Painter. Yup. And Joel Becker, he's at Joel Bkr. Follow our producers Carmen Rodriguez at Carmen Armand Dashiell Bennett at dashbot, Kill Brooks at Kill Brooks and Kevin Lozano at Kevin Lloyd Lozano. And for more Odd Lots content go to bloomberg.com odd lots we have a daily newsletter and all of our episodes and you can chat about all these topics 24. 7 in our Discord, Discord, GG Oddlots
[61:45]
E
and if you enjoy Odd Lots, if you like these AI episodes then please leave us a positive review on your favorite podcast platform. And remember, if you are a Bloomberg subscriber, you can listen to all of our episodes absolutely ad free. All you need to do is find the Bloomberg channel on Apple Podcasts and follow the instructions there. Thanks for listening.
[62:08]
B
Sam.
[62:33]
D
If you follow markets, you know the value of long term thinking. You plan, you diversify, you prepare for volatility. But even the best strategies can't prevent every bad day. For more than 75 years, Cincinnati Insurance has helped individuals and businesses navigate tough moments. With expertise, personal attention and independent agents who focus on relationships, not transactions, the Cincinnati insurance companies let them make your bad day better. Find an agent@cinfin.com Dryness is one of
[63:10]
F
the biggest challenges for curly hair and many products they clock out after wash day. The new Ultra moisture collection was literally
[63:17]
A
designed with our hair in mind.
[63:19]
F
Curls, coils, all of it powered by Botano oil and Jamaican Black Castor oil.
[63:24]
A
Which means the science is actually doing doing the work.
[63:27]
F
Clinically proven to help retain moisture for up to five days. This means non stop moisture for up to five whole days. The shampoo gently cleanses without stripping.
[63:37]
A
The mask deeply, conditions and helps reduce breakage. The leave in adds rich moisturization with
[63:42]
F
hyaluronic acid and the curl Cream wraps every curl in luxurious hydration with long lasting definition. No sulfates, no parabens, no silicones, no mineral oils, just nourishment without the compromise.
[63:56]
A
Because here's the thing, your hair deserves
[63:58]
F
products that actually understand it. Moisture that starts at the root and keeps going well past wash day. That's the Ultra Moisture Collection from Cantu. Explore their full line built for every curl pattern, every hair routine available now at Walmart. Go get it.
[64:14]
B
Sinesta TravelPass makes traveling more rewarding Designed to help you get more out of every stay. Sign up@sonesta.com to enjoy instant savings, bonus points and valuable perks like early check ins, late checkout room upgrades and free stays. Over time with Sonesta Travel Pass, every stay brings you closer to your next reward. Choose from more than 1100 hotels across 13 distinctive brands and unlock the best available rates when you book direct with Sonesta Travel Pass. Here today, roam tomorrow.
[64:46]
A
Join now@sonesta.com Terms and conditions apply.