Summary10 min read

Podcast Summary: "The End of GPU Scaling? Compute & The Agent Era — Tim Dettmers (Ai2) & Dan Fu (Together AI)"

The MAD Podcast with Matt Turck
January 22, 2026

Episode Overview

In this lively and deeply technical episode, host Matt Turck brings together Tim Dettmers (Assistant Professor at Carnegie Mellon, Research Scientist at AI2) and Dan Fu (VP of Kernels at Together AI, Assistant Professor at UC San Diego) for a "reality check" discussion on hardware bottlenecks, AGI definitions, the future of compute, and the explosive practical rise of AI agents.

The core tension explored is Dettmers’ belief that we're rapidly approaching the limits of GPU scaling and thus facing a plateau in AI progress, versus Fu’s more optimistic outlook that there’s vast untapped potential—even with current and next-gen hardware, with models only now beginning to leverage the infrastructure available.

They also break down the rise of agents, their impact in coding and other workflows, and the practical skills both technical and non-technical users need to thrive in the "agent era".

Key Discussion Points & Insights

1. Guest Backgrounds & Expertise

Tim Dettmers:
Specializes in efficient deep learning, quantization, and coding agents. Noted for reducing memory usage while maintaining performance.
"My past research has been mostly on efficient deep learning quantization ... use up to 16 times less memory than if you have dense ... now I'm working on coding agents." (01:34)
Dan Fu:
Focuses on accelerating language models at the kernel/GPU level, creator of FlashAttention, works on both kernel optimization and alternative model architectures, recent focus on deploying and accelerating models on the latest Nvidia hardware.
"In industry, I focus a lot on basically making models go fast ... GPU kernels are the things that actually translate the models to how they run on the GPU." (02:25)

2. Defining AGI & The State of AI Today

Dan Fu:
Suggests by many older definitions, we're at or extremely close to AGI:
"By almost any definition anyone could have written down, let's say five years ago or 10 years ago ... we basically have the vision of AGI that we had back then." (00:02, 03:55)
Tim Dettmers:
Argues there's ambiguity and overhype in the term “AGI”; prefers an economic definition—usefulness and the ability to trigger an industrial revolution (06:00).
"We don't think carefully about the definition ... what I think makes sense is this economic angle. Can we get another industrial revolution?" (05:33—06:55)

3. AGI Hype vs. Computational Reality

Roots of AGI Narratives:
Dettmers outlines their origins in "effective altruism" and rationalist circles, warning against “lazy thinking” and unexamined extrapolations:
"There's always like, 'oh, we get AGI in two years' ... a little bit of being in a bubble ... not being exposed to different ideas." (07:31)
Hard Physical Constraints:
Dettmers details how exponential progress always meets diminishing returns. Core physical structure (latency, memory movement) imposes hard ceilings on how fast, cheap, or effective GPU computing can get.
"Everything that grows exponential will level off. Because if you need resources, the resources will be exhausted." (11:40)
The End of GPU Scaling:
"GPUs will no longer improve, meaningfully. We have essentially seen the last generation of significant GPU improvements ... maxed out on the additional features ... that's the end of it." (11:26—16:12)

4. The Optimistic Case: Hardware and Software Underutilization

Dan Fu:
Contrasts by arguing most current models woefully under-leverage even existing hardware (low chip utilization rates). There’s easily “100x more compute” in the short-term pipeline, with ever-bigger clusters and more efficient training/inference methods emerging:
"If you look at where the systems are today ... we are just so far from even using the last generation of hardware as efficiently as possible ... you can see up to two orders of magnitude more compute available." (16:16—17:50)
Models Are Lagging Indicators:
"The models that we see today ... are already trained on clusters that are a year and a half old ... The models we index on quality today are actually trained on pretty old hardware." (21:25)
Post-training & Usefulness:
Pre-training dominates compute cost, but post-training enables precise, domain-specific utility: "Pre training is like the general strength training that you do in the gym ... post training is like the specific drills that you run." (23:10)

5. The Usefulness Convergence

Matt Turck:
Suggests, regardless of where “true AGI” lands, practicality and usefulness win: "What ultimately matters is where you land in terms of usefulness in the industry ... We still have so much juice to squeeze." (25:30)
Tim Dettmers:
"You shouldn't pay too much attention to AGI but more about thinking about how can we make it most useful." (26:06)
Dan Fu:
Notes the transformative potential is already showing itself sector-by-sector (e.g., self-driving, healthcare), and compares the breakthrough to self-driving cars, where sudden leaps in reliability change perceptions overnight:
"Progress is funny in this way ... it's not there, and then one day ... it's actually a lot better than the service that I'd get in an Uber." (27:25)

6. Hardware Ecosystem: Beyond Nvidia?

Multi-Hardware, Multi-Chip Future:
Dan Fu envisions increasing specialization and diversity (AMD, Grok, Cerebras, etc.), especially for inference workloads: "You're going to see a lot more diversity, especially around inference ... training and inference are actually quite different computations and as a result you might want quite different chips to do it." (30:10)

7. The Agent Era: Are We at the Inflection Point?

Coding Agents as the 'Switch Flip' Moment:
Dan Fu recounts a transformation in 2025, when agents became substantially better than even domain experts at GPU kernel work, dramatically accelerating productivity: "Last June we had this really interesting realization ... these agentic coding assistants, were actually very good at writing these kernels ... I was like, oh my God, this thing is making me five times more productive as a kernel expert." (32:32—34:15)
Generalization of Agents:
Tim Dettmers sees coding agents as general agents, with broad impact outside of coding, rapidly accelerating all digital workflows: "Coding agents are general agents ... coding agents make things so easy ... you can paralyze a lot of different tasks." (35:15)
The New Skill: Agent Literacy:
Dettmers insists over 90% of code/text should now be agent-generated, with critical human review and customization: "If you don't know how to use agents well, you will be left behind. That will become a critical skill." (39:11)

8. Practical Advice: How to Harness Agents (Even as a Non-Coder)

Start Small:
Try automating minor tasks, use visual feedback, iterate, and play—agents can explain and adapt for non-coders: "With minimal learning, you can get there, execute programs, build websites ... The agents write good code." (39:25—41:10)
How to Pick What to Automate:
Use both creative "what would be useful?" and more analytical, process-oriented approaches (cost-benefit, time saved vs. time to automate): "Look at how you work, you time each of these steps ... you can quickly realize that automating certain things will not make a difference." (41:17)

9. Managing & Learning with Agents

Treat Agents Like Junior Employees:
Break down tasks, supervise, and provide context—don’t just give agents limitless freedom or mechanize everything blind: "Making the agents effective ends up being a lot like managing junior folks on your team or at a company." (44:00)
Expertise Multiplies Agent Power:
The more domain expertise you have, the more agents can boost your productivity. The process for becoming an expert hasn't changed, but learning is now easier and more interactive with agents as teachers/collaborators. (44:00—47:38)
Education Challenge:
Current students can become reliant on agents before they've internalized core concepts. The future demands both foundational knowledge and agent fluency—a challenge for teachers and learners: "If we allow students to use agents, they are very productive. But sometimes the built solutions ... are actually very bad ... we don't want to have students that don't understand things, but we also want students that basically can use agents." (49:56)

10. What's Next: Research & Commercial Priorities

AI2 (Dettmers):
Upcoming open-source coding agent that can:
- Be trained 100x cheaper than current SOTA agents.
- Rapidly specialize to private codebases automatically and locally.
- Offers a complete scientific breakdown of what actually moves the needle for coding agents. "You can just point our method to that repository ... quickly generate the data and then you have an agent that is as good as a frontier model, but you can deploy it locally." (52:44)
Together AI (Fu):
Focus on model efficiency at inference time (currently sub-5% hardware utilization!), unveiling “mega kernels” (packing entire models into a single GPU kernel for big speed-ups), and "Together Atlas" (adaptive speculative decoding). "At inference time, when you have the model ... the hardware utilization is less than 5%. So it's at a place where there's so much more we can do." (54:44)

11. Looking Forward: The Rest of 2026

Dettmers:
Sees most surprise not at the "frontier" (biggest models likely to plateau; user experience to improve, not capabilities), but in efficient smaller/specialized models that are easier to deploy and own. "Performance on the frontier will stagnate. But on the smaller level we get more and more powerful models still ... smaller models might even be better because they're specialized." (58:19—60:49)
Fu:
Extremely optimistic about open source model leaps, new hardware launches, and multi-modality (video, audio).
"I think we're going to see another big jump in open source capabilities ... excited to just see what is that frontier of intelligence you can get on your laptop or on your phone." (60:49)

12. Post-Transformer Architectures?

Fu:
State space and hybrid architectures are already present (best audio models, new minimax/linear attention hybrids), with Chinese labs leading on risky, innovative research: "You're going to see a lot more diversity in architectures ... kind of already seeing it." (62:25)

Notable Quotes & Memorable Moments

Dan Fu:
“By almost any definition ... we basically have the vision of AGI that we had back then.” (00:02)
Tim Dettmers:
“Everything that grows exponential will level off. Because if you need resources, the resources will be exhausted.” (11:26)
Dan Fu:
“You can see up to two orders of magnitude more compute available, 100x more compute.” (17:50)
Tim Dettmers:
“If you don't know how to use agents well, you will be left behind.” (22:22, 39:11)
Dan Fu:
“It might not generate the right thing for you. But if you give an expert programmer this set of tools, they can go 10 times faster than they were able to go before. And I think that's a really exciting place to be.” (34:34)
Tim Dettmers:
“More than 90% of code and text should be written by agents. You need to do so or you will be left behind.” (37:18)

Timestamps for Important Segments

[00:02] – Dan Fu: “By almost any definition anyone could have written down ... we basically have the vision of AGI that we had back then.”
[11:26] – Tim Dettmers: Exponential progress, GPU scaling bottlenecks, “Everything that grows exponential will level off.”
[16:16] – Dan Fu: “We are just so far from even using the last generation of hardware as efficiently as possible ...” Plus specifics of utilization numbers and stalling progress.
[22:47] – Matt Turck and Dan Fu: On pre-training, post-training, and model lag.
[35:15] – Tim Dettmers: Why coding agents are general agents and the broader impact.
[39:11] – Tim Dettmers: “If you don't know how to use agents well, you will be left behind. That will become a critical skill.”
[49:56] – Tim Dettmers: Educational tradeoffs—can students master both foundational CS and agent skills?
[52:44] – Tim Dettmers: Announcing upcoming low-cost, highly customizable coding agent.
[54:44] – Dan Fu: Together AI’s priorities: “At inference time ... hardware utilization is less than 5% ... there’s so much more we can do.”
[58:19–60:49] – Both: Predictions for the rest of 2026—stagnant frontier, rapidly improving smaller models, open-source explosion.

Tone

Candid, technical, future-facing, but with a pragmatic and sometimes skeptical undercurrent.
Interleaves deep technical observations with pragmatic advice for all listeners.
Occasional playful ribbing at the hype cycles that dominate the AGI discourse.
Insistent that the "agent era" is already here, urging all to develop agent skills on pain of irrelevance.

Takeaways for Listeners

AI progress is at an inflection point not just in raw capability, but in practical, workplace and daily task automation via agents.
While hardware scaling may plateau, massive optimization potential remains untapped—those who leverage it early will have superpowers.
Agent skills—knowing how to harness, guide, manage, and collaborate with agents—are becoming the most essential digital literacy.
Practical usefulness will matter much more than ambiguous definitions of AGI—focus on achieving tangible economic and workflow gain.
Open-source models and small, specialized models will bloom in 2026 as deployment and ownership become practical and effective.

Loading summary

Transcript61 lines

[00:00]
Tim Detmers
We have max of the features, we have max of the hardware.
[00:03]
Dan Fu
That's what we get by almost any definition anyone could have written down, let's say five years ago or 10 years ago. We basically have the vision of AGI that we had back then.
[00:11]
Tim Detmers
Everything that grows exponential will level off. Because if you need resources, the resources will be exhausted.
[00:17]
Dan Fu
You can see up to two orders of magnitude more compute available, 100x more compute.
[00:22]
Tim Detmers
If you don't know how to use agents, well, you will be left behind.
[00:25]
Dan Fu
How far can you push it? I think it's never been a more exciting time to work in AI.
[00:29]
Matt Turk
Hi, I'm Matt Turk. Welcome to the Matt podc. We have a special reality check on AGI with two guests who are very close to the computational reality of AI. Tim Detmers of AI2 and Dan Fu of Together AI. In this episode, Tim argues that we are hitting diminishing returns and running into hard physical constraints, while Dan argues that we are still leaving huge performance on the table and that today's models are lagging indicators of hardware progress. Then we shift into a fun practical discussion on how to use agents and what to expect from AI in 2026. Please enjoy this fun conversation with Tim and Dan. Tim and Dan, welcome.
[01:06]
Dan Fu
Thanks for having us.
[01:07]
Matt Turk
So Tim, a few weeks ago you wrote a great provocative blog post entitled why AGI Will Not Happen. And then Dan, a few days later you replied with your own blog post, equally fascinating entitled yes, AGI Will Happen. I'd love to go into your backgrounds. You both have the very interesting characteristic of being having foot in industry and a foot in academia. So Tim, if you want to start with yours.
[01:34]
Tim Detmers
I'm assistant professor at Carnegie Mann University Machine Learning and Computer Science department and also research scientists at the Allen Institute for AI. My past research has been mostly on efficient deep learning quantization. That means model compression. Take large models, compress them down from like 16 bit to something like 4 bit. Key research has been there. Eulora for example. That's a very efficient fine tuning a compressed four bit these adapters on the model and then use up to 16 times less memory than if you have dense both handling and now I'm working on coding agents there we have a very exciting release in about two weeks time. State of the art agents you can quickly specialize to private data. Get like strong performance on any code base that you like and yeah, that's very exciting Dan.
[02:26]
Dan Fu
Hey, so I'm an assistant professor at UC San Diego and also my total is VP of kernels at Together AI So in industry, I focus a lot on basically making models go fast. So GPU kernels are the things that actually translate the models to how they run on the gpu. You can think of them as basically specialized GPU programs. A lot of my research, my PhD in my lab focused on that. So I developed things like Flash Attention, which was a efficient kernel for one of the core operations of a lot of the language models that we use today. I also did research on sort of alternative architectures to Transformers, things like state space models and things like that. And together I'm really focused on how do you make the best language models that we have today? How do we make them go faster? I think as of this morning's recording, we actually just released a blog post with cursor about how we accelerated a bunch of their models and helped them launch Composer 2.0 on Nvidia's Blackwell GPU. So that's a bit of a flavor of what I do.
[03:30]
Matt Turk
So let's get into this AGI discussion and then in the second part of this conversation, we'll talk about agents, encoding agents and thoughts there, because I want to make sure we cover that AGI, obviously it's a term that everybody uses, and I think we can all agree that nobody really knows what that means. But for purposes of this discussion, what is a useful definition of AGI from your perspective?
[03:55]
Dan Fu
Sure, yeah, I think so. One of the things that we kind of discussed back and forth in this set of blog posts is sort of what AGI means for me. I think one of the things that I've been thinking about recently is that if you took where we are today with the models that we have today with the language models, and I think, you know, we'll probably talk about this a bit more with the. Later with the agents. By almost any definition, anyone could have written down, let's say, five years ago or 10 years ago, certainly when, you know, Tim, you and I started our PhDs, we basically have the vision of AGI that we had back then. We have things that can write code, they can write human text, even though, you know, maybe the. They. They use too many EM dashes or something like that, but they can do. Do these really, really amazing things. I think one of the things that I think about is at what level does this kind of become a new industrial revolution where you can. Where this technology is really going to change a lot of what. The way that we do things today. The, the. And have a huge, you know, really, really great economic impact in terms of software engineering. I feel like we're already there or almost there. Like there are, there are things that may be super specialized, like, I don't know, able to write like the best FORTRAN and COBOL code in the world, but for web development, a lot of, even a lot of low level systems engineering, they're already really great. One of the reasons that I wrote my blog post was if you kind of think about where we are today, you know, we maybe already have AGI or like some form of AGI and if not, then certainly the next generation of models, the models that today are training already, if they're at all better than what we have today, then we've already hit something that's, that's really amazing and pretty wild.
[05:34]
Tim Detmers
When I wrote my blog post actually I noticed like, oh, I forgot to put the definition of AGI. My blog post. Even though my blog post is very much about AGI. And I think that sometimes sort of reflects how we think about AGI. We don't think carefully about the definition. I mean there are sort of, and I thought about it before and I think there's sort of different kind of definitions that have sort of advantages, disadvantages. I wouldn't say. I mean, as you said before, there's not one definition that people agree on, just to mention sort of a couple. And I think one that's sort of quite widespread is to see AGI as sort of cognitive abilities, cognitive tasks. What can you do? Cognitive and software engineering, very cognitive. Writing, very cognitive. Moving a robot in space, that's more kinetic, sort of. You can, you could also say like, hey, you also need to think about how you move. That's also part of cognition. But I think most people would separate that and say everything digital is kind of cognitive. And if you have physical, that goes beyond that. What I think makes sense is sort of this economic angle. Can we get another industrial revolution? What it means is, is AI useful? And it's just so broad and useful that you want to use it everywhere? Kind of. It accelerates all kinds of things similar to where when computers were introduced, productivity increase. Not initially. Productivity actually went down and you need to do diffusion in the economy to pick it up again. We might see something like that with AGI more broadly and starts with software engineering going up pretty significantly. But yeah, I think that is useful.
[07:16]
Matt Turk
Let's jump into the heart of the argument. Tim, I was amused by what you said about where all those ideas of AGI and superintelligence come from. If you want to talk to that.
[07:31]
Tim Detmers
Yeah. To sort of lay out sort of the entire narrative. There are certain thoughts about AGI and that is sort of very much rooted in a certain kind of thinking. It comes from effective altruism communities of the rationality communities. I was part of these communities, like long time ago, that is now 15 years ago. If you look on Twitter, there's always like, oh, we get AGI in two years and then one year later, oh, we get AGI in two years and then one year later we get AGI in two years. I feel like it's a little bit lazy thinking, a little bit of being in a bubble and not being exposed to different ideas. And that was one of the main motivations for me writing the blog post, because I feel like there are some things, some ideas that if you think about them, they might provide a counterpoint to a lot of the thinking that's out there.
[08:19]
Matt Turk
Yeah. And your core thinking is that there is a tension between those ideas and the computational reality. Is that a fair way to put it?
[08:29]
Tim Detmers
Yeah. There's a physical component and then there's an idea component. But there are similar structure. And this structure is basically diminishing returns. Everything that grows exponential will level off. Because if you need resources, the resources will be exhaustive. Resources can mean different things. And if you look at the physical aspect, it gets just more and more difficult to advance technology. That is the case almost with any field of research or development. Things get more and more easily. You need more resources to make further progress, and the progress set of and goes lower and lower. And so if you look at the physical reality of computational devices, and then also computation itself has particular structure. And so basically, computation used for computation is two things. The first is you need to gather data from one location and aggregate it in a certain location, where you then put this new information together to compute a transformation of that information. You basically want to combine sort of known things and compute some new sort of things that you didn't know before. Useful information. Useful information can needs to be transformed from information that you already know. If you move a lot of information around but you don't transform it, you can't make new information. If you do a lot of computation on the information that you already have, you miss out on the long distance inside, the indirect inside. I think a lot of this actually maps to the neural network architectures that we have. The beginnings we had convolutional networks. They are very effective. And what they're doing, they don't move much memory, they do a lot of computation. And that means your device needs a Lot of flops and memory bandwidth is not that important. Once you go to very dense computation, very large matrices, then it goes in the direction of recurrent neural networks. But there you have still sort of this component of your recurrent, basically paying attention to previous states. But because it's recurrence, the memory reuse of that computation is minimal. And with transformers, you basically then had these large matrices that compute, basically that transform the incoming information from the previous layer. And then you had attention that now computes information across time or space. And I would argue these are the most two fundamental ways of computing information. You want to relate information to itself or transformation of that information. But then you also want to basically relate information to distantly related sort of information. So you want long term relationships and you want a transformation based on what you already know.
[11:27]
Matt Turk
And you say this is slowing down, Right. In your blog post you have a pretty striking sentence where you say GPUs will no longer improve, meaning fully. We have essentially seen the last generation of significant GPU improvements.
[11:41]
Tim Detmers
Yeah, so this had two components. And so one is also sort of a very fundamental thing. And it's physical in the sense that I mentioned these two components, memory, movement and computation. And so computation can only be useful if you move memory to this sort of local neighborhood where you do this computation. Now this is a geometric problem. You need to have a large store of information and then use this large store to move information closer to where you want to do the computation. And we have figured out how to physically do this optimal. We have like a large slow memory that's dram. Then we move it to a cache. If you look at the geometry, that's how you do it fast. If you have a certain size of computation, this is optimal. If you have a different size of computation matrix multiplication, then you want to use not a CPU, but more like a GPU, which has higher latency but more throughput. You can move more data, but more slowly. And yeah, if you look at all of that, you can push around a little bit how you structure everything like the caches and how large they are and how much cores are they shared. But in the end, the fundamental problem remains the same. You have a geometric problem. You can only fill the space in a certain way. And that means you always have certain access pattern with certain latencies. And the biggest latency is a big block of dram. That is the major bottleneck. This is also called the von Neumann bottleneck based on computers, almost all computers that we have. And this is the bottleneck of moving A program to where you execute the program. And for neural networks, let's basically move the weights in the inputs to the execution where you execute that program. That will be the tensor calls. There are not many ways how you can go around this bottleneck. The only way is to store the memory locally and do the local computation there. And there are some processes that do something like that. For example this rebirth processor. So they don't have this von Neumann battleneg in a major way on the ship, but then they need to also pipe data into that ship. And so the von Neumann bottleneck moves basically away from the ship to your storage or 2G network. And so you just move it away. But still the same bottleneck. You need to load the program, which might reside on disk or memory through the network to the ship. Same physical problem, you just move a couple of variables around. That is sort of one part of the problem. We don't have architectures that can solve this problem. That's sort of the second part where my argument kicks in and that is you need new technology to overcome bottlenecks. But once you have leveraged that technology, you need new technology to get over that. If you look at what we can do is we move from DRAM to hbm. So that's dram, that's stacked, that's much faster. But you can only stack it that high because it's very difficult to manufacture and test and correctness yield is very low. And we were on actually in 2026 there's not enough HDM. You can't build the nice processes anymore because you run out. It's just too difficult to manufacture them. And with that we have all these innovations. One of us, Tensor Core, big step up, then have an 8 bit precision, another step up, then we have 4 bit precision with particular block wide quantization, particular data types. From my research and other research we know that's close to information theoretically optimal in a practical sense. If you train on the math data, 4 bit precisions, you need actually 8 bit precision so you can't go further. The hardware is maxed out. We have no new technology. We can make it easier to manufacture a little bit cheaper, but not faster. And you have maxed out on the additional features. Sparsity could be something. People tried it for 50 years, I tried it present well and so that might be the last thing. But forward precision is the end of quantization and so that's the end of it. Like we have max up the features, we have max of the hardware. That's what we get.
[16:12]
Matt Turk
Okay, fascinating. All right, Dan, what is your perspective on all of this?
[16:16]
Dan Fu
I really appreciated Tim's post because I think one thing that I really appreciated is that there's some AGI talk that if you just kind of trace the exponential at some point you get the thing that will eat up the universe or whatever, which I always found it a little bit odd to think that way. I appreciate the thing in terms of the actual physical constraints because like Tim said, these are physical systems with physical inputs and actually doing physical computation. I think my perspective was that if you look at where the systems are today and you look at the models that we've trained, we are just so far from being, from even using the last generation of hardware as efficiently as possible. So not to mention all the new hardware that's being built out. So I think on the technical side, I'd say there are two major points I wanted to make in my post, which was one, if you look at the models that are the really great ones, the, that we know today and my blog post, I mostly talked about open source models because they talk a little bit more about how they train the resources behind it. We don't have public figures behind how much OpenAI and Anthropic are using, but if you look at the Deep SEQ model, for instance, this is one of the best open source models we have out there today. It was trained at the end of 2024 on last generation, kind of nerfed GPUs, H800 instead of H1 hundreds. The 800 is nerfed by all sorts of ways from Nvidia to get around the export restrictions at the time. And they were trained with, let's call it about 2000H800s according to the report, for I think about a month. And when you compute how long that took, when you see how much compute was actually available on the chip, you get something like a 20% effective chip utilization or something like that. The number, the, the term of art is called MFU model flop utilization, but basically that's a 20% utilization number. Meanwhile, I think in the earlier in the 2000s, we were seeing lots of training runs on older hardware, different model architectures that were easily achieving 50, 60% MFU. So if you just take that number and then say, hey, maybe there's a way to get it out there. Since then, you know, my good friend Tridao released a whole new set of kernels on how to train these models better and you say, okay, there's a 3x there just, just from, from, from, from that one piece. Then the other thing to realize is that, so that is a model that is being used today in early 2026 as the base for some of the best open source or open source adjacent models out there. It would have started training the base model at least a year and a half ago. So let's call it mid-2020. For since then we've started building out completely new clusters with the current generation of hardware. So on Nvidia, these are the Blackwells we've started. You know, they're companies like Poolside, they're building out tens of thousands of B200GB200 chips. You know, there's other folks like Reflection who are building out tens of thousands of B200 chips. So this is comparing. We have a new set, a new generation of hardware where even if you take the exact same precision as you had before, exact same everything, 2 to 3x faster compute, 10x larger clusters, plus maybe 3x lurking in terms of just pure optimization, that's 3 times, 3 times, maybe another 10, that's another 90x of compute available. And that's not even looking at future buildouts. That is literally clusters that you can point to today that people have started training on that. You might hope that at the end of that you'll get much better models. The point I really wanted to make was if you just look at it from those basic inputs, you can look around, you can squint a little bit, you can see up to two orders of magnitude more compute available compared to the models that we are indexing on today. Now we can argue about are there going to be diminishing returns in terms of scaling up? Are there going to be, do we expect the scaling curves to hold and all that? But you can just look around and see it and that's 100x more compute. So I think from the physical, just a pure compute perspective, there's a lot more available, a lot more that we're not doing. This is not even to mention a bunch of the great points that you mentioned, Tim. So these are all eight bit training runs. We just started writing the papers about how to do a four bit training run properly. There's new things like the, on the GB200 you have 72 of these really connected really quickly. I don't think we've even seen the first pre trained model come out of that yet. GPT 5.2 I think was the first time that you saw in one of OpenAI's reports, hey, this was trained on H100 H200 and GP200, which to me suggests that was actually pre trained on one of the really old clusters. Maybe some fine tuning was done on the new GP200s.
[21:18]
Matt Turk
You make the point that not only is the hardware underutilized, but you also say that model models themselves are a lagging indicator.
[21:26]
Dan Fu
Yeah. So the models that we see today, that we can play with today, have been pre trained on clusters that were built out a year or two ago because you need enough time to get the cluster running, you need enough time to do the large pre training run, and then you need enough time to really post train it, fine tune it, do all the RLHF and all that stuff. So the models that we have a snapshot today that at the beginning of a conversation say, maybe it is AGI, maybe it isn't, are already trained on clusters that are a year and a half old. We've built out much larger clusters since then. You can expect that they're going to use them for pre training. The models that we see today, that we index on quality today are actually trained on pretty old hardware. And we've got new generations of hardware, more software choices we can make, not to mention architectural choices. So Tim, you're mentioning this thing about you need to move data and then compute on data. We've actually seen the transformer change in architecture over the last. A little bit slowly for researcher, a little bit slowly for my taste. But you've seen the fundamental way we do the computation change. Even if you find another 1.5x or 2x there, now you're talking 100, 150x more compute. So there's a lot more compute out there to train better, higher quality models.
[22:48]
Matt Turk
If I understand this whole discussion correctly, all of this is about pre training, right? And whether we can train a bigger model with more data and more compute. But in conversations on this pod, a lot of the conversations have been about the importance of post training and building AI systems with pre training plus rl. Where does that fit?
[23:10]
Dan Fu
That's a great question. And I think another piece that we didn't. You know, I don't think either of our blogs particularly hit on. One way I like to think about it is that pre training is like the general strength training that you do in the gym. You go lift heavy weights, you improve your strength, improve your, your, your general ability. And then post training is like the specific drills that you run to, to. To get a good at, at, at a particular task. So historically the vast amount of compute has gone to pre training. So just building Models that are more generally capable of doing many things have a lot of knowledge, get to a point where maybe they, they have more, more knowledge than your average person. You know, I certainly don't know as much as ChatGPT for instance. And then the post training is both how do you make it helpful? So you know ChatGPT, you ask it to do something and then it actually listens to you and tries its best to do it. But I think the other thing that we've started to see increasingly in post training is that you can start to post train specific skills. So the model that's really good at helping you code uses a lot of the knowledge that you got from pre training, but is actually adapted to be particularly good for coding. Or the model that's really good for legal work, for instance, has a lot of the pre training backbone. But then the post training is really what gets it to that place where it's really useful. From a pure computational perspective, pre training is usually much more compute expensive than post training post training, the work that you have to do, I think I'm not a post training expert, but the work ends up looking a lot more like how do you build a useful product, how do you get user feedback, how do you do things like that? Even then there's a world where maybe the next generation of models is. The next generation of pre trained models is a strong enough base that if you go tackle each vertical of the economy that you care about, you could actually post train it to something quite useful. So I think that's a whole other computational aspect of it. So maybe we don't even need that 100x more compute that may be out there. Maybe it's more a more traditional work of. Let's understand this problem deeply. Let's understand how to train in almost the human sense. How would you take an intern and train this intern to do this specific task? How do you get this very powerful pre trained model to do something really useful in this post training sense is.
[25:30]
Matt Turk
This concept of usefulness that you both mentioned, where both of your point of views may be converge some ways AGI is something, but what ultimately matters is where you land in terms of usefulness in the industry. And therefore, even though one may not be able to reach that kind of ethereal definition of AGI that nobody really understands due to diminishing returns, in some ways doesn't matter because we still have so much juice to squeeze that we have enough to go until we get to a place where this is truly useful, not just for coding, but for the rest of the economy.
[26:06]
Tim Detmers
Yeah. The main conclusion of sort of of my blog post was exactly that. Like you shouldn't pay too much attention to AGI but more about thinking about how can we make it most useful. That might go beyond of how useful is a model. I mean Dan mentions mentioned like post training as a product. An important part we saw with computers is diffusion in the economy. That requires a very different mindset. The best mindset is build the best model and then everybody will use it. But that can help to really figure out how can you benefit the most people in the most pragmatic way. And I think that's sort of more the Chinese mindset and so that sort of mindset. So if I think of a usefulness one is model, the other sort of the mindset. But I would agree, I think that both Dan and I I think most people would agree that if you have AI that does very impressive things like math, Olympia things and that sort of thing, but it can't do anything useful is at AGI and so models are already useful. So that scenario will not happen. But I think what we really want is very very useful models. And I think we have that and I think we can prove that. But I don't think we get to AGI by SERP definitions but we will see significant impact.
[27:25]
Dan Fu
Yeah, I think I would just add to that that Tim, you had this point about how much of the economy is physical, how much of it is knowledge work and I think that the US China contrast is really interesting there there's been these analyses, this book by Dan Wong going around about the manufacturing economy, the engineering economy versus the more orderly economy. And I think there's certainly a lot of great knowledge work to be done in the US I think there's also if you look at what the actual sectors of the economy are. So a large portion of it is healthcare, a large portion of it is education. Tech is certainly also a large portion that's kind of leading the stock market and driving the stock market. There's a lot of great people who are trying to use the new models to try to do things like develop new drugs or understand how to make a real impact in health care or if we can get robotics off the ground and do things like start taking helping with some of the physical labor, maybe not necessarily building houses but the day to day household labor that could be large untapped portions of the economy. Those pieces are really great. You can almost start to see the first pieces towards it. The self driving analogy is really interesting to Me because early on, I'll say early on in my Ph.D. i was quite skeptical, cool about self driving. So let's call this 2018, 2019. It felt like self driving was always a year away or two years away or if you ask the experts, they say oh, five years away. And then last year I rode in a Waymo and today I just actually got access to Waymo on the highway. So now conceivably I could potentially sell my car and I live in the Bay Area in California. I won't because I personally like driving but there's a lot of the progress is funny and the this way where it's kind of, it's not there, it's not there, it's not there. And then one day something, a switch flips and then you're suddenly oh, not only is this thing pretty good, it's actually a lot better than the service that I'd get in an Uber or a taxi or something like that. That's a really exciting thing. If we see that happen, if that, we see that flips, that switch flip for like you know, household cleaning or, or putting away the dishes or things like that. I think it would, it would, it would really be really exciting. It would change a lot of folks perspectives. I'm not a roboticist myself, but I'm really watching that space with a lot of excitement.
[29:48]
Matt Turk
Dan, as a quick tangent, do you think we're evolving towards a multi hardware, multi chip kind of world based on what you see? Obviously there's been Grok and Nvidia, there's Cerberus, there's a bunch of special Asics companies coming up from your low level in the stack vantage point. What do you see?
[30:10]
Dan Fu
Yeah, that's a great question. So it's something that I spend quite a bit of time thinking about, more so I'd say on the lab side than necessarily on the industry side. Although of course we're paying close attention kind of on both sides. I think it's at a really exciting time where the Nvidia chips are really strong, really reliable. There's a lot of software support around them that has built around. We're starting to see the same things happen, for example in AMD chips with some of the research there. So on the lab side we put out recently a library called hipkittens led by my great friend Simran Arora. And she was really looking at okay, how can we take what are the right software abstractions to program on these AMD GPUs and turns out they're not exactly the same as the Nvidia GPU. So even two GPUs that have relatively similar specs, certainly compared to Grok or Cerebras or Sambanova or, or one of these other chips, even though they're relatively similar, they actually have pretty different software abstractions you need to use. And I think more people are getting excited by that and investing time and energy into that. We saw the Grok acquisition from Nvidia. A lot of people are excited about TPUs today. I think Cerebras and OpenAI just announced their partnership. So I think certainly it's going to be a wave of things coming forward that you're going to see a lot more. I'm sure Nvidia will still do great and still grow beyond their $5 trillion company or whatever it is at the time of recording. But I think you're going to see a lot more diversity, especially around I think, inference of the model. So training and inference are actually quite different computations and as a result you might actually want quite different chips to do it. On the inference side, you might want to, for example, your models to live locally on your phone, on your laptop. You know, my phone, my iPhone, which is a few years old at this point, is already more powerful than some of the GPUs that I had when I was starting my PhD. That growth of that hardware power is really exciting to see.
[32:17]
Matt Turk
And Dan, you mentioned a second ago, in reference to self driving cars, that moment where things flip switch is turned on. Has that happened with agents already? You talked about software singularity. Are we at that moment for agents?
[32:33]
Dan Fu
Yeah, I think that so personally in my life, I'd say that moment was last June ish. So June 2025 was the moment that it really flipped for me to give some context here. So what I do in my day job at Together AI is we write a lot of these GPU kernels. I don't know how popular, but in the general ML Zeitgeist, GPU kernels are thought of as kind of like the final boss of the thing that you learn how to program. They're very hard, they're very highly parallel. You don't write them like you have to write in C, which is this old language that the old systems people used decades ago or whatever. They're not in Python, et cetera. When you're trying to hire for people who can write kernels, it's very hard. It's a very challenging skill set. It's certainly the tip of the spear in terms of their programming strength. And last June we had this really interesting realization where we realized that cloud code cursor agent, these agentic coding assistants, were actually very good at writing these kernels. So there was one week where I think I wrote like three or four different features that usually would have taken me a week each. And I wrote all of those in a single day. And I was like, oh my God, this thing is making me five times more productive as a kernel expert. I got my team on it. Now my team has all these really complex systems that they've built where they can write a whole feature that I think would have taken months of a whole team's time before. And this is kind of that final boss of programming challenge that was really challenging. So from our perspective for coding for this really technically challenging GPU kernel programming, it kind of crossed that, kind of crossed the Rubicon for us already. And I think, you know, I gave this talk a few months ago at Slush about what we're calling the software singularity, where we realized, hey, in terms of software engineering, even for these really niche skills, it's certainly better than the average programmer. It's at a place where it can accelerate the really expert programmers. So it's right now, as of today's recording, is at a place where if I just let it on its own, it might not generate the right thing for you. But if you give an expert, expert programmer this set of tools, they can go 10, 10 times faster than they were able to go before. And I think that's a really exciting place to be.
[34:52]
Matt Turk
And on that topic of agents, Tim, you just wrote another great blog post called Use Agents. So be left behind. And part of what you talk about is coding agents versus agents for the rest of tasks. Where are we in that arc where agents are transitioning from being excellent at code to useful for the rest of our lives?
[35:16]
Tim Detmers
That blog post was also as a reaction to what I see is there are large productivity gains if you use coding agents for all kinds of tasks. And as a professor, you don't code that much. You can actually code more easily, which probably other professors previously would ran up to. It's so easy now. But yeah, also for non coding tasks, it's super useful. And when I look at the productivity gains that I have, some of the smaller, like two or three, sometimes it's like 10 times faster. I do tasks 10 times faster. The quality is not great. Sometimes the quality is higher. An agent might not be as good as I am, but the agent gets tired. Agent doesn't make sort of bad Mistakes or yeah, needs to cognitively struggle like with like complicated information that you put together, similar to bootle kernels. What Dan mentioned, all of that is working. And I mean Matt, as you put it, is coding agents and agents for other stuff. But how I would see it is it's just coding agents. Coding agents are general agents. Coding agents can write programs that solve other problems. And code is so general, if there's a digital problem, you could solve it. Coke and coding agents make the things so easy that now you can solve a variety of problems in a way that you couldn't solve before. And this angle makes you productive. I would say this is the main thing that has been changing. Coding agents allow you to check a problem in a way that you could think before. And it's at a pace that you could think before. You can paralyze a lot of different tasks. The agent doesn't get tired, you just keep going. The work's much easier.
[36:59]
Matt Turk
One bit I love in your post is that you're careful to separate the hype from reality at the beginning, but then quickly you land. From your experience of experimenting with agents for the lifestyle in particular, you land at the conclusion that more than 90% of code and text should be written by agents. You need to do so or you will be left behind.
[37:19]
Tim Detmers
I think for, for many people that are engineers, that's already true. And there is sort of this thinking, oh, if you produce text or code and everything's done by agents, must be low quality, must be bad, right? The key thing is you inspect the code, you inspect the text, you might make some slight edits. The 10% that you do might make a big, big difference through this. Basically sort of editing, just reviewing of output, you kind of make it your own. AI generated things are not less personal than things you've written on your own. I see that if I write a grant proposal with the help of an agent, it's sort of alive. I can feel like it's exciting. Like person reading it said like, this is good research, I want to fund this. I think that's just the reality. Like if you just generate things and don't look at it and just say like, yep, that's good, that will not help you. But you can quickly review content, you can skim it, you can look at like, ah, this doesn't look right or I want it different and edit it and you're good to go. That will be reality. And the skills that you need to work in that way, they are not fully developed for most people. They also not fully developed. For me, it's still sort of a phase of experimentation. Models move, frameworks move. And so you need to adapt, you need to learn. There's a lot to learn, but if you do it, payoffs are huge. And I think there was a thinking that software engineers will no longer exist, but I think people no longer believe that. It's like software engineers are so productive. That is exactly what you need to know. If you use agents well, you can do so many things, and I think that's the core thing. If you don't know how to use agents well, you will be left behind. That will become a critical skill.
[39:12]
Matt Turk
Practically, how do I do that if I'm not a coder and I think about automating some parts of my job? What are some of your recommendations on how to approach that problem?
[39:25]
Tim Detmers
Yeah, I mean, the best thing is just sort of being sort of very pragmatic. Just think about things and trying to code them. Particularly if you're not a coder, that's very difficult. And there's sort of this barrier. We say, like, I haven't coded before, like, I don't know this and. But if you interact with agents, they can just build stuff. And with minimal learning, I mean, they can also explain stuff. With minimal learning, you can get there, execute programs, build websites, particularly if it's visual, you get quick feedback. It's not that difficult anymore. I mean, often I mentioned you need to inspect things. Things. But if you build simple tools for yourself to make your life easier, often you don't need to do that. The agents write good code. If you work in a company, you need to integrate it in a good code base. You probably should review it. But if you build a small program on your own to make your work more productive, that's easy. Just to give me an example that might be relevant here. I bet, for example, a tool that if I have a video where I talk, so I record videos of how I interact with agents, then there are certain phases where I just look at outputs and try to understand things. There are phases where I talk. So I just build a tool that recognizes the speech that when I'm saying it's a timestamps, then it slices the video. So basically I have an entire video where I talk rather than sort of moments where nothing happens. And that's very easy to do. I built this like in 20 minutes. I think everybody can do it because I didn't look at the code, the agent that did it, and then I look at the video. I'm like, oh, yeah, let's do it right. If you get started with a feedback loop, you don't need to code, you just need to inspect the output that you can understand or learn how to execute a Python program or a bash cell and you're there.
[41:11]
Matt Turk
How do you pick what you want to automate? Like, how do you. How do I think about automation in my life?
[41:17]
Tim Detmers
Yeah. So I also talked about that in a blog post and it can be sort of, sort of more intuitive thing and then a more nuanced thing. I think the more intuitive thing is you just think about what could be useful and then can even be like something more complex. You say, I want an Android app or an iPhone app that does this thing. And you initially might think that's complex, but then you throw a coordination. It works immediately. The world's noise. There's so many things that you can just do and you can be very creative and say, like what I always wanted to have and it wasn't there. Nobody built this product. Can I fill it now? And I think that mindset gives you useful things that makes you more productive, but it also flexes your muscle and sometimes it doesn't work. And then you understand like, okay, agents suckle this or this is what I still need to learn to make both these kind of things. And I think that is sort of the more intuitive perspective that's very useful and that quickly get you started on the path where you say, like, I mean, first there's excitement, then it's sort of the sober reality, but then you pick up again and it's like you realize, okay, if I do it like this, I get more and more productive day by day. That's sort of the more intuitive part. The more nuanced part is the sort of part that I learned in automation industry. I worked like three years in automation industry in Germany automating factories. And it's sort of a very calculated sort of approach where you look at how you work, you time each of these steps and then you say, if I automate this exact step in such a way, what would be the payoff? How much time would I save? And then you calculate what is the productivity gain and then you calculate how much time do I need to this automation. And if you do that, you can quickly realize that automating certain things will not make a difference. The blog posts I mentioned, emails doesn't really work. And there might be other things. A big thing is like always calendar invites. Nobody likes to create invites for a meeting. But then if you think about it, you're also very particular about meetings. Like some days you want more meetings, you have a meeting day or you say like, I can put this in before lunch. And agents know that. And if you specify that to an agent, you could also just create the calendar by meeting pipeline and it doesn't increase productivity by much. And so there are a lot of problems. If you think about this nuanced way and can say like, okay, I don't do that, but ah, this will help.
[43:52]
Matt Turk
Me and Dan, on your end, what have you learned or observed in terms of agents? What works, what currently doesn't work but will work soon, how to manage them?
[44:01]
Dan Fu
I think there's too broad things I've noticed for agents. So the first is making the agents effective ends up being a lot like managing junior folks on your team or at a company. So for example, the new intern who shows up on your team, you're not going to go to the intern and say, hey, go fix our revenue for the year, double our revenue for the year, or something like that. Maybe you'll try that once, but you're unlikely to see the payoff from that. Instead, what you often do with junior folks is you say, hey, here's a first little tasks that you can do to get to know this complicated code base. And here are the things that you might run into because you've kind of done it before. And when you give the agents that context, give them that ability to look at those things, then they can usually figure things out. The other bit is that when you have a new person on your team, you maybe won't give them access to all the production credentials and all the production database and all those things, but you're going to give them enough tools to be productive. So sometimes there, there's this tension between, oh, I don't want my agent to go delete my everything in production, so I'm just gonna have it be hamstrung and watch every little thing it does. Whereas if you did that with a person, you would never expect that person to be productive. That's another big. You kind of want to, you know, think about the agents as, at least today as, you know, maybe interns or more junior folks. The other really interesting thing that I've noticed, and I think when I think more the educational role of, of a professor and how do you prepare people for this future where agents are going to be such an important part of, of workflows is that how do you train for that? And one of the things that I've noticed is that the more Expertise somebody has. So whether that's in, for example, Tim with process automation, or expertise in what I do in kernels and writing these very highly specialized programs, the more expertise you have, the more powerful the agent makes you. And that's because you can work at such a higher level of abstraction. You know what the important things are, you know how to set the direction, you know what the common pitfalls are like, what is easy, what's kind of hard, what you need to break up into multiple steps. One piece of conversation that was coming up for, for a while was like, are agents going to replace all software engineers or things like that, or are they going to replace all junior people or something like that. I think where we are today, that's probably clearly not the case anymore, where if I have a tool to make my team 10x stronger, I'm not going to fire 9 people on my team. I'm going to say, okay, go do this. Become 100x more productive than you were before. That's one bit. But then also kind of the script for how you become an expert at something is probably pretty similar to the way that it was before. You're going to study things deeply, you're going to try to understand things a lot. You were going to want to do things yourself with, with your, you know, hands on and really get things done in this world. You know, the ChatGPT can teach you a lot of things. I was personally, personally I was trying to get chatgpt to teach me all the little ins and outs of how a car works. I don't know how effective it was so far. But you know, there's, it's, it's a lot easier to learn things now than it was certainly even five, two, three years ago. So those are kind of some of the things I'd say. So you want to treat the agents as if you are in that manager role. You want to help them get unstuck. You don't want just to say, throw the agent out of the problem, walk away and never look at it again. But you also want to figure out how to level yourself up so that you can be a better manager, have more domain expertise and really understand things in a deeper way.
[47:38]
Matt Turk
So the fact that you need to learn and be an expert and that doesn't change, I think that's very interesting and that makes a lot of sense. The question is, if you show up on the job as a young kernel engineer and that's your first day, typically they would be okay, well, you do this simple task and that other simple task and then by year two you graduate to a more complex task. How does hands on job training look like?
[48:05]
Dan Fu
Right, yeah. So we think about this a lot together because we're still hiring aggressively even in this world world where the models are very good, the agents are very good, the way we think about it. So first I actually went, you know, the professor and me went and actually recorded a bunch of lectures on how GPUs work. So I make everybody watch those and then I still give them a task from scratch, which is, okay, go take this flash attention kernel and modify it to do some other thing. You can pick the extra feature. The nice thing about the agents is that you can dive into that higher impact role in a way that you weren't able to before. So it's really impactful. I think when a junior IC goes to try to manage someone for the first time because you're suddenly starting to think in much more precise terms. So you know, the, the classic software engineer thing is hey, the PM asked me to do this and wrote this super long doc with all these requirements. But then the minute that you try to go ask someone else to do something, you realize the specificity that you need when you want to address a feature or something like that. The nice thing about agents is you can almost start to shortcut that process where you can have the junior IC still be an ic, still do IC style contributions, but they can now act as that manager role and act as their own pm. Because when you're communicating with the agents, you need to be as precise about what you're trying to get done. So in some sense I've seen with the, with the really, you know, the junior folks who are joining my team. So these are folks fresh out of college or fresh out of a master's when they are really gung ho about understanding and being able to use the AI agents, there's, they're able to communicate so much better than in the olden days. They're able to level up their, their level of understanding a lot faster. And then of course they can do things and build tools at a speed that would have been really, really hard to do, you know, five, 10, five, 10 years ago.
[49:57]
Tim Detmers
And maybe so add sort of the educational perspective there because I think it's sort of quite interesting. It's a little bit sort of contrasting. So what is also quite interesting is the educational perspective. Certain agents I talk quite a bit, there's basically use agents will be left behind. And that is also true for students. But just like as Dan Said you need the domain expertise. You only need to have some knowledge to use agents. Well, what we're seeing is if we allow students to use agents, they are very productive. But sometimes the built solutions that look correct, that are actually very bad or just wrong and they don't realize we are at this point where it's almost very difficult to learn both domain expertise and agent use. That's a very difficult balance to achieve because we don't want to have students that don't understand things, but we also want to have students that basically can use agents. And so if they can't do that, they will not be effective in the workforce. What Dan said is you already have a pretty good person with strong background knowledge and then that person can level up their equipment with agents. But what do you do if you have someone that just learns computer science? How much agents should I learn? How much just work should you do without agents? And that's a very tricky balance and we don't know how to solve it. If we let people use agents, they perform very poorly on basic knowledge. And if we let people just do the bayancing knowledge, they don't know how to use agents and they can't compete. So they can't be useful work in the workforce nowadays. Maybe their solution is do all sort of the basic knowledge first and then agents. But that's not what students do. Students have access to these AI tools. They will use them because it's easy. And so maybe the solution is just you need a way of thinking and of working with information and knowledge that you don't understand and develop. Main guess, critical thinking. I think this goes beyond critical thinking. You basically need to basically know the unknown, unknowns, things that you didn't consider and don't understand and that you didn't even think about. You need to have the ability to think more about that to really keep up with agents. Because I think in the future it's realistic that we work on problems that we don't understand that agents understand, but we need to keep up in some way that would be difficult.
[52:28]
Matt Turk
All right, so to switch tacks as we get closer to the end of this conversation, what are you guys currently working on? What's top of mind for you at Allen Institute? On the one hand, together, on the other hand, whoever wants to take it first.
[52:45]
Tim Detmers
We have actually sort of a very exciting project that will be released very soon, in the next weeks. And so I worked quite a bit of efficiency. I've been switching my work basically to coding agents. And so we will Have a major release of an open source coding agent that has a couple of key features. For one, training is 100 times cheaper. You need to generate synthetic data and you need to train on it. And so we have a method that's 100, roughly 100 times cheaper, but still get state of the art performance. And then we have another sort of major result which you can almost see as a holy grail of open source models. And that is we can take a private code base. Like if a company, you have this code base cloud doesn't know your code base, but you have the data, you could fine tune a model on it. So what we have is you can just point our method to that repository. You don't need to have any sort of tests or need to understand how to generate the data. It's just automatic. You quickly generate the data and then you have an agent that is as good as a frontier model, but you can have like a 32 billion agent, you can deploy it locally, you can have an army of specialized models for particular tasks, for particular code bases and so forth. And yeah, I think that is a very powerful result. All of that is also packaged with a science of coding agents. There are a lot of confounding factors, a lot of hidden things and papers that are not mentioned. We sort of unearth them, build scaling laws and show what does matter and what doesn't matter. And if you put all of that together, I think the sheep agents, very few GPUs, everybody can use it. And very easily we unblock people by revealing all the sort of secrets. I think there will be a vast change in terms of how quickly we can progress in coding agents.
[54:43]
Matt Turk
Very cool. What about you, Dan?
[54:45]
Dan Fu
Yeah, so together, I think the major question that we're trying to answer today is we have all these powerful AI models that can do all these amazing things. They're very expensive to run today. So the question in all the public markets, is OpenAI going to be able to turn of profit ever? And what's really exciting about together is that we are kind of on the forefront of getting these models to use the hardware as good as it can. We talked a little bit about training early on in the podcast at inference time, when you have the model, when it's already been trained, already been post trained, the hardware utilization is less than 5%. So it's at a place where there's so much more that we can do, there's so much more efficiency that we can push out of these. And we're really excited about figuring out how to use the hardware the best way that we can, whether it's serving customers like Cursor or whatever the next great foundation model company is. But yeah, we're really excited about pushing that frontier and then really getting us to this future. If these models are going to be as impactful as you think, if they're going to be running everywhere, running your daily lives, running your toaster, well, we better make it the the best possible toaster that we can do.
[55:54]
Matt Turk
You want to talk about Mega kernels and together ATLAS for a minute?
[55:59]
Dan Fu
Together Mega kernels, Together Atlas. These are both projects along these lines. So let me dive into the mega kernels first. So to understand this, the first thing when we say kernels is we usually mean we are going to write a specialized GPU program for a single operation in a model. You can think of this model as one of these trained models is like abc. Different operations in a row and there'll be hundreds of these. And the way that we've been writing kernels for the whole history of, let's say, call it Nvidia hardware, is that you really specialize a single kernel for a single operation. With these mega kernels, we're doing something quite interesting, which is we can take the entire model, however many billions of parameters, and put it into a single GPU kernel. And with that you can start to do a lot more fine grained optimization than you were able to do before. It actually starts to make the Nvidia GPU look a little bit more like a Cerebras chip, or look a little bit more like a Samba Nova chip in terms of the optimization that you're able to do. And this is really critical at inference time. So we're able to see 2x, sometimes 3x speed ups over even highly optimized inference engines. So we're working on bringing that to really work in production, bring it to fruition and, and use it across our whole stack together. ATLAS is another really great, interesting research project that we've done recently where we can get the model. There's this technique called speculative decoding that we use at inference where basically we have a little model that is trying to guess what the big model is going to do. And because of the ways that we've designed language models, if the little model guesses correctly, you basically get those tokens for free. So if you do the speculative decoding right, you can get again 2x3x speed ups over just running a vanilla model. With together atlas, we do one extra thing which is we say, okay, we can get that little model and we can actually adapt it to your traffic. The longer you use this model, the more it learns the patterns of what you're asking, of what you're saying and it can actually get faster over time. So all these things we have efficiency in mind, inference efficiency in particular and, and we're all kind of pushing towards making all these things faster.
[58:07]
Matt Turk
What are you excited about for 2026 at a reasonably granular level, what do you think happens? What do you think doesn't happen?
[58:20]
Tim Detmers
Yeah, so I think they're both sort of, I'm sort of split. I think a lot of things will be very boring and not much innovation. But then we also surprised by a couple of things that we maybe don't see. And I think actually sort of in the frontier we will be less surprised. I mean it's no secret that we ran out of pre training data and as Dan said, like these are sort of the muscles then you can sort of smooth over and you smooth over with synthetic data. And that's how you build coding agents on lots and lots of different environments. You combine the data. We make some progress here. But I think you already see the dimension machine returns. I don't think coding agents will be that much better. The user experience will probably improve. But you see it that all these models get almost equally good. Like if I use, I had my config like glm 4.7 setup and I used it and I thought I was using Opus 4.5 and then realized oh wait, wait, I used a different model because the he quite similar. And so I think see less progress there. Where I think we see more progress is actually the small models. If you train smaller models on more specialized data they can do quite well. And the smaller models that you get, they're pretty powerful. 100 billion parameter model, you can fit it pretty well. Even sort of low grade data center GPU like an RTX 6000, $6000. I think for a lot of companies it will be very interesting. They don't need to rely on the frontier models. The small models might even be better because they're specialized. A big problem. And like Anthropic CEO pointed it out, you have these powerful open weight models but nobody uses them because the deployment is so complex. And that is because once you go beyond eight GPUs you first need the users to make it efficient. But then also very complicated inference systems. There's no open source system that can do there at the moment. Disaggregated entrance separation across sequence lengths and so forth, perhaps we can build this, we can build this also for an HGPU machine, for smaller models. And then the efficiency that you see with 100 billion model will rival what frontier models have. So you will get the efficiency of small models, you get the flexibility of small models. Performance on the frontier will stagnate. But on the smaller level we get more and more powerful models still because you can distill from these large models into these small models taken together, I think that will change things.
[60:50]
Dan Fu
I'm also really excited about small models. I think we're going to see a lot more capability out of them. I'll be watching the open source models pretty closely. I think with things like GLM 4.7 you're starting to see it. You're starting to see the open source models rival some of the, at least our current best frontier models. I think we're going to see another big jump in open source capabilities this year. I'm really excited to see new hardware. So we're starting to hear a little bit about Ruben, the next generation of Nvidia GPUs. I think we're starting to hear a bit about the AMD 400 series of GPUs. Really excited to see kind of what that next jump in hardware capabilities is. Even as we haven't forgotten fully kind of used even the current generation of hardware. I'm excited to see what people do kind of with all the other modalities. I think last year video generation models had a little bit of a moment with Sora too, with Gemini and VEO I think they called it. Really excited to see what they can do and yeah, really excited to just see what is that frontier of intelligence that you can get on your laptop or on your phone and how fast can you push it? How far can you push it? I think it's never been a more exciting time to work in AI.
[62:03]
Matt Turk
You both mentioned state space architectures earlier in the conversation. Do you think that's part of the near future that we sort of evolved to post transformer architectures? Whether state space, JEPA world models, whatever direction. Is that something that you see in the near term horizon and that you think is desirable?
[62:25]
Dan Fu
I think in a lot of places they're already there. So some of the best audio models in the world are at least partially based on state space models. I think Nvidia released a bunch of really great hybrid models recently. I think Nemotron is what they called them and so there's a lot of really great work kind of there already. I think we will see the architectures continue to evolve. So in some sense, the deep SEQ MLA compression takes some of those ideas. One of the minimax models had sort of a linear attention idea. So I think you're going to see a lot more diversity in architectures. You're kind of already seeing it, but certainly, certainly, I think out of the Chinese labs where there isn't really an OpenAI of China, right? So there's like an OpenAI orthropic or a Google Gemini that kind of brings all these centers of product and model and revenue and all those together. So I think you see a lot more risk taking out of the Chinese labs where you're trying to differentiate the next model, your next open source model. One way to do that is architecture. Another way to do that, of course, is just pure quality. So I think we're going to see a lot more explosion of different architectures.
[63:34]
Matt Turk
All right, well, this has been a fascinating conversation. I really appreciate the time and insight and thoughts. Thank you so much. Really appreciate it.
[63:42]
Tim Detmers
Yeah, thank you so much, Matt.
[63:43]
Dan Fu
Thanks so much for having us. Great to see you, Tim. This was a lot of fun.
[63:47]
Matt Turk
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build the podcast and get great guests. Thanks and see you at the next episode.