Transcript
A (0:00)
Back in February, Microsoft Chief Executive Mustafa Suleiman sat down for an interview with the Financial Times. During it, he made the following extraordinary claim.
B (0:12)
I think that we're going to have a human level performance on most, if not all, professional tasks. So white collar work, where you're sitting down at a computer, either being a lawyer or an accountant or a project manager or marketing person, most of those tasks will be fully automated by an AI within the next 12 to 18 months.
A (0:34)
Now, if this prediction is true, then we're just a year away from one of the most sudden and calamitous economic shifts in the history of modern economics. I mean, worldwide, the knowledge and technology intensive industries produce over $10 trillion of value per year and make up more than a third of economic activity here in the US So if basically all of this could be replaced by compute, and this was going to happen by next spring, it would make the Industrial Revolution seem glacial by comparison. It would be the economic equivalent of the asteroid that killed much of life on Earth, including the dinosaur. So is it possible that Suleiman is right? And if he's not, what's a more realistic understanding of what AI will and will not be able to do in the workplace in the near future? Well, it's Thursday, which means it's time for another AI Reality Check episode. So this is a great opportunity to dive deeper into Suleiman's claims. Now, I have a lot to say on this topic, including, if you make it all the way to the end of this episode, a little conspiracy that I uncovered when I was doing research on this topic. So stay tuned for that. But we have a lot to get to, so let's get started. As always, I'm Cal Newport and this is Deep Questions the Show for people seeking depth in a distracted world. All right, as you may have guessed, I'm going to argue here today that Mustafa Suleiman's claim is not accurate. Now, I have three major reasons to propose why he is inaccurate in his claim. I'm going to present these three reasons in order from least technical to most technical. Now, to be clear, I'm not trying to be like Dr. AI skeptic guy here, right? I mean, obviously people are finding uses for LLMs in the workplace, even if they are nowhere ready to replace all knowledge work jobs. So I'm hoping that as I get to the end of these three reasons, and before I get to the conspiracy, I promised you at the end, that I'll be able to review the ways that these tools actually are being useful. What are the actual parameters of where LLMs are or will continue to make a difference in knowledge work in the near future. So we're going to give some positive information as well towards the end of this episode. All right, but let's dive in. I want to start with my first reason why Suleiman is likely wrong in his claim that reason is other tech leaders don't agree. So before we get into the details, what LLMs can or cannot do, I want to emphasize that Suleiman's idea that we are 12 to 16 months and given that he said this in February, that means basically a year away from all knowledge work jobs being fully automatable. So he said fully automatable by AI. That claim is an outlier compared to what other leaders of major AI companies have been saying. So, for example, before Suleiman grabbed the crown of most doomer about AI job impacts in February, Dario Amadei had been the most pessimistic among the CEOs on multiple occasions. The claim that he made is that AI will replace up to 50% of entry level knowledge work jobs within five years. That's the claim that Amadei has made several times now. That's not great news either, but it's actually much less dramatic than what Suleiman was predicting. Think about it, it's a longer timeline. Amadeh is talking up to five years. It's only affecting entry level jobs, right? So he's not talking about all knowledge work jobs, but just entry level jobs. And he's not talking about most jobs, he's talking about 50%. So on all scales, we might measure Amadei's claim. It's much less drastic than what Suleiman was saying. If we turn our attention to Nvidia's Jensen Huang, he's much more aggressively against the type of claim that Suleiman or even Amadei is making right now. He doesn't really see AI fully automating many jobs at all. In fact, at a recent event at Stanford, Wang argued that making these predictions about large swaths of the economy being automated is actually anti productive, if not just straight up false. I want to play a clip of Jensen Huang at a recent event making the opposite claim of Suleiman. First of all, I think the narratives of AI destroying jobs is not going to help America. First of all, it's just, it's false. Huang goes on in this this event to argue that what we're more likely to see is AI tools being integrated into work like we did with computer tools in the 90s and early 2000s, it will change what jobs look like. It'll change the day to day, maybe what tools you're touching, but is not going to wholesale replace large swaths of the economy. He notes in that Stanford appearance that the engineering teams at Nvidia use a lot of AI tools and, and they're busier than ever before and they're hiring more engineers than ever before. So he doesn't see AI tools as a job destroyer, but instead as a job changer. All right, so let's move on now to the second reason why Suleyman is likely wrong in his prediction. We are not seeing enough progress. All right, so Suleiman says we're basically a year away from most knowledge work jobs being fully automated. If that was true, we would need to be seeing major rapid advances in LLM technology to keep us on such a ambitious trajectory. But it's simply not what is happening. Now, look, this might sound crazy at first because you're bombarded with news constantly about AI, and AI is this and AI is that, and we're scared of this. And it gives you this general impression that aren't things just moving really fast in the AI world? But if you cut through the PR and the hype, here's what you will actually discover if you follow this technology closely. Since roughly late 2024, progress with newly released LLMs from the frontier, AI companies have been making steady but not particularly fast progress. And instead of seeing the type of immediately obvious functional improvements that we got used to in that era, when we went from GPT2 to GPT4, most of the improvements we're seeing from model to model are largely being captured in benchmarks. So charts of numbers of tests invented by the AI companies themselves, in many cases that have obscure acronym names. So we just see these charts where they'll say, look, we got a 20% boost on this benchmark. We're now moving to be comparable to this machine on this benchmark. So we're not seeing major revolutionary leaps in functionality anymore. Instead, we're seeing slow and steady progress. To give you an example of what this is like, I'm going to load up here on the screen a Reddit thread that's talking about the most recent release from Anthropic, which is Claude's Opus 4.7. There's a summary here of a vigorous conversation that happened on this Reddit thread about the new Opus 4.7. Here's the summary. The verdict is in, and it's not pretty. The overwhelming consensus in this thread is that Opus 4.7 is a massive regression and a serious downgrade from 4.6. Users across the board are reporting a dumber, lazier and less reliable model. That feels like a step back to early ChatGPT. All right, so we're not talking about this is their newest model and we're not seeing revolutionary leaps from model to model. This is much more of a jagged type of frontier. Make progress here, take steps back elsewhere. Another model that was released recently was OpenAI's GPT 5.5 and people seem to like this better than Opus 4.7. But what is the magnitude of these improvements? Well, I loaded up a review I'll put on the screen here. Matt Schumer had a long review he posted online where he said a big upgrade that doesn't always feel like one. I'll read you a couple things from this review. Schumer is excited about continued improvement in the LLM, playing nicely with coding harnesses, so producing long term plans for coding. Here's what he wrote for serious software work. It is exceptional, thoughtful, careful, able to make many of the same decisions I would make, and very good at iterating against a goal until a thing actually works. But as he also goes on to point out, these models have been getting slowly and surely better at this type of thing steadily now for about a year. So you might not actually notice much of an improvement because they're already pretty good at that. Here's a summary of what Matt called the biggest story about GPT5. GPT5 rounds out the weaker parts of the GPT line design from existing context, iOS, native Mac apps, security, et cetera. Right, so what are we hearing here? This is what we're hearing about these latest newest models is slow and steady. It's like normal software updates. We improved the native Mac iOS thing, we tweaked this functionality. Now we're getting better scores on this particular benchmark. Sometimes they swing in a miss like Opus 4.7. The tweaks we made actually made this worse. And so people are going to go back to the previous there's nothing bad about that. This is just like the normal pace of software improvements. But the problem is to get from a place where we are now, where almost no knowledge work is fully automatable, no major task is fully automatable, to a place where all knowledge work tasks are fully automatable. We're not going to get there in another year with these slow and steady improvements. One step forward, one step back, two steps forward, one step back. If we tweak this, we improve that, we improve that. That's nowhere near a fast enough pace to get us where Suleiman said we're going to get. Now at this point you might be saying, yeah, but what about coding agents? Coding agents feel like this example of a major knowledge work task, software engineering, that we weren't using AI heavily except for in more autocomplete ways. And then suddenly it seemed last fall into the new year, everyone was using AI coding tools within software engineering firms. It's not everyone, but it's like massive percentages. Now the way that the general public thinks about this is like, well, I don't know, like AI quotes around. It keeps getting better and better and it got good enough to automate that all of a sudden. And as it continues to get better, maybe it will all of a sudden unlock other types of tasks that we can fully automate. But that doesn't really describe what happened with the rise of coding agents within enterprise software development teams. What you have to understand is that that leap was actually as much about, if not more about the quote unquote coding harnesses as it was the underlying models. The coding harnesses is the software program written by people. It's not machine learning, it's not AI that actually calls the LLMs for ideas and then executes the LLMs ideas on behalf of the LLM. What really happened is there's a multi year process of multiple companies working on these coding harnesses to make them more and relevant. They're trying to figure out how can we get LLM's ability to produce computer code, which we've known since 2022. How do we get this ability better integrated into how actual professional software development teams operate on really large code bases. And so most of the innovation actually happened in that coding harness. They figured out all these different rules and approaches. We're going to use skill files, we're going to have a sort of simulation of memory over these skills files. There's a ton of just old fashioned, like 1950s style AI in these coding harnesses like regex, pattern matching, using existing software tools to verify code. So there's a lot of kludes together, stuff you can read. I think Gary Marcus had a good piece about this a few weeks ago because they leaked the code for the cloud code coding harness. So we know what's in there. So there's just slow and steady work on building the coding harness until they could finally figure out how to make it play nicely enough with enterprise coding that we can now bring LLMs into that process. Now they also tune the models to play better with the harnesses as well. But I think it's in the harness development that we saw. That's what made AI coding relevant to big software development teams. So what this tells us if you want to have a similar jump in another type of major knowledge work task somewhere, you have to have a lot of people iterating for maybe a year or two to try to figure out the right harness to connect properly into that particular type of job. And if we want Suleiman's claim to be true, that all major knowledge work tasks will be fully automatable, you would need like 1,000 of these teams, each of them focused on another major knowledge work task, trying to build a custom coding harness that works just for that task. Well, they're not doing that. They don't have enough people to do that. The market isn't there. And it takes experts. The one thing that people working at AI companies are experts at is software development, because that's what they do. So this was like the ideal place to do this. So I think the real lesson of the sudden emergence of coding agents is that it's actually really hard and takes a lot of focused work to try to integrate AI into individual types of workflows. It's not something that just happens as the models get smarter. So again, unless these companies are hiding thousands of teams working on all these different areas of knowledge work, trying to find ways to subtly integrate AI into them, I do not see how we're going to suddenly have coding agent style, automation, and many other tasks within roughly 12 months period. All right, my third and final point. Why Suleyman is likely wrong. The functionality of LLMs are limited. So we're moving down this technical stack here. Let's briefly open the black box around LLMs to better understand what they can and more importantly, cannot likely do, no matter how big we scale them. So let's start from the bottom. We've done this before, so I'll go quick. What does an LLM actually do? It predicts tokens. You give it text and it outputs what token should come next. It has been trained to assume the text that's given as input is a real text written by a human that actually exists. And so there is a right answer for what the next token is. And it's trying to get the right answer that's at the base what an LLM does. Okay, so then how do we get long text out of an LLM? Well, you have to actually put a program on top of the LLM to keep calling it again and again, what's known as autoregressively. So you have some text, you give it as input, it outputs a token, you put that token at the end of the input text, and now you take that slightly longer input text and put it through the LLM and you get another token. Now you take that even longer input text and you put it through the LLM and you keep doing that until it's grown out a whole answer, and then you can return that answer to whoever made the original prompt. So what this autoregression of token guessing gives you is basically a story completer. Here's some text. And the LLM is implicitly trying to finish the story it was given as input as accurately as possible based on all the types of text it's seen so far. Now, in the original big LLMs like GPT3, it would create reasonable stories, but they would be all over the place. So then they figured out how to then what's known as post train or tune these LLMs of the possible ways they could go to tune them towards certain categories of types of text and away from others. So GPT 3.5, which was the version that powered the original ChatGPT, was tuned to think about the stories it's completing as answers to questions. So it put it into your prompt, the input's a question, you're trying to answer the question. And that was much easier for average users to deal with. So they're story completers now. We don't want to downgrade that. Right? This isn't a dismissal like odds autocomplete, because what we discovered, particularly with GPT4, is that as we scaled these things up in order to successfully complete stories in reasonable ways like these, LLMs actually encoded a lot of really interesting rules and logic. If my story involves a math problem, to complete that story in the right way, the LLM is going to have to have some math logic built in there, because otherwise the story is not going to seem reasonable. This was really the big discovery of GPT4 was like, wow, there's all of these abilities that were implicitly coded into this LLM during its training. It's just trying to complete the stories, but we trained it on so much stuff for so long that it learned all of these types of abilities that allows it to win the story, the complete stories. Well, it understands humor, it understands math, it understands computer code, it understands basic logic. If it's seen a game enough time, it has some sense of generally what the Game rules are what a valid move in that game. Looks like it was pretty incredible what GPT4 could actually do. There was this idea of if we kept scaling these things bigger and bigger. Yeah sure they're story completers, but they can eventually have so many rules and logics programmed into them to win this game that like they will actually have like a human level intelligence. And then we can just ask the use the LMS to brain to automate everything. That was the vision. Okay, then what happened is we learned by the summer of 2024 is that this type of just scaling was hitting a wall. Just trying to make these models bigger and train them longer was not appreciably leading to new functionality being encoded into them like we had seen before. And this picked up a whole era that really started with the Alphabet soup models out of OpenAI in the fall of 2024 it kicked off a whole era of post training and tuning where okay, they get further appearances of improvement in these models. We can't just scale them bigger. We can do things called post training where if we have very specific data sets where we have questions and exact right answers, we can tune an already pre trained model to be better at using its already wired intelligence to answer this type of problem. And that's why we saw starting in 2024 a big focus on the type of things for which we had data to do this tuning, reasoning, math and computer coding. That's why weren't they talking about other types of things you would want an LLM to do? Because it'd be economically very valuable. Because all we can do now is tune these and get sort of like steady improvements in areas where we happen to have a lot of highly structured data to do the tuning on it. And that's really where we've been since 2024. And that's why we have more of this slow but steady improvement with LLMs on particular benchmarks and no longer these big general leaps like we had during the pre scaling era when we saw like 3 to 3.5 to 4. Okay, so this is a problem for automating work because we know we can't scale these LLMs just to have like a human level of reasoning on things. And for most things we do in knowledge work, most of the skilled things we do, we don't have highly structured data sets like we have for computer programming or math. So we can't even tune the pre trained LLMs to be better at things that a lot of people do in their jobs. So this is sort of a Wall that's hard to get past when you leave where we are right now with LLMs now you might say, yeah, but what about workplace agents? A lot of what people do in knowledge work is actually not that skilled. It's individual things that LLM can do. Send an email, get information about an upcoming conference, move information to a spreadsheet, build slides and send it to the team. Like actually a lot of things we do in knowledge work is what I call shallow work. It's things that don't require a particularly high level of skill. So why can't we have at the very least something like a knowledge workplace administrative assistant that can automate a lot of boring stuff we do in knowledge work, like we have in computer programming, where coding agents can do multiple steps of work on behalf of the programmer. So at the very least this wouldn't be Suleiman's prediction, but it would get us closer. Why don't we have those either? Well, it turns out that building these type of tools is difficult. I wrote an article about this back in January for the New Yorker. It was called why didn't AI transform our lives in 2025? And in that article I looked at the question of why don't we have more workplace agents? We were told at the beginning of 2025 that this would be the year of the agent, not just coding agents, but for all the things you do in work, more people would have these AI assistants. Why did that largely not happen? Well, when you actually study how these agents work, the problem is the plans. The multi step plans that agents execute are the result of LLM prompts. You have a harness, which is a computer program written by a person that prompts an LLM and says, give me a multi step plan for doing X. And then it takes that text and it goes step by step to harness and executes those things on behalf of the user. For computer programming, they've learned how to be pretty good at making those multi step plans because the option space of what you do when you're building a computer program is very narrow and what they're doing is very verifiable. So actually the harnesses for coding agents will call non AI tools. They can verify things. All right, this step has a clear success category. Like the code compiles and passes these tests and we're going to run another program that's going to check the code, make sure it compiles and passes those tests. And if it doesn't, we can go back to the LLM and say that plan didn't work, give us another one. Right. It's very well set up for doing this. But once you're in the more so I wrote about in my New Yorker article. Once you're in this more general world of more ambiguous sort of knowledge work task, what you're going to get out of the LLM is a reasonable sounding plan. Reasonable sound, because that's what it does. It's a story completer. Reasonable sounding plans get you in trouble. You need correct plans. And the way that humans actually make plans is we do a few things. One, we test a bunch of possibilities internally to see what makes the most sense. And two, we have some sort of notion of correctness and a world model we can use to evaluate plans to see does this actually do what I want it to do. Are there any mistakes along the way? That's not how LLMs operate. Again, we're just autoregressively producing tokens. So you get a reasonable sounding plan, but it's not stepping back. It doesn't have a world model to test it against. It can't have hard coded rules that it consistently applies. It has no ability to do future simulation of possible outcomes. And so we just get a story that sounds like a good plan but often has issues along the way. And if an agent is just going to automatically execute these things, we get into trouble. All right, so making agents, even just like administrative assistants in non computer programming areas is also a very hard problem and not one that we have a good solution to. We also underestimate the degree to which if you watch a programmer using a coding agent, they're constantly tweaking and re asking and adjusting. There's a huge amount of work to try to get consistently useful output out of these agents and most knowledge workers just simply aren't going to do that or have the technical chops to pull that off. Earlier this year, actually in the fall, OpenAI even mentioned they were slowing down or reducing their non coding agent projects because they weren't working and they wanted to focus on ChatGPT and their coding agents. All right, so if we put all these things together, we put these three reasons together, I think it's clear that Mustafa Suleiman's claim that AI would fully automate most knowledge work jobs by next spring really is an outlier opinion. Is an opinion that is contradicted by the statements of other CEOs. The reality of the rate of progress of LLMs over the last couple of years, which is not nearly fast enough to get where he says we're going to get. And that's the technical limitations of these models. They're just LLMs with a harness on it is just not a great setup for automating a lot of different types of knowledge work jobs. So why then might Suleiman be pushing this story? What's in it for him? I'm going to play you a quick clip of one potential explanation. It comes from the Absolutely Agentic channel. Let's hear what they had to say there.
