
Loading summary
Alex
If you've ever wondered how generative AI works and where the technology is heading, this episode is for you. We're going to explain the basics of the technology and then catch up with modern day advances like reasoning to help you understand exactly how it does, what it does and where it might advance in the future. That's coming up with Semianalysis Founder and Chief Analyst Dylan Patel right after this from LinkedIn News.
Leah Smart
I'm Leah Smart, host of Everyday Better, an award winning podcast dedicated to personal development. Join me every week for captivating stories and research to find more fulfillment in your work and personal life. Listen to Everyday better on the LinkedIn podcast network, Apple Podcasts or wherever you get your podcasts. Did you know that small and medium businesses make up 98% of the global economy, but most B2B marketers still treat them as a one size fits all? LinkedIn's Meet the SMB report reveals why that's a missed opportunity and and how you can reach these fast moving decision makers effectively. Learn more@LinkedIn.com Meet the SMB welcome to.
Alex
Big Technology Podcast, a show for cool headed and nuanced conversation of the tech world and beyond. We're joined today by Semianalysis Founder and Chief Analyst Dylan Patel, a leading expert in semiconductor and generative AI research and someone I've been looking forward to speaking with for a long time now. I want this to be an episode that a helps people learn how generative AI works and B is an episode that people will send to their friends to explain to them how generative AI works. I've had a couple of those that I've been sending to my friends and colleagues and counterparts about what is going on within generative AI. That includes one this three and a half hour long video from Andrej Karpathy explaining everything about training large language models. And the second one is a great episode that Dylan and Nathan Lambert from the Allen Institute of AI did with Lex Friedman both of those three hours plus. So I want to do ours in an hour and I'm very excited to begin. So Dylan, it's great to see you and welcome to the show.
Dylan Patel
Thank you for having me.
Alex
Great to have you here. Let's just start with tokens. Can you explain how AI researchers basically take words and then give them numerical representations and parts of words and give them numerical representations? So what are tokens?
Dylan Patel
Tokens are in fact like chunks of words, right? In the human way you can think of like syllables, right? Syllables are often viewed as like chunks of word. They have some meaning. It's the base level of speaking, right, Is syllables, right? Now, for models, tokens are the base level of output. They're all about compressing, you know, sort of. This is the most efficient representation of.
Alex
Language from my understanding. AI models are very good at predicting patterns. So if you give it 1, 3, 7, 9, it might know the next number is going to be 11. And so what it's doing with tokens is taking words, breaking them down to their component parts, assigning them a numerical value, and then basically, in its own word, in its own language, learning to predict what number comes next, because computers are better at numbers, and then converting that number back to text. And that's what we see come out. Is that. Is that accurate?
Dylan Patel
Yeah. And each individual token is actually. It's not just like one number, right? It's multiple vectors. You could think of like, well, the tokenizer needs to learn. King and queen are actually extremely similar on. On most in terms of, like, the English language. Extremely similar, right? Except there is, like, one vector in which they're super different, right? Because a king is a male and a queen is a female. And then from there, like, you know, in language, oftentimes kings are considered conquerors and, you know, all these other things. And like, these are just like, historical things, right? So a lot of the text around them, while they're both like, royal, regal, right? Like, you know, monarchy, et cetera, there are many vectors in which they differ. So, like, it's not just like converting a word into one number, right? It's like converting it into multiple vectors. And each of these vectors, the model learns what it means, right? You don't initialize the model with, like, hey, you know, king means male monarch, and it's associated with, like, war and conquering, because that's what all the writing about kings is on, you know, in history and all that, right? Like, people don't talk about the daily lives of kings that much, or they mostly talk about, like, their wars and conquests and stuff. And so, like, there will be each of these numbers in this embedding space, right? Will be assigned over time. As the model reads the Internet's text and trains on it, it'll start to realize, oh, king and queen are exactly similar on these vectors, but very different on these vectors. And these vectors aren't. You don't explicitly tell the model, hey, this is what this vector is for. But it could be like, you know, it could be as much as, like, one vector could be, like, is it a building or not?
Nathan Lambert
Right?
Dylan Patel
And it doesn't actually know that you don't, you don't know that ahead of time. It just happens to in the latent space and then all these vectors sort of relate to each other. But yeah, these numbers are, are inefficient representation of words because you can do math on them, right? You can, you can multiply them, you can divide them, you can run them through an entire model. Whereas. And your brain does something similar, right. When it hears something, it converts that into a frequency in your ears and then that gets converted to frequencies that should go through your brain.
Nathan Lambert
Right?
Dylan Patel
This is, this is the same thing as a tokenizer, right. Although it's like obviously a very different medium of compute, right. Ones and zeros for computers versus binary and multiplication, et cetera, being more efficient. Whereas humans brains are more like analog in nature and think more in waves and patterns in different ways while they are very different. It is a tokenizer, right? Like language is not actually how our brain thinks. It's just a representation for which it to reason over.
Alex
Yeah, so that's crazy. So the tokens are the sufficient representation of words. But more than that, the models are also learning the way that they are. All these words are connected. And that brings us to pre training. From my understanding, pre training is when you take the entire, basically the entire Internet worth of text and you use that to teach the model these representations between each token. So therefore, like we talked about, if you gave a model, the sky is. And the next word is typically blue in the pre training, which is basically all of the, the English language or all of language on the Internet, it should know that the next token is blue. So what you do is you want to make sure that when the model is outputting information, it's closely tied to what that next value should be. Is that a proper description of what happens in pre training?
Dylan Patel
Yeah, I think that's pretty. That's the objective function, which is just to reduce loss, that is how often is the token predicted incorrectly versus correctly.
Nathan Lambert
Right?
Alex
Right. So it's like this. If you said the sky is red, that's not the most probable outcome. So that would be wrong.
Dylan Patel
But that text is on the Internet, right? Like because the Martian sky is red and there's all these books about Mars and sci fi, right.
Alex
So how does the model then learn how to, you know, figure this out and in what context is it accurate to say blue and red?
Dylan Patel
Right. So I mean first of all, the model doesn't just output one token, right? It outputs a distribution. It turns out the most, the way most people take it is they take the top, top K, I. E. The most high probability. So, yes, blue is obviously the right answer if you give it to anyone on this planet. But there are situations and contexts where the sky is red is the appropriate sentence. But that's not just in isolation, right? It's like if the prior passage is all about Mars and all this, and then all of a sudden it's like. And that's like a quote from a Martian settler. And it's like the sky is. And then the correct token is actually red, right? The correct word. And so it has to know this through the attention mechanism, right? If it was just the sky is blue always, you're gonna output blue. Because blue is, let's say, 80%, 90%, 99% likely to be the right option. But as you start to add context about Mars or any other planet, right? Other planets have different colored atmospheres, I presume you end up with this distribution starts to shift, right? If I add, we're on Mars, the sky is. Then all of a sudden Blue goes from 99% in the prior context window, right? The text that you sent to model the attention of it, all of a sudden it realizes the sky is blue. Preceded by that, the stuff about Mars. Now, blue rockets down to like, you know, let's call it 20% probability and red rockets up to 80% probability, right? Now the model outputs that and then most people just end up taking the top probability and outputting it to the user.
Nathan Lambert
And.
Dylan Patel
And that's sort of like how. How does the model learn that is is the attention mechanism, right? And this is sort of the beauty. Yeah, the attention mechanism is the beauty of modern sort of large language models. It takes the relational value in this vector space between every single token, right? So the sky is blue, right? Like when I think about it, yes, blue is the next token after the sky is. But in a lot of older style models, you would just predict the exact next word. So after sky, after, obviously it could be many things. It could be blue, but it could also be like scraper, right, Skyscrapers. Yeah, that makes sense. But what attention does is it is taking all of these various values, the query, the key, the value which represents what you're looking for, where you're looking, and what that value is across the attention and your. You're calculating mathematically what the relationship is between all of these tokens. And so going back to the king queen representation, right? The way these two words interact is now calculated, right? And the way that every word in the entire passage you sent is calculated is Tied together. Which is why models have, like, challenges with, like, how long can you. How many documents can you send them, right? Because if you're sending them, you know, just the question, like, what color is the sky? Okay? It only has to calculate the attention between your, you know, those. Those words, right? But if you're sending it, like, 30 books with, like, insurance claims and all these other things, and you're like, okay, figure out what. What's going on here. Is this a claim or not?
Nathan Lambert
Right?
Dylan Patel
And in the insurance context, all of a sudden it's like, okay, I've got to calculate the attention of not just like, the last five words to each other, but I have to calculate every, you know, 50,000 words to each other, right? Which then ends up being a ton of math. Back in the day, actually, the best language models were a different architecture entirely, right? But then at some point, you know, Transformers, large language models, sort of large language models, which are basically based on transformers, primarily rocketed past in capabilities because they were able to scale and because the hardware got there, and then we were able to scale them so much that we were able not to just put, like, some text in them, and not just a lot of text or a lot of books, but the entire Internet, which, you know, one could view the Internet oftentimes as a microcosm of all human culture and learnings and knowledge to many extents, because most books are on the Internet, most papers are on the Internet. Obviously there's a lot of things missing on the Internet, but this is sort of. This is the sort of modern, you know, magic of, like, what. What it was sort of like three different things, like, coming all together at once, right? An efficient way for models to relate every word to each other. The compute necessary to scale the data large enough, and then someone actually, like, pulling the trigger to do that, right? At the scale that was, you know, got to the point where it was useful, right? Which was sort of like GPT 3.5 level or 4 level, right? Where it became extremely useful for normal humans to use, you know, chat. Chat models.
Alex
Okay? And so why is it called pre training?
Dylan Patel
So. So pre training is. Is. Is sort of called that because it is what happens, you know, before the actual training of the model, right? The objective function in pre training is to just predict the next token. But predicting the next token is not what humans want to use AIs for, right? I want it to ask a question and answer it. But in most cases, asking a question does not necessarily mean that the next most likely token is the answer.
Nathan Lambert
Right.
Dylan Patel
Oftentimes it is another question.
Nathan Lambert
Right.
Dylan Patel
For example, if I ingested the entire SAT and I asked a question, the next five answers, all the next tokens would be like, a, is this B, is this C, is this D, is this like no, I just want the answer.
Nathan Lambert
Right.
Dylan Patel
And so pre training is the reason it's called pre training is because you're ingesting humongous volumes of text, no matter the use case.
Nathan Lambert
Right.
Dylan Patel
And you're learning the general patterns across all of language.
Nathan Lambert
Right.
Dylan Patel
I don't actually know that king and queen relate to each other in this way and I don't know that king and queen are opposites in these ways.
Nathan Lambert
Right.
Dylan Patel
And so this is why it's called pre training is because you must get a broad general understanding of the entire sort of world of text before you're able to then do post training or fine tuning, which is, let me train it on more specific data that is specifically useful for what I want it to do. Whether it's, hey, in chat style applications, you know, go and go in, you know, when I ask a question, give me the answer. Or in other applications like teach me how to build a bomb. Well, obviously no, I'm not going to help you build a bomb because that's what I don't want the model to teach me how to build a bomb. So you know, it's sort of got to do this and it's not like you're teaching it, you know, when you're doing this pre training you're filtering out all this data because in fact there's a lot of good, useful data on how to build bombs because there's a lot of useful information on like, hey, like C4 chemistry and like, you know, people want to use it for chemistry.
Nathan Lambert
Right?
Dylan Patel
So you don't want to just filter out everything so that the model doesn't know anything about it. But at the same time you don't want it to output, you know, how to build a bomb. So there's like a fine balance here. And that's why pre training is defined as pre because you're still letting it do things and teaching it things and inputting things into the model that are theoretically like quite bad.
Nathan Lambert
Right.
Dylan Patel
For example, books about like killing or war tactics or what have you.
Nathan Lambert
Right?
Dylan Patel
Like things that like plausibly you could see like, oh well, maybe that's not okay or wild descriptions of like really grotesque things all over the Internet. But you want the model to learn these things, right? Because first you build the general understanding. But before you say okay, now that you've got a general framework of the world, let's align you so that you, with this general understanding, the world, can figure out what is useful for people, what is not useful for people. What should I respond on? What should I not respond on?
Alex
So what happens then in the training process? So the, the is the training process that the model is then attempting to make the next prediction and then just trying to minimize loss as it goes.
Dylan Patel
Right, right. I mean, like, basically you have loss is how often you're wrong versus right in the most simple terms. Right. You'll run through passages right through the model, and you'll see how often did the model get it right? When it got it right. Great. Reinforce that when I got it wrong. Let's figure out which neurons in the model, quote, unquote, neurons in the model, you can tweak to, then fix the answer so that when you go through it again, it actually outputs the correct answer. And then you move the model slightly in that direction. Now, obviously, the challenge with this is if I first, you know, I can come up with a simplistic way where all the neurons will just output. The sky is blue. Every single time it says the sky is. But then when it goes to, you know, hey, the color blue is commonly used on walls because it's soothing, right? And it's like, oh, what's the next word is soothing? Right, Soothing. You know, and so like that, that is a completely different representation. And to understand that blue is soothing and that the sky is blue and those things aren't actually related, but they are related to blue is like, very important. And so, you know, oftentimes you'll run through the training data set multiple times.
Nathan Lambert
Right?
Dylan Patel
Because the first time you see it, oh, great. Maybe you memorized that the sky is blue and you memorize the wall is blue. And that. And when people describe art and oftentimes use the color blue, it can be representations of art or the wall.
Nathan Lambert
Right?
Dylan Patel
And so over time, as you go through all this text in pre training, yes, you're minimizing loss initially by just memorizing, but over time, because you're constantly overriding the model, it starts to learn the generalization.
Nathan Lambert
Right?
Dylan Patel
That is blue is a soothing color, also represents the sky, also used in art for either of those two motifs.
Nathan Lambert
Right?
Dylan Patel
And so that's sort of the goal of pre training is you don't want to memorize, right? Because that's, you know, in school you memorize all the time, and that's not useful because you forget everything you Memorize. But if you get tested on it then. And then you get tested on it six months later, and then again six months later after that, or however you do it ends up being, oh, you don't, you don't actually like memorize that anymore. You just know it innately and you've generalized on it. And that's the real goal that you want out of the model. But that's not necessarily something you can just measure.
Nathan Lambert
Right.
Dylan Patel
And therefore loss is something you can measure. That is for this group of. This group of text.
Nathan Lambert
Right.
Dylan Patel
Cause you train the model in steps. Every step you're inputting a bunch of text, you're trying to see what's predict the right token where you didn't predict the right token. Let's adjust the neurons, okay. Onto the next batch of text. And you'll do these batches over and over and over again across trillions of words of text. And as you step through, and then you're like, oh, well, I'm done. But I bet if I go back to the first group of texts, which is all about the sky being blue, it's gonna get the answer wrong. Because maybe later on in the training, it discovered it saw some passages about sci fi and how Martian sky is read. So, like, it'll overwrite. But then over time, as you go through the data multiple times, as you see it on the Internet multiple times, you see it in different books multiple times, whether it be scientific, sci Fi, whatever it is, you start to realize and it starts to learn that representation of like, oh, when it's on Mars, it's red because the skin sky and Mars is red because the atmospheric makeup is this way, whereas the atmospheric makeup in. In on Earth is a different way. And so that's sort of like the whole point of pre training is, is to minimize loss. But the nice side effect is that the model initially memorizes, but then it stops memorizing and it generalizes. And that's the useful pattern that we want.
Alex
Okay, that's fascinating. We've touched on post training for a bit. But just to recap, post training is so you have a model that's good at predicting the next word. And in post training, you sort of give it a personality by inputting sample conversations to make the model want to emulate the certain values that you want it to take on.
Dylan Patel
Yeah. So post training can be a number of different things. The most simple way of doing it is, yeah, pay for humans to label a bunch of data, take a bunch of example conversations, et cetera, and Input that data and train on that at the end, right? And so that, that example data is, is useful, but this is not scalable, right? Like using humans to train models is just so expensive, right? So then there's the magic of sort of reinforcement learning and, and other synthetic data technologies, right? Where the model is helping teach the model, right? So you have many models in a, sort of, in a post training where yes, you have some example human data, but human data does not scale that fast or right? Because the Internet is trillions and trillions of words out there. Whereas, you know, even if you had, you know, Alex and I write words all day long for our whole lives, we would have millions or you know, hundreds of millions of words written, right? It's nothing. It's like orders of magnitude off in terms of the number of words required. So then you have the model, you know, take some of this example data and you have various models that are surrounding the main model that you're training, right? And these can be policy models, right? Teaching it, hey, is this, is this what you want or that what you want reward models, right? Like is that good? Is that a good response or is that a bad response? You have value models like hey, grade this output, right? And you have all these different models working in conjunction to say, you know, different companies have different objective functions, right? In the case of anthropic, they want their model to be helpful, harmful, harmless and safe, right? So be helpful but, but also don't harm people or anyone or anything. And then, you know, you know, safe, right? In other cases, like Grok, right? Elon's model from Xai, it actually just wants to be helpful and maybe it has like a little bit of a.
Nathan Lambert
Right leaning to it, right?
Dylan Patel
And for other folks, right? Like you know, I mean most AI models are made in the Bay Area, so they tend to just be left leaning, right? But also the Internet in general is a little bit left leaning because it skews younger than older. And so like all these things like sort of affect models, but like it's not just around politics, right? Post training is also just about teaching the model. If I, if I say like the movie where the princess has a slipper and it doesn't fit, it's like, well, if I said that into a base model that was just pre training, like the answer wouldn't be oh, the movie you're looking for Cinderella, you know, it would only realize that once it goes, you know, once it goes through prose training, right? Because a lot of times people just throw garbage into the Model and. And then the model still figures out what you want, right? And this is part of what post training is like. You can just do stream of consciousness into models and oftentimes it'll figure out what you want. Like, you know, if it's a movie that you're looking for, or if it's help answering a question, or if you throw a bunch of like unstructured data into it and then ask it to make it into a table, it does this, right? And that's because of all these different aspects of post training, right? Example data, but also, you know, generating a bunch of data and grading it and seeing if it's good or not and whether it matches the various policies you want. Is it help? You know, a lot of times grading can be based on multiple factors, right? There could be a model that says, hey, is this helpful? Hey, is this, is this safe? And what is safe?
Nathan Lambert
Right.
Dylan Patel
So then that model for safety needs to be tuned on human data, right? So there's. It is a quite complex thing, but the end goal is to be able to get the model to output in a certain way. Models aren't always about just humans using them either, right? There can be models that are just focused on like, hey, like, you know, if it doesn't output code, you know, yes, it was trained on the whole Internet because the person's gonna talk to the model using text, but if it doesn't output code, you know, penalize it right? Now, all of a sudden, the model will never output like text ever again. It'll only output code. And so like, these sorts of like models exist too. So post training is not just a univariable thing, right? It's what variables do you want to target. And so that's why models have different personalities from different companies. It's why they target different use cases and why, you know, it's not just like one model that rules them all, but actually many.
Alex
That's fascinating. So that's why we've seen so many different models with different personalities is because it all happens in the post training moment. And this is when you talk about giving the models examples to follow. That's what reinforcement learning with human feedback is. The humans give some examples, and then the model learns to emulate what the human is interested in, what the human trainer is interested in, having them embody. Is that right?
Dylan Patel
Yeah, exactly.
Alex
Okay, great. All right, so first half, we've covered what training is, what tokens are, what loss is, what post training is. Post training, by the way, also called fine tuning. We've Also covered reinforcement learning with human feedback. We're going to take a quick break and then we're going to talk about reasoning. We'll be back right after this.
Leah Smart
Small and medium businesses don't have time to waste and neither do marketers trying to reach them on LinkedIn. More SMB decision makers are actively looking for new solutions to help them grow, whether it's software or financial services. Our Meet the SMB report breaks down how these businesses buy and what really influences their choices. Learn more@LinkedIn.com meet the SMB that's LinkedIn.com meet the SMB and we're back here.
Alex
On Big Technology Podcast with Dylan Patel. He's the founder and chief analyst at Semianalysis. He actually has great analysis on Nvidia's recent GTC conference, which we just covered recently on a recent episode. You can find semianalysis@semianalysis.com it is both content and sort of and consulting. So definitely check in with Dylan for all of those, all those needs. And now we're going to talk a little bit about reasoning because a couple months ago and Dylan, this is really where I sort of entered the picture, watching your conversation with Flex, with Nathan Lambert about what the difference is between reasoning and your traditional LLMs large language models. If I gathered it right from your conversation, what reasoning is is basically instead of the model going basically predicting the next word based off of its training, it uses the tokens to spend more time basically figuring out what the right answer is and then coming out with its with a new prediction. I think Carpathy does a very interesting job in the YouTube video talking about how models think with tokens. The more tokens there are, the more computes they compute they use because they're running these predictions and through the transformer model which we discussed, and therefore they can come to better answers. Is that the right way to think about reasoning?
Dylan Patel
So I think that humans are also fantastic at pattern matching, right? We're really good at like recognizing things, but a lot of tasks, it's not like an immediate response, right? We are thinking whether that's thinking through words out loud, thinking through words in an inner monologue in our head, or it's just like processing somehow and then we know the answer, right? And this is the same for models, right? Models are horrendous at math, right. Historically happened, right? You could ask it, you know, what is 9.11 bigger than 9.9? And it would say, yes, it's bigger. Even though like everyone knows that 9.11 is, is way smaller than 9.9, right. And that's just like a thing that happened in models because they didn't think or reason, right? And it's the same for you, Alex, right? Like you know, or myself, right? Like if someone asked me, you know, 17 times 34, I'd be like, I don't know, like right off the top of my head. But you know, give me, give me a little bit of time. I can do some long form multiplication and I can get the answer right. And that's because I'm thinking about it. And this is the same thing with reasoning for models is, you know, when you look at a transformer, every word is this, every token output, it has the same amount of compute behind it, right? That is, you know, when I'm saying the versus sky is blue, the blue and the have, have this or the is in the blue have the same amount of compute to generate, right? And this is not exactly what you want to do, right? You want to actually spend more time on the hard things and not on the easy things. And so reasoning models are effectively teaching large pre trained models to do this, right? Hey, think through the problem. Hey, output a lot of tokens, think about it, generate all this text and then when you're done, you know, start answering the question. But now you have all of this stuff you generated in your context, right? And that stuff you generated is, is helpful, right? It could be like, you know, all sorts of, you know, just like any human's thought patterns are, right? And so this, this is the sort of like new paradigm that we've entered maybe six months ago where models now will think for some time before they answer. And this enables much better performance on all sorts of tasks, whether it be coding or math or understanding science or understanding complex social dilemmas, right. All sorts of different topics they're much, much better at. And this is done through post training, similar to the reinforcement learning by human feedback that we mentioned earlier. But also there's other forms of post training and that's what makes these reasoning models.
Alex
Before we head out, I want to hit on a couple things. First of all, the growing efficiency of these models. So I think one of the things that people focused on with Deep SEQ was that it was just able to be much more efficient in the way that it generates answers. And there was this obviously this big reaction to Nvidia stock where it fell 18% the day or at the Monday after Deep Seek weekend because people thought we wouldn't need as much compute. So can you talk a little bit about how models are becoming more efficient and how they're doing it.
Dylan Patel
Yeah. So there's a variety of. The beauty of these, of AI is not just that we continue to build new capabilities.
Nathan Lambert
Right.
Dylan Patel
Because those new capabilities are going to be able to benefit the world in many ways. And there's a lot of focus on those, but there's also a lot of. There's a lot of focus on, well, to get to that next level of capabilities is the scaling laws. That is, the more compute and data I spend, the better the model gets. But then the other vector is, well, can I get to the same level with less compute and data?
Nathan Lambert
Right.
Dylan Patel
And those two things are hand in hand, because if I can get to the same level with the less compute and data, then I can spend that more computing data and get to a new level.
Nathan Lambert
Right.
Dylan Patel
And so AI researchers are constantly looking for ways to make models more efficient, whether it be through algorithmic tweaks, data tweaks, tweaks in how you do reinforcement learning, so on and so forth.
Nathan Lambert
Right.
Dylan Patel
And so when we look at models across history, they've constantly gotten cheaper and cheaper and cheaper.
Nathan Lambert
Right.
Dylan Patel
At a stupendous rate.
Nathan Lambert
Right.
Dylan Patel
And so one easy example is GPT3.
Nathan Lambert
Right? Right.
Dylan Patel
Because there's GPT3.3.5 Turbo, Llama 2.7B, Llama 3, Llama 3.1, Llama 3.2.
Nathan Lambert
Right.
Dylan Patel
As these models have gotten bigger, we've gone from, hey, it costs $60 for a million tokens to it cost less than it costs like $0.05 now for the same quality of model now. And the model has shrank dramatically in size as well. And that's because of better algorithms, better data, et cetera. And now what happened with Deep SEQ was similar. You know, OpenAI had GPT4, then they had 4 Turbo, which was half the cost. Then they had 4.0, which was again half the cost. And then Meta released llama 405B open source. And so the open source community was able to run that. And that was again like roughly like half the cost or 5x lower cost than 4.0, which was lower than 4 Turbo and 4. But Deepsea came out with another tier.
Nathan Lambert
Right.
Dylan Patel
So when we looked at GPD3, the cost fell 1200x from GPT3's initial cost to what you can get llama 3.2.3B today.
Nathan Lambert
Right.
Dylan Patel
And likewise, when we look at from GPT4 to deep seq v3, it's fallen roughly 600x in cost.
Nathan Lambert
Right.
Dylan Patel
So we're not quite at that 1200x but it has fallen 600x in cost from $60 to less than, you know, to about a dollar, right? Or to less than a dollar, sorry, 60x. And so you've got this massive cost decrease. But it's not necessarily out of bounds, right? We've already seen it. I think what was really surprising was that it was a Chinese company for the first time.
Nathan Lambert
Right.
Dylan Patel
Because Google and OpenAI and anthropic and Meta have all traded blows, right? You know, whether it be OpenAI always being on the leading edge or Anthropic always being on the leading edge, or, you know, Google and Meta, you know, being close followers, but oftentimes, sometimes with a new feature and sometimes just being much cheaper. We have not seen this from any Chinese company, right? And now we have a Chinese company releasing a model that's cheap. It's not unexpected, right? Like this is actually within the trend line of what happened with GPT3 is happening to GPT4 level quality with Deep Sea. It's more so surprising that it's a Chinese company. And that's, I think, why everyone freaked out. And then there was a lot of things that like, you know, from there became a thing or.
Nathan Lambert
Right.
Dylan Patel
Like if Meta had done this, I don't think people would have freaked out.
Nathan Lambert
Right.
Dylan Patel
And Meta's gonna release their new Llama soon enough.
Nathan Lambert
Right.
Dylan Patel
And that one is gonna be, you know, similar level of cost decrease, probably similar areas. Deepsea V3.
Nathan Lambert
Right.
Dylan Patel
It's just not. People are gonna freak out because it's an American company and it was sort of expected.
Alex
All right, Dylan, let me ask you the last question, which is the, you mentioned, I think you mentioned the bitter lesson, which is basically that there, I mean, I'm gonna just be kind of facetious in summing it up, but the answer to all questions in machine learning is just to make bigger, bigger models. And scale solves almost all problems. So it's interesting that we have this moment where models are becoming way more efficient, but we also have massive, massive data center buildouts. I think it would be great to hear you kind of recap the size of these data center build outs and then answer this question. If we are getting more efficient, why are these data centers getting so much bigger? And what might that added scale get in the world of generative AI for the companies building them?
Dylan Patel
Yeah. So when we look across the ecosystem at data center build outs, we track all the build outs and server purchases and supply chains here. And the pace of construction is incredible. Right. You can pick a state and you can See new data centers going up all across the US and around the world, right? And so you see things like capacity in, you know, for example of the largest scale training supercomputers goes from, hey, it's a few hundred million dollars. It's, it's not even a few hundred million dollars years ago, but like, you know, hey, for GPT4 it was a few hundred million dollars and it's, it's one building full of GPUs too. GPT 4.5 and the reasoning models like O103 were done in a, in three buildings on the same site. And you know, billions of dollars to hey, these next generation things that people are making are tens of billions of dollars. Like OpenAI's data center in Texas called Stargate, right? With Crusoe and Oracle and etc.
Nathan Lambert
Right?
Dylan Patel
And likewise applies to Elon Musk who is building these data centers in old fat in an old factory where he's got like a bunch of like gas generation, you know, outside and he's doing all these crazy things to get the data center up as fast as possible, right? And you can go to just basically every company and they have like these humongous build outs and this sort of like. And because of the scaling laws, right? You know, 10x more compute for linear like improvement gains, right? Like it's sort of like it's log, log, sorry. But you end up with this like very confusing thing which is like, hey, models keep getting better as we spend more. But also the model that we had a year ago is now done for way, way cheaper, right? Oftentimes 10x cheaper or more, right? Just a year later. So then the question is like why are we spending all this money to scale? And there's a few things here, right? A, you can't actually make that cheaper model without making the better, bigger model. So you can generate data to help you make the cheaper model, right? Like that's part of it, but also another part of it is that, you know, if we were to freeze AI capabilities where we were basically in, what was it, March 2023, right? Two years ago when GPT4 released and only made them cheaper, right? Like Deep Seek is like much cheaper, it's much more efficient, but it's roughly the same capabilities as GPD4 that would not pay for all of these buildouts, right? AI is useful today, but it is not capable of doing a lot of things, right? But if we make the model way more efficient and then continue to scale and we have this like stair step, right, where we like increase capabilities massively, make them way more efficient. Increase capabilities massively, make them way more efficient. We do the stair step. Then you end up with creating all these new capabilities that could in fact pay for, you know, these massive AI buildouts. So no one is trying to make, with these, you know, with these $10 billion data centers. They're not trying to make chat models. Right?
Nathan Lambert
Right.
Dylan Patel
They're not trying to make models that people chat with, just to be clear.
Nathan Lambert
Right.
Dylan Patel
They're trying to solve things like software engineering and make it automated, which is like a trillion dollar plus industry.
Nathan Lambert
Right.
Dylan Patel
So these are very different, like sort of use cases and targets. And so it's the bitter lesson because yes, you can make, you can spend a lot of time and effort making clever, specialized methods, you know, based on intuition and you should, right. But these things should also just have a lot more compute thrown behind them because if you make it more efficient, as you follow the scaling laws up, it'll also just get better and you can then unlock new capabilities.
Nathan Lambert
Right.
Dylan Patel
And so today, you know, a lot of AI models, the best ones from Anthropic, are now useful for like coding as an assistant with you, right? You're, you're going back and forth, you know, as time goes forward, as you make them more efficient and continue to scale them, the possibility is that, hey, it can code for like 10 minutes at a time and I can just review the work and it'll make me 5x more efficient.
Nathan Lambert
Right.
Dylan Patel
You know, and so on and so forth. And this is sort of like where, where reasoning models and sort of the scaling sort of argument comes in is like, yes, we can make it more efficient, but we also just, you know, that's not going to solve the problems that we have today.
Nathan Lambert
Right.
Dylan Patel
The earth is still going to run out of resources, we're going to run out of nickel because we can't make enough batteries and we can't make enough batteries. So then we can't, with current technology that we can't replace all of, you know, gas, you know, gas and coal with renewables. Right? All of these things are going to happen unless like you continue to improve AI and invent and, or just generally research new things and AI helps us research new things.
Alex
Okay, this is really the last one. Where is GPT5?
Dylan Patel
So, so, so OpenAI released GPT 4.5 recently with what they called training Run Orion. There were hopes that Orion could be used for, for GPT5, but its improvement was like, not enough to be like really a GPT5. Furthermore, it was trained on the classical method, which is like, which is a ton of pre training and then some reinforcement learning with human feedback and some other reinforcement learning like PPO and DPO and stuff like that. But not, but then along the way, right, this model was trained last year. Along the way, another team at OpenAI made the big breakthrough of reasoning, right, Strawberry training and they released OH1 and then they released OH3. And these models are rapidly getting better on with reinforcement learning with verifiable rewards. And so now GPT5, as Sam calls it, is going to be a model that has huge pre training scale, right, like GPT 4.5, but also huge post training scale like oh 1 and oh 3 and continuing to scale that up, right. This would be the first time we see a model that was, was a step up in both at the same time. And so that's, that's what OpenAI says is coming. They say it's coming, you know, this year, hopefully in the next three to six months, maybe sooner, I've heard sooner, but you know, we'll see. But this, this path of scaling both pre training and post training with reinforcement learning with verifiable rewards massively should yield much better models that are capable of much more things. And we'll see what those things are.
Alex
Very cool. All right, Dylan, do you want to give a quick shout out to those who are interested in potentially working with semianalysis, who you work with and where, where they can learn more?
Dylan Patel
Sure. So we, you know, @semianlysis.com, we have the public stuff, which is like all these reports that are pseudo free. But then most of our work is done directly for clients. There's these data sets that we sell around every data set in the world, servers, all the compute where it's manufactured, how many where, what's the cost and who's doing it. And then we also do a lot of consulting. We've got people who have worked all the way from asml, which makes lithography tools, all the way up to Microsoft and Nvidia, which making models and doing infrastructure. And so we've got this whole gambit of folks, there's roughly 30 of us across the world in US, Taiwan, Singapore, Japan, France, Germany, Canada. So there's a lot of engagement points. But if you want to reach out, just go to the website, go to one of those specialized pages of models or sales and, and reach out. And that'd be the best way to sort of interact and engage with us. But for most people, just read the blog, right? Like I think, like, unless you have specialized, like, needs, unless you're a company in the space or you're investor in the space, like, you know, you just want to be informed. Just read the blog. And it's free, right? I think that's the best option for most people.
Alex
Yeah. Well, I will attest the blog is magnificent. And Dylan is really a thrill to get a chance to meet you and talk through these topics with you. So thanks so much for coming on the show.
Dylan Patel
Thank you so much, Alex.
Alex
All right, everybody, thanks for listening. We'll be back on Friday to break down the week's news. Until then, we'll see you next time on Big Technology Podcast.
Host: Alex Kantrowitz
Guest: Dylan Patel, Founder and Chief Analyst at Semianalysis
Release Date: April 23, 2025
Duration Covered: 00:00 – 40:19
Alex Kantrowitz kicks off the episode by addressing listeners' curiosity about the workings and future trajectory of generative AI. He emphasizes the goal of making the episode a comprehensive yet concise guide to understanding generative AI, aiming to distill complex concepts into an accessible one-hour discussion.
“I want this to be an episode that helps people learn how generative AI works and is an episode that people will send to their friends to explain to them how generative AI works.”
[00:00]
Dylan Patel joins the conversation, bringing his expertise in semiconductor and generative AI research to break down foundational elements such as tokens, pre-training, fine-tuning, and reasoning in AI models.
The discussion begins with an exploration of tokens, the fundamental units that AI models use to understand and generate language.
Dylan Patel explains that tokens are akin to syllables in human language—basic chunks of words that carry meaning.
“Tokens are in fact like chunks of words, right? In the human way you can think of like syllables, right. Syllables are often viewed as like chunks of word. They have some meaning.”
[02:23]
Alex elaborates on how tokens are represented numerically, allowing models to predict subsequent tokens effectively.
“AI models are very good at predicting patterns... assigning them a numerical value, and then basically, in its own word, in its own language, learning to predict what number comes next...”
[03:20]
Dylan further clarifies that each token is represented by multiple vectors, capturing nuanced relationships between words. For instance, "king" and "queen" share similarities but differ in specific vectors representing gender and associated historical contexts.
“These numbers are, are inefficient representation of words because you can do math on them, right. You can, you can multiply them, you can divide them...”
[04:52]
This multidimensional representation allows models to understand complex linguistic relationships beyond simple word associations.
Pre-training involves feeding vast amounts of text data into AI models to help them learn the probabilities of token sequences.
Alex describes pre-training as exposing the model to extensive language data to predict the next token accurately.
“Pre training is the objective function, which is just to reduce loss, that is how often is the token predicted incorrectly versus correctly.”
[06:47]
Dylan expands on this by explaining that pre-training equips the model with a broad understanding of language patterns without specific directives, enabling it to handle diverse contexts.
“Pre training is defined as pre because you're still letting it do things and teaching it things and inputting things into the model that are theoretically like quite bad.”
[12:13]
He highlights the balance required during pre-training to ensure the model learns general language patterns while mitigating the risks of absorbing harmful or inappropriate content.
Fine-tuning, or post training, tailors the pre-trained model to specific tasks or behaviors, enhancing its utility and alignment with desired outcomes.
Alex summarizes fine-tuning as imparting personality and specific responses to the model.
“Post training is so you have a model that's good at predicting the next word. And in post training, you sort of give it a personality by inputting sample conversations...”
[18:14]
Dylan outlines various methods used in post training, including Reinforcement Learning with Human Feedback (RLHF), which involves using human-labeled data to guide the model's responses.
“Post training can be a number of different things... using reinforcement learning and other synthetic data technologies.”
[19:00]
He emphasizes that post training is critical for aligning AI outputs with human values and specific application requirements, resulting in models with distinct personalities and functionalities.
The conversation shifts to reasoning, a capability that enhances AI models' problem-solving skills beyond simple pattern prediction.
Alex introduces reasoning as the model's ability to generate and evaluate multiple tokens to arrive at a coherent and accurate answer.
“Reasoning is basically instead of the model going basically predicting the next word based off of its training, it uses the tokens to spend more time basically figuring out what the right answer is...”
[25:18]
Dylan compares this to human thought processes, where individuals deliberate before answering, allowing for more accurate and contextually appropriate responses.
“Reasoning models are effectively teaching large pre-trained models to do this, right? Think through the problem. Output a lot of tokens, think about it, generate all this text...”
[26:39]
This approach significantly improves the model's performance in complex tasks such as mathematics, coding, and understanding nuanced social issues.
Despite advancements in making models more efficient, data centers continue to expand to meet the growing computational demands.
Alex mentions the decrease in costs associated with newer models like Deep SEQ and the consequent reactions in the market, particularly with Nvidia's stock.
“...OpenAI had GPT4, then they had 4 Turbo, which was half the cost... Deep SEQ was similar... the cost fell 1200x from GPT3's initial cost to what you can get llama 3.2.3B today.”
[28:27] – [30:30]
Dylan explains that while models are becoming cheaper and more efficient due to algorithmic improvements, the overall demand for compute power drives massive investments in data center infrastructure.
“When we look across the ecosystem at data center build outs... capacity in... largest scale training supercomputers goes... billions of dollars to... next generation things.”
[32:42] – [35:54]
He underscores that increased efficiency doesn't negate the need for larger data centers; instead, it enables the scaling required to handle more complex and capable models.
Dylan Patel introduces the concept of the "Bitter Lesson", which posits that scaling up models and compute resources often yields better results than specialized, intuition-driven approaches.
“The bitter lesson because yes, you can make... but these things should also just have a lot more compute thrown behind them because if you make it more efficient, as you follow the scaling laws up, it'll also just get better...”
[35:44] – [36:18]
He discusses the ongoing trend of scaling both pre-training and post training, which is crucial for unlocking new AI capabilities and maintaining competitive advantages.
Alex reflects on the paradox of increasing model efficiency alongside expanding data centers, questioning the sustainability and future impact of such growth.
“If we are getting more efficient, why are these data centers getting so much bigger?”
[32:00]
Dylan responds by highlighting that efficiency gains alone are insufficient to address global challenges like resource scarcity and environmental concerns. He argues that continued scaling is essential for developing AI systems capable of innovative solutions.
“The earth is still going to run out of resources... unless you continue to improve AI and invent and, or just generally research new things and AI helps us research new things.”
[36:54] – [37:16]
In discussing the future, Dylan provides insights into the development of GPT-5, a model expected to integrate extensive pre-training with advanced post training.
“GPT5, as Sam calls it, is going to be a model that has huge pre training scale... and also huge post training scale like oh1 and oh3...”
[37:21] – [38:50]
He anticipates GPT-5 to offer significant improvements in capabilities, driven by both increased data and refined training techniques, setting the stage for next-generation AI applications.
As the episode concludes, Dylan shares information about Semianalysis, his organization focused on providing in-depth analysis and consulting in the AI and semiconductor sectors.
“We have the public stuff, which is like all these reports that are pseudo free. But then most of our work is done directly for clients... roughly 30 of us across the world...”
[38:59] – [40:09]
He invites interested parties to engage with Semianalysis through their website for reports and consulting services.
Alex expresses appreciation for Dylan's insights and the informative discussion, wrapping up the episode.
“Dylan is really a thrill to get a chance to meet you and talk through these topics with you. So thanks so much for coming on the show.”
[40:17]
Note: This summary encapsulates the first 40 minutes of the episode, focusing on the core discussions about generative AI's mechanisms, training processes, efficiency trends, and future developments.