Loading summary
A
I'm back with Rainer Pope, who is the CEO of Maddox, which is a new AI chip company. Last time we were talking about what happens inside a data center. Now I understand what happens inside an AI chip. How does a chip actually work? Full disclosure, by the way, I am an angel investor in Maddox, so hopefully you have designed a good chip. Also, if you're listening to this on an audio platform, it's much preferable to watch this blackboard lecture on a platform where you can see what's happening. So switch over to YouTube or Spotify.
B
So I'll start with sort of the very smallest fundamental unit of chip design. Then we'll sort of build up into what an overall actual production chip. What are the components of that? At the very bottom level of a chip, the primitives that we work with are logic gates, which are very simple things like and or not. And then these are connected together by wires that have to be laid out physically as metal traces on a chip. The main function that AI chips want to compute is multiplication of matrices. And really inside that is the fundamental primitive is multiply accumulative, just like of pairs of numbers. So we're going to sort of demonstrate what that calculation looks like by hand and then sort of infer what a circuit would look like for that. It'll turn out to be sort of easiest if I do multiplication accumulator, something like a four bit number with another four bit number. And then we're going to. The actual clearest primitive is actually multiply accumulate. So there's a multiply these two terms and then we're going to add in so product of these two terms and then we're going to add in an eight bit number.
A
And can I ask a clarifying question? Why is this the natural primitive for whatever computation happens inside a computer?
B
Yeah, so there's a few reasons for this. It's a little bit more efficient, but the reason it's natural for AI chips is that if you look what's happening during a matrix multiply. What is matrix multiply? In very short? It is. There's a for loop over I and over j and over k of output I, k plus equals to input IJ times other input jk. And so multiply accumulate happens at every single step of a matrix multiply.
A
Makes sense.
B
And then the other observation is that the precision will almost always be higher in the accumulation step than in the multiplication step. This is maybe specific to AI chips, but you're multiplying low precision numbers. But then when you accumulate errors accumulate quickly and so you need more precision here. So this is why we've chosen to do a 4 bit multiplication and an 8 bit addition.
A
Let me make sure I understood that there's two ways to understand that. One is that the value will be larger than the inputs, and the other is that if it was a floating point number, it would be. Maybe that part is less intuitive to me, but it's maybe the same principle.
B
It is really the same principle. I guess the separate principle is that as you are summing up this number, you are summing up a whole bunch of numbers and so you've got a lot of rounding accumulating. Whereas in this case, like there's only one multiplication in that chain and so there's not a lot of rounding errors accumulating in the multiplication.
A
Why are you summing up a whole bunch of numbers or just two numbers?
B
Right, just. I mean this summation happens, it's repeated j many times.
A
So yeah, any errors accumulate. I see, yes, makes sense.
B
So how would we perform this calculation? By hand? I mean, as a human we would probably separate it into two, but we can sort of do it all in one using long multiplication. So the multiplication term first we're going to multiply this number, this four bit number here, by every single bit position in the other four bit number. So we write that out first, 1001, multiplied by this bit position, that is this number itself, then shift it across by one, we're multiplying by zero. That gives us an old zeros number shifted across even one more to multiply by this one we get 1001. And then finally for this last bit position, we get an old zeros number again. So this sort of gives us a bunch of terms that we're going to have to add for the multiplication. And then while we're doing that summation of this, we might as well add in the actual accumulated term as well. And so we just copy that directly across. So this is the sum, It's a five way sum that we're going to want to compute. So firstly, what logic gates did it take us? To even get to this intermediate step, we needed to produce all 16 of these partial products. How do I produce one of these partial products? So let's take this number one for example. Here it is one. So how do we produce this number? By multiplying this number by this one over here. We can actually produce that by an and gate, this number is 1. If both this bit is 1 and this bit is 1, if you want either of them is 0, then the multiplication of 1 times 1 of 0 times anything is 0. So to produce all of this stuff, we ended up consuming 16 and gates. Or in the general case, if I were doing a P bit multiply times a Q bit multiply, then this will be like P times Q. Many ands.
A
Exactly.
B
Finally, I sum them. Actually, most of the work is going to happen in the summing. And so let me describe sort of the other logic gate that we use here and is almost the simplest logic gate that exists on a chip. It's almost the smallest. At the other extreme, typically the very largest logic gate that you'll use is something called a full adder. And what this does is sort of it does like. Coming from software, you might think that a full adder adds 32 bit numbers together. In this case it just adds 3 single bit numbers together. And so you can think of it as like adding 0, 1 and 1 together. Now when I add these together, the result can be 0, 1, 2 or 3. So I can express that in binary using just 2 bits. So as input it has 3 bits and as output it has, has 2 bits, which in this case are 1. And like the number 2 in binary is 10. So this is also known as a 3 to 2 compressor. Because it takes three bits of input and produces two bits of output.
A
The two inputs are an X and a Y value and then some carry that that came in from like, sorry,
B
the three inputs are all bits that are in the same sort of bit position, like three bits that are in a column here. And then the two outputs, I have sort of drawn them vertically here and horizontally here to kind of match this vertical versus horizontal layout here, which is expressing that things that are in the same column are in the same bit position, whereas things that are in adjacent columns. Like this is a carry out, whereas this was the sum.
A
So if the inputs in the full ladder are, let's say like 1, 0, 1, then the output would still be 1 0. If it was 11 1, it'd be 1 1, it was 00 0, it'd be 0 0. If it was like 01 0, it'd still be 0 1.
B
So yeah, it's just counting essentially the number of things and expressing that in binary. So this circuit actually can sort of capture what we as humans naturally do when we're doing summing along a column. So I can show that sort of. I'll show sort of one iteration of using the full adder to sum. The way I sum here is going to be a little bit unnatural for humans. Humans we would sort of sum along the column and then remember the carry. But instead of remembering the carry, we'll actually just explicitly write it out. So in this we proceed from the rightmost column towards the left. On the rightmost column we sum the one and the one. And that produces like a, a zero here and a carry of one. So we've sort of used this full adder circuit on this pair of bits and produced a pair of bits as output. Now we can do the same thing with this column. We've got a column of 1, 2, 3, 4 numbers. And so maybe we'll like take the first three of them, run a full adder on them, and that gives us a zero and a zero as output. So like the sum of these is zero, zero. So that's the full compressor, full adder applied to all of these bits. As I've used up bits, I'm going to sort of just cross them out to indicate that I've handled them. Let's just keep going a little bit more. So we'll go here. I take these three numbers, I add them, that gives me a one and a zero. I've dealt with these three numbers and now I take one, two and I can even take these three numbers for example, right now and add them. And that gives me a one and a zero. And I've dealt with these numbers. So I can sort of like the way I should view this is that I have this whole grid of numbers that need to be added. I'm going to just keep applying full adders to all the bits that are here, constantly removing three numbers from a column and then writing out two numbers as output. Keep going with this over and over and over and again until I eventually get like some, just one single number coming out here, something like that. This is probably the wrong sum. So this approach that I've described here, this is called a data multiplier. And this is sort of like the standard for how you do area efficient multipliers using full adders. Let's try and quantify the circuit size of this just so we have got a sense of how big things are so we can compare to them later. How many full adders did I use? I started with how many numbers I have, the 16 partial products, which is the product of all of these terms, with all of these terms plus the eight terms that I'm adding here. So I started off with 24 bits and then I produced eight bits on the output eventually. And in every step I was sort of crossing off three numbers and writing Two numbers out as end result. And so every single use of a full adder eliminates one of the bits here. And so how many full adders it must be the 24 minus the 8. So there were 16 full adders in this circuit in general. This is true in the general case as well. There will be P times Q many full adders in this circuit.
A
Let me try understand the logic of that. So the input bits 24 is p times q plus p plus q.
B
That's right.
A
And the output bits is just p plus q. And so p times q plus p plus q minus p plus q equals p times Q.
B
That's right. So I think this explains sort of, or at least hints at the second reason why we chose to do a multiply accumulate. First reason being that's actually what shows up in matrix multiplication. But second reason being, it gave us this very slick P times Q, very simple algebra. So we've sort of described like this whole procedure. Every single atomic step that I took here becomes a logic gate. And then sort of the wires connected together, like when I had these three inputs that I salvaged to produce these two outputs. If I think of mapping this to a physical device, there would be a wire that runs sort of connecting all three of these things together into a logic gate that produced this output. Okay, so this is the main primitive at different bit widths. That is inside an AI chip we're going to build up from here to. How would you use that to run all of the other operations you might want?
A
This might be the wrong time to ask this question, but whenever Nvidia reports that this chip can do x many FP4 or half as many FP8, it seems to imply that those circuits are fungible, that there's not as dedicated like FP4 versus FP8. But the way you're mapping it out here, it seems like you would need. If it has to be mapped out in the logic, you would need a dedicated FP4 multiply accumulate and then a dedicated FP8 accumulate. Basically, can you funge them as drawn?
B
They're actually not particularly fungible. This is actually one of the main choices you have to make when designing a chip, which is how much of FB4, how much of FP8 do I have? And then sometimes I'll make that consideration from the point of view of like, what do I think is the customer requirement? Another way to take an angle on that is to say what is the power budget for? Equalise the power budget between FP4 and FP8.
A
But so then when they report those numbers and they just happen to be the case that it does 2x as many FP4 as FP8, they just happen to choose, give the equivalent die areas to all the floating points and as a result it ended up being like,
B
why is the ratio exactly 2x?
A
Yeah, exactly, yeah.
B
So part of it is, I mean, surely that won't be exactly equivalent to die area. There's a data movement reason, actually. And we'll maybe come back to this when we sort of look through how it goes into and out of memories. There's something really nice, just from a software level of the fact that I can pack two 4 bit numbers into the same storage as an 8 bit number. And so when I store that to a memory or something like that, the sizing of the buses that I wire within the chip actually makes that work out really nicely.
A
Actually, come to think of it, it's not just 2x, it would be the amount of area it takes, it sounds like is quadratic. It's quadratic in fact, with the bit link. So that's why smaller precision is even more favorable than your negative.
B
This is a really big reason. So in fact, Nvidia made a change historically, up until B100 or B200, every time you have the bit precision, you double the FLOP count. That ratio is exactly like, for the reason you said, because of this quadratic scaling, that ratio is actually slightly wrong. It should be like an even bigger. You should get an even bigger speed up than you might otherwise think. Nvidia's product specs have sort of started acknowledging that in B300 and beyond, where the FB4 is three times faster than the FB8, though it should be 4x. Yeah. What I've shown here is the simplest case of integer multiply. When you are dealing with floating point, as you do in FB4 and FB8, there's this sort of other term which is the, the exponent that just complicates this calculation. So what can we see already from this? I think the big observation you've made is that there's this quadratic scaling with bit width, which is very effective and is the single reason why low precision arithmetic has worked so well for neural nets. But the other thing we're going to do now is we're going to compare sort of the area spent on the multiplication itself with all of the circuitry that is around it.
A
So
B
we'll walk back in time a little bit and see how did GPUs prior to tensor cores work, which is the same way as the way that CPUs worked in fact. So which is like where do we stick this multiply accumulate unit? So generically I'll describe like a CUDA core or a cpu. You'll have some register file which stores some number of entries. Maybe it's like 8, 8 entries of like in this case, I guess 4 bit numbers, but typically like 32 bit numbers or something like that, which are numbers. So this is the like inside the CUDA core I'll have some register file of some depth and then I will have my multiply accumulate circuit. Multiply and accumulate circuit. And what it's going to do, it's going to take three arbitrary registers from this register file, perform the multiply accumulate and then write back to the register file. So it's going to maybe write to this one, but it was able to read from this one, this one and another random one. So it'll take three inputs like this. So this is the core data path of many processors. Most processors look like this. You've got some set of registers and then you've got some set of logic units or ALU's. We want to analyze the cost of the data movement from the register file to the ALU and back. So ultimately there's going to be some circuit that says, well, I don't always have to select this guy. I might select any of the registers at any point in time. And so sort of a first question is how can I build a circuit? The circuit that I'm going to look for is a mux. So in this case it's going to have eight inputs, one from each entry of the register file. And it's going to have one output which is actually producing this output. And then what is the cost of this thing? All we have to build it out of is and, and, or, and so how do we build it? We do the dumbest thing possible. We form a mask saying okay, when we want to read like the third entry we're going to and every single entry with either 1 or 0 based on whether that's what we want to read. And then we're going to. Or all of them together.
A
Okay, Just to make sure I understand the basics, what the MUX is doing is just like selecting, just selecting, just selecting an input.
B
Yeah, so like invisible to software. It's like you say I want input number three. That means there's a MUX here. And so like what is the cost of this mux? So an N input MUX. Operating On P bits. Well, I'm going to. So I have N rows, that's this, eight rows and I've got like each row is P bits wide. Well, I have to. And every single bit. So I get N times P many and gates every single input I have to say am I going to mask it out or not? And then I'm going to or them all together. And so there's going to be like n -1 times p Many or gates, which is saying I've got all of these different things, almost all of them are zeros, but I need to sort of collapse them down into. From my eight options down into one option. And so every step I need to or one row into an existing row.
A
Got it? Yeah. It's actually kind of funny that you would sort of. You don't think at the level of hardware, you sort of just think like, oh, I'll just select element three. And something as simple as that is sort of like in and of itself a quite complicated circuit.
B
Yeah, I mean this is the first step of all of the hidden data movement costs that show up. Yeah. And so the thing we're just going to compare like I have to pay this cost and I've got one mux here and then in fact I have two more copies of that for each of the three inputs to my multiply accumulate operation. And so I have this cost which is like 3 times n times P and gates over here compared to this P times Q like sort of gates in the actual circuit that is doing the thing I care about. And if we plug in actual numbers like this n being 8, like I get like 24 times p gates over just in the data movement compared to like If Q is 4 like 4 times P gates just in the adder
A
multiply adder and Sorry, where is the three coming from?
B
Three different inputs here.
A
Got it.
B
Okay, so the case really, just what I'm hinting at here is that all of this work which scales as the size of the register file, and this is a very small register file. All of this work just moving the data from the register file to the logic unit is many, many times more expensive than the logic unit.
A
In the most recent Clustermax report, semi analysis ranked almost 100 different GPU clouds. Crusoe was one of only five that made the gold tier. Semianalysis found that gold tier providers like Crusoe had a total cost of ownership that was 5 to 15% lower than silver tier ones, even when they had identical GPU pricing. This makes sense because total cost of ownership is downstream of a bunch of different things that don't necessarily show up in the sticker price, but that Crusoe has optimized things like how well you detect faults and how quickly you replace failed nodes. For example, Crusoe was one of the first clouds to adopt NV Sentinel, Nvidia's own GPU monitoring and self healing software for enhanced GPU uptime utilization and reliability. This that's Crusoe makes use of everything that Nvidia has learned about why chips fail across all their different fleets and deployments so that Crusoe can catch faults earlier in the process. And once they identify a failure, Crusoe can swap in a healthy node in less than 10 minutes. Because you're not running bare metal, Crusoe doesn't have to spend time installing an operating system or configuring drivers. They can just spin up a new VM on an already running and pre qualified host. If you want to learn more about this or the other reasons that Crusoe made Gold Tier, go to Crusoe AI Dwarkesh. It might be helpful to just see what a MUX looks like. Maybe like a two bit or a four bit mux.
B
Yeah, great. So we'll take some inputs, we'll have maybe like we'll just do a two way mux. So we've got two different numbers, We've got these two inputs and then we have a. So these are the inputs that are being selected between. And then we have a selector. Which says which can either be like I want this one or it could be I want the other one. So this is a one hot encoding. So this is what we all start with. And then the output we want to produce. Like let's focus on this case. So this is the actual input we got. Yeah, we just want to produce this guy as the result. And so like sort of very laboriously what we do is we and this bit with all of these and so that produces like anding this bit with this row and likewise we and this bit with this row that produces all zeros. So this was the. There's 4 ands here, there's 4 ands here. And then finally we just. Or these two together. When this gives a 1 we or these two together, this gives a 1 we or these 2 together gives a 0 we or these two together. And it gives a 1. And so this is the 4 ors. So like this actually ends up looking a little bit like addition. In fact, like we did exactly the same set of ands here, sort of We've added all of these things together, but then instead of collapsing it by using these full adder circuits, we just get a very simple collapsing with or gates.
A
But I guess that doesn't look like
B
N times p. Yeah, so this was with N equals two inputs.
A
Ah, great.
B
In the general case we will have nice like and so this is N rows and then we'll have P bits per row. So that gives us the N N times P many and gates. So this circuit I've described here, almost all of the cost, like seven eighths of the cost is in the reading and writing the register file. And only a tiny fraction of the cost is in the logic unit itself. So this is the problem to solve. This essentially was the state of play prior to the volta generation of Nvidia GPUs. This kind of thing is what was inside the CUDA cores. And this sort of problem statement is what motivated the introduction of tensor cores, which are more generically called systolic arrays. So if we think about how are we going to solve this problem, we're spending almost all of our circuit area on something that we just really don't care about and is hidden to the software programmer. And the thing that we actually care about is not much of the area. Well, make this one bigger somehow while keeping this at the same size. That's the goal. So sort of the evolution was like we had baked this much into hardware in this stage. This single line is a multiply accumulate and this single thing was baked into hardware. The idea of a systolic array is to sort of go two levels of loop up and bake this entire loop out here into hardware. And so the idea being that if we have a much bigger granularity fixed function piece of log, maybe the taxes we pay on the input and output are much smaller.
A
It sounds like you're suggesting that if you go up one step in the matrix multiply loop, that there's some you can tilt the balance more towards compute than communication.
B
That's right. So there's two effects that we're going to take advantage of here. One is just that we can do more stuff per every trip through a register file. And then the other thing we're going to take advantage of is in fact in some of this loop we can take advantage of for example certain things staying fixed. So let's sort of visually we're going to look at this matrix multiplication. So this portion of the loop corresponds to a matrix vector multiplication in fact. So we'll take a matrix and multiply it by a vector. So how do we do this? We take every column gets multiplied by the vector and then summed. So we're going to sum sort of along columns. So this 0 and 3 gets multiplied by the 3 and 7 and gets summed. And then the 1 and 2 gets multiplied by the 3 and 7and gets summed. So there is a multiply accumulate associated with every single one of these entries in the matrix. So we'll just draw out these four multiply accumulates.
A
And just to make sure I understand why there's four multiple accumulates. So if each entry in the column that corresponds to the output vector is a dot product, and in this case it'll be like two multiplications and then the addition of those two multiplications. So like you're accumulating.
B
Yeah, so the addition. So really there's only one addition per dot product. But like we like to start with
A
the initialization of zero.
B
Yeah. So what we're going to aim for is to have. So we want to have quadratically more compute. We do. We have. We have got sort of X times Y as much compute as we had before. But we're going to want to somehow aim for having only X times as much communication. And this is sort of the intention so that we get this advantage term going as Y. So we've laid down the multiplications bringing in like we're going to want to bring in a vector of size 2. And so that sort of already is in line with our comms target. That's fine. However, we need to somehow manage the communication of this matrix which exceeds our budget of X. And so the idea is that in an AI context, this matrix is actually going to stay fixed for a long period of time. And so instead of bringing it in from the outside, so we've got some register file sitting over here. We don't want to have like the amount of stuff coming out of this register file. This is the term that we want to go sort of as X in some sense. We don't want to bring this full matrix in from the register file every cycle because we don't have enough that would cost too much in terms of wiring from the register file. And so instead we're going to store. Our key trick is that this matrix can be stored locally to the systolic array. And so where we will store these numbers 0, 1, 2 and 3 in just like a gate called a register that physically stores these numbers. And we're going to reuse these numbers over and over again for a large number of different vectors.
A
And so the optimization here is that the nature of matrix multiplication is you can store this square quadratic thing directly where the logic is happening, which is higher dimension than the OR has an extra dimension compared to the inputs which you keep swapping in and out.
B
That's right.
A
I mean, this is the nature of what a matrix multiplication is, is that you do a lot of multiplication to get one value out. Like a dot product is the result of a lot of multiplications. And so that optimization means that you can stuff a lot of multiplication in before you get some value out of it.
B
That's right, yeah. So just to complete the picture here of concretely how that looks, I swapped the three and the two here, three and two. So just like this, zero and three is going to multiply by the three and seven. And so we're going to form a dot product sort of along columns here. So somehow we're going to feed a three and a seven in here. These participate in sort of. This feeds into this multiplication and also feeds into this multiplication. Likewise, the three feeds into here and also into here. And then we're going to sum up sort of along here with like starting at the top of a column, we feed in zeros, and then coming out the bottom, we get results coming out. So sort of just to visually see what we've got, there's a dot product that is performed along columns in a matrix. And that that sort of maps exactly to what is done spatially in the systolic array here. So this is one dot product summed vertically, and this is a second dot product also summed vertically. And then what is the data that needs to go into and out of the register file? We have X amount of data that's coming out here on the output. And we also have sort of this data coming from the input X amount of data from the input. And so with respect to the input and output vectors, at least we've sort of met our goal of having only X as much data going in and out of the the register file. This leaves open the question of, like I said, that the weight matrices, weight matrix is stored locally in the systolic array. How did it get there in the first place is sort of like at some point you need to boot your chip and populate this data. And so where did that come from? The trick is just we just do it very slowly. So we very slowly trickle feed it in into the systolic array. The sort of, the simplest strategy is that we sort of run this daisy chain that says like Feed a number into here, and then on the next clock cycle it'll move down to the next entry or the systolic array. And so we can do that in every column in parallel. And that gives us the sort of. This is also going to come from here. And this is going to be another factor of approximately X units of bandwidth coming in.
A
Would you mind repeating that sentence one more time?
B
So we know that we're going to be bringing in numbers only rarely into the matrix. And so we just want to come up with any construction at all, such that the amount of wiring that actually feeds into sort of crosses this boundary of this histolic array, like this boundary right here. We just want to keep that bounded to X and not go as xy. And so a particularly simple strategy is that we we sort of bring in a number into the top row of the systolic array. That's what we can do in one clock cycle. And then for Y consecutive clock cycles, we're going to be bringing in the top row every time and then sort of shift all of the other rows down by one. And that keeps the wiring that needs to come from this expensive register file only down to a factor of X rather than X. I see.
A
Okay. So there's two questions in terms of communication. There's like communication time and then there's communication bandwidth.
B
Yes.
A
And you're saying since we're only going to be loading this in once, let's minimize bandwidth. That's right, because bandwidth equals die area and let's just load it in slowly over smaller lanes because we're just going to keep this value in there for a while.
B
Exactly.
A
Interesting. So it's interesting to me that when we were talking last time about inference across many chips, the big high level thing we're trying to optimize for is increase the amount of compute per memory bandwidth, that is to say, per communication. And here also we're trying to increase the amount of actual multiplies or actual additions relative to transporting information from registers to logic. So in both cases you're trying to maximize compute relative to communication.
B
Yeah, this shows up all the way up and down the stack. This is sort of close to the bottom, sort of like to the gates. There's sort of a version that's maybe even closer to the gates of just like even the precision of number format that you choose to use. We saw that same effect. There's like a square cube law or a squared versus linear term going on, both in just purely the precision of this alu, but then also in terms of the size of the matrix.
A
Yeah, very interesting.
B
So this unit is sort of the next bigger unit we had like the multiplication circuit and then on top of that we have a pretty large systolic array. I drew it as two by two, but in for example older TPUs they were described as 128 by 128 of this circuit shown here. And this circuit ends up being. This is the most efficient known mechanism for circuit for implementing a matrix multiply.
A
I see. So we've talked about sort of. It seems obvious that you should try to maximize compute relative to communication. What are non obvious trade offs that actually keep you up at night about should we do X or should we do Y? And it's not obvious what the answer is.
B
Yeah, so I mean I think most of the decisions in chip design are sizing decisions. And so already in what we've drawn so far. So AI chips all have this circuit in it. They have a systolic array and then somewhere near it a register file providing inputs and outputs. The two sort of like even within this scope sizing questions that you have are how big should I make my systolic array? And how big should I make the register file and then the trade off for the size of the systolic array. Actually these two questions are coupled is one way to think of it is to say I'm going to have a budget for what percentage of my chip area I want to spend on data movement. So maybe I just say that I want this to be 10% and the Sysdloc array to be 90% and then sort of I can size my register file bigger. Register files are more flexible. They allow me to run sort of more, I can get more application level performance out. But then they sort of take away from this area spent on the systolic array.
A
Yeah, makes sense. I recently ran an essay contest where I asked people to write about what I consider to be some of the biggest open questions about AI. The submission window closed last week, so I used Cursor to create a couple of different interfaces to help me review the entries. One interface anonymizes submissions and hides unnecessary information. It lets me group responses by question, add notes and record my scores. The other interface helps me review entrants who also want to be considered for the researcher role that I'm hiring for. The UI puts the applicant's essay right next to their resume and their personal website so that I can see everything at once. Cursor's hardness is really good at helping these models see and improve their UIs. I watched it render these interfaces in the built in browser, take screenshots, click through sections and keep iterating. At this point, cursor is where I do most of my work. Whether I'm reading and visualizing a bunch of research papers, or coding up an interface to review applications, or making flashcards for my blackboard lectures, Cursor just makes it very easy for an AI to look at whatever I'm looking at and help me understand it and work with me on it. So whatever you're working on, you should do it in cursor. Go to cursor.com dwarkash where does the clock cycle of a chip come in? What determines what that is? Yeah, and what is a clock cycle of a chip?
B
So I guess at baseline it's sort of worth observing that chips are incredibly, incredibly parallel, right? You've got 100 billion transistors in a chip. A key thing that you need to do whenever you have massive parallelism is you need to synchronize between the different parallel units in software. Typically you, you have these very expensive synchronization methods like a mutex. So one thread will finish what it's doing, it will grab a lock somewhere stored in memory, and then notify the other thread that it's done. On chips, we take a very different approach and say that every nanosecond or so, all circuitry in the chip will kind of pause for a moment and then synchronize every. So it actually synchronizes every single nanosecond or so. And so that is the clock cycle. The entire chip, typically all in sort of one fell swoop, goes in lockstep to the next operation that happens. And so what this looks like in circuitry is that you will have this typically drawn. So the clock is sort of mediated by registers, which are these storage devices that we've drawn elsewhere. And the way to think of it is that I have, I have some storage which is storing like a bit, which might be 0 or 1, and then I have some sort of cloud of logic which maybe is like this systolic array or this multiplier or something like that. And then I've got some. And that's going to produce some output. So my inputs, I've got a bunch of inputs feeding to this cloud of logic. And then eventually later there's going to be some output register that this writes to. There is a global clock signal which drives all of these registers. And it says at a certain instance in time, when the clock strikes Whatever value happens to be on this wire at that instant, that's what's going to get stored in there. And so the sort of, the challenge here is like, I, I would like to have my clock speed run as fast as possible, because if I can run at 2 gigahertz, I can get twice as many operations done per second than if I run at 1 gigahertz. But what that ends up meaning is that I am very sensitive to the delay through this cloud of logic because any computation that is going to happen in here needs to sort of finish before the next clock cycle hits. So a major point of sort of optimization on any chip then is to sort of make this delay, delay from here as short as possible.
A
Interesting. And is there ever. Because the constraint here seems to be that if you add too much logic, then you might risk missing the clock cycle. But if you don't add enough, then you're, you know, leaving potential compute on the table. Is there ever a situation where you're like, you'd take a probabilistic chance that a computation finishes, or is it just like. No, either it's going to fish by clock cycle or not.
B
Yeah. In standard chip design, you margin it such that, I mean, there is a probability, but it's like many, many standard deviations, like way standard deviations out such that for all intents and purposes, it is a reliable part. It will always meet the clock.
A
Clock.
B
There are some weird exceptions to that. There are clock domain crossings where you go from one clock to the other clock. And then you actually do have to reason about this probability. But interesting, in the main path, you just margin it such that you'll get there 25% of the clock cycle in advance, so that it's very unlikely that.
A
Where the clocks synchronize, I guess, where the registers are. This is not something you determine as a chip designer. This is sort of just like an artifact of, hey, I want whatever sequence of logic. And then the software you use to convert your Verilog into the thing you send to tsmc that just determines like, hey, in order to make this work, you got to put a register here, here and here to make sure that there's no one step that is too long, such that it makes the whole clock cycle of the entire chip longer than it has to be.
B
Yeah, so this is actually a huge part of the work of designing a chip actually is inserting them. So it is done in a combination of manually and automatically. So I mean, just to show the very sort of dumb version of what you can do here, you can take this logic and split it in half. And so say, actually, instead of just one cloud of logic, I'm going to have two smaller clouds of logic which do the same thing, but split them up by a register feeding in like this. And this is, if you split it in the middle, you can hit twice the clock frequency. That's great. You get twice the performance at the cost of this extra register and so at the cost of some more storage.
A
And so stepping back, why do we need to synchronize the whole chip? If you imagine playing Factorio or something, there's no global clock cycle. Shit is done when it's done. There's iron on the plate. You can take it if you want.
B
Yeah. So taking that analogy, the thing that you need to be mindful of is if I've got two different paths through some logic, so I have to do a computation like F here and then computation G here, and then they're going to come and meet for computation H somewhere here. And so there's going to be manufacturing variance here. In some chips, F will take a little longer. Maybe in some chips G will take a little bit longer. And so if I've got some signal that's propagating through here and the result from F and G have to sort of meet up at H, the thing that can go wrong is that F can get there early and it meets like the previous value of G or the next value of G or something like that.
A
And H needs to know when to start exactly when has this next iteration of. And so this explains why different chips made at the same process node, the same TSMC technology can have different clock cycles. Two chips made at 3 nanometer might have different clock cycles based on whether they were able to optimize, making sure that there's no one critical path that is so long that it slows down the whole chip's clock cycle.
B
That's right. This optimization that I showed here, this is just this sort of pipeline register insertion. It's called we've inserted in the middle of the pipeliner register here. This is a sort of pure trade off between clock speed and area. This is the easy case. There is a harder case too, which is. So I've drawn it as a pipeline of logic here. But in other cases you may have some calculation which actually feeds back in on itself. So it runs some function F and then writes back to itself like this. So for example, this might be this addition. Like you've got some number that you're adding into Every clock cycle. And so this could be like plus, where adding in some number every clock cycle. So this little circuit essentially is just going to sum all of the numbers that get presented on different clock cycles. And the challenge is, if this plus takes too long, what can I do? If I split it in. If I try and put a pipeline register right in the middle of it here, in the middle of it, this will end up changing the computation that's done. Instead of forming a running sum of everything that comes here, I will actually have two different running sums. I'll end up having a running sum of the even numbers and a running sum of the odd numbers. So sort of this constraint where I have a loop in my logic, which all chips have somewhere. This is actually the thing that is the hardest thing to address and sets the clock cycle.
A
I don't understand why it'd be a problem to have that, or I'm not sure even what it would mean to have a register there. Because it's a sort of atomic operation, right? Yeah.
B
Well, so plus is not really atomic,
A
as you just demonstrated.
B
Yeah, yeah. It took a whole lot of work to do a summation. And so you can take the early parts of that work and then stick a register in the middle and then do take the late parts of that work.
A
Got it. Okay.
B
Yeah.
A
And I guess it's then up to. So TSMC offers a PDK which desires, hey, here's the primitives of logic that we can grant you in the chip. And it's up to them to determine that no primitive is bigger than the clock cycle. They're hoping a process node target. But other than that, is there like, what further optimization. Can't you just say, like, hey, here's all the primitives from tsmc, and
B
keep
A
adding registers in between the primitives as much as is needed until you get to your desired clock cycle.
B
Yeah, As a logic designer, the chip architects set the clock cycle. So just for one example, the primitives you get from TSMC are on the order of like and gates or full adders. They depends a lot on voltage and which library you choose and so on. But generally you can typically have about like 10 or 20 or 30 of these in a clock cycle sequentially. So these primitives are very, very fast, like 10 picoseconds or something like that. And so as a logic designer, I mean, in principle, if you literally just had register and then and gate kind of in a loop like that, you could get an insanely fast clock speed, like more than 4 or 5, 6 gigahertz something like that. But if you take this really sort of simple circuit and you look at the area you're spending here, this is maybe like one. I mean, this is called one gate equivalent in size. So like unit of one in area, and this thing is like unit of eight in area or something like that. And so this is just again, almost all of your cost has been this synchronization or communication cost compared to the actual logic. And so this would be a case where you've gone too far. You've made your clock speed really, really fast at the cost of spending almost all of your area on. On pipeline registers.
A
Interesting. So what you're hinting at is a dynamic where you can have really fast clock speed, but you're not getting that much work done.
B
Yeah.
A
And so you can have like low latency, but low bandwidth or throughput, rather.
B
Yeah, it hurts your throughput, in fact, because the throughput of your chip you can think of as the product of how much do I get done per clock cycle, which is based on this area efficiency thing, times how many clocks I get per second.
A
This is actually so similar to the thing were discussing last time about batch size, where if you have a low batch size, then any one user can receive their next token really fast, but the total number of tokens that are processed in, say, an hour will be kind of lower than it could otherwise be.
B
Yeah, exactly. You get less parallelism out if you drive your clock speed up really high.
A
Language models are starting to compete against the best human forecasters. I sat down with two senior Jane Streeters, Ron Minsky and Dan Pontecorvo, and asked, at some point, does AI just do what Chain street does?
C
There's a world that we should take seriously where, you know, we're going to build large language models or some other AI systems that are like, strictly smarter than all humans on the planet and more capable at all cognitive tasks. Trading in particular feels to me as like kind of AGI complete, sort of like NP complete. Because at the end of the day, trading involves figuring out what things are worth, which means making predictions about the future.
A
Jane street isn't betting against AI. They just signed a $6 billion compute deal. But Ron's view is that the edge keeps moving.
C
I have never been more desperate to hire more engineers and more traders than I am today. You know, you have the usual thing of, like, the other hard parts that we don't yet know how to automate. Well, that ends up being where the competitive edge lies.
A
You can find these open positions and Watch the full interview@janestreet.com doorkesh okay, so I remember talking to an FPGA engineer at chainstree@clark, who actually helped me prep for the previous interview we did together. And he was explaining why they use FPGAs. And I imagine that for high frequency trading, throughput is less important than latency. And so having very specific control over the clock cycle in a deterministic way is the most important thing. Maybe it'd be interesting to talk about why you can't just achieve that with an asic, or why FPGA is the why you might use an FPGA to have determinants to clock cycles for high frequency trading.
B
Yeah, so I mean, firstly, let's consider the business case for an FPGA versus an ASIC. FPGAs and ASICs use largely the same sort of conceptual model, which is that I have a series of gates built from ands ors xors, those like very small primitives connected together with a fixed clock cycle and connected together with wires that are running in a fixed clock cycle. So anything you can express in an FPGA you can express in an ASIC too. And it'll be about an order of magnitude cheaper and better energy efficiency on an ASIC than an fpga. The trade off is that the first FPGA costs you $10,000, whereas the first ASIC you make costs you $30 million because it requires an entire tape out. So sort of the business use case for an FPGA would be that I want something that has this very deterministic latency and fast runtime and high parallelism. But I'm going to change it very frequently, change what I do every month or something like that. And so then I don't want to pay the gosh every time. Now, how does an FPGA actually implement. It's sort of like it emulates the ASIC programming model, but in a fixed piece of hardware. And so how does that work? Actually? So what it has at the base is it's got the two components we just talked about. It's got these registers as storage devices, and then it's got These are called LUTs, lookup tables, which actually provide all of the gates. And then we're going to see even the third component. We then have sort of a swarm of these registers and LUTs, And all of these are available and then they are connected by sort of this big set of sort of muxes. So in front of every single one of these, we've Got something like one of these muxes which selects one input from everywhere else, sort of selecting from all of these different things. You know, we've got a whole bunch of different options feeding into all of these things.
A
So
B
what this allows is essentially when I program my fpga, I can say that I'm going to take all of these components and I'm going to sort of superimpose on top of this a particular wiring which goes through this LUT and then feeds into this lut and then goes to this register and then feeds into this LUT or something like that. So what I've drawn in orange is like how you like. FPGA means field programmable gate array. The orange is what has been programmed in the field, whereas the white is all of the wires that must exist in the FPGA in order to actually make the device in the first place.
A
What does it mean to be programmed into a field?
B
Like programmed in the field. So the device is being deployed in a data center, it's sitting in the field and then you can come and program.
A
Not field as in electric field, as in out there in the world field. So if I look at the how the field programming comes out of the first lookup table and goes into the second one, how is it, how, where
B
are the wires that made that happen?
A
I guess.
B
Yeah, so I got a little bit lazy and drawing all of these. Every single device here has a MUX sitting in front of it which can select from all of the nearby like circuits that are available. And so the actual configuration of the FPGA amounts to it. It is the MUX control. So in this MUX here we have the data inputs and then we have the control that selects. And so there's a little storage device sitting next to every single one of these muxes, saying this is where you're going to source your input from.
A
Got it.
B
And so programming it consists of configuring every single one of these muxes.
A
So that makes sense what is happening inside of the lookup table.
B
Yeah, so the purpose of the lookup table, so it's going to also have a little bit of control feeding, telling it what to do as well. The purpose of the lookup table is to function, to be able to configurably take the role of an and gate or gate Xor, any of those different things. So there's many ways you could consider doing that. The way it is done in sort of traditional FPGAs is to say it will support. So it will be a lookup Table will be. It will have four bits of input, one bit of output. How many different functions are there? From four bits to one bit, there are 16 different functions. And so you can actually just tabulate this as like 16 different numbers. You've got a table of 0111 001, 16 entries. And so what it does is this table is stored in this blue configuration bit and then it views these four bits as binary, looks up the relevant row of the table and emits that bit. So this is a truth table view of lookup tables, essentially.
A
Okay, so the lookup table, if you think about an and gate or gate nor gate XOR gate, these are all like take as input.
B
Those are like two input functions. Yeah, Sometimes we have like more complicated, like a three input function would be a three way XOR or a four way xor.
A
And in this case how many, it just depends how big it is.
B
But typical size for LUTS is 4 input, which is sort of just a sweet spot between. There's another computer communication trade off. Like here. Like if it has too few inputs, then you need to use more luts if it has too many.
A
But basically the lookup table is like a truth table. It's a truth table and with a truth table you can program in any gate you want.
B
That's right.
A
And so it's a lookup table. Just think like a programmable gate.
B
That's right. One of the things you can do here is you can see why the rule of thumb that an FPGA is like an order of magnitude more expensive than an ASIC comes from is to count how many gates would be inside this lookup table.
A
Yeah.
B
So we can view this lookup table essentially as one of these muxes. And so it is a mux with has to select between 16 different values. And so it is a mux with sort of n equals 16 options. P equals 1 bits. And so what we saw way earlier is that this circuit costs like N times P many gates. And so it's like. So it costs like N times p equals 16 and gates and also 16 ors.
A
This circuit being the mux.
B
Yeah, exactly. The mux.
A
The mux is the core, the mux that goes into the lookup table.
B
So the lookup table itself you can think of as being actually a big mux that selects from all 16 rows down to one output.
A
Yeah.
B
Okay, that is the lookup table.
A
But the way you've drawn it here, there's like a mux and then a lookup table.
B
It's muxes all the way down. So I mean there is a second mux that is inside here. This mux is this mux.
A
Got it.
B
Okay.
A
And then the other mux is just
B
saying where it came from in this sort of mess of cases. Right.
A
And then the second mux is. Okay, now you have one value, but that value is still a four bit value.
B
Yeah. So I've selected four bits from the soup. Right. And then I use those four bits to select which entry in the lookup table I'm going to use.
A
Right. Okay. So suppose in the first mux there's like eight nearby. You're pulling from eight nearby registers as input. And so that's a total of 32 bits going in. And then out of that, four bits come out. Those four bits go into the second MUX, which is inside the lookup table.
B
So actually I would say in this case these registers are single bit registers. So if there are eight nearby registers and lookup tables, then I have eight bits total coming in nearby. I select from eight down to four different values. So there's actually like four different muxes, one associated with each of these input. Like little mux associated with each of these input bits. Each of them is selecting one out of eight.
A
And what are those eight coming from
B
nearby registers and other LUTs.
A
And each register is one bit.
B
Yes. Yeah.
A
And so I guess AMD or whoever makes these FPGA still has to be opinionated about what registers should connect to which registers. And then you can program in the actual gates. But they had a wire in the communication topology, right?
B
Yeah. So there's the sort of like you get flexibility in a local grain thing. There's a sort of nearby neighborhood where you can select from. But then more grossly, like, more coarsely longer distance connections, they form an opinion on that.
A
And the reason it's 10x slower is why.
B
So if you look at the cost of building this lookup table, it's like 32 gates and then it can give me the equivalent of what's an interesting thing I can do here. I can do a four way and gate. And so that's like I'm using 32 gates of lookup table to sort of implement a four way and means like what is a four way? And I would do like, and, and, and then and of. And so to imp. Like this is a circuit that I could implement in an ASIC directly using these three and gates, but using a lut I can also implement it, but it's going to take like these 32 gates instead of three.
A
And so the overhead is really coming from the fact that the lookup table, the MUX and the lookup table is there's a more concise way to describe a truth table than listing out every single possible combination of inputs, which is just to write out the gate.
B
Yeah. To place down the polysilicon and the metal.
A
That's right, yeah. Interesting. One important point he made to me is that the reason they prefer FPGAs to CPUs is because they get deterministic clock cycles. They know when a packet will come in and go out. Why is it not a guarantee in CPUs?
B
So you can actually design a CPU that has deterministic latency as well. And in fact, the processors that are inside a lot of AI chips actually also have deterministic latency too. GROK has advertised this. TPUs have that in the core as well. The challenge is getting sort of deterministic latency and high speed at the same time. And so where does the non determinism in latency come from? Non deterministic latency comes from specific design choices. In a cpu, it's actually possible to remove those design choices and make a CPU that has deterministic latency. Those are not very attractive in the market. And so people don't make those CPUs anymore. But. But actually, in some sense, deterministic latency is maybe sort of a simpler designing starting point. And then some chip designers have added things in to be non deterministic. To take a concrete example of that, probably the most important example is on a cpu, just like the CPU cache itself. So. In a cpu, you have the cpu, this is the CPU die itself. And then there is a memory off on the side. This is the DDR memory off on the side. And then you have a cache system here. Inside it is the cache. That sort of remembers recent accesses to DDR and stores them. And so when I'm running through my CPU instructions, every time I have an instruction that accesses memory, it first checks in cache. Was the data stored in cache, and then if not, it fetches out to DDR. This is a huge optimization. The cache is like two orders of magnitude faster than the DDR. If you never use the cache, basically all programs would run 100 times slower. So the presence of a cache is absolutely necessary for a CPU to run at reasonable speed. But whether or not you get a cache hit is dependent on the sort of ambient environment of the cpu. Like what other programs are running, what has run recently? What is the random number generator inside the cache system doing. And so that is a big source of non determinism in the runtime of a cpu. So this is sort of the memory system for a cpu. The big thing that you can do differently is instead of having the hardware say I'm going to read memory and then decide, the hardware decides whether or not it comes from cache or not, you can actually bake this decision into software. So a different design philosophy. And you see this in maybe for example, TPUs. The TPU instead has. I mean I'll draw the same diagram but I'll call it a Scratchpad. And so the main difference is. So this would be like a TPU and then HBM in this case rather than DDR, but it's still an off chip memory. And instead of the software saying first access memory and then the hardware decides you've got some instructions that go here. This is like one kind of instruction and then a totally different kind of instruction that goes to hbm.
A
Yeah.
B
And so this style is generically known as Scratchpad instead of cache. The key distinction being that you have one kind of instruction that says read or write Scratchpad and a totally different instruction that says read or write hbm.
A
So Scratchpad being the cache.
B
Yeah, this thing here is the scratchpad,
A
just to be clear. So stepping way back, people say computers have the quote unquote John von Neumann architecture where there's this serial processing of information. And maybe just because we've been talking about parallel accelerators, but I just don't like the FPGA is super parallel. The kinds of AI accelerated TPUs are super parallel. Even CPUs are super parallel if you think about all the cores they have. And so is it actually. In what sense is modern hardware actually the von Neumann architecture? Is that actually a fair way to describe modern hardware?
B
I think it's a fair way to describe CPUs. Like just the amount of parallelism, like on a CPU, the amount of parallelism you get is about 100 cores times maybe like 16 way vector unit. So about 1000 way parallelism on a CPU.
A
Yeah. One question is there is a die that is being used for the CPU and if there's fewer threads, just as a matter of transistor voltages or switching on and off, Is it just that there's literally one control flow, like a small part of the die where voltages are switching on and off? Or how do you actually occupy the die area of a CPU if there's only.
B
As opposed to if there's so few cores. Am I spending what's happening in there. The cores are just much bigger and more complicated. I guess we should compare a CPU core which takes up 1/100th of the die to. I mean, to a LUT. Like a LUT is just only these 16 gates. So it's clear why there are so many more LUTs in an FPGA than cores in a CPU. But then sort of maybe, why are there more CUDA cores, for example, than CPU cores? I think would be like, what's the difference between a CPU and a gpu? Or something like that would be a big difference. Inside the cpu you have so one big use of. So the sort of the top uses of area in a CIDA CPU are the cache. Very little is actually the alu's. Like, mostly it's like these register files rather than the logic units. And then both of these things have equivalents in a gpu. And so that's not a big difference. But the thing that does not have an equivalent in a GPU is the sort of this branch predictor. And so there is a whole big area in the CPU which is sort of just a whole bunch of predictors that are saying, when will my next branch be? And where is the branch target for that? And so stripping a lot of that out, as well as sort of making these register files tighter, in a sense is driving a lot of where the GPU gains are.
A
And what is the purpose of the branch predictor? To execute both branches at once, or what does it do?
B
So the issue is that when I've got a series of instructions, like instructions, instructions, instructions, instructions. If I have a branch like here, if this instruction is branch, The actual processing step of processing an instruction takes a really long amount of time. It takes like maybe 5Ns or something like that. So the time to actually notice that I've got a branch and then evaluate the boolean whether it's true, and then update the program counter to the new target and then read from the instruction memory for that, that could take actually 5Ns to finish. And so in reality, this may finish somewhere down here. I don't want to, but I want to run a clock speed that is much faster than what 5Ns allows. 5Ns is 200 MHz clock speed. I would like to run at 1 or 2 GHz or something like that. And so I need to run other instructions while the branch is being evaluated. And so I really just want to keep running the following instructions that happen after me, but that might have been wrong. If the branch ended up being taken, then I need to know that instead of evaluating these instructions, I actually need to jump to wherever the target is and run these instructions instead. And so the purpose of the branch predictor is genuinely to predict based on before you even get to this destruction, to be like five cycles earlier to predict there was going to be a branch that's going to happen.
A
So if I think about how the brain works versus what you're describing here at a high level, the differences might be that while you can do structured sparsity in these accelerators and then save yourself some area that you would have otherwise had to dedicate to these gates in the brain, there's unstructured sparsity. Any neuron can connect to any other neuron and not in ways where they'd be column aligned or whatever. Then there's the fact that memory and computer are co located, I guess you could say in a way the memory and computer are co located on this
B
is exactly the co location in some sense of the memory computer.
A
That's right, yeah. Maybe that actually isn't a big difference. And the other maybe a big difference is that the clock cycle on the brain is much slower than on computers. And partly that's to preserve energy because the faster the clock cycle, the bigger the voltage needs to be in order to identify for the signal to settle and to identify what state of transistor is then.
B
That's right.
A
I don't know if you have other high level takes about how. Any commentary on what the brain might be doing versus how these chips work?
B
Yeah. So let's take the clock speed one first. Actually the clock speed is quite high on a chip because that drives higher throughput like. Like when we compare a GPU running some workload, it's running batch size 1000 or something like that, whereas the brain is not running batch size thousand. There's only one of me. And so you could sort of imagine saying, well take a GPU and instead of running at a gigahertz, run at a megahertz or something like that. And that would start to look maybe a little bit more like sort of equivalent things that you're talking about in the brain. There is in the way that silicon works. There are like that does not give you a thousand x advantage in energy efficiency. So what it ends up looking like is you can like you sort of just end up running this circuit once to stabilization and then it'll sit idle for a long period of time. It doesn't consume A lot of energy while it's sitting idle, because most of the energy is consumed in, in sort of toggling bits from zero to one and back. So actually, let's talk about the energy consumption of a circuit like this. The way to think of a bit being stored is you've actually deposited some charge in a capacitor somewhere, sitting somewhere in the chip implicitly. So it becomes charged when it becomes a one and then it becomes discharged when it next goes to a zero. And that cycle of charging the capacitor and then dumping that charge out to ground, that is where the energy is consumed. This is called the dynamic or switching power. This is most of the energy consumption of a chip. There is some other energy consumption just coming from the fact that insulators aren't perfect insulators, but we'll discard that. Most of the energy consumption actually comes from just the charging and discharging of toggling from zero to one and back to zero.
A
Right, right.
B
So if you run a chip much slower and you only clock it once every thousand clock cycles or something, you will have a thousand times fewer transitions. It'll be about 1000 times less energy consumption, but not a substantial advantage in energy efficiency.
A
Okay, so you've described how a TPU works at a high level. What is the difference at a high level between how a GPU and a TPU work?
B
Yeah, so I mean, I think there's sort of a high level organization principle that is different and then there's sort of inside the cores what are different. But we'll look sort of outside at the high level. So we'll take a GPU and a tpu. And what does the top level block structure look like? If you think of this as the whole chip in each case, the organization of the GPU is mostly a bunch of almost identical units which are these, these are the SMS. And then they've got an L2 memory in the middle and then a bunch more of these SMS on the bottom. And so there's sort of this fairly regular grid of cores. And then like if we look at a TPU in, in comparison, you end up with much coarser grained units of logic. And so you end up with something like some large number of maybe just a few matrix units. These are the big systolic arrays. And then in the middle you've got some vector units, and then you've got your matrix units at the bottom. So now sort of like matrix units with a vector unit in the middle, sort of. This is the whole TPU chip, you can sort of think of scaling this thing down into a really tiny unit with a smaller matrix unit, smaller vector unit, and that is sort of what an SM is. So sort of at a very high level point of view, the GPU has a lot of tiny, tiny TPUs sort of tiled across the whole chip.
A
So you're suggesting the tensor core within a streaming SM is analogous to an mxu?
B
Yeah, it's very similar.
A
Yeah, I see. And so if you had more lack of structure, having a bunch of tiny TPUs makes a lot of sense. Whereas if you kind of just have huge matrix multiplications, you're like, why don't we avoid the cost of having the individual SMS with their own registers and warp schedulers and things like that? Why don't we just make a huge thing and amortize those costs across the whole thing?
B
And I mean, I think this shows up in how large you can grow things. We've sort of seen this theme, especially with a systolic array, where larger systolic array amortizes the register file costs better. This sort of design allows you to have larger systolic arrays, whereas the sort of GPU design constrains you to having small units of everything. There is a trade off, however, there ends up being, because of this sort of coarse grained separation of things there, you need to move a lot of data from the vector unit to the matrix units. And so you need to move a lot of data through a sort of like two lines of perimeter here. Whereas if you sort of look at the equivalent thing here, you've got vector units everywhere and you need to move data through this line, through this line, through this line, through this line, through this line, through this line. So the amount of data you can move between a vector unit and a matrix unit is actually much higher in a GPU than in a tpu. Because. Because instead of having to move all the data through these just two lines, you're moving all these data through 16 lines or something of wiring instead in a GPU.
A
Right. But also you might have to move across less area, which I mean is
B
also a saving, like it's an energy saving. So data ends up moving like if you can operate entirely within an sm, the data movement is much smaller, but then the moment you want to operate across sms, it becomes sort of more complicated and expensive.
A
So you don't have to comment, but one might expect that a thing Maddox might try to do is to get the GPU like smaller structure of systolic arrays surrounded by sram, but also at the same time, make it so that the things you need in an SM to support the CUDA architecture, but take a bunch of space you might discard.
B
Yeah. We've talked publicly about something which we call a susplitable systolic array, which is sort of, in some sense, you can think of as like, big systolic arrays that can be small systolic arrays as well.
A
Cool. Okay. I think that's a good note to close on, Rainer. Thank you so much.
B
Thanks, Rakesh.
Guest: Reiner Pope (CEO, Maddox)
Host: Dwarkesh Patel
Date: May 22, 2026
This episode is a deep dive into the physical and architectural fundamentals of AI chip design. Reiner Pope, CEO of Maddox (an AI chip company), returns to explain—layer by layer—how computation occurs inside a chip. The focus is on the actual hardware implementing modern AI workloads, demystifying the journey from basic logic gates to full-scale systolic arrays. Throughout, the dialogue draws out key trade-offs, practical engineering decisions, and the surprising dominance of data movement costs over raw arithmetic logic in chip design. Discussions oscillate from multiplication at the hardware level to macro-architectural choices in GPUs, TPUs, FPGAs, and comparisons to biological brains.
"The main function that AI chips want to compute is multiplication of matrices. And really inside that is the fundamental primitive of multiply-accumulate..."
— Reiner, [00:44]
"The big observation you've made is that there's this quadratic scaling with bit width, which is very effective and is the single reason why low precision arithmetic has worked so well for neural nets."
— Reiner, [15:12]
"...all of this work just moving the data from the register file to the logic unit is many, many times more expensive than the logic unit."
— Reiner, [21:18]
"The idea of a systolic array is to sort of go two levels of loop up and bake this entire loop...into hardware."
— Reiner, [26:31]
"Most of the decisions in chip design are sizing decisions...these two questions are coupled...maybe I just say that I want [register files] to be 10% [of the area] and the Systolic Array to be 90%..."
— Reiner, [36:50]
"The throughput of your chip you can think of as the product of how much do I get done per clock cycle...times how many clocks I get per second."
— Reiner, [50:12]
"Anything you can express in an FPGA you can express in an ASIC too...but the first FPGA costs you $10,000, whereas the first ASIC you make costs you $30 million..."
— Reiner, [52:34]
"The GPU has a lot of tiny, tiny TPUs sort of tiled across the whole chip."
— Reiner, [77:27]
Why Multiply-Accumulate (MAC) is fundamental:
"If you look what's happening during a matrix multiply...multiply accumulate happens at every single step."
— Reiner, [02:02]
On the staggering cost of data movement:
"Almost all of the cost, like seven eighths of the cost is in the reading and writing the register file. And only a tiny fraction of the cost is in the logic unit itself."
— Reiner, [24:59]
On self-imposed performance ceilings:
"If you split [logic] in the middle, you can hit twice the clock frequency. ...but this would be a case where you've gone too far. You've made your clock speed really, really fast at the cost of spending almost all of your area on pipeline registers."
— Reiner, [49:58]
Determinism in CPUs vs. FPGAs:
"You can actually design a CPU that has deterministic latency as well... but those are not very attractive in the market... The cache is absolutely necessary for a CPU to run at reasonable speed... whether you get a cache hit is dependent on the...ambient environment."
— Reiner, [63:39]
On the core difference in AI chips:
"This is the most efficient known mechanism for implementing a matrix multiply."
— Reiner, [36:04] (regarding systolic arrays)
| Topic | Timestamp | |---------------------------------------------------|-------------| | Logic gates to MAC: Hardware basics | 00:31–08:17 | | Multiply-accumulate and matrix multiply explained | 01:51–02:49 | | Precision, rounding errors, and accumulators | 02:49–03:49 | | Full adder circuits and summing partial products | 06:17–11:47 | | Area scaling and cost of gates | 12:08–13:44 | | MUX, register files, and hidden data movement | 16:29–24:59 | | Systolic arrays and design tradeoffs | 26:40–36:04 | | Clock cycles and pipelining | 39:20–50:12 | | FPGAs: LUTs, muxes, programmability | 52:34–63:19 | | Deterministic vs. non-deterministic CPUs | 63:19–68:03 | | GPU vs. TPU macro-architecture | 75:34–79:47 | | Maddox's direction: Split-able systolic arrays | 80:14–80:24 |
The episode demystifies the actual physical constraints and engineering choices that shape modern AI chips, making clear that software abstractions mask oceans of complexity beneath. Maximizing compute per communication—whether across wires on a die or network cables in a datacenter—is the central design principle, and nearly all clever chip design is a dance to optimize this tension. The emergence of large systolic arrays, precise management of register files, and the trade-offs of programmability vs. efficiency mark today’s chip landscape. Notably, future advances (such as Maddox's "splitable systolic arrays") aim to hybridize the best of flexible and efficient compute for the AI workloads of tomorrow.
For visuals, Reiner's live blackboard explanations and circuit diagrams are best seen in the video version.