
Loading summary
Host
Rayner Pope is the co founder and CEO of MATX. He's a former MathWiz and Haskell programmer who became a TPU architect for Google. And now he's teamed up with Google's former chief chip architect to design a better chip for AI. So a year ago everyone was saying Google is canceled. AI is going to eat their search. No one's going to search for things and therefore the business won't do well. Obviously that sentiment has really shifted, in part helped by, you know, Gemini 3 is really good and then also it's really fast. You know, it's powered by the custom chip hardware Google has. You were inside Google for actually, I think a lot of the foundational periods laying the groundwork for that stuff. What do people not appreciate about what Google did, right, to lay all the groundwork for their current AI success?
Rayner Pope
They started with the research, right? The Transformers came from there. Pretty much anyone who's maybe, I don't know, over 30 and at a large lab has been at Google Brain at some point. So I think there's just like there was and has been a lot of talent there. TPUs are pretty good. I mean, we think there's better you can do, of course, but they at least had the option, like the opportunity to design the TPUs for neural nets at least, rather than graphics applications like Nvidia. And so the overall architecture, starting with single core, doing what was at the time reasonably large systolic arrays by today's standards, nowhere near as much, but I think those were a lot of really good decisions.
Host
When did the TPU project start?
Rayner Pope
TPUV1 was announced in 2016. I think that was what actually kind of led to the creation of all of those 2016, 2017 startups. So Cerebrus, Grok, Graphcore, Sambanova, all of those. TPU v1 actually was, I think is a really impressive project. It was done on a very short timeline. Maybe I don't know the full details, but maybe about a year or so, maybe a year and a half with a skeleton team of 20, 30 people, really, really minimal viable product. More recent TPUs and more recent AI chips in general can't do that because the market has moved and the stakes or the table stakes are much higher. But the first generation product, they just one big systolic array, stick a memory next to it, we're done. And it was really simple, a nice elegant product.
Host
And obviously the TPUV one predates the Transformer. Is that just a coincidence that they happened at very similar times or related in some way?
Rayner Pope
Yeah, I mean there was a period of maybe about four years of like a lot of ML research or neural net research prior to transformer. So what was popular? LSTMS and confidence and resdemon inception. The big thinking at the time was to adapt it to be used for LSTMs. It's a reasonable fit there. But yeah, no, I mean I think there was just a huge flurry of activity. I think why did it all happen then and not later is probably just because people stopped publishing in 2022 was about the time when just Google completely stopped publishing its research. And so all the good papers are from before that as a result.
Host
Right, right. But is there some hand wavy story you can tell about parallelization where Both Transformers and TPUs are about really internalizing the importance of parallelization.
Rayner Pope
So I mean definitely I put it somewhat on people actually. So I mean it is just true. Hardware is massively parallel. Like you've got tens of billions, hundreds of billions of transistors on your chip and it takes like maybe 100 clock cycles to get from one side of the chip to the other. And so you can't do a sequential computation involving transistors on both sides of the chip. So the hardware is just fundamentally parallel and you have to take advantage of that TPUV one and all later TPUUs naturally took advantage of that. Just matrix multiply is really nice because it is so parallel. So I think on the hardware side that's generally understood. I think most ML researchers especially of the time, we're not sort of super deep in what hardware wants and what is sort of mechanical sympathy is sometimes a term that's used for that.
Host
Sorry, what are the term? I mean it kind of speaks for itself.
Rayner Pope
It's like I mean think about the poor machine and what does it want? So what does it want? I mean the term actually I think originates in maybe high frequency trading and areas like that, which is I haven't worked in. I've just like. I like reading about the software that people have built from there and it's like for them what does the machine want? It wants a lot of instruction level parallelism. This is CPUs not DPUs wants a lot of don't branch so unpredictable branches kill your performance. And so think about the things that CPUs do and how to use them best. Can I get to peak performance on a CPU is sort of that idea. I think the whole idea of peak performance on a CPU is kind of crazy. No one even says what is peak performance? What is my Percentage of peak on a CPU, because performance of software running on CPUs is really bad. But running on GPUs or TPUs or AI chips in general, actually that is the main focus. It's like, what is my percentage of peak? Can I get 70% or 80%?
Host
Okay. I feel like many people listening to this know that GPUs perform better for AI workloads than CPUs. That's kind of a funny history when you think about it, where just one day we woke up with all these very mathematically intensive workloads for first crypto mining and then AI. And so then Nvidia is extremely well positioned because they've been making GPUs for gamers that you would plug into. You'd buy your Dell PC back in the day and maybe upgrade the graphics card by plugging in a better Nvidia graphics card than the one the Stockdale computer came with. And they were incredibly well positioned to capture that. So I think people know that. What is the intuitive explanation as to why GPUs are better for AI workloads than CPUs? Because I mean, people say, yeah, they're better for these mathematical computations. But that's kind of a tautological answer, basically. Is there some way you can have a mental model for why that is the case? Because I mean, software instruction sets also involve doing math.
Rayner Pope
Yeah. So I mean intuitions, I'm not sure. Let me try and just go to some of the big differences, which is are really wide vector instructions is sort of the hallmark of a gpu, which I think it's maybe if you want sort of some intuition, it's like how much is spent on controlling the thing. And maybe control means like if I'm driving a truck, how much is the driver versus the payload? A truck has a huge payload in it that's more like the gpu. Whereas maybe a motorcycle is more like the CPU where you've got actually just processing the instructions, reading, what do I have to do next? Okay, how do I do that? That is most of the cost on a cpu. Whereas if you just keep the same instructions but make the payload 100 times bigger, then you can shift most of the cost to be in the actual work that you want to do.
Host
Okay, so CPUs have been optimized for very complex instruction sets, whereas GPUs optimized
Rayner Pope
for, yeah, complex instruction sets and sort
Host
of
Rayner Pope
fine grained changing what you want to do. So steering. In this analogy, a CPU can steer an optical cost no problem. Whereas on A gpu. You're just going to go straight line for a really long time.
Host
Yes, yes. Okay, so this is getting us into what is matx. How did you guys started and which part of this space are you attacking?
Rayner Pope
Yeah, so MaddX is making the best chips physically possible for LLMs. What led us into Maddox? So Mike is the other founder. Mike and I were both working at Google and we, I mean I was working on the inference stack for running LLMs and I was saying how can we make the best software on TPUs for running LLMs? And then what we really wanted out of hardware was support much much larger matrices. The matrices have grown from maybe 128 in dimension into the many thousands. And so ruck goes to many trailers. So much larger matrices and much lower precision arithmetic. And we tried to move the TPUs in this direction. TPUs have been moving in this direction, but they are kind of constrained by a lot of other workloads. There was a big ads workload at the time and so back in 22 before ChatGPT was released, there was this idea that LLMs were going to be a big thing, but not conviction and really hard to make a big bet on that. I think a startup is more of the right place to make a big bet on a workload. You either like if you fail, it's fine, another startup will succeed. Whereas I think a company like Google or Nvidia, the next chip has to work for sure.
Host
You can take more technical risks as it's turned up.
Rayner Pope
Yeah, well actually I would say we're taking sort of product risks rather than technical risks.
Host
But is there actually product risk? Because it seems like LLMs are going to work.
Rayner Pope
I think now we understand it two years ago or three years ago, I think it was not something.
Host
And when you say the best chips for LLMs, I mean I can think of multiple ways to measure best. It could be best performance per watt. It be could. It could be lowest latency capable of handling the largest models.
Rayner Pope
What is best in general, there are two metrics which LLM workloads care about, which is throughput, which is really just an economics thing. I buy a chip for $30,000 and then can I do 10,000 tokens a second or 100,000 tokens per second of throughput that determines the dollars per token. So throughput and then latency. How fast does the thing respond? As I see the market, the economics seems to be most important. Ultimately the quality of the AI you can train and serve is constrained by. I have only a $10 billion budget. And I want to train and serve the best model I can on that budget. And so if I can have more tokens per dollar, then I can get a better quality out. The product we aim to build is far ahead on latency, on throughput. But then actually the sort of surprising thing is we're competitive with the best on latency as well. And so I think that is a unique thing in offering both in the same place.
Host
And is this for. Obviously in AI there's training the models and then running the models. Inference. Is this most interesting for inference or is there any training angle? I mean, incidentally, is it useful for training but you are trying to win inference? Is that how you think about it?
Rayner Pope
I think that's a reasonable way to look at it. I think the best inference chip today will be a really good training chip as well. And so our product is both training and inference. But I think the first sales will be an inference that's mostly just a market effect where it's easier to buy. It's not as big of a risk to go to buy an inference cluster than as a training cluster. I think the product is really compelling for training as well. And so I think it should be the best training product.
Host
And you guys just raised a big new round of financing.
Rayner Pope
Yeah, that's right. So we, this is a. We've raised a series BE round. It's led by Jane street and situational awareness. Situational awareness, that is Leopold Aschenbrennens fund. He wrote the definitive book on AGI and where it's going. And then Jane street, they're real technical experts, they understand all of the details really well. So very happy to be having them lead the round. It is A, it's $500 million round. Helps us actually ramp the manufacturing and supply chain for our chip so we can bring our chip to market.
Host
That's a lot of money.
Rayner Pope
Yeah, it is, yeah. No, I mean, I think the like roughly, I would say it costs Ballpark $100 million to produce a chip in small volumes. But then if you want to, you see the orders that are going around like OpenAI anthropic Google are going around buying multi gigawatt clusters. They cost tens of billions of dollars of chips and you want to deploy all of in a year or so. And so you just need a massive supply chain behind you.
Host
And so assuming everything works technically, what rate of production could you start to see?
Rayner Pope
We have some estimates of where we'd like to be on. This is, I mean, ramping to very Large volumes is a huge challenge for anyone. And so obviously for the large players, they've had some practice at it. Getting to a very large volume for a startup is hard. We would like to be at a place where we're shipping multiple gigawatts a year.
Host
Multiple gigawatts per year. Speaking of the metrics, you talked about tokens per second. We used to measure chips in flops, and I guess there's some kind of custom flop thing for AI chips. But is everyone just using tokens per second these days? Is the industry aligning on that as the chip metric?
Rayner Pope
Yeah. So, I mean, I guess it's sort of like an application metric versus the chip itself. Flops of the chip is the key chip metric. There's a little bit of like, if I go and say, I've got an exaflop chip to you, then sort of the appropriate suspicion is to say, okay, but can I actually use those flops effectively?
Host
I see.
Rayner Pope
And so then you need to map the application to that.
Host
Yeah, yeah. So this is kind of telling you the usable flops. Yeah. For your purposes. Okay. As a consumer of AI, we have known for a long time that lower latency products succeed. Google talked about their internal testing where the differences were down to, was it 50 milliseconds, something like that? Yeah. In result times, where they noticed more Google engagement, the faster the results were. And you'd think that 50 milliseconds is imperceptible to a human. And it almost is, but turns out it's not. And I think Amazon has. I mean, certainly they've optimized the latency of the Amazon experience quite a lot. I don't know if they've talked about this stuff publicly, but you know that their internal metrics similarly show that, like, the faster the product page loads, the more people buy it. And yet in AI, Google has carved out a meaningful advantage via Gemini just being really fast for its level of intelligence. And. And as far as I can tell, ahead of most of the other labs at a latency at a fixed high level of intelligence. Why have you guys or Grok or kind of better chips not being adopted faster to give this product latency is just that this will happen and you guys will be powering all the AI products. But I note that Google has an interesting lead there.
Rayner Pope
I think there's ultimately, at least for existing chips in the market, there's a really uncomfortable trade off between latency and throughput. The chips that are best at throughput have historically been the chips that are based On HBM as the memory. So that is Google, Amazon, Nvidia. In order to have very large throughput, you need a lot of inferences in flight simultaneously. So that needs the large memory. But that hasn't been so good at latency. And then there's the Grok and Cerebras that are much better at latency because they've got the sram wait through an sram, very low latency. The problem is, and the challenge when you go to a Grokos Rebus system is that the throughput you get there, it just is not very good. And so the fundamental dollars per token is just not competitive with Google or Nvidia or Amazon. It is actually possible to do both in the same chip. It's kind of an obvious thing. You say you take the hbm, you take the sram, put them together in the same chip, you put the weights in SRAM and you put all of the inference data in hbm. That is what we are doing in fact. And I think that actually hits a really nice sweet spot where you can get low latency and also be very cheap. So I think that's a really attractive point to be. It hasn't happened in the market yet just because of product decisions that have been made by the different chips.
Host
Got it. But we should expect like we should expect all the AIs we're using to get significantly faster over the coming three to five years.
Rayner Pope
Faster, I'd say.
Host
Yeah.
Rayner Pope
So I mean generally HBM based chips tend to be about 10 milliseconds or 20 milliseconds per.
Host
Sorry, HBM based chips are things like GPUs.
Rayner Pope
That's right, yeah. There's just some simple math of how long does it take you to read through all of HBM takes about 20 milliseconds. And so that's the amount of time per token it runs. Whereas the amount of time to read through all of SRAM is much faster. And so you can typically get about 1 millisecond.
Host
Yes, boundary order of magnitude faster. Famously, software used to be like old fashioned deterministic software. The kind that's now out of favor. Used to be very easy and quick to scale and you know, you would have social networks that have some south by southwest moment and they can scale through, you know, 10, 100, a thousand orders of magnitude of, you know, adding users because it's just, you know, a few rows in the database and it's a very underutilized cpu. What's interesting about the AI world is there are very real bottlenecks you know, you want to spend lots of time talking about power, but it's not just bringing power online. You know, just you mentioning hbm, reminding me of it seems like there's a view that maybe there's some going to be, you know, there's going to be some HBM supply chain crunch. And so where do you see, are we in for just a crunched world where some limiter is pacing the rate of AI buildout over the coming few years where the economics work of the products and everything like that. But ultimately we just can't bring the components online fast enough because we have to build out the factories, things like that. And what are those crunched components?
Rayner Pope
Yeah, no, I mean I think so. And I'll just comment by the way, this is a great time to be a supplier in this place or just
Host
really you should have started an HBM project.
Rayner Pope
I know, right? I think it's also just a fun time to be someone who optimizes software. That's always what I like doing. Always. The challenge is why am I optimizing this if no one cares? But finally there's a place where actually you can. It's actually very meaningful in a very tangible sense. If I can make this 20% more efficient, then it can save that 20% of the build out the supply chain. We're going to have crunches on all of the supply chain really. So if you look at the big components of what any company but us, for example build out, there is dependency on logic dies from typically tsmc, maybe Samsung or hbm which are the big three HBM vendors. Hynix, Samsung and Micron. And then, and then there's also just the whole rack manufacturing which includes, I mean literally just sheet metal and so on that builds the rack, but also cables and connectors because of all the high speed interconnect.
Host
That's what we don't sound hard. Are they sneaky hard?
Rayner Pope
The big challenge is that you want to bring in a huge amount of power, get a huge amount of heat out and also have phenomenal interconnect which has very high signal integrity requirements. And so pack a lot of cables in. With cables don't bend too much. They have to have enough copper in them and so on that you don't lose data rate on the interconnect.
Host
Yes.
Rayner Pope
So yeah, if you push it to a limit, it's time.
Host
Okay. Wafers, racks, hbm, what else?
Rayner Pope
Data centers, which I think is power. Primarily a little bit of build out, but primarily power and grid infrastructure.
Host
There okay, how do you then as a startup that is looking to acquire all these components, elbow your way in amongst the giants of the Googles and the Nvidias and all these people who have long running relationships and have been buying for much longer.
Rayner Pope
Yeah, I mean ultimately what all of these suppliers care about, they do somewhat care about a diversity of their own customers. It's not a great position to be in.
Host
They don't want monopsony. That's right.
Rayner Pope
Yeah. But then what is their hesitation or the calculus for one of these large suppliers is if I reserve some of my capacity for you a startup, are you going to be around in a year? Is anyone going to even buy your product? Our approach has been to just actually find buyers for the product and then the buyers answer that question ultimately.
Host
Got it. And so if you show up with a bunch of fairly ironclad contracts to a supplier then that has helped.
Rayner Pope
That's the nature of it.
Host
Yeah. I presume also the round you just raised really helps there where showing that you are incredibly well capitalized and not going anywhere also helps from a supplier point of view. A supplier validation point of view.
Rayner Pope
Yeah, absolutely. Yeah. I mean it helps just to say that we are around. We in some cases are actually it depends on which part of the supply chain. But some parts of the supply chain, some are fungible. Logic dies are typically pretty fungible but other parts of manufacturing are. You actually need something specifically set up for you and so we're also able to cover the capital costs for that.
Host
Yeah, yeah, that makes sense. And coming back to the MATX architecture, okay. You want to build the best your problems.
Rayner Pope
What is that?
Host
Yeah, exactly. Sounds great.
Rayner Pope
Yeah. So I mean there's a few aspects to that. I think the first one is just pick your memory system. Right. And so I said like we've seen this HBM family, we've got the SRAM family, put them both together is actually, I mean most obvious idea. But like you can actually do it. There are a lot of details to make that work. While we've done that work, one of the things that shows up there is you've spent all of this area on your chip on sram. How do you fit in the matrix multipliers which are the other big thing you really need to do and so somehow create a much more efficient matrix multiplier engineering. There is a gold standard for that that is called the systolic array. Make a really large systolic array, you can't beat that in area or power
Host
efficiency like provably so practically Practically, it
Rayner Pope
is not known a better approach there. The main thing is, where are the inefficiencies? Typically the inefficiencies show up when you leave the systolic array. So if you make your systolic array really big, then you just don't leave it as often. And so that's the idea. Make a really big systolic array. That is sort of the theme of several of the 2023 era startups, including us. But one of the challenges there is now there is this part of the neural network as part of the transformer, which is this attention that doesn't map well onto a large systolic array. And so that's attention. The mixture of expert layer maps really well, but the attention does not. And so what we came up with, which is quite different than some of the other startups in this space, is say take a really large systolic array but have a way to split it up into pieces without losing efficiency. So sort of that is the core of the design for us. And then there's sort of the third component. So first was HBM and sram. Second is the systolic ray. Third component is just an interesting new approach on low precision arithmetic. Low precision arithmetic in general, we've seen number formats get narrower and narrower. They get faster and faster as you make them less precise.
Host
Number formats get narrower. What does that mean?
Rayner Pope
Yeah, so float 32 was how people used to train neural nets and that's
Host
just too much precision.
Rayner Pope
Too much precision? Yeah, it's like saying I've got an image with a billion color bit depth. It's like too many colors. You'd rather have more pixels and fewer colors. And so that trend seems to go all the way, like almost all the way down to like one bit even where just have very few colors but a huge number of pixels. And that in net seems to be better, just more efficient way to train models.
Host
And so sorry, literally what precision are you dealing with in these?
Rayner Pope
So we have a range. We actually have an ML team who we hired specifically to research different forms of numerics and how to make them all work together really well. We have a range of precisions, it's not just one precision. We think probably the main thing will be similar to where Nvidia is at, which is four bit precision. But I think a mix of different precisions is useful for just when you look at the research, sometimes you want some layers in higher precision or lower precision and so on.
Host
Yeah, yeah. Okay, so four bits is 16.
Rayner Pope
Yeah, you get 16 choices that's it?
Host
Yeah, that's it. Yeah. It was pretty imprecise.
Rayner Pope
Yeah.
Host
That's really interesting. I didn't know about that dynamic, but it makes sense.
Rayner Pope
Yeah. And half of them are positive, half of them are negative. So, like it's even less precise.
Host
How do you design a chip? Like, is that a whiteboard? What software are you working in? I'd just love to know. I understand how you design software and what that looks like. I've actually no sense for what chip design looks like.
Rayner Pope
So the way that you actually type a chip into a computer is similar to software. So you write verilog. Verilog is a programming language. It is a very parallel programming language, which makes it different than like C or Python or something, but it is a programming language. So the mechanics of how you express the design are the same as software. And we have continuous integration, git, all of those things.
Host
But a program executes your Verilog program.
Rayner Pope
We don't really run it. Right.
Host
Yeah, exactly.
Rayner Pope
How does it run? So Synopsys and Cadence provide EDA tools.
Host
So eda, you have to remember.
Rayner Pope
Yeah, electronic.
Host
Just a humble payments.
Rayner Pope
I don't even know what it means really. I think it's electronic design automation. It takes the Verilog and first turns it into a description of what are the logic gates that are involved ands, ors, nots, and then the wires between them. And then it runs for days, doing some really difficult algorithms and then eventually produces. So gates are the first thing. And then even below that, it literally just produces polygons. It says like P type semiconductor here, N type semiconductor here and polysilicon.
Host
Okay. So you write Verilog and then that compiles down into gates and ultimately like the Minecraft 3D, just this is where your elements should go. But then what is the iteration loop like when we write code at Stripe, we build a first version of something and then we try it out and then we refine it and we add more functionality over time. We're going to write some tests and stuff. Some point we'll ship that, we'll find product market fit and then we'll refine it in market. Do you just sit down and write the completed chip and it works really well?
Rayner Pope
Yeah. Every year we tape out a chip and if there's a bug we just wait till next year. It's not really how we do it.
Host
Yeah. So what's the iterative.
Rayner Pope
How do we actually do it? Yeah, it's much more a waterfall than software is. So waterfall is almost a bad word. In software development, but it's a fact
Host
of life in chips.
Rayner Pope
Yeah. So the waterfall goes from architects to logic designers who are writing Verilog. And then there's this design verification and then physical design. So there's this really big architecture phase which happens before even writing any Verilog, which is what do I want the organization of my chip to be? There's in some sense, I mean, what I really like. I came to hardware after doing almost 10 years in software. I really like the blank slate you get in hardware. You've got like all of the raw materials you have are much more varied in what you have available. So what is the organization of your chip? Do I have 100 cores? Do I have one core? Do I have systolic arrays? Do I have vector units? All of those things. And then we spend a long time coming up with that general principle and then saying, okay, now I've got these applications I want to run. I want to run a transformer of a particular shape, I want to map that onto this architecture that I've got in my head. And so we do a lot of iteration. Well, I've got this architecture in my head. I write it down to communicate to other people, but that's just like a markdown file. And then still actually a lot in my head. But maybe with Python simulation and so on, I'll see do my applications map well to it. And so can I run LLM?
Host
This is where I was going to go, okay, so you have a simulator where you write your chip, you can then simulate its performance, and you have some battery of tests that you kind of see how this chip design works. Is it like an industry standard? Is it the X plane of chip testing?
Rayner Pope
Yeah. So, I mean, there's an industry standard thing for the Verilog. Once you've done the design, they're just Verilog simulators that you can test against. But you've already invested a huge amount of work by the time you've got to that point. And so you sure hope you haven't made a big mistake at that point.
Host
Yes.
Rayner Pope
So the thing that everyone does prior to that is we'll write our own performance simulator, which, I mean, it is very specific to your particular architecture and you can write it quite concisely in just a normal programming language. And so that is where most of the architecture work is done. And then the simulation on Verilog is more. I know what I'm doing. I just want to make sure I didn't have any bugs when I implemented it right.
Host
But I presume it's a game of inches where different people are trying different things. And then you do simulate it to see if it runs 1% better across the battery of tests.
Rayner Pope
Or is that not how it works in this space? Not so much. So, I mean, just to sort of characterize what performance of an AI chip is, it is how many really? If you're just like, first thing you care about is flops. How many flops have I got? That's a product of how many multiplies. Like, I've got a grid of a certain size, like 1000 by 1000. So that can do a million multipliers in the clock cycle. And then I have a certain clock frequency, like a gigahertz, and so I multiply them out. That is the speed of it. I don't even need to write that and test it to see how fast it is.
Host
It just is.
Rayner Pope
Yeah. So what I plan in advance is it's going to be this fast. What I can then optimize on maybe a little bit is clock speed. There's not a lot I can do there. And then I can optimize a bit on area as well. So there is some room for optimization. But actually a lot of it gets set, like, actually just the speed of the chip gets set very much up front.
Host
Got it. And then how many chips do you fab? Is it only the ones going into production or is it just build a few to throw away? Or how does it work?
Rayner Pope
Yeah. So the ideal, which companies tend to hit about 50% of the time, is that your first tape out. Tape out costs like $30 million.
Host
Tape out is just production.
Rayner Pope
That's right. It's the actual manual. Like, the first chip costs $30 million. The second chip costs $1,000. So tape out is that first chip. The ideal is that your first tape out actually is your production thing. So you do a tape out. You make maybe 1,000 chips and test them, and then you do production volume in the unlucky 50% of the time, you need to redo some or all of your tape out. So in good cases, and in many cases, you can redo just the metal layers, which costs you only like $100,000, as opposed to the pay the $30 million again. But in bad cases, if you've made something serious and you can't fix the metal layers, you have to do the whole thing again.
Host
Why can't that be solved? Is that definitionally an error in simulation where it turns out these two gates were too close together and it Just led to some reliability issues or.
Rayner Pope
Yeah, so yeah, what you're describing is like physical, like the physical implementation of the chip is wrong. That's one class. The other class is that the logical specification of the chip is wrong.
Host
But shouldn't that be.
Rayner Pope
Shouldn't you caught that before.
Host
Before you spend $30 million on.
Rayner Pope
So, I mean, yeah, we do a lot of testing. We try not to ship these things. I hear software companies also ship bugs to production as well and sometimes things.
Host
That's a very good rhetoric. Shouldn't you not be shipping bugs?
Rayner Pope
But I mean, there is a real trade off in. You can spend more and more time on design verification. There's always this question of when do you stop? And so you stop when your coverage metrics have hit a certain point, but maybe not 100%.
Host
And then if, you know, Apple has to discretize the iPhone release cycle and they've settled on once per year and so they'll decide we've got this better camera, but it's got to wait for the next version or we're going to improve the waterproofing, but that's got to wait for the iPhone 8 or whatever. And so they have taken the continuous process of always coming up with ways to make the iPhone better and discretized it into annual iPhone releases. What will your discrete versions be?
Rayner Pope
Many chip vendors have this sort of TikTok model which is you'll do on one generation, like maybe you're trying to release every year. On even numbered years you'll do a physical technology upgrade. So new transistor technology, new memory technology, new interconnect. And then on odd numbered years you might do an architecture. Overall, I think that's a pretty good fit because you have different parts of your company that are skilled at different areas and it allows you to keep sort of both of them occupied without having instead every two years doing a massive risk release.
Host
Yeah. Okay, and so you think that's probably likely for you. Yeah, that's right. You mentioned interconnect. So there's a narrative out there that Nvidia, a huge part of the defensibility comes not from the chips, which are good, but from the software layer and the ability for engineers to write these really parallel workloads and the fact that they've been refining Cuda for whatever number. Decades. Exactly. Yeah. A long time. Just how do you think about parallelization and is that narrative true?
Rayner Pope
Yeah, it's true for sure. It's true in many areas of the market, I think, and especially where you look at Where Nvidia entered the market, they're doing PC devices, lots of gaming and so on. There are thousands of games, maybe tens of thousands of games released. And they all need to be programmed against cuda. And so there's such a huge investment in the software that this is really important to the compatibility. There are not thousands of LLMs, there's one LLM per frontier lab and there's maybe five frontier labs or something like that. And so just the economics of that is different. The calculation for a Frontier Lab roughly goes as I just bought a $10 billion compute cluster. I have hired 50 of the best people who can write optimized GPU or TPU or Trainium software. I pay them less than $10 billion, a lot less. And so let's put them to work optimizing the compute. And so they can like good work there can, I mean, depends on what your baseline is, but it can very easily double the performance of the software you write. And so there is a huge amount of custom software written for every generation of chip. When a new chip comes out, the software is substantially rewritten to optimize for that specific chip. And that's just the right trade off given the relative costs of these things. What that means for us is that that ecosystem already exists. And that way of operating where you say I'm just going to staff a 50 person team to write software for this chip works really well for if you're trying to sell to Frontier Labs.
Host
Okay, so you're saying CUDA is way more important for the games environment where just there's a lot of games than this top heavy AI market that we're in where if people say you need to then customize your workload for a MATX chip, it's like, well, do that custom business. That makes, that makes a lot of sense. Where will you fab the chips?
Rayner Pope
Tsmc.
Host
Okay, yeah. Why is TSMC so durable?
Rayner Pope
Yeah, I mean it's interesting. They don't charge a lot as well. You'd think that if they're a monopoly provider they should charge a lot of money. They don't. I think that is a big aspect of why they're so durable.
Host
It's like this cyclical. It's cyclical conservatism crossed with Taiwanese business. Conservatism means you're at the most conservative part of the matrix.
Rayner Pope
But it does. An American capitalist might say, well they're just screwing up. They could have extracted more money from the market. But you could also say that this is actually their long term sustaining Advantage because they will just stay ahead for a really long time.
Host
They don't encourage the creation of competitors. Yeah, but isn't the creation of competitors kind of priced in because of geopolitical risk? And so like it's not like everyone's fat, dumb and happy with their TSMC dependence. They're actually thinking a lot about it.
Rayner Pope
Yeah, I mean, so there is real technical advantage there as well. It's not just like the discouragement, but
Host
like designing chips seems really hard. Building airplanes seems really hard. There are so many areas where competitive market forces create multiple options and yet that has not occurred here.
Rayner Pope
So I mean there are multiple options. You can buy from intel or Samsung,
Host
but at leading edge nodes.
Rayner Pope
Yeah, yeah. So I mean what do we even care about in leading edge nodes? I guess the big advantage is on power. The advantage on area is smaller. The leading edge nodes, the density doesn't go up as much as it used to. So when you are really, really sensitive to power, it is a good idea to be on leading edge nodes. So that is AI chips and mobile phone chips. But there's a lot of the market where you don't like devices in so on.
Host
Yeah, car chips. Yeah, that's fine. But you're kind of saying like if you exclude the two most interesting parts of the market just for this super high growth area of the market. It's interesting to me. Like again, there's a lot of other really complex business problems out there that competition has solved. And again, chip design is like, why has someone not left TSMC and gotten built a new fab?
Rayner Pope
Yeah, I mean, I don't know. The cost of a lab of a fab is extremely expensive. I mean, I recognize that also the cost of a lab is extremely expensive too. I don't really understand the technical details of why it's so hard. I mean there is some amount of just a $10 billion fab versus $100 million tape out and chip development. There's a huge difference there. But beyond that I'm not sure what's
Host
TSMC like to deal with.
Rayner Pope
So they're very big. So as a startup we tend to work with, not directly with TSMC but with an ASIC vendor who, I mean firstly does a huge amount of the actual backend work for us that interface with them, but then also has their existing relationships with them.
Host
Got it.
Rayner Pope
TSMC cares a lot about diversity of their customer pool and gets back to that conservatism. So they're great to work with from that perspective.
Host
They want to encourage startups that's right. Yeah, that's very cool. Why don't the labs design their own chips?
Rayner Pope
I mean, Google does, but Google does. OpenAI is starting. It's really a trade off of how much advantage do you get from vertical integration versus how much advantage do you get by concentration of R and D work. So you take the five labs and if they all buy from one player, then you can put like five times as much R and D into that chip. And does that beat the advantage you get from saying I know exactly what my model is because of the several years delay from designing a chip to being in production, you can't actually say I know exactly what my model is because models change much faster than that. So even the labs are forced into this position where they have to make predictions and they have to hedge against what they might do two years from now. The calculus is sort of like what is the probability distribution of what my model might look like. And then sort of design a chip that gets like 90% of that probability distribution or something.
Host
Yeah, yeah. Elon is excited about Data Centers in Space. The two criticisms I've heard are that cooling is very hard and then just repairing the chips is hard. But I know nothing about chips. You do?
Rayner Pope
Yeah. So, I mean, the repair I think is really interesting when you look at how Nvidia deploys their axe, how we do something pretty similar to what Nvidia does. I mean, in general, you always need to design for the fact that some of your chips are going to be down. Like mean time between failure of chips is not that large. And so in a cluster of 100,000 chips, there's going to be chips that are down all the time. One way you can do that is you can make a rack where one rack has some spare chips in it. Nvidia has eight spare chips and rack of 64. That's pretty good. The common antarx works really well for you there that the sort of you can actually, because you can pick which ones to avoid with very high probability, tolerate a lot of failures. And then just for the other family of things is to say my rack has to work, but I have some spare racks as well. So you can math that out with. The tax of reliability here is only like 10%. It's pretty good. But that relies on someone coming and servicing the part within a day or something like that. If you say they're going to service it never, then I think you actually can get where you want to be. But maybe with 100% tax on reliability rather than 10% so for example, if you think the average lifetime of a chip is in the range of three to five years. So that means if I deploy twice as many chips, then three to five years from now, half of them will still work.
Host
Yeah. And also the burn in is particularly failure Y. And how about the cooling?
Rayner Pope
So most of the challenge, I mean, I guess this is actually really a data center design aspect. At the rack level, the challenge of cooling is just getting the heat out, like as quickly as possible out of the rack into the cooling network, how you get it out of the spaceship. Other people would know that better than I do.
Host
Okay, yeah, again, that seems to be the main objection, but I don't know.
Rayner Pope
Yeah, I mean, I think it's sort of like if you think the cost of repair is that you need to have deployed twice as many chips, then like it's a trade off of the capital of the chips versus the power saving.
Host
Exactly. The repair thing, it feels like, can be solved. Because also I think part of the beta, you know, probably one's claim is that we will just be so power limited that, you know, you have no option but to go to space. And you know, people can argue about that, but were that to be the case, then yes, it's like, well, you can get power in space and you cannot on earth and so you might as well go there. Whereas like, the cooling is a more fundamental. Does the product actually work at all? Rayner thinks about AI the unglamorous way, compute systems architecture and what it takes to run models reliably at scale. And if you're building an AI product, the business model similarly has a ton of unglamorous complexity. You're not just selling AI, you're monetizing consumption across API calls, tokens, processed GPU hours. Stripe billing is a scalable system for usage based billing. It lets you launch token based pricing, subscriptions, credits, hybrid models, whatever you want. So you can create revenue models based on usage without rebuilding your pricing system every six months. If you're building an AI product, stripe billing is worth a look. What are your AI predictions for 2026?
Rayner Pope
I mean, what I'm really excited about is just being able to. I mean, I'm still excited about the coding. This is what we do as a company. It's what many others do as a company as well. The one aspect of this is expanding into more domains. So for example, where we spend our time as a company, we write Rust, we write verilog, we write Python, Haskell. Yeah, no, there's a story there. I Used to love Haskell. Rust is my current favorite. Mutation is good. The models are extremely good at Rust and Python. They've done a lot of RL on them. They have not done as much RL on Verilog. They've done almost none on. Okay, write me a markdown file that describes a chip architecture. And then how do you even RL on that? You have to say what is a good chip architecture. I have to somehow say whether that's a good result or not.
Host
Yes.
Rayner Pope
I think one of the things the labs are doing is trying to broaden what they've done RL on, source it from customers and so on in order to sort of fill out the knots, make it less spiky, fill out the gaps between the spikes.
Host
I presume the labs would love to work with you on improving the models by doing RL on this specific task. However, it's also kind of makes sense for us. Yeah. You're a special sauce. So do you want to come up with some AI approaches but keep them proprietary? Is that.
Rayner Pope
Yeah, so I mean, we've looked at a few different aspects here. There's the. I mean, what we're able to do by ourself, our business is not training models. We do it in order to do the research on numerics, but like actual production models we don't do. So it's like the biggest mileage I think is on the rl and it's not something we can really do ourselves. We'd love it if we could have a custom model just for us, but that doesn't seem to be able to. The terms we've been offered by labs so far have not been on those terms.
Host
But because you have to share the
Rayner Pope
IP back, the way they prefer to do it is that they put it into their mainstream model because it's good for them. Yeah.
Host
Which absolutely you don't want to do. Yeah. I mean, how do you think, what does you using AI to design a model do you think look like? Because this is actually, I think, an interesting sight glass into a weak version of recursive self improvement where we're using the AIs to develop better AIs. And so I'm curious. Yeah. What you think that looks like? Is it your own proprietary recursive models? What else is there kind of day to day AI usage that's load bearing.
Rayner Pope
Yeah. I mean, so the stuff that is available today and I think will become even better very quickly is just the stuff that looks most like software. So writing Verilog, running tests, running Continuous integration and so on. And that is a big fraction of the development time in a chip. It's probably 9, 12, 15 months. Yes. The. There's some stuff that's downstream of that, which is physical design, which is you take that Verilog and you generate the gates and the polygons. We don't have a clear path for. It's not. At least the most obvious thing is not clear for how to compress that. Can you tape out a chip in one month? Would be the goal. In theory, you could compress all of the logic design and design verification down to a short amount of time just by continuing on the same path we're doing now. But if you wanted to take the physical design down, that has to leave code. You're now doing graphical interfaces and saying I want to place stuff and so on. Actually there has been work on this even prior to LLMs, which is specific model trained for that particular problem. And I think the vendors, which is like Synoptis and Cadence, probably should move in that direction. Most of the focus has not been do it faster, it's been do it with higher quality. But that is a big bottleneck on can I have a new chip every month. And then there's just the practical thing of a new chip every month. Doesn't really make sense because then if I'm deploying, if it takes me a year to populate a data center, that means I'm going to have different chips in different corners of the data center.
Host
Yes. Sorry. When you talk about one month to tape out. So you do all this work to ultimately produce a file. Everything TSMC then does, it's not entirely in software. Like is there some typesetting that has to happen of moving stuff around? But yeah. What happens when you send your files to tsmc? Then what?
Rayner Pope
So they create a mask. So that is where the ASML tools come in. A mask is really just a stencil. You shoot the lasers through the mask or the X rays through the mask and then that produces the different P type and N type semiconductors. So they produce the mask. That is the expensive part. And then they're building up these 15 or so metal layers so they place on the silicon and then there are different layers of metals which connect all the transistors together. They do that on a wafer. It happens on a stepping basis. So there's sort of a maximum size of chip you can build which is constrained by this machinery.
Host
The wafer stepper is part of the ASML special sauce, right?
Rayner Pope
Yeah, I guess there's probably some important alignment requirement there.
Host
Yeah, I think I remember that being quite like the, you know, the classic manufacturing throughput problem. And I think they've done a lot of work on optimizing that.
Rayner Pope
So they take that. So then you just produce hundreds of copies of your chips. You have to test it because there's defects typically I think the average rates really depends on process and so on. But small single digit number of defects per chip. So you test the chip and see whether it has any defects in it. Many chips are designed to be able to tolerate a few defects and so you need to configure it to tolerate the defects. And now you have a die that by itself works and then you need to package it. So you put it on in a package together with memories. Typically that's the HBM and maybe you escape the wires to connect to other chips.
Host
How long does it take to make a mask?
Rayner Pope
So I mean what we see is time from tape out to first chip to chips back again. Depends on node but it's ballpark four or five months.
Host
Oh, so tape out is just like sending the file?
Rayner Pope
Yeah, we consider tape out, send the file and then there's a whole process of you make the masks for all the layers and then actually just producing the chips.
Host
Got it. So producing the masks and producing the chips happens after tape out.
Rayner Pope
That's right.
Host
I see. Okay, so like is the term tape out from like you send a magnetic tape with the instructions or something?
Rayner Pope
It could be. I was in software when stomach created.
Host
Curious what the tape actually means. It feels like think about AI predictions. One thing I'm really struck by is how still in 2026 every time you open a chat window it's contextless, it's got no memory. And now to be fair, it's like guys, it's been four years. Not even four years, it's been three and a half years. Just calm down, we'll get there. But I also interpret a lot of the current enthusiasm for openclaw and all that stuff as it's like this super hacky backdoor into state management where you know, your little claw will write a markdown file of what it's doing and then you know, look at that markdown file the next time and things like that. But it just feels like state management and memory is going to be a huge deal and that will really change the character of AI products.
Rayner Pope
Yeah, it's really interesting. Like the so I mean long context is the reason is one of the biggest bottlenecks on model perform on speed of the model.
Host
Yes.
Rayner Pope
Every single token you generate, it reads through all of the previous tokens, or maybe it reads through a subset of them, but reads through a lot of the previous tokens you've written. And so memory bandwidth for that is really constraining. You can think of model level ways to solve that problem, which is to say, maybe I can compress it into fewer bytes or something like that. But it's interesting that the sort of most effective way to solve it has been. I mean, it's really a combination of everything. But the most effective way to solve it has been once you hit your 300,000 token limit, have the model go back through it and compact.
Host
That's kind of what OpenClaw is doing. It's like compacting everything you've done. But it's funny that it's so manual.
Rayner Pope
Yeah, I mean, I think manual is the wrong word.
Host
You know, it's so primitive.
Rayner Pope
It's maybe because it's so controllable. You can like, if you want to iterate on how you compact, you give a different prompt and you say compact this way, compact that way, you can iterate that on that in seconds or minutes. Whereas if you're trying to do some iteration on the model level, where you say, now I've got a different model architecture, it's going to take months to take and launch something.
Host
Yes. Any other AI predictions?
Rayner Pope
I'm generally just interested in what makes models cheaper and faster. So that's just at the model architecture level, really tied into this context thing. I think the context size will stay ballpark the same where it is maybe a few times larger, but the parameter count will go up. Like parameter count should grow much faster than context length actually, just because of the underlying physics of what's available though.
Host
Has that been the story? Would that be a re acceleration of parameter count? Because it feels like we've leveled off slightly in the last year or two and instead we've been focusing on more and better rl.
Rayner Pope
Yeah, okay. Paramount account or thinking tokens, I guess those are available. But the context length I think is sort of struggling to grow.
Host
Yeah. Yeah. Okay, but you think we. When you say context length is struggling to grow, but you're saying we keep context the same length.
Rayner Pope
Keep context the same length, but we're
Host
better at working with large context. Is that what you're saying? Because.
Rayner Pope
Yeah, I mean, just have application level interventions to manage large context. Like compacting.
Host
Yeah, because I think everyone's had the experience currently of the chat conversation and the further down in the chat you get, it just gets looser and yeah, it's sloppy. It's just like really sloppy by the end and it's like making mistakes or whatever. So you're saying we start to do better with large contacts. Okay, how about that? When will I be typing into a chat window and it is a MattX chip underneath it powering it tape out
Rayner Pope
in under a year and then that means chips available in.
Host
Yeah, okay, that's exciting. Okay, so in 2027 I will be seeing very high performing chats as a result of.
Rayner Pope
In the 1% experiment of the users or something like that.
Host
Yeah, exactly. Yeah. I need to find a way to finagle myself into the A B test. Yeah. MadX is 100 people.
Rayner Pope
That's right, yeah.
Host
How have you gone about building the team? The culture?
Rayner Pope
Yeah. So I mean so what we have on the team is hardware, mostly hardware, but a big software team and also a big ML team. I think the ML team is quite unusual in what we ask them to do. When you look at a typical ML team in an AI chip company, it will be what I might say, ML engineering or ML performance. They are writing kernels that, that actually will just use your hardware as well on a given model. There's sort of a missed opportunity there. If you're saying all we do is we take other people's models and we write kernels for them. You're optimizing this, but you can't optimize this at the same time. And so we want to optimize the whole thing at the same time. So like real co design.
Host
Yeah.
Rayner Pope
So our ML team is actual real ML research. What they do every day is they train our small LLMs from scratch, focusing on numerics and attention. And this has really, really helped us make an interesting product. It showed up most strong in our numerics. Often what you see when people design numerics is they say, well back when float 32 was popular, it would be I'm going to follow the IEEE standard. Now it is like follow the open compute standard. And there's lots of little details where you say things like maybe what's the rounding mode? I'm going to use round to nearest even or something like that, which is like the best known standard for how to round. We want to cut corners anywhere we can. And so maybe don't do the best rounding, maybe don't get all the corner cases correctly. That's a very scary proposition if you're just making those choices blind. But if you have the benefit of a research team who can sort of back you up as you do that. It's really powerful, and it's really interesting that we can make some sloppy choices in these cases.
Host
I feel like often technical advances come through better iteration loops. A favorite example of this I found recently was that the Wright brothers actually had a failed season before first flight. So I guess first flight was 1904, and they were down in Kitty Hawk in 1903 and not making that much progress. And they went back to Ohio and they had a wind tunnel, and they were like testing their design in the wind tunnel. You can imagine not a lot of wind tunnels in 1904. And they did a lot of wind tunnel testing. And their successful flight was after that. Is this something you're focused on, where to get better chips, you allow for a better testing and iteration loop? And what does that look like?
Rayner Pope
Yeah, I think this mostly happens in the architecture and product definition stage, Maybe even more generally. I think AI chips seem to live or die by product definition and architecture. What is the most extreme form of fast iteration? It's doing it in your head. And so can you map a model to hardware in your head? Can you estimate the performance of what it is in your head? You're not going to be 100% perfect, but maybe you can prove some kind of lower bound on performance. And so, I mean, the simplest possible thing is my model has a trillion parameters. My device can do a billion multipliers per second. So it takes 1000 seconds to run or something like that, Just do that simple division. But then there are much more complicated things. Like, we tend to look at resource balances. And so, like, how many memory fetches do I need to do per multiplier or something like that? So we do. I mean, at least the way I like to do design and architecture and optimization is to be able to sort of estimate the performance to within about 30, 40% before even typing anything in at all. And so we've tried to do that a lot. A lot of our architecture comes from there. Then sort of the next stage of iteration is. Oh, that's kind of on the performance side. This also happens on the. On the circuit design side as well. Can you take a circuit and say, what is the gate count on that? So like a 16 bit multiplier has approximately 16 squared, many gates. And you can do that for more complicated things by sorting networks and so on. So we already have a pretty good idea of the costs and speeds of things at that point. After doing these calculations, then what we tend to do as sort of the next step of iteration is on the ML side we run model experiments. You get iteration speed just by having small models mostly. And then on the hardware side, we use simulators, performance simulators, to do the next level of detail, to make sure we're seeing all the things we want to see.
Host
Yeah. This idea that the best iteration is in your head is kind of reminding me of Jeff Dean's numbers. Do you have your equivalent of that number is every matx?
Rayner Pope
Yeah, we have go gate in our company which says, what is the cost of an XOR gate, an and gate, full adder, sram, bit cell and so on.
Host
And you want people to be working with that stuff in their head and have an intuitive sense for it because again, it leads to better iteration. What is the pitch to someone joining Mattax?
Rayner Pope
I mean, I think if you are someone who likes optimizing, just optimize something. Software, hardware, factorio, whatever. If you're trying to fit something into the smallest budget possible, I think it's a pretty exciting place to be. I think hardware companies in general are really exciting because you have such a broad range of skills of people on the team. You have software people, you have hardware people, you've got physical design, you've got people who are just like looking at the insertion force of a rack, of a card into a rack. And so there's so much discussion and learning you can do. I think Maddox in particular, we really care about this and I think we extend it all the way up into the application and the machine learning as well. Really, really, really interesting technical problems. And I think just generally there's lots of interesting people to about talk.
Host
Yes. And presumably in terms of impact, if you can design a meaningfully higher throughput chip, a 20% higher throughput chip means 20% more AI is happening. If the bottleneck is elsewhere, like power or something like that, or cost, you actually just are meaningfully increasing the amount of intelligence in the world, which is presumably exciting to people.
Rayner Pope
Yeah, yeah. I mean, I think this shows up both as just can it apply in more applications? As well as just how smart is the model?
Host
Yes. Why Rust?
Rayner Pope
So a previous project I worked on at Google, we did a lot of Haskell. I did Haskell growing when I was in school. I loved it. Very principled, very interesting. I like Haskell, but I also like making stuff fast. And then the question is, what is the first thing you want to do? You want to be able to modify your memory, Haskell. You jump through hoops to do that. Maybe I just want a language that is like programming, like functional programming that lets me modify my memory. So I think Rust has a lot of the nice things which are like type classes or traits and a rich type system. One of the things that we have done, like interesting ways we use it at Maddox are the range of sort of data types that you express on software. Like what are the integer types? Int32, int64, int8. Maybe that's all you care about. But it turns out in hardware you care about every single bit. And so you want to use like 17, 18, 19 bit integers that is quite natural to express. And we build up sort of a whole ecosystem of rich hardware data types in Rust as well.
Host
Has Rust beaten Go for the position of sort of performant type programming language with modern features or do they actually address different.
Rayner Pope
Yeah, I mean, so there's what the Rust marketing will say, which is safe without a garbage collection, which I think is a real. I mean it is the objective thing that you can say is different, but sort of bury is the lead, which is. It's also just like it's got nice type system features that Go doesn't have. And then why is garbage collection? Why does it matter at all? I mean, people often focus on the time it takes to run a garbage collector. But the other thing is that every time you allocate an object, you've got the object and then you've got the garbage collector header at the beginning. And so it uses a lot more memory as well. And so if you want to design some, I don't know, data structure that uses the right amount of memory rather than a bit more, then I hadn't
Host
realized that in Rust you're allocating your memory manually versus in Go you have
Rayner Pope
a. Yeah, yeah, that's right.
Host
Okay. And you prefer that for what you're doing or is.
Rayner Pope
I just really like dealing with the details. Like you give me a puzzle and I'll be like, let me solve every single piece of it. So that tickles that part of my mind.
Host
With Rust, it seems like you're a fan of optimization generally. Is that a fair characterization?
Rayner Pope
Yeah.
Host
Where else have you. So chip optimization is one domain, where else?
Rayner Pope
Yeah, so I mean, I started. I mean, one of the really exciting things I found about working at Google is the whole Google code base is available and you can look at how does a memory allocator work, how does a mutex work, how does a hashmap work, any of those things. And you can go and look inside the implementations and Google has excellent Implementations of those, some of the best you could write. So one of the things I did on my nights and weekends when I was at Google was just go find those implementations, write a benchmark, how many nanoseconds does it take to allocate 8 bytes of memory? And then can I make that faster? Can I make maybe I inline this function? Maybe I look at the assembly and say, looks like there's a few memory moves here, or there's some registers that are being used that I don't need in the fast path, I only need it in the slow path. Can I do something there? So I don't know. That was always my fun and learning activity. Being outside of Google. I probably could have done this inside of Google as well. But outside of Google, I felt the luxury to be able to talk about these results as well. One of the things I've looked at recently is just hash tables are used so much. One prompt for me was like, what would. If I wanted to design custom CPU instructions for accelerating hash tables? Like, hash tables are one of the most common things. I'm looking at them up and writing them all the time. What would the optimal CPU be for that? And so then, well, then following down that chain is like, what is the best hash table implementation in the first place? And so I spent some time looking at different SIMD implementations and there's this really cool technique called cuckoo hashing, where you hash into two different locations and then you use the bucket which is less full. It's been in the literature for decades, and yet the best hash table implementations don't use it because it's somehow not practical.
Host
Why is it not practical?
Rayner Pope
Practical hash tables are these days considered to be ones that use SIMD vector instructions to scan like eight buckets at a time. And the way cuckoo hashing is normally described is I look up one bucket here and one bucket there, and so I'm not using the vector instructions. Vector instructions are much faster than scalar instructions, and so there's kind of a missed opportunity. Again, just like take the two good ideas and stick them together. Do vector instructions on cuckoo hashing,
Host
you
Rayner Pope
have to be careful to get the details right, but if you get it right, you can actually just win.
Host
And so sorry, is your claim that one could design a custom CPU that has way better hash table performance, or even on current chips you could get way better hash table performance.
Rayner Pope
So both. I mean, I'm interested in what you can design in custom hardware, but Maddox doesn't make CPUs. They're not going to make CPUs you
Host
could new line of business.
Rayner Pope
I mean, we just want to focus on shipping one product. Well, for the time being. Fair.
Host
Good answer.
Rayner Pope
So, I mean, I think it's an interesting exercise, but I don't get to feel the endorphins of seeing the number going down. So I first did this on just Intel CPUs and you can get better performance than like some of the best hash table implementations available using cuckoo hashing on Intel CPUs even.
Host
And what are examples of workloads that are really hash table read intensive? I know, kind of everything but JavaScript, I guess.
Rayner Pope
But yeah, I mean, it's sort of a tricky exercise because when you really think about it, you're like, did I really need a hash table there? I probably didn't, but you just reach for it all the time.
Host
Okay, well, you could go to the Google JavaScript team and probably help them eke out better performance in the Chrome JavaScript engine.
Rayner Pope
Yeah, I mean, potentially I'm not going to spend my time on that.
Host
Well, if they're listening to this podcast, here's a free idea from Reiner and then explain the Dragon.
Rayner Pope
Yeah, this is from a book that when I was working on the JAX team. So the Jax team is one of the ML infrastructure teams at Google. I was there as the most recent team before I left.
Host
And so what does the JAX team do?
Rayner Pope
Yeah, so the JAX team develops. This is sort of Google's new, more modern version of TensorFlow or competitive PyTorch. It's how you write models in Python to run on TPU's. A big part of the JAX team, however, is to say, okay, we have jax, the technical artifact. Can we help enable users to actually use it really well and get high performance? And so ultimately that became, well, who are the users? It's people writing LLMs. How do you get good performance on LLMs? And so really, really strong team. The JAX team at Google, although as with a lot of brain, people are now elsewhere as well. And so we developed a lot of the different techniques for how to lay out models efficiently on many chips. And so ultimately some people at Google and I contributed after I left Google, wrote this guide called how to scale your model how to run an LLM as fast as possible. It is sort of the main reference for how to get high performance on TPUs. There is now also a GPU version of this as well. It's a Dragon because it's how to train your Dragon.
Host
I see. Okay, last question. People might not have thought that there's room for new chip companies. It might have seemed unusual or very hard. And you guys, it seems like a very good approach with that. Where do you think are other opportunities for companies to be started here in 2026? Where do you think people should be looking for entrepreneurial opportunities or just technical challenges that haven't been properly addressed?
Rayner Pope
More labs I think is still interesting. Can we do more on model architecture? Is always interesting.
Host
You think we have not fully explored model architecture space?
Rayner Pope
Yeah, I mean the Frontier labs have done a pretty good job of exploring it, but I think as the hardware changes, the shape of the model should change for sure.
Host
Yeah. Okay. And presumably you're not thinking like yet another frontier lab pursuing the same architecture. You think there's probably off the wall looking architectures that will actually make a lot of.
Rayner Pope
Yeah, I think there's a little bit off the wall for sure.
Host
Do you have a specific architecture in mind?
Rayner Pope
My mentality is always sticking within the transformer family. But what are the constraints that are currently available, like currently imposed that you could lift? So for example, one of the things is there's this idea when you're doing transformer inference, you do pre fill that is sort of processing what the user said to you. And then there's decode which is generating the response to that. And those are totally different in in pretty much every aspect of how they actually run. One runs a step at a time, the other one runs really in parallel. So there is this somewhat artificial constraint today that those are the same model that's doing both maybe lift that constraint. Another example would be there's this idea that the model that you. I mean this is more fundamental constraint that you have to train the same model as you serve. But again training is very different from serving. At training it's very compute intensive at serving it's more memory bandwidth intensive. And so maybe is there a way you can make a model that when you use it at inference time it increases the amount of compute it does to use some of the available resources?
Host
Yeah, makes sense. Well, ran it. Thank you.
Rayner Pope
Pleasure.
Date: February 26, 2026
Host: Stripe (John Collison)
Guest: Reiner Pope (Co-founder & CEO, MatX)
In this engaging episode, Stripe's John Collison ("Host") sits down with Reiner Pope, CEO and co-founder of MatX, over a pint to dive deep into the world of AI-optimized chips. Pope, a former Google TPU architect and seasoned software/hardware engineer, gives an insider's view on transformer hardware evolution, the economics and constraints of AI chip manufacturing, and MatX's quest to build the best chips possible for large language models (LLMs). The conversation ranges from deep technical architecture details to industry supply chain dynamics and predictions for AI's evolving landscape.
| Topic/Segment | Timestamp | |---------------|---------------| | Google's TPU foundations | 00:52–01:37 | | What makes AI chips fundamentally different | 03:36–05:30 | | GPUs vs. CPUs for AI: the truck vs. motorcycle analogy | 06:35–07:54 | | Why MatX exists—niche hardware for LLMs | 08:03–09:36 | | Core MatX product metrics: throughput & latency | 09:58–11:00 | | Series B/E funding, supply chain strategy | 11:45–12:49 | | Application-level chip metrics (tokens/sec) | 13:31–13:56 | | Current/potential supply chain constraints | 18:23–19:29 | | Details of the MatX chip architecture | 21:54–25:13 | | Chip design methodology (Verilog, iteration) | 25:42–29:42 | | The economics and risks of 'tape out' | 31:20–32:14 | | TSMC, global fab context | 37:13–40:13 | | Labs making their own chips? | 40:29–41:22 | | AI in chip design; feedback cycles | 47:44–48:45 | | Model architecture predictions | 54:18–55:12 | | Team-building and co-design at MatX | 56:07–56:57 | | Iteration and performance estimation | 58:55–60:56 | | The “Brooks’ Law” of hardware: waterfall vs. agile | 27:41–27:48 |
MatX’s story exemplifies the intersection of deep compute systems engineering, the evolving economics of AI, and the entrepreneurial courage to innovate under constraints. Reiner Pope provides both a technical masterclass and a candid look at the challenges and rewards of building at the edge of AI hardware.
[End of summary]