
Loading summary
A
Hello, this is LEY in Space just SWIX today with our special guest Jack Morris. I guess from Columbia. That's your affiliation right now, Cornell.
B
It's actually confusing because I go, I'm in the New York City outpost of Cornell. So you have the city. Right, but it's Cornell Tech, which is like a small Cornell campus in New York. I just.
A
You're a student of Sasha Rush who teaches at. So I should have made that connection. Okay, yeah, I'm sorry. Wow, that's a horrible mistake to make right off the bat. But you're one of. Look, you're one of the. There are not that many PhD students that make an impact with their research. The last time someone like this happens was Shen Yu from Princeton and he joined the OpenAI operator team quite shortly after he graduated. So like you're one of those like high profile PhD students, at least that's like coming out of the program and like I figure like it was a good time to just like talk about your work and also the fact that you're looking for like which lab you're going to join. That's like a whole interesting meta discussion, especially with like the insane market for AI talent these days. What's it like to be an AI grad student these days?
B
Yeah, and thanks for having me. I guess maybe we can go back to when things first started or like, like put yourself in my shoes. And 2017, 2018, I really learned a lot about machine learning. And at my. I went to a state university, it's a good school, but they didn't have like a deep learning research department or anything. They had people doing it, but it was just not as big at that time. But I was getting really interested in those topics, especially as applied to language. And then in 2019 I kind of was starting to do research and I think thinking about my career, I mean at that point I was 2021, I was thinking about like, where do I want to be career wise? Or like who's doing the coolest stuff right now? Like looking at like what kind of stuff is coming out of that time. I mean, I think AlphaGo, I thought AlphaGo was really good at that time. I was playing a lot with like Bert and Bert based models. So like, you know, Google, DeepMind, they're doing great work. GPT2, GPT1 from OpenAI were like interesting, but I think most people were into BERT at that time. I still have a soft spot for like that parameter class of like 100 million to 1 billion scale models. But this is all to Say I think at that time I felt like the people doing a lot of the most impactful work were like professors and PhD students, like just a ton of like interesting ideas being explored and cool opportunities in academia. So I ended up applying to grad school. Well at first I did this Google AI residency program which was mostly during the pandemic, like 2020 and then 2021 and. And then I was also applying to grad school. Started grad school in 2021. That's still what was going on at that time. Like around when I guess GPT3,175 billion had been released but not instruct GPT. So like we had pre training and sort of the science of pre training was emerging but that's where the models were. And I still think like I'm glad that I went to grad school and like I had a great experience but the last five years have changed a lot. Like the whole meta has shifted, you know, like the kind of power dynamics are completely different, the ideas are coming from different places. Most stuff is open now, most stuff is not open. The types of questions people are asking are different. And so yeah, I mean for better or for worse, I did go to do the full grad school thing and, and here I am. It's been really interesting perspective watching the science kind of emerge with the products. Like the biggest thing that happened by far was like ChatGPT coming out and which was right in the middle like what 2022 before Christmas, like November. I remember that year like, like my grandma was asking me about it and that's when it hit me like, oh, this is actually becoming like a real area that people will know about and understand. Like I was trying to explain it to my parents and that's when I think things really started to change in the types of questions you wanted to ask can't always be answered with academic resources. So a lot of the like fundamental kind of like boundary pushing and AI science moved into companies.
A
That was the year when like you know, just around Europe's as well, everyone in NLP and deep learning were like very confused at like I think some people were like kind of expecting this already in a sense that they had. They were obviously more clued into large language models. But I think that the sheer amount of consumer level interest that was around at the time in 2022 that completely changed the world. Now we're just in a different sphere. Did you have to pivot your research or were you already. You just went from Bert to other stuff you've done A lot of embeddings work.
B
I mean you're always heads down working on a problem. So I don't think most people in academia are the type to say, oh look at this new product that came out. I'm going to abandon everything I'm doing.
A
That could be the right move, you know.
B
Oh, it definitely can. Honestly, if, if, if I were to give advice to a younger grad student, I think the way to do it would be literally just like sit and wait until the next kind of paradigm shift and then just immediately start working as fast as you can to like re implement it. Like I, I don't think that's like maybe the best way to do science, but it's probably the best way to play the sort of academic in the days of AI. Like you've seen that so many times, most recently probably with the reasoning models. Like 01 came out of OpenAI September 2024 last year. And then there's just been this explosion of like you build like abstraction ladders on top of that. Like first it was reimplementation, like how do we even do this? And now it's like a lot about the data. What's the right data, what are the right evals, what are the right training schemes. Like there are so many different axes you can test and publish research in. And like I think the easiest way to do that probably is just work in a field like that that like has it only existed for less than one year and so no one has any like big advantage.
A
I guess that is mostly correct. I think anyone who jumped on reasoning and RL for llabs is doing super well. I just saw this morning that one of the recent Stanford grad students who worked on RL, they just started their company and they're worth 500 million. It's, it's like absolutely bonkers right now. Like just like no product, just three dudes, you know, sitting in some basement somewhere. I mean undoubtedly cracked, but like also not worth 500 million.
B
Yeah, but maybe it's not paying for the product, right? It's like the potential ideas behind it or the.
A
Yeah, yeah.
B
There was this big shift from in scale of working with a hundred million parameter models. Really what happened is like I think the companies invested a ton more into training and infra and like we all kind of had to catch up. Like you know me, I go to Cornell, work with a professor there, he has to buy GPUs. Like should he buy last year's GPUs or this year's GPUs? How many should he get? That we were kind of like trying to figure that out. And there was, there was like a big lag, I think, where basically the 7 and 8 billion parameter scale, like there's a huge difference between the BERT size models, which are 125 million parameters, to, to 200, and then like the 8 billion parameters, I mean, obviously it's two orders of magnitude. But just like this idea of emergence, like if you're talking to a model that's 100 million parameters, no matter how well it's trained, it knows nothing. Like if you ask it like what's the capital of a state? Or like if you ask it who's who was President of the United States in 1990 or whatever, it'll just always say George Washington because it just associates the words like President United States with George Washington. And then when you get to the 8 billion parameter scale, suddenly it knows every single president and knows every single capital of every single country. And I really do think that changes the type of research you can do. And so like, it took us a while, I think in academia to catch up, like getting good 7 billion parameter models and then running them and getting GPUs to run them. Now I think things have stabilized a lot. Like we have access to compute and we can kind of like fine tune and inference that scale of models. And that's like kind of fine. But there was kind of two years where everyone in academia was working on smaller models and none of it really mattered.
A
I can sort of branch that discussion in two ways and we should sort of go to your research at some point. But I'm enjoying this because I think we don't get to talk about this on the podcast too often. One is there is an often bit of advice from the industry people to grad students, which is give up, don't work on models, just do benchmarks, right? Like a really good benchmarks will get our attention and then we'll hire you and then you can switch to models later. You have, for better or worse, avoided that, which is cool. And we can talk about that as well. But the other thing I think is that around about 7, 8B, maybe 4B is when you start switching from a single GPU setup to a distributed setup. And I'm wondering, do grad students get HPC training? How much do they teach you of just how to work with large clusters of stuff?
B
Oh, to be clear, they don't teach you anything like, anything like if you see a paper coming out from even, you know, Stanford probably the best school in AI if you had to choose. And it's not like they're learning how to do like multi node distributed FSDP training, like with whatever deep speed. You have to learn that from the Internet and from other people. And like there's no classes that really do that. I mean it's, that's hard to facilitate. Like as one person, I would say most grad students are doing stuff on single gpu. Some people are doing multi GPU training. There's probably basically no grad students doing multi node training. I mean there's probably a few, especially if they have like company affiliations. But that's really unusual, I think.
A
Okay, for grad students who are looking to get up to speed on that, I would recommend the GPU mode discord where basically the Pytorch team is hanging out in there just waiting to help you. And then the other one would be the fast AI team. If you have some kind of thing, Jeremy Howard will basically help you out. And they have some distributed training. Honestly, try to reach out to the DeepSpeed team at Microsoft. Like actually they're reasonably accessible. Nobody talks to them. It's so funny. I met them at Europe's and they had nobody at their. They was presenting Deepspeed 3. I was the only one asking questions.
B
Yeah, that's good advice. Listen to this guy.
A
Yeah, I mean, it's just basically like people are there if you want to ask. This is very, very valuable experience. Once you're like a GPU God, like you're basically, you know, in a different tier as a researcher because you don't rely on someone else helping you out. Like you can just kind of be your own research engineer, you know.
B
Yeah, I'll comment on that quickly because if someone has been listening to this and also following me online for a while, I think I've made a couple comments like saying something like you shouldn't learn about CUDA or things to that nature. And I'll, I'll give some more color to that. So it's definitely a great idea to learn CUDA if you can. I think my point was that if you're trying to enter this space, like learn about the models, learn about how they're trained, what the data looks like, what the compute looks like, one axis of that is how to do more efficient training and inference. And one part of doing more efficient training and inference is studying the hardware, which is GPUs. So like, I think that's a very small subset of all possible knowledge that you could Acquire and it's probably not the best place for a lot of people to start. That said, if you do it, you've got to be one of the most hireable people in the world. Like if you like really deeply understand the architecture of the new GPUs coming out and how to control it, you're in a very small handful of people and like everyone will want to hire you.
A
Actually the sweet spot is not even Cuda right now. I would say actually it is Mojo. I don't know if you've been paying attention to modular Mojo.
B
Oh, I listen to your podcast, man. You had that guy on the other day.
A
The whole story is Chris Ladner, industry legends, LLVM Swift, all these things. And now he's turned his attention to the Python Cuda relationship, right? And he wants to basically create a viable Cuda replacement. It's basically Python married with Rust for the last two and a half years. It was basically kind of stealth, not ready for production. When he came on our podcast, he was basically announcing to the world we're open for business, you can use us now for most models and we actually are faster than the native. Sometimes the PTX implementation, I don't know how that works precisely, but he's a compiler languages God. I think there's one of those windows now like you said bet early on something that's a shift. It's one of those windows now when you try to implement things you basically modular is 100 people. If you run into issues, you'll get Chris's personal help on things. I'm not promising it, but probably because he wants to work on improving the toolkit and all you have to do is just. It's not really about becoming a Cuda guard because obviously once you ramp up on the general concepts and principles, you can probably translate ecosystems pretty effectively. A lot of people switch from JAX to Cuda. But the thing is just like being able to experiment very quickly on a limited budget, efficiency is not just because you are trying to be an efficiency guru and that's your career and that's kind of boring, but it's really also just about being able to experiment very quickly and finding these ideas.
B
I also think VLM and SGLANG seem like really good and important and here to stay. They'll probably just get larger and more complex to accommodate future systems. But if I were like a starting out grad student and working in that area, I'd probably like want to learn more about how they work.
A
Awesome. Let's go to your research. I like to Mention that I first came across you because of cde, the contextual document and biddings paper. You can tell me the story about that, but I just want to show you proof that I get one slot per day to highlight that number one AI story. And you were the slot of the day for October 5th.
B
Oh, no way.
A
I mean, obviously you were producing work before that. But I thought CDE was a really cool exploration of like. Oh yeah, embedding models are kind of like stuck in a rut. Like, here's actually how to make them very efficient by just doing it in two stages. That seems like a relatively simple insight. That was done very well. But you have a general maybe information theory thing that maybe we should start with and then we can sort of credit is in our way.
B
Yeah, sure, that sounds good. So we can, we can circle back on that. That's, that's really cool that you wrote about it. What was that almost coming up on two years ago? Yeah, this is the post I wrote. I called it a new type of information theory. We don't need to go into the. There. There's this paper about a concept called V information. Maybe I'll give like the most simple explanation, which is if you say you have two text files, one text file contains a paragraph of information about New York City, and then the other text file contains the same text but encrypted with like, whatever encryption algorithm. So it looks like random letters, but if you decrypt it, it has the same text as the first text file. From the perspective of like Shannon's information theory, these two files contain the same information content. Like relative to everything, they have the same number of bits. But it's very clear to the observer that the first text file, which is plain English text, is much easier to read and easier to process even though they have the same information. And so there's this theoretical framework proposed in this paper, which is a theory of usable information under computational constraints from 2020. It really doesn't have that much press. There aren't as many citations as you would think. But I think it's a really, really neat idea. It's like we should measure information with computational power as a constraint. So, like, they have this idea they call V information of how much information is extractable from a given like file or code. So in that case, we could say the left text file actually has more extractable information than the right text file. I think that's like, really good. That captures a lot of our ideas of how these deep learning systems work. Like, why does pre Training work. Like if you have two sets of weights and you want to train on some downstream data set, one set of weights is pre trained, one set of weights is randomly initialized. Why is the pre trained model better at all even though it's never seen your data? Maybe one way of looking at that is that it has like, it makes the information like more extractable somehow. Like there's this concept of like computational processing that you can almost like store up. I like this as a, just like a lens to view problems with like how much information is stored where, like if you, if you get a, a set of model weights or like an activation vector and you open it up like print some tensor numpy array, it looks like random numbers, right? Like there's nothing human intelligible about that. But really it's this complex combination of the training data and the training algorithm which get compressed into model weights and then the actual computation that the model is doing, which involves manipulating these numbers in ways that we don't understand. So it's like this really highly compressed nonlinear combination of all these information sources mixed with computation. And I just think we don't have the right words of discussing this. I think I like the information theory analogy because back in the day, you know, we had phones and like telegraphs and, and people were just sort of like building the phone system with these crazy heuristics to like send information across the country or send telegraphs across the Atlantic. People were just like trying stuff and then we kind of found stuff that worked and we ran with it, but it wasn't really optimal. And it wasn't until someone came along and proposed this concept of like a bit, like a one or a zero that tells you something. And once we have a bit, we can do all these things. We can like count the amount of information, a signal, we can do really good error correction, we can measure properties of distributions of things and we can build like a really good system for phones. And then eventually which led to computers. I'm bringing this all up because I don't think we have, I don't think we know what a bit is yet in terms of like deep learning models. I'm going to graduate from my PhD this year, but I didn't figure it out. So if you're listening to this, maybe you can like, I don't know, spend more time on it, or you're smarter than me, or you have, you know, a group of collaborators, you can all get together and figure out what the right lens to look at this stuff is. But even by just asking these questions, I think I was able to conduct this research agenda that I'm kind of still working on actually.
A
Yeah. What do you call this field?
B
I don't know. I don't know. I called the post a new type of information theory. I don't think it exists yet. I guess so. Maybe it'll get a name once someone actually comes up with the right set of definitions. I think V Information is a really good start.
A
There's a couple related threads. So first of all, you don't know this, but I actually have been trying to accumulate data about Shannon. It's just like a Shannon Information theory view of language models. I have a lot of notes. This is actually on my GitHub for people who are watching along. But at the limit, if a language model has 175 billion parameters using 16 bit, it will take up 350 gigabytes. You can compare that to Wikipedia. Wikipedia is about 150 gigabytes. Let's say GPT3 can store two Wikipedias. But is that a relevant measure of information storage? It is not, because you can compress Wikipedia a lot. There's a lot of repeated patterns. Tokenization is like the first form of compression. But I think there's a related talk from Ilya Sutskever about how deep learning is machine learning. Kind of is compression like you have a data set, you compress it into a model that is smaller than a data set, but generalizes and has some amount of acceptable loss. I think that one of your commenters on the post made this direct comparison with Kolmogorov complexity, which is how Ilya sees it. So I think people have this information theory idea or approach to language models. It is just not precise because exactly what you say, we don't know what a bit really means. We don't know what the most legible. Legibility is a word that comes to mind in terms of it matters to us that it's human readable. Even if it's SHA1 SHA256, I don't care. But that is less readable and therefore there's more. I guess. I don't know. Entropy is not the word because it's directly convertible, but it's just less useful.
B
Yeah, useful is a good word. I think maybe useful information or useful information is the right lens. And Kolmogorov is a really interesting connection like Kolmogorov complexity. I think that's a really good concept for computer scientists. So I'm not sure exactly about this specific talk or what he was trying to say, but I, I think that we have a very good understanding of language model pre training and there's a deep connection between language models and, and compression. Actually maybe let's, let's start with the embeddings. We can come back to that.
A
Okay, so is this, are we going to the first paper?
B
Actually let's go to your Wikipedia numbers if, if you still have access to that. So this 50 gigabytes for text of Wikipedia, that sounds like pretty high to me. Is that, that's uncompressed like text files?
A
I don't know. I grabbed it from Andrew Wang, so I don't know.
B
Okay, okay. No, no, I'm probably off. I just sort of have the sense that like when you store text it's generally like very, very small, especially when you zip.
A
Maybe he's including all the languages, all the edits, I don't know.
B
Yeah, yeah, that can make sense. That can make sense because I guess what you, you say from, you know, if you want to do apples to apples comparisons, GPT3 can store two Wikipedias, is that right? 2.3 Wikipedias, 2 point something.
A
Yeah.
B
So I thought it would be a lot more. And this is actually an experiment that you could do. You could like just train a model on Wikipedia and keep training it until you can perfectly extract all of Wikipedia. And that would be like a good way of knowing like how many Wikipedias can CPT store. I like that idea. But I think this type of like back of the envelope math is, it's really useful for thinking about problems and like grounding yourself in the real world, even if you can never quite answer the question questions you want to answer at least like in four years, if we think about embeddings, you know, vectors that people use for search, we can do the exact same kind of math. So if you use the OpenAI embeddings, which last time I checked I think have 1,536 dimensions. So that if you say there's 16 bits per dimension and like half precision floating point, it's something like 20 kilobytes of information in a vector. And if you want to store 20 kilobytes of text, that's a lot of text, like many, many paragraphs that you can perfectly compress into 20 kilobytes. And so I think this is kind of like the idea we had. I'll give you the practical explanation which was I'm. Well, first of all, I'm a second year grad student, I'm like going to these conferences, seeing all these other things, people working on and thinking, you know, like, what the heck? Like, how am I gonna, like, have my own little area to do work in that no one else is working in already? And so I spent a lot of time coming up with bad ideas. My advisor would say, no, like, that's not a good idea to work on. Many times this happened, and, like, even my first year and a half of grad school was, like, a lot of exploration and a lot of, like, coming up with bad ideas. And then, honestly, I'd be interested to see how he remembers it. But I think I wrote a sequence of proposals about different projects, and then I came up with this idea. I was like, oh, we should just try to do as well as we can to reverse engineer the text that's in embeddings. And I. And then we were talking about it. He was like, oh, yeah, you should just do that. And then that was the end of the proposals. And then that was just working on that problem for a long time. Which at the time, I was really motivated by that because I was like, cool. Like, my first as a grad student, my first sort of, like, official, like, sign off on, like, coming up with a good research idea. And at the same time, there was this big rise of this startup business model called, like, a vector database. And there are all these companies popping up, raising money, raising money, getting, like, craz funding, and then actual applications being built that do something where instead of exchanging customer data, they exchange vectors. So we had this, like, very grounded question of, like, what data are they actually sending when they. When they send the vectors? Like, first of all, you have this information theoretic argument that when you send one vector, there should be a lot of text recoverable. Just in terms of, like, a lot of these things represent very short documents, but they actually have many, many bits. So, like, the problem seems tractable. And then second of all, we had this justification of how the product is actually being used. Like, if someone hacks into a vector database, what do they actually find? If that makes sense. So we were working on that for a while.
A
I think I have the talk that you did that Sasha highlighted is this one.
B
Oh, yeah, yeah.
A
Maybe that has the graphic that would kind of.
B
Oh, go one before. I think one before. This is actually. Yeah, this one's good.
A
Yeah, I like having visual aid. I like how. I like giving people breadcrumbs to follow up if they. If they're interested in digging more. But, yeah, I remember this is a pretty hot area of research at the time.
B
And there's been some really interesting follow ups. Like we, we ended up building a system that can do this quite well, like taking and embedding. And I think our highlight number is like at a certain length, like a long sentence length, we can get 90% of the text back. Exactly. And a lot of people were able to do stuff with that. Like they can. For example, I know these people that work on a problem of like debiasing embeddings. And like in one data set they do something. They have a procedure for like removing all latent features that correlate with gender. So they can produce like useful embeddings that from some perspective have no like information or usable information about gender. And they'd been doing that for a while and then they actually just used our tool and they. So like they would put in a sentence like this woman is a doctor at Weill Cornell Hospital in New York. Or say this woman is a doctor, she works at Weill Cornell. And then they would run their procedure and then they run our embedding to text model. And now it would say like, this person is a doctor, they work at Weill Cornell, which is pretty cool. So they have like sort of text based evidence that their method is actually removing gender features. But let me talk for a second about the research phase here, because I thought it would be. I mean, I know if, if you ever heard me talk about this, I probably told you about it, but just for a wider audience. I like thinking back on this because it was probably my, in some sense, like my greatest victory of grad school was like working on this embedding inversion problem for a while, for quite a while. And proposing a lot of approaches and like testing stuff. I think sometimes you do stuff and it's clear it was a bad idea. Sometimes you think you should have figured it out earlier and then sometimes you do stuff and you kind of realize it's really complicated and probably not worth it. So I was testing different decoding algorithms for embeddings that are closer or text that's to the text that's in embeddings. And I was testing these kind of like inference time adaptation models for samplers. I think we tried a lot of architecture and like kind of training tweaks. We should have tried rl, I think that would work. But finally we found something that ended up working. And I guess I'm just saying this all because I thought it was like so rewarding. Like we were just banging our heads against this, the wall. I would have bi weekly meetings with my advisor who kind of suggests things. Sometimes we would agree we were mutually stuck. Sometimes I would get feedback one way or another and try something new or try a couple things. And we had this idea that it was possible from the information theory arguments and this other thing where we would kind of take our best guess at what the text was and re embed it and see that it was kind of far from the true embedding. So we had this proof that a better method could leverage this kind of information. And then when we finally solved it, it was. It was awesome. Like, we had this number that was like 30 for months. I think at one point I got it to 35. And actually I think I was like, oh, I'm done. Like, I got it to 35. And my advisor told me, like, that's. You can't really just propose a new problem and show you push a metric from 30 to 35. That's like, confusing and probably not that meaningful to people. And I think I was, you know, that was kind of like a local minimum for me where I was like, bummed. But then we ended up getting the number to like 97 or something, which neither of us knew were. We were all just. We were just kind of staring at this graph like, oh, my God, like, who knew you could get this much information from an embedding? And that was like, so great. Like, just sort of this. It was so rewarding it. So it was invigorating, honestly. Like, that research process of, like, we picked a good problem and then we spent so long trying stuff that didn't work, which I'm probably forgetting how frustrating that was. I'm sure it was terrible. But then like, actually solving, or at least coming up with a much better way of solving the problem. I don't know if I'd say we solved it, but we definitely learned a lot from where we started. Was great. And it completely solidified for me the fact that I should have gone to grad school to have this life experience and makes me want to do research forever.
A
You're clearly in love with the journey, which I think is important because this is what keeps you going through the tough parts. Is this a good time to talk about the universal geometry side then?
B
Yeah, yeah, yeah, let's. Let's do that next. I think that's a good idea. So. So we have this more recent follow up and the. So the first part I was talking about ended up in this paper called Text Embeddings Reveal Almost as Much as Text, which was published in 2020. 3 and then we recently had a paper come out on arXiv which will hopefully be published at some point. And it's called Harnessing the Universal Geometry of Embeddings, which was also, that was probably like, like the only other time I felt like we've made like maybe there have been two more times, but that was probably the, the second of three times where I felt like we made like a real discovery about like the unknown. And it was like very rewarding just for its own intrinsic kind of elusiveness. And I'll start from explaining it in terms of the prior paper. So we, we built a system that can, you know, do embeddings to text and it works very well and we're all, we're very pleased with ourselves and, and then we went to a conference, we talked to people about it, we talked to like the vector databases. I think some of them changed their privacy policies, which was like somewhat gratifying. And then we kept getting this perpetual question which is like, well you're just assuming we use the OpenAI model or you're just assuming we use the most popular text embedding model. If they fine tune their own model, or if they use a model that you're not training an adversary for, then you can't solve the problem, which is like true. Like none of the vector detect stuff works unless you have this assumpt assumption of knowing the encoder and also being able to make a lot of queries to it. But we had this kind of underlying theory that all of the models learn very similar things. Like we have some preliminary evidence for that. Like certain models that are fine tuned from the same base, you can kind of swap their representations without doing much. Or if you look at the nearest neighbors, a lot of the models will give you the exact same nearest neighbors even though they have completely different training bases. And then there's this paper that came out last year called the Platonic Representation Hypothesis from some folks at mit, which is really, really compelling and I think just like great intersection of philosophy, representation, learning, deep learning, research. Like I, I love this paper and it's, it's such a beautiful idea which is something like all models are trained on data from the world and there's only one world. And so as the models get better by scaling data and scaling model size, they're sort of converging to learn the exact same thing. And in this paper they have evidence based on correlations for doing this with vision and language models. It's very neat. And so we saw this. So basically Think about, you know, you're us. You see this Platonic representation hypothesis paper. A lot of people have this shared idea like you know, Claude and GPT4 probably do a lot of very similar internal computation because both of them are trained on trillions of tokens of human written text. Even if they have different architectures, like maybe you know, the actual basis or like the, the numbers, if you look at them look different but in some way they're like kind of computing the same thing. And I think it's even more true with these like embedding models which have like really only one objective that works and they're probably all trained on like Ms. Marco, which is a really popular data set and pre trained maybe on Wikipedia. But we wanted to basically combine this Platonic representation hypothesis idea with the vectotext thing and produce a system that can align models so that we can do embedding inversion.
A
It's valuable for more than just embedding inversion. You can use this to kind of glue together models. That's what actually got me super excited. And by the way, I think there's a few related threads. I think we did an episode with Nicholas Carlini where he had an extraction attack on one of the GPT models and they got it fixed. The other thing I want to, we just really want to spell it for people just in case they're not think. Being able to invert embeddings also means that you can back out secret prompts or context that might leak customer information. That's potentially harmful and obviously an attack vector issue. I think one of the things I had a question about was whether or not position embedding does affect it and extension of position embedding is affect it because obviously contexts are going to get longer and longer. Your ability to invert will obviously decrease with longer context. What now? Maybe not that important.
B
No, no, no, no, you're totally right. So we're operating in this space in, in our work where the sequences are relatively short and the embeddings are relatively large. Like I think we're kind of at a great advantage from that perspective. And you're definitely right. Like if you embed an entire book to a 500 dimensional vector, no way. There's just no way you could get the entire book back. Like there must be this, these kinds of collisions. Like you know, in information theory, like if you have lossy compression, two different inputs map to the same code, which means that you can never determine which input formed the code. And I think that's probably what will Start to happen. Like if you have two books and you swap just one word and you embed them. I don't know, someone can try this. You'll probably get like a perfect collision. And in that case inversion is impossible. And even like. Like when you don't take it to the limit, it probably just gets very, very hard. Like things get super compressed. So I don't know how well this work scales. Like, it's a great question, like exactly how much information you can sort of cram into one of these vectors. And I don't have a sense of where the boundary is.
A
It'd be interesting to talk to one of the linear algebra people from the math department on how literally can we take inversion? What measures of a matrix do they have where we can kind of run that and like try to get some meaningful information out of that? This is like where information theory starts to collide with linear algebra and all the other stuff.
B
Totally, yeah. There's always this detail where we're running these on computers and so we don't actually have like real decimal numbers or real numbers. We have like floating point representations of numbers which are like very. It kind of like throws a wrench into the mix.
A
Do you have any consideration of like superposition when like sort of non linearity. Like you could like stuff information in the lower bits, but I don't know if that matters. I really don't. It's just like a nice thing to think about.
B
Yeah, yeah, it is a great question. And I get a sense that like a lot of the less important bits are more useful for computation and maybe the higher order bits are more important for like storing data or something like that. But I'm not sure. These are the kinds of questions I'm actually hoping to explore over the next few years. Like, I'll skip ahead for a second. So we have this result that's like maybe the third sort of like discovery I was alluding to. Which is like a way to measure the exact capacity of a language model. And we get this number. If you train a language model on a ton of random data and you measure its rate of memorization. Yeah. Can you open the right curve? This is sort of the discovery I'm talking about. Like no matter how you scale the training size, you hit this like perfect, perfect ish plateau in total memorization, which we call the model capacity. And the question I've been stuck on in the back of my mind for a while is like, how is that actually implemented? So like, this is a transformer that is trained for Many, many data points and many, many training steps. And so like, it's almost like if you have, okay, the 10 to the six point on the x axis, the capacity, we don't have to actually say the numbers, but it's basically perfectly dividing its comp between all of the data points. Like every one of the 10 to the 6 data points gets like a tiny sliver of the model parameters because they're completely independent random strings. So I don't really know if superposition is occurring here. Like, it seems possible to me that the model would learn like these completely independent columns of computation, one per data point. But it's also possible it's learning some kind of like combined thing where maybe it learns like a load and a store and it's like sort of like loading and storing bits using these generic operations. And then in the end it reconstructs the random string. So even though like the data is completely independent, the kind of like compute is, is very similar in terms of like predicting random strings. But yeah, I guess this is all to say, like, about superposition and everything. I have no idea how the mechanisms are actually implemented inside the models. And that's like one thing I'm hoping to learn about in the next couple years.
A
It's a reasonable question whether it's meaningful to learn. I think there's a lot of things that is nice to know, but maybe not that useful. Latent space alignment is very, very useful. Data set efficiency in theory, cool, but practically people are just going to go for the biggest data set they can. The scaling laws are kind of worked out insofar as the relationship of compute and data. Amount of memorization. I don't know. I think maybe this is a good point to maybe also bring in the idea that Andre has been pushing for the last, I think, year and a bit of the cognitive core. What is the dumbest possible model that knows nothing but is smart enough for tool use to do everything else? So you can run it on device and fast inference, it's open source, whatever. So Gemma3n is like a really good candidate right now because it's like a 4B model that is claimed to be better than Llama 4 and GPT 4.1 According to, you know, certain arenas that shall not be named.
B
This is where things get complicated. Like it feels like language models kind of implement things and know things almost in the same way. And it's like really difficult to disentangle like whether they're memorizing facts from whether they're like learning useful ways to Generalize about new stuff. But I agree this would be really nice. I don't think we have a lot of evidence that we can build a system like this that like is really, really good at reasoning but really dumb about the world. Like I don't know if we have the tools.
A
Yeah, maybe, maybe not. I think the existence proof is humans.
B
Right.
A
People always lean on humans as like the existence proof. It's not a great existence proof because I think if you talk to people about the number of neurons that we have and you make a neuron roughly equivalent to a parameter. We have something like 100 trillion in our brains and we consume 20 watts of energy. That's nothing. We're so much better than language models it's not even funny. And then the last feature of us is that we are self pruning which is not something that language models do as well.
B
Oh, like we forget stuff.
A
No, we are not deeply densely connected. Connections will drop. Therefore we're more efficient.
B
See, I see. Unlike a language model where everything is always connected all the time.
A
Yeah. Or you preset the skip layers or whatever and that's it. It's not really actually anything involved with learning. It's just something you do based on ablations and guesstimates.
B
Even if we did want that, I'm not sure if we have the right frameworks or methods for actually building what you're talking about yet.
A
I think the world is much closer to where you're at than where Andre is at. Andre is kind of wishing for an optimistic world. Our conversation with Noam Brown was like, like, yeah, reasoning is emergent. If you gave the O1 harness on top of GPT2, you would get nothing because GPT2 didn't know enough. You need a GPT3 and GPT4 in order to then get 01 as GPT4 is the base model, which is like. Yeah, I mean that's reasonable. The way I put it is in order to use tools, in order to search Google, you need to know at least search terms in order to then search Google and then learn what you need. And if you don't know what to search, then you might just be too tired.
B
Dumb. I like the kind of ethos, like maybe you could do some kind of free training or whenever the model doesn't know something, it can just Google for it. And that way you try to encourage it to learn words without or like to guess words correctly without actually storing the information into its weights. It seems like a nice goal at least.
A
Yeah, you need some kind of Online learning, probably, or memory and some combination of that. Yeah, it's exciting. I think if that is the direction of where this all ends up, that's great, but people are not doing that. Instead, we're building $500 billion data centers in the middle of Texas, and all hail the God cluster that just will eventually wrap around the sun and consume solar energy, because that's. That's what we need. Do we finish out the universal geometry thing?
B
Let me finish the kind of methodological description. So we had this goal. So, yeah, back to the embedding universality. We started with going from embeddings to text. We know about this Platonic representation hypothesis, and maybe I'll skip over the details, but basically we had total inspiration from computer vision in this model from 2017 called Cyclegan, which is, among other things, it's a way to map between two different distributions without any underlying notion of, like, which thing should be mapped, where. It's just based on some kind of idea of closeness. So, like, the cool thing about this, if you look at the top left, so I guess the. The top left is Monet, so Impressionist paintings, and this picture on the right is a photograph. So, like, it's learning this kind of, like, semantic notion of what content goes where just by mapping a distribution of Monet pictures to a distribution of photographs without actually telling it which Monet picture should map to which photograph. It's kind of a subtle point I'm making. It takes a little bit of time to wrap your head around or maybe like, go to the middle one, if you don't mind the zebras and the horses. So, like, it's clearly learning, like, what an animal is and what legs are and sort of like more abstract stuff, like what the camera position should be and what grass is and stuff like that. And it's learning, like, what a horse that looks like a zebra is, which is actually like a complicated semantic concept. Like, we don't have a data set that has a horse and then that horse as a zebra. We just have separate horses and separate zebras. But somehow this, this GAN system is able to, like, elicit this sort of mapping property. It's like kind of a magical connection that it learns. And I'm still, like, in awe that it's possible at all. But we more or less, like, repurposed this system and we built our own. But this idea, we took it and we applied it to model embeddings where instead of zebras and horses, we have BERT embeddings and GPT embeddings or two completely different models with different architectures. So I think these are GTR, which is a T5 based retrieval model, and GTE, which is based on BERT. So they have different training data, different architectures, different downstream objectives, different embeddings. But yet when we do this cycle GAN in the embedding space, they just perfectly sort of snap to the same place, which is amazing and has some pretty deep implications of like the Platonic stuff. Like maybe the models actually are learning a lot of the same functions or something and in some semantic way they're like very close. And yeah, this is a diagram of how our system looks.
A
It's weird to me how profound it seems. You seem deeply impressed by it. And then the other thing is when we talk to Emmanuel from Anthropic who did the circuit tracing and mechanistic interpretability work, they were excited that the same thing in different languages maps to the same circuits. And I'm like, what you would expect?
B
Yeah, yeah.
A
Like, I don't know, like why? Like, I don't know. I think, I feel like this feels more profound to you than it does to me. I'm like, yeah, obviously.
B
No, that's so fair. Maybe it's just like self congratulatory and we're happy that we're like the people that got it to work.
A
Yeah, exactly.
B
Yeah. It does seem obvious in retrospect and I think that's like constant feedback I've gotten from research from, you know, people will tell you that this seems obvious to them, but you have to realize that like you came from a perspective of no one ever having done this before and they're coming from a perspective of you telling them it's true. And like if someone had told you that this was true, it would be like maybe obvious to you too, if that makes sense.
A
The way I put it is that we have the intuition but not the proof. You have to, you did the work and you have at least some evidence that it's true, whereas we just have intuitions.
B
Right, right.
A
So part of research is just confirming intuitions. The applied part comes from like, okay, now that you know this for a fact, what do you do with it?
B
Yeah, right. I think the details can be really interesting. Like the details of the proof, like which models are most similar to one another and to what degree can you get them to align and on which distributions does this property actually emerge? And like, that's why reading papers can be fun sometimes is because they kind of answer all those little questions.
A
Yeah, I would say, okay, I'LL pull up something very current, which is Gemma3n, which launched, which sort of was generally available yesterday. I would say for me, and you can correct me if I'm wrong, the most immediate implication is mapping adapters to language models. So the dream is that you have a language model backbone. Let's say this one is like a 2B language model backbone. And then you offload your vision. So you only load in the vision and parameters or the vision adapter. When you need vision, you only load in all audio. You only don't need text to speech whenever you need it. Because these are all separately trained, you're just sort of aligning latent spaces and you can sort of train them separately. And I think this helps to make us more confident in one, it's more efficient. That's a given. Two, it helps, it makes us confident that we can just add capabilities without taking away or catastrophically forgetting others.
B
So they're just sort of like stacking more parameters.
A
Just stackable.
B
So that's very cool.
A
Yeah, Swappable stackable. It's like a fatter version of Lora's that is not really that model specific. I would say Apple and Google are pursuing this for their on device stuff. Is where is my sense?
B
Is this open source, Gemma?
A
Yeah, for a given definition, open source, which is like we released the weights of hugging face. Here you go.
B
Oh, that sounds like open source to me. Oh yeah, I guess it's open weights, but not the data.
A
Not the data, not the code.
B
Not the code. Yeah, right.
A
Yeah. I would say that this is quite soda in terms of efficient models. Maybe a small LM also from hugging face would be also in that category. There's not that many people working on very efficient models.
B
Yeah, this is a very deeply related question and something that really interests me which is like what is the limit of a 100 million parameter model? If you imagine 100 years from now when we have maybe our computers are gelatinous blobs and we all communicate through telepathy, will we have hundred million parameter models that are at the level of today's O3 Pro or whatever? And like if so, like how would that even be the case? Like based on scaling laws, like do we have special data? Do we come up with like a brilliant new training scheme or some type of magical architecture? Like I really don't know. Or maybe we really are at the plateau already? I don't know.
A
It seems like when we are doing things like calling a small model, like a 27B model as small, that's what Ms. Charles is doing, we've plateaued a little bit in terms of what we can do to compress things. I have a fun theory that this is where we mix quantum computing with models. You have to change what a parameter means. We have to search through very high dimensional space and resolve them much quicker than we can with like conventional compute. That would be my pie in the sky thing.
B
I said 100 years. That's very reasonable to me.
A
Throw quantum at it.
B
Yeah, I probably have to get a second PhD to know what's going on there. I think that we should establish the definition of small model as being a model that a grad student can inference at reasonable time on a single GPU, which is probably like 7B maybe. I don't think 27 is small under any reasonable. Is it MOE, Ms. Strong?
A
I don't think so. I think their stuff is default dense. Don't quote me on that. This is coming off of just a lot of pre trained data that is potentially collided. Okay, there's two more papers that we wanted to cover and then we can sort of wrap it. You had an approximating. You had a language model training data. I think this is a little bit also newer. How does this rank in terms of your overall work?
B
Yeah, let's return to the kind of information theory question. So yeah, maybe we'll skip over the contextual embeddings in the case of time, but we'll group those papers. Great paper. Hopefully people start training with that technique. It's kind of a free lunch. Those questions are all about information and model activations. Like how much can we recover from this given vector? Or like what data does this vector represent? Or what computation does this vector represent? And there's really two types of like if you want to taxonomize there's, there's two types of whatever you call it, dense information storage mechanisms. One of them is activations or embeddings which we were discussing already. And then the other is weights, which are the things that, that are used to perform the computation but not the computation itself. And so we have now two papers in this direction of what is stored in the weights. The first one is about language model capacity, which is called how much can language models memorize? Or how much do language models memorize? I never remember which one we settled on. And then the other one is called approximating language model training data from weights. The first one is like, I think has a lot of deep messages about how language models store information and how they work in general. The second thing is Like a proof of concept of maybe like a longer term research project. Let's start with the capacity stuff, if that's good with you.
A
Do I have the paper for that? I don't know.
B
You know, we can return to the question you asked me, which is something like why do we care? Or like what is this useful for? And I don't know if I have a good answer for this. I think this is somewhat profound. Like, like it's kind of like in, in physics, you know, when they try to measure these constants like gravity. People tried to measure the rate of acceleration of gravity for a long time. Or like those Greek guys, like back in, in the B.C. era when they were trying to, to approximate the radius of the earth based on shadows. We're trying to take the GPT architecture, like the main one, and just measure how much information it can store. And we did this through the, the lens of memorization, which I think we can skip over for the podcast and we'll just talk about like information storage and weights. Like these curves to me are pretty crazy. Again, maybe it's like the sort of discoverers folly or something where I'm like, oh, this didn't exist before, so it seems so cool. But then you're saying like it seems somewhat obvious.
A
No, no, no, don't, don't let me take that away from you. Yeah, no, again, I independently was asking how come there's not enough People exploring LLMs from information theory and then you come along and your embeddings works become an information theory exploration. And I'm like, suddenly I'm very aligned to exploring this, promoting this and encouraging more people to figure it out. Because that's ultimately how we figure out this whole compression issue and what Andre wants, which is the cognitive core, the most efficient model for the most capabilities, like that is an information theory question.
B
Totally agree with that. We could start here, like so, so transformers that are trained in 32 bit precision, we approximate, can store about 3, 6 bits of information to maybe 3.9 bits somewhere in there per parameter. And like why is this? I mean from some perspective this is, this is quite bad. Like if you have 32 bits available and you can only use three to four of them, like you're just store 32, bro. Yeah, yeah. Then you'll like, you know, you could build your own AI lab if you can make these models that much more efficient. I don't know how they're implementing this mechanism or where the kind of bottlenecks come from or even now that we know this what it's necessarily useful for. I guess the tools that would be interesting to me are knowing like, like given a data set, if you could predetermine the exact model size and maybe architectural properties required to get a certain level of performance, that would be really neat. And like we don't even know how to do that. We don't even know what the difference is between doing Lora training, which trains less than 1% of the parameters, and full fine tuning which trains all the parameters. We don't even really understand the difference there. So I think this is like maybe like a baby step sort of in that direction. But there's a lot of unknown ahead of us, us.
A
Okay, do you think this is a hard limit? Do you think someone can come up with a better algorithm but better architecture and then sort of just change the slope?
B
There are two axes here. One is the ability of the model to store data. And I think we can definitely improve that. I think like maybe even if we tested this with La Llama architecture, like there's sort of like a GPT plus plus architecture, like I would guess that can store better data just because the kind of numerical flow is a little bit better, the non linearities are maybe like a little bit more suitable to training. Like that will probably raise the bound a little bit. And then the second axis is that our measurement tools are just not that good. Like this is, you know, me, I'm a grad student, I'm running all these hyperparameter sweeps and sort of like where we draw conclusions from them. But even that being said, like, there are probably ways to measure this better, but all that would do is push the number up. So it's possible like there is a way to store five bits per parameter. If you have like a better optimization technique or if you were a super genius and you could just, just perfectly set the weights to store the data, then maybe you can do better. And this is just sort of like what we can reach through optimization is this 3.6 bits per parameter. But I would be happy if someone came along with a much better measurement tool. Like this is just sort of like the first measurement. I mean, I would, I would guess in the future, like, you know, people will look back and say like, this is like somewhat off in one direction or another for whatever reason. And that's just how science goes. And I have no problem with it.
A
What we do is we call this the Morris constant 3.6, right? And then we set a challenge like a leaderboard of like, beat this, right? And like, let People go, yeah, that.
B
Assumes that we know the true constant ahead of time and we can measure the error rate.
A
It's doable. You laid it out here.
B
Yeah, yeah, yeah, yeah, yeah, that makes sense.
A
One minor doubt I have is like the goal actually isn't memorization, it's generalization. Right. The best memorizer model may not be the best generalizer model. And this incentivizing people to max this number might actually just be fruitless. In terms of actual intelligence. You just get the best actual compressor. You're just going to get gzip.
B
That's totally true. And there's this pattern in research time after time. It's like someone poses a question and then people answer it over and over and over again. But it's, it's often much more fruitful to just ask a new question. Maybe it just doesn't matter how much GPT models can store and you should just like work on something else.
A
We'll figure that out. Did you want to dwell on this, this side at all?
B
Yeah, let's just talk about it real quick. Definitely not the algorithm itself.
A
By the way, what are your tools for doing these kinds of charts and these kinds of diagrams? Like, I just kind of curious behind the scenes on the tools.
B
I think like visual has definitely been sun hobby of mine during grad school. This one actually Oscar, my co author, made this one. Maybe I gave some like prompting, but he made it. I, I think most the last few papers have all been in diagrams, Google diagrams. I was using Figma for a while and Illustrator, I think Illustrator actually is, is the best tool.
A
Oh, did you know the Transformers Transformers diagram was in Adobe Illustrator?
B
Oh yeah, yeah, I did know that actually. Yeah. Because that's the only way you can get arrows that sort of like curve like that and they have good shadows. Diagrams is like the least robust, but it's the most accessible. And honestly, if you, if you're good, you can make pretty good stuff. Excalidraw is nice too. If it's not going in a paper.
A
Yeah, it's just too rough for a paper. But you need something professional looking. You know, it helps. Like if you're going to publish your work, you need to make it look, look nice and professional and official.
B
Right.
A
So this is what it is.
B
Yeah, yeah. And I think there's something worthwhile about saying like, okay, if I'm going to put my name behind this, like, I want to spend time making all the references perfect and all the diagrams professional, all the captions are correct and I think it's like important to put that level of detail into your work. That's a little cheugy, but let's finish this off. So okay, we're talking about bits information theory. What information stored embeddings. We were talking about language model capacity. I think a much more question, more practical question is maybe this is more analogous to the vector database hacking embedding threat model we discussed is like if you have access to a set of model weights, what can you learn about the data? So like you were just mentioning Gemma 3B came out yesterday and you can download it and it takes up a certain amount of space on disk and it was trained on some data but we have no insight into what the data was. I mean it's probably English, there's probably some distribution of web text, I guess there's a lot of code and we seem to have a lot of information about the model. Right. You have this file and there's like many ones and zeros which means something. But it's kind of like a very highly compressed version of the training data. But I would be extremely surprised if they do any type of private training. There are these mechanisms for doing differentially private language model training or even just anonymization in the pre training pipeline. I bet they don't do any of that. They just sort of train on the data and then they kind of know that we don't have the right tools to decrypt the model weights. And so that's like my dream is we can come up with some way of translating model weights back into text data sets. And so in the most recent kind of drop paper drop is that paper approximating language model training data from weights. And it turns out to be a really hard problem. Like trying to go from model weights to text is really hard. And we do something a lot simpler which is like. Well there's, there's two ways we make it simpler. The first thing is we assume access to two checkpoints, which I think is probably not the case in Gemma. But in, in the case of Deepseek, if you download the 400 billion parameter model weights, it's this giant file and you can actually get two of them. You can get the base model weights and the fine tuned model weights. So the way we put this, you have this kind of like difference in parameter space telling you what deep seq fine tuned on. And it's very controversial. I mean there's sort of like geopolitical definitely at the corporation level they're really interested in the implications of like what did Deep seq train on and they've released this kind of treasure trove of information of what they trained on, which is the actual model weights. But we have no tool for like interpreting or kind of decrypting this weight difference. And so we started with something really simple which is instead of even just trying to like regenerate the training data, we take just a web corpus and try to do selection of training data that kind of like looks like the true training data and gives us performance that's as close as possible to the true training data. So there's this complicated method, but it's something like you just sort of like look at the data point gradient and see if it points in the direction in weight space of the fine tune. And then you take like the top data set. There's some tricks to it, but it's basically just like gradient based selection based on this weight difference. And it seems to be okay. Like it can get us pretty good training data. So I guess if you actually wanted to use this, it would be like your competitor releases a base model and a fine tune and you're trying to recreate their data set so you can take this weight difference and take a giant web data set. Like if I was doing this at a company, I'd probably try to scale it up to trillions of tokens and then select the exact data points that try to produce the model. And it turns out you can train a pretty good model with that. We don't get to quite the performance of the original model, but it does seem to be like trending in that direction.
A
This is like very creative. I don't know what the use of it exactly is.
B
Yeah. When would you be in this exact situation?
A
Decently often for the open model labs, even Deepseek R1 has released an update. Mistral does it pretty frequently. Llama does it frequently. It's not impossible. But I think that I really like the creativity in using quote unquote synthetic checkpoints to do this, which is, I don't think I've heard less from many others place. So I think, I don't know if you came up with the idea.
B
It's like linear interpolation in weight space.
A
Okay. That's a bunch of the recent work. I wanted to sort of cap things off with the data sets. Question, is that a good.
B
You can ask me whatever you want.
A
Well, it's not an ask. It's just like I think this is a very good thesis. I think it's a hot take. I almost invited you to speak Based on just this alone. But it was a little bit neat.
B
To talk for the conference.
A
Yes. When I look for conference keynotes, I look for something that has a broad overview that can put the last few years in perspective. Or it's an insight that you can reasonably rely on to last for a while so you can get some mileage out of it. I think a lot of ideas in AI come and go, but things are scaling laws, things that are trend lines, things that are like, there's no new ideas in AI that I pay attention to. So maybe you want to recap, like what's, what's the backstory if there was one.
B
Yeah, yeah, sure. So the, the meta backstory is I've sort of started writing on substack and this is a post that I wrote a few months ago.
A
The highest art form of humanity.
B
Yeah, yeah, yeah. Publishing papers wasn't doing it for me anymore and I moved to substack. And this is the name of the post. There are no new ideas in AI, only new data sets. 1. One guy pledged me, but then I found out he was like my former student from a class I was teaching. So I don't think it really counts. It counts. He's a friend.
A
He's your first supporter.
B
A pledge is a pledge, man. I'll take whatever I could get. So. So the underlying thesis is that whenever, maybe I'll lay out this framework first. So there's this, this book called the Structure of Scientific Revolutions by Thomas Kuhn Moon that I read near the beginning of my PhD, which suggests that science kind of moves in these cycles where not very often. There's something he calls a paradigm shift, which is like a. You could think of it as a zero to one innovation where everything changes and then it's followed by a rapid period of small innovations, a lot of like, reapplication of previous techniques, pre paradigm shift techniques to the new era. And then things sort of slow down as we wait for a new paradigm shift. And I was kind of asking myself what's unique to the paradigm shifts that we've seen in AI? And by the way, to me, AI and language models are somewhat synonymous at this point. Like, at least for the foreseeable future, I'm certain that will change. But basically everything that's pushed the boundary to whatever we have now that resembles intelligence has come from language models. And so those breakthroughs came in a few steps. So I think the idea is also like a meta commentary on the research community, because what everyone wants as a researcher is some kind of like Cute new method that no one has thought of before that just works on the existing data better than the previous methods. That's like, for whatever reason, like the kind of most glamorous thing people think you can do as a researcher, like Mamba. It's like, it's like a transformer, but it's like more efficient and works better. So that's what a good idea looks like. And I think everyone wants to like, find something like that. But if you look at what's actually born out in practice, it's never been like that. I think, like, all of the things that I would consider paradigm shifts in the Kunian sense came from a new technique, but trained on new data. And I think the new data is super, super important. So I wrote it as a series of four paradigm shifts. The first is the emergence of of deep neural networks with Alexnet, which I think was like 2010 to 2012 era, where we just started training on ImageNet, which is like a scale no one had ever seen before of millions of images. And then the second thing was transformers and BERT. And this attention is all you need. Paper 2017, the first GPT, 2018, which is web scale pre training. Like no one had ever done that before. No one had ever tried to scrape all the text off the Internet Internet and then tokenize it and feed it into models. Like, it's a crazy idea. And I think, like, we should be honest. I mean, Transformers are incredible and like, their staying power is never going to cease to amaze me. They're like, much more optimal than I think anyone ever knew. And I don't know if we'll ever beat them. But the real innovation is web scale pre training. And I think like, we honestly probably could have gotten this with RNNs. I know like the scaling laws paper shows that RNNs have worse curves for scaling. But probably people would have been like, I bet you could have built ChatGPT with a very sophisticated RNN. Like you didn't even need Transformers. What you need is web scale pre training. And the third innovation, which is instruction tuning. And we thought it, it came with like reinforcement learning. But I think the big innovation of instruction tuning is actually the human preference data, which is like gathering positive and negative pairs of what looks good, like in terms of a chatbot interface. And actually it turns out you can do supervised learning on that too. You can do dpo, which is a form of supervised learning. You don't even need the instruct GPT techniques, you just need the data. So like, I'm sort of playing devil's advocate here, but I actually think this is true that like if we had the right data sets, we almost could have scaled like 2015 era techniques and gotten something that looks like at least in structgpt, reasoning models are a little different. Like they're. I'm not sure if we could have that with RNNs or not. Like I don't if I'm in a position to comment on that with certainty. But they do fall into this framework, which is they really did emerge from a new data source. In this case it's something like a little different. It's like verification with symbolic systems like math calculators, coding environments, unit tests, like things where we can provide numerical feedback to language model outputs. But we built a way to learn that and leverage it to get get more intelligent systems. And so whatever. The fifth thing is, whether it's video or embodied AI or some kind of crazy innovation on reasoning models, whatever comes next will probably be some type of new data source that we're not using yet.
A
That's a really good thesis. I would say that the researchers I talked to would somewhat disagree. Yeah, obviously this is like a hot take type of thing. And like you already acknowledge that RNNs don't, don't scale to the same extent. Like they operate on the slop of the curve. Whereas I guess the amount of data or the type of data or the core insight just changes the order of magnitude of the x axis that we are mostly working on. But both are important. The way that I think someone put it to me was an improvement on compute or data efficiency is the equivalent of having a whole bunch more data that otherwise would be a lot more expensive to collect. It's likely that the frontier models right now are just a collection of hundreds of these small little experiments that just stack up. You mentioned muon in your post, which seems to be the atom killer, curiously enough. Still none of the big models use muon. But vibes are good.
B
Yeah. And the value of building better optimizers is really incredible. It's just a free launch. You can just sort of plug in a slightly better training mechanism and then you save a ton of compute and a ton of training time. That's hugely valuable.
A
I think this is cool because I think it puts us in a mode of if you were ever to ask what comes after reasoning, it has to be something on the order of this. And most ideas are not. Most ideas are not. And so this is cool in a sense of it just jolts you out of incremental thinking into what really is missing for the next paradigm. And I don't have an answer. Do you have one? Do you have, do you have candidates?
B
Oh, I really haven't even considered that too much. I guess like scaling reasoning, you gotta.
A
Do the autocomplete for step five. I mean you got us all the way there and you're like, you know, you gotta show us the way.
B
Now we can say it's an exercise left to the reader, but I mean, the reality is like predicting the future is too damn hard. You know, like maybe it'll be obvious to me in hindsight in five years, but sitting here today, I really can't, can't derive from first principles what the next wave of innovation will come from.
A
Yeah, I think we have a few years left. Each of these phases lasted for a few years. Reasoning just started last year, kind of. We got some juice on this one. Cool. I think that is a broad overview. We've went way over time, but I really enjoyed this. I guess my parting question for you is kind of a meta one. So I'm not an academic, I'm kind of self taught. I just read a bunch of papers and I talk to people all day as part of the podcast. How do I rate in terms of my questions? As though could I pass as a grad student or what's my distribution? Maybe I was maybe more industry oriented than academics.
B
I think you got to realize that the only person that's an expert in your area as a grad student is you. And even eventually your advisor defers to you. So for a small set of questions that fall within your very niche expertise. So like, I think you're clearly like a very good generalist and have like a huge amount of background on these topics and to the point where I would say you're passing the, the grad student Turing test. And I think if you went to a talk like people would just assume you have some weird research area of your own that they don't understand.
A
You know, my research area is AI engineering. Like I'm, I'm totally of making it up as I go, but. No, this is super helpful. Okay, well, that's about all we prepared. All the best in your search. All the best in your PhD, I assume. Apparently the current PhD meta is you do a bunch of small papers, you staple them together and find an overall theme, you do a defense, and that's it. That's the journey. Which is kind of cool. I would love to do that. I'm too old to do it, but it's cool.
B
Yeah, yeah. It's a great thing to do at any age change.
A
Well, it's better to do a substack. Right. And then you have people subscribing and pledging along the way and getting validation and like, yeah, that's better than a PhD substack. That's the title of the episode. Like, substack better than PhD. But no, yeah. Thanks for your time. This is really great. Where can people find you? What are you looking for, really?
B
I'm online. You know, you can follow my substack and Twitter. I Twitter tweet pretty consistently. And you're putting papers out. I guess, like, the most meaningful thing, to be honest, is to engage with the research and send me an email if you really care. That that's amazing. And, like, I love having those kinds of discussions and you mean, like, what I'm looking for in a job or out of life?
A
Your research direction? Like, what interests you over anything else that's like, if. If there's someone out there looking who has a problem and is looking for someone to help them on it, like, you are the guy for.
B
Oh, yeah. Hopefully, if you listen this long. Like, I think, like, my research is a lot more well connected than some people's PhD research in that it all falls into, like, a very small manifold of, like, all possible problems. And so if you. If you want to work on anything within that space or that's sort of like, adjacent to the problems that we discussed in terms of, like, language model, maybe not even language level, but model weight and activation information, I think anything that can be described as that is very interesting to me and I would love to talk.
A
Awesome. Well, we'll put your contact info in the show notes and thanks for your time.
B
Thank you.
Release Date: July 2, 2025
Host: Swyx (Latent.Space)
Guest: Jack Morris (PhD student, Cornell Tech; researcher in AI, NLP, and information theory)
This episode dives deep into the intersection of information theory and language models (LLMs), with a focus on Jack Morris' ground-breaking research. The conversation explores the evolution of AI research in academia versus industry, foundational questions about model capacity, embedding inversion, and the role of data in AI breakthroughs. Jack shares candid insights from his own PhD journey, practical advice for aspiring researchers, and speculates on the next paradigm shifts in AI.
"There was kind of two years where everyone in academia was working on smaller models and none of it really mattered."
— Jack Morris (07:46)
"We should measure information with computational power as a constraint... [V-information] measures how much information is extractable from a given file or code."
— Jack Morris (15:50)
"It was so rewarding ... We had this number that was like 30 for months. ... Then we ended up getting the number to like 97."
— Jack Morris, on his embedding inversion breakthrough (30:32)
"As the models get better by scaling data and scaling model size, they're sort of converging to learn the exact same thing."
— Jack Morris, on the Platonic Representation Hypothesis (32:38)
"Transformers that are trained in 32 bit precision, we approximate, can store about 3, 6 bits of information to maybe 3.9 bits somewhere in there per parameter."
— Jack Morris (55:58)
"The best memorizer model may not be the best generalizer model... You're just going to get gzip."
— Swyx (59:16)
"There are no new ideas in AI, only new data sets."
— Jack Morris (67:03, thesis of the episode)
This episode is a deep exploration, blending hard research with practical implications and field-wide perspective. Jack’s work on information theory for language models foreshadows potential future breakthroughs in both model understanding and AI system design.