Jeremy Harris (106:56)
little bit, your memory requirements are going to grow as the text gets longer, but in a more controlled way and in a way that you can directly control and kind of customize. And so there's this interesting argument in the paper about maybe this is a kind of unification of the transformer and record recurrence approaches. I think that's a little bit generous to say that, but it certainly is interesting. It's a totally architecture agnostic plug in for any rnn. So you can use an RNN that's already been trained and then just slap this on, which is kind of cool capability. This is also. Yeah, oh yeah. This does happen at every layer, right. So your RNN still has many layers and at every layer you're maintaining a kind of memory stash. That little vector we talked about. And so you are running this algorithm at every stage. They look at a couple different ways of doing this. The first is, I alluded to it, it's this checkpoint mode where you start the rnn, you start scanning your text and you take snapshots every 100 pages or whatever. But the other approach is to use what they call independent compressor mode. And so here, every hundred pages, instead of, of continuing to let that memory vector evolve, what they're going to do is they'll actually reset it and get it to start from scratch instead of carrying, if you will, the baggage from the previous hundred pages. I mean, you could view it as baggage or you could view it as context. That's actually kind of helpful. And that's why they tried both approaches. So the challenge is, at the end of the day, when you have this kind of Frankenstein stitched together set of memory vectors, vectors, you now have to decide, okay, how am I going to aggregate these together to get a single output? And they try a bunch of different strategies, this thing called residual memory, where they just sum up all the memories together with the current one at each step. Very simple. But it treats all past segments equally, regardless of how relevant they are. Which is a challenge, right? Because the whole point of attention is it allows you to focus more, well, attention on one part of the text or another. This approach, if you're just going to say, okay, well, let's just glue all these together, kind of average together the values, you're not really kind of caring whether this hundred pages of the book, for example, is more relevant to the question that I'm asking versus the, say, first hundred pages. They also try other techniques to do this. I won't go into too much detail, but it is actually really interesting to look at kind of how they're trying to solve this problem. My I have a couple of gripes with this. One of them is they talk about if you make the. I talked about 100 pages, right? As every 100 pages, maybe you take a snapshot. They go, well, if you take a snapshot, every single token in the L equals one limit. You recover a transformer. I mean, they don't quite say that, but that's kind of the bit of the frame here. It's still fundamentally an RNN update rule versus an attention update rule aside, I don't know that that's the case, but they do show that this approach is expressive enough to get a transformer back in some special cases. But it's not a generalizable fact about this. So to say that it's a unification fully of recurrence and transformers, I think is a bit of a stretch, but it's a step on the way and quite interesting. So they do show improvements over RNN base models. Right. When you don't do the snapshot strategy across many benchmarks, it's very clear that it is adding value. The key question is how does it compare to transformers on recall tasks especially, it's competitive, but it's not superior. And it is better from an efficiency standpoint than transformers at long context lengths, which you would expect because that's where you start to see issues with transformers just with the sheer size of the N squared effect. Right. For all those tokens. Yeah. Anyway, they look a whole bunch of different parameter scales and token scales. They don't show the kind of loss versus compute scaling law curves that would tell you whether this approach closes the gap with transformers as you scale up. That seems like a really big gap. I would love to see that. Right. The question every time you see a paper and you've heard us, Andre and I say this a lot, is not how impressive is this paper, it's does this paper suggest that the result can scale can operate at scale better than than the alternative? And without clear scaling curves, it's really hard to tell. Relatively small scale being experimented on here. 1.3 billion parameter models, 100 billion token budget fairly modest when you look at some of the multi trillion Token corpuses, the llama series, 7 billion parameter models everywhere. So this is a relatively small scale test, which again is why I'd really love to see scaling laws here. And you still see transformers win on recall tasks, which is not surprising. I mean, Transformers crush it on recall. They're literally looking at everything in memory at the same time. So the framing here really is more about closing the gap. But again, given that the whole point is to close the gap, you need to tell a scaling story here. So I really wish that had been included. And there's also no inference time scaling results. So those are some things. But this is a really interesting starting point and I think something that should be should be looked at in more detail. All right, so next up we have untied Ulysses memory efficient context parallelism via headwise chunking, which is really hard to say. Three times fast. This is a paper out of together AI and we've covered a whole bunch of together AI papers. This is as a reminder, it's one of the kind of proliferating number of organizations that's focused on Kind of decentralized AI training, the sort of torrenting version of the future of AI, where you should be able to train AI models on everyone's a little bit on everybody's local laptop or whatever in the extreme case. And so you see out of a lot of these groups and especially together, a lot of hardware level innovation. It reminds me of some of the stuff you'll see out of Deep Seek. It's that kind of focus on how can we just get these pieces of hardware to work together with our models in a very co optimized way. So these tend to be some of the most interesting papers to read. One of the core problems to kind of focus on this paper is when you're going to especially train with agents that have huge amounts of context that they generate, right? Their chains of thought can be truly massive. You have essentially a situation where a single GPU cannot necessarily, necessarily hold the entire chain of thought in its head at the same time, right? So there's just so much material that the KV cache explodes on you, right? The KV cache being that part of attention that holds all the context, the sort of numerical representation of the context. And so, and so you need to find a way to balance, to share that context across devices. Now this is going to be a new kind of parallelism, right? And an increasingly important one. You know, we've talked in the past on the show about a whole bunch of different parallelisms, right? There's like data parallelism where I have a chunk of data that I want to send to one GPU and a chunk of data I want to send to another and to another, or to one server rack or to another server rack. There's also pipeline parallelism. We're going to send a few layers of the model to different GPUs or different racks. And then there's tensor parallelism, where, okay, now we can even cut layers in two or in three and send chunks of layers to individual GPUs. And you typically do all of these at the same time. So multi parallelism, you do, you know, pipeline, tensor and data parallelism all at the same time. This is another kind of parallelism where you also can parallelize the context itself. So this big, you know, whether prompt, but more typically the response, and you paralyze that across devices. And so this is for cases where a single GPU just can't hold the full attention matrix in memory. And what they're going to do is split the context along the sequence dimension. So essentially like, you know, the first part of the context goes to one gpu, the second to another, and to end to another and so on. And yeah, and this is just like it's really important to be able to kind of orchestrate and coordinate all this activity every time you add a new kind of parallelism, especially when you're doing reinforcement learning. And I'll explain why in just a second. You are introducing more orchestration headaches. You're introducing more opportunities for GPU1 to finish its job way before GPU2 and then just be sitting idle and for GPU3 to, to be too fast or too slow. And so this ends up leaving you with massive gaps. And those gaps get resolved through orchestration. And the reason that this is so important with RL in particular is that you'll often have a situation where you take a model and then you have to send that model out to a bunch of nodes, a bunch of GPUs or whatever to generate rollouts. And those rollouts take time. And then you'll get an output and they take a different amount of time.