Omar Khattab (47:38)
So yeah, what we. So everything we talked about today, I do almost nothing about this because this is work we did three years ago and I'm just out there telling people about it. But I'm not really, you know, we're not changing these abstractions. What we actually do is the following set of questions we ask. All right, someone wrote a program and we assume they did a reasonable job describing what they want. And maybe that means they wrote the control flow, they have the signatures and they have some data. These are the three pieces that you might want to have, or they have some, not all of these. How do we actually do a good job at optimizing this? So it's actually a really, I think, an interesting progression to sort of see how we progress from the very early optimizers in 2022 to the latest ones. Very early ones had to work with models that basically didn't work right. And we're not had essentially no instruction following capability, but were hit and miss for their tasks. So what we did, look, you know, the reinforcement learning people do on LLMs, which is you take the program and you do what we call, like we bootstrap examples, which just is another like way of saying you just sample, like you just run the program maybe with high temperature or something. A lot of times you see which things actually work and you keep traces of all of these over time. And then those traces which are generated by the model, they can become few shot examples. And if you just do that, sometimes it improves a lot, sometimes it becomes a lot worse. So you just do some kind of discrete search on top to find which ones actually improve on average. That was like when models are really bad. As models have been getting better, we've been moving, we've moved a lot basically all the way to reflective prompt optimization methods where you actually go to the model and you're like, here is my program, here are all the pieces. Here's what this language means. Here are the initial instructions I came up with from like just the declarative signatures, by the way, Here are some rollouts that are generated from this program. Here are how well they perform. Let's debug this. Let's kind of iterate on the system to debug this. And obviously there's a lot of scaffolding to make sure that search is actually like a, like a formal algorithm that is going to lead to improvement. But increasingly more and more of it is actually carried off by the models. One thing we also do a lot of is we ask, all right, conventional policy gradient reinforcement learning methods like grpo, nothing about them cannot be applied to a DSPI program because the DSPI program says nothing about how the optimization should happen. So actually, for a very long time, From February of 2023, you could actually run offline RL. And since May of 2025, you can run online RL or GRPO on any DSPY program that you write. You know, people think that it's limited to prompt optimization, but the, you know, I think the only notion that is fundamental in DSPY to prompt is that natural language is an irreducible part of your program, but that prompt is human facing. It's how you say what you want, how it turns, how it gets turned into. The artifact may well use reinforcement learning with gradients or natural language sort of learning. So we spend a lot of time on optimization. We also spend a lot of time on inference techniques like, you just declared that you want your signature, which processes lists of books. Well, guess what? No model has long enough context to work with lists of books. So last week, my PhD student Alex and I released this idea called recursive language models, which sort of takes any model that is good enough and sort of figures out a structure in which it can handle, you know, or scale to essentially unbounded lengths of context. And we were able to, you know, push it to 10 million tokens and see essentially no degradation. And the reason we build these types of algorithms is we really want to back your signatures by whatever it takes to sort of bridge the gap between whatever the current capability limit of the model is and the intent you specified. And the last thing we think a lot about is, well, we've made this argument conceptually and tried to demonstrate it empirically that you need this irreducible, you need these three pieces, signatures in natural language, structured control, flow and data, to fully specify your intent, at least ergonomically enough. The question though, is this is a very Large space of programming where you need to figure out, how do I, okay, I have a concrete problem, how do I map it into these pieces, knowing that maybe I need all of them? And so we spend a lot of time, this is why it's a big open source project. We want to see what people actually build and learn from that sort of what are the software engineering, what are the AI software engineering practices that we should encourage and support? So these are the types of questions we think about. And I think one reason this has to have the structure of this open source project, it's just like this large fuzzy space is I don't want to be the only group or small number of teams working on any of these pieces. I think it's a space of the more academics and the more researchers and the more people work on optimizers, all programs benefit. The more people work on modules, all programs benefit. The more people build better models, especially programmable models, whatever that might mean in the future. It sort of models that understand that they're going to get used in this structure, that everyone benefits. And you know, it reminds me sort of of the way in which deep learning sort of really took off, which was some people iterated on the architectures, some people iterated on the optimizers of, you know, you got things like Adam, other methods. And I think like, that is what we're trying to really push, push a community towards.