Pash (40:07)
In a lot of ways it's, you know, it's us and the forks. We were kind of there originally when we were like the only ones with this like, philosophy of keeping things simple, keeping things down to like the model, letting the model do everything, not cutting on, not trying to make money off of inference, going context heavy, reading files into context very aggressively, and kind of going back to cloud code. I was actually like, it was really nice to see that they, they came out and they validated our whole philosophy of like keeping things as simple as possible. And that kind of goes in with like the whole RAG thing, which is like rag was this early thing in like 2022, you started getting these vector database companies context windows were very small. This was like a way of people called it, like, oh, you can give your AI infinite memory. It's not really that, but that was like the marketing that was sold to the venture backers that were like investing in all these companies. And it became this narrative that really stuck around. And even now we get potential enterprise perspective. They're going through the procurement process and it's almost like they're going through a checklist asking, hey, do you guys do indexing of the code base and doing rag? And I'm like, well, why do you want to do this? I think Boris said it very well on this exact podcast where we tried rag. And it doesn't really work very well, especially for coding is like the way RAG works is you have to chunk all these files across your entire repository and chop them up into small little pieces and then throw them into this hyperdimensional vector space and then pull out these random chunks when you're searching for relevant code snippets. And it's like fundamentally it's so schizo. And I think it actually distracts the model and you get worse performance than just doing what like a senior software engineer does when they first they're introduced to a new repository where it's like you look at the folder structure, you look through the files, oh, this file imports from this other file. Let's go take a look at that. And you kind of agentically explore the repository. That's like we found that works so much better. And there's like similar things where it's like the simplicity always wins. Like this bitter lesson where fast supply is another example. So Cursor came out with this fast apply like they call the instant apply back in July of 2024 where the idea was models at the time were not very good at editing files. And the way editing files works in kind of the context of an agent is you have a search block and then a replace block where you have to match the search block exactly to what you're trying to replace and then a replace block just swaps that out. And at the time models were not very good. It was like, I forget, like GPT they were using under the hood at the time wasn't very good at formulating these search blocks perfectly and it would fail oftentimes. So they came up with this clever workaround to fine tune this fast apply model where they let these frontier models at the time, they let them be vague, they let them output those like lazy code snippets that we're all very familiar with, where it's like rest of the file here, like rest of the imports here. And then fed that into this fine tuned fast supply model that was probably like a Quinn 7B or something quantized very small, dinky little model. And they fed this lazy code snippet into this smaller model and the smaller model we fine tuned to output the entire file with the code changes applied. And one of the founders of Ader said this really well in very early GitHub discussions where he said like, well now instead of worrying about one model messing things up, now you have to worry about two models messing things up. And what's worse is the other model that you're giving that you're handing your production code to this like fastify model. It's like, it's a tiny model. Its reasoning is not very good. It's maximum output tokens. It might be 8,000 tokens, 16,000 tokens. Now they're training like 32,000 tokens maybe. And a lot of the coding files, like we have a file in our repository that's like 42,000 tokens long and that's longer than the maximum token output length of one of these smaller Fast supply models. So what do you do then? Then you have to build workarounds around that. Then you have to build all this infrastructure to like, pass things off. And then it's making mistakes. It's like very subtle mistakes too, where it's like, it looks like it's working, but it's not actually what the original Frontier model suggested. And it's like, slightly different. And it introduces all of these subtle bugs into your code. And what we're starting to see is as AI gets better, the application layer is reducing. You're not going to need all these clever workarounds. You're not going to have to maintain these systems. So it's really liberating to not be bogged down with RAG or with FAST Apply and just focus on this core agentic loop and maximizing Diff edit failures. Like in our own internal benchmarks, Cloudsonnet 4 recently hit a sub 5% or like around actually 4% diff at a failure rate. When Fast supply came out, that was way higher. That was like in the 20s and the 30s. Now we're down to 4%, right. And in six months, how does it go to zero? Well, it's going to zero. Like, as we speak, it's going to zero every day, you know, And I was actually talking with the founders of some of these companies that do Fast supply. They were trying to kind of work with us. Their whole bread and butter is fine tuning these fast supply models and, you know, like, relays and morph. And I had like a very candid conversation with these guys where I was like, well, there's a window of time where Fast Supply was relevant. Cursor started this window of time back in July. How much time do you think we have left until they're no longer relevant? Do you think it's an infinite time window? They're like, no, it's definitely finite. Like this. This era of fast appliance models is definitely coming to an end. And I was like, well, how long do you guys think they were? Like, maybe three months, maybe less. So I still think there's some cases where RAG is useful. You know, if you have a lot of human readable documents, a large knowledge base of documents where you don't really care about like inherent logic within them, like sure, index it, chunk it, do retrieval on it or FAST applies, like maybe if your organization you're forced into using like a very small model that's not very good at search and replace, like a deep SEQ or something, you know, maybe use a fast apply model.