
Loading summary
A
Foreign. Welcome to the Late in Space podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swix, editor of Late in Space.
B
Hello. Hello. And we're so excited to have Kyle finally in the studio. Welcome.
C
Hey. Very excited to be here.
B
Kyle, you're CEO, founder, co founder.
C
Co founder, CEO of OpenPipe, which started.
B
Two years ago and recently got acquired by Core Weave. Congrats.
C
Thanks.
B
I think you might be our first started and exited founder that we've had on the pod. Maybe. Ish. I don't know.
A
I'm not keeping especially on that timeline.
C
Well, I don't think I was exited when we. I don't remember if we set this up before or after we announced we were getting acquired.
B
I specifically pinged you because you got. I think you got acquired. You've been on my list to watch, obviously. You've spoken three times at AIE and you've been on my list of like, when is it a good time to have an open pipe or fine tuning RL discussion? And then you got acquired. And I'm like, okay, yeah, that's a good time to talk about it also because I think it gives us a window to talk about acquisitions, consolidation, what should be an independent company, what maybe doesn't have to be anyway, but we'll maybe do this chronologically so we don't get too far ahead of ourselves. You were famously director of Startup School.
C
Yes.
B
Maybe for people who don't know what is Startup School. Did that make you fall in love with the color orange? Yes.
C
I'm wearing an orange shirt. For those who are listening, a very bright orange shirt. This is my conference shirt and I felt like it was appropriate for the pot as well. So, yes, I was at Y Combinator for about four and a half years and led the Startup School team there. So Startup School, it's changed over the years. It meant one thing before I was there, it means another thing now. But during the time I was at yc, Startup School was basically all of the external facing, a lot of the content, certainly all of the tech. So it was things like we had a mooc, effectively where founders could come in, they could learn about how to start a company, they could get advice from YC founders, YC partners. We had a co founder matching service that we built which actually worked really well. We got a lot of people through our total. I guess technically that probably doesn't matter anymore, but a very large fraction of the batches that went through YC while I was there were directly attributable to people that we found and ended up recruiting to YC through their experience at startup school. So that was kind of what we were working on.
B
Yeah, I always kind of consider it as the scout program for yc. Like the YC before the yc. Any notable famous people that met as part of your co founder matching? Because I'm always very negative on those things because like it's like online dating, like the chances of success is super low, but when it works it's really.
C
Nice, you know, that's a great question. I left. So we launched that product probably nine months before I left. And so I don't know what the long term outcomes were of that specifically.
B
Yeah, so you left yc you spent a year in kind of the wilderness. You went through YC S23. What's that journey like?
C
What's the, you know, I was very excited about AI things in general. So I left YC I guess beginning of 2022 and I was trying out a bunch of different things. Ended up landing on what turned into OpenPipe in early 2023. This was, let's see. So I'd been working. So my co founder is my brother, my little brother, which has been fun journey on its own. We were looking at different ideas and one thing we realized was we actually started the company immediately after the GPT4 launch. And, and what we saw as the opportunity in the market at the time, which has changed since then, was GPT4 was insanely expensive and extremely powerful. But there was an opportunity to distill specific workflows from GPT4 down to much smaller, much cheaper models. And there was a very clear value prop there. Given how expensive GPT4 was, it was hard to deploy in production, but you could sort of take those abilities and deploy them much more cheaply. So that was kind of the first thing we built was this kind of very managed, very clean distillation flow.
A
What was that process like in the beginning to get people to actually care? Because I'm assuming most people are doing experimentation but didn't really have these large production workflows that they needed to distill down. And then I think maybe once we got there, the models get cheaper and faster. So what was the initial six, nine months of the company through the evolution of the models?
C
Yeah, so it worked. It was great. So I mean it did take us a while, I guess. We formed the company early maybe March of 2015, 2023. By the time we launched our product, it was August. I want to say there were some different things we were trying in between and actually it was not hard to find people and get them excited. There weren't very many. I mean this was even late 2023, there weren't very many people in production. But anyone who did have production workflows, it was extremely painful. Like they were paying hundreds of thousands of dollars a month to OpenAI. So it was very easy to convince them to try this out. And so we got our first three customers after launching, probably within a month and we were doing significant revenue over the next six months. We actually got to a million in ARR over about a eight month period following that launch. So by the Latter Part of 2024. So actually yes, initial traction was super strong, very clear value prop. But then as you were alluding to kind of like there was just this slow march of the frontier model token prices just dropping over and over by 3.5x over and over again, which kind of ate away at our value prop over time.
A
What was the process of fine tuning the model? Because even the open models were not that great. And so what were maybe the bottlenecks instead of having three to get to 30 customers? Did you feel like in the beginning it was a matter of just the market growing? The open source models not being good enough, the fine tuning not being simple.
C
Efficient enough, the pain point, I guess repeating what I said before was the price was too high on the closed models. But you couldn't just drop in an open model and replace them because like you're saying, the quality was quite bad, especially as you're moving to smaller model sizes, but larger models, open models weren't even available at that time. So that's kind of where the value prop was, was like, hey, the closed models are too expensive. At least the ones that are performant enough to do the job. The open ones are not good enough. We have a very clear managed flow. The way the flow worked was quite simple. You simply put in our SDK, it's a drop in replacement for the OpenAI SDK. It's capturing, you continue to use GPT4 in production for a period of time. We're capturing the requests and responses. And then we had just a very clean managed flow where it's like, okay, at some point you say, hey, I want to distill this down and you, you train on that. And then you know, we provided an API that was a direct drop in replacement. You would just change kind of the inference URL and you were using your own model and it your app continued working.
B
Yeah, I think the market analysis here Because I was also exploring starting a business around that at the time and that's why I ended up not investing. Was basically you get squeezed between the GPU providers who also want to do fine tuning as a service because then that makes people more sticky and the labs who keep putting out distilled versions of something, whatever mini versions of their models. What was the analysis on the Neo cloud side? Because you kind of also want to host the inference.
C
Yeah, honestly we, like I said, felt very squeezed from the frontier labs that were putting out just more capable models at lower cost. I did not see the competition ever really materialize from the Neo clouds, from the GPU providers. Everybody had an offering in fine tuning. When we talked to customers, nobody used them because they just were really hard to use. So I do think that call it a product thing, I guess it's not.
B
Their focus, so who cares? Yeah, interesting. Developer experience matters.
C
It does, yeah. Still does. Did. I don't know, maybe it doesn't matter anymore. Now we just have coding models to everything for.
B
No, it still does. When you have thinking machines launching an API and people getting excited about the API, you're like, yeah, okay, pure developer experience there.
C
That's fair. Yeah, yeah.
A
I'm just going through the chronological list here. Was like the Mistral 7B fine tuned, kind of like one of the big inflection points in the history of the company. It's like, okay, this is like a good open model and like the 7B size.
B
Or is it just Mistral and Mixtral? That was like a golden period of fine tuning startups because Mistral was like a credible open source model.
C
Yeah, they were really strong models, better than the Llama two that they were effectively replacing. And they also have the super open license which I think the licensing has become maybe less of a concern over time at the margin because people are getting used to maybe. But at the time that was like a pretty big deal that they had this fully open Apache 2 license and yeah, maybe they have their own IP issues with how they trained it. I don't know. I have no inside information there. But at least the guarantee they're making to people using their model.
B
I call this Mistral washing. As long as it's like it's constantly a sparkling region of France called Mistral, it's okay. They don't ask about what goes into it.
C
There's plausible deniability, Harm's lithe connection there.
A
Yeah.
B
Okay. There was this Mistral period, Jan 2024 you talked about S Laura and there was a period of Time where loras became more important. I feel like they then became less important and I don't know what's the rise and fall loras for you as a business?
C
Yeah. So Lauras have really, really. If you're predicate on the fact that you're doing fine tuning at all, Loras have very, very attractive properties relative to doing a full fine tune. Right. Because if you're doing a lora you can at training time, it helps some. You're using less memory to train. But where it really helps you out is at inference time. Because if you're doing loras, then when you deploy it for inference, you can multiplex basically an arbitrarily large number of loras on the same GPU deployment. That lets you do things like do per token pricing as opposed to GPU hour pricing. It just gives you much more flexibility at deployment time. I'm actually still a Lora bull. For the record, you're talking about the rise and fall. I think loras, their future is still out there.
B
I mean they're cool again because Thinking machines.
C
Yeah, I felt very vindicated by that blog post. For the record, just I guess for listeners, Thinking Machines put out a week or two ago a blog post doing quite a lot of research on the trade offs between loras and full fine tuning in various different training regimes. I think the reason loras were uncool for a while was mostly just because fine tuning was uncool. I think if you're doing fine tuning anyway, loras are still in many cases the way you want to do it. But not that many people were doing fine tuning.
B
As a marketing guy, Loras had bad marketing. They were just like, oh, you can't afford full fine tuning. Here's the Walmart store brand fine tuning.
C
No, that's fair. There is some of that. I think we didn't have a huge issue. We've had to do some user education like, hey, just try it. I think for the training runs, the types of training runs that we're interested in, where it's like, hey, I'm doing a relatively lightweight customization of an existing model for a specific task. There's really no downside to using Alura and there's a lot of upsides from an infra simplicity point of view. I agree that there's a branding issue around that. Hopefully the thinking machines blog post kind.
B
Of addresses that rank one. I think there's different hyperparameters Aluras that you can use to make yourself happy. The fact that John Schumann was like, nope, we're actually banking the company on this, at least for now, is a pretty big vote of confidence. I think it's surprising that no one's done the research prior to them.
C
And I was talking to someone at Thinking Machine prior to their launch who had come from one of the big labs and what that researcher told me was like, oh no, everyone doing post trainer research inside this big lab uses Lora's. I mean, not for the full run, but when they're doing their experiments, they'll just use Lora's on a base model to run the experiments. And it works fine for listeners of.
B
The pod that was leaked in one of the pods that we released. But it's up to you to find it cool Then it was the first World's Fair you talked about you probably don't need fine tuning. As a fine tuning founder, basically I think your talks are really good. I would recommend people watch all of them. What I pulled out was you had pieces of advice. So your talk title was obviously somewhat intentionally clickbaity. But your actual advice on when people should fine tune is when it's cost, latency or quality consistency that you really care about.
C
Yeah, I mostly stand by that. I don't think it's changed. And the biggest one we see today, and this is true for kind of like classical sft, it's also true for the RL stuff we're doing today. Crossing my fingers, it's not always the thing, but the main one I see that really drives fine tuning is if you have to move to a smaller model and it's typically for latency reasons and this is usually like real time voice. So if you're sort of forced into a smaller model anyway, then there's a very high chance that doing some tuning on that model is going to get you. It will be necessary basically to have a successful deployment. So we see that a lot coming from customers that again have those latency requirements. There's other reasons as well. Sometimes for whatever reason, you really have to deploy on a single gpu. You have to deploy within your own cloud and you want a. You know, you basically have to use a smaller model to do that. So basically in the case where you're forced to a smaller model anyway, then fine tuning it is often necessary, I would say for 90% of use cases where you aren't forced to a smaller model, then it's still not a good ROI and you probably shouldn't invest in it today.
A
How do you quantify these things? So cost, right, could always be lower. So is There kind of like a threshold of like, cost to roi, because it's also hard to figure out how much it's going to cost you to do the fine tune because you need to get the data and all of that. Do you have a mental model of that?
C
This is sort of like a function of the total amount of overhead required. I'd say there's two parts on the cost side and then there's multiple parts on the benefit side. On the cost side, the main things you have to think about are the upfront effort required to get an actual training system set up for your task. And that can be quite variable. But I would say at a minimum, you're going to have to dedicate a couple of weeks of a fairly competent engineer's time. If you have a very complex system and you're doing RL and you need to set up a whole environment, it could be a lot longer, it could be a couple of months of time. So that's just a fixed cost you have to pay. There's also an ongoing carrying cost where once you've committed to doing fine tuning, it does make other parts of your stack less flexible, less nimble. Because whenever you're updating your prompt or you're adding new context or whatever, now you have to spend a few hours training a model, and that's just going to slow down your iteration cycle, which is a real cost. In many cases, that's the larger cost. So you only want to do that if the benefits are large enough. The dollar cost, I would say, is basically never a factor. It's just so much less than the amount you're spending this engineer to do the work that it's not. I mean, each of these runs is between five and a couple hundred dollars. And it's just. You don't have to do that many of them. Yeah.
A
Because most of the data is like first party.
C
Yeah.
A
Right. Okay. When was the switch to RL? Was it when Zero1 Preview came out? You were maybe like, okay, it's time to move on from sft.
C
Yeah. So that was a big moment for us with, you know, there's all the leaks before that about Strawberry and all this and like, you know, a lot of people talking about, okay, how are they doing it? We realized through that that, okay, someone's figured out how to make RL actually work with LLMs, which was not a thing. I mean, it was a thing that some people had played around with before that, but it wasn't a thing many people were thinking about. And so our bet at that point was, yes, let's figure out whether this works for task specifically and the space. I think it's important to kind of tease out different parts of the market. I think with the release of 01, and this has been proved out many times with releases since then, I think there's now a very strong consensus that, okay, on the frontier model, general purpose model side, investments in RL are paying off, I think. I don't think most people would argue with that, especially as you're getting into these agentic tasks and training them to do that. It seems very clear. Well, obviously the big labs are paying ridiculous amounts of money for these environments and everything, but also they're actually getting really good results. The model's coming out. We're seeing it especially on the coding model side, but in other contexts as well, we're seeing the especially agentic use is working way better because of this. So I think even late 2024, it was pretty clear that RL was going to work in that context. And then the question in our mind was, can we apply this in a different segment of the business, which is kind of like task specific customization? The question is, does that work well? How much effort does that take? Is it going to be something that ends up being unnecessary? Because, oh, the big labs can just train on every single task and the base models are going to be just good at everything and so there's no benefit to it. So those were kind of the open questions in our mind, but it seemed like there was at least a good enough bet that we wanted to try it out.
A
Yeah. And you had this agent reinforcement training framework and you did the email agent. That's kind of like the first proof of concept. Was that obvious to do email? Was it obvious to call it that way? What was the behind the scene, how should we package this?
C
So what I told our team and we decided to go all in on RL in January of 2025. And we've been doing some experience before that. We released before that kind of like an RL model that would generate hacker news titles from articles, which is a fun project. So we'd done a little bit before that, but that was kind of like we're like, hey, we're going to bet the company on. Not in a literal sense, we could have done something else later. But this is the thing that we're going to spend all of our time working on for at least a few months. And what I told our team at that time in January 25th was like, there's probably a 25% chance that this is the right direction in the sense that a year, two years from now, all the companies, everyone doing inference should be doing RL and task specific training so that their model's just way better at their task is a relatively low chance. But it was sort of one of those big if true things. If that is true, if it turns out that just doing RL on your task is just something everyone should be doing and it's just teaching these agents, continually teaching them through experience is just going to be a huge benefit, then being the first people working on that would be a really, really awesome position to be in. So that's how we thought about it is less than 50% chance but really big outcome. If not, if so I think since that time and I've been very transparent with this with our team and when I'm talking to other people, I don't think the chance that that is the right approach is 100% yet. I think that we're still in the process, even after going through this and doing that, of figuring out. But the probabilities in my mind are going in the right direction now. I think they're actually. Today I was actually just thinking about this with another conversation. I think that the chances that everyone should be or everyone who's deploying an agent at scale should be doing RL with it either as part of pre deployment or even continuously as it's deployed, that that's the pattern that's going to get to. I'd say there's a 55, 60% chance that that's just the better thing to do and that's informed by our experiments working with customers. So anyway, not 100% but going all the way back to your question. No, it was not obvious. It was an informed bet. It's still a bet, but one that I'm feeling pretty good about right now.
B
One thing I think that is tricky about just onboarding onto this space is all the math. I remember reading the DPO paper. I think they were at Neurips for 2023 and people were very excited about it. Some of it's just being pretentious for a paper, but some of it's actually real complexity. You don't have a PhD like a prior ML background. How do you come to grips with it? What were the best ways to get around it for you?
C
I would probably push back on that a little bit. I don't think the math is actually that complicated. I think that when you see the PPO equation or something with all the symbols, if that's your first intro to it, then it feels very complicated. But I think if you were to show that exact same equation, just code, maybe not Pytorch code, because you also have to understand. But if you just did the naive implementation in Python and showed someone like, hey, this is kind of like how we're computing the loss here, who was a strong engineer, I think it's actually quite grokkable. I don't think the barrier to entry is that high. I think you just have to believe you can do it and then spend some time staring at it. That would be what I would recommend. You can read the papers and look at the equation. I think actually this is one area where OLMs have been super helpful. If I'm reading a new paper and I look at one of those equations and I'm like, I don't understand how this new term they introduced corresponds to these other terms, then I can dump all the context around it into GPT5 and say, hey, can you write this out of Python for me and show me what they're doing differently? And that's super helpful for my background, I guess.
B
Yep. The way I put it is I wish that all these papers would just publish with pseudocode or just straight up Python instead of math, because you actually just need to look at the implementation.
C
I know, like, Jeremy Howard's been beating this drum for years and I most agree with him.
B
I mean, there's a literal website called Papers with Code and people just keep not following it. I remember interviewing the DPO guys when they were at neurips and it was just like they were just very obsessed with proving in principle equivalence to ppo. And it was very hard to follow, I'll definitely say that. And I think now, obviously at some point GRPO kind of took over the general consensus. It was very strange because I think when deepseek first started talking about it, it was viewed as an optimization. They tend to just generally couch everything as an optimization. But I think the leader insight, which I think you touched on in one of your blog posts, was that no, actually it makes comparisons independent rather than global. And that's actually what unlocks some mono. Self supervised rl.
C
Yeah, I mean, it's interesting. There's real pros and cons. If you're moving from PPO or something similar to it to grpo, there are some big pros. I mean, one pro is just sort of like operational simplicity. Like there's a whole extra model you need for this value model you need for PPO that you can throw away with grpo, and that just makes your life easier. You don't have to train that model, but also there's no hyperparameters around that model that you have to configure. So that's nice. Another thing is the benefit that you're talking about, which we've observed. So the way GRPO works is you have to do a set of different trajectories or a set of different rollouts all in parallel with the exact same environment, the exact same conditions, and then you score each of them. And GRPOO uses the differences in those scores to promote the trajectories that did better and decrease the probability of the ones that did worse. Because they do it in a group relative way. It lets you be a little bit looser with how you score them potentially. You don't have to necessarily have a globally aware scoring function. You just need some scoring function that is able to distinguish between this small set of things you have in front of you. And that's easier. That's easier for a human. If you tell a human choose which of these is better, it's easier for them to do than say, is this one good or bad in absolute terms. So that's nice. The big downside, the huge downside of grpo, and I think actually the reason why GRPO actually is likely to be a dead end and we probably will not continue using it indefinitely, the fact that you need to have these parallel rollouts in order to train on it is actually that makes the data generation much more complicated because you need a fully reproducible environment to be able to do these sort of parallel rollouts. And it turns out in practice that's like getting that set up is the hardest challenge today with getting RL working is like actually designing this robust, reusable environment that you can run. All of this training in most companies, and that's not true. Sometimes that's easy to do. There's certain situations where you can do that. But for the work we do, at least where we're training agents on real code bases to operate real applications, it turns out it's really, really hard to sandbox those things in a way that's totally reproducible. Ppo. Now, in practice, a lot of times when you're training with ppo, you also will use an environment like that because it lets you do a bunch of runs and be more data efficient. But at least in principle, you have the option with ppo, you can actually purely train on, say, real production traces of real people. Interacting with app. And so you don't have to have a simulated environment at all, which makes the deployment much easier.
B
Can you double click on why it's hard to do the sandboxing? Because in principle, we just capture all the inputs.
C
Yeah, well, you don't need to just capture all the inputs. You need a system that reacts the same way your production system does and in many different ways. So let's say your Airbnb. I'm bringing this up because this is like, an example of one that companies have gone out and built sandboxes. If you're Airbnb and you're trying to. You want to train an agent to, like, maybe you're not Airbnb, fine, you're A company like us is trying to train an agent to, like, do really well at operating Airbnb and booking on your behalf. Right. Like, you have to build a copy of the Airbnb website that reacts to you as the user the exact same way that the real one does, with the same failure modes. Right. Because if you don't include the same failure modes and bugs they have, then, like, one of those bug. When one of those bugs comes up in production, your agent's gonna have no idea what to do with it. It's just gonna fall over. You also need to simulate if this is a sort of cooperative agent, where it's getting human input as well, and kind of like working with the human to get something done, which in practice is the way a lot of these are deployed. You also need to simulate the user. And, I mean, you can do the naive thing and just say, oh, we're going to have a separate LLM with a system prompt that is like the user simulator. And we do that, but it's like, okay, but the breadth of ways a user might respond, there's a lot more diversity in that than the actual diversity you'll get in practice when you have this simulated user. And so then it's like, okay, well, is this environment close enough to how a real user would interact that, like, you know, if a user says something different, that it's going to know what to do? And the answer in many cases is no. If you're just purely training on kind of like an LLM user simulator, it's going to have its own idea of, like, what the correct way to answer is, and the breadth of, like, a way a human might respond in this situation is wider, and your agent just may not be able to deal with that.
A
Do you feel like it's hard to build the simulations as A company that needs to build the product that lets everybody do it, or do you feel like even for the individual companies that own the code base, that are like domain experts in their own product, it's still just like a very hard infrastructure problem?
C
I think it's still very hard. You know, like ideally all companies should have this anyway because they're, you know, if you're doing end to end testing, like theoretically, if you're following best practices, you would have one of those set up. When we talk to enterprises, almost universally that's like not something that really exists. So there are some startups, like there's some companies we talked to that do have it and we can just use that, but it's a very, very small number that actually have an environment like that. And I think it's hard to do and there's lots of weird bugs that don't show up in an environment like that. And even if they do have a testing environment, they don't have it populated with full realistic data, which is also important so that it understands how to interact. So I think in practice it's hard in both cases. Maybe it's easier for the company, but at the same time, depending on the quality of the company's engineers, it's might not be easy for them either.
A
Yeah. How do you classify the types of environments? So you have formal environments like a compiler you can put in there, you don't need to do any work, they just work. Then you have this kind of RL environment, startups in a way that are building a bank environment. They're building these things that are not digital twins or whatever term of the actual environments, but they're close to it. And then on top of it you have helping people trying to build the exact replica of their thing. There's obviously value in the formally verified one. We verified that. Do you think there's value in this RL environment? Startups that are building somewhat generic but test specific environments and then if none of those work, then what do we do instead of grpo?
C
I guess the question, yeah, I suspect there is value in that. I think the folks buying those environments and training on them in the big labs would have the best knowledge on how well they work. I think they probably work okay. I think they probably also are like, and we'll see maybe with the next generation of models released how well they transfer. I would say so far it seems like they don't train well enough. If you use OpenAI's agent interface, it's okay. Or if you use the computer Use products that everybody's putting out. They're okay, but not reliable enough to actually let go do something interesting unsupervised in the world. And I think if the environments they were training it in were high enough fidelity, then they would be good enough in the same way that coding agents can go much further. Because I think that in that case we do have environments that are much higher fidelity because it's a much simpler environment in a lot of ways. It's a code base, it's maybe running a web browser. It's much easier to capture the full realistic environment in that context.
B
For those who are interested, when you make a reference to RL environment startups selling to the big labs, they're selling it for a lot of money, like at least seven figures.
A
Right?
C
That's my understanding. I'm not a buyer.
B
Please drop data points because people who are not in Silicon Valley don't know this. And it's probably the current thing in VC, which is RL environment startups anyway.
A
A lot of them.
B
There's like 20 of them apparently.
A
Yeah, but it's like a small number. I know that. Yeah, all the labs are buying ad hoc, but in a way it's almost like they don't even care. It's not a product. It's like they're basically like paying the company to build an environment ad hoc.
C
For that services business at the moment.
A
Exactly. But I mean if you're spending like a billion dollar in a job you.
B
Can specialize in like we are the one that does ecommerce. Like we are the e commerce experts. So come to us for ecommerce, go to the other guys for social media, Go to the other guys for like I don't know.
A
But I'm curious. Your take is like how do you need to get the data out to make it fit in your training run? Especially when you get to like these larger labs. I think they have like very sophisticated post training pipelines. And I don't know if there's like a way to just build a company where it's like you just send them a CSV of like data. It needs to be very integrated in it. But I'm curious what you've seen working with customers too.
C
So for rl, like the whole way this works is it has to sort of be getting feedback from the real environment. So I don't see a world where it's as simple as like, hey, you can, you know, there's like a CSV type approach. I guess you could code anything as a CSV but if you try hard enough for RL to work, you have to be looking at real runs, ideally of your actual agent in its current state across within an environment, as real as possible. So you have to like look at actually. And the data format's actually super simple. It's just basically a list of chat completion messages. It's effectively whatever tool calls. Yeah, exactly. Yeah. It's whatever your agent will be seeing and doing when it's running. So getting the data is not hard. But what's hard is when you're doing one of these runs and your agent makes a tool call. Okay? Now that tool call has to connect somehow. It's got to get data back from something and that data has to look like it will look in real usage. So setting up that whole part of the system is, is the challenge and.
B
Then just a reference job for more people. Web arena is my first instance of this kind of thing where you literally have a Docker container that has a clone of Reddit, a clone of wikipedia, clone of GitLab, clone of CMS and a clone of an E commerce place. And I think since then there's like mine to web maybe. I don't know if there's other large, well known academic environments where people are basically using these as benchmarks, but probably also it's pretty useful for training. So if you want to check out those things, you can definitely check there. I think the question for you is as someone who bet on sft, then you bet on RL FT and then now you see these guys making a lot of money. Why didn't you go there?
C
It seems to me like that definitely is a services heavy business at the moment as it's presently constituted. I'm sure that these companies are all developing different kinds of secret sauce on how to do this more quickly. So that's part of it. I don't particularly enjoy services businesses, but I also kind of feel like we will move towards a world where either the big labs, it's one of those businesses where the only customers right now are whatever four, maybe six big labs that are training these models on environments. And I don't think I'm a little right.
B
What's the 10?
C
Yeah, but look, you can say the same about scale AI and all of their competitors that are many billion dollar companies that have basically the exact same customer set. So yeah, may work out.
B
Yeah. Unless I don't know if you want to do a small shameless plug for Verus.
A
Oh yeah. I mean, so Verus, one of our portfolio companies, they Work with the people building the agents now with the model on like their internal tool call loop so they can observe all the internal traces and build the data to then have like a open pipe do the RFD on the thing. I think in the enterprise we've seen a lot of that, especially for chatbots it's like the less sexy use case, but they work with a lot of financial services company where their customers go in there and say, what's my balance? When did I do this transaction? And those are all tool calls. And they need a way to test and improve that behavior. And the models haven't gotten that much better because these tools are badly documented, they're badly named. I think that's kind of the problem with a lot of the agent builders that are not AI native companies is like they just put this like very generic tools in the thing and then they expect it to work like magic. And these simulations kind of help them also have the usual compliance things. It's like before shipping this we tested that. It doesn't give financial advice. We test that there's all these different things. So I'm curious to see how much the companies generalize. I think Verus has a lot of success in highly regulated environments because of different requirements. But I'm curious if you have a different way to segment the market of like when you think about rl, there's like environments that are like low stakes. There's like environment that are like high stakes. There's environment that have implicit rules that are made by the SEC or other government agencies. How you think about it?
C
Yeah, I don't know that that segmentation is necessarily the most relevant. I'd have to think more about that segmentation, whether it's. There's a strong difference in how useful RL is across those sectors. Where I see the segmentation is something basically just capabilities based. Where it's like, hey, if I'm trying to do something that's much more advanced and maybe long horizon, then RL can probably give me a much better behavior. And I might almost think that those sort of more compliance. I feel like in those kind of environments you probably don't want your agent doing very much because then you can't make any guarantees about what it might do. And so you're probably not doing these long horizon things and maybe RL is not going to get you what you want. But I don't know. Yeah, I haven't thought about it too much.
A
Yeah, I think like a lot of the customers don't necessarily end up doing RL anyway, it's almost like the simulation and the environment is like a way for them to understand the paths that the agent can take and less about. We need to then use that data to do fine tuning. But I think it's like it's going to be a spectrum.
B
What replaces your po?
C
Yeah, it's a good question.
A
We need the alpha.
C
Yeah, I mean, I don't know is the short answer. I do think this is like a fairly high salience question in the research community. I think there's a lot of folks trying to figure that out.
B
Every paper has a variant.
C
Yeah. But I think the big question is, are we doing normalization based on grouping or in some other way? Right. That's like, I would say, like, I would claim we're just going to keep calling it grpo as long as the normalization is done within like a group. Even though. Yeah, there's a lot of things that like, probably should get their own names. A lot of things that have tried to get their own names and have failed on the marketing side. Yeah, I think something that like, doesn't require group level normalization, which a lot of, you know, older things didn't, probably works. But I think that the older things also are really finicky. So there's, there may be other kinds of simplification and I don't know exactly what, what those will be.
A
Where do you put the prompt optimization thing? We did a Dev Day episode and we mentioned Jeppa and then everybody came out of the woodwork on Twitter.
B
DSI brought it.
A
Yeah, exactly.
C
Okay, tell me, have you or people you talked to tried Jeppa? I want to know what I read the paper.
B
I'm just like, look, the prompt layer updates are not the same as Wait's updates. They're just comparing apples and oranges. And I talked with a few people I respect on the RL side and they kind of validated the way that these grad students market their papers is their thing beats the current hot thing. And the current hot thing is grpo. But they're just not that comparable.
C
I disagree with that. I actually think they are comparable in the sense that it depends on for what purpose. But if I'm a company and trying to get the best performance out of my agent, I don't care if you're changing my prompt or if you're changing my weights. If you get better performance on my agent, I'm happy on that front. I do think they're comparable. And, and we've evaluated, I mean, we.
B
Evaluated like, so their Answer was, you are going to do both. If you really want max performance, you're going to do both.
C
Yeah. We evaluated everything from dispute, and we evaluated JEPA as well. And it's like, it just doesn't work. Okay. Like, okay, that's going to be the fighting words.
B
JEPA doesn't work.
C
It didn't work on the problems we tried it on. It just didn't. It got like a minor boost over the sort of like, more naive prompt we had and was just like. It was like, okay, just kind of like our naive prompt with our model gets maybe like 50% on this benchmark and Jepa got to 56 and we do our own. We get to like 96. I mean, it was just like, not even comparable. And so maybe we were holding it wrong.
B
Both sides are claiming skill issue. Right. So what they would say is you probably used it wrong. And then RL people are saying that probably JEPA guys, when they set up the GRPO benchmark, it wasn't a very fair comparison, which is exactly what. What my source said. It's hard to tell. Everyone is trying to get to some version of the truth.
C
What I will say is we want it. I mean, I don't know if I would say it goes so far as to say we want it to work, but we certainly want to know if it works. That's actually very relevant to the.
B
If it's more efficient to get there.
C
Then you shouldn't have been able to get it working.
B
It's actually kind of more credible now that you're part of a larger core weave that you're not, obviously. Because I think JEPA maybe makes openpipe less relevant.
C
I totally would disagree with that because the level we see ourselves operating at is actually we're not like RL Bros trying to figure out the use case for rl. We're like, hey, we're working with all these enterprises, we have all these big companies we're talking to, and we're trying to figure out how we make their stuff work better. And so I personally am very motivated. If something like Jeppa works, okay, let's build a product around that. That's how I think about openpipe, at least.
B
No, I mean, that's a good clarification to make even more. So you actually took a sincere look at it and you concluded that there was nothing to do, nothing to build.
C
Well, maybe we were holding it wrong.
B
So we had Shen Yu on the podcast a while ago, and I think he's been a proponent of automatic prompt optimization. And this Idea that you can do a lot more in the prompts than you can do in the weights. And in principle, I'm biased inclined to believe that something like a dspy, something like a JEPA works. So I'm very surprised to hear this.
C
Yeah, we keep trying it. We tried the Mipro V2 stuff that was hyped before that also.
B
Okay, I should not bury the lead on the best argument for this, which is basically JEPA models how the big labs do their system prompts. It's genetic evolution and they sort of incrementally evolve based on the overall evals that they have. It's slow because it's done by humans, but JEPA theoretically improves. It automates this.
C
Okay, hold on. Is the kind of. The big labs how something. This is new.
B
No, no, no. This is philosophically this.
C
I'm not saying like, oh sure, but you're injecting a whole lot of human intuition and kind of like potentially out of band information.
B
We have the best model in the world, which is humanity or like smart humans. And now we're doing JEPA using dumb lms.
C
Right. But they're also like, the humans can bring in out of bound information that maybe is not captured in the actual, like, you know, the evaluation. Like they can be like, oh, yes, technically this did well on the eval, but it's not really. I would suspect that a lot of that ends up getting injected through that human being in the loop.
B
Yeah, I've always been very surprised at how these guys work on their system prompts, which are tens of thousands of words long and there's no ablations. They just kind of pick what seems to work and then chuck it in there. And that is the Claude system prompt.
C
Can't argue a success.
A
Is GPT5 the first model that had a prompt optimizer by one of the large labs? I believe so, but I don't remember.
B
Claude Workbench had this like a year and a half ago, if you see it that way. It just wasn't like fully automated. But it was extremely good for its time. I kept telling people about it, nobody believed me.
C
Do we know if they used it internally?
B
Cloud Workbench?
C
Yeah. Okay.
B
Why not?
C
Oh, I don't know. My experience knowing a lot of people at these labs is like they launch a lot of products because some team is super excited about this product, but that, yeah, I wouldn't put that much weight on it just because they launched.
B
It for some measure of use internally. I'm sure the people I talk to I biased I don't know if you fully explored that.
A
No, I think that it's just interesting that now it's been acknowledged that the LLM can improve your prompt. And so I think Jetpa now is also writing this wave of like, okay, maybe we can do this programmatically. But I also think the long tail of people just prompts really badly. And so I think there's some value there. Versus, once you go into rl, you already have a more sophisticated audience. Who gets to do grpo, People that are really smart. Who gets to do prompt optimization? Everybody's trying to do it.
C
Yeah, that's right. Maybe our baseline was.
A
I know your naive prompt is probably like top 10 percentile of prompts that people put in these LLMs.
C
I'll take it. Yeah.
B
And then the other thing that comes to mind as you were talking about things, injecting things out of band and all that, I think it's a broader trend that I'm tracking for WorldSport 26, which is the move to online evals. The way that we do evals today is probably too locked down. You're kind of fighting the war that you already know should be fought, and you're not fighting the wars that you don't know about because you didn't plan for it. Whatever. How can we sort of move more online evals into our JEPA process? Maybe that's what it is.
C
That part I'm much more bullish on. And we can make the analogy. We can pull in kind of like RL intuition here, which is if you're doing JEPA on a sort of static data set of like, oh, this is the input. This is what makes a good or bad output. Then as you're updating your prompt, your information, the data you're training on becomes less useful. Right? Because it's generated by. Because it's based on kind of the problems you're running into before. And that's the same problem you have with rl, where you have this concept of being off policy, where it's like, as you're doing training, you really want to be training on rollouts that came from the latest version of your model. Because if you train on some that came from further back, then it's sort of stale data and it's no longer representing the current issues with your model. And so if you try and correct for the issues that existed back then, it may not actually be helping you that much. And I think for either RL or prompt optimization, that's definitely true. I think that one way to Apply that in practice is exactly what you're saying. Where you're using the actual data from your real evals, you have some way of saying, hey, either people flagging these or no, I'm flagging these. Or some way of saying this was a good or bad output. I totally agree with you. If you're bringing that into your process, I'm much more optimistic that you're going to get good results.
B
Yeah, and the pipelines are not set up. This is like analytics and UX people being drawn into the ML process, which they've never been done before. If I had to make a bet as a big theme for next year, this is going to be it.
C
No, I agree. I think that all of the sort of observability, people like platforms see that and are trying to figure out what the right shape is. I haven't seen the right shape yet, but yes, it seems like a theme for next year. Statsig maybe. Yeah, I haven't used them, but OpenAI seems to like them.
B
Yeah, I mean, I do think buying an experimentation platform makes sense and I think it's sort of. I've said before on the podcast, I think that I'm very bullish on model routing as a feature, but less bullish on model routing companies because of exactly stuff like this where it is just going to get absorbed into the model. It's a very big part of building the process. You probably don't want to, and it's not that hard. It's not rocket science. You're just connecting pipes and making sure things are set up so that it's easy to use that data.
C
I have a question for you, a general question. So what fraction of tokens generated by say like the end of 2026 do you think are going to come from open source models versus proprietary models?
B
That's a fun question. So we have an answer from Ankur from Fray Interest where he was like, it's 5% and going down. I think it's going to go up because of the amount of enterprise adoption of open models that I'm seeing because.
C
There'S a lot of demand. The enterprises would much rather be on open models if they actually could get the performance they're looking for.
B
Yeah, for privacy, all that stuff. And I think basically, honestly, it's just literally, we may have hit quote unquote AGI in the sense of the average LLM is capable of the work of the average human. Not the best human, but the average human. Sure. It's actually pretty decent at customer service and it's actually pretty decent in, I don't know, transcribing things on PDFs, whatever. So, yeah, totally. I think that should rise. But. But people who believe that it should rise to like 50% are out of their minds.
A
And I think it's a trick question. We should take coding out. I think once you take coding out, I think, yeah, it can be like 15, 20%. But I think with coding it's still going to be very low because these max plans are so subsidized and so many tokens are being generated. Like, Anthropic is like, you know, 50% of the revenue.
C
It's your claim that it'll mostly be, you know, that coding will mostly be closed models because the tokens are subsidized or because the models are just so much better.
A
I think as long as, I mean, I'm paying 200 bucks a month and it's like I'm spending thousands of dollars by accident. I pay with my credit card and I spend like 100 bucks in like an hour. And it's like, by the way, think about.
B
This is like the thing that nobody wants to talk about for Anthropic. Like, anthropic went from like 1 billion in revenue to 5 billion. And it was like, ooh, yay. And then like, what's the margins? You have this goose meme going, like, what's the margins? They say it's like 6% there. You are part of the 6% that is abusing everything. So everyone else.
A
I'm not abusing.
B
You're the loss leader.
A
It's not like I'm rotating accounts. I'm just using the product. You know, it's like, yeah, yeah.
B
But like through you, people like hear about cloud code. They pay the $200 a month and then they don't use it and they pay for your input.
C
Yeah.
A
Thank you. Thank you, everyone. Keep doing it right to go away. But I think, like, I don't really see. It's hard to see a world in which quancoder or whatever model replaces that between quality and cost. It's like to make. To generate this amount of tokens for 200 bucks a month. I don't know how anybody can offer together fireworks. They cannot really offer it at that price. And the quality is not as good.
C
But the reason they can't offer that price is because of the subsidies. Right. Which is not the long term sustainable.
A
I mean, it's interesting because. So Both Anthropic and OpenAI are building their own infra. Right. And they're going to get to a place where they're going to have idle GPUs that they own. And so they will also be incentivized to have 100% utilization. And so they will subsidize some of it. Just the same way, if you go on SF compute, you pay a buck 40 for like an H100 instead of the 220 listed price on AWS. So I think it will continue. But again, it depends on whether or not they actually have the 500 billion, like they were saying, which I think they do. You know, just to be clear, I think Stargate will go online, but once it goes online, then it's like, well.
C
If they figure out how to pay for $500 billion worth of compute, then they probably can subsidize for a while.
B
I think they have the 500B. They're going bigger. Isn't it obvious?
C
What do we mean by have?
B
At the start of this year, when they announced Stargate, people were like, oh, you don't even have 10. Elon was like, you don't even have 10. Whatever. And then Satya is like, I'm good for my 80. But now we're seeing all the money start coming in and probably it's in the order of like 200, 300 billion that you could probably get raised and committed and they're going to get the rest. It's fine. I think that the plan is actually a lot better.
C
Can I just say, I love this industry. It's like, yeah, they've got like 2 or 300 billion and what's another couple hundred billion? There's no other industry in the history of the world where you can see.
B
Yeah, it is stupid. But also, do you doubt it? I don't. That's fair. No, literally after last week, I think maybe two weeks ago, with the whole Oracle, Nvidia and then even AMD deal, I'm like, oh, these guys, not only they've locked down Stargate one, they're working on Stargate two, whatever that is. And the sheer ambition is freaking crazy. And there is still one more shoe to drop, which is the non sovereign wealth funding that OpenAI needs to get, which they've promised to drop by the end of this year. And my money is on. They have to do a coin. Like, I'm not a crypto guy at all, but like, you know, this is.
C
Going to be like an OpenAI coin.
B
This is the one AI founder that has his own coin already. And like he needs more money. And he said that they will come up with new innovative financing methods. What else is there? Yeah, they're already in a token selling business.
C
But you got great line buy an open air token.
B
It translates to GPT5 token.
A
Sure, it's a stable coin.
C
You'd have to get a lot of political buy in, I think to take.
B
That level of the White House that is most crypto friendly since the dawn of time.
C
Well, I guess Elon's out of there now, so maybe they can get the. Make the friends. Yeah, I think it's doable.
B
We'll see. Who knows. For what it's worth, nobody's like, this is a me theory. I don't have any inside information. Yeah. Should we go back to Ruler?
A
Yeah, sorry.
C
Right.
A
Open fire. Anyways, we were saying, I think this story takes us to July 25th when you release Ruler, which you call easy mode for RL Rewards. And then I mean shortly after you get acquired in September. So maybe you just want to talk through the summer. What was the vision then? Maybe how the acquisition came together.
C
Yeah, absolutely. So I mentioned my initial opinion of how likely this direction was to work was maybe 25%. We're up to 55% or so. And Ruler is actually a big update that got me from the 25 to the 50. I guess just for context there. So basically there are several problems you have to solve if you want to use RL successfully. The problems you have to solve, I mean some of them are just really dumb. Basic like hey, you got to get the infra. And the libraries have all really sucked and been built by PhD students who don't know how to build reliable software. There's all these practical issues that we're working through. So that's one thing and that's kind of what we're trying to solve with art. But even after you've got that solved, you've got major issues which is like you got to know if your agent is actually or whatever system you're using on RL is doing a good job. Right. That's fundamental. You have to have a reward. You have to know it's doing well or poorly. Sometimes that's easy to do. If you're solving a math problem or something, you can come up with a data set of math problems and the known solution and check if it's the same. On the coding side, there's been a lot of innovative work around. I mean there's first of all a lot of open data and a lot of. I think the approach a lot of companies take is you find existing test cases and Then you break them. But there's sort of a way to figure out if you could run the test case and see if your code fixes it or not. In a lot of other domains it's much more murky. It's like what is a good job versus a bad job? How do I know if I did a good job? And you really need that information. So we've tried a bunch of different things. Ruler is a library that we released.
B
Which let me relative universal LLM elicited rewards.
C
Thank you. Yes. And the way it works is basically this depends on the sort of GRPO insight which was mentioning earlier that you actually don't. With grpo it has this nice property where you don't have to have an absolute judge of the truth. You just have to judge relatively. And so simplifying it a lot is basically just llmsjudge on a whole group. So you say, okay, this is the task I'm trying to achieve. Here's four different runs of an agent trying to achieve it. Which of these did best? And it stack ranks them. And it turns out that works phenomenally well with grpo. Like way better than I expected. Way better than anyone who kind of like I talked to before. We actually tried this expected because it's sort of in the elemisjudge. It can sort of self ground because it's just getting these relative ranks. So it doesn't have to have an omniscient view of what good or bad looks like. So that has worked at basically everything we threw it at. We've done it with a bunch of client projects, we've done a bunch of our own customers. It basically just works. I honestly kind of feel like the reward assignment problem is fairly solved. Yeah, it's fantastic.
B
Just any LMS judge off the hook.
C
We've tried it with so many things. So one of the results we published was we used Quin 2.5 14B as the model we're training and as the judge. We used Quen 2.532B which is, I mean it's fine, but it's much worse than any frontier model. And even with that combination, we were able to get our agent doing state of the art better than any Frontier model on the task we tried it on, even with an extremely weak judge model. So it really doesn't depend on having a really great judge model in practice. So yeah, it's just not something we've had to worry about since then at all. So that's kind of like checked off. So that's sort of Got me a significant increase in like, okay, this is actually something people can apply. This is now something that's packaged up. People can just use our. We open sourced everything. You can use it off the shelf. If you stick in your trainer run, it will probably just work. So that leaves the remaining problem, which I guess we were talking about them out of order. But that leaves the environment problem. Right. That's like the one big remaining piece that we don't know yet how to automate or remove and requires a lot of manual work for every single task for listeners.
B
This is why I kind of refer to it as self supervised because it removes more and more of the human judgment and the history of machine learning all the way from I guess.
C
The.
B
Start of imagenet and everything is really that insight of you should just take humans increasingly out of it and scale up the data you can just throw in there with no supervision.
C
Yeah, totally.
B
Yeah, it's really awesome. Are you bullish on dedicated LMS judge models? Have you looked at those bespoke labs? We did an episode with them and they're really trying to carve out a niche in there.
C
We've looked into it. We've trained some ourselves. We've also used some off the shelf. There's an evaluation benchmark that the AI2 people put together a reward benchmark. And so reward bench is kind of like trying to benchmark models on serving.
B
As ll reward models. Are LMS judged in your mind? It's the same thing.
C
Yeah.
B
They have mildly different depends on the task elements. Judged is usually more sort of product facing and reward is reward modeling is much more specific within a chat task, which is. That used to be the old meaning of reward model.
C
I don't know, maybe terminology has changed. I think they're pretty equivalent.
B
I understand that. Yeah, I can see your side.
C
Anyway, so, yeah, rewardbench is kind of like. And so we've tried a bunch off that. The thing is, I guess my maybe meta take on this is any task that is extremely common is going to end up as a specific part of the training data for the frontier labs. And LLMSjudge is just something everybody's doing in so many different contexts that you have to assume that all of the Frontier Labs have a bunch of LMSjudge style tasks that they're training their models on. And I do believe that if something does kind of like make it in a more than minor way into their training data, that they're going to do at least as good a job as a dedicated model. So I don't think there's probably a lot of alpha in dedicated LMS judges just because it's something that. Let me caveat that and say if you've got a very, very specific task that's weird and has weird requirements and you have a lot of data on what's good or bad, then training a reward model for your specific task I think could still work. Or fine tuning an LMSjudge on your specific task could work. I'm pretty bearish on. Hey, this is a model that is trained as an LMS judge, but it's a generic LMS judge that can be used to judge anything. I just don't think you're going to beat the Frontier Labs on that.
B
Yeah. One other version of this that is not quite an LLM, but some people are thinking about it is something that we're working on for a future episode, which is World models.
C
Sexy.
B
Yeah, very sexy. First applied in video, as far as I can tell, for Genie 3, Genie 123 and now with code and potentially with virtual cells for AI bio. Any exploration there that's interesting to you?
C
Yeah, so we've been playing around with it a little bit. It's one of the directions that I'm fairly optimistic on for solving the environment problem specifically, because if you think about it like a world model, it's a simulated environment. That's its whole purpose. Right. So if you get one that's in.
B
An LLM like thing, not like a doc walker.
C
Yes. So it's like whatever, hallucinating, generating, imagining the responses you'll get from the world. So you can imagine.
B
Right.
C
If you had a really, really great world model that you were training on. Yeah. It's like your agent that you're using, it would go out and make some tool call and then this world model would generate, hey, this is probably what the tool call. And if you have a smart enough, strong enough one, then it could keep its own effective internal state of the changes you made so far and how that affects. So we've played around with it some. I think if we can get it to work really well, then that could be a solution for the environment problem where you just take a bunch of production traces and use those to condition your world model so it understands your specific system and what its failure modes are and then train against that world model and the resultant agent that you train with that would then be able to perform in your real environment. So I do think it's a really interesting area of research. Yeah.
B
And did you see the meta cold World model work.
C
I don't think I saw that one.
B
Okay. Yeah, it was like two weeks ago. We just confirmed the guy for AIU code in November and it's really interesting. The world model is.
C
Oh, sorry, you're talking about the Meta one.
B
Yeah.
C
Okay, I missed it. Yes, I did. I saw that one.
B
I said a lot of syllables, so it may not have parsed, but yeah, it's literally having a debugger as the environment, as the world model, and opening up the execution trace to the model to see what's going on and see the state and track the state as the code executes. Seems to be smart and exploits the unique situation of code environments where we can actually do these things.
C
Yeah, I think the way they envision that model being used is a little different, actually. I'm curious. I'll have to see the talk. But my understanding from that paper is the goal they're imagining is this is almost sort of like a pre training step. And then now that this model understands code really, really well, we can then use it as basically like a code generation or a coding agent of some kind. Okay, yeah. Which I think makes sense. That's almost more like a different kind of pre training, I would say. The way I'm interested in applying world models is basically as its own end. Right. Where it's like, actually the goal is to come out of this with something that simulates the world, which is not something you really need in code at all because it's so easy to run code and you don't need to model what will happen if you execute this code, typically, because you can just execute the code and see what happens for training purposes.
B
But it closely models how we think about code when we code is we kind of mentally execute the model as we type and we go like, is that what we really want? Yeah, I don't know. Anyway, it's the first model that meta's released since the MSL reorganization. We know just based on our context, they're very interested in code models as a path to AGI, which I'm also, of course, very interested in.
A
I know we kept in here for a while. Let's wrap up on the acquisition. So a lot of people say companies are not bought, sold, they're bought. What was that process like for you? Did it just happen? What was the behind the scenes?
C
Yeah, so that was driven by actually mostly the weights and biases. Founding team Lucas. Yep. So, yeah, Lucas and Sean particularly so they had recently been acquired by coreweave and Core Weave was looking to Continue growing up the stack. And so yeah, they approached me were like, hey, you know, like no pressure but like this is like an area that we think is really promising and we, you know, would you like to work here? And so that's how the conversation started. It was like long. It was pretty painful. There were, there were points as late as, you know, like the week before we actually signed where it was like unclear if it was actually going to happen. So that part was super painful. However, we've been there a month now. We just shipped a product yesterday which I'm super excited about. It's been fantastic working there so far. Like I was like very concerned. I was like, okay, yes, this is great. We make a lot of money by selling our company, but is the work environment gonna really, really suck? And I was like, well, I guess that's just a risk I'll have to take. It's been fantastic. It's honestly been way, way better than I could have imagined.
B
Did you go down to the office? The one down here?
C
I was there today. We work for, I'm based in Seattle and they have a small office up there that we work for.
B
Ways and Bass office in San Francisco is fantastic. If you have the chance, go visit. They do all hackathons and co working things.
C
Yeah, there's a hackathon going on in a month or so I'm sure every week.
B
But yeah, I mean so do you consider yourself working for weights and biases or core reef or both and open pipe too?
C
No, no, yeah, it's so we, so we I report to the weights and biases like yeah founders. So we're within that organization in the org chart. We're there. I don't know like branding wise, they're trying to say everything kind of that's not being sold to like big labs is kind of weights and biases. So like our stuff we're launching is weights and biases branded. Yeah, it's not. Yeah, not core weave branded as much. I don't know, they're still figuring it out.
B
And what's the product you launched?
C
We launched serverless reinforcement learning. Basically it lets you offload all of the GPU management. You don't have to worry about crashes and out of memories and scaling up and down. We handle all that for you and you just define your environment, you define your reward function and then every time you run a step you ship back to our backend. Hey, these are the trajectories, these are the rewards. Now update my model and we just make it work for you. It makes it way easier.
B
Yeah, okay.
C
Very thinky, like, it is very thinky. Like, I love the thinking machines launch. I think they have a really good idea. It's also very validating.
B
How did this take so long to appear? It seems.
C
I don't know. Yeah, we were going, but I felt this way about everything. There's so many things that should exist, clearly. I just think there's still not enough people, smart people working in this space. Honestly, we need. I realize that there's a lot of people. It just feels like there's still a lot of low hanging fruit. Nobody's doing okay.
B
One thing I saw from your post was your North Star as the RL team at Core Weave is to build an old world where every agent learns continually from his real world experience. So you're touching on the hot topic of the moment, continual learning. What else do we need to get there?
C
I super believe that. And that's basically the vision where I'm like, I keep talking about these percentages. 25 like if we get to the world where we build that, then I think it's just like the advantages are huge. They're clear. Everyone should just deploy their agents that way. We want to be the team that builds the software that makes that easy to do. So I talk to a lot of engineers at our customers and they're trying to deploy agents and it's so easy to get the initial prototype and something that kind of works well, it is so hard to get from that to something that you are confident is reliable enough to actually deploy in production. And when you actually look at what those failure modes look like, it's like, oh yeah, we know if it gets in this situation or if it gets these kind of inputs, it behaves funnily. But then it's like, yeah, you can update your prompt to address that, but that's not scalable because at a certain point it's going to start breaking other things. You don't know what it's breaking. You really want some way to just say, okay, look, this thing you did there, that was the wrong thing. Just adjust this behavior when you get into this and then otherwise carry on. Right. And that's what we can do with rl and that's what we can do with continual learning. We don't have to have this concept of oh up front. I'm trying to make the perfect model that solves everything. It's like I'm trying to make a model that's good enough, I can deploy it in production. And then when these Errors come in, I'm going to say, oh, exactly. I mean, very analogous to how you train a human employee. Be like, oh, no, actually, that's not what you should do in that situation. All right, fix that and carry on. And that's just going to make this whole process so much easier. And I think that. I think that there is today, like, 10 times as much AI inference that could exist than is existing right now, just purely with projects that are sitting in the proof of concept stage and have not been deployed, because there's a huge bucket of those. And it's all about this kind of reliability issue where it's like, okay, it works in controlled circumstances. There's areas where it doesn't work. And so if we can solve this problem, there's that 90% of the inference market, addressable market today that's just going to come online because we've solved that problem. So. So that's what we want to do. I'm super excited about it, and I think we have very concrete ideas on the specific pieces we need to make that work, and we just have to execute against them.
A
Do you feel like the online RL is more susceptible to the reward hacking, especially as you're shortening this loop and you don't spend as much time looking at the different checkpoints?
C
I'm not that worried about it. And the reason why is because reward hacking is quite easy to detect once it starts happening, because once the model's found some hack, it just starts doing it all the time. It's like, oh, yes, this worked great. I'm just going to keep doing it. And so you notice very quickly, whoa, it's doing this thing. And assuming you're using, at least in part, an LLMs judge to determine which ones are good and bad, it's so easy to just throw in an extra term and be like, hey, that weird thing that you keep doing, if it does that, that's bad. Give it a low reward. We've done this with a bunch of customers, and reward hacking does happen, but you just see it and you adjust your reward prompt and it just goes away.
B
What's a thing from YC that guided you through your entrepreneurship journey? And what's one thing that maybe you find that you disagree with YC on?
C
Oh, that's a good question. One thing that I really identify with, and I've tried to do a good job, is kind of like, I think they say, hold your problem tight and your solution loosely.
A
Right?
C
Where it's like, that's what you did. Yeah. Spend A lot of time thinking about what is the problem people are trying to solve and then it's like don't be too bought into the way you're solving it today. I think that's super important, everyone. It's very easy to get that balance wrong if you're not thinking about it very consciously. Something I disagree with? That's a good question. I think there's lots of things I disagree with, but I don't have it cached in that direction in my brain. I don't know. I definitely have disagree with lots of specific pieces of advice, but I don't have a great answer right now.
B
I'll bridge it for you in case something comes up. Sam Altman's like everything I said as president of YC was wrong for OpenAI, right? Do B2B ended up doing B2C. You should ship products often ended up being in stealth for three years.
C
Yeah, actually I think that second one does resonate with me a lot. We have tried to ship really quickly and just kind of follow the gradient of the market. I think if I do another startup and I don't know, maybe this is just me being beat up by the market too much. If I do another startup, I think at least some points I probably would have done better to be heads down and execute on my vision for longer and go for the more ambitious thing. But that would take longer to prove value, which is definitely not the YC way. But I think if you have, I don't know, a good vision and good taste, then that can work. Work quite well.
B
Yeah, we'll see what that is whenever that comes out. But thanks for your time. This is a great overview of everything.
A
Thank you guys.
C
This has been a super fun conversation. Thanks to both of you.
B
Awesome.
This episode brings on Kyle Corbitt, co-founder and CEO of OpenPipe, recently acquired by CoreWeave. Through a vibrant discussion, Alessio and Swix dig into Kyle’s journey from YC’s Startup School to OpenPipe’s founding, rapid scaling, pivot journeys, and eventual acquisition. The focus centers on the evolution of model fine-tuning, why Reinforcement Learning (RL) has become central in AI infrastructure, technical trade-offs in fine-tuning strategies, the realities of running an RL-first business, and what the future holds for continual learning, agent environments, and the economics driving the foundational model ecosystem.
(00:17 – 04:01)
Startup School Background:
OpenPipe’s Genesis:
(04:01 – 08:38)
Strong value prop: distilling GPT-4’s expensive capabilities into smaller, affordable open models
“Anyone who did have production workflows, it was extremely painful. Like they were paying hundreds of thousands of dollars a month to OpenAI.” (04:28)
Quick traction: first three customers within a month of launch; $1M ARR within eight months
Market Headwinds:
Product Experience:
(07:46 – 10:51)
Open Source Model Evolutions:
LoRA (Low-Rank Adapters) – Rise, Fall, Resurrection:
(11:29 – 13:00)
Cost-Benefit Mental Model:
- Upfront effort: “a couple weeks of a competent engineer’s time,” up to months for RL
- Ongoing cost: less flexible stack, slower iterations
- Direct financial cost rarely a main factor: “Each of these runs is between five and a couple hundred dollars.” (14:00)
(14:42 – 18:47)
Trigger Event: Emergence of Zero1 (“01”) models; realization via leaks (Strawberry, etc.) that RL could significantly improve LLMs
Strategic Bet:
First proof of concept: RL-trained “email agent”; informed bet, not “obvious”
(18:47 – 20:35)
(20:35 – 25:46)
Key Methods:
Pros and Cons:
(25:46 – 32:21)
Why are RL environments hard to create?
Market for RL environments:
(29:57 – 33:55)
RL requires ongoing, tight data-in-the-loop from real rollouts; can’t just batch up a CSV and train
The most challenging parts are integrating the agent’s “tool calls” with environment responses that closely mimic production (30:57)
Discussion of simulation tools, regulated environments, and generalization in RL market segmentation
(35:02 – 41:11)
Prompt Optimization vs. Weight Updates:
Baseline design matters:
(41:49 – 54:18)
Push for Online & Continual Learning:
Market Size and Open vs. Closed Model Dynamics:
(50:20 – 54:18)
(60:11 – 62:16)
How it Happened:
Post-Acquisition:
On the Fine-Tuning Wave:
On RL’s Evolution & Importance:
On Productizing RL:
On YC’s Advice:
| Segment | Topic | Timestamp | |---------|-------|-----------| | 01:13 | Kyle’s YC Startup School work | 01:13–01:29 | | 03:03 | OpenPipe founding inspiration | 03:03–03:14 | | 04:28 | Early OpenPipe product-market fit | 04:28–05:22 | | 09:06 | LoRA fine-tuning, multiplexed inference | 09:06–09:36 | | 12:44 | Fine-tuning ROI heuristic | 12:44–13:00 | | 16:45 | RL focus: high-risk/high-reward bet | 16:45–18:47 | | 23:32 | RL environment sandboxing pain | 23:32–25:46 | | 36:31 | Prompt optimization (JEPA) vs. weights | 36:31–37:05 | | 52:09 | RULER launch – relative LLM-based rewards | 52:09–53:14 | | 62:16 | New serverless RL product | 62:16–62:43 | | 63:12 | Vision: continual RL for every agent | 63:12–65:30 |
Technological Pivots as Market Response:
OpenPipe's journey was a case study in responding to rapid drops in model pricing, the rise of open models, and shifting value props for AI infra companies.
RL as an Unlocked Superpower — If the Environment Problem is Solved:
RL’s effectiveness is now much less about reward design thanks to advances like RULER, but fully realizing its potential depends on practical, high-fidelity simulation environments.
Community Skepticism About “Hot” Research Fads:
Despite buzz, prompt optimization frameworks like JEPA fell flat for OpenPipe’s tasks, reinforcing the need for sober, hands-on benchmarks over hype.
RL’s Business & Product Potential Unlocked via Infra:
“Serverless RL” and similar abstractions are about making RL feasible for production teams, reducing friction and opening RL capabilities to a much broader developer audience.
Industry Economics Will Shape the Next Decade:
The fate of open models, who can fund a “$500B” Stargate-scale compute estate, and token pricing subsidies are as crucial as any ML breakthrough for who wins in enterprise AI.
Kyle’s arc with OpenPipe illustrates both the whiplash-fast nature of the modern AI product landscape and the kind of relentless, direct engagement with technical challenges—like reproducible RL environments and reliable reward pipelines—that unlock real differentiation. The conversation reveals both optimism (“I think that there is today, like, 10 times as much AI inference that could exist than is existing right now… if we can solve [agent reliability].”) and caution (on overhyping prompt optimization, or assuming the environment problem is a quick fix).
Above all, the episode is a window into how today’s AI engineers aren’t just plugging papers into products, but are running real-time experiments in a turbulent, capital-drenched, and opportunity-soaked phase of the industry—a phase where who builds what kind of infra, in what market, and how quickly, is still very much up for grabs.