
One of the most immediate and high-impact applications of LLMs has been in software development. The models can significantly accelerate code writing, but with that increased velocity comes a greater need for thoughtful,
Loading summary
Harjot Gill
One of the most immediate and high impact applications of LLMs has been in software development. The models can significantly accelerate code writing, but with that increased velocity comes a greater need for thoughtful, scalable approaches to code review. Integrating AI into the development workflow requires rethinking how to ensure quality, security and maintainability at scale. Coderabbit is a startup that brings generative AI into the code review process. It evaluates code quality and security directly within tools like GitHub and VS Code, acting as an AI reviewer that complements existing CI CD pipelines. Harjot Gill is the founder and CEO of Coderabbit. He joins the podcast with Kevin Ball to discuss Code Rabbit's architecture, its multimodel LLM strategy, how it tracks the reasoning trail of agents, managing context windows, lessons from bootstrapping the company, and much more. Kevin Ball, or K. Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through latent space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website Kball LLC.
Kevin Ball
Harjit, welcome to the show.
Harjot Gill
Thanks Kevin.
Kevin Ball
Yeah, I'm excited to dig in with you. I'm really excited about what you guys are doing, but let's maybe start with that. So can you give our audience a little bit of a background on you and on Code Rabbit?
Harjot Gill
Yeah, that's great. So I'm Harjeode and I'm co founder CEO of coderabbit which is a startup using generative AI to look at code reviews essentially code quality code security for users which are on popular git platforms like GitHub, GitLab. So the company is like roughly a couple of years old but has grown tremendously non linearly pretty much in the last couple of years. We have 100,000 developers who are using this platform on a daily basis and it's a pretty popular product loved by the developers across all the industry segments and so on.
Kevin Ball
Awesome. So let's first look at this from a user standpoint, what does this look like? And then I will be excited to dive under the covers and dig into how Code Rabbit works. But for me as a developer, if I want to use codewit, what do I do and what does it look like?
Harjot Gill
Right. So coderabbit is like a tool that is a nice complement to a lot of the code generation tools which are out there on the market as you know, like a lot of the developers are now familiar with cursor, GitHub, copilot, windsurf and so on. And as they're now using AI to generate all of the code. And we know that AI generated code has a lot of deficiencies in terms of maintainability and sometimes they're like just sloppy errors that AI makes. So now you got to bring in AI to review AI because now review is becoming a bottleneck.
Kevin Ball
Right.
Harjot Gill
So to consume Code Rabbit, there are like a couple of ways. The product primarily works inside your pull request model. So essentially once you are done with your feature branch, you open a pull request before it gets merged into the main line and gets shipped out of the end customers. That's typically where all the code reviews happen, like the human reviews. A lot of the static analysis tools that you're running, like Linters and unit tests run essentially your CI CD pipeline runs over there. So codehabit sits alongside those tools and uses AI to perform code reviews. And very recently around couple of weeks back, we also released a VS code extension that also works with the forks of VS code like Cursor and Windserve, so that the developers can also review the code before they even push the code to the remote git branch.
Kevin Ball
Okay, cool. So then let's look at what that looks like on the implementation side, because I think one of the things that I've certainly run into with Genai is naive application of the models. These models are very powerful, they could do a lot of cool stuff, but as you highlight, they get a lot of things wrong. And so figuring out how you feed them the right context and put all those things in place is very important. So can you maybe walk us through, I guess first, what is the architecture for Code Rabbit behind the scenes?
Harjot Gill
I will first start by first contrasting with how different the code generation is from code review. And then we'll probably go deeper into how coderabbbit makes it all work. And if you look at code generation, it all started with a lot of these tab completion style use cases, autocomplete. I mean typically you will see usage of small low latency models. So as you type you have this suggestions show up in ghost text that you can press tab to complete. And most sophisticated approaches will use some sort of a vector database to index your code so that you get more relevant suggestions based on your data structures or coding patterns that you're using.
Kevin Ball
Right.
Harjot Gill
On the other hand, the code review is a problem that requires very, very deep reasoning. So the workflow that Codrabbit is sitting on is latency insensitive because you're running it in the CI CD pipeline and that workflow can typically take several minutes to complete. So a tool like Code Rabbit has to be a lot more thorough in terms of its analysis in order to make it actually work. So coderabbit, believe it or not, is actually one of the biggest consumers of the reasoning models in the world right now. So one of the biggest users of O3O4 mini Sonet.
Kevin Ball
Right.
Harjot Gill
And that's part of the magic that makes it work. Then of course, it's the entire workflow around it on how we bring in the relevant context.
Kevin Ball
Right.
Harjot Gill
And the context comes from. So the workflow basically triggers as soon as you open up pull requests. So the context naturally comes from what's the payload of that pull request, what the diff looks like.
Kevin Ball
Right.
Harjot Gill
Then you're also bringing the context from the remaining code base, the code graph. So we understand the impact that code would have on the dependencies that you're using in the code. Like other functions which are not even changed, but now are depending on the code that you're changing. Right. So the building the code graph is also pretty critical in terms of context. Right. The other context comes from the JIRA or linear issues that you are trying to solve through that pull request. So usually there's some product knowledge or some knowledge about the bug that you're trying to solve coming from the issue systems.
Kevin Ball
Right.
Harjot Gill
There's a lot of context is coming from the past learnings, because Code Rabbit is a very collaborative product. It's a product that people consume at a team level. So and the way you train Code Rabbit is by chatting with it. So the more you talk to Code Rabbit, the better it gets over time. So those learnings that it has learned over user interactions from the previous reviews also get pulled in. And these are some of the examples. They're like, I don't know, 10 to 15 different data points that we pull in during the context.
Kevin Ball
Right.
Harjot Gill
But it's not sufficient. Actually, that's the thing. Like, I mean, as you know, these models have a very, very limited context windows. And even though when we are seeing these context windows expand to million tokens or so, it's still not efficient because you basically lose the quality of inferences as you try to stuff in more context. Like, it's great for summarization, but when you're talking about deep reasoning, you can't really use all that context.
Kevin Ball
Right.
Harjot Gill
So what we try to do is like give Code Rabbit's agent enough hints so that it can get a basic bearing on what's happening in the pull request where directionally are the trajectory of these changes variable, where they're going and so on. And then what we are doing, which is a cool thing, which is like so differentiated right now, we create all these like sandbox environments in the cloud. So we actually do create sandboxes. We clone the repository and then we let the AI run an agentic loop to navigate that code base. So we let the AI run CLI commands like shell scripts. It can run keyword searches, it can go and read additional files and bring additional data points into the context. Like it can even run astgrab queries, abstract syntax tree queries to read entire functions and bring it into its context and then continue with with its analysis in order to validate a bug. And one of the stages of reasoning process is okay, it looks like there might be an issue if you're going to change this, but can I go and validate if it's really an issue? So it's a combination of preloading some context and then giving the agent enough agency to go and find missing information. It even runs web queries. Like sometimes you have knowledge cutoff issues, right? I mean these models have been trained in the past. Like I mean sometimes you have 2023 cutoff, 2022 cutoff, which is kind of bad for the coding use cases because a lot of these libraries and frameworks are constantly evolving. So in a lot of these cases we try to bring in the context from doing Internet searches. So sometimes we'll say, okay, this is a new syntax that we are looking at. Is this syntax something that's really out there or is it incorrect? So you will sometimes see code rabbit do a web query to confirm the latest documentation.
Kevin Ball
So that is fascinating and I'd love to dive into some of those pieces. So first off, you said you kind of start with the diff and building the code graph from there. Is that something that you are moderating through an LLM or you have a sort of static analysis that you're doing? Or how do you build that code graph for what's likely to be impacted?
Harjot Gill
It's a combination of both actually. So that's a nice thing. There's a lot of these abstract syntax reanalysis and understanding the relationships. I mean you're familiar with language server protocols, LSPs. It's kind of similar what we are doing there, but it's our own proprietary implementation. So not exactly like LSPs, but somewhere in the middle in terms of the memory footprint and everything. We need to build that code graph. And it's all being done on demand. It's not being pre indexed like source graph or something. We just create this live as we're doing the analysis. And the other part is the large language models are able to then further understand the relevance of that code graph. I mean a lot of things can be references and dependencies, but which are really relevant for understanding that diff and code review. So there's a lot of cleanup on the context as well happening before we trigger some of the more expensive reasoning models.
Kevin Ball
That's interesting. So could you walk me through maybe what is the pipeline of steps that you go through? So it sounds like there's some amount of static analysis, there's some amount of cleanup with cheaper models. There's some amount of then these expensive reasoning models. Maybe not in full detail, no secrets here, but kind of what are the different types of steps involved and how do you think about sequencing them?
Harjot Gill
Yeah, I mean we have written about it as well. When we started the company, there were a couple of initial blog posts on how Code Rabbit works and what makes it both cheap and good at the same time, which is hard to engineer in the engineer, those kind of things in the world of AI. So one of the things that we do really well is understanding the context. Right. So it's not like tools like Cursor where you're picking a model and then you're running with that model for your entire flow. Like Code Rabbit is an ensemble of models we draw. We don't even expose what models we are using to the end customers. Right. Sometimes people ask which models you're using. Can we choose the models? We don't let them because they'll most likely make a mistake in picking the right model for the use case we have. So our team does a lot of work behind the scenes to pick up the right kind of a model for the different parts of our pipeline and the workload we have. So we use like seven or eight models and depending on which one's like a good fit for which part of the workflow. And a lot of the context preparation is where we use like cheaper, faster models like GPT4.1 Nano or GPT Mini 4.1 Mini. Those are like kind of the big workhorses. They're dirt cheap, but we still spend a lot significant amount of money on them given how much volume we are running through those models. And they do all sorts of tasks from summarizing large context like files Entire files and previous issues and so on. So there's a lot of summarization that goes in before we even go into actual code review workflow. So there are multiple steps, there's a whole setup process where we're creating a sandbox. We are running a lot of the static analysis tools in them. So there's a lot of context being pulled in from your existing tooling. So we already basically we go and identify what kind of tooling you have set up on your repository. Let's say you are using eslint. We will go and detect that we will use existing configuration that your DevOps team might have set up. Sometimes people use golang CI lint. So we pick up all these tools, right? And we run it. Run them for you. So basically that's one of the contexts we bring in. Then there's a lot of context we bring in from your CI CD failures. That's another place where we use large language models to understand your failure logs. So if you have a build failure or a unit test case failure, we understand exactly what happened there. And that context is also used during code review so that we can provide remediation one click fixes for those steps.
Kevin Ball
Right.
Harjot Gill
So yeah, so there's a, as I said, like seven or eight models and for different use cases for chat, there's a different model for some of the agentic verification flows that we run. Those are like different reasoning models and so on.
Kevin Ball
Cool. And you mentioned a lot of this is being done on demand, but you also said, hey, you can train code rabbit. It will incorporate past learnings based on conversations or things that you've done in there. So it sounds to me like there's some sort of kind of summarization or indexing that you're doing of previous PRs that gets fed in at some layer. What does that piece look like?
Harjot Gill
That's right. I mean this is where we have like a very different indexing system, similar in some ways, but different in many ways of the entire code base. So we do look at the entire code base and based on what got merged over the last thousand 2000 commits. Right. I mean that's how the system works. And over there we are indexing not just the code snippets. We understand that. Okay, these are the irrelevant code snippets. Anyone?
Kevin Ball
Any?
Harjot Gill
That's how like everyone's been doing it. They use abstract syntax trees, tree sitter grammar rules to extract out the relevant snippets and index. But one of the unique things we also do on top of that we also convert those snippets into doc strings document like natural language. Because a lot of the user queries are in natural language, right. So when you're doing code completion and your, your similarity search happens on the code snippets itself, so you have a good match in the vector db. But when you're going into a lot more agentic use, cases like Code rabbit is the. The input is natural language query, right? So you have a better match when you're converting code into natural language representation or summary of it. So we do a lot of that.
Kevin Ball
At scale that makes a lot of sense. And then I presume you expose it to your agent. Here's a query framework. You do a natural language query and load up whatever might be relevant.
Harjot Gill
That's right. We're bringing in knowledge from the code graph, we're bringing in knowledge from the code base index we have created. Like it's a very different kind of an indexer than what people have been doing in the space. And a lot of that context is also shown to the user so that people also can trust the AI because the AI is known to hallucinate.
Kevin Ball
Right.
Harjot Gill
So one of the things you build trust is to also show the context and how that insight was bubbled up. Like what led to that review comment or the conclusion.
Kevin Ball
Right.
Harjot Gill
So all that helps in making a great user experience.
Kevin Ball
Let's maybe talk about that exposing piece because I think that is key for any of these LLM driven applications giving you the paper trail of like, how did this get here? Why is this here? So I can as a human validate it and detect those hallucinations and things. So when you have this long pipeline of context that you're loading in, you mentioned a bunch of different steps from a bunch of different sources. How do you keep track through the process to be able to bubble up the right sets of relevant context?
Harjot Gill
Right. So it's all in the ux. So when you are like posting these review comments, so sometimes we will show what kind of additional context was used to bring up that insight. Sometimes it's just pure LLM logic. Like there's no additional context, it's just a issue that was detected on a Surface level. But sometimes it's deep inspection of the code base. Sometimes the agent will go and read additional files in the repository. So you could see like an analysis chain in Code Rabbit Comet. Sometimes if you open that chain you will see all the thought process and the paper trail. As you said, what kind of commands were executed to come up with a certain Insight and then it will pinpoint the files and locations, even the files therefore not change in the pull request. I mean it will also bring up insights from your remaining code base. But you could go back and follow the paper trail and if it ever went off track, you know exactly why it went off track. And then you can chat with code evidence and explain it why its analysis is correct or incorrect. And if it is incorrect, it will remember it for next time.
Kevin Ball
That's super cool. So essentially just to make sure I'm understanding, your agent is outputting its logs of what it's doing, which includes both LLM reasoning and tool calls, often different places, and then the results of what those tool cards are and you keep that track and just bubble that up straight to the UI for someone to be able to explore.
Harjot Gill
That's right. We think it helps a lot. And actually we were one of the first companies to pioneer this whole sandbox and cli. Now we see like this becoming a commonplace codex came out and all. But Code Rabbit's been doing this since last two years, since the days of GPT4. We are the first ones to actually find out that a lot of the code based navigation is a great way of finding issues versus doing pure rag. So. So everyone while was prioritizing a lot of code based indexing. I know the code based indexing helps, but a lot of what makes Code Rabbit unique comes from this code based navigation. That happens ad hoc using shell scripts.
Kevin Ball
Yeah, that's super interesting and I think is something that we've started to see in a lot of more recent agents of, hey, let's just expose programming tools essentially to agents and let them figure out the right way to apply that.
Harjot Gill
So even today Code Rabbit doesn't use tool calls. So some people think we use tool calls. We actually don't. I mean the entire system is based on CLI commands. So we actually generate code as instead of doing tool calls. We have a sandbox and cli. That's all you need. That's the only tool you need actually everything can be even MCPS you don't need even to open GitHub issues. We use a CLI command GitHub CLI to open GitHub issues. Right. We don't actually use MCPS because we don't have to. All the tools are available over cli.
Kevin Ball
That is fascinating. Let's dive into that a little bit more. So in terms of. I think one of the concerns about giving the LLM full access to any sort of code is how do you sandbox it properly. How do you decide what is in and what is out? So how do you think about that sandbox, especially if you're giving it like web access and access to GitHub systems and things like that?
Harjot Gill
Yeah, I mean they're like standard techniques on sandboxing. People have been doing it in the past for many use cases even people have been doing like dev environments, preview environments, so code revit in a lot of ways standing on the shoulders of the giants in many ways. I mean there's some proprietary stuff we have done to make it fast and cheap.
Kevin Ball
Right.
Harjot Gill
I mean, so we are kind of running these sandboxes at scale while also being very cost effective in doing so. But yeah, I mean there's, I wouldn't say any big secret sauce on how containerization C groups and all those things work. I mean those are like standard systems techniques. Right, but the main thing is like how do you further block off access? In our case we don't block off Internet access because that's something we feel the agent should have. Sometimes it will also make curl commands. It will sometimes want to use the GitHub CLI to read more other PRs and so on. Like. So we don't restrict any Internet access but at the same time we do want to like make sure our cloud services are protected. Like we don't have access to our internal systems and so on.
Kevin Ball
That makes sense. Do you list out for it, for example, what sets of CLI tools or what permissions or access like GitHub CLI? Presumably you have to give it a token to be able to access the appropriate place and things like that.
Harjot Gill
Yeah, yeah. GitHub CLI has a nice way to authenticate. So the token is like we just provided the token once and the CLI works like that. That take token is in a secure vault inside GitHub CLI. The main thing is we don't actually have to give AI a lot of information on these tools because it's already in the training data. So, so when you understand sed command, cat command, rip, grep, I mean those tools are like well known, well understood by AI. So there's not a lot of hand holding in making you understand the schema of these tools because it's just shell scripts, I mean it's trained on that. We do explain the scenarios in which certain tools might be handy. So we try to influence the behavior in some ways on when it can make certain commands. I mean for example, if you see if you're doing a package JSON update, let's say go and read the vulnerability database. GitHub has to see if these packages are not out of date. And it does that. It's pretty effective actually. So each time you see a package JSON command, you will see agent making a Command call to GitHub Open Vulnerability Database. They have to detect whether these Python packages or Ruby packages have any vulnerabilities. Right?
Kevin Ball
Yeah, I like that a lot. So in terms of scenarios, and I'm going to explore this because I think as you highlight, you are one of the most successful examples of these agents in the wild. But it is a technique a lot of people are trying to figure out and explore. So can you give us a like ballpark? Are we talking tens of scenarios? Are we talking hundreds? Like what does this look like?
Harjot Gill
Yeah, they are in the order of more than tens for sure. From what I recall. Like this all coming from the tribal knowledge. Like if you are being an engineering lead or good engineer yourself, kind of taking what you know best and then programming that as a prompt, taking your own knowledge in many cases. Right. And a lot of times we are just learning from the sheer amount of customers we have. And one of the reasons why Coderabit improved a lot is because we have a lot of open source usage and that is a great feedback loop. So we have every few seconds we review some pull request in open source and a lot of people interact with Coderabbit. We kind of observe what they're doing in those pull requests and some of that behavior goes back into training our agent.
Kevin Ball
Yeah, that makes a ton of sense. And we talked a little bit earlier about the challenges of prompt stuffing. When you've got these big context windows and too much is in there, is the scenarios amount still small enough that that's all going into the base agent prompt or do you do some sort of dynamic loading or figuring out of what are likely relevant scenarios in any particular time?
Harjot Gill
Yeah, that's right. I mean it's the latter. First of all, we're using multiple models. As I said, there's not single one base agent prompt. It's not like agentic loop that everyone else has. I mean it's a pipeline in a way and a lot of the work goes in preparing the context. Actually a lot of the money is actually spent in. Because one of the things with the reasoning models is these models get thrown off track very, very quickly. If you're doing rag and just stuffing in the context without cleaning it up first or re ragging it, these models tend to go completely off track and haywire Right. As opposed to non reasoning models. They overthink. And that is one of the reasons why some companies struggled. When Sonnet 3.7 came out, Sonnet 3.5 was working really well for a lot of the coding companies. But when 3.7 came out, they had no clue what happened, what hit them. Like we were prepared. One of the good things is because we were built with the reasoning models in mind from day one. In fact, even before reasoning models came out, we had a lot more internal reasoning process. Actually we are in a lot of stages which were just doing internal monologues and reasoning. We always benefit each time a new reasoning model comes out. So there's not big changes into our system. But some companies have to fundamentally rethink how they were doing their prompting with the reasoning models.
Kevin Ball
Yeah, that makes a lot of sense. So let's maybe break down a little bit the agentic loop because I think a lot of people building agents right now, it's essentially, yeah, one big system prompt and tool calls and a loop around it. So you said for yours it's more of a pipeline. You have this more dynamic set of things. So how do you think about the design of your agent?
Harjot Gill
Yeah, I mean we work on like large complex code bases. So single loop doesn't work for us. We have to figure out how do we have like the main agent that figures out what kind of things it has to do. Then the delegations happening. There's a lot of complexity over there as well on how we break up the work. There's a whole task tracking system where you have a main root task breaking up into subtasks. That's how we do it. We do divide and conquer. Essentially the problem with agents, right. And the results bubble up and the visibility bubbles up. And that's how it works effectively on large code bases. A lot of that is proprietary. It's not like we're using any framework LangChain or something like, I mean it's all in house and going back, it's a loop. But the trick with these systems is also making sure that AI or the large language models saw the right context. Like sometimes you have shell scripts, you know that the quality of the output over there won't be high for you to make a good judgment. So sometimes they. There's a lot of suppression happening. Even though AI would say, okay, it looks like there's a bug, but you know that it didn't see the relevant context. So this might not be high quality inference, I'll just hide it. Rather than bubble up a lot of noise. So we do a lot of cleanup. Even on this agentic loop. It's not like a pass through to the user. There's a lot more understanding of in our system on what kind of quality context is going in into the pipeline so that we know that the decision or the inference we are getting end of the day is going to be high quality or we can even trust it.
Kevin Ball
Yeah, that makes a lot of sense.
Harjot Gill
For example, one of the examples is like lack of output doesn't mean there's a bug. Right. Sometimes you will run a find search on a file and you won't find that file. Right. Which is probably like you're maybe looking in a wrong place rather than that file not existing. So those kind of scenarios you have to account for. There are many such scenarios, by the way.
Kevin Ball
Yeah. So let me make sure once again that I'm understanding. So essentially you have a top level and it breaks things down into a task graph. It says like essentially here's the set of things that I think we need to do to dig into this. And then delegates those tasks to sub agents in some form which go and do work and then as they complete it kind of bubbles up through the graph to the top level agent.
Harjot Gill
That's right. And this task graph is dynamic, as you can guess. I mean it's figured out by the AI. Yeah, yeah. So there's a system that figures out what the task should be.
Kevin Ball
Now thinking about those tasks, are they fully dynamic? Do you predefine classes of tasks, does that connect to how you decide what's going to be relevant context and how high quality is likely to be? Or is it completely driven by the LLM?
Harjot Gill
It's a hybrid system. Like we do know the nature of these tasks because we let the AI choose what kind of task is running and then we know what these tasks should look like when we run them. But the graph itself is dynamic to a large extent. I mean there's a pipeline, I mean it's a hybrid architecture. There's some pipeline stages which are always like hard coded in a system. These steps have to happen. But then there is like we give enough freedom to this agent to go and find stuff as well and plan around it. So what we found is like planning is a big part of the quality. Like the more you plan, the more you give it agency to go and first like go and navigate the code. And that usually yields high quality outcomes in the end of the day. Rather than just rushing into doing something or concluding something you want to let The AI like follow multiple chains of thoughts and some of them could lead to a dead end, but that's fine. And like maybe four out of five doors were closed, but one of the doors leads to some interesting insight.
Kevin Ball
Yeah, this is all connecting for me because as you build out those tasks, they have classifications. That's going to help with what we talked about in terms of picking what are the relevant scenarios to load into the context for that sub agent to decide what it might check or do. The filtering that you talked about, is that also done kind of agentically by the LLM where it's judging quality or do you have some sort of static analysis in there in some form as well?
Harjot Gill
Yeah, it's mostly LLM driven. I would say there is some static stuff, as I said. Like we know exactly like, okay, that these models did not see the relevant context. So it's very easy to sometimes figure that out from the quality of commands it's running and so on and the outputs.
Kevin Ball
Right.
Harjot Gill
But in many cases the validation is done by another kind of a judge LLM which is running online and which is also able to make decide whether the result so far has been accurate or not.
Kevin Ball
That makes sense. That makes sense. And then in terms of what you mentioned, in terms of adapting to inputs, then as things come back, I assume different layers have the ability to say, oh, that was a dead end, go try this, let's replan, let's restructure this as you go. How do you limit the extents of that or decide when you're done?
Harjot Gill
It's an arbitrary number. I know 10 levels deep. I mean when it's done, it will like just say it's done. I mean, but sometimes like we have to have. It's like the stack depth problem, like the maximum stack depth you want to do. And it's a cost thing. Like now I don't remember what's a constant right now. Maybe it was five or ten, something like that. We picked a number and said, okay, this is the deepest, we want to go into the rabbit hole.
Kevin Ball
That makes sense.
Harjot Gill
These things tend to loop around. Like especially the earlier models there was a lot of this looping behavior where it will go and check same thing again and again. Right, well.
Kevin Ball
And cost does bring up an interesting question. Right. Like I have a coworker who is way down in agent land and exploring all sorts of different agents and trying different things, but they tick up in cost pretty quickly if you just let them run. So you mentioned you've done a lot to try to control costs. And keep this contained. How do you approach that?
Harjot Gill
Yeah, it's multiple things. Right. One is like the reason we use a lot of the cheaper models is the cost. Like yes, you could use inexpensive model for everything, even summarization, but that doesn't make sense. It's like orders of magnitude more expensive. Right. For example, O3 is like five times expensive than Sonnet and Sonnet is like orders of magnitude expensive than 4 or mini or something.
Kevin Ball
Right.
Harjot Gill
So it's being like smart about manning the workload to the right model and so that you get the best price to performance ratio for workload that you have in mind. The other factors are like being smart about especially the incremental thing. One of the things that people love about Code Rev, it's an incremental reviewer. So it will remember the last time we left the review and next time when it resumes, it will first see whether I have to really re review something or not, Whether it's a trivial change. Can I skip it? So we do a lot of the lot of the prompts that actually just figuring out whether we need to even do a deeper analysis or just approve it.
Kevin Ball
There's like a short circuit basically.
Harjot Gill
Yeah, there is a short circuit and so far no one has noticed or complained because sometimes we do skip and the quality on that has been really high. At least the decisions we have been making have been very high quality on that. Right. And the other part has been rate limits. So we do. And you would sometimes see on Twitter, like people complain code habit has rate limits, but that's one of the ways we kind of control the abuse so that it's kind of fair to. So unlike a lot of the AI companies which are now going into consumption pricing, like you would see, agent companies are now like Cursor, for example, has a max mode which is, I was reading the documentation, 20% markup over the API cost. So you're passing on the Sonnet cost. Gemini costs to the end user. Like Code Rabbit, on the other hand has a per seat pricing. It's all you can eat. But the way we sustain as a business at scale is through a lot of these techniques on the LLM side and rate limits. Right. I mean we are able to have for our open source plan, we have a lot more strict rate limits versus like relaxed rate limits for our paid users and different plans.
Kevin Ball
Yeah, that makes sense. What would you say some of the most kind of challenging technical areas of building out Code Rabbit have been and how have you addressed them?
Harjot Gill
It's been fun. It's been a different kind of a project. I don't know how much. It's my third startup now, so a very different flavor than the previous two that I did. I mean the earlier were in observability and infrastructure, cloud infraspace reliability management. This has been like the very different kind of a product where we had to like unlearn a lot of the way you build software. Like it's not deterministic. Like there are a lot of like deficiencies in the large language models themselves, but they're amazing in so many ways. The trick has been how do you hide those deficiencies from the end user. They tend to be noisy, they tend to create a lot of slop otherwise.
Kevin Ball
Right.
Harjot Gill
And build a product that people love. So it's a combination of reliable execution of these agents and also a great UX that becomes part of your daily workflow. For example, Code Rabbit sits inside a pull request models and one of the very few companies which have been able to successfully bring a product into an existing workflow. A lot of people hate AI if you ask me. Like people are trying to bring AI to every workflow you might have and people hate that. But Core Rabbit has been one of the very few exceptions where it's actually being loved and being pulled in very rapidly by the developers themselves.
Kevin Ball
You highlight a couple really important things and I want to go deeper on there. So one is that these models come with fundamental trade offs. They have strengths and they have deficiencies and if you want to use them effectively in a product, you need to build around those. You can't just treat them like software. And then as you also mentioned, many companies are failing to see that and just kind of like trying to bolt them onto things without even thinking about like, is this a useful use case for this? What are the strengths? What are the trade offs? How do I do that? So I'm curious, through building Code Rabbit, if you've kind of developed any, I guess almost like principles for how you think about what is going to be a good use case for LLMs and not. Or how you build a product around a large language model.
Harjot Gill
That's a great question actually. Like one of the things that people love about Code revit has been how surprisingly reliable it is or accurate it is given that bad experience or bad taste in the mouth that every other product leaves. And that is the bar we try to like keep up with the new features. Which also means tracking where these models are technically like in terms of both price and performance. Right. So there are a lot of use Cases we want to do like, but we deliberately don't go and build them because we know that the capabilities are not there yet. We don't want to like lower the power on code rabbits for example. Like a lot of companies are now doing issues to PR but if you give an open ended prompt 80% of the time, you're still going to end up with the wrong implementation.
Kevin Ball
Right.
Harjot Gill
So these are like still like I would say experimental systems, not ready for large scale mainstream use. Case Codabit is mainstream. We are being used in even traditional companies, not just like Silicon Valley startups, but even traditional companies on php, Java applications, even older applications are using us very successfully.
Kevin Ball
Right.
Harjot Gill
So those are some of the principles. Like yes, we could do a lot with AI, especially with tool calling. It doesn't require a lot of code. I mean if you look at agentic system, they are very simple systems. Like they're just a bunch of tools cobbled together and it's usually like Sonnet doing all the magic for you. But those are not products yet. Right. You need a person who is really expert in prompting, in able to drive the outcomes. Yes. Twitter is a different bubble. Like when people say they're successful with AI, they're good prompt engineers. They know exactly where these models will fail and they don't even try those use cases. But the rest of the world is not ready for a lot of these prompting and models.
Kevin Ball
Right.
Harjot Gill
So these are the guiding principles. Like some of them. UX is another one. Like we do try to make sure that it's, we understand really the user's existing workflow so that we can seamlessly bring AI into, into their daily life versus something they have to remember to use. I mean the, one of the big differences between code I would experience and other tools is it's not a chat product. Every other product requires prompting and chat. It is not like we are the very one of the very few products that has zero activation energy. There is no activation on the user. You open up pr, it gives you insights.
Kevin Ball
I think there's something really powerful there because ChatGPT was so successful that it has kind of made everyone have this mental model of LLM equals chat. And to your point, you are not a chat product. That is not at all what you are doing. What is your mental model for what. What makes a good LLM problem? If LLM does not equal chat, what is it providing for you? Like if someone else was trying to go through the learning process that you have of how am I going to apply this in a useful Way to create real value. What's the like picture that you have of what the capabilities this LLM provides?
Harjot Gill
No, that's a great question. Like you have to first understand where the data is coming from, what the training data looks like. We know that these models are trained on software. They've been very successful because that data has been very easy to obtain. Things like shell scripts, you know intuitively that hey, you have thousands of repositories. These LLMs are trained on what good shell scripts look like.
Kevin Ball
Right.
Harjot Gill
So those are the like you have to play with the strengths. Like if you suddenly come up with like a use case where you know that there's been very scarce training data or even the reasoning models cannot solve every problem. Like they were good at things that they've seen in the past or they've been reinforced. Learning has been, it has been there for even for rl.
Kevin Ball
Right.
Harjot Gill
One of the things we have seen that these large language models don't really make someone who's already 10x become 100x effective. They really make an average, let's say 1x person become 10x because it's bringing in a lot of the training data which is trained on best practices, good use cases to a more average developer. And also like that what makes it like effective in automating repetitive work? The toil. Right. Some of these code review comments are actually toil. Like most of the time it's the same thing. Best practices around security, best practices around some null pointer checks.
Kevin Ball
Right.
Harjot Gill
It's again and again the same thing.
Kevin Ball
Right.
Harjot Gill
Or it's sometimes unit test case generation, doc strings, like those kind of things. It's very effective.
Kevin Ball
Right.
Harjot Gill
And those are the use cases we typically, typically go after where there's a lot of toil, repetitive work and we know that people just don't want to do these things. Those are things you go and automate. If you ask me. Can an LLM create a brand new or some make someone who's already a really good programmer become 100x? I mean that I don't know yet. But we have seen a lot of people become 10x thanks to large language models.
Kevin Ball
One of the things you said a little earlier was around essentially not wanting to build features where the technology isn't there yet. What would you say is kind of the edge right now of the types of things you would do with an LLM where you think it might get to in the next few months versus Ah, that's not going to happen anytime soon.
Harjot Gill
Yeah, we constantly track the envelope like that's the whole idea with the evals, like one of the other secret sauces these good AI app companies have is like evals, like where they're able to track not just the efficacy of the current system or the new models that come out, but also track the limits of these models. And we have some test cases we know that even advanced models like O3 are not yet able to solve for us. So it's very critical that we track the progress. And we have seen our own benchmarks and our own evils getting beaten progressively from 4.0 to O1 to O3 and so on.
Kevin Ball
Right.
Harjot Gill
And that gives us a good idea. And the second is a price. Right. On how effectively can we offer like because even these providers don't have enough quota, like we have to fight with the provider sometimes to get rate limits. So even if, let's say we have a use case in mind and people are willing to pay for it, we just don't have capacity for it to be delivered at scale. So there are multiple factors which kind of hold us back on some of these frontier use cases that we have in mind. It's a complicated thing, like I would say, on where to make bets. I mean overall, like in this space there's massive appetite in the market to bring AI to, as I said, automate the toil and the mundane work.
Kevin Ball
Right.
Harjot Gill
But at the same time there are the practical limitations on how much capacity you can get and the capabilities of the AI itself.
Kevin Ball
Yeah, I think it's the first time I've seen in quite a long time where it feels like the whole industry is capacity limited. We just can't ship enough GPUs.
Harjot Gill
That's right, that's right. And it gets expensive as well. I mean we do see like there's going to be auto supplement magnitude reduction. But then again, some of these other use cases will start opening up.
Kevin Ball
Right.
Harjot Gill
And it's challenging. Right? I mean that, that aspect, I mean overall the models are in a way designed, I mean especially with rl, like yes, you can make them competent on a lot of use cases provided you have the right kind of data. Like it's like recording the usage, not just what's available on the Internet, but observing how people do things, how I think if I. That's how the RL thing works. Right. And some, sometimes it's just synthetic. But the thing is that you have to have ability to record that data somehow and that's how like other use cases will open up. But for now it seems like coding is something you can easily obtain that data either through code editors, through open source, through by hiring humans. Like I know these companies are also hiring a lot of contractors to go and solve programming puzzles, right? So that data is relatively easy to obtain. And that's why we have seen a lot more success in coding use cases initially with AI, but that doesn't mean other use cases are out of reach forever. It's just a matter of time. People will figure out how to obtain quality data to make those use cases reliable.
Kevin Ball
You mentioned evals, and that's another place it might be worth us digging for a little bit because this is something that I feel like there's a lot of chatter, but I haven't seen big standards coming out yet in terms of how to eval. It feels very company specific oftentimes. So how are you thinking about and managing evals?
Harjot Gill
It is indeed company specific. And we have burnt in the past by looking at public evals and trusting them. And that's what happened like back in June, July where we were burnt by like even GPT4.0 came out. It was actually worse than Turbo for at least our use case. We didn't have good evals back then. And we saw a lot more. Like the main eval is like, hey, are we seeing the same number of conversions? Are people still buying the product at the same rate? Like some from sign up to paid, are we seeing a big churn rate? So those kind of things are the real data points, like the business outcomes. As long as you release these models and your outcomes improve or mean the same, this means something is working. So a lot of it is like wipe checks as well, right? At the same time, like you want to be like, still do as much as you can at your end. Because if you're rolling out these new models, you don't want them to backfire. Like we have like 100,000 developers. Like last thing we want is like disrupting their daily flows, right? So we try to be careful. Like we try to curate some of these examples we see in the wild where we think make a good eval and it's about like we're taking more like a cattle versus pets approach. We don't like have millions of examples like other companies. We try to curate a golden data set of as few examples as possible, which allow us to track where the AI is today and where we can also able to like compare these models more effectively very quickly.
Kevin Ball
What granularity do you apply that at? Because we talked about you have this complex and valuable tool chain or kind of task Graph and pipeline of things going on. Is the eval at the set, at the level of the whole pipeline on a particular code change or are there more granular things that you are testing?
Harjot Gill
It's both. We are taking the end to end approach as well, where we are running the end to end flow. But a lot of the times we are also running like as a unit test case kind of a thing, assuming a lot of the context we are able to provide is perfect. From the other stages of the pipeline, how is a certain stage going to perform? Because you know, like it's a complex pipeline and especially agentic and your errors compound the deeper you go. That's the hard part. Like I mean if you have 5% error rate, it becomes 20% end of the day down downstream.
Kevin Ball
Right.
Harjot Gill
So the idea is like how do we decompose this pipeline and test each stage independently as much as possible by keeping a lot of the other factors the same. So yeah, so it's, it's kind of a balance. Yes, there are end to end tests as well and at the same time it's very granular. I wouldn't say we have 100% coverage because some of the prompts are simple. We don't feel like writing a lot of evals for them. But some of the more complex prompts where a lot of the classification happens, a lot of the reasoning happens like those kind of prompts. We have extensive tests for now.
Kevin Ball
Are you using any particular framework for that or it's homegrown or it's mostly homegrown.
Harjot Gill
I mean we do have some visibility in tools like Langsmith, especially from the open source. Like we don't trace our paid customers to private repositories, but that's where we have a lot of the open source data coming in that provides us live visibility into how the system is performing.
Kevin Ball
That makes sense. Slightly different direction. You said this is your third company and I think I saw code rabbits completely bootstrapped. You didn't go the venture capital route or anything like that. I know that's something a lot of developers dream about doing, taking a project and bringing it to be something sustainable. What did that take? How does that look? And were you able to get to something that could sustain you very quickly or kind of. What was that timeline like?
Harjot Gill
That's an interesting question. Like, I mean, yes, I mean we had success in the past. My first startup was a good exit. Second, not so much. I mean that was in the reliability management space. But Code Rabbit was kind of an internal tool that started out There but then it like flourished independently in this startup. Like one of the unique things has been just the compressed time frames things, things are moving. So it's not like we didn't take venture capitalist money. We are funded by crv which is one the of of the big investors in the product led growth companies. So overall we raised like around 26 million. So it's not like it's completely bootstrapped at this point. Like there is significant VC money which has been raised in this company and but yeah, I mean it did get to series A without the seed funding round. So we were already a million dollars annual recurring revenue last year when we did that round that was completely on bootstrapped budget. But we could do that given that yes, there was some prior success so we could like invest. We were at a stage in life where we could take that kind of a risk.
Kevin Ball
That makes sense. How did you get your initial sets of customers? I think this like zero to one phase is one of the most challenging and particularly for developers finding the market. And you're targeting developers which a lot of us when we think about oh I could do something that we start with a niche that we want to scratch for ourselves. So how did you kind of get to that 1 million out of the gate, no background budget except what you could fund yourself.
Harjot Gill
A lot of that is thanks to my co founder Gur who did things I would not have otherwise done. First of all, I'm like the first two startups were all enterprise sales, very content marketing driven, very different go to market. I'm not saying that was ineffective, but that's what those products needed. On the other hand, the developer market is very consumer style market. It's a massive market compared to selling cloud infra for example.
Kevin Ball
Right.
Harjot Gill
And the strategies that work here are very different. Like even things like ads work very effectively in this space. So it was combination of multiple things like influencers, organic tweets. Like our users talk about the products. A lot of it is not even us pushing it. Like it's the flywheel effect of the users that talk about it. So a lot of our customers who come in inbound are primarily coming because of word of mouth. They're not being acquired by marketing by any way as our cost of acquisition of customers is very, very low in the industry because it's just a flywheel effect. Like the key things we did is we made the product accessible to as many people as we could. We made the product free for open source users so they could try it out. We made the Product free for all individual users on VS code. So the idea is like we know that this AI thing is so new it needs a massive habit change. Right? You want to like the main battle is not building a product or raising money. The main thing is like are people going to form this new habit or not? That was our biggest worry two years back. We saw it coming. Everyone was trying to bring AI products to the market. We knew 90% of them would fail because people are not going to change their habits. So we saw that early on and in order to quickly iterate on the product and make sure that we build a habit forming product, we had to make it accessible. There was no other way. And we kind of innovated a lot on that and that's what led to a lot of user love because we could iterate and pretty much hammer it to the point where it has a very good product market fit and gets universal love.
Kevin Ball
Yeah, great lessons there. I guess we're getting closer to the end. Is there anything on the horizon? What's the next big release coming from Code Rabbit?
Harjot Gill
No, we're doing very interesting stuff now actually. So code review has been a very interesting starting point. Getting us through the door in pretty much most companies now. One of the things now we are seeing is wipe coding take off now like we are seeing even more acceleration in our growth. Like we have been growing crazy but last three weeks has been like I would say crazier. We have never seen that kind of growth because all these OpenAI codecs came out, background agents Cursor is doing and cloud code is there like there's so many wipe coding tools out there and what we are seeing is like this huge opportunity in like being a tool that can make the wipe coded systems production ready. So there is still some last 20% polishing or we call them finishing touches. Those are the areas we are focusing on that in the pr. Can we eliminate all the deficiencies? Like for example like if you're missing documentation and you as a company care about it, can we add doc strings, can we add missing unit test case coverage? Because those kind of things you're going to discover when you actually open a pr. You're not going to discover that in your cursor or code editor. You're going to discover that in the CICD and that last 20% polishing is what we are like focusing on as a company.
Kevin Ball
That's super cool. Especially because I feel like one of the things I've seen with people exploring vibe coding is the better your code practices are, the better the AI is able to generate things in it. If you keep things modular and well named and all these things that get caught in a code review, then you're going to be able to sustain this longer as well.
Harjot Gill
That's right. I mean, there's so many things you're talking about, maintainability, you're talking about can we fix some of these CI CD failures? Like, there's just so much downstream of a PR as well that need to happen. And we are pretty excited. Like, I mean, the massive appetite and a lot of these form factors haven't been thought of in the past and we're so excited to bring all these new ideas to the market.
Kevin Ball
That's awesome. Well, anything else that you would like to leave our audience with before we wrap?
Harjot Gill
I mean, the only thing I would say is definitely try out Code Rabbit if you haven't tried it already. I know that a lot of people have heard about it, but the thing is like, it's not. It's a tool that will surprise you once you actually try it because it's that good. So I recommend everyone at least try it once.
Kevin Ball
Awesome. I think that's a great wrap up.
Harjot Gill
Thanks, Kevin. Sa.
Podcast Summary: Software Engineering Daily – "CodeRabbit and RAG for Code Review with Harjot Gill"
Release Date: June 24, 2025
In this enlightening episode of Software Engineering Daily, host Kevin Ball engages in a deep dive with Harjot Gill, the founder and CEO of CodeRabbit—a pioneering startup integrating generative AI into the code review process. They explore the architecture of CodeRabbit, its innovative use of Large Language Models (LLMs), and the intricate mechanisms that ensure quality, security, and maintainability in software development at scale.
Harjot Gill introduces CodeRabbit as a solution that leverages generative AI to enhance code reviews, ensuring code quality and security across platforms like GitHub and GitLab. With over 100,000 daily users, CodeRabbit has rapidly gained popularity among developers across various industry segments.
Harjot Gill [00:00]: “One of the most immediate and high impact applications of LLMs has been in software development. The models can significantly accelerate code writing, but with that increased velocity comes a greater need for thoughtful, scalable approaches to code review.”
Kevin Ball inquires about the practical aspects of using CodeRabbit. Harjot Gill explains that CodeRabbit seamlessly integrates into existing development workflows, primarily functioning within the pull request model. Once a feature branch is ready, opening a pull request triggers CodeRabbit to perform automated code reviews alongside traditional CI/CD pipelines. Additionally, a recently launched VS Code extension allows developers to review code before pushing it to remote repositories.
Harjot Gill [02:34]: “Coderabbit sits alongside those tools and uses AI to perform code reviews. And very recently... we also released a VS code extension that also works... so that the developers can also review the code before they even push the code to the remote git branch.”
Harjot Gill differentiates between code generation and code review, emphasizing that while code generation focuses on autocomplete and suggestions using smaller, low-latency models, code review demands deep reasoning and comprehensive analysis.
Harjot Gill [04:18]: “The workflow that Codrabbit is sitting on is latency insensitive because you're running it in the CI CD pipeline and that workflow can typically take several minutes to complete.”
CodeRabbit employs an ensemble of multiple LLMs tailored for specific tasks within the pipeline. This strategy ensures optimal performance and cost-effectiveness, utilizing models like GPT4.1 Nano for context preparation and more advanced models for nuanced code analysis.
Harjot Gill [10:15]: “Code Rabbit is an ensemble of models we draw. We don't even expose what models we are using to the end customers.”
To provide relevant and accurate code reviews, CodeRabbit builds a dynamic code graph from pull request payloads, analyzing diffs, dependencies, and contextual information from issue trackers like JIRA. Additionally, past interactions and team-specific learnings enrich the AI’s understanding, ensuring personalized and effective reviews.
Harjot Gill [06:21]: “There are like, I don't know, 10 to 15 different data points that we pull in during the context.”
LLMs, despite their prowess, have limited context windows and can suffer from quality degradation when overloaded with information. Harjot Gill discusses how CodeRabbit strategically manages context by supplying the AI with essential hints and creating sandbox environments where the AI can execute agentic loops—running CLI commands and web queries to fetch additional context as needed.
Harjot Gill [06:48]: “What we are doing, which is a cool thing, which is like so differentiated right now, we create all these like sandbox environments in the cloud.”
Ensuring secure and efficient sandboxing is pivotal for CodeRabbit. The system employs standard containerization techniques, allowing the AI unrestricted internet access to perform necessary operations like running shell scripts and accessing GitHub APIs. Tokens are securely stored, and the AI leverages its training data to understand and execute CLI commands without additional handholding.
Harjot Gill [17:37]: “We actually generate code as instead of doing tool calls. We have a sandbox and CLI. That's all you need.”
CodeRabbit's agentic loop is a sophisticated pipeline that breaks down code review tasks into a dynamic task graph. This system delegates subtasks to specialized agents, ensuring thorough and accurate analysis. Each agent's output is meticulously tracked, allowing the main agent to reassess and replan if necessary, maintaining a high standard of code quality.
Harjot Gill [25:09]: “That's right. And this task graph is dynamic, as you can guess. I mean it's figured out by the AI.”
To maintain cost-effectiveness, CodeRabbit strategically employs a mix of cheaper and more expensive models depending on the task complexity. Additionally, the platform implements rate limits and incremental review processes to minimize unnecessary computations, ensuring scalability without exorbitant costs.
Harjot Gill [28:12]: “One of the things that people love about Code Rev, it's an incremental reviewer. So it will remember the last time we left the review and next time when it resumes, it will first see whether I have to really re review something or not.”
Developing CodeRabbit involved navigating the inherent non-determinism of LLMs and ensuring a reliable user experience. Harjot Gill emphasizes the importance of hiding LLM deficiencies from users, maintaining high accuracy, and embedding AI seamlessly into existing workflows to foster adoption.
Harjot Gill [30:10]: “The trick has been how do you hide those deficiencies from the end user. They tend to be noisy, they tend to create a lot of slop otherwise.”
CodeRabbit’s growth was propelled by a consumer-style marketing approach, leveraging influencer partnerships, organic social media presence, and a strong word-of-mouth effect. By making the tool accessible and free for open-source projects and individual developers, CodeRabbit built a habit-forming product that quickly resonated with its target audience.
Harjot Gill [43:44]: “We made the product accessible to as many people as we could. We made the product free for open source users... to quickly iterate on the product and make sure that we build a habit forming product.”
Looking ahead, CodeRabbit is expanding its focus from code reviews to code polishing, addressing the final 20% of code quality enhancements such as documentation and unit test coverage. This strategic shift aims to solidify CodeRabbit’s position as an indispensable tool in the software development lifecycle.
Harjot Gill [46:17]: “We are focusing on that in the PR. Can we eliminate all the deficiencies? Like for example like if you're missing documentation and you as a company care about it, can we add doc strings, can we add missing unit test case coverage.”
Harjot Gill concludes by encouraging developers to experience CodeRabbit firsthand, highlighting its effectiveness and seamless integration into daily workflows.
Harjot Gill [48:10]: “I recommend everyone at least try it once.”
Key Insights:
Integration Over Intrusion: CodeRabbit successfully embeds AI into existing workflows without disrupting developer habits, ensuring widespread adoption and love from its user base.
Strategic Use of LLMs: By employing an ensemble of specialized models and dynamic task management, CodeRabbit balances performance with cost, delivering high-quality code reviews at scale.
Transparency and Trust: Providing users with contextual insights and a clear reasoning trail fosters trust and allows for validation, mitigating the risk of AI-induced errors.
Scalable and Sustainable Growth: A focus on accessibility, combined with intelligent cost management and a strong user-centric approach, has propelled CodeRabbit’s exponential growth.
For developers looking to enhance their code review processes with state-of-the-art AI, CodeRabbit offers a robust, reliable, and seamlessly integrated solution. As Harjot Gill aptly puts it, experiencing CodeRabbit firsthand is the best way to appreciate its transformative potential.