
Loading summary
A
Foreign.
B
Welcome to the Linspace podcast in the new studio. This is Alessio, partner and CTO at Decibel, and I'm joined by Zwicks, founder of Small AI.
C
Hey. Hey.
A
Hey.
C
It's weird to say welcome because obviously, actually, today's guest, Jeff, has welcomed us to Chroma for many months now. Welcome.
A
Thanks for having me. Good to be here.
C
Jeff. You're a founder CEO of Chroma. I've sort of observed Chroma for a long, long time, especially back in the old office. And you original got your start in the open source vector database, Right? You're the open source vector database of choice of a lot of different projects, particularly with even projects like the Voyager paper. You guys were used in that. I don't even know the full list. But how do you introduce Chroma today?
A
It's a good question. I mean, naturally you always want to kind of take your messaging and make it fit your audience. But I think the reason that Chroma got started is because we had worked for many years in applied machine learning and we'd seen how demos were easy to build, but building a production reliable system was incredibly challenging, and that the gap between demo and production didn't really feel like engineering. It felt a lot more like alchemy. There's some good XKCD memes about this guy standing on top of a giant steaming pile of garbage. And the other character asks, this is your data system? And he's like, yes. He's like, how do you know if it's good or how do you make it better? You just stir the pot and then see if it gets any better. That just seemed intrinsically wrong. And this is back in 2021, 2022, that we were having these conversations. And so that coupled with a thesis that latent space was a very important tool. That is a plug. Yes, we agree. That is a plug.
C
We need to ring the bell.
B
Yeah, exactly.
A
Ring the gog. That latent space, both the podcast, but also the technology, was a very underrated tool, a very important tool for interpretability. It's fundamentally how models see their own data. We as humans can kind of have that shared space to understand what's going on. That's where we got started. And so I think that's also where we continue to want to go. What do we want to do? We want to help developers build production applications with AI and want to make the process of going from demo to production feel more like engineering and less like alchemy. Doing a database is not a side Quest. It is a part of the main quest. What we realized along the way was search was really a key workload to how like AI applications were going to get built. It's not the only workload, but it's definitely a really important workload. You don't earn the right to do more things until you've done one thing at a world class level that requires maniacal and kind of maniacal focus. And so that's really what we've been doing for the last few years. That was a long kind of rambly introduction, but maybe to sort of land the plane. If you ask people, what does Chroma do today? We build a retrieval engine for AI applications. We're working on modern search infrastructure for AI. Some version of that.
C
I'll do a double click on this. Is information retrieval and search the same thing or are they slightly different in your mind? I just wanted to clarify our terminology.
A
Yeah, I think that modern search infrastructure for AI, we can maybe unpack that for a couple seconds. So modern is in contrast to traditional and mostly what that means is modern distributed systems. So there's a bunch of primitives and building great distributed systems that have come on to the scene in the last five, 10 years that obviously are not in technology that is older than that by definition. Separation of read and write, separation of storage and compute. Chrome is written in Rust, it's fully multitenant. We use object storage as a key persistence tier and data layer for Chroma distributed in Chroma Cloud as well. So that's the modern piece and then the 4ai piece actually I think matters in four kind of different ways. 4ai means four different things. It means number one, the tools and technology that you use for search are different than in classic search systems. Number two, the workload is different than classic search systems. Number three, the developer is different than classic search systems. And number four, the person who's consuming those search results is also different than in classic search systems. Think about classic search systems. You as the human were doing the last mile of search, you were doing click, click, click. Exactly. You're like, oh, which of these are relevant? Open a new tab, summarize, blah blah, blah, blah. You, the human were doing that and now it's a language model. Humans can only digest 10 blue links. Language models can digest orders of magnitude more. All of these things matter. And if think influence how a system is designed and what it's sort of made for.
B
Back in 2023, I think the Vector DB category was kind of one of the hottest ones and you had pinecon raised 100 million, you had all these different weba, you had all these companies. How did you stay focused on what mattered to you rather than just try to raise a lot of money, make a big splash? And it took you a while to release Chroma Cloud too, which, rather than just getting something out that maybe broke once you got to production, you kind of took your time. Can you maybe give people advice on, in the AI space, how to be patient as a founder and how to have your own vision that you follow versus kind of like following the noise around you?
A
There are different ways to build a startup and so different schools of thought here. So one school of thought certainly is like the find signal and kind of follow the gradient descent of what people want, sort of lean startup style. My critique of that would be that if you follow that methodology, you will probably end up building a dating app. Middle schoolers, because that just seems to be the lowest base take of what humans want. To some degree, the slot machine would be the AI equivalent of that versus the other way to build a startup is to have a very strong view, presumably a contrarian view, or at least a view that seems like a secret, and then to just be maniacally focused on that thing. There are different structures and folks, but we've always taken the second approach. And yeah, there was the option of like, okay, Chroma's single node is like doing really well, getting a bunch of traffic. Clearly having a hosted service is the thing people want. Like, we could just spend. We could very quickly get a product in the market, but we felt like, no, really, what we want Chroma to be known for is our developer experience. Like, we want our brand to be. We want Chroma's brand and the craft expressed in our brand to be extremely well known. And we felt like by offering a single node product as a service, it was not going to meet our bar of what great developer experience could and should look like. Yeah, we made the decision of like, no, we're going to build the thing that we think is right, which was really challenging. It took a long time and obviously I'm incredibly proud that it exists today and that it's like serving hundreds of thousands of developers and they love it, but it was hard to get there.
B
When you're building the team, how do you message that? If I go back maybe like a year and a half ago, you know, I could join Chroma, I could join all these different companies. How do you keep the vision clear to people when on the outside you have, oh, I'll just use PG Vector or like, you know, whatever else. The thing of the day is, do you feel like that helps you bring people that are more aligned with the vision versus more of the missionary type on just joining this company before it's hot and maybe any learning that you have from recruiting early on the upstream.
A
Version of Conway's Law, like you ship your org chart is you ship your culture. Because I think your org chart is downstream of your company's culture. We've always placed extremely high premium on that, on the people that we actually have here on the team. I think that the slope of our future growth is entirely dependent on the people that are here in this office. And that could mean going back to zero, that could mean linear growth, that could mean all kinds of versions of hyperlinear growth, exponential growth, hockey stick growth. And so yeah, we've just really decided to hire very slowly and be really picky. And I don't know, I mean, you know, the future will determine whether or not that was the right decision. But I think having worked on a few startups before, that was something that I really cared about, was like, I just want to work with people that I love working with and like, want to be shoulder to shoulder with in the trenches and I think can independently execute on the level of like craft and quality that like we owe developers. And so that was how we chose to do it.
C
Talk about standard condition, all the other fun stuff towards the end. But we'll focus on Chroma. I always want to put some headline numbers up front, so I'm just trying to do a better job of giving people the brain dump on what they should know about Chroma. 5 million monthly downloads is what I have on Pypi and 21,000 GitHub stars. Anything else people should know? That's the typical sales call, headline stuff that you know.
A
Yeah, yeah. 20,000 GitHub stars, 5 billion plus monthly downloads. I've looked at the number recently. I think it's like over 60 or 70 million all time downloads now. For many years running, Chrome has been the number one used project broadly, but also within communities like LangChain and Llamaindex.
C
Okay, cool, fair enough. Yeah, I think when you say single node Chroma, I think you're describing the core difference between what Chroma Cloud has been and I think we're releasing this in line with your GA in Chroma Cloud.
A
Yes.
C
So what should people know about Chroma Cloud and how you've developed this experience from the start you mentioned separation of storage and compute. What does that feel?
A
Yeah, 100%. Chroma is known for its developer experience. I don't know that we were the first to do this. I think we were. With Chroma, you just PIP install Chroma DB and then you can use it.
C
It's just like in memory.
A
I think it may be the first you can persist. It could have been the first database to ever be pip installable.
C
Any SQLite wrapper is pip installable. Technically, you know, no.
A
SQLite was not pip installable. Even to this day, I don't think.
C
You probably have a deeper dive and knowledge of this. I'm just speculating myself.
B
Yeah.
A
So that led to a very seamless onboarding experience for new users because you could just run a command and then you could use it. We did all of the work to make sure that regardless of the deployment target or architecture that you're running it on, it would just work. In the early days we had people do really weird stuff like run it on Arduinos and PowerPC architectures and really esoteric stuff. We would go the extra mile to like make sure that it worked everywhere and it just always worked. So that was Chroma single node. So going back to the developer experience that we wanted to have in a cloud product, like, we thought that in the same way that you could run Tip of cell chroma DB and be up and running in 5 seconds and not have to think about it, you have to learn a bunch of abstractions. You don't have to spend a bunch of time learning this really complicated API. That same story had to be true for the cloud. And so what that meant is having a version of the product where you have to be forced to think about how many nodes you want or how to size those nodes, or how your sharding strategy should be, or your backup strategy or your data tiering strategy or I could go on. There just wasn't good enough. It needed to be like 0 config, 0 knobs to tune. It should just be always fast, always very cost effective and always fresh without you having to do or think about anything. Regardless of how your traffic goes up and down and how your data scale goes up and down, that was sort of the motivating criteria. It also like usage based billing. That was really important because that just is like so fair. We only charge you for the minimal slice of compute that you use and like nothing more. Which not all serverless databases can claim, but it is true inside of Chroma that we like truly only charge you for the narrow slice of what you use. And so that was the criteria that we entered, kind of the design criteria process, which is, you know, de facto.
C
You'Re also building a serverless compute platform.
A
Yeah, you have to. No, exactly. That motivated the design of Chroma Distributed. Chroma Distributed is also a part of the same Monorepo that's open source Apache 2 2, and then the control and data plane are both fully open source Apache 2. And then Chroma Cloud uses Chroma Distributed to run a service. And that service you can sign up, create a database and load in data in under 30 seconds. And this is sort of a time of filming. People get like five bucks of free credits, which is actually enough to load in like 100,000 documents and query it 100,000 times, which obviously, for a lot of use cases, actually might mean they use it for free for years, which is fine. And to get there, we had to do kind of all the hard work.
C
I think every blog should basically have semantic indexing. So host your personal blog on Chroma. Why not?
A
Yeah, I mean, the mission of organizing the world's information remains unsolved.
B
You have one of your usual cryptic tweets and you tweeted context engineering a couple months ago. What was it in April? I think everybody now is talking about context engineering. Can you give the canonical definition for you? And then how Chroma plays into it, and then we'll talk about all the different pieces of it.
A
I think something that's incredibly important when a new market is emerging is abstractions and the primitives that you use to reason about that thing. And AI, I think in part of its hype, has also had a lot of primitives and abstractions that have gotten thrown around and have led to a lot of developers not actually going to be able to think critically about what is this thing, how do I put it together? What problems can I solve? What matters? Where should I spend my time? For example, the term rag. We never use the term rag. I hate the term rag.
C
Yeah, I killed the RAG track partially because of your influence.
A
Thank you. Thank you. A. It's just retrieval. First of all, retrieval generation are three. Three concepts put together into one thing. That's just really confusing. And of course, rag got known now as. He's branded as like, oh, you're just using single dense vector search. And that's what RAG is. It's also dumb. I think one of the reasons I was really excited about the term, I mean, obviously AI Engineering, which you did a ton of work for. Context engineering is in some ways a subset of AI engineering. What is it? It's a high status job. Context engineering is the job of figuring out what should be in the context window any given LLM generation step. And there's both an inner loop which is setting up the, you know, what should be in the context window this time. And there's the outer loop which is how do you get better over time at filling the context window with only the relevant information. And we recently released this technical report about context rot, which goes sort of in detail in depth about how the performance of LLMs is not invariant to how many tokens you use. As you use more and more tokens, the model can pay attention to less and then also can reason sort of less effectively. I think this really motivates the problem. Context rot implies the need for context engineering. And I guess why I'm really excited about the meme and I got maybe both lucky to some degree that called it back in April. This is going to be a big meme, is that it elevates the job. It clearly describes the job and it elevates the status of the job. This is what frankly most AI startups, any AI startup that you know of, that you think of today that's doing very well, what are they fundamentally good at? What is the one thing that they are good at? It is context engineering particularly.
C
I would feel like a lot of pieces I've read, a lot of it focuses on agents versus non agent stuff. Like the context engineering is more relevant for agents. Do you make that distinction at all or you're just looking at context engineering generally?
A
No, I think there's interesting agent implications of agent learning. Can agents kind of learn from their interactions which maybe are less relevant and like static sort of knowledge based corpuses. Chat your documents obviously. Then again, like, you know, I think you can make the argument that even like chat your document use cases like should get better with more interactions. I don't draw a distinction between agent and non agent. I don't actually know what agent means still. But again, affirmatives, abstractions, words, they matter. I don't know. Like what does agent mean? I don't know.
C
Well, there's many definitions out there. I've taken a stabilization.
A
Most terms that can mean anything are just a vehicle for people's hopes and fears. Yeah, I think agent is the same thing for sure.
C
Well, maybe we'll try to be more concise or precise about context engineering so that it actually means something and people can actually use it to do stuff. One thing I definitely will call out for context engineering or context rot in general is I think that there's been a lot of marketing around Needle on the haystack where every frontier model now comes out with completely green perfect charts of full utilization across 1 million tokens. I'm wondering what you guys takes are all on that kind of marketing.
A
So maybe to back up a little bit, the way that we came to work on this research was we were looking actually at agent learning. So we were very curious, could you give agents access to prior successes or prior failures and if you did, would that help boost agent performance? We were specifically looking at a couple different data sets, suitebench inclusive and we started seeing interesting patterns where on multi turn age interactions where you're giving it the whole conversation window, the number of tokens explodes extremely quickly and instructions that were clearly in there were being ignored and were not being enacted upon. And we're like oh that clearly is a problem. We've now felt the pain. It was sort of a meme amongst people in the know that like this was true and like I think also you know, some of the research community's reaction to the context fraud technical report is like yeah, we know and you know that's fine but nobody else knew and like you're kind of nice if like you can actually teach builders what is possible today versus what is not possible today. I don't blame the labs. I mean building models is so insanely competitive. Everybody invariably is like picking the benchmarks that they want to do the best on. They're training around those. Those are also the ones that you know, find their way into their marketing. You know, most people are not motivated to come out and say here are all the ways that our thing is great and here are the ways that our thing is not great. I don't know, I have some sympathy for why this was not reported on. But yeah, there was this bit of this sort of implication where like oh look, our model is perfect on this task. Needle in a haystack, therefore the context window you can use for whatever you want. There was an implication there. And while I hope that that is true someday, that is not the case today.
C
Yeah, well send people at least on the YouTube video will put this chart which is kind of your figure 1 of the context route report. It seems like Sonnet 4 is the best in terms of area under curve is how to think about it. Then Quinn.
A
Wow.
C
And then GPT 4.1 and Gemini flash degrade a lot quicker in Terms of the context length.
A
Yeah, I don't have much commentary. That is what we found for this particular task. Again, how that translates to people's actual experience in real world tasks is entirely different. There is a certain amount of love that developers have for Claude and maybe those two things are correlated.
C
Yeah, I think it shows here if this is true.
A
That's a big explanation for why you follow my instructions. Is it clear baseline thing people want?
C
I don't think it's super answered here, but I have a theory also that reasoning models are better at context visualization because they can loop back normal autoregressive models, they just kind of go left to right. But reasoning models, in theory, they can loop back and look for things that they needed connections for that they may not have paid attention to in the initial pass.
A
There's a paper today that showed, I think maybe the opposite, but. Really, I'll send it to you later.
C
Yeah, that'd be fascinating to figure out papers every day.
B
I thought the best thing was that you did not try to sell something. You're just like, hey, this thing is broken. Kind of sucks. How do you think about problems that you want to solve versus research that you do to highlight some of the problems and then hoping that other people will participate? Like, does everything that you talk about is on the Chroma roadmap, basically? Or are you just advising people, hey, this is bad. Work around it, but don't ask us to fix it?
A
Kind of going back what I said a moment ago, Chroma's broad mandate is to make the process of building AI applications more like engineering and less alchemy. And so this is pretty broad tent, but we're a small team and we can only focus on so many things. We've chosen to focus very much on one thing for now. I don't have the hubris to think that we can ourselves solve this stuff conclusively for a very dynamic and large and emerging industry. I think it does take a community, it does take a rising tide of people all working together. We intentionally wanted to make very clear that we do not have any commercial motivations in this research. We do not posit any solutions. We don't tell people to use Chroma. It's just here's the problem, it's implied. Listen, we weren't sad that that was maybe, and maybe it may be a positive, you know, but like still, there's still reasons around, you know, speed and costs regardless, I think. But there's just a lot of work to do and I Think that like, it's interesting where like the labs don't really care and they're not motivated to care. Increasingly is the market to be, to be a good LLM provider. The main market seems to be consumer. You're just not that motivated to like.
C
Developers as a secondary concern.
A
As a secondary concern. And so you're just like not that motivated really to do the legwork to like help developers learn how to build stuff.
C
Yeah.
A
And then like if you're a SaaS company or you're a consumer company, you're building with AI, you're an AI native company, like, this is your, like this is your secret sauce. You're not going to market how to do stuff. And so I think there's just like a, there's a natural empty space which is people that are actually have the motivations to help show the way for how developers can build with AI. There's not a lot of obvious people who are like obviously investing their time and energy in that, but I think that is obviously a good thing for us to do. And so that's kind of how I thought about it.
C
Just a bit of pushback on the consumer thing. Like you say labs and don't you think OpenAI building memory into ChatGPT and making it available to literally everybody probably too much in your face, I would argue. But they would really care to make the memory utilization good. I think context utilization, context engineering is important for them too. Even if they're only building for consumer and don't care about developers.
A
Yeah. How good is it today? Is obviously one important question, but we'll skip that one. Even if that's the case. Are they actually going to publish those findings?
C
No, never.
A
Exactly. It's alpha. Right. Why would you give away your secrets? Yeah. And so I think there's just very few companies that actually are in the position where they have the incentive and they really care about trying to teach developers how to build useful stuff with AI. And so I think that we have that incentive.
B
But do you think you could get this to grow to the point of being the next needle in a haystack and then forcing the models providers to actually be good at it?
A
There's no path to forcing anybody to do anything. And so we thought about that when we were kind of putting this together. We're like, oh, maybe we should sort of formulate this as a formal benchmark that you can make it very easy to. Like we did open source, all the code. So you could. If you're watching this and you're from A large model company, you can do this. You can take your new model that you haven't released yet and you can run these numbers on it. And I would rather have a model that has a 60,000 context token context window that is able to perfectly pay attention to and perfectly reason over those 60,000 tokens than a model that's like 5 million tokens. Just as a developer, the former is so much more valuable to me than the latter. I certainly hope that model providers do pick this up as the thing that they care about and that they train around and that they evaluate their progress on and they communicate to developers as well. That would be great.
B
Do you think this will get bitter lesson as well? How do you decide which of these? Because you're busy saying the models will not learn this. It's going to be a trick on top of it that you won't get access to.
A
I'm not saying that well, but when.
B
You'Re saying that they will not publish how to do it, well, it means that the model API will not be able to do it, but they will have something in ChatGPT that will be able to do it.
A
I see. Yeah. It's very risky to bet what's going to be a better lesson versus what is not. I don't think I'll hazard a guess.
C
Hopefully not AI engineers.
A
Yeah. Hopefully not all of humanity. I don't know. You know.
B
Yeah.
C
To me also an interesting discipline developing just around context engineering. Lance Martin from LangChain did a really nice blog post of all the different separations. And then in New York you hosted your first meetup. We're going to do one here in San Francisco as well. But I'm just kind of curious, what are you seeing in the field, who's doing interesting work, what are the top debates, that kind of stuff.
A
I think this is still early. I mean, a lot of people are doing nothing. A lot of people are just still yeeting everything into the context window. That is very popular.
C
Yeah.
A
And you know, they're using context caching and that certainly helps. But like their cost and speed but like isn't helping the context route problem at all. And so yeah, I don't, I don't know that there's lots of best practices in place yet. I mean, I'll highlight a few. So the problem fundamentally is quite simple. It's, you know, you have n number of sort of candidate chunks and you have Y spots available and you have to do the process to curate and cull down from 10,000 or 100,000 or a million candidate chunks, which 20 matter right now for this exact step. That optimization problem is not a new problem to many applications and industries. Sort of a classic problem. And of course what tools people use to solve that problem. Again, I think it's still very early, it's hard to say, but a few patterns that I've seen. So one pattern is to use what a lot of people call first stage retrieval to do a big call down. So that's to be using signals like vector search, like full text search, like metadata filtering, metadata search and others to go from let's say 10,000 down to 300. Like we were saying a moment ago, you don't have to give an LLM 10 blue links, you can brute force a lot more. And so using an LLM as a re ranker and brute forcing from 300 down to 30, I've seen now emerge a lot like a lot of people are doing this and it actually is like way more cost effective than I think a lot of people realize. I've heard of people that are running models themselves that are getting like a penny per million input tokens and like the output token cost is basically zero because it's like the simplest.
C
These are dedicated re ranker models, right? Not full LLMs.
A
No, these are LLMs, okay? They're just using LLMs as re rankers.
C
Okay.
A
And of course there are also dedicated re ranker models that by definition are going to be still like cheaper because they're much smaller and faster. Because they're much smaller. But like what I've seen emerge is like application developers who already know how to prompt are now applying that tool to re ranking. And I think that like this is going to be the dominant paradigm. I actually think that like probably purpose built rerankers will go away and the same with it, like purpose built, they'll still exist, right? Like if you're at extreme scale, extreme cost, yes, you'll care to optimize that. And the same with that. If you're running with hardware, right, like you're just going to use a CPU or GPU unless you absolutely have to have an ASIC or an fpga. And I think the same thing is true about like re rankers where like as LLMs become 100 a thousand times faster, 100 a thousand times cheaper, that people are just going to use LLMs for re rankers and that actually brute forcing information curation is going to become extremely, extremely popular. Now today the prospect of running 300 parallel LLM calls, even if it's not Very expensive. The tail latency on any one of those 300 LLM calls, API availability, it's all still really bad. There are good reasons to not do that today in a production application, but those will also go away over time. So those patterns I think I've seen emerge. That is a new thing that I think I've only seen start to really become popular in the last few months. And by popular I mean popular and the leading tip of the spear, but I think will become a very, very dominant paradigm.
C
Yeah, we've also covered a little bit on, especially on the code indexing side of the house. So everything we've been talking about applies to all kinds of contexts. I think code is obviously a special kind of context in corpus that you want to index. We've had a couple of episodes. The cloud code guys and the client guys talk about they don't embed or they don't index your code base. They just give tools and use the tools to code search. And I've often thought about whether or not this should be the primary context retrieval paradigm where when you build an agent, you effectively call out to another agent with all these sort of recursive re rankers and summarizers or another agent with tools. Or do you sort of glom down one to a single agent? I don't know if you have an opinion, obviously, because agent is very ill defined. But I'll just put it out there.
A
Got to pull that apart. So indexing by definition is a trade off. When you index data, you're trading write time performance for query time performance. You're making it slower to ingest data, but much faster to query data, which obviously scales as data sets get larger. And so if you're only grepping very small, you know, 15 file code bases, you probably don't have to index it. And that's okay if you want to search all of the open source dependencies of that project. You all have done this before in Vs Code or Cursor, right? You run a search over the Node Modules folder. It takes a really long time to run that search. That's a lot of data. So to make that indexed and sort of again make that trade off of write time performance or create time performance. That's what indexing is. Just demystify it. What is this? Right, that's what it is. You know, embeddings are known for semantic similarity. Today embeddings is just a generic concept of like information compression. There's actually like many tools you can use. Embeddings For I think embeddings for code are still extremely early and underrated. But Regex is obviously an incredibly valuable tool. And you know, we've actually worked on now inside of Chroma, both single node and distributed. We support regex search natively. So you can do Regex search inside of Chroma because we've seen that as like a very powerful tool for code search. It's great. And we build indexes to make regex search go fast at large data volumes. On the coding use case that you mentioned, another use case that another feature we added to Chroma is the ability to do forking. So you can take an existing index and you can create a copy of that index in under 100 milliseconds for pennies. And in so doing, you then can just apply the diff for what files changed to that new index. So any corpus of data that's logically.
C
Changing so very fast re indexing is the result.
A
But now you can have an index for each commit. So if you want to search different commits, search different branches or different release tags, any corpus of data that's logically versioned, you now can search all those versions very easily and very cheaply and cost effectively. And so, yeah, I think that that's kind of how I sort of think about regex and indexing and embeddings. I mean, yeah, the needle continues to move here. I think that anybody who claims to have the answer, you just shouldn't listen to them.
B
When you said that code embeddings are underrated, what do you think that is?
A
Most people just take generic embedding models that are trained on the Internet and they try to use them for code. And it works okay for some use cases, but does it work great for all use cases? I don't know another way to think about these different primitives and what they're useful for. Fundamentally, we're trying to find signal Text search works really well. Lexical search. Text search works really well when the person who's writing the query knows the data. If I want to search my Google Drive, I just for the spreadsheet that has all my investors, I'm just going to type in Cap Table because I know there's a spreadsheet in my Google Drive called Cap Table. Full text search. Great. It's perfect. I'm a subject matter expert in my data now. If you wanted to find that file and you didn't know that I had a spreadsheet called Cap Table, you're going to type in the spreadsheet that has the list of all the investors and of course in embedding space and semantic space that's going to match. And so I think again, these are just like different tools and it depends on who's writing the queries. It depends on what expertise they have in the data. What blend of those tools is going to be the right fit. My guess is that for code today, it's something like 90% of queries or 85% of queries can be satisfactorily run with Regex. Regex is obviously the dominant pattern used by Google code search, GitHub code search. But you maybe can get 15% or 10% or 5% improvement by also using embeddings. Very sophisticated teams also use embeddings for code as a part of their code retrieval, code search stack. And you shouldn't assume they just enjoy spending money on things unnecessarily. They're getting some either eking out some benefit there. And of course, for companies that want to be top of their game and want to corner their market and want to serve their users the best, this is what it means to build great software with AI, 80% is quite easy, but getting from 80% to 100% is where all the work is. And each point of improvement is a point on the board and is a point that I think users care about and is a point that you can use to fundamentally just serve your users better.
B
Do you have any thoughts on the developer experience versus agent experience? This is another case where, well, we should maybe reformat and rewrite the code in a way that it's easier to embed and then train models there. Where are you on that spectrum?
A
Yeah, I mean, one tool that I've seen work well for some use cases is instead of just embedding the code, you first have an LLM, generate a natural language description of what this code is doing. And either you embed just the natural language description, or you embed that and the code, or you embed them separately and you put them into separate vector search indexes. Chunk rewriting is the broad category for what that is. Again, this is like the idea here is like it's related to indexing, which is as much structured information as you can put into your write or your ingestion pipeline. You should. So all of the metadata you can extract, do it at ingestion. All of the chunk rewriting you can do, do it at ingestion. If you really invest in like trying to extract as much signal and kind of pre bake a Bunch of the signals at the ingestion side. I think it makes the downstream query tasks like, much easier. But also, just because we're here, it's worth saying people should be creating small golden data sets of what queries they want to work and what chunks should return. And then they can quantitatively evaluate what matters. Maybe you don't need to do a lot of fancy stuff for your application. It's entirely possible that again, just using regex or just using vector search, depending on the use case, that's maybe all you need, I guess. Again, anybody who's claiming to know the answer, the first thing you should ask is let me see your data. And then if they don't have any data, then you have your answer already.
C
I'll give a plug to a talk that you gave at the conference, how to look at your data. Yes, looking at your data is important. Having golden data sets, these are all good practices that I feel like somebody should put into a little pamphlet. Call it the ten Commandments of AI Engineering or something.
A
Okay, might do that.
C
Thou shall look at your data. We're about to move on to memory, but I want us to sort of leave space for any other threads that you feel like you always want to get on a soapbox about.
A
That's a dangerous thing to ask.
C
I have one to key off of because I think I didn't know where to insert this in the conversation, but we were kind of skirting near it that I'm trying to explore, which is I think you had this rant about R A N G where the original transformer was sort of like an encoder decoder architecture. Then GPT turns most transformers into decoder only. But then we're also encoding with all the embedding models as encoder only models. So in some sense we sort of decoupled the transformer into first we encode everything with the encoder only model, put it into a vector database like Chroma, and Chroma also does other stuff. But then we decode with the LLMs and I just think it's like a very interesting meta learning about the overall architecture. It is stepping out of just the model to models and system. And I'm curious if you have any reflections on that or if you have any modifications to what I just said.
A
I think there's some intuition there which is like the way we do things today is very crude and will feel very caveman in five or 10 years. Why are we going back to natural language? Why aren't we just passing the embeddings directly to the models who are just going to functionally re put it into latent space. Right.
C
Yeah. They have a very thin embedding layer.
A
Yeah. So I think there's a few things that I think might be true about retrieval systems in the future. So number one, they just stay in latent space. They don't go back to natural language. Number two, instead of doing this is actually starting out to change, which is really exciting. But for the longest time we've done one retrieval per generation. Do you retrieve and then you stream out n number of tokens? Like, why are we not continually retrieving as we need to?
C
Agent Greg.
A
Don't call it that. But there was a paper or a paper in maybe like a GitHub that came out a few weeks ago. I think it was called unfortunately Ragar one where they teach deep cigar one kind of give it the tool of how to retrieve. And so in its internal chain of thought and it's infotense compute, it's actually searching.
C
There's also retrieval augmented language models. I think this is an older paper.
A
Yeah, yeah. There's a bunch of realm and retro and it's kind of a long history here. So I think that kind of somehow not that popular. I don't know why somehow not that popular. Well, a lot of those have the problem where either the retriever or the language model has to be frozen and then the corpus can't change, which most developers don't want to deal with. The developer experience around.
C
I would say we would do it if. If the gains were that high or the labs don't want you to do it.
A
I don't know about.
C
Yeah, because the labs have a huge amount of influence.
A
Labs have a huge amount of influence. I think it's also just like you don't get points in the board by doing that. Well, you just don't. No one cares. The status games don't reward you for solving that problem. So, yeah, so broadly, continual retrieval I think will be interesting to see come out of the scene, number one. Number two, staying in embedding space will be very interesting. And then, yeah, there's some interesting stuff also about kind of like GPUs and how you're kind of like paging information into memory on GPUs that I think can be done much more efficiently. And this is more like five or 10 years in the future that we're kind of thinking about. But yeah, I think when we look back and think that this was hilariously crude the way we do things today?
C
Maybe, maybe not. We're solving, IMO challenges with just language. Yeah, it's great. I'm still working on the implications of that. It's still a huge achievement, but also very different than how I thought we would do things.
B
You said that memory is the benefit of context engineering. I think you had a rant on Twitter about stop making memory for AI so complicated. How do you think about memory and what are maybe the other benefits of context engineering that maybe we were not connecting together?
A
I think memory is a good term. It is very legible to a wide population. Again, this is sort of just continuing the anthropomorphization of LLMs. We ourselves understand how we as humans use memory. We're very good at. Well, some of us are very good at using memory to learn how to do tasks. And then those learnings being flexible to new environments. And the idea of being able to take an AI, sit down next to an AI and then instruct it for 10 minutes or a few hours and kind of just tell it what you want it to do and it does something and you say, hey, actually do this next time the same way that you would with a human at the end of that 10 minutes, at the end of those few hours. The AI is able to do it now. And the same level of reliability that a human could do it is an incredibly attractive and exciting vision. I think that that will happen. And I think that memory, again, is like the memory is the term that everybody can understand. We all understand, our moms all understand. And the benefits of memory are also very appealing and very attractive. But what is memory under the hood? It's still just context engineering, I think, which is the domain of how do you put the right information into the context window? And so, yeah, I think of memory as the benefit. Context engineering is the tool that gives you that benefit. And there may be stuff as well. Maybe there's some version of memory where it's like, oh, you're actually using RL to improve the model through data scene. And so I'm not suggesting that only change in context is the only tool which gives you great performance on tasks, but I think it's a very important part.
B
Do you see a big difference between synthesizing the memory, which is like, based on this conversation, what is the implicit preference? Yeah, that's one side and then there's the other side, which is based on this prompt. What are the memories that I should put in?
A
I think they will be all fed by the same data, so the same feedback signals that tell you how to retrieve better will also tell you what to remember better. So I don't think they're actually different problems. I think they're the same problem.
C
To me, the thing I'm wrestling with a little more is just what are the structures of memory that makes sense. So there's like obviously all these analogies with long term memory, short term memory. Let us try to coin something around sleep. I do think that there definitely should be some sort of batch collection cycle, maybe sort of garbage collection cycle where it's like where the LLM is sleeping. But I don't know what makes sense. We're making all these analogies based on how we think humans work. But maybe AI doesn't work the same way. I'm curious about anything that you've seen that's working.
A
Yeah, I always, again, you know, as a through line of this conversation, I always get a little bit nervous when we start creating new concepts and new acronyms for things. And all of a sudden there's, you know, info charts that are like, Here are the 10 types of memory. And you're like, why? These are actually, if you squint, they're all the same thing. Like, do they have to have to be different?
C
You know, like you have to blow the people's minds.
A
No, I don't think you do. I don't know. You gotta, you gotta resist the slot machine. The slot and the sloth machine. Compaction has always been a useful concept in, even in databases, in databases on your computer. We all remember running DEFRAG on our Windows machines in 1998. And you know, so yeah, again, some.
C
Of us not old enough to do that.
A
I am not at this table. And yeah, so obviously offline processing is helpful and I think that is also helpful in this case. And as we were talking about before, like, what is the goal of indexing? The goal of indexing is to like trade right. Time performance for query time performance. Compaction is another tool in the toolbox of write time performance. You're re ingesting data, it's not indexing.
C
But actually it is indexing.
A
It's sort of re indexing. Yeah, you're taking data, you're like, oh, maybe those two data points should be merged. Maybe they should be split. Maybe they should be rewritten. Maybe there's new metadata we can extract from those. Let's look at the signal of how our application is performing. Let's try to figure out are we remembering the right things or not. The idea that there's going to be a lot of offline compute and inference under the hood that helps make AI systems continuously self improve is a sure bet.
B
Part of the sleep time compute thing that we talked about was pre computing answers. So based on the data that you have, what are likely questions that the person is going to ask and then can you pre compute those things? How do you think about that in terms of chroma?
A
We released a technical report maybe three months ago. The title is Generative Benchmarking. And the idea there is like, well, having a golden data set is really powerful. What a golden dataset is is you have a list of queries and you have a list of chunks those queries should result in. And now you can say, okay, this retrieval strategy gives me for these queries, gives me 80% of those chunks, whereas if I change the embedding model now, I get 90% of those chunks. That is better. And then you also need to consider cost and speed and API reliability and other factors obviously when making good engineering decisions. But you can measure now changes to your system. And so what we noticed was developers had the data, they had the chunks, they had the answers, but they didn't have the queries. We did a whole technical report around how do you teach an LLM to write good queries from chunks? Because again, you want chunk query pairs and so if you had the chunks, you need the queries. Okay, we can have a human do some manual annotation, obviously, but humans are inconsistent and lazy and QA is hard. And so can we teach an LLM how to do that? We did a whole technical report and proved a strategy for doing that well. So I think generating QA pairs is really important for benchmarking your retrieval system. Golden dataset, frankly, it's also the same dataset that you would use to fine tune in many cases. And so, yeah, there's definitely something very underrated there.
C
Yeah, I'll throw a plus one on that. I think as much attention as the context rock paper is getting, I feel like generative benchmarking was a bigger aha moment for me just because I actually never came across constant before. And I think actually more people will apply it to their own personal situations. Whereas context fraud is just generally like, yeah, don't trust the models that much, but there's not much you can do about it except do better context engineering.
A
Yeah, yes, yes.
C
Whereas generative benchmarking, you're like, yeah, generate your evals and part of that is going you're going to need the data sets and it'll sort of fall you into the place of all the best practices that everyone advocates for. So yeah, it's a very nice piece of work.
A
I think having worked in applied machine learning developer tools now for 10 years, the returns to a very high quality small label data set are so high everybody thinks you have to have a million examples or whatever. No, actually just a couple hundred. Even high quality examples is extremely beneficial. And customers all the time. I say, hey, what you should do is say to your team, Thursday night we're all going to be in the conference room, we're ordering pizza and we're just going to have a data labeling party for a few hours. That's all it takes to bootstrap this.
C
Google does this. OpenAI does this. Anthropic does this. You are not above doing this great, you know.
A
Right, yeah, exactly. Yeah, yeah. Look at your data again. It's what matters.
C
Label. Maybe should classify that as label your data, not look at because lookat seems a bit too.
A
I agree with that. Yeah, there's some more view only. Right. I agree with that. Yeah.
C
Read and write, read and write. While you mentioned it, I should correct myself. It wasn't standard cognition, it was standard cyborg. My favorite fact about you is you're also a cyborg with your leg. If you see Jeff in person, you should ask him about it. Or maybe not. Maybe don't. I don't know.
A
I don't care.
C
Don't care. Standard cyborg, mighty hive and know it. What were those lessons there that you're applying to Chroma?
A
Yeah, more. More than I can count. I mean, it's a bit of a cliche and it's very hard to be self reflective and honest with yourself about a lot of this stuff, but I think viewing your life as being very short and kind of a, you know, a vapor in the wind and therefore like only doing the work that you absolutely love doing and only doing work that you doing that work with people that you love spending time with and serving customers that you love. Serving is a very useful North Star and it may not be the North Star to print a ton of money, in some sense there may be faster ways to scam people into making $5 million or whatever. But if I reflect on, and I'm happy to go in more detail obviously, but if I reflect on my prior experiences, I was always making trade offs. I was making trade offs with the people that I was working with or I was making trade offs with the customer that I was serving. I was making trade offs with the technology and how proud I was of It. And maybe it's sort of like an age thing, I don't know, but the older that I get, I just more and more want to do the best work that I can. And I want that work to not just be great work, but I also want it to be seen by the most number of people, because ultimately, that is what impact looks like. Impact is not inventing something great and then nobody using it. Impact is inventing something great and as many people using it as possible is any of that.
C
And we can skip this question if it's sensitive, but is any of that guided by religion, by Christianity? And I only ask this because I think you're one of a growing number of openly, outwardly, positively religious people in the Valley. And I think that it's kind of what I want to explore. I'm not that religious myself, but I just kind of. How does that inform how you view your impact, your choices? There was a little bit of that in what you just said, but I wanted to tease that out more.
A
I think increasingly modern society is nihilist. Nothing matters. It's absurdist. Right? Everything's a farce. Everything is power.
C
Everything's a comedy.
A
Everything's a comedy meme. Yeah, yeah, exactly. And so, like, it's very rare, and I'm not saying that I always am the living exemplar of this, but, like, it's very rare to meet people that have genuine conviction about what flourishing for humanity looks like. And that's very rare to meet people that are actually willing to sacrifice a lot to make that happen and to start things that they may not actually see complete in their lifetimes. It used to be commonplace that people would start projects that would take centuries to complete, and now that's less and less the case.
C
The image that comes to mind is the Sagrada Familia in Barcelona, which I think was started like 300 years ago and is completing next year.
A
Yeah, I've seen it in construction, but I can't wait to see it completely as well.
C
Yeah, I'm sure the places are booked out already.
A
Yeah. And so it's common. There are actually a lot of religions in Silicon Valley. I think AGI is also a religion. It has a problem of evil. We don't have enough intelligence. It has a solution, a deus ex machina. It has the second coming of Christ, that AGI, the singularity, is going to come. It's going to save humanity, because we will now have infinite and freaky intelligence. Therefore, all of our problems will be solved and we will live in sort of like the palm of grace for all eternity. It's going to solve death. Right. And so I think that religion still exists in Silicon Valley. I think that it's like there's a conservation of religion. You kind of can't get rid of it.
C
The God gene.
B
Yeah.
A
I mean, people have different terms for this, but I think that I'm always skeptical of religions that haven't been around for more than five years. Put it that way.
C
Yeah, there's survivorship bias. Anyway, I do think you're one of the more prominent ones that I know of, and I think you guys are a force for good. And I like to encourage more of that. I don't know, people should believe in something bigger than themselves and build for plant trees under which they will not sit. Am I mangling the quote? Is that actually a biblical quote?
A
I don't think it's a biblical quote, but I like that quote. That's a good one. So, yeah, plus one.
C
I think society is really collapsed when you just live for yourself. That really is true.
A
Agreed.
B
Who does your design? Because all of your swag is great. Your office looks great, the website looks great, the docs look great. How much of that is your input? How much of that do you have somebody who just gets it, and how important is that to, like, making the brand part of the culture?
A
I think all value, you know, again, going back to the Conway's Law thing, like, you ship your org chart, you ship what you care about as a founder in some sense. And like, I do care deeply about this aspect of what we do. And so I think does it does. It does come from me in some sense. I can't take at all credit for everything we've done. We've had the opportunity to work with some really talented designers and we're hiring as well for that. So if people are listening to this and want to apply, please do. I think it's cliche to crib Patrick Collison quotes, but he does seem to be one of the most public embodiers of this idea that how you do. I'm not sure this is a direct quote from Hindi, to be clear. This is more of just a broad aphorism. But how. How you do one thing is how you do everything and just ensuring that there's a consistent experience of what we're doing. Where, like you said, if you come to our office, it feels intentional and thoughtful. If you go to our website, it feels intentional and thoughtful. If you use your API, it feels intentional and thoughtful. If you go through our interview process. It feels intentional and purposeful. I think that's so easy to lose. It's just so easy to lose that. And in some ways the only way that you keep that is by insisting on that. That standard remain. And I think that that is like one of the main things that I can do really for the company. As a leader, it's sort of cringe to say, but you do kind of have to be the curator of taste. It's not that I have to stamp everything that goes out the door before it does, but at a minimum, companies, maybe it's not even downhill in quality. It's not sort of legible that any one thing is bad or worse, but it's more like people just have their own expressions of what good looks like and they turn that up to 11 and then the brand becomes incoherent. What does this thing mean? And what do they stand for? Again, there's no longer a single voice. Yeah. Again, I'm not claiming that I'm perfect at this or good at this, but we certainly wake up every day and we try to.
C
You have a lot of. It's very powerful this skill. You have to convey straightforward principles and values and thoughtfulness I think in everything that you do. You know, I've been impressed with your.
A
Work for a while. Thank you.
B
Anything we're missing, you're hiring designers. Any other roles that you have open that you want people to apply for.
A
If you're a great product designer that wants to work on developer tools. I think we have one of the most unique opportunities at crema. If you are interested in extending the kind of research that we do, that's also an interesting opportunity. We're always also hiring very talented engineers that want to work with other people that are very passionate about kind of low level distributed systems and in some ways solving all the hard problems so that application developers don't have to.
C
When you say that, can you double click on low level distributed systems? People always say this and then like okay, Rust like Linux kernel. What are we talking here?
A
Yeah, maybe a useful encapsulation of this is if you care deeply about things like Rust or deterministic simulation testing or.
C
Raft Paxos TLA consensus TLA plus. Really? Wow.
A
If you just keep. I'm saying these are proxies for you would like the work that we do here.
C
I just really want to tease out the hiring message but also part of my goal is also to try to identify what is the type of engineer that people that startups are really Trying to hire and they cannot get. Because the better we can identify this thing. I can maybe create some kind of branding around it, Create an event and get these. There's a supply side and a demand side and they can't find each other. That's why I put AI engineer together. That was part of it. But then this distributed systems person, which I have heard from you and 100 other startups, what is the skill set? What are they called? What do they do? And part of that is cloud engineering. Because a lot of times you're just dealing with aws. A lot of times you're just dealing with, I don't know, debugging network calls and consistency things. If you're doing replication or whatever, where did they go? What did they do? But they don't use TLD at work.
A
Probably not, yeah. I mean, last year I started the SF Systems group.
C
Yes. The reading group.
A
Yeah. There's like presentations and the point of that was like, let's bring. Let's create a meeting place that care about this topic because there wasn't really a place in the Bay Area for people to do that. So that continues to go now and continues to run, which is great. I mean, to be clear, we have a lot of people on the team who are extremely good at this. And so it's not that we have zero, it's that we have six or seven. And you want 20, but yeah, it's not that we want more, but we are. In some ways, I feel like our product roadmap is very obvious and we know exactly what we need to build for the next, even like 18 months. But quality is always a limiting function. Quality and focus are always limiting functions. And like, well, yes, I will always make my land. Acknowledgement to Mythical Man Month eventually.
C
Let's get more people.
A
You kind of do need more people because you need more focus. Like, you need more people to care deeply about the work that they do. I think the AI is certainly an accelerant and it's helpful. It's the reason that our team is still very small today relative to many of our competitors, is because I think we've really embraced those tools.
C
Your cursor, shop, cloud code, Windsurf, people.
A
Use whatever they want. Yeah. So I think all of those tools get some usage internally. So far, we've still not found that really any AI coding tools are particularly good at rust, though I think. I'm not sure why that is other than the obvious. There's just not that many examples of great rust on the Internet. And so.
C
You would think that rust errors would help you debug itself, right?
A
You would think.
C
Apparently not. Okay.
A
All right.
C
I have zero experience in that front. I've contributed three things to the rust SDK of temporal, and that was my total experience of rust. But I think it's definitely on the rise. It's zig. It's rust, and I don't know if there's a third. Cool languages, I think.
A
Ghost accounts.
C
Golang. Yeah. If you're in that bucket, reach out to Jeff. But otherwise, I think we're good.
B
Thanks for coming on.
A
Thanks for having me, guys. Good to see you.
C
Thank you.
B
It.
Episode: Long Live Context Engineering – with Jeff Huber of Chroma
Date: August 19, 2025
This episode dives deep into "context engineering"—the emerging discipline of designing, optimizing, and maintaining the information presented to AI models during inference—with Jeff Huber, Founder & CEO of Chroma. Chroma is a leading open-source vector database and search infrastructure for AI applications. The discussion covers Chroma’s evolution, technical innovations, the philosophy of context windows in modern AI, real-world problems in retrieval, the pitfalls of “RAG” dogma, and the future of memory in AI systems. The conversation is both illuminating and pragmatic, featuring lessons from startup building, developer experience, distributed systems, and personal reflections on meaning and mission.
[00:22–02:56]
Origins & Motivation:
Chroma’s Focus:
[03:04–04:30]
[05:03–08:08]
[08:33–12:16]
Product Metrics:
Cloud Launch & Zero-Config Ethos:
[12:16–15:33]
The Meme of Context Engineering:
RAG is Dead, Long Live Retrieval:
Agent vs. Non-agent Systems:
[15:33–19:20]
Context Rot:
Marketing vs. Reality:
Call to the Community:
[23:14–29:36]
Field Observations:
Two-Stage Retrieval Rising:
Code Context is Special:
[30:02–34:03]
Embeddings in Code Search:
Chunk Rewriting & Ingestion-Time Enrichment:
Golden Datasets:
[34:16–41:53]
From Encoder-Decoder to Direct Latent Space:
Continuous/Iterative Retrieval:
Offline Processing and Compaction:
[45:25–49:53]
Personal Purpose and Reflection:
Faith, Conviction, and Societal Values:
[49:59–52:19]
[52:35–56:09]
On shifting from demo to production:
“The gap between demo and production didn’t really feel like engineering. It felt a lot more like alchemy.”
– Jeff Huber (00:49)
On mislabels in AI:
“The term RAG. We never use the term RAG. I hate the term RAG.”
– Jeff Huber (13:08)
On context engineering’s core:
“Context engineering is the job of figuring out what should be in the context window any given LLM generation step.”
– Jeff Huber (13:10)
On model marketing vs. practical reality:
“There was this bit of this sort of implication where like, ‘Oh look, our model is perfect on this task, needle in a haystack, therefore the context window you can use for whatever you want.’ …That is not the case today.”
– Jeff Huber (17:02)
On brute force LLM reranking:
“Application developers who already know how to prompt are now applying that tool to reranking. …This is going to be the dominant paradigm.”
– Jeff Huber (25:34)
On analogies and new AI concepts:
“I always get a little bit nervous when we start creating new concepts and new acronyms for things. …If you squint, they’re all the same thing.”
– Jeff Huber (40:48)
On working and living with intention:
“Only doing work that you absolutely love doing and only doing that work with people that you love spending time with and serving customers that you love…is a very useful North Star.”
– Jeff Huber (45:46)
On multi-generational building:
“It used to be commonplace that people would start projects that would take centuries to complete, and now that’s less and less the case.”
– Jeff Huber (48:23)
On the company’s curation and taste:
“How you do one thing is how you do everything and just ensuring that there’s a consistent experience of what we’re doing.”
– Jeff Huber (50:18)
This episode provides both strategic and intensely practical insight into the bleeding edge of AI infrastructure, the emerging field of context engineering, and the mindset required to build high-impact developer tools in an uncertain and noisy space. Jeff Huber’s mix of technical rigor, startup wisdom, and human perspective will be invaluable for AI engineers, founders, and anyone navigating the world of modern software and intelligent systems.
For more, visit latent.space.