
Vector search has become a foundational technology for AI applications, enabling everything from semantic code search to contextual retrieval for large language models. However, a major challenge with vector databases has been the cost as data storage ...
Loading summary
Narrator
Vector search has become a foundational technology for AI applications, enabling everything from semantic code search to contextual retrieval for large language models. However, a major challenge with vector databases has been the cost as data storage scales. TurboPuffer is a vector database that focuses on speed, cost and scalability. It was created by Simon Haroub Eskelson and Justin Lee in 2023 and has seen adoption from high profile companies such as Cursor and Notion. Simon joins the podcast with Gregor Vand to discuss the origin of TurboPuffer, its unique technical design, the economics of vector storage, and more. Gregor Vand is a CTO and founder currently working at the intersection of communication, security and AI, and is based in Singapore. His latest venture, Wintik AI, reimagines what email can be in the AI era. For more on Gregor, find him at van HK or on LinkedIn.
Gregor Vand
Hello, welcome to Software Engineering Daily. My guest today is Simon Eskelson.
Simon Haroub Eskelson
Thank you for having me, Gregor.
Gregor Vand
Yeah, great to have you here, Simon. We're here today to talk all about today TurboPuffer, which is a company some of our audience may have heard of equally. Maybe quite a few haven't yet, and it's going to be why we're here today and talking all about it. However, you probably are using this product behind the scenes without you realizing, so we're going to get into that as well. But I think to begin with, Simon, you've got a really interesting kind of backstory and we always dive into those with our guests. So just maybe talk us through very briefly how you kind of got to turbopuffer and I'll just call out, you did spend quite a chunk of time at Shopify and I think that'll be interesting to understand how that's helped lead into what TurboPuffer is today as well.
Simon Haroub Eskelson
Yeah, that's right. I started my career at Shopify and moved to Canada as a result, where Shopify was built. I moved from Denmark where I grew up and worked on infrastructure at Shopify for almost a decade. When I joined in 2013 it was in the hundreds of requests per second and when I left we regularly saw peaks in north of 1 million requests per second. And as part of that I worked on more or less every single aspect of the infrastructure. Generally the ball bottlenecks for that kind of scale tend to appear in the data layer. Those are the most persistent ones. So I spent, yeah, almost the entire time working on that layer between the Rails app and the databases, and sometimes inside the databases themselves, but mostly on top of them. So yeah, almost 10 years working on every single aspect of database scalability at Shopify. And then after that I spent about two years working in small increments at my friend's companies on whatever infrastructure problems they had. It turns out in 2022, 2023, it's mostly tuning postgres. Vacuum was the persistent infrastructure problem and it was there through that I discovered how a new type of database was ready to be built. And a bunch of things had changed that allowed a new database architecture to be built and that it felt like a lot there were going to be lots of companies that wanted to connect a lot of data to AI and that a new search engine on this new storage architecture was ready to be built. And so I started hacking on the first version of TurboPuffer back in 2023, in the summer of 2023.
Gregor Vand
Nice. And just sticking on that for a second. I think again, audience might find it helpful to understand you left Shopify, but it was Almost, I guess two years before as you call out, you started on TurboPuffer. What was the choice around? Not just launching straight into something else. I guess it's around this idea that you're maybe waiting for a problem to solve, so to speak. But how would you speak to doing what you did, which was just go and help other people with their problems for a little while, for example.
Simon Haroub Eskelson
Yeah, I went to Shopify right after high school. So I'd spent my entire career inside of one company. And so it felt I couldn't develop high conviction on joining one particular company because I'd only been in one company basically my entire life. I did some little bit of work for another startup doing high school, but other than that it's been decade with one company. So I had a bunch of friends who had a bunch of infrastructure challenges that seemed interesting. So I wanted to pop around and on small contracts solve a very actual problem at SAS companies that had something and just over deliver on some infrastructure promise. And that was really good way for me to see a bunch of companies. That was really the primary concern. And I took summers off for two years. That was nice and it gave me a bunch of space to get in the best shape of my life and some other things. I didn't leave with the intention of founding a company. I didn't leave with the intention of really anything other than it was ready for a change. And this felt like the natural thing. I spent a lot of time at that time also working on my napkin math, which is a bunch of blog posts I've written about doing first Principle, sort of. Okay. The memory bandwidth is this much, so you should be able to do the task in this long. I spent a bunch of time writing that and building out all the numbers for it. And then at some point, this problem just started staring me in the eye and I couldn't stop thinking about it.
Gregor Vand
Yeah, awesome. So let's get into TurboPuffer. So let's start with the origins, if you like. You did sort of allude to it. You were helping some friends and you've written a few blog posts about TurboPuffer. Generally, one of the first ones is kind of the origin story. I believe you were helping friends at Readwise. Do you want to just talk us through? Like, where did this come from?
Simon Haroub Eskelson
Yeah. So Readwise, run by some of my friends. It's a company that their first product was essentially every highlight that you highlighted with Kindle. Send you an email every day or every week with some of them. And I love the product and always been close with the founders there. They were launching at the time a reader product. So read it later. You take an article, it's nicely formatted, it's on your phone, you can listen to it. And for that they just needed their postgres database up to snuff. And so I was helping them with that again, tuning Autovacuum again, it's the biggest problem we have. I felt a bit like gaslit by the orange site here, having spent like a decade thinking that MySQL was a worse database than Postgres. Turns out that they're just two different databases. And one of the biggest challenges with Postgres is tuning autovacuum, which I spent some time doing for them. Anyway. So for Readwise, one of the other things that we were working on, this was like the fall of 2022, where I feel like everyone almost remembers what they were doing as ChatGPT dropped. And when ChatGPT dropped, it was while I was in the middle of doing this database tuning and we decided to just build a bunch of AI functionality. And it's like, oh, maybe we can do this to help people comprehend what's going on in this article. And at the time, the context windows were pretty small, and so we started using vectors very early on to draw the right parts of the article in. I also played around with creating a little recommendation engine so it would find one article and recommend another article that was similar. And so it's clear that there was sort of this foundational lay brewing. And as we had something working, I ran the back of the envelope Math on what it would cost to take all of the articles that even at the time was pre launch was in Readwise Reader and put it into a vector index on one of the databases around then that could do it and did the math. And their postgres database at the time was costing them a couple thousand dollars, 3k a month. And the back of the envelope math told me that on a reputable vector database that we thought could do the workload, it would cost 30 grand a month. Even if the company could afford that, it was clearly detracted from launching this feature. Right. We were just not going to invest in something that was going to cost an order of magnitude more to store a subset of the data for just powering recommendations and some context into an LLM. And we just sort of put it in the bucket of, well, the token costs for the LLMs are coming down, so presumably someone's going to figure out how to make this cheaper as well. And I just sort of saw out the engagement there and got their database up to snuff and they launched and it was a very successful launch for them. And I couldn't stop thinking about how are we going to make this vector storage cheaper? And I was like. My memories of running the search clusters at Shopify were coming back to me. It was one of the worst clusters to get woken up by. And it very quickly became very obvious to me that all of the incumbents were storing everything in memory. And if you have a kilobyte of text, as you might have in an article, often there are multiple kilobytes of text. In order to turn that into a vector embedding, you're going to put it into a bunch of chunks, right, that are semantically meaningful. So you maybe take a kilobyte of text and there's four paragraphs, so four chunks. You turn every one of them into a vector. Those vectors might be 6 kilobytes each. So now you have 20 to 30 kilobytes of vector data from 1 kilobyte of text. So at 20 to 30x amplification for a normal full text search index, the amplification to build all the structures to do full text search is around 2, maybe 3 on a bad day, depending on the data distribution. But somewhere between 1.8 and 3 for vector indexes, we're now talking about 30x, right? Just to store the vectors, let alone build an index. And so that was why it was so prohibitive, why the costs were so high, let alone that all of these databases had to store everything in dram, which on a cloud provider costs somewhere between two to five dollars per gigabyte. And it was clear to me that at Readwise, the economics didn't line up right. It would have been like, I don't know how many users they had at the time, right?
Gregor Vand
But like, and there was a timing thing here, I guess, hardware wise in terms of NVMe SSDs. So non volatile Memory Express. Could you maybe just speak to that as well? That sort of backend piece if you like.
Simon Haroub Eskelson
Yeah, exactly. So I think that to build a good database you need two things. You need a new workload and you need a new storage architecture. And the new workload here seems to be that, well, there's a new way to power these recommendations to pull context into an LLM. And the second you need a new storage architecture. The new storage architecture that started to become clear to me around the time in early 2023, as I think about this problem, was that, well, NVME SSDs are very fast. I think they got faster than anyone really expected. But they were actually not generally available in the clouds until the late 2010s, like around 2017, 2018 on AWS. So databases hadn't been built around them. And the phenomenal thing about NVMe SSDs is that per gigabyte they're about 100 times cheaper than DRAM. But in terms of how much bandwidth and gigabytes per second you can get from them, they're only about an order of magnitude, maybe even only 5x slower than DRAM. So you've got this amazing set of principles. But most databases haven't built around getting all of that bandwidth through. You have to bypass the Linux page cache to get the maximum. You have to build your storage engine in such a way that it tries to get a lot of data for every round trip. So that was the first thing that had to be true to build a database like turbocoffer. The second thing that had to be true was that object storage like S3 or GCS depending on the cloud you're on need it's a very nice primitive that they're consistent. That is you put an object and when you read it it's the same object. And S3 only became consistent at re invent in 2020, which is remarkably late, but a very nice property to have when you're building a database. You can build around it by creating new files and so on, but primitives are needed. The other primitive that Turbohoffer was waiting for which was actually only released, we're recording in July of 2025, only at Re Inv. So seven months ago did S3 release the final API that is required for a TurboPuffer like database, which is Compare and Swap. Compare and Swap allows you, right, to put a file on S3, read it, do a modification to it, and then write it back only if it hasn't been changed during that time. With that simple primitive, you can now build a database like TurboPuffer that doesn't have any dependencies other than object storage other than S3 or GCS. This API was available in GCS on GCP first, which is why TurboPuffer started in GCP, because it was available there. And we developed strong conviction that this was going to be a ubiquitous feature because actually every other opx storage implementation other than S3 had it. And finally they released it last year and a month later we went into aws and until then we'd run workloads cross cloud. So These were the three things that we needed to build TurboPuffer that became available around that time. And then suddenly. Well, we had the two prerequisites to building a great database. We had a new workload and we had a new storage architecture. And a new storage architecture meant that there was differentiation from tacking this on to existing databases.
Gregor Vand
Yeah, and I mean, you obviously touched on sort of the prerequisites, I guess, but in terms of actually technical challenges in making sort of S3 like storage performant, like, were you aware of those ahead of time, if you like, or is this still something where just through solving the problem you've come across those challenges? And what would you say those were?
Simon Haroub Eskelson
I started working on Drupal Buffer in May of 2023 and the main challenge was that let's talk about vector indexes for a second because I think it would be relevant of like, how do you put this on S3? So there's two broad categories of how you build a vector index. If you have a bunch of data, let's say for example, that you're Spotify and you want to recommend music to people, then you take every song and you pump it through some kind of model and it puts it in vector space and songs that are adjacent in vector space as you plot it into this massive coordinate system. Right. Imagining it's in two dimensions. Songs that are adjacent in vector space would also be songs that might be similar. And so for naturally clusters form, there's a raw cluster and a pop cluster and a rap cluster or whatever. Right. As the semantic relationships sort of unfold in this space, models have gotten very good in 2022 of taking a piece of data and plotting it into the coordinate system, but not so much as actually storing the coordinate system and querying it it. So when you have a vector index, which is essentially just a coordinate system in many dimensions, but if you're visualizing this in your head right now, you should visualize it in two. When I go on an E commerce site and I search for red dress and they have a burgundy skirt, well, those two points would be very close in vector space. And this is a very nice problem to have solved in such a simple way. But searching for a query across all of the things that you have available, maybe the hundreds of millions of songs that are in Spotify, maybe the billion things that are on an E commerce catalog. The only way to get the exact result of the vectors that are close in that coordinate system is to look at every single vector and compare with the query vector of like which ones are closest. This is the only way that you're guaranteed that you're going to find the top 10 closest vectors. But this is slow. It's incredibly slow, right? A million vectors can easily be gigabytes big. And searching at gigabytes in memory, if you max out the node takes maybe a couple hundred milliseconds. So that's maybe not bad in itself. But now every node that's completely maxed out is doing five requests per second. And that's going to cost you thousands of dollars per month, right? Depending on the node sizes and so on. So instead what we do is we have approximate algorithms. We say, okay, we're okay with maybe a 95% or 99% accuracy over what we retrieve. So in the vast majority of cases we're getting the exact results, but in some outlier cases we're or not. There's two way to build these approximate nearest neighbor indexes, or ANN for short. There's others, but these are the two that people have productionized. The first one are the graph based indexes. So you can imagine that as you're adding things into this coordinate system, you can come up with some heuristics to connect vectors that are adjacent in vector space or in the coordinate system. And you can connect them in the graph and then you can walk the graph by dumping yourself in the middle of the graph and then continuously searching a graph to things that are closest and then find some good results. This works really well. And it was by far the most common method of doing vector Search In 2023, almost every production use case was using this. And it was what a lot of the vector indexes at the time were doing. The second way of doing it is a cluster based approach. This is actually the classical method. This was the first thing I think people came up with and productionized where what you do is you take all of the vectors that you have and you create clusters, right? So if you use the music example, there might be a cluster that's rock and the pop and the rap clusters, and you create these natural clusters. And then in every cluster, you take the average of all of the things that belong to that cluster and that's the centroid. And then when you do a search, you say, okay, I'm searching for songs that are close to this song. And you say, okay, well, it's closest to the pop cluster. So we're going to search only the pop cluster. And that way, you know, you're only searching 33% of everything. And in reality this way you can cut the space in such a way that you have to search only a small percentage of the entire data set. It's a very nice way to visualize. And there's some semantic meaning even in what these clusters mean. So these are the two competing ways. And at the time in 2023, I think it seemed like graphs were going to win. But the problem with graphs is that they're very fast in memory, right? In memory, you can navigate a graph where you dumped into the center and then every time you have to read from memory, it's 100 nanoseconds and you might have to go, okay, 100 nanoseconds to land in the center, hundreds of nanoseconds to go to the next part in the graph, 100 nanoseconds, 100 nanoseconds. And in aggregate, you know, that's still going to be very fast because memory is fairly fast and will CPU cache really well for these random reads. But when you're talking about disk and you're talking about S3, this doesn't work, right? If you're reading a couple of kilobytes off of S3, you're going to have a P90 in the hundreds of milliseconds. So you land in the middle of the graph, 100 milliseconds, you go to the next node, 100 milliseconds or 200 milliseconds, you go to the next one, next one, next one, and every single time it's 200 milliseconds because you can't Predict which jumps you're going to make. You can make a bunch of rhystics, but you can't really do much. You can try then to shrink the diameter of the graph, right? So you can try to make it so that there's less jumps and make the graph like denser, for lack of better words. Right? That will work on audio and this can work. But it's very difficult on large graphs to get less than maybe 9 to 10 or so round trips on millions or hundreds of millions of vectors. It also becomes very expensive to write because every time you update one of these nodes you have to sort of update a lot of things around it. And this is also problematic on disk. It's not a big deal in memory, but it's a big deal on disk to do a lot of random reads and especially on S3, editing a lot of random files. So graphs just don't work that well for it. They are amazing for in memory and it's almost impossible to beat the performance in memory. But when things are on disk or they're on S3, you have round trip latencies into hundreds of milliseconds for S3 and in the hundreds of microseconds for disk. Neither of which is going to work well for having lots of these dependent jumps. But of course, if we go back to the clustering method that seems very crude and primitive, well, you could just download all those centroids, right? Here's the wrap centroid, here's the pop centroid and whatever other genres. And you download the Centroids bin and it's like a nicely packed binary pack. You search through all of those and you find the closest few clusters and then you do another round trip to S3 and you download just those clusters. You've gone to S3 twice. You fetch a lot of data. But S3 is fine with that, right? You can download a lot of data in 100, 200 milliseconds. And this works really well for disk too, right? Because you go to disk and you get like the chunk of centroids in one big page and then you start getting the individual clusters again in two round trips. And so we built the storage engine all around, like minimizing the number of soundtrips. We use a cluster based index because we want to minimize the number of round trips which allows to give this like pretty really, really good performance. But on these storage mediums that are so much cheaper than memory, right? Again, memory being in the around 2 to $5 per gigabyte and S3 is 2 cents per gigabyte and disks are about 8 to 10 cents per gigabyte.
Gregor Vand
Yeah. And we're going to get onto kind of how we even measure any of this. I've got kind of one's more sidebar question so we'll come back to that in a second because just a lot of what you've been describing there, I feel like I can visualize it and maybe for some of our listeners it's still a little bit difficult to visualize. So we'll come back to that in just a second.
Capital One Representative
Capital One's tech team isn't just talking about multi agentic AI. They already deployed one. It's called chat concierge and a simplifier in car shopping using self reflection and layered reasoning with live API checks. It doesn't just help buyers find a car they love. It helps schedule a test drive, get pre approved for financing and estimate trade in value. Advanced, intuitive and deployed. That's how they stack. That's technology. At Capital One.
Gregor Vand
You did talk about this is all about round trip and you have also written about this idea that I believe turbo buffer should be a maximum three round trips for what you call sub second cold latency. So again, could we also just talk a bit about cold and warm here in terms of how does that also come into it? Because I think you did a very good job of explaining how to think about these jumps between data. But maybe then how are we looking at. I think a lot of our audience are familiar with the concept of effectively cold and warm data. So how does that come into it as well?
Simon Haroub Eskelson
Yeah, so the canonical source of truth for all data in TurboPuffer is object storage. When you do a write and we return a success back from turbopuffer, it's been committed to the most durable systems on earth, which is S3GCS and friends. When you do a query we will send it to a node and if that node has it in memory cache then we'll use that, it's the fastest. And if it doesn't then we'll go to disk and if it doesn't then we'll go to object storage. If we haven't seen a query for say a week, then it's not going to be in cache anymore and we have to go directly to abxdroid. So in the first roundtrip we'll get a bunch of metadata files like what's the schema, what's the most recent index and a bunch of other metadata about the namespace. Then we'll get the centroids and a bunch of other things that might be relevant to any filtering that you're doing. And then we'll get the clusters and exactly what we need from the clusters to satisfy the query. And those are the fundamental three round trips. There are situations where we're going to do more round trips. If you're fetching a bunch of information and we can't make the decision that fetching all of that in that other in the third round trip is going to be the best performance, the quarry planner has to make a decision about whether to do another round trip or to try to fetch more data. So that's the scenario for a cold query. Again, these round trips, depending on the size of them and the mood of S3 on that day, take about 200 milliseconds or so each. So we get cold query performance just short of a second in, you know, 600 milliseconds, 800 milliseconds. But of course, S3 also has caches, and it depends, but that's generally what we see. In fact, when you do a core to S3, the variance is so high that once we cross the 90th percentile of the latency that we generally see C for something of that size, we will send off a second query to try to minimize the variance in these query latencies. That's a cold query. In the high hundreds of milliseconds, a warm query just can go to memory or can go to disk. And at that point there's no reason why this can't be as fast as any traditional storage architecture. It's only that cold query that happens once in a blue moon. Some of our customers will do pre flight queries, right? When you open the Q and A dialogue in Notion, it will send a query to hint to turbopuffer that we should start warming that index. And we will do that in the order that reduces the latency as fast as possible to try to minimize this impact for the user. So generally this is not an issue that we see, and especially with the cost that we can get by only having that single copy. This is also where the name comes from, right? It's like the Pufferfish is fully deflated when the data is only on object storage and we can inflate it all the way into DRAM or even CPU caches. As you query the namespace more turbopover gets faster.
Gregor Vand
Got it. Okay, that makes sense. So, yeah, sidebar question here. This is just actually jumping fully into the product itself from a visual standpoint, because we're going to come back to a whole bunch of other things about performance, et cetera. But if you're a developer, how can you imagine visualizing the data through TurboPuffer? I've used other Vector Store products, so I can kind of think of how they try to represent this data to me. Some of it kind of worked and some of it I thought didn't make a lot of sense to me. So how has turbopuffer kind of approached this one?
Simon Haroub Eskelson
Do you mean visualizing it to the user?
Gregor Vand
Correct. Yeah. Through like a GUI or so forth? Yeah, yeah.
Simon Haroub Eskelson
Turbopuffer's console is still fairly simple and operational. I would love for there to be more of a playground. What we see our customers do is generally export the namespace locally and then try to visualize it. I don't know how many of our customers do visualize the data and how many of them just start writing evals against it. I would love to help with tooling, but we've been very focused on the database itself.
Gregor Vand
Okay. Yeah, I think it's really helpful to understand. It probably also speaks to where obviously where togopuffer's been focusing maybe perhaps versus some other products that sort of are trying to hit all the points at once and maybe not doing a bit of a general approach, shall we say? Let's go to performance. And I believe that the term in Vector Stores is recall. So that effectively trade off between latency and accuracy. I believe TurboPuffer does measurement on, I believe at least samples 1% of all queries internally. And you have presented back that data anonymized through blog posts as well. But talk to us about that. How do you measure, why are you measuring? And where do we go from there?
Simon Haroub Eskelson
Recall is extremely important because if you're building a pipeline that searches, you don't want to think about in your evals whether your search engine is inaccurate. You just don't want to think about that. That should be the job of the vendor or this database that you've chosen. You should not have to manually tune it and you should not have to manually run recall against it. TurboPuffer, of course, is. We iterate on our ANN algorithm. We run it against, of course, various benchmarks that we have internally, right? Academic benchmarks and so on to make sure that it performs. But nothing beats real world performance, right? In a real world, someone is going to insert the same song a million times into a cluster with lots of other songs. They're going to put that burgundy skirt in there like 20 million times, right? And not realize they have the bug. And they still are going to expect that the accuracy is going to go up. So overall accuracy is very important to us and we consider it our job. I want to just briefly explain what recall is and then I'll talk a bit about how we work on it at TurboPuffer. So recall, if we go back to my original example, right, of the only way to get the absolute true result of the top 10 closest vector embeddings to another query and vector embedding is to look at the entire data set, right O of N with the ann. What you do is if you issue a query with the approximate near neighbor index, then you compare the top, let's say top 10 for the approximate results and the exact results. If you have a recall of 90%, it means that nine of the 10 results were correct in the ANN result. Of course, if it's 1 or 100%, then everything is overlapping. We find that our customers are very happy somewhere between 90 and 100%, erring on around 95% recall, right? So average across everything. That's what recall means. And Generally anything above 90% is really good. Recall is a tricky metric because you could search for banana in a cluster or like in a data set that only knows about songs, and it's going to give you a result because in some way, you know, banana is close to something, right? And so the vector distance might be very long, but it also in the clustered index, the longer you are away, the further the clusters are away from the query vector, the worse the results get. So it's a flawed metric, but it's like the best flawed metric that we've got and that we're all measuring against. So we're aiming for 90 to 100% recall erring on the side of 95. And when we very early on in TurboPuffer's history, we decided to really take this on in our problem. And as you mentioned, a percentage of production queries are sampled in production. So it means that for a random percentage of like somewhere around 1%, but it sort of scales to how many queries a user does. We will send a query over to a different worker fleet and it will evaluate that like against the exact result, the approximate result, and compare it and then report the number back to our instrumentation and datadog. And so we have a dashboard, right, where we look at every single customer and look at their recall and we will get big red dots if someone's recall is below 90%. And that's how we reevaluated because I think production is the only thing that tells the true story you can't just evaluate against academic benchmarks. I think being on call for as long for 10 years almost at Shopify taught me that nothing matters other than production and it's going to be the same for accuracy. Is that the only way that we will trust our recall is if we know that on all the production data sets it's above 90%.
Gregor Vand
Got it. I believe there is something that comes into this also and native filtering. Perhaps you could talk to us a little bit about that.
Simon Haroub Eskelson
Filtering is the most important thing with recall because this is where it gets tricky and this is where just tapping on a vector index to an existing database is not quite enough to ensure that a recall is high. So let's take some simple examples and work through them. Right. And explain what this pre filter, post filter and place filtering, whatever means, Right. Let's say that you have a query that is we'll keep using an E Commerce example here and we're using it on a very large E Commerce data set and we want to filter out any products that are not public. Let's say that 99% of all the products are public, right? Like 1% is sort of like held back as people are iterating on them or haven't released them yet or they've sold out. If only 1% of the data doesn't match the filter, it's probably fine to just over filter by around 1% and then filter it out after. That's a post filter. And with that if you evaluate the recall again, the only accurate way is to evaluate it against everything. You will get very very high recall with a post filter. But you could imagine in a case where you're matching against an example that would only match, let's say 10% of the data set, it could be everything that I don't know, everything that is the color blue and that's only 10% of the data set. Well in that case if that's 10% of the data set and I've over fetched 100 items, well I'm not going to get that 100% recall. Right. The math just like doesn't really work out for the precision that you need. So in that case, well it's only 10% of the data set. So maybe what we can do is we can just find everything with another index like a traditional B tree index or a bitmap index that's blue and then evaluate all the vectors and that works okay. For something that's like 10% it will be a little bit slow because that's A lot of vectors to look at, but it might be okay. But the trickiest queries are the ones in between, right? They're the ones that filter out maybe 50% of the data sets. Imagine an example like you're in Singapore, everything that ships to Singapore, maybe that's 50% of the catalog, right? So it's like, okay, do you post filter? Well, you have to over fetch by a lot. You have to get a lot of stuff to make sure that you have the right recall. Because it might, in a clustered index, it might completely eliminate some clusters. Like maybe, you know, food items are never shipping to Singapore, but you're searching for a banana and it's just not matching that much of it. But it might go into the next like food, colored clothing that might start shipping, right? But those clusters you've cut off from. And when you start to think about it in the cluster sense, the query planner really has to be aware of like how much of different clusters match the filter and how much does that cut off. And then you have to have some heuristics around how many vectors you have to look at in the order from the query vector to look at enough vectors that you can guarantee a high recall. I realize this is hard to parse, but really what I'm trying to impress here is that the query planner that plans how much data the query needs to look at to get high recall needs to be very aware of both the vector index and also the filtered index to get high recall. And most databases that have just slapped a vector index onto an existing database will only pre or post filter. Often the user has to choose. So you have to know about the selectivity of their data set. But it's not a trivial thing to choose.
Gregor Vand
Yeah, and that leads quite nicely into what I wanted to touch on next, which is vector indexes. I believe it's Spfresh Vector Index that's used at TurboPuffuffer. This is probably a concept at very high level that most of our audience are at least familiar with to some degree. If you've worked with databases, you probably understand the concept of an index. That is some way of saying to the database, hey, these are the kinds of queries we're going to be making on a very regular basis. So we need an index across say these three columns on this table, because that's the kind of lookup that we're going to be doing. That's a relational database index, for example. How does that look obviously in the vector context? And again, the choice here, spfresh I'd love to hear about that.
Simon Haroub Eskelson
So in the graph based indexes, we talked about how there's graph based and cluster based indexes. In the graph based indexes, they're really nice because you can add things to it and it fits neatly into the graph and you can just keep adding to them. And you don't have to worry that much about recall because the recall on a graph based index is, is usually phenomenal. That's why they've been so popular, because you just add to them. But given that we chose a cluster based index for the read reasons that I mentioned before, when you do a clustered index, it's suddenly, you know, someone could have started adding products in a category or a cluster that didn't really exist when you created the initial clusters. And generally a clustered algorithm sort of has to look at all the data and then decide what the clusters are. But when you get into millions or hundreds of million or even billions of vectors, that can take easily days. Like on a state of the art algorithm, it can take days on a very large machine to figure out what the clusters are. So you start using GPUs to try to do it faster, but it becomes very, very difficult to do. So basically these clusters can't be incrementally maintained right. In a B tree. In a traditional database index is really nice because you just sort of like insert them into the tree and it nicely balances and there's lots of properties that make sure you can just incrementally add to it like a graph. So in a clustered index, to maintain it incrementally, you need a bunch of heuristics to maintain the clusters. You can think about it as like maybe, you know, if we continue with the E Commerce example, you have a shoe cluster and this customer is just like, they just keep adding shoes. And at some point the shoe cluster has maybe 1,000 items in it and it's like, okay, well that's a lot of data to search. Every time we search for something relevant to shoes, so we have to split the cluster. So at some point every time you write it's like, well, this is a shoe, this is a shoe. And then it reaches some terminal size of the cluster and we split the cluster. So you split the cluster and then maybe there's a sneaker cluster and there is a leather shoe cluster and that's what it's decided to make the two clusters. And you can imagine incrementally maintaining the clusters like this. Where they reach some size, we split them once in a while. You remove enough items that we have to merge clusters and every time we do this we have to recompute the the centroids of the clusters. It is much more complicated than that. But that is the general idea behind spfresh. That with enough of these heuristics you can do this. This is probably the most complicated part of the entire turbopuffer code base. To make this work at very, very large scale and make it work with recall and make it work with filters, that is how spfresh works. And this scales very, very well.
Redis Representative
Building agentic AI apps isn't just about choosing the best. LLM agents need short term memory, long term recall and lightning fast retrieval. Without it, you're left with clunky prototypes that never scale, you know. Redis the world's fastest caching solution. It turns out fast data is the key to good context. And good context is essential for fast accurate memory. It's what makes AI agents actually work with your data. Redis for AI. The right infrastructure, the right tools, the only way to scale. Learn more at Redis IO Genai.
Gregor Vand
Building.
Narrator
An app often feels like a balancing act. You want to ship features fast, chat, activity feeds, moderation, video, but building them from scratch is slow and complex. That's where Stream comes in. Stream provides developer friendly APIs that let you add real time communication without reinventing the wheel. YStream first developer experience. Stream has clean open source SDKs and great docs and you can get a proof of concept running in hours. Second speed to build, experiment, prototype or launch features quickly. No credit card required to start. Stream also scales over 2000 global apps including Strava, Patreon, Nextdoor, Robinhood and Peloton rely on Stream to power in app communication for more than a billion users. Whether you're a startup or an enterprise, Stream handles the hard parts so you can focus on what makes your app unique. Get started today at GetStream IO podcast.
Gregor Vand
And moving kind of to scaling generally. So my experience with just this space, I know that the people approaching it are doing it incredibly different ways, is in the email space. And for us namespacing was kind of a like a hard requirement. So let's take that example for a second. We've been talking a lot about E commerce, which I'm also quite familiar with, but let's hard left into email for a second. So email, you've got however many thousands of users and you absolutely never want someone to be searching and someone else's email turns up in that search, that would be sort of catastrophic. So namespacing, that is the idea that every user's data is in a very siloed sort of environment was incredibly important to us. I'm aware that namespacing is a part of TurboPuffer and scaling that has. I believe it's like 14 million namespaces or something. So talk to us about that because, I mean. Oh, sorry, 40 million. 40 million namespaces. And I believe that's probably also something that, again, maybe your Shopify experience in terms of that was a huge scaling challenge, I imagine, led into kind of how you've been able to approach this as well.
Simon Haroub Eskelson
Yeah, look, the only way you can shard anything or you can scale anything is to shard it. And so we decided at TurboPuffer to make namespacing a core sharding primitive that we expose to the user. Because if you can give us your sharding key in a simple way, well, then we can scale really, really far with you. In TurboPuffuffer, a namespace max directly to one shard. And a namespace is just a prefix on object storage. So you can imagine that if you have Gregor email. Well, that's prefix number one, right? And it's literally just like, you know, that on S3. Gregor email, right? And then all your files are in there, then you have Simon Email, and so on, so on, so on. And that way we are only constrained in our scalability on how many namespaces S3 can have. Well, S3 can have a lot of namespaces. We have yet to see any limit, and there's no documented limit on how big this can be. And TurboPuffer, yeah, has more than 100 million of these namespaces. And this works great because it also means that we can encrypt every individual namespace differently, because you might decide that you want to encrypt the Gregor email namespace or prefix on S3 with a key that you have access to. And I want to do that with mine as well to add even another layer of defense that's basically as good of a protection as if you had that data in your own bucket. There's no logical difference. You can rotate the key at any time or revoke it. So that namespacing is a core tenant of TurboPuffer, and it's a core tenant of, like. I'm thinking back to my Shopify experience, right? And most of the problems we solved by abusing the wicked shard on the shop, and that shops that naturally have ways that they have to talk to other shops other than in some edge cases. So that's exactly what we did at turbopuffer and build that in as a core primitive. In the future, probably a namespace will map to multiple shards and things like that, but it will be abstracted against this single layer of completely horizontally scalable search.
Gregor Vand
Nice. Yeah. As I say, there was, as you call it, a core primitive that I was looking for. Surprisingly, let's just say other people in the space take slightly odd approaches to it, or virtually no approach at all. So I think it's something that the audience should look out for when thinking about this kind of thing. So we're going to take a bit of a turn into who's actually using this because there's some huge names using TurboPuffer. Notion and Cursor are probably, I imagine, two of the biggest names, if you'd like, but also two of the names that I think most of our audience will be very familiar with and can probably start to imagine sort of how from everything we've just been talking about how the technology actually works under the hood, like how that's being translated through to what does that even mean for them as a user each day? So maybe could we talk about both of those cases? I think. Super fascinating to hear about those.
Simon Haroub Eskelson
Yeah, we could start with Cursor. So I got to know the Cursor team in 2023, and when they saw the first announcement of TurboPuffer, knowing the team so well now, it's very clear that they just had a conversation around the dinner table at some point thinking, well, why hasn't anyone built it like this? Right. They're big fans of S3, just like I am. So it slotted right into their mental model and we went back and forth with a bunch of bullet point lists on email and spent some time with the team and we were just completely aligned on what needed to be built here. There was a couple missing features. So we built that out for them and then they migrated for them. The storage architecture of turbopuffer just made sense. Right. When you open a code base in Cursor, it indexes the code base, right? And one of the parts of the indexing is to embed the code base. And this powers a lot of the agents and a lot of the functionality inside of Cursor then has the ability to do a semantic search on the code base. I use this all the time to be like, where does it do this and where does it do that? And I can just do that in plain language. So Cursor, yeah, will embed with their own embedding models that they've trained will embed the entire code base and then use that as a tool call into everything else. And for them, the storage architecture of having everything in memory, which was the. The previous solution that they were on, just didn't make a lot of sense. You don't need every code base ever opening cursor in memory at all times. It's incredibly expensive and the per user economics of that just weren't sustainable for them. So they needed a way to bring the per user cost to something that made sense to them. And so the turbopuffer model of inflate the Pufferfish when you're querying the code base just made a lot of sense. Right. Only some percentage of the code bases are going to be at active at once, but they're still valuable to keep around. So it's just such a clear fit to the storage architecture. So they've been great. And their first build was reduced by 95% to move on to TurboPuffer and they've been amazing partners, have inspired a lot of how we build TurboPuffer since then. Yeah, Notion is another customer of ours and they also. It's a very similar story where they were using a vector database that also had things in memory and the per user economics again just didn't make a lot of sense to them. Again, only a percentage of these workspaces are active at once. So that really appealed to them about TurboPuffer is they wanted to connect a lot of this data into their LLMs. So it's a very similar story to the cursor one. But I don't want to give the impression that turbopuffer is only good for these use cases. I mean it's still fundamentally the cheapest way to store data in the cloud is to put it in object storage at $0.02 per gigabyte and then cache it on a computer compute node only when it's actually being used. Because generally in a traditional storage architecture you have to replicate the data to three nodes, Right? So in Turbo Buffer's case, you have it in $0.02 a gigabyte of object storage and then you store it on disk and memory on a blended cost of maybe around into tens of cents per gigabyte. But on a traditional storage architecture with memory and disk and all of that, you easily run into dollars per gigabyte. So this allows us to have this pricing that just makes a lot of sense and it allows us to have a truly serverless environment where you can just like push in the vectors. And this was great for Both Notion and Cursor, because they have a lot of writes. Right. Every time you edit files, every time you edit in Notion, it has to issue a write to update the vectors in turbopuffer.
Gregor Vand
Yeah, that's interesting. I have noticed just how fast Cursor is at indexing. You know, I love using cursor workspaces, so, you know, like dumping say, like three different code bases into our workspace and then having it kind of figure things out across all three. And yeah, the indexing has always impressed me. So now I know why. I mean, obviously credit to Cursor as well, but it's great to understand the technology behind the scenes. And yeah, I was at a Notion event not so long ago and obviously they're doing a big push on their AI offering. And a lot of this is, look, we're going to index across all these data types, you know, like Google Drive and I believe Gmail is coming soon. And Slack and, well, probably Slack. I mean, there was that announcement about the API restrictions, but I'm assuming maybe they're figuring that one out. But at the end of the day, it's a lot of data that needs to be indexed and then effectively AI rag search over that. So that's also super interesting that turbopover is kind of behind the scenes on that one.
Simon Haroub Eskelson
Yeah, I think every company has a lot of data that they want to make it into context. And if you have a lot of data, then the economics of adopting another data store rather than perhaps using a vector index in your relational database can start to make sense when you get into the tens of millions of documents that need to be indexed.
Gregor Vand
Right.
Simon Haroub Eskelson
Same reason why we've always moved search workloads out after some point in time for full text search as well. After a certain threshold, the downsides of having another data store start to make sense. So when you want to connect a lot of data to AI, we think the storage architecture makes a lot of sense.
Gregor Vand
You touched on it just talking through the Cursor and Notion examples. Pricing, I think it's good to call out. TurboPuffer doesn't offer any kind of free plan. And I think you've been quite sort of vocal about why you don't offer one. I think it'd be great just to kind of hear the whys behind that. I fully support that, just to be clear. But I think it's also great for developers to sort of understand as much for their own products as it is for the fact that they might want to come now try turbopuffer and then might be disappointed that there is no free plan, for example.
Simon Haroub Eskelson
Yeah, I mean first of all, if you're not having a good time in the first 30 days and you've tried the product earnestly, then you have the right to cancel. But we think that the companies that benefit the most from turbopuffer the is not a price point. That's scary. And for small companies it might make sense to use a small index that tacked onto the existing relational database. But if you have ambition to index tens of millions of vectors, it could make sense to start on TurboPuffer. But in a lot of cases it's really about wanting to provide people really good experience. And it's easier to provide people really good experience when you have a commercial only offering. And we can put behind the necessary staff to support people when they have questions and give them a really good experience. So that you get the feeling that we're part of your team.
Gregor Vand
Absolutely. And you've announced, I believe quite recently you're now ga. So general availability. How does turbopuffer sort of as an org now look, I haven't done research on that. I always find it interesting. Are you two people, are you 200? I mean where is turbopuffer?
Simon Haroub Eskelson
Yeah, I mean we're less than 20 people and it's a very engineering heavy team. We've hired like a bunch of database engineers and that's the majority of the team people working on the database. And now we're starting to hire other types of roles to expand the team and especially support our customers and so on. Yeah, that's the size of the company. We really want to have a dense like P99 engineering environment and so we've tried to hold the standards high on team that we put in front of our customers that are developing this product.
Gregor Vand
That's a very nice number. Team wise. I think it's been a little bit over reported on sort of this idea of the one person unicorn and all this nonsense. So I think it's good to sort of call out that some of the best companies coming out at the moment are maybe 10, but 20 also sounds good and a bit more. And like from what you can share, where is TurboPuffer kind of going over the next say six months to a year? What's kind of top of mind for where the product needs to go?
Simon Haroub Eskelson
One of the biggest frontiers we're pushing right now is just more database features. So for example, we just launched like Conditional writes it's the ability to say hey, I only want to update this document if it's actually newer than the document that's in the database. So it's database features. It's a lot on the full text search side. So Turbo Buffer doesn't just do vector search, it also does full text search. And we see more and more of our customers doing that. There's a lot of expectations about what you can do in full text search. And so we have to build that on the storage engine that we've built from the ground up right to work. So those are the biggest things that we have. We have a very solid generic LSM storage engine in underpinning all of this, that we've matured and now it's really about features and it's about even more performance than we deliver now.
Gregor Vand
Awesome. Well, we're coming up for time. One final question I had was just actually completely nothing to do with databases specifically. It's actually around, I mean you've talked about the name, but also the visual language and just I guess the front facing site. Very fun. It's sort of pixel art, I guess, if you want to kind of give it some kind of name. Where did kind of the idea come from and like who kind of leads that on the team?
Simon Haroub Eskelson
Yeah, I think it's like neo retro pixel art or something. I don't know what the aesthetic is called really. This came from when I was back home in Denmark and we had some friends visiting and I was talking to my friend who's a designer about this idea of turbopuffer and I think he said later that he had no idea what I was talking about, but I sounded really excited about it and he was so encouraging and into my mind lateral to exactly how to lay out the bytes on disk and built a storage engine. A and N index was sort of this, just this aesthetic that I wanted. And so we were riffing back and forth on what that would look like and kind of creating some sample sites and it was just like this like very bare bones, right? Like I, you know, at Shopify I spent 10 years evaluating databases. So I was on the buy side the majority of my career. And every time I went to a website it's just like, what does it cost? What are the trade offs? Who are the customers and what they're using it for? These are the only questions I really care about. And most database websites are just so filled with so many different other things, right? And I'm just like, no, what are the trade offs? Like any database has trade offs. Like is this the right set of trade offs for me? What's the architecture what are the guarantees? What's the consistency model? And so this show, Don't Tell just was very important to me. And then we wanted to breathe a little bit of fun into it. I think Turbopuffer in itself is kind of a whimsical name. And so this aesthetic just sort of appeared over many, many iterations that you'll see see on Internet Archive. And we're just really fond of it and we'll continue to iterate on it because it makes us happy when we go to the website.
Gregor Vand
Yeah. As you've called out, if you go into turbopuffer and you'll see a whole bunch of like, sliders, et cetera, so you get sort of into the meat of it very quickly. But I think, as you call out, having this kind of fun aesthetic around, it definitely sets the tone for who you are as a company as well. Reminds me a little bit in different form of. You're probably familiar with Tiger Beetle and, you know, Financial Database. And so they have a very, very specific type of database, but they also bring fun visuals to it, which just. And obviously the name. So it's very memorable to that point. I remember seeing something about turbopuffer and Hacker News a little while ago, and then this episode was suggested and I knew the name straight away, could visualize the website before I'd even come back to it. So I think it's very smart. So, yeah, thanks so much for coming on. Where can people find you? Or like, where's the best place to go to? Kind of just get acquainted with. With TurboPuffer.
Simon Haroub Eskelson
Yeah. TurboPuffer.com is a great entry point. We post on all the social medias. You'll find the link there. I'm sirupsin on X where, you know, these days I mostly tweet when I go out for runs for some reason, but that's where you'll find me and find the company.
Gregor Vand
Awesome. Well, thank you so much, Simon. Very deep, technical episode today, which I think a lot of the audience will love and just really look forward to following along. Turbopuffer. I think you guys are doing very interesting things. So thanks for coming on.
Simon Haroub Eskelson
Thank you so much for inviting me on.
Date: September 30, 2025
Host: Gregor Vand | Guest: Simon Hørup Eskildsen
This episode explores the creation and technology behind Turbopuffer, a next-generation vector database engineered for speed, cost efficiency, and scalability. Host Gregor Vand interviews co-founder Simon Hørup Eskildsen, delving into the problems with existing vector databases, the technical and economic innovations that Turbopuffer introduces, real-world use cases, and broader architectural decisions that shape its development. The conversation is highly technical and insightful for developers and engineering leaders interested in AI infrastructure and database scalability.
[02:02]
Quote:
"I spent almost the entire time working on that layer between the Rails app and the databases, and sometimes inside the databases themselves, but mostly on top of them. So yeah, almost 10 years working on every single aspect of database scalability at Shopify." — Simon [02:13]
[05:26]
Quote:
"For vector indexes, we're now talking about 30x [storage amplification], right? Just to store the vectors, let alone build an index." — Simon [07:26]
[09:28]
Quote:
"Per gigabyte [NVMe SSDs are] about 100 times cheaper than DRAM... most databases haven't built around getting all that bandwidth through. You have to bypass the Linux page cache to get the maximum." — Simon [09:49]
[12:34]
Quote:
"Graphs just don't work that well for it... they are amazing for in-memory and it's almost impossible to beat the performance in memory. But when things are on disk or they're on S3, you have roundtrip latencies into hundreds of milliseconds..." — Simon [17:26]
[20:32]
Quote:
"The canonical source of truth for all data in TurboPuffer is object storage... As you query the namespace more, TurboPuffer gets faster." — Simon [20:32, 23:00]
[23:40]
Quote:
"Turbopuffer's console is still fairly simple and operational... we've been very focused on the database itself." — Simon [23:46]
[24:49]
Quote:
"We will get big red dots if someone's recall is below 90%. ...nothing matters other than production and it's going to be the same for accuracy." — Simon [25:40, 26:18]
[28:18]
Quote:
"The query planner that plans how much data the query needs to look at to get high recall needs to be very aware of both the vector index and also the filtered index to get high recall." — Simon [30:36]
[32:05]
Quote:
"This is probably the most complicated part of the entire turbopuffer code base. To make this work at very, very large scale and make it work with recall and make it work with filters..." — Simon [33:54]
[37:13]
Quote:
"We decided at TurboPuffer to make namespacing a core sharding primitive that we expose to the user... this works great because it also means that we can encrypt every individual namespace differently..." — Simon [37:13]
[39:51]
Quote:
"For them, the storage architecture of having everything in memory, which was the previous solution that they were on, just didn't make a lot of sense. ... Their first bill was reduced by 95% to move on to TurboPuffer." — Simon [41:47]
[44:19]
Quote:
"It's easier to provide people really good experience when you have a commercial only offering. And we can put behind the necessary staff to support people when they have questions and give them a really good experience. So that you get the feeling that we're part of your team." — Simon [44:44]
[45:38]
Quote:
"We really want to have a dense like P99 engineering environment and so we've tried to hold the standards high on team that we put in front of our customers that are developing this product." — Simon [45:47]
[47:37]
Quote:
"Any database has trade offs. Like is this the right set of trade offs for me? What's the architecture what are the guarantees? ... this show, Don't Tell just was very important to me. And then we wanted to breathe a little bit of fun into it." — Simon [48:22]
Highly technical, candid, and engineering-driven; Simon is practical and transparent about trade-offs, mistakes, and real customer needs, with an undercurrent of developer fun and enthusiasm.
For listeners: If you want to understand the modern challenges and solutions in AI-first infrastructure and how storage economics shape what’s possible for vector search at scale, this episode delivers a masterclass.