Summary5 min read

Podcast Summary: Vespa AI and Surpassing the Limits of Vector Search

Podcast: Software Engineering Daily
Date: May 12, 2026
Host: Sean Falconer
Guest: Radu Jorge (Software Engineer, Vespa)

Episode Overview

This episode explores the evolution and current limitations of vector search—especially in the context of AI and Retrieval-Augmented Generation (RAG) pipelines—and how Vespa advances search technology by leveraging tensor-based retrieval. Radu Jorge shares his industry experience and technical insights, focusing on why vector similarity is not enough in modern production systems, the benefits of tensors, challenges of multi-stage re-ranking architectures, and perspectives on the future of AI search.

Key Discussion Points & Insights

1. Origins of Vespa and Radu’s Background

Radu's journey began in antivirus log centralization, then nearly 12 years consulting for Elasticsearch and Solr, before joining Vespa out of curiosity for its distinct architecture ([02:01]).
Vespa’s origin: Stemmed from FAST Search and Transfer, later acquired by Yahoo, maintaining a focus on large-scale, generalized search solutions ([03:08]).
- Over-engineering for scale: Vespa designs features (like tensors) to be adaptable for both small and massive deployments ([04:22]).

2. The Limits of Vector Search in Production Systems

Vector similarity = one signal among many.
- Real-world relevance typically demands compound signals—recency, lexical match, chunking logic, and business-specific rules ([06:25]).
- "Vector similarity in itself is not enough...when they start to use these things, people care about multiple signals and how to combine them." — Radu ([06:25]).
Hybrid and Lexical Search Still Matter:
- BM25, a lexical search method, often performs impressively, sometimes outperforming off-the-shelf embedding models ([07:52]).
- "Hybrid search—BM25 combined with embedding models—outperforms the models themselves..." — Radu ([08:39]).
Lossiness of Vectorization:
- Mapping large documents to a single vector inevitably loses information. Chunking is essential for maintaining semantic fidelity ([11:14]).
- "The more data I'm stuffing into the vector, the more I'm generalizing the ultimate meaning of that thing." — Sean ([12:16]).

3. RAG Pipelines and Multi-Stage Re-Ranking: Tradeoffs and Challenges

Efficiency and architecture:
- Two-stage search systems (retrieve → re-rank) can be bottlenecked by data transfer and performance concerns, especially at scale ([13:20]).
- "If you have a really good re-ranker that performs really badly, you can only throw a few results at it because otherwise you're going to have unacceptable latency." — Radu ([13:33]).
Multimodal challenges:
- Compressing rich media (images, PDFs) loses even more nuance than text; chunking and patch-based embeddings help, but hybrid models may be necessary ([14:39]).

4. Tensors vs. Vectors: What Changes with Tensor-Based Retrieval?

Vectors are simple arrays (one-dimensional lists of numbers).
Tensors generalize this: multi-dimensional, potentially with named axes and complex structures ([16:46]).
- Enables advanced ranking math (dot products, max/average across axes, personalization).
- "Tensor math is a really flexible way to represent a lot of things and do those interactions quickly." — Radu ([18:40]).
Future-proofing: Vespa’s tensor-based system adapts quickly to new embedding paradigms (e.g., patch-wise embeddings for images) without major architectural changes ([19:15]).

5. Implementing Tensor-Based Search in Vespa

Schema definition: Tensors are declared in the schema with defined shapes and types ([20:47]).
Rank profiles: Encode the logic for combining signals—dot products, averages, top-N, etc.—using tensor ops ([21:19]).
Flexibility for new signals: New business or domain signals are readily integrated as new tensor dimensions.
Support tools: Vespa offers abstractions, playgrounds, and sample apps to ease the complexity of tensor math ([23:41]).

6. Named Dimensions & Efficiency Gains

Named dimensions (e.g., tokens, regions, timestamps) make ranking and applying user preferences efficient at scale.
- "If you have this natively in a tensor...you can simply take user preferences, car attributes, and do a dot product, which is super, super fast." — Radu ([25:31])
Compute-to-data paradigm: Vespa does ranking computations directly where data lives, minimizing data movement ([28:15]).

7. Real-Time Indexing and Updates

Vespa offers real-time updates: as soon as an update is acknowledged, it’s immediately queryable ([32:06]).
- In-memory attributes (like prices or stock) are instantly updated and visible to new searches, distinguishing Vespa from systems requiring periodic commits ([33:39]).

8. Looking Ahead: The Future of Search

Future themes:
- Multimodal search (visual data, real-time exploration) will gain importance ([34:53]).
- Persistent, hard problems: establishing effective "golden datasets" for tuning and benchmarking remains an open challenge ([36:19]).
- "If those have been around for more than 15 years, I would assume they'll be around for the next five years as well." — Radu ([36:19])

Memorable Quotes & Moments

On vector search limits:
"Vector similarity in itself is not enough. Even in vector databases, they add stuff on top." — Radu ([06:25])
On BM25 and hybrid models:
"Hybrid search—BM25 combined with all those models—would outperform the models themselves." — Radu ([08:39])
On implementation tradeoffs:
"If you add more signals and you want to combine them, that engineering investment...will happen everywhere." — Radu ([23:08])
On the future-proofing of Vespa:
"When new models came in, we could just natively support them because all the plumbing was already there." — Radu ([19:15])
On the relevance of search for agents:
"If what happens under the hood gives [agents] good results quickly, then that is...even more important than for humans." — Radu ([30:20])
On the golden dataset dilemma:
"How do you know your testing thing is good? ... If you're starting your ecommerce shop or book search website—now what? How do you know?" — Radu ([37:37])

Timestamps for Key Segments

| Topic | Timestamp | |---------------------------------------|--------------| | Introduction & Radu’s background | 01:31-04:18 | | Vespa's philosophy & origins | 03:08-05:32 | | Limits of vector search | 06:25-09:33 | | Vectorization losses & chunking | 09:56-12:31 | | RAG and multi-stage re-ranking | 13:20-14:39 | | Multimodal (images, PDFs, tables) | 14:39-16:16 | | Tensors vs vectors in Vespa | 16:46-19:15 | | Future-proofing via tensors | 19:15-20:40 | | Setting up tensor-based search | 20:47-23:08 | | Abstractions & helping tools | 23:41-25:18 | | Named dimensions & performance | 25:18-28:15 | | Compute to data (content node exec) | 28:15-29:55 | | Search relevance in AI agents | 29:55-31:12 | | Real-time updates & indexing | 32:06-34:38 | | The future of search, golden sets | 34:53-38:19 |

Summary

This episode gives a nuanced, technical deep-dive into why search and retrieval in AI systems now demand more than simple vector similarity—and how Vespa’s tensor-first, high-efficiency architecture positions it for the data-rich, multimodal, agent-driven future of search. Radu’s practical perspective, drawn from both consulting and product engineering, highlights the blend of flexibility, scale, and future-proofing that defines Vespa’s approach as well as the open challenges that remain.

Loading summary

Transcript73 lines

[00:01]
A
Vector search has risen to become a foundational tool in modern search and retrieval systems, including the RAG pipelines that power many AI applications. However, the demands on retrieval systems are growing more sophisticated, which is revealing the limits of relying on a single vector similarity score. Vespa is a popular open source search and data serving engine. Central to Vespa's architecture is tensor based retrieval, which is an approach that represents data as tensors rather than simple vectors. Tensor based retrieval enables richer mathematical operations and more flexible ranking functions that can surmount the limitations of a single vector similarity score. Radu Jorge is a software engineer at Vespa with a background spanning nearly 12 years of consulting and training on Elasticsearch and solar. In this episode, Radu joined Sean Falconer to discuss why vector similarity alone falls short in production, how tensor based retrieval generalizes to support richer ranking functions, the tradeoffs in chunking and multi stage re ranking architectures, and where AI search is headed next. This episode is hosted by Shawn Falconer. Check the show notes for more information on Sean's work and where to find him.
[01:32]
B
Radu, welcome to the show.
[01:33]
C
Hi, thanks for having me.
[01:35]
B
Yeah, absolutely. I'm glad you were able to be here. I interviewed your founder, Vespa and CEO probably a couple years ago, so it's great to catch up again on everything that's happening over at Vespa. A lot has changed in the world of AI and I'm sure in the world of Vespa over the last couple years.
[01:51]
C
Yep.
[01:51]
B
So you've been working in this space for a while, I guess. Like what's your origin story? How did you end up working in search infrastructure ultimately get involved at Vespa?
[02:02]
C
Yeah, this was two jobs ago working at an antivirus company and we needed to centralize logs and that's how I got into elasticsearch and then I moved on to a company that was at the time at least doing mostly consulting on top of Elasticsearch and solar. And so I've been doing for almost 12 years at that company, consulting, training, that sort of stuff for Elasticsearch and Solar and then OpenSearch.
[02:26]
B
I mean what ultimately led you to Vespa?
[02:29]
C
Well, I guess mostly curiosity because Vespa not being based on the scene and having different internals, different distributor model, different trade offs that it makes got me intrigued. Met a bunch of people at conferences, was more and more curious and yeah, that's how I got into it.
[02:48]
B
Yeah, I mean Vespa is a company. At least the origins of it has been around for a long time over 20 years has been working on search related problems. Can you share a little bit about some of the origins of the company and what was their original problem that they were focused on and how much of that originating DNA is there today?
[03:09]
C
I know quite a few things, but only from other people because I wasn't around. Of course, I've only been here for like a couple of years. As far as I know, the origins are pre Yahoo. So there used to be a company called fast, which is a recursive acronym, comes from FAST Search and Transfer and they've been doing web search and like search in general. I think there were a few other things, but the idea was like large scale search and then through a series of acquisitions ended up in Yahoo. And in Yahoo they were serving lots and lots of use cases. And Vespa still does serve lots and lots of use cases within Yahoo. Not sure which of them can be told publicly, but the idea is like have a bunch of verticals that you can serve some smaller scale, some really huge scale. So I think this implies that a lot of the problems that Vespa needed to solve were quite generic as well as large scale. So I think you'll see this in Vespa today. Like a lot of the solutions we adopt tend to be over engineered, if you will, because we expect them to be used and pulled in all sorts of directions.
[04:19]
B
What do you mean by that in terms of over engineering? Can you give an example?
[04:22]
C
Well, tensors I think are a good example of that because it's like you don't only support vectors and distance functions and all that, you support all sorts of maths on top of all sorts of numerical structures. So then when you come up with a new use case, it's just that much easier to add it up because there are lots of things that are already supported and already thought through to be scalable and fast.
[04:50]
B
I see, so you're talking more about having some first principles, thinking around search that generalized all sorts of problems versus attacking everyone as like a narrowly brand new problem that you have to go and engineer a specific solution for that particular problem.
[05:04]
C
Yeah, for me, coming with my consulting background, I'm used to solving specific issues. And at Vespa when I look at an issue and I'm like, okay, how do we solve this? And people are like, wait, wait, let's make sure we don't bump into something like three months later where we have to do this all over again. So is there something generic that we can do? Will this perform at scale and all that sort of stuff does this align with how Vespa is used in general and things like that? So that for me is a bit of a shift.
[05:32]
B
Yeah, I think there's always a trade off between moving from like consultancy or even forward deployed engineer type of role where you're really trying to help solution specific things for a customer and unblock them versus being part of like core product in R and D where you're thinking beyond just a singular customer. But how do we generalize this thing to maybe all of our customers?
[05:55]
C
Yep.
[05:56]
B
So Vespa has been fairly vocal about writing about vector search and how it's reaching some of its limits. For some people, maybe that's a bold claim. Vector search, vector database. This is something that's been around for quite some time. I think it's really got a lot of traction certainly in the last few years. Can you talk a little bit about the core argument behind the dialogue coming from Vespa in terms of vector search reaching as wellness? Like, are there certain things that vectors are good at and where are things starting to break down?
[06:25]
C
I think the general idea is that people want good relevance. Right. You have a corpus, you're searching in it. You need the most relevant things to surface. And so for that, vectors are only one. Like in general, vector distance is one signal. Right. You may have n other signals. Like, is this document recent? Does this document match? Well, lexical search, which chunk is more relevant? Do we care about the top chunk? Do we take care of the average chunk or average of top 10 chunks or whatever business rules we may have? So in practice, what we see is that a lot of people end up having really complex algorithms for measuring what ends up being a relevant score. So having flexibility around this I think is important. I don't think there's anything wrong with vector similarity. I just think that vector similarity in itself is not enough. And even if you look at what are traditionally called vector databases, they add stuff on top. It's not just vector singularity that they care about. So I think it's just a natural trend that people, when they start to use these things, they just care about multiple signals and also how to combine them.
[07:40]
B
Yeah. So if you're only looking at something like vector similarity, what are some of the things that you might end up getting wrong in some of these use cases or scenarios where you're limiting yourself to the singular signal?
[07:53]
C
Well, one thing that comes to mind is lexical search. I think there's a lot of, let's say, memes in the search world about BM25 which is say most popular algorithm behind lexical search. And that BM25 actually performs really well as time goes by, it doesn't seem to die. And we had a recent blog post where we benchmarked a lot of embedding models and most of them in most of their flavors would outperform BM25. And first of all, actually, let me take a step back. When I say most of Those models outperform BM25, I mean those models off the shelf outperformed BM25 off the shelf. In reality, nobody actually uses that. Most people will tune both their embedding models and their BM25 implementation. But for argument's sake, okay, most of those models would auto perform BM25 but hybrid search. So just BM25 combined with all those models would outperform the models themselves. So lexical search is a signal that I think is here to stay and has been proven over and over again. That is just one example. Another one that comes to mind is if you have long texts you can have, I mean, vector similarity on the whole text becomes quite meaningless because you can't capture that meaning of a blog post or a book in one vector. So I think that's where chunking comes in. And then how do you combine chunks and stuff like that and metadata? So again, in practice what we see is people end up adding up on more and more stuff that becomes their compound signal.
[09:34]
B
Can you talk a little bit about how this vectorization process works? So what do you lose when you turn a document into a vector? What are you kind of giving up by using that representation compared to what? I mean, there's other ways that you could potentially represent a document. I'm sure you could do something like you could just store the text of a document, for example, and do some sort of text based search over it.
[09:57]
C
So then you do lose things like exact filtering. Like that's one of the main complaints people have about vector search is like, where's my threshold? Where's the cutoff between a relevant and an irrelevant document? Like with lexical search, this is usually quite easy to figure out. Like a lot of people in the Lucene world use some sort of minimum match. It's like, okay, if you have three words, then all three need to match, but if I have 10, then 7 out of 10 is good enough and then I have a reasonable cutoff. Okay, it's not perfect, but it is somewhat intuitive, explainable, and most of the time it works. With vector search is like, you don't really know because you can still have, for something like cosine similarity, you can say, okay, decent similarity score is like 0.7, and I'm going to cut it off at that. But then that depends that that similarity changes when you're running different queries. So in other words, it's very hard to figure out what that cutoff point is and what that cutoff point means. And that messes up with faceting. So if you want to analyze your result set, then you're looking at what. Because vector search, unless you have this artificial cutoff point, then you're going to match everything.
[11:15]
B
Yeah, I mean, I think that would be certainly true of if you're not doing any kind of chunking. The whole point of, if you have a large document of breaking it down into smaller chunks is that you have more tightly coupled semantically meaningful chunks. So if I take an entire book and I turn it into one singular vector, then I'm creating a single point in high dimensional space that represents this whole book. There's no way I can encapsulate all the meaning of that book and have all the points in space that are similar to it in some reasonable way. I'm going to end up losing a lot of the specifics, I would think. But if I break it down by paragraphs or sections or chapters or whatever it might be, then at least I have more tightly coupled dots in high dimensional space that are going to probably have a sphere of similarity around it to other dots in that space. They're probably more semantically meaningful. So the more data I'm essentially trying to stuff into the vector, the more I'm generalizing, essentially the ultimate meaning of that thing. Is that fair?
[12:17]
C
Yeah, I think that's a fair thing. It's like you have a limited number of data points, effectively the dimensionality of your vector that you can store. So the more meaning you have in something big, the more you're going to compress and the more lossy it's going to be.
[12:32]
B
Yeah, exactly. Yeah. So it's a very lossy format, especially as you're getting more and more text stuffed into the singular vector representation. There's also, if we look at things like the use of RAG over the last couple years of vector databases, typically the RAG systems are getting more and more complicated. Where we're doing, we have the pipeline where we're breaking up, we're chunking these things, we got different chunking strategies, we're indexing it in a vector database and then when we're actually retrieving it. We're also doing multiple steps where we're maybe retrieving relevant documents and then maybe we're using like a RE ranking model as well to re rank the results from it. In terms of this kind of two stage architecture where we're decoupling the search from the RE ranking, is that problematic? Are there challenges around not tying those things together and decoupling them?
[13:20]
C
I think it is problematic in the sense that if you have a lot of data to RE rank, then you're going to have a lot of traffic coming in and out that can become a bottleneck.
[13:30]
B
So it's primarily efficiency problem.
[13:34]
C
Yeah. And efficiency is really important because if you're more efficient then you can essentially afford to do fancier stuff. Right. Let's take this RE ranking example. So if you have a really good RE ranker that performs really badly, you can only throw a few results at it because otherwise you're going to have unacceptable latency. But if your RE ranker is super efficient, then you can throw all your results at it and you're going to have great results. But this I think applies all over the place. Right. Like if you can in general have a base relevance function that performs well and does a lot of stuff, then you're going to have a really good baseline to work with.
[14:20]
B
And I guess all these problems that we're talking about are perhaps even more amplified in like the multimodal world. It's one thing where we're talking about compressing text into a vector. Like what happens when we compress images and video and things like that. Do we lose too much in using this kind of lossy format? Especially when we're talking about rich media.
[14:40]
C
I don't know that I have enough experience with things like audio or video, but I know for things like PDFs it's going to be really hard to put all that information in a single vector because you can have N pages and on those N pages I think it's enough if you extract the text, you have the problem that we talked about earlier, which is how we cram a lot of text into a vector. But now if you have diagrams in it and other things, good luck.
[15:09]
B
Yeah. And tables, which you're probably going to, I would suspect, like even the approach to doing similarity measurements between those things is probably going to be quite different than you would do for traditional text.
[15:21]
C
Yeah. So what I've seen with PDFs are people doing vectors per page, or rather per patch per page. So you have models like Colpali and So on that can do this sort of stuff.
[15:34]
B
Do they handle tables and images differently though?
[15:37]
C
They don't actually. They're pretty generic. You just throw the image of a PDF page at it and they give you a vector per patch. So you'd have typically 128 patches. So 32 by 32. And you're going to have one vector for each of those patches. But it's all very well coordinated. So in the end, because you can throw a text query at the same model so they live in the same vector space, it can actually figure out whether something's on a table or on a graph. So you can have a graph of, I don't know, energy consumption by month and you can say what was the consumption in July? And it can highlight that for you.
[16:16]
B
Okay, I want to get into a little bit of this topic around tensor based retrieval essentially for the listeners that perhaps at this point know what a vector is. And I think generally because of all the everything that's been happening in AI over the last couple years, I think people who didn't know what a vector was three, four years ago perhaps know what a vector is today, they might not be super familiar with the concept of a tensor. Can you explain essentially what is the difference from vector to tensors and why does it matter for search?
[16:46]
C
Yeah, So a vector is a list of numbers, right? The data type can differ, can be a float. Normally it's a float natively, but we can quantize it. So basically compress the float into, let's say a 16 bit float or an integer or even a bit. So that is a vector. A tensor is a more flexible way to represent numbers. So simple thing could be just represent one number. You can have an array which would be a vector. You can have named dimensions. So like I mentioned patches earlier for cold paly you can say we have a patch id and for each patch id, we can attach a vector. So now we're going to have a map of vectors or you can have a sparse tensor where we can say, let's say for personalization, right? So I go to a clothing store and I prefer, I don't know, black pants and blue T shirts and stuff like that. Those could be named dimensions in my tensors and based on my preference I can store numbers. Let's say a heavy preference would be closer to one. And maybe if I hate them, they should be a negative number. And so I can perform all sorts of math on top of these numerical structures, these tensors, and I can get the results I want so, for example, with vectors, we can do the similarity search that we all know and love, or we can do personalization, for example, by doing some sort of dot product between my preferences and what a specific item of clothing would be. Or we can do copaly and we can sum up things. We can do maxim. So all sorts of things can be done on top of tensors. I'm not sure if that answers your question.
[18:30]
B
Yeah. So every feature or thing that you want to describe needs to map into a numeric representation. Right. And that could be a vector, it could be a singular value, but some numeric representation.
[18:40]
C
Right. In the context of tensors, yes. I mean, with Vespa, you can do much more with ranking than just using tensor math. But tensor math is a really flexible way to represent a lot of things and then do those interactions quickly.
[18:55]
B
Right. So by representing things as tensors versus just purely vectors, you have a whole, whole set of tools essentially, that you can use to perform these different types of searches using tensor math that you wouldn't be able to support using if you were just doing essentially cosine measurements between two different vectors, Correct?
[19:15]
C
Yeah. And also, I think most importantly, we are very. I wouldn't say completely because nothing is complete, but very future proof. So, for example, when Colpali models came in, we could just natively support that because you can have these patch vectors modeled in a tensor and then you can implement maxim using tensor math. And there you go, you have all the maxim stuff. You don't need to come up with a whole new feature of how do we deal with this? How do we deal with multiple tensors? How do we combine them in the way that they're supposed to be combined?
[19:47]
B
This was something that Vespa was already supporting.
[19:50]
C
Yeah, I mean, not before the model and the technique came into existence. It's not like we were supporting it, but, yeah, we were supporting it from day one because all the plumbing was already there. You just needed to write the correct expression. And there you go.
[20:03]
B
I see.
[20:04]
C
Okay. Another good example is Bayesian BM25. There's a new technique to normalize BM25 scores, because one of the main problems with BM25 is that you don't have a predictable score that you can use to then combine with other kind of scores. So it's like, it's, it's ideal if we can normalize it between 0 and 1, and then you can treat it much more uniformly. And so when that technique came out, we were like, okay, how do we implement this in Vespa and it turns out pretty much everything was already there. All the sigmoid calculations we could already do in the rank profile math. So this was impressive even for the author.
[20:40]
B
Can you walk me through what is the process for doing a tensor based search in Vespa?
[20:48]
C
The process would depend on exactly how the tensor looks like. So if you have a vector, you define, I mean any type of tensor, in fact, you will define it in the schema. It's like, okay, this is the shape of the tensor, this is the data type, and then you feed the data which should match that shape, right? So if it's a map of arrays or whatever that is, and then when you run the query, you typically also have a query, TensorFlow. You can construct tensors at query time from the signals that you may have, like, I don't know, chunk similarities. You can construct tensors from that. And then you would have something called a rank profile. So in the schema you would say this is my rank profile. And the rank profile expresses how the similarities, how the score of the document should be computed. So let's say we do a dot product between two tensors, or we do a similarity between a bunch of vectors and you can iterate, we can take the average of that similarity or whatever. We want the top N vectors similarity and average that. Whatever math you can think of should be there, or a lot of the relevant things are already there. And you can construct irrelevance function that way.
[22:06]
B
How do I know what relevance function to use?
[22:10]
C
I think that is very much up to you. And like, how do you, let's say, tweak your relevance? It would depend on the use case. I think most people would just start from something simple like lexical search. Then, okay, we can find a decent vector, like an embedder model to work with my data. Then I can think of, okay, what are other business relevant signals that I want to incorporate? All sorts of metadata. So people typically iterate. And I think it's very important to have some sort of golden set that you can evaluate and see whether my quality is going up or down. Let's see, very generic approach.
[22:53]
B
It sounds like there's maybe some additional complexity involved with getting this set up and working. But the advantage is that you're trading off some level of maybe technical investment and complexity upfront. But the trade off is that you get better results.
[23:08]
C
This is value for any system. I don't think it's particular with tensors that, you know, if you add more signals and you want to combine them, it's going to be just that engineering investment that you were talking about will happen everywhere. Maybe tensors require a little bit more understanding of some sort of math. Not crazy. I mean, my math stopped at high school and I can still grok it to some extent, so it's not too, too scary. But it's a little bit more than just, you know, at least what I'm used to.
[23:37]
B
How much is Vespa like, abstracting away some of that math for you?
[23:42]
C
There are some helper things. Like, for example, we talked about Colpali, you have. There are aliases, like you can just multiply two tensors, for example, like X, Y, and then that's going to do a dot product for you. You can also do the unfurled thing. I think more interestingly, we have a bunch of helper, let's say frameworks. If I can say that there's something called Tensor Playground where you can go and click around and you have some examples and you can also come up with your own and you can fiddle with tensors and see what the results are. We also did in a couple of years, in December, we had this Tensor Advent challenge where the idea was, okay, let's have some thematic challenges you have to solve with tensors, like how much Santa has to pack and how much the elves have to travel and stuff like that that you would just solve with tensors just to get a feel of that math. And then there's quite a big repository which is called sample apps in the Vespa GitHub, which has lots and lots of examples of use cases and you can see the rank profiles there and you can see the schema. A lot of people will take one of these sample apps and just change it to what they need. And I think that's useful. It's rare that you just start from scratch on a path that nobody went to before.
[25:18]
B
You mentioned just a little bit earlier of this concept of name dimensions. And Vespa's tensor framework supports name dimensions like token and region timestamp. What does that give you? Why does that design choice matter?
[25:32]
C
It matters because it's very quick to let me step back here and try to come up with an example. So one of them is you can have attributes that you care about for ranking. So let's say you're searching for cars and you may have things like, is this car expensive? Is this car cheap to insure? Does this car use a lot of fuel? Is it new? Whatever, Right? So things that maybe I care about when ranking. So even if you don't have tensors. Right. You can still take those into account when ranking. Right. You can take the mileage, you can take all those dimensions and you can come up with a formula that takes all those dimensions and comes up with a final dimension, which is the score of my document. But it is quite expensive to get. Assuming that you store this in multiple fields, you can need to get the value from all those fields. Do whatever math you need to do in some sort of high level math. Hopefully you don't have to bring it all the way to the application because that's going to be horrible. But even if you have to do this at some sort of high level script like with, with elasticsearch is painless, for example, that can be very slow. By contrast, if you have this natively in a tensor, then you can simply take the user's preferences, take those car attributes and do a dot product which is super, super fast. So this will scale a lot better than taking those attributes manually. So I think this is what it gives you in essence because kind of comes back to what we discussed earlier about efficiency that allows you to do fancier things. Because at some point you will not be able to do things in other search engines, even though the capabilities are there. But you know, if at your scale they don't make sense, you're not going to use them. Right. It doesn't help you that they're there if you can't use them. But with tensors it's different because a lot of those tensors, tensor operations are super fast and so they will scale and people do use them at very large scale.
[27:40]
B
Yeah, I mean one of the things that I think that seems unique about Vespa around some of this efficiency stuff that you were speaking to is that the tensor computation is happening on the content node where the data lives. It's kind of the idea of do you bring the data to the computation or bring the computation to the data? Data is expensive to move around. So if you can bring the computation to the data, then is going to save you some cost in terms of time of like moving this data around, which then gives you probably more compute cycles that you can spend on trying to get good results out of the search.
[28:15]
C
Exactly. And I think this comes at two levels. One is the computation that you do on all the documents. I think it's just unfeasible to bring all the documents somewhere outside where the data lives. Right. Like you will have to take some sort of top end. Unless you have a tiny data set, you just cannot afford to take all the data out of the content nodes and into something external. And so that is one thing, like if you can do some tensor computation on the first level, then you're going to save a lot of cycles. But then the other thing is you can do the first RE ranking phase. So basically the second phase, first phase runs on all documents, second phase runs on top end. That second phase will run on the content nodes as well. So you can bring a more sophisticated model, you know, could be a light gbm, like a tree, could be an ONNX model. It's usually not something super big, but it can be, you know, complex enough, for example, to handle multiple signals of multiple ranges, such as, you know, you have similarity. And then we discussed BM25 and recency and all the things that maybe matter to you and come up with a coherent score. And that still happens on the continent without moving data. And only later you can maybe move a much smaller set to what we call the global RE ranking, which happens on a stateless layer. And that can again have its own model, maybe a bigger, more complex model. It can also run on GPU that can, can do the final RE ranking. So there's sort of stages to that,
[29:55]
B
I guess One of the things I was wondering about too is, you know, there's like the concept of rag vectors or even if you're using some other, you know, search technique, like the tensor stuff that we're speaking about, it was very, very popular a couple years ago. And then now with AI agents, I think there's some dialogue around, like, how relevant is this today? Like, can you talk a little bit about, you know, where do some of these concepts fit into the agent world?
[30:21]
C
Right. So they, for agents, this would be just a search. I don't think they care all that much about what happens under the hood. But if what happens under the hood gives them good results quickly, then that is, I think, even more important than it is for humans. Because agents would typically run multiple searches. And so the problem I think would be compound with latency or with bad results. Because it's latency. Definitely.
[30:49]
B
If you're 90% accurate in isolation and then you do that 10 times, then it's like 0.9 to the power of 10, which means that you're successful. I don't know what the math is, but it's probably going to be like 10% success in that compound factor of searches. Right. So the more you can get the accuracy up on the search in isolation, the more the accuracy goes up in the aggregate as well.
[31:13]
C
And I think the other thing is that models, at least to my knowledge to this day, aren't as good as figuring out how to filter the context. So if you, if you give them bad results, they will tend to hallucinate more because now they have bad context to rely their hallucinations on.
[31:35]
B
Yeah. Or if it's too much. Right. Like all models degrade in performance, the larger the context that you give them due to context rot. I mean, just based on what the way the attention mechanism works. They can only pay attention to so many things. So they might pay attention to the thing that you don't want them to pay attention to if you give them bad results.
[31:53]
C
Yep.
[31:53]
B
How does Vespa handle like updates to data? So if you have like knowledge base that's changing every minute, you know, there's news, there's pricing, there's inventory, how does the index, a re indexing of that information work?
[32:07]
C
So to talk in general terms, Vespa is real time, meaning when you make an update, the moment you get the acknowledgement as the application, that thing is searchable. So most engines would be near real time, meaning there has to be some sort of commit happening, which, you know, there's always a trade off like that there's no free lunch. Right. But this is the trade off that Vespa does. It's like it assumes that you need your data to be available right now, so you won't have some caches that you have with other engines. But the upside is that you can for things that are moving quickly, such as pricing for E commerce, that is a very frequent example. Or how much you have in stock that you can change a lot. And if the data you're changing is an attribute, so effectively the price or the, you know, in stock thing that is kept in memory, that is super, super quick. This contrasts with other systems where you have a commit and then you would effectively need to kind of re index the document in order to change one value from it, which can be prohibitive. But in Vespa that's the advantage that you can, you can quickly update things.
[33:26]
B
How does that like, you know, technically work? I'm not sure I'm following. Like if I have a new update, how does Vespa handle that in real time? Like a continuously flow of new information, like how does Vespa handle making that available in real time?
[33:40]
C
So if you have an in memory attribute like a price and you want to change it, I mean it will be backed by disk. Right. So you have all the persistence, write ahead, log all that stuff. But you send the update, it's changed in memory, it's also replicated to all the other nodes. And when all the other nodes got the update request, you get the acknowledgement from the client. And also this happens at the operation level. So if you want to update three, let's say products prices in one go, the way you typically do this with Best buy is with HTTP 2, you're going to send, we have libraries that do this. You're going to send effectively 10 updates individually and they respond individually and each of them at the moment they respond, you know, they're already kind of flipped in memory. So you see the new price, every searcher that runs after that will see the new price or whatever you updated.
[34:38]
B
Vespa has been in the search world for a long time, like we've talked about. So, you know, over a 20 year journey, what's next for search? Like if you fast forward ahead three, five years, like, what are some of the problems that need to be solved that haven't been solved today?
[34:54]
C
I don't know, to be honest. There's so much work in the short term that I find it hard to, to look because things are moving so quickly, it's hard to tell. I do have a feeling that multimodal search will become more important. Like it would have visual cues here and there that will become more important depending on the use case. I would think that the ability to explore data in real time would also be increasingly important. I think people are, and even agents are not necessarily happy with seeing top end results. They may want to know what else is in that result set. And that brings yet again the question of what is the result set? Where, what do we consider? Where's that threshold between relevant and irrelevant results? And yeah, I think there are also problems that have been there since before I got into search, which was like 15 years ago and are still not really solved, which is like, how do we get a good golden set? How do we, you know, measure search effectively? How do we get that feedback loop going? How do we improve performance, not performance in sense of latency, but relevance, without breaking other things? Yeah, I think if those have been around for more than 15 years, I would assume they will be around for the next five years as well.
[36:20]
B
I think the golden data set problem is a huge one even outside of search, just in AI in general, whatever AI system I'm building, if I don't have a good data set to essentially test against, how do I know that the investments I'm making are moving in the Right direction. I see this. A lot of companies and projects skip that step probably because it's hard, but it's really hard to know whether the things that you're doing are, are actually useful if you don't have any way to test against it. But people skip that step because there's not like an easy way to achieve it essentially right now.
[36:55]
C
Yeah. And I feel like it's also a chicken and egg problem. Like even if you do it, which as you said, not everyone does it, but even if you do it is like, how do you know your testing thing is good? How do you make sure that? Because that's, I think, the main difference between what we see on the Internet when people publish. Oh, this is the new state of the art model. This is the new state of the art, the art technique, this and that, or academia, they have a golden set. That golden set is the benchmark. So the assumption is that the golden set works. But if you're starting your, I don't know, e commerce shop or book search, website, whatever, search use case and you start from scratch like now what? How do you know?
[37:37]
B
I mean, I think that's the advantage that some of the stuff around like coding has is that typically companies have a history of things that they can kind of build benchmark data sets around. There's issue trackers, there's prior code that engineers have built. There's been essentially a history of creating stuff that they can mine for creating these golden test sets. But if you're starting brand new in a new field where the measurement of what good is is far more subjective than just like compiling and running something against unit test, it's like really, really hard to create those data sets. And even if you do put the work into creating it to your point, like how do you know whether it's good or not? Radu, thanks you so much for being here. It was a great conversation.
[38:20]
C
You're welcome. Thanks for having me.