Summary7 min read

Software Engineering Daily

DeepMind’s RAG System with Animesh Chatterjeet and Ivan Solovyev

Date: March 12, 2026
Host: Shawn Falconer

Episode Overview

This episode explores the design philosophy, challenges, and future of DeepMind’s “File Search” — a fully managed Retrieval Augmented Generation (RAG) system integrated into the Gemini API. Guests Animesh Chatterjeet (Engineering Lead) and Ivan Solovyev (Product Manager) explain the technical decisions behind File Search, how it addresses the complexity and costs of RAG adoption, innovations in embeddings, trade-offs of abstraction vs. configurability, and the push toward multimodal retrieval. The discussion offers practical insight for engineers implementing or migrating RAG pipelines.

Key Discussion Points & Insights

1. What is DeepMind’s File Search and What Problem Does It Solve?

Simplifying RAG Adoption
- File Search is “an integrated RAG solution that makes it super easy for you to take loads and loads of data, text, PDFs, code, whatever you have, upload it into Gemini and start asking questions about your data.” (Ivan, 02:35)
- Eliminates infrastructure burdens: “We removed a lot of complexity … you don’t need to set up your database, you don’t need to set up your infrastructure.” (Ivan, 02:52)
Radically Transparent Pricing
- “The pricing is usually fairly complex…what we did was we decided to simplify the whole model.” You pay for indexing when uploading, and for tokens when querying. No storage or miscellaneous fees. (Ivan, 03:30–04:10)
- “The price is actually much cheaper. So it's a good competitive advantage.” (Ivan, 04:35)

2. RAG: Evolution, Relevance, and Technical Progress

RAG is Not Dead—It’s Foundational
- “RAG is a fundamental capability. RAG has been there from the very beginning … it was always useful to some extent.” (Ivan, 05:37)
- Valuable for enterprise-scale: “Having RAG becomes very, very beneficial … costs become much better with RAG if you put everything into the context, it becomes expensive very, very fast.” (Ivan, 06:53)
Technique Evolution
- Modern LLMs haven’t killed RAG. Instead, “they have encouraged RAG to cover even more use cases.” Models can struggle with ‘lost in the middle syndrome’ in long contexts; smarter chunking is required. (Animesh, 07:30–07:50)
- Notable mention of recent academic work: “There was a recent paper last year, which is called a ReFRAG…they are trying to embed the chunks and give all these embeddings to the model and let the model decide which … to expand." (Animesh, 07:55)
Collaboration, Not Competition
- “RAG is in some sense a tool that we give it to the model ... I wouldn’t really see them as competitors.” (Animesh, 09:14)

3. Technical Details: Chunking, Embeddings, and Retrieval

Chunking Strategy
- Uses Gemini embedding models; default chunking and configuration optimized based on internal evals. Most users don’t need to tune. (Animesh, 10:19)
- “We have run a bunch of evals to kind of find the sweet spot in terms of latency of how many chunks we want to retrieve versus the quality we see.” (Animesh, 10:45)
- “Somewhere around five chunks returned … was serving fairly well.” (Ivan, 12:20)
Configurability vs. Simplicity
- “80% of quality is embeddings, 20% is your configuration … playing with those configurations will not yield significant improvement.” (Ivan, 13:17)
- Advanced users may still need full control and can opt for more configurable (“complex”) pipelines. (Ivan, 13:55)
File Types & Multimodal Support
- “We are indexing text files mostly … PDFs, docs, code files … we are currently doing OCR on images … we are actually working on getting the multimodal support.” (Ivan, 14:25)
- Current chunking is generic; “things like code are working fine” but work remains for complex structures like graphs/tables. (Animesh, 15:44; Ivan, 16:13)

4. Indexing, Corpus Management, and Updates

Handling Updates
- Efficient ingestion via parallelization and Spanner for immediate consistency. (Animesh, 19:46)
- “You are not updating the document, you are inserting a new version …. if they want, [developers] could delete the earlier document. … We are not doing data diff at our end.” (Animesh, 20:48–21:17)
Retrieval Method
- “Right now it's purely semantic search, which is vector based.” Possible future support for hybrid/keyword search; graph-based retrieval is not a fit for the product’s simplicity. (Animesh & Ivan, 21:47–22:22)

5. Citations and Attribution

Mapping Answers to Sources
- “The models are trained to cite the responses … every sentence that they have used the original corpus to generate from.” (Animesh, 23:48)
- Each returned chunk is uniquely indexed; citations map directly to the original source via index. (Animesh, 24:32)

6. Real-World Use Case: Beam

AI-Driven Game Platform
- Beam uses File Search to index engine code and docs, allowing new users to “very quickly pull all the relevant documentation into the context and present … how it works.” (Ivan, 25:30)

7. Performance: Latency, Quality, and Optimization

Retrieval Latency
- “Latency is somewhat in line with the model latency. Couple seconds for the retrieval.” (Ivan, 26:49)
Retrieval Quality
- “Up to like 85% depending on the use case of the retrieval… correct hits.” (Ivan, 26:47)
Improving Accuracy
- Strategies: use best embedding models, optimize retrieval vs. latency, filter results post retrieval, prompt verification, simple cutoff rather than complex re-ranking. (Animesh, 27:24; Ivan, 28:13; 29:03)
Fine-Tuning Embeddings?
- “We have a general recommendation … that people shouldn’t do fine tuning. … The speed of progress of the models is so much faster than what individual smaller labs can do.” (Ivan, 30:05)
- “If your use case is very, very niche” then possibly, but gains quickly disappear with new model versions. (Ivan, 30:45)

8. Innovations in Embedding Models

Recent Model Improvements
- “We've added the multimodal embedding support now …. we started representing embeddings … as this Matryoshka representation” — you can truncate an embedding vector to trade off size vs. performance. (Animesh, 32:48)
- Noted improved multilingual capabilities; English works well, still expanding for other languages. (Animesh, 34:07)

9. Current Limitations & Future Work

Hard Problems Ahead
- Multimodal (native image/video/audio) support, better handling of structure (tables, graphs), and improved internationalization are major open areas. (Animesh, 34:07; Ivan, 34:46)
Scalability
- 1TB quota per user, 100MB per file, 20GB recommended corpus size for best latency. Multiple corpuses can be queried in parallel. (Animesh, 38:14; Ivan, 37:59)
Getting Started
- “Go to AI Studio and play with the file search applets … start hitting the Gemini API.” (Ivan, 38:52–39:06)
- Code samples and generated code support in AI Studio. (Animesh, 39:08; Ivan, 39:16)

Notable Quotes & Memorable Moments

On RAG’s Role:
“RAG is a fundamental capability. RAG has been there from the very beginning.”
— Ivan Solovyev (05:37)
On Simplicity:
“We made some opinionated decisions … The tool is just there. You just upload your data and you can use it right away.”
— Ivan Solovyev (02:52)
On Pricing:
“We removed most of the things that you are paying for and we focused on two simple aspects … you're not paying for storage, you're not paying for anything else.”
— Ivan Solovyev (03:30)
On Model Progress:
“The speed of progress of the models is so much faster than what individual smaller labs can do … fine-tuning won’t be that relevant anymore.”
— Ivan Solovyev (30:05)

Important Timestamps

| Timestamp | Segment | |-----------|------------------------------------------------------| | 02:35 | What is File Search – simplicity & pricing | | 05:37 | RAG's foundational role, cost/scale dynamics | | 07:30 | Evolution & new RAG techniques (‘reFRAG’) | | 10:19 | Chunking, embedding selection, and error reduction | | 12:20 | Default config typically "just works" | | 13:17 | Trade-offs: configurability vs. ease of use | | 14:25 | File types, OCR, and future multimodal plans | | 19:46 | Indexing updates, parallelization, Spanner use | | 21:47 | Search: vector/semantic, future hybrid roadmap | | 23:48 | Citations and mapping generated answers to sources | | 25:30 | Beam use case: education & code/document retrieval | | 26:47 | Performance: latency and retrieval quality | | 29:03 | Is re-ranking worth it? | | 30:05 | Fine-tuning embeddings—worthwhile or not? | | 32:48 | Matryoshka embeddings for space/accuracy trade-off | | 34:07 | Remaining hard problems: multimodal, structure, intl. | | 37:59 | Storage limits and scaling advice | | 38:52 | How to get started and available resources |

Closing Thoughts

Migration is easy: “You upload your data … [I] recommend using the embeddings model first … then try file search directly.” (Ivan, 35:50)
General availability: File Search is available for Gemini 2.5 GA, 3.x in preview, API access is ready for all users. (Ivan, 36:38)
DeepMind’s design priorities: simplicity, transparent pricing, broad accessibility, and continuous model and feature improvements.

“I just want to say that it's been really exciting to see the adoption that we're getting for file search. We actually received quite a lot of great feedback from developers and … it's been really nice to see how this works for their use cases.”
— Ivan Solovyev (39:32)

Loading summary

Transcript98 lines

[00:00]
Narrator
Retrieval Augmented generation, or rag, has become a foundational approach to building production AI systems. However, deploying RAG in practice can be complex and costly. Developers typically have to manage vector databases, chunking strategies, embedding models, and indexing infrastructure. Designing effective RAG systems is also a moving target as techniques and best practices evolve in step with rapidly advancing language models. Google DeepMind recently released the File Search Tool, a fully managed RAG system built directly into the Gemini API. File Search abstracts away the retrieval pipeline, allowing developers to upload documents, code and other text data automatically, generate embeddings, and query their knowledge base. We wanted to understand how the DeepMind team designed a general purpose RAG system that maintains high retrieval quality. Animesh Chatterjeet is a software Engineer at Google DeepMind and Ivan Solovyev is a product manager at DeepMind, and they worked on the File Search tool. They joined the podcast with Shawn Falconer to discuss the evolution of rag, why simplicity and pricing transparency matter, how embedding models have improved retrieval quality, the trade offs between configurability and ease of use, and what's next for multimodal retrieval across text, images and beyond. This episode is hosted by Shawn Falconer. Check the Show Notes for more information on Sean's work and where to find him.
[01:45]
Shawn Falconer
Ivan and Animesh, welcome to the show.
[01:49]
Animesh Chatterjeet
Thank you.
[01:49]
Ivan Solovyev
Pleasure to be here.
[01:51]
Shawn Falconer
Awesome. Well, why don't we, you know, we have two guests today. Just so everyone can kind of learn whose voice is who. Why don't we start off with Ivan? We'll start with you. You know, who are you and what do you do?
[02:02]
Ivan Solovyev
Yeah, my name is Ivan. I'm product manager for File Search on Gemini API.
[02:07]
Shawn Falconer
Great. And Animesh, you.
[02:09]
Animesh Chatterjeet
Hi, I'm Animesh. I'm the engineering lead on File Search.
[02:13]
Shawn Falconer
Awesome. Well, thanks both for being here. So you know, we're talking about this product which you mentioned to your file, the Fire search tool for Gemini APIs. And before we get too deep into I think sort of some general things around like AI, RAG agents and so on, can you talk a little bit about what the File Search tool is and what problem it tries to address? Maybe Ivan, you can take that?
[02:36]
Ivan Solovyev
Yeah, absolutely. So File Search Tool. First of all, it's an integrated RAC solution that makes it super easy for you to take loads and loads of data, text, PDFs, code, whatever you have, upload it into Gemini and start asking questions about your data. There are plenty of RAC pipelines available on the market, like we have Vertex, Rack Engine and there are other providers who do support this feature. So it's nothing new. And what we focused on in File search in particular is accessibility and simplicity of use. We made some opinionated decisions. We removed a lot of complexity in terms of configuration setup. You don't need to set up your database, you don't need to set up your infrastructure. The tool is just there. You just upload your data and you can use it right away. So we believe this simplicity is something that can help a lot of developers get started and overcome the complexity of setting up their own pipeline. The other big aspect that we're actually proud of is how we price the whole product. If you compare it to what's available on the market, first of all, the pricing is usually fairly complex. There are multiple components that are coming into the picture. You are paying for storage, you are paying for inference, you are paying for indexing, yada, yada, yada. What we did was we decided to simplify the whole model. We removed most of the things that you are paying for and we focused on two simple aspects. So first of all, you are paying for indexing. So whenever you upload the file, we do need to do a lot of complex processing. We need to do embeddings, so you pay for that. And after that, whenever you do a query, Gemini, you're just paying for tokens. Obviously there are going to be some addition from file search, adding data into the context, but that's it. You're not paying for storage, you're not paying for anything else.
[04:23]
Shawn Falconer
Why make that change around pricing? Is it primarily to just really try to simplify things for the users of this? Why was it needed to kind of, I don't know, buck the trend of how perhaps people used to have been paying for Rag in the past?
[04:37]
Ivan Solovyev
Yeah, I think in Gemini API and AI Studio in general, we are aiming a lot for simplicity. And we do hear a lot of feedback from developers that it's hard to deal with lots and lots of products. It's hard to deal with different billing models and billing cycles and how the whole cost is being calculated. So we do see it as a decent improvement over other products and the price is actually much cheaper. So it's a good competitive advantage.
[05:04]
Shawn Falconer
Okay, can you talk a little bit about, I guess, like the evolution of Rag? Obviously, like Rag was kind of the buzzword of the moment, I would say, like a couple years ago now. There's also, I think, been some things out in the zeitgeist of like, do we need Rag still? We have Agent and Rag is dead. You hear all this kind of stuff. Like, I Guess like where do you stand on that? And also can you talk a little bit about sort of the history and how maybe things are approached to even rag and the way that we use it has changed during that time.
[05:37]
Ivan Solovyev
Let me talk through what I think like where we are with Rack and maybe animesh can chime in on the history of development of this feature. In terms of where we are, I think RAC is a fundamental capability. RAC has been there from the very beginning. Whenever this model gets really popular in use, it was always a staple whenever you wanted to process the data and hype cycles were going up and down. Or we see this with different features related to LLMs, but I feel that the track was always there and it was always useful to some extent. In the latest years we saw improvements in the context size obviously available to LLMs. And this does help a lot with use cases with limited data sets. And we do see much better quality whenever you try to do simple retrieval tasks on a small data set that fits into the context. And we usually do recommend to use that approach. However, whenever you start doing any enterprise use cases, whenever you have a huge code base, whenever you have a large file sets, like any legal documentation, anything like that, having RAG becomes very, very beneficial. First of all, you can work with the whole database without building the complicated pipeline or infrastructure to actually juggle the data in and out of the context. You can work with the whole data set. The costs become much better with rac. If you put everything into the context, it becomes expensive very, very fast. And especially if you're using the higher tier models like PRO models, it's getting expensive. And with file search or other RAC solutions, you are able to actually reduce this cost. And for the large databases and the large enterprise use cases, this actually adds up fairly, fairly quickly.
[07:23]
Shawn Falconer
And in terms of the history like has our techniques and approach to RAG changed over the last couple years? Has that evolved as well?
[07:31]
Animesh Chatterjeet
Yes, I think the use case have evolved. So I think to your question whether the long context models have threatened the proposition of rag, I would say that in fact they have encouraged RAG to cover even more use cases. Like there are a lot of EDO use cases where people want to upload documents for their entire semester. And there is now even more focus on making RAG efficient because even the long context retrievals are working fine. We still see sometimes this context route or lost in the middle syndrome where models are not great at retrieving data which is kind of in the middle of the context. So now there are new techniques on making sure. That how can we improve even the chunks that we are feeding into the model from rag? So there was a recent paper last year, which is called a refrac, where what they are trying to do is instead of passing the chunks, as it is, to the model, they are trying to embed the chunks and give all these embeddings to the model and let the model decide which of these embeddings may seem more interesting and only expand those cases. So, yeah, from the kind of initial vanilla rag, where we kind of find everything, give it to the model, we are trying to make that more smart by figuring out which chunks to give. Also, the embedding models themselves have improved. Right. The way we are able to represent data, that has significantly improved. So now we are better at understanding the context. We are doing better in languages other than English. So internationalization has also picked up. So there are these different parts using which we can say that RAG as a product is improving.
[09:02]
Shawn Falconer
And would you say that it's kind of like the wrong, I don't know, like, question or view to take that around? Like, tool use versus rag? Are they really competitors or are they more like these are collaborators in some sense?
[09:14]
Animesh Chatterjeet
I mean, RAG is in some sense a tool that we give it to the model. Right. In case this needs more information from your specific private corpus, this is the way to go. So, yeah, I wouldn't really see them as competitors.
[09:27]
Shawn Falconer
Yeah, absolutely. I mean, I agree as well. I think that it's kind of a misunderstanding of what RAG is to frame it in that way, but I think it is something that you see out there in the wilds of, I don't know, Twitter sphere and so forth. But that's a little bit of the Wild west of AI in some sense.
[09:43]
Animesh Chatterjeet
Yeah. In fact, I would say RAG is coming in popular in different ways now. Like, we recently are kind of public reviewing the personalization, which, like, enables the model to give more context about your Persona. And like, the way to enable is is again, something like rag, where you figure out relevant chunks and give it to the model, and then it can understand your Persona better and answer queries in that context.
[10:08]
Shawn Falconer
Can you talk a little bit about how that works in terms of being able to determine what are the right chunks to feed in the model and how do you reduce the error rate of essentially identifying incorrect chunks?
[10:20]
Animesh Chatterjeet
I think what we do is when you provide the data, we chunk it and we embed it using the latest Gemini embedding model, and then we basically index it internally, and then at the time of the Query. When the user provides the query, we again embed it using the same embedding model and then try to figure out the relevant chunks from that corpus that the user has uploaded. And then we have some knobs on which embedding model to use or how many chunks do we want to retrieve and pass it to the model. And we have run a bunch of evals to kind of find the sweet spot in terms of latency of how many chunks we want to retrieve versus the quality we see. And there are some knobs that the users can provide in terms of how do they want to chunk the data. But mostly it's the default settings that we have iterated upon. We have tweaked the SI to make sure that the model actually triggers this tool when it actually feels that it is necessary to get more context and it's not unnecessarily triggering. So yeah, the entire suite of tools we have kind of evolved and eval through to make sure that we provide the right default settings and there are some capabilities that the users can override.
[11:27]
Shawn Falconer
What is that sweet spot in terms of the number of chunks to return?
[11:32]
Animesh Chatterjeet
So I think it's like in low double digits right now and we have kind of kept it open. We don't document how many chunks we want, but yeah, it's not too many at this point of time.
[11:46]
Shawn Falconer
Is there some use case dependency on that or can you actually have sort of more of a universal approach to this?
[11:53]
Animesh Chatterjeet
Yeah, so right now we are going with the one solution fits all approach because we want to keep it simple as and when we hear use cases of customers who feel the need that they want more of these chunks retrieved, it's easy to expose that as an option in the API. We don't want to do that right now, but if needed we could do that. If the threshold at which you want to retrieve the chunks or the number of chunks you want to fit to the model, those are all things we could tweak around.
[12:18]
Shawn Falconer
Ivan, you were adding something.
[12:20]
Ivan Solovyev
Yeah. So so far what we saw from the partners integration is that the default configuration actually fits most of the use cases. We do have people doing search of illegal documents, we do have people doing searches over their code databases to provide relevant guidance for code completion and such. And in all of those cases, somewhere around five chunks returned in the response from file search was serving fairly well for that.
[12:45]
Shawn Falconer
And then in terms of, you know, I think historically if you look at how people have approached rag, there's a lot of people who want to really exert A lot of control over things like, you know, chunk overlap, chunk size, various settings. So I guess by abstracting away a lot of that retrieval pipeline, how do you sort of balance that? Is it that you're targeting a specific type of use case or a specific type of user? Or have you really figured out the secret sauce of the right collection of those things that's just going to work for people out of the box?
[13:17]
Ivan Solovyev
I think most of the quality actually comes from the embedding model. So it's like you should think about this as like 80% of quality is embeddings, 20% is your configuration. So as long as we have the best embedding models, which we believe we do, the rest is less relevant to the quality of the outcome. So we do believe that for most people, playing with those configurations will not yield significant improvement and the time is better spent in building their own pipelines. So that's where we're focused on. At the same time, we never say don't use any other RAC pipelines. We actually say file search is the simplest tool. You should try the first thing. It should work for the majority of people. But if you really need the configurability, let's say your use case is very, very complex. You're processing very well structured data, specific tables, specific graphs that our system does not yet recognize. Well, in that case, you may want to adjust all the little knobs that are coming with the more complicated pipelines.
[14:20]
Shawn Falconer
And what kind of files are you capable of indexing?
[14:25]
Ivan Solovyev
I would love to say all of them, but we are indexing text files mostly, so PDFs, docs, code files, anything with text. We are currently doing OCR on images, so we're not fully ignoring images within PDFs and other files, but we are going through the OCR system, we're extracting text out of them and putting that into the context as well. And we are actually working on getting the multimodal support in as well. So we want to support native image processing, video processing, and at some point native audio as well. Gemini models are pretty good at reasoning on top of image and video data. So we want to have this retrieval capability to actually find the relevant images and put them into context so the Gemini model can see them and act on them.
[15:14]
Shawn Falconer
So for, you know, even if you're processing text, like there's lots of different types of text files. You could have code, could be a text file you could have marked down. You could have documents that have not just images, but like tables and so forth. So are you able to dynamically figure out like the chunking strategy on behalf of essentially the user or does it matter? Like do you have to use a different strategy for say basically breaking up code to be able to find the relevant chunks versus something like, I don't know, legal document?
[15:44]
Animesh Chatterjeet
So far we have not done anything majorly different across these different types of documents. And based on the customer feedback so far things like code are working fine. We see that in some cases where there are like graphs or tables, we have sometimes seen that the chunking strategy, like the default chunking strategy, doesn't work. We are working on techniques to make sure we represent this data in a more structured way and we can provide it to the model without breaking that structured context. But yeah, that is something being in the works.
[16:14]
Ivan Solovyev
Yeah. And in a lot of cases it is about chunking. But if you look at the structured data that is not just plain text like parsing tables and graphs, that's where we see some regressions in terms of quality. But the way we address this is not through different chunking. It's mostly through pre processing the data, making sure that the columns and rows in the table are aligned well when the data is represented to the model as text. So this kind of pre processing is I think more important to get the quality right.
[16:44]
Shawn Falconer
And I guess it also going back to like what you're saying, sort of 80% the embedding model. So there's also this reliance on, I guess if you can use the embedding model to like truly represent the semantics of what it is that you're creating the embedding from, then you're going to get a higher quality search result.
[17:00]
Ivan Solovyev
Yes.
[17:01]
Animesh Chatterjeet
Plus the fact that you are overlapping chunks. So potentially you would be retrieving multiple chunks which have the overlapping parts and then together they'll kind of recreate the whole context that is needed.
[17:10]
Shawn Falconer
Right.
[17:12]
Narrator
Why is there always a meeting bot in your Zoom call? Blame Recall AI. Recall AI powers the meeting bots and desktop recording apps behind products like Clulee, HubSpot and ClickUp. They handle the hard infrastructure work capturing clean recordings, transcripts and metadata across Zoom Google Meet Microsoft Teams in person meetings and more so developers don't have to build it themselves. If you're building a meeting, note taker or anything involving conversation data. Recall AI is the API for meeting recording. Get started today with $100 in free credits at Recall AI software in Mobile application security. Good enough is a risk. Guard Square uses advanced multilayered code hardening techniques and automated runtime application self Protection and mobile application security. Security testing combined with real time threat monitoring to deliver the highest level of mobile app security. Discover how Guard Square brings all these together to provide mobile app security for your Android and iOS apps without compromise. @www.guardsquare.com. you know fidelity is a financial services leader, but did you know that inside Fidelity is a community of technologists who working together to shape the future of finance and tech. Fidelity is always investing in tomorrow, from emerging tech to cutting edge tools that will transform what comes next. Their technologists are encouraged to keep learning so they can expand their skill sets, explore new ground and stay ahead of this rapidly evolving industry. And right now, Fidelity is hiring technologists to join their team. Fidelity technologists get the best of both worlds. Startup energy that's grounded in the stability of a financial institution. That means support, resources and amazing benefits. Bring your skills to a culture where you're empowered to dream big and build the tech that drives an organization and makes a real impact on people's lives. Find out more@tech.fidelitycareers.com that's tech.fidelitycareers.com Fidelity is an equal opportunity employer.
[19:23]
Shawn Falconer
So you're abstracting away the vector database and the indexing that you're doing. But how does this work with updates? I think that's historically been a challenge. Like if I process a document and then later that document changes or maybe a website is maybe even a better example where a website is going to change from time to time. But I've already indexed that particular page and then I need to re index it. How does that kind of update process work?
[19:46]
Animesh Chatterjeet
There are two parts to this update, right? One is basically you kind of calling our API to ingest those documents. So we try to make sure that we are highly paralyzed in terms of our ingestion latency. So we pretty much can parallelize at a chunk level and ensure that all of those are ingested into the database. And then Google has the spanner, which is also exposed externally as the cloud spanner, which provides very strong consistency guarantees. So once you pretty much write the data, it's almost instantaneously available to be indexed. And we leverage that capability of spanner to make sure that we can pretty much read our writes as soon as they are available. So that significantly reduces the delay in reading the indexes and reading the embeddings.
[20:32]
Shawn Falconer
Do you have to like if I've already indexed a particular page though or a document and then I'm re indexing it, do I have to blow away the initial indexing indexes in order to re index it, or is there essentially the equivalent of like an upsert in the vector world?
[20:48]
Animesh Chatterjeet
So essentially you have the corpus, you can add your new document to that corpus, which would just mean that the new chunks are indexed, the rest of the index remains as it is. In our world, you are not updating the document, you are inserting a new version of the document. And we would be chunking that and indexing it.
[21:06]
Shawn Falconer
But if the old version is there, do you run into this potential risk that when you're pulling back relevant chunks, you could pull back relevant chunks that are no longer actually relevant because the fundamentals of the document has changed?
[21:17]
Animesh Chatterjeet
Yeah. So that capability we provide to the developers in terms of the corpus management or the document management APIs. So if they want, they could delete the earlier document. But from our perspective, it's difficult for us to figure out whether it is the new version of the same document or not. We are not doing data diff at our end. It's up to the developers to kind of remove the old content if they think that's not relevant anymore.
[21:40]
Shawn Falconer
I see. Okay. And then is the search that's going on, is it purely vector based? Is there a hybrid element to this?
[21:47]
Animesh Chatterjeet
Right now it's purely semantic search, which is vector based. We have had some requests of users wanting a keyword based search and that is something we're considering adding to the roadmap. Given the indexing capabilities that Spanner offers, we think it's like a natural extension to the offering and yet something which should not add too much complexity to our system.
[22:08]
Ivan Solovyev
We also looked at the graph rack systems in the past, but for now, to me at least feels a little bit more complex for the product that we're trying to build. So we haven't found the right way to simply integrate it into the system yet.
[22:22]
Shawn Falconer
Yeah, I mean, I think that you see a lot in these, like more complex rag pipeline scenarios where they're using a combination of vector search. There might be a knowledge graph or Ontology or something like that to also ground the results in some semantic understanding. Is that something that you see as a future direction for this or would it be more you would use this in combination, perhaps with a separate system that would handle that piece of it.
[22:47]
Ivan Solovyev
So far what we saw from our customers is that the current setup is working well for them and I think we will not overcomplicate it just yet. To answer your question directly, I feel that we better have two separate systems that can complement each other. And as your needs grow, you can implement both to serve your needs in a more complex use cases. I would say we also have, I mentioned the Vertex Rack system which is built on top of the. It's not quite Gemini API, but it's a very similar Gemini API inside of Vertex. So for anything that requires a lot more complexity configuration, maybe swapping out the databases or adding this additional systems on top, we can always guide customers to use more complex solution they really need to. And we can focus on the simplicity and getting started.
[23:40]
Shawn Falconer
Okay, how do citations work? How do you map, I guess sort of the generated token back to the specific source chunk?
[23:48]
Animesh Chatterjeet
Yeah, so right now it's the models are trained to cite the responses. So when they generate the responses, they actually cite every sentence that they have used the original corpus to generate from. Right. And then it's just a matter of post processing that response, removing those citations and adding that separately as grounding data. So essentially it's model generating citations to the data that they refer to.
[24:13]
Shawn Falconer
Yeah. So the model is trained to essentially figure out or to provide a reference back to where that text or what the source text was. And then you have to map that source text, I guess back to the database chunk and the original source in order to inject the link or something like that that refers to the citation, is that right?
[24:33]
Animesh Chatterjeet
Yeah. So basically when like the flow is something like this, when the model realizes that it needs to use file search, it will emit a query that I want this query to be answered by the file search tool. You run the query and you give the responses. Each response in some sense is indexed uniquely. So if the model is receiving five chunks of data, it knows that each of them is a different index. And this index can vary per turn as well. So now when the model responds, it kind of cites the exact unique index using which we can kind of figure out which chunk it was referring to and then figure out which document it was part of and add more metadata about that.
[25:10]
Shawn Falconer
Okay. And the blog post that talks about the product in that you cover. There's a company, Beam, which is a AI driven game generation platform that's using this. Can you talk a little bit about how are they using this product in their, I guess to solve problems in their world?
[25:31]
Ivan Solovyev
I think their use case is pretty neat and simple at the same time. So they have lots and lots of new developers coming to the platform who want to build games with AI and they're not necessarily experienced developers. They are mostly learning and AI helps Beam to educate and help those developers to create their first game. So the way they are using File Search is they have a huge code base that is their engine plus the documentation on top of that that is talking about how each component is used, how animations are happening, how scripts are implemented, et cetera, et cetera. It's a rather big data set. So what they do is they put it all into the file search, they index it, and whenever the user starts experimenting with the agent and the agents that supports it, they will naturally ask questions how to do specific things. And through File Search they can very quickly pull all the relevant documentation into the context and actually present to the developer that, hey, you probably want to use this module. Here is how it works, here is the old documentation. And they've been able to close this education loop for their customers. Receive great feedback.
[26:41]
Shawn Falconer
What's the performance on the retrieval look like?
[26:44]
Ivan Solovyev
Performance in terms of retrieval quality or latency?
[26:47]
Shawn Falconer
Let's start with latency.
[26:49]
Ivan Solovyev
Latency is somewhat in line with the model latency. Couple seconds for the retrieval. In terms of quality, it will depend on the use case. If I recall correctly, we saw up to like 85% depending on the use case of the retrieval. Like correct hits in terms of documents redrift.
[27:09]
Shawn Falconer
Look, as a user of this, given that, I mean any RAG system, you're not going to, it's going to be very difficult to get like 100% accuracy on retry documents. But what are some of the things that, or, you know, approaches people take to help increase the accuracy?
[27:25]
Animesh Chatterjeet
I think there would be like a few things, right. One would be naturally be the embedding model that Ivan talked about and kind of called out the importance. The second is your retrieval strategy. Like sometimes you would want to optimize quality for latency. Whether you want to kind of go through your entire database, find all relevant chunks or kind of figure out the first few relevant chunks and give it to the model. So that would kind of be the other aspect on like trading of latency versus quality. And third, I think is just the model training on triggering the search only when relevant and also not hallucinating the answers. Those are kind of orthogonal to File Search. Those apply to any tool that we have trained Gemini with. But I would say those are kind of the three aspects. The MLM quality, your retrieval quality, and just the model quality, which probably is of utmost importance.
[28:13]
Ivan Solovyev
Yeah, that's mostly on our end and that's what we are working on in terms of improving the quality for developers. What I saw is some developers actually implement the post processing. So they would implement file search calls in a sub agent or a separate flow and they will do filtering on top of the returned results. So they will call Gemini one more time. They will have a prompt that is doing the verification of the results in comparison to the context that the model already has and they will call out the results that don't really fit the context on victimization and improve the quality of the output that way.
[28:49]
Shawn Falconer
Is there value in using like a re ranker model? Have you found in your own experiments that it actually improves like it's worth, I guess the investment of introducing something to re rank the results return from the vector search?
[29:04]
Animesh Chatterjeet
Not so much in fact. I mean that just adds more complexity. And if we kind of extrapolate the question that we were talking about a bit earlier about whether even do we need drag? I think if you kind of take that logic and apply it here, that once you give the relevant chunks and as long as your context is not blowing up too much, I think letting the model figure out what is relevant probably is better. Yeah, in terms of like if you are retrieving too many chunks, then we have some threshold on what is the quality score of this chunk. Right. And then we have some threshold below which we don't return those to the model. But that's not like re ranking between chunks is just like a vanilla cutoff beyond which we don't get any more chunks, but between the chunks, in fact, we have not seen any advantage of providing a ranked order to the model.
[29:51]
Shawn Falconer
What about in terms of people who look at fine tuning embedding models for specific use cases? Has there been recent results around actually improving retrieval or is it again just sort of overcomplicating the whole process?
[30:06]
Ivan Solovyev
We have a general recommendation in GDM and I think we made this a year or so back, that people shouldn't do fine tuning. In most of the cases, the speed of progress of the models is so much faster than what individual smaller labs can do in terms of fine tuning that it's almost irrelevant. And by the time you actually have a fine tuned model and it will probably perform better for a use case for like a month or two, we're going to have the next 003 embedding model. It's going to be better across the board, like 15% on all the benchmarks and fine tuning won't be that relevant anymore. With that said, we do see people using fine tuning in specific use cases. I think it does make sense if your use case is very, very niche and you own a very particular data set which you don't expect Google or anyone else to pay attention to anytime in the future that may yield good results. But as I said, so far what we've seen is fine tuning becomes relevant within six months or so.
[31:09]
Shawn Falconer
Yeah, I think that's fair. I found a similar result working with business on my end. Like, I think that one of the challenges as well is that if you do go through the process of fine tuning, then even if you are getting better results for six months or whatever, then a new model comes along, you've adjusted the weight. So how do you apply that to a new version of the model?
[31:28]
Ivan Solovyev
Model? Yeah, you need to start over. That's pretty much from scratch. You have to fine tune the model again.
[31:33]
Shawn Falconer
Yeah, it gets kind of expensive as the model's improved. Do you think that a lot of the things that we've historically done with RAG to try to drive out performance ends up basically we don't have to make things quite as complicated because we can rely on better performance from the model and the model is kind of absorbing a lot of the complexity.
[31:52]
Ivan Solovyev
Yeah, I absolutely do think so. We did see an amazing progress for Gemini models in the last year. And not just Gemini models. You look across the board. Anthropic OpenAI did great work in improving their text LLMs. First of all, these improvements do convert into embeddings models as well. And separately we are working on improving the embeddings model more and more. And we're going to see in the next year or two, we're going to see significant improvements in terms of retrieval quality, in terms of use case complexity that those models can handle. So I do believe that a lot of this additional configuration that is happening around those models will go away and you will be able to just embed the thing, hit the search and get the results that are really relevant and useful.
[32:36]
Shawn Falconer
And what are some of the things that have happened over the last year or so that have made the embedding models better? Like what are the particular innovations that have happened there to really drive up performance?
[32:48]
Animesh Chatterjeet
I mean, we have added the multimodal embedding support now, which would really improve the quality of understanding of things beyond text. So that is one thing, and I think it's in public preview right now. So we are hoping to do a GA launch for that. The other thing which kind of we launched, I think in the last version of our embedding model or the one before that, I don't remember, is we started representing embeddings as this Matryoshka representation, which basically means the embedding Vector that is generated, the front part of that vector has more context about the thing being embedded than the latter part, which makes it very easy for the end user to just truncate the embedding. So in case like the Embedding is a 3K dimension embedding vector and you don't want to store as much size, you could just truncate it at any point and it would still give you an accurate enough representation of the entity. So some of those things have been really helpful and some of those we can actually explore. Like we were talking about the knobs that we could give to the user that could be another knob in future we could give to the users if they want to reduce the size of their storage by using a truncated embedding instead of the full embedding at the cost of some quality. So based on their use case, users can choose to pick one of the two knobs.
[34:00]
Shawn Falconer
What do you see as some of the like hard problems in this space that are yet to be solved? When it comes to rag in particular,
[34:07]
Animesh Chatterjeet
I think Ivan, please add on. But like multimodal, we are starting, I'm sure we'll kind of add on more capabilities on the multimodal side. That is one thing we talked about chunking. That's still an area I feel we can get more benefit out of by kind of capturing the structure better. In certain kind of use cases, the multilingual or like the internationalization, that is another aspect I feel we are getting great at solving English related queries, but there are certain languages where certainly we could do more. And as the user base expands to countries across the world, we have more of these internationalization use cases. So that is another aspect where I feel we can certainly improve upon.
[34:47]
Ivan Solovyev
Yeah, just in general, I think getting the quality higher hit rates, better retrieval is something that we always pursue. Multimodal is very interesting aspect. Text is working great for a lot of use cases, but there is a lot more multimodal use cases that people are thinking of right now. Looking for images, looking for video opens up a lot more consumer products that can be built on top of that. And the file search in any racket trailer here is even more beneficial than for text because of the size of the data that you are feeding into the model. So if you can reduce that as much as possible through file search, that'd be very interesting.
[35:27]
Shawn Falconer
Mm. One thing, you know, just bringing us back to the file search tool in terms of people who've already invested in particular stack to do rag, whether it's, you know, A combination of vector database, maybe a particular framework. They have their chunky strategy, like what is it that they need to do if they wanted to migrate essentially over to the file search tool approach?
[35:50]
Ivan Solovyev
Well, migration itself, I hope is fairly easy and straightforward. You upload your data. But the reality is, I think what I would recommend them to start with is probably use the embedding model that we provide. It is available as a standalone service. So if they have their own pipeline and they want to run the Ebals and experiment with the Gemini infrastructure, I would recommend using the embeddings model first with part of their data and just comparing the results. And the next step would be using the file search, uploading a small portion of their database and just running evals, comparing both systems.
[36:26]
Shawn Falconer
And then what is the status of a file search tool in terms of its availability? Is this early access, is this preview or what can you share around the timelines where people can start to get their hands on this, you know, more generally?
[36:38]
Ivan Solovyev
Yeah, absolutely. File search is actually generally available for our 2.5 model family and recently our 3 model family. So 2.5 models are in GA. So it's both models and the tool combination are generally available. Flash 3 and Pro 3 models are still in preview, but file searches as an API is generally available there as well.
[37:02]
Shawn Falconer
Okay, and then what is next besides, you know, moving all the stuff to getting it in the hands of more users? Like what can you share about some of the things that you're thinking in terms of additional problems to attack or things that you want to continue to make investments around making the product really, really easy to use? Yeah.
[37:20]
Ivan Solovyev
So as we mentioned, multimodal support is the big push we're doing. We want to invest into a better understanding of structured data. And we keep collecting these examples from our developers in terms of tables and graphs and whatnot that they're trying to process. So I think that's going to improve the quality and the applicability of this a lot. And then latency and being able to work with a much bigger data sets. We do limit at 1 terabyte right now for the highest tier, but the latency go down quite a bit if you start consuming all of the quota. So we want to invest in that as well and improve the retrieval latency.
[37:59]
Shawn Falconer
Is that one terabyte of total storage, is that right?
[38:03]
Ivan Solovyev
Yes, that's one terabyte of total storage across all of your file stores.
[38:08]
Shawn Falconer
Is there a limitation around the size of a single file that can be processed other than, I guess it needs to be smaller than a terabyte.
[38:15]
Animesh Chatterjeet
It's 100 MB right now. So that is, we have kept on the file. And for making sure that the retrieval latencies are kind of acceptable and in the good range, we recommend users to keep their individual corpus to about 20 GB. So their total data across corpuses can be 1 TB. But each corpus, or each file search store, as we call it, should be relevant to 20gb. And then at the query time, you can provide multiple of these file search store IDs, so we can find out those queries in parallel. But the size of a individual file storage store performs good till the limit of about 20 GB.
[38:49]
Shawn Falconer
Okay, and if I want to play with this, like, how do I get started?
[38:53]
Ivan Solovyev
The simplest way, I think, would be to go to AI Studio and play with the file search applets. I think the link is available in the blog post that we published. And the other way is to start hitting the Gemini API.
[39:06]
Shawn Falconer
Okay, great.
[39:08]
Animesh Chatterjeet
Code samples. That would make it really easy for somebody to just start playing around, just upload their data using the upload API and hit the Gemini API.
[39:17]
Ivan Solovyev
Oh, and our wipe coding environments in AI Studio also fully supports the file search. So you can just prompt the model to generate the code that uses file search and it will do it for you.
[39:27]
Shawn Falconer
Okay, well, even easier. Is there anything else you would like to share as we start to wrap
[39:32]
Ivan Solovyev
up, I just want to say that it's been really exciting to see the adoption that we're getting for file search. We actually received quite a lot of great feedback from developers and quite a lot of excitement, and it's been really nice to see how this works for their use cases.
[39:50]
Shawn Falconer
Fantastic. Well, Ivan and Himesh, thank you so much for being here.
[39:54]
Ivan Solovyev
Thank you for having us.
[39:55]
Shawn Falconer
Cheers.
[40:03]
Ivan Solovyev
Sam.