Uber’s On-Call Copilot with Paarth Chothani and Eduards Sidorovics - Software Engineering Daily

Summary5 min read

Software Engineering Daily: Uber’s On-Call Copilot with Parth Chothani and Eduards Sidorovics

Podcast: Software Engineering Daily
Host: Sean Falconer
Guests: Parth Chothani (Staff Software Engineer, Uber AI Genai Team) and Eduards Sidorovics (Senior Software Engineer, Uber AI Platform Team)
Release Date: April 8, 2025

Introduction

In this episode, Sean Falconer interviews Parth Chothani and Eduards Sidorovics about GENIE, Uber's AI-powered on-call copilot designed to streamline on-call operations and enhance the efficiency of engineering teams across the company.

The Genesis of GENIE

Motivation and Challenges
Parth explains that Uber's extensive platform teams and the reliance on tools like Slack for support created significant inefficiencies. Engineers frequently faced delays in obtaining help, leading to frustration and reduced productivity.

“There was a lot of pain that we as engineers faced when asking for help from other support teams or other platform teams.”
— Parth Chothani [03:42]

Solution: An Automated Copilot
To address these challenges, Uber developed GENIE to provide real-time responses to queries by leveraging internal knowledge sources, thereby improving incident resolution and team collaboration.

“We wanted to have an automated solution which can look at all the internal knowledge sources and be able to answer questions that engineers across the company can take help from and really improve their efficiency.”
— Parth Chothani [03:42]

Architecture & User Interaction

User Onboarding and Interaction
Eduard describes the user experience where engineers onboard GENIE by specifying their internal wikis or helpdesk channels. The system then scrapes, embeds, and stores this data, enabling GENIE to respond to queries via a Slack bot.

“You just specify their sources and then boom, that's it.”
— Parth Chothani [05:33]

Backend Pipeline
The backend leverages big data technologies like Spark for parallel processing and generating embeddings using models such as OpenAI’s ADA. The data is then stored in a vector store, initially using Uber’s homegrown solution, Sia, and moving towards OpenSearch for better open-source compatibility.

“We use Spark to take a lot of internal sources, generate embeddings on the fly, and ingest data into Vector store at scale.”
— Parth Chothani [06:20]

Technical Details: Pipeline, Embedding, Vector Store

Embedding Models
GENIE employs both in-house and third-party embedding models, with a preference for OpenAI's ADA embeddings for their effectiveness.

“We preferred the ADA embedding models from OpenAI to begin with and those have worked reasonably okay...”
— Parth Chothani [12:34]

Vector Store Evolution
Initially utilizing a homegrown vector store, Uber transitioned to OpenSearch to enhance compatibility and scalability.

“We are trying to move towards other better Vector Store solutions like OpenSearch...”
— Parth Chothani [13:09]

Feedback and Evaluation

Continuous Feedback Loop
Eduard highlights the importance of user feedback, where responses from GENIE can be rated using emojis, helping the team gauge accuracy and areas for improvement.

“When GENIE replies, there's like a pop up of you can reply with the emoji saying okay, is it good...”
— Eduard Sidorovics [16:50]

Evaluating Documentation Gaps
When GENIE provides inaccurate answers, it indicates potential gaps in documentation, prompting updates and improvements.

“If the answer is not good, it means that either RAC components were not good or actually the documentation was not there.”
— Eduard Sidorovics [17:14]

Security Considerations

Protecting Sensitive Information
Security was a top priority. Uber implemented Genai Gateway to filter out Personally Identifiable Information (PII) before any data interacts with the model, ensuring data privacy and compliance.

“We really wanted, as Edwards mentioned, like the redaction PII data should be redacted before it gets sent out.”
— Parth Chothani [26:28]

Challenges Faced

Addressing Hallucinations
One of the initial challenges was ensuring GENIE provided accurate information, as early versions sometimes generated incorrect responses.

“Hallucination was a start where we were like, you know, just spitting out things that were pretty much wrong sometimes.”
— Parth Chothani [27:01]

User Experience Design
Creating a frictionless and intuitive UI without cumbersome approval processes was another significant hurdle.

Evaluation Methodologies
Developing in-house methods to measure productivity gains and answer accuracy required innovative thinking, as there were no industry standards available.

Managing Non-Determinism
The stochastic nature of LLMs introduced non-determinism, necessitating a shift in engineering mindset to handle unpredictable outputs.

“The non determinism definitely is one of the things that makes this whole product building so challenging.”
— Parth Chothani [34:23]

Impact and Metrics

Productivity Gains
GENIE has been deployed across over 150 channels, answering more than 70,000 questions with a 48% helpfulness rate. Parth estimates that it has saved Uber approximately 13,000 engineering hours since its inception.

“We estimated roughly when we did math around like 13k engineering hours so far we have saved across the company.”
— Parth Chothani [37:23]

Future Directions

Enhancing Accuracy and Features
The team is focused on developing GENIE V2 to meet increasing user expectations by improving accuracy and integrating advanced features like user intent detection.

“We are definitely very much thinking about, like, taking this and making it a V2 version where we can have a very high level of accuracy.”
— Parth Chothani [39:54]

Adapting to Evolving Technologies
With the rapid evolution of AI models, GENIE’s architecture is designed to be flexible, allowing for quick integration of new advancements to maintain and enhance performance.

Key Takeaways

User-Centric Design: Simplifying onboarding and interaction is crucial for widespread adoption.
Continuous Feedback: Implementing robust feedback loops ensures ongoing improvement and accountability.
Security First: Protecting sensitive data through comprehensive security measures is essential.
Adaptability: Building flexible systems allows adaptation to evolving technologies and user needs.
Impact Measurement: Creative and rigorous methods to measure productivity gains demonstrate the tangible benefits of AI tools like GENIE.

Parth concludes with an encouraging note for developers venturing into generative AI applications:

“People have to just be open to the fact that whatever we build is… open to experimentation is a healthy mindset for GENIE.”
— Parth Chothani [40:37]

This episode provides a comprehensive overview of how Uber leveraged AI to create GENIE, addressing internal inefficiencies, enhancing productivity, and navigating the complexities of AI integration in a large-scale organization. It offers valuable insights for engineers and organizations looking to implement similar AI-driven solutions.

Loading summary

Transcript95 lines

[00:00]
Sean Falconer
At Uber, there are many platform teams supporting engineers across the company and maintaining robust on call operations is crucial to keeping services functioning smoothly. The prospect of enhancing the efficiency of these engineering teams motivated Uber to create genie, which is an AI powered on call copilot. GENIE assists with on call management by providing real time responses to queries, streamlining incident resolution and facilitating team collaboration. Parth Chothani is a staff software engineer on the Uber AI Genai team. Eduard Sidorovich is a senior software engineer on the Uber AI platform team. In this episode they join the show with Sean Falconer to talk about the challenges that motivated the creation of Uber genie, the architecture of genie, and more. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.
[01:06]
Host
Parth and Edwards, welcome to the show.
[01:08]
Parth Chothani
Thank you Sean for having us.
[01:10]
Eduard Sidorovich
Thank you.
[01:11]
Host
Hi, yeah, thanks for being here. You know, I'm really excited to talk about, you know, genie, this on call copilot that you guys were involved in at Uber. But maybe before we get there, let's have you introduce yourselves just so since there's both of you people can hopefully learn your voices. But let's start with you Parth. Like who are you? What do you do?
[01:29]
Parth Chothani
Yeah, hey everyone, I'm Parth here. I'm a backend infrastructure engineer on Michelangelo, which is like the SageMaker equivalent in Uber. And yeah, I've been here at Uber for four years working on distributed systems, generative AI and core ML problems. And before that I was at aws, also building Chatbot like solutions and, and also at Microsoft working on teams and those kind of products here.
[01:55]
Host
Awesome. And Edward, same question to you. Who are you? What do you do?
[01:59]
Eduard Sidorovich
My name is Eduard and I joined over a bit more than a year and pretty much started right away to work with PART and Genie and also part of ML AI platform team. Yeah, and before that was pretty much working in some startups, training mostly some deep learning models and partially for a year also work in the same manufacturing company and doing some MLOP stuff for them.
[02:24]
Host
Awesome. So let's get into GENIE a little bit. Like can you explain what this project was and sort of how did it come to be?
[02:31]
Parth Chothani
Yeah, I can maybe take a first shot at it. So at Uber generally there is so many platform teams, you know, supporting many engineers across the company and there is a lot of like tooling and all of that that gets built to support all the engineers and to make sure that infrastructure is very highly scaled. Right. And as part of that there's many, many like support forums specifically. Slack is a very popular one and people generally engineers will come to Slack for help. And what you know, we also went through as part of like our background and whatnot is like there was a lot of pain that we as engineers faced when asking for help from other support teams or other platform teams. And this was a recurring pain across the company. So that was something that prompted us to think like, how can we solve this kind of a problem? And that's where inception of GENIE started, that we wanted to have an automated solution which can look at all the internal knowledge sources and be able to answer questions that use customers across the company. Engineers across the company can take help from and really improve their efficiency.
[03:42]
Host
So I mean, I think that this is like a super common problem that like many companies all suffer from. And it's certainly as you scale any organization, it becomes more and more of a problem where you end up with all these sort of data silos that exist or in the Slack world they're like chat silos of one off bespoke conversation that's happening where someone gets help and then inevitably people ask the same types of questions. It's hard to surface that in a uniform way. And it becomes this kind of like death by a thousand cuts. Basically every company suffers from this and certainly at scale become a real hindrance. So I totally get that. I want to get into Genie's architecture a little bit and how you built up the project. So can you talk a little bit about what's actually going on? What is the user interaction and then sort of what is happening behind the scenes to support that user interaction?
[04:31]
Eduard Sidorovich
I mean I will just maybe start with the user experience of it and pretty much assuming I have a team and then we have, we maintain our own Angel Wiki and we have our own help desk channel. So we want to customers, we want to onboard genie. So how it happens is pretty much we have like a platform, Michelangelo. Then you go there, you create a project, you specify the NG Wiki which you want to use and then everything else happens. On our side, you pretty much, it creates the pipeline, you run the pipeline on backend, it pretty much scrapes the data, embeds everything and stores it. And yeah, there's another backend service which actually is queried when someone asks a question, being that on other end there's a, let's say a Slack bot and then it calls this backend service which gathers from which channel it is queried and then sends the question to LLM with The provided context.
[05:33]
Parth Chothani
Yeah. And just to add to that, basically like our goal has always been a simplified, very collective user experience where people can come in, they just specify their sources and then boom, that's it. And then everything else is just one click set up for them to be able to, you know, use genie in their own Slack channels or in their own UIs, whatnot. Yeah. So that's our been North Star experience that we have been always trying to build towards.
[06:00]
Host
So me as a user, I point this to some internal wikis and knowledge bases and then this brag pipeline kicks off where it's going to go and essentially parse those, presumably go through some chunking process, create embeddings, land that in some sort of vector store. Can we you get into details like how does that pipeline work? What are the sort of the steps and components of it?
[06:21]
Parth Chothani
Yeah, so underneath we use a lot of big data technologies like Spark, which helps us be able to take a lot of internal sources, be able to generate embeddings on the fly. Parallelize basically with different, let's say we have executors being taking different chunks, trying to create embeddings either through in house models or third party models. Embedding models. And then even for something like when we want to push the data to Vector store we have like workflows and open source technologies that we have used something like Cadence, which is a uber grown workflow system to be able to ingest data into Vector store at scale and there also we have Spark behind the scenes to be able to take all this data and do a faster ingestion on the fly here.
[07:07]
Host
And when I come in and I select these sources, how long does it take to essentially generate the vector embeddings to a point where I can start actually interacting, going through sort of the user experience, being able to get my questions answered by this copilot.
[07:21]
Eduard Sidorovich
Let's say there's a different thing. Like one is when you onboard yourself and because of there's some like you have to wait for approvals and whatnot. So it can take like a day for example. But if you, let's say update your sources or you completely revamp your sources, whatever, it doesn't matter, run the pipeline. But yeah, takes maybe today around 1 and then took 15 minutes, then it completely updates the sources.
[07:45]
Host
Yeah. Why have it where it's sort of a user configuring these sources for their specific needs versus sort of more of like a wholesale pipeline that is scraping everything, building one vector representation of all these internal Knowledge bases and wikis and then presumably being able to use the semantic search to attach the right context for the user when they're interacting with the copilot.
[08:09]
Parth Chothani
No, that's a fantastic question because that was very much like our first thought as we wanted to build something like that. I think some of the things we learned also. So we were trying to explore some solutions like Glean, which could support something like this out of the box. And what we kind of found was that the way we had configured Glean inside Uber, it was very, very individual access oriented and there was some not something like public data which was all scraped for us already. So that was one problem that we kind of surfaced very, very early on. And then I think we also found like when we are more focused on ingesting sources which are, let's say more hand curated, more filtered, the vector store always does a better job at surfacing that information. And also the co accuracy is much higher versus when we experimented like we tried ingesting all for example NG wiki and the accuracy of the answer seemed all over the place for a given use case. So we felt like not the best of the performance either. So we try to find a sweet spot where we can enable people to bring in their sources, but then again make it more like a magical user experience so that it's more UI driven, people don't have to do much and that's what we have been building towards.
[09:30]
Host
Right.
[09:30]
Eduard Sidorovich
I just wanted to add that it's also the use case driven. In our case, we have a help desk channel from Michelangelo for our ML platform. So people are coming there to ask specifically about Michelangelo. So it's better to narrow down only to Michelangelo and not to surface anything. What someone else have written about Michelangelo which might not be updated. Maybe they once wrote it, they're not updating, but now because of like their outdated information, they will surface the wrong answer. So we might also kind of eliminate that.
[10:03]
Host
Got it. Yeah. So it's sort of like an easier way to get the performance and accuracy that you need by having people sort of self select into wire how they want to constrain the universe, then try to programmatically essentially figure out what the right context is going to be because you're going to end up with a lot of noise across these different potential internal wikis.
[10:23]
Parth Chothani
Yeah.
[10:23]
Host
Was there consideration around essentially scraping everything, but then for each sort of chunk keeping the representation of the source. That way when I'm selecting my sources, I don't have to go through the generation process of the Pipeline based on the sources. I'm essentially just subsetting the existing set of sources and embeddings.
[10:43]
Parth Chothani
Yeah, we definitely wanted that kind of experience to start with. I think we found some infrastructure gaps where we pretty much don't have all this internal sources in an offline store that we can just pretty much take and create embeddings in the background. So we found some limitations there and which is where we went with the next best experience, which is like let's just create it on the fly. Also the other thing is there is a lot of wastage if we just were to create everything behind the scenes. The reason being that it feels like almost every team runs its own processes and own style. So some teams prefer, like Michelangelo, for example, is very wiki engviki driven. We will be very meticulous in updating wikis with all FAQs and all user documentation. And some other teams seem to not have that discipline. Which is where again, enforcing something like this means like if we were to just blindly do everything, we pretty much might waste a lot of resources also and not have much business gain either versus letting people choose. I think then we have given them the capability to refresh knowledge, which means like once they know what they need, then they are able to refresh it at their own pace, which pretty much is like a second best experience of like what you are talking about.
[12:00]
Host
So if internal information gets updated in any of these source information, do I need to go manually like do a refresh? Or is there automatically when these updates happen to the actual like internal knowledge base or wiki, it kicks off this pipeline to do the updates automatically.
[12:17]
Eduard Sidorovich
I mean it doesn't detect anything. So far it's meaning that initially it was only manual. Like if you update and you want to update, you go ahead and click. But then now it's also orchestrated in a way that it's like a cron job in a way that you can update it daily or whatever cadence you prefer.
[12:35]
Host
Okay, and then what model are you using for generating the embeddings?
[12:39]
Parth Chothani
Yeah, so we have two different flavors of models. We have some of the third party models which are open source and which we have hosted inside those are some options and other options are also like, you know, third party models which OpenAI and other providers also give. So I think we have those options but generally we preferred the ADA embedding models from OpenAI to begin with and those have worked reasonably okay for what we have been trying to test with so far here.
[13:09]
Host
And what are you using for Vector.
[13:10]
Parth Chothani
Store Yeah, so we have a homegrown solution right now for Vector Store and we are trying to move towards other better Vector Store solutions. But that homegrown solution is what we call a sia and that's a solution that we have been working towards and we're trying to embrace a new technology as we called OpenSearch as we are trying to kind of become more open source compatible.
[13:34]
Host
Yeah, was that something that existed already at Uber or was that something you built specifically for this project?
[13:40]
Parth Chothani
Yeah, so the technology did exist, but the technology existed for more for like the typical search, which is more text based search. I think when we started the project there's like more of like company started realizing that there is a much better, bigger need for VectorDB store hosted solution. Right in house hosted solution. So then the team, one of our sister teams, they spinned up infrastructure to be able to host a managed VectorDB solution. So it was more like, I think we learned as a company there was a need across for Genai solutions like this to have a very nice highly available infrastructure for VectorDB here.
[14:25]
Host
And in terms of both from the pipelines and also like sort of the user experience interacting with the copilot, how is that essentially built to maintain like reliability and sort of durability so that parts of essentially this pipeline don't end up breaking or you know, going out at some point.
[14:43]
Parth Chothani
Yeah, I can start and maybe Edwards, you can chime in. As part of any production, internal or external applications, we always, we have a highly available monitoring system. So as part of that what we do is we make sure we have alerts on the backend APIs that surface responses to the Slack channels or any UIs that we are supporting for Genie. So that's obviously part one. We also look at logs to make sure like if there's nothing obvious that is going wrong and then Edwards can probably chime into more of the evaluation and what we have built around to make sure customers learn about how their applications, how their channels and UIs are performing against our backends and the whole end to end.
[15:30]
Eduard Sidorovich
Yeah, I think it's one of the solutions is that we constantly receive feedback, meaning that when genie replies there's like a pop up of you can reply with the emoji saying okay, is it good, it's resolved by genie or it's not good enough. So this is also kind of keeps always a feedback loop for us to know like okay, if something is good or not good. Yeah. And then we kind of build some evaluation on top of it being that One of the more like interesting solutions what we did is that we thought, okay, that kind of GENIE has a interesting perspective on the documentation because like when you as engineer you write the documentation, you think you know what people need to know, but typically it's not true. And when customers ask a question, it means, typically it means that something is not covered in documentation or they were just lazy to check the documentation. And what we actually built is that we check what answers were not good. Meaning that if the answer is not good, it means that either RAC components were not good or actually the documentation was not there. And if the documentation was there, then we like with another like LLM is a judge. We try to suggest what is missing in the documentation. Like it kind of summarizes all the unanswered questions and tries to point out where it should be added and what should be added. Like specifically how to run this pipeline, how to debug it kind of points out obviously it doesn't know how to do it because it's uber internal knowledge. But yeah, it helps. And it's actually, yeah, some users actually kind of acting on it pretty well.
[17:15]
Parth Chothani
And just to add to this, like basically our idea is give people these tools that Edwards was talking about where they can pretty much figure out some of the high level themes around what documentation is missing, where the bot might be underperforming. They have at least a headway to figure out how to improve their channel quality.
[17:36]
Host
In terms of the feedback loop, is that primarily for you to sort of monitor performance and also give the team some insight into where maybe the documentation isn't meeting the needs essentially or. Or is some part of that also factored into sort of the learning cycles of the actual copilot. So if I know that the response wasn't good, I can take that into account the next time I generate a response to a similar query.
[18:03]
Eduard Sidorovich
So that is more of a first one that yeah, it's to hold us accountable that knowing how good it performs and to make sure that we also motivate customers to update their NGV keys. And yeah, to point out what is missing. But it's also, yeah, now it's, we're kind of adding more on top of it. That means that it's. You can help to update the documentation right away. Like I think part can maybe explain it more, but it means that you can update the FAQs and then it will go eventually to the knowledge base and it will help to answer the question later on.
[18:41]
Parth Chothani
Yeah, it's more like building a loop where People find out what is missing, they add FAQs to the document documentation and then you know, we have the refresh knowledge pipelines which pretty much take this FAQs, refresh it. So it's like a quick feedback loop. And we actually found out inside even Michelangelo that there were parts of our documentation which are outdated. We didn't know about it. And then the bot surfaced some answers and we were like how did this happen? And then we took initiatives to clean up documentation and that was like a quick feedback loop that without even looking at our evaluation reports we found out immediately that hey, there is this prompts within our documentation where we give conflicting information that we ourselves have not reconciled.
[19:24]
Host
Yeah, I mean I think that's super valuable because I don't know any significantly large organization I've ever worked for. The internal documentation usually can get really horrific over time. It just, you know, there's not a huge incentive a lot of times to keep those things up to date so it can really fall out of date. But they're really valuable, especially for new people because they don't know where to get those answers. And the only option you have is internal documentation or. Or you end up having to message somebody and get that sort of bespoke answer.
[19:51]
Sean Falconer
This episode is sponsored by Mailtrap, an email platform developers love Go for high deliverability, industry best analytics and live 247 support. Get 20% off for all plans with our promo code sedaily check the show notes for more information.
[20:08]
Host
Can you take me through sort of the life of a query? So I'm interacting with this over Slack I put in a query, then what happens? Sort of behind the scenes.
[20:17]
Parth Chothani
Yeah, so behind the scenes when you're querying basically we will invoke API which pretty much underneath tries to figure out what the user is trying to do. And there is also as part of the query we also have a very customized Slack workflow functionality that we have built a plugin inside which can take additional information from the user on what they're trying to do, what is the action they're trying to perform, which particular product they're trying to interact with, what is the way to reproduce their problem. So pretty much think of all additional context that on call and a bot needs to be able to even figure out what the user is trying to do with this whole additional information that we send as part of the question. Then we pretty much generate embeddings on the fly for the question. We do a vector DB lookup. We make sure we have all the right context and as part of the ingestion that we have done for the source data, we make sure the ingestion follows the schema. That way there is, you know, source URLs, there is metadata around what the page was about, and all of this pretty much is part of available as part of the ingested data in VectorDB. So when we are sending all the information to LLM, we want to make sure that there is information around citation. There's much more metadata that we can surface for different use cases pretty much. So that's all of the metadata is fetched along with the source URL and everything. And we send that to LLM with different prompts and we allow users, different users to configure prompts. There is flexibility in what they want to solve. And as part of this, then the LLM pretty much decides what the answer should be based on the prompt and all. And then that's what is surfaced to the user today.
[22:03]
Host
In terms of the LLM, what model are you using?
[22:05]
Parth Chothani
Yeah, we have experimented with different models that OpenAI came up with. So we started with GPT4, then we moved to Turbo, there's GPT4O now. And then we are trying to look at the reasoning models also to see how we can have certain questions answer in a much more crisper and cleaner way with the detailed reasoning.
[22:26]
Host
Still, you mentioned at the beginning of that query to response pipeline that you're trying to figure out what is the user actually want so that you can sort of attach that to, you know, creating the context, what's involved with figuring out what the user actually wants, what the intention behind the query is.
[22:47]
Parth Chothani
Yeah, I think part of this, what we are trying to also currently experiment with is like user intent detection where we can figure out like, is the user trying to debug a problem? Is the user's question about like a product, those kind of things. We are trying to experiment and see where intent detection can help us figure out like more of the user's thought process because not all type of questions we have understood also the bot can do a great job at. So we want to also as part of like our accuracy enhancement is be more mindful of where the bot can excel and where the bot cannot excel. And that's part of like where we are trying to do experimentations and detect some user intent detection right now.
[23:30]
Host
Yeah, what about metrics around evaluating sort of the effectiveness of this? Like do you have things that you're tracking even in sort of the development process, like using like an eval framework Some of these newer frameworks exist for building generative AI applications in order to figure out if you make a change to how you're generating your embeddings or how you're, you know, figuring out the intent that's actually a performance improvement versus a degradation of some sort.
[23:56]
Eduard Sidorovich
I think the main metric is the customer feedback. That's I guess our end goal.
[24:02]
Host
Yeah, so if you make a change, essentially you're waiting for sort of live feedback to see if your accuracy is improved based on the feedback from the users.
[24:11]
Parth Chothani
So I think that's part one of it obviously. And then there's the golden data sets that people generally hand curate so that we make sure there is more like quality built in before also deploying a change. So if somebody changes a prompt or something, we generally accept ask the users to do more golden data set against testing so that they have thought about what kind of implication that have. And Edwards can maybe chime in on the post production rollout here.
[24:37]
Eduard Sidorovich
And we also tried with a different evaluation and I think like more classical NLP and then also LLM as a judge and apparently most of the cases actually LLM is just more simple but it's actually typically works better.
[24:50]
Host
Was there any challenges or thought around the risk of like sensitive information being shared with genie?
[24:57]
Parth Chothani
Yeah, that was something we really brainstormed and thought a lot. And I think me also coming from Amazon where I was like a security certifier. So security was always like top of our minds when we started this. And you know, we want to be very, very mindful of what data gets exposed to the outside the company. So in the beginning we were very, very thoughtful about hand curating we which data sources are secure and we have different levels of gradations like many other companies of what data is private versus public or you know, what is very sensitive that cannot be leaked outside. So we worked with our security teams. We hand curated certain data sources which were reasonably, you could say public inside the company. And we obviously went through a lot of different processes inside the company before we were okay to even create embeddings for those kind of data sources. And that was our due diligence to make sure that as we develop a new productivity enhancement, we don't leak out data that will mess up our company's reputation.
[25:57]
Eduard Sidorovich
One thing to add, I think there's a very cool solution which is built in Uber, it's Genai Gateway. It's pretty much imagine that you have OpenAI API, but then it doesn't go directly to OpenAI it goes through a gateway and the gateway you can. They actually, they filter PI data so there's not high risk of leaking anything.
[26:21]
Host
Yeah. So if I put in my Social Security number for some reason it's going to get filtered out by the gateway.
[26:27]
Eduard Sidorovich
Yeah, yeah.
[26:28]
Parth Chothani
We really wanted, as Edwards mentioned, like the redaction PII data should be redacted before it gets sent out. So I think that's built into our other ecosystem that our sister teams have built to make sure that we have security built in. Into we don't have to worry. But still, as application owners, we have still done our due diligence to even make sure PII doesn't even come in our ecosystem.
[26:51]
Host
Yeah. Basically shift that problem left before it enters the model.
[26:54]
Parth Chothani
Exactly.
[26:55]
Host
You know, in terms of building the system, like what were some of the biggest sort of technical hurdles that you had to work through?
[27:02]
Parth Chothani
Yeah. So I think there are many, many different angles to where we struggled in the beginning. One was obviously hallucination was a start where we were like, you know, just spitting out things that were pretty much wrong sometimes and not right. So there was a lot of this prompt based evaluations that we had to do. There is also the UI experience that we really, really thought very deeply about because there were other solutions that Glean provided, for example, and those solutions were very individual access driven, needed approvals from users even to see the answers in channels. And we didn't want that kind of experience because we wanted a frictionless experience. So definitely the experience was part of it. Then obviously when we started developing, there was no industry standard on how to evaluate gen AI apps. So building like the feedback looping system, for example, we had to come up with methodologies on how we can even compute and say we are saving time for the company and users. So there was some methodology we had to develop inside to figure out how to even say there is some productivity gains here. And then obviously like Edwards can speak more about the eval part which he's driving the whole evaluation of how to showcase what is problem with your documentation. That was like a unique thing that we had to brainstorm. And the ui, the product we built around to support this kind of monitoring, that was a very new thing is all. There's no industry precedence as such on this is how other people have done it. So there's a lot of these new things we had to maneuver and also we were working in a very small team, pretty much a two to three people team. So that was like we are very short on people to try something like this and also other challenge was how to platformize this kind of stuff. Not only prove that this works well, how to platformize it in a way that we can benefit a lot of, you know, other parts of the company and make sure that people can leverage this fast enough and show gains. So the speed of execution, the accuracy, the UI experience, the monitoring and working in a very small team, I think all of these were like different challenges. We had to really maneuver all throughout to deliver something here.
[29:19]
Host
Yeah. And I think just to jump in for a second, I think one of the challenges that probably anybody building, you know, like sort of an AI intergen AI application like this today is facing is that even if you have, you know, deep expertise in ML, very few people have 10,000 hours of experience like building these types of applications. Right. So there is a lot of sort of net new ground to figure out and you can't necessarily draw on your 10, 20 years of engineering experiences that you've, you know, seen this problem a hundred times before.
[29:51]
Eduard Sidorovich
I think one of the challenges was the see that not maybe UI but the ux, like how to make sure that everyone can to make it scalable so that everyone can create their own GENIE and be like specifically tuned for them. I think that was a kind of quite a lot of let's say design thinking how to do it. And also I think one of the not also technical problem, but it's a expectation management. I think it's like ChatGPT works well and then everyone has this miracle experience. Right. But then you see, you go to other help desk channel and you see that it performs very well. Even though you feel that it performs well because you don't know much of a context, you think that it works very well. But when you run it on your own documentation like oh no, it doesn't work as well as you expected. Kind of you look okay, why is it. And typically it's. Yeah, maybe just because the documentation is not up to date. So it was challenging to explain that in machine learning we say garbage in, garbage out.
[30:54]
Host
Yeah. I mean this goes back to the data quality problem. And if your data is bad to begin with or some portion of it is bad to begin with, what can you expect in terms of the model can only do so much. It's not going to fix your data problem for you.
[31:10]
Eduard Sidorovich
Yeah, yeah.
[31:11]
Parth Chothani
And also circle back to the question you were talking about, like the lack of experience in building this kind of thing. I think what there are some parallels that I kind of sense still. Like yeah, while nobody had experience in this technology and whatnot. I think we inside the small team we were all part of we were trying to be scrappy at the same time, speedy in execution and we had to balance obviously security. I think those three angles we tried to do and I felt like pretty much most new projects have that kind of thing where you're wanting to be scrappy, you want to be showing something but also being mindful because we are in a bigger company, we're not in a smaller company where you can afford to make mistakes. And this is a public company. So I think drawing from our previous experiences, we try to have these principles in mind and I think these principles helped us guide while we didn't know the nuances of the technology, but I felt basics of software engineering were still in place to make sure they were our guiding light as we delivered something.
[32:18]
Sean Falconer
Developers We've all been there it's 3am and your phone blares, jolting you awake. Another alert. You scramble to troubleshoot, but the complexity of your microservices environment makes it nearly impossible to pinpoint the problem quickly. That's why Chronosphere is on a mission to help you take back control with Differential Diagnosis, a new distributed tracing feature that takes the guesswork out of troubleshooting. With just one click, DDX automatically analyzes all spans and dimensions related to a service, pinpointing the most likely cause of the issue. Don't let troubleshooting drag you into the early hours of the morning, just DDX it and resolve issues faster. See why Chronosphere was named a leader in the 2024 Gartner Magic Quadrant for Observability Platforms? At Chronosphere IO Sed, understanding the details of infrastructure tools matter, and there's no better way to understand that than looking directly at the code. Open source code bases give everyone the ability to inspect, audit and contribute to the software they use, enhancing trust and transparency. Bitwarden is a trusted open source and end to end encrypted security solution that empowers businesses and individuals to securely manage and share information online. Made by developers like you, Bitwarden offers open source solutions for virtually every credential management use case, from secrets management to password management and passwordless. Developers can even securely manage their SSH keys with the new Bitwarden SSH agent. Get started on your open source security journey today and start your free trial@bitward.
[33:53]
Host
Gordon.Com do you feel like though with building this type of application where you're you're relying on kind of there is a certain amount of non Determinism going to be involved in sort of the stochastic nature of some of these, you know, the models themselves. Like does it require a bit of a mindset shift when you're engineering in that way versus traditional application development where it's going to be very deterministic, you can rely on, you know, if the output's not what you expect, you can kind of trace back to a bug in the program that you put in there.
[34:24]
Parth Chothani
I think that definitely non deterministic aspect did throw us off and I think we had to even build stuff in our experience and UIs and explicitly call out, hey, this, you know the answers. You know, make sure you don't take it for the word of it. You make sure you evaluate. Right. And that non determinism definitely is one of the things that makes this whole product building so challenging though I felt like that aspect has also changed as the models have become better, as we have learned how to restrict the prompts, restrict the citations and we, we, we started doing citations and that also has led to more like, let's say trust in what we are now saying versus like you know, just being so open ended that you just don't know what it's saying is true or not. So I think that with evaluation stuff that Edwards led and built, I think like there's many of that stuff is starting to come together now and it's become more deterministic where we, we feel like, okay, there is more control on what we are saying now versus what the system is generating versus what it was before.
[35:30]
Host
Yeah, got it.
[35:32]
Eduard Sidorovich
I think we kind of build the muscle when the models are less deterministic. And now we, we see like, with all this progression of new and new models, we are like, we are healthy, skeptical, which is, I guess, good.
[35:46]
Host
Yeah, I've definitely seen in the two years or so that I've been building on large language models like a significant improvement in terms of like reliability of their performance and answers. And you know, the problems haven't completely gone away, but there's, it's a lot better than it was, I guess two years ago. I think they've addressed a lot of these challenges.
[36:07]
Eduard Sidorovich
Yeah. I think it's also good thing that LLM is becoming cheaper. Right. And what helps is that, I mean before it was you'd make one LLM call and then you're like, okay, it's enough. Now you can make like, okay, validation calls like 2, 3 times to validate the answer. So you can make it artificially more deterministic because it's become there first of all it become better and they become cheaper.
[36:30]
Host
Yeah, that helps a lot with applying some of these like basic patterns around reflection and go through a series of iteration of refinement and so forth so that you can actually get a much better response validation that you mentioned, especially if you're expecting a certain type of output. And then of course all the things now that are happening with even agents where you can bring in tools to help evaluate or request data as needed and so forth. You mentioned productivity gains earlier and how it was important to be able to demonstrate that this project is worth the time investment, worth the presumably the compute resources that you're putting into this, the token costs and stuff like that. So what were sort of the impact and productivity gains that you saw?
[37:10]
Parth Chothani
Yeah, I think as we were publishing the blog also like we have been able to roll the bot out to more than 150 plus channels, right. And it's answered like 70,000 plus questions. We've seen around like 48% helpfulness rate which is like mix of, you know, the questions that bot auto resolved and where the bot actually helped the user to prompt in the right direction. So I think from that perspective we estimated roughly when we did math around like 13k engineering hours so far we have saved across the company. And that's as I was saying, we had to do some creativity here to even figure out how to measure these kind of things.
[37:50]
Host
How did you figure that out?
[37:52]
Parth Chothani
I think first part of what Edwards was previously mentioning, we have this, you know, emojis that people react on. So that is something we did. Some other things we did was also to gain more data. Some of the partners that we were working with, we wanted to have a higher rate of feedback there because like Google search, not many people leave feedback on accuracy because users just want answers, they don't like to leave feedback. And that's something we have seen. I mean I personally observed in my own experience that while working with any customer support in like airline or anything, you know, just never want to leave feedback. It's waste of my time. That's what I feel always unless it's negative feedback. And then yeah, unless it's negative feedback. That's what we found. So some channels, some partners, we enforce the feedback because that gave us more confidence that, you know, is the bot really even performing well? Right. So that was one of the features we initially built and also other experiences we also built is with some teams we try to experiment and say where the bot is like the first level of resolution always and the on calls only come when the customers say, I want to escalate to on call. So that was another experience we built to validate and see how useful is the bot. So this mix of different experiences, plus us doing some creative math to incorporate these feedback emojis and convert it into some engineering hours and determine how much we are actually saving to the company, that's how we came up with some of these math here.
[39:23]
Host
Response rate evaluation, that 13,000 hours over what time frame is that since the.
[39:30]
Parth Chothani
Inception of the bot? Roughly, I would say a year plus. Yeah.
[39:34]
Host
Okay, so that's like pretty substantial amount of time saved.
[39:38]
Eduard Sidorovich
I mean, the adoption is kind of. It's not like everyone is onboarded, right? It's meaning that you have to come and onboard yourself. So I think most of the heavyweight was also in the latest months.
[39:49]
Host
What's next for this project? Are you continuing to invest in this? Like, what are you looking to do with this?
[39:55]
Parth Chothani
Yeah, definitely. I think, like, what we have seen is the expectations have completely shifted. Answers we were giving six months back, and what was acceptable has completely shifted. Users are expecting much more. So what was A helpful answer 6 months back seems like not a helpful answer anymore. So we are definitely very much thinking about, like, taking this and making it a V2 version where we can have a very high level of accuracy and work with substantial partners. That way we can bring this to the next level of expectations that people are having from the bot. So that's like an ongoing investment for sure.
[40:33]
Host
As we start to wrap up, is there anything else you'd like to share overall?
[40:38]
Parth Chothani
Anybody building gen AI apps there? Definitely. This landscape is extremely, very fast evolving. It changes literally in days, not weeks, not months. So I think the pace at which technology is changing here is way faster than any other technology that I've ever worked with in my career so far. So I think people have to just be open to the fact that whatever we build is. It might be thrown away in a week or two. And that's something just like being open about that just makes us not feel frustrated because I think we were at times when we were feeling like pretty much, hey, what have we built? Do we have to throw away everything? I mean, that was the question that we would get a lot. So I think just being open about the pace of the change and being open to experimentation is a healthy mindset for Jennai World, at least, I feel.
[41:27]
Host
Yeah, I would think you would have to. Ideally, you'd factor that a little bit into your design as well, so that you have flexibility in sort of the architecture of the design to swap in and out of models as those things improve or other components essentially where you might be able to squeeze out a little extra performance by going through an additional cycle of inference or something like that.
[41:49]
Eduard Sidorovich
Yeah, I think I also wanted also to encourage to build these genie apps. I mean first of all it's kind of fun and second of all, sometimes it feels frustrating because you build something and then something developed very similar to it, but then like, okay, so we feel like it's a waste of time. But at the same time, I mean you build something and then you can make it adapt to your specific case. And then I think it also was kind of also with the genie that okay, I mean we built the chatbot but then like at the same time like Glean was coming up with something similar. But because we built something on our own, you know, we can put agentic stuff, we can put like okay, if, if someone put a lock, we can go and check the lock and then that's something what other solutions cannot do and will not be able to do, at least in the pretty nearest future. So yeah, I just wanted to encourage.
[42:40]
Parth Chothani
People to experiment and to add to that. I think like what Edwards was mentioning is very spot on that build unique features because I think there is the, that's what creates value in the long run. So I think while we had other competitive solutions also being built outside by third party vendors and whatnot, I think we focused on trying to be unique with the experience of UI or the tools that we allow users to integrate. And I think that probably proved us right in the wrong run that we're able to customize a lot more things because it's in house and the experiences can be tuned in change much faster. So overall being unique also helps like stand out in the long run.
[43:20]
Host
Yeah, I think even from like if you were building you know, a company around some sort of AI application today, kind of going deeper might be better than going like really wide in general because a lot of the like hyperscaler companies are going to probably address the wide, but you can out compete them if you go really deep on a particular thing. Like you can create like the best possible, I don't know, like medical device related AI experience or something like that. And that's probably not going to be something that you know, Amazon's going to put a ton of resources into or OpenAI or something like that versus sort of the generality of what they're trying to solve. Yeah, awesome. Well, Parth and Edwards, thank you so much for being here. This was great.
[44:01]
Eduard Sidorovich
Thank you.
[44:02]
Parth Chothani
Thank you, Sean, for having us. It was really, really nice to have this podcast here.
[44:07]
Host
Cheer.
[44:14]
Parth Chothani
Sa.