Agentic AI at Glean with Eddie Zhou - Software Engineering Daily

Summary5 min read

Podcast Summary: Software Engineering Daily - “Agentic AI at Glean with Eddie Zhou”

Release Date: April 22, 2025

In this insightful episode of Software Engineering Daily, host Sean Falconer engages in an in-depth conversation with Eddie Zhao, a founding engineer at Glean and former Google engineer. The discussion centers around the evolution of Glean from an enterprise search company to a pioneer in agentic AI, exploring the engineering and design considerations essential for building advanced AI-driven productivity tools.

1. Evolution of Glean: From Enterprise Search to Agentic AI

Eddie Zhao begins by outlining Glean’s foundational vision, emphasizing that while enterprise search was the initial focus, the company has consistently aimed to enhance knowledge workers' efficiency across various tasks.

“With enterprise search we were meeting knowledge workers in a little slice of their sort of job to be done... freeing them up to do more things.”
[01:25]

Sean Falconer probes into how Glean transitioned to integrating agentic reasoning systems, to which Zhao explains that agentic AI broadens their assistance both in understanding user needs earlier and in executing tasks post-information retrieval.

“All we've done as we've evolved from Glean search to assistant and now to this agent platform is sort of broaden that segment...”
[01:25]

2. Challenges with Current AI Models and Data Context

The conversation delves into the limitations of large language models (LLMs) in understanding company-specific data. Zhao highlights the crucial need for context injection—integrating an organization's internal data seamlessly into AI operations to enhance relevance and accuracy.

“They’re not getting better at knowing your company's knowledge... figuring out how to implement context injection is really important.”
[04:25]

Falconer emphasizes that even less advanced models can outperform sophisticated ones if they access the right data, underscoring the importance of data liberation and selective information provision.

“You can have essentially a lower power model that has access to the right data... to generate a meaningful response.”
[04:57]

3. Defining a Reasoning Agent

When asked to define a reasoning agent, Zhao presents a flexible framework, acknowledging diverse interpretations while emphasizing agents’ ability to formulate and execute plans using available tools.

“A reasoning agent is something that can, given a set of tools, formulate a plan to satisfy an input and then go and execute those tools.”
[06:25]

He contrasts agents with Retrieval-Augmented Generation (RAG) systems, explaining that agents extend beyond retrieval to execute actions and manage multi-step processes.

“Agents are simply an extension of RAG, where the content being generated may not be the response to the user, it might be the next step in a plan.”
[08:03]

4. Technical Challenges in Building Agents

The discussion moves to the complexities of designing agents, particularly managing unbounded execution and ensuring controlled workflows. Zhao suggests implementing fixed execution limits and leveraging research indicating performance drops with excessively long reasoning tokens.

“You might allow for a fixed number of executions... there's a sharp decrease in performance once the thinking tokens become too long.”
[13:37]

They also explore multi-agent systems as a solution to scalability issues, advocating for a decentralized approach where specialized sub-agents handle distinct tasks.

“Our internal approach is a little bit more, okay, yes, you do have a central agent, but the tools... can delegate more to those other agents.”
[15:08]

5. Managing Identity and Permissions

Zhao explains how Glean leverages its existing identity infrastructure to ensure agents operate within appropriate access boundaries, maintaining security and relevance based on user roles and permissions.

“We can leverage our entire identity infrastructure and platform... certain tools can only execute if you have access to them.”
[24:29]

6. Debugging and Error Handling in Agentic Systems

Addressing the complexity of debugging dynamic agent workflows, Zhao likens it to tracing through a graph of interconnected systems, identifying breakdowns by tracking inputs and outputs at each stage.

“It's composed together... you need to be able to say, okay, at the high level, where can I track this down... and do that trace.”
[25:07]

7. Guardrails Against Hallucinations and Incorrect Information

Falconer raises concerns about AI hallucinations, prompting Zhao to discuss the inherent difficulties in ensuring generated content’s accuracy. Glean adopts a dual approach of offline evaluation and real-time monitoring to mitigate these risks.

“It's a really hard problem... as an ML engineer's perspective of diagnosing, it does matter where the part of the system is breaking down.”
[28:33]

8. Building and Deploying Agents: In-House vs. External Tools

When asked about the development tools for agents, Zhao notes that Glean employs a mix of proprietary systems and open-source frameworks, allowing flexibility while adhering to performance and security standards.

“It's a blend... our principle is to try to reuse where possible.”
[31:06]

9. Internal Use and Dogfooding

Zhao highlights Glean’s commitment to dogfooding, with internal teams actively building and utilizing agents to refine use cases and enhance platform robustness.

“Folks are building internal agents... trying to build those using that same suite of tools from low code to otherwise.”
[37:19]

10. Measuring Success and Use Cases for Agents

On evaluating agent success, Zhao emphasizes usage metrics and long-term engagement as primary indicators, alongside outcome-based measures tailored to specific use cases.

“Usage is always king for any product... it depends on the use case.”
[38:02]

11. Comparing Agents with Simpler Workflows

The conversation addresses when to adopt agentic solutions versus simpler prompt-based workflows. Zhao advocates for a seamless user experience where complexity is abstracted, allowing users to start simple and escalate to agents as needed.

“Ideally they don't need to think about the level of abstraction... push them up the complexity curve as needed.”
[39:08]

12. Key Technical Challenges and Future Directions

Zhao identifies tooling and evaluation suites as significant hurdles, stressing the need for advanced tools to assess and refine agents effectively. He underscores the importance of scalable evaluation methodologies to drive reliability and user trust.

“Tooling and the evaluation suite of tooling... is still sort of a little bit behind.”
[40:09]

Conclusion

The episode concludes with Zhao expressing excitement about the advancements and ongoing challenges in agentic AI. He reiterates Glean’s dedication to building effective, scalable, and secure agent systems that empower knowledge workers.

“We covered so much here... Thanks for some great questions and it was really awesome talking.”
[42:54]

Sean Falconer echoes this sentiment, appreciating the deep dive into the complexities and innovations driving Glean’s agentic AI journey.

“Really enjoyed it and cheers.”
[43:02]

This episode offers a comprehensive exploration of agentic AI within enterprise settings, highlighting both the potential and the intricate challenges of integrating advanced AI systems into everyday workflows. Eddie Zhao’s insights provide valuable guidance for engineers and companies aiming to harness the power of AI to enhance productivity and decision-making.

Loading summary

Transcript71 lines

[00:01]
Eddie Zhao
Glean is a workplace search and knowledge discovery company that helps organizations find and access information across various internal tools and data sources. Their platform uses AI to provide personalized search results to assist members of an organization in retrieving relevant documents, emails and conversations. The rise of LLM based agentic reasoning systems now presents new opportunities to build advanced functionality using an organization's internal data. Eddie Zhao is a founding engineer at Glean and previously worked at Google. He joined Sean Falconer to discuss the engineering and design considerations around building agentic tooling to enhance productivity and decision making. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.
[01:01]
Sean Falconer
Eddie, welcome to the show.
[01:02]
Eddie Zhao
Thanks Sean. Great to be here.
[01:04]
Sean Falconer
Yeah, absolutely. I'm really looking forward to this. So I'm fairly familiar with Glean as a bunch of my former colleagues at Google have gone and moved on to working at Glean. And I'm curious about how has the original vision of Glean evolved from this like enterprise search company to now working and having astutely reasoning agents as part of their offering?
[01:25]
Eddie Zhao
Yeah, yeah, definitely. I'd like to think that our vision actually hasn't changed that much and enterprise search was really just the first way that we could deliver part of that vision. And so the way I like to think about it is with enterprise search we were meeting knowledge workers in a little slice of their sort of job to be done, their user journey, right? You know, they open their computer, they need to do something, whether they're an engineer writing code, a pm, researching other products, a salesperson prepping for a call, they all kind of have some journey they need to do. And we were looking basically to make that journey more meaningful, easier, and sort of free them up to do more things, right. And so with search it was really about, okay, if this person can do the work of thinking, what is my high level goal, what do I need to do? Identifying that they need to find some information and then going and interfacing with a search product to get that information. Right. So you've really sort of helped them in that little segment, right. And all we've done as we've evolved from clean search to assistant and now to this agent platform is sort of broaden that segment both to the left and to the right, sort of helping them need to do a little bit less work to understand what do they need to know to do this task, meeting them earlier on in that journey as well as to the right, sort of helping them get more done after they found the information. Right. Maybe it's just synthesizing the information on the result page. Maybe it's actually starting to help them do, do whatever they're going to actually do with that information. Right. So I really see that sort of evolution from enterprise search into where we are today as an extension of the same vision we had to sort of help knowledge workers everywhere.
[03:02]
Sean Falconer
Yeah. I think one of the unique aspects of Glean and how you've managed to like position yourself in the market is with like the models themselves. Like, they're very, very smart, but they're really, really dumb about like your data essentially. Like they know all of this stuff, but they don't really know anything about like you and your company. And that's where the challenge is to build like actual meaningful applications. And they also think that's one of the challenges with some of the other agent offerings that are available in the market is like, it's great. This tool could be amazing. Do all this reasoning capability, but where essentially is it getting that data? And since Glean is already plugged into that from sort of the very beginning, has access to all the sort of really rich information that you essentially want the agents to have access to.
[03:47]
Eddie Zhao
Right, Right. And I think that that's an important sort of nuance for people to understand as the underlying LLMs that power these agentic systems get better. Because them getting better is useful for everyone. But it's important to define what they get better at. Right. They're not getting better at knowing your company's knowledge. That's a binary thing. Have they seen it or have they not? Right. And so you're right that figuring out how to implement Inject, we call it context injection. Putting enterprise context at the right places throughout an agentic systems execution is really important to actually getting them to work.
[04:25]
Sean Falconer
Yeah. Because I've seen you have a really powerful model, but if you don't have good data, it doesn't really help. You can have essentially a lower power model that has access to the right data at the right time, outperform the best model in the world. So it really comes down to, I think the central challenge for most companies building any kind of AI experience today is how do I kind of liberate the data and find that subset of data to provide stuff into that context window so that I can actually steer the model in the correct direction to generate a meaningful response.
[04:57]
Eddie Zhao
That's right. This sort of notion of world knowledge and then company knowledge as a layer on top is something sort of tying back to your first question. That we've been thinking about in the search world, you know, for a while, when someone comes to a product in their work life, right, A search bar, a chat interface, they're not sort of only in the company context, they're sort of, they're an employee of this company, but they've built on top of a foundation of world knowledge. And so we have to build our systems in the same way, right? So even in the search setting, if you think about embedding models used for retrieval, right, you still run into the same dynamic of, hey, have these models seen this new data or this data that's specific to the company? If they haven't, they're not going to perform well, right? So how do you sort of layer that new understanding? How do you adjust these models to account for that? The same thing goes for all the generative products and the generative use cases in the agent world. How do you effectively acknowledge that there is a prior, if you will, in that world knowledge, but you need to augment it with what's there? And it's a really hard problem. I'm not going to claim we've fully solved it by any means, but I think we're really well positioned to keep making progress and building things that give value to people.
[06:04]
Sean Falconer
So I think it's worth stopping down for a second and actually explaining from your perspective, like, what exactly is a reasoning agent? I think that a lot of variation in terms of what people think about an agent. I know Hugging Face recently came out with this nice framework of kind of like the levels of agentic AI. So how do you essentially define a reasoning agent?
[06:25]
Eddie Zhao
Yeah, and I want to preface this by saying, by no means do we think we're not incredibly opinionated here. I think there's a lot of framings and frameworks that are being developed and they all have their merits. I think we've sort of come to our own internal viewpoint on this and even that continues to evolve because it does matter. You know, when you talk about, okay, if we have a reasoning agent and it has access to tools, right, what are those tools themselves capable of? And so one framing is okay. A reasoning agent is something that can, given a set of tools, formulate a plan to satisfy an input and then go and execute those tools, right? But then there start to become many different questions, extensions, right? Can it use these tools in an iterative fashion? What is the granularity of these tools? Are the tools themselves other agents in the sense that they also have the ability to call out further into other systems or are they sort of more static? And so I know this is kind of a cop out answer, but we're trying to not draw a hard line around this while still figuring out how can we keep our framework flexible enough to adapt to how the industry is thinking about agents and thinking about agents they might build in the open source world, they might build in other parts, and how can we make sure they can at least integrate with Glean as the dust sort of continues to settle on this.
[07:50]
Sean Falconer
So when customers come to you and they're asking for an explanation between, like, how are agents different than rag? How are agents different than some sort of like AI application that follows the workflow? Like, what is sort of the explanation? There's.
[08:03]
Eddie Zhao
Yeah, I think for the former in terms of how it's different from rag, I think the main component here is that we would see agents as an extension of RAG in the sense that the only tool that a RAG agent, quote unquote, has access to, or the only flavor of tool is some sort of read or retrieve type tool. Right. It can issue a search, whether it's a search engine like Lean or a Federator search engine, and then it sort of generates. Right. Retrieval, augmented generation, and it generates based off of that. Right. And so agents are simply an extension of that, where the content that's being generated may not be the response to the user, it might be the next step in a plan. Right. And being able to sort of sequence this out so that it's not just a retrieve and then a generate step, but rather a more extensible. Perhaps it's multiple retrieve steps, perhaps it's a generation that goes into the next step that is then another retrieval. It's providing access to more actions that are not just retrieve and read actions, but actually executing. You could be executing code, it could be interfacing. We call them write tools, but sort of writing out into the world. So those are some ways that we sort of distinguish agents from core rag. And the second one you mentioned, what was the second distinction you were looking for?
[09:17]
Sean Falconer
Essentially like an AI application that's following some sort of workflow.
[09:21]
Eddie Zhao
Yeah. So I think in terms of the word workflow, I think the distinction between what people have been calling agents, and this is where terminology does get a little bit blurry, but we do like to think about things in terms of sort of static and dynamic. Right. So a workflow might represent something that's more of a fixed execution. It might be multiple steps. There might still be LLM calls that are allowing for some variability. But the actual execution flow of what steps will be executed is fixed. Right. And the system itself, quote unquote, can't modify that graph. Right. But once you introduce some level of dynamic where you say, okay, for a given query we're actually constructing a graph or perhaps iteratively constructing it, that becomes the distinction between a workflow, or what people are calling a workflow, and something. And of course you can take graphs that are constructed for a given query and freeze them into workflows that are repeatable. But that's another distinction between these sort of workflows and more dynamic agents.
[10:22]
Sean Falconer
Right. So it really boils down to like, what is the control logic? Is it some sort of predetermined pre programmed set of steps? Or are you essentially allowing sort of the brain of the agent to determine what that sequence of steps is?
[10:34]
Eddie Zhao
That's right. And they definitely both have their place. It's sort of a high risk, high reward. Right. The more levers you give the system to create something flexible, the more powerful it can be. You could hope that it generalizes to new use cases, new queries, but at the same time it can be more unpredictable. Right. And if you want something more predictable, you want something frozen, you certainly can bias toward the sides of freezing that control logic, as you said. And of course you can use a dynamic system to help you build the initial sort of graph that you want to freeze, iterate on it, and then freeze it and use it indefinitely. Right? That's obviously another path you could take.
[11:12]
Sean Falconer
Do you see most people using some sort of hybrid approach where some part of the sort of workflow might be agentic, where it's a more dynamic set of reflection steps or something like that, or whereas other parts of it is a more predetermined set of orchestration steps.
[11:30]
Eddie Zhao
Yeah, I think we do see both. And I think everyone wants a fully looping dynamic system, but they often find once they deploy it, it's a lot harder to sort of get a handle on. And so people end up building more constraints back around their system and sort of freezing different parts of it. But it really depends on the use case and the stakes. If you have something that are high stakes, you can't afford that execution graph to change. And for us, I think what's interesting sort of tying this back to the enterprise case is when we talk about something like reflection, a concept like reflection, which is, hey, given what I've executed so far, you ask this central brain, this LLM decision making logic, what should I do next? Right. And the really tricky thing about many of these enterprise, not all, but many of these enterprise use cases is the same dynamic between world and sort of company knowledge. Right. It doesn't know what it doesn't know. Right. Even in a simple. I mean, this problem is present even in rag, right? In a simple retrieve and then reflect before generate step. You know, if I'm asking some question, it could be as simple as about like my holiday calendar that my company has for this year, depending on it, what the retrieval engine does, if it retrieves the right things or if it doesn't, how can you ask that reflection component to do the right thing afterwards? Right. If it's sort of beholden to the performance of that previous upstream system, if you present it, here's the query that was planned, here's some results. You know, what should you do next? Right. And you could imagine a case where the search engine didn't return the right results and it decides to respond. So how can you really. Reflection, I think, is a very hard task to do in the enterprise context because these agents generally don't know what they don't know.
[13:15]
Sean Falconer
Right.
[13:16]
Eddie Zhao
And so that's a very careful context injection sort of problem to think about and work on.
[13:22]
Sean Falconer
How do you deal with the unbounded execution even outside of reflection? Unless you're putting a hard limit on how many times we can loop over this, how do you essentially manage the fact that the execution cycle might be unbounded?
[13:38]
Eddie Zhao
Yeah, I think as simple as it sounds, the first thing you mentioned is the easiest way to do it is you might allow for a fixed number of executions at various stages. Right. It could be overall the number of steps. You can only create an execution graph with this number of steps, or within a given number of steps, you can only have it iterate this many times. Those are some good guardrails to put in. I think it also sort of depends on the complexity here. There's probably some neat analog to. There was a paper last week or two weeks ago that people were benchmarking a lot of these thinking models like the LLMs themselves and sort of measuring their performance as the number of thinking tokens increased. And they found that there's a sharp decrease in performance once the thinking tokens became too long. And the sort of, I don't know, intuitive analog is that, well, it's rabbit holing, it's spinning. Right. And you could probably extend that same thing to agentic systems where, you know, if your number of executions, if you're looping too much, it's probably unlikely you're going to reach a good outcome. And I think there's a careful balance there. Right. I think there's a lot of value that people can have with simpler agents that do execute a much smaller, you know, we're talking order of single digit number of executions rather than dozens or hundreds. Right. And so I think playing in that smaller space still provides a lot of value and you can add some simple upper bounds depending on the use case.
[14:56]
Sean Falconer
Does it help to think about essentially by breaking up these problems where you want an agent to operate and solve some sort of task instead of using sort of one monolithic agent to perform that, to split it up essentially into a multi agent system?
[15:09]
Eddie Zhao
Totally, totally. And this goes back to how I'm trying not to be too opinionated here, but in know personally, I do have some opinion in terms of. For us, we're also thinking about, okay, how do you scale, right. How do you both build agents internally and help others build agents in a way that isn't bottlenecked on a monolithic system? Again, drawing from ML systems and sort of first principles here, a lot of the times ML systems in large products become too monolithic and there's a downside to them because then, okay, you have your team of 15 people now they can only all work on this one model and everyone's just trying to work on this one model. So your actual rate of improvement is lower than if you had left that factored into multiple systems. You could have multiple people working in parallel and, and sort of you're giving up that short term quote unquote gain for a more medium and long term gain, but you'll reach a better spot. And I think the same thing applies here where our internal approach right now is a little bit more, okay, yes, you do have a central agent, but the sort of tools, and I use that term again a little bit loosely, that it's given access to maybe other agents. And so if you can delegate more to those other agents and it's not just about in the short term what you can do, but someone full time focused on making that agent successful is always going to be more, you know, going to do, have a better time doing that than you trying to solve all these things at the central level. Right. So I do think there's a huge role for delegation here and it's about figuring out that right interface between that central agent and these delegated agents.
[16:40]
Sean Falconer
Yeah, I mean, I think one of the things I see people sort of missing as they sort of dive into the space and are excited about it is that as exciting as this stuff is, it really carries all the same challenges that you have with running any large distributed system. And the scale problems aren't simply just about infrastructure scale. They're also about, like, how do you scale the teams? And essentially how do I build design this in such a way that I can loosely couple these things, treat them essentially like a microservice that can kind of operate independently, even leverage different models and stuff like that, independent of some of these other systems. And the teams don't have to necessarily know exactly what's happening within each particular team and have these hard, fast dependencies between them.
[17:21]
Eddie Zhao
Totally. I mean, I know this podcast is software engineering Daily. And so, you know, experienced software engineers everywhere, I'm sure, have been punching the air seeing, you know, LLM agent systems being designed without thinking about core engineering principles that just like all the things you said that are still incredibly relevant. And I think, you know, it's the ability for these systems to show you something really awesome in the short term and give a good proof of concept has made it easier to sort of forget good engineering principles when designing them for the medium and long term. And obviously the rate of change and the pace has made that hard too.
[17:56]
Sean Falconer
Yeah, absolutely. And then in terms of like an agent going back to like a singular agent, can you break down what the components or the anatomy of an agent is and how each of those are used in order to essentially come up with a plan, perform these reasoning steps, potentially execute tools and sort to solve a specific problem?
[18:15]
Eddie Zhao
Yeah, sure. The way we're thinking about it is, you know, we have a central system that has access to a set of tools and, you know, its first step is to develop that first pass at a strategy or a plan using those tools. Right. And how it does so is important. You can obviously give descriptions of the tools or if you're using function calling, whatever that interface may be. But as with all LLMs, you know, having good in context examples is really important. Right. And so this kind of goes back to the team scaling component of like, how do you influence the central system? Well, you can have teams that are building these sort of golden in context examples that say, hey, I want to make sure that when this central agent sees a query like the one I care about or sees an input, that they can effectively sort of build the graph that represents what I want it to look like. Right. And so I think that the first part is assembling this input to this main, what we call strategize call, that is composed of available tools, but Also curated important examples that demonstrate how to use those tools, how to synthesize them them in this multi step way. Right. And then it becomes, you know, the output of this system is a sort of graph that can be partially executed or fully executed. But again that graph, it has some level of delegation. Right. Each of these tools, the way we're thinking about it, you can defer to them more and more. Right? I can say, hey, for this tool I have an objective. I'm not telling you exactly how to accomplish the objective because that is the tool or the sub agent's goal or not a goal, but it's their sort of domain is to figure out how to do it. And so you can sort of distribute the cognitive load, if you will, over the subagent as well. And then the sub, you know how you have different teams working on sub agents, different folks can work on these sub agents and make them work well. And then, yeah, so then you can basically execute your graph and sort of come back to the central system as needed. But that's kind of the rough flow that we're thinking about right now.
[20:13]
This episode is sponsored by Mailtrap, an email platform developers love Go for high deliverability, Industry best analytics and live 24. 7 support. Get 20% off for all plans with our promo code Sedaily check the show notes for more information.
[20:31]
Sean Falconer
Is there a limit to the number of tools that any one agent can handle?
[20:37]
Eddie Zhao
Definitely, I think there fundamentally is. And even if we're not even talking about hundreds of tools, even if we're only talking about a dozen tools, the permutations of chaining them together to accomplish, you know, some set of things is also obviously like exponential. Right? And so you do need. This is our kind of drum that we beat. A lot of it can come back to search and for us it's thinking about, okay, how can we make sense of all these permutations, especially the ones that are these sort of in context examples that are examples of how to use these tools in sequence or in parallel and take that eventual set of tens of thousands, hundreds, thousands, millions and search over them down to something that's much smaller and guide the LLM to that. Right? So you almost have like a graph search problem or just some sort of item search problem that you can then say, okay, now it's a tractable set of things that we want. We've given an assist to the central agent. It doesn't have to fully reason over all the possibilities. You can again factor part of that system out itself to a search Problem do that search separately and then say, hey, here's 5, 10, 20 different tools and or combinations of tools that are probably relevant for this rather than, you know, the entirety of all the combinations.
[21:47]
Sean Falconer
So the idea there is that, you know, I have some sort of context about what it is I need to, you know, maybe data I need to gather, then I can essentially, you know, perform the search. And then essentially that's sort of my first line of defense as a tool at sort of the main agent that's operating here. And then from there I know, okay, well you know, Slack has some of this information. You Google Docs has some of this information. Confluence Page has some of this information. And then I can essentially limit the set of tools to those three tools to then go and execute against.
[22:16]
Eddie Zhao
Yeah, that's a good way to think about it. And another thing from our side, the way we think about it is the other dimension on this is the sort of company specific dimension, right? There are mostly companies kind of do work or store information in roughly the same way, but that's very quickly starts to break down. Companies develop their own ways of storing information. That's what we've been building with search, right? So even the sort of algorithms that you would use or the nuances of like, okay, what tools or workflows are relevant for company X for this query might be different for company Y. Right. And so the way that you build that search algorithm can start to be, you know, it's not just about, you know, you need to do query understanding in the same way over. Okay, when confluent users are asking about this kind of query, what kind of workflows should be surfaced in this search setting versus if a glean employee is asking that it might be different. Right. And that kind of goes back to, you know, a lot of what we've been doing in Glean around, not just around language adaptation, but also around these other signals that are important to understand, right? Like is, you know, Sean is working in these different places. So when he asks this query again, that search algorithm should be actually personalized to him and it may return a different set of things, a different sort of set of sub agents to reason over than if someone else at Confluent does, right?
[23:31]
Sean Falconer
How do you persist, you know, the identity information across all these particular endpoints?
[23:36]
Eddie Zhao
I mean, in the glean context a given request is going to come in and it's obviously it has identity information associated with it, right? So we can make permissioned calls wherever we need it. At the agent level, I guess Your question is sort of, okay, so I guess if we draw the analog back to documents, it's obvious in that documents have authors and people interacting with it. But now if the items that you're sort of searching over are other agents or workflows, maybe your question is how do I know like who is associated with them and how the identity gets associated there?
[24:09]
Sean Falconer
Right, right. If I'm the user that's interacting with the agent and let's say there's some sort of like UI to this agent, the presumably the agent needs to know like who I am in the organization so that when it makes a tool call, essentially my identity information can be factored into that tool call. So it knows that. Okay, well, Sean only has access to this subset of documents within the organization.
[24:30]
Eddie Zhao
Yeah, I mean in that sense it's no different than how Glean manages identity throughout the whole, you know, we can re leverage our entire like identity infrastructure and platform. But I think the interesting question is, is sort of on the modeling front of, you know, if Sean's colleague has created this agent, should that agent be more likely to be relevant to Sean depending on how closely they work together, where they're working in these same kind of signals? Because we can have that identity metadata in that sort of implicit, the explicit stuff is a given. Right. You only can have access to things you have access to. Certain tools can only execute if you have access to them. It's not just documents. Right. So we're able to build on all the same clean infrastructure we have to make that work.
[25:08]
Sean Falconer
Given that you have essentially any agent multi agent system, you have a lot of like these internal, external dependencies. Presumably it's going to be probably not running all on you know, one server and stuff like that. And on top of that you have sort of these unbounded execution plans that might cycles that happen in a stochastic model. At the heart of this is acting as the brain, like how do you manage the debug process? Like how do you figure out when errors go happen with some of these like dynamically generated execution workflows?
[25:38]
Eddie Zhao
It's a really, really good question. And it's another thing that as the use cases and the tooling like sort of evolve in parallel, we're all trying to build the right tools to give ourselves this ability. Right. And you know, for us it's sort of similar to reasoning about the life of any query, if you will, like even in the search context. Right. Most of the things we've been building are compound AI systems. They're composed Together. So you need to be able to say, okay, at the high level, where can I track this down to where in the flow, input and output of each system do I think the breakdown is and sort of do that trace. Right. And this, you know, people talk about reasoning traces, this is similar. And so for us when we talk about the graph, it is a lot. Okay, how can we figure out where in this graph you know, the output was not desired from the input? And you can always start from the. The back. You could start from the beginning either way. But I'm curious to dig into a bit. You said sort of internal and external dependencies. I'm not sure I fully understood what you meant there.
[26:31]
Sean Falconer
Well, from the agent's sort of perspective itself. So I could. This may not be relevant in the context of glean, but just thinking about agents in general, like I could have sort of internal knowledge systems I need to tap into, which is, you know, my Google Docs or something like that. But it could also have factor. I need to factor in essentially external knowledge systems like or some sort of website or something like that where I'm actually searching beyond the bounds of my company's.
[26:56]
Eddie Zhao
That's definitely relevant to us. By the way. It's clear that there's a one way street of internal stuff's not going out, but definitely external stuff is coming in. Like I mentioned, people are coming to it. So many queries that come into agents, you need that blend like you're saying.
[27:10]
Sean Falconer
Yeah. And then also you have short term memory, long term memory, which could be different systems as well, I guess. How are you managing that? Is your sort of long term memory using a vector store representation of that?
[27:23]
Eddie Zhao
Yeah, I think long term memory, the way to think about it is again you can model that as a reliance on your context injection. Right. How do I know what is the relevant plan for Eddie for this query? There's short term information I can make use of like what was his last query? What work was he doing before this? But then that extends further into like these signals that we use for search are in many ways closer to long term memory. Right. You know, how. What was Eddie doing like one month ago? Or what team was he working with? These things again can be injected into different pieces, different prompts at different stages in this graph via any, any methods. Right. They could be retrieved with means, they could be vector retrieved, they could be lexically retrieved. I think the more important thing is that they do. They are retrieved in some fashion and sort of injected at the right time.
[28:16]
Sean Falconer
In terms of the outputs that any of these agents are generating. How do you essentially control for incorrect information? Hallucinations like put guardrails around it, like what sort of post processing steps exist to essentially evaluate the response to make sure that it's actually a valuable response?
[28:33]
Eddie Zhao
Yeah, very good question and very hard question to answer. You know, I think for us our best bet is look, these systems are unbounded, right? And I think the biggest sort of delusion that some folks have is like, I can instruct things not to happen and they won't happen, right? But instruction following, it's well defined. You could measure and rate how well is this instruction followed. But once instruction starts to bleed into knowledge again, going back to this, you don't know what you don't know. You tell the LLM never to lie. It's, it's not lying with the context that it's given. Right? And so this does relate a lot to rag concepts around. Okay, is something correct conditional on the context that it's given or is it correct sort of independent? As an end to end system, the user doesn't care. The user needs to make sure they're not getting false information. But from an ML engineer's perspective of diagnosing, it does matter, you know, what, what part of the system is breaking down, right. In terms of the guardrails, I think doing stuff on the fly, there's some low hanging fruit that can be done. But ultimately on the fly is a hard problem, right? You're asking a system to reason itself. Hey, is what I just admitted correct or incorrect? I think that going back to this knowledge problem, that's a very hard thing to do. You can obviously build in online judges to say tasks that are a bit more narrow, right? Like was the output that was just created ungrounded on the context that it was given, that's more tractable. But even if it isn't ungrounded, that doesn't mean that it's correct. Right? Again, given that context problem. So a lot of what we try to do is measure more things offline in batch. You can run all kinds of processes to generate things offline, run them through your system, generate things you know to be correct and make sure that your system can achieve them. Or make sure that you're, you know, create adversarial sets where you say, hey, I'm going to make it look like this is the case. How can I measure my system's performance to back off correctly or whatever it might be? And so you kind of come at it from that side and you get a measure of, okay, how good are we at this and how can we continue improving that? And you pair that with what's happening at request time online. But it's really hard. It's sort of an unsolved problem to say for every input request coming in, do I know if it was exactly correct or not? I can put some guardrails around that. But the sort of strategy we've been coming at is like larger scale measurement from the other side in terms of.
[30:49]
Sean Falconer
All the pieces that make this system possible. From tracing that you're doing some of this offline, batch processing this, or evaluate responses, actually building and deploying the agents and the way that they communicate. How much of that is like built from scratch versus relying on existing tools?
[31:07]
Eddie Zhao
I would say it's. It's a blend. I think we are constantly evaluating and looking at parts of our system that even were built earlier that now there's a great open source alternative for and revisiting whether we can rebuild on top of that. I think it's a blend. We. It's sort of a trope that engineers always want to reinvent the wheel, but a great engineer will never do that because they know they can create more impact sort of building on top of what others have built. And for us, I think it's. So it is a blend. Our principle is to try to reuse where possible. We don't always do that perfectly. And I think especially in an enterprise environment, you know, there's also components like how do we. If we care deeply about efficiency, for example. Right. And performance. Right. Are we sure that the frameworks that we're using are pushing the boundary there? Right. Or do we need to go and roll something ourselves? Can we optimize within that framework or do we need to roll something ourselves? Right. Are there fundamental design decisions in these frameworks that sort of go against our security constraints or our sort of deployment setup? And, you know, that doesn't come up a lot, but I think there's these checklists of things we run through to understand can we sub out a new framework for what we have? But at the end of the day it's, you know, sometimes we're rolling our own and sometimes we're relying on what's out there.
[32:23]
Developers. We've all been there. It's 3am and your phone blares, jolting you awake, another alert you scramble to troubleshoot. But the complexity of your microservices environment makes it nearly impossible to pinpoint the problem quickly. That's why Chronosphere is on a mission to help you take back control with Differential Diagnosis, a new distributed tracing feature that takes the guesswork out of troubleshooting with just one click, DDX automatically analyzes all spans and dimensions related to a service, pinpointing the most likely cause of the issue. Don't let troubleshooting drag you into the early hours of the morning, just DDX it and resolve issues faster. Cycronosphere was named a Leader in the 2024 Gartner Magic Quadrant for Observability Platforms at Chronosphere IO Sed this episode of Software Engineering Daily is brought to you by Capital One. How does Capital One stack? It starts with applied research and leveraging data to build AI models. Their engineering teams use the power of the cloud and platform standardization and automation to embed AI solutions throughout the business. Real Time Data at Scale enables these proprietary AI solutions to help Capital One improve the financial lives of its customers. That's technology at Capital One. Learn more about how Capital One's modern tech stack data ecosystem and application of AI ML are central to the business by visiting capital1.comtech do you have some.
[33:55]
Sean Falconer
Sort of eval framework put in place to make sure that when you are making changes to perhaps like, you know, the system prompts of some of these things that you're actually generating like a better result than you were previously?
[34:06]
Eddie Zhao
Oh, definitely, definitely. That would be crazy if we were just pushing out changes without evaluation.
[34:11]
Sean Falconer
I see some crazy stuff out there.
[34:12]
Eddie Zhao
So totally, I should say that'd be crazy. I know probably most people out there are doing that.
[34:16]
Sean Falconer
There's a lot of putting your finger in the wind and seeing which way is blowing it.
[34:19]
Eddie Zhao
Totally, totally. You know, and, and I sort of have mentioned this before to other folks internally because it's interesting. The team has been building an ML or AI product for a while, right? And so they have the muscle built of what evaluation means. Right? And it's interesting coming from the search side because you have sort of the search side, the traditional ML side and now this sort of gen AI product side, right? Where on the traditional ML side is, you know, you have some large scale ML system you monitor, some metric you try to make, you know, your experiment is, hey, I changed the way this model trains, I changed this model architecture. Numbers go up, great. The search world is sometimes that, but also a lot more qualitative. Hey, I'm looking at individual queries, I'm trying to understand which parts of the system are breaking down, what can I change? But I'm always going to run an evaluation, right? And when you run an evaluation, you have some parts that are automated that give you a high level metric, but you're also going to get a qualitative sense. You're going to go look at some queries and understand more vibes based evals, if you will say. And that was a thing that people started with a lot in the generative world, but it's still really relevant. So I think it's a lot of it is about pairing something quantitative, large scale. You can say, hey, I ran my evaluation suite on my prompt change and it's clearly, you know, the metrics went down a lot and so I know there's something to be concerned with. If you run them and they're all neutral, they might still be the right thing to do. You have to sort of rely on, hey, is there enough qualitative evidence here for me to believe that I'm making progress, that I'm, I know I'm improving some of these issues. And so I think all things in balance here and then layer on top of that is, you know, automated. Once you have a strong enough evaluation signal with enough density, you know, a lot of prompt engineering stuff can then be automated and you can use all kinds of frameworks out there to do that.
[35:57]
Sean Falconer
In terms of the agent experiences that are available on Glean, I know that there's essentially like a no code experience where I can just fill out some forms, create an agent that way. There's also some existing sort of pre built agents and I know that Glean has apps as well that you can build against the APIs are the APIs for building net new agent experiences as well.
[36:17]
Eddie Zhao
I don't want to speak out of the product roadmap and make a bunch of PMs and or engineers frustrated that I said the wrong thing here. So I don't know what to commit to here. But certainly like we want to support a range of builders from low code all the way to people understanding how to programmatically do these things. I think API definitions are important and it can be tricky in the generative world to say like, hey, what exactly is this API? But I think in the agent building case there's definitely talk of it. I don't know if it's committed to but you know, we want every engineer internally to be able to build powerful agents on the same set of tools that we're giving external users. So you know, if we can't successfully, I guess that's our sort of forcing function to make sure whatever we're exposing and the platform we're building is effective because, you know, if engineers with sort of full access quote unquote internally can't do it effectively, then you know, we certainly can't expect folks outside of Glean to do that.
[37:12]
Sean Falconer
So how are you dogfooding some of this stuff internally? Are you using some of this technology to like essentially make people more efficient within Glean?
[37:19]
Eddie Zhao
Yeah, I think folks are building internal agents, finding use cases where they're relevant and trying to build those using that same suite of tools from low code to otherwise. And you know, they have different levels of traction. And that's the neat thing about user generated content, right? Like some of them take off, some of them don't. And so our product team is constantly trying to understand, hey, what use cases are really shining through, you know, how much of things that people ask in chat or assistant itself are really agents or workflows that should be abstracted out and sort of made more repeatable and how can we make that happen? So a lot of it is sort of bringing more structure to a lot of existing usage.
[37:57]
Sean Falconer
How should people be thinking about measuring the success of any particular agent that they're using?
[38:03]
Eddie Zhao
Wow, what a loaded question. I wish I could speak for there's so many different agent use cases, right. And I think usage is a decently good barometer. Right. If people are coming back to it, that means they're probably finding value out of it. And the long term it's, it holds true. You know, you could build an agent that's actually just wrong all the time and people use it at first, but then once they realize that it's wrong, they won't. So as long as you sort of measure on a long enough time horizon, I think that's an effective measure. Usage is always king for any product. I think when it comes more to like success. Right. It's about measuring outcomes that come from that and that starts to get a little bit more specific. Right? Are you talking about, hey, here's an 8 agent that, you know, half of our salespeople use and the other half didn't like, are we running an internal A B test to see how many you know, exceeded their, their quota by a lot or not. Right. That you could, you could start to start thinking about some of those outcomes. Although that becomes, like I mentioned, really use case specific.
[38:54]
Sean Falconer
How should companies be thinking about agents versus some sort of simpler process? Like what is the scenarios where it makes sense for them to be like, okay, well we're going to go, you know, fall in on an agent versus some Simpler like, you know, prompt based approach or workflow.
[39:09]
Eddie Zhao
Ideally they don't need to think about the level of abstraction. Ideally a single product can sort of cleanly span the gamut of like I'm putting in a natural language instruction on this use case and perhaps behind the scenes, like, I don't care what happens. It's going to either do something simple or you know, if it detects something complex, then I'll be prompted to say, okay, we think this is probably like merits something more complex. Do you want to help like refine or iterate on this agent? Right. But requiring people to do the pre work of understanding how complex their task is a hard thing to do. It's a big ask to make, right? And so at least from our perspective, we'd like to sort of lift that away from folks and you know, get them to try the simpler thing and push them as up the complexity curve as needed.
[39:56]
Sean Falconer
What would you say is like one of the biggest, hardest, like technical challenges with like actually building agents today? Like what is sort of the gap that's there that, you know, some R and D efforts need to be put into in order to solve.
[40:10]
Eddie Zhao
I actually still think tooling and the evaluation suite of tooling and it's just an engineering problem is still sort of a little bit behind, you know, ML infra in the traditional ML world. And you know, the faster you can give folks the ability to see how their agents are doing and help them evaluate it at scale and push them to create a lot more training or eval data, the faster like these agents can actually work, right? Because a lot of it is getting that evaluation signal so you can tune on it. So I do think that's, to me a bottleneck I see across the industry. Not just because a lot of people see a small barrier and they just give up and they just ship whatever's out there because they're like, I can't measure this anyways, right? But imagine how much more reliable what they could ship is if they could have iterated on it. And so that's top of mind for us as we're thinking about, okay, how do we help people build not just any agent, but like effective agents, right? How do we give them the right toolkit to really measure? Because no one wants to put something out there and then turns out none of their colleagues can use because it's unusable. We want to help them sort of get a pulse on, okay, you know, I'm pretty confident because I ran hundreds of queries on this and now it's really easy to do that. This is going to work as I.
[41:20]
Sean Falconer
Expect it to in the companies I talk to. I think there's, of course there's a ton of interest in leveraging AI, but there is a lot of fear around making any of this stuff customer facing. So a lot of it is looking at, I think, internally to how do I augment my existing sort of knowledge workers and find efficiencies there before I ever put something sort of customer facing. Yeah, I agree. I think tool immaturity is a challenge. I think this is also why a lot of companies that are doing this sort of in production don't trust even some of the existing tools that are in frameworks that are available because they're just worried about like, the maturity of those tools. It's not like you're building even on the cloud now that's been around for 15 years or whatever. Like you're building on stuff that's maybe only been around for six months sometimes.
[42:03]
Eddie Zhao
Yeah, yeah, there's definitely that aspect. And a lot of it is the applications of ML before were people understood, hey, I have. This classifier has precision recall. That's not a hundred percent, but I know what the business outcome is with a false positive and with a false negative. Right. For a lot of these applications, the business outcome of a false negative, or. I mean, it's not even defined a false negative false positive. It's unbounded text generation. It's, you know. Right. A right action that does something that could be really devastating. Like it's. It's hard to measure the business outcome of that. Right. And so. But a lot of people still need to just understand that it is still an ML system. There's going to be some stochastic nature to it. It's not going to behave exactly how you want it to all the time. It's the difference between like an M feature and a software feature in a way. Right. It needs to behave like you intended to enough of the time. But I think that sort of uncertainty is fundamentally built into a lot of this.
[42:52]
Sean Falconer
Awesome. Well, anything else you'd like to share?
[42:54]
Eddie Zhao
No, we covered so much here. Thanks for some great questions and it was really awesome talking.
[42:59]
Sean Falconer
Yeah, well, thanks for being here. Really enjoyed it and cheers.
[43:02]
Eddie Zhao
Cool. Thanks, John.