
Loading summary
Podcast Host
Hi listeners and welcome back to no Priors. Today I'm joined by Issa Fulford, one of the pioneering minds behind OpenAI's deep research. This is a new agentic product that OpenAI released in February of this year which uses reasoning and tools like web browsing to complete multi step research tasks for you. Today they're making it free to all us users. Welcome Issa.
Interviewer
Issa, thank you for doing this.
Issa Fulford
Thank you so much for having me.
Interviewer
You and your team have shipped one of the most exciting AI products of late. I use it a lot. Deep Research. Where did the idea come from? Tell me the origin story.
Issa Fulford
Yeah, so around a year ago now, we were very excited about the progress internally on this new reinforcement learning algorithm. We were seeing a lot of progress on math problems and science problems and coding problems. And at the same time I was working with my friend Yash, who works at OpenAI on a few side projects. And we were very interested in agents generally and kind of wondered if we could apply the same algorithms to tasks that are maybe more in line with what the average user would do every day. And so the two first things we were thinking about were online browsing tasks because I think in a lot of different professions people do just have to do a lot of research, synthesize a lot of information and then come back with a report. And then we're also thinking about software engineering. We kind of have been working on those things. I've been focusing on browsing. So to start with the math and coding problems that people were already training on, those data sets already exist. You can have a math problem with a ground truth answer and you can train on those. But for browsing it's kind of more open ended. You don't really have data sets like that that exist. So we really started by grounding the research and what product use cases we actually wanted the final model to be good at. So we literally would write out just a list of things. Like I hope the model that could find this list of products for me and rank them by these reviews from Reddit or something like that. Or I want it to be able to write a literature review on this topic.
Interviewer
I feel like a lot of people when they think about browsing and agents, they land on the same 2, 3 transact use cases that I actually don't think are particularly inspiring. Right. So it tends to be like order a burger on doordash or something like that. Or I feel like ordering flowers is also like a really common one. Why do you think you came up with like Such a different set of goals for the agent.
Issa Fulford
Yeah. So I think before we focused on taking right actions, which those are examples of taking right actions, we wanted to get really good at synthesizing information from a large number of sources and mostly read only tasks. That was for a number of reasons. Firstly, just a huge number of knowledge work professions mostly do that. So it would be quite useful for those groups of people. Secondly, I think the Overall goal for OpenAI is to create an AGI that can make new scientific discoveries and we kind of felt that a prerequisite to that is to be able to synthesize information. If you can't write a literature review, you're not going to be able to write a new scientific paper. So felt very in line with the company broader goals.
Interviewer
It's also very meta because you have helped make an AI that makes me better at learning and it's learning.
Issa Fulford
Yeah, I hadn't thought about that. I love that more practically the read only tasks, maybe the safety question is a bit more constrained. So it was a good thing to start with as well.
Interviewer
Yeah. It seems that the read only space people were also not nearly as ambitious as you were going in or you and Yash were going in. About like maybe it could understand this set of things for me. Okay, so you thought of these end evals and come up with a set of tasks that could be auto gradable or fit a set of characteristics that made them better fit the algorithms and then what?
Issa Fulford
That was actually a huge process in itself. I think we initially had built a demo to pitch people on this idea and there was no model training involved. It was fully just prompted models with the UI pitching the vision of what this product could look like. And so I think after that then we were at the point where we actually had to start thinking about how are we going to do this, how are we going to create the data, how are we going to train the model, what tools do we have to create to enable the model to browse the Internet effectively? And that was a lot of iteration. I was working very closely with Edward sun and a few other people on this and so we also collaborated a lot with the RL team. I think it was definitely a big undertaking and a good thing about it was we were able to work uninterrupted for quite a few months making the numbers on our ebals go up. So I think it was nice to have not too much pressure to ship something really quickly and we were just able to iterate and get it to.
Interviewer
A good state did you have a favorite, most important task?
Issa Fulford
We had a few tasks. People would just propose different tasks. One of them was to find all of the papers that Liam Feddes and Barrett Zoff had written together. I think there is 11. The model now can find most of them or all of them. We would always ask that question and then another one which the model actually can't answer anymore. Probably for good reason, but finding the middle name of one of our co workers. And then personally, I think I started using it pretty early on for actually finding information for product recommendations, travel, and I think actually quite a few people internally we had kind of a streamlit playground that people would just use. A lot of people had found it and were using it. Sam told me he used it to buy a bunch of things. Every time it would go down, people would message us like, what happened? We need to use the model even when a previous version that honestly wasn't that good. So I think that was a good sign. Initial sign, yeah.
Interviewer
What can you say about the actual bulk of the work like the tool creation and the data creation?
Issa Fulford
So for the data we did a bunch of different things. We used human trainers for some of it. We kind of had to come up with new ways, new kinds of data sets, I guess, and we had to figure out how to design datasets to exercise the kind of skills that we wanted the model to learn. And then you have to make a way to grade those datasets as you're training them and then also how to make the good tools for the model to be able to actually complete the task successfully. So right now we just have the browsing tool, which is a text based browser, but it can see embedded images and open PDFs and then also it has access to a Python tool so it can do analysis and calculations and plot graphs and things like that. But you can imagine in future versions we'll just expand the toolset and, and so the model will just become more capable, but we'll also need to make datasets that actually make the model exercise all of those different tools and figure out how to use them and backtrack and all these different things during training. So that's actually able to flexibly answer new problems from users in the product.
Interviewer
It is clear that reinforcement fine tuning on very powerful base models can do very useful things. Now that's super exciting. What advice would you have for startups or other companies who are thinking about doing RFT for a particular task as to like when it's worth doing or when they can Just try to do sort of just traditional orchestration where agents are a component.
Issa Fulford
So I think in general you will always get a model better at a specific task if you train on that task. But we also see a lot of generalization from training on one kind of task to other domains. So you can train a reasoning model on mostly math, coding, other reasoning kind of problems, and it will be good at writing. But if you trained it on that specific task, it would be better at it. I think if you have a very specific task that you think is so different to anything that the model was likely trained on, and you try it a bunch of times yourself, and you've tried a lot of different prompts and it's just really not good at it. So maybe it's some genetic sequencing task or something that's just so out of distribution for the model that it doesn't know how to figure it out. I think that is a good time to try reinforcement, fine tuning. Or if you have a task that is so critical to your business workflow that getting the extra 10, 15% performance is really make or break, then probably try it. But if it's something that you think, oh, the model's pretty good at, but it gets things wrong some percentage of the time, and then you see with every next model that's released, it gets a little bit better. It might not be worth the effort if the model naturally is just going to get better at those things. So that would be my recommendation.
Interviewer
Okay, great, great advice. You've talked about needing to use human experts to create some of this data. I think of browsing as a somewhat universal task. I guess there are good and better, there are worse and better browsers. Like, where do you feel like you need expertise, or what do you know about browsing expertise that you didn't before? Or information gathering expertise.
Issa Fulford
Yeah, I guess it's one of those things where basically every single profession involves, you know, having a question or wanting to do research in a domain and then having to find information from many different sources to synthesize an answer. And while doing that, you have to have the expertise to reason about, is this a useful source? Is this not, Is this, you know, should I include this? Is this completely off topic? Whatever. Like that is kind of universal to most jobs or most scientific domains, any kind of anything. So. And the cool thing with RL is that you don't necessarily need to know the whole process of how the person would do the research. You just have to know what the task is and what the outcome should be. And the model will just learn during training how to get from the problem to a good answer. So I think we just took a pretty broad approach. I think that's one thing that if you work at a place like OpenAI, I think you can do what they would tell most startups not to do and just try and focus on a really broad set of users and just get experts in loads of different domains and try and see if you can get good at everything at once, which was the approach that we took. And then we also created a lot of synthetic data sets and things like that. But the human data was definitely a really key part for making this model successful.
Interviewer
Did any of the learned planning from the model across these domains surprise you in terms of the path to find the perfect handbag or the restaurant in Japan, or the set of papers that was relevant?
Issa Fulford
Yeah, I guess sometimes it will use search terms that I wouldn't necessarily have used or, you know, we didn't teach it to plan upfront, but sometimes we'll see it. It does end up making a plan upfront before starting its research. Sometimes the model will do smart things and try to get around restrictions you put on it. So you have to make sure that it's not hacking, you know, and trying to use a different search engine other than the search engine that you gave it or something like that. Like it will do smart things that you have to make sure you're looking out for in case you want to not allow the models to do those things.
Interviewer
Maybe we can actually use this as a moment to talk about some of the failure modes, like how do you think about some of the classic issues with agents, like maybe compounding error or distraction or even safety.
Issa Fulford
Yeah. So I think with deep research, since it can't actually take actions that aren't kind of the same class of the typical agent safety problems you would think of. But I think the fact that the responses are much more comprehensive and take longer means that people will trust them more. So I think maybe hallucinations is a bigger problem. While this model hallucinates less than any model that we've ever released, it's still possible for it to hallucinate most times because it's will infer something incorrectly from one of its sources. So that's part of the reason we have citations, because it's very important that the user is able to check where the information came from and if it's not correct, they can hopefully figure it out. But yeah, that's definitely one of the biggest model limitations and something that we're actively always working on to improve in terms of future agents. I think the ideal agent will be able to do research and take actions on your behalf. And so I think that's a much harder question that we need to address. And it's kind of at that point when capabilities and safety kind of converge, where an agent is not useful. If you can't trust it to do a task in a way that doesn't have unintended side effects that you don't want. Like if you ask it to do a task for you and then in the process it sends an embarrassing email or something like this, you know, that's not a successful completion of the task. So I think that is going to be a much more interesting and difficult safety area that we're starting to tackle.
Interviewer
You can tell me if you just don't have a projection here, but do you think people are going to want explicit guardrails? Do you think you can learn a bunch of those characteristics in the model itself?
Issa Fulford
If you've used operator, I'm sure you have. You have to confirm every right action. I think to start with, that makes a lot of sense. You want to build trust with users and as the models become more capable, maybe you've seen it successfully do things a few times and you start to trust it more. And so maybe you allow it to. Okay, every time, you don't have to ask me. Every time you send an email to these people, that's fine. But I do think that as these agents start to roll out, we will definitely want to have guardrails and confirmation just so while they're not the end state capability, we still want to make sure we have a good level of oversight. But I think that they will get so good that we'll just trust them to do things on our behalf.
Interviewer
What are some of the obvious ways you feel like deep research as a product is going to get better? Yeah, I mean it's going to extend into. Right. You just implied that at some point.
Issa Fulford
I think maybe it's, you know, the ideal state would be to have a unified agent that can do all of these different things. Anything that you would delegate to a co worker, it should be able to do.
Interviewer
How are we going to make decisions about if it's like, Sarah, you do this versus agent, please do this?
Issa Fulford
Yeah, I guess.
Interviewer
Is it always just try the agent first?
Issa Fulford
Probably. I mean, I would try the agent first if it was my work. It's kind of the pattern of every time the model becomes more capable. The level of abstraction of the human becomes higher, if that makes sense. The task you're asking it to do is just higher and higher level, but you're still initiating the task. So maybe a year ago I was asking it to write a function for me and now I'm asking it to write a whole file and maybe next year it will make a whole PR for me or something like that. So I still think we'll be in the driving seat as to deep research, I think obvious next steps for deep research would also be to have access to private data, like be able to do research over any internal documentation or GitHub, whatever it is.
Interviewer
There's a golden thread here because when.
Podcast Host
We first met you were working on.
Interviewer
Retrieval and I was like there cannot be only one person at this company working on retrieval.
Issa Fulford
Everything, all roads lead back to retrieval. So I think that will be really cool. And then eventually taking right actions or calling APIs and then obviously there are just a lot of things that the model is not perfect at now that we just need to improve. But I think we have a really cool working relationship with the reinforcement learning team. So a lot of teams will contribute datasets to the big runs that they do. So we contribute datasets and then as they train models with a ton of compute, then it just becomes a better base model for us to continue training from. So just think the capabilities are compounding.
Interviewer
So this was not a low key research preview, but a side project that turned into a very interesting internally pitched project. How do you think about what is a product that OpenAI or at least you yourself want to work on independently versus what belongs in the core research path?
Issa Fulford
A cool thing about OpenAI is that even though the company is bigger, I think the culture of anyone being able to have an idea and prove it out and then push it to completion is still been maintained as the company has grown. For me personally, I'm always motivated to work on things that I will use myself with the research. For example, I do use it a lot for looking up various things, travel recommendations. I think I'm probably a daily active user. It's fun when you get some dog feed you Amazing.
Interviewer
Yeah. Burning a lot of GPUs. Are there use cases where you're the original expert? Are there ways that you or Yash or seen the user base use them that you encourage people to use? Deep research?
Issa Fulford
I'm always interested to see people using it in domains that I have absolutely no expertise in. For example in medical research or I've seen a lot of different scientists posting about how they've used Deep Research and how I help them do something. To me that's the most interesting because when we were working on it, I obviously had no way of judging whether an output is good or not. So seeing experts actually ratify Deep Research this one responses is useful. An area that I was surprised to see people using the model in was code search and for coding questions I think use the latest package or latest version of whatever repo to help me write this file or something for data analysis as well. That's also something the model's already pretty good at and I think we'll just continue to get better at. I think uploading a file or something like that and having it do some analysis for you or do some research and then create a report with numerical analysis is pretty interesting.
Interviewer
I actually haven't tried this and it's not a browsing task. What makes the model particularly good at this or what is it capable of? Is it really multi step and then being able to do planning and understanding of the task and produce a report that's cohesive?
Issa Fulford
Yeah, I think also the base model or the model that we started fine tuning from O3 is just very capable model. It's trained on many different datasets including a lot of coding, reasoning and math tasks. So that inherited capability is pretty strong. And then when you add the browsing on top of that, it's still able to do that analysis. So I think those two together can be quite powerful.
Interviewer
Before the podcast we were just talking about the idea of learning taste or preferences from users. OpenAI's just released a bunch of memory features. How do you think that deep research could or just agents in general could evolve to take into account how people want to learn or their information ingestion preferences?
Issa Fulford
Yeah, I think agent memory will definitely be very important. It'll be very annoying if every time you ask it to do a task you have to repeat the same information how you want it to do the task, everything about you which currently for deep research you do have to do. And I think as the tasks get more complex and right now it will take five to 30 minutes, you can imagine in the future it might take hours or days to complete a task that you ask the model to do. You definitely want the model's research to be compounding. You don't want it to want to have to start fresh every time. So I don't necessarily have a good answer, but I think it's something that will be very important.
Interviewer
There is a common understanding between many people at some of the Leading labs that like the recipe to AGI is, I'd say, somewhat known or there's confidence on this and the return of RL is very exciting for everyone. The stance that I've heard from you and from others is both enthusiasm on like this seems to work. We're going to get real capability out of it. It's quite data efficient and it's going to be a lot of work. Tell me a little bit about the emotional experience of building deep research and if that changes your view at all.
Issa Fulford
I agree with everything you said. I think it's so impressive to see how data efficient the algorithm is. I guess the data you train on is much higher quality and smaller. So actually curating that is an undertaking. And then making sure that the model has access to all the tools that a human would have access to to do the work that they need to do and then making sure that you represent tasks that people will find useful or do in their jobs in a way that you can judge whether the model did a good job or not is also hard. And there's so many other challenges for pre training where you have so much more data. You have to do all of these different things that are like, I think it's just a different challenge. And both are compounding. You need a really good base model to be able to do rl and then for our team we just do more rl. Yeah, it's all very compounding, but I think that everybody does kind of see a pretty clear path to this broadly capable agent.
Interviewer
Do you think there are big blockers to progress of, like you said, maybe not exactly describing it as the next iteration of deep research, but just confidence that we're going to have these unified agent capabilities and it will feel like a coworker. What stands between us and that?
Issa Fulford
There's a lot of really hard safety questions that we need to figure out. We would never ship anything that we don't have. Very high confidence is safe. And I think the stakes are way higher when it has access to your GitHub repositories and your passwords and your private data. So I think that's a really big challenge. I guess also if you want the model to be able to do tasks that take many, many hours. Finding efficient ways to manage context, kind of similar to the memory thing, but if you're doing a task for a really long time, you're going to run out of context. So what's an efficient way of dealing with that? Allowing the model to continue to do its thing and then, yeah, Just the task of making the data and making the tools. I mean I've said this already a few times, but that's a lot of work.
Interviewer
I was just looking at my history of queries. My user request is like I want to see what things I asked of Deep Research vers other models in particular in my memory. But it has ranged from like obviously you know, if I'm trying to get up to speed on a market for a company I'm looking at or on a technical topic or travel planning, it's a big one. Also I have looked for things that are taste related. So I'll be like, okay, I like, you know, this set of books for these reasons. I want you to, you know, actually just giving me a long form summary of a bunch of other things you think I should read and explain why I realize I don't have a super clear mental model of when Deep Research should be better than O3. What instinct can you give me here?
Issa Fulford
Deep Research is very good when you have a very specific query or well defined query. So maybe not a general overview of a topic but you're looking for some specific information and you think it would be supplemented by existing research online even if that information is also we also train the model on the base model on that information. I think having live access to it is quite useful.
Interviewer
So if I have any instinct about directing to retrieval or particular sources that focusing is useful.
Issa Fulford
I think so. And also we trained it to have much longer outputs than I think the normal models would. So if you're looking for something very comprehensive, maybe sometimes too comprehensive for some tasks, I think Deep Research is will be useful for those things.
Interviewer
Connect this for me to a Deep Research fashion task.
Issa Fulford
I've used it to find new brands, so I'll say these are the kinds of brands I like. Please find new brands. Or I can find this specific coat that looks like this one or something like that and then it's very good at finding those versus I think the base model or the normal model will say it will give you some brands but it won't necessarily fit all of the constraints that I had given. I want it to sell this fake fur coat that's this length this season or something. It's not going to be able to do that because it just won't have the up to date information and also just won't necessarily be able to deal with all of the constraints in a query in one shot. O1 isn't browsing as comprehensively. I'll use it to find things where I'M looking for a very specific thing that would take me hours to find. So I'm looking for this very specific item or sweater that it's probably available on RealReal or somewhere but I can't find it. Or I'm looking for an Airbnb with very specific constraints. So I think those kinds of things deep research is good for and then more general high level things you should use normal search for.
Interviewer
Yes. Well, I will admit I have had some multi year browsing shopping tasks that I am now making a cron job for Deep Research. I want to ask just one more experience question which is was there a particular win or fit failure that surprised you in the training of Deep Research?
Issa Fulford
It really was one of those things where we thought that training on browsing tasks would work. Felt like we had good conviction in it. But actually the first time you train a model on a new data set using this algorithm and seeing it actually working and playing with the model was pretty incredible. Even though we thought it would work so honestly, just that it worked so well was pretty surprising. Even though we thought it would, if that makes sense.
Interviewer
Yeah, it's the visceral experience of like, oh, the path is paved with strawberries or whatever.
Issa Fulford
Exactly. But then sometimes some of the things that it fails at are also surprising. Sometimes it will make a mistake where it will do such smart things and then make a mistake where I'm just thinking why are you doing that? Stop. So I think there's definitely a lot of room for improvement. But yeah, we've been impressed with the model so far.
Interviewer
I'm used to all my technology tools being instantaneous. Deep Research is not instantaneous, it's thinking and using tools. Can it be faster?
Issa Fulford
Yeah, I do think there's a good middle ground in between where sometimes you don't want it to do really deep research, but you want it to do more than a search. And I think that we will release things soon that people will be happy about and will fill that gap.
Interviewer
Okay. I don't know how to communicate this preference, but I want to toggle at some point to be like as much work as. I mean, because I would say this to a human. I want to do as good of a job you possibly can do in the next five minutes.
Issa Fulford
Yeah, see that's something where I think it seems like a bad UX to actually make the user make that decision. The model should just be better at knowing how much time to think. I think we made a decision when training the model that we just are going to go for max Thinking time every time. So I'm sure we'll ask it a really simple query sometimes just to test and then get quite frustrated that it's still thinking. So I do think that's also an area for improvement, knowing how long to think for. But yeah, I suspect with deep research, we'll always be focusing on the tasks that take the maximum length of time. And then I think 0, 3 or O next will have a better in between.
Interviewer
What is an example of a task you can imagine deep research taking a day at in the future? I mean, there's some GPUs smoking.
Issa Fulford
I think anything that would take, I mean, right now, in five or 30 minutes, it can do what human experts take, many hours. So I guess in an hour it could do something that would take a human days. In a day it could do something that would take a human weeks. Obviously there'll be a lot of challenges to get it to scale like that, but I think you can imagine it doing a research project that would have taken weeks to complete or write a thesis or something like that.
Interviewer
Okay, I'm going to make our intern compete with it over the next couple months then.
Issa Fulford
Yeah, sounds good.
Interviewer
If you were to project forward a year, which is a really long time in AI land, what is something that you think will surprise people that agents can do and that will actually be released? So it takes the safety considerations into, into this set.
Issa Fulford
Yes, a general agent that could do a lot of the, you know, help you do a lot of the tasks that you would do in a lot of different areas. Like for me, I do a lot of coding. I'm hoping that there'll be an agent that is pretty, pretty sufficient at coding, but that I will just trust to. I'll give it a task and it will hopefully make a PR or something. But maybe I can ask the same agent to help me book a trip to career or something. I hope that we'll get to a more unified experience, but I also think that the rate at which these models are improving is going to be pretty surprising to most people.
Interviewer
Why do you think a unified experience is important or why do you think that makes sense? Because I think today it's quite different to think about. Obviously, ChatGPT is one experience that's very encompassing, but there are models that people use in different contexts, like next line completion type models for coding that just feel like a very different setting.
Issa Fulford
I think that you'll probably want both. You'll probably want an experience where you can at some point override or interrupt the model and Say, oh no, I didn't mean that. Or you can take over and start typing something, especially in the short term as the models are not as capable as humans in a lot of areas and are more capable in other areas. So I think it will be a combination of you asking the model to do something, but then maybe to go with the coding example, then maybe you're also in your VS code or whatever it is your cursor and it's been doing something for you, but you can also actually type and write some of it yourself. So I think it will be a combination of those things. But I kind of want it to be something that is just like. It's like having a coworker on Slack or a remote coworker. You can just ask to do things for you, send them a Slack message and then they'll start doing it and then you can review their work or help at some point. But seems like a pretty nice general interface and you don't have to think about which agent should I ask to do which task. You should just be able to figure it out.
Interviewer
The mental model I have for this is my general ethos is actually I love the people I work with. I prefer to work with fewer people with less management overhead, all things considered, because each person has more context and I have more understanding of them. And so the universally useful agent is attractive for that research.
Issa Fulford
Yeah. And you only have to tell it something once and it will remember and then it will have state on everything you're working on. Things like that. Awesome.
Interviewer
Well, this has been a great conversation, Issa. Thanks for doing this and thank you for the product release.
Issa Fulford
Thank you so much for having me and thank you for using Deep Research.
Podcast Host
Find us on Twitter o priorspod. Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify or wherever you listen. That way you get a new episode every week and sign up for emails or find transcripts for every episode at no priors. Com.
Date: April 24, 2025
Host: Sarah Guo (Conviction)
Guest: Issa Fulford (OpenAI)
This episode features a deep-dive conversation with Issa Fulford, an AI engineer at OpenAI and a key architect behind Deep Research—a new agentic product enabling users to complete complex, multi-step research tasks using reinforcement learning and tool integration. The discussion unpacks the product's origin, core technologies, its implications for agentic AI, challenges around safety, use cases, and the future of AI agents.
Genesis of the Idea ([00:39]):
Direction of the Product ([02:37]):
Alignment with AGI Ambitions ([02:37]):
Iterative Demo and Prototyping ([04:00]):
Task Design and Evaluation ([04:59]):
Human-Generated and Synthetic Data ([05:59]):
When to Apply Reinforcement Fine-Tuning (RFT) ([07:23]):
Role of Human Expertise in Browsing Tasks ([08:59]):
Emergent Planning Behaviors ([10:29]):
Safety and Hallucinations ([11:18]):
Guardrails and User Trust ([12:54]):
Current and Emerging Capabilities ([13:44]):
Agent as a Co-worker Paradigm ([14:04]):
Internal OpenAI Collaboration ([15:20]):
How Experts Use Deep Research ([16:40]):
Why Deep Research Excels in Specific Queries ([22:28]):
Fashion, Shopping, and Taste-Related Tasks ([23:21]):
Key Development Surprise ([24:43]):
Performance and Speed ([25:34]):
Future Scalability—Long-Running Tasks ([26:54]):
Unified Agent Vision ([27:51]):
UX and Collaboration Model ([28:51]):
Importance of Memory and Context ([18:34], [20:57]):
Final Reflections on Safety and Progress ([20:57]):
The conversation is candid, insightful, and pragmatic—balancing technical confidence (“everybody does kind of see a pretty clear path to this broadly capable agent” [19:07]), with humility about the real challenges (“There’s a lot of really hard safety questions that we need to figure out” [20:57]). The whole session provides both a front-row seat to the development of AI agents and actionable guidance for practitioners and researchers.