
Loading summary
A
Foreign.
B
Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swix, editor of Late in Space.
A
Hello. Hello.
C
And we are here in the OpenAI Dev Day studio with Sherwin and Christina from the OpenAI platform team. Welcome.
A
Thank you for having us.
D
It's always great to be here.
C
Yeah, it's such a nice thing. We've covered three of these dev days now, and this is the first time it's been so well organized that we have our own little studio, podcast studio in the Dev Day venue. And it's really nice that you actually get a chance to sit down with you guys. So thanks for taking the time.
A
Yeah, I feel like Dev Day is always a process and like, we've only had three of them and we try to improve it every time. And I actually. I know for a fact that I think we have this podcast studio this time because the podcast interviews and the interviews with folks like yourselves last time went really well. And so you want to lean into a little bit more and glad that we were able to have this studio for you.
B
We were kneeling on the ground interviewing like Michelle Lesser.
A
Olivier. I just saw it post production. I thought it was.
C
We had to have people, like, cordon off the area so they wouldn't walk in front of the cameras.
B
People just come up, hey, good to. I'm like, we're like, recording nice.
D
I guess if you guys have been to three, like, what. What stood out from today or what? What's your favorite part?
C
I feel like the vibes are just a lot more confident. Like, you are obviously doing very well. You have the numbers to show it. You know, I just. Every year in Dev Day, you report the number of developers. This year it's 4 million. I think last year was like 3. And I have more. More questions about that. That kind. But also just very interesting, very high confidence launches. And then also, I think the community is clearly much more developed. I think there's just a lot more things to dive into across the API surface area of OpenAI than I think last year in my mind. I don't know about you.
B
Yeah. And we were at the OG that day, which was the Dall E hack night at OpenAI in 2022. And I think Sam spoke to like 30 people. So I think it's just crazy to see the.
A
Yeah, honest. I think it's like, it's kind of similar to this podcast studio, which is. I think we've had a number of dev days now. We honestly were like, slowly figuring things out as a company over time as well. And both from a product perspective and also from a, like, how we want to present ourselves with Dev Day. And at this third one, at this point, we've had a lot of feedback from people. I actually think a lot of the attendees, you'll get like an email with like a chance for feedback as well. And we actually like do read those and we act on those. And like, one of the things that we did this year that I really liked were all of those, like, there was like some art installations and like the little arcade games that we did, which was, you know, came up with via like engaging with the feedback from the game.
D
Yeah, the arcade games were so fun. I loved like the theme of all the ASCII art throughout. This is my first SF dev day, but I've been to the Singapore one. That was actually my first week.
C
Oh yeah, that's the one I spoke about.
D
Yeah, I saw you there. That was my first week of OpenAI. They're really in the deep end.
A
Put her on a plane to Singapore.
D
Yeah, yeah.
C
That's awesome. Well, so, you know, that's congratulation, congrats on everything and kudos to the organizing team. We should talk about some developer API stuff. Yeah, so we're going to cover a few of the things you're not exactly working on Apps SDK, but I guess what should people just generically take away. What should developers take away from the apps SDK launch? Like, how do you internally view it?
A
So the way that I think about it is I actually view OpenAI since the very beginning as the company that has really valued kind of like opening up our technology and bringing it out to the rest of the world. One thing we talk about a lot internally is, you know, our mission at OpenAI is to one, build AGI, which we're trying to do, and then. But to, you know, potentially, you know, just as important is to bring the benefits of that to, to the entire world. And one thing that we realized very early on is that we as a company, it's very difficult for us to just bring it to every, truly every corner of the world. And we really need to rely on developers, other third parties to be able to do this, which is, you know, Greg talked about the start of the API and like kind of how, you know, that was formulated, but that was part of that mentality, which is we need to rely on developers and we need to open up our technology to the rest of the world so that they can partake for us to really fulfill Our mission. So the API, obviously, is a very natural way of doing that, where we just literally expose API endpoints or expose tools for people to build things. But now that we have ChatGPT with its 800 million weekly active users, I forgot the stat that we shared. I think it's now the fifth or sixth largest website in the world and.
C
The number one and number two most downloaded on the Apple App Store.
A
Oh, yeah, with Sora.
D
Yeah, yeah.
A
But that one, it moves around all the time, so it's kind of hard to celebrate.
C
And, you know, just screenshot it when it's good.
A
Yeah, we definitely screenshot it and share it when it was good. But kind of going back to my main point is, like, we've always kind of engaged with developers as a way for us to bring the benefits of AGI to the rest of the world. And so I view this as actually a natural extension of this, candidly. We've actually been trying to do this, you know, a couple of times with last dev day with GPT, two dev days ago with. I'm sorry, two devs ago with GPTs and plugins, plugins, which was, I think, not tied to a dev day. So I view this as like, again, we love to deploy things so iteratively, and I view it as like, just a continuation of that process and also engaging deeply with developers and helping them benefit from some of the stuff that we have, which in this case is ChatGPT distribution and when.
B
So Apps SDK is built on the MCP protocol. When did OpenAI become MCP build? I'm sure internally you must have had, you know, design discussions before about doing your own protocol. When did you buy into it and how long ago was that?
D
I think it was in March, I want to say. It's hard for me to remember. Kind of like the exact March was.
C
The takeoff of mcp.
A
Okay, yeah, yeah.
D
So we built the Agents SDK and we launched that alongside the Responses API in early March. And I think as MCP was growing, that felt like a really. And, you know, we're building kind of a new agentic API that can call tools and just be much more powerful. MCP was kind of like the natural protocol that developers were already using to bring all the tools into their system. And I think, like, in March is when we added an McP to agents SDK first, and then soon after with kind of our other.
A
Yeah, I think there was like a tweet or something we did where it was like, OpenAI is.
D
Yeah, there was definitely a moment, I think There was a specific moment in a specific tweet.
A
But what I will say though is like, and this is honestly that credit to the team at Anthropic that kind of created MCP is I really do think they treat it as an open protocol. Like we work very closely with, I think like David and the folks on the like, you know, consortium, and they are not really viewing it as this like thing that is specific to Anthropic. They really view it as this open protocol. There is like, it is an open protocol. The way in which you make changes feels very open. We actually have a member of our team, Nick Cooper, who is sitting on kind of like that, that steering committee for MCP as well. And so I think they are really treating it as something that is easy for us and other companies, everyone else to embrace, which I think they should because they, they do want it to be something that is very embraced by all. And so because of that, I think it makes it a little bit easier for us to embrace it. And honestly, it's a great protocol. It's very general.
C
It's already solved. Why would you make it?
A
Yeah, it's very general. There's obviously still more to do with it. But it was very easy for us to integrate because of how streamlined and how simple it was.
C
Yeah. My final comment on apps SDK stuff and then we'll move to Agent Kit is I always see in abstractly when you sort of wireframe a website or an AI app, it used to be that the initial AI integration on the website would be you have the normal website and then you have a little chatbot app. And now it's kind of like inverted where there's chatgpt at the top layer and then there's like turn out to website embedded inside of it. And it's kind of like that inversion that I honestly have been looking for for a little bit. And I think it's really well done. Like actually all the integrations and the custom UI components that come up, you had like Canva on the keynote there and it looks like Canva, but like you can chat with it in all the context of your ChatGPT. That is an experience I've never seen.
A
Yeah. And I think that's kind of back to the iterative learning that we've had. That I think was because we've learned a lot from plugins. So like when we launched plugins, I remember one of the feedback that we got. I don't know if people here really remember plugins. It was like March23 one of the points of feedback was like, oh, you can integrate. We told all these companies that you can integrate these plugins into ChatGPT, but they really didn't have that much control over how exactly it was used. It was really just like a tool that the model could call and you were just really bound by ChatGPT. And so I think you can kind of see the evolution of our product with this. And like, this time we realized how important it was for companies for third party developers to really own and like, steer the experience and make it feel like themselves, help them, you know, like really preserve their own brand. And so. And, you know, I actually don't think we would have gotten that learning had we not, you know, had all these other steps beforehand.
B
Awesome. Christina, you were the star today on stage with the Agent Kit demo. You had eight minutes to build an agent, you had a minute to spare, and then you have so many shoes with the download.
D
Honestly, I was like, let's do a little bit less testing and maybe we. I don't know how much time I killed on the.
A
I was extremely stressed on the download thing. I was like, if a UI bug is what, like takes the demo down, I'd be so sad.
D
I think it was a full screen. Yeah, like focus thing.
A
I heard the window wasn't in focus or something.
B
Yeah. Maybe you want to introduce Agent Kit to the audience.
D
Yeah. So we launched Agent Kit today. Full set of solutions to build, deploy and optimize agents. I think, I think a lot of this comes from working with API customers and realizing how hard it actually is to build agents and then actually take them into production. Hard to get kind of that confidence and the iterative loop and writing prompts, optimizing them, writing evals, all takes a lot of expertise. And so kind of taking those learnings and packaging them into a set of tools, that makes it a lot easier and kind of intuitive to know what you need to do. And so there's a few different building blocks that can be used independently, but they're of kind, kind of stronger together because you then get the whole end to end system and releasing that today for people to try out and see what they build.
C
Yeah. So I find it hard to hold all the building blocks in my head, but actually, chronologically it's really interesting that you guys started out with the Agent SDK first and then you had Agent Builder, you have a connector registry, you have Chat Kit, and then you have the eval stuff. Am I missing any major components? Those are the main moving parts, right?
A
Yeah. I think that's it. And then we also still have the RFT fine tuning API, but we technically group it outside of the Agent kit umbrella.
C
Got it, got it, got it. Yeah. So it's weird how it develops and it's now become the full agent platform. Right. And I think one thing that I wasn't clear about when I was looking at the demo was it's very funny because what you did on stage was build a live chat app for Dev Day's website.
A
Yeah.
D
Did you get a chance to try it out?
C
Yeah, I tried it out. It was awesome. And actually I kind of wanted to ask, how did the.
D
Where's merch?
C
Yeah, exactly. I was like, where'd you click the merch? Anyway, this is very close to home because I've done it for my conferences and it's a very similar process. But I think what was not obvious is how much is going to be done inside of Agent Builder. I see there are some actually very interesting nodes that you didn't get to talk about on stage, like user approval. That's like a whole thing and transform and set state. There's kind of like a Turing Complete Machine in here.
D
Yeah, yeah. So, I mean, I think again, like this is the first time that we're showing Agent Builder and so it's definitely the beginning of what we're building. And human. Human approval is like one of those use cases that we want to go pretty deep on. I think the node today that I showed is pretty simple, like binary approval reject. It's similar to kind of what you'd see for MCP tools of, you know, approving that an action can take place. But I think what we've seen with much more complex workflows, our users, is that it's actually quite advanced, like human in the loop interaction. Sometimes these could be over the course of weeks. It's not just kind of simple approval of the tool, there's actual decision making involved in it. And I think as we work with those customers, we definitely want to continue to go deeper onto those use cases too.
C
Yeah.
B
What's the entry point? So are developers also supposed to come here and then do the to code export, like just segment, like the use cases?
D
Yeah. So I think the two reasons that you would come to Agent Builder are one, kind of more as a playground, right. To kind of model and iterate on your systems and write your prompts and optimize them and test them out and then you can export it and run it in your own systems using agents SDK, using kind of Other models as well. The second would be kind of to get all of the benefits of us deploying that for you too. So you can kind of use maybe like natural language to describe what type of agent you want to build. Model it out, bring in subject matter experts so that you really have this canvas for iterating on it and getting feedback, building data sets and kind of getting feedback from those subject matter experts as well, and then being able to deploy it all without needing to handle that on your own. And that's a lot of the philosophy around how we're building it with chat kit as well. Right. You can kind of take pieces of it, you can have a more advanced integration where it's much more customized, but you also get a really natural path of going live without with really kind of easy defaults as well.
B
Do you see it as a two way thing? So I build here, I go to code, then maybe I make changes in code and then I bring those changes back to the agent builder eventually.
D
That's definitely what we want to do. So maybe you could start off in code, you could bring it in. We'll also probably have ability to run code in the agent builder as well. I think just a lot of flexibility.
A
Around the one thing I'd say too is a lot of the demos that we showed today, I think were aired on the side of simplicity, just so that the audience could kind of see it. But if you talk to a lot of these customers they're building pretty complex. You got to zoom out on that canvas quite a bit to see the full flow. And then for us, we were working with a lot of customers who were doing this. And then if you turn that into an actual agent's SDK file, it's pretty long. And so we saw a lot of benefit from having the visual set up here, especially as the setup grows longer and longer. It would have been a little difficult to showcase. But even on like minutes, right? Yeah, you can do it in eight minutes. But like even with some of the presets that we have on the website with like.
D
Yeah, one of the things that we launched today as well alongside just like the canvas is a set of templates that we've actually gathered from our engineers who are working in the field with customers directly of like the kind of common patterns that they have in our own basically like Playbooks when we're working with customers on customer support, document discovery and so kind of publishing those as well.
C
Data enrichment, planning, helper customer service, structured data Q and A document comparison. That's nice. Internal knowledge assistant.
D
Yeah, and I think we just plan to add more to those as we can kind of build those out.
C
I always wonder if there should be. So you're not the only agent builders, but obviously by default of being an OpenAI, you are a very significant one. Any interest in a protocol interop between different open source implementations of this kind of pattern of agent builder?
A
I think we've thought about it especially around, I'd say agents SDK. I would actually say maybe even like zooming out a bit more from just this is like. Yeah, we were also sitting here and kind of like observing things being made over and over again. Even besides agent workflows. We're kind of watching what the industry is trying to do with Responsys, like what we've done with responsys API like stateful APIs. And so, you know, obviously we were the first one to launch Responsys API but like a couple of other people have kind of adopted. I think GROK has it in their API. I think I saw LMSYS just did something recently in walls, but not, you know, not everyone. And so unfortunately I don't have a great answer today of yes or no, but we are kind of assessing everything and trying to see like hey, there has been a lot of value with MCP with hopefully with our commerce protocol as well. Acp. Yeah, I definitely did not forget the name. And so even thinking about what we want to do with agents with the agent workflow, the portability story around that as well as the portability, I'd say even of responses API, it'd be great if that could be a standard or something. Something. And developers don't need to build three different stateful API integrations if they want to use different models.
D
Yeah, and I think that's one of the. So it's not exactly a protocol, but one of the things that we launched today with evals too is ability to use like third party models as well and kind of bring that into one place. And so I think definitely kind of see where the ecosystem is at, which is, you know, using multimodals and kind.
C
Of having third party models as in non OpenAI models.
A
Yeah, yeah, it'll work with evals starting today.
C
Okay, got it.
A
We have a really cool setup with open router where we're working with them and then you can bring your open router set up and then with that you can actually, you know, you write your evals using our data sets tool or use our dataset tool to create a bunch of Evals and you'd actually be able to hit a bunch of different model providers, you know, take your pick from wherever, even like open source ones on together and see the. See the results in our product.
C
Yeah, that's awesome. Speaking more about evals, right, Like I think I saw somewhere in the release docs that you basically had to expand the Evals products a little bit to allow for Agent Evals. Maybe you can talk about what you had to do there.
D
Yeah.
A
Yeah, I was going to say. So I actually think Agent Evals is still a work in progress. So I think we've like made maybe 10% of the progress that we need here. For example, I think we could still do a lot more around multimodal evals, but the main progress that we made this time was kind of allowing you to take traces. So the agents SDK has really nice traces feature where if you define things, you can have a really long trace, allowing you to use that in the Evals product and be able to grade it in some way, shape or form over the entirety of what it's supposed to be doing. I think this is step one. I think it's good to be able to do this, but I think our roadmap from here on out is to really allow you to break down the different parts of the trace and allow you to eval and measure each of those and optimize each of those as well. A lot of the times this will involve human in the loop as well, which is why we have the human in the loop component here too. But if you look at our evolve product over the last year, it's been very simple. It's been much more geared towards this simple prompt completion setup. But obviously as we see people doing these longer gentic traces, how do you even evaluate a 20 minute task correctly? It's a really hard problem. We're trying to set up our Evals product to move in that way to help you not only evaluate the overall trajectory, but also individual parts of it.
C
Yeah, I mean the magic keyword is rubrics, right? Everyone wants LM as judge rubrics.
A
Yep, yep.
C
Yeah, obviously where this will go. Okay, great. The other thing I think online I see the developer community very excited about is sort of automated prompts optimization, which is kind of evals in the loop with prompts. What's the thinking there? Where's things going?
D
Yeah, so we have automated prompt optimization, but again, I think this is an area that we definitely want to invest more in. I think did a pretty big launch of this when we launched GPT5 actually, because we saw that it was pretty difficult as new models come out to kind of learn all the quirks about a new model. We have a big prompting guide for every model that we launch and I think building out a system to make that a lot easier, we definitely want to tie that in completely with, with evals. We should be able to kind of improve your prompts over time, improve your agents over time as well, if they're kind of made in the agent builder based on the evals that you've set up. And so I think we see this as like a pretty core part of the platform of basically suggested improvements to the things that you're building.
A
I actually think it's a really cool time right now in prompt optimization. I'm sure you guys are seeing this too. Not only there are a lot of products kind of gearing around this, so kind of what we're thinking about, but I also think there's a lot of interesting research around this. Like GEPA with the databricks folks are actually doing really cool stuff around this. We're obviously not doing, doing any of the cool GPA optimization right now in our product, but would love to, would love to do that soon. And also it's just an active research area. So like, you know, whatever Matei and the databricks folks like might think about next, what we might, you know, think about internally as well, whatever new prompt optimization techniques come out, I think we'd love to be able to have that in our, in our product as well. And yeah, and it's interesting because it's coming at a time when people are realizing that prompt, you know, like, like I feel like two years ago people were like, oh, at some point prompting is going to be dead.
C
No.
A
And it's like it's gone up and if anything it has become more and more entrenched. And I think that there's this interesting trend where it's becoming more and more important and then there's also interesting cool working done to further entrench prompt optimization. And so that's why I just think it's a very fascinating area to follow right now. And also was an area where I think a lot of us were wrong two years ago because if anything it's only gotten more important. Important.
C
Yeah, I would say you used to work at OpenAI. Now as an MSL, we call this kind of like zero gradient fine tuning or zero gradient updating because you're just tweaking the prompt. But it is so much prompt that it's actually like you end up with a different model at the end of it.
A
There's a lot of things that make it more practical too. Just even from our perspective, we have a fine tuning API and it is extremely difficult for us to run and serve all of these different snapshots. Laura's great MSL just. Sorry, Thinking labs just published. John Truman just had a cool blog post about this but man, it is pretty difficult for us to manage all of these different snapshots. And so if there is a way to hill climb and do this zero gradient optimization via prompts. Yeah, I'm all for it and I think developers should be all for it because you get all these gains without having to do any of the fancy fine tuning work.
C
Since you are part of the API, you lead the API team and since you mentioned thinky, I got to throw a cheeky one in there. What do you think about the Tinker API?
A
So yeah, it's a good one. So it's actually funny when it launched I actually DM John Shulman and I was like, really? Wow, we finally launch it. So the.
C
Because you used to work with him.
A
Yeah, yeah. So we. It's actually funny. So at. Yeah, so right when I joined OpenAI, this has actually been, I think a passion project of John's. Like he's been talking about doing something in this, like in this shape for a while, which is a truly low level research fine tuning library. And so we actually talked about it quite a bit when he was at OpenAI as well. It's actually funny I talked to one of my friends who said that when he was at Anthropic, he also worked on this idea for a bit and.
C
I think now he's a man on a mission.
A
Yeah, I mean John's so great in this regard. He's so purely just interested in the impact of this because one, it's a really cool problem and then two, it also empowers builders and researchers. You saw all the researchers who express all this love for Tynker because it is a really great, great product. And so I'm just really happy to see that they shipped it. And I think he was really happy to kind of get it out there in the world as well.
C
Yeah, this is probably, this is very much a digression but like it's weird as someone passionate about API design that it took this long to find a good fine tuning API abstraction which is effectively all he wanted. He was like, that guy's like, I don't want to worry about all the infra I'm a researcher. I just want these four functions and it's kind of interesting.
A
Yeah, yeah, cool.
B
Before the OpenAI comms team barges in the room.
A
I know.
B
So what feedback do you want from people like the Agent Builder, for example? The thing I was surprised by was the if else blocks not being natural language and using the common expression language. I'm sure that's something already on your roadmap. What are other things where you're kind of like at a fork that you would love more input on?
D
I think like one of the things that we spent a lot of time discussing was like whether we want kind of more of like the deterministic workflows or more LLM driven workflows. And so I think getting feedback on that, honestly, having people model existing workflow. A lot of what we did was kind of work with our team on, especially with engineers who are working with customers, modeling the workflows that already exist in the Agent builder and what gaps exist, what types of nodes are really common and how can we add those in. I think that would be the most helpful feedback to get back. And then as we expand kind of from just chat based, right now the initial deployment for Agent builder is through ChatKit. We plan on releasing more standalone workflow runs as well and kind of the types of tasks that people would like to use in that type of API.
C
So more modalities for example?
D
Yeah, I mean I think for sure, more modalities. I think kind of voice is already something that a lot of people have talked to us about, even today at dev day. So I think modalities for sure. But also. So more like the logical nodes of what can't be expressed today.
C
Yeah, well, you're building a language, right? You have common expression language, which I never heard of prior to this. I thought was this python, is this JavaScript? And then there was like a whole link in there. Was that a big decision for you guys?
D
I think that was more just kind of like a way that we thought we could kind of represent a mix of the variables and I don't know, like conditional statements.
C
The other thing I'll also mention is that you let once you. So there's a trope in developer tooling where anything that can store state will eventually be used as a database, including DNS. So be prepared for your state store to become a database. I don't know if there's any limits on that because people will be using it.
A
It's actually funny, I'd heard this quote before and there's Definitely some truth to it. I don't know if our stateful APIs have become a database guest quite yet, but who knows.
D
I mean conversations.
C
Well you charge for it. You charge for a system storage. Yeah, the storage. Right. So there's some limit on that. But like.
A
Yeah, but it's very cheap. It's like I remember we priced it.
D
Like I think if you wanted to kind of like dump all your data somewhere. I don't know, this is like the most like transforming it all into this shape.
C
It's like the useful, it's easy.
D
That's place about it.
A
But yeah, but also please don't do this because I think it'll put quite a bit of strain on Ventod and our info team and what we try and do. So. Yeah.
B
How do you think about the MCP side? So you have OpenAI first party connectors, you have, you have third party preferred I guess servers you will call them and then you have open ended ones. Do you see that part of registry like functionality expanding or do you see most of it being user driven? Auth is like the biggest thing. Like if you add Gmail and calendar and drive you have to like auth each of them separately. There's not like a canonical. What's the thinking there?
D
Yeah, but I think definitely for the registry that's why we want to make it a lot easier for like companies to kind of manage work what their developers have access to, managing the configurations around it. And I think in terms of first party versus third party, we want to support both of those. We have some direct integrations and then I know anyone can kind of create MCP servers. I think we want to make that a lot easier to establish kind of private links for companies to use those internally. So I think just really excited about that ecosystem growing.
A
Yeah, I think one of the coolest things observed too is just I actually think we as an industry are still trying to figure out the ideal shape of connectors. So I mean part of why I think the 1P connectors exist too, like we end up storing quite a bit of state. It's like a lot of work for us. But by having a lot of state on our side, we call them sync connectors, we can actually end up doing a lot more creative stuff on our side. When you're chatting with ChatGPT and using these connectors to kind of boost the quality of how you're using it. Right. Like if you have all the data there you can do all this like re ranking, we can put it in a vector store if you want and put it anywhere else. Whereas. And so there's some inherent trade offs here where, where like you put in a lot of work to get these like 1P connectors working, but because you have the data, you can do a lot more and get higher quality. But then, but then the question is like, oh my God, there's like such a long tail of other things which is where the MCP and like the third party connectors come in. But then you have the trade off of like, you're beholden to like the API shape of the MCP creator. It might actually work well, it might not work well with, with the models. And then what happens if it doesn't work well? Then you kind of have to like, you know, you're kind of like at the mercy of this. And mcp by the way is like really great because it already does some layer of standardization. But my sense is there's still going to be more evolving here and I think we want to support both of them because we see value in both right now, especially working with developers. We want to have kind of like all options kind of on the table here. But it will be interesting to see how this evolves over time.
B
Yeah, when I saw about three, four months ago when you launched the form for like signing with ChatGPT interest, I think to me that's kind of like the vision where I log in and I have the MCP tied in and then I sign in with ChatGPT somewhere and I can run these workflows in that app where I'm logging in. So yeah, I think Sam, you know, said in an interview that he's chatgpt is like your personal assistant. So I think this is like a great step in that direction.
A
Yeah, I think there's a lot more to go in that direction, but so.
C
Far no plan on like ChatGPT or OpenAI's IDP. Right. Which is a different role in the off ecosystem.
A
Yeah, it's interesting because direct answer is like no plans right now, of course. But I actually think we currently have some version of this which is our partnership with Apple, because with Apple you can actually sign in to your Chat GPT account and some of that identity does carry with you into your iOS experience with Siri. Right. Like if you, if you, I don't know if you've actually used this, the Sierra integration, I actually use it quite a bit. But if you sign into your ChatGPT account, the Siri integration will actually use your subscriptions to status to decide what type of model to use when it, when it passes things over to ChatGPT. And so if you're, you know, just a free user, you get, you know, the, the free model. But if you're a Plus or Pro subscriber, you get routed to GPT5, which is I think what they.
D
I think we also recently announced the partnership with Kakao.
A
Oh yeah, Kaka is another one.
D
Yeah. Where I think you. A similar thing where you can sign in with ChatGPT. Kakao is one of the largest, like Messenger. Yeah. Apps in Korea and kind of interact with Cacao directly there.
A
Yeah. I mean Sam's been talking about it for a while. It's a very compelling vision. We obviously want to be very thoughtful with how we do it.
C
You know, now you have a social network, you have a developer platform. Very, very valuable. Yeah, exactly. Okay, so. And then on the other side of Office, something I was really interested to look at and I couldn't get a straight answer, is there some form of bring your own key for agent kit? Like when I expose it to the wider world, obviously by default I'm paying for all the inference, but it'd be nice for that to have a limit and then if you want more, you can bring your own key.
D
Yeah, I mean we don't have something like that yet, but I think. Yeah, it's definitely an interesting area too.
A
Yeah, it doesn't do it out of the box today, but developers have been asking about it for forever. It's a really cool concept because then as a developer, especially indie developer, you don't need to bear the burden of infrastructure. Yeah.
C
I think when you get into the business of agent builders that are publicly exposed where you have an allow list of domains, it rhymes with this exact pattern of someone has to bear the cost. Sometimes you want to mess around with the different levels of responsibility.
A
Yeah. I will say in general, if you look at our roadmap, we engage a lot with developers to hear what are the pain points and we try and build things that address it. And ideally we're prioritizing in a way that's helpful. But yeah, we've definitely heard from a good number of developers that the cost is or all of the copy paste your key solutions right now, which are huge security hazards because developers don't want to bear the burden of inference. Hopefully we make the cost cheaper, the.
D
Models keep getting cheaper.
A
Yeah, hopefully that helps. But what we realize is as we make it cheaper, the demand goes up even more and you end up still spending quite a bit. But yeah, so we Definitely heard this from a lot of developers and it's definitely something top of mind.
C
Mind, yeah.
B
Do you see this as mostly like an internal tools platform though? Like to me like you've been doing a big push on like the more forward deployed engineering things. It's almost like, hey, we needed to build this for ourselves as we sell into these enterprises, might as well open it up to everybody. What drives building these tools? Like you think of people building tools to then expose or mostly on the internal side.
D
Yeah, I mean and so like I think our again our first deployment is chat kit which is kind of one of. It's intended to be for external users. But I think one of the things that we also did see a lot as we were working with customers is that a lot of companies have actually built some version of an agent builder internally to kind of manage prompts internally to manage templates that they're sharing across, you know, the different developers that they have, maybe the different product areas. And we're seeing that kind of like over and over again as well and really wanted to build a platform so that this is not, you know, an area that every company needs to invest in and like rebuild from scratch, but that they can kind of have a place where they can manage these templates, manage these prompts and really focus on the parts of agent building that is more unique to their business.
A
It is interesting too, like from a deployment perspective it is like it has spanned both internal and external use cases, right? Like kind of like these internal platforms, people use it for like data processing or something which is an internal use case. But if you saw some of the demos today, like there have been a huge number of companies that are trying to do this for external facing use cases as well.
C
Customer service is one template.
A
Customer service, the like ramp use case.
D
We use this internally and externally like our customer support help. OpenAI.com already powered on Agent kit and then various other like internal use cases as well.
A
And one of the things that I actually think the team has done a really great job of. So like Tyler, David and Jiwon on the team, they built the, especially the chat kit components. They built it to be like very consumer great and, and like very polished. Like you kind of look at the, there's like a whole grid of like the different widgets and things that you could create there. Like ideally people see it and like they see it as like these very polished like consumer grade ready external facing things versus like you know, you think of internal tools and like the UI is always like the last thing that people care about but like you really, you know, push the team. And I think they did a really great job of making the Chat kit experience like really, really consumer grade. And it should feel almost like ChatGPT and with like really buttery smooth animations and like really responsive designs and all of that.
D
Yeah. I think your point on widgets is like definitely like really resonates. Right. Because Chat kit, it handles chat ux but we're also just building like really visual ways for you to represent like every action that you want to take. And that is definitely like very high polished.
A
Yeah. And when working with customers like those have been the most helpful customers for us to work with because you know, when Ramp is thinking about, you know, how what, what they want to publicly present to people, like they have a pretty high bar as they should as well as you know, all the other customers, customers that have been iterating on it. And so that kind of feedback from our customers has really helped us up level the general product quality of the launch that we had today as well.
C
Yeah. Would you ever, would you open source TrackIt.
A
Talked about it. We talked about it. There are a bunch of trade offs.
D
I think so Chat kit itself is like an embeddable iframe and so I think the actual iframe. Yeah. And so that helps us keep it like evergreen. Right. So if you are using Chat kit and we come up with new, I don't know, a new model that reasons in a different way or kind of new modalities that you don't actually need to rebuild and pull in new components to use it in the front end. I think there's parts of widgets for example that is much more like a language and can definitely is something that is easier to explore that for as well as kind of the design system that we've built for Chat Kit. But I think as part of the actual iframe itself, I think there's a lot of value in that being. Yeah, a more evergreen. Yeah, more evergreen experience. That is pretty opinionated.
C
Like there'd be no point being open source. You want the.
D
Then you don't get the benefits of.
C
It, you know, being Stripe alums like stripe checkout. Like it's auto optimized for you to like.
A
So I'm not a Stripe alum, but.
D
Christine is and the team actually is the team that built. Yeah. So it's very similar philosophically. Right. So stripe can build elements and checkout and not every business needs to rebuild the pieces that are really common. And I think we see the same with chat. We see chat being built over and over again, especially as we kind of come up with new modalities like reasoning everything. It's not really something that is easy to keep up to date. And so we should just do that and leave kind of the hard parts of building agents again to the developers.
B
Does it feel, I mean, I know WordPress is like a bad connotation in a lot of circles, but to me it almost feels like the WordPress equivalent of like chat is like, hey, this is like drop in thing and then you have all these different widgets. Do you see the widget becoming a big kind of like developer ecosystem where people share a widget? Is that kind of like a first party thing and then what's like the M4P versus widget forest? No, exactly. I mean it's kind of like, like it seems great for people that are like in between being technical and like not really being technical enough.
D
Yeah, yeah, yeah. I mean, I think that's a big part of building widgets. Right. Like it's already kind of in the language that is very consumer friendly. You can use in our widget builder already already you can kind of use AI to create those widgets and they look pretty good. I don't know if you guys have gotten a chance to try that out yet, but definitely see kind of, I don't know, a 4X.
A
If you haven't tried out the Widget studio and the demo, like apps as well.
C
Yeah, you got a custom DOM main like Widget Studio, which is cool.
A
I actually don't know how we got.
D
That, but yeah, everything's in Chat Studio and then we have like the playground there so you can try out what chat would look like with all the customizations. We have ChatKit World, which is a fun site we built. I was like spinning for a while this morning. It was.
C
Kasia also like uploaded some of her solar system stuff and all the demos as well.
B
Yeah.
D
And then that's where like the widget builder.
C
Yeah. So it's really come together like or it's taken almost more than a year to come together and build all this stuff, but it's coming together. It's really interesting.
A
Yeah, it's something that we definitely planned.
C
All of this upfront.
A
Oh yeah, yeah. We have the master plan from three years ago. No, but I think especially on this stuff, I think there was an arc of a general platform that we did want to build around and it takes a while to build these things. Obviously Codex helps speed it up quite a bit. Now. But yeah, I will say it does seem great to kind of like start, start to have all the pieces start fitting together. Yeah. I mean you saw we launched evals and we got the fine tuning API for a while and, and we laid all the groundwork for, for some of this stuff over the last year and we're hoping that we can eventually, you know, make it into this, this, this full feature platform that, that, that's helpful for people.
C
I think you have, since you have, since you did the Codex mention, maybe a quick tip from each of you on Codex power user tools or tips.
A
So there's actually a funny one that one of the new grads has I think taught our team in general. And I think this is a point for just how new grads and younger generation people are actually more AI native. So one of them is to really lean into, push yourself to trust the model to do more and more. So I feel like the way that I was using Codex and so for me it's usually for my personal project projects, they, they don't let me touch the code anymore. But you give it like small tasks. So you're like, you're, you're like not really trusting it. Like I view it as like this like intern that I like, I really don't trust. But what a lot of the like, so we had an intern class this year. What a lot of the interns would do is just like full YOLO mode. Like trust it to like write the whole feature and it like it doesn't work. Or worse it like doesn't work sometimes. But like I don't know, like 30, 40% of the time it just like one shots it. I actually haven't tried this with like codec GPT5 Codex. I bet it, I bet it probably like one shots it even more. But one tip that I'm like starting to, I feel like undo this like, like relearn things here is to like really lean into like the AGI component of it and just like really let the model rip and like kind of trust it because a lot of times it can actually do stuff that surprises me and then I have to like readjust my priors. Whereas before I feel like I was in this like safe space of like I'm just treating this, I'm giving this thing like a tiny bit of rope.
C
Yeah.
A
And, and because of that I was kind of limiting myself with how effective I could be.
C
Like sure. But okay. But also is there an etiquette around submitting, getting effectively vibe coded PRs that someone else now has to review, right? And it's like it can be a.
A
Whole codex to reviews. Now it actually reviews itself.
C
Does Codex approve its own prs a lot more than humans?
A
It doesn't get approved them.
D
But I was going to say, I think the Codex PR reviews are actually one of the things that my team very much relies on. I think they're very, very high quality reviews. On the Codex PR side for the Visual agents builder, we only started that probably less than two months ago and that wouldn't be possible without codecs. I think there's definitely a lot of use of codecs internally and it keeps getting better and better. And so, yeah, I think people are just finding they can rely on it more and more. And it's not totally vibe coded. It's still checked and edited, but definitely as a kicking off point. And I think I've heard of people on my team, it's like on their way to work. They're like kicking off like five codex tasks. Because the bus takes 30 minutes, right? And you get to the office and it kind of helps you orient yourself for the day. You're like, okay, now I know the files. I have the rough sense, like, maybe I don't even take that pr and I actually just like still code it. But it helps you just context switch so much faster too and be able to like orient yourself in a code base.
A
There's so many meetings nowadays where I have like one on ones with engineers and I walk into the room, they're like, wait, wait, wait, give me a second, I gotta kick off my Codex thing.
C
I'm like, oh, sorry, we're about to enter async zone.
D
Almost like your notes, right? You're like, let me.
A
And they like to like, okay, now we can start our one on one because now it's great.
C
Yeah, cool. We're almost out of time. I wanted to leave a little bit of time for you to shout out the Service Health Dashboard because I know you're passionate about it.
A
Oh yeah.
C
Well, tell people what it is and why it matters.
A
Yeah. So this is a launch that we actually didn't, you know, it didn't get any stage time today, but it was actually something I'm really excited about. So we launched this thing called the Service Health Dashboard. You can now go into your usage or like your settings account and kind of see the health of your integration with our OpenAI API. And so this is scoped to your own. Org. So basically, if you have an integration that's running with us doing a bunch of tokens per minute or a bunch of queries. It's now tracking each of those responses, looking at your token velocity TPM that you're getting the throughput as well as the responses, the response codes. And so you can see kind of like a real time personal SLO for your integration. The reason why I care a lot about this is obviously over the last year, we've spent a lot of time thinking about reliability. We had that really bad apple outage last December, you know, longest like three, four hours of my life. And then had to, you know, talk to a bunch of customers. We haven't had one that bad since, you know, knock on wood, we've done a bunch of work. We have an infra team led by Venkat and they've been working with Jana on our team, and they've just been doing so much good work to get reliability better. And so we actually, again, knock on wood, we think we've got reliability in a spot where we're like, comfortable kind of putting this out there and kind of like letting people actually see their slo. And hopefully it's three, four, soon to be five nines. But the reason why I cared a lot about it is because we spent so much time on it and we feel confident enough to have it behind a product now.
C
Five nines is like two minutes of outage or something.
A
Yeah, yeah, we're working to get to five nines. Yeah.
C
What does an extra nine take?
A
It's exponentially more work. So, you know, but we always, we were, you know, in the last couple of years, you were talking about like hitting three nines and hitting three and a half nines and then hitting four nines. But yeah, it's exponentially more work. I could go for a while on the different topics, but we'll have to.
C
Do that in a follow up. I mean, that's the engineering side, right?
A
Yes, yes.
C
Like you're serving 6 billion tokens per minute.
A
We actually zoomed past that. Yeah, that's outdated. Yeah. But, yeah, it's been crazy. The growth.
B
We say awesome. I know we're out of time. It's been a long day for both of you, so we'll let you go. But thank you both for joining us.
A
Yeah.
D
Yeah.
A
Thanks for having us.
D
Thanks.
C
Thank you.
B
That's it.
C
How was that?
A
That was great.
C
Okay.
A
We have the mics offer. The thing I. I didn't want to say on the podcast was on the. On the tinker thing. So we, we actually buil.
Episode: DevDay 2025: Apps SDK, Agent Kit, MCP, Codex and why Prompting is More Important than Ever
Date: October 7, 2025
Host: Latent.Space (Alessio and Swyx)
Guests: Sherwin & Christina, OpenAI Platform Team
This episode dives deep into OpenAI’s third annual DevDay, focusing on the major launches and trends shaping AI engineering in 2025. Guests Sherwin and Christina from the OpenAI Platform Team discuss the debut of the Apps SDK, Agent Kit, integration of the MCP protocol, improvements in code generation and agent building, and the continuously evolving role of prompt engineering. The conversation is rich with practical insights, memorable moments from DevDay, product philosophy, and candid perspectives from the people designing these core tools for millions of developers.
Timestamps: 00:05 – 02:53
Growth of DevDay and Developer Community:
OpenAI’s Mission Extended:
Product & Event Evolution:
Timestamps: 02:59 – 08:54
Apps SDK as Extension of API Philosophy:
MCP Protocol Integration:
UI and Experience Inversion:
Timestamps: 08:54 – 15:06
What is Agent Kit?
Demo Recap and Capabilities:
Templates and Playbooks:
Timestamps: 12:22 – 15:06
Dual Usage:
Long-Term Vision:
Timestamps: 15:06 – 17:25
Timestamps: 17:25 – 21:16
Agent Evals and Trace-Based Scoring:
Prompt Optimization and Rubrics:
Timestamps: 39:12 – 42:01
Letting Codex Do More:
Memorable Quote:
Codex Self-reviews:
Timestamps: 42:23 – 44:03
What It Is:
Context:
This episode paints a rich, dynamic picture of where AI engineering sits as of late 2025:
A must-listen for anyone building with (or competing against) OpenAI in 2025.