
Visual Studio Code has become one of the most influential tools in modern software development. The open-source code editor has evolved into a platform used by millions of developers around the world, and it has reshaped expectations for what a modern ...
Loading summary
A
Visual Studio code has become one of the most influential tools in modern software development. The open source code editor has evolved into a platform used by millions of developers around the world and it has reshaped expectations for what a modern development environment can be through its intuitive UX rich extension marketplace and deep integration with today's tooling landscape. Now, in an era defined by rapid advances in AI assisted programming, the VS code is at the center of a profound shift in how software is written. Kai Metzl is the Engineering Manager leading the VS code team at Microsoft. He joins the show with Kevin Ball to talk about the origins of VS code, how AI has reshaped the editor's design philosophy, the rise of agentic programming models, and what the future of development might look like. Kevin Ball, or K. Ball, is the Vice President of Engineering at MENTO and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup and organizes the AI in Action discussion group through latent space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website K Ball LLC.
B
Kai, welcome to the show.
C
Hi Kevin. Thanks for having me.
B
Yeah, I'm excited for this conversation. So let's maybe start a little bit with you and your background and your journey to leading this VS code team.
C
Oh so actually it started very, very early on. So my first internship was already with DevTools and I never really left DevTools. So and then, you know, now 10 years ago I joined Microsoft explicitly for the VS code effort. So there was, you know, there was promise that there is something that that could get traction in the market. So and that's the moment I joined and then we pretty much went from no users to a whole lot of those 44 million by now.
B
Yeah, I remember when VS code first emerged and I was like another ide and then it kind of took over the market.
C
Yeah, that's true. I mean when you think about this, right? A very well established market, right? There are Editors forever, right? IDEs forever. But all of us somehow lived in this in between world where it's like we're not super happy yet. I was like, yeah, I can do this there super, super fast and I can do this there in a good way. But I have to wait until it starts up and it has too much stuff in my face and so on, right? It was really finding the sweet spot in the middle. And that's actually also how we talked about this, right? It's really two ends of the Spectrum editor on the left hand, full fledged ides on the right hand side. Where is the versus spot in between. And that's really what we tried to find and I think we had a, you know, we really hit the bullseye.
B
Yeah, absolutely. And I feel like you were winning for a ways. And now we're in this kind of moment, the tech industry, where what it means to write code feels like it's shifting very rapidly. And so I'd love to kind of dig in with you about the ways in which you are thinking about this. I think initially bringing Copilot into VS code and looking at that. But what has been the sort of VS code journey to this new agentic coding world we find ourselves in?
C
Yeah. So when you think through this one, right, we easily forget what we knew and what we didn't know. I mean you just go six months back and what our understanding was of how coding should be and what it is today. Right. Finally I just had a conversation with someone and this person said, oh, 60 days ago. I was like, what? I thought that was in March or so. I didn't. We have November right now. So it's like we're working on very compressed timelines, right? Lots of things are happening. So. And I just want to keep this in mind when we, when we talk about all of this. Right. So at the very beginning, right, we, we started working actually at the time as in the Eskot team with the GitHub Next team. And the GitHub Next team is pretty much the internal research kind of area. And GitHub next had good relationships with OpenAI. And so this is pretty much where the AI powered IntelliSense suggestions came from. There were already attempts in other areas before, right? This was not new per se and usually it was integrated with code assist and so on. Right. And then they got little stars, for example, saying oh, those are AI suggestion. The other ones are coming from the language servers and these kinds of things. So that was the first part, right? And then we're like, no, no, that needs to change, right? We need a different UI for this. This is then where we really pushed hard into ghost text and these kinds of things, right? And in the beginning we already had multi line completions, but then we realized no one is using them because now you have to review code rather than to stay in the flow. Right? So it turned. Then we pretty much walked backwards and saying oh, smaller completions. So that's pretty much this whole journey of completions, right? Then ChatGPT came along. So we started off the year right after ChatGPT launched with a hackathon within the VS code team, or saying, you know, we have four days here and we just go and build what we think we can actually build with these new models, with this new kind of approach. And that was super interesting because that pretty much immediately made clear that you cannot really just put this on from the side. It needs to be really part of the tool itself. AI really infuses in every single aspect, right? So you think about the command palette in VS code, right? You type in there the very moment you have good AI, you think about that. It should be smart enough to figure out what you actually mean rather than what you type. But then you have to find the right spot between. No, no, I actually meant what I typed compared to. No, don't guess widely, right? And then there were of course performance considerations, right? Everything in VS code is about performance all of a sudden, right? The AI answers were not that fast and so on. But it was super clear that this needs to be core part of the experience, right? Then there was, for us at least, very interesting conversations because GitHub at the time GitHub Copilot was already an established brand, right? And GitHub had VS code, didn't have a sign in that you needed, right? So you just fire it up. But in order to use GitHub functionality, you needed to sign in. GitHub had already billing and all these pieces in place, right? And then there was established brand. So how do we now find in a way the balance between, you know, what comes in through an extension compared to what is in the core and that journey, this duality that really took us a while to get, right? So then there was a lot with chat, there was a lot with what models do you actually have available? Not just what models, but what capabilities do those models have, how much context window do you actually get and so on, right? And I think there is a difference between if you, if you sit in a startup and you think about those problems or you, you come from a world that is already profitable, right? And so Microsoft thinks about this in very different ways. And like, for example, it took us a really while to convince others, no, we need larger context windows. You cannot work with a 4K context window very efficiently, right? So there was these kinds of, of challenge in the beginning was an extremely steep learning curve, I think, for an organization as a whole. And then Maya and I think over time we kind of figured this out, right? We're still learning. So I'm not done here, right, so. But I think we figured this out. Yeah. And then you know, when you. When I just think about the last year, right, we came with edits, we came with what we call NES or the tap, tap TAP model. You have the agentic loop. We integrate the cloud agent that GitHub has a copilot coding agent. Right there is now the copilot cli. We integrate all of those now into the VS code interface. We use the agent sessions view in order to make that right. We're now actively working on improving the agent session views because it's still somewhat rough in usage. Right. So we're actually improving this. So there is a lot of these kinds of things that happened and at the same point in time the competitive landscape has shifted, the capabilities of the models have shifted right there. It's now not only about capabilities, now it's about how long can it run, how fast does it respond, time to first token, what model mix are you actually using at the right time while people actually learn how to use it. Right. So it's an extremely dynamic area.
B
It absolutely is, yeah. So an area I'd love to kind of dig in with you a little bit more. You mentioned how even from the beginning, as you started to look at more advanced tab completions, AI enabled tab completions, not just language server, you had to kind of find this balance between how much were you showing, how much were you asking to allow the developer to course correct. Right. If you infer. I think one of the beautiful things about these AI models is you can do this sort of intent based UI where you kind of try to guess what the user is doing and lead them there faster, but you can get it wrong. And so I'm curious, as you've layered on each of those pieces, as you've gone down chat oriented programming and now agentic programming and all these things, how do you think about that balancing act of well, we can do a lot for you, but how do we make sure we're doing the right things?
C
So I would actually start with the the sentence of saying that this is an unsolved problem. Right. And it seems maybe surprising because we have been doing this. But like for example, right the you look at the tab completion. So nes. So next edit suggestions Right there. It's always between what do you show to the user, how often do you show something to the user, how often does the user actually accept what is being shown to them and then also how often do they explicitly dismiss it and that is the space in which you operate. Right. And you try to find as we discussed before, right? What's the right thing between an editor and an ide? You try to find this spot, right? And there are extremes where you can go. Like, for example, you show everything that the model proposes right, immediately to the user. And of course, the acceptance, the absolute number of acceptance goes up and up. But you also annoy the user at the same time more and more. So you really have to find pretty much how often do you show something, how many opportunities are there that you actually show to the user, and how many of those does the user not explicitly dismiss but accepts? So you really have to find this explicit acceptance, explicit dismissal, and so on, right? And that is an ongoing kind of fine calibration because people also learn that that is the next part, right? A person who actually uses NES for the very first time has different expectations from a per. From a person who is actually much better. It also has to do like, for example, the typing speed that a user has, right? So how long do you wait, for example, until you show something, a slow typer for them? It might be way more annoying if the model is actually very fast, while a fast typo is annoyed that they don't have the proposal, right? They pretty much want to type in without stopping, just hit the tap tick key in order to accept, because they anticipate what the model actually will bring, right? So it's a really ongoing effort, right? And we have quite elaborate dashboards with metrics on this where we really go back and forth and adjust those pieces and then see and run a 5% flight and seeing does that actually change how people actually interact with it? So if you show a little bit more, has it a positive outcome, negative outcome? How often do people hit the escape key? And it's really interesting, like, for example, the escape key hit rates are not that high, right? They're around 3% the last time I checked. But when you ask someone, they're actually saying, I hit Escape all the time. And then you look at the data saying, no, it's actually, you don't. It's like. So it's just really, really, really interesting because in the end you have to get a happy developer. And happiness is a combination of how productive you feel, of how well you actually thought you could go through your thought processes, how focused you could be, how little annoyed you were, all of these kinds of things. And that is an ongoing kind of process to really get this right.
B
Absolutely. I will say, as a longtime VIM user, no amount of escape in VS code ever feels like a lot of escape yeah, absolutely. I'm curious. You talked about how this can vary across, for example, different typing speeds or experiences. When you're tuning these knobs, are they global knobs applied to everyone? Do you have some sort of adaptive system in there such that, for example, if I'm a faster typer, I get more rapid completions? How does that end up working?
C
So it's mostly global right now. Right. But we are, for example, I had made working on how is typing speed encoded actually in the input that the model gets. Right. So that it actually can take that into consideration.
B
Got it. So it would be something. It would still be a global model, but this would now become an input with features developed based on it. Interesting. Okay. And looking at that feature space you mentioned, typing speed is like how new a user is or some sort of representation of that also encoded in some way.
C
We don't have a good way to really encode that as an experience level because you could argue from, you look at the workspace and I think the workspace is a good indication, but it doesn't necessarily tell you if the user is new to that particular workspace or not. Right. So it's quite complicated to have a good profile of a user that. Because each of us actually goes through those different stages depending on what repo they are looking at. Right. If I open a Rust repo, I might be more intimidated than when I'm looking at the typescript repo. Right. And these kinds of things also should, in a perfect world, play into what we're doing. They're not right now, but we're thinking about those.
B
Then looking at some of the other interaction modes beyond the nextedit suggestions, as you start looking at chat oriented development or even this increasing like agent loop types of development, what are the interactivity trade offs that you're exploring there?
C
I mean, let me put it this way. The original chat interfaces across all of the different tools, ours included, they were interesting. You looked at those, right. And saying, oh, this is really amazing what it can do. And at the same point in time, oh my God, this is bad. This sucks so much.
B
I feel like this is my experience with all of AI.
C
All right, because I know what I mean by this one is we spend years optimizing to go from, let's say three seconds for a particular interaction to two seconds for a particular interaction. Right. And we do all of this in order to keep you in flow state and then we put you into a chat and now the answer takes 20 seconds. In a good case, it might take longer you know, couple of minutes and some other cases and so on, right? So, and, and you only tolerate this because you still think that the outcome at the end is, is quicker, is better than what you have could done on yourself. So you torture yourself a little bit in order to accept the better outcome. And that is pretty much the baseline where we started with these kinds of chat interactions. And since then, I think you see that the world actually changed a bit. First of all, we have much faster models. People have also developed different styles of interacting, right? Like for example, 1, 1 style of interacting is you use a really fast model in order to do your research, right? So you go and interactively you go and try to figure out what you want to do. But this is something where you actively research and then you kind of know what you want to do and you are able to actually put this in a reasonable prompt that you then delegate and let run in a background agent, for example, right? That, that's one. There's another school of thought or another behavior that is almost the inverse of this where people go and say, no, I run multiple exploratory asynchronous agents, right? I roughly tell them what I want. They're responsible for creating a plan for this. And it can take a long, right? They use a large slow model for this, right? Until you get your plan, you work a little bit on the plan, but then you pretty much use a dumbo fast implementation model, right? And you do this because you kind of know that AI doesn't get it right all the way to the very end. So you are actually helping along, right? You're perfectly fine to go only to 90% and do the other 10% manually, right? And there's not such a big difference between going to 92% and doing the last 8%, right? That's why people accepted a model that is not that so sophisticated sometimes for this, right? But those are really different work styles, right? And one, you do all of synchronicity and speed and exploring and thinking, right? You do this interactively and then delegate and then review. But because you actually had the thought process at the beginning, what to do, the review is kind of easier compared to that. The models think that the agents think, then I'm helping with the implementation. And that also makes the whole review, you know, much, much less, right? So, so this, this is quite interesting, right? And there is really this, this kind of trade off, right, in what you're saying with the interactivity versus not, right? We have different, we implemented different custom agents, right? And so like for example, we have the one that only the ask mode where you really just go and make no edits, right? We have one where you can define yourself, what is the scope of the modifications that actually can happen, right? That's called edit mode. Then we have the agentic mode rocking on or playing around with something that's called interactive mode where the model becomes, or the agent becomes exceedingly steerable. So when you go and say make this change in this file, it will make this change in this file and not go off and fix five other files that actually now have compiled errors because it's you who actually steers it. But again, that needs to be super, super fast. We have a planning mode that we ship a planning agent, right? So there's all these different kinds of trade offs and we're learning at any given point in time, right? But it is like what it always was with developer tools. There are different breeds of developers with different, different interests and different preferences, right? And you've got to be giving the right tools, the right combination of tools to each of them so that they can find a place that they're happy, right?
A
SE Daily listeners, quick question. When things go wrong in production, do you know why? In minutes or hours. AppSignal is the application performance monitoring tool designed for developers who want clean, actionable insights without a huge observability bill. You get all the tools you need to fix issues before customers notice, like error tracking, performance monitoring, log management and more. AppSignal works for teams of all shapes and sizes, from startups and side hustles to SMEs and enterprise, and is especially great for teams that build with Ruby on rails, Elixir, Node JS and Python. Start your free 30 day trial and get 10% off a yearly plan with code SCD10. Go to www.appsignal.com sed that's www.appsignal.com SCD and use code SCD10. If you're an engineering leader, you know this cycle your team's focused on building product. But someone in ops needs a dashboard, marketing needs an admin panel, finance needs a custom workflow. The requests pile up, you can't get to them all. So people start building their own solutions. Shadow it spreads and eventually you're the one stuck cleaning up tools that were built with duct tape and good intentions retool breaks that cycle. Their AI appgen platform gives teams a governed place to build the tools they need so everything stays secure and under your control. Someone could type build me a customer admin panel that Manages accounts from postgres and they'd get a real production ready app with proper permissions built in. Your teams get unblocked and you don't inherit a pile of technical debt down the road. So if you're tired of being the cleanup crew for Shadow IT, head to retool.com sedaily and see how other engineering teams are democratizing app building without creating chaos. Because honestly, we could all use a better way to handle internal tools. Sometimes you just need retool.
B
So let's dig in a little bit because one of the things that you talked about, right, for these different modes and how steerable they are, there's obviously model differences, right? Like Sonnet loves to edit all the files. It just likes to talk, whereas some of the other models don't. But there's a lot that you're doing in the agentic harness and how you're defining these agents. Can we maybe dig in a little bit? I think coding tools are probably some of the most advanced agent software pieces we have out there. How do you build it? What's the stack for defining one of these agents, an ask agent or what have you?
C
So, I mean, at the very core, right, And I'm pretty sure you have heard that answer several times, right? An agentic loop is not that particular complicated, right? It's, it's like you give it a bunch of tools, you give instructions how to use those tools. Most of those instructions are actually with the tool description, sometimes they are outside, right? Each model has certain kinds of preferences, right? And there are prompt guidance for each of those. Some, for example, like that you tell it, oh, give the user an update from time to time. Others actually stop the agent loop when you give such instructions. Like for example, Codex is one of the models that stops and it wants to give an update to the user, right? So there, there are all these kind of differences, but that is the basics, right? And then you've got to pretty much instruct the, the agent. And that's actually one of the more interesting problems is when is it done? At which point in time should it actually consider to be done? And so that is the basics of all of this, right? So what we then when you ask a custom agent on top, right, that's actually something where we say, okay, in a custom agent you can define what toolset is available to that agent. So out of all available tools and they actually can come from different sources, right? There are built in tools in Vs Code extensions can actually define tools. And on top of this you can install MCP servers, you can have a quite large set of tools. Then you can specify pretty much in a custom agent file, you can say, oh, here's the tools that you should make available. Then after that you have pretty much then the normal syntax that we use for everything else, which is a markdown inspired syntax where you go and say, here's that you can say what, how an agent should actually operate. And then you have seen, you know, many people have seen what cloud skills look like and so on, right? All of the definitions are pretty much comparable to each other, right? In that kind of aspect, right? It's a markdown file where you give references to two other instructions files. They can say what tools to use under what circumstances, right? Like for example, you would go and say something like, hey, because I'm in a workspace that actually, I know I have defined it all. I really like that you use the test runner tool rather than go and do NPM run tests or cargo test or something. I was like, you say this explicitly, right? Or sometimes a model goes and kind of assumes that let's say that your tests are wrong, right? That was the first moment I thought models really get intelligent and they pretty much rewrote tests to assert true in one way or another was somewhat obfuscated. But that was pretty much the bottom line, right? I was like, oh, all my tests pass now. So it's good. Yeah, you did. But, you know, so sometimes you just go in very explicitly say, right? Never touch test tools, right? Test files there, everything is good there. You might make, if you do a refactor, you can adapt them, but asserts are untouchable, for example. So it's pretty, pretty forward, you know, straightforward like this, right? And then a lot of work then actually goes in and saying, okay, how much instructions do you need for tool usage? How much guidance do you need to give? Right? And that's where pretty much this whole machinery comes into play of what evals you run, how many of them you run, how often you run them. How do you really think about actually assessing, evaluating an outcome? And that is quite different. For example, we run the rest of the Industry3 bench, for example, as one of the benchmarks. And we're not just looking at the resolution because the resolution radius you get from A to B, right? That can be super, super messy, right? We look at how fast do you get from A to B. How many tools did you actually call? What is all of the amounts of tokens that you actually used in order to get there? Did you call the tools that we think you should use like for example, if you have a terminal tool you can use, you can, you can use that tool and you can get to the end of it. And that's perfectly fine if you run as a background agent. But if you run as a foreground agent in VS code, right. Then the user actually expect that when you say that the tests failed, that they can look at the test explorer and see the failing test and click there, right? So you wanted to use that tool then if you have a WATCH task, it should not try to spend a bunch of turns in order to figure out how to build the project called the Watch Task. It's right there, right? So these kinds of evaluations we actually then run, right? And compare. And then there's then, you know, the fine tuning going on. Verb change is going in how you group tools in different categories. That actually makes a difference there. There's a lot of that work that actually goes in.
B
Yeah, it's deceptively complex inside of this very simple wrapper. So there's a couple pieces I'd like to dig in on that. So one, as you mentioned, different models tend to have different preferences, I guess we'll call them, right, in terms of how they invoke tools, how they check in with the user, things like that. In the ui, I have this very simple model switcher. I'm just changing models. All of that is opaque to me. Are you customizing tool descriptions, the core agent instructions, all these different things by model to help them behave consistently or how is that all functioning?
C
So there is a current state and then there is the near future state, the current state. And you can, I mean we are all open source. So you can take the repository, you can look at this and you actually see, I mean when you run inside VS code, right, we have a log view where you can see every single call that is actually being made to the model. You see every single detail of this. So you can actually verify my words here. So you in your own day to day experience. So in the current state is we actually have specific prompt for, I would say roughly every model family. Sometimes more detailed, right? Sometimes we go down and saying, oh, it's not just a GPT model. We really separate between GPT5, GPT5 codecs, GPT5 1, codex. See, we move so fast that I say 5 rather than 51 as an industry. So different entry points pretty much in our main prompt file generation. And then we'll pretty much pick what tools are available for that particular model. What are additional instructions that need to be given and so on. We customize the instructions that are outside of the actual tool descriptions. But we don't have a model, we don't have it yet in code where we actually say no. We know exactly that this is the tool description that works better in this particular kind of model. That's actually something that we discussed several times. Never just never quite made it to the point of yep, now it's coming. We just the last iteration plan, we again had the same conversation and I was saying, oh, we do this right when we are shipped beginning of December, we'll have model specific tool descriptions, right? Where pretty much the prompt file can override and saying, oh, if this tools show up, right here is actually the tool description that you should use.
B
So another kind of detailed topic in here is how you assemble context for the agent. In terms of you're operating in this type of repo. You have these things beyond tools. There's people who do different amounts of like pre injection of context. Maybe it's not just system prompt, but it's got a whole bunch of additional things. How do you think about the right ways to sort of present things to the agent? So it kind of starts out going the right direction versus everything's in on demand tool calling.
C
So there are a couple of things here that. And let me actually start with tools first, right? So most models these days have been trained with a particular tool set. So out of the box they already know a certain set of tools. Like for example apply patch for GPT models, string replace for sonnet models. Those kinds of things that they are must have in those individual tools. And then the next question is beyond this, how well do you models actually generalize? And so you want the tools and then again the context, as I said, in what kind of environment is that particular prompt now executing? Is that foreground agents and background agents and so on. So that's the first kind of question, right? Those tools, how are they actually represented in, in your, in your prompt? Then the next one is, let's say you have a couple of MCP servers installed. So most of some MCP servers have only one or two tools, right? But others come in with dozens and dozens of tools, right? Most models have a limit how many tools you can actually put in the prompt, right? So 128. But then on top of this, there's a lot of tokens that you actually put in there, right? So how many tokens do you actually want to spend on tools that are rarely used or only in particular specific situations? Are being used. So a technique we're using there is like for example, we go and take all of the tools that an MCP server gives us and we actually now create pretty much virtual categories of tools. And in these kinds of virtual categories they are represented as tools in their own right? We give this to the model and the very moment the model decides to call one of those virtual tools, then we pretty much expand it. But now you have immediately this kind of trade off discussion, which is the very moment you do this, you actually.
B
You'Ve blown your KV cache.
C
Exactly, exactly. Right? So in some cases, some models actually support support that you put this at the end of the prompt, others actually don't, right? So now you immediately have to make this trade off, right? And that's pretty much where a lot and also of evals come in, right? You run in all of those different configurations, you compare. This is optimizations functions that you have to hit here, which is like, oh, if I blow my cache once or twice over such a long time, I'm still good, right? Or you're saying, no, actually I can run with a slightly larger prompt, right? That, that is fine because I have a cache hit rate of 87 or whatever, right? And I was like, this is, this is okay, right? There is, there's no big, big advantage here. So it's a constant kind of trade off. And that is also true for all of the other context part, right? And there is no, not a real stable. I mean there are some stable like you say who the user is, you say what the repository is that the user runs in, right? But the very moment already like how much information do you give about the project itself? Like for example, we put this kind of prefix in where we're saying, oh, this is what pretty much the top level of the project looks like to the user, right? And we still believe that this is actually reasonable token spent. But then on top of this comes what we dynamically include. And dynamic inclusion is clearly like if you have an agent's MD file, we have custom instructions that we actually do support. And custom instructions actually can be tailored in different ways. They can be just in a certain location and that's fine. They can in the front metal of those custom instructions files, you can say apply to. And then you can actually give club patterns and saying, oh, you know, in this particular test folder in a typescript file, this is a file that actually applies. And then there's yet another mechanism where you can actually give a natural language description under which circumstances that custom instruction implies. So and Then we actually start collecting these and actually putting them also in the prompt. Right? And that is actually a process that I think is the one that is the most valuable, right, because you make sure that the user is in control of how much they pretty much AI prepare their code base. But when they did a really good job, then they can actually make sure that the agent extremely quickly gets to the right place, knows where to start, knows where to look.
B
Let's maybe talk a little bit about interactions between agentic pieces and the IDE itself. Now, you mentioned a couple examples of this of if it's running in the foreground, use tools that connect to parts of the IDE so that they're running in the right place rather than using the terminal. But what is the surface area that you expose to the agent, to the ide? And how do you think about changes coming from the agent versus coming from a human?
C
So I mean, it's all about how you interact with it, right? So again, the most straightforward form is you actually have a foreground agent running in Vs code. And that foreground agent, we give it actually quite an interesting set of things that it can do, right? It can look at terminals, it can look, read selections and terminal. All of these concerns can run tools, watch tasks, right? All of these so specialized edit tools that we then, actually. Right, where we actually are able to run pretty much snapshots at a given point in time. So that we can show you, oh, here's all the appropriate diffs and so on, right? There's a good chunk of tools that we actually give a foreground agent in a background agent, that's quite different background agent, we give significantly less. Why? The first thing is if you run the agent in the foreground, you have this kind of expectation that the agent actually is reasonably quick, right? If you think about this more that that is the interactive part, right? You don't want to sit there and wait two minutes and twiddle your thumbs, right? You want to get the answers relatively quickly, right? And then again, you want to make sure that this all kind of is like the extension of what you would do anyways. But when you go and move something into the background, then you clearly don't want that it touches your UI state at any given point in time. You don't want it to. Like we mentioned the example of the test runner a couple of times, right? You don't want it to mess around with your test runner. You don't want it to open up a terminal on you so that your mouse all of a Sudden clicks, a different place and so on. Right? So there are different tool sets that you are actually giving a cloud agent is yet different, Right. So while the background agent is still running on my local box, right. And is still exposed to me closing the lid, neither cloud agent is not. Right. And there it's about in what containerized environment is that agent actually run? Right. What is the project that you actually have? Can it build successfully in that container? Right. Can it execute and test run in this container? Yes and no and so on. Right. In the agent. But again, the cloud agents have significantly less tools in order to do so. Remind me of your question again.
B
Well, so my question was kind of how you think about those interactions and you've given me a fair amount. This actually leads to something that or a curiosity I had as I was listening to you is how much does this differential exposure of different types of tools end up influencing how well the agent does? I'm imagining the same prompt in a ID context versus a background agent versus a cloud environment might result in quite different coding behaviors.
C
The model choice I think has a much bigger impact on the actual outcome. So what we are trying to do is really straddle the line between the user experience and the success of the agent. Because as we said, if you have a terminal tool, so execute terminal commands tool, you can get really far. You don't need an edit to push and cat command with input redirection. And you see this is your edit and then it writes it to the file system and so on. So you don't really need a whole bunch of tools in order to make an agent successful going from A to B. There there are some differences. Like for example, right when the industry introduced the to do tools in order to have longer running agents self organizing and so on, right? But, but in big parts, when you think through this, right, there is not. You don't need a huge amount of tools in order to make that successful. So that's one. So when we actually bring it in the foreground and give it more tools, then that's very specific to the environment, right? It's like for example, when we come and say no other agent needs an hey, maybe you should install this extension to have a better user experience, right? But inside VS code that clearly is a tool that is available. And it's particularly interesting when you scaffold, for example a new project, you go, you say, oh, I want to do this right? And now create this workspace for me. Or like oh, and you go because you told it to go. But you don't have the GO extension installed. So it makes sense that the agent actually goes and saying, by the way, go install. Should I install the go extension for you? So it's really more about the user experience that we try to give to folks in the appropriate environments they are in. That's really the biggest difference. And then it's also how, coming back to how you interact actually with the agent, right? I think for a background agent, I don't want to have a lot of interaction there. I want to have context isolation. I want to make sure that it's even running in a sandbox environment so that I'm not bothered by, by tool calls, right? So that I have to approve tool calls and these kinds of things. Right? But in a, in a, in a foreground agent, right, that, that's more like. Well, I'm not quite sure yet exactly what I'm doing, right? At least that is the use case. I see. Primarily people are talking about code, right? They kind of go make a selection and say, hey, change this short prompts. They are not particular long. Sometimes people go and actually use NES in order to start a change, but then they don't finish it and just go and program saying, you know, hey, finish this up. Right? And it should then be very quickly just, you know, in this particular file, just, just do the rest. So. Or when people create new test cases, right? So test case generation usually is not something that takes particularly long, right? I mean, it depends, but in most cases, right, it's pretty straightforward. Pushing this in the background and then coming back after a while to review it and all of this, especially people more. It's like, no, no, I do it right now and then immediately run it and then let me review it, right? So that there is no cheating going on, right? That the tests are not already playing to how the actual behavior is. So on, right. So very different styles of interacting, right? Where one is that the forecast is really like short interactive behavior, talking about code, pointing at code, collecting pretty much the context that you want. That's all very code specific, right? To background issues you don't really do, right? That's like you were, you try to be precise the moment you started night and then at the end. Yeah, you can follow up a little bit if you want, right? But, but it's different expectations, different levels of preparation and so on, right. And what I'm actually saying here is, right, there's, there's more to this, which is when I say you talk about code, then I'm more like, you actually are a person who cares about code and you are actually really working on something where you need to guide in regards to software architecture, certain patterns that you want to enforce, etc. There's this whole other world where you don't care. So you don't care what the code looks like. It's really just outcome oriented and so on. And those lines, they, they also shift back and forth, right? They are good within the same project, by the way, right?
B
100%. I care about my core architecture. This tool. Just vibe it. I don't care.
C
Yes, exactly. Right. So it is exactly this, right? Where you pretty much. And this is a really interesting point, right? And I think as a, as an industry we might not or we maybe don't talk about this enough, right? Which is that how do you actually AI ready your code bases? That is exactly. If you have a project like, I mean our code base, right? The initial commits and all of this, they are more than 10 years old. Since then we built on top of this in order to make our code base AI ready. We really have to think about what are the core abstractions that we really agent. Never go and change those things, right? If you should change anything here, we tell you. Right? But then there are other parts that are a little bit more peripheral, as you said, right? Some. Some tool or so that you just want to have on the side now tool in a more generic way, right? That's like, yeah, just, just go do it. Right. And you might even just check in pretty much the prompt file that you use in order to generate this, right? So it's very, very different, right? So you think about what is untouchable, what kind of list at the periphery, right? Where you care, where you don't care. Most people really love using test driven development for a bunch of this tests are pretty much my prompts that I use for the implementation side. There's really this great flexibility and people are operating in quite different ways.
B
I'm curious when we talk about these different modes of operating and the fact that we flow between them, how do we connect the dots? So an example that I'm going to bring forward and I'm very interested to how you would think about this. I'm often working on something kind of interactively in that interactive mode. I'm thinking about it and an idea comes, oh, it would be great if we do this and I have a set of sort of predefined research style prompts that I can just kick off. So I'll kick off a background agent and say, okay, Go and research in my code base what it would look like if I were to do something like this. Write me an analysis doc and go. It'll go off and do it as I continue on my main line and at some point I want to come back and pull, almost suck that into interactive mode. Now I can do this right now with like branches or doing things like that, but I'm curious if there's something in the IDE that lets me kind of. It's almost like I'm pushing ideas into the stack and then I want to pop them down into my interactive world.
C
There are different ways of thinking through this, right? So then you actually, it's interesting because we really just discussed that and we have a mock up. We have not implemented this yet where a similar discussion came up. But more about at which point in time do I go back to a background agent or to a cloud agent? Yours is similar because it depends on the output with cloud agent generated for you. In your case, you want to see the analysis, right? You want to look at this and so on. So, and the way, right, you, you think through this as well as you kick them off, right? And at some point you've got to, you've got to go back. So you need an indication that it's telling you that it's ready for review. But then the interesting thing is that this is not necessarily just looking at something does not necessarily mean that you did all of your due diligence, right? So in a way you need interaction, saying yeah, it's ready, you can go there, right? But then at some point saying I took action on this is actually good, right? So you want to have this awareness of those, right? And when you think about this, I mean there's really, really, how should I say, priority. I mean when you think about email management tools and so on, they are quite similar kind of characteristics. So one thing that we had was pretty much at any given point in time when you interact with this chat and so on, clearly you can make these things disappear. But you pretty much have awareness about where your background agents are and which ones are ready to review, right? You don't see them if you don't want to. The running ones, you don't see the ones you took action on, but really just those that you haven't acted on yet, right, but, but they are done, right? They have produced what you asked them for. And then one of the mock ups that I just talked about, right, is pretty much where we bring this right into pretty much the very top of the Title bar where you have pretty much something that just can come down as an overlay when you just made super quick, you just see it, right? And it needs to be a first class citizen. If this ui what I describe is what it will look in the end is a different question. Right. But it is absolutely clear that you need this kind of peripheral awareness. No matter what you do, you need this kind of peripheral awareness that something else is. Right. I mean there are workarounds for this. Right. And that's actually something that sometimes we're maybe too focused on the one tool that we own and that they operate in. Right. But again, right. I mean if you want something from me, you select me and I get a notification that you want something from me. So with integrations and these kinds of other work environments that actually tell you that you have a Slack channel with your agent and just comes up and actually says, hey, I'm done, it's good. And you have the notification and it fits in your other workflows and so on. So there's a lot to explore here. That's where I'm going. We can do some things in the idea, we can do some things with GitHub on GitHub.com but I think it's not necessary where the line is, if an agent is an actor, there's a lot of other kind of tools that already are custom built for actors. And so I think we've got to broaden our way of thinking through these problems a bit.
B
I love that. And it's a good segue to another topic here, which is VS code has always been very, very extensible, very plugin centric, very open. How are you thinking about that within this new world? Are there things that need to change? I know you mentioned MCP servers, that's definitely one way of interacting. But are there changes going on in that landscape as well?
C
There's an interesting duality here. One is that in order to get new functionality into VS code, you had to write an extension. If you actually have direct access to LLMs, who can actually operate some of that? You said, oh, I have a custom prompt file that does X for me. So to kick off and research background agent. But you can also have custom prompt files that do actually things in interactive mode in the ide. And now if you give you capability to keyboard shortcut this, well, all of a sudden there's extensibility right there without writing any extension and so on. Right. So interesting kind of duality here because some things you still want this extension, but some Others you have a lot of flexibility already without even required to write an extension. So that already changes extensibility, right? Then MCP service is a newish new in this newish concept, right? And that is interesting, right, because it has MCP spec covers a lot. But the most interesting one, the most used one, is the one that you actually can make tool calls. You get pretty much a tool host, right? And that is clearly, I mean, there we're still very early, right? When you think about this, kind of obvious, some of the aspects, right? But now you actually have people who kind of say anthropic just posted this a couple of days ago, right? The whole part of a couple of weeks ago about proclamatic MCP tool calling, right? And then you're like, oh, I can clearly see where the caps are. But. But now we're really just making different APIs, right? It's like, why are we not calling them the real APIs? Why is there a differentiation between the normal APIs and the. In the MCP servers, right, that you can see that they then start fusing together, right? And that's just, just. Or MCPS is just the API that you publish, right? And nothing else and so on. That would make a lot of sense, right? But now you put this together with autonomy of an agent and you end up in a, in a potentially scary world. Because now you need to think through the security implications. You need to think about how do you actually control this, right? What do you allow, what do you not allow? Identity management, permissions for agents, all of these kinds of things, right? So, and what we are doing today is we create a sandbox for some of this, right? But it's, but it already starts like, you know, people go and say, oh, context seven, right? It should be quite careful how I phrase this so that your takeaway is not, oh, this is context. Evidence is an issue. But I give this as an example, right? So you register your website there, right? It's crawling in markdown files. I'm sure there's some sanitization going on, but you know, wherever there's sanitization, you can actually play it. Now you have an MCP server that actually finds those pieces of documentation, puts it in your prompt. And now what? Now it starts building and executing code and so on, right? So it's this poison, the Vel kind of poison problem, right? So you, you need, in a way, control all entry points into this. But this creates the most awful user experience.
B
Yeah, no, this is, this is a fascinating domain because in essence, all of these large language models, they're another form of running computation. It's word programmed rather than formally programmed. And so any MCP server is injecting code that's running on your box. Do you trust it? Do you trust everyone who's able to get anything into that?
C
Yes, it's exactly the question, right? And what we just in VS code, for example. So the way you actually do tool approvals, first of all, this is just running the tool, right? So again, input outputs, right? So what you just said is the part where you're saying, oh, are you okay with this command being executed? And it can be local. If the MCP server runs local, it might install packages, right? UVX or NPX or SI if it didn't run yet, right? So interesting, right? But there's a number one, right? Are you okay with being this one executed saying, okay, so how do you want to do this for this session, for this particular call? Right? Is there certain patterns in a command that you actually want to allow? There's a lot of room in order to get this kind of configuration, right? But then you make the call and let's assume this is a remote MCP server. Now what you pretty much said is I'm okay that this server is being called, maybe with my authorization token. But now there's a response being computed and that response also goes into your or either in a summarized form or in its actual form into the history of your chat, right? And I was like, oh, what now? Now you need to pretty much review everything that comes back. But again, right, that is, that's awful. So you want to use then specialized security models to actually do this kind of monitoring, right? But now you're pretty much in this kind of, you know who wins, right? Head to head race, right? So that's what I even haven't touched on is like this is chat on one way, right? But now you go to terminal in terminal commands and I do you want every terminal command to ask for permission? Right. Now you've got to go and say, no, no, let us actually read and understand what that terminal command is. And if that terminal command actually feels safe. Right, let's do it. But then you also need to give the user control. I mean a user in an enterprise setup might think about what tool should be called without permission. Explicit, right? Every single time permission differently than if I'm running on a VM in the cloud that, you know, I just least for this particular kind of use case.
B
For example, I do wonder if it leads towards a world where essentially all development is actually happening inside of a container or vm.
C
I think if you think this all logically to the end, that is the, I think the part you're getting to, right? But then still you need to control the inputs and outputs to this container, right? So fetching a webpage, right, when you go and say, hey, I need the latest version of Node, right? You get an install command that runs in the terminal and this moment something comes into your box. So, yes, you want this to be safe, right? You want to be able to close the doors and say you cannot get out of it, right. My point here is the problem doesn't go away even if you put this into containers, right? But you can control the environment in a better way. But we still need to think about how to make this a good user experience, how to make that understandable for you and so on. Right? And when we, when I say understandable, it's like we're now talking about. We didn't say this explicitly, but I think our conceptual model here when we talked was, oh, there's this one or two background agents that I have right now, but now multiply this with 100 and all of a sudden that, that is a very, very different problem, right? And we need to work through all of this, right? I mean, that's already starting, right, when you think about this, right? So with cloud agents, for example, let's say you own GitHub, you groom, you assign a bunch of issues to copilot, or you actually have auto triaging enabled, right? So you go and say, oh, we auto triage that ones now, you get those, you get those PRs for this, you review them. Reviewing 5 or 10 depending on size might be good. Reviewing 200 a day.
B
I've reviewed more code in the last 6 months than I can remember, right. It's ridiculous. It's wild. So I think this gets to kind of where I want to take us towards the end. And we're getting closer to the end of our time here, which is where do you see this going over the next year or two? I hesitate to go too much farther out because as you've highlighted, things are moving so fast. But like, how is VS code and this whole world of how we're managing the writing of code. I say it that way because maybe we're not actually writing the code, but we're managing the generation or writing of code. Where do you see it going and what's coming down the pipe?
C
I'm not sure. I look two years down the pipe, right? Because we Might surprise ourselves how quickly we end up in a given place. But when I think about, so we're still learning about the interactivity models and that's active research where we go in and say we implement it one way, we implement the other way. We look how people actually accept it, where does the percentage go? In the beginning was a lot of tap, tap, tap. Now it's more like, oh, the percentage is lower, but depending on the experience level of people and what part of the code in when you talked about there's something that is really sacred land and then there are other things. So all of these kinds of things are influencing how those interaction models are and that will change and we'll figure this out. I mean, different ideas will come from different areas and so on. But I think then there's this whole point about how do we use agents effectively? And I think that is also a very hard problem because again, and people, oh, we run many agents parallel. Yes, but now in what circumstances? If I have a project like VS Code where we go through 3,000 issues a day, sorry, months, not a day.
B
But that's projecting forward two years, right?
C
Probably. But 3,000 issues a month, right? And that was just based on human activity, right? You put now AI into the mix, that number needs to go up, right? Then how much of you can do more parallel work, right? Because there are different boxes that you can execute, right? You own, let's say as a team member, you own a bunch of tags, those are yours and you can go and have a couple of agents running on each of those tags and that's kind of fine. But if I go and think about creating something new, then actually running multiple things in the background, that is way more complicated because it's easier to think step 1, 2, 3, where 1, 2, 3 built on top of each other rather than, oh, it's one and then it's two. A, B, C, D. Right?
B
Cognitive limits.
C
Yeah, absolutely right. As a human, we are pretty much the weakest link in the chain. Assuming that we really get to a place where that code that comes out is in a good shape or we're not there yet. Right? But you can see that this is happening, right? So how do you actually really work with this level of parallelism? Right? And so on. So I think there, there's work to be done. I think where, where you clearly. Where we clearly will end up, right? And that is now just think about real world large scale operations, right? Where you go and say, hey, I have to make a change here, right? I want this New vertical feature to go in, but it now actually touches dozens of repositories, different service deployments, all of these kinds of things, right? So now you end up in the world where you pretty much need to create a plan, almost like a project plan, right? So agents need to go. And one solution is you have a monorepo and you just have one agent running around, right? But what is more likely is that it stays at distributed world, at least for. For many people out there. And then you need different instances of agencies that actually went. The main agent delegating to other agents they are running and they need to communicate to each other, right? They need to report back where they are. The reporting back cannot be a markdown document anymore. They need to potentially go back and say, no, here is really the change tracker, here are the different issues, they're linked to each other and so on. Maybe you need to see this on the planning board and in order to understand, understand this, right? Has a lot to do. Again, the human is the weakest link in the chain, right? It's about transparency, what is actually happening and so on, right. So I think there is a lot that will happen in this particular area where we need to go, right? And yeah, then there's one other aspect, but that's the one I personally struggle a little bit the most with, which is what is the. It is very easy to say, oh, I can code wherever I want and if I have an idea I just type it in my phone and send it off and so on. Certainly true, right. There are some use cases for this, but I'm not quite sure how. I mean, I want to work on my iPad, that's clear. But this is just a replacement for. I just sit in a different place, I don't want to use my laptop, so these kinds of things. But really what is the. What's the role of mobile, of smaller form devices really? How much do I want to do on my phone? I can see maybe voice that plays a big role in this one, but other than that, I don't want to review code on my phone.
B
I was going to say kicking things off great. Reviewing code.
C
That's right, that's right. Then I think that last part. And again, it's actually not that surprising when you think about this, right? We like to be creative and in what environments are we creative? And I really could see that we're still in very traditional kind of collaboration forms. As we said, you could have a Slack channel with your AI agent, for example, so these kinds of interactions. But I think the other one is, what is it that we really like as humans? We like to stand on a dashboard or together, sit huddle together and do something together. There's materials on the table that we shuffle around in order to talk. These kinds of things. How would you replicate these kinds of things? I think sometimes Microsoft had this studio PC large screen that you can. Could flat down. But you now take something like this, right, where you can draw on the screen, particularly when you do UI development, for example, right? You go, you draw on the screen, you say, this. Is this what I want? Right? And so on. And then you can actually talk to it at the same time while you're drawing and saying, no, this here, right, should be a little bit more over here. What do you think about this? Give me two alternatives and so on. And all of a sudden you have a very, very interactive rock star that makes us happy as human. There's a lot of dopamine in this, right? And it can be very well AI supported by voice, by how it's actually multimodal inputs and so on, right? And there might be or there might be not code involved in this. There's code involved in it behind the scenes, right? But again, right, this as a rock star, I can clearly see. See, right? And I think we will. We will see a lot of this coming forward.
B
I think that's a great cut point.
Software Engineering Daily | January 6, 2026
Host: K Ball (Kevin Ball)
Guest: Kai Maetzel (Engineering Manager, VS Code, Microsoft)
This episode explores the evolution of Visual Studio Code (VS Code) from a simple code editor to a powerful, AI-integrated development platform. Kai Maetzel, Engineering Manager for VS Code, discusses the origins of the tool, the influence of AI (notably through GitHub Copilot), the rise of agentic programming models, and how extension and interaction patterns are rapidly changing in the wake of new AI capabilities. The conversation offers deep technical insights into product design, user experience, model integration, developer happiness, extensibility, and the future of software development.
Kai’s Background:
Market Opportunity:
Rapid Change:
GitHub Copilot Integration:
Full AI Integration:
Technical Integration Challenges:
Feature Evolution:
Model Suggestions vs. Developer Flow:
User Variety & Adaptation:
Chat UX Frustrations:
Emergent Behavior Patterns:
Agent Modes in VS Code:
*“There are different breeds of developers with different, different interests and different preferences, right? And you've got to be giving the right tools...so that they can find a place that they're happy.” (Kai, 20:59)
Agentic Loop Fundamentals:
Custom Agents:
Evaluating Agent Performance:
Model-Specific Prompting:
Context Management:
Foreground vs. Background vs. Cloud Agents:
User Experience and Trust:
Evolution of Extensibility:
Security Challenges:
Isolation & Scaling:
User Experience Evolution:
Mobile & Multimodal Futures:
Creative Collaboration:
On AI Integration:
“It was super clear that this needs to be core part of the experience ... AI really infuses in every single aspect.” (Kai, 07:15)
On UX Calibration:
“Happiness is a combination of how productive you feel, of how well you actually thought you could go through your thought processes, how focused you could be, how little annoyed you were...” (Kai, 13:27)
On Chat Speed:
“We spend years optimizing ... to keep you in flow state and then we put you into a chat and now the answer takes 20 seconds ... You only tolerate this because you still think that the outcome at the end is quicker, is better.” (Kai, 16:22)
On Developer Preferences:
“There are different breeds of developers with different, different interests and different preferences, right? And you've got to be giving the right tools ... so that they can find a place that they're happy.” (Kai, 20:59)
On Extensibility and Security:
“Now you put this together with autonomy of an agent and you end up in a potentially scary world ... you need to think through the security implications ... But this creates the most awful user experience.” (Kai, 54:32)
On Cognitive Bottlenecks:
“As a human, we are pretty much the weakest link in the chain ... how do you actually really work with this level of parallelism?” (Kai, 63:51)
On the Future of Developer Collaboration:
“We like to stand on a dashboard or together, sit huddle together and do something together ... How would you replicate these kinds of things?” (Kai, 66:50)