Transcript
A (0:00)
Today on the AI Daily Brief, we are doing a 101 on one of the most important concepts in AI right now, harness engineering. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, kpmg, Blitzi, Drata and Mercury. To get an ad free version of the show, go to patreon.com aidailybrief or you can subscribe on Apple Podcasts. Ad free is just $3 a month. If you are interested in sponsoring the show or really finding out anything else about the show, head on over to aidaily Brief AI or shoot us a note at sponsorsidailybrief.AI One final note before we dive in. Today is hopefully the last day for a while that I will be on the road traveling so this episode was recorded at the end of last week. If for some reason Sam Altman decided to release Spud over the weekend and you're wondering why the heck this is the episode you're getting, that is why. But I will be back, I promise, very soon. In the meantime, this gave me a chance to dive a little deeper on something that I think is extremely important and I've wanted to explore for a while, which is harness engineering. Today we are digging into a topic that first, you might have heard this term floating around a little bit, but second, even if you haven't, if you are among the subset of the audience that has been dabbling with Claude code or Codex or even using openclaw, you have been living in and doing this thing. Whether you realize it or not. I'm talking about harness engineering and you might notice that there is kind of a lineage of engineerings that we focus on that have changed over the years in AI. In 2023 and 2024 we talked a lot about prompt engineering, the art and the science of finding the right ways to prompt the model to get the results that you wanted. There was so much in prompt engineering that people spent so much time on. Think about the things that everyone used to recommend, like getting the model to adopt a Persona, or later on, the whole idea of JSON engineering where people hyper structure their prompts in the way that an engineer might. Now last year in 2025 we started to talk a lot more about context engineering. The idea of context engineering was that it turned out that what mattered for AI performance was not just the way you spoke to the model, but what set of information or context that model had access to take the example of asking ChatGPT to help you create a marketing campaign, one part of getting good results sure might be what you prompt it for and how you ask it. But obviously it's kind of intuitive that if ChatGPT had access to information about the performance of all your past marketing campaigns, it might be able to be more informed in how it helped you. So context engineering was all about the way that we brought together different context and gave AI access to it. Now, interestingly, context engineering actually kind of has had divergent meetings for different people. For engineers and developers, Context engineering has often been about designing the systems that surround AI and agents in order to better interact with and use context, dealing with problems like persistence in memory and state. And in a way, this is kind of a part of what we'll talk about with harness Engineering. For laypeople, for non technical users, Context engineering has been much more about what's the best way to give AI access to the information it needs to help me do its job. Now, it's important to note that while prompt engineering might have decreased a little bit in its importance scale, Context engineering is still very much alive and important. In fact, I did that entire episode about a week ago about how to build a personal context portfolio so that you could transport your personal context from LLM to LLM or agent to agent without having to repeat yourself every time. But the term du jour right now is harness engineering, which is effectively about everything you put around a model. The systems, the tooling, the access that help it do what it's meant to do. And when one starts to look around, you kind of start to see the harness engineering conversation popping up everywhere. At the beginning of April, Cursor launched its newest version, cursor 3. In their announcement post, they wrote, software development is changing, and so is Cursor. In the last year, we moved from manually editing files to working with agents that write most of our code. How we create software will continue to evolve as we enter the third era of software development, where fleets of agents work autonomously to ship improvements. We're building towards this future, but there is a lot of work left to make it happen. Engineers are still micromanaging individual agents, trying to keep track of different conversations and jumping between multiple terminals tools. In Windows, we're introducing Cursor three, a unified workspace for building software with agents. The new Cursor interface brings clarity to the work agents produce, pulling you up to a higher level of abstraction with the ability to dig deeper when you want. It's faster, cleaner and more powerful with the multi repo layout, seamless handoff between local and cloud agents, and the option to switch back to the cursor IDE at any time. So all of the features they then go on to announce having all of your agents in one place, the ability to run many agents in parallel, new UX for handoff between local and cloud. All of this is the instantiation of harness engineering into a product. Even more recently we got Claude Managed agents. In their announcement post they said explicitly it pairs an agent harness tuned for performance with production infrastructure. And in the accompanying blog post they basically say this is kind of all about harnesses. The post was called Scaling Managed Decoupling the brain from the Hands. Now of course in this metaphor, the brain is the model and the hands are the harness. Harnesses, they write, encode assumptions that go stale as models. Improved managed agents then is built around interfaces that stay stable as harnesses change. Now, we'll maybe come back later to some of the specifics of that new product, but. But again, the point here is that harness engineering is kind of everywhere at the beginning of March, Leighton Space dropped a post called Is Harness Engineering Real? And to provide another analogy, their team references back to when they worked in finance. It doesn't say for sure, but I assume this is Sean Swick's writing because this was part of his experience set. But whoever it was wrote a common debate in my finance days was about the value of the human versus the value of the seat. If a trader made 3 million in profits, how much of it was because of her skills and how much was because of the position, institution and brand she is in in? And any generally competent human could have made the same results, they continue. The same debate is currently raging in harness engineering, the system subset of Agent Engineering and the main job of Agent Labs. Agent Labs, by the way, are how the Latent Space team refers to everyone, like cursor, cognition, etc. The central tension they continue is between Big Model and Big Harness. An AI Framework founder you all know once confided in me at an OpenAI event, I'm not sure these guys even want me to exist. To define harness, they write, in every engineering discipline, a harness is the same thing. The layer that connects, protects and orchestrates components without doing the work itself. They continue talking with the Big Model guys. You really see it. Every podcast with Boris Czerny and Kat Wu, the creators of Claude Code, emphasize how minimal the harness of Claude Code is, meaning their job is mostly letting the model express its full power in the way that only the Model Maker knows best in one interview, Bora I would like to say there's nothing that secret in the sauce. Generally our approach is all the secret sauce. It's all in the model. And this is the thinnest possible wrapper over the model. We literally could not build anything more minimal. Cat added, it is very much the simplest thing I think by design. Gnome Brown from OpenAI seems to agree. They quote him as saying, before the reasoning models emerged, there was like all of this work that went into engineering agentic systems that made a lot of calls to GPT4O or these non reasoning models to get reasoning behavior. And then it turns out we just created reasoning models and you don't need this complex behavior. In fact, in many ways it makes it worse. You just give the reasoning model the same question without any sort of scaffolding and it just does it. And so people are building scaffolding on top of the reasoning models right now. But I think in many ways these scaffolds will just be replaced by the reasoning models and models in general becoming more capable. On the other side says Latent Space are the Big Harness Guys Jerry Liu from Llamaindex wrote a post on this on X that he titled the model Harnesses Everything. He added a picture that sums up his point as saying agent reasoning is exponentially improving, but models are blank slates. The biggest barrier to AI value is the user's own ability to context and workflow engineer the models. The more complex the business process, the more complex the prompt that users need to define. Now where latent space comes out is that while they might have some bias towards the big model thesis, actually referencing the bitter lesson that we talked about in episodes a couple of weeks ago, they also acknowledge that harness engineering has real value. So let's dive a little deeper into what harness engineering actually is. And for part of our guide we're going to use a post from humanlayer.dev from the Middle of March called Skill Harness Engineering for Coding Agents. Author Kyle writes, we've spent the last year watching coding agents fail in every conceivable way, ignoring instructions, executing dangerous commands unprompted, and going in circles on the simplest of tasks. Every time the instinct was the same every we just need better models. GPT6 will fix it. We just need better instruction following It'll work when niche library I'm using is in the training data. But over the course of dozens of projects and hundreds of agent sessions, we kept arriving at the same it's not a model problem, it's a configuration problem. Yes, models will get smarter and yes, some existing failure modes will disappear and then because they are smarter, we will give them new problems which are bigger and harder, and they will continue to fail in unexpected ways. Unexpected failure modes are a fundamental problem for non deterministic systems. So instead of praying for GPT 64 Codex Ultra High Extended to save us all, what if we focused instead on answering the question how do we get the most out of today's models? And the next point that Kyle makes is the one that I was saying before, which is that most of us who have been dabbling in these systems, be it openclaw or CLAUDE code or codecs, have been doing harness engineering. Whether we realize it or not here, he continues, there are lots of ways to get better performance out of your coding agent. If you use coding agents for moderately hard tasks, you've probably configured your coding agent a bit. Have you used skills? MCP servers sub agents Memory agents, MD files A coding agent equals AI models plus a harness. These are all technically separate concepts, but they are all part of the coding agent's configuration surface. Basically, what does the model use to interact with its environment? Harness Engineering, they write, describes the practice of leveraging these configuration points to customize and improve your coding agent's output quality and reliability. They continue by arguing that harness engineering is the subset of context engineering, which primarily involves leveraging harness configuration points to carefully manage the context window of coding agents. It answers, how do we give our coding agents new capabilities? How do we teach it things about our code base that aren't in the training data? How do we increase task success rates beyond magic prompts? And one of the things that they point out is that harnesses aren't just one thing. To some extent, harnesses work backwards from what models can't do natively to create some component to solve for that. In another post from Viv from LangChain called the Anatomy of an Agent Harness, Viv added a chart that showed the desired agent behavior versus what the agent adds, for example, the simple one that's a part of every CLAUDE code session. If the desired agent behavior is to write and execute code, the harness adds bash and code execution. If the desired agent behavior is safe execution and default tooling, the harness adds sandboxed environments and tooling. If the desired agent behavior is remembering and accessing new knowledge, the harness is going to need to provide memory files, web search, and mcps. And importantly, when you've heard about all of these techniques like Karpathy's Auto Research or the RALPH Wiggum loops. Those are harness additions to get to the desired agent behavior of completing long horizon work. They also point out that this is something that the big labs are talking about quite a bit now too. Back in February, OpenAI dropped a post called Harness Leveraging Codecs in an Agent First World. The place that they start from in this post is the goal of building and shipping an internal beta of a software product with zero lines of manually written code. That has been the context through which they have had to figure out what needed to be part of the harness that they were designing. One of the big experiments that they found was effectively that in this new approach to engineering, they had to uncover new ways of giving the agent progressively more context. This is this idea which you might have heard me talk about before called progressive disclosure, which is a key part of the way that agent skills have been designed where skills that provide context effectively unfold with the agent being able to access the minimum amount of information to know if it needs to go deeper into that skill without having to crowd out its context window with all sorts of unnecessary information. The key part of the story though is in some of the last lines in the post they conclude, our most difficult challenges now center on designing environments, feedback loops and control systems that help agents accomplish our goal. Building and maintaining complex, reliable software at scale. That is a very different proposition than just making a model better. All right folks, quick pause. Here's the uncomfortable if your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client zero. They embedded AI and agents across the enterprise. How work gets done, how teams collaborate, how decisions move not as a tech initiative, but as a total operating model shift. And here's the real unlock that shift raised the ceiling on what people could do. Humans stayed firmly at the center while AI reduced friction, surfaced insight and accelerated momentum. The outcome was a more capable, more empowered workforce. If you want to understand what that actually looks like in the real world, go to www.kpmg.us AI. That's www.kpmg.us AI. With the emergence of AI code generation in 2022, Nvidia master inventor and Harvard engineer Sid Pareshi took a contrarian stance. Inference, time, compute and agent orchestration, not pre training, would be the key to unlocking high quality AI driven software development in the enterprise. He believed the real breakthrough wasn't in how fast AI could generate code, but in how deeply it could reason to build enterprise grade applications while the rest of the world. Focused on co pilots, he architected something fundamentally Blitzy, the first autonomous software development platform leveraging thousands of agents that is purpose built for enterprise scale code bases. Fortune 500 leaders are unlocking 5x engineering velocity and delivering months of engineering work in a matter of days with Blitzi. Transform the way you develop software. Discover how@blizzi.com that's blitzy.com let's face it, if you're leading GRC at your organization, chances are you're drowning in spreadsheets. Balancing security risk and compliance across shifting threats and regulatory frameworks can feel like running a never ending marathon. Enter Drata's agentic trust management platform. Designed for leaders like you, Drata automates the tedious tasks like security, questionnaire responses, continuous evidence collection, and much more, saving you hundreds of hours each year. With Drata, you spend less time chasing documents and more time solving real security problems. But it's more than just a time saver. It's built to scale and adapt to your organization's needs. Whether you're running a startup or leading GRC for a global enterprise. With Drata you get one centralized platform to manage your risk and compliance program. Drata gives you a holistic view of your GRC program and real time reporting your stakeholders can act on. With Drata, you can also unlock a powerful trust center, a live, customizable product that supports you in expediting your never ending security review requests in the deal process. Share your security posture with stakeholders or potential customers, cut down on back and forth questions and build trust at every interaction. If you are ready to modernize your GRC program and take back your time, visit drata.com to learn more. This podcast is brought to you by Mercury Banking, designed to work the way modern software does. One thing I've always found weird as a founder is that almost every tool you use to run a company is modern. Your analytics tools, your email tools, your AI tools, they all feel like software built in, you know, the last decade. Then you go to banking and suddenly it feels like you've time traveled back to the 70s. That's why I use Mercury. It's business banking that actually works. Like the rest of the tools, founders rely on clean interface, everything where you expect it, and basic things like wires, cards or permissions taking a couple clicks instead of a phone call in three forms. For the whole AIDB ecosystem, it is just dramatically simpler. You can see everything from the dashboard, control, spend and give the right people access without handing over the whole account. If you run a company and you're tired of banking feeling like the one tool that never modernized? Check out Mercury. Visit mercury.com to learn more and apply online in minutes. Mercury is a fintech company, not an FDIC Insured bank. Banking services provided through Choice Financial Group and column NA members FDIC. Now as we get more discourse about harness engineering, we also get more maps and guides for what it actually means. In an Aetna Labs post, for example, they describe the key components of harness as a three layer architecture, the information layer, which determines what information an agent can see and what capability it can invoke, that is Memory and context management and tools and skills an execution layer that determines how work is decomposed, how agents collaborate and how to recover in the case of failure, which is basically orchestration and coordination and infrastructure and guardrails and a feedback layer which determines how the system can improve over time, whether the results of the execution are verified and whether each failure is recorded and transformed. So that's evaluation and verification and tracing and observability. And there is increasing evidence out there of the power of harnesses. Blitzy, who has over the last year been a frequent sponsor and collaborator on the show, recently released a 66.5% performance on Suite Bench Pro that is much higher than, for example, GPT5.4's 57.7. Now, effectively, Blitzi's whole thesis could be reframed as being that the harness layer, the agent scaffolding, the orchestration and the context infrastructure wrapped around the foundation models can unlock bigger performance gains than the model themselves. One of the key things that they found when auditing their performance versus GPT 5.4 is that in many cases GPT 5.4's failures weren't catastrophic. It got close on every problem, but missed intricate details in corner cases. When Blitzi succeeded on those same tasks, it succeeded because its knowledge graph gave its agents deep code based context that a raw model doing a single pass couldn't match. LangChain has also recently been writing about how they've been improving agent performance with harness engineering as well. Nicholas Charier thinks that there is some amount of consensus around the power of the harness that is starting to happen. The AI entrepreneur recently wrote a post on X called the Great Convergence and it plumbs through some similar themes as my episode Every AI product is becoming every other AI product, but puts it in the context of the harness. Nicholas writes, over the last year a strange thing has happened in tech. Very different companies have started moving towards the same product shape and it feels like everyone is building the same thing Linear announced last week that they're building coding agents. OpenAI is deprecating Sora and focusing entirely on Codex. Anthropic is obviously all in on cloud code and coworkers notion is building agents for work. So are Google, Microsoft, Meta, Lovable, Retool and many others. What changed, he writes, is not just that models got better, although that's a major part of it working. He argues that the important shift is the invention of what he calls the general harness, the simple harness architecture diagram that he describes. A user input hits context engineering which moves to the model, which calls in tools, which acts as context engineering in a loop until the task result comes out on the other side, Nicholas writes, claude code was a massive breakthrough. Although initially invented for coding use cases, it turns out that a smart looping agent generalizes incredibly well towards any computer based task if you give it the right tools. So this new technique emerges and turns out to be a general problem solving machine. It also scales on a very unique dimension. It can keep running for a long time. It takes the shape of a model harness, a goal and a set of tools. It runs in a loop, calling tools until it stops and produces a result. He points out that so many of the new agents that we're seeing all come back to just a harness looping agent architecture with the right tools and context management. Ultimately, he predicts that by the end of 2026, many software companies will look like they are selling the same thing. He writes that's not because the industry lost imagination, but because the architecture and economics are pushing everyone towards the same destination. Self improving software systems that can take a goal, use tools and produce business outcomes. The harness, he writes, explains the convergence. The self improvement explains the acceleration. Once agents can be monitored, evaluated, orchestrated and improved by changing their own code and context, the companies that own more of the loop will improve faster and their progress will compound. The winners, he says, will not just have better models, they will have distribution, trusted workflow, positioning, proprietary context, and the shortest path from observation to improvement. Which brings us back to Anthropic's managed agents, because what they show is that we're now at the point where we're not only recognizing harness engineering, we're starting to build towards inevitable changes in what harness engineering is. Remember, the subtitle from their accompanying blog post is Harnesses encode assumptions that go stale as models improve. Managed agents is built around interfaces that stay stable as harnesses change. Here's how they set it up. A running topic on the engineering blog is how to build effective agents and design harnesses for long running work. A common thread across this work is that harnesses encode assumptions about what CLAUDE can't do on its own. This then goes back to that idea from the LangChain blog that the harness is about adding things that address a certain desired agent behavior that aren't in the model natively continuing, Anthropic writes. However, those assumptions need to be frequently questioned because they can go stale as models improve. As just one example, in prior work we found that Claude Sonnet 4.5 would wrap up tasks prematurely as it sensed its context limit, approaching a behavior sometimes called context anxiety. We addressed this by adding context resets to the harness, but when we used the same harness on Claude Opus 4.5, we found that the behavior was gone, the resets had become dead weight. We expect harnesses to continue evolving, so we built Managed Agents, a hosted service in the CLAUDE platform that runs long horizon agents on your behalf through a small set of interfaces meant to outlast any particular implementation, including the ones we run today. Basically, they say building managed Agents meant solving an old problem in computing of how to design a system for programs. As yet unthought of Another way to put it is that Anthropic is building a metaharness, a system that's deliberately unopinionated about what any specific harness should look like, because they expect harnesses to keep changing as models improve. Going back to that brains versus hands metaphor, effectively, Anthropic separated the agent loop, the brain from the execution environment, the hands or sandbox, and even separated it from the event log, which is the session. Each can fail or be replaced independently. And I think that this has big implications for the debate that we started on of big harness versus big model. In the way that latent space framed it, it sort of a little bit obliterates that debate. Anthropic is saying, effectively, yes, harness engineering matters. In fact, it matters so much that we're building infrastructure to make harnesses disposable. The whole point is that any given harness is temporary. The discipline is permanent, the specific implementation is not. But let's talk about why any of us should care. First of all, if you use CLAUDE code, Cursor codecs or OpenClaw, which is by far the most successful open source harness we've had so far, you are already doing harness engineering, whether you call it that or not. Every time you write an agents MD file, structure your repo so the agent can navigate it, anything like that, you're building an outer harness. Brigitte Bacular actually distinguishes between the user harness and the outer harness. Basically, there's the inner harness, which is built by the builders of the coding agent, that is anthropic or OpenAI, and the outer harness, which is built by you, the user. The outer harness is what's going to determine whether the agent produces good work around your specific code base or your specific goals. But I would argue that that is not the only reason to care about harness engineering. If you're an enterprise leader, the mental model matters because it reframes AI adoption from pick the best model to pick the best environment for agents to work in. It is a technical capstone on the larger truth which all enterprises are realizing that AI success is not about dropping in a tool and hoping it works, but about designing a new system in which the capability set that AI has and that AI enables among your people can thrive. One could torture the harness engineering frame of reference and frankly extrapolate it out to the entire organizational design need. Simply put, the model and the tools are necessary but insufficient. The environment you put them in is going to determine the output quality. And lastly, if you're just a consumer watching this, hopefully understanding harness engineering a little bit better will go some way to explain why every product seems to be turning into every other product. When we understand that the core loop of models calling tools in a loop until they are done actually turns out to be general purpose, you understand why linear is building coding agents and why notion is building work agents and why they're all heading towards the same place. The harness enables them to use the models to accomplish whatever goal they were set out to do with a process and a pattern that starts to look really familiar. Anyways, guys, hopefully this was a valuable primer and a term that you're going to be hearing a lot more of. But for now, that's going to do it for today's AI Daily brief. Appreciate you listening or watching as always, and until next time, peace.
