The Human in the Loop

Three stayed local. One didn't

Four coding CLIs went behind a proxy. Three stayed local. One uploaded the entire workspace.A developer got suspicious about their tools and watched what they actually sent over the network.Grok CLI was shipping everything to xAI's cloud. Full git history. The .env file, secrets included.It kept doing it when the prompt said not to read files. It kept doing it with "improve the model" switched off.Claude Code, Codex and Gemini stayed local. So this is one tool, not a rule for all of them.Nobody caught this through an audit or a disclosure. One person got curious and looked.A coding agent has to read your repo. That is the job. But reading your repo and uploading your repo are two different things, and nothing in the install flow tells you which one you agreed to.I covered this on this week's episode of The Human in the Loop, along with what it means for anyone running agents inside a regulated environment.#AISecurity #CodingAgents #TheHumanInTheLoop

Transcribe →

Incidental Learning Is Dying

1w ago00:21:44Tap to summarize

Last week I closed a bug I never understood.The agent found it, fixed it, explained it. I read the explanation, nodded, shipped. Ten minutes.A year ago that same bug costs me an afternoon, and I come out understanding the subsystem. This time I understood nothing. I just had a green check.There is a name for what I lost. Incidental learning: the understanding that was never in the ticket. It was a side effect of doing the work.Here is the uncomfortable part. Your talent pipeline is a side effect of work you can now automate.No engineer picks the slow path on purpose. Not when the fast path ships today and the sprint ends Friday. So the learning does not come back on its own. Someone has to design it back in.But how are we manage to do that when we need to fulfill the expectations?#AIAgents #EngineeringLeadership #TheHumanInTheLoop

Transcribe →

The AI Access Layer

3w ago00:20:24Tap to summarize

The most powerful AI model in general release this week requires government approval to access.OpenAI launched Sol, Terra, and Luna (restricted to around 20 companies). That's not a GTM decision. That's a policy one.Before Fable and this one, I assumed frontier AI access was fundamentally a commercial problem. Pay enough, move fast enough, you reach the frontier.That assumption changed last weeks.At the same time: Microsoft shipped MAI-Code-1-Flash into GitHub Copilot, not routed through OpenAI. JetBrains made Codex the default agent in its IDE. GitHub Desktop went Copilot-native.Every major coding surface made a model allegiance decision in seven days.Here's what that means in practice: the layer controlling which model reaches your developers is becoming a real battleground. Which IDE your org standardizes on, which vendor your enterprise GitHub defaults to.Access to frontier AI is no longer just a technical problem. It's a distribution one.Most teams haven't thought about this yet. They pick a tool, they use it.But if that layer gets locked in before you notice, you may not get a second choice.Full breakdown in this week's episode of The Human in the Loop. #AIEngineering #EnterpriseAI #TheHumanInTheLoop

Transcribe →

What happens when your AI tool gets taken away

Jun 2100:23:19Tap to summarize

One export rule. One acquisition. Either can pull your AI tool out from under you before lunch.This week both happened.A US export-control directive forced Anthropic to cut off foreign-national access to Fable 5 and Mythos 5. To comply, they disabled the models broadly. One day they're in your workflow. The next, a government decides who can call them.Then SpaceX bought Cursor in a reported $60 billion deal. If your team standardized on Cursor, the owner of your IDE just changed.The AI tool you picked is a dependency too. One you don't control. And the forces that can pull it away are bigger than a deprecated library. Export rules. Acquisitions. Regulation.So "which agent is best" now sits next to a harder question. What happens to my team if this vendor gets acquired, repriced, or regulated out of our region?If your main coding model went dark tomorrow, what's your second choice? And have you actually tried it?Full breakdown in this week's episode of The Human in the Loop. Link in the comments.#AItools #VendorRisk #TheHumanInTheLoop

Transcribe →

Fable 5 Is Gone. Now What?

Jun 1300:21:49Tap to summarize

The model you built on can disappear. Not crash. Disappear. That's not hypothetical. Last week the US government ordered Anthropic to pull Claude Fable 5 and Mythos 5. Not throttle them. Pull them. For every customer in the world, within hours of the order arriving. The reason was a security concern, not a bug in anyone's code. A decision made somewhere you'll never see, and the thing your app depended on stopped existing the same afternoon. A model you don't host is a dependency you don't control. It can change, get priced out, or get recalled by a government. None of that lands in your backlog with a warning. Have you actually tested what happens when your main model isn't there? #LLMOps #ModelRisk #TheHumanInTheLoop

Transcribe →

When 80% of the Code Isn't Yours

Jun 900:22:00Tap to summarize

You're still measuring AI by whether it writes good code. That's already the wrong question.Reading Anthropic's latest numbers, more than 80% of the code merged into their codebase is now written by Claude. The typical engineer ships 8x as much code per day as in 2024. And the length of task an AI can finish reliably is doubling roughly every four months. Four-minute jobs two years ago. Twelve-hour jobs now.But "what is the human actually still doing?"Their answer: not writing. Not running experiments. Setting direction. Reviewing. Deciding what's worth building and catching what slipped through. They already run automated Claude reviewers that flag bugs their best engineers missed.That quietly reframes the whole skills conversation. Most of my career has been about making output faster and cleaner. Fewer defects, quicker delivery. If output is becoming close to free, the value moves somewhere else. To judgment. What to build. When to ship. What not to trust.I don't think most IT teams are ready for that shift yet. What do you think?#AI #TheHumanInTheLoop #AILeadership

Transcribe →

The $36,000 Engineer: When Agentic AI Stops Being a Subscription

Jun 700:23:37Tap to summarize

Uber blew its whole 2026 AI budget in four months.Then it set a $1,500 monthly cap on each coding tool, per engineer. Claude Code, Cursor, a dashboard to watch the spend, an approval step to go over.Simon Willison did the math. Two tools, and one engineer runs about $36,000 a year.For years AI was a flat subscription. You paid once a month and you knew the number. Agentic coding turned that into a metered bill. And a metered bill does not warn you politely. It surprises you.This is the cloud invoice all over again. A team turns something on, forgets it is metered, and finds out at the end of the month.A paper this week put numbers on the risk. 63 real cases where agents blew past their limits. Often a single retry loop, quietly burning thousands before anyone looked.A cap you cannot enforce in code is just a wish.So before you scale agents across a team, the real question is not what the budget is. It is what happens, automatically, the second someone hits the ceiling. Most teams can answer the first. Almost none can answer the second.Full breakdown in this week's episode of The Human in the Loop.

Transcribe →

AI Ships Faster Than Anyone Can Review It

May 3100:22:09Tap to summarize

Meta says AI writes 80% of new code. Their own reviewers can't keep up with their own AI.Straight from their engineering blog.They built RADAR to auto-review low-risk diffs because "the share of diffs receiving timely review has declined." Their words. AI-generated code outpaced human review capacity.Read that with the rest of the week's news.Cognition says Devin merged 7x more PRs year-over-year. AI-written commits inside customer codebases jumped from 16% to 80%. Anthropic shipped Opus 4.8 on Wednesday, and every IDE, gateway, and agent runner had it the same day. They also disclosed a $47B revenue run-rate. The "is this a real business" debate is over.But here is what keeps coming back to me:Shipping more code faster is only a win if the systems that catch problems scale at the same rate. This week, the evidence says they aren't.A new arXiv study of 20,574 real coding-agent sessions documents how often agents do something other than what was asked. ITBench-AA, the first serious benchmark for agentic IT work, scored every frontier model below 50%.Adoption is real. The guardrails are not.This week's episode of The Human in the Loop covers all of it: the shipping wave, the cost-control backlash starting inside eng departments, and why ITBench-AA matters more than the score suggests.

Transcribe →

Why Your Coding Agent's PRs Keep Getting Rejected

May 2400:20:10Tap to summarize

The model isn't the problem. I went back through 20 of my agent's pull requests and the failures looked exactly like a junior's first month.3 of them tried to rewrite things nobody asked them to rewrite. 5 skipped the test, or wrote a test that would have passed either way. 4 fixed the bug but broke something else in the process.I used to assume model quality was the main driver. It isn't.The agent doesn't ship a one-line fix. It opens a change touching twelve files, half of them unrelated to the bug. It writes the code and skips the test. Or it writes a test that proves nothing.But notice that none of these are model problems. They're the same review failures a junior would ship. Just at higher volume and with more confidence.The practical move: stop logging "the agent failed" and start logging why. Counting changed how I prompt and how I scope tickets. It told me more about my codebase than any benchmark score ever has.If your top three match mine, you're seeing what everyone else is seeing. If they differ, that's signal about your codebase or your prompts.What's your top rejection reason?Full breakdown in this week's episode of The Human in the Loop. Link in the comments.

Transcribe →

Counting Keystrokes to Prove the Team Can Write

May 1700:24:49Tap to summarize

Counting accepted Copilot suggestions to prove AI works is like counting keystrokes to prove the team can write.It is the cleanest number on the dashboard. It is also the one that tells you nothing.Forty years ago Fred Brooks split software work into two parts. The accidental: syntax, boilerplate, scaffolding. The essential: what to build, why, for whom, what to trade off.The accidental is what AI tools are good at. That is why the dashboards look spectacular. Lines generated. Suggestions accepted. Prompts sent. The tools were always going to win that part.The numbers that should actually move sit one layer deeper. Cycle time. Change failure rate. Time to first PR review. Defect density.These were already telling you whether the team was shipping good software, long before AI showed up. AI either bends them or it does not.If cycle time has not moved, suggestion-acceptance is a vanity stat. If change failure rate has not dropped, you are not shipping faster. You are writing more code, faster. If time to first review has not shortened, your reviewers are the bottleneck and Copilot cannot fix that.GitHub shipped team-level Copilot metrics this week. It made the wrong question easier to ignore, not harder.Which second-order metric has actually moved on your team since you rolled out an AI coding tool?Full breakdown in this week's episode of The Human in the Loop.

Transcribe →

All episodes

Three stayed local. One didn't

Incidental Learning Is Dying

The AI Access Layer

What happens when your AI tool gets taken away

Fable 5 Is Gone. Now What?

When 80% of the Code Isn't Yours

The $36,000 Engineer: When Agentic AI Stops Being a Subscription

AI Ships Faster Than Anyone Can Review It

Why Your Coding Agent's PRs Keep Getting Rejected

Counting Keystrokes to Prove the Team Can Write