AI Writes 100% of the Code. You’re Still Liable for 100% of It.

Microsoft’s new DELEGATE-52 research is the clearest evidence yet that “AI-generated” is not the same as “approved.” Here’s what engineering leaders should actually do about it.

May 05, 2026

There is a new bet running through every engineering org I talk to in 2026.

“AI can write all the code.”

How much can you trust AI is the real question.

No matter where code comes from, you’re still responsible for it.

The AI didn’t sign your customer agreement. The AI doesn’t carry the SOC 2 attestation. The AI doesn’t get paged at 3am. The AI doesn’t get fired when production crashes.

You did. Your team did. Your company did.

That accountability hasn’t moved. You can’t point the finger at AI when things don’t go perfect.

AI deleted your production database? Sorry, it’s your fault.

No matter how good the AI models get, that won’t change.

Which means the rule is simple:

AI can write 100% of your code, but humans still have to review 100% of it, because humans are still 100% responsible for it.

You can’t dodge accountability.

Microsoft just published new research that turns this from a philosophical argument into a real one.

What Microsoft’s DELEGATE-52 just showed us

DELEGATE-52 is a benchmark Microsoft Research just released. It simulates long delegated workflows across 52 professional domains: code, databases, financial statements, music notation.

Think about your long chats with AI. You change things. You change them back. Over time, a long context of edits piles up.

That’s exactly what DELEGATE-52 was built to measure.

The test setup is deceptively simple. Give a model a document. Ask it to make a structural edit, the “forward” edit. Then ask it to undo that edit, the “backward” edit. Chain those round trips together. After 20 interactions, compare what you have to what you started with.

If the model is doing its job, you end up where you started.

Except it didn’t!

Across 19 models tested, the average degradation was about 50% of the original content. The frontier tier (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupted roughly 25% of content by the end of long workflows. Only one domain consistently cleared a 98% fidelity threshold across the frontier: Python code. Most everything else slipped, and natural-language and niche domains failed badly.

The headline number is bad enough. The detail of the failure is worse.

Weak models, the paper found, fail by deletion. Things go missing. That’s noticeable. You can spot it in a diff.

Frontier models fail by plausible rewriting. The shape of the content stays the same. The vocabulary stays the same. The output looks like what came before. But the substance has drifted. Subtly, defensibly, in ways that survive a normal review.

Read that again, because it’s the whole argument.

The better the model, the more dangerous the corruption, because the corruption looks like it isn’t there.

“I read the diff and it looks fine.”

That’s how most engineering teams review AI code today. It’s also the exact review this kind of corruption is built to slip past.

Two findings that should reset how you evaluate AI tools

Buried in the paper are two more results that matter at least as much as the corruption rate.

#1 Two prompts can’t predict twenty

Microsoft was explicit: short prompt sessions tell you nothing about how the model behaves in a long session. If your team’s “we tested it and it worked” story is built on a few prompts and a thumbs-up, that story doesn’t hold up.

The model that looked great in the demo is the same one that loses 25% of the document by the 20th prompt.

#2 Even Claude Code can’t save you

The kind of agentic loop Claude Code runs didn’t improve DELEGATE-52 scores. The problem is in the model itself, not the wrapper around it.

Want another piece of evidence?

Look at Chroma’s “context rot” research, which tested 18 frontier models and found that every one of them degrades as input context grows, well before the advertised window fills up.

Chroma found that three things go wrong at once.

Models pay attention to the start and end of context but skim the middle. As context grows, they have exponentially more pairs of words to track, so focus dilutes. And similar-looking content actively misleads the model.

None of these are bugs. They’re how the AI models work.

Two independent research efforts came to the same conclusion:

Output quality drops as sessions get longer and context gets wider. The models can’t tell. Reviewers usually can’t either.

And nowhere does this hit harder than on real, large codebases.

The prototype trap

The research confirms what we keep reading about on LinkedIn and social media.

People rave about the amazing prototype they built over a weekend. At the same time, enterprise software architects are not impressed with AI.

The research is showing that both things can be right at the same time.

The prototype thing people rave about is real. You describe an app, AI builds it in 30 minutes, the demo works. Anyone who says otherwise hasn’t tried it.

Then point the same AI at your actual codebase. 200,000 lines of legacy Python. Fifteen internal frameworks. Business logic that took ten years to get right.

The wheels come off. The architects were right.

Bigger codebases trigger context rot immediately. Enterprise sessions are never one-shot, so corruption compounds. And the failures hide better in 200 files than in 5. A 5% silent corruption rate across an enterprise repo is invisible until production breaks.

The numbers map this directly. Python on simple workflows: 98% fidelity. Most other domains over long sessions: 25-50% degradation.

“If AI built that demo in 30 minutes, imagine what it’ll do for our platform.”

That’s the trap. The demo tells one story. The first production incident tells another.

How to use AI you can’t trust

If humans are responsible for the code, the engineering moves are obvious. The goal isn’t using less AI. It’s using AI in a way humans can actually review.

Tip #1: keep sessions short enough that a human can still hold the diff in their head. DELEGATE-52’s whole point is that long sessions accumulate plausible-looking corruption. If your agent ran a long session and produced a 2,000-line diff, no human is going to actually review that, no matter what the PR comment says. Cap session length. Commit and restart aggressively. Treat each agent session like a one-off contractor job, not a long-running employee.

Tip #2: narrow the context, don’t widen it. Repo-wide context windows feel like progress. They’re the exact thing that triggers context rot. For most edits the model only needs the file, the interfaces it touches, and the relevant test. That’s 2-10K tokens, not 200K. Build that into your tooling. The model won’t do it on its own.

Tip #3: verify by execution, not by reading. This is the move that actually catches plausible corruption. Tests, type checks, property checks, integration checks. Those are the only review steps that aren’t fooled by output that looks right. If your CI doesn’t catch what an LLM silently changes, your reviewers won’t either. The model produces exactly the kind of change designed to slip past visual review.

Tip #4: you can’t generate more AI code than you can review. This is the one that actually matters. Either grow review capacity, narrow where AI is allowed to work, or slow how fast AI code enters your repo. Anything else just defers the risk to a future incident.

The bottom line

The marketing pitch for AI coding tools is “AI writes code for you.”

Anyone who’s actually used AI knows the real version: AI writes the code for you, and you spend all day making sure it actually works.

DELEGATE-52 is not an interesting benchmark. It’s evidence in the case against trusting code you did not personally read, did not personally test, and did not personally vouch for.

The teams that get this in 2026 will ship faster with AI than the teams that don’t. They’ll spend less time hunting plausible-looking bugs that no human introduced. And a lot less time explaining to a customer why something they thought was reviewed wasn’t, really.

Your AI assistant is a really productive intern.

Don’t ship its code without reading it.

You’re the one whose name is on it.

Further reading

LLMs Corrupt Your Documents When You Delegate (arXiv:2604.15597)
DELEGATE-52 code and dataset
Context Rot: How Increasing Input Tokens Impacts LLM Performance (Chroma Research)

Product Driven Newsletter

Discussion about this post

Ready for more?