The Harness Evolution Principle: Why Mature Harnesses Look Like Pruning

“Any code that you are writing that is compensating for model unreliability will have a half-life of just months” [3]. Lucas, a Research PM at Anthropic, said that in his Code with Claude 2026 keynote, and the half-life framing has shaped how we think about every harness decision since.

The corollary is just as important: code that connects your model to your world tends to compound. Your custom tools, your data, your auth, your specific context. The model cannot absorb what it cannot see.

Most developers treat their harness like a build-once artifact. You scope it, you build it, and you move on. The people who built Claude Code do the opposite. They rewrite it continuously. The discipline is not construction, it is gardening. And the question every harness builder needs to keep asking is: which parts of your harness are on a depreciation schedule, and which ones are appreciating?

Figure 1 - Hero diagram showing harness components being pruned by model generation, with a before/after comparison of V1 and V2 architectures. Title overlay: "The Harness Evolution Principle."

Figure 1 - The harness evolution principle: Every harness component encodes an assumption about what the model cannot do alone. As the model improves, those assumptions expire. Mature harness work looks less like building structure up and more like pruning it down. One model generation is often enough to retire a component that took real engineering effort to build correctly.

Where This Fits: The Three-Era Arc#

If you have not read the predecessor article on the three eras of AI engineering, a quick orientation. AI engineering moved from prompt engineering (2022 to 2024), to context engineering (2024 to 2025), to harness engineering (2025 onward). Each era absorbed the prior one. In the harness era, the orchestration code wrapped around the model is where the performance lives.

What the three-era framing does not capture in full is how the harness itself evolves inside Era 3. It is not static infrastructure. Every harness design decision is made at a specific point in model capability space, and that space keeps moving. This article is specifically about that motion and what it demands from practitioners.

The pruning principle is Era 3’s hardest discipline to internalize. Era 2 instincts said “add more: more context, more retrieval, more structure.” Era 3 evidence says auditing what you can now subtract is just as important as knowing what to add.

The V1/V2 Case Study: Concrete Proof#

The best-documented example of this principle in production comes from Anthropic’s own engineering team. Prithvi Rajasekaran published the full story on the Anthropic engineering blog on March 24, 2026 [1]. It is worth walking through in detail, because the numbers are unusually clean.

V1: Built for the 4.5 Generation#

The V1 harness was designed around the 4.5-generation capability profile (Sonnet 4.5 and Opus 4.5). Both models exhibited what the team called context anxiety: as the context window filled, the model changed behavior. It rushed through steps, wrapped up prematurely, and declared tasks done when they were not. Context compaction did not fully solve the problem because it does not start with a clean slate.

The V1 response was a sprint-based architecture. Three agents: a planner, a generator, and an evaluator. Before each sprint, the generator and evaluator negotiated a sprint contract, defining what “done” meant for that work chunk before any code was written. Between sprints, the harness ran a full context reset: clear the window entirely, write state to a progress file, and start the next sub-session with a clean slate and just the structured handoff as context.

This was sophisticated engineering. Sprint contracts as a coordination mechanism between agents. Context resets as a defense against context anxiety. It worked: the V1 harness produced results the solo model could not.

The team verified this with a simple experiment: build a 2D retro game. A solo Sonnet 4.5 agent ran for 20 minutes and spent $9 [1]. The game it produced had entities on screen, but they did not respond to input. It was broken. The full V1 harness ran for 6 hours and spent $200 [1]. The game was functional and playable. Per the Anthropic blog, the full harness was over 20x more expensive, “but output quality differences were immediately apparent” [1].

KEY INSIGHT: The 20-minute/$9 result is not a failure of the model. It is a measurement of where solo model capability stops and where the harness earns its cost. That boundary moves with every model release.

Figure 2 - Architecture diagram comparing V1 (left) and V2 (right) harness designs. V1 shows planner, generator, evaluator with sprint contracts and context resets between each sprint. V2 shows planner, generator, and task-conditional evaluator only, with removed components crossed out.

Figure 2 - V1 to V2: what got pruned: The V1 harness for Sonnet 4.5 needed sprint contracts and context resets because the model could not sustain long agentic runs without them. One model generation later, Opus 4.6 absorbed those capabilities natively. The sprint structure was removed entirely. The evaluator became task-conditional rather than per-sprint universal. The harness became simpler and cheaper.

V2: Built for Opus 4.6#

When Anthropic upgraded to Opus 4.6, they stress-tested every assumption baked into V1. Anthropic’s Opus 4.6 launch blog described the capability changes directly: the model plans more carefully, sustains agentic tasks for longer, has better long-context retrieval, and has better code review and debugging skills to catch its own mistakes.

Those four improvements correspond precisely to what the V1 harness was compensating for. Better long-context retrieval means sprint-based state handoffs are no longer needed. Better sustained task execution means context anxiety is reduced enough that context resets are no longer required. Better native task decomposition means the sprint contract mechanism is unnecessary overhead.

V2 removed the sprint structure entirely. Sprint contracts: gone. Context resets: gone. The builder ran coherently for “over two hours without the sprint decomposition Opus 4.5 had required” [1]. The evaluator shifted from per-sprint to task-conditional. For tasks well within the model’s capability range, the evaluator is not needed at all. For tasks at capability edges, it still provides real value. This is a crucial nuance: the evaluator was not removed wholesale. It became conditional on what the task actually demands.

Anthropic put the principle on paper in the same blog post. The quote is worth using verbatim: “Every harness component encodes assumptions about model limitations, and those assumptions merit stress testing since they can quickly become outdated as models improve” [1]. That sentence is the harness evolution principle in a single line.

The V2 DAW Build: The Numbers#

To verify V2 was actually better rather than just simpler, Anthropic ran a demanding test: build a fully featured DAW (Digital Audio Workstation) in the browser using the Web Audio API. The V2 harness with Opus 4.6 produced the result in 3 hours 50 minutes at a total cost of $124.70 [1].

The cost breakdown by phase tells the story of how the task actually ran [1]:

Agent and Phase	Duration	Cost
Planner	4.7 minutes	$0.46
Build Round 1	2 hours 7 minutes	$71.08
QA Round 1	8.8 minutes	$3.24
Build Round 2	1 hour 2 minutes	$36.89
QA Round 2	6.8 minutes	$3.09
Build Round 3	10.9 minutes	$5.88
QA Round 3	9.6 minutes	$4.06
Total	3 hours 50 minutes	$124.70

Most of the time and cost went to the builder, which is exactly where it should go. The planner cost $0.46. The QA rounds combined cost $10.39. The sprint contract machinery that V1 required? Not in the table. It no longer exists.

Figure 3 - Styled cost breakdown table for the V2 DAW build showing all seven phases, durations, and costs, with the three Build rounds highlighted and a total row at the bottom. Source: Anthropic engineering blog, March 24, 2026.

Figure 3 - V2 DAW build cost breakdown: Planner at $0.46, three build rounds accounting for $113.85 of the total, three QA rounds at $10.39 combined, and a $124.70 total over 3 hours 50 minutes. The sprint contracts and context resets that V1 required add nothing to this table because they no longer exist. Per Anthropic engineering blog, March 24, 2026.

Figure 4 - Side-by-side comparison showing solo agent vs full V1 harness on the retro game task: $9 / 20 min / "entities did not respond to input" vs $200 / 6 hours / "functional and playable". Over 20x cost, but only one produced a working result.

Figure 4 - Solo agent vs V1 harness on Sonnet 4.5: The $9/20-minute solo result produced a broken game. The $200/6-hour V1 harness result produced a working one. Over 20x more expensive, per the Anthropic engineering blog. This is the measurement that told the team exactly where solo model capability ended and harness ROI began. That measurement changes with every model generation.

Boris Cherny: The Human Anchor#

Data and institutional case studies are useful, but sometimes you need to hear it from a person.

Boris Cherny is the creator of Claude Code. Here is what he said about the relationship between Claude Code and continuous rewriting:

“All of Claude Code has just been written and rewritten and rewritten and rewritten over and over and over. There is no part of Claude Code that was around 6 months ago” [4].

This is the permission structure for practitioners. If you build a harness and find yourself rewriting major pieces of it six months later, that is not a sign you built it wrong the first time. That is how harness engineering works. The creator of Claude Code says so. His own harness has no surviving code at six months. He keeps building it. The maintenance obligation is not a failure signal; it is a signal the model improved and your scaffolding is catching up.

KEY INSIGHT: Continuous harness rewriting is not technical debt accumulation. It is professional maturity. The discipline is distinguishing which rewrites are driven by model improvements (planned, expected, healthy) from which are driven by original design errors (avoidable, worth post-mortems).

Figure 5 - Typographic quote card with Boris Cherny's verbatim quote in large clean type and attribution "Boris Cherny, creator of Claude Code" below.

Figure 5 - Boris Cherny on harness rewriting: “There is no part of Claude Code that was around 6 months ago.” This quote reframes continuous rewriting from a problem to the expected discipline. The creator of the tool describes his own relationship to it. If he cannot freeze his harness, no practitioner should expect to freeze theirs.

The NLAH Ablation: More Structure Is Not Always Better#

The Boris quote and the V1/V2 case study both point in the same direction. The Tsinghua NLAH paper [6] adds empirical rigor to the same principle.

The NLAH team ran module-by-module ablation across their agent harness on two benchmarks: SWE-bench Verified and OS World. The verifier module is the component type that most developers add instinctively. It checks whether the agent’s output actually meets the spec before accepting it. Classic “add structure for reliability” thinking.

The results were not what the instinct predicts. On SWE-bench Verified, the verifier cost 0.8 percentage points [6]. Noticeable, but small. On OS World, the verifier dropped performance by 8.4 percentage points [6]. The penalty is benchmark-specific, not universal.

The mechanism is instructive. The local verifier checks outputs against its own acceptance criteria, which can diverge from the actual benchmark evaluator’s behavior. On a code benchmark (SWE-bench), the divergence is small. On a computer-use benchmark (OS World), the divergence is large enough to actively hurt results.

The takeaway is not “verifiers are bad.” The takeaway is that “add a verifier” is not universally good advice. Harness decisions are empirical, not structural. A component that looks like it should help can actively hurt on the specific workload you care about.

Cite the finding [6] if you encounter resistance to this idea. The evidence is there.

Figure 6 - OS World benchmark line chart showing performance below 50% twelve months before Code with Claude 2026, rising to 78% on Opus 4.7, with a projected line approaching 80%. Source: Lucas, Code with Claude 2026 keynote.

Figure 6 - OS World: sub-50% to 78% in twelve months: Lucas, a Research PM at Anthropic, reported this trajectory in his Code with Claude 2026 keynote. OS World went from “below 50% less than 12 months ago” to 78% on Opus 4.7, “about to hit 80%” [3]. This is the eval data behind why image-scaling glue retired. One category of compensating code became unnecessary because the model absorbed it.

The OS World Trajectory#

Lucas’s OS World data is worth pausing on. OS World measures computer-use performance: can the model navigate real software interfaces autonomously? Less than twelve months before the Code with Claude 2026 keynote, Claude was “scoring below 50% on this eval” [3]. At the keynote, it was 78% on Opus 4.7, with Lucas noting it was “about to hit 80%” [3].

That curve is what drove the retirement of image-scaling glue. In 2025, computer-use agents required substantial scaffolding: downscale 1080p screens to fit pixel limits, track the downscale factor, scale click coordinates back up to native resolution after the model sampled a click location, wrap everything in retries and verify statements. This was real engineering effort that real teams wrote and maintained.

Opus 4.7 supports native-resolution screenshots at up to 1440p with one-to-one pixel coordinates [5]. The scaling math is gone. The retry loops around bad coordinate calculations are gone. Not because anyone decided to remove them, but because the model absorbed the capability they were compensating for.

The OS World trajectory tells you the pace at which the model is absorbing computer-use compensating code. Twelve months, sub-50% to nearly 80% [3]. If you are maintaining image-scaling scaffolding today, you are maintaining code that is on its way out.

Figure 7 - Two-panel bar chart showing NLAH verifier ablation results. Left panel: SWE-bench Verified, showing -0.8% penalty from removing the verifier. Right panel: OS World, showing -8.4% penalty. Both bars highlighted with annotation "Benchmark-specific penalty: same module, different workload."

Figure 7 - NLAH verifier ablation: benchmark-specific penalty: On SWE-bench Verified, removing the verifier costs -0.8 percentage points. On OS World, it costs -8.4 points. Same module, same harness, same model. Different benchmark, very different result. Per arXiv:2603.25723 (Pan et al., Tsinghua/Harbin Institute of Technology). Harness decisions are empirical, not structural.

The Practitioner’s Playbook#

Principles are useful, but actionable steps are better. Here is how to apply the harness evolution principle when you are actually building.

1. Document each component’s assumption. Before you ship any harness component, write a single sentence: “This component compensates for [specific model limitation].” Sprint contracts compensate for the model’s inability to sustain long-horizon coherent task decomposition without a checkpoint mechanism. Context resets compensate for context anxiety as the window fills. Image-scaling math compensates for the model’s inability to handle native-resolution coordinates. Write it down. That sentence is your deprecation trigger.

2. Re-evaluate after every model upgrade. When a new model version ships, pull out your list of component assumptions and run through each one. Does this limitation still exist in the new model? If the answer is “not sure,” that is an ablation task, not a rhetorical question. Run the experiment.

3. Start minimal. Build the smallest harness that makes the task work reliably. Add components only when you have measured evidence that a specific failure mode requires them. Every component you add without measured evidence is debt you will probably be paying interest on in six months.

4. Prefer ablation over intuition. The NLAH verifier result is the reminder here. Intuition says “a verifier adds reliability.” The data says “it depends on the benchmark, and the penalty can be larger than the gain.” Run the experiment. Ablate one component at a time. Measure the delta.

5. The evaluator threshold. The evaluator’s necessity is task-conditional. If the model is well within its capability range for your task, an evaluator is overhead. If the task pushes the model’s limits, the evaluator provides real value. The V2 harness encoded this correctly: evaluator runs when the task warrants it, not on every sprint by default. Match the mechanism to the actual failure mode.

KEY INSIGHT: The five rules above are not about reducing effort. They are about indexing that effort to where the durable value is. A component you cannot explain is a component you cannot maintain when the model improves past its assumption.

Figure 8 - Visual five-step playbook diagram showing each rule as a labeled card in a vertical flow, with annotation examples next to each rule drawn from the V1/V2 case study and NLAH ablation.

Figure 8 - The five-rule practitioner playbook: Document the assumption. Re-evaluate after each model upgrade. Start minimal. Prefer ablation over intuition. Match the evaluator to the task, not to every sprint by default. Each rule is a concrete response to a failure mode the V1/V2 case study or the NLAH ablation data surfaced.

Conclusion#

The harness layer shrinks as models improve. Not immediately, not completely, but on a months-scale clock. Sprint contracts had a half-life of roughly one model generation. Image-scaling glue had a similar arc. Whatever compensating code you wrote in 2025, some portion of it is currently deprecating.

The connecting layer is the inverse. Your data, your auth, your tools, your specific business context: the model cannot absorb what it cannot see. Anthropic cannot ship a model update that knows your database schema, your customer patterns, or your proprietary workflows. That infrastructure appreciates. Every harness engineering decision worth making is ultimately a bet on the connecting layer.

The discipline the harness evolution principle demands is subtraction, performed continuously, indexed to the model release cycle. Not as a reactive scramble after each release, but as a standing practice: document assumptions when you build components, stress-test those assumptions when the model improves, remove components that have expired.

Boris Cherny rewrites Claude Code continuously. Anthropic removed sprint contracts one model generation after building them correctly. The half-life rule is now on the record in a public keynote, and the V1/V2 case study is the concrete proof that the clock is real.

The engineers who internalize this are not building infrastructure once. They are gardening it. The garden changes. The skill is knowing what to prune.

The Series#

This is Part 3 of the five-part Harness Fundamentals series:

What Is an Agent Harness, Really? Nine Components Most Builders Miss — a working definition and the nine components every modern harness needs
Three Eras of AI Engineering: Prompt to Context to Harness — how the discipline moved and what each era absorbed from the one before
The Harness Evolution Principle: Why Mature Harnesses Look Like Pruning (this article) — the V1/V2 case study, the Boris anchor, and a practitioner’s pruning playbook
Building Your First Specialized Harness in Python: 9 Components, 12 Design Decisions (coming soon) — hands-on construction of a minimal harness with all nine components mapped to working code
Skills, Slash Commands, and Harnesses: A Discipline Hierarchy (coming soon) — where individual skills fit inside the broader harness and how the three layers interact

References#

[1] P. Rajasekaran, “Continuous reinvention: A brief history of Claude Code’s development,” Anthropic Engineering Blog, Mar 2026. https://www.anthropic.com/engineering/continuous-reinvention-a-brief-history-of-claude-code

[2] Anthropic, “Effective harnesses for long-running agents,” Anthropic Engineering Blog, Nov 2025. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

[3] L. Roberts, “The Expanding Toolkit,” Code with Claude 2026, Anthropic, May 2026. https://www.youtube.com/@Anthropic

[4] B. Cherny, “Claude Code at the edge of model capability,” Code with Claude 2026, Anthropic, May 2026. https://www.youtube.com/@Anthropic

[5] Anthropic, “Computer use (beta),” Claude Platform Documentation, 2026. https://docs.claude.com/en/docs/agents-and-tools/tool-use/computer-use-tool

[6] Y. Pan et al., “Natural Language Agent Harnesses,” arXiv preprint arXiv:2603.25723, 2026. https://arxiv.org/abs/2603.25723

[7] X. Xie et al., “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments,” 2024. https://os-world.github.io/