By Diego Sanz · March 10, 2026 · 11 min read

The Three Phases of AI Agent Evolution: From Demo to Moat

Pop Art silkscreen grid of AI agent icons in the style of Warhol's Marilyn Diptych — vivid production agents left, fading grey pilots right — Inspired by Andy Warhol's *Marilyn Diptych* (1962) — silkscreen grid of vivid production-ready agents (left) against fading grey pilots that never shipped (right); the three-generation arc from demo to competitive moat encoded in the color-to-monochrome shift.

Listen

0:00 / 15:44

The Three Phases of Agent Evolution: Why Harness Engineering Is Now the Primary Competitive Battleground

What if the reason your AI agents keep failing has nothing to do with the model you chose — and everything to do with the system you built around it?

Fifty-six percent of CEOs have seen zero cost or revenue improvement from AI investment. Not zero-or-modest — zero. PwC surveyed 4,454 business leaders across 95 countries in late 2025 and found that most of the world’s organizations have spent heavily on AI and come away with nothing measurable to show for it. What makes that number especially uncomfortable is the thing it doesn’t say: that the models weren’t good enough. They were. The bottleneck is somewhere else entirely.

The place it actually sits is the discipline most organizations have never heard of: harness engineering. And the companies that figure this out first are going to have an advantage that is genuinely difficult to replicate — because the harness, unlike the model, is where your business logic, data context, and institutional knowledge live.

This article traces the three-generation arc from prompt engineering to harness engineering, explains why the formula Agent = Model + Harness is more than a slogan, and makes the case for why the harness — not the model — is where the real competitive battle is being won and lost right now.

The Three-Generation Arc: From Prompts to Context to Harness

Every technology goes through a phase where practitioners are still solving the wrong problem. For generative AI, the first wrong problem was the prompt.

From roughly 2022 through 2024, the craft was prompt engineering — the art of the single perfect instruction. Few-shot examples, chain-of-thought structuring, role assignments. Organizations hired prompt engineers. Consultants sold prompt libraries. The implicit assumption was that if you could just craft the right input, you would get the right output. The model was the system, and the prompt was the key.

That assumption ran into a wall the moment anyone tried to build something that had to work reliably, repeatedly, at scale. A perfect prompt for one scenario is a mediocre prompt for the next one. Single-turn optimization doesn’t survive contact with the real world.

The second generation — context engineering, which came into focus through 2025 — was a significant step forward. The insight was that what you gave the model mattered as much as what you asked it to do. Retrieval-augmented generation, dynamic context injection, structured memory: practitioners started building pipelines that assembled rich information environments before handing control to the model. This was better. Production deployments improved. But context engineering still treated each model invocation as essentially a single event — one carefully assembled context window, one response. It answered the question of what the model should know. It didn’t answer the question of how the model should behave across an entire workflow.

From Prompt Engineering to Harness Engineering

Each generation answered a question the previous one couldn’t.

Generation 1 · 2022–2024

Prompt Engineering

The question it answered

“What’s the right instruction?”

What it built

Few-shot examples, chain-of-thought, role assignments, prompt libraries

Where it hit the wall

Single-turn optimization doesn’t survive contact with the real world at scale

Generation 2 · 2025

Context Engineering

The question it answered

“What should the model know?”

What it built

RAG pipelines, dynamic context injection, structured memory, information assembly

Where it hit the wall

Treated each invocation as one event. Didn’t answer how a model behaves across an entire workflow

Generation 3 · Now

Harness Engineering ★

The question it answers

“How does the model behave across every turn?”

What it builds

Control systems, feedback loops, verification logic, tool access, state management, constraint files

The competitive implication

The harness encodes your business rules and data context. The model is available to everyone. The harness belongs to you.

Harness engineering is the third generation, and it is a different category of problem. Where prompt engineering wrote instructions and context engineering assembled information, harness engineering builds the control system that governs an agent across every turn of its operation — its constraints, its feedback loops, its verification logic, its tool access, its state management. You are no longer crafting a message to a model. You are architecting the environment the model operates inside.

The Formula That Actually Explains Agent Failure

Agent = Model + Harness.

The formula is deceptively clean, and most practitioners do encounter it as a slogan and move on. The teams that internalize it build fundamentally different systems. It makes three claims that are worth pulling apart separately.

First: a model alone is not an agent. A model by itself is a stateless token predictor. It takes input, produces output, forgets everything, and waits for the next input. The agentic behavior — persistence, tool use, multi-step reasoning, error recovery — comes entirely from the harness wrapped around it. Without the harness, you don’t have an agent. You have a very capable autocomplete.

Second: the model is increasingly a commodity. The frontier offerings from OpenAI, Anthropic, and Google have converged on raw capability in ways that would have seemed implausible two years ago. You can swap the model inside a well-built harness without rewriting the harness. The specific reasoning engine underneath is becoming less important as a differentiator with every model generation.

Third, and this is the one executives need to sit with: the harness is the competitive moat. The harness encodes your business rules. Your data context. Your verification logic. Your safety constraints calibrated to your specific risk tolerance. Your domain-specific toolchain. None of that transfers when you switch models, and none of it can be replicated by a competitor who doesn’t know your business. The model is available to everyone. The harness belongs to you.

Teams building agents at scale describe the same reality in different terms. Engineers running large-scale agent deployments — at organizations whose agent infrastructure handles thousands of operations daily without fine-tuning or proprietary models — consistently report the same finding: the model is standard issue. The harness is the engineering achievement. The organizations that benchmark their agent investments against model capability comparisons are measuring the wrong variable.

Where the Investment Is Breaking Down

PwC’s diagnosis deserves a harder look, because the surface explanation — “AI is just hard” — obscures where the problems actually occur. The clearest signal in the data isn’t the 56% getting zero. It’s the gap between them and the organizations that aren’t.

PwC found that CEOs who had built what the survey calls “AI foundations” — strong technology infrastructure, enterprise-wide integration, formalized risk and governance processes — were three times more likely to report meaningful financial returns. Not modestly more likely. Three times. The organizations getting zero are not getting zero because they chose the wrong model. They’re getting zero because they deployed isolated, tactical AI projects with no coherent foundation beneath them.

The data layer is part of that foundation. PwC’s “AI foundations” framework explicitly includes the infrastructure that governs data quality — the certification, the currency, the drift monitoring. An isolated AI project, by definition, lacks that governance layer. The practical consequence is consistent with everything practitioners observe: even a well-designed agent degrades if the data context it operates on is stale, uncertified, or drifting from the underlying reality it was built to represent.

AI Investment Outcomes · PwC 29th Annual Global CEO Survey, 4,454 CEOs

56%

Zero ROI

No cost reduction, no revenue gain — despite significant AI investment

3×

More Likely to See Returns

CEOs with AI foundations — infrastructure, integration, governance — versus those running isolated AI projects

What separates the two groups: not the model chosen, not the AI provider — the infrastructure and governance built around it

Source: PwC 29th Annual Global CEO Survey (2026), 4,454 CEOs across 95 countries, released January 2026

This is not a comfortable finding for organizations that have been investing primarily in model selection and prompt optimization. It means that the engineering investment most likely to improve production reliability isn’t a better model — it’s a better harness, fed by better data. Specifically: data that is certified, current, and monitored for drift. Teams that solve the harness architecture problem without solving the data layer problem underneath it are building on sand.

There is also a unit economics argument that gets overlooked in conversations about model capability. Managing context and caching strategy within the harness — keeping system instructions stable, preventing unnecessary cache invalidation — can reduce token costs by an order of magnitude while cutting latency by more than half, without touching the underlying model at all. The harness isn’t just a reliability lever. It’s a cost lever, and for organizations running agents at any meaningful scale, that arithmetic matters.

The Organizational Gap: Why Enterprises Are Behind

Gartner estimates that less than 5% of enterprise applications featured task-specific AI agents as recently as 2025 — and projects that crossing 40% is a near-term horizon. That is an enormous change in a very short window. The organizations that show up to that window with only prompt engineering skills and model evaluation frameworks are going to have a difficult time.

The gap is organizational as much as technical. Most companies are still structured as if model selection is the strategic decision. Procurement frameworks compare foundation model providers. Evaluation rubrics test output quality on benchmarks. The team responsible for “AI” is often the same team that was responsible for “data science” — skilled at working with models, less practiced at building production control systems.

Harness engineering requires a different posture. The engineers at teams running production agents at scale describe having redefined their jobs entirely: they stopped being people who write code the model executes and became architects of control systems and feedback loops. Autonomy is granted to agents incrementally, with explicit gates. Architectural constraints are enforced by linters and validators, not by prompting the agent to follow the rules. When an agent makes a mistake, the response isn’t a better prompt — it’s an update to the harness that makes that class of mistake structurally impossible. The harness is a living system that gets more robust with each failure it survives.

The Harness Feedback Loop

Each failure makes the system more robust. The loop compounds.

Agent Failure in Production

An agent makes a wrong call. Context drifted. Schema misaligned. State degraded across turns.

Harness Update — Not Prompt Tweak

The response is a structural change. The class of mistake becomes architecturally impossible, not just discouraged.

Competitive Moat Deepens

The harness accumulates institutional knowledge. Every failure survived makes it harder to replicate from the outside.

System More Robust

Each failure the harness survives makes it more reliable than it was before. The loop goes live again.

↻

The compound effect: Organizations that build this loop first accumulate a harness that encodes years of failure recovery. Starting later means starting behind — and the gap widens with every cycle.

Most organizations are not staffed or organized for this. The capability gap is real. But it is also closeable — and the organizations that close it first acquire the moat that compounds most reliably, because the harness accumulates institutional knowledge that is genuinely proprietary.

The Skeptic’s Objection: Is This Just Infrastructure Hype?

The cynical read is that “harness engineering” is a rebranding exercise — a new vocabulary layer on top of what good software engineers have always done: build reliable systems around unreliable components. Fine. That read isn’t wrong. Every generation of software has required building wrappers, validation layers, and operational scaffolding around whatever the new primitive was.

What the cynical read misses is that the nature of the primitive has changed. When you wrap a deterministic API, the failure modes are enumerable. When you wrap a probabilistic reasoning engine that is making judgment calls across multi-step workflows — where context drift at turn three can compound into a completely wrong outcome at turn twelve — the control problem is categorically different. The harness isn’t just error handling. It’s behavioral governance of a system that can fail in ways you didn’t anticipate and can’t fully enumerate in advance.

There is also a genuinely new structural insight that harness engineering surfaces: models cannot reliably evaluate their own work. Anthropic’s engineering research on agent evaluation confirms this as a design constraint — their guidance is explicit that model-based graders require calibration with human experts, and that production systems need independent evaluation layers rather than self-assessment. The practical implication is that a production-grade harness requires separate evaluation agents — systems that assess the outputs of generative agents independently, rather than asking the generating model to check itself. This is a non-obvious architectural requirement. Organizations that haven’t internalized it are building agent systems with a structural blind spot.

The skeptic is right that this is infrastructure. The skeptic is wrong to conclude that infrastructure is therefore undifferentiated. Your harness is differentiated because your business data, rules, and verification logic are differentiated. The moat isn’t in the scaffolding itself — it’s in everything the scaffolding encodes.

What Leaders Should Actually Do in the Next Quarter

The competitive window here is real, but it isn’t infinite. Basic harness orchestration — managed runtimes, standard telemetry, generic control planes — is commoditizing into cloud infrastructure fast enough that buying it rather than building it is already the right call for most organizations. The hyperscalers and open-source frameworks will handle the plumbing.

What doesn’t commoditize is the layer above that: your domain-specific tools, your custom evaluation datasets, your business rules encoded in constraint files, your data context architecture calibrated to your specific systems. That is what you build. That is what compounds. That is what cannot be replicated by a competitor who doesn’t know your business the way you do.

So the question isn’t whether to invest in harness engineering. It’s whether your organization is structured to do it. Start there: does your AI team have the mandate and the skills to build control systems, not just evaluate models? Is your data layer — the context your harness will govern — certified, current, and monitored for drift? And when an agent fails in production, is the response a prompt adjustment, or is it a harness update?

A Practitioner’s Note

Three questions to ask before your next agent investment

The competitive window is real. The organizations that answer these correctly first acquire the moat that compounds most reliably.

Does your AI team have the mandate to build control systems, not just evaluate models? If the team is still organized around model selection and benchmark evaluation, the harness won’t get built.

Is your data layer certified, current, and monitored for drift? Even a well-designed harness fails if the data context it governs is stale. The harness and the data layer are a unit.

When an agent fails, is the response a prompt adjustment or a harness update? The answer tells you which generation your team is operating in — and how fast the moat is actually compounding.

The 56% of organizations getting zero ROI from AI aren’t failing because the models aren’t good enough. They’re failing because no one built the foundation underneath them. PwC’s own data shows what separates them from the organizations that are: not the model, not the provider — the infrastructure built around it. The model — whichever one they choose — is almost a detail.

Sources

The agentic reality check: Preparing for a silicon-based workforce

Deloitte · December 2025
Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up from Less Than 5% in 2025

Gartner · August 2025
PwC 29th Annual Global CEO Survey 2026

PwC · January 2026
Demystifying Evals for AI Agents

Anthropic Engineering · January 2026

Topics in this article