DEV Community

Cover image for Context Engineering for AI Agents: The 2026 Practitioner's Deep Dive
Manoranjan Rajguru
Manoranjan Rajguru

Posted on

Context Engineering for AI Agents: The 2026 Practitioner's Deep Dive

Table of Contents

  1. The Model War Is Over. The Harness War Has Begun.
  2. Why Prompt Engineering Failed (And What Replaced It)
  3. Anatomy of an Agent Context Window
  4. The Harness: Context Engineering's Runtime
  5. Designing Context Policies: Fill Rules for the Window
  6. Context Anxiety and Compaction Strategies
  7. Multi-Agent Context Handoffs
  8. Building Your AGENTS.md: The Ratchet Principle
  9. MCP: Context Engineering at Scale
  10. Benchmarks and Token Economics
  11. Production Context Engineering Checklist
  12. Conclusion

1. The Model War Is Over. The Harness War Has Begun.

Here is a number that should stop you cold: on Terminal Bench 2.0, a team moved a coding agent from Top 30 to Top 5 — without changing the model once. They changed only the harness.

The same model. Different scaffolding, different context policies, tighter backpressure signals. Top 5.

That result, published by Viv Trivedy in early 2026 and independently corroborated by the HumanLayer team, encapsulates everything that is happening right now at the frontier of applied AI engineering. For the last two years, engineers have been obsessing over the left side of one equation:

AI Agent = LLM Model + Harness
Enter fullscreen mode Exit fullscreen mode

Every blog post, every benchmark, every Twitter thread — focused relentlessly on which model is smarter, hallucinates less, writes better React. That conversation has been fine. But it has been missing the other half of the system entirely.

The harness — the prompts, tools, context policies, hooks, sandboxes, subagents, feedback loops, and recovery paths wrapped around the model — is where the actual leverage lives in 2026. And the discipline of designing that harness well has a name: context engineering for AI agents.

This is not a rebrand of prompt engineering. It is a fundamentally different discipline. Context engineering for AI agents has different tooling, different failure modes, and different optimization targets than anything that came before it. This post is a practitioner's deep dive into what it means, why it matters, and how to actually do it.

Context Engineering for AI Agents — Architecture Overview


2. Why Prompt Engineering Failed (And What Replaced It)

In June 2025, Andrej Karpathy posted a thread that quietly reframed how serious engineers should think about their work:

"+1 for 'context engineering' over 'prompt engineering'. People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

Karpathy went further: context engineering involves task descriptions, few-shot examples, RAG, related multimodal data, tool schemas, state, history, compacting strategies — and the "art" of understanding LLM psychology well enough to know what the model needs to see to perform well.

Shopify CEO Tobi Lutke agreed the same week:

"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

The problem with "prompt engineering" was never the craft — it was the semantic association. Most people heard "prompt engineering" and inferred "clever phrasing." The inferred definition stuck. Laughably pretentious term, typing things into a chatbot.

Context engineering doesn't have that problem. The inferred definition — carefully constructing what information flows into the model's context — is almost exactly the intended definition.

What Context Engineering Actually Covers

In a production agent system, context engineering governs:

  • What goes into the system prompt and how it is structured for model comprehension
  • Which tools are exposed and how their schemas are written (a poorly described tool is invisible to the model)
  • How memory is managed — what is retrieved from long-term stores, how much, and when
  • How conversation history is compressed to extend effective task horizon without hitting context limits
  • How state is handed off between agents or across context resets
  • What the model should not see — negative context engineering, filtering irrelevant information before it wastes tokens

None of this is "typing things into a chatbot." It is systems engineering. It requires understanding how transformers process long sequences, how attention patterns degrade with distance, and how specific models behave as context fills up. It is a real discipline, and in 2026, it is the discipline that separates functional agents from production-grade ones.


3. Anatomy of an Agent Context Window

Before you can engineer a context window, you need a precise mental model of what it contains. Most developers think of context as a single monolithic thing — the "conversation history." In production agent systems, a context window is better understood as a layered resource with competing priorities.

Anatomy of an LLM Context Window

Here is the canonical decomposition for a coding agent context window, ordered from most-static to most-dynamic:

Layer 1: System Prompt (3–8% of window)

The system prompt is the constitution of your agent. It defines the agent's identity, capabilities, constraints, output format expectations, and conventions. Unlike conversation turns, it cannot be compacted away. Every token here is a permanent cost per inference call.

Design rule: treat system prompts like hot-path code. Remove everything that is not load-bearing. A 10,000-token system prompt that could be 3,000 tokens is burning 7,000 tokens on every single call.

Layer 2: Tool Schemas (5–15% of window)

Every tool you expose occupies context — the function name, description, parameters, and examples. A poorly designed tool schema can cost 800–1,500 tokens per tool. At 20 tools, that is up to 30,000 tokens of tool overhead before the model has processed a single line of task context.

Design rule: expose only the tools relevant to the current task stage. A planning agent does not need a file-write tool. A code-execution agent does not need a web-search tool. Dynamic tool routing — selecting the active toolset based on the current sub-task — is one of the highest-leverage optimizations in context engineering.

Layer 3: Few-Shot Examples (0–20% of window)

Few-shot examples are the most powerful per-token investment in your context window. A well-chosen 3-shot example can improve output quality by more than a 5,000-token explanatory system prompt. But they are also the easiest to over-provision.

Design rule: use dynamic few-shot retrieval. Store examples in a vector database. Retrieve the 2–3 most semantically similar to the current task. Do not bake 10 static examples into every prompt.

Layer 4: RAG-Retrieved Context (10–30% of window)

For agents that need external knowledge, retrieved chunks are often the largest variable component of the context window. Naive RAG — retrieve top-k and dump it all in — is a context engineering anti-pattern.

Design rule: re-rank and compress before injecting. Retrieve 20 chunks, re-rank by relevance to the current query and current agent state, then summarize the bottom half before including it.

Layer 5: Task State and History (20–40% of window)

The conversation history — what the agent has done so far, what tools it called, what they returned — is where context windows get eaten alive on long-horizon tasks. A 100-step task where each step consumes 500 tokens of history accumulates 50,000 tokens of state before the task ends.

Design rule: compress early and compress continuously. Do not wait for the context window to fill before compacting. Set a high-water mark (e.g., 60% of window) and trigger compaction proactively.

Layer 6: Current Step Input (5–15% of window)

The actual current task, query, or observation. This is what the model is actually being asked to reason about. Engineers routinely cram so much context into layers 1–5 that the model arrives at layer 6 with degraded attention capacity.

Design rule: always reserve headroom for the task. A 128K context window that is 115K full before the task arrives is functionally worse than a 32K window with 15K free.


4. The Harness: Context Engineering's Runtime

Context engineering does not happen in a text file — it happens in code. The harness is the runtime that makes context decisions on every call: what to inject, what to compact, which tools to activate, when to hand off, and when to stop.

Viv Trivedy's formulation:

"Agent = Model + Harness. If you're not the model, you're the harness."

A harness concretely includes:

  • System prompts, CLAUDE.md, AGENTS.md, skill files
  • Tools, MCP servers, and their descriptions
  • Bundled infrastructure (filesystem, sandbox, browser)
  • Orchestration logic (subagent spawning, handoffs, model routing)
  • Hooks and middleware (compaction triggers, continuation, lint checks)
  • Observability (logs, traces, cost and latency metering)

The distinction between scaffolding and harness matters for precise engineering:

  • Scaffolding is the behavior-defining layer: what the model sees, what tools it knows about, how its responses are parsed. It shapes the model's world-model.
  • Harness is the execution layer: what drives the loop, handles tool calls, decides when to stop.

Claude Code, Cursor, Codex, and Aider are all harnesses. They may run the same underlying model. The behavioral difference you experience between them is almost entirely attributable to harness design differences, not model differences.

Here is a minimal but structurally complete Python harness skeleton:

import anthropic
from typing import Generator

client = anthropic.Anthropic()

def run_agent_loop(
    task: str,
    tools: list[dict],
    system_prompt: str,
    context_policy: "ContextPolicy",
    max_iterations: int = 50,
) -> str:
    """
    Minimal production harness with context policy enforcement.

    The context_policy object handles:
    - Which message history to include
    - When to trigger compaction
    - Tool selection for this task stage
    """
    messages = []
    iteration = 0

    # Inject the initial task
    messages.append({"role": "user", "content": task})

    while iteration < max_iterations:
        # Apply context policy before each call
        # This is where context engineering happens
        active_messages = context_policy.prepare_messages(messages)
        active_tools = context_policy.select_tools(tools, messages)

        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=8096,
            system=system_prompt,
            tools=active_tools,
            messages=active_messages,
        )

        # Check stop conditions
        if response.stop_reason == "end_turn":
            return extract_final_answer(response)

        if response.stop_reason == "tool_use":
            # Execute tool calls and append results
            tool_results = execute_tool_calls(response.content)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

            # Apply backpressure signals (lint checks, test results, etc.)
            if backpressure := context_policy.check_backpressure(messages):
                messages.append({"role": "user", "content": backpressure})

        iteration += 1

    raise RuntimeError(f"Agent exceeded max_iterations={max_iterations}")


def extract_final_answer(response) -> str:
    """Extract text from the final response."""
    for block in response.content:
        if block.type == "text":
            return block.text
    return ""




def dispatch_tool(tool_name: str, tool_input: dict) -> str:
    """
    Route tool calls to their implementations.
    Replace with your actual tool registry in production.

    Example:
        tool_registry = {
            "read_file": read_file_tool,
            "write_file": write_file_tool, 
            "run_bash": run_bash_tool,
            "search_web": search_web_tool,
        }
        return tool_registry[tool_name](**tool_input)
    """
    raise NotImplementedError(f"Tool '{tool_name}' not registered. Add it to your tool registry.")

def execute_tool_calls(content_blocks: list) -> list[dict]:
    """
    Execute tool calls and format results.
    Returns a list of tool_result blocks for the next message.
    """
    results = []
    for block in content_blocks:
        if block.type == "tool_use":
            # Dispatch to actual tool implementation
            tool_output = dispatch_tool(block.name, block.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(tool_output),
            })
    return results
Enter fullscreen mode Exit fullscreen mode

The ContextPolicy object is the heart of the harness — and the primary surface area for context engineering. Let's build that out next.


5. Designing Context Policies: Fill Rules for the Window

A context policy is the set of rules that governs what gets put into the context window on each model call. Most agent frameworks today either have no explicit context policy (everything goes in until it crashes) or a trivial one (truncate from the left when over limit).

Neither is good enough for production systems.

Here is a more complete ContextPolicy implementation:

from dataclasses import dataclass, field
from typing import Callable
import tiktoken

@dataclass
class ContextPolicy:
    """
    Governs context window management for a running agent.

    Design parameters:
    - model_context_limit: Total token limit for the model
    - compaction_threshold: Trigger compaction at this fraction of the window
    - system_prompt_budget: Max tokens reserved for system prompt
    - tool_budget: Max tokens reserved for tool schemas
    - task_headroom: Min tokens always kept free for the current task step
    """
    model_context_limit: int = 128_000
    compaction_threshold: float = 0.60       # Compact at 60% full
    system_prompt_budget: int = 6_000
    tool_budget: int = 20_000
    task_headroom: int = 16_000

    # Tool selector: given message history, return relevant tool subset
    tool_selector: Callable = field(default_factory=lambda: lambda tools, msgs: tools)

    # Compaction function: given messages, return a compressed version
    compactor: Callable = field(default_factory=lambda: default_compactor)

    def _count_tokens(self, messages: list[dict]) -> int:
        """Approximate token count for message list."""
        enc = tiktoken.get_encoding("cl100k_base")
        total = 0
        for msg in messages:
            if isinstance(msg["content"], str):
                total += len(enc.encode(msg["content"]))
            elif isinstance(msg["content"], list):
                for block in msg["content"]:
                    if isinstance(block, dict) and "text" in block:
                        total += len(enc.encode(block["text"]))
        return total

    @property
    def available_for_history(self) -> int:
        """Token budget remaining for conversation history."""
        return (
            self.model_context_limit
            - self.system_prompt_budget
            - self.tool_budget
            - self.task_headroom
        )

    @property
    def compaction_trigger(self) -> int:
        """Token count at which to trigger compaction."""
        return int(self.model_context_limit * self.compaction_threshold)

    def prepare_messages(self, messages: list[dict]) -> list[dict]:
        """
        Apply context policy to message history before each model call.

        Strategy:
        1. Check if current history exceeds compaction threshold
        2. If so, compact the oldest N messages into a summary
        3. Always preserve the most recent N turns verbatim (recency bias)
        4. Ensure task headroom is available
        """
        current_tokens = self._count_tokens(messages)

        if current_tokens > self.compaction_trigger:
            messages = self.compactor(messages, target_tokens=self.available_for_history)

        return messages

    def select_tools(self, all_tools: list[dict], messages: list[dict]) -> list[dict]:
        """
        Return only the tools relevant to the current task stage.
        Reduces tool-schema token overhead by 40-70% on typical agents.
        """
        return self.tool_selector(all_tools, messages)

    def check_backpressure(self, messages: list[dict]) -> str | None:
        """
        Check if backpressure signals should be injected.
        Returns a message to inject if action is needed, else None.

        Example signals: lint failures, test failures, build errors.
        These are injected as user messages to create correction loops.
        """
        # Implementation is domain-specific
        # A coding agent might run: subprocess.run(["tsc", "--noEmit"])
        return None


def default_compactor(messages: list[dict], target_tokens: int) -> list[dict]:
    """
    Default compaction: summarize oldest half, keep newest half verbatim.

    For production use, replace with an LLM-based summarizer:
    summary = llm.summarize(old_messages)
    """
    if len(messages) < 4:
        return messages

    split = len(messages) // 2
    old_messages = messages[:split]
    recent_messages = messages[split:]

    # In production: call an LLM to summarize old_messages
    summary_text = f"[COMPACTED: {len(old_messages)} earlier messages summarized. Key context: task in progress, {len(old_messages)} tool calls executed.]"

    return [
        {"role": "user", "content": summary_text},
        {"role": "assistant", "content": "Understood. Continuing from the compacted context."},
        *recent_messages,
    ]
Enter fullscreen mode Exit fullscreen mode

The critical insight here is that context policy design is not the model's job — it is the harness engineer's job. The model cannot decide to compact its own context; it cannot choose to expose fewer tools. All of that happens in the harness layer, before the model is called.


6. Context Anxiety and Compaction Strategies

In early 2026, Anthropic published a term that every agent developer should have in their vocabulary: context anxiety.

From Anthropic's engineering blog:

"Models tend to lose coherence on lengthy tasks as the context window fills. Some models also exhibit 'context anxiety,' in which they begin wrapping up work prematurely as they approach what they believe is their context limit."

Context anxiety manifests as:

  • Premature "I've completed the task" declarations mid-task
  • Quality degradation on tasks in the second half of a long context
  • Increased hallucination rates as the window fills
  • Shorter, less thorough tool calls and reasoning traces

Anthropic found that Claude Sonnet 4.5 exhibited context anxiety so severely that compaction alone could not fix it. The solution they converged on: context resets — a full context window wipe paired with a structured handoff artifact.

Compaction vs. Context Reset

These are two fundamentally different strategies:

Strategy Mechanism Use Case
Compaction Summarize old messages in-place; same agent continues Routine history management; tasks < 40 steps
Context Reset Wipe window entirely; spawn new agent with HANDOFF.md Long-horizon tasks; context-anxious models; multi-day jobs
Sliding Window Drop oldest N messages; no summarization Conversational agents; when recency is all that matters
Hierarchical Compaction Summarize at multiple granularities (step → phase → task) Very long tasks (100+ steps); pipeline stages

Here is a production-grade compaction function that uses an LLM as the summarizer:

import anthropic

compaction_client = anthropic.Anthropic()

def llm_compactor(messages: list[dict], target_tokens: int) -> list[dict]:
    """
    LLM-based compaction: summarize the first half of message history.

    This preserves semantic meaning better than truncation,
    and captures tool call patterns that matter for task continuation.

    Args:
        messages: Full message history
        target_tokens: Target token count for the compressed history

    Returns:
        Compressed message list with summary prepended
    """
    # Always preserve the most recent 20% of messages verbatim
    recency_cutoff = max(4, len(messages) - len(messages) // 5)
    messages_to_compact = messages[:recency_cutoff]
    messages_to_keep = messages[recency_cutoff:]

    # Format the old messages for the summarizer
    formatted_history = format_messages_for_summary(messages_to_compact)

    summary_response = compaction_client.messages.create(
        model="claude-haiku-4-5",  # Use a cheaper/faster model for compaction
        max_tokens=2048,
        system="""You are a context compression assistant. 
        Summarize the provided agent conversation history into a dense, 
        information-rich summary. Preserve:
        - What task the agent is working on
        - What has been completed (with key outputs/results)  
        - What tools were called and their outcomes
        - Any errors encountered and how they were resolved
        - The current state and what the agent should do next

        Be dense. Omit filler. Preserve all concrete artifacts, file paths,
        function names, and decision points that the agent will need.""",
        messages=[
            {
                "role": "user",
                "content": f"Summarize this agent conversation history:\n\n{formatted_history}"
            }
        ]
    )

    summary_text = summary_response.content[0].text

    # Construct the compressed history
    return [
        {
            "role": "user",
            "content": f"[CONTEXT SUMMARY - {len(messages_to_compact)} messages compacted]\n\n{summary_text}"
        },
        {
            "role": "assistant",
            "content": "Context loaded. I understand the task state and will continue from where the summary left off."
        },
        *messages_to_keep  # Keep recent messages verbatim
    ]


def format_messages_for_summary(messages: list[dict]) -> str:
    """Format message list into a readable string for the summarizer."""
    lines = []
    for msg in messages:
        role = msg["role"].upper()
        if isinstance(msg["content"], str):
            lines.append(f"[{role}]: {msg['content'][:500]}")
        elif isinstance(msg["content"], list):
            for block in msg["content"]:
                if isinstance(block, dict):
                    if block.get("type") == "text":
                        lines.append(f"[{role}]: {block['text'][:500]}")
                    elif block.get("type") == "tool_use":
                        lines.append(f"[{role} TOOL CALL]: {block['name']}({str(block['input'])[:200]})")
                    elif block.get("type") == "tool_result":
                        lines.append(f"[TOOL RESULT]: {str(block.get('content', ''))[:300]}")
    return "\n".join(lines)
Enter fullscreen mode Exit fullscreen mode

The Context Reset Pattern

For long-horizon tasks where compaction is insufficient, the context reset pattern works as follows:

  1. Planner agent creates a PLAN.md with task decomposition and success criteria
  2. Executor agent works through the plan, writing progress to PROGRESS.md
  3. At reset trigger (context high-water mark or phase boundary): executor writes HANDOFF.md
  4. New executor agent is spawned; it reads PLAN.md + PROGRESS.md + HANDOFF.md as its initial context
  5. Loop continues until all plan items are checked off

The HANDOFF.md must be rigorously structured — it is the only mechanism by which the new agent inherits state:

# HANDOFF.md — Agent Context Reset

## Task Summary
[One paragraph: what we're building, what success looks like]

## Completed Steps
- [x] Step 1: Set up project structure (output: /src/index.ts created)
- [x] Step 2: Implemented auth module (output: /src/auth/*.ts, 847 lines)
- [x] Step 3: Database schema defined (output: schema.prisma, 23 models)

## Current Step
- [ ] Step 4: Implement REST API endpoints for /users, /posts, /comments

## Key Decisions Made
- Chose Hono over Express for the API layer (reason: better TypeScript types)
- Using Zod for request validation (aligned with schema.prisma types)
- JWT expiry set to 15min access / 7d refresh (security requirement from AGENTS.md)

## Files Written (do NOT overwrite)
/src/auth/jwt.ts, /src/auth/middleware.ts, /src/db/client.ts, schema.prisma

## Errors Encountered and Resolved
- TypeScript error in auth.ts: fixed by adding explicit return type annotations
- Prisma connection issue: resolved by adding DATABASE_URL to .env

## Next Actions
1. Read /src/auth/middleware.ts to understand the auth patterns in use
2. Implement GET /users/:id endpoint in /src/routes/users.ts
3. Run `npm run test:api` after each endpoint to verify

## Environment
- Node 22, TypeScript 5.8, Prisma 6.x, Hono 4.x
- Tests: `npm test` | Lint: `npm run lint` | Build: `npm run build`
Enter fullscreen mode Exit fullscreen mode

A new agent initialized with this HANDOFF.md can pick up a multi-hour task with zero loss of context coherence.


7. Multi-Agent Context Handoffs

Single-agent systems hit a ceiling on complex tasks. The more powerful pattern — increasingly standard in production 2026 — is the Planner → Generator → Evaluator three-agent architecture.

Multi-Agent Context Handoff Pattern 2026

Why Three Agents?

Anthropic's engineering team documented a critical limitation: agents reliably over-rate their own work. When a code-generating agent is asked to evaluate the code it just wrote, it responds with confidence and praise — even when a human would immediately spot problems.

The fix is architectural, not prompt-based. Separating generation from evaluation is far more effective than any prompting strategy that tries to make a generator self-critical. A standalone evaluator tuned to be skeptical produces reliable criticism because it has no ego investment in the output.

Here is the three-agent pattern in code:

import anthropic
import json

client = anthropic.Anthropic()

def run_planning_agent(task: str) -> dict:
    """
    Planner: Decomposes the task and writes a structured execution plan.
    Returns a plan dict that the generator will follow.
    """
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        system="""You are a technical planning agent. Your job is to:
        1. Decompose the given task into concrete, verifiable sub-tasks
        2. Identify success criteria for each sub-task
        3. Flag potential risks or ambiguities
        4. Output a structured JSON plan

        Be specific. Vague plans produce vague outputs.""",
        messages=[{
            "role": "user",
            "content": f"Create an execution plan for: {task}"
        }]
    )

    # Extract JSON plan from response
    plan_text = response.content[0].text
    # In production: use structured output / tool_use to enforce JSON
    return {"plan": plan_text, "task": task}


def run_generator_agent(plan: dict, tools: list[dict]) -> dict:
    """
    Generator: Executes the plan using available tools.
    Produces artifacts (code, files, data) and a generation report.
    """
    generation_prompt = f"""
Execute the following plan:

{plan['plan']}

Original task: {plan['task']}

Use the available tools to complete each step. 
Write progress to PROGRESS.md after each major step.
When complete, output a JSON summary with:
- completed_steps: list of completed items
- artifacts: list of files/outputs created  
- issues_encountered: any problems found
- confidence_score: your self-assessment (0-10)
"""

    messages = [{"role": "user", "content": generation_prompt}]
    artifacts = {}

    # Run the generator loop (simplified — production version has full ContextPolicy)
    for _ in range(100):  # max iterations
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=8096,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            final_output = extract_text(response)
            return {"output": final_output, "artifacts": artifacts, "plan": plan}

        if response.stop_reason == "tool_use":
            tool_results = execute_tools_and_collect_artifacts(response.content, artifacts)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

    return {"output": "max_iterations_reached", "artifacts": artifacts, "plan": plan}


def run_evaluator_agent(generation_result: dict, evaluation_criteria: list[str]) -> dict:
    """
    Evaluator: Critically assesses the generator's output.

    Key design choice: this agent is tuned to be skeptical.
    It does NOT have access to the generation history — 
    only the artifacts and evaluation criteria.
    """
    criteria_text = "\n".join(f"- {c}" for c in evaluation_criteria)

    eval_prompt = f"""
You are a rigorous code reviewer and technical evaluator. 
Your job is to critically assess the following output against specific criteria.

IMPORTANT: Be skeptical. Do not give credit for work that is incomplete, 
incorrect, or meets the criteria only superficially. When in doubt, fail the criterion.

Evaluation Criteria:
{criteria_text}

Output to evaluate:
{generation_result['output']}

Artifacts produced: {list(generation_result['artifacts'].keys())}

For each criterion, provide:
- PASS or FAIL
- Specific evidence for your judgment (quote relevant code/output)
- If FAIL: specific remediation steps

Output as JSON: {{"criteria_results": [...], "overall_pass": bool, "revision_signals": [...]}}
"""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        system="You are a skeptical technical evaluator. Your bias is to find problems, not to validate work. A false negative (missing a real problem) is worse than a false positive (flagging non-problems).",
        messages=[{"role": "user", "content": eval_prompt}]
    )

    eval_text = response.content[0].text
    return {"evaluation": eval_text, "generation_result": generation_result}


def run_full_pipeline(
    task: str,
    tools: list[dict],
    evaluation_criteria: list[str],
    max_revision_rounds: int = 3
) -> dict:
    """
    Full Planner → Generator → Evaluator pipeline with revision loop.
    """
    # Phase 1: Planning
    plan = run_planning_agent(task)

    revision_round = 0
    while revision_round < max_revision_rounds:
        # Phase 2: Generation
        generation_result = run_generator_agent(plan, tools)

        # Phase 3: Evaluation
        eval_result = run_evaluator_agent(generation_result, evaluation_criteria)

        # Parse evaluation to check if we're done
        if is_evaluation_passing(eval_result["evaluation"]):
            return {
                "status": "success",
                "output": generation_result["output"],
                "artifacts": generation_result["artifacts"],
                "evaluation": eval_result["evaluation"],
                "revision_rounds": revision_round,
            }

        # Inject revision signals back into the plan for the next generator run
        revision_signals = extract_revision_signals(eval_result["evaluation"])
        plan["revision_notes"] = revision_signals
        plan["revision_round"] = revision_round + 1
        revision_round += 1

    return {
        "status": "max_revisions_reached",
        "output": generation_result["output"],
        "evaluation": eval_result["evaluation"],
    }


def is_evaluation_passing(evaluation_text: str) -> bool:
    """Parse the evaluator's JSON output and check overall_pass."""
    try:
        # In production: use structured output from the evaluator
        import re
        match = re.search(r'"overall_pass":\s*(true|false)', evaluation_text)
        return match and match.group(1) == "true"
    except Exception:
        return False


def extract_revision_signals(evaluation_text: str) -> list[str]:
    """Extract actionable revision signals from evaluator output."""
    # In production: parse the JSON revision_signals array
    return ["Review evaluator feedback and address all FAIL criteria before resubmitting."]


def extract_text(response) -> str:
    for block in response.content:
        if hasattr(block, 'text'):
            return block.text
    return ""


def execute_tools_and_collect_artifacts(content_blocks, artifacts: dict) -> list[dict]:
    """Execute tools, collect file artifacts, return tool results."""
    results = []
    for block in content_blocks:
        if hasattr(block, 'type') and block.type == "tool_use":
            output = dispatch_tool(block.name, block.input)  # dispatch to tool registry
            # Track file writes as artifacts
            if block.name in ("write_file", "create_file"):
                artifacts[block.input.get("path", "unknown")] = True
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(output),
            })
    return results
Enter fullscreen mode Exit fullscreen mode

The revision loop — where evaluator signals feed back into the next generator run — is the production-grade equivalent of adversarial training at inference time. It is one of the most reliable mechanisms for improving agent output quality without touching model weights.


8. Building Your AGENTS.md: The Ratchet Principle

The AGENTS.md file (or CLAUDE.md in Anthropic's ecosystem) is the long-term memory of your harness. It is where your operational knowledge about what the agent should and should not do lives — and it should grow with every production incident.

Addy Osmani coined the ratchet principle:

"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

Every line in a mature AGENTS.md should be traceable to a specific incident. Here is an annotated example for a backend coding agent:

# AGENTS.md — Backend Coding Agent

## Identity
You are a senior backend engineer working on a Node.js/TypeScript API.
Stack: Node 22, TypeScript 5.8, Prisma 6, Hono 4, PostgreSQL 16.

## Core Constraints (Do NOT violate these)

### Safety
- NEVER run destructive database commands (DROP, TRUNCATE, DELETE without WHERE) 
  without explicit confirmation in the task spec.
  [INCIDENT: Agent dropped test table in production-connected environment, 2026-02-14]

- NEVER commit changes to main or master. Always use feature branches.
  [INCIDENT: Direct push to main caused CI failure, 2026-01-22]

- NEVER expose secrets, API keys, or credentials in code, logs, or output.
  [STANDARD: Security baseline]

### Code Quality  
- NEVER comment out tests. If a test is wrong, fix it or delete it.
  [INCIDENT: Commented-out tests reached production, undetected for 3 weeks]

- ALWAYS run `npm run lint` and `npm run build` before declaring a task complete.
  [INCIDENT: Type errors in production 4 separate times before this rule was added]

- ALWAYS add JSDoc to public functions. Private functions optional.
  [TEAM STANDARD: Code review surfaced this as consistent gap, 2026-03-01]

### Architecture
- Use Zod for ALL request/response validation. Do not use manual if-checks.
  [ARCHITECTURAL DECISION: Adopted 2025-11-15 for consistency]

- Repository pattern for DB access. Never write raw SQL in route handlers.
  [ARCHITECTURAL DECISION: Established 2025-09-01]

- Error responses MUST follow the ApiError schema in /src/types/errors.ts
  [INCIDENT: Inconsistent error formats broke mobile app, 2026-01-08]

## Workflow
1. Before starting: read PLAN.md if present, read PROGRESS.md if present
2. Write to PROGRESS.md after completing each major step
3. After completing: run full test suite, lint, and type-check before reporting done
4. If blocked by an error for > 3 attempts: write the error to BLOCKERS.md and stop

## Tool Use Priorities
1. Prefer reading existing files before writing new ones (understand before building)
2. Prefer modifying existing patterns over creating new ones
3. When writing tests: co-locate with source files (*.test.ts next to *.ts)
Enter fullscreen mode Exit fullscreen mode

The power of AGENTS.md is that it is version controlled and incrementally improved. The team reviews it in sprint retrospectives. Lines that no longer apply (because a capable model made them redundant) get removed. Lines get added with every incident. Over time, it becomes a hard-won knowledge artifact that outlasts individual models.


9. MCP: Context Engineering at Scale

The Model Context Protocol (MCP), introduced by Anthropic in late 2024, has become the infrastructure layer for context distribution in complex agent systems. Where AGENTS.md and system prompts are static context, MCP provides dynamic, programmatic context injection from external servers.

An MCP server is a lightweight service that exposes tools, resources, and prompts to any compliant client. The practical effect: you can build a library of context injection services — a codebase analyzer, a Jira ticket reader, a compliance rule fetcher — and connect them to any agent harness without changing the harness itself.

Here is a minimal MCP server for injecting project context:

from mcp.server.fastmcp import FastMCP
from pathlib import Path
import subprocess

mcp = FastMCP("project-context-server")

@mcp.tool()
def get_project_structure(max_depth: int = 3) -> str:
    """
    Returns the project file structure up to max_depth.
    Use this at the start of any coding task to understand the codebase layout.
    """
    result = subprocess.run(
        ["find", ".", "-maxdepth", str(max_depth), "-not", "-path", "*/node_modules/*",
         "-not", "-path", "*/.git/*"],
        capture_output=True, text=True
    )
    return result.stdout


@mcp.tool()
def get_recent_git_changes(n_commits: int = 10) -> str:
    """
    Returns the last N commits with their diffs.
    Use this to understand what has changed recently and avoid regressions.
    """
    result = subprocess.run(
        ["git", "log", f"-{n_commits}", "--oneline", "--stat"],
        capture_output=True, text=True
    )
    return result.stdout


@mcp.resource("project://agents-config")
def get_agents_config() -> str:
    """Returns the current AGENTS.md content as a resource."""
    agents_md = Path("AGENTS.md")
    if agents_md.exists():
        return agents_md.read_text()
    return "No AGENTS.md found."


@mcp.tool()
def search_codebase(query: str, file_pattern: str = "*.ts") -> str:
    """
    Semantic search over the codebase using ripgrep.

    Args:
        query: Text to search for (supports regex)
        file_pattern: Glob pattern to limit search scope

    Returns:
        Matching lines with file paths and line numbers
    """
    result = subprocess.run(
        ["rg", "--type-add", f"custom:{file_pattern}", 
         "-t", "custom", "-n", "--context", "2", query],
        capture_output=True, text=True, cwd="."
    )
    # Limit output to avoid token overload
    lines = result.stdout.split("\n")[:100]
    return "\n".join(lines)


if __name__ == "__main__":
    mcp.run(transport="stdio")
Enter fullscreen mode Exit fullscreen mode

This server can be connected to any MCP-compatible harness — Claude Code, Cursor, a custom harness — and immediately injects structured project context on demand. The key context engineering insight: on-demand context is far more token-efficient than pre-loaded context. Instead of dumping the entire codebase into the system prompt, you give the agent a tool to fetch exactly what it needs, when it needs it.

In June 2026, HuggingFace demonstrated MCP integration on physical robots (Reachy Mini), allowing a language model to control hardware via the same context injection protocol used for software development tasks. The architectural pattern — MCP as a universal context distribution layer — is proving more durable than any specific application.


10. Benchmarks and Token Economics

Context engineering is not just an architectural nicety — it has directly measurable impact on cost and performance.

Context Engineering for AI Agents — Architecture Overview

The Token Efficiency Gap

HuggingFace published agent traffic analysis in June 2026 that quantified the impact of CLI design on token consumption. Agents hand-rolling curl or raw SDK calls to accomplish Hub tasks used up to 6× more tokens than agents using the purpose-built hf CLI tool. (verify before publishing)

The mechanism: tool-optimized interfaces do context engineering for the agent. Instead of the agent generating multi-step API calls, parsing intermediate JSON, and handling errors with natural language — a tool does all of that and returns a single, structured, token-dense result.

IBM Research: Knowledge-Graph Guided Agents

IBM's research team (published June 2026) demonstrated that agents equipped with domain-specific agent logic — knowledge graphs, program analysis libraries, algorithmic decomposition — consumed up to ~30× fewer tokens than baseline frontier LLM approaches on the same enterprise tasks, while achieving equal or better accuracy. (verify multiplier before publishing)

Their App Insights agent for mainframe code understanding pre-indexed application structure into hundreds of interrelated database tables, allowing the agent to retrieve precisely targeted facts rather than consuming raw source files. The context engineering insight: structured retrieval beats unstructured consumption.

Terminal Bench 2.0

The most striking benchmark: Viv Trivedy's team moved a coding agent from Top 30 to Top 5 on Terminal Bench 2.0 by changing only the harness. The model was identical. The harness changes:

  • Dynamic tool selection (reduced tool schema overhead ~60%)
  • AGENTS.md with codebase-specific rules (prevented ~15 recurring failure modes)
  • Compaction policy with LLM summarizer (enabled tasks 3× longer than baseline)
  • Backpressure signals from linter and test runner (reduced "finishing broken code" incidents ~80%)

This is the clearest quantification of the harness gap available today.

Cost Implications

For a team running an agent that averages 50,000 tokens per session at 1,000 sessions/day:

  • Without context engineering: ~50M tokens/day
  • With context engineering (assuming 40% reduction via policy optimization): ~30M tokens/day
  • At $15/1M tokens (Claude Opus class): savings of ~$300/day → ~$109,000/year

Context engineering at scale is a significant cost engineering problem, not just a quality engineering problem.


11. Production Context Engineering Checklist

Before shipping an agent to production, validate against this checklist:

Context Window Design

  • [ ] System prompt has been audited for token waste (target: < 5% of total window)
  • [ ] Dynamic tool selection is implemented — agent does not see all tools at all times
  • [ ] Few-shot examples are retrieved dynamically, not hardcoded
  • [ ] RAG pipeline includes re-ranking before injection (not raw top-k)
  • [ ] Task headroom is explicitly reserved (minimum 10% of context window)

Compaction and Longevity

  • [ ] Compaction policy has a defined trigger threshold (recommend: 60% full)
  • [ ] Compaction uses an LLM summarizer, not naive truncation
  • [ ] For tasks > 50 steps: context reset pattern is implemented
  • [ ] HANDOFF.md template is defined and tested with a fresh-agent reconstruction test

Multi-Agent Architecture

  • [ ] Generator and evaluator are separate agents (no self-evaluation)
  • [ ] Evaluator is explicitly tuned to be skeptical (system prompt bias toward finding problems)
  • [ ] Revision signals are structured (JSON) not freeform
  • [ ] Pipeline has a max revision round limit (prevent infinite loops)

AGENTS.md

  • [ ] AGENTS.md exists and is version controlled
  • [ ] Every rule has a traceable origin (incident, architectural decision, team standard)
  • [ ] AGENTS.md is reviewed in retrospectives for additions and removals
  • [ ] Critical safety rules are also enforced programmatically in hooks (not just via prompt)

Observability

  • [ ] Token consumption per call is logged
  • [ ] Context window utilization % is tracked as a metric
  • [ ] Compaction events are logged with before/after token counts
  • [ ] Tool call patterns are analyzed to find unused tools (candidates for removal)

12. Conclusion

The model war was worth fighting. Better models matter. But in 2026, the frontier of practical agent engineering has moved decisively to the harness — and context engineering for AI agents is the discipline that defines it.

To recap the core principles:

Context engineering is the art and science of filling the LLM context window with precisely the right information for the next step. It covers system prompt design, tool schema optimization, dynamic memory retrieval, compaction strategies, handoff artifacts, and backpressure signals.

The harness is the runtime that executes context decisions on every call. Agent = Model + Harness. The behavioral gap between a basic agent and a production agent is almost entirely a harness gap.

The ratchet principle compounds improvement: every agent mistake becomes a permanent harness rule. A mature AGENTS.md is a hard-won operational knowledge base that makes your agent better than any cold-start system.

Token economics matter at scale: a 40% reduction in token consumption through context policy optimization translates to real cost savings — $100K/year at moderate scale — while simultaneously improving agent quality by reducing context overload.

The developers who will ship the best AI agents over the next two years are not the ones who pick the best model. They are the ones who build the best harnesses around whichever models are available. That work starts with context engineering.


If you found this useful, follow me on dev.to for more deep-dives on AI engineering. Questions, corrections, or real-world harness patterns to share? Drop them in the comments — this is a discipline we're all building together.


Tags: ai-agents, llm, context-engineering, generative-ai, python, anthropic, machine-learning, mlops

Top comments (0)