Mehmet TURAÇ

Posted on May 30

Harness Engineering: The Code Around the Model Is the Hard Part

#ai #agents #architecture #productionengineering

Everyone benchmarks the model. Almost nobody benchmarks the harness — the loop, the tool dispatch, the context manager, the retry logic that wraps a raw inference call and turns it into something that can run unattended against production. In my experience building agentic platforms, swapping the model is a config change you ship in an afternoon. The harness is where the months go, and it's where reliability is actually won or lost.

This is the part that doesn't show up in demos. A demo agent calls a tool, gets a clean result, and prints a tidy answer. A production agent calls a tool that times out, gets a 200 with a malformed body, hits a rate limit on retry, and now has to decide whether to keep going or give up — all while staying inside a token budget and not corrupting anything downstream. The model doesn't solve that. The harness does.

The harness is the product

When people say "we built an agent," they usually mean they wrote a prompt and a tool schema. That's the easy 20%. The other 80% is the scaffolding that decides when to call the model, what to put in front of it, whether to trust what comes back, and what to do when something fails. That scaffolding is the harness, and it's where your engineering judgment lives.

The useful mental model: the LLM is a single, expensive, non-deterministic function call. Everything that makes that call safe, bounded, observable, and repeatable is your code. Treat the model as a component you don't control and the harness as the system you do, and most architecture decisions get clearer.

Anatomy of a harness

Strip away the framework branding and every agent harness has the same moving parts:

A control loop that runs steps until the task is done, a stop condition fires, or a budget is exhausted.
A context manager that assembles the prompt each step — system instructions, relevant history, tool specs — and decides what to drop when it won't fit.
A model call wrapped in its own timeout and retry policy.
A parse-and-validate stage that turns model output into a typed, checked action before anything acts on it.
A tool dispatcher that executes the chosen action with its own timeouts, retries, and idempotency handling.
Guardrails that gate side effects — allow-lists, argument validation, rate limits.
Observability that records every step as structured data.

Frameworks give you defaults for these. The defaults are fine for prototypes and quietly wrong for production, because the right policy is domain-specific. How many steps before you bail? What's a retryable tool error versus a fatal one? What do you drop from context first? Nobody can answer those for you.

Tool calls are an untrusted boundary

The single most common production failure I see is treating model output as if it were already valid. The model proposes a tool call; the harness executes it verbatim. Then one day the model emits an argument that's subtly out of range, or invents a tool name, or returns JSON with a trailing comment, and the dispatcher happily forwards garbage into a system that does real things.

A tool call from the model is a proposal, not an instruction. Validate it like input from an untrusted client, because that's exactly what it is.

def step(state: AgentState, tools: dict[str, Tool]) -> StepResult:
    # 1. Assemble context within budget — drop oldest observations first
    prompt = state.context.render(token_budget=state.remaining_tokens())

    # 2. Model call is fallible: its own timeout + bounded retry
    completion = call_model(prompt, timeout_s=30, max_retries=2)
    state.spend(completion.usage)

    proposal = completion.tool_call
    if proposal is None:
        return StepResult(done=True, answer=completion.text)

    # 3. Validate the proposal BEFORE anything acts on it
    tool = tools.get(proposal.name)
    if tool is None:
        # Don't crash — feed the error back so the model can recover
        state.context.add_observation(f"error: unknown tool '{proposal.name}'")
        return StepResult(done=False)

    try:
        args = tool.schema.validate(proposal.arguments)
    except ValidationError as e:
        state.context.add_observation(f"error: invalid args: {e}")
        return StepResult(done=False)

    # 4. Guardrail: side effects must pass policy
    if tool.has_side_effects and not policy.allows(tool, args, state):
        state.context.add_observation("error: action blocked by policy")
        return StepResult(done=False)

    # 5. Dispatch with the tool's own failure handling
    observation = dispatch(tool, args, timeout_s=tool.timeout, retries=tool.retries)
    state.context.add_observation(observation)

    trace.emit(step=state.step_no, tool=tool.name, usage=completion.usage,
               latency_ms=observation.latency_ms, outcome=observation.status)
    return StepResult(done=False)

Notice what the failure paths do: they don't raise. A bad tool name, invalid arguments, or a blocked action all become observations fed back into context. The model gets to see its mistake and try again. This single pattern — turning harness-level errors into model-visible feedback — is the difference between an agent that recovers and one that dies on the first imperfect output.

Context is a budget, not a buffer

The naive harness appends everything to a growing transcript and passes it back every step. This works until it doesn't: you blow the context window, latency climbs with every step, cost grows quadratically over a long task, and the model's attention degrades as the relevant signal drowns in old tool dumps.

Context is a budget you spend deliberately each step. That means making active decisions: which prior observations still matter, which can be summarized, which can be dropped entirely. A 40KB API response that mattered three steps ago is now dead weight — keep a one-line summary of what it told you and discard the body. The control loop's job isn't to remember everything; it's to keep the useful state in front of the model and evict the rest. Get this wrong and a task that should take eight steps either runs out of window at step twelve or costs five times what it should.

Plan for failure, because it's the default

In a system where one step is a network call to a probabilistic model and the next is a network call to a flaky third-party API, failure isn't the exception — it's the steady state. The harness has to assume every external call can time out, return malformed data, or partially succeed.

The parts that earn their keep here are unglamorous: timeouts on every external call (model included), bounded retries with backoff, idempotency keys on any tool that mutates state so a retry doesn't double-charge or double-send, and a hard step ceiling so a confused agent can't loop forever burning tokens. None of this is novel — it's the same distributed-systems discipline we've applied for two decades. What's new is that one of the unreliable components is now the decision-maker itself, which means a retry can produce a different decision. Your harness has to be correct under that, not just under transient errors.

You can't fix what you can't see

A non-deterministic system you can't replay is a system you can't debug. When an agent does something wrong in production — picks the wrong tool, loops, gives up early — "it worked on my machine" is meaningless, because your machine got a different sample.

So every step has to emit structured data: the assembled context, the model's decision, the tool called, the arguments, latency, token usage, and outcome. Not log lines you grep — structured spans you can query, aggregate, and replay. With that, "the agent failed" becomes "at step 7 it called the search tool with an empty query because the previous observation got evicted from context," which is an actual bug with an actual fix. Without it, you're tuning prompts by superstition. Token and cost accounting belong in the same trace, because on a long-running agent they're a production concern, not a billing footnote.

The takeaway

The model gets the headlines and the harness gets the pager. As base models keep improving, the differentiator between an agent that demos well and one that survives contact with production won't be which model you picked — it'll be the engineering quality of the code wrapped around it: how it validates, how it budgets context, how it fails, and how observable it is.

So here's what I keep coming back to: if you swapped your agent's underlying model tomorrow, how much of your reliability would survive the change — and how much was the harness carrying all along?

Runnable, tested example: https://github.com/mturac/harness-demo

Top comments (6)

Harjot Singh • May 31

This is the thesis I'd put on a billboard. The model is increasingly a commodity - everyone has access to roughly the same frontier weights - so the durable engineering value is entirely in the harness: context management, tool orchestration, error recovery, verification, and routing. "Prompt better" is a dead end; "engineer the system around the model" is the actual craft, and it's unglamorous enough that most people skip it.

I've staked a whole product on exactly this - Moonshift is a multi-agent pipeline (the harness, not a model) that takes a prompt to a shipped SaaS on your own GitHub + Vercel. The harness is where everything lives: plan-as-contract so agents re-ground instead of re-deriving, verification gates between steps, and model-routing so the cheap-80/expensive-20 split keeps a full build ~$3 flat. First run's free, no card. Genuinely excited someone's naming this - what do you consider the hardest part of the harness? For me it's context handoff between steps without either losing the thread or dragging the whole history along.

Mehmet TURAÇ • May 31

Appreciate that — and yeah, context handoff is the one I'd single out too. The framing that's worked for me: the transcript is the wrong unit. What carries between steps isn't "what happened," it's "what's still load-bearing for the next decision." Those are different sets, and conflating them is exactly what produces both failure modes you named — keep the raw transcript and you drag dead weight, trim it naively and you cut the thread.

So the handoff isn't a copy, it's a projection. Each step emits a compact, typed result — the decision, the outcome, and a one-line distilled observation — and that's what gets carried, not the 40KB body that produced it. The full artifact gets evicted but stays addressable (by id/ref) in case a later step genuinely needs to rehydrate it. Cheap to keep a pointer, expensive to keep the payload.

The hard judgment underneath is relevance decay: which prior observation still matters at step N. I don't think there's a clean general answer — it's domain-specific, and getting it wrong is what your context-eviction bug looks like in practice (the empty-query-because-it-got-evicted case). Curious how Moonshift's plan-as-contract handles re-grounding when an early step's output gets evicted but a late step needs it — do you re-derive, or do you keep the plan itself as the durable spine and let observations be disposable around it?

Harjot Singh • May 31

Context handoff is the one that quietly breaks everything at scale. The lossy summary between agents is where a good run rots: agent B re-derives what A already knew, or worse acts on A's conclusion without A's caveats and compounds the error. What helped me most was making the handoff an explicit, structured artifact (not "summarize and pass the blob") plus a verify gate before B builds on it, so a bad handoff gets caught instead of propagating. Continuity + verification is genuinely the whole ballgame for multi-agent. Great piece, you're writing the thesis I keep trying to articulate.

Harjot Singh • May 31

Context handoff is the right one to single out, it's the failure mode that shows up in no benchmark but quietly wrecks real multi-step work. The model is fine, the harness loses the thread: what the previous step decided, what's already been tried, what the actual goal was three steps back. Most "the AI got confused" moments are really "the harness didn't carry the right state forward." The fix that's worked for me is treating context as something you deliberately curate and pass, not something you dump the whole transcript into and hope. That handoff layer is most of what makes Moonshift's pipeline hold together across a 14-step build. What's your approach, structured state objects between steps, running summarization, or something else?

TxDesk • May 31

The harness-as-product framing is right and the part I'd push further is the trust-boundary section.

You called out tool calls from the model as untrusted input. The mirror image is also true and more often missed: tool RESULTS coming back from external systems are also untrusted input to the harness. Not adversarially untrusted necessarily, but semantically untrusted. The protocol or API returns bytes the model has to reason about, and if the harness doesn't decode and validate those bytes into a typed shape the model can actually understand, the model gets to hallucinate over raw blobs.

In my domain (DeFi support agents reading on-chain state), this is where most failure lives. The RPC returns hex calldata that's technically a valid response. The model dutifully tries to interpret it and makes something up. The fix is the same shape as your tool-call validation: the harness decodes against the protocol's actual ABI, attaches structured semantics, and feeds the model a typed observation. Now the model is reasoning over "user holds 14.2 WETH as collateral against 23k USDC debt, HF 1.31" instead of "0x70a08231...".

The answer to your closing question for me is: most of my reliability survives a model swap, and almost none of it survives an ABI decoder swap. Which is the same point you're making, just from the data-shape side rather than the control-flow side.

Code link's bookmarked. Curious if your harness has a notion of typed observations or if you keep them as raw text into context.

Kaspar von Grünberg • Jun 9

The framing here is right: the harness is where the engineering actually lives. But there's one level above it that most teams only discover after the third production incident.

A harness solves the single-agent reliability problem. The moment you're running dozens of agents across different teams and functions, you need the same harness logic standardized across all of them, or every team reinvents the retry policy, the context budget, the guardrails, the observability schema. That's exactly the platform problem. What you're describing as "harness" is what we'd call the execution and evaluation layers of an Agent Infrastructure substrate, and the teams that get there fastest are the ones who treat it as shared infrastructure, not per-agent scaffolding. We mapped this out architecturally here if useful: weaveintelligence.io/blog/what-is-...