Manoranjan Rajguru

Posted on May 30

Agent Harness Explained: Build Production-Ready AI Agents with Microsoft Agent Framework

#azure #ai #python #agents

Meta Description: Learn what an agent harness is, why it matters for production AI systems, and how to implement one step-by-step using Microsoft Agent Framework's create_harness_agent -- with real Python code and deep technical walkthroughs.

Introduction
What is an Agent Harness?
Microsoft Agent Framework Overview
Anatomy of create_harness_agent -- 8 Sub-Systems
Full Implementation Walkthrough
Running and Testing
Production Considerations
Conclusion

1. Introduction

You have a capable LLM. You write a chat loop in 20 lines of Python and it works -- until the context window fills up and the agent loses the thread. Or it calls a tool but forgets the result two turns later. Or it crashes in production with no trace of what went wrong.

This is the hidden complexity tax of AI agents. Every production-grade agent needs: a tool-calling loop, conversation history management, context-window compaction, a planning mechanism, durable memory, skill extensibility, and observability. If you wire each of these yourself, you spend more time on infrastructure than on actual intelligence.

The agent harness pattern solves this by pre-assembling all components into a single, tested, configurable pipeline -- so you focus on what your agent does, not how to keep it running.

In this deep dive you will learn:

What an agent harness is and why it matters
How Microsoft Agent Framework (MAF) implements the harness pattern with create_harness_agent
A component-by-component breakdown of all 8 sub-systems
A complete, production-ready Research Agent from the official MAF repository

2. What is an Agent Harness?

The concept comes from two familiar software patterns: a test harness (configures and tears down system-under-test so test authors write test logic, not setup code) and a DI container (wires and resolves components so application code never calls new directly).

An agent harness applies this to AI agents:

A factory that constructs a fully wired, ready-to-run agent by assembling all required infrastructure components -- history, tools, memory, observability, planning -- from a single configuration point.

The key insight: your agent instructions define what the agent does; the harness defines how that intent is reliably executed, persisted, and observed.

The Manual Wiring Problem

Here is what you would have to build manually:

Component	Manual Responsibility
Tool Calling Loop	Detect tool calls, dispatch, collect results, re-invoke model
History Management	Serialize/deserialize history, decide storage backend
Context Compaction	Monitor token count, evict stale messages, preserve context
Planning / Todo	Design task-tracking schema, prompt model to use it
Mode Management	Track agent state, implement approval gates
Persistent Memory	Write/read durable store, inject into context at the right time
Skills Loading	Progressive skill discovery, filter by relevance
Telemetry	Instrument every model call and tool dispatch

Harness vs. DIY -- Side by Side

Manual tool calling loop (this is just one of eight required pieces):

# DIY: manual tool loop only -- no memory, compaction, or telemetry
async def run_agent_manually(client, tools, messages):
    while True:
        response = await client.chat(messages=messages, tools=tools)
        if response.finish_reason == 'stop':
            return response.content
        if response.finish_reason == 'tool_calls':
            messages.append(response.message)
            for tool_call in response.tool_calls:
                tool_fn = tool_registry.get(tool_call.function.name)
                result = await tool_fn(**json.loads(tool_call.function.arguments))
                messages.append({
                    'role': 'tool',
                    'tool_call_id': tool_call.id,
                    'content': str(result)
                })
            # Still need: token counting, history, todos, telemetry...

Harness equivalent -- full production pipeline:

# Harness: complete pipeline in 4 lines
from agent_framework import create_harness_agent
from agent_framework.foundry import FoundryChatClient
from azure.identity import AzureCliCredential

agent = create_harness_agent(
    client=FoundryChatClient(credential=AzureCliCredential()),
    max_context_window_tokens=128_000,
    max_output_tokens=16_384,
)

3. Microsoft Agent Framework Overview

Microsoft Agent Framework (MAF) is Microsoft's open-source, production-grade framework for building AI agents and multi-agent workflows. It is the direct successor to both AutoGen and Semantic Kernel -- combining AutoGen's simple abstractions with Semantic Kernel's enterprise features.

Key capabilities:

Multi-language -- Full parity between Python and C#/.NET
Multi-provider -- Azure AI Foundry, OpenAI, Azure OpenAI, Anthropic, Gemini, Bedrock, Ollama, and more
Multi-pattern -- Single agents, sequential, concurrent, handoff, group-chat, human-in-the-loop
Multi-deployment -- Local dev, Azure Functions, Foundry Hosted Agents, Durable Task

Installation

pip install agent-framework==1.7.0

Agent Runtime Loop

Every agent.run() call executes this 6-phase deterministic loop:

User Message
    |
    v
[ 1. Context Assembly ]  -- History + ContextProviders (Memory, Skills, Mode, Todos)
[ 2. Middleware Pre   ]  -- Telemetry spans open, compaction checks
[ 3. Model Inference  ]  -- FoundryChatClient / OpenAI / Anthropic
[ 4. Tool Dispatch    ]  -- FunctionInvocationLayer (tool call -> execute -> result -> loop)
[ 5. History Persist  ]  -- Saved after every service call
[ 6. Middleware Post  ]  -- Telemetry spans close
    |
    v
Agent Response (streaming or complete)

The harness ensures all six phases are populated and correctly ordered.

4. Anatomy of create_harness_agent -- 8 Sub-Systems

from agent_framework import create_harness_agent

agent = create_harness_agent(
    client=client,                       # Required: LLM backend
    max_context_window_tokens=128_000,   # Required: total context window
    max_output_tokens=16_384,            # Required: reserved for response
    name='MyAgent',                      # Optional: agent identity
    description='What this agent does.',
    agent_instructions='System prompt.',
    disable_todo=False,                  # Toggle: TodoProvider
    disable_mode=False,                  # Toggle: AgentModeProvider
    disable_compaction=False,            # Toggle: CompactionProvider
    memory_store=None,                   # Optional: path for MemoryContextProvider
    skills_paths=None,               # Optional: path for SkillsProvider
    extra_tools=[],                      # Optional: additional tool functions
)

Sub-System 1: Function Invocation Layer

The agentic loop engine. When the model returns tool calls, this layer:

Detects tool call requests
Dispatches to registered tool functions
Collects results and injects them into message history
Re-invokes the model with updated history
Repeats until the model returns a final text response
Enforces a max iteration limit

async def search_database(query: str, limit: int = 10) -> list[dict]:
    '''Search the product database.
    Args:
        query: The search query string.
        limit: Maximum results to return.
    Returns:
        List of matching product records.
    '''
    return results  # Your implementation

# JSON schema is generated from type annotations automatically
agent = create_harness_agent(
    client=client,
    max_context_window_tokens=128_000,
    max_output_tokens=16_384,
    extra_tools=[search_database],
)

Sub-System 2: History Persistence

Persists conversation history after every model service call -- not just at the end of a turn. If an agent makes three tool calls and crashes on the third, it resumes from the second successful state.

session = agent.create_session()  # Isolates this conversation

result_1 = await agent.run('Research quantum computing', session=session)
result_2 = await agent.run('Summarize your findings', session=session)
# Turn 2 has full memory of everything from turn 1

Sub-System 3: Compaction

Prevents context-window overflow with two strategies:

Sliding Window -- Prunes older messages when token count approaches budget; system prompt always preserved
Tool Result Compaction -- Summarizes verbose tool outputs when under token pressure

agent = create_harness_agent(
    client=client,
    max_context_window_tokens=128_000,  # GPT-4o total window
    max_output_tokens=16_384,           # Reserved for response
    # ~112K available for history; compaction fires before the limit
    disable_compaction=True,            # Or disable to self-manage
)

Sub-System 4: TodoProvider

Gives the agent an explicit task-tracking system it manages itself:

create_todo(title, description) -- Creates a work item
complete_todo(id) -- Marks done
list_todos() -- Gets pending items

User: 'Research the top 5 AI agent frameworks.'

Agent uses TodoProvider internally:
  -> create_todo('Research AutoGen')
  -> create_todo('Research LangGraph')
  -> create_todo('Research MAF')
  -> create_todo('Write comparison table')

  [Executes each autonomously]

  -> complete_todo('Research AutoGen')
  -> complete_todo('Research LangGraph')
  -> ...

Sub-System 5: AgentModeProvider -- Plan/Execute Workflow

Implements a two-phase workflow:

Phase 1 -- Plan Mode (Interactive)
The agent asks clarifying questions, builds a todo list, and waits for human approval. No autonomous work happens until approved.

Phase 2 -- Execute Mode (Autonomous)
After approval, the agent works through todos independently, streams progress, and returns to plan mode if it hits a blocking ambiguity.

This is what makes agents trustworthy -- humans approve the plan before autonomous execution begins.

Sub-System 6: MemoryContextProvider

File-based durable memory that survives compaction and persists across sessions. Previously saved memory is re-injected into context at every context assembly cycle.

from pathlib import Path

memory_dir = Path('./agent_memory')
memory_dir.mkdir(exist_ok=True)

agent = create_harness_agent(
    client=client,
    max_context_window_tokens=128_000,
    max_output_tokens=16_384,
    memory_store=str(memory_dir),  # Enable durable memory
)
# Agent can now call memory_write(key, content) and memory_read(key)

Sub-System 7: SkillsProvider

Progressive skill loading from a directory of YAML/JSON skill definitions. The agent sees summaries first, then loads full definitions only when needed -- preserving context budget.

agent = create_harness_agent(
    client=client,
    max_context_window_tokens=128_000,
    max_output_tokens=16_384,
    skills_paths='./skills',
)
# skills/web_research.yaml, data_analysis.yaml, code_review.yaml...

Sub-System 8: OpenTelemetry

Full distributed tracing, auto-configured. Every model call, tool invocation, compaction event, and mode switch emits OTLP spans:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Point at Azure Monitor, Jaeger, Grafana Tempo, or Datadog
exporter = OTLPSpanExporter(endpoint='http://localhost:4317')
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Harness picks up the configured tracer automatically
agent = create_harness_agent(
    client=client,
    max_context_window_tokens=128_000,
    max_output_tokens=16_384,
)

5. Full Implementation Walkthrough

Step 1: Install and Authenticate

pip install agent-framework
az login

Step 2: Configure Environment

export FOUNDRY_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com/api/projects/your-project
export FOUNDRY_MODEL=gpt-4o

Step 3: Minimal Harness Agent

# minimal_harness.py
import asyncio
from agent_framework import create_harness_agent
from agent_framework.foundry import FoundryChatClient
from azure.identity import AzureCliCredential
from dotenv import load_dotenv
import os

async def main():
    load_dotenv()  # MAF does NOT auto-load .env
    client = FoundryChatClient(credential=AzureCliCredential(),
                               project_endpoint=os.getenv('FOUNDRY_PROJECT_ENDPOINT'),
                               model=os.getenv('FOUNDRY_MODEL'))
    agent = create_harness_agent(
        client=client,
        max_context_window_tokens=128_000,
        max_output_tokens=16_384,
    )
    session = agent.create_session()
    response = await agent.run(
        'What are the latest trends in AI agent frameworks?',
        session=session,
    )
    print(response)

asyncio.run(main())

Step 4: Full Research Agent (Official Sample)

The complete harness_research.py from the official MAF repository:

# harness_research.py
import asyncio
from agent_framework import create_harness_agent
from agent_framework.foundry import FoundryChatClient
from azure.identity import AzureCliCredential
from dotenv import load_dotenv

RESEARCH_INSTRUCTIONS = '''
## Research Assistant Instructions

You are a research assistant. Research topics thoroughly using web search.
Form good search queries but always verify claims with available tools.

### Research quality
Consult multiple sources. Cross-reference key claims.
Track your sources -- you will need them when presenting results.

### Presenting results
- Use Markdown formatting and clear section headings.
- Cite sources inline: According to [source](URL), ...
- End with a summary of key takeaways.
- Save the final report to file memory so it survives compaction.
'''


async def main() -> None:
    load_dotenv()
    client = FoundryChatClient(credential=AzureCliCredential())

    agent = create_harness_agent(
        client=client,
        max_context_window_tokens=128_000,
        max_output_tokens=16_384,
        name='ResearchAgent',
        description='A research assistant that plans and executes research tasks.',
        agent_instructions=RESEARCH_INSTRUCTIONS,
        # All 8 sub-systems active with sensible defaults
    )

    session = agent.create_session()
    print('Research Assistant (powered by create_harness_agent)')
    print('=' * 50)
    print('Enter a research topic. Type /exit to quit.')

    while True:
        user_input = input('You: ').strip()
        if not user_input:
            continue
        if user_input.lower() == '/exit':
            print('Goodbye!')
            break

        print('Assistant: ', end='', flush=True)

        # agent.run(stream=True) returns AsyncGenerator[AgentUpdate]
        # update.text = streaming text fragment
        # update.contents = list of tool calls, search events, etc.
        async for update in agent.run(user_input, session=session, stream=True):
            if update.contents:
                for content in update.contents:
                    if content.type == 'function_call':
                        print(f'  [calling: {content.name}]', flush=True)
                    elif content.type in ('search_tool_call', 'search_tool_result') \
                         and getattr(content, 'tool_name', None) == 'web_search':
                        action = None
                        if content.type == 'search_tool_result' and isinstance(content.result, dict):
                            action = content.result.get('action', {})
                        elif content.type == 'search_tool_call':
                            action = content.arguments if isinstance(content.arguments, dict) else None
                        if action:
                            t = action.get('type', 'search')
                            if t == 'search':
                                q = action.get('query', '')
                                print(f'  [Web search: {q}]', flush=True)
                            elif t == 'open_page':
                                print(f'  [Opening: {action.get("url","")}]', flush=True)
            if update.text:
                print(update.text, end='', flush=True)
        print()


if __name__ == '__main__':
    asyncio.run(main())

Customization Patterns

# Pattern 1: Lean Q&A agent
lean = create_harness_agent(
    client=client,
    max_context_window_tokens=32_000,
    max_output_tokens=4_096,
    disable_todo=True,   # No task tracking
    disable_mode=True,   # No plan/execute
)

# Pattern 2: Research agent with persistent memory
from pathlib import Path
mem = Path('./memory'); mem.mkdir(exist_ok=True)
research = create_harness_agent(
    client=client,
    max_context_window_tokens=128_000,
    max_output_tokens=16_384,
    memory_store=str(mem),
)

# Pattern 3: Enterprise agent with custom tools
from agent_framework.tools import get_web_search_tool

async def query_internal_db(query: str) -> list[dict]:
    '''Query internal company database. Args: query: Search query. Returns: records.'''
    return []

enterprise = create_harness_agent(
    client=client,
    max_context_window_tokens=128_000,
    max_output_tokens=16_384,
    extra_tools=[get_web_search_tool(), query_internal_db],
)

6. Running and Testing

az login
python harness_research.py

Expected output:

Research Assistant (powered by create_harness_agent)
==================================================
Enter a research topic. Type /exit to quit.

You: Research AI agent frameworks in 2026

Assistant:
  [calling tool: switch_to_plan_mode]
  [calling tool: create_todo]
  [calling tool: switch_to_execute_mode]

Here is my research plan. I will:
  1. Survey major frameworks (MAF, AutoGen, LangGraph, CrewAI)
  2. Look up recent benchmarks and community activity
  3. Write a comparison table

Shall I proceed? (yes/no)

You: yes

  [Web search: "Microsoft Agent Framework 2026"]
  [Opening: https://github.com/microsoft/agent-framework]
  [Web search: "LangGraph vs AutoGen comparison 2026"]
  [calling tool: complete_todo]
  ...

7. Production Considerations

Token Budget Sizing

Leave 15-20% headroom -- Tokenizers can be imprecise; don't set to the exact model maximum
Research agents -- Use the full window (128K for GPT-4o) for multi-source research
Q&A agents -- 16K-32K is sufficient and reduces cost significantly
Coding agents -- 64K-128K to handle large file contexts

Durable History Backends

For multi-user production deployments, InMemoryHistoryProvider is insufficient -- process restarts lose all history. MAF supports pluggable backends via the IHistoryProvider interface (CosmosDB, Redis, PostgreSQL, etc.).

Security: Prompt Injection Defense

MAF ships FIDES (Flow Integrity Deterministic Enforcement System) as a middleware -- defense against prompt injection (OWASP LLM Top 10 risk #1). FIDES assigns integrity labels (trusted/untrusted) and confidentiality labels (public/private) to every piece of content. Labels propagate automatically; enforcement is deterministic, not heuristic.

Deploying to Azure Foundry Hosted Agents

Foundry Hosted Agents provides containerized Micro VM hosting with built-in identity, autoscaling, managed session state, and versioning. The agent code stays identical -- only the hosting layer changes.

# See the official hosting samples
# https://github.com/microsoft/agent-framework/tree/main/python/samples/04-hosting

Observability in Production

With OpenTelemetry pre-wired by the harness, configure your exporter for production visibility:

Azure Monitor Application Insights -- For teams already in Azure
Grafana + Tempo -- For open-source observability stacks
Datadog / Dynatrace -- For enterprise APM platforms

Key dashboards: token consumption per session, tool call latency distribution, compaction frequency, agent error rates.

8. Conclusion

The agent harness pattern is the difference between an AI demo and a production AI system. It acknowledges a fundamental truth: building the intelligence of an agent is the easy part. Keeping that intelligence reliable, observable, durable, and safe under production load is the hard part -- and the harness handles that for you.

Microsoft Agent Framework's create_harness_agent is the most complete open-source implementation of this pattern available today. In a single factory call it assembles eight battle-tested sub-systems -- function invocation, history persistence, context compaction, todo-based planning, plan/execute mode management, durable file memory, progressive skill loading, and OpenTelemetry instrumentation -- all individually configurable and working in concert.

Key takeaways:

Start with the harness, strip down if needed -- It is always easier to disable features you don't need (disable_todo=True) than to add them later
The Plan/Execute pattern is the right default for task-oriented agents -- it keeps humans in control while enabling genuine autonomy
Telemetry is non-negotiable in production -- the harness gives it for free; configure an exporter on day one
The official GitHub sample is your north star -- harness_research.py is production-quality code worth reading end to end

Get started now:

pip install agent-framework
az login
python harness_research.py

Explore the full framework at github.com/microsoft/agent-framework, join the community on Discord, and follow the latest on the official blog.

The infrastructure is handled. Go build the intelligence.

All code samples are sourced from or based on the official Microsoft Agent Framework repository (MIT License).

Top comments (1)

Harjot Singh • May 31

Glad to see "harness" entering the mainstream vocabulary - a year ago everyone said "agent" and meant the model; now people are realizing the production-readiness lives entirely in the harness around it (retries, tool schemas, state, guardrails, observability). Microsoft Agent Framework formalizing that is a good sign the industry's growing up past the demo phase.

The thing I'd flag from building one: a framework gives you the primitives, but the opinionated decisions - how context flows between steps, when to verify, which model handles which task - are still yours to make, and they're where the real quality/cost lives. In Moonshift (a multi-agent pipeline: prompt to a shipped SaaS on your own GitHub + Vercel) the harness is opinionated on exactly those: plan-as-contract, verification gates, and routing that keeps a full build ~$3 flat. First run's free, no card. Solid explainer - how does the MS framework handle model routing, if at all? That's the piece I find most frameworks leave to you, and it's the biggest cost lever.

DEV Community