Meta Description: Learn what an agent harness is, why it matters for production AI systems, and how to implement one step-by-step using Microsoft Agent Framework's
create_harness_agent-- with real Python code and deep technical walkthroughs.
Table of Contents
- Introduction
- What is an Agent Harness?
- Microsoft Agent Framework Overview
- Anatomy of create_harness_agent -- 8 Sub-Systems
- Full Implementation Walkthrough
- Running and Testing
- Production Considerations
- Conclusion
1. Introduction
You have a capable LLM. You write a chat loop in 20 lines of Python and it works -- until the context window fills up and the agent loses the thread. Or it calls a tool but forgets the result two turns later. Or it crashes in production with no trace of what went wrong.
This is the hidden complexity tax of AI agents. Every production-grade agent needs: a tool-calling loop, conversation history management, context-window compaction, a planning mechanism, durable memory, skill extensibility, and observability. If you wire each of these yourself, you spend more time on infrastructure than on actual intelligence.
The agent harness pattern solves this by pre-assembling all components into a single, tested, configurable pipeline -- so you focus on what your agent does, not how to keep it running.
In this deep dive you will learn:
- What an agent harness is and why it matters
- How Microsoft Agent Framework (MAF) implements the harness pattern with
create_harness_agent - A component-by-component breakdown of all 8 sub-systems
- A complete, production-ready Research Agent from the official MAF repository
2. What is an Agent Harness?
The concept comes from two familiar software patterns: a test harness (configures and tears down system-under-test so test authors write test logic, not setup code) and a DI container (wires and resolves components so application code never calls new directly).
An agent harness applies this to AI agents:
A factory that constructs a fully wired, ready-to-run agent by assembling all required infrastructure components -- history, tools, memory, observability, planning -- from a single configuration point.
The key insight: your agent instructions define what the agent does; the harness defines how that intent is reliably executed, persisted, and observed.
The Manual Wiring Problem
Here is what you would have to build manually:
| Component | Manual Responsibility |
|---|---|
| Tool Calling Loop | Detect tool calls, dispatch, collect results, re-invoke model |
| History Management | Serialize/deserialize history, decide storage backend |
| Context Compaction | Monitor token count, evict stale messages, preserve context |
| Planning / Todo | Design task-tracking schema, prompt model to use it |
| Mode Management | Track agent state, implement approval gates |
| Persistent Memory | Write/read durable store, inject into context at the right time |
| Skills Loading | Progressive skill discovery, filter by relevance |
| Telemetry | Instrument every model call and tool dispatch |
Harness vs. DIY -- Side by Side
Manual tool calling loop (this is just one of eight required pieces):
# DIY: manual tool loop only -- no memory, compaction, or telemetry
async def run_agent_manually(client, tools, messages):
while True:
response = await client.chat(messages=messages, tools=tools)
if response.finish_reason == 'stop':
return response.content
if response.finish_reason == 'tool_calls':
messages.append(response.message)
for tool_call in response.tool_calls:
tool_fn = tool_registry.get(tool_call.function.name)
result = await tool_fn(**json.loads(tool_call.function.arguments))
messages.append({
'role': 'tool',
'tool_call_id': tool_call.id,
'content': str(result)
})
# Still need: token counting, history, todos, telemetry...
Harness equivalent -- full production pipeline:
# Harness: complete pipeline in 4 lines
from agent_framework import create_harness_agent
from agent_framework.foundry import FoundryChatClient
from azure.identity import AzureCliCredential
agent = create_harness_agent(
client=FoundryChatClient(credential=AzureCliCredential()),
max_context_window_tokens=128_000,
max_output_tokens=16_384,
)
3. Microsoft Agent Framework Overview
Microsoft Agent Framework (MAF) is Microsoft's open-source, production-grade framework for building AI agents and multi-agent workflows. It is the direct successor to both AutoGen and Semantic Kernel -- combining AutoGen's simple abstractions with Semantic Kernel's enterprise features.
Key capabilities:
- Multi-language -- Full parity between Python and C#/.NET
- Multi-provider -- Azure AI Foundry, OpenAI, Azure OpenAI, Anthropic, Gemini, Bedrock, Ollama, and more
- Multi-pattern -- Single agents, sequential, concurrent, handoff, group-chat, human-in-the-loop
- Multi-deployment -- Local dev, Azure Functions, Foundry Hosted Agents, Durable Task
Installation
pip install agent-framework==1.7.0
Agent Runtime Loop
Every agent.run() call executes this 6-phase deterministic loop:
User Message
|
v
[ 1. Context Assembly ] -- History + ContextProviders (Memory, Skills, Mode, Todos)
[ 2. Middleware Pre ] -- Telemetry spans open, compaction checks
[ 3. Model Inference ] -- FoundryChatClient / OpenAI / Anthropic
[ 4. Tool Dispatch ] -- FunctionInvocationLayer (tool call -> execute -> result -> loop)
[ 5. History Persist ] -- Saved after every service call
[ 6. Middleware Post ] -- Telemetry spans close
|
v
Agent Response (streaming or complete)
The harness ensures all six phases are populated and correctly ordered.
4. Anatomy of create_harness_agent -- 8 Sub-Systems
from agent_framework import create_harness_agent
agent = create_harness_agent(
client=client, # Required: LLM backend
max_context_window_tokens=128_000, # Required: total context window
max_output_tokens=16_384, # Required: reserved for response
name='MyAgent', # Optional: agent identity
description='What this agent does.',
agent_instructions='System prompt.',
disable_todo=False, # Toggle: TodoProvider
disable_mode=False, # Toggle: AgentModeProvider
disable_compaction=False, # Toggle: CompactionProvider
memory_store=None, # Optional: path for MemoryContextProvider
skills_paths=None, # Optional: path for SkillsProvider
extra_tools=[], # Optional: additional tool functions
)
Sub-System 1: Function Invocation Layer
The agentic loop engine. When the model returns tool calls, this layer:
- Detects tool call requests
- Dispatches to registered tool functions
- Collects results and injects them into message history
- Re-invokes the model with updated history
- Repeats until the model returns a final text response
- Enforces a max iteration limit
async def search_database(query: str, limit: int = 10) -> list[dict]:
'''Search the product database.
Args:
query: The search query string.
limit: Maximum results to return.
Returns:
List of matching product records.
'''
return results # Your implementation
# JSON schema is generated from type annotations automatically
agent = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
extra_tools=[search_database],
)
Sub-System 2: History Persistence
Persists conversation history after every model service call -- not just at the end of a turn. If an agent makes three tool calls and crashes on the third, it resumes from the second successful state.
session = agent.create_session() # Isolates this conversation
result_1 = await agent.run('Research quantum computing', session=session)
result_2 = await agent.run('Summarize your findings', session=session)
# Turn 2 has full memory of everything from turn 1
Sub-System 3: Compaction
Prevents context-window overflow with two strategies:
- Sliding Window -- Prunes older messages when token count approaches budget; system prompt always preserved
- Tool Result Compaction -- Summarizes verbose tool outputs when under token pressure
agent = create_harness_agent(
client=client,
max_context_window_tokens=128_000, # GPT-4o total window
max_output_tokens=16_384, # Reserved for response
# ~112K available for history; compaction fires before the limit
disable_compaction=True, # Or disable to self-manage
)
Sub-System 4: TodoProvider
Gives the agent an explicit task-tracking system it manages itself:
-
create_todo(title, description)-- Creates a work item -
complete_todo(id)-- Marks done -
list_todos()-- Gets pending items
User: 'Research the top 5 AI agent frameworks.'
Agent uses TodoProvider internally:
-> create_todo('Research AutoGen')
-> create_todo('Research LangGraph')
-> create_todo('Research MAF')
-> create_todo('Write comparison table')
[Executes each autonomously]
-> complete_todo('Research AutoGen')
-> complete_todo('Research LangGraph')
-> ...
Sub-System 5: AgentModeProvider -- Plan/Execute Workflow
Implements a two-phase workflow:
Phase 1 -- Plan Mode (Interactive)
The agent asks clarifying questions, builds a todo list, and waits for human approval. No autonomous work happens until approved.
Phase 2 -- Execute Mode (Autonomous)
After approval, the agent works through todos independently, streams progress, and returns to plan mode if it hits a blocking ambiguity.
This is what makes agents trustworthy -- humans approve the plan before autonomous execution begins.
Sub-System 6: MemoryContextProvider
File-based durable memory that survives compaction and persists across sessions. Previously saved memory is re-injected into context at every context assembly cycle.
from pathlib import Path
memory_dir = Path('./agent_memory')
memory_dir.mkdir(exist_ok=True)
agent = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
memory_store=str(memory_dir), # Enable durable memory
)
# Agent can now call memory_write(key, content) and memory_read(key)
Sub-System 7: SkillsProvider
Progressive skill loading from a directory of YAML/JSON skill definitions. The agent sees summaries first, then loads full definitions only when needed -- preserving context budget.
agent = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
skills_paths='./skills',
)
# skills/web_research.yaml, data_analysis.yaml, code_review.yaml...
Sub-System 8: OpenTelemetry
Full distributed tracing, auto-configured. Every model call, tool invocation, compaction event, and mode switch emits OTLP spans:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Point at Azure Monitor, Jaeger, Grafana Tempo, or Datadog
exporter = OTLPSpanExporter(endpoint='http://localhost:4317')
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Harness picks up the configured tracer automatically
agent = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
)
5. Full Implementation Walkthrough
Step 1: Install and Authenticate
pip install agent-framework
az login
Step 2: Configure Environment
export FOUNDRY_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com/api/projects/your-project
export FOUNDRY_MODEL=gpt-4o
Step 3: Minimal Harness Agent
# minimal_harness.py
import asyncio
from agent_framework import create_harness_agent
from agent_framework.foundry import FoundryChatClient
from azure.identity import AzureCliCredential
from dotenv import load_dotenv
import os
async def main():
load_dotenv() # MAF does NOT auto-load .env
client = FoundryChatClient(credential=AzureCliCredential(),
project_endpoint=os.getenv('FOUNDRY_PROJECT_ENDPOINT'),
model=os.getenv('FOUNDRY_MODEL'))
agent = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
)
session = agent.create_session()
response = await agent.run(
'What are the latest trends in AI agent frameworks?',
session=session,
)
print(response)
asyncio.run(main())
Step 4: Full Research Agent (Official Sample)
The complete harness_research.py from the official MAF repository:
# harness_research.py
import asyncio
from agent_framework import create_harness_agent
from agent_framework.foundry import FoundryChatClient
from azure.identity import AzureCliCredential
from dotenv import load_dotenv
RESEARCH_INSTRUCTIONS = '''
## Research Assistant Instructions
You are a research assistant. Research topics thoroughly using web search.
Form good search queries but always verify claims with available tools.
### Research quality
Consult multiple sources. Cross-reference key claims.
Track your sources -- you will need them when presenting results.
### Presenting results
- Use Markdown formatting and clear section headings.
- Cite sources inline: According to [source](URL), ...
- End with a summary of key takeaways.
- Save the final report to file memory so it survives compaction.
'''
async def main() -> None:
load_dotenv()
client = FoundryChatClient(credential=AzureCliCredential())
agent = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
name='ResearchAgent',
description='A research assistant that plans and executes research tasks.',
agent_instructions=RESEARCH_INSTRUCTIONS,
# All 8 sub-systems active with sensible defaults
)
session = agent.create_session()
print('Research Assistant (powered by create_harness_agent)')
print('=' * 50)
print('Enter a research topic. Type /exit to quit.')
while True:
user_input = input('You: ').strip()
if not user_input:
continue
if user_input.lower() == '/exit':
print('Goodbye!')
break
print('Assistant: ', end='', flush=True)
# agent.run(stream=True) returns AsyncGenerator[AgentUpdate]
# update.text = streaming text fragment
# update.contents = list of tool calls, search events, etc.
async for update in agent.run(user_input, session=session, stream=True):
if update.contents:
for content in update.contents:
if content.type == 'function_call':
print(f' [calling: {content.name}]', flush=True)
elif content.type in ('search_tool_call', 'search_tool_result') \
and getattr(content, 'tool_name', None) == 'web_search':
action = None
if content.type == 'search_tool_result' and isinstance(content.result, dict):
action = content.result.get('action', {})
elif content.type == 'search_tool_call':
action = content.arguments if isinstance(content.arguments, dict) else None
if action:
t = action.get('type', 'search')
if t == 'search':
q = action.get('query', '')
print(f' [Web search: {q}]', flush=True)
elif t == 'open_page':
print(f' [Opening: {action.get("url","")}]', flush=True)
if update.text:
print(update.text, end='', flush=True)
print()
if __name__ == '__main__':
asyncio.run(main())
Customization Patterns
# Pattern 1: Lean Q&A agent
lean = create_harness_agent(
client=client,
max_context_window_tokens=32_000,
max_output_tokens=4_096,
disable_todo=True, # No task tracking
disable_mode=True, # No plan/execute
)
# Pattern 2: Research agent with persistent memory
from pathlib import Path
mem = Path('./memory'); mem.mkdir(exist_ok=True)
research = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
memory_store=str(mem),
)
# Pattern 3: Enterprise agent with custom tools
from agent_framework.tools import get_web_search_tool
async def query_internal_db(query: str) -> list[dict]:
'''Query internal company database. Args: query: Search query. Returns: records.'''
return []
enterprise = create_harness_agent(
client=client,
max_context_window_tokens=128_000,
max_output_tokens=16_384,
extra_tools=[get_web_search_tool(), query_internal_db],
)
6. Running and Testing
az login
python harness_research.py
Expected output:
Research Assistant (powered by create_harness_agent)
==================================================
Enter a research topic. Type /exit to quit.
You: Research AI agent frameworks in 2026
Assistant:
[calling tool: switch_to_plan_mode]
[calling tool: create_todo]
[calling tool: switch_to_execute_mode]
Here is my research plan. I will:
1. Survey major frameworks (MAF, AutoGen, LangGraph, CrewAI)
2. Look up recent benchmarks and community activity
3. Write a comparison table
Shall I proceed? (yes/no)
You: yes
[Web search: "Microsoft Agent Framework 2026"]
[Opening: https://github.com/microsoft/agent-framework]
[Web search: "LangGraph vs AutoGen comparison 2026"]
[calling tool: complete_todo]
...
7. Production Considerations
Token Budget Sizing
- Leave 15-20% headroom -- Tokenizers can be imprecise; don't set to the exact model maximum
- Research agents -- Use the full window (128K for GPT-4o) for multi-source research
- Q&A agents -- 16K-32K is sufficient and reduces cost significantly
- Coding agents -- 64K-128K to handle large file contexts
Durable History Backends
For multi-user production deployments, InMemoryHistoryProvider is insufficient -- process restarts lose all history. MAF supports pluggable backends via the IHistoryProvider interface (CosmosDB, Redis, PostgreSQL, etc.).
Security: Prompt Injection Defense
MAF ships FIDES (Flow Integrity Deterministic Enforcement System) as a middleware -- defense against prompt injection (OWASP LLM Top 10 risk #1). FIDES assigns integrity labels (trusted/untrusted) and confidentiality labels (public/private) to every piece of content. Labels propagate automatically; enforcement is deterministic, not heuristic.
Deploying to Azure Foundry Hosted Agents
Foundry Hosted Agents provides containerized Micro VM hosting with built-in identity, autoscaling, managed session state, and versioning. The agent code stays identical -- only the hosting layer changes.
# See the official hosting samples
# https://github.com/microsoft/agent-framework/tree/main/python/samples/04-hosting
Observability in Production
With OpenTelemetry pre-wired by the harness, configure your exporter for production visibility:
- Azure Monitor Application Insights -- For teams already in Azure
- Grafana + Tempo -- For open-source observability stacks
- Datadog / Dynatrace -- For enterprise APM platforms
Key dashboards: token consumption per session, tool call latency distribution, compaction frequency, agent error rates.
8. Conclusion
The agent harness pattern is the difference between an AI demo and a production AI system. It acknowledges a fundamental truth: building the intelligence of an agent is the easy part. Keeping that intelligence reliable, observable, durable, and safe under production load is the hard part -- and the harness handles that for you.
Microsoft Agent Framework's create_harness_agent is the most complete open-source implementation of this pattern available today. In a single factory call it assembles eight battle-tested sub-systems -- function invocation, history persistence, context compaction, todo-based planning, plan/execute mode management, durable file memory, progressive skill loading, and OpenTelemetry instrumentation -- all individually configurable and working in concert.
Key takeaways:
-
Start with the harness, strip down if needed -- It is always easier to disable features you don't need (
disable_todo=True) than to add them later - The Plan/Execute pattern is the right default for task-oriented agents -- it keeps humans in control while enabling genuine autonomy
- Telemetry is non-negotiable in production -- the harness gives it for free; configure an exporter on day one
-
The official GitHub sample is your north star --
harness_research.pyis production-quality code worth reading end to end
Get started now:
pip install agent-framework
az login
python harness_research.py
Explore the full framework at github.com/microsoft/agent-framework, join the community on Discord, and follow the latest on the official blog.
The infrastructure is handled. Go build the intelligence.
All code samples are sourced from or based on the official Microsoft Agent Framework repository (MIT License).
Top comments (1)
Glad to see "harness" entering the mainstream vocabulary - a year ago everyone said "agent" and meant the model; now people are realizing the production-readiness lives entirely in the harness around it (retries, tool schemas, state, guardrails, observability). Microsoft Agent Framework formalizing that is a good sign the industry's growing up past the demo phase.
The thing I'd flag from building one: a framework gives you the primitives, but the opinionated decisions - how context flows between steps, when to verify, which model handles which task - are still yours to make, and they're where the real quality/cost lives. In Moonshift (a multi-agent pipeline: prompt to a shipped SaaS on your own GitHub + Vercel) the harness is opinionated on exactly those: plan-as-contract, verification gates, and routing that keeps a full build ~$3 flat. First run's free, no card. Solid explainer - how does the MS framework handle model routing, if at all? That's the piece I find most frameworks leave to you, and it's the biggest cost lever.