choutos

Posted on Feb 16 • Originally published at wanderclan.eu

Why I Replaced My AI Assistant With an Orchestra

#ai #architecture #llm #agents

You know that moment when your AI assistant loses the plot halfway through a complex task? You asked it to research a topic, draft a document, update a repo, and notify your team — and somewhere around step three it forgot what it was doing, hallucinated a file path, and burned through $2 of tokens producing nothing useful.

I lived there for months. Then I stopped asking one agent to do everything and started orchestrating many. Here's what I learned.

The Single-Agent Ceiling

A single LLM agent hits three walls fast:

Context windows are a leash. Even with 200K tokens, a complex task that involves reading codebases, API docs, and prior conversation history fills up. The model starts dropping details. You can feel it getting dumber as the context grows.

Generalists underperform specialists. A single prompt carrying instructions for research, code generation, writing, and tool usage is asking one person to be the intern, the senior engineer, and the project manager simultaneously. The system prompt bloats, the model hedges, and quality drops across the board.

Cost scales badly. Every token of context is re-processed on every completion. A 150K-token conversation where you need a 200-token answer still bills you for 150K input tokens. Multiply that by iterative tasks, and your bill becomes a problem.

These aren't theoretical limits. They're the daily reality of anyone building with LLM agents beyond toy demos.

The Orchestra Model

The fix is the same one humans discovered millennia ago: specialisation and coordination.

Instead of one agent, you run several:

┌─────────────────────────────────┐
│         Conductor Agent         │
│   (orchestrates, delegates,     │
│    synthesises results)         │
└──────┬──────┬──────┬───────────┘
       │      │      │
  ┌────▼──┐ ┌─▼───┐ ┌▼────────┐
  │Research│ │Code │ │Comms    │
  │Agent   │ │Agent│ │Agent    │
  └───────┘ └─────┘ └─────────┘

The conductor agent receives the user's intent, breaks it into subtasks, spawns specialist agents, and synthesises their results. Each specialist runs in its own session with a focused system prompt, a clean context window, and only the tools it needs.

This isn't a new idea. It's microservices, applied to cognition.

How Agents Actually Talk

The practical architecture matters more than the metaphor. Here's what works:

Spawn-and-report

The conductor spawns a sub-agent with a task description. The sub-agent runs independently, completes its work, and its final output is reported back to the conductor. No polling loops. No shared memory bus. Just fire-and-forget with a callback.

# Pseudocode: conductor spawning specialists
tasks = decompose(user_request)

for task in tasks:
    spawn_agent(
        task=task.description,
        model=task.preferred_model,    # cheap model for simple tasks
        label=task.name,
        timeout=300
    )

# Results arrive asynchronously via callbacks
# Conductor synthesises when all complete

Shared workspace, not shared context

Agents don't share context windows — that would defeat the purpose. Instead, they share a filesystem. Agent A writes research to /workspace/research/topic.md. Agent B reads it. The workspace is the integration layer.

This is deliberately low-tech. Files are debuggable. You can inspect what any agent produced. There's no opaque message-passing protocol to reverse-engineer when things go wrong.

Context handoff via task descriptions

When the conductor spawns a sub-agent, it passes a focused task description containing only what that agent needs. Not the entire conversation history. Not every file in the workspace. Just: "Here's what I need you to do, here's the relevant context, go."

spawn_agent(
    task="""
    Research the current state of WebTransport browser support.
    Save findings to /workspace/research/webtransport-support.md
    Include: browser versions, known limitations, polyfill options.
    """,
    model="claude-sonnet-4-20250514"  # fast + cheap for research
)

The sub-agent gets a clean 0-token conversation, a focused mission, and returns a result. Its context window is 100% dedicated to the task.

Why Specialists Win

A "researcher" agent and an "engineer" agent outperform one generalist for the same reason a DBA and a frontend developer outperform one full-stack developer asked to do both simultaneously:

Focused system prompts. The researcher agent's prompt says: "You find information, evaluate sources, and produce structured summaries. You do not write code." The engineer agent's prompt says: "You write, test, and commit code. You do not conduct open-ended research." Each agent is better at its job because it's not trying to be good at everything.

Right-sized models. Not every task needs your most expensive model. Research and summarisation? A fast, cheap model handles it. Complex architectural decisions? Route that to the heavy hitter. Multi-agent lets you match model capability to task complexity.

Task                    Model              Cost
─────────────────────────────────────────────────
Research & summarise    Sonnet             $
Code review             Opus               $$$
Draft email             Haiku              ¢
Complex refactor        Opus + thinking    $$$$

Parallel execution. While the researcher is reading docs, the engineer can be setting up scaffolding. The conductor doesn't wait for sequential completion — it fans out work and collects results.

Memory and Continuity

Agents are stateless by default. Every session starts from zero. That's a feature for isolation but a problem for continuity. Here's what bridges the gap:

Daily memory files. Each day gets a memory/YYYY-MM-DD.md file with raw notes — what happened, what was decided, what's pending. Agents read recent files at session start to rebuild context.

Long-term memory. A curated MEMORY.md file acts as distilled, long-term memory. It's not a log — it's the important stuff. Periodically, an agent reviews daily files and promotes insights to long-term memory.

Workspace as state. The most reliable "memory" is just the filesystem. Code that was committed, documents that were written, configs that were changed — these persist naturally. Agents don't need to "remember" what they did if the artefacts are right there.

workspace/
├── MEMORY.md              # Long-term curated memory
├── memory/
│   ├── 2026-02-15.md      # Yesterday's notes
│   └── 2026-02-16.md      # Today's notes
├── drafts/                # Work in progress
├── research/              # Research outputs
└── projects/              # Active project files

This is intentionally simple. The fancier your memory system, the more ways it breaks.

Tool Integration: Agents That DO Things

The difference between a chatbot and an agent is that an agent has hands. A well-integrated multi-agent system connects to:

Git/GitHub/GitLab — commit code, open PRs, review changes
Email & calendar — read inbox, send messages, check schedules
Databases & APIs — query data, update records, trigger workflows
File systems — read, write, organise, search
Browsers — navigate, scrape, fill forms, interact with web apps

Each specialist agent gets only the tools relevant to its role. The comms agent gets email and calendar. The engineer gets git and the shell. The researcher gets web search and fetch. Least privilege, applied to AI.

The Real Challenges

Multi-agent isn't magic. Here's what actually goes wrong:

Coordination overhead. The conductor agent consumes tokens just deciding what to delegate. For simple tasks, the overhead exceeds the benefit. If your task fits comfortably in one context window, a single agent is faster and cheaper.

Error propagation. Agent A produces flawed research. Agent B builds on it. Agent C ships it. Without validation at each handoff, errors compound. You need the conductor to sanity-check intermediate results, which adds cost and latency.

Cost management. More agents = more API calls. Parallel execution is faster but not cheaper. You need monitoring, budgets, and the discipline to use cheap models where they suffice.

Debugging is harder. When something goes wrong in a multi-agent run, you're tracing through multiple sessions, multiple context windows, and async handoffs. Good logging and a file-based workspace help, but it's still more complex than debugging one conversation.

When Multi-Agent Is Overkill

Don't use it for:

Single-step tasks (answer a question, write a function, summarise a doc)
Tasks that fit comfortably in one context window
Prototyping and exploration where you need tight iteration loops
Anything where latency matters more than quality

Do use it for:

Multi-step workflows spanning research → implementation → review → delivery
Tasks requiring different skill profiles (writing + coding + data analysis)
Long-running background work where you want parallel execution
Workloads where cost optimisation via model routing matters
Anything a single agent keeps failing at due to context limits

The heuristic is simple: if you find yourself copy-pasting between AI conversations to move context around, you need orchestration.

The Practical Takeaway

Multi-agent AI orchestration isn't about building something impressive. It's about recognising that the same principles that make software teams effective — specialisation, clear interfaces, focused scope, shared artefacts — apply to AI systems too.

Start with one conductor and two specialists. Give them a shared workspace. Let the conductor decompose tasks and route them. See what breaks. Fix it. Add agents as you find genuine specialisation boundaries.

The orchestra metaphor works because it captures the essential insight: the conductor doesn't play every instrument. It doesn't need to. It needs to know what each instrument does, when it should play, and how to bring them together into something coherent.

Your AI doesn't need to be smarter. It needs collaborators.

DEV Community