Juan Saez

Posted on Jun 10

Why Your Multi-Turn AI Agents Lose Their Train of Thought (And How to Fix It)

#agents #ai #architecture #llm

1. The Agent That Forgot Everything

I have an agent that clarifies requirements. I give it a problem, it asks questions, I answer, it refines, and after three or four rounds it should have a spec ready. Simple.

Round one works fine. It asks reasonable questions. I answer. But when I ask it to continue — same session, same agent, next step in the pipeline — the clarifier starts from scratch. It repeats questions I already answered. It ignores constraints we already agreed on. Sometimes it contradicts its own analysis from five minutes ago.

This isn't an LLM bug. It's an architecture problem. Claude Code, OpenCode, and pretty much any coding agent that delegates work to subagents shares the same behavior: every invocation of a subagent — even the same one, to continue a conversation — creates a brand new session. No history. No context. No memory of what that subagent already thought, asked, or decided.

The good news: the infrastructure to fix this already exists in these tools. It's just that nobody uses it by default.

2. How Coding Agents Delegate

The pattern is the same everywhere. A main agent receives your prompt, reasons about it, and when it needs specialized help, it calls a delegation tool — typically named task().

// OpenCode — tool/task.ts, lines 144-155
const session = taskID
  ? yield* sessions.get(SessionID.make(taskID))
      .pipe(Effect.catchCause(() => Effect.succeed(undefined)))
  : undefined

const nextSession =
  session ??                    // If it exists, reuse it
  (yield* sessions.create({     // Otherwise, create a new one
    parentID: ctx.sessionID,
    title: "params.description + ` (@${next.name} subagent)`,"
    permission: [...],
  }))

Here's the flow: the LLM decides to call task(), the system looks up the subagent definition (permissions, model, system prompt), creates a new session with parentID pointing to the main session, and kicks off an independent LLM loop. The subagent does its work, returns the result, and the main agent continues.

Claude Code exposes resume: sessionId in its SDK for exactly this — the same pattern: pass the session ID and the agent resumes with full history; omit it and a new session is created. OpenCode has the task_id parameter that lets you resume an existing session, but if you don't explicitly pass it, the system creates a new session. And since the main agent — the one calling task() — has no way of knowing which task_id it used last time, the default wins every time.

The subagent gets called, works, finishes. The next time you need it — even if it's the exact same agent to continue the exact same conversation — it's born with no past.

3. What Actually Happens to the Session

Here's what I found when I dug into OpenCode's source code: the infrastructure already persists everything. The problem isn't technical — it's a design choice.

Finding 1: Sessions store EVERYTHING.

OpenCode persists sessions in SQLite. Every message, every tool call, every output, every step of reasoning gets recorded:

SessionTable         MessageTable          PartTable
────────────         ────────────          ─────────
id (PK)              id (PK)               id (PK)
parent_id (FK→self)  session_id (FK)       message_id (FK)
title                data (JSON)           session_id (FK)
agent                                      data (JSON)
time_created                               ↑ text | tool | reasoning
time_updated                               ↑ snapshots | patches

When a subagent executes 15 tool calls, analyzes 8 files, and produces a 500-word response — all of it lands on disk. And it stays there.

Finding 2: Subagents can't delete their sessions.

Subagents' default permissions don't include task or todowrite. There is no end_session, close, terminate, or anything similar. A subagent cannot — by accident or by design — destroy its own session.

Finding 3: The LLM loop exits — it doesn't destroy.

When the subagent finishes responding, the main loop checks an exit condition:

// OpenCode — prompt.ts, lines 1267-1274
if (
  lastAssistant?.finish &&
  !["tool-calls"].includes(lastAssistant.finish) &&
  !hasToolCalls &&
  lastUser.id < lastAssistant.id
) {
  break  // ← Exits the loop. Nothing else.
}

That break doesn't call sessions.remove(). It doesn't archive the session. It doesn't touch a single field in the database. The in-memory runner cleans itself up, but in SQLite the session sits there intact, with all its messages.

Finding 4: No TTL, no timeout, no automatic cleanup.

I searched for ttl, expir, timeout.*session, auto.*delete across the entire codebase. Zero results for sessions. Sessions live until someone deletes them manually. They don't expire.

The irony: the infrastructure already does exactly what we need. It persists context. It keeps the history. It destroys nothing. You just need to ask it to reuse a session. And all it takes is passing the right task_id.

4. The Handshake: 200 Lines That Change Everything

The problem isn't one of capability — it's one of indexing. OpenCode can persist sessions and resume them if you pass a task_id. What it can't do is answer questions like:

"Give me the session for spec-42's clarifier"
"Has spec-42's constructor finished?"
"Resume the conversation with the planner where we left off"

To OpenCode, a session is ses_1d6f79327ffe7JM4ZcELwlMV0D. It doesn't know what "spec-42" is, which agent ran in that session, or which step of the workflow you're on. That's domain knowledge.

The handshake is the layer that translates domain knowledge into session references. Three functions:

Discovery: given a spec ID, find the right session's task_id
Naming: instead of ses_1d6f79327ffe, you see Kael-planner, Aitana-validator
Orchestration state: is the planner running? Did the validator approve?

The analogy is DNS. A web server can serve content if you give it the right IP. DNS translates github.com to 140.82.121.3. The handshake translates spec-42 → constructor to ses_1d6e78035ffe. It doesn't replace persistence. It complements it.

In practice, two scenarios:

Scenario A — New: No task_id exists for this agent. Call task() normally. Capture the task_id from the response. Persist it in a map: "spec-42/constructor" → "ses_1d6e78035ffe".

Scenario B — Resume: A task_id already exists. Retrieve it from the map. Call task() with that task_id. OpenCode loads the full session. The agent doesn't "remember" by magic — it sees its entire history.

The result: a 5-agent pipeline (clarifier → planner → auditor → constructor → validator) where each agent can resume with full context. Seven iterations on the same spec without losing a single reference.

5. Three Benefits You Didn't Expect

The obvious benefit is that the agent stops repeating questions. But there are three more that only show up in production.

Fewer tokens, lower latency. When an agent resumes its session, it doesn't need to re-run grep to find the relevant files, re-read documentation it already read, or re-analyze code it already understood. All of that is in the tool call history. Every tool call not re-executed is tokens saved and seconds the user doesn't wait for.

Real iterative refinement. A clarifier that goes through three rounds of questions sharpens its understanding each time. Without session continuity, round three is just as generic as round one — the agent doesn't know what it already asked or what you already answered. With it, each iteration builds on the last.

Auditability. When something goes wrong, the session history shows you exactly what the agent did, which tools it used, and why. Without continuity, that record fragments into orphaned sessions. With the handshake, you have a traceable reasoning chain end to end.

6. This Isn't New — The Industry Already Does It

What's interesting isn't that it works. It's that the industry already solved this problem with the same pattern — just with more infrastructure.

LangGraph implements checkpoints with thread_id + checkpointer. The thread_id is the direct equivalent of our task_id. The difference is that LangGraph needs you to configure SqliteSaver or PostgresSaver — FlowTask uses OpenCode's SQLite, which was already there. [docs]

Temporal runs workflows with durable execution: when a worker crashes at step 5 of 10, another worker picks up the workflow, replays the event history from the beginning, skips already-completed activities, and resumes from the last checkpoint. OpenCode solves the same conceptual problem — use history to avoid repeating completed work — at the LLM context level. The difference: Temporal guarantees this against infrastructure failures with deterministic replay; OpenCode does it at the conversational context layer of the LLM. [docs]

Microsoft Agent Framework defines supersteps with checkpoint storage: each superstep captures the full state upon completion. Each agent in our pipeline is a superstep that persists its state when done. [docs]

The difference: FlowTask solves it with 200 lines of protocol. Your tool already has the rest.

7. Conclusion

Session continuity isn't an infrastructure problem — it's a design omission. The tools already persist all the context. All that's missing is the handshake that reuses it.

If your agents depend on multi-turn reasoning, don't accept the default of a fresh session every time.

The full pattern is implemented in FlowTask for reference.

The difference: FlowTask solves it with 200 lines of protocol. Your tool already has the rest.

7. Conclusion

Session continuity isn't an infrastructure problem — it's a design omission. The tools already persist all the context. All that's missing is the handshake that reuses it.

If your agents depend on multi-turn reasoning, don't accept the default of a fresh session every time.

The full pattern is implemented in FlowTask for reference.

Top comments (13)

Alex Shev • Jun 11

The hard part with multi-turn agents is deciding what should survive the turn boundary. Keeping everything creates noise; keeping too little makes the agent restart the task every few minutes.

I like treating state as a structured artifact: current objective, constraints, decisions already made, open questions, and evidence links. That gives the next turn a compact operating picture instead of another giant transcript to reinterpret.

Juan Saez • Jun 11

Interesting approach, and it's basically how the orchestrator passes context between subagents in FlowTask: not a raw transcript, but a structured output that says exactly what was reviewed and what reasoning led to that conclusion.
The distinction I'd draw is between a snapshot and a reasoning chain. A snapshot tells the next instance the result, not how it got there. For a new instance of the same subagent that's a problem: if the agent read file X and the snapshot only captures the conclusion, the new instance has to re-read the file and reinterpret it from scratch (which can generate doubts the previous reasoning already resolved, not always, but it's a real risk), adding unnecessary turns and spending tokens re-doing work that was already done.
For clarification agents specifically, the reasoning chain isn't overhead, it's the snapshot.

Alex Shev • Jun 11

Yes, that snapshot vs reasoning-chain distinction is the key point.

A handoff that only says “result: safe” is useful for orchestration, but it is weak for continuation. The next agent still has to rediscover why the result is safe, what evidence was checked, and where the uncertainty was. That is where token savings turn into hidden rework.

I like the clarification-agent framing. For that role, the reasoning chain is not a verbose transcript. It is the artifact: what was ambiguous, what was resolved, what evidence changed the state, and what still needs a human or a downstream agent.

The trick is keeping it structured enough that it can be inspected and resumed, without turning it back into raw chat history.

Juan Saez • Jun 11

Exactly, and that's the problem you're describing: the new agent has to rediscover what was already reasoned.
I'm currently using a domain heuristic: when the agent shifts to a different feature, I create a new instance. Works reasonably well but has its failures, sometimes it switches when it should continue and vice versa.
The evolution I'm researching is a plugin for OpenCode specifically. The idea: access the internal context data of each subagent (tokens used, how far from saturation, when it compacts), measure the saturation level, and when it hits a certain threshold, instead of letting the CLI compact on its own, read the active reasoning chain, compact it in a controlled way, and start a new instance with that chain as the first message. The new agent starts with a clean context window but with all the relevant reasoning, without the accumulated noise or system prompt instructions that no longer apply.
Work in progress, if I find a better strategy for injecting the chain I'll adjust.

Alex Shev • Jun 12

That is the right direction. The part I would be careful about is making the transferred chain auditable, not just persuasive.

If the new instance receives a compacted reasoning chain, it should also receive evidence handles: which files were inspected, which assumptions were made, what was rejected, and what is still unresolved. Otherwise the chain can become another narrative the next agent has to trust.

The best handoff format is probably not full transcript and not pure summary. It is closer to: conclusion, supporting evidence, open questions, and invalidated paths. That gives the new agent enough continuity without forcing it to inherit all the noise.

xulingfeng • Jun 11

I was nodding along until I got to the part about sessions persisting everything but nobody passing the sessionId — because I'm sitting in that exact problem right now.
I run an agent on my home PC and another on a company server, talking over MQTT. Every message exchange is essentially a fresh session. We had to build a manual queue file just to keep some shared context — otherwise each round starts from zero. "The default creates a new session" is the story of my week.
What really stood out: OpenCode stores everything in SQLite but the design doesn't reuse it. We went the other direction with our memory layer — persist everything, connect entities, track time — because without that, multi-agent coordination doesn't work at all.
This post is basically an argument for why memory systems exist. 😏
Have you experimented with sharing a session root across subagents, or do you lean toward independent sessions with return-value-only communication?

Juan Saez • Jun 11

I see two things here.
First, on independent sessions: the main reason is the context window. Current models handle between 200k and 300k tokens on average, and while 1M windows exist, the cost scales proportionally. With a shared root, subagents end up competing for that space, and as the limit approaches the model starts compacting, meaning it summarizes the history to free up room. The problem is that compaction isn't precise, it's a blend of all the shared state, and that's where hallucinations start, which in practice are worse than simply missing context.
Second, on sharing reasoning between subagents: independent sessions don't mean they can't communicate. A strategy I'm evaluating in FlowTask is having subagent A call subagent B directly with its task_id if it needs B's reasoning chain. B still has its own context intact, so if it needs to compact it does so over its own thread, not over mixed state. The result is a more precise summary with no noise from other sessions.

Aliaksei Zelianouski • Jun 10

Careful calling the fresh session a bug - half the time it's the feature. A draft/edit/critique chain works because each subagent gets clean context; one window juggling three roles and they bleed into each other. The clarifier that needs to continue the same conversation is the real exception where you want resume. So the default is correct for most delegation, and resumption is the special case you opt into - which is roughly how these tools already ship it.

Juan Saez • Jun 10

Fair point. The handshake isn't for all delegation — in FlowTask itself it applies to 4 of 11 subagents. The criterion: if the value is in the artifact the agent produces, clean session. If it's in the accumulated reasoning — clarifiers, researchers, agents that need to correct their own thread between turns — that's where continuity matters. Should have scoped that better in the article.

Aliaksei Zelianouski • Jun 11

Makes sense, and it matches how we do it (me and my AIs). Our persistent memory just lives in plain files, with hooks that compact them now and then. That's been more than enough.

But we only do that for the conversational, general-purpose agents - a personal assistant, a long-running writer bot - the ones that actually accumulate state across days and need to remember.

For coding I don't reach for memory at all. A good model with a good harness does fine on a clean context each task. The artifact is the output; the reasoning doesn't need to survive.

Juan Saez • Jun 11

I think there's confusion between 3 different problems with different solutions:

Session continuity — this isn't persistent memory. It's about the agent being able to resume its own reasoning thread within the same task. An agent writing a spec that loses context between turns repeats the same questions in round 3 that it asked in round 1 — it doesn't know what it already agreed on with the user. That's what the handshake solves.
Persistent memory for conversational agents — you're right, that's the correct tool there. I use SQLite with FTS5 for better indexing (github.com/Gentleman-Programming/e...), but the principle is the same as what you describe with plain files.
Coding agents without memory — valid, but it depends on project size. Most LLMs have 200k-300k token windows (1M exists but it's expensive). The issue is that CLIs like OpenCode use grep to search the codebase — which adds noise to the context: comments, references, imports. On small projects or low-verbosity languages it doesn't matter. On large codebases or verbose languages like C#, Java, C++ that saturation hits fast.

Aliaksei Zelianouski • Jun 11

We all are AI power users here. I believe your conclusions are coming our of practice and these things work for you.

For me, I don't see any need for session continuity in coding. Modern harnesses like Claude Code spawn lots of sub-agents to analyze your ask, and search for the code they need. They all run on small fresh contexts, so it's cheap and fast. All the reasoning, tool calls, intermediate output is thrown away - the main agent receives a well-formed context with no noise. It's cleaner than dragging old conversations summaries from session to session. And more efficient since the harness works with the up to date code. If I need to remember anything, I ask a harness to create/update a skill. But usually, it's about integrations or domain-specific knowledge AI don't have. But that's my experience, not some generalized rule.

I have a project where multiple AIs have a very long group conversation. The longer the conversation goes the more they deviate from the system prompt. There are no sub-agents - each AI runs a single session. I have to do tricks to keep them on track. Like appending reminders about behavioral patterns to the last message. That's another reason for me to avoid large context. A secondary one - the primary reason is the price.

Juan Saez • Jun 11

Exactly, for coding agents that's the right call, and it lines up with what I mentioned earlier: if the value is in the artifact, clean context wins every time. The post was focused on clarification agents, where the accumulated reasoning is the work product. Different problems, different tools. Appreciate the breakdown of your setup, good to see how others are solving it in practice.

View full discussion (13 comments)