Joakim William Hauge

Posted on May 25

Preventing Recursive Tool Loops in LangChain Agents

#typescript #langchain #ai #node

One of the fastest ways for LangChain agents to become unstable in production is not model quality.

It’s recursive tool loops.

A workflow starts normally:

search
retrieve
summarize

Then suddenly:

the same tool gets called repeatedly
retries compound
context grows
token usage spikes
execution drifts indefinitely

The agent technically remains “alive.”

Operationally, it stopped making progress a long time ago.

This article shows a simple way to detect and interrupt recursive tool loops in LangChain agents using TypeScript.

The Problem

A basic agent workflow often looks harmless:

```ts id="jlwm4"
const result = await agentExecutor.invoke({
input: userPrompt
});




But production agents can drift into patterns like:



```txt id="0jlwm4"
search_documents
→ search_documents
→ search_documents
→ search_documents

or:

```txt id="1jlwm4"
search
→ summarize
→ retry
→ search
→ summarize
→ retry




This usually happens because:

* the model fails to converge
* tool outputs are ambiguous
* retries reinforce uncertainty
* the agent misinterprets partial progress

The result is:

## runaway execution.

# Why This Is Dangerous

Most AI workflows behave normally most of the time.
T
he problem comes from tail events:

* recursive retries
* unstable recovery behavior
* escalating context windows
* repeated tool invocation

A tiny percentage of unstable runs can consume a disproportionate amount of:

* inference cost
* latency
* compute
* operational attention

This is not just an observability issue.

It’s a runtime governance issue.

---

# Basic Strategy

We want to:

* track recent tool usage
* detect repetition patterns
* interrupt execution safely

before the workflow spirals.

The simplest version:



```txt id="2jlwm4"
“If the same tool is called too many times consecutively, stop execution.”

Simple.
Effective.
Easy to implement.

Step 1 — Track Tool History

We’ll maintain lightweight runtime state:

```ts id="3jlwm4"
type ExecutionState = {
toolHistory: string[];
};




Initialize it:



```ts id="4jlwm4"
const state: ExecutionState = {
  toolHistory: []
};

Step 2 — Detect Recursive Patterns

Now create a helper:

```ts id="5jlwm4"
function detectRecursiveLoop(
toolHistory: string[],
threshold = 3
): boolean {
if (toolHistory.length < threshold) {
return false;
}

const recent = toolHistory.slice(-threshold);

return recent.every(
tool => tool === recent[0]
);
}




This checks:



```txt id="6jlwm4"
Did the same tool run 3 times in a row?

Step 3 — Wrap Tool Execution

Now intercept tool calls:

```ts id="7jlwm4"
async function guardedToolCall(
toolName: string,
execute: () => Promise
) {
state.toolHistory.push(toolName);

if (detectRecursiveLoop(state.toolHistory)) {
throw new Error(
Recursive loop detected for tool: ${toolName}
);
}

return execute();
}




---

# Step 4 — Use Inside LangChain Tools

Example:



```ts id="8jlwm4"
const result = await guardedToolCall(
  "search_documents",
  async () => {
    return searchTool.invoke(query);
  }
);

That’s it.

Now your workflow can:

detect runaway repetition
interrupt unstable execution
prevent unnecessary cost escalation

Why Simple Detection Works Surprisingly Well

A lot of teams initially assume they need:

anomaly detection
reinforcement learning
advanced telemetry pipelines

But simple operational heuristics already eliminate many expensive failures.

Especially:

recursive retries
retry storms
repeated tool churn

You do not need perfect intelligence initially.

You need:

bounded execution.

Production Improvements

The minimal approach above works surprisingly well, but production systems usually add:

semantic similarity detection
token velocity monitoring
execution depth limits
tool-call budgets
runtime ceilings
timeout policies
adaptive thresholds

Example:

```txt id="9jlwm4"
search
→ search
→ search




is easy to detect.

More advanced loops look like:



```txt id="10jlwm4"
search
→ summarize
→ retry
→ search
→ summarize
→ retry

These require broader trajectory analysis.

The Distributed Systems Parallel

Distributed systems eventually evolved:

retry limits
circuit breakers
bounded failure domains
timeout controls

because unconstrained retries became dangerous at scale.

Autonomous agent systems are beginning to encounter similar operational realities.

As agents become:

more autonomous
more persistent
more deeply integrated

runtime governance becomes increasingly important.

Final Thoughts

Most teams focus heavily on:

prompts
model quality
orchestration frameworks

But production AI systems also need:

bounded execution
runtime constraints
operational safeguards
economic stability

Because eventually:
the challenge is not just building autonomous agents.

It is building governable autonomous agents.

Top comments (2)

Marcus Chen • May 25

The pattern is right. The piece that bites hardest in voice land is the latency tax of the loop, not the cost. We had a tool-loop on a voice agent that fired 4 times in 1.8s before our 2s first-response budget blew up. Cost was tiny. The user heard 1.8s of dead air. The fix was a per-conversation latency budget that hard-caps any single tool sequence at 600ms and short-circuits to a holding phrase if it overruns. Detection is the easy part. Bounding the wall-clock impact on streaming UX is the part most LangChain teams underbuild. What heuristic did you settle on for distinguishing legit retries from runaway loops?

Joakim William Hauge • May 25

This is a really good point.

In voice systems the wall-clock impact can absolutely become the dominant failure mode before economics even matter. 1.8s of dead air is effectively a broken UX regardless of token cost.

Your “holding phrase” fallback is especially interesting because it treats governance as a streaming/runtime concern rather than just a cost-control mechanism.

Right now we’re mostly experimenting with fairly explainable heuristics:

repeated tool invocation
repeated node traversal
token velocity spikes
execution depth thresholds
low state-transition diversity

The interesting challenge is exactly what you described though:
distinguishing productive recovery behaviour from non-converging retries in real time.

Feels increasingly similar to distributed systems retry governance honestly.