One of the fastest ways for LangChain agents to become unstable in production is not model quality.
It’s recursive tool loops.
A workflow starts normally:
- search
- retrieve
- summarize
Then suddenly:
- the same tool gets called repeatedly
- retries compound
- context grows
- token usage spikes
- execution drifts indefinitely
The agent technically remains “alive.”
Operationally, it stopped making progress a long time ago.
This article shows a simple way to detect and interrupt recursive tool loops in LangChain agents using TypeScript.
The Problem
A basic agent workflow often looks harmless:
```ts id="jlwm4"
const result = await agentExecutor.invoke({
input: userPrompt
});
But production agents can drift into patterns like:
```txt id="0jlwm4"
search_documents
→ search_documents
→ search_documents
→ search_documents
or:
```txt id="1jlwm4"
search
→ summarize
→ retry
→ search
→ summarize
→ retry
This usually happens because:
* the model fails to converge
* tool outputs are ambiguous
* retries reinforce uncertainty
* the agent misinterprets partial progress
The result is:
## runaway execution.
# Why This Is Dangerous
Most AI workflows behave normally most of the time.
T
he problem comes from tail events:
* recursive retries
* unstable recovery behavior
* escalating context windows
* repeated tool invocation
A tiny percentage of unstable runs can consume a disproportionate amount of:
* inference cost
* latency
* compute
* operational attention
This is not just an observability issue.
It’s a runtime governance issue.
---
# Basic Strategy
We want to:
* track recent tool usage
* detect repetition patterns
* interrupt execution safely
before the workflow spirals.
The simplest version:
```txt id="2jlwm4"
“If the same tool is called too many times consecutively, stop execution.”
Simple.
Effective.
Easy to implement.
Step 1 — Track Tool History
We’ll maintain lightweight runtime state:
```ts id="3jlwm4"
type ExecutionState = {
toolHistory: string[];
};
Initialize it:
```ts id="4jlwm4"
const state: ExecutionState = {
toolHistory: []
};
Step 2 — Detect Recursive Patterns
Now create a helper:
```ts id="5jlwm4"
function detectRecursiveLoop(
toolHistory: string[],
threshold = 3
): boolean {
if (toolHistory.length < threshold) {
return false;
}
const recent = toolHistory.slice(-threshold);
return recent.every(
tool => tool === recent[0]
);
}
This checks:
```txt id="6jlwm4"
Did the same tool run 3 times in a row?
Step 3 — Wrap Tool Execution
Now intercept tool calls:
```ts id="7jlwm4"
async function guardedToolCall(
toolName: string,
execute: () => Promise
) {
state.toolHistory.push(toolName);
if (detectRecursiveLoop(state.toolHistory)) {
throw new Error(
Recursive loop detected for tool: ${toolName}
);
}
return execute();
}
---
# Step 4 — Use Inside LangChain Tools
Example:
```ts id="8jlwm4"
const result = await guardedToolCall(
"search_documents",
async () => {
return searchTool.invoke(query);
}
);
That’s it.
Now your workflow can:
- detect runaway repetition
- interrupt unstable execution
- prevent unnecessary cost escalation
Why Simple Detection Works Surprisingly Well
A lot of teams initially assume they need:
- anomaly detection
- reinforcement learning
- advanced telemetry pipelines
But simple operational heuristics already eliminate many expensive failures.
Especially:
- recursive retries
- retry storms
- repeated tool churn
You do not need perfect intelligence initially.
You need:
bounded execution.
Production Improvements
The minimal approach above works surprisingly well, but production systems usually add:
- semantic similarity detection
- token velocity monitoring
- execution depth limits
- tool-call budgets
- runtime ceilings
- timeout policies
- adaptive thresholds
Example:
```txt id="9jlwm4"
search
→ search
→ search
is easy to detect.
More advanced loops look like:
```txt id="10jlwm4"
search
→ summarize
→ retry
→ search
→ summarize
→ retry
These require broader trajectory analysis.
The Distributed Systems Parallel
Distributed systems eventually evolved:
- retry limits
- circuit breakers
- bounded failure domains
- timeout controls
because unconstrained retries became dangerous at scale.
Autonomous agent systems are beginning to encounter similar operational realities.
As agents become:
- more autonomous
- more persistent
- more deeply integrated
runtime governance becomes increasingly important.
Final Thoughts
Most teams focus heavily on:
- prompts
- model quality
- orchestration frameworks
But production AI systems also need:
- bounded execution
- runtime constraints
- operational safeguards
- economic stability
Because eventually:
the challenge is not just building autonomous agents.
It is building governable autonomous agents.

Top comments (2)
The pattern is right. The piece that bites hardest in voice land is the latency tax of the loop, not the cost. We had a tool-loop on a voice agent that fired 4 times in 1.8s before our 2s first-response budget blew up. Cost was tiny. The user heard 1.8s of dead air. The fix was a per-conversation latency budget that hard-caps any single tool sequence at 600ms and short-circuits to a holding phrase if it overruns. Detection is the easy part. Bounding the wall-clock impact on streaming UX is the part most LangChain teams underbuild. What heuristic did you settle on for distinguishing legit retries from runaway loops?
This is a really good point.
In voice systems the wall-clock impact can absolutely become the dominant failure mode before economics even matter. 1.8s of dead air is effectively a broken UX regardless of token cost.
Your “holding phrase” fallback is especially interesting because it treats governance as a streaming/runtime concern rather than just a cost-control mechanism.
Right now we’re mostly experimenting with fairly explainable heuristics:
The interesting challenge is exactly what you described though:
distinguishing productive recovery behaviour from non-converging retries in real time.
Feels increasingly similar to distributed systems retry governance honestly.