TL;DR
I just open-sourced mastra-rlm-kit, a paper-faithful implementation of Recursive Language Models (RLMs) for Mastra.
It lets agents:
- break complex tasks into executable Python steps
- spawn recursive and batched sub-queries
- ground reasoning in code instead of vibes
- produce full, inspectable audit trails
This isn’t a prompt trick. It’s an architecture.
👉 GitHub: https://github.com/alvarofc/mastra-rlm
👉 npm: npm install mastra-rlm-kit
The Problem: Agents Are Still Bad at Thinking
If you’ve built agents with Mastra (or LangGraph, CrewAI, AutoGen…), you’ve probably tried something like:
“Given earnings reports, analyst notes, and news articles that don’t fit in a single context window, analyze renewable energy stocks in Q3 2024, compare them to traditional energy, and give me a recommendation.”
What happens?
- the model silently drops context
- key documents are ignored
- comparisons are incomplete or superficial
- the final answer sounds confident but isn’t grounded
Not because the model is weak — but because the agent architecture is.
Most agents still assume:
- one prompt
- one context window
- one response
That breaks down immediately once the task exceeds context limits or requires verification.
The Core Insight: Reasoning Needs Structure
In 2024, Chen et al. introduced Recursive Language Models (RLMs) with a simple but powerful idea:
Don’t ask the model to reason in one pass.
Force it to reason step by step, with execution and recursion.
An RLM works like this:
- A root model decomposes the task into steps
- Each step can execute Python code
- When more information is needed, it spawns recursive sub-queries
- Sub-queries can run in parallel
- Every action is logged and auditable
Instead of hoping the model reasons correctly inside a single context window, you externalize the reasoning process.
What mastra-rlm-kit Brings to Mastra
Mastra already has workflows, observability, and strong TypeScript ergonomics.
What it didn’t have was serious reasoning.
mastra-rlm-kit adds that missing layer with three exports:
| API | What it’s for |
|---|---|
createRlmTool() |
Expose RLM as a callable tool |
createRlmWorkflow() |
Build full recursive reasoning pipelines |
createRlmRunner() |
Low-level, programmatic control |
This isn’t a “conceptual” RLM — it’s paper-faithful and production-oriented.
Key Features
- ✅ Paper-faithful RLM implementation
- 🔁 Recursive sub-queries via
llm_query()andllm_query_batched() - ⚡ Parallel exploration with batched calls
- 🧪 Grounded reasoning via sandboxed Python REPL
- 📜 Deterministic artifacts: output, events, audit log, recursion tree
- 🔌 Model-agnostic: works with any Mastra-compatible model
Every run leaves a trail you can inspect, debug, and trust.
Quick Start
npm install mastra-rlm-kit @mastra/core zod
Use It as a Tool
import { createRlmTool } from "mastra-rlm-kit";
export const runRlmTool = createRlmTool({
workspace,
defaults: {
rootModelId: "openrouter/moonshotai/kimi-k2.5",
subModelId: "openrouter/minimax/minimax-m2.5",
budgets: {
maxIterations: 30,
maxCalls: 50,
maxDepth: 1,
maxOutputChars: 10000,
},
},
});
Or as a Workflow
import { createRlmWorkflow } from "mastra-rlm-kit";
export const rlmWorkflow = createRlmWorkflow({
workspace,
models: {
root: { id: "openrouter/moonshotai/kimi-k2.5" },
sub: { id: "openrouter/minimax/minimax-m2.5" },
},
defaults: {
budgets: {
maxIterations: 30,
maxCalls: 50,
maxDepth: 1,
maxOutputChars: 10000,
},
},
});
Where RLMs Actually Shine
| Use Case | Why RLM Helps |
|---|---|
| Long-context tasks | Breaks work across recursive calls instead of one window |
| Multi-hop Q&A | Each hop is a traceable sub-query |
| Math & logic | Python executes and verifies reasoning |
| Data analysis | Intermediate states are inspectable |
| Research synthesis | Parallel sub-queries before synthesis |
If the task exceeds a single context window or requires verification, RLMs win.
A Note on Benchmarks
mastra-rlm-kit includes strict, reproducible benchmarks — but they’re not the headline feature.
All benchmark runs:
- use datasets as-is (no rewritten questions or labels)
- run the RLM loop without prompt tuning
- score outputs using official exact-match metrics
Current Results (OolongBench)
On a recent OolongBench validation slice:
- Accuracy: 20% (exact match)
- Completion rate: 100%
- Avg sub-queries: ~8 per task
Many failures are near-misses (off-by-one values, partial lists, non-canonical names), which are not counted as correct by design.
Why This Is Still Useful
These results aren’t about leaderboard performance.
They show that RLMs:
- execute multi-step reasoning reliably
- fail deterministically (no silent hallucinations)
- produce full traces you can inspect and improve
Full benchmark commands and reports live in the repo.
How It Works Internally
- Root model receives the task
- It writes Python REPL steps
- Steps execute and store intermediate results
- Missing info → spawn
llm_query()sub-queries - Sub-queries batch and parallelize
- Results aggregate into a final synthesis
- Full trace is persisted
Every claim is either:
- executed code, or
- traceable recursive output
That’s how hallucinations die.
Why Mastra Was the Right Fit
Mastra already gets the fundamentals right:
- TypeScript-first
- Built-in observability
- Clean workflow primitives
- Model-agnostic via Vercel AI SDK
RLMs don’t replace Mastra — they complete it.
Final Thought
The gap between agents that talk and agents that think is still massive.
Most demos fall apart the moment you ask for:
- long-context reasoning
- verification
- decomposition
- accountability
mastra-rlm-kit doesn’t add magic.
It adds structure, execution, and transparency.
Try it. Break it. Improve it.
And tell me what you build.
— Built by @metasurfero
Top comments (0)