DEV Community

Cover image for Building an RLM with Mastra: Introducing mastra-rlm-kit
Alvaro Fragoso
Alvaro Fragoso

Posted on

Building an RLM with Mastra: Introducing mastra-rlm-kit

TL;DR
I just open-sourced mastra-rlm-kit, a paper-faithful implementation of Recursive Language Models (RLMs) for Mastra.

It lets agents:

  • break complex tasks into executable Python steps
  • spawn recursive and batched sub-queries
  • ground reasoning in code instead of vibes
  • produce full, inspectable audit trails

This isn’t a prompt trick. It’s an architecture.

👉 GitHub: https://github.com/alvarofc/mastra-rlm
👉 npm: npm install mastra-rlm-kit


The Problem: Agents Are Still Bad at Thinking

If you’ve built agents with Mastra (or LangGraph, CrewAI, AutoGen…), you’ve probably tried something like:

“Given earnings reports, analyst notes, and news articles that don’t fit in a single context window, analyze renewable energy stocks in Q3 2024, compare them to traditional energy, and give me a recommendation.”

What happens?

  • the model silently drops context
  • key documents are ignored
  • comparisons are incomplete or superficial
  • the final answer sounds confident but isn’t grounded

Not because the model is weak — but because the agent architecture is.

Most agents still assume:

  • one prompt
  • one context window
  • one response

That breaks down immediately once the task exceeds context limits or requires verification.


The Core Insight: Reasoning Needs Structure

In 2024, Chen et al. introduced Recursive Language Models (RLMs) with a simple but powerful idea:

Don’t ask the model to reason in one pass.
Force it to reason step by step, with execution and recursion.

An RLM works like this:

  1. A root model decomposes the task into steps
  2. Each step can execute Python code
  3. When more information is needed, it spawns recursive sub-queries
  4. Sub-queries can run in parallel
  5. Every action is logged and auditable

Instead of hoping the model reasons correctly inside a single context window, you externalize the reasoning process.


What mastra-rlm-kit Brings to Mastra

Mastra already has workflows, observability, and strong TypeScript ergonomics.

What it didn’t have was serious reasoning.

mastra-rlm-kit adds that missing layer with three exports:

API What it’s for
createRlmTool() Expose RLM as a callable tool
createRlmWorkflow() Build full recursive reasoning pipelines
createRlmRunner() Low-level, programmatic control

This isn’t a “conceptual” RLM — it’s paper-faithful and production-oriented.


Key Features

  • Paper-faithful RLM implementation
  • 🔁 Recursive sub-queries via llm_query() and llm_query_batched()
  • Parallel exploration with batched calls
  • 🧪 Grounded reasoning via sandboxed Python REPL
  • 📜 Deterministic artifacts: output, events, audit log, recursion tree
  • 🔌 Model-agnostic: works with any Mastra-compatible model

Every run leaves a trail you can inspect, debug, and trust.


Quick Start

npm install mastra-rlm-kit @mastra/core zod
Enter fullscreen mode Exit fullscreen mode

Use It as a Tool

import { createRlmTool } from "mastra-rlm-kit";

export const runRlmTool = createRlmTool({
  workspace,
  defaults: {
    rootModelId: "openrouter/moonshotai/kimi-k2.5",
    subModelId: "openrouter/minimax/minimax-m2.5",
    budgets: {
      maxIterations: 30,
      maxCalls: 50,
      maxDepth: 1,
      maxOutputChars: 10000,
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

Or as a Workflow

import { createRlmWorkflow } from "mastra-rlm-kit";

export const rlmWorkflow = createRlmWorkflow({
  workspace,
  models: {
    root: { id: "openrouter/moonshotai/kimi-k2.5" },
    sub: { id: "openrouter/minimax/minimax-m2.5" },
  },
  defaults: {
    budgets: {
      maxIterations: 30,
      maxCalls: 50,
      maxDepth: 1,
      maxOutputChars: 10000,
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

Where RLMs Actually Shine

Use Case Why RLM Helps
Long-context tasks Breaks work across recursive calls instead of one window
Multi-hop Q&A Each hop is a traceable sub-query
Math & logic Python executes and verifies reasoning
Data analysis Intermediate states are inspectable
Research synthesis Parallel sub-queries before synthesis

If the task exceeds a single context window or requires verification, RLMs win.


A Note on Benchmarks

mastra-rlm-kit includes strict, reproducible benchmarks — but they’re not the headline feature.

All benchmark runs:

  • use datasets as-is (no rewritten questions or labels)
  • run the RLM loop without prompt tuning
  • score outputs using official exact-match metrics

Current Results (OolongBench)

On a recent OolongBench validation slice:

  • Accuracy: 20% (exact match)
  • Completion rate: 100%
  • Avg sub-queries: ~8 per task

Many failures are near-misses (off-by-one values, partial lists, non-canonical names), which are not counted as correct by design.

Why This Is Still Useful

These results aren’t about leaderboard performance.

They show that RLMs:

  • execute multi-step reasoning reliably
  • fail deterministically (no silent hallucinations)
  • produce full traces you can inspect and improve

Full benchmark commands and reports live in the repo.


How It Works Internally

  1. Root model receives the task
  2. It writes Python REPL steps
  3. Steps execute and store intermediate results
  4. Missing info → spawn llm_query() sub-queries
  5. Sub-queries batch and parallelize
  6. Results aggregate into a final synthesis
  7. Full trace is persisted

Every claim is either:

  • executed code, or
  • traceable recursive output

That’s how hallucinations die.


Why Mastra Was the Right Fit

Mastra already gets the fundamentals right:

  • TypeScript-first
  • Built-in observability
  • Clean workflow primitives
  • Model-agnostic via Vercel AI SDK

RLMs don’t replace Mastra — they complete it.


Final Thought

The gap between agents that talk and agents that think is still massive.

Most demos fall apart the moment you ask for:

  • long-context reasoning
  • verification
  • decomposition
  • accountability

mastra-rlm-kit doesn’t add magic.
It adds structure, execution, and transparency.

Try it. Break it. Improve it.
And tell me what you build.

— Built by @metasurfero

Top comments (0)