Alvaro Fragoso

Posted on Feb 16

Building an RLM with Mastra: Introducing mastra-rlm-kit

#rlm #mastra #ai #typescript

TL;DR
I just open-sourced mastra-rlm-kit, a paper-faithful implementation of Recursive Language Models (RLMs) for Mastra.

It lets agents:

break complex tasks into executable Python steps
spawn recursive and batched sub-queries
ground reasoning in code instead of vibes
produce full, inspectable audit trails

This isn’t a prompt trick. It’s an architecture.

👉 GitHub: https://github.com/alvarofc/mastra-rlm
👉 npm: npm install mastra-rlm-kit

The Problem: Agents Are Still Bad at Thinking

If you’ve built agents with Mastra (or LangGraph, CrewAI, AutoGen…), you’ve probably tried something like:

“Given earnings reports, analyst notes, and news articles that don’t fit in a single context window, analyze renewable energy stocks in Q3 2024, compare them to traditional energy, and give me a recommendation.”

What happens?

the model silently drops context
key documents are ignored
comparisons are incomplete or superficial
the final answer sounds confident but isn’t grounded

Not because the model is weak — but because the agent architecture is.

Most agents still assume:

one prompt
one context window
one response

That breaks down immediately once the task exceeds context limits or requires verification.

The Core Insight: Reasoning Needs Structure

In 2024, Chen et al. introduced Recursive Language Models (RLMs) with a simple but powerful idea:

Don’t ask the model to reason in one pass.
Force it to reason step by step, with execution and recursion.

An RLM works like this:

A root model decomposes the task into steps
Each step can execute Python code
When more information is needed, it spawns recursive sub-queries
Sub-queries can run in parallel
Every action is logged and auditable

Instead of hoping the model reasons correctly inside a single context window, you externalize the reasoning process.

What mastra-rlm-kit Brings to Mastra

Mastra already has workflows, observability, and strong TypeScript ergonomics.

What it didn’t have was serious reasoning.

mastra-rlm-kit adds that missing layer with three exports:

API	What it’s for
`createRlmTool()`	Expose RLM as a callable tool
`createRlmWorkflow()`	Build full recursive reasoning pipelines
`createRlmRunner()`	Low-level, programmatic control

This isn’t a “conceptual” RLM — it’s paper-faithful and production-oriented.

Key Features

✅ Paper-faithful RLM implementation
🔁 Recursive sub-queries via llm_query() and llm_query_batched()
⚡ Parallel exploration with batched calls
🧪 Grounded reasoning via sandboxed Python REPL
📜 Deterministic artifacts: output, events, audit log, recursion tree
🔌 Model-agnostic: works with any Mastra-compatible model

Every run leaves a trail you can inspect, debug, and trust.

Quick Start

npm install mastra-rlm-kit @mastra/core zod

Use It as a Tool

import { createRlmTool } from "mastra-rlm-kit";

export const runRlmTool = createRlmTool({
  workspace,
  defaults: {
    rootModelId: "openrouter/moonshotai/kimi-k2.5",
    subModelId: "openrouter/minimax/minimax-m2.5",
    budgets: {
      maxIterations: 30,
      maxCalls: 50,
      maxDepth: 1,
      maxOutputChars: 10000,
    },
  },
});

Or as a Workflow

import { createRlmWorkflow } from "mastra-rlm-kit";

export const rlmWorkflow = createRlmWorkflow({
  workspace,
  models: {
    root: { id: "openrouter/moonshotai/kimi-k2.5" },
    sub: { id: "openrouter/minimax/minimax-m2.5" },
  },
  defaults: {
    budgets: {
      maxIterations: 30,
      maxCalls: 50,
      maxDepth: 1,
      maxOutputChars: 10000,
    },
  },
});

Where RLMs Actually Shine

Use Case	Why RLM Helps
Long-context tasks	Breaks work across recursive calls instead of one window
Multi-hop Q&A	Each hop is a traceable sub-query
Math & logic	Python executes and verifies reasoning
Data analysis	Intermediate states are inspectable
Research synthesis	Parallel sub-queries before synthesis

If the task exceeds a single context window or requires verification, RLMs win.

A Note on Benchmarks

mastra-rlm-kit includes strict, reproducible benchmarks — but they’re not the headline feature.

All benchmark runs:

use datasets as-is (no rewritten questions or labels)
run the RLM loop without prompt tuning
score outputs using official exact-match metrics

Current Results (OolongBench)

On a recent OolongBench validation slice:

Accuracy: 20% (exact match)
Completion rate: 100%
Avg sub-queries: ~8 per task

Many failures are near-misses (off-by-one values, partial lists, non-canonical names), which are not counted as correct by design.

Why This Is Still Useful

These results aren’t about leaderboard performance.

They show that RLMs:

execute multi-step reasoning reliably
fail deterministically (no silent hallucinations)
produce full traces you can inspect and improve

Full benchmark commands and reports live in the repo.

How It Works Internally

Root model receives the task
It writes Python REPL steps
Steps execute and store intermediate results
Missing info → spawn llm_query() sub-queries
Sub-queries batch and parallelize
Results aggregate into a final synthesis
Full trace is persisted

Every claim is either:

executed code, or
traceable recursive output

That’s how hallucinations die.

Why Mastra Was the Right Fit

Mastra already gets the fundamentals right:

TypeScript-first
Built-in observability
Clean workflow primitives
Model-agnostic via Vercel AI SDK

RLMs don’t replace Mastra — they complete it.

Final Thought

The gap between agents that talk and agents that think is still massive.

Most demos fall apart the moment you ask for:

long-context reasoning
verification
decomposition
accountability

mastra-rlm-kit doesn’t add magic.
It adds structure, execution, and transparency.

Try it. Break it. Improve it.
And tell me what you build.

— Built by @metasurfero

DEV Community