Posted on May 26

I built a token-level debugger for comparing two LLMs

#llm #mlops #rag

Same prompt, two models, different outputs. No tooling was actually showing me where they diverged.
Built tokenflame that gives entropy heatmaps, tokenizer diffs, divergence markers, token-by-token replay. One command, one HTML file.
pip install tokenflame

Top comments (1)

Harjot Singh • Jun 1

a token-level diff between two LLMs is genuinely useful tooling, model selection is mostly vibes otherwise. that kind of visibility is what makes routing decisions in Moonshift defensible: agents build + deploy + market a SaaS overnight, and picking the right model per step matters for cost + quality. nice tool. first run's free if you ever want a real workload to test it against.