DEV Community

BN
BN

Posted on

I built a token-level debugger for comparing two LLMs

Same prompt, two models, different outputs. No tooling was actually showing me where they diverged.
Built tokenflame that gives entropy heatmaps, tokenizer diffs, divergence markers, token-by-token replay. One command, one HTML file.
pip install tokenflame

Top comments (1)

Collapse
 
harjjotsinghh profile image
Harjot Singh

a token-level diff between two LLMs is genuinely useful tooling, model selection is mostly vibes otherwise. that kind of visibility is what makes routing decisions in Moonshift defensible: agents build + deploy + market a SaaS overnight, and picking the right model per step matters for cost + quality. nice tool. first run's free if you ever want a real workload to test it against.