Oleh Kem for ComparEdge

Posted on May 28 • Edited on May 30

I Built a Tool to Stop Guessing LLM API Costs. Here Is What I Learned.

#llm #api #webdev #programming

You know that moment when you check your API dashboard and the number has an extra digit you were not expecting? That is where this project started.

We were comparing models for a production pipeline, nothing exotic, just document processing, and realized we had no reliable way to answer a basic question: which model actually costs less for our workload?

So we built one: LLM Calculator. Here is what the build taught us.

The Math Problem Nobody Talks About

LLM pricing looks simple until you try to calculate it for real.

First, input and output tokens have different prices. Most models charge 2 to 5x more for output. A summarization task (lots of input, little output) and a code generation task (little input, lots of output) can have wildly different costs on the same model. The "cheapest model" depends entirely on what you are doing with it.

Then there is batch pricing. OpenAI gives 50% off for batch API calls. If your workload can handle async, that reshuffles the entire ranking. Same story with cached pricing: Anthropic's prompt caching can cut input costs by 90% on repeated prefixes. Are you factoring that in? Most people are not.

Now multiply this across 16 providers and 110+ models: OpenAI, Anthropic, Google, DeepSeek, Groq, Mistral, Meta, Cohere, Together, Perplexity, xAI, Fireworks, Replicate, AI21, Cloudflare, Amazon Bedrock. Prices change constantly. Your mental model of "GPT-4o costs about X" is probably already outdated.

What We Built

A free LLM token cost calculator at comparedge.com/llm-calculator, part of ComparEdge (independent, no vendor affiliations).

Feature tour, dev-to-dev:

Input/output ratio slider. Drag to match your workload profile. Rankings reshuffle in real time. This single feature changed more model decisions than anything else in our testing.

Batch and cache toggles. One click each. Toggle batch pricing for async-tolerant workloads, cached pricing for repeated-prefix scenarios. The cost landscape changes dramatically.

Stack and Compare mode. Pick up to 5 models, see them side-by-side with pricing, context windows, and cost per million tokens for your specific ratio. The "final boss" view for making a decision.

Budget filter. Set a ceiling. Everything over it disappears. Useful when you need to narrow 110 options fast.

10 export formats. PDF and CSV, sure. But also: LiteLLM JSON (for proxy configs), OpenRouter JSON, Python Dict, .env Snippet, Cursor Rules, Markdown, HTML, Plain Text. The output should drop into your actual workflow.

What We Learned Building This

Pricing data is a moving target. We thought the hard part would be the UI. It was not. It was keeping pricing accurate across 16 providers who update at different times, in different formats, with different definitions of what a token even means. Maintenance is the real product.

"Cheapest" is the wrong question. The right question is: cheapest for my specific input/output ratio, with or without batch/cache, within my context window requirements. That is a much harder question, but it is the one that actually saves money.

People do not want more data; they want fewer options. Early versions showed everything. Users were overwhelmed. The budget filter and compare mode exist because people need to go from 110 models to 3 candidates fast.

The Forecasting Problem

Here is what we have not solved yet: predicting future costs.

We are working on a forecasting mode combining growth multiplier, agent overhead, and Pareto concentration factor. The agent overhead part is the tricky bit. Agentic workflows multiply token consumption in ways that are hard to model because the agent decides how many calls to make.

We do not want to ship a forecasting tool that is just "multiply current cost by a number you pick." That is a spreadsheet. We want something that accounts for how LLM usage actually scales. Still in progress.

Try It

Free at LLM Api Calculator Cost. PDF export works without an account. If you use it and have feedback, especially on what export formats are missing or what the compare mode gets wrong, I would genuinely like to hear it in the comments.

Top comments (3)

Harjot Singh • May 31

The extra-digit-on-the-dashboard moment is exactly how every team discovers they were guessing, and the deeper lesson in your build is that there's no single cheapest model, only a cheapest model for your token shape. Input/output asymmetry is the trap nobody prices in: a summarization workload (huge input, tiny output) and a generation workload (tiny input, huge output) invert which model wins, because output tokens are often 3-5x the input price. So a calculator that takes YOUR actual input:output ratio is far more honest than a generic per-million headline, which is what most people compare on and get burned. The natural next step from a calculator is routing: once you can compute cost-per-task accurately, you stop picking one model and start sending each workload to the one that's cheapest for its specific shape while clearing the quality bar. The calculator answers which is cheaper; routing acts on it automatically. That cost-tracks-the-workload thinking is core to how I approach spend in Moonshift. Did your numbers surface a case where the obvious cheap model lost once you accounted for it being more verbose (more output tokens) on your task?

Xidao • Jun 2

The input/output ratio slider is exactly the right mental model. Teams often compare provider price sheets as if one ranking will hold across workloads, but the winner flips fast once you separate extraction, summarization, classification, and tool-heavy agent loops.

One cost bucket that still surprises people in production is everything around retries and discarded work: schema-validation failures, tool-call retries, safety re-prompts, and partial generations that never reach the user. Those tokens usually do not show up in back-of-the-envelope estimates, but they can dominate the delta between a spreadsheet forecast and the actual bill. In agent systems I have seen the control-loop overhead matter almost as much as the "successful" request path.

If you ever extend the calculator, a useful dimension might be an effective-overhead multiplier for orchestration waste plus a cache hit-rate assumption instead of a simple cache on/off switch. That gets a lot closer to the messy real-world economics than list prices alone.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.