Jasmin Virdi

Posted on May 26 • Edited on May 30

An LLM API call, in 4 GIFs

#llm #javascript #ai #beginners

Statelessness and cost-saving tips

This is the first post of series Building TinyAgent where we are going to build a small agent from scratch in Node.js with no frameworks just the API calls.

But before we write an agent, we need to understand what actually happens when you call an LLM. If you've only ever used a SDK, you've probably never seen the raw request and understand how it works. Six lines of code, an API key, and it just works but you have no idea what happened when request was dispatched and response was printed on the screen.

1. The request

Here is the sample API call with each and every section explained in detail.

A few things worth noticing in the API call.

The API is stateless: Every new API call does not remember previous call context. If you want a chatbot that "remembers" earlier messages, you hold the messages array and resend the whole thing every time.

max_tokens is a hard stop, not a target. If you hit the target the response stops mid sentence.

The API call pattern is universal. Different URL, Authorization: Bearer instead of x-api-key, the system prompt lives inside messages rather than at the top level. But it's the same POST, the same JSON, the same {model, messages, max_tokens}. Once you understand the shape, switching providers is just a find-and-replace.

2. The response

The API answers with a JSON blob. There are ~10 fields in it, but only four actually matter:

The one which is mostly skipped is: stop_reason.

It tells you why the model stopped, and in real systems and there could be possible reasons behind it:

end_turn      → finished naturally, you're done
max_tokens    → hit your ceiling, response is truncated
tool_use      → model wants to call a tool (next post!)
stop_sequence → matched one of your stop strings

If you only check the text and ignore stop_reason, you will ship a bug at some point. The response looks fine right up until it doesn't.

The other field worth burning in: usage. It shows you how many tokens went in and came out. You want this number in your logs from day one not after you get a surprise bill. 🤯

3. Tokens

I keep saying "24 input tokens." Here's what that means:

Things that surprise people and is worth noting:

Words don't equal tokens. "Unbelievable" is one word but four tokens. The tokenizer splits on common substrings, not spaces.

Code costs more than it looks def add(a, b): is 8 tokens. Every bracket and comma is its own token.

JSON is expensive. {"a":1} is 7 tokens. If your tool schemas are bloated, they're quietly eating into your budget on every single request.

Non-English costs more Japanese, Hindi, Arabic tend to run 2–4× the token count of the same content in English. If you're building for a global audience, this changes your cost math a lot.

Rule of thumb for English prose: ~1 token ≈ 4 characters ≈ 0.75 words. For everything else, run it through the tokenizer yourself before assuming.

4. The bill

Two meters run on every call. They are priced differently

Output tokens cost roughly 3–5× more than input tokens. That's the one number to internalize about LLM pricing.

cost = (input_tokens  / 1,000,000) × input_price
     + (output_tokens / 1,000,000) × output_price

Three things that follow from the asymmetry:

Long prompts are cheap. Long responses are expensive. Stuffing 50 KB of context into a system prompt is fine. Asking for 50 KB of output is roughly 5× more expensive.
"Thinking" tokens count as output. Reasoning models bill their internal thought at the output rate, even though you don't see it.
Tool schemas eat input on every call. They get resent with every request, just like the system prompt.

At $0.006 per call, 100k calls a day is $600/month from one small feature. Add usage logging now, not when you get the alert. 🚨

5. The whole thing in 20 lines

Here is the complete code of the API call we have discussed above:

Jasmin2895 / TinyAgent

No dependencies and no install setup it is just Node file with API key.

Three things to try before the next post

Run it and watch the numbers Make ten calls, change the prompt length, see how usage moves. You'll build a real instinct for cost faster this way than reading any doc.
Set max_tokens: 20 and ask for something long. Watch it cut off. Check stop_reason. This is a bug you'll hit in production eventually better to meet it on purpose right now
Build a multi-turn chat by hand. Keep a messages array, push each user message and each model reply onto it, and resend the whole thing every turn. Once you do this, you'll immediately understand why long conversations get expensive you're paying for the full history on every call.

What's next

In the upcoming post series we will expand the ability of the TinyAgent to actually handle lot of things than just responding.

Happy Coding! 👩‍💻

Top comments (61)

Sujala Vasanthasena Nelavai • Jun 3

Nice illustration

Jasmin Virdi • Jun 3

Thank you!

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • May 26

Very great illustration! I am a visual learner and this helped a lot! Good work :D

Jasmin Virdi • May 27

Thanks @francistrdev
I am a visual learner too. 🙋‍♀️😄

S M Tahosin • Jun 1

Great post, Jasmin! The GIFs make the stateless nature and messages array flow super clear.
Really liked your point on stop_reason — it's crucial for catching incomplete responses. Also the token cost breakdown is spot on; that output price difference completely changes how you design agents.
Looking forward to the next parts of the series. Keep it up! 👏

Jasmin Virdi • Jun 1

Thanks @tahosin

Glad you liked it. Thanks for sharing your feedback. 🙂

Ofri Peretz • May 30

Love that this starts from the raw call instead of an SDK — the stateless "resend the whole messages array" model is exactly what people skip. One thing worth flagging early for the TinyAgent series, since it bites everyone the moment the agent does more than print text: the model's output is untrusted input the instant it drives an action. As soon as a tool call, shell command, or SQL string comes from the response, you've got an injection surface — and the "just works in 6 lines" simplicity makes it easy to pipe model output straight into a sink. Validating what the model is allowed to trigger is what turns a toy agent into a safe one. Looking forward to the rest.

Jasmin Virdi • May 31

Thanks @ofri-peretz

Great point. I think that adding info related validations would be great.

Valentin Monteiro • May 27

The 4 GIFs are the happy path. The 5th invisible one in prod is retry/fallback/idempotency, which is where most agent loops actually burn their budget. Pricing math also flips once you're in a tool-calling loop: output tokens usually dominate input by an order of magnitude or more, so input price arbitrage between providers stops mattering. The real comparison is output cost plus structured-output reliability.

Jasmin Virdi • May 27

Fair point @valentin_monteiro

Really appreciate you adding this it is helping me to think from a broader prospect for this series I would try to cover this in upcoming posts.
Quick question though, what do you mean by structured output reliability? Is that about the model consistently returning valid JSON or something broader?

zhongqiyue • May 27

Great post — the stop_reason branching is something a lot of tutorials skip, but it's essential for building reliable agents. We ran into the same need to switch providers without rewriting code, so we started using ai.interwestinfo.com as a unified gateway. The pricing has been noticeably lower than buying direct, and having one key for 300+ models simplifies a lot. Have you experimented with routing requests between providers based on cost or latency?

Jasmin Virdi • May 27

Thanks @__c1b9e06dc90a7e0a676b

Interesting, does it support multiple models? I haven't tried routing request based on cost or latency. Could you share some more pointers on it ?

Mykola Kondratiuk • May 29

most devs skip the raw API until they hit a debugging problem, and then they badly need it. starting with raw calls front-loads complexity but I get why - much easier to debug the SDK layer after.

CapeStart • May 29

As models become commoditized, understanding the mechanics around API calls, context windows, tool usage, and cost control may become a bigger advantage than model choice itself.

Jasmin Virdi • May 29

True @capestart

I believe having complete idea of how things work under the hood would help us in selecting the correct model and differentiating between them for usage.

xulingfeng • May 29

These GIFs are brilliantly clear — they show exactly how much the SDK abstractions hide. We switched to raw API calls for our Hermes agent stack after hitting a mysterious latency issue. Turned out the SDK was polling for stream completion even on non-streaming requests, adding 300-800ms per call that didn't show up in any dashboard.

Out of curiosity — are you planning to cover streaming vs non-streaming latency differences in the TinyAgent series? That's the one gap I haven't seen well explained with visuals.

Jasmin Virdi • May 29 • Edited

Thanks @xulingfeng

This is an interesting find. I am planning to focus this series on AI basics and LLM. Can you help me understand more about the issue. Would be happy to include if related in the series!

xulingfeng • May 30

Happy to help! The key insight is that SDK abstractions hide two things worth covering: serialization overhead (what happens when your objects hit the wire) and error propagation (where retry/backoff can silently mask failures). A simple demo: compare a raw curl request vs an SDK call for the same endpoint — the latency distribution tells the real story. Happy to review a draft if you go down that path! 🙌

Jasmin Virdi • May 30

I see. @xulingfeng

I would definitely try this out and would be happy to discuss further!

Super thanks😁

mote • May 31

The GIF format choice is clever â showing token generation as a streaming process rather than a discrete response is the kind of intuition that takes a while to build.

One thing I'd add: the "wait, it's just autocomplete" realization usually hits hardest when you're debugging a prompt that's almost right. The model isn't reasoning step-by-step â it's hallucinating a plausible continuation, which means small prompt tweaks can produce wildly different outputs.

For those building LLM integrations: the mental model shift that helps most is treating the API call as a partial function â it might return something useful, something wrong, or nothing at all (timeout/error). Designing for all three cases upfront saves a lot of production incidents.

Jasmin Virdi • May 31

Thanks @motedb

The "almost right" prompt is exactly when it clicks that the model is just guessing the next words, not actually thinking. And I like the "partial function" idea: plan for a good answer, a wrong one, and no answer at all from the start. Definitely something I'll keep in mind.

View full discussion (61 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.