DEV Community

An LLM API call, in 4 GIFs

Jasmin Virdi on May 26, 2026

This is the first post of series Building TinyAgent where we are going to build a small agent from scratch in Node.js with no frameworks just the A...

Read full post

Sujala Vasanthasena Nelavai • Jun 3

Nice illustration

Jasmin Virdi • Jun 3

Thank you!

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • May 26

Very great illustration! I am a visual learner and this helped a lot! Good work :D

Jasmin Virdi • May 27

Thanks @francistrdev
I am a visual learner too. 🙋‍♀️😄

S M Tahosin • Jun 1

Great post, Jasmin! The GIFs make the stateless nature and messages array flow super clear.
Really liked your point on stop_reason — it's crucial for catching incomplete responses. Also the token cost breakdown is spot on; that output price difference completely changes how you design agents.
Looking forward to the next parts of the series. Keep it up! 👏

Jasmin Virdi • Jun 1

Thanks @tahosin

Glad you liked it. Thanks for sharing your feedback. 🙂

Ofri Peretz • May 30

Love that this starts from the raw call instead of an SDK — the stateless "resend the whole messages array" model is exactly what people skip. One thing worth flagging early for the TinyAgent series, since it bites everyone the moment the agent does more than print text: the model's output is untrusted input the instant it drives an action. As soon as a tool call, shell command, or SQL string comes from the response, you've got an injection surface — and the "just works in 6 lines" simplicity makes it easy to pipe model output straight into a sink. Validating what the model is allowed to trigger is what turns a toy agent into a safe one. Looking forward to the rest.

Jasmin Virdi • May 31

Thanks @ofri-peretz

Great point. I think that adding info related validations would be great.

Valentin Monteiro • May 27

The 4 GIFs are the happy path. The 5th invisible one in prod is retry/fallback/idempotency, which is where most agent loops actually burn their budget. Pricing math also flips once you're in a tool-calling loop: output tokens usually dominate input by an order of magnitude or more, so input price arbitrage between providers stops mattering. The real comparison is output cost plus structured-output reliability.

Jasmin Virdi • May 27

Fair point @valentin_monteiro

Really appreciate you adding this it is helping me to think from a broader prospect for this series I would try to cover this in upcoming posts.
Quick question though, what do you mean by structured output reliability? Is that about the model consistently returning valid JSON or something broader?

zhongqiyue • May 27

Great post — the stop_reason branching is something a lot of tutorials skip, but it's essential for building reliable agents. We ran into the same need to switch providers without rewriting code, so we started using ai.interwestinfo.com as a unified gateway. The pricing has been noticeably lower than buying direct, and having one key for 300+ models simplifies a lot. Have you experimented with routing requests between providers based on cost or latency?

Jasmin Virdi • May 27

Thanks @__c1b9e06dc90a7e0a676b

Interesting, does it support multiple models? I haven't tried routing request based on cost or latency. Could you share some more pointers on it ?

Mykola Kondratiuk • May 29

most devs skip the raw API until they hit a debugging problem, and then they badly need it. starting with raw calls front-loads complexity but I get why - much easier to debug the SDK layer after.

CapeStart • May 29

As models become commoditized, understanding the mechanics around API calls, context windows, tool usage, and cost control may become a bigger advantage than model choice itself.

Jasmin Virdi • May 29

True @capestart

I believe having complete idea of how things work under the hood would help us in selecting the correct model and differentiating between them for usage.

xulingfeng • May 29

These GIFs are brilliantly clear — they show exactly how much the SDK abstractions hide. We switched to raw API calls for our Hermes agent stack after hitting a mysterious latency issue. Turned out the SDK was polling for stream completion even on non-streaming requests, adding 300-800ms per call that didn't show up in any dashboard.

Out of curiosity — are you planning to cover streaming vs non-streaming latency differences in the TinyAgent series? That's the one gap I haven't seen well explained with visuals.

Jasmin Virdi • May 29 • Edited

Thanks @xulingfeng

This is an interesting find. I am planning to focus this series on AI basics and LLM. Can you help me understand more about the issue. Would be happy to include if related in the series!

xulingfeng • May 30

Happy to help! The key insight is that SDK abstractions hide two things worth covering: serialization overhead (what happens when your objects hit the wire) and error propagation (where retry/backoff can silently mask failures). A simple demo: compare a raw curl request vs an SDK call for the same endpoint — the latency distribution tells the real story. Happy to review a draft if you go down that path! 🙌

Jasmin Virdi • May 30

I see. @xulingfeng

I would definitely try this out and would be happy to discuss further!

Super thanks😁

mote • May 31

The GIF format choice is clever â showing token generation as a streaming process rather than a discrete response is the kind of intuition that takes a while to build.

One thing I'd add: the "wait, it's just autocomplete" realization usually hits hardest when you're debugging a prompt that's almost right. The model isn't reasoning step-by-step â it's hallucinating a plausible continuation, which means small prompt tweaks can produce wildly different outputs.

For those building LLM integrations: the mental model shift that helps most is treating the API call as a partial function â it might return something useful, something wrong, or nothing at all (timeout/error). Designing for all three cases upfront saves a lot of production incidents.

Jasmin Virdi • May 31

Thanks @motedb

The "almost right" prompt is exactly when it clicks that the model is just guessing the next words, not actually thinking. And I like the "partial function" idea: plan for a good answer, a wrong one, and no answer at all from the start. Definitely something I'll keep in mind.

Theo Valmis • May 29

The 4-GIF framing is great precisely because it forces the question of where the boundary actually lives. Once people see how thin the request/response shell is, the more interesting question becomes what shapes the prompt before it goes out — and that's where most production complexity ends up living.

Leo Pessoa • May 30

The stop_reason warning is the one that bites hardest, and usually in production rather than in dev. The same failure mode extends one layer up: even when stop_reason is "end_turn", the content field might be valid JSON, valid JSON with unexpected keys, or a prose explanation of what the JSON would have looked like. That's the layer structured output APIs add on top of these four mechanics — not just that the model stopped, but that it delivered the contract. Would love to see a Part 2 that covers what happens after the response technically parses but the shape is wrong.

Jasmin Virdi • May 30

I see, this is a worth noting point @pessoabuilds
The model can stop but still return the wrong shape. Adding structured output validation to the list is a great point.

VoltageGPU • May 27

Interesting breakdown of the API call flow! When working with GPU-backed LLMs, I've seen how critical it is to manage memory and concurrency efficiently—especially when handling multiple requests. If you're scaling this up, you might want to look into how frameworks like VoltageGPU help with resource isolation.

Jasmin Virdi • May 27

Thanks @voltagegpu

Seems interesting will check. I believe the infra and scaling could be another topic altogether.

Felix • May 30

Great explanation! I've been working on a multi-model API relay project and this perfectly illustrates why a unified endpoint matters — having to explain these steps for 5 different providers is exactly the pain point we're solving. The streaming GIF is especially helpful for beginners. Thanks for sharing!

Jasmin Virdi • May 30

Thanks!

Glad you find it helpful!

The Seventeen • May 29

This is a really beautiful write up. Would love to see how you integrate AgentSecrets for credentials management!

agentsecrets.theseventeen.co

Jasmin Virdi • May 29

Thanks @the_seventeen

More coming soon. Stay tuned!

The Seventeen • May 29

Can't wait!

shogun 444 • May 30

One thing that took me too long to realize: LLM APIs are surprisingly simple.

Most of the magic people talk about is just sending a JSON payload, managing context manually, and keeping an eye on token usage. Understanding statelessness and token costs early saves a lot of confusion later.

Jasmin Virdi • May 30

Thanks @shogun444

Glad it helped!

Nahuel Nucera • May 28

Amazing!

Jasmin Virdi • May 28

Thanks @nahuel990

Anguishe • May 26

Very nice! Awesome topic to do a series on. Looking forward to seeing the rest 😍

Jasmin Virdi • May 27 • Edited

Thanks @bashsnippets

I have bunch of things to cover in this series. This is really motivating, hope I do justice to the series. 😄

Vasyl • May 27 • Edited

This is actually a really clean explanation for beginners. The JSON is expensive part surprises almost everyone 😄 People use AI APIs for months without knowing what stop_reason does.

Jasmin Virdi • May 27 • Edited

Thanks @workout097collab

Glad you liked it. More coming soon. 😄

leob • May 29

Insightful, very well written! AI/LLMs explained for "the rest of us" (a.k.a. "mere mortals") :-)

Jasmin Virdi • May 29

Thanks @leob

Glad you liked it. More in the series coming soon.

leob • May 29

Looking forward to it!

Nimesh Kulkarni • May 31

Insightful..!

Jasmin Virdi • May 31

Thanks @nimay_04

Nafas Ebrahimi • May 28

Great post! I learned a few things from this post.

Jasmin Virdi • May 28

Thanks @nafasebra

Glad you liked it

UnitBuilds • May 26

And if you're using APIs, turn on flex/batch and context caching, to make sure you dont burn your wallet

Jasmin Virdi • May 27

Great point @unitbuilds

Prompt caching is good to have when we have long prompt that does not changes frequently, this would help in reducing input costs. Whereas batch is perfect for anything that doesn't need a real time response. Will make sure to cover in upcoming modules of series.

Microns • May 31

Good🎉

Jasmin Virdi • May 31

Thanks @microns