DEV Community

Cover image for Streaming an LLM response, in 4 GIFs

Streaming an LLM response, in 4 GIFs

Jasmin Virdi on May 31, 2026

We have watched tokens stream in from an LLM before where they appeared one at a time, like the model was typing. If you used the Anthropic SDK's ....
Collapse
 
mudassirworks profile image
Mudassir Khan

the delta type check is the one that bites people building tool call UIs — they filter for text_delta fine, then add tool use and suddenly input_json_delta just vanishes. saw this in production where a UI worked for months until a prompt started routing to a tool.

the pop trick for incomplete SSE frames is clean. one extra pattern: buffer growing past a threshold without hitting \n\n is usually a stalled connection, not a slow model. explicit stall timeout beats waiting on the fetch signal.

are you planning to cover streaming with tool calls since the delta types diverge pretty sharply from pure text streaming?

Collapse
 
jasmin profile image
Jasmin Virdi

Hi @mudassirworks

Yes, will be covering in the upcoming posts.

Collapse
 
tahosin profile image
S M Tahosin

Great follow-up, Jasmin!
The GIFs make the streaming flow really easy to understand, especially how you handle the chunks and combine them properly.
Loved seeing the delta.content part explained clearly. Streaming definitely makes the UX feel much more responsive.
Looking forward to the tool calling and memory parts next. Keep these coming! 👏

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks again, @tahosin.
Glad you liked this one too very motivating. No pressure, but I’ll make sure to keep the momentum going! 😁

Collapse
 
itskondrat profile image
Mykola Kondratiuk

silent recovery on stream errors is the sneaky footgun - output comes back wrong but no exception thrown.

Collapse
 
jasmin profile image
Jasmin Virdi

So true @itskondrat, the silence is the dangerous part. I started treating a missing finish_reason as a hard error instead of assuming no exception means it worked. Truncation shouldn’t get to fail quietly.

Collapse
 
harjjotsinghh profile image
Harjot Singh

Streaming is one of those UX details that quietly makes or breaks an AI product, the same response feels 3x faster when it streams vs lands all at once, even at identical total latency. The gotchas people hit: handling partial or aborted streams, backpressure, and rendering markdown/code incrementally without flicker. Worth getting right once and reusing. I lean on streamed output in Moonshift so users watch the agent working in real time instead of a spinner, perceived progress is half the experience. Did you hit the partial-markdown-rendering problem, or keep it plain text?

Collapse
 
jasmin profile image
Jasmin Virdi • Edited

Agreed @harjjotsinghh

Yeah, the perceived speed thing is wild which is same latency, totally different feel. A spinner just can't compete with watching tokens land.

I kept these gifs plain text on purpose, to keep the focus on the streaming itself. But partial markdown is a tricky one for sure, flickering half done code blocks are a pain. How do you handle it in Moonshift, buffer till the block closes or re-parse each chunk?

Collapse
 
__5b6e8f677243ba4b2f60f profile image
Felix

Nice work! The GIF format actually makes the streaming concept click way faster than reading the docs. One thing I'd add — when you're switching between OpenAI and Anthropic, the streaming format differs slightly (OpenAI sends multiple delta chunks vs Anthropic sends longer content blocks). It's a subtle difference that can break your UI if you're not handling both. Definitely worth a follow-up GIF comparing the two!

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks Felix!

That's a great point. I would definitely look into it. Your feedback really helps a lot!🙂

Collapse
 
devdatta_gawali_28 profile image
Devdatta Gawali

"Great article! I also just started my web dev journey as a beginner. Learning so much!"

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @devdatta_gawali_28
Glad you liked it.