DEV Community

Streaming an LLM response, in 4 GIFs

Jasmin Virdi on May 31, 2026

We have watched tokens stream in from an LLM before where they appeared one at a time, like the model was typing. If you used the Anthropic SDK's ....

Read full post

Mudassir Khan • Jun 6

the delta type check is the one that bites people building tool call UIs — they filter for text_delta fine, then add tool use and suddenly input_json_delta just vanishes. saw this in production where a UI worked for months until a prompt started routing to a tool.

the pop trick for incomplete SSE frames is clean. one extra pattern: buffer growing past a threshold without hitting \n\n is usually a stalled connection, not a slow model. explicit stall timeout beats waiting on the fetch signal.

are you planning to cover streaming with tool calls since the delta types diverge pretty sharply from pure text streaming?

Jasmin Virdi • Jun 6

Hi @mudassirworks

Yes, will be covering in the upcoming posts.

S M Tahosin • Jun 1

Great follow-up, Jasmin!
The GIFs make the streaming flow really easy to understand, especially how you handle the chunks and combine them properly.
Loved seeing the delta.content part explained clearly. Streaming definitely makes the UX feel much more responsive.
Looking forward to the tool calling and memory parts next. Keep these coming! 👏

Jasmin Virdi • Jun 1

Thanks again, @tahosin.
Glad you liked this one too very motivating. No pressure, but I’ll make sure to keep the momentum going! 😁

Mykola Kondratiuk • Jun 2

silent recovery on stream errors is the sneaky footgun - output comes back wrong but no exception thrown.

Jasmin Virdi • Jun 2

So true @itskondrat, the silence is the dangerous part. I started treating a missing finish_reason as a hard error instead of assuming no exception means it worked. Truncation shouldn’t get to fail quietly.

Harjot Singh • May 31

Streaming is one of those UX details that quietly makes or breaks an AI product, the same response feels 3x faster when it streams vs lands all at once, even at identical total latency. The gotchas people hit: handling partial or aborted streams, backpressure, and rendering markdown/code incrementally without flicker. Worth getting right once and reusing. I lean on streamed output in Moonshift so users watch the agent working in real time instead of a spinner, perceived progress is half the experience. Did you hit the partial-markdown-rendering problem, or keep it plain text?

Jasmin Virdi • May 31 • Edited

Agreed @harjjotsinghh

Yeah, the perceived speed thing is wild which is same latency, totally different feel. A spinner just can't compete with watching tokens land.

I kept these gifs plain text on purpose, to keep the focus on the streaming itself. But partial markdown is a tricky one for sure, flickering half done code blocks are a pain. How do you handle it in Moonshift, buffer till the block closes or re-parse each chunk?

Felix • Jun 1

Nice work! The GIF format actually makes the streaming concept click way faster than reading the docs. One thing I'd add — when you're switching between OpenAI and Anthropic, the streaming format differs slightly (OpenAI sends multiple delta chunks vs Anthropic sends longer content blocks). It's a subtle difference that can break your UI if you're not handling both. Definitely worth a follow-up GIF comparing the two!