DEV Community

Cover image for Streaming an LLM response, in 4 GIFs
Jasmin Virdi
Jasmin Virdi Subscriber

Posted on

Streaming an LLM response, in 4 GIFs

Perceived speed vs actual latency

We have watched tokens stream in from an LLM before where they appeared one at a time, like the model was typing. If you used the Anthropic SDK's .stream() method, it just worked and you probably never saw what was on the wire.

This post will majorly focus on how a stream response works and how bugs are handled by SDK behind the hood.

1. Why Streaming exists

To enable the streaming option we would need to make one change in the post request that is a single field "stream": true and it will change the response experience.

non-streaming vs streaming, side by side

Here are the pointers we take from the gif.

  1. The left side shows no streaming as the cursor blinks for 4 seconds then the whole response lands at once.
  2. The right side shows the streaming where the first word shows up in about 300 milliseconds. Words flow in as the model generates them.

Both the sides have same model, same prompt, same total time it is just the right side started giving response almost 4 seconds earlier. The 4 seconds wait time for a full reply feels broken. A streamed reply that finishes in four seconds feels fast. Streaming doesn't make the model faster it makes the wait disappear.


2. What's on the wire

When you set stream: true, the API stops sending a single JSON blob. It opens a persistent HTTP connection and pushes events down the line as the model generates them. The format is Server-Sent Events (SSE) a web standard. Any SSE debugger will read this stream.

Here's what comes through:

raw SSE chunks streaming in with delta.text and stop_reason highlighted
A few things to notice:

The text lives in delta.text, nested inside content_block_delta events. Those are the events we should look after.

stop_reason moved. In post 1, we saw it right there in the response JSON. Here, it arrives at the very end inside a message_delta event, just before message_stop. If the loop bails out as soon as the text stops arriving we will never see it.

Chunks don't line up with tokens or words. You might get "Hello" in one chunk and " world" in the next, or both in one. The network decides where the cuts happens and it is not the model, not the API.

That's what the SDK has been hiding from you.


3. Reading the stream

Streaming sounds complicated until we write the loop. It's just reading bytes, buffering them, splitting on blank lines, and parsing JSON.

Here's the flow:

  1. The response body is a ReadableStream which can be iterated with for await.
  2. Each iteration gives us bytes which we can decode to string.
  3. Buffer the string. A chunk might end mid-message.
  4. Split the buffer on \n\n — that's the SSE message separator.
  5. Keep the last item in the buffer. It might be incomplete.
  6. For each complete message, find the data: line, strip the prefix, and parse the JSON.
  7. If the type is content_block_delta, print delta.text.
  8. If it's message_delta, you've got your stop_reason.

code on left highlights line by line, output appears on right

Here is the complete sample code you can use to try out:

const prompt = process.argv[2] ?? "Count to 10, slowly.";

const response = await fetch("https://api.anthropic.com/v1/messages", {
  method: "POST",
  headers: {
    "x-api-key": process.env.ANTHROPIC_API_KEY,
    "anthropic-version": "2023-06-01",
    "content-type": "application/json",
  },
  body: JSON.stringify({
    model: "claude-opus-4-5",
    max_tokens: 1024,
    stream: true,
    messages: [{ role: "user", content: prompt }],
  }),
});

const decoder = new TextDecoder();
let buffer = "";

for await (const chunk of response.body) {
  buffer += decoder.decode(chunk, { stream: true });

  const messages = buffer.split("\n\n");
  buffer = messages.pop() ?? "";

  for (const message of messages) {
    const dataLine = message.split("\n").find(l => l.startsWith("data: "));
    if (!dataLine) continue;

    const data = JSON.parse(dataLine.slice(6));

    if (data.type === "content_block_delta" && data.delta.type === "text_delta") {
      process.stdout.write(data.delta.text);
    }

    if (data.type === "message_delta") {
      process.stderr.write(`\n\n[stop_reason: ${data.delta.stop_reason}]\n`);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The way it is working is that when the chunk ends in the middle of a message split("\n\n") leaves an incomplete fragment as the last item. pop() pulls it back into the buffer so the next chunk can finish it. Without this line, every split message crashes the parser.

data.delta.type === "text_delta" this check matters because content_block_delta can carry other delta types too: input_json_delta for tool arguments, thinking_delta for extended thinking, signature_delta for verification. For now we only care about text.

You can find the full implementation here on GitHub as well.


4. Three bugs

The code above works on a good day. Here's what breaks it on a bad one.

three bugs — ghost stream, silent truncation, split packet

The ghost stream. The issue is user navigates away with the stream keeps running and tokens keep arriving with nobody to read them. In order to fix this pass an AbortController signal to fetch and call abort() when you're done.

The fix is an AbortController:

const controller = new AbortController();
const response = await fetch(url, { signal: controller.signal, ...options });
// later, when the user navigates away:
controller.abort();
Enter fullscreen mode Exit fullscreen mode

The silent truncation. The API can send an error event mid stream during overload. If the loop only handles content_block_delta, the error gets skipped and you end up with a truncated response and no exception. The fix is to handle data.type === "error" explicitly.

if (data.type === "error") {
  throw new Error(`Stream error: ${data.error.message}`);
}

Enter fullscreen mode Exit fullscreen mode

The split packet. A single SSE message can arrive in two TCP packets. Without buffering, JSON.parse throws on the half. This is what buffer = messages.pop() ?? "" fixes, it holds the incomplete piece until the next chunk completes it.

stop_reason, in a stream

In post 1, stop_reason was right there in the response JSON. In a stream, it's the same four values end_turn, max_tokens, tool_use, stop_sequence but they arrive inside a message_delta event near the end of the stream.

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn",...}}
Enter fullscreen mode Exit fullscreen mode

The same rule from post 1 applies: if you ignore stop_reason, you'll ship a bug. A max_tokens cutoff in a streamed response looks exactly like a normal end of stream. You won't know the model was cut off unless you read this event.

Three things to try before the next post

1. Run the streaming code. Then change "stream": true to false and run it again. Notice how long you wait before seeing anything. That gap is what your users feel.

2. Add console.error(chunk.length) inside the for await loop, before any parsing. Run the code and watch the numbers. You'll see chunks of wildly different sizes it could be 8 bytes here, 400 bytes there. The network decides, not the model. Tokens and chunks are not the same thing.

3. Start a stream, then disconnect your wifi mid response. Watch what happens. The loop hangs, then eventually throws but only if we have added error handling. This sets up the error handling post later in the series.

What's next

TinyAgent can now stream a response. Tokens land as they arrive. stop_reason shows up at the end. It still has no memory though every call starts blank.

In the upcoming post series we will capture another important details. 😁

Happy Coding! 👩‍💻

Top comments (14)

Collapse
 
mudassirworks profile image
Mudassir Khan

the delta type check is the one that bites people building tool call UIs — they filter for text_delta fine, then add tool use and suddenly input_json_delta just vanishes. saw this in production where a UI worked for months until a prompt started routing to a tool.

the pop trick for incomplete SSE frames is clean. one extra pattern: buffer growing past a threshold without hitting \n\n is usually a stalled connection, not a slow model. explicit stall timeout beats waiting on the fetch signal.

are you planning to cover streaming with tool calls since the delta types diverge pretty sharply from pure text streaming?

Collapse
 
jasmin profile image
Jasmin Virdi

Hi @mudassirworks

Yes, will be covering in the upcoming posts.

Collapse
 
tahosin profile image
S M Tahosin

Great follow-up, Jasmin!
The GIFs make the streaming flow really easy to understand, especially how you handle the chunks and combine them properly.
Loved seeing the delta.content part explained clearly. Streaming definitely makes the UX feel much more responsive.
Looking forward to the tool calling and memory parts next. Keep these coming! 👏

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks again, @tahosin.
Glad you liked this one too very motivating. No pressure, but I’ll make sure to keep the momentum going! 😁

Collapse
 
itskondrat profile image
Mykola Kondratiuk

silent recovery on stream errors is the sneaky footgun - output comes back wrong but no exception thrown.

Collapse
 
jasmin profile image
Jasmin Virdi

So true @itskondrat, the silence is the dangerous part. I started treating a missing finish_reason as a hard error instead of assuming no exception means it worked. Truncation shouldn’t get to fail quietly.

Collapse
 
harjjotsinghh profile image
Harjot Singh

Streaming is one of those UX details that quietly makes or breaks an AI product, the same response feels 3x faster when it streams vs lands all at once, even at identical total latency. The gotchas people hit: handling partial or aborted streams, backpressure, and rendering markdown/code incrementally without flicker. Worth getting right once and reusing. I lean on streamed output in Moonshift so users watch the agent working in real time instead of a spinner, perceived progress is half the experience. Did you hit the partial-markdown-rendering problem, or keep it plain text?

Collapse
 
jasmin profile image
Jasmin Virdi • Edited

Agreed @harjjotsinghh

Yeah, the perceived speed thing is wild which is same latency, totally different feel. A spinner just can't compete with watching tokens land.

I kept these gifs plain text on purpose, to keep the focus on the streaming itself. But partial markdown is a tricky one for sure, flickering half done code blocks are a pain. How do you handle it in Moonshift, buffer till the block closes or re-parse each chunk?

Collapse
 
__5b6e8f677243ba4b2f60f profile image
Felix

Nice work! The GIF format actually makes the streaming concept click way faster than reading the docs. One thing I'd add — when you're switching between OpenAI and Anthropic, the streaming format differs slightly (OpenAI sends multiple delta chunks vs Anthropic sends longer content blocks). It's a subtle difference that can break your UI if you're not handling both. Definitely worth a follow-up GIF comparing the two!

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks Felix!

That's a great point. I would definitely look into it. Your feedback really helps a lot!🙂

Collapse
 
devdatta_gawali_28 profile image
Devdatta Gawali

"Great article! I also just started my web dev journey as a beginner. Learning so much!"

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @devdatta_gawali_28
Glad you liked it.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.