DEV Community

Pablo Albaladejo
Pablo Albaladejo

Posted on

Streaming LLM Responses Through API Gateway REST with Lambda

The problem with buffered LLM responses

When you call an LLM from a Lambda function and wait for the full response before returning it, the user stares at a spinner for 5 to 15 seconds. The model is generating tokens the entire time, but the client receives nothing until the very last one. For short completions this is tolerable. For structured output (summaries, extracted entities, multi-field JSON), the delay kills the experience.

The fix is response streaming: send tokens to the client as the model produces them. The user sees content appearing incrementally. Time-to-first-byte drops from seconds to hundreds of milliseconds. The perceived latency improves dramatically even though the total generation time stays the same.

In November 2025, AWS added native response streaming support to API Gateway REST APIs. Before that, streaming from Lambda required Function URLs (limited auth, no VPC support) or WebSocket APIs (more infrastructure, more complexity). Now we get streaming through the same REST API most teams already have in production.

This is the first post in a four-part series:

  1. The infrastructure and handler setup (this post)
  2. A subtle middleware timing bug that breaks observability when streaming (Part 2)
  3. A deferred observability pattern that solves it (Part 3)
  4. Production-grade testing for the whole thing (Part 4)

Every post has a companion step in the repository with working code you can deploy.

TL;DR: Stream structured LLM responses from Bedrock Claude through API Gateway REST and Lambda using Middy's streamifyResponse and AI SDK v6 streamText() with Output.object(). The CDK stack uses a CfnMethod escape hatch to enable ResponseTransferMode: STREAM. Time-to-first-byte drops from seconds to hundreds of milliseconds. Full working code included.

What this post covers

We will wire up a complete streaming pipeline: API Gateway REST receives a request, invokes a Lambda function in streaming mode, the Lambda uses Middy and the Vercel AI SDK to call Bedrock Claude via streamText, and the structured JSON streams back to the client chunk by chunk.

By the end, you will have a deployed endpoint you can curl and watch tokens arrive in real time.

How do you stream LLM responses through API Gateway REST?

Here is the data flow:

Client (curl / browser)
  |
  | POST /stream
  v
API Gateway REST (responseTransferMode: STREAM)
  |
  | response-streaming-invocations
  v
Lambda (Middy, streamifyResponse: true)
  |
  | streamText()
  v
Bedrock Claude 3.5 Sonnet
Enter fullscreen mode Exit fullscreen mode

The client sends a POST request. API Gateway forwards it to Lambda using the streaming invocation endpoint. Lambda calls Bedrock Claude through the AI SDK's streamText, which returns a ReadableStream.

Middy detects the stream in the response body and pipes it to Lambda's response stream. API Gateway forwards each chunk to the client as it arrives.

Why API Gateway REST instead of Function URLs? Three reasons:

  • VPC support. Function URLs do not support response streaming from within a VPC. If your Lambda runs in a VPC to reach a database or private service, Function URLs are not an option.
  • Built-in auth and policies. REST APIs support IAM and Cognito authorizers, request validation, and usage plans that most teams already depend on.
  • Battle-tested CORS. CORS handling through mock integrations and gateway responses is battle-tested. Function URLs require you to handle CORS entirely in your Lambda code.

How do you configure API Gateway REST for response streaming?

Two properties on the API Gateway method integration enable streaming:

  1. ResponseTransferMode: STREAM tells API Gateway to forward response bytes to the client as they arrive, rather than buffering the entire response.

  2. The streaming invocation URI replaces the standard Lambda invoke path. Instead of .../invocations, you use .../response-streaming-invocations. This tells API Gateway to use the InvokeWithResponseStream API instead of the standard Invoke.

Wire format

When Lambda streams through an API Gateway proxy integration, the stream must follow a specific wire format:

  1. JSON metadata: a valid JSON object containing statusCode, headers, and optionally cookies. This can be as minimal as {}.
  2. An 8-null-byte delimiter: eight 0x00 bytes that separate metadata from body.
  3. The response payload: the actual content chunks.

The delimiter must appear within the first 16 KB of the stream. If you use Middy with streamifyResponse: true, it handles this format automatically. You never write the delimiter yourself.

Timeouts and bandwidth

API Gateway REST streaming supports timeouts up to 15 minutes. There are two idle timeouts to be aware of:

  • Regional and private endpoints: 5 minutes of idle time before the connection terminates.
  • Edge-optimized endpoints: 30 seconds. For LLM streaming, avoid edge-optimized endpoints.

Bandwidth is throttled at two layers:

Layer Uncapped window Throttled rate
Lambda First 6 MB 2 MB/s
API Gateway First 10 MB 2 MB/s

For LLM responses (typically tens of KB), you will never hit these limits. They matter for large file downloads, not token streaming.

There are additional quotas to be aware of when using API Gateway REST with streaming:

Quota Value Notes
Max streamed payload (API Gateway) 10 MB Hard limit, not increasable
Max streamed payload (Lambda) 200 MB Via InvokeWithResponseStream
Integration timeout (default) 29 seconds Increasable for regional and private APIs
Idle connection timeout 310 seconds General API Gateway limit (separate from the streaming idle timeouts above)

For LLM streaming, the integration timeout is the one that matters most. The default 29-second limit may be too short for complex prompts. You can request an increase through Service Quotas, or use a longer timeout in your CDK stack (we use 60 seconds).

Pricing. Lambda response streaming uses InvokeWithResponseStream, which is billed identically to standard Invoke: duration plus memory.

However, streaming responses are not interrupted when the client disconnects. You are billed for the full function duration even if the client closes the connection mid-stream. Keep this in mind when setting timeouts.

CDK infrastructure

CDK's L2 LambdaIntegration construct does not expose ResponseTransferMode or the streaming invocation URI. We use the CfnMethod escape hatch to override the underlying CloudFormation resource directly.

Here is the full stack. Read it top to bottom. We will break down the important parts after.

// lib/streaming-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as logs from 'aws-cdk-lib/aws-logs';
import type { Construct } from 'constructs';
import * as path from 'node:path';

export class StreamingStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const fn = new lambda.Function(this, 'StreamingHandler', {
      runtime: lambda.Runtime.NODEJS_22_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset(path.join(__dirname, '..', 'lambda'), {
        bundling: {
          image: lambda.Runtime.NODEJS_22_X.bundlingImage,
          command: [
            'bash', '-c',
            'npx esbuild index.ts --bundle --platform=node --target=node22 --outfile=/asset-output/index.mjs --format=esm --external:@aws-sdk/* && cp /asset-output/index.mjs /asset-output/index.js',
          ],
        },
      }),
      memorySize: 512,
      timeout: cdk.Duration.seconds(60),
      logRetention: logs.RetentionDays.ONE_WEEK,
    });

    fn.addToRolePolicy(
      new iam.PolicyStatement({
        actions: [
          'bedrock:InvokeModel',
          'bedrock:InvokeModelWithResponseStream',
        ],
        resources: ['arn:aws:bedrock:*::foundation-model/*'],
      }),
    );

    const api = new apigateway.RestApi(this, 'StreamingApi', {
      restApiName: 'streaming-llm-api',
      deployOptions: { stageName: 'v1' },
    });

    const streamResource = api.root.addResource('stream');

    streamResource.addCorsPreflight({
      allowOrigins: apigateway.Cors.ALL_ORIGINS,
      allowMethods: ['POST', 'OPTIONS'],
      allowHeaders: ['Content-Type'],
    });

    const postMethod = streamResource.addMethod(
      'POST',
      new apigateway.LambdaIntegration(fn, { proxy: true }),
    );

    // Escape hatch: override CloudFormation for streaming
    const cfnMethod = postMethod.node.defaultChild as apigateway.CfnMethod;

    cfnMethod.addPropertyOverride(
      'Integration.ResponseTransferMode', 'STREAM',
    );
    cfnMethod.addPropertyOverride(
      'Integration.TimeoutInMillis', 60_000,
    );
    cfnMethod.addPropertyOverride(
      'Integration.Uri',
      cdk.Fn.sub(
        'arn:aws:apigateway:${AWS::Region}:lambda:path/2021-11-15/functions/${FnArn}/response-streaming-invocations',
        { FnArn: fn.functionArn },
      ),
    );

    new cdk.CfnOutput(this, 'ApiUrl', {
      value: `${api.url}stream`,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

The three addPropertyOverride calls are the entire difference between a buffered API and a streaming one. Everything else is standard CDK. Note that we also grant bedrock:InvokeModelWithResponseStream because Bedrock needs this permission for streaming model calls, not just InvokeModel.

Lambda handler with Middy

The Lambda handler is where the streaming behavior is configured on the compute side. Middy's streamifyResponse: true option wraps the handler with awslambda.streamifyResponse() internally, so we do not call the native AWS streaming API ourselves.

// lambda/index.ts
import middy from '@middy/core';
import httpCors from '@middy/http-cors';
import type { APIGatewayProxyEvent } from 'aws-lambda';
import { streamingService } from './streaming-service';

type HttpStreamResponse = {
  body: ReadableStream<Uint8Array> | string;
  headers: Record<string, string>;
  statusCode: number;
};

const streamHandler = async (
  event: APIGatewayProxyEvent,
): Promise<HttpStreamResponse> => {
  const body = JSON.parse(event.body ?? '{}');
  const prompt: string =
    body.prompt ?? 'Summarize the benefits of serverless architecture';

  const result = streamingService({ prompt });
  const response = result.toTextStreamResponse();

  return {
    body: response.body ?? '',
    headers: Object.fromEntries(response.headers.entries()),
    statusCode: response.status,
  };
};

export const handler = middy<APIGatewayProxyEvent, HttpStreamResponse>({
  streamifyResponse: true,
})
  .use(httpCors({ origins: ['*'] }))
  .handler(streamHandler);
Enter fullscreen mode Exit fullscreen mode

The handler returns an object with { body, headers, statusCode }. When body is a ReadableStream<Uint8Array>, Middy detects this and pipes it to Lambda's response stream. When body is a plain string, Middy writes it as a buffered response.

This is the key difference from the native awslambda.streamifyResponse() pattern, where you receive a Writable stream as a handler argument and write to it directly.

The Middy pattern is valuable because it preserves the middleware chain. Middlewares like httpCors run in the before phase (setting up CORS headers) and the after phase (attaching them to the response).

The handler remains a pure function that takes an event and returns a response object; the streaming mechanics are abstracted away.

The HttpStreamResponse type accepts either ReadableStream<Uint8Array> or string for the body. This matters: if your handler needs to return an error before starting the stream (validation failure, auth error), it returns a plain string body with the appropriate status code. The stream is only used for the happy path.

How do you use AI SDK streamText with Bedrock Claude on Lambda?

The service layer calls Bedrock Claude through the AI SDK's streamText function with Output.object() for structured generation. This sends the prompt to the model and returns a streaming result that produces structured JSON validated against a Zod schema.

// lambda/streaming-service.ts
import { bedrock } from '@ai-sdk/amazon-bedrock';
import { Output, streamText } from 'ai';
import { z } from 'zod';

const responseSchema = z.object({
  summary: z.string().describe('A concise summary of the topic'),
  keywords: z.array(z.string()).describe('Relevant keywords'),
});

type StreamParams = {
  prompt: string;
};

export const streamingService = (params: StreamParams) => {
  const model = bedrock('anthropic.claude-3-5-sonnet-20241022-v2:0');

  return streamText({
    model,
    prompt: params.prompt,
    output: Output.object({ schema: responseSchema }),
    system:
      'You are a helpful assistant. Respond with a structured JSON object containing a summary and keywords about the topic the user asks about.',
    maxOutputTokens: 2000,
    temperature: 0.7,
  });
};
Enter fullscreen mode Exit fullscreen mode

streamText with Output.object() does three things for us. It instructs the model to produce JSON matching our Zod schema. It streams the response token by token. And it exposes toTextStreamResponse(), which wraps the token stream in a standard Web Response object with a ReadableStream body.

That Response is what we destructure in the handler to extract body, headers, and status.

Why Output.object() instead of plain streamText? We want structured JSON output. The output: Output.object({ schema }) option instructs the model to generate a JSON object conforming to our Zod schema, and the AI SDK validates the final result against it.

Without the output option, streamText returns free-form text. For structured output use cases (data extraction, form generation, classification results), Output.object() is the right tool.

Note that partial objects streamed during generation are not validated against the schema. Only the final complete object is. If you need validation during streaming, you would validate partial results yourself. For most UI rendering use cases, the partial objects are good enough to display incrementally.

Testing the endpoint

Deploy the stack and grab the API URL from the CloudFormation output:

cd step-1
npm install
npx cdk deploy
Enter fullscreen mode Exit fullscreen mode

Then test with curl. The --no-buffer flag is critical. Without it, curl buffers output and you will not see the streaming effect:

curl -i --no-buffer \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain the CAP theorem"}' \
  https://your-api-id.execute-api.us-east-1.amazonaws.com/v1/stream
Enter fullscreen mode Exit fullscreen mode

You will see the JSON object build up incrementally:

HTTP/2 200
content-type: text/plain; charset=utf-8
access-control-allow-origin: *

{"summary":"The CAP theorem, also known as Brewer's theorem, states
that a distributed data store can only guarantee two of the following
three properties simultaneously...","keywords":["CAP theorem",
"consistency","availability","partition tolerance",...]}
Enter fullscreen mode Exit fullscreen mode

A note on first-byte latency. You may notice a 1 to 2 second delay before the first chunk arrives, even though streaming is working correctly. This is expected Lambda behavior: Lambda buffers chunks smaller than approximately 10 KB before flushing them to the client.

The first few tokens from the model are small, so they accumulate in the buffer. Once enough data builds up or the buffer timeout triggers, the first chunk flushes and subsequent chunks flow smoothly. This is not a bug in your code or configuration.

What we built

We now have a working streaming pipeline. A POST request hits API Gateway, which invokes Lambda in streaming mode. Middy wraps the handler with streamifyResponse, the AI SDK calls Bedrock Claude with streamText and Output.object(), and the structured JSON streams back to the client as it generates.

The key pieces are:

  • API Gateway: ResponseTransferMode: STREAM and the response-streaming-invocations URI, set through CDK escape hatches.
  • Lambda/Middy: streamifyResponse: true with a handler that returns { body: ReadableStream }.
  • AI SDK: streamText with Output.object() and a Zod schema, converted to a Web Response via toTextStreamResponse().

What is next

This clean architecture hides a subtle timing bug. When Middy streams a response, the after middleware hook fires at the wrong moment: before the stream has finished writing. Any observability logic you put in after (logging LLM metrics, recording token counts, publishing to CloudWatch) will execute with incomplete data or not at all.

Nobody has documented this behavior, and it breaks every middleware-based observability pattern you would naturally reach for.

In the next post, we will reproduce this bug, trace it through Middy's source code, and understand exactly why it happens.


All code for this series is available in the companion repository. Each post corresponds to a step-N/* branch with a self-contained, deployable CDK project.

References

Top comments (0)