Zac_A_Clifton

Posted on Feb 13

Benchmarking Vercel AI Gateway against the native Anthropic SDK

#ai #performance #vercel #anthropic

We're building SalesSage (Not fully announced yet) an AI-powered platform with the goal of making any person a salesperson.

One of our core features is real-time audio transcript analysis with AI systems.

That means making a lot of calls and sending a lot of context to Claude and other AIs.

This means that latency matters for us because we want to make sure we are responding in near realtime to what is being discussed.

Se we wanted to see if is routing our API calls through the Vercel AI Gateway slower than hitting Anthropic directly?

TL;DR

At small prompts (~10 tokens), the native Anthropic SDK is ~15-20% faster than the Vercel AI Gateway
At large context (120K tokens, 60% of the context window), the difference between native and gateway nearly vanishes
Gateway has occasional latency spikes that blow up tail latency — p99 TTFB spiked to 5.6s on one Sonnet call, though it's not statistically significant.
Tier 1 Anthropic rate limits (30K input tokens/min) make large context calls through the native SDK impractical without significant delays
The gateway handles rate limits for you, which is a real advantage

The setup

We wrote a benchmark suite in TypeScript that tests two providers:

Native Anthropic SDK (@anthropic-ai/sdk) — direct API calls to Anthropic
Vercel AI Gateway (gateway() from the ai package) — Anthropic routed through Vercel's proxy

Each provider was tested with Claude Sonnet 4 and Claude Opus 4, measuring:

TTFB (time to first token) via streaming
Total completion time via non-streaming

We ran two variants: a small prompt (~10 tokens, 5 iterations) and a large context prompt (120K tokens calibrated via Anthropic's countTokens API, 3 iterations).

We created a calibration step that used binary search with client.messages.countTokens() to build a prompt that lands within 500 tokens of our 120K target.

Small prompt results

With a tiny prompt, the native SDK wins consistently:

This was no real surprise here. The gateway adds a proxy hop, and at small payloads that hop is a measurable percentage of the total request time. Sonnet shows a ~200ms overhead, Opus is nearly identical.

Both models perform similarly at small context sizes where Sonnet and Opus TTFB are within 120ms of each other.

Large context results — this is where it gets interesting

We filled 60% of the 200K context window (~120K tokens of business meeting notes) and re-ran everything:

Wait — Opus through the gateway is basically the same speed as native? And total completion is actually faster through the gateway?

At large context sizes, the model's processing time dominates so heavily that the gateway proxy hop becomes noise. The ~200ms overhead that mattered at 10 tokens is irrelevant when the model is chewing through 120K tokens for 5+ seconds.

Opus gets hit harder than Sonnet

Context size doesn't affect all models equally:

Opus TTFB jumps 4x when you fill the context window, vs Sonnet's 2.5x. At small prompts they're nearly identical, but at 120K tokens Opus takes almost twice as long as Sonnet to produce the first token (4.8s vs 2.6s).

Since time-to-first-token matters for our product this told us that in a large context we need to exclusively consider streaming UI, real-time summaries.

The tail tells a different story

p50 looks clean. p99 does not.

That red bar on Gateway Sonnet? A 5.6s TTFB on a small prompt. That's a 4.5x multiplier over p50. The native SDK stays tight at 1.4x.

At large context sizes, the tail is less dramatic, but the pattern holds:

The native SDK stays tight and consistent. On the p99 it is 1.0-1.1x of p50 across the board. The gateway has wider variance and can spike to 6.7s on Opus.

We are not sure how to put this into perspective yet, and it's something we will be considering for our latency-sensitivity as we are unsure if we need predictable tail behavior.

The rate limit surprise

Here's what we didn't expect to be the biggest finding: Tier 1 Anthropic rate limits make large context calls through the native SDK impractical.

Our org is currently on Tier 1 since we are in closed alpha. This means we can only send 30K input tokens per minute for Sonnet and Opus. A single 120K-token request consumes 4 minutes of rate limit budget. Not 4 minutes of clock time — 4 minutes of token budget.

We initially tried 75-second delays between calls. Still got 429'd. Had to bump to 240-second (4-minute) delays, turning a benchmark into a 90-minute affair.

The Vercel AI Gateway? Processed every 120K token request without a single rate limit error. The gateway operates on its own rate limit tier, which for large context workloads is a real competitive advantage.

What we'd do differently

A few things we learned running these benchmarks:

@ai-sdk/anthropic@3.x is incompatible with ai@5.x at runtime. The SDK returns LanguageModelV3 models, but ai@5.x only supports V2. Type-casting compiles fine but the runtime check rejects it. We had to drop the Vercel AI SDK provider entirely.
Use countTokens for calibration. Don't estimate token counts from character length. The API gives you exact numbers — use it.
Budget ~$20 for a full benchmark run. Large context calls at 120K tokens add up fast across multiple providers, models, and iterations.

The bottom line

For short prompts, the native Anthropic SDK is ~15% faster with tighter tail latency. If you're making lots of small, fast calls, go direct.

For large context (which is our actual production use case at SalesSage — meeting prep briefs packed with transcripts, CRM data, and company research), the gateway is effectively the same speed, handles rate limits transparently, and gives you observability through the Vercel dashboard for free.

Conclusion:
Based on this data and other benefits of going with an AI Gateway as most of our calls will be large contexts.

DEV Community