A single torch.cuda.synchronize() in the wrong place can erase every optimization you spent weeks building. Your GPU sits idle, your pipeline stalls, and your inference latency doubles. In vLLM's distributed serving stack, tensors move between GPUs constantly: billions of parameters shuffled during weight updates, and key-value cache blocks shipped between nodes during live inference. The codebase solves these two problems with two radically different async patterns -- and studying them side-by-side reveals a masterclass in GPU concurrency.
The Problem Is Waiting
Imagine you need to get 14GB of model weights (a 7B parameter model in BF16) from a trainer process onto your inference GPU. The naive approach: transfer everything, wait, then resume inference.
At NVLink bandwidth (~900 GB/s), that's ~15ms. Not terrible. But at PCIe 5.0 (~40 GB/s effective), it's 350ms of dead time. And during RLHF training, this happens every few hundred steps.
vLLM faces this problem in two distinct contexts:
- Weight distribution: Getting updated model weights from a trainer to inference workers (bulk, periodic)
- KV cache migration: Shipping key-value cache blocks between disaggregated prefill and decode nodes (streaming, continuous)
Each context demands a different async pattern. Let's build the understanding from scratch.
CUDA Streams: Separate Lanes on the Same Highway
Before we can overlap anything, we need the mechanism that makes overlap possible.
Think of a factory with multiple conveyor belts feeding the same set of assembly stations. Each belt keeps its items in order. The stations (GPU compute units) pull work from any belt that has items ready:
Stream 0 (default): [kernel A] → [kernel B] → [kernel C]
Stream 1 (transfer): [NCCL broadcast chunk 0] → [NCCL broadcast chunk 1]
A and the broadcast can run AT THE SAME TIME on the GPU.
There is NO ordering between streams unless you add one.
A CUDA stream is just an ordered queue of GPU operations. Operations within a stream execute sequentially. Operations across streams can overlap. That's the entire mental model.
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
# Everything here goes on the new stream, not the default one
do_transfer()
# Back on the default stream -- transfer and default work can overlap
The key primitive: stream.synchronize() blocks the CPU until all operations on that one stream finish. It does NOT wait for other streams. This surgical waiting is what makes pipelining possible.
Quick recap: A CUDA stream is an ordered queue. Two streams run independently.
stream.synchronize()waits for just one. Everything that follows builds on this single idea.
The uint8 Trick: One Blob to Rule Them All
Before we can pipeline transfers, we need to solve a packaging problem. A model has weights in different dtypes -- bfloat16, float32, maybe int8 quantization scales. NCCL broadcasts a single contiguous buffer. You can't torch.cat tensors of different dtypes.
The solution: everything is just bytes in memory.
# packed_tensor.py, lines 62-69
# Get weight tensor (any dtype: bf16, fp32, int8...)
tensor = (
post_iter_func(next(iterator))
.contiguous() # Ensure contiguous memory
.view(torch.uint8) # Reinterpret as raw bytes — NO COPY
.view(-1) # Flatten to 1D for heterogeneous cat
)
packing_tensor_list[buffer_idx].append(tensor)
packing_tensor_sizes[buffer_idx] += tensor.numel()
.view(torch.uint8) is NOT a cast or a copy. It reinterprets the same memory as raw bytes -- zero cost. A bfloat16 tensor of shape [4096, 4096] becomes 33,554,432 uint8 values. Now every weight looks the same, and you can torch.cat them into one contiguous blob.
The consumer reverses the process using stored metadata (name, shape, dtype):
# Consumer unpacks: raw bytes → original dtype → original shape
tensor.contiguous().view(dtype).view(*shape)
Both sides iterate in the same order. If the order mismatches, you get silent corruption -- more on this in the antipatterns section.
Double-Buffered Weight Transfer: The Assembly Line
This is the centerpiece. vLLM's packed_tensor.py defines two constants:
# packed_tensor.py, lines 13-14
DEFAULT_PACKED_BUFFER_SIZE_BYTES = 1024 * 1024 * 1024 # 1GB
DEFAULT_PACKED_NUM_BUFFERS = 2
1GB per buffer. Two buffers. The model weights get chunked into ~1GB pieces, and the two buffers alternate roles: one is being packed with fresh weights while the other's broadcast is still in flight.
Here's the actual producer loop:
# packed_tensor.py, lines 39-86 (simplified)
streams = [torch.cuda.Stream() for _ in range(num_buffers)]
buffer_idx = 0
while True:
streams[buffer_idx].synchronize() # GATE: wait for buffer
with torch.cuda.stream(streams[buffer_idx]):
try:
packing_tensor_list[buffer_idx] = []
packing_tensor_sizes[buffer_idx] = 0
while True:
tensor = (post_iter_func(next(iterator))
.contiguous().view(torch.uint8).view(-1))
packing_tensor_list[buffer_idx].append(tensor)
packing_tensor_sizes[buffer_idx] += tensor.numel()
if packing_tensor_sizes[buffer_idx] > target_size:
break
packed = torch.cat(packing_tensor_list[buffer_idx])
group.broadcast(packed, src=src) # ASYNC on GPU!
buffer_idx = (buffer_idx + 1) % num_buffers # ROTATE
except StopIteration:
# Flush final partial buffer
if len(packing_tensor_list[buffer_idx]) > 0:
packed = torch.cat(packing_tensor_list[buffer_idx])
group.broadcast(packed, src=src)
break
The timeline shows the overlap:
Time ──────────────────────────────────────────────────────►
Stream 0: ┌──pack──┐┌──broadcast──┐ ┌──pack──┐┌──broadcast──┐
│ buf[0] ││ buf[0] │ │ buf[0] ││ buf[0] │
└────────┘└─────────────┘ └────────┘└─────────────┘
▲ sync ▲ sync
Stream 1: ┌──pack──┐┌──broadcast──┐ ┌──pack──┐
│ buf[1] ││ buf[1] │ │ buf[1] │
└────────┘└─────────────┘ └────────┘
▲ sync
Iteration 0 (buffer 0, Stream 0): Pack ~1GB of weights, broadcast via NCCL. The broadcast is GPU-async -- returns to CPU immediately.
Iteration 1 (buffer 1, Stream 1): streams[1].synchronize() returns instantly (empty stream). Pack into buffer 1, broadcast. Stream 0's broadcast may still be running. This is the overlap.
Iteration 2 (buffer 0, Stream 0): streams[0].synchronize() -- the critical call. We're about to reuse buffer 0. Must wait for Stream 0's previous broadcast to finish. Then safely overwrite.
Why Does the Broadcast Return Instantly?
Because NCCL operations are asynchronous on the GPU:
# pynccl.py, lines 342-366
def broadcast(self, tensor: torch.Tensor, src: int, stream=None):
if stream is None:
stream = current_stream()
if src == self.rank:
sendbuff = buffer_type(tensor.data_ptr())
recvbuff = buffer_type(tensor.data_ptr()) # Required by NCCL
else:
sendbuff = buffer_type() # Null — receiver doesn't send
recvbuff = buffer_type(tensor.data_ptr())
self.nccl.ncclBroadcast(
sendbuff, recvbuff, tensor.numel(),
ncclDataTypeEnum.from_torch(tensor.dtype),
src, self.comm,
cudaStream_t(stream.cuda_stream),
)
# No synchronize()! Returns to CPU immediately.
# Caller synchronizes at TOP of next iteration.
No synchronize() after the call. The synchronization happens at the top of the next iteration, when we need to reuse the buffer.
Also critical: NCCL broadcast is a collective. Every process must call it the same number of times, in the same order. Mismatch = deadlock.
Checkpoint: We've covered streams (independent lanes), uint8 packing (uniform bytes), and double-buffering (overlap via alternating buffers). This pattern handles bulk, structured transfers -- but still blocks the caller. What if we need to not block at all?
KV Connector Background Threads: Transfer Without Stopping Inference
In disaggregated serving, prefill nodes compute KV caches and ship them to decode nodes. This must happen during live inference without stalling the decode engine.
vLLM's p2p_nccl_engine.py takes a fundamentally different approach: background Python threads with producer-consumer queues.
The main inference thread does this:
# p2p_nccl_engine.py, lines 254-258
if self.send_type == "PUT_ASYNC":
with self.send_queue_cv:
self.send_queue.append(item)
self.send_queue_cv.notify() # Wake background thread
return True # Returns IMMEDIATELY to inference loop
Five lines. Append to a deque, notify a condition variable, return. Zero blocking.
Meanwhile, a daemon thread drains the queue:
# p2p_nccl_engine.py, lines 476-484
def send_async(self):
"""Background daemon thread — drains send queue."""
while True:
with self.send_queue_cv:
while not self.send_queue:
self.send_queue_cv.wait() # Releases GIL, sleeps
item = self.send_queue.popleft()
if not self.send_queue:
self.send_queue_cv.notify() # Signal: queue drained
self.send_sync(item) # NCCL send + stream.synchronize()
The background thread sleeps on send_queue_cv.wait() (zero CPU cost), wakes when notified, and performs the NCCL P2P send on a dedicated CUDA stream (self.send_stream).
Inference Thread Background Threads
──────────────── ──────────────────
decode step ┌───────────────────┐
send_tensor() │ Send Thread │
.append() + .notify() ──────►│ cv.wait() │
return True (instantly!) │ ncclSend(item) │
next decode step └───────────────────┘
recv_tensor() ┌───────────────────┐
cv.wait() ◄──────────────────│ Listener Thread │
tensor = recv_store[id] │ zmq.poll() │
continue inference │ ncclRecv(tensor) │
│ cv.notify() ─────┘
└───────────────────┘
The GIL Question
Python's GIL means only one thread runs Python at a time. How does this overlap?
NCCL calls are C library calls that release the GIL. While the send thread is inside ncclSend(), the inference thread is free to run. The threads work because they spend 99% of their time in GIL-releasing C calls:
Main thread: [Python][Python][Python][Python]
Send thread: [cv.wait][ncclSend...........][cv.wait]
GIL-free GIL-free (C lib) GIL-free
Implicit vs. Explicit Coordination
| Dimension | Weight Transfer | KV Connector |
|---|---|---|
| Coordination | Implicit (NCCL collective) | Explicit (ZMQ + queues) |
| Overlap target | Pack ↔ Broadcast | Transfer ↔ Inference |
| Blocks inference? | Yes | No |
| Concurrency | 2 CUDA streams | OS threads + dedicated streams |
| Failure mode | Hang (collective mismatch) | Timeout (missed ZMQ) |
| Best for | Bulk, periodic | Streaming, continuous |
Weight transfer coordinates implicitly through collectives -- simple but rigid. KV connector coordinates explicitly through ZMQ messages -- more machinery but fully async.
Eight Antipatterns
Synchronization
1. Syncing the wrong stream. streams[1].synchronize() when you needed streams[0]. Buffer 0 gets corrupted mid-broadcast. Silent wrong answers.
2. Missing synchronize entirely. Works under light load, race condition under heavy load. Nondeterministic, passes tests, fails in production.
3. torch.cuda.synchronize() instead of stream.synchronize(). Waits for ALL streams. Correctness maintained, performance destroyed (2-5x slower).
Buffers
4. One buffer (num_buffers=1). Must sync after every broadcast. All complexity, zero overlap.
5. Too many buffers (num_buffers=8). 8GB of VRAM wasted. Near-zero benefit over 2 buffers -- bandwidth is the bottleneck, not packing speed.
Coordination
6. Mismatched iteration order. Producer: [layer0, layer1, layer2]. Consumer: [layer2, layer0, layer1]. Every collective matches. No hang. Wrong bytes in wrong layers. Silent catastrophe.
7. Forgetting NCCL is a collective. One process skips a broadcast via early-exit. Others block forever. Deadlock, no error message.
8. Default stream from background thread. Send thread doesn't specify a stream. NCCL work goes on the default stream. Serializes with inference. Zero overlap despite threading complexity. Fix: self.send_stream = torch.cuda.Stream().
Bonus: The GIL Illusion
Python threads only provide parallelism when threads spend time in C extensions that release the GIL. If your background thread does tensor slicing in Python between NCCL calls, it holds the GIL and blocks inference. Keep the thread body a tight loop of C calls.
The Takeaway
These two patterns form a reusable vocabulary for GPU concurrency:
- Double-buffered CUDA streams: Overlap data packing with transfer. Use for bulk, periodic transfers.
- Background threads + producer-consumer queues: Decouple transfer from the critical path. Use for streaming, continuous transfers.
The meta-lesson: the hardest part of GPU concurrency is not making things fast -- it's making things correct while fast. stream.synchronize() is one line of code that separates "works" from "silently corrupts." The antipatterns exist because the failure mode of a missing sync is not a crash but wrong answers.
The next time you see a stream.synchronize() call in GPU code, don't skip over it. That one line is the load-bearing wall.
Code from vLLM's main branch: vllm/distributed/weight_transfer/packed_tensor.py and vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py.
Top comments (1)
Great to see a post about LLM inner problem with a clear clerification. Even fof me who is have a very limited python knowledge and no understanding how write a GPU target code in python. The whole mechanism is cover by some mystic fog, but this post is a great startingpoint for me.