DEV Community

Mayank Ketkar
Mayank Ketkar

Posted on

Two Ways to Move Tensors Without Stopping: Inside vLLM's Async GPU Transfer Patterns

A single torch.cuda.synchronize() in the wrong place can erase every optimization you spent weeks building. Your GPU sits idle, your pipeline stalls, and your inference latency doubles. In vLLM's distributed serving stack, tensors move between GPUs constantly: billions of parameters shuffled during weight updates, and key-value cache blocks shipped between nodes during live inference. The codebase solves these two problems with two radically different async patterns -- and studying them side-by-side reveals a masterclass in GPU concurrency.

The Problem Is Waiting

Imagine you need to get 14GB of model weights (a 7B parameter model in BF16) from a trainer process onto your inference GPU. The naive approach: transfer everything, wait, then resume inference.

At NVLink bandwidth (~900 GB/s), that's ~15ms. Not terrible. But at PCIe 5.0 (~40 GB/s effective), it's 350ms of dead time. And during RLHF training, this happens every few hundred steps.

vLLM faces this problem in two distinct contexts:

  1. Weight distribution: Getting updated model weights from a trainer to inference workers (bulk, periodic)
  2. KV cache migration: Shipping key-value cache blocks between disaggregated prefill and decode nodes (streaming, continuous)

Each context demands a different async pattern. Let's build the understanding from scratch.

CUDA Streams: Separate Lanes on the Same Highway

Before we can overlap anything, we need the mechanism that makes overlap possible.

Think of a factory with multiple conveyor belts feeding the same set of assembly stations. Each belt keeps its items in order. The stations (GPU compute units) pull work from any belt that has items ready:

Stream 0 (default):  [kernel A] → [kernel B] → [kernel C]
Stream 1 (transfer):  [NCCL broadcast chunk 0] → [NCCL broadcast chunk 1]

A and the broadcast can run AT THE SAME TIME on the GPU.
There is NO ordering between streams unless you add one.
Enter fullscreen mode Exit fullscreen mode

A CUDA stream is just an ordered queue of GPU operations. Operations within a stream execute sequentially. Operations across streams can overlap. That's the entire mental model.

stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
    # Everything here goes on the new stream, not the default one
    do_transfer()
# Back on the default stream -- transfer and default work can overlap
Enter fullscreen mode Exit fullscreen mode

The key primitive: stream.synchronize() blocks the CPU until all operations on that one stream finish. It does NOT wait for other streams. This surgical waiting is what makes pipelining possible.

Quick recap: A CUDA stream is an ordered queue. Two streams run independently. stream.synchronize() waits for just one. Everything that follows builds on this single idea.

The uint8 Trick: One Blob to Rule Them All

Before we can pipeline transfers, we need to solve a packaging problem. A model has weights in different dtypes -- bfloat16, float32, maybe int8 quantization scales. NCCL broadcasts a single contiguous buffer. You can't torch.cat tensors of different dtypes.

The solution: everything is just bytes in memory.

# packed_tensor.py, lines 62-69
# Get weight tensor (any dtype: bf16, fp32, int8...)
tensor = (
    post_iter_func(next(iterator))
    .contiguous()           # Ensure contiguous memory
    .view(torch.uint8)      # Reinterpret as raw bytes — NO COPY
    .view(-1)               # Flatten to 1D for heterogeneous cat
)
packing_tensor_list[buffer_idx].append(tensor)
packing_tensor_sizes[buffer_idx] += tensor.numel()
Enter fullscreen mode Exit fullscreen mode

.view(torch.uint8) is NOT a cast or a copy. It reinterprets the same memory as raw bytes -- zero cost. A bfloat16 tensor of shape [4096, 4096] becomes 33,554,432 uint8 values. Now every weight looks the same, and you can torch.cat them into one contiguous blob.

The consumer reverses the process using stored metadata (name, shape, dtype):

# Consumer unpacks: raw bytes → original dtype → original shape
tensor.contiguous().view(dtype).view(*shape)
Enter fullscreen mode Exit fullscreen mode

Both sides iterate in the same order. If the order mismatches, you get silent corruption -- more on this in the antipatterns section.

Double-Buffered Weight Transfer: The Assembly Line

This is the centerpiece. vLLM's packed_tensor.py defines two constants:

# packed_tensor.py, lines 13-14
DEFAULT_PACKED_BUFFER_SIZE_BYTES = 1024 * 1024 * 1024  # 1GB
DEFAULT_PACKED_NUM_BUFFERS = 2
Enter fullscreen mode Exit fullscreen mode

1GB per buffer. Two buffers. The model weights get chunked into ~1GB pieces, and the two buffers alternate roles: one is being packed with fresh weights while the other's broadcast is still in flight.

Here's the actual producer loop:

# packed_tensor.py, lines 39-86 (simplified)
streams = [torch.cuda.Stream() for _ in range(num_buffers)]
buffer_idx = 0

while True:
    streams[buffer_idx].synchronize()  # GATE: wait for buffer
    with torch.cuda.stream(streams[buffer_idx]):
        try:
            packing_tensor_list[buffer_idx] = []
            packing_tensor_sizes[buffer_idx] = 0
            while True:
                tensor = (post_iter_func(next(iterator))
                    .contiguous().view(torch.uint8).view(-1))
                packing_tensor_list[buffer_idx].append(tensor)
                packing_tensor_sizes[buffer_idx] += tensor.numel()
                if packing_tensor_sizes[buffer_idx] > target_size:
                    break
            packed = torch.cat(packing_tensor_list[buffer_idx])
            group.broadcast(packed, src=src)  # ASYNC on GPU!
            buffer_idx = (buffer_idx + 1) % num_buffers  # ROTATE
        except StopIteration:
            # Flush final partial buffer
            if len(packing_tensor_list[buffer_idx]) > 0:
                packed = torch.cat(packing_tensor_list[buffer_idx])
                group.broadcast(packed, src=src)
            break
Enter fullscreen mode Exit fullscreen mode

The timeline shows the overlap:

Time ──────────────────────────────────────────────────────►

Stream 0:  ┌──pack──┐┌──broadcast──┐          ┌──pack──┐┌──broadcast──┐
           │ buf[0] ││   buf[0]    │          │ buf[0] ││   buf[0]    │
           └────────┘└─────────────┘          └────────┘└─────────────┘
                                    ▲ sync                              ▲ sync
Stream 1:            ┌──pack──┐┌──broadcast──┐          ┌──pack──┐
                     │ buf[1] ││   buf[1]    │          │ buf[1] │
                     └────────┘└─────────────┘          └────────┘
                                              ▲ sync
Enter fullscreen mode Exit fullscreen mode

Iteration 0 (buffer 0, Stream 0): Pack ~1GB of weights, broadcast via NCCL. The broadcast is GPU-async -- returns to CPU immediately.

Iteration 1 (buffer 1, Stream 1): streams[1].synchronize() returns instantly (empty stream). Pack into buffer 1, broadcast. Stream 0's broadcast may still be running. This is the overlap.

Iteration 2 (buffer 0, Stream 0): streams[0].synchronize() -- the critical call. We're about to reuse buffer 0. Must wait for Stream 0's previous broadcast to finish. Then safely overwrite.

Why Does the Broadcast Return Instantly?

Because NCCL operations are asynchronous on the GPU:

# pynccl.py, lines 342-366
def broadcast(self, tensor: torch.Tensor, src: int, stream=None):
    if stream is None:
        stream = current_stream()
    if src == self.rank:
        sendbuff = buffer_type(tensor.data_ptr())
        recvbuff = buffer_type(tensor.data_ptr())  # Required by NCCL
    else:
        sendbuff = buffer_type()        # Null — receiver doesn't send
        recvbuff = buffer_type(tensor.data_ptr())
    self.nccl.ncclBroadcast(
        sendbuff, recvbuff, tensor.numel(),
        ncclDataTypeEnum.from_torch(tensor.dtype),
        src, self.comm,
        cudaStream_t(stream.cuda_stream),
    )
    # No synchronize()! Returns to CPU immediately.
    # Caller synchronizes at TOP of next iteration.
Enter fullscreen mode Exit fullscreen mode

No synchronize() after the call. The synchronization happens at the top of the next iteration, when we need to reuse the buffer.

Also critical: NCCL broadcast is a collective. Every process must call it the same number of times, in the same order. Mismatch = deadlock.

Checkpoint: We've covered streams (independent lanes), uint8 packing (uniform bytes), and double-buffering (overlap via alternating buffers). This pattern handles bulk, structured transfers -- but still blocks the caller. What if we need to not block at all?

KV Connector Background Threads: Transfer Without Stopping Inference

In disaggregated serving, prefill nodes compute KV caches and ship them to decode nodes. This must happen during live inference without stalling the decode engine.

vLLM's p2p_nccl_engine.py takes a fundamentally different approach: background Python threads with producer-consumer queues.

The main inference thread does this:

# p2p_nccl_engine.py, lines 254-258
if self.send_type == "PUT_ASYNC":
    with self.send_queue_cv:
        self.send_queue.append(item)
        self.send_queue_cv.notify()  # Wake background thread
    return True  # Returns IMMEDIATELY to inference loop
Enter fullscreen mode Exit fullscreen mode

Five lines. Append to a deque, notify a condition variable, return. Zero blocking.

Meanwhile, a daemon thread drains the queue:

# p2p_nccl_engine.py, lines 476-484
def send_async(self):
    """Background daemon thread — drains send queue."""
    while True:
        with self.send_queue_cv:
            while not self.send_queue:
                self.send_queue_cv.wait()   # Releases GIL, sleeps
            item = self.send_queue.popleft()
            if not self.send_queue:
                self.send_queue_cv.notify() # Signal: queue drained
        self.send_sync(item)  # NCCL send + stream.synchronize()
Enter fullscreen mode Exit fullscreen mode

The background thread sleeps on send_queue_cv.wait() (zero CPU cost), wakes when notified, and performs the NCCL P2P send on a dedicated CUDA stream (self.send_stream).

 Inference Thread                    Background Threads
 ────────────────                    ──────────────────
   decode step                       ┌───────────────────┐
     send_tensor()                   │   Send Thread      │
       .append() + .notify() ──────►│   cv.wait()        │
     return True (instantly!)        │   ncclSend(item)   │
   next decode step                  └───────────────────┘
     recv_tensor()                   ┌───────────────────┐
       cv.wait() ◄──────────────────│   Listener Thread  │
       tensor = recv_store[id]       │   zmq.poll()       │
   continue inference                │   ncclRecv(tensor)  │
                                     │   cv.notify() ─────┘
                                     └───────────────────┘
Enter fullscreen mode Exit fullscreen mode

The GIL Question

Python's GIL means only one thread runs Python at a time. How does this overlap?

NCCL calls are C library calls that release the GIL. While the send thread is inside ncclSend(), the inference thread is free to run. The threads work because they spend 99% of their time in GIL-releasing C calls:

Main thread:  [Python][Python][Python][Python]
Send thread:  [cv.wait][ncclSend...........][cv.wait]
               GIL-free  GIL-free (C lib)    GIL-free
Enter fullscreen mode Exit fullscreen mode

Implicit vs. Explicit Coordination

Dimension Weight Transfer KV Connector
Coordination Implicit (NCCL collective) Explicit (ZMQ + queues)
Overlap target Pack ↔ Broadcast Transfer ↔ Inference
Blocks inference? Yes No
Concurrency 2 CUDA streams OS threads + dedicated streams
Failure mode Hang (collective mismatch) Timeout (missed ZMQ)
Best for Bulk, periodic Streaming, continuous

Weight transfer coordinates implicitly through collectives -- simple but rigid. KV connector coordinates explicitly through ZMQ messages -- more machinery but fully async.

Eight Antipatterns

Synchronization

1. Syncing the wrong stream. streams[1].synchronize() when you needed streams[0]. Buffer 0 gets corrupted mid-broadcast. Silent wrong answers.

2. Missing synchronize entirely. Works under light load, race condition under heavy load. Nondeterministic, passes tests, fails in production.

3. torch.cuda.synchronize() instead of stream.synchronize(). Waits for ALL streams. Correctness maintained, performance destroyed (2-5x slower).

Buffers

4. One buffer (num_buffers=1). Must sync after every broadcast. All complexity, zero overlap.

5. Too many buffers (num_buffers=8). 8GB of VRAM wasted. Near-zero benefit over 2 buffers -- bandwidth is the bottleneck, not packing speed.

Coordination

6. Mismatched iteration order. Producer: [layer0, layer1, layer2]. Consumer: [layer2, layer0, layer1]. Every collective matches. No hang. Wrong bytes in wrong layers. Silent catastrophe.

7. Forgetting NCCL is a collective. One process skips a broadcast via early-exit. Others block forever. Deadlock, no error message.

8. Default stream from background thread. Send thread doesn't specify a stream. NCCL work goes on the default stream. Serializes with inference. Zero overlap despite threading complexity. Fix: self.send_stream = torch.cuda.Stream().

Bonus: The GIL Illusion
Python threads only provide parallelism when threads spend time in C extensions that release the GIL. If your background thread does tensor slicing in Python between NCCL calls, it holds the GIL and blocks inference. Keep the thread body a tight loop of C calls.

The Takeaway

These two patterns form a reusable vocabulary for GPU concurrency:

  • Double-buffered CUDA streams: Overlap data packing with transfer. Use for bulk, periodic transfers.
  • Background threads + producer-consumer queues: Decouple transfer from the critical path. Use for streaming, continuous transfers.

The meta-lesson: the hardest part of GPU concurrency is not making things fast -- it's making things correct while fast. stream.synchronize() is one line of code that separates "works" from "silently corrupts." The antipatterns exist because the failure mode of a missing sync is not a crash but wrong answers.

The next time you see a stream.synchronize() call in GPU code, don't skip over it. That one line is the load-bearing wall.


Code from vLLM's main branch: vllm/distributed/weight_transfer/packed_tensor.py and vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py.

Top comments (1)

Collapse
 
pengeszikra profile image
Peter Vivo

Great to see a post about LLM inner problem with a clear clerification. Even fof me who is have a very limited python knowledge and no understanding how write a GPU target code in python. The whole mechanism is cover by some mystic fog, but this post is a great startingpoint for me.