Daya Shankar

Posted on Feb 9

How PCIe, NVLink, and NUMA Topology Affect GPU Scheduling Outcomes

#gpu #cloudpractitioner #acecloud

GPU topology changes what “8 GPUs” really means: NCCL step time, multi-node InfiniBand efficiency, and inference p99. NVLink can hide bad PCIe wiring.

NUMA never does. On PCIe-only cards like NVIDIA L40S (PCIe Gen4 x16, NVLink: No), the scheduler must respect PCIe and socket locality, or you’ll buy GPUs and get bus contention.

GPU topology is scheduling input, not trivia

Bridge: If the scheduler can’t see topology, it will place jobs that look valid and run slow.

Schedulers allocate counts (gpus=8). They don’t allocate paths (“these 4 GPUs share a PCIe switch and sit on the same socket as the NIC”). That gap creates two common outcomes:

Training: NCCL all-reduce stalls on the slowest hop.
Inference: p99 latency spikes when CPU threads and DMA traffic cross sockets.

If you run a gpu topology-sensitive fleet on a gpu cloud server, you either encode topology into placement rules or you accept variance as a “feature.”

The three wires that matter

Bridge: We’ll tie each wire to the exact failure mode you see in Kubernetes and Slurm.

PCIe fabric

Bridge: PCIe is fast until multiple devices converge on the same upstream link.

PCIe isn’t one big flat bus. It’s a tree: endpoints → switches → root complex → CPU socket. When traffic funnels through a shared upstream link, bandwidth becomes shared and latency jumps.

L40S-specific reality: L40S uses PCIe Gen4 x16 (64 GB/s bidirectional) and does not support NVLink. That means GPU↔GPU traffic stays on PCIe. No fast side-channel.

NVLink and NVSwitch

Bridge: NVLink changes the GPU↔GPU fast path, which changes how forgiving the node is.

NCCL supports PCIe and NVLink/NVSwitch, and it will route collectives differently based on what it detects.
If you’re on a node with NVLink, you can sometimes “get away with” weaker PCIe placement. On L40S, you can’t.

NUMA sockets

Bridge: NUMA decides whether “local” memory and PCIe devices are actually local.

Dual-socket servers have two NUMA domains. Each domain has its own memory controller and PCIe root complex resources. Cross-socket traffic uses the CPU interconnect (UPI/IF/QPI-class links). That’s where you pay the “SYS hop” penalty in many topology maps.

What topology looks like in real metrics

Bridge: This is how topology shows up when you’re staring at slow training jobs.

NCCL is topology-aware, but not topology-proof

Bridge: NCCL will pick a comm graph, but it can’t invent a faster hop.

NCCL provides collectives across GPUs within and across nodes and supports PCIe, NVLink, and InfiniBand.
It also exposes knobs that make the topology model explicit.

NCCL documents path cutoffs like PIX / PXB / PHB / SYS for peer-to-peer decisions (same PCI switch → across PCI switches → same NUMA node → across NUMA nodes).

That matters because NCCL all-reduce behaves like this:

One slow edge in the ring/tree drags the whole step.

Cross-socket edges are the usual culprit on PCIe-only nodes.

“InfiniBand is fine” but the job is still slow

Bridge: GPU↔NIC locality can bottleneck before you hit the fabric.

GPUDirect RDMA provides a direct data path between GPU memory and a third-party PCIe device such as a NIC.
If the NIC sits under the other socket, you can still get extra hops and host involvement depending on topology and configuration.

Result: you scale nodes and don’t scale throughput.

Inference p99 gets ugly under mixed load

Bridge: p99 spikes happen when you add jitter to the CPU↔GPU feeding path.

Inference often looks fine at p50 and fails at p99. On L40S nodes, the usual trigger is cross-socket CPU placement or PCIe contention from a neighboring workload.

Step 1: Print the topology map on every node class

Bridge: You can’t Script scheduling rules until you can prove the wiring.

Run this on each node SKU you plan to Deploy:

# GPU and link map 
nvidia-smi topo -m 

# NUMA layout 
lscpu | grep -E "Socket|NUMA" 
numactl --hardware 

# PCIe tree 
lspci -tv

If you rent capacity, ask your provider for nvidia-smi topo -m output before you commit. AceCloud can hand you those maps for specific gpu cloud server SKUs so you can design job shapes that fit the hardware.

Step 2: Turn topology into job shapes

Bridge: Asking for “8 GPUs” is vague. Asking for “a 4-GPU island” is schedulable.

On PCIe-only nodes, “8 GPUs” often means “two 4-GPU islands.” If your job spans islands, you introduce slow hops.

Define shapes up front:

Workload	Shape	Why it works
NCCL training, single node	4-GPU island or full 8-GPU node	avoids cross-island P2P penalties
NCCL training, multi-node IB	N nodes × (same shape)	keeps rank topology consistent
Inference	1 GPU per pod/task	reduces contention, easier NUMA pinning

Now you can Configure scheduling rules around shapes instead of raw counts.

Kubernetes: make topology visible or accept random placement

Bridge: Vanilla K8s schedules extended resources, not PCIe and NUMA reality.

Use Topology Manager for NUMA alignment

Bridge: This is how you keep CPUs, devices, and memory on the same NUMA node.

Kubernetes Topology Manager with single-numa-node can reject pods that can’t be placed with a single NUMA affinity, using hints from “hint providers.”
Device plugins can provide NUMA TopologyInfo so kubelet can make locality-aware assignments.

This is the difference between:

CPU threads on socket 0 feeding GPUs on socket 0, and

CPU threads on socket 1 starving GPUs on socket 0 through remote memory traffic.

Encode GPU islands as labels

Bridge: K8s can’t pick “the 4 GPUs under this PCIe switch,” so you model islands at the node-pool layer.

Practical pattern:

1. Split node pools by hardware topology class (consistent wiring).

2. Label nodes with the shape they support.

Example labels:

topo.gpu.shape=4island

topo.ib.local=true

Then select with affinity:

affinity: 
  nodeAffinity: 
    requiredDuringSchedulingIgnoredDuringExecution: 
      nodeSelectorTerms: 
      - matchExpressions: 
        - key: topo.gpu.shape 
          operator: In 
          values: ["4island","8full"]

This is boring. That’s why it works.

On managed Kubernetes, topology discipline is mostly a node-pool problem—mixing GPU SKUs or wiring revisions inside one pool guarantees inconsistent NCCL graphs.
If you depend on Kubernetes node autoscaling, make sure it scales the right topology-labeled pool; adding “more nodes” that don’t match your GPU island shape can make jobs slower, not faster.

Slurm: ask for socket-local GPUs and bind them

Bridge: Slurm exposes more of the machinery you need for topology-aware placement.

Slurm schedules GPUs via GRES and has GPU-specific allocation features.
srun supports --gpus-per-socket and --gpu-bind.
On some HPC systems, --gpu-bind=closest binds each task to the GPU in the same NUMA domain as the CPU core the rank runs on.

Example pattern for dual-socket nodes:

srun -N1 \ 
  --sockets-per-node=2 \ 
  --gpus-per-socket=4 \ 
  --cpus-per-gpu=8 \ 
  --cpu-bind=cores \ 
  --gpu-bind=closest \ 
  python train.py

Two wins:

You don’t spread GPUs across sockets unless you mean to.

You keep CPU threads close to the GPUs they feed.

Step 3: Verify placement inside the job

Bridge: If you don’t verify, you’ll blame code for a wiring problem.

Kubernetes

Bridge: Check what you got, not what you asked for.

kubectl exec -it <pod> -- bash -lc ' 
  echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES 
  nvidia-smi topo -m 
'

Slurm

Bridge: Same idea. Different launcher.

srun -N1 --gpus=4 bash -lc ' 
  echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES 
  nvidia-smi topo -m 
'

Then Grep for cross-socket indicators in the topo matrix and fix placement before you touch model code.

Step 4: Microbenchmarks that catch topology regressions

Bridge: Two short tests will tell you if the node can actually Scale.

NCCL collectives: nccl-tests

Bridge: If all-reduce is slow here, training will be slow everywhere.

nccl-tests is the standard harness for NCCL collective performance.

git clone https://github.com/NVIDIA/nccl-tests.git 
cd nccl-tests 
make MPI=1 

export NCCL_DEBUG=INFO 
export NCCL_DEBUG_SUBSYS=GRAPH 

mpirun -np 8 ./build/all_reduce_perf -b 8M -e 1G -f 2 -g 1 | tee nccl.log

Now compare nccl.log across nodes of the “same” SKU. If graphs differ, your fleet isn’t topology-consistent.

InfiniBand baseline: ib_write_bw

Bridge: Prove the fabric, then prove your NUMA binding assumptions.

ib_write_bw is part of the perftest utilities used for InfiniBand performance testing.

Server:

ib_write_bw --report_gbits

Client:

ib_write_bw <server-ip> --report_gbits

Then rerun with NUMA binding. Yandex’s guide shows the exact pattern: bind CPU and NIC by NUMA to isolate the path.

Decision checklist for gpu topology on a gpu cloud server

Bridge: Use this to pick nodes and placement rules that won’t surprise you.

1. Confirm the interconnect

If you’re on L40S, assume PCIe-only GPU↔GPU. NVLink isn’t there.

2. Pick a job shape

4-GPU island for most NCCL training on PCIe-only nodes.

Full 8-GPU only when the topology map shows a clean fabric.

3. Make NUMA a hard requirement

K8s: Configure Topology Manager + device plugin topology hints.

Slurm: use --gpus-per-socket and bind.

4. Verify inside the allocation

Run nvidia-smi topo -m.

Pipe logs and Grep for the patterns that correlate with slow steps.

5. Benchmark once per node class

nccl-tests for collectives.

ib_write_bw for fabric sanity.

Bottom line

Bridge: Scheduling outcomes depend on wiring, so treat wiring as an input.

PCIe, NVLink, and NUMA decide whether your scheduler produces fast placements or expensive slow ones. On L40S-based fleets, you don’t get NVLink to mask bad decisions. Encode gpu topology into node pools and constraints. Verify allocations with nvidia-smi topo -m. Then scale your training jobs with fewer surprises on-prem or on a gpu cloud server.

If you want, I can tailor this into a strict guest-post format for your target publication (their style guide, link density, and any required “how-to” sections) while keeping the same blunt engineering tone.

DEV Community