GPU topology changes what “8 GPUs” really means: NCCL step time, multi-node InfiniBand efficiency, and inference p99. NVLink can hide bad PCIe wiring.
NUMA never does. On PCIe-only cards like NVIDIA L40S (PCIe Gen4 x16, NVLink: No), the scheduler must respect PCIe and socket locality, or you’ll buy GPUs and get bus contention.
GPU topology is scheduling input, not trivia
Bridge: If the scheduler can’t see topology, it will place jobs that look valid and run slow.
Schedulers allocate counts (gpus=8). They don’t allocate paths (“these 4 GPUs share a PCIe switch and sit on the same socket as the NIC”). That gap creates two common outcomes:
- Training: NCCL all-reduce stalls on the slowest hop.
- Inference: p99 latency spikes when CPU threads and DMA traffic cross sockets.
If you run a gpu topology-sensitive fleet on a gpu cloud server, you either encode topology into placement rules or you accept variance as a “feature.”
The three wires that matter
Bridge: We’ll tie each wire to the exact failure mode you see in Kubernetes and Slurm.
PCIe fabric
Bridge: PCIe is fast until multiple devices converge on the same upstream link.
PCIe isn’t one big flat bus. It’s a tree: endpoints → switches → root complex → CPU socket. When traffic funnels through a shared upstream link, bandwidth becomes shared and latency jumps.
L40S-specific reality: L40S uses PCIe Gen4 x16 (64 GB/s bidirectional) and does not support NVLink. That means GPU↔GPU traffic stays on PCIe. No fast side-channel.
NVLink and NVSwitch
Bridge: NVLink changes the GPU↔GPU fast path, which changes how forgiving the node is.
NCCL supports PCIe and NVLink/NVSwitch, and it will route collectives differently based on what it detects.
If you’re on a node with NVLink, you can sometimes “get away with” weaker PCIe placement. On L40S, you can’t.
NUMA sockets
Bridge: NUMA decides whether “local” memory and PCIe devices are actually local.
Dual-socket servers have two NUMA domains. Each domain has its own memory controller and PCIe root complex resources. Cross-socket traffic uses the CPU interconnect (UPI/IF/QPI-class links). That’s where you pay the “SYS hop” penalty in many topology maps.
What topology looks like in real metrics
Bridge: This is how topology shows up when you’re staring at slow training jobs.
NCCL is topology-aware, but not topology-proof
Bridge: NCCL will pick a comm graph, but it can’t invent a faster hop.
NCCL provides collectives across GPUs within and across nodes and supports PCIe, NVLink, and InfiniBand.
It also exposes knobs that make the topology model explicit.
NCCL documents path cutoffs like PIX / PXB / PHB / SYS for peer-to-peer decisions (same PCI switch → across PCI switches → same NUMA node → across NUMA nodes).
That matters because NCCL all-reduce behaves like this:
- One slow edge in the ring/tree drags the whole step.
- Cross-socket edges are the usual culprit on PCIe-only nodes.
“InfiniBand is fine” but the job is still slow
Bridge: GPU↔NIC locality can bottleneck before you hit the fabric.
GPUDirect RDMA provides a direct data path between GPU memory and a third-party PCIe device such as a NIC.
If the NIC sits under the other socket, you can still get extra hops and host involvement depending on topology and configuration.
Result: you scale nodes and don’t scale throughput.
Inference p99 gets ugly under mixed load
Bridge: p99 spikes happen when you add jitter to the CPU↔GPU feeding path.
Inference often looks fine at p50 and fails at p99. On L40S nodes, the usual trigger is cross-socket CPU placement or PCIe contention from a neighboring workload.
Step 1: Print the topology map on every node class
Bridge: You can’t Script scheduling rules until you can prove the wiring.
Run this on each node SKU you plan to Deploy:
# GPU and link map
nvidia-smi topo -m
# NUMA layout
lscpu | grep -E "Socket|NUMA"
numactl --hardware
# PCIe tree
lspci -tv
If you rent capacity, ask your provider for nvidia-smi topo -m output before you commit. AceCloud can hand you those maps for specific gpu cloud server SKUs so you can design job shapes that fit the hardware.
Step 2: Turn topology into job shapes
Bridge: Asking for “8 GPUs” is vague. Asking for “a 4-GPU island” is schedulable.
On PCIe-only nodes, “8 GPUs” often means “two 4-GPU islands.” If your job spans islands, you introduce slow hops.
Define shapes up front:
|
Workload |
Shape |
Why it works |
|
NCCL training, single node |
4-GPU island or full 8-GPU node |
avoids cross-island P2P penalties |
|
NCCL training, multi-node IB |
N nodes × (same shape) |
keeps rank topology consistent |
|
Inference |
1 GPU per pod/task |
reduces contention, easier NUMA pinning |
Now you can Configure scheduling rules around shapes instead of raw counts.
Kubernetes: make topology visible or accept random placement
Bridge: Vanilla K8s schedules extended resources, not PCIe and NUMA reality.
Use Topology Manager for NUMA alignment
Bridge: This is how you keep CPUs, devices, and memory on the same NUMA node.
Kubernetes Topology Manager with single-numa-node can reject pods that can’t be placed with a single NUMA affinity, using hints from “hint providers.”
Device plugins can provide NUMA TopologyInfo so kubelet can make locality-aware assignments.
This is the difference between:
- CPU threads on socket 0 feeding GPUs on socket 0, and
- CPU threads on socket 1 starving GPUs on socket 0 through remote memory traffic.
Encode GPU islands as labels
Bridge: K8s can’t pick “the 4 GPUs under this PCIe switch,” so you model islands at the node-pool layer.
Practical pattern:
1. Split node pools by hardware topology class (consistent wiring).
2. Label nodes with the shape they support.
Example labels:
- topo.gpu.shape=4island
- topo.ib.local=true
Then select with affinity:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topo.gpu.shape
operator: In
values: ["4island","8full"]
This is boring. That’s why it works.
On managed Kubernetes, topology discipline is mostly a node-pool problem—mixing GPU SKUs or wiring revisions inside one pool guarantees inconsistent NCCL graphs.
If you depend on Kubernetes node autoscaling, make sure it scales the right topology-labeled pool; adding “more nodes” that don’t match your GPU island shape can make jobs slower, not faster.
Slurm: ask for socket-local GPUs and bind them
Bridge: Slurm exposes more of the machinery you need for topology-aware placement.
Slurm schedules GPUs via GRES and has GPU-specific allocation features.
srun supports --gpus-per-socket and --gpu-bind.
On some HPC systems, --gpu-bind=closest binds each task to the GPU in the same NUMA domain as the CPU core the rank runs on.
Example pattern for dual-socket nodes:
srun -N1 \
--sockets-per-node=2 \
--gpus-per-socket=4 \
--cpus-per-gpu=8 \
--cpu-bind=cores \
--gpu-bind=closest \
python train.py
Two wins:
- You don’t spread GPUs across sockets unless you mean to.
- You keep CPU threads close to the GPUs they feed.
Step 3: Verify placement inside the job
Bridge: If you don’t verify, you’ll blame code for a wiring problem.
Kubernetes
Bridge: Check what you got, not what you asked for.
kubectl exec -it <pod> -- bash -lc '
echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
nvidia-smi topo -m
'
Slurm
Bridge: Same idea. Different launcher.
srun -N1 --gpus=4 bash -lc '
echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
nvidia-smi topo -m
'
Then Grep for cross-socket indicators in the topo matrix and fix placement before you touch model code.
Step 4: Microbenchmarks that catch topology regressions
Bridge: Two short tests will tell you if the node can actually Scale.
NCCL collectives: nccl-tests
Bridge: If all-reduce is slow here, training will be slow everywhere.
nccl-tests is the standard harness for NCCL collective performance.
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=GRAPH
mpirun -np 8 ./build/all_reduce_perf -b 8M -e 1G -f 2 -g 1 | tee nccl.log
Now compare nccl.log across nodes of the “same” SKU. If graphs differ, your fleet isn’t topology-consistent.
InfiniBand baseline: ib_write_bw
Bridge: Prove the fabric, then prove your NUMA binding assumptions.
ib_write_bw is part of the perftest utilities used for InfiniBand performance testing.
Server:
ib_write_bw --report_gbits
Client:
ib_write_bw <server-ip> --report_gbits
Then rerun with NUMA binding. Yandex’s guide shows the exact pattern: bind CPU and NIC by NUMA to isolate the path.
Decision checklist for gpu topology on a gpu cloud server
Bridge: Use this to pick nodes and placement rules that won’t surprise you.
1. Confirm the interconnect
- If you’re on L40S, assume PCIe-only GPU↔GPU. NVLink isn’t there.
2. Pick a job shape
- 4-GPU island for most NCCL training on PCIe-only nodes.
- Full 8-GPU only when the topology map shows a clean fabric.
3. Make NUMA a hard requirement
- K8s: Configure Topology Manager + device plugin topology hints.
- Slurm: use --gpus-per-socket and bind.
4. Verify inside the allocation
- Run nvidia-smi topo -m.
- Pipe logs and Grep for the patterns that correlate with slow steps.
5. Benchmark once per node class
- nccl-tests for collectives.
- ib_write_bw for fabric sanity.
Bottom line
Bridge: Scheduling outcomes depend on wiring, so treat wiring as an input.
PCIe, NVLink, and NUMA decide whether your scheduler produces fast placements or expensive slow ones. On L40S-based fleets, you don’t get NVLink to mask bad decisions. Encode gpu topology into node pools and constraints. Verify allocations with nvidia-smi topo -m. Then scale your training jobs with fewer surprises on-prem or on a gpu cloud server.
If you want, I can tailor this into a strict guest-post format for your target publication (their style guide, link density, and any required “how-to” sections) while keeping the same blunt engineering tone.
Top comments (0)