Mayank Ketkar

Posted on Feb 15

How to Read GPU Profiling Logs: A Ground-Up Guide

#gpu #profiling #performance #cuda

You ran nsys profile, got a 2GB .nsys-rep file, exported it to SQLite, and found yourself staring at 88 tables with names like CUPTI_ACTIVITY_KIND_KERNEL and ENUM_WDDM_PAGING_QUEUE_TYPE. The kernel names are integer IDs. The timestamps are in nanoseconds. Nothing is human-readable. You closed the file and went back to guessing.

This post exists so you never have to guess again.

I'm going to teach you to read any nsys trace in under 10 minutes — using four tables, one SQL join pattern, and four queries. Then we'll use those tools to solve a real mystery: why does a model give different results at batch size 1 vs batch size 16, even though both traces show exactly 8,955 kernel launches?

What nsys actually records

Imagine you're standing in a factory watching an assembly line. You have a stopwatch, and your job is to write down: when each machine started, when it stopped, which machine it was, and what it was building.

That's what NVIDIA Nsight Systems does for your GPU. It records every kernel launch, every memory copy, every synchronization event — with nanosecond timestamps.

The output is two files:

profile.nsys-rep     ← visual report (open in the Nsight GUI)
profile.sqlite       ← raw data in a database (query with SQL)

The .sqlite file is the gold mine. Everything in the .nsys-rep is derived from it.

88 tables, but only 4 matter

Open any nsys SQLite export and you'll find 88 tables. Most are enum lookup tables or metadata. You need these four:

Table	What it records
`CUPTI_ACTIVITY_KIND_KERNEL`	Every GPU kernel execution: start, end, grid, block, name
`CUPTI_ACTIVITY_KIND_MEMCPY`	Every memory transfer: CPU→GPU, GPU→CPU, GPU→GPU
`CUPTI_ACTIVITY_KIND_MEMSET`	Every memory fill (zeroing buffers)
`NVTX_EVENTS`	Human-readable markers programmers add to their code

Plus one helper table: StringIds — the Rosetta Stone that maps integer IDs to actual names.

Here's how to discover them:

import sqlite3
# nsys export --type=sqlite profile.nsys-rep  -> produces profile.sqlite
conn = sqlite3.connect('profile.sqlite')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [r[0] for r in cursor.fetchall()]
print(f"Total tables: {len(tables)}")  # 88 tables!
# But only 4 matter:
#   CUPTI_ACTIVITY_KIND_KERNEL  ← GPU kernel executions
#   CUPTI_ACTIVITY_KIND_MEMCPY  ← memory transfers
#   CUPTI_ACTIVITY_KIND_MEMSET  ← memory fills
#   NVTX_EVENTS                 ← human-readable markers
#   + StringIds                 ← name lookup table

The #1 gotcha: names are integers

This is the single most confusing thing when you first look at nsys data. Kernel names are not stored as strings. They're stored as integer foreign keys into the StringIds table.

When you query the kernel table, you'll see:

demangledName = 58
shortName     = 59

These are pointers. To get the actual name, you JOIN:

JOIN StringIds s ON k.shortName = s.id

And suddenly 59 becomes vectorized_elementwise_kernel.

Every query you write will include this JOIN. It becomes muscle memory after your second trace.

Decoding kernel names: the Rosetta Stone

Once you run that JOIN, you'll see names like nvjet_tst_192x192_64x4_2x1_v_bz_coopB_TNN. This isn't gibberish — every character means something:

nvjet_tst _ 192x192 _ 64x4 _ 2x1 _ v _ bz _ coopB _ TNN
   │          │        │      │     │    │     │       │
   │          │        │      │     │    │     │       └─ transpose: T=yes, N=no
   │          │        │      │     │    │     │          TNN = A^T × B → C
   │          │        │      │     │    │     │
   │          │        │      │     │    │     └─ cooperative mode
   │          │        │      │     │    │        coopA/coopB = how SMs share work
   │          │        │      │     │    │
   │          │        │      │     │    └─ block-zero init strategy
   │          │        │      │     │
   │          │        │      │     └─ layout: v=vertical, h=horizontal
   │          │        │      │        (how tiles map to SMs)
   │          │        │      │
   │          │        │      └─ warp tiling: 2 warps in M, 1 in N
   │          │        │
   │          │        └─ block tile: 64×4 threads per block
   │          │
   │          └─ output tile: 192×192 chunk per SM
   │
   └─ nvjet_tst = NVIDIA JIT persistent kernel (deterministic!)

Think of it like a car's VIN number. Once you know the format, you can read any GPU kernel at a glance.

Besides nvjet_tst_*, you'll encounter:

device_kernel — output of torch.compile. Opaque, but often 70%+ of GPU time.
vectorized_elementwise_kernel — PyTorch's generic ops (add, multiply, cast).
rms_norm_kernel / act_and_mul_kernel — normalization and activation.
*_splitK_* — Split-K GEMM with atomic reduction. Potential non-determinism source.

Grid and block: mapping kernels to hardware

The kernel table has gridX/Y/Z and blockX/Y/Z columns. These map to physical GPU hardware.

An H200 has 132 Streaming Multiprocessors (SMs) — 132 independent assembly lines. Each can process one thread block at a time.

  GPU with 132 SMs:
  ┌─────┬─────┬─────┬─────┬─── ── ──┬─────┐
  │SM 0 │SM 1 │SM 2 │SM 3 │  ...    │SM131│
  │blk 0│blk 1│blk 2│blk 3│         │blk  │
  │128  │128  │128  │128  │         │131  │
  │thrds│thrds│thrds│thrds│         │thrds│
  └─────┴─────┴─────┴─────┴─── ── ──┴─────┘

  grid=132 × block=128 = 16,896 total threads

gridX=1: One SM active, 131 idle. Tiny work.
gridX=132: Every SM busy. What persistent GEMM targets.
gridX=64,000: Blocks queue in waves. GPU stays saturated.

Four SQL queries that answer every question

Every GPU investigation follows the same four-step pattern: The Census, The Lineup, The Stakeout, and The Timeline.

Level 1: The Census — "What's slow?"

-- Where is the GPU spending its time?
SELECT
    s.value AS kernel_name,
    COUNT(*) AS call_count,
    ROUND(SUM(k.end - k.start) / 1e6, 2) AS total_ms,
    ROUND(AVG(k.end - k.start) / 1e3, 1) AS avg_us
FROM CUPTI_ACTIVITY_KIND_KERNEL k
JOIN StringIds s ON k.shortName = s.id
GROUP BY s.value
ORDER BY total_ms DESC
LIMIT 10;

On our H200 running vLLM inference:

Kernel	Calls	Total (ms)	%
device_kernel (torch.compile)	748	877.38	70.2%
vectorized_elementwise_kernel	883	60.98	4.9%
nvjet_tst_192x192_64x4_2x1...	28	51.94	4.2%

70% of GPU time in one kernel type — the compiled vision encoder.

Level 2: The Lineup — "What category?"

SELECT
  CASE
    WHEN LOWER(s.value) LIKE '%nvjet%' THEN 'Persistent GEMM'
    WHEN LOWER(s.value) LIKE '%cublas%' THEN 'cuBLAS (DANGER)'
    WHEN LOWER(s.value) LIKE '%device_kernel%' THEN 'torch.compile'
    WHEN LOWER(s.value) LIKE '%flash%' THEN 'Flash Attention'
    WHEN LOWER(s.value) LIKE '%norm%' THEN 'Normalization'
    ELSE 'Other'
  END AS category,
  COUNT(*) AS calls,
  ROUND(SUM(k.end-k.start)/1e6, 2) AS total_ms
FROM CUPTI_ACTIVITY_KIND_KERNEL k
JOIN StringIds s ON k.shortName = s.id
GROUP BY category ORDER BY total_ms DESC;

Results: 17.3% Persistent GEMM, 7.7% Elementwise, 2% Normalization, and... 0.1% Flash Attention. Just 1.2ms. Barely a blip.

Remember that 0.1%. It becomes 5.15x the smoking gun.

Level 3: The Stakeout — "What changed?"

-- Drill into every launch of a specific kernel
SELECT
    k.start,
    ROUND((k.end - k.start) / 1e3, 1) AS dur_us,
    k.gridX, k.gridY, k.gridZ,
    k.blockX, k.registersPerThread,
    k.dynamicSharedMemory
FROM CUPTI_ACTIVITY_KIND_KERNEL k
JOIN StringIds s ON k.shortName = s.id
WHERE s.value = 'reshape_and_cache_flash_kernel'
ORDER BY k.start
LIMIT 20;

Level 4: The Timeline — "What happened when?"

-- Unified GPU timeline: kernels + memory transfers
SELECT 'KERNEL' AS type,
       k.start,
       ROUND((k.end - k.start) / 1e3, 1) AS dur_us,
       s.value AS name
FROM CUPTI_ACTIVITY_KIND_KERNEL k
JOIN StringIds s ON k.shortName = s.id
UNION ALL
SELECT 'MEMCPY',
       m.start,
       ROUND((m.end - m.start) / 1e3, 1),
       'copyKind=' || m.copyKind
FROM CUPTI_ACTIVITY_KIND_MEMCPY m
ORDER BY start
LIMIT 50;

Case study: the cascade attention smoking gun

Here's the mystery: We're running vLLM inference with batch-invariant mode, which guarantees bitwise-identical results regardless of batch size. BS=1 works perfectly. BS=16 gives different results. Both traces: exactly 8,955 kernel launches. Where's the bug?

Step 1: The Census — nothing obvious

Metric	BS=1	BS=16	Ratio
Total kernels	8,955	8,955	1.00x
Total GPU time	1.25s	1.37s	1.10x
Memcpy ops	1,099	1,147	1.04x
Memset ops	345	429	1.24x

Step 2: The Lineup — the outlier appears

Most categories grow modestly. Persistent GEMM +35%, Normalization +52%. Expected.

But Flash Attention: 1.20ms → 6.17ms — 5.15x increase.

In absolute terms it's 6ms. But the ratio is an extreme statistical outlier.

Step 3: The Stakeout — same calls, more data

reshape_and_cache_flash_kernel runs 392 times in both profiles (same count!), but takes 5.15x longer per call in BS=16. More data per call, not more calls.

Memory: 83% more device-to-device copies (53 vs 29 ops).

GEMM: 7 new kernel variants in BS=16 (88ms) that don't exist in BS=1 (which had 7 different variants, only 18ms).

Step 4: The code — root cause

Every clue points to one place:

# vllm/v1/attention/backends/flash_attn.py
def use_cascade_attention(common_prefix_len, query_lens, ...):
    # Too short prefix — not worth splitting
    if common_prefix_len < 256:
        return False  # ← BS=1 exits here

    # Too few requests — not worth splitting
    num_reqs = len(query_lens)
    if num_reqs < 8:
        return False  # ← BS=1-7 exits here

    # BS=16 with shared system prompt (>256 tokens):
    # → cascade ON → split prefix/suffix → LSE merge
    # → mathematically equivalent, but FP-different
    return True  # ← determinism breaks here

Cascade attention splits attention into prefix and suffix passes, then merges with LSE arithmetic. Mathematically equivalent. Floating-point different. That's the determinism break.

The fix: disable_cascade_attn=True.

The Evidence Board

| Clue | Evidence | Verdict |
|------|----------|---------|
| 5.15x attention slowdown | reshape_and_cache_flash_kernel: 1.20ms → 6.17ms | More data per call |
| 83% more D-to-D copies | 29 → 53 ops, 81 → 163 MB | Internal tensor splitting |
| 7 new GEMM variants | 88ms new vs 18ms removed | Autotuner adapting |
| Identical kernel count | 8,955 both profiles | Same graph, different PATH |
| All clues → | CASCADE ATTENTION | flash_attn.py:673 |

The cheat sheet

┌─────────────────────────────────────────────────────────────┐
│  NSYS SQLITE CHEAT SHEET                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  THE 4 TABLES:                                              │
│    CUPTI_ACTIVITY_KIND_KERNEL  → kernel executions          │
│    CUPTI_ACTIVITY_KIND_MEMCPY  → memory transfers           │
│    CUPTI_ACTIVITY_KIND_MEMSET  → memory fills               │
│    NVTX_EVENTS                 → human-added markers        │
│    + StringIds                 → name lookup                │
│                                                             │
│  THE JOIN (everywhere):                                     │
│    JOIN StringIds s ON k.shortName = s.id                   │
│                                                             │
│  TIMESTAMPS:  nanoseconds                                   │
│    / 1e3 = μs    / 1e6 = ms    / 1e9 = seconds             │
│                                                             │
│  COPYKIND:  1=CPU→GPU  2=GPU→CPU  8=GPU→GPU                │
│                                                             │
│  GRID/BLOCK:                                                │
│    grid × block = total threads                             │
│    grid=132 → all H200 SMs active                           │
│    grid=1 → one SM, 131 idle                                │
│                                                             │
│  DANGER ZONE:                                               │
│    *cublas* → non-deterministic GEMM                        │
│    *splitK* → non-deterministic reduction                   │
│    cascade/merge_attn → FP divergence                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Your first 10 minutes

# 1. Capture
nsys profile -o my_trace --stats=true python my_script.py

# 2. Export to SQLite
nsys export --type=sqlite my_trace.nsys-rep

# 3. Find your bottleneck
python3 -c "
import sqlite3
conn = sqlite3.connect('my_trace.sqlite')
cur = conn.cursor()
cur.execute('''
    SELECT s.value, COUNT(*), ROUND(SUM(k.end-k.start)/1e6,2) AS ms
    FROM CUPTI_ACTIVITY_KIND_KERNEL k
    JOIN StringIds s ON k.shortName = s.id
    GROUP BY s.value ORDER BY ms DESC LIMIT 10
''')
for row in cur.fetchall():
    print(f'{row[0]:50s}  calls={row[1]:5d}  total={row[2]:8.2f} ms')
"

In 10 minutes you'll know which kernel is your bottleneck. GPU profiling is not a dark art. It's four tables, one JOIN, and four queries.

The answer is in the trace. Go look.

All data from real H200 GPU traces: 8,955 kernel launches during Qwen3-VL-2B inference with vLLM 0.15.2.

DEV Community