Hector Li

Posted on Feb 11

Using an AI Coding Agent to Ship 2-Bit Quantization for WebGPU

#ai #github #webgpu #onnxruntime

How a developer paired with an AI agent to find and fix five layered bugs in ONNX Runtime's GPU shader pipeline — without being an expert in WGSL or bit-packing. Here's the OnnxRuntime PR (merged)

The Problem

A developer needed to enable 2-bit (Q2) quantized model inference on ONNX Runtime's WebGPU backend. The 4-bit path worked, but 2-bit with zero points crashed immediately. The codebase involved C++ GPU kernels, WGSL shader templates, TypeScript shader generators, Emscripten WASM builds, and multiple build systems. A deep stack where any single layer could silently produce wrong numbers.

Rather than spending days manually tracing shader bit logic, the developer partnered with an AI coding agent (GitHub Copilot in VS Code) to systematically find and fix every issue.

Here's how that collaboration actually worked.

Step 1: "Why does it crash?" — The Agent Reads the Error

The developer shared the error message:

"Currently, zero points are not supported for Q2 quantization"

The agent searched the codebase, found the ORT_ENFORCE guard in matmul_nbits.cc and the nbits_ == 4 check in matmul_nbits.h, and identified a missing bit_mask constant in the WGSL template. Instead of just pointing these out, the agent directly applied all three fixes — removing the guards, adding the mask, and guarding the DP4A codepath that couldn't handle Q2 zero points — across three files in a single edit operation.

What the agent did well: Cross-file root cause analysis from a single error message. The developer didn't need to know which files to look at.

Step 2: "Tests pass but output is wrong" — The Agent Spots a Math Bug

With the crash fixed, the developer built and ran tests. Six of eight failed with wrong numerical output. The developer asked the agent to investigate.

The agent read the zero-point buffer stride calculation and identified that the formula n_blocks_per_col + 1 was a Q4-only shortcut. For Q2, where four values pack per byte, the stride must round up to the nearest multiple of 4. The agent wrote the corrected ceiling formula and applied it.

What the agent did well: Pattern recognition in quantization math. The "+1" looked innocuous but encoded a Q4 assumption the developer might have glossed over.

Step 3: "JSEP still gives wrong results" — Diving into TypeScript Shader Generators

After the native C++ path was fixed, the developer reported that the browser-facing JSEP path still produced garbage. This is where the collaboration got interesting.

The JSEP shaders are generated at runtime by TypeScript code — template strings that emit WGSL. The agent needed to understand code that writes shader code, not the shader itself.

The agent traced through matmulnbits.ts, identified that the multi-pass loop used pass * 8 as a bit shift — which works for Q4 (one pass) but for Q2 (two passes) shifts into the wrong byte — and fixed the formula to pass * bits * 4.

What the agent did well: Reasoning through meta-programming. The bug wasn't in the TypeScript or the WGSL — it was in the relationship between them.

Step 4: "Still wrong" — The Agent Writes Verification Scripts

After the shift fix, the developer tested again: "the result changed, but still not correct."

At this point, staring at code wasn't enough. The agent wrote Python simulation scripts that replicated the shader's bit extraction logic step by step. The first script (verify_extraction.py) proved the shift fix was necessary but insufficient. A second script (verify_extraction2.py) revealed the deeper bug:

The Q4 extraction pattern unpack4xU8(b_value & 0x0F0F0F0F) extracts the same bit position from all four bytes simultaneously. For Q4, that gives four sequential values. For Q2, it gives values v0, v4, v8, v12 — completely out of order relative to the sequential A-data they're multiplied with.

The agent designed a "nibble-spread" technique: take two bytes per pass, spread each nibble into its own byte of a synthetic u32, then apply the standard extraction. It wrote yet another verification script (verify_nibble_spread2.py) with a non-repeating test pattern to confirm the extraction produces values in the correct order, then applied the fix to both shader paths in the TypeScript.

What the agent did well: When code reading hit a wall, the agent pivoted to writing executable proofs. Each script answered a specific yes/no question about the bit logic, building confidence incrementally rather than guessing.

Step 5: "Almost — but still off" — The Last Bug

The developer tested again: "the result changed, but still not correct." Three fixes in, still wrong.

The agent wrote verify_a_offset.py — a script that traced how the A-data (activation) pointer advances across passes. It found the final bug: pass 0's inner loop increments input_offset eight times. Pass 1 then computed its start as input_offset + 8/aComponents, but since input_offset was already advanced, this double-counted the offset. Pass 1 read A[16] instead of A[8], skipping eight activation values.

The fix was a one-line change: pass 1 uses input_offset directly instead of adding an offset to an already-advanced pointer.

The developer tested: "the result is correct now."

What the agent did well: Maintained state across a long debugging session. By this point, the agent had built a mental model of how word_offset, input_offset, pass indices, and aComponents interact across the shader generator's nested loops — context that would take a human significant time to reconstruct after each failed attempt.

Step 6: "Do we need to update the tests?" — The Agent Adds Coverage

With all fixes working, the developer asked whether tests needed updating. The agent:

Read the existing test file to assess coverage gaps
Identified that block_size=64 (the real-model configuration that exercised the zero-point padding bug) had no test
Added three new test cases covering block_size=64, symmetric variants, and multi-word extraction scenarios
Figured out which build target to compile (onnxruntime_provider_test, not onnxruntime_test_all)
Built and ran all nine tests — all passed

What the agent did well: End-to-end task completion. The developer asked a yes/no question; the agent answered by doing the work, including navigating an unfamiliar build system to find the right test binary.

The Collaboration Pattern

Looking back, the session followed a repeating cycle:

Developer: "It's broken" / "Still wrong"
    → Agent: Search, read, analyze, hypothesize
    → Agent: Write verification script OR apply code fix
    → Agent: Build
    → Developer: Test with real model
    → (repeat until correct)

The developer brought domain context (which model to test, what "correct" looks like, the build commands) and judgment (when to test, when to push back). The agent brought tireless code reading, cross-file tracing, bit-level arithmetic verification, and the ability to maintain context across a multi-hour, multi-bug debugging session without losing track of which fixes were already applied.

Key moments where the agent added outsized value:

Situation	Without agent	With agent
Finding all Q4-hardcoded guards	Grep + manual reading across C++, WGSL, TypeScript	Agent searched and identified all three in one pass
Understanding shader generator meta-programming	Mentally compile TypeScript → WGSL → GPU execution	Agent traced the template logic and identified the generated shift values
Verifying bit extraction ordering	Pen-and-paper binary arithmetic	Agent wrote executable Python proofs with non-repeating test patterns
Tracking pointer advancement across nested loops	Extremely error-prone mental simulation	Agent wrote a trace script that showed exact index values at each step
Maintaining context across 5 sequential bugs	Each "still wrong" resets human working memory	Agent retained cumulative understanding of every prior fix

What Didn't Work (and What the Developer Still Had to Do)

The agent couldn't run the actual model on WebGPU — the developer had a test project with a browser environment and a real 2-bit transformer model. Each "is it correct now?" required the developer to run the model, compare output against CPU baseline, and report back. The agent operated on code structure and logic; the developer operated on ground truth.

The build system was also a friction point. The agent had to discover — through trial and error — that tests lived in onnxruntime_provider_test.exe rather than onnxruntime_test_all.exe, and that the VS 2026 Insiders vcvarsall path was non-standard. These are the kinds of environmental details where the developer's existing knowledge was essential.

Takeaways for Developers

Describe symptoms, not solutions. Saying "it gives wrong results on WebGPU but correct on CPU" gave the agent more to work with than "I think the bit shift is wrong."
Let the agent write verification scripts. When the bug is in bit-level arithmetic inside a shader generator, reading code has diminishing returns. Executable proofs are faster and more reliable.
Iterate tight loops. The five-bug sequence would have been demoralizing solo — each fix revealing another failure. With the agent maintaining context and proposing the next investigation immediately, the cycle stayed fast.
Keep ground truth in human hands. The developer's ability to test with a real model and say "correct" or "still wrong" was the irreplaceable signal that drove the entire session. The agent can analyze and fix; only the developer can validate against the actual use case.
The agent is most valuable on cross-cutting, multi-layer bugs. A bug in one file is easy. Five bugs spanning C++, WGSL templates, TypeScript shader generators, and build configuration — each masked by the previous one — is where an agent that doesn't lose context across files and hours earns its keep.

DEV Community