Hector Li

Posted on Feb 13

I Shipped a 5-Bug Fix to ONNX Runtime — By Telling an AI Agent "Still Wrong"

#webdev #programming #ai

I shipped a 5-file, production-quality PR to ONNX Runtime in one session — and I wrote almost none of the code myself.

Know Your Goal (or the Problem)

I had an ONNX model with a 2-bit quantized MatMulNBits operator. It ran correctly on CPU. I wanted to run it in a web project using ONNX Runtime's WebGPU backend. I tried, and got this error:

Error running model: failed to call OrtRun(). ERROR_CODE: 1, ERROR_MESSAGE: .../matmul_nbits.cc:123 ... nbits != 2 was false. Currently, zero points are not supported for Q2 quantization.

From the error message, I knew that 2-bit MatMulNBits was partially supported in WebGPU, but there was a feature gap — it didn't support models that include a zero_points input.

As a former ONNX Runtime developer, I knew something about low-bit quantization, T-MAC, the 2bit implementation in CPU, but I have no idea or experience with OnnxRuntime WebGPU development. Next, let's see what an AI coding agent can do with this.

Ask the AI Agent to Do the Work

Open VS Code with the local ONNX Runtime repository.
Copy the error message directly into the AI agent (GitHub Copilot with Claude Opus 4.6).

Round 1: Remove the Gate

From the error message, the agent located the source file that threw the error and started investigating.

The agent started to read the code and thinking.

The agent found the root cause and made the changes.

The agent removed the restriction — an ORT_ENFORCE(nbits != 2, ...) guard that explicitly blocked Q2 with zero points. I knew from experience that simply removing a guard wouldn't be enough to make the feature work correctly — the underlying shader logic still assumed 4-bit. But I asked the agent to build it anyway to establish a baseline. I ran it with my model. Of course, it produced wrong results.

My role: Domain judgment — knowing the guard removal was necessary but insufficient, and choosing to proceed anyway to see what broke next.

Round 2: Fix the Buffer Stride

Copy the error to the agent, it started to investigate.
The agent found the problem and made the changes.

The agent found that the zero-point buffer stride calculation used a Q4-only shortcut (+1) that didn't generalize to Q2's 4-values-per-byte packing. It rewrote the formula with proper ceiling arithmetic.

I rebuilt and tested with my project. The result was still not correct.

My role: Testing against ground truth in a browser environment the agent couldn't access.

Round 3: Write Unit Tests as a Diagnostic Tool

At this point, staring at shader generator code wasn't productive. I asked the agent to create unit tests — not just for coverage, but as a diagnostic strategy to isolate which configurations were failing.

Asked the agent to create some UTs to see if it can find some issues.
It created UTs, found bugs, and fixed them

The agent wrote a MatMul2BitsWebGpu test suite, found that 6 of 8 test cases failed, traced the failures to bit-shift and value-extraction ordering bugs in the TypeScript shader generator, and fixed them.

I rebuilt and tested with my project. The result was still not correct.

My role: Choosing the right diagnostic approach — unit tests revealed bugs that code reading alone couldn't surface.

Round 4: Feed It the Real Model

The unit tests were passing, but my real model still gave wrong output. I provided the agent the actual 2-bit quantized transformer model I was using.

Asked the agent to investigate with the real model.
The agent walked through the code with the data and node attributes from the real model to address the issue. That was amazing!
The agent found the root cause and made the fix.

This was the most impressive round. The agent wrote Python scripts to simulate the shader's bit extraction logic step by step, using real data from my model. It discovered that the A-data (activation) pointer was being double-advanced across multi-pass loops — pass 1 was reading A[16] instead of A[8], silently skipping 8 values. A one-line fix resolved it.

My role: Providing the real model — something the agent couldn't obtain on its own. This was the input that unlocked the final bug.

Round 5: Fill the Test Gaps

The result was correct with my test project. Asked the agent to add more test cases to cover all changes.
The agent said existing tests already have good coverage, but were missing cases that match the configuration in my real model.

The result was finally correct! I asked the agent to update test coverage. It identified that the existing tests didn't include block_size=64 (the configuration my real model used, which exercises zero-point padding edge cases) and added three new test cases. All 9 tests passed.

My role: Validating the final result against the real model and asking for coverage of the actual production configuration.

What Changed

Five bugs across five files, each hidden behind the last:

Bug	File	Issue
Q2+ZP blocked	`matmul_nbits.cc`, `matmul_nbits.h`, WGSL template	Hard-coded guards rejecting Q2 with zero points; missing bit mask
Buffer stride	`matmul_nbits.cc`	Zero-point stride used Q4-only `+1` rounding instead of proper ceiling formula
Bit shift	`matmulnbits.ts`	Multi-pass shift `pass * 8` crossed byte boundaries; should be `pass * bits * 4`
Value ordering	`matmulnbits.ts`	`unpack4xU8` extracts same bit position from all 4 bytes — wrong order for Q2
A-data offset	`matmulnbits.ts`	Pass 1 double-advanced the activation pointer, skipping 8 values

The PR

All work done! Time to push the changes to GitHub and create a PR: Improve WebGPU MatMulNBits to support zero pointer for 2bits

It's worth noting that the PR didn't receive any review comments directly related to the code changes — only a future improvement request. The agent's code was production-quality on the first submission.

Bonus: Ask the Agent to Write the Blog

Asked the agent to create a blog from what we have done.

First attempt — a technical summary of the bugs and fixes:
Bringing 2-Bit Quantization to ONNX Runtime's WebGPU Backend

That's useful, but what I wanted was a blog showing how I paired with the AI agent. So I asked again:

Asked the agent to create another blog.
Using an AI Coding Agent to Ship 2-Bit Quantization for WebGPU

Reading that second blog, you'll notice it emphasizes "what the agent did well", "tireless code reading", "the agent is most valuable on...". And you might wonder: what exactly did the developer do? Just keep saying "result is not correct!" and "why don't the tests cover all cases?" 😄

What I Actually Did

But that framing misses the point. Here's what the developer contributed that the agent couldn't:

Defined the problem — provided the error message, the model, and the expected behavior
Made strategic choices — when to build, when to switch to unit tests, when to provide the real model
Held ground truth — tested in a real browser environment the agent had no access to
Applied domain judgment — knew the guard removal was insufficient, knew which model configurations mattered

The developer's job wasn't to write code — it was to define the problem, validate the result, and make judgment calls about what to try next. That turned out to be enough.

DEV Community