DEV Community

Hector Li
Hector Li

Posted on

I Shipped a 5-Bug Fix to ONNX Runtime — By Telling an AI Agent "Still Wrong"

I shipped a 5-file, production-quality PR to ONNX Runtime in one session — and I wrote almost none of the code myself.


Know Your Goal (or the Problem)

I had an ONNX model with a 2-bit quantized MatMulNBits operator. It ran correctly on CPU. I wanted to run it in a web project using ONNX Runtime's WebGPU backend. I tried, and got this error:

Error running model: failed to call OrtRun(). ERROR_CODE: 1, ERROR_MESSAGE: .../matmul_nbits.cc:123 ... nbits != 2 was false. Currently, zero points are not supported for Q2 quantization.

From the error message, I knew that 2-bit MatMulNBits was partially supported in WebGPU, but there was a feature gap — it didn't support models that include a zero_points input.

As a former ONNX Runtime developer, I knew something about low-bit quantization, T-MAC, the 2bit implementation in CPU, but I have no idea or experience with OnnxRuntime WebGPU development. Next, let's see what an AI coding agent can do with this.


Ask the AI Agent to Do the Work

  1. Open VS Code with the local ONNX Runtime repository.
  2. Copy the error message directly into the AI agent (GitHub Copilot with Claude Opus 4.6).

Round 1: Remove the Gate

From the error message, the agent located the source file that threw the error and started investigating.

The agent started to read the code and thinking.
The agent started to read the code and thinking
The agent found the root cause and made the changes.
The agent found the root cause and made the changes

The agent removed the restriction — an ORT_ENFORCE(nbits != 2, ...) guard that explicitly blocked Q2 with zero points. I knew from experience that simply removing a guard wouldn't be enough to make the feature work correctly — the underlying shader logic still assumed 4-bit. But I asked the agent to build it anyway to establish a baseline. I ran it with my model. Of course, it produced wrong results.

My role: Domain judgment — knowing the guard removal was necessary but insufficient, and choosing to proceed anyway to see what broke next.

Round 2: Fix the Buffer Stride

Copy the error to the agent, it started to investigate.start
The agent found the problem and made the changes.problem

The agent found that the zero-point buffer stride calculation used a Q4-only shortcut (+1) that didn't generalize to Q2's 4-values-per-byte packing. It rewrote the formula with proper ceiling arithmetic.

I rebuilt and tested with my project. The result was still not correct.

My role: Testing against ground truth in a browser environment the agent couldn't access.

Round 3: Write Unit Tests as a Diagnostic Tool

At this point, staring at shader generator code wasn't productive. I asked the agent to create unit tests — not just for coverage, but as a diagnostic strategy to isolate which configurations were failing.

Asked the agent to create some UTs to see if it can find some issues.ut
It created UTs, found bugs, and fixed themut_fix

The agent wrote a MatMul2BitsWebGpu test suite, found that 6 of 8 test cases failed, traced the failures to bit-shift and value-extraction ordering bugs in the TypeScript shader generator, and fixed them.

I rebuilt and tested with my project. The result was still not correct.

My role: Choosing the right diagnostic approach — unit tests revealed bugs that code reading alone couldn't surface.

Round 4: Feed It the Real Model

The unit tests were passing, but my real model still gave wrong output. I provided the agent the actual 2-bit quantized transformer model I was using.

Asked the agent to investigate with the real model.investigate1
The agent walked through the code with the data and node attributes from the real model to address the issue. That was amazing!real_model
The agent found the root cause and made the fix.fix

This was the most impressive round. The agent wrote Python scripts to simulate the shader's bit extraction logic step by step, using real data from my model. It discovered that the A-data (activation) pointer was being double-advanced across multi-pass loops — pass 1 was reading A[16] instead of A[8], silently skipping 8 values. A one-line fix resolved it.

My role: Providing the real model — something the agent couldn't obtain on its own. This was the input that unlocked the final bug.

Round 5: Fill the Test Gaps

The result was correct with my test project. Asked the agent to add more test cases to cover all changes.result
The agent said existing tests already have good coverage, but were missing cases that match the configuration in my real model.final

The result was finally correct! I asked the agent to update test coverage. It identified that the existing tests didn't include block_size=64 (the configuration my real model used, which exercises zero-point padding edge cases) and added three new test cases. All 9 tests passed.

My role: Validating the final result against the real model and asking for coverage of the actual production configuration.


What Changed

Five bugs across five files, each hidden behind the last:

Bug File Issue
Q2+ZP blocked matmul_nbits.cc, matmul_nbits.h, WGSL template Hard-coded guards rejecting Q2 with zero points; missing bit mask
Buffer stride matmul_nbits.cc Zero-point stride used Q4-only +1 rounding instead of proper ceiling formula
Bit shift matmulnbits.ts Multi-pass shift pass * 8 crossed byte boundaries; should be pass * bits * 4
Value ordering matmulnbits.ts unpack4xU8 extracts same bit position from all 4 bytes — wrong order for Q2
A-data offset matmulnbits.ts Pass 1 double-advanced the activation pointer, skipping 8 values

The PR

All work done! Time to push the changes to GitHub and create a PR: Improve WebGPU MatMulNBits to support zero pointer for 2bits

It's worth noting that the PR didn't receive any review comments directly related to the code changes — only a future improvement request. The agent's code was production-quality on the first submission.


Bonus: Ask the Agent to Write the Blog

Asked the agent to create a blog from what we have done.blog1

First attempt — a technical summary of the bugs and fixes:
Bringing 2-Bit Quantization to ONNX Runtime's WebGPU Backend

That's useful, but what I wanted was a blog showing how I paired with the AI agent. So I asked again:

Asked the agent to create another blog.blog2
Using an AI Coding Agent to Ship 2-Bit Quantization for WebGPU

Reading that second blog, you'll notice it emphasizes "what the agent did well", "tireless code reading", "the agent is most valuable on...". And you might wonder: what exactly did the developer do? Just keep saying "result is not correct!" and "why don't the tests cover all cases?" 😄


What I Actually Did

But that framing misses the point. Here's what the developer contributed that the agent couldn't:

  • Defined the problem — provided the error message, the model, and the expected behavior
  • Made strategic choices — when to build, when to switch to unit tests, when to provide the real model
  • Held ground truth — tested in a real browser environment the agent had no access to
  • Applied domain judgment — knew the guard removal was insufficient, knew which model configurations mattered

The developer's job wasn't to write code — it was to define the problem, validate the result, and make judgment calls about what to try next. That turned out to be enough.

Top comments (0)