DEV Community

Cover image for How I Built a Chrome Extension That Runs Llama, DeepSeek, and Mistral Locally Using WebGPU (No Ollama, No Server)
Shankar
Shankar

Posted on

How I Built a Chrome Extension That Runs Llama, DeepSeek, and Mistral Locally Using WebGPU (No Ollama, No Server)

No external software, no cloud, no complexity—just install and start chatting.*

TL;DR: I built a Chrome extension that runs Llama, DeepSeek, Qwen, and other LLMs entirely in-browser using WebGPU, Transformers.js, and Chrome's Prompt API. No server, no Ollama, no API keys. Here's the architecture and what I learned.

Try it: noaibills.app

So far we've only seen WebGPU in-browser LLM demos and proof-of-concepts (GitHub repos, standalone sites). This is the first Chrome extension that brings the same experience to users who prefer "install and go" in their browser—no dev setup, no API keys, no server. Here's how I built it and what I learned along the way.


Why I built it

Honestly? I was frustrated.

Cloud AI wants your data. Every time I tried using ChatGPT or similar tools for anything remotely sensitive—drafting a work email, reviewing code, journaling—I'd pause and think: "Wait, this is going to their servers." That bugged me. I wanted something that just... stayed on my machine.

"Local AI" usually means yak-shaving. Sure, Ollama exists. It's great. But every time I recommended it to a non-dev friend, I'd get the same look. "Open terminal... run this command... wait, what's a model?" And forget about it if you're on a locked-down work laptop where you can't install anything. I wanted something my mom could use. Okay, maybe not my mom—but you get the idea.

$20/month adds up. I don't use AI heavily enough to justify a subscription. I just want to fix some grammar, summarize a doc, or get unstuck on a coding problem a few times a week. Paying monthly for that felt wrong.

So I set out to build something private, simple, and free. A Chrome extension that runs models inside the browser itself—no server, no Ollama, no Docker, no nonsense. Just install and chat.

Turns out WebGPU makes this actually possible now. Here's how I put it together.


Why client-side?

Three reasons:

Privacy. Your messages and the model weights stay on your machine. Nothing leaves the browser. That's not a marketing claim—it's just how the architecture works.

Cost. After you download a model once, inference is free. No API calls, no usage billing, no surprises.

Offline works. Once a model is cached, you can use it on a plane, in the subway, wherever. No internet needed.

The tradeoff? You're limited to smaller models—quantized Llama, SmolLM, Phi, DeepSeek R1 distillates. Nothing massive. But for everyday stuff like drafting, summarizing, and coding help? More than enough.


High-level architecture

Here's the bird's eye view:

The app is a Next.js front end that talks to the UI through useChat from the Vercel AI SDK. But instead of hitting an API endpoint, it uses a custom transport that runs streamText directly in the browser using browser-ai providers—WebLLM, Transformers.js, or Chrome's built-in Prompt API.

A small model manager handles the messy parts: picking the right backend, caching model instances, showing download progress. The whole thing can also run as a Chrome extension (side panel) via static export.

Same UI, same code, but the "backend" is your GPU. No Node server anywhere.


1. One transport to rule them all

The AI SDK's useChat doesn't care where messages go. It just wants a transport that takes messages and returns a stream.

So I built BrowserAIChatTransport. It:

  • Grabs the current provider and model from a Zustand store
  • Gets the actual language model from the model manager
  • For reasoning models like DeepSeek R1, wraps everything with extractReasoningMiddleware to parse <think>...</think> tags
  • Calls streamText and returns result.toUIMessageStream()

One gotcha: not every model supports every option. Some don't have topP, others ignore presencePenalty. So the transport only passes options that (a) the current provider actually supports and (b) are explicitly set. Learned that one the hard way.

const baseModel = modelManager.getModel(provider, modelId);
const model = isReasoningModel(modelId)
  ? wrapLanguageModel({
      model: baseModel,
      middleware: extractReasoningMiddleware({ tagName: "think", startWithReasoning: true }),
    })
  : baseModel;
const result = streamText({ model, messages: modelMessages, ...streamOptions });
return result.toUIMessageStream();
Enter fullscreen mode Exit fullscreen mode

One transport. Multiple providers. UI doesn't know the difference.


2. Three providers, one interface

I support three backends:

  • WebLLM (MLC): WebGPU-based. Best for bigger models like Llama 3.2, Qwen, DeepSeek R1. Fast if you have a decent GPU.
  • Transformers.js: Runs on CPU via WASM. Smaller footprint. Good for lighter models like SmolLM.
  • Browser AI (Prompt API): Chrome's built-in model (Gemini Nano). No download at all—just works.

They all wire into the same LanguageModelV3 interface. The model manager instantiates the right adapter, caches instances so switching threads doesn't re-download anything, and fires progress callbacks for the loading UI.

I keep all the model IDs in a single models module—filtered for low VRAM, tagged for "supports reasoning" or "supports vision." That way the transport and UI both know what each model can do.


3. The loading experience

Browser models need to download weights. Sometimes gigabytes of them. I didn't want users staring at a blank screen wondering if it was broken.

So I built a useModelInitialization hook that:

  1. Checks if the model is already cached (availability() returns "available")
  2. If not, kicks off a minimal streamText call to trigger the download
  3. Pipes progress updates to the UI

The tricky part: progress can come from two places—the model manager's callback OR the stream itself (via data-modelDownloadProgress parts). I ended up merging both into the same state so users see one smooth progress bar.


4. Reasoning models and think tags

DeepSeek R1 and similar models emit their reasoning in <think>...</think> blocks before giving the final answer. I wanted to show that separately—collapsible, so you can peek at the model's thought process.

The AI SDK's extractReasoningMiddleware handles the parsing. On the UI side, I check each message part: if it's reasoning, render a <Reasoning> component; if it's text, render the normal response. Same stream, two different displays.


5. Making it work as a Chrome extension

This was the fiddly part.

Static export. Next.js builds with output: "export", drops everything into extension/ui/. The side panel loads ui/index.html.

CSP hell. Chrome extensions don't allow inline scripts. So I wrote a post-build script that extracts every inline <script> from the HTML, saves them as separate files, and rewrites the HTML to reference them. Fun times.

WASM loading. Transformers.js needs ONNX Runtime WASM files. Can't load those from a CDN in an extension. So the build script copies them into extension/transformers/ and I set web_accessible_resources accordingly.

End result: one codebase, one build process. Dev mode runs at localhost:3000, production becomes a Chrome extension.


6. Chat history in IndexedDB

I wanted conversations to survive tab closes and browser restarts. Used Dexie (a nice IndexedDB wrapper) with a simple schema: conversation id, title, model, provider, created date, and the full messages array.

When you pick a conversation from history, it rehydrates everything—including which model you were using—so you can keep chatting right where you left off.

I also had to migrate from an older localStorage-based format. On first load, the app checks for legacy data, bulk-inserts into IndexedDB, then cleans up. Nobody loses their old chats.


What I learned

Abstract early. The single transport pattern saved me a ton of headaches. Adding a new provider is just "wire up the adapter, add the model IDs." The UI doesn't care.

Browser limitations are real but manageable. CSP, WASM loading, storage quotas—all solvable with the right build scripts. Just budget time for it.

Progress feedback matters. Users will wait for a 2GB download if they can see it happening. A blank screen with no feedback? They'll close the tab.

Local AI is good enough for most things. I'm not claiming it replaces GPT-4. But for the 80% of tasks—drafts, summaries, quick coding questions—a 3B parameter model running locally is plenty.

Not positioned as a cloud LLM replacement—it's for local inference on basic text tasks (writing, communication, drafts) with zero internet dependency, no API costs, and complete privacy.

Core fit: organizations with data restrictions that block cloud AI and can't install desktop tools like Ollama/LMStudio. For quick drafts, grammar checks, and basic reasoning without budget or setup barriers.

Need real-time knowledge or complex reasoning? Use cloud models. This serves a different niche—not every problem needs a sledgehammer 😄.


If you're building something similar, hopefully this saves you some time. The patterns here—single transport, model manager, static export with build-time fixes—should generalize to whatever in-browser runtime you're targeting.

Give it a try: noaibills.app

And if you have questions or feedback, I'd love to hear it.

Top comments (0)