Jhin Lee for Google Developer Experts

Posted on Feb 15

Why I Built llamadart: Offline Local LLM Inference for Dart/Flutter

#ai #dart #llama #flutter

I built a desktop AI-powered writing assistant, and cloud inference with Gemini worked great.

But I needed one more mode: offline AI.

I wanted the app to work on flights, unstable connections, and restricted environments where outbound API access is unavailable. That pushed me toward local inference with llama.cpp and GGUF models.

The biggest friction was native packaging. Cross-platform C++ setup is powerful, but it usually means extra work before product work: toolchains, linker/runtime quirks, per-architecture builds, and CI maintenance.

So I started llamadart with one practical goal:

Make local LLM inference in Dart/Flutter usable without forcing app developers to manually build llama.cpp first.

What llamadart is

llamadart is a Dart/Flutter package for local GGUF inference on top of llama.cpp, with a Dart-first API and streaming support.

Core references:

Package: https://pub.dev/packages/llamadart
Source: https://github.com/leehack/llamadart
Main docs: https://github.com/leehack/llamadart/blob/main/README.md

Architecture overview

High-level structure:

App Layer (Flutter UI / CLI / API server)
        |
        v
LlamaEngine (stateless orchestration)
        |
        +--> ChatSession (stateful history + context trimming)
        |
        +--> ChatTemplateEngine (detect/render/parse/tool grammar)
        |
        v
LlamaBackend abstraction
        |
        +--> NativeLlamaBackend (isolate + FFI)
        |
        +--> Web backend (bridge runtime path)

Implementation links:

LlamaEngine: https://github.com/leehack/llamadart/blob/main/lib/src/core/engine/engine.dart
ChatSession: https://github.com/leehack/llamadart/blob/main/lib/src/core/engine/chat_session.dart
ChatTemplateEngine: https://github.com/leehack/llamadart/blob/main/lib/src/core/template/chat_template_engine.dart
LlamaBackend interface: https://github.com/leehack/llamadart/blob/main/lib/src/backends/backend.dart
Native backend: https://github.com/leehack/llamadart/blob/main/lib/src/backends/llama_cpp/llama_cpp_backend.dart
Web backend: https://github.com/leehack/llamadart/blob/main/lib/src/backends/webgpu/webgpu_backend.dart

Build hook deep dive: how native binaries are delivered

This is the part I cared about most.

llamadart uses Dart build hooks (official docs: https://dart.dev/tools/hooks). Build hooks are executed automatically by the Dart SDK during run, build, and test to compile or download native assets.

For FFI context, Dart also documents this under C interop: https://dart.dev/interop/c-interop#build-hooks

Hook entry point

The hook implementation is here:

https://github.com/leehack/llamadart/blob/main/hook/build.dart

build.dart reads target metadata from hook input (BuildInput) and reports resolved assets through BuildOutputBuilder as CodeAsset entries.

Related official APIs:

Hooks package: https://pub.dev/packages/hooks
BuildInput and BuildOutputBuilder: https://pub.dev/documentation/hooks/latest/hooks/
Code assets package: https://pub.dev/packages/code_assets
CodeAsset API: https://pub.dev/documentation/code_assets/latest/code_assets/CodeAsset-class.html

What happens at build/run time

The hook runs a deterministic local-first pipeline:

Resolve target platform and architecture from hook input.
Map (os, arch) to a concrete release asset filename (for example, libllamadart-linux-x64.so).
Try local packaged binary (third_party/bin/...) first.
If missing, try cache (.dart_tool/llamadart/binaries/...).
If cache miss, download from GitHub Releases.
Copy to hook output directory with standardized runtime filename.
Report the asset as DynamicLoadingBundled() so Dart can bundle and load it.

On Apple targets, the hook also handles binary thinning when needed (lipo) so the reported artifact matches the active architecture.

How GitHub is used

llamadart uses GitHub Releases as the binary distribution layer.

In the hook, the release URL is pinned by a llama.cpp tag constant:

const _llamaCppTag = 'b8011';
const _baseUrl =
    'https://github.com/leehack/llamadart/releases/download/$_llamaCppTag';

That gives stable, tag-based binary resolution instead of "latest" behavior.

The automation that produces those release assets is here:

https://github.com/leehack/llamadart/blob/main/.github/workflows/llama_cpp_auto.yml

That workflow builds artifacts across target OS/arch matrices, uploads release assets, regenerates bindings when needed, and updates the pinned hook tag.

Runtime FFI resolution

At runtime, FFI symbols are linked to the hook-provided asset via annotations like:

@Native: https://api.dart.dev/dart-ffi/Native-class.html
@DefaultAsset: https://api.dart.dev/dart-ffi/DefaultAsset-class.html

In llamadart, generated bindings use this default asset pattern:

https://github.com/leehack/llamadart/blob/main/lib/src/backends/llama_cpp/bindings.dart

That avoids per-platform dynamic library path logic in app code.

Why this matters in practice

This hook + release model gives a practical DX benefit:

Package users avoid manual local llama.cpp build steps in the common case.
First build can fetch the correct binary for the target.
Subsequent builds use cache/local artifacts.
App code stays focused on model lifecycle and inference logic, not native packaging scripts.

A quick note on dinja

While implementing chat-template compatibility, I ran into another gap: GGUF models expose Jinja templates, but template behavior differs significantly across model families.

I could not find a Dart Jinja package tuned for this multi-template LLM use case, so I built dinja (source: https://github.com/leehack/dinja).

In llamadart:

Template handlers render prompts with dinja.
JinjaAnalyzer does AST-level capability checks (system role, tool-call shape, typed content, thinking tags).
ChatTemplateEngine routes handlers and supports custom overrides/registration.

Relevant links:

Handlers directory: https://github.com/leehack/llamadart/tree/main/lib/src/core/template/handlers
Jinja analyzer: https://github.com/leehack/llamadart/blob/main/lib/src/core/template/jinja/jinja_analyzer.dart
Template engine: https://github.com/leehack/llamadart/blob/main/lib/src/core/template/chat_template_engine.dart

Minimal usage example

import 'dart:io';
import 'package:llamadart/llamadart.dart';

Future<void> main() async {
  final engine = LlamaEngine(LlamaBackend());

  try {
    await engine.loadModel('path/to/model.gguf');

    final session = ChatSession(
      engine,
      systemPrompt: 'You are a helpful writing assistant.',
    );

    await for (final chunk in session.create([
      LlamaTextContent('Rewrite this paragraph in a concise style.'),
    ])) {
      stdout.write(chunk.choices.first.delta.content ?? '');
    }
  } finally {
    await engine.dispose();
  }
}

Closing

I built llamadart because my app needed both online model quality and reliable offline execution, without turning native build management into the main project.

If you are building AI features in Dart/Flutter and need local inference, I would love feedback:

DEV Community