I built a desktop AI-powered writing assistant, and cloud inference with Gemini worked great.
But I needed one more mode: offline AI.
I wanted the app to work on flights, unstable connections, and restricted environments where outbound API access is unavailable. That pushed me toward local inference with llama.cpp and GGUF models.
The biggest friction was native packaging. Cross-platform C++ setup is powerful, but it usually means extra work before product work: toolchains, linker/runtime quirks, per-architecture builds, and CI maintenance.
So I started llamadart with one practical goal:
- Make local LLM inference in Dart/Flutter usable without forcing app developers to manually build llama.cpp first.
What llamadart is
llamadart is a Dart/Flutter package for local GGUF inference on top of llama.cpp, with a Dart-first API and streaming support.
Core references:
- Package: https://pub.dev/packages/llamadart
- Source: https://github.com/leehack/llamadart
- Main docs: https://github.com/leehack/llamadart/blob/main/README.md
Architecture overview
High-level structure:
App Layer (Flutter UI / CLI / API server)
|
v
LlamaEngine (stateless orchestration)
|
+--> ChatSession (stateful history + context trimming)
|
+--> ChatTemplateEngine (detect/render/parse/tool grammar)
|
v
LlamaBackend abstraction
|
+--> NativeLlamaBackend (isolate + FFI)
|
+--> Web backend (bridge runtime path)
Implementation links:
-
LlamaEngine: https://github.com/leehack/llamadart/blob/main/lib/src/core/engine/engine.dart -
ChatSession: https://github.com/leehack/llamadart/blob/main/lib/src/core/engine/chat_session.dart -
ChatTemplateEngine: https://github.com/leehack/llamadart/blob/main/lib/src/core/template/chat_template_engine.dart -
LlamaBackendinterface: https://github.com/leehack/llamadart/blob/main/lib/src/backends/backend.dart - Native backend: https://github.com/leehack/llamadart/blob/main/lib/src/backends/llama_cpp/llama_cpp_backend.dart
- Web backend: https://github.com/leehack/llamadart/blob/main/lib/src/backends/webgpu/webgpu_backend.dart
Build hook deep dive: how native binaries are delivered
This is the part I cared about most.
llamadart uses Dart build hooks (official docs: https://dart.dev/tools/hooks). Build hooks are executed automatically by the Dart SDK during run, build, and test to compile or download native assets.
For FFI context, Dart also documents this under C interop: https://dart.dev/interop/c-interop#build-hooks
Hook entry point
The hook implementation is here:
build.dart reads target metadata from hook input (BuildInput) and reports resolved assets through BuildOutputBuilder as CodeAsset entries.
Related official APIs:
- Hooks package: https://pub.dev/packages/hooks
-
BuildInputandBuildOutputBuilder: https://pub.dev/documentation/hooks/latest/hooks/ - Code assets package: https://pub.dev/packages/code_assets
-
CodeAssetAPI: https://pub.dev/documentation/code_assets/latest/code_assets/CodeAsset-class.html
What happens at build/run time
The hook runs a deterministic local-first pipeline:
- Resolve target platform and architecture from hook input.
- Map
(os, arch)to a concrete release asset filename (for example,libllamadart-linux-x64.so). - Try local packaged binary (
third_party/bin/...) first. - If missing, try cache (
.dart_tool/llamadart/binaries/...). - If cache miss, download from GitHub Releases.
- Copy to hook output directory with standardized runtime filename.
- Report the asset as
DynamicLoadingBundled()so Dart can bundle and load it.
On Apple targets, the hook also handles binary thinning when needed (lipo) so the reported artifact matches the active architecture.
How GitHub is used
llamadart uses GitHub Releases as the binary distribution layer.
In the hook, the release URL is pinned by a llama.cpp tag constant:
const _llamaCppTag = 'b8011';
const _baseUrl =
'https://github.com/leehack/llamadart/releases/download/$_llamaCppTag';
That gives stable, tag-based binary resolution instead of "latest" behavior.
The automation that produces those release assets is here:
That workflow builds artifacts across target OS/arch matrices, uploads release assets, regenerates bindings when needed, and updates the pinned hook tag.
Runtime FFI resolution
At runtime, FFI symbols are linked to the hook-provided asset via annotations like:
-
@Native: https://api.dart.dev/dart-ffi/Native-class.html -
@DefaultAsset: https://api.dart.dev/dart-ffi/DefaultAsset-class.html
In llamadart, generated bindings use this default asset pattern:
That avoids per-platform dynamic library path logic in app code.
Why this matters in practice
This hook + release model gives a practical DX benefit:
- Package users avoid manual local llama.cpp build steps in the common case.
- First build can fetch the correct binary for the target.
- Subsequent builds use cache/local artifacts.
- App code stays focused on model lifecycle and inference logic, not native packaging scripts.
A quick note on dinja
While implementing chat-template compatibility, I ran into another gap: GGUF models expose Jinja templates, but template behavior differs significantly across model families.
I could not find a Dart Jinja package tuned for this multi-template LLM use case, so I built dinja (source: https://github.com/leehack/dinja).
In llamadart:
- Template handlers render prompts with
dinja. -
JinjaAnalyzerdoes AST-level capability checks (system role, tool-call shape, typed content, thinking tags). -
ChatTemplateEngineroutes handlers and supports custom overrides/registration.
Relevant links:
- Handlers directory: https://github.com/leehack/llamadart/tree/main/lib/src/core/template/handlers
- Jinja analyzer: https://github.com/leehack/llamadart/blob/main/lib/src/core/template/jinja/jinja_analyzer.dart
- Template engine: https://github.com/leehack/llamadart/blob/main/lib/src/core/template/chat_template_engine.dart
Minimal usage example
import 'dart:io';
import 'package:llamadart/llamadart.dart';
Future<void> main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('path/to/model.gguf');
final session = ChatSession(
engine,
systemPrompt: 'You are a helpful writing assistant.',
);
await for (final chunk in session.create([
LlamaTextContent('Rewrite this paragraph in a concise style.'),
])) {
stdout.write(chunk.choices.first.delta.content ?? '');
}
} finally {
await engine.dispose();
}
}
Closing
I built llamadart because my app needed both online model quality and reliable offline execution, without turning native build management into the main project.
If you are building AI features in Dart/Flutter and need local inference, I would love feedback:
Top comments (0)