Zero external dependencies. Zero backend. Ten real datasets. Two live statistical terminals per page. 128 KB of WebAssembly. This is the story of building it from the ground up.
Most ML tutorials end at the Jupyter notebook. Train a model, print an accuracy score, call it done. What happens after — packaging, serving, making it actually usable by someone who isn't you — is left as an exercise.
This article is about what happens after. Specifically: what happens when you take the constraint "the model must run entirely in the browser with no server" seriously, follow it all the way down to its logical conclusion, and build every layer of the stack yourself in Rust.
The result is a live web demo with ten real ML datasets — Iris, Breast Cancer, Titanic, California Housing, Heart Disease, and five more — each with a dynamically-built slider interface, live probability bars, and two statistical analysis terminals that update in real time as you drag. The entire thing is one 128 KB WebAssembly binary plus small per-dataset model files. No server. No Python. No cloud.
Here's the GitHub repo: thomascherickal/Ferrum
Here's the live demo: Live Demo
The constraint that drove everything
The wasm32-unknown-unknown WebAssembly target has no operating system, no file system, and no libc. It is the most minimal compilation target Rust supports. A crate that links against rand, ndarray, reqwest, or virtually any library that makes OS calls will not compile to it — not without significant plumbing.
That constraint is the source of everything interesting in this project. Once you accept "no external dependencies, standard library only," you have to build the tensor type yourself, the matrix operations yourself, the activation functions yourself, the training loop yourself, the binary model format yourself. That sounds painful. It is actually clarifying.
The payoff is threefold:
The WASM binary is genuinely small. 128 KB contains the entire ML engine: tensor math, normalisation, inference, and the statistical terminal logic. That's less than a medium-resolution JPEG.
Every line of code that runs in the user's browser is in your repository. No transitive dependencies, no supply-chain risk, nothing opaque.
Inference is private. User inputs never leave the browser tab. There is no server to log requests, no cloud API to rate-limit you, no GDPR surface area. The model runs locally.
The architecture: twelve modules, strictly layered
The ML engine lives in ferrum_core, a library crate with twelve modules arranged in a strict dependency stack. Each module imports only from those above it in this list. There are no cycles, no forward references.
error ← InferError enum, Result<T> alias
tensor ← Tensor: flat Vec<f32> + shape, row-major storage
ops ← matmul, bias-add, transpose, argmax, softmax
activation ← ReLU, Sigmoid, Tanh, Softmax, Identity (serialisable enum)
layer ← Layer trait, Linear (y = xW+b), ActivationLayer
model ← Sequential: Vec<Box<dyn Layer>>, forward()
rng ← seeded xorshift64* PRNG, Box-Muller normal samples
loss ← softmax cross-entropy + MSE (both with analytic gradients)
optim ← SGD with momentum, stateless over parameters
csv ← CSV parser, Normalizer, ModelMetadata, TaskType detection
train ← DenseT, ReluT, Net (trainable MLP), backpropagation
loader ← FINF v3 binary format (weights + normalizer + metadata JSON)
Read these files top to bottom and the entire engine unfolds with no surprises. The stack tells you the dependency direction for free.
The tensor: one struct, everything else follows
pub struct Tensor {
pub shape: Vec<usize>,
pub data: Vec<f32>,
}
That is the whole data model. A 3×4 matrix is twelve contiguous floats in a Vec. There is no broadcasting, no views, no strides, no CUDA. Every operation returns a new Tensor. This costs allocations and buys clarity — the data flow through the network is always explicit.
The key primitive is map:
pub fn map<F: Fn(f32) -> f32>(&self, f: F) -> Tensor {
Tensor {
shape: self.shape.clone(),
data: self.data.iter().copied().map(f).collect(),
}
}
This single method is the entire implementation of ReLU, Sigmoid, and Tanh. Activation functions don't need a module; they need a closure.
The matmul: one loop swap, real cache benefit
The performance-critical operation is matrix multiply. The textbook i-j-k order reads matrix B in column-major order for a row-major layout — cache-unfriendly. The i-k-j order walks both B and the output buffer contiguously in the innermost loop:
for i in 0..m {
let a_row = i * ka;
let o_row = i * n;
for k in 0..ka {
let a_ik = a.data[a_row + k];
let b_row = k * n;
for j in 0..n {
out[o_row + j] += a_ik * b.data[b_row + j];
}
}
}
Same FLOP count. Better cache behaviour. This is still naive single-threaded f32 — no SIMD, no BLAS. But "naive, correct loop order" is the right baseline when readability is the goal.
Two loss functions, two gradient derivations
The engine handles both classification and regression, which requires two loss functions — each of which I derived by hand and verified with finite differences.
Softmax cross-entropy (classification)
If p = softmax(z) and the true class is t, the gradient of the loss with respect to the logits z is simply:
dL/dz = (p - onehot(t)) / batch_size
No softmax Jacobian. No chain rule composition. No numerical instability. This is what you get when you fuse the softmax and the cross-entropy into one operation. The gradient is so clean because the ugliness of the softmax derivative cancels perfectly against the cross-entropy derivative when you do them together.
MSE (regression)
For regression, the gradient is even simpler:
grad[i] = 2.0 * (pred[i] - target[i]) / batch_size;
Both gradients are verified by the same test: perturb each parameter by ε = 0.001, measure (L(w+ε) - L(w-ε)) / 2ε, confirm it matches the analytic gradient to within 1e-2. If the calculus is wrong, this test catches it.
The FINF v3 format: three things in one file
Real production engines use GGUF, SafeTensors, or ONNX. This project defines its own minimal binary format — FINF (Ferrum Inference) — and serialises it by hand using nothing but std::fs::write.
FINF v3 embeds three things in a single file:
4 bytes b"FINF"
u32 version = 3
u32 normalizer_len
[bytes] "mean0,std0;mean1,std1;…" per-column z-score statistics
u32 metadata_len
[bytes] { JSON } ModelMetadata (see below)
u32 num_layers
[layers] tag byte + layer parameters
The two embedded payloads are the engineering choices worth explaining.
The normalizer is baked into the model file because a model that receives un-normalised inputs fails silently. The most common deployment bug in tabular ML is forgetting to apply the same preprocessing statistics at inference that you used during training. Embedding them in the model file makes this mistake structurally impossible — there is no separate statistics file to forget.
The metadata JSON is baked in because the browser needs it to build the UI dynamically. ModelMetadata carries:
pub struct ModelMetadata {
pub dataset_name: String,
pub task: TaskType, // Classification or Regression
pub feature_names: Vec<String>, // read from the CSV header
pub feature_ranges: Vec<[f32; 2]>, // [min, max] per feature in raw data
pub class_names: Vec<String>, // label strings in index order
pub target_name: String,
pub target_range: [f32; 2],
pub input_dim: usize,
pub output_dim: usize,
}
The browser extracts this metadata once after loading the model, then uses it to build slider labels, set slider min/max ranges, name probability bars, and power both statistical terminals — without any per-dataset JavaScript.
The generic WASM bindings
The tabular_wasm crate exposes a single TabularModel struct to JavaScript via wasm-bindgen:
#[wasm_bindgen]
pub struct TabularModel {
model: Sequential,
norm: Normalizer,
meta_json: String,
task: TaskType,
}
#[wasm_bindgen]
impl TabularModel {
pub fn new(bytes: &[u8]) -> Result<TabularModel, JsValue> { ... }
pub fn metadata(&self) -> String { ... } // ModelMetadata as JSON
pub fn norm_encoded(&self) -> String { ... } // "mean0,std0;…" for JS stats
pub fn predict(&self, values: &[f32]) -> Result<String, JsValue> { ... }
}
predict returns JSON whose shape depends on the task type:
// Classification:
{ "type": "classification", "class_index": 0,
"confidence": 0.981, "probabilities": [0.981, 0.013, 0.006] }
// Regression:
{ "type": "regression", "value": 247300.0, "value_norm": -0.331 }
The same WASM binary handles all ten datasets. The JavaScript reads metadata() to discover what kind of dataset it is and builds the appropriate UI — sliders, probability bars, statistical terminals — from scratch.
norm_encoded() is a deliberate design decision: the normaliser statistics are needed in JavaScript to compute per-feature z-scores for the statistical terminals. Rather than re-parsing the metadata JSON or making a second fetch, this method returns the compact mean,std;mean,std;… string directly so the JS can reconstruct z-scores with one split.
The automatic task detection
The train_cli binary accepts any CSV and figures out classification versus regression automatically:
let distinct_targets: HashSet<String> = raw_rows.iter().map(|(_, t)| t.clone()).collect();
let all_numeric = distinct_targets.iter().all(|t| t.parse::<f64>().is_ok());
let reg_threshold = if raw_rows.len() > 50 { 15 } else { raw_rows.len() / 3 };
let task = if all_numeric && distinct_targets.len() > reg_threshold {
TaskType::Regression
} else {
TaskType::Classification
};
If the target column is numeric and has more than ~15 distinct values, it's regression. Otherwise it's classification. This heuristic works correctly on all ten current datasets — including edge cases like wine quality (11 distinct integer values → classification) versus housing prices (thousands of distinct floats → regression).
The trainer also reads feature names from the CSV header, computes per-feature [min, max] ranges, and packages everything into the FINF file alongside the weights. There is no configuration file, no schema definition, no argument to specify field names.
The live statistical terminals
Every dataset page has two terminals that update on every slider drag. They are implemented entirely in JavaScript — no additional Rust code — using the normalizer statistics that norm_encoded() exposes.
Terminal 1 — Model Statistics
For each feature, the terminal computes and displays:
- Raw value: the current slider position
-
Z-score:
(value − μ) / σusing the training-set mean and standard deviation from the embedded normalizer. Colour-coded: green for |z| < 1, yellow for 1–2, red for >2 - Range bar: the value's position within [dataset min, dataset max]
- Centred z-bar: direction and distance from the training mean, as a horizontal bar
Below the feature table, a static architecture card shows layer dimensions, task type, normaliser type, and file format.
Terminal 2 — Quantitative Report
For classification, the report shows:
- A confidence badge: Certain / Confident / Uncertain / Toss-up, derived from the Shannon entropy of the output distribution
- The full probability table: P, a bar, log P, and the odds ratio for every class
- Shannon entropy H(p) on a gauge, from 0 nats (model certain) to ln(C) nats (maximally confused across C classes)
- The top-2 margin: P(winner) − P(runner-up)
Shannon entropy is the right measure of model certainty here because it captures the full distribution, not just the top probability. A model that outputs [0.60, 0.39, 0.01] has the same top probability as one that outputs [0.60, 0.20, 0.20], but the second is meaningfully more uncertain — entropy catches that, argmax-confidence doesn't.
For regression, the report shows:
- The predicted value on a visual scale spanning [training_min, training_max], with the training mean marked
- The prediction's z-score relative to the training target distribution
- How far the prediction is from the dataset mean, as a percentage
- A ±1σ reference interval from the training targets
- An approximate quartile (Q1/Q2/Q3/Q4)
The critical implementation detail: the normalizer for regression models stores one extra (mean, std) pair at the end — the target variable's statistics. This is what denormalise_target() uses to convert the normalised network output back to the original scale, and what the JavaScript uses to compute the target z-score without any additional server call.
The ten datasets
Training one dataset takes 1–5 seconds. All ten train in about a minute, single-threaded.
| Dataset | Task | Rows | Features | Result |
|---|---|---|---|---|
| 🌸 Iris Species | 3-class | 150 | 4 | 98.7% acc |
| 🐧 Palmer Penguins | 3-class | 342 | 4 | 99.4% acc |
| 🌾 Wheat Seeds | 3-class | 210 | 7 | 99.5% acc |
| 🍷 Wine Quality | 3-class | 1,599 | 11 | 80.9% acc |
| 🩺 Pima Diabetes | binary | 768 | 8 | 93.0% acc |
| ❤️ Heart Disease | binary | 297 | 13 | 96.3% acc |
| 🔬 Breast Cancer | binary | 569 | 30 | 99.3% acc |
| 🚢 Titanic Survival | binary | 891 | 6 | 86.9% acc |
| 🚗 Auto MPG | regression | 392 | 6 | RMSE 1.95 mpg |
| 🏠 CA Housing Prices | regression | 20,433 | 8 | RMSE ~$52k |
Each model file is 1.5–10.5 KB and is self-contained: weights, normaliser statistics, and the full metadata JSON are all packed into one FINF v3 binary. The largest is Breast Cancer (30 features, 10.5 KB). The smallest is Iris (4 features, 1.5 KB).
The numbers
| Property | Value |
|---|---|
| WASM binary | 128 KB |
| JS glue | 9 KB |
| Shared JS + CSS | 38 KB |
| Largest model file | 10.5 KB (Breast Cancer, 30 features) |
| Smallest model file | 1.5 KB (Iris, 4 features) |
| Total page weight | ~230 KB for all ten datasets |
| External dependencies | 0 |
| Tests | 131 (0 failures) |
| Source lines (Rust) | ~3,700 |
| Source lines (JS) | ~920 |
The test suite: 131 tests, three layers
86 unit tests in ferrum_core — every module in isolation. Highlights:
-
backprop_gradient_check: perturbs individual weights by ε, confirms the analytic gradient frombackward()matches(L(w+ε) - L(w-ε)) / 2εto within 1e-2. This is the proof that the calculus is correct. -
mse_gradient_finite_difference: same check for the regression loss -
metadata_json_roundtrip:ModelMetadata::to_json()thenfrom_json()must produce identical structs — this is what the browser depends on -
normalizer_zero_meanandnormalizer_produces_unit_variance: fit-and-transform must actually standardise the data
39 integration tests — the complete pipeline for all ten datasets. Key tests:
-
all_ten_model_files_load_and_produce_finite_outputs: loads every trained model from disk, runs inference on a test input, asserts no NaN or Inf in output -
all_classification_models_output_valid_distributions: for every classification model, two distinct inputs must produce probabilities that sum to 1.0 within 1e-4 -
both_regression_models_produce_plausible_values: housing predictions must be in [$10k, $10M]; MPG predictions must be in [5, 60] mpg -
batch_inference_matches_individual: running three inputs as a batch must produce identical results to running them one at a time
6 WASM glue tests — the Rust side of the bindings: load from bytes, infer, check metadata JSON fields, verify norm_encoded format, batch/individual agreement, corrupt-byte rejection.
The CI/CD pipeline
A single GitHub Actions workflow handles the complete lifecycle on every push to main:
steps:
- Check formatting (cargo fmt --check)
- Clippy (cargo clippy -- -D warnings)
- Run all 131 tests (cargo test --workspace)
- Train all 10 models (cargo run -p train_cli --release -- ...)
- Compile to WASM (cargo build --target wasm32-unknown-unknown --release)
- Generate JS bindings (wasm-bindgen ...)
- Verify all 10 model files have FINF magic bytes
- Deploy web/ to GitHub Pages
Total CI time: ~4 minutes on GitHub's free runners.
The "verify FINF magic bytes" step is a deliberate defensive check: if the trainer emits a corrupt file, or the wrong file gets copied to web/datasets/*/model.bin, the deploy fails loudly before any user sees a broken page.
Adding a new dataset in three steps
The generic architecture means adding an eleventh dataset requires no Rust changes and almost no JavaScript changes.
Step 1 — Prepare the CSV (numeric features, label in last column, header optional):
age,sex,cp,trestbps,chol,fbs,...,target
63,1,1,145,233,1,...,0
...
Step 2 — Train:
cargo run -p train_cli --release -- new_data.csv \
web/datasets/newds/model.bin "Dataset Name" 48 500
The trainer auto-detects classification vs regression, reads feature names from the header, computes ranges, and embeds everything in the model file.
Step 3 — Write one HTML file by copying any existing dataset page and changing the title, subtitle, preset button values, and source URL. The slider labels, slider ranges, probability bar class names, and both statistical terminals build themselves from the embedded metadata. No JavaScript to modify.
The engineering lessons, in order of surprise
The format is the API. FINF v3 encodes not just weights but the full contract between training and inference: the normalizer statistics and the UI metadata. The format is what makes the browser able to build the right UI for any dataset without a configuration file.
Shannon entropy beats argmax-confidence. Reporting max(probabilities) as "confidence" is misleading when a model outputs [0.55, 0.44, 0.01] — the model is not 55% confident, it is nearly maximally uncertain between two classes. Entropy captures the whole distribution. The terminal shows both: the top probability for human readability and the entropy for actual information content.
The norm_encoded() method is not a convenience — it is a privacy guarantee. The normaliser statistics live in the model file. JavaScript reads them from there. There is no separate fetch to a stats endpoint, no server that knows which inputs the user is testing. The statistical terminals are fully client-side.
The i-k-j matmul loop order is a free lunch. Swapping the inner two loops in a triple-loop matmul costs nothing and improves cache behaviour. Any project doing its own matrix multiply should do this.
Regression needs one extra normaliser slot. For regression, the target variable must be normalised before training (otherwise the loss is dominated by the raw scale of the target). The engine stores the target's mean and standard deviation as an extra pair at the end of the normaliser string. denormalise_target() uses the last pair. JavaScript does the same. The fact that this is implicit rather than explicit (a separate struct field) is a trade-off I'd reconsider if the project grew.
Finite differences are non-negotiable. Both gradients — softmax cross-entropy and MSE — are verified against numerical finite differences in the test suite. Writing backprop by hand without this check is guessing. The check costs ten lines of test code and has caught at least one sign error during development.
Deployment in three commands
# 1. Train all models
bash scripts/train_all.sh
# 2. Compile to WebAssembly
bash scripts/build_wasm.sh
# 3. Serve locally
python3 -m http.server 8080 --directory web
For GitHub Pages, push the repository, activate Pages under Settings → Pages → Source → GitHub Actions, and the included workflow handles the rest on every subsequent push.
The web/ directory is fully self-contained: copy it to any static host — Netlify drop, Cloudflare Pages, an S3 bucket, a USB stick — and the demo works. There is no backend to configure.
What's next
The architecture is designed so extensions are local. Some natural next steps:
- Embeddings: the first Linear layer already acts as an embedding table when inputs are one-hot — a proper embedding lookup layer would let the engine handle word-level inputs
-
Parallel kernels:
ops.rsis the single place where arithmetic lives; dropping inrayonfor the matmul would parallelize inference without touching any other module -
Adam optimizer:
optim.rsis the single place where parameter updates live; Adam adds two more buffers per parameter and one more expression -
More loss functions: focal loss for class imbalance, Huber loss for robust regression — all live in
loss.rs, none require changes above it
Built with Rust 1.95, wasm-bindgen 0.2.122, and ten public datasets from UCI ML Repository, Kaggle, and Palmer Station Antarctica.
Top comments (0)