Thomas Cherickal

Posted on May 29

I Built a Complete ML Engine from Scratch in Rust, Compiled It to WebAssembly, and Now Ten Datasets Run Live in the Browser

#neural #rust #webassembly #classification

Zero external dependencies. Zero backend. Ten real datasets. Two live statistical terminals per page. 128 KB of WebAssembly. This is the story of building it from the ground up.

Most ML tutorials end at the Jupyter notebook. Train a model, print an accuracy score, call it done. What happens after — packaging, serving, making it actually usable by someone who isn't you — is left as an exercise.

This article is about what happens after. Specifically: what happens when you take the constraint "the model must run entirely in the browser with no server" seriously, follow it all the way down to its logical conclusion, and build every layer of the stack yourself in Rust.

The result is a live web demo with ten real ML datasets — Iris, Breast Cancer, Titanic, California Housing, Heart Disease, and five more — each with a dynamically-built slider interface, live probability bars, and two statistical analysis terminals that update in real time as you drag. The entire thing is one 128 KB WebAssembly binary plus small per-dataset model files. No server. No Python. No cloud.

Here's the GitHub repo: thomascherickal/Ferrum

Here's the live demo: Live Demo

The constraint that drove everything

The wasm32-unknown-unknown WebAssembly target has no operating system, no file system, and no libc. It is the most minimal compilation target Rust supports. A crate that links against rand, ndarray, reqwest, or virtually any library that makes OS calls will not compile to it — not without significant plumbing.

That constraint is the source of everything interesting in this project. Once you accept "no external dependencies, standard library only," you have to build the tensor type yourself, the matrix operations yourself, the activation functions yourself, the training loop yourself, the binary model format yourself. That sounds painful. It is actually clarifying.

The payoff is threefold:

The WASM binary is genuinely small. 128 KB contains the entire ML engine: tensor math, normalisation, inference, and the statistical terminal logic. That's less than a medium-resolution JPEG.
Every line of code that runs in the user's browser is in your repository. No transitive dependencies, no supply-chain risk, nothing opaque.
Inference is private. User inputs never leave the browser tab. There is no server to log requests, no cloud API to rate-limit you, no GDPR surface area. The model runs locally.

The architecture: twelve modules, strictly layered

The ML engine lives in ferrum_core, a library crate with twelve modules arranged in a strict dependency stack. Each module imports only from those above it in this list. There are no cycles, no forward references.

error      ← InferError enum, Result<T> alias
tensor     ← Tensor: flat Vec<f32> + shape, row-major storage
ops        ← matmul, bias-add, transpose, argmax, softmax
activation ← ReLU, Sigmoid, Tanh, Softmax, Identity (serialisable enum)
layer      ← Layer trait, Linear (y = xW+b), ActivationLayer
model      ← Sequential: Vec<Box<dyn Layer>>, forward()
rng        ← seeded xorshift64* PRNG, Box-Muller normal samples
loss       ← softmax cross-entropy + MSE (both with analytic gradients)
optim      ← SGD with momentum, stateless over parameters
csv        ← CSV parser, Normalizer, ModelMetadata, TaskType detection
train      ← DenseT, ReluT, Net (trainable MLP), backpropagation
loader     ← FINF v3 binary format (weights + normalizer + metadata JSON)

Read these files top to bottom and the entire engine unfolds with no surprises. The stack tells you the dependency direction for free.

The tensor: one struct, everything else follows

pub struct Tensor {
    pub shape: Vec<usize>,
    pub data:  Vec<f32>,
}

That is the whole data model. A 3×4 matrix is twelve contiguous floats in a Vec. There is no broadcasting, no views, no strides, no CUDA. Every operation returns a new Tensor. This costs allocations and buys clarity — the data flow through the network is always explicit.

The key primitive is map:

pub fn map<F: Fn(f32) -> f32>(&self, f: F) -> Tensor {
    Tensor {
        shape: self.shape.clone(),
        data: self.data.iter().copied().map(f).collect(),
    }
}

This single method is the entire implementation of ReLU, Sigmoid, and Tanh. Activation functions don't need a module; they need a closure.

The matmul: one loop swap, real cache benefit

The performance-critical operation is matrix multiply. The textbook i-j-k order reads matrix B in column-major order for a row-major layout — cache-unfriendly. The i-k-j order walks both B and the output buffer contiguously in the innermost loop:

for i in 0..m {
    let a_row = i * ka;
    let o_row = i * n;
    for k in 0..ka {
        let a_ik = a.data[a_row + k];
        let b_row = k * n;
        for j in 0..n {
            out[o_row + j] += a_ik * b.data[b_row + j];
        }
    }
}

Same FLOP count. Better cache behaviour. This is still naive single-threaded f32 — no SIMD, no BLAS. But "naive, correct loop order" is the right baseline when readability is the goal.

Two loss functions, two gradient derivations

The engine handles both classification and regression, which requires two loss functions — each of which I derived by hand and verified with finite differences.

Softmax cross-entropy (classification)

If p = softmax(z) and the true class is t, the gradient of the loss with respect to the logits z is simply:

dL/dz = (p - onehot(t)) / batch_size

No softmax Jacobian. No chain rule composition. No numerical instability. This is what you get when you fuse the softmax and the cross-entropy into one operation. The gradient is so clean because the ugliness of the softmax derivative cancels perfectly against the cross-entropy derivative when you do them together.

MSE (regression)

For regression, the gradient is even simpler:

grad[i] = 2.0 * (pred[i] - target[i]) / batch_size;

Both gradients are verified by the same test: perturb each parameter by ε = 0.001, measure (L(w+ε) - L(w-ε)) / 2ε, confirm it matches the analytic gradient to within 1e-2. If the calculus is wrong, this test catches it.

The FINF v3 format: three things in one file

Real production engines use GGUF, SafeTensors, or ONNX. This project defines its own minimal binary format — FINF (Ferrum Inference) — and serialises it by hand using nothing but std::fs::write.

FINF v3 embeds three things in a single file:

4 bytes   b"FINF"
u32       version = 3
u32       normalizer_len
[bytes]   "mean0,std0;mean1,std1;…"   per-column z-score statistics
u32       metadata_len
[bytes]   { JSON }                    ModelMetadata (see below)
u32       num_layers
[layers]  tag byte + layer parameters

The two embedded payloads are the engineering choices worth explaining.

The normalizer is baked into the model file because a model that receives un-normalised inputs fails silently. The most common deployment bug in tabular ML is forgetting to apply the same preprocessing statistics at inference that you used during training. Embedding them in the model file makes this mistake structurally impossible — there is no separate statistics file to forget.

The metadata JSON is baked in because the browser needs it to build the UI dynamically. ModelMetadata carries:

pub struct ModelMetadata {
    pub dataset_name:   String,
    pub task:           TaskType,       // Classification or Regression
    pub feature_names:  Vec<String>,    // read from the CSV header
    pub feature_ranges: Vec<[f32; 2]>,  // [min, max] per feature in raw data
    pub class_names:    Vec<String>,    // label strings in index order
    pub target_name:    String,
    pub target_range:   [f32; 2],
    pub input_dim:      usize,
    pub output_dim:     usize,
}

The browser extracts this metadata once after loading the model, then uses it to build slider labels, set slider min/max ranges, name probability bars, and power both statistical terminals — without any per-dataset JavaScript.

The generic WASM bindings

The tabular_wasm crate exposes a single TabularModel struct to JavaScript via wasm-bindgen:

#[wasm_bindgen]
pub struct TabularModel {
    model:     Sequential,
    norm:      Normalizer,
    meta_json: String,
    task:      TaskType,
}

#[wasm_bindgen]
impl TabularModel {
    pub fn new(bytes: &[u8]) -> Result<TabularModel, JsValue> { ... }
    pub fn metadata(&self)     -> String { ... }  // ModelMetadata as JSON
    pub fn norm_encoded(&self) -> String { ... }  // "mean0,std0;…" for JS stats
    pub fn predict(&self, values: &[f32]) -> Result<String, JsValue> { ... }
}

predict returns JSON whose shape depends on the task type:

// Classification:
{ "type": "classification", "class_index": 0,
  "confidence": 0.981, "probabilities": [0.981, 0.013, 0.006] }

// Regression:
{ "type": "regression", "value": 247300.0, "value_norm": -0.331 }

The same WASM binary handles all ten datasets. The JavaScript reads metadata() to discover what kind of dataset it is and builds the appropriate UI — sliders, probability bars, statistical terminals — from scratch.

norm_encoded() is a deliberate design decision: the normaliser statistics are needed in JavaScript to compute per-feature z-scores for the statistical terminals. Rather than re-parsing the metadata JSON or making a second fetch, this method returns the compact mean,std;mean,std;… string directly so the JS can reconstruct z-scores with one split.

The automatic task detection

The train_cli binary accepts any CSV and figures out classification versus regression automatically:

let distinct_targets: HashSet<String> = raw_rows.iter().map(|(_, t)| t.clone()).collect();
let all_numeric = distinct_targets.iter().all(|t| t.parse::<f64>().is_ok());
let reg_threshold = if raw_rows.len() > 50 { 15 } else { raw_rows.len() / 3 };

let task = if all_numeric && distinct_targets.len() > reg_threshold {
    TaskType::Regression
} else {
    TaskType::Classification
};

If the target column is numeric and has more than ~15 distinct values, it's regression. Otherwise it's classification. This heuristic works correctly on all ten current datasets — including edge cases like wine quality (11 distinct integer values → classification) versus housing prices (thousands of distinct floats → regression).

The trainer also reads feature names from the CSV header, computes per-feature [min, max] ranges, and packages everything into the FINF file alongside the weights. There is no configuration file, no schema definition, no argument to specify field names.

The live statistical terminals

Every dataset page has two terminals that update on every slider drag. They are implemented entirely in JavaScript — no additional Rust code — using the normalizer statistics that norm_encoded() exposes.

Terminal 1 — Model Statistics

For each feature, the terminal computes and displays:

Raw value: the current slider position
Z-score: (value − μ) / σ using the training-set mean and standard deviation from the embedded normalizer. Colour-coded: green for |z| < 1, yellow for 1–2, red for >2
Range bar: the value's position within [dataset min, dataset max]
Centred z-bar: direction and distance from the training mean, as a horizontal bar

Below the feature table, a static architecture card shows layer dimensions, task type, normaliser type, and file format.

Terminal 2 — Quantitative Report

For classification, the report shows:

A confidence badge: Certain / Confident / Uncertain / Toss-up, derived from the Shannon entropy of the output distribution
The full probability table: P, a bar, log P, and the odds ratio for every class
Shannon entropy H(p) on a gauge, from 0 nats (model certain) to ln(C) nats (maximally confused across C classes)
The top-2 margin: P(winner) − P(runner-up)

Shannon entropy is the right measure of model certainty here because it captures the full distribution, not just the top probability. A model that outputs [0.60, 0.39, 0.01] has the same top probability as one that outputs [0.60, 0.20, 0.20], but the second is meaningfully more uncertain — entropy catches that, argmax-confidence doesn't.

For regression, the report shows:

The predicted value on a visual scale spanning [training_min, training_max], with the training mean marked
The prediction's z-score relative to the training target distribution
How far the prediction is from the dataset mean, as a percentage
A ±1σ reference interval from the training targets
An approximate quartile (Q1/Q2/Q3/Q4)

The critical implementation detail: the normalizer for regression models stores one extra (mean, std) pair at the end — the target variable's statistics. This is what denormalise_target() uses to convert the normalised network output back to the original scale, and what the JavaScript uses to compute the target z-score without any additional server call.

The ten datasets

Training one dataset takes 1–5 seconds. All ten train in about a minute, single-threaded.

Dataset	Task	Rows	Features	Result
🌸 Iris Species	3-class	150	4	98.7% acc
🐧 Palmer Penguins	3-class	342	4	99.4% acc
🌾 Wheat Seeds	3-class	210	7	99.5% acc
🍷 Wine Quality	3-class	1,599	11	80.9% acc
🩺 Pima Diabetes	binary	768	8	93.0% acc
❤️ Heart Disease	binary	297	13	96.3% acc
🔬 Breast Cancer	binary	569	30	99.3% acc
🚢 Titanic Survival	binary	891	6	86.9% acc
🚗 Auto MPG	regression	392	6	RMSE 1.95 mpg
🏠 CA Housing Prices	regression	20,433	8	RMSE ~$52k

Each model file is 1.5–10.5 KB and is self-contained: weights, normaliser statistics, and the full metadata JSON are all packed into one FINF v3 binary. The largest is Breast Cancer (30 features, 10.5 KB). The smallest is Iris (4 features, 1.5 KB).

The numbers

Property	Value
WASM binary	128 KB
JS glue	9 KB
Shared JS + CSS	38 KB
Largest model file	10.5 KB (Breast Cancer, 30 features)
Smallest model file	1.5 KB (Iris, 4 features)
Total page weight	~230 KB for all ten datasets
External dependencies	0
Tests	131 (0 failures)
Source lines (Rust)	~3,700
Source lines (JS)	~920

The test suite: 131 tests, three layers

86 unit tests in ferrum_core — every module in isolation. Highlights:

backprop_gradient_check: perturbs individual weights by ε, confirms the analytic gradient from backward() matches (L(w+ε) - L(w-ε)) / 2ε to within 1e-2. This is the proof that the calculus is correct.
mse_gradient_finite_difference: same check for the regression loss
metadata_json_roundtrip: ModelMetadata::to_json() then from_json() must produce identical structs — this is what the browser depends on
normalizer_zero_mean and normalizer_produces_unit_variance: fit-and-transform must actually standardise the data

39 integration tests — the complete pipeline for all ten datasets. Key tests:

all_ten_model_files_load_and_produce_finite_outputs: loads every trained model from disk, runs inference on a test input, asserts no NaN or Inf in output
all_classification_models_output_valid_distributions: for every classification model, two distinct inputs must produce probabilities that sum to 1.0 within 1e-4
both_regression_models_produce_plausible_values: housing predictions must be in [$10k, $10M]; MPG predictions must be in [5, 60] mpg
batch_inference_matches_individual: running three inputs as a batch must produce identical results to running them one at a time

6 WASM glue tests — the Rust side of the bindings: load from bytes, infer, check metadata JSON fields, verify norm_encoded format, batch/individual agreement, corrupt-byte rejection.

The CI/CD pipeline

A single GitHub Actions workflow handles the complete lifecycle on every push to main:

steps:
  - Check formatting (cargo fmt --check)
  - Clippy (cargo clippy -- -D warnings)
  - Run all 131 tests (cargo test --workspace)
  - Train all 10 models (cargo run -p train_cli --release -- ...)
  - Compile to WASM (cargo build --target wasm32-unknown-unknown --release)
  - Generate JS bindings (wasm-bindgen ...)
  - Verify all 10 model files have FINF magic bytes
  - Deploy web/ to GitHub Pages

Total CI time: ~4 minutes on GitHub's free runners.

The "verify FINF magic bytes" step is a deliberate defensive check: if the trainer emits a corrupt file, or the wrong file gets copied to web/datasets/*/model.bin, the deploy fails loudly before any user sees a broken page.

Adding a new dataset in three steps

The generic architecture means adding an eleventh dataset requires no Rust changes and almost no JavaScript changes.

Step 1 — Prepare the CSV (numeric features, label in last column, header optional):

age,sex,cp,trestbps,chol,fbs,...,target
63,1,1,145,233,1,...,0
...

Step 2 — Train:

cargo run -p train_cli --release -- new_data.csv \
  web/datasets/newds/model.bin "Dataset Name" 48 500

The trainer auto-detects classification vs regression, reads feature names from the header, computes ranges, and embeds everything in the model file.

Step 3 — Write one HTML file by copying any existing dataset page and changing the title, subtitle, preset button values, and source URL. The slider labels, slider ranges, probability bar class names, and both statistical terminals build themselves from the embedded metadata. No JavaScript to modify.

The engineering lessons, in order of surprise

The format is the API. FINF v3 encodes not just weights but the full contract between training and inference: the normalizer statistics and the UI metadata. The format is what makes the browser able to build the right UI for any dataset without a configuration file.

Shannon entropy beats argmax-confidence. Reporting max(probabilities) as "confidence" is misleading when a model outputs [0.55, 0.44, 0.01] — the model is not 55% confident, it is nearly maximally uncertain between two classes. Entropy captures the whole distribution. The terminal shows both: the top probability for human readability and the entropy for actual information content.

The norm_encoded() method is not a convenience — it is a privacy guarantee. The normaliser statistics live in the model file. JavaScript reads them from there. There is no separate fetch to a stats endpoint, no server that knows which inputs the user is testing. The statistical terminals are fully client-side.

The i-k-j matmul loop order is a free lunch. Swapping the inner two loops in a triple-loop matmul costs nothing and improves cache behaviour. Any project doing its own matrix multiply should do this.

Regression needs one extra normaliser slot. For regression, the target variable must be normalised before training (otherwise the loss is dominated by the raw scale of the target). The engine stores the target's mean and standard deviation as an extra pair at the end of the normaliser string. denormalise_target() uses the last pair. JavaScript does the same. The fact that this is implicit rather than explicit (a separate struct field) is a trade-off I'd reconsider if the project grew.

Finite differences are non-negotiable. Both gradients — softmax cross-entropy and MSE — are verified against numerical finite differences in the test suite. Writing backprop by hand without this check is guessing. The check costs ten lines of test code and has caught at least one sign error during development.

Deployment in three commands

# 1. Train all models
bash scripts/train_all.sh

# 2. Compile to WebAssembly
bash scripts/build_wasm.sh

# 3. Serve locally
python3 -m http.server 8080 --directory web

For GitHub Pages, push the repository, activate Pages under Settings → Pages → Source → GitHub Actions, and the included workflow handles the rest on every subsequent push.

The web/ directory is fully self-contained: copy it to any static host — Netlify drop, Cloudflare Pages, an S3 bucket, a USB stick — and the demo works. There is no backend to configure.

What's next

The architecture is designed so extensions are local. Some natural next steps:

Embeddings: the first Linear layer already acts as an embedding table when inputs are one-hot — a proper embedding lookup layer would let the engine handle word-level inputs
Parallel kernels: ops.rs is the single place where arithmetic lives; dropping in rayon for the matmul would parallelize inference without touching any other module
Adam optimizer: optim.rs is the single place where parameter updates live; Adam adds two more buffers per parameter and one more expression
More loss functions: focal loss for class imbalance, Huber loss for robust regression — all live in loss.rs, none require changes above it

Built with Rust 1.95, wasm-bindgen 0.2.122, and ten public datasets from UCI ML Repository, Kaggle, and Palmer Station Antarctica.

DEV Community