luckrig: a concept for tasting LLM rigs, not just models

#llm #opensource #showdev #machinelearning

HuggingFace Spaces lets you try models.
LMSys Arena lets you compare models.

Neither lets you try a specific rig.

Exact GPU. Exact quantization. Exact context length.
Someone's actual tuning notes — with your own prompt, right now.

That's the gap. luckrig is a concept to fill it.

If Arena maps models, luckrig maps the rigs.

Service	What you taste	Hardware visible?
HF Spaces	Author's model wrap	Whatever they printed
LMSys Arena	Blind A/B models	Model name. Nothing else.
AI Horde	Any worker that fits	Abstracted away
luckrig	A specific rig	GPU · quant · ctx · tuning

AI Horde abstracts the worker away.
luckrig makes the hardware the star.

Access earned by contribution, not money.

Inspired by Hotline Connect — the early-2000s Mac P2P tool where
contribution score, not payment, determined access rights.

Three seed nodes exist in the POC — not yet public.

first-5090-qwen3 — RTX 5090, Qwen3-35B-A3B, Q4_K_XL, 267 tok/s
weekend-m3max — Apple M3 Max, Qwen2.5-14B, Q5_K_M
shed-pi5 — Raspberry Pi 5, llama3.2-1B, 2.3 tok/s

These are local test nodes to demonstrate the concept.
Looking for early contributors who want to register a real node.

Rarity-first, not leaderboard.

The Pi node ranks higher than the 5090 because it's rarer.
Not a speed competition — a showcase of diversity.

Working POC. No external dependencies.

git clone github.com/prospectorlabs/luckrig
cd luckrig
npm start
→ http://127.0.0.1:8787

Concept + full spec + working code, all open.

https://github.com/prospectorlabs/luckrig
https://prospectorlabs.dev/luckrig/

Top comments (1)

Tekeshwar Singh • May 23

The gap you're identifying is real, and I'd add one more dimension that hardware specs alone can't capture: inference quality consistency under realistic load patterns.

Two rigs with identical GPU/quant/ctx specs can behave very differently when you run 50 concurrent requests vs 5. The 5090 with Qwen3-35B might give you 267 tok/s in isolation but degrade to 180 tok/s with queue depth 10 — and the output quality often degrades before the throughput does. Smaller context requests start hallucinating earlier when KV cache is under pressure.

What I'd love to see luckrig expose (maybe as optional fields in tuning notes): sustained throughput at different concurrency levels, and — harder but more valuable — a quality degradation curve. At what queue depth does the rig start producing outputs that would fail a basic factual eval?

The Pi node rarity framing is clever. Though in practice for production workloads, what matters most is: at my expected p95 request rate, does this rig maintain output quality? The rarity metric is interesting for discovery; quality-under-load is what determines whether a rig is actually usable.

Solid concept — the harness-aware comparison space is completely empty right now.