DEV Community

gen
gen

Posted on

luckrig: a concept for tasting LLM rigs, not just models

luckrig: a concept for tasting LLM rigs, not just models

HuggingFace Spaces lets you try models.
LMSys Arena lets you compare models.

Neither lets you try a specific rig.

Exact GPU. Exact quantization. Exact context length.
Someone's actual tuning notes — with your own prompt, right now.

That's the gap. luckrig is a concept to fill it.


If Arena maps models, luckrig maps the rigs.

Service What you taste Hardware visible?
HF Spaces Author's model wrap Whatever they printed
LMSys Arena Blind A/B models Model name. Nothing else.
AI Horde Any worker that fits Abstracted away
luckrig A specific rig GPU · quant · ctx · tuning

AI Horde abstracts the worker away.
luckrig makes the hardware the star.


Access earned by contribution, not money.

Inspired by Hotline Connect — the early-2000s Mac P2P tool where
contribution score, not payment, determined access rights.

Register a node → write tuning notes → upload timing measurements.
That's how you earn access to other people's rigs.


Three seed nodes exist in the POC — not yet public.

  • first-5090-qwen3 — RTX 5090, Qwen3-35B-A3B, Q4_K_XL, 267 tok/s
  • weekend-m3max — Apple M3 Max, Qwen2.5-14B, Q5_K_M
  • shed-pi5 — Raspberry Pi 5, llama3.2-1B, 2.3 tok/s

These are local test nodes to demonstrate the concept.
Looking for early contributors who want to register a real node.


Rarity-first, not leaderboard.

The Pi node ranks higher than the 5090 because it's rarer.
Not a speed competition — a showcase of diversity.


Working POC. No external dependencies.

git clone github.com/prospectorlabs/luckrig
cd luckrig
npm start
http://127.0.0.1:8787

Concept + full spec + working code, all open.

https://github.com/prospectorlabs/luckrig
https://prospectorlabs.dev/luckrig/

Top comments (1)

Collapse
 
tsvillain profile image
Tekeshwar Singh

The gap you're identifying is real, and I'd add one more dimension that hardware specs alone can't capture: inference quality consistency under realistic load patterns.

Two rigs with identical GPU/quant/ctx specs can behave very differently when you run 50 concurrent requests vs 5. The 5090 with Qwen3-35B might give you 267 tok/s in isolation but degrade to 180 tok/s with queue depth 10 — and the output quality often degrades before the throughput does. Smaller context requests start hallucinating earlier when KV cache is under pressure.

What I'd love to see luckrig expose (maybe as optional fields in tuning notes): sustained throughput at different concurrency levels, and — harder but more valuable — a quality degradation curve. At what queue depth does the rig start producing outputs that would fail a basic factual eval?

The Pi node rarity framing is clever. Though in practice for production workloads, what matters most is: at my expected p95 request rate, does this rig maintain output quality? The rarity metric is interesting for discovery; quality-under-load is what determines whether a rig is actually usable.

Solid concept — the harness-aware comparison space is completely empty right now.