I Built a Full Voice Pipeline on a €399 Edge AI Box (Whisper + Kokoro on Tensor Cores)

#ai #nvidia #opensource #edgecomputing

We just shipped a feature I've been wanting for months: full bidirectional voice on local hardware.

No cloud. No API keys. No latency. Just a €399 box on your desk that listens and talks back.

The Setup

Hardware: NVIDIA Jetson Orin Nano (67 TOPS, 1024 CUDA cores, tensor cores)
Speech-to-Text: OpenAI Whisper (runs locally on GPU)
Text-to-Speech: Kokoro TTS (82M params, natural human voice)
AI Brain: OpenClaw (connects to Claude/GPT via API)
Power draw: 15 watts total

How It Works

You speak → Whisper (tensor cores) → text → AI thinks → Kokoro (tensor cores) → natural voice response

The entire voice loop runs on the NVIDIA tensor cores. Whisper transcribes your speech in real-time across 90+ languages. Kokoro generates natural human speech with multiple voice options. The AI brain (OpenClaw) handles the thinking.

Why Tensor Cores Matter

The Jetson Orin Nano's Ampere GPU has dedicated tensor cores — specialized hardware for matrix operations that AI models depend on. This means:

Whisper runs in real-time (not 10x slower like on a Raspberry Pi)
Kokoro generates speech faster than real-time
Both can run simultaneously with CUDA cores to spare
All at 15 watts — less than a light bulb

Real-World Usage

I send a voice message on Telegram. My AI assistant:

Transcribes it locally (Whisper)
Understands the request (Claude API)
Takes action (browser automation, email, calendar)
Responds in natural speech (Kokoro)
Sends back a voice message on Telegram

The whole loop takes a few seconds. No audio ever leaves the device for transcription.

Privacy Angle

Every word you say is processed on your hardware. Your voice data never hits a cloud server for STT/TTS. The only cloud call is to the LLM API — and even that's optional if you run a local model.

Compare this to Alexa, Siri, or Google Assistant where every utterance is uploaded, stored, and analyzed.

The Numbers

Component	Model	Size	Speed
STT	Whisper Small	461MB	Real-time
TTS	Kokoro-82M	~200MB	Faster than real-time
Total VRAM	-	~1GB	Leaves 7GB free
Power	-	-	15W total system

Try It

The hardware is called ClawBox — a pre-configured Jetson Orin Nano with OpenClaw, Whisper, and Kokoro pre-installed. Plug in, connect to Telegram, start talking.

Or build your own: grab a Jetson Orin Nano, install OpenClaw, and follow the setup guide.

What's your local voice setup? Running Piper, XTTS, or something else? Would love to hear what's working for people.

DEV Community