We just shipped a feature I've been wanting for months: full bidirectional voice on local hardware.
No cloud. No API keys. No latency. Just a €399 box on your desk that listens and talks back.
The Setup
Hardware: NVIDIA Jetson Orin Nano (67 TOPS, 1024 CUDA cores, tensor cores)
Speech-to-Text: OpenAI Whisper (runs locally on GPU)
Text-to-Speech: Kokoro TTS (82M params, natural human voice)
AI Brain: OpenClaw (connects to Claude/GPT via API)
Power draw: 15 watts total
How It Works
You speak → Whisper (tensor cores) → text → AI thinks → Kokoro (tensor cores) → natural voice response
The entire voice loop runs on the NVIDIA tensor cores. Whisper transcribes your speech in real-time across 90+ languages. Kokoro generates natural human speech with multiple voice options. The AI brain (OpenClaw) handles the thinking.
Why Tensor Cores Matter
The Jetson Orin Nano's Ampere GPU has dedicated tensor cores — specialized hardware for matrix operations that AI models depend on. This means:
- Whisper runs in real-time (not 10x slower like on a Raspberry Pi)
- Kokoro generates speech faster than real-time
- Both can run simultaneously with CUDA cores to spare
- All at 15 watts — less than a light bulb
Real-World Usage
I send a voice message on Telegram. My AI assistant:
- Transcribes it locally (Whisper)
- Understands the request (Claude API)
- Takes action (browser automation, email, calendar)
- Responds in natural speech (Kokoro)
- Sends back a voice message on Telegram
The whole loop takes a few seconds. No audio ever leaves the device for transcription.
Privacy Angle
Every word you say is processed on your hardware. Your voice data never hits a cloud server for STT/TTS. The only cloud call is to the LLM API — and even that's optional if you run a local model.
Compare this to Alexa, Siri, or Google Assistant where every utterance is uploaded, stored, and analyzed.
The Numbers
| Component | Model | Size | Speed |
|---|---|---|---|
| STT | Whisper Small | 461MB | Real-time |
| TTS | Kokoro-82M | ~200MB | Faster than real-time |
| Total VRAM | - | ~1GB | Leaves 7GB free |
| Power | - | - | 15W total system |
Try It
The hardware is called ClawBox — a pre-configured Jetson Orin Nano with OpenClaw, Whisper, and Kokoro pre-installed. Plug in, connect to Telegram, start talking.
Or build your own: grab a Jetson Orin Nano, install OpenClaw, and follow the setup guide.
What's your local voice setup? Running Piper, XTTS, or something else? Would love to hear what's working for people.
Top comments (0)