A test on implementing voxtral using Bob!
Introduction — What is voxtral?
Voxtral is a next-generation family of open-source speech-to-text models developed by Mistral AI. It is designed to bridge the gap between high-latency offline transcription and fast but often less accurate real-time systems.
The family features two primary models:
- Voxtral Mini 4B Realtime: A lightweight, 4-billion parameter model purpose-built for live, streaming transcription with ultra-low latency (<500ms).
- Voxtral Mini Transcribe V2: A batch-optimized version that delivers state-of-the-art accuracy, outperforming models like Whisper large-v3 and GPT-4o mini Transcribe in word error rate (WER) and cost efficiency.
Unlike traditional models that process audio in fixed chunks, Voxtral uses a novel streaming architecture and a custom causal audio encoder. This allows it to transcribe audio as it arrives, making it one of the first open-source solutions to match offline accuracy in a real-time environment.
What Can it Be Used For?
The “Voxtral Test” explores the model’s versatility across several high-impact domains:
- Real-Time Voice Agents: Powering conversational AI and virtual assistants with sub-200ms latency, enabling natural-feeling voice interfaces.
- Live Subtitling & Broadcasting: Generating near-instant multilingual subtitles for media, broadcasts, and conferences.
- Meeting Intelligence: Transcribing long-form recordings (up to 3 hours) with speaker diarization to clearly identify who is speaking and when.
- Contact Center Automation: Analyzing customer calls in real-time to suggest responses, track sentiment, and automate CRM documentation.
- Edge & Private Deployments: Because it is released under the Apache 2.0 license and runs efficiently on single GPUs (requiring ~16GB VRAM), it is ideal for privacy-first applications where data cannot leave the local environment.
- Domain-Specific Accuracy: Using context biasing, developers can provide the model with technical terms or proper nouns (up to 100 phrases) to ensure precise spelling in specialized industries like legal or medical fields.
As usual, my go-to partner Bob helped me craft a user-friendly application to showcase exactly what Voxtral can do.
The Vision: Real-Time Transcription for Everyone
The primary goal of this implementation is to provide a high-performance, user-friendly interface for testing Mistral AI’s Voxtral-Mini-4B-Realtime-2602. By leveraging a modern web-based UI, users can capture live audio directly from their microphones and witness state-of-the-art speech-to-text transcription with ultra-low latency — typically under 500ms. The project bridges the gap between complex AI model hosting and the end-user, offering a dashboard that monitors everything from audio waveforms to real-time latency statistics.
Getting Started: From Clone to Capture

The “Quick Start” workflow is designed for rapid deployment, guiding developers through a simple five-step process. After setting up the environment and installing dependencies, the core of the GPU-enabled experience relies on running a vLLM server. This server hosts the Voxtral model using a specialized container image optimized for NVIDIA GPUs. Once the backend is active, a Python-based proxy server establishes the bridge to the web browser, allowing users to simply open an HTML file and start transcribing immediately.
# server.py
#!/usr/bin/env python3
"""
Voxtral Realtime Transcription Server
Connects to vLLM server running Voxtral-Mini-4B-Realtime-2602
"""
import argparse
import asyncio
import base64
import json
import logging
import time
from typing import Optional
import numpy as np
import websockets
from websockets.server import WebSocketServerProtocol
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class VoxtralServer:
def __init__(self, vllm_host: str = "127.0.0.1", vllm_port: int = 8000):
self.vllm_host = vllm_host
self.vllm_port = vllm_port
self.vllm_url = f"ws://{vllm_host}:{vllm_port}/v1/realtime"
self.sessions = {}
async def handle_client(self, websocket: WebSocketServerProtocol, path: str):
"""Handle incoming client WebSocket connection"""
client_id = id(websocket)
logger.info(f"Client {client_id} connected from {websocket.remote_address}")
vllm_ws = None
try:
# Connect to vLLM server
logger.info(f"Connecting to vLLM server at {self.vllm_url}")
vllm_ws = await websockets.connect(self.vllm_url)
logger.info(f"Connected to vLLM server for client {client_id}")
# Wait for session.created from vLLM
session_response = await vllm_ws.recv()
session_data = json.loads(session_response)
if session_data.get('type') == 'session.created':
logger.info(f"Session created: {session_data.get('id')}")
# Forward session.created to client
await websocket.send(session_response)
# Create tasks for bidirectional communication
client_to_vllm = asyncio.create_task(
self.forward_client_to_vllm(websocket, vllm_ws, client_id)
)
vllm_to_client = asyncio.create_task(
self.forward_vllm_to_client(vllm_ws, websocket, client_id)
)
# Wait for either task to complete
done, pending = await asyncio.wait(
[client_to_vllm, vllm_to_client],
return_when=asyncio.FIRST_COMPLETED
)
# Cancel pending tasks
for task in pending:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
except websockets.exceptions.WebSocketException as e:
logger.error(f"WebSocket error for client {client_id}: {e}")
try:
await websocket.send(json.dumps({
'type': 'error',
'error': f'Connection error: {str(e)}'
}))
except:
pass
except Exception as e:
logger.error(f"Error handling client {client_id}: {e}", exc_info=True)
try:
await websocket.send(json.dumps({
'type': 'error',
'error': f'Server error: {str(e)}'
}))
except:
pass
finally:
if vllm_ws:
await vllm_ws.close()
logger.info(f"Client {client_id} disconnected")
async def forward_client_to_vllm(
self,
client_ws: WebSocketServerProtocol,
vllm_ws,
client_id: int
):
"""Forward messages from client to vLLM server"""
try:
async for message in client_ws:
try:
data = json.loads(message)
msg_type = data.get('type', 'unknown')
logger.debug(f"Client {client_id} -> vLLM: {msg_type}")
# Forward to vLLM
await vllm_ws.send(message)
# Log audio chunks
if msg_type == 'input_audio_buffer.append':
audio_len = len(data.get('audio', ''))
logger.debug(f"Forwarded audio chunk ({audio_len} bytes)")
except json.JSONDecodeError as e:
logger.error(f"Invalid JSON from client {client_id}: {e}")
except Exception as e:
logger.error(f"Error forwarding client message: {e}")
except websockets.exceptions.ConnectionClosed:
logger.info(f"Client {client_id} connection closed")
except Exception as e:
logger.error(f"Error in client->vLLM forwarding: {e}")
async def forward_vllm_to_client(
self,
vllm_ws,
client_ws: WebSocketServerProtocol,
client_id: int
):
"""Forward messages from vLLM server to client"""
try:
async for message in vllm_ws:
try:
data = json.loads(message)
msg_type = data.get('type', 'unknown')
logger.debug(f"vLLM -> Client {client_id}: {msg_type}")
# Forward to client
await client_ws.send(message)
# Log transcription results
if msg_type == 'transcription.delta':
delta = data.get('delta', '')
logger.info(f"Transcription delta: {delta}")
elif msg_type == 'transcription.done':
text = data.get('text', '')
logger.info(f"Transcription complete: {text}")
except json.JSONDecodeError as e:
logger.error(f"Invalid JSON from vLLM: {e}")
except Exception as e:
logger.error(f"Error forwarding vLLM message: {e}")
except websockets.exceptions.ConnectionClosed:
logger.info(f"vLLM connection closed for client {client_id}")
except Exception as e:
logger.error(f"Error in vLLM->client forwarding: {e}")
async def start(self, host: str = "0.0.0.0", port: int = 8080):
"""Start the WebSocket server"""
logger.info(f"Starting Voxtral proxy server on {host}:{port}")
logger.info(f"Proxying to vLLM server at {self.vllm_url}")
async with websockets.serve(self.handle_client, host, port):
logger.info(f"Server ready and listening on ws://{host}:{port}")
await asyncio.Future() # Run forever
def main():
parser = argparse.ArgumentParser(
description="Voxtral Realtime Transcription Proxy Server"
)
parser.add_argument(
"--host",
type=str,
default="0.0.0.0",
help="Host to bind the proxy server (default: 0.0.0.0)"
)
parser.add_argument(
"--port",
type=int,
default=8080,
help="Port to bind the proxy server (default: 8080)"
)
parser.add_argument(
"--vllm-host",
type=str,
default="127.0.0.1",
help="vLLM server host (default: 127.0.0.1)"
)
parser.add_argument(
"--vllm-port",
type=int,
default=8000,
help="vLLM server port (default: 8000)"
)
parser.add_argument(
"--debug",
action="store_true",
help="Enable debug logging"
)
args = parser.parse_args()
if args.debug:
logging.getLogger().setLevel(logging.DEBUG)
server = VoxtralServer(vllm_host=args.vllm_host, vllm_port=args.vllm_port)
try:
asyncio.run(server.start(host=args.host, port=args.port))
except KeyboardInterrupt:
logger.info("Server stopped by user")
except Exception as e:
logger.error(f"Server error: {e}", exc_info=True)
if __name__ == "__main__":
main()
# Made with Bob
Inclusivity: The CPU-Only Alternative

Recognizing that not every tester has access to an NVIDIA GPU with 16GB of VRAM, the project includes a dedicated CPU-Only setup. This path utilizes Vosk, an offline-compatible speech recognition engine that runs efficiently on standard consumer hardware. While Voxtral on a GPU offers the highest accuracy and lowest latency, the Vosk-based CPU alternative still provides impressive real-time performance with 85–95% accuracy. This ensures that the user interface and workflow can be evaluated on virtually any modern machine, from laptops to basic servers.
# sever_vosk.py
#!/usr/bin/env python3
"""
CPU-Compatible Realtime Transcription Server using Vosk
Works on CPU without GPU requirements
"""
import argparse
import asyncio
import base64
import json
import logging
import os
import struct
from typing import Optional
import websockets
from websockets.server import WebSocketServerProtocol
try:
from vosk import Model, KaldiRecognizer
VOSK_AVAILABLE = True
except ImportError:
VOSK_AVAILABLE = False
print("⚠️ Vosk not installed. Install with: pip install vosk")
print("⚠️ Download a model from: https://alphacephei.com/vosk/models")
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class VoskTranscriptionServer:
def __init__(self, model_path: str):
if not VOSK_AVAILABLE:
raise ImportError("Vosk is not installed. Run: pip install vosk")
if not os.path.exists(model_path):
raise FileNotFoundError(
f"Model not found at {model_path}\n"
f"Download from: https://alphacephei.com/vosk/models\n"
f"Example: vosk-model-small-en-us-0.15"
)
logger.info(f"Loading Vosk model from {model_path}")
self.model = Model(model_path)
logger.info("Model loaded successfully")
async def handle_client(self, websocket: WebSocketServerProtocol, path: str):
"""Handle incoming client WebSocket connection"""
client_id = id(websocket)
logger.info(f"Client {client_id} connected from {websocket.remote_address}")
# Create recognizer for this session
recognizer = KaldiRecognizer(self.model, 16000)
recognizer.SetWords(True)
session_id = f"vosk-session-{client_id}"
try:
# Send session.created
await websocket.send(json.dumps({
'type': 'session.created',
'id': session_id,
'model': 'vosk-cpu',
'object': 'realtime.session'
}))
logger.info(f"Session created: {session_id}")
audio_buffer = bytearray()
is_recording = False
async for message in websocket:
try:
data = json.loads(message)
msg_type = data.get('type', 'unknown')
logger.debug(f"Client {client_id}: {msg_type}")
if msg_type == 'session.update':
# Acknowledge session update
await websocket.send(json.dumps({
'type': 'session.updated',
'session': {
'id': session_id,
'model': 'vosk-cpu'
}
}))
elif msg_type == 'input_audio_buffer.commit':
is_recording = True
if data.get('final', False):
# Process final audio
if recognizer.AcceptWaveform(bytes(audio_buffer)):
result = json.loads(recognizer.Result())
text = result.get('text', '')
if text:
await websocket.send(json.dumps({
'type': 'transcription.done',
'text': text,
'object': 'realtime.transcription'
}))
logger.info(f"Final transcription: {text}")
# Get any remaining text
final_result = json.loads(recognizer.FinalResult())
final_text = final_result.get('text', '')
if final_text:
await websocket.send(json.dumps({
'type': 'transcription.done',
'text': final_text,
'object': 'realtime.transcription'
}))
# Reset for next session
audio_buffer.clear()
recognizer = KaldiRecognizer(self.model, 16000)
recognizer.SetWords(True)
is_recording = False
elif msg_type == 'input_audio_buffer.append':
if is_recording:
# Decode base64 audio
audio_b64 = data.get('audio', '')
audio_bytes = base64.b64decode(audio_b64)
# Add to buffer
audio_buffer.extend(audio_bytes)
# Process in chunks
if len(audio_buffer) >= 8192: # Process every 8KB
chunk = bytes(audio_buffer[:8192])
audio_buffer = audio_buffer[8192:]
if recognizer.AcceptWaveform(chunk):
result = json.loads(recognizer.Result())
text = result.get('text', '')
if text:
# Send partial result
await websocket.send(json.dumps({
'type': 'transcription.delta',
'delta': text + ' ',
'object': 'realtime.transcription.delta'
}))
logger.info(f"Partial transcription: {text}")
else:
# Send partial result
partial = json.loads(recognizer.PartialResult())
partial_text = partial.get('partial', '')
if partial_text:
await websocket.send(json.dumps({
'type': 'transcription.delta',
'delta': partial_text,
'object': 'realtime.transcription.delta'
}))
except json.JSONDecodeError as e:
logger.error(f"Invalid JSON from client {client_id}: {e}")
await websocket.send(json.dumps({
'type': 'error',
'error': f'Invalid JSON: {str(e)}'
}))
except Exception as e:
logger.error(f"Error processing message: {e}", exc_info=True)
await websocket.send(json.dumps({
'type': 'error',
'error': f'Processing error: {str(e)}'
}))
except websockets.exceptions.ConnectionClosed:
logger.info(f"Client {client_id} connection closed")
except Exception as e:
logger.error(f"Error handling client {client_id}: {e}", exc_info=True)
finally:
logger.info(f"Client {client_id} disconnected")
async def start(self, host: str = "0.0.0.0", port: int = 8080):
"""Start the WebSocket server"""
logger.info(f"Starting Vosk transcription server on {host}:{port}")
logger.info("This server runs on CPU - no GPU required!")
async with websockets.serve(self.handle_client, host, port):
logger.info(f"Server ready and listening on ws://{host}:{port}")
await asyncio.Future() # Run forever
def main():
parser = argparse.ArgumentParser(
description="CPU-Compatible Realtime Transcription Server using Vosk"
)
parser.add_argument(
"--host",
type=str,
default="0.0.0.0",
help="Host to bind the server (default: 0.0.0.0)"
)
parser.add_argument(
"--port",
type=int,
default=8080,
help="Port to bind the server (default: 8080)"
)
parser.add_argument(
"--model",
type=str,
required=True,
help="Path to Vosk model directory (e.g., vosk-model-small-en-us-0.15)"
)
parser.add_argument(
"--debug",
action="store_true",
help="Enable debug logging"
)
args = parser.parse_args()
if args.debug:
logging.getLogger().setLevel(logging.DEBUG)
if not VOSK_AVAILABLE:
print("\n❌ Vosk is not installed!")
print("\nInstall with:")
print(" pip install vosk")
print("\nDownload a model from:")
print(" https://alphacephei.com/vosk/models")
print("\nExample models:")
print(" - vosk-model-small-en-us-0.15 (40MB, fast)")
print(" - vosk-model-en-us-0.22 (1.8GB, accurate)")
return 1
try:
server = VoskTranscriptionServer(model_path=args.model)
asyncio.run(server.start(host=args.host, port=args.port))
except KeyboardInterrupt:
logger.info("Server stopped by user")
except Exception as e:
logger.error(f"Server error: {e}", exc_info=True)
return 1
return 0
if __name__ == "__main__":
exit(main())
# Made with Bob
Under the Hood: A Multi-Layered Architecture
The system follows a robust, multi-layered architecture designed for efficiency and scalability.
- The Client Layer: A browser-based UI that captures raw audio via the Web Audio API and visualizes it in real-time.
- The Application Layer: A Python proxy server that manages WebSocket connections and routes data.
- The Server Layer: Where the “brain” resides — either a vLLM server running the 4-billion parameter Voxtral model on a GPU, or a Vosk server processing audio on a CPU. This decoupled design allows for horizontal scaling, where a single load balancer can distribute traffic across multiple GPU or CPU instances to support many concurrent users.
Conclusion
In conclusion, this project provides a functional and accessible entry point for exploring the cutting edge of real-time speech-to-text technology. By implementing a dual-mode architecture, the application successfully bridges the gap between high-end hardware requirements and common consumer setups, allowing users to experience Mistral AI’s Voxtral on GPUs or the efficient Vosk engine on CPUs. From the live audio visualization in the browser to the multi-layered backend that handles high-speed WebSocket streaming, the core infrastructure for sub-500ms transcription is now firmly in place.
Disclaimer & Status: Please note that this application is currently in Alpha development. It is intended solely as a rudimentary testing tool and a proof-of-concept for real-time transcription workflows. Users should expect potential bugs and stability issues, as it is not yet optimized for production environments. We have a roadmap of future enhancements planned — including multi-user support, speaker diarization, and automated punctuation — to transform this testbed into a more robust solution. Your feedback is invaluable during this early stage; comments, bug reports, and contributions are highly welcome as we continue to refine the experience.
>>> Thanks for reading <<<
Links
- voxtral gist (This guide covers running and trying out the Red Hat AI Inference Server to serve Mistral Voxtral-Mini-4B-Realtime-2602 model, powered by vLLM.): https://gist.github.com/dougbtv/f27d05fd6de68e07be3651c453bf37c7
- voxtral Hugging Face page: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602
- voxtral page on Mistral.ai site: https://mistral.ai/news/voxtral-transcribe-2
- Code repository for the application: https://github.com/aairom/voxtral-test
- IBM Project Bob: https://www.ibm.com/products/bob




Top comments (0)