DEV Community

Cover image for Build Voice AI in Python: Complete Speech-to-Text Developer Guide (2026)
Abhishek Mishra
Abhishek Mishra

Posted on • Originally published at smallest.ai

Build Voice AI in Python: Complete Speech-to-Text Developer Guide (2026)

TL;DR – Quick Integration Overview

API Platform: Pulse STT by Smallest AI – a state-of-the-art speech-to-text API supporting real-time streaming and batch audio transcription.

Key Features:

  • Transcribes in 32+ languages with automatic language detection
  • Ultra-low latency: ~64ms time-to-first-transcript for streaming
  • Rich metadata: word timestamps, speaker diarization, emotion detection, age/gender estimation, PII redaction

Integration Methods:

  • Pre-Recorded Audio: POST https://waves-api.smallest.ai/api/v1/pulse/get_text – upload files for batch processing
  • Real-Time Streaming: wss://waves-api.smallest.ai/api/v1/pulse/get_text – WebSocket for live transcription

Developer Experience: Use any HTTP/WebSocket client or official SDKs (Python, Node.js). Authentication via a single API key.

Why Pulse STT? Compared to other providers, Pulse offers faster response (64ms vs 200-500ms for typical cloud STT) and all-in-one features (no need for separate services for speaker ID, sentiment, or PII masking).

Quick Links:


Introduction: Why Voice Integration Matters

Voice is becoming the next frontier for user interaction. From virtual assistants and voice bots to real-time transcription in meetings, speech interfaces are making software more accessible and user-friendly. Developers today have access to Automatic Speech Recognition (ASR) APIs that convert voice to text, opening up possibilities for hands-free control, live captions, voice search, and more.

However, integrating voice AI is more than just getting raw text from audio. Modern use cases demand speed and accuracy – a voice assistant needs to transcribe commands almost instantly, and a call center analytics tool might need not just the transcript but also who spoke when and how they said it.

Latency is critical. A delay of even a second feels laggy in conversation. Traditional cloud speech APIs often have 500–1200ms latency for live transcription, with better ones hovering around 200–250ms. This has pushed the industry toward ultra-low latency – under 300ms – to enable seamless real-time interactions.

In this guide, we'll walk through how to integrate an AI voice & speech API that meets these modern demands using Smallest AI's Pulse STT. By the end, you'll know how to:

  1. Transcribe audio files (WAV/MP3) to text using a simple HTTP API
  2. Stream live audio for instantaneous transcripts via WebSockets
  3. Leverage advanced features like timestamps, speaker diarization, and emotion detection
  4. Use both Python and Node.js to integrate voice capabilities

Understanding Pulse STT

Pulse is the speech-to-text/ ASR(automatic speech recognition) model from Smallest AI's "Waves" platform. It's designed for fast, accurate, and rich transcription with industry-leading latency – around 64 milliseconds to first transcribed word TTFT for streaming audio. This is an order of magnitude faster than many alternatives.

Highlight Features

Feature Description
Real-Time & Batch Modes Stream live audio via WebSocket or upload files via HTTP POST
32+ Languages English, Spanish, Hindi, French, German, Arabic, Japanese, and more with auto-detection
Word/Sentence Timestamps Know exactly when each word was spoken (great for subtitles)
Speaker Diarization Differentiate speakers: "Speaker A said X, Speaker B said Y"
Emotion Detection Tag segments with emotions: happy, angry, neutral, etc.
Age/Gender Estimation Infer speaker demographics for analytics
PII/PCI Redaction Automatically mask credit cards, SSNs, and personal info
64ms Latency Time-to-first-transcript in streaming mode

Getting Started: Authentication

Step 1: Get Your API Key

Sign up on the Smallest AI Console and generate an API key. This key authenticates all your requests.

Step 2: Test Your Key

curl -H "Authorization: Bearer $SMALLEST_API_KEY" \
  https://waves-api.smallest.ai/api/v1/lightning-v3.1/get_voices
Enter fullscreen mode Exit fullscreen mode

Authentication Header

All requests require this header:

Authorization: Bearer <YOUR_API_KEY>
Enter fullscreen mode Exit fullscreen mode

Part 1: Transcribing Audio Files (REST API)

The Pre-Recorded API is perfect for batch processing voicemails, podcasts, meeting recordings, or any existing audio files.

Endpoint

POST https://waves-api.smallest.ai/api/v1/pulse/get_text
Enter fullscreen mode Exit fullscreen mode

Query Parameters

Parameter Type Description
model string Model identifier: pulse (required)
language string ISO code (en, es, hi) or multi for auto-detect
word_timestamps boolean Include word-level timing data
diarize boolean Enable speaker diarization
emotion_detection boolean Detect speaker emotions
age_detection boolean Estimate speaker age group
gender_detection boolean Estimate speaker gender

Supported Languages (32+)

Italian, Spanish, English, Portuguese, Hindi, German, French, Ukrainian, Russian, Kannada, Malayalam, Polish, Marathi, Gujarati, Czech, Slovak, Telugu, Odia, Dutch, Bengali, Latvian, Estonian, Romanian, Punjabi, Finnish, Swedish, Bulgarian, Tamil, Hungarian, Danish, Lithuanian, Maltese, and auto-detection (multi).

cURL Example

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true&word_timestamps=true&emotion_detection=true" \
  --header "Authorization: Bearer $SMALLEST_API_KEY" \
  --header "Content-Type: audio/wav" \
  --data-binary "@/path/to/audio.wav"
Enter fullscreen mode Exit fullscreen mode

Python Example

import os
import requests

API_KEY = os.getenv("SMALLEST_API_KEY")
audio_file = "meeting_recording.wav"

url = "https://waves-api.smallest.ai/api/v1/pulse/get_text"
params = {
    "model": "pulse",
    "language": "en",
    "word_timestamps": "true",
    "diarize": "true",
    "emotion_detection": "true"
}
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "audio/wav"
}

with open(audio_file, "rb") as f:
    audio_data = f.read()

response = requests.post(url, params=params, headers=headers, data=audio_data)
result = response.json()

# Print transcription
print("Transcription:", result.get("transcription"))

# Print word-level details with speaker info
for word in result.get("words", []):
    speaker = word.get("speaker", "N/A")
    print(f"  [Speaker {speaker}] [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']}")

# Check emotions
if "emotions" in result:
    print("\nEmotions detected:")
    for emotion, score in result["emotions"].items():
        if score > 0.1:
            print(f"  {emotion}: {score:.1%}")
Enter fullscreen mode Exit fullscreen mode

smallest ai pulse stt python transcribe demo

Node.js Example

const fs = require('fs');
const axios = require('axios');

const API_KEY = process.env.SMALLEST_API_KEY;
const audioFile = 'meeting_recording.wav';

const url = 'https://waves-api.smallest.ai/api/v1/pulse/get_text';
const params = new URLSearchParams({
  model: 'pulse',
  language: 'en',
  word_timestamps: 'true',
  diarize: 'true',
  emotion_detection: 'true'
});

const audioData = fs.readFileSync(audioFile);

axios.post(`${url}?${params}`, audioData, {
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'audio/wav'
  }
})
.then(res => {
  console.log('Transcription:', res.data.transcription);

  // Print words with speaker info
  res.data.words?.forEach(word => {
    console.log(`  [Speaker ${word.speaker}] [${word.start}s - ${word.end}s] ${word.word}`);
  });
})
.catch(err => {
  console.error('Error:', err.response?.data || err.message);
});
Enter fullscreen mode Exit fullscreen mode

Example Response

{
  "status": "success",
  "transcription": "Hello, this is a test transcription.",
  "words": [
    {"start": 0.0, "end": 0.88, "word": "Hello,", "confidence": 0.82, "speaker": 0, "speaker_confidence": 0.61},
    {"start": 0.88, "end": 1.04, "word": "this", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.76},
    {"start": 1.04, "end": 1.20, "word": "is", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.20, "end": 1.36, "word": "a", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.36, "end": 1.68, "word": "test", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.68, "end": 2.16, "word": "transcription.", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99}
  ],
  "utterances": [
    {"start": 0.0, "end": 2.16, "text": "Hello, this is a test transcription.", "speaker": 0}
  ],
  "age": "adult",
  "gender": "female",
  "emotions": {
    "happiness": 0.28,
    "sadness": 0.0,
    "anger": 0.0,
    "fear": 0.0,
    "disgust": 0.0
  },
  "metadata": {
    "duration": 1.97,
    "fileSize": 63236
  }
}
Enter fullscreen mode Exit fullscreen mode

Part 2: Real-Time Streaming (WebSocket API)

For live audio – voice assistants, live captioning, call center analytics – use the WebSocket API for sub-second latency with partial results as audio streams in.

WebSocket Endpoint

wss://waves-api.smallest.ai/api/v1/pulse/get_text
Enter fullscreen mode Exit fullscreen mode

Query Parameters

Parameter Type Default Description
language string en Language code or multi for auto-detect
encoding string linear16 Audio format: linear16, linear32, alaw, mulaw, opus
sample_rate string 16000 Sample rate: 8000, 16000, 22050, 24000, 44100, 48000
word_timestamps string true Include word-level timestamps
full_transcript string false Include cumulative transcript
sentence_timestamps string false Include sentence-level timestamps
redact_pii string false Redact personal information
redact_pci string false Redact payment card information
diarize string false Enable speaker diarization

Python Streaming Example

From the official cookbook:

import asyncio
import json
import os
import numpy as np
import websockets
import librosa
from urllib.parse import urlencode

WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text"

# Configurable features
LANGUAGE = "en"
ENCODING = "linear16"
SAMPLE_RATE = 16000
WORD_TIMESTAMPS = False
FULL_TRANSCRIPT = True
SENTENCE_TIMESTAMPS = False
DIARIZE = False
REDACT_PII = False
REDACT_PCI = False

async def transcribe(audio_file: str, api_key: str):
    params = {
        "language": LANGUAGE,
        "encoding": ENCODING,
        "sample_rate": SAMPLE_RATE,
        "word_timestamps": str(WORD_TIMESTAMPS).lower(),
        "full_transcript": str(FULL_TRANSCRIPT).lower(),
        "sentence_timestamps": str(SENTENCE_TIMESTAMPS).lower(),
        "diarize": str(DIARIZE).lower(),
        "redact_pii": str(REDACT_PII).lower(),
        "redact_pci": str(REDACT_PCI).lower(),
    }

    url = f"{WS_URL}?{urlencode(params)}"
    headers = {"Authorization": f"Bearer {api_key}"}

    # Load audio with librosa (handles any format)
    audio, _ = librosa.load(audio_file, sr=SAMPLE_RATE, mono=True)
    chunk_duration = 0.1  # 100ms chunks
    chunk_size = int(chunk_duration * SAMPLE_RATE)

    async with websockets.connect(url, additional_headers=headers) as ws:
        print("✅ Connected to Pulse STT WebSocket")

        async def send_audio():
            for i in range(0, len(audio), chunk_size):
                chunk = audio[i:i + chunk_size]
                pcm16 = (chunk * 32768.0).astype(np.int16).tobytes()
                await ws.send(pcm16)
                await asyncio.sleep(chunk_duration)
            await ws.send(json.dumps({"type": "end"}))
            print("📤 Sent end signal")

        async def receive_responses():
            async for message in ws:
                result = json.loads(message)

                if result.get("is_final"):
                    print(f"{result.get('transcript')}")

                    if result.get("is_last"):
                        if result.get("full_transcript"):
                            print(f"\n{'='*60}")
                            print("FULL TRANSCRIPT")
                            print(f"{'='*60}")
                            print(result.get("full_transcript"))
                        break

        await asyncio.gather(send_audio(), receive_responses())

# Usage
if __name__ == "__main__":
    api_key = os.environ.get("SMALLEST_API_KEY")
    asyncio.run(transcribe("recording.wav", api_key))
Enter fullscreen mode Exit fullscreen mode

Install dependencies:

pip install websockets librosa numpy
Enter fullscreen mode Exit fullscreen mode

Run:

export SMALLEST_API_KEY="your-api-key"
python transcribe.py recording.wav
Enter fullscreen mode Exit fullscreen mode

Node.js Streaming Example

From the official cookbook:

const fs = require("fs");
const WebSocket = require("ws");
const wav = require("wav");

const WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text";

// Configurable features
const LANGUAGE = "en";
const ENCODING = "linear16";
const SAMPLE_RATE = 16000;
const WORD_TIMESTAMPS = false;
const FULL_TRANSCRIPT = true;
const DIARIZE = false;
const REDACT_PII = false;
const REDACT_PCI = false;

async function loadAudio(audioFile) {
  return new Promise((resolve, reject) => {
    const reader = new wav.Reader();
    const chunks = [];

    reader.on("format", (format) => {
      reader.on("data", (chunk) => chunks.push(chunk));
      reader.on("end", () => {
        const buffer = Buffer.concat(chunks);
        const samples = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / 2);
        resolve(samples);
      });
    });

    reader.on("error", reject);
    fs.createReadStream(audioFile).pipe(reader);
  });
}

async function transcribe(audioFile, apiKey) {
  const params = new URLSearchParams({
    language: LANGUAGE,
    encoding: ENCODING,
    sample_rate: SAMPLE_RATE,
    word_timestamps: WORD_TIMESTAMPS,
    full_transcript: FULL_TRANSCRIPT,
    diarize: DIARIZE,
    redact_pii: REDACT_PII,
    redact_pci: REDACT_PCI,
  });

  const url = `${WS_URL}?${params}`;
  const audio = await loadAudio(audioFile);
  const chunkDuration = 0.1; // 100ms
  const chunkSize = Math.floor(chunkDuration * SAMPLE_RATE);

  return new Promise((resolve, reject) => {
    const ws = new WebSocket(url, {
      headers: { Authorization: `Bearer ${apiKey}` },
    });

    ws.on("open", async () => {
      console.log("✅ Connected to Pulse STT WebSocket");

      for (let i = 0; i < audio.length; i += chunkSize) {
        const chunk = audio.slice(i, i + chunkSize);
        ws.send(Buffer.from(chunk.buffer, chunk.byteOffset, chunk.byteLength));
        await new Promise((r) => setTimeout(r, chunkDuration * 1000));
      }
      ws.send(JSON.stringify({ type: "end" }));
      console.log("📤 Sent end signal");
    });

    ws.on("message", (data) => {
      const result = JSON.parse(data.toString());

      if (result.is_final) {
        console.log(`✓ ${result.transcript}`);

        if (result.is_last) {
          if (result.full_transcript) {
            console.log("\n" + "=".repeat(60));
            console.log("FULL TRANSCRIPT");
            console.log("=".repeat(60));
            console.log(result.full_transcript);
          }
          ws.close();
        }
      }
    });

    ws.on("close", resolve);
    ws.on("error", reject);
  });
}

// Usage
const apiKey = process.env.SMALLEST_API_KEY;
transcribe("recording.wav", apiKey).then(() => console.log("Done!"));
Enter fullscreen mode Exit fullscreen mode

Install dependencies:

npm install ws wav
Enter fullscreen mode Exit fullscreen mode

Run:

export SMALLEST_API_KEY="your-api-key"
node transcribe.js recording.wav
Enter fullscreen mode Exit fullscreen mode

WebSocket Response Format

{
  "session_id": "sess_12345abcde",
  "transcript": "Hello, how are you?",
  "full_transcript": "Hello, how are you?",
  "is_final": true,
  "is_last": false,
  "language": "en",
  "words": [
    {"word": "Hello,", "start": 0.0, "end": 0.5, "confidence": 0.98, "speaker": 0},
    {"word": "how", "start": 0.5, "end": 0.7, "confidence": 0.99, "speaker": 0},
    {"word": "are", "start": 0.7, "end": 0.9, "confidence": 0.97, "speaker": 0},
    {"word": "you?", "start": 0.9, "end": 1.2, "confidence": 0.99, "speaker": 0}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Key Response Fields

Field Description
is_final false = partial/interim transcript; true = finalized segment
is_last true when the entire session is complete
transcript Current segment text
full_transcript Accumulated text from entire session (if enabled)
words Word-level timestamps (if enabled)

smallest ai pulse stt node streaming websocket demo


Part 3: Advanced Features

Speaker Diarization

Enable diarize=true to identify different speakers:

params = {"model": "pulse", "language": "en", "diarize": "true"}
Enter fullscreen mode Exit fullscreen mode

Response includes speaker labels:

{
  "words": [
    {"word": "Hello", "speaker": 0, "speaker_confidence": 0.95},
    {"word": "Hi", "speaker": 1, "speaker_confidence": 0.92}
  ],
  "utterances": [
    {"text": "Hello, how can I help?", "speaker": 0},
    {"text": "I have a question.", "speaker": 1}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Emotion Detection

Enable emotion_detection=true to analyze speaker sentiment:

{
  "emotions": {
    "happiness": 0.28,
    "sadness": 0.0,
    "anger": 0.0,
    "fear": 0.0,
    "disgust": 0.0
  }
}
Enter fullscreen mode Exit fullscreen mode

PII/PCI Redaction

For compliance (HIPAA, PCI-DSS), enable redact_pii=true or redact_pci=true:

{
  "transcript": "My credit card is [CREDITCARD_1] and SSN is [SSN_1]",
  "redacted_entities": ["[CREDITCARD_1]", "[SSN_1]"]
}
Enter fullscreen mode Exit fullscreen mode

Age and Gender Detection

Enable age_detection=true and gender_detection=true:

{
  "age": "adult",
  "gender": "female"
}
Enter fullscreen mode Exit fullscreen mode

Comparing STT Providers

Provider Latency Languages Diarization Emotion PII Redaction Price (per 1000 min)
Pulse STT ~64ms 32+ Competitive
Google Cloud STT 200-300ms 125+ ~$16
Deepgram 100-200ms 36+ ~$4-5
AssemblyAI 200-400ms 30+ ~$3.50
OpenAI Whisper Batch only 99+ ~$6

Why Pulse STT stands out:

  • Fastest time-to-first-transcript (64ms)
  • All-in-one features (no separate services needed)
  • Competitive accuracy across diverse accents
  • Built for real-time voice AI applications

Best Practices

Audio Quality

  • Use 16kHz, mono, 16-bit PCM for best results
  • WAV or FLAC formats are ideal
  • Minimize background noise when possible

Error Handling

try:
    response = requests.post(url, params=params, headers=headers, data=audio_data, timeout=120)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 429:
        # Rate limited - implement exponential backoff
        time.sleep(2 ** retry_count)
    elif e.response.status_code == 401:
        # Invalid API key
        raise ValueError("Invalid API key")
Enter fullscreen mode Exit fullscreen mode

Rate Limiting

  • Add 500ms+ delay between batch requests
  • Use webhooks for long audio files
  • Implement exponential backoff for 429 errors

Bonus: Full Demo Application

Want to see everything working together? Check out the demo app in the code samples repository — a complete Next.js web application featuring:

smallest ai pulse stt demo

  • File upload transcription with word-level timestamps (hover to see timing)
  • Real-time microphone streaming with live transcript display
  • Secure WebSocket proxy that keeps your API key server-side
  • Modern UI with Smallest AI brand colors
  • Language selection (English, Hindi, Spanish, French, German, Portuguese, Auto-detect)
  • Emotion detection and speaker diarization display

Quick Start

cd demo-app
npm install
Enter fullscreen mode Exit fullscreen mode

Create a .env.local file with your API key:

echo 'SMALLEST_API_KEY=your-api-key' > .env.local
Enter fullscreen mode Exit fullscreen mode

Start both servers (Next.js + WebSocket proxy):

npm run dev:all
Enter fullscreen mode Exit fullscreen mode

Then open http://localhost:3000 in Chrome or Safari (for microphone access).

How It Works

The demo runs two servers:

  1. Next.js (port 3000) — Serves the React UI and handles file upload via /api/transcribe

  2. WebSocket Proxy (port 3001) — Securely proxies audio from browser to Pulse STT WebSocket API

Browser → WebSocket Proxy (3001) → Pulse STT (wss://waves-api.smallest.ai)
Browser → Next.js API (3000) → Pulse STT (REST API)
Enter fullscreen mode Exit fullscreen mode

This architecture keeps your API key secure on the server while enabling real-time streaming.

Project Structure

demo-app/
├── src/
│   └── app/
│       ├── api/
│       │   └── transcribe/
│       │       └── route.ts    # REST API for file upload
│       ├── page.tsx            # Main UI
│       └── layout.tsx
├── ws-server.js                # WebSocket proxy server
├── .env.local                  # Your API key (create this)
└── package.json
Enter fullscreen mode Exit fullscreen mode

Scripts

Command Description
npm run dev Start Next.js only
npm run dev:ws Start WebSocket proxy only
npm run dev:all Start both (recommended)

This architecture pattern is recommended for production apps — API keys stay server-side while the React frontend provides a smooth user experience with both file upload and real-time microphone transcription.


Conclusion

Integrating voice and speech capabilities into your workflow and apps can greatly enhance user experience. With Pulse STT, developers can achieve high-accuracy, low-latency transcription with just a few API calls.

When to use REST API:

  • Podcast transcription
  • Meeting recordings
  • Voicemail processing
  • Batch analytics

When to use WebSocket API:

  • Live captioning
  • Voice assistants
  • Call center real-time analytics
  • Interactive voice applications

The code patterns in this guide translate directly to production. Start with the REST API for prototyping, then add WebSocket streaming when real-time interaction becomes a requirement.


Resources

Top comments (0)