Abhishek Mishra

Posted on Feb 10 • Originally published at smallest.ai

Build Voice AI in Python: Complete Speech-to-Text Developer Guide (2026)

#voice #ai #python #smallestai

TL;DR – Quick Integration Overview

API Platform: Pulse STT by Smallest AI – a state-of-the-art speech-to-text API supporting real-time streaming and batch audio transcription.

Key Features:

Transcribes in 32+ languages with automatic language detection
Ultra-low latency: ~64ms time-to-first-transcript for streaming
Rich metadata: word timestamps, speaker diarization, emotion detection, age/gender estimation, PII redaction

Integration Methods:

Pre-Recorded Audio: POST https://waves-api.smallest.ai/api/v1/pulse/get_text – upload files for batch processing
Real-Time Streaming: wss://waves-api.smallest.ai/api/v1/pulse/get_text – WebSocket for live transcription

Developer Experience: Use any HTTP/WebSocket client or official SDKs (Python, Node.js). Authentication via a single API key.

Why Pulse STT? Compared to other providers, Pulse offers faster response (64ms vs 200-500ms for typical cloud STT) and all-in-one features (no need for separate services for speaker ID, sentiment, or PII masking).

Quick Links:

API Console – Get your API key
Documentation – Full API reference
Python SDK – Official client

Introduction: Why Voice Integration Matters

Voice is becoming the next frontier for user interaction. From virtual assistants and voice bots to real-time transcription in meetings, speech interfaces are making software more accessible and user-friendly. Developers today have access to Automatic Speech Recognition (ASR) APIs that convert voice to text, opening up possibilities for hands-free control, live captions, voice search, and more.

However, integrating voice AI is more than just getting raw text from audio. Modern use cases demand speed and accuracy – a voice assistant needs to transcribe commands almost instantly, and a call center analytics tool might need not just the transcript but also who spoke when and how they said it.

Latency is critical. A delay of even a second feels laggy in conversation. Traditional cloud speech APIs often have 500–1200ms latency for live transcription, with better ones hovering around 200–250ms. This has pushed the industry toward ultra-low latency – under 300ms – to enable seamless real-time interactions.

In this guide, we'll walk through how to integrate an AI voice & speech API that meets these modern demands using Smallest AI's Pulse STT. By the end, you'll know how to:

Transcribe audio files (WAV/MP3) to text using a simple HTTP API
Stream live audio for instantaneous transcripts via WebSockets
Leverage advanced features like timestamps, speaker diarization, and emotion detection
Use both Python and Node.js to integrate voice capabilities

Understanding Pulse STT

Pulse is the speech-to-text/ ASR(automatic speech recognition) model from Smallest AI's "Waves" platform. It's designed for fast, accurate, and rich transcription with industry-leading latency – around 64 milliseconds to first transcribed word TTFT for streaming audio. This is an order of magnitude faster than many alternatives.

Highlight Features

Feature	Description
Real-Time & Batch Modes	Stream live audio via WebSocket or upload files via HTTP POST
32+ Languages	English, Spanish, Hindi, French, German, Arabic, Japanese, and more with auto-detection
Word/Sentence Timestamps	Know exactly when each word was spoken (great for subtitles)
Speaker Diarization	Differentiate speakers: "Speaker A said X, Speaker B said Y"
Emotion Detection	Tag segments with emotions: happy, angry, neutral, etc.
Age/Gender Estimation	Infer speaker demographics for analytics
PII/PCI Redaction	Automatically mask credit cards, SSNs, and personal info
64ms Latency	Time-to-first-transcript in streaming mode

Getting Started: Authentication

Step 1: Get Your API Key

Step 2: Test Your Key

curl -H "Authorization: Bearer $SMALLEST_API_KEY" \
  https://waves-api.smallest.ai/api/v1/lightning-v3.1/get_voices

Authentication Header

All requests require this header:

Authorization: Bearer <YOUR_API_KEY>

Part 1: Transcribing Audio Files (REST API)

The Pre-Recorded API is perfect for batch processing voicemails, podcasts, meeting recordings, or any existing audio files.

Endpoint

POST https://waves-api.smallest.ai/api/v1/pulse/get_text

Query Parameters

Parameter	Type	Description
`model`	string	Model identifier: `pulse` (required)
`language`	string	ISO code (`en`, `es`, `hi`) or `multi` for auto-detect
`word_timestamps`	boolean	Include word-level timing data
`diarize`	boolean	Enable speaker diarization
`emotion_detection`	boolean	Detect speaker emotions
`age_detection`	boolean	Estimate speaker age group
`gender_detection`	boolean	Estimate speaker gender

Supported Languages (32+)

Italian, Spanish, English, Portuguese, Hindi, German, French, Ukrainian, Russian, Kannada, Malayalam, Polish, Marathi, Gujarati, Czech, Slovak, Telugu, Odia, Dutch, Bengali, Latvian, Estonian, Romanian, Punjabi, Finnish, Swedish, Bulgarian, Tamil, Hungarian, Danish, Lithuanian, Maltese, and auto-detection (multi).

cURL Example

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true&word_timestamps=true&emotion_detection=true" \
  --header "Authorization: Bearer $SMALLEST_API_KEY" \
  --header "Content-Type: audio/wav" \
  --data-binary "@/path/to/audio.wav"

Python Example

import os
import requests

API_KEY = os.getenv("SMALLEST_API_KEY")
audio_file = "meeting_recording.wav"

url = "https://waves-api.smallest.ai/api/v1/pulse/get_text"
params = {
    "model": "pulse",
    "language": "en",
    "word_timestamps": "true",
    "diarize": "true",
    "emotion_detection": "true"
}
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "audio/wav"
}

with open(audio_file, "rb") as f:
    audio_data = f.read()

response = requests.post(url, params=params, headers=headers, data=audio_data)
result = response.json()

# Print transcription
print("Transcription:", result.get("transcription"))

# Print word-level details with speaker info
for word in result.get("words", []):
    speaker = word.get("speaker", "N/A")
    print(f"  [Speaker {speaker}] [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']}")

# Check emotions
if "emotions" in result:
    print("\nEmotions detected:")
    for emotion, score in result["emotions"].items():
        if score > 0.1:
            print(f"  {emotion}: {score:.1%}")

Node.js Example

const fs = require('fs');
const axios = require('axios');

const API_KEY = process.env.SMALLEST_API_KEY;
const audioFile = 'meeting_recording.wav';

const url = 'https://waves-api.smallest.ai/api/v1/pulse/get_text';
const params = new URLSearchParams({
  model: 'pulse',
  language: 'en',
  word_timestamps: 'true',
  diarize: 'true',
  emotion_detection: 'true'
});

const audioData = fs.readFileSync(audioFile);

axios.post(`${url}?${params}`, audioData, {
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'audio/wav'
  }
})
.then(res => {
  console.log('Transcription:', res.data.transcription);

  // Print words with speaker info
  res.data.words?.forEach(word => {
    console.log(`  [Speaker ${word.speaker}] [${word.start}s - ${word.end}s] ${word.word}`);
  });
})
.catch(err => {
  console.error('Error:', err.response?.data || err.message);
});

Example Response

{
  "status": "success",
  "transcription": "Hello, this is a test transcription.",
  "words": [
    {"start": 0.0, "end": 0.88, "word": "Hello,", "confidence": 0.82, "speaker": 0, "speaker_confidence": 0.61},
    {"start": 0.88, "end": 1.04, "word": "this", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.76},
    {"start": 1.04, "end": 1.20, "word": "is", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.20, "end": 1.36, "word": "a", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.36, "end": 1.68, "word": "test", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.68, "end": 2.16, "word": "transcription.", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99}
  ],
  "utterances": [
    {"start": 0.0, "end": 2.16, "text": "Hello, this is a test transcription.", "speaker": 0}
  ],
  "age": "adult",
  "gender": "female",
  "emotions": {
    "happiness": 0.28,
    "sadness": 0.0,
    "anger": 0.0,
    "fear": 0.0,
    "disgust": 0.0
  },
  "metadata": {
    "duration": 1.97,
    "fileSize": 63236
  }
}

Part 2: Real-Time Streaming (WebSocket API)

For live audio – voice assistants, live captioning, call center analytics – use the WebSocket API for sub-second latency with partial results as audio streams in.

WebSocket Endpoint

wss://waves-api.smallest.ai/api/v1/pulse/get_text

Query Parameters

Parameter	Type	Default	Description
`language`	string	`en`	Language code or `multi` for auto-detect
`encoding`	string	`linear16`	Audio format: `linear16`, `linear32`, `alaw`, `mulaw`, `opus`
`sample_rate`	string	`16000`	Sample rate: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`
`word_timestamps`	string	`true`	Include word-level timestamps
`full_transcript`	string	`false`	Include cumulative transcript
`sentence_timestamps`	string	`false`	Include sentence-level timestamps
`redact_pii`	string	`false`	Redact personal information
`redact_pci`	string	`false`	Redact payment card information
`diarize`	string	`false`	Enable speaker diarization

Python Streaming Example

From the official cookbook:

import asyncio
import json
import os
import numpy as np
import websockets
import librosa
from urllib.parse import urlencode

WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text"

# Configurable features
LANGUAGE = "en"
ENCODING = "linear16"
SAMPLE_RATE = 16000
WORD_TIMESTAMPS = False
FULL_TRANSCRIPT = True
SENTENCE_TIMESTAMPS = False
DIARIZE = False
REDACT_PII = False
REDACT_PCI = False

async def transcribe(audio_file: str, api_key: str):
    params = {
        "language": LANGUAGE,
        "encoding": ENCODING,
        "sample_rate": SAMPLE_RATE,
        "word_timestamps": str(WORD_TIMESTAMPS).lower(),
        "full_transcript": str(FULL_TRANSCRIPT).lower(),
        "sentence_timestamps": str(SENTENCE_TIMESTAMPS).lower(),
        "diarize": str(DIARIZE).lower(),
        "redact_pii": str(REDACT_PII).lower(),
        "redact_pci": str(REDACT_PCI).lower(),
    }

    url = f"{WS_URL}?{urlencode(params)}"
    headers = {"Authorization": f"Bearer {api_key}"}

    # Load audio with librosa (handles any format)
    audio, _ = librosa.load(audio_file, sr=SAMPLE_RATE, mono=True)
    chunk_duration = 0.1  # 100ms chunks
    chunk_size = int(chunk_duration * SAMPLE_RATE)

    async with websockets.connect(url, additional_headers=headers) as ws:
        print("✅ Connected to Pulse STT WebSocket")

        async def send_audio():
            for i in range(0, len(audio), chunk_size):
                chunk = audio[i:i + chunk_size]
                pcm16 = (chunk * 32768.0).astype(np.int16).tobytes()
                await ws.send(pcm16)
                await asyncio.sleep(chunk_duration)
            await ws.send(json.dumps({"type": "end"}))
            print("📤 Sent end signal")

        async def receive_responses():
            async for message in ws:
                result = json.loads(message)

                if result.get("is_final"):
                    print(f"✓ {result.get('transcript')}")

                    if result.get("is_last"):
                        if result.get("full_transcript"):
                            print(f"\n{'='*60}")
                            print("FULL TRANSCRIPT")
                            print(f"{'='*60}")
                            print(result.get("full_transcript"))
                        break

        await asyncio.gather(send_audio(), receive_responses())

# Usage
if __name__ == "__main__":
    api_key = os.environ.get("SMALLEST_API_KEY")
    asyncio.run(transcribe("recording.wav", api_key))

Install dependencies:

pip install websockets librosa numpy

Run:

export SMALLEST_API_KEY="your-api-key"
python transcribe.py recording.wav

Node.js Streaming Example

From the official cookbook:

const fs = require("fs");
const WebSocket = require("ws");
const wav = require("wav");

const WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text";

// Configurable features
const LANGUAGE = "en";
const ENCODING = "linear16";
const SAMPLE_RATE = 16000;
const WORD_TIMESTAMPS = false;
const FULL_TRANSCRIPT = true;
const DIARIZE = false;
const REDACT_PII = false;
const REDACT_PCI = false;

async function loadAudio(audioFile) {
  return new Promise((resolve, reject) => {
    const reader = new wav.Reader();
    const chunks = [];

    reader.on("format", (format) => {
      reader.on("data", (chunk) => chunks.push(chunk));
      reader.on("end", () => {
        const buffer = Buffer.concat(chunks);
        const samples = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / 2);
        resolve(samples);
      });
    });

    reader.on("error", reject);
    fs.createReadStream(audioFile).pipe(reader);
  });
}

async function transcribe(audioFile, apiKey) {
  const params = new URLSearchParams({
    language: LANGUAGE,
    encoding: ENCODING,
    sample_rate: SAMPLE_RATE,
    word_timestamps: WORD_TIMESTAMPS,
    full_transcript: FULL_TRANSCRIPT,
    diarize: DIARIZE,
    redact_pii: REDACT_PII,
    redact_pci: REDACT_PCI,
  });

  const url = `${WS_URL}?${params}`;
  const audio = await loadAudio(audioFile);
  const chunkDuration = 0.1; // 100ms
  const chunkSize = Math.floor(chunkDuration * SAMPLE_RATE);

  return new Promise((resolve, reject) => {
    const ws = new WebSocket(url, {
      headers: { Authorization: `Bearer ${apiKey}` },
    });

    ws.on("open", async () => {
      console.log("✅ Connected to Pulse STT WebSocket");

      for (let i = 0; i < audio.length; i += chunkSize) {
        const chunk = audio.slice(i, i + chunkSize);
        ws.send(Buffer.from(chunk.buffer, chunk.byteOffset, chunk.byteLength));
        await new Promise((r) => setTimeout(r, chunkDuration * 1000));
      }
      ws.send(JSON.stringify({ type: "end" }));
      console.log("📤 Sent end signal");
    });

    ws.on("message", (data) => {
      const result = JSON.parse(data.toString());

      if (result.is_final) {
        console.log(`✓ ${result.transcript}`);

        if (result.is_last) {
          if (result.full_transcript) {
            console.log("\n" + "=".repeat(60));
            console.log("FULL TRANSCRIPT");
            console.log("=".repeat(60));
            console.log(result.full_transcript);
          }
          ws.close();
        }
      }
    });

    ws.on("close", resolve);
    ws.on("error", reject);
  });
}

// Usage
const apiKey = process.env.SMALLEST_API_KEY;
transcribe("recording.wav", apiKey).then(() => console.log("Done!"));

Install dependencies:

npm install ws wav

Run:

export SMALLEST_API_KEY="your-api-key"
node transcribe.js recording.wav

WebSocket Response Format

{
  "session_id": "sess_12345abcde",
  "transcript": "Hello, how are you?",
  "full_transcript": "Hello, how are you?",
  "is_final": true,
  "is_last": false,
  "language": "en",
  "words": [
    {"word": "Hello,", "start": 0.0, "end": 0.5, "confidence": 0.98, "speaker": 0},
    {"word": "how", "start": 0.5, "end": 0.7, "confidence": 0.99, "speaker": 0},
    {"word": "are", "start": 0.7, "end": 0.9, "confidence": 0.97, "speaker": 0},
    {"word": "you?", "start": 0.9, "end": 1.2, "confidence": 0.99, "speaker": 0}
  ]
}

Key Response Fields

Field	Description
`is_final`	`false` = partial/interim transcript; `true` = finalized segment
`is_last`	`true` when the entire session is complete
`transcript`	Current segment text
`full_transcript`	Accumulated text from entire session (if enabled)
`words`	Word-level timestamps (if enabled)

Part 3: Advanced Features

Speaker Diarization

Enable diarize=true to identify different speakers:

params = {"model": "pulse", "language": "en", "diarize": "true"}

Response includes speaker labels:

{
  "words": [
    {"word": "Hello", "speaker": 0, "speaker_confidence": 0.95},
    {"word": "Hi", "speaker": 1, "speaker_confidence": 0.92}
  ],
  "utterances": [
    {"text": "Hello, how can I help?", "speaker": 0},
    {"text": "I have a question.", "speaker": 1}
  ]
}

Emotion Detection

Enable emotion_detection=true to analyze speaker sentiment:

{
  "emotions": {
    "happiness": 0.28,
    "sadness": 0.0,
    "anger": 0.0,
    "fear": 0.0,
    "disgust": 0.0
  }
}

PII/PCI Redaction

For compliance (HIPAA, PCI-DSS), enable redact_pii=true or redact_pci=true:

{
  "transcript": "My credit card is [CREDITCARD_1] and SSN is [SSN_1]",
  "redacted_entities": ["[CREDITCARD_1]", "[SSN_1]"]
}

Age and Gender Detection

Enable age_detection=true and gender_detection=true:

{
  "age": "adult",
  "gender": "female"
}

Comparing STT Providers

Provider	Latency	Languages	Diarization	Emotion	PII Redaction	Price (per 1000 min)
Pulse STT	~64ms	32+	✅	✅	✅	Competitive
Google Cloud STT	200-300ms	125+	✅	❌	❌	~$16
Deepgram	100-200ms	36+	✅	❌	✅	~$4-5
AssemblyAI	200-400ms	30+	✅	✅	✅	~$3.50
OpenAI Whisper	Batch only	99+	❌	❌	❌	~$6

Why Pulse STT stands out:

Fastest time-to-first-transcript (64ms)
All-in-one features (no separate services needed)
Competitive accuracy across diverse accents
Built for real-time voice AI applications

Best Practices

Audio Quality

Use 16kHz, mono, 16-bit PCM for best results
WAV or FLAC formats are ideal
Minimize background noise when possible

Error Handling

try:
    response = requests.post(url, params=params, headers=headers, data=audio_data, timeout=120)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 429:
        # Rate limited - implement exponential backoff
        time.sleep(2 ** retry_count)
    elif e.response.status_code == 401:
        # Invalid API key
        raise ValueError("Invalid API key")

Rate Limiting

Add 500ms+ delay between batch requests
Use webhooks for long audio files
Implement exponential backoff for 429 errors

Bonus: Full Demo Application

Want to see everything working together? Check out the demo app in the code samples repository — a complete Next.js web application featuring:

File upload transcription with word-level timestamps (hover to see timing)
Real-time microphone streaming with live transcript display
Secure WebSocket proxy that keeps your API key server-side
Modern UI with Smallest AI brand colors
Language selection (English, Hindi, Spanish, French, German, Portuguese, Auto-detect)
Emotion detection and speaker diarization display

Quick Start

cd demo-app
npm install

Create a .env.local file with your API key:

echo 'SMALLEST_API_KEY=your-api-key' > .env.local

Start both servers (Next.js + WebSocket proxy):

npm run dev:all

Then open http://localhost:3000 in Chrome or Safari (for microphone access).

How It Works

The demo runs two servers:

Next.js (port 3000) — Serves the React UI and handles file upload via /api/transcribe
WebSocket Proxy (port 3001) — Securely proxies audio from browser to Pulse STT WebSocket API

Browser → WebSocket Proxy (3001) → Pulse STT (wss://waves-api.smallest.ai)
Browser → Next.js API (3000) → Pulse STT (REST API)

This architecture keeps your API key secure on the server while enabling real-time streaming.

Project Structure

demo-app/
├── src/
│   └── app/
│       ├── api/
│       │   └── transcribe/
│       │       └── route.ts    # REST API for file upload
│       ├── page.tsx            # Main UI
│       └── layout.tsx
├── ws-server.js                # WebSocket proxy server
├── .env.local                  # Your API key (create this)
└── package.json

Scripts

Command	Description
`npm run dev`	Start Next.js only
`npm run dev:ws`	Start WebSocket proxy only
`npm run dev:all`	Start both (recommended)

This architecture pattern is recommended for production apps — API keys stay server-side while the React frontend provides a smooth user experience with both file upload and real-time microphone transcription.

Conclusion

Integrating voice and speech capabilities into your workflow and apps can greatly enhance user experience. With Pulse STT, developers can achieve high-accuracy, low-latency transcription with just a few API calls.

When to use REST API:

Podcast transcription
Meeting recordings
Voicemail processing
Batch analytics

When to use WebSocket API:

Live captioning
Voice assistants
Call center real-time analytics
Interactive voice applications

The code patterns in this guide translate directly to production. Start with the REST API for prototyping, then add WebSocket streaming when real-time interaction becomes a requirement.

Resources

Smallest AI Console — API key management
Waves Documentation — Full API reference
Discord Community — Developer support