TL;DR – Quick Integration Overview
API Platform: Pulse STT by Smallest AI – a state-of-the-art speech-to-text API supporting real-time streaming and batch audio transcription.
Key Features:
- Transcribes in 32+ languages with automatic language detection
- Ultra-low latency: ~64ms time-to-first-transcript for streaming
- Rich metadata: word timestamps, speaker diarization, emotion detection, age/gender estimation, PII redaction
Integration Methods:
-
Pre-Recorded Audio:
POST https://waves-api.smallest.ai/api/v1/pulse/get_text– upload files for batch processing -
Real-Time Streaming:
wss://waves-api.smallest.ai/api/v1/pulse/get_text– WebSocket for live transcription
Developer Experience: Use any HTTP/WebSocket client or official SDKs (Python, Node.js). Authentication via a single API key.
Why Pulse STT? Compared to other providers, Pulse offers faster response (64ms vs 200-500ms for typical cloud STT) and all-in-one features (no need for separate services for speaker ID, sentiment, or PII masking).
Quick Links:
- API Console – Get your API key
- Documentation – Full API reference
- Python SDK – Official client
Introduction: Why Voice Integration Matters
Voice is becoming the next frontier for user interaction. From virtual assistants and voice bots to real-time transcription in meetings, speech interfaces are making software more accessible and user-friendly. Developers today have access to Automatic Speech Recognition (ASR) APIs that convert voice to text, opening up possibilities for hands-free control, live captions, voice search, and more.
However, integrating voice AI is more than just getting raw text from audio. Modern use cases demand speed and accuracy – a voice assistant needs to transcribe commands almost instantly, and a call center analytics tool might need not just the transcript but also who spoke when and how they said it.
Latency is critical. A delay of even a second feels laggy in conversation. Traditional cloud speech APIs often have 500–1200ms latency for live transcription, with better ones hovering around 200–250ms. This has pushed the industry toward ultra-low latency – under 300ms – to enable seamless real-time interactions.
In this guide, we'll walk through how to integrate an AI voice & speech API that meets these modern demands using Smallest AI's Pulse STT. By the end, you'll know how to:
- Transcribe audio files (WAV/MP3) to text using a simple HTTP API
- Stream live audio for instantaneous transcripts via WebSockets
- Leverage advanced features like timestamps, speaker diarization, and emotion detection
- Use both Python and Node.js to integrate voice capabilities
Understanding Pulse STT
Pulse is the speech-to-text/ ASR(automatic speech recognition) model from Smallest AI's "Waves" platform. It's designed for fast, accurate, and rich transcription with industry-leading latency – around 64 milliseconds to first transcribed word TTFT for streaming audio. This is an order of magnitude faster than many alternatives.
Highlight Features
| Feature | Description |
|---|---|
| Real-Time & Batch Modes | Stream live audio via WebSocket or upload files via HTTP POST |
| 32+ Languages | English, Spanish, Hindi, French, German, Arabic, Japanese, and more with auto-detection |
| Word/Sentence Timestamps | Know exactly when each word was spoken (great for subtitles) |
| Speaker Diarization | Differentiate speakers: "Speaker A said X, Speaker B said Y" |
| Emotion Detection | Tag segments with emotions: happy, angry, neutral, etc. |
| Age/Gender Estimation | Infer speaker demographics for analytics |
| PII/PCI Redaction | Automatically mask credit cards, SSNs, and personal info |
| 64ms Latency | Time-to-first-transcript in streaming mode |
Getting Started: Authentication
Step 1: Get Your API Key
Sign up on the Smallest AI Console and generate an API key. This key authenticates all your requests.
Step 2: Test Your Key
curl -H "Authorization: Bearer $SMALLEST_API_KEY" \
https://waves-api.smallest.ai/api/v1/lightning-v3.1/get_voices
Authentication Header
All requests require this header:
Authorization: Bearer <YOUR_API_KEY>
Part 1: Transcribing Audio Files (REST API)
The Pre-Recorded API is perfect for batch processing voicemails, podcasts, meeting recordings, or any existing audio files.
Endpoint
POST https://waves-api.smallest.ai/api/v1/pulse/get_text
Query Parameters
| Parameter | Type | Description |
|---|---|---|
model |
string | Model identifier: pulse (required) |
language |
string | ISO code (en, es, hi) or multi for auto-detect |
word_timestamps |
boolean | Include word-level timing data |
diarize |
boolean | Enable speaker diarization |
emotion_detection |
boolean | Detect speaker emotions |
age_detection |
boolean | Estimate speaker age group |
gender_detection |
boolean | Estimate speaker gender |
Supported Languages (32+)
Italian, Spanish, English, Portuguese, Hindi, German, French, Ukrainian, Russian, Kannada, Malayalam, Polish, Marathi, Gujarati, Czech, Slovak, Telugu, Odia, Dutch, Bengali, Latvian, Estonian, Romanian, Punjabi, Finnish, Swedish, Bulgarian, Tamil, Hungarian, Danish, Lithuanian, Maltese, and auto-detection (multi).
cURL Example
curl --request POST \
--url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true&word_timestamps=true&emotion_detection=true" \
--header "Authorization: Bearer $SMALLEST_API_KEY" \
--header "Content-Type: audio/wav" \
--data-binary "@/path/to/audio.wav"
Python Example
import os
import requests
API_KEY = os.getenv("SMALLEST_API_KEY")
audio_file = "meeting_recording.wav"
url = "https://waves-api.smallest.ai/api/v1/pulse/get_text"
params = {
"model": "pulse",
"language": "en",
"word_timestamps": "true",
"diarize": "true",
"emotion_detection": "true"
}
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "audio/wav"
}
with open(audio_file, "rb") as f:
audio_data = f.read()
response = requests.post(url, params=params, headers=headers, data=audio_data)
result = response.json()
# Print transcription
print("Transcription:", result.get("transcription"))
# Print word-level details with speaker info
for word in result.get("words", []):
speaker = word.get("speaker", "N/A")
print(f" [Speaker {speaker}] [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']}")
# Check emotions
if "emotions" in result:
print("\nEmotions detected:")
for emotion, score in result["emotions"].items():
if score > 0.1:
print(f" {emotion}: {score:.1%}")
Node.js Example
const fs = require('fs');
const axios = require('axios');
const API_KEY = process.env.SMALLEST_API_KEY;
const audioFile = 'meeting_recording.wav';
const url = 'https://waves-api.smallest.ai/api/v1/pulse/get_text';
const params = new URLSearchParams({
model: 'pulse',
language: 'en',
word_timestamps: 'true',
diarize: 'true',
emotion_detection: 'true'
});
const audioData = fs.readFileSync(audioFile);
axios.post(`${url}?${params}`, audioData, {
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'audio/wav'
}
})
.then(res => {
console.log('Transcription:', res.data.transcription);
// Print words with speaker info
res.data.words?.forEach(word => {
console.log(` [Speaker ${word.speaker}] [${word.start}s - ${word.end}s] ${word.word}`);
});
})
.catch(err => {
console.error('Error:', err.response?.data || err.message);
});
Example Response
{
"status": "success",
"transcription": "Hello, this is a test transcription.",
"words": [
{"start": 0.0, "end": 0.88, "word": "Hello,", "confidence": 0.82, "speaker": 0, "speaker_confidence": 0.61},
{"start": 0.88, "end": 1.04, "word": "this", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.76},
{"start": 1.04, "end": 1.20, "word": "is", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
{"start": 1.20, "end": 1.36, "word": "a", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
{"start": 1.36, "end": 1.68, "word": "test", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99},
{"start": 1.68, "end": 2.16, "word": "transcription.", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99}
],
"utterances": [
{"start": 0.0, "end": 2.16, "text": "Hello, this is a test transcription.", "speaker": 0}
],
"age": "adult",
"gender": "female",
"emotions": {
"happiness": 0.28,
"sadness": 0.0,
"anger": 0.0,
"fear": 0.0,
"disgust": 0.0
},
"metadata": {
"duration": 1.97,
"fileSize": 63236
}
}
Part 2: Real-Time Streaming (WebSocket API)
For live audio – voice assistants, live captioning, call center analytics – use the WebSocket API for sub-second latency with partial results as audio streams in.
WebSocket Endpoint
wss://waves-api.smallest.ai/api/v1/pulse/get_text
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
language |
string | en |
Language code or multi for auto-detect |
encoding |
string | linear16 |
Audio format: linear16, linear32, alaw, mulaw, opus
|
sample_rate |
string | 16000 |
Sample rate: 8000, 16000, 22050, 24000, 44100, 48000
|
word_timestamps |
string | true |
Include word-level timestamps |
full_transcript |
string | false |
Include cumulative transcript |
sentence_timestamps |
string | false |
Include sentence-level timestamps |
redact_pii |
string | false |
Redact personal information |
redact_pci |
string | false |
Redact payment card information |
diarize |
string | false |
Enable speaker diarization |
Python Streaming Example
From the official cookbook:
import asyncio
import json
import os
import numpy as np
import websockets
import librosa
from urllib.parse import urlencode
WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text"
# Configurable features
LANGUAGE = "en"
ENCODING = "linear16"
SAMPLE_RATE = 16000
WORD_TIMESTAMPS = False
FULL_TRANSCRIPT = True
SENTENCE_TIMESTAMPS = False
DIARIZE = False
REDACT_PII = False
REDACT_PCI = False
async def transcribe(audio_file: str, api_key: str):
params = {
"language": LANGUAGE,
"encoding": ENCODING,
"sample_rate": SAMPLE_RATE,
"word_timestamps": str(WORD_TIMESTAMPS).lower(),
"full_transcript": str(FULL_TRANSCRIPT).lower(),
"sentence_timestamps": str(SENTENCE_TIMESTAMPS).lower(),
"diarize": str(DIARIZE).lower(),
"redact_pii": str(REDACT_PII).lower(),
"redact_pci": str(REDACT_PCI).lower(),
}
url = f"{WS_URL}?{urlencode(params)}"
headers = {"Authorization": f"Bearer {api_key}"}
# Load audio with librosa (handles any format)
audio, _ = librosa.load(audio_file, sr=SAMPLE_RATE, mono=True)
chunk_duration = 0.1 # 100ms chunks
chunk_size = int(chunk_duration * SAMPLE_RATE)
async with websockets.connect(url, additional_headers=headers) as ws:
print("✅ Connected to Pulse STT WebSocket")
async def send_audio():
for i in range(0, len(audio), chunk_size):
chunk = audio[i:i + chunk_size]
pcm16 = (chunk * 32768.0).astype(np.int16).tobytes()
await ws.send(pcm16)
await asyncio.sleep(chunk_duration)
await ws.send(json.dumps({"type": "end"}))
print("📤 Sent end signal")
async def receive_responses():
async for message in ws:
result = json.loads(message)
if result.get("is_final"):
print(f"✓ {result.get('transcript')}")
if result.get("is_last"):
if result.get("full_transcript"):
print(f"\n{'='*60}")
print("FULL TRANSCRIPT")
print(f"{'='*60}")
print(result.get("full_transcript"))
break
await asyncio.gather(send_audio(), receive_responses())
# Usage
if __name__ == "__main__":
api_key = os.environ.get("SMALLEST_API_KEY")
asyncio.run(transcribe("recording.wav", api_key))
Install dependencies:
pip install websockets librosa numpy
Run:
export SMALLEST_API_KEY="your-api-key"
python transcribe.py recording.wav
Node.js Streaming Example
From the official cookbook:
const fs = require("fs");
const WebSocket = require("ws");
const wav = require("wav");
const WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text";
// Configurable features
const LANGUAGE = "en";
const ENCODING = "linear16";
const SAMPLE_RATE = 16000;
const WORD_TIMESTAMPS = false;
const FULL_TRANSCRIPT = true;
const DIARIZE = false;
const REDACT_PII = false;
const REDACT_PCI = false;
async function loadAudio(audioFile) {
return new Promise((resolve, reject) => {
const reader = new wav.Reader();
const chunks = [];
reader.on("format", (format) => {
reader.on("data", (chunk) => chunks.push(chunk));
reader.on("end", () => {
const buffer = Buffer.concat(chunks);
const samples = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / 2);
resolve(samples);
});
});
reader.on("error", reject);
fs.createReadStream(audioFile).pipe(reader);
});
}
async function transcribe(audioFile, apiKey) {
const params = new URLSearchParams({
language: LANGUAGE,
encoding: ENCODING,
sample_rate: SAMPLE_RATE,
word_timestamps: WORD_TIMESTAMPS,
full_transcript: FULL_TRANSCRIPT,
diarize: DIARIZE,
redact_pii: REDACT_PII,
redact_pci: REDACT_PCI,
});
const url = `${WS_URL}?${params}`;
const audio = await loadAudio(audioFile);
const chunkDuration = 0.1; // 100ms
const chunkSize = Math.floor(chunkDuration * SAMPLE_RATE);
return new Promise((resolve, reject) => {
const ws = new WebSocket(url, {
headers: { Authorization: `Bearer ${apiKey}` },
});
ws.on("open", async () => {
console.log("✅ Connected to Pulse STT WebSocket");
for (let i = 0; i < audio.length; i += chunkSize) {
const chunk = audio.slice(i, i + chunkSize);
ws.send(Buffer.from(chunk.buffer, chunk.byteOffset, chunk.byteLength));
await new Promise((r) => setTimeout(r, chunkDuration * 1000));
}
ws.send(JSON.stringify({ type: "end" }));
console.log("📤 Sent end signal");
});
ws.on("message", (data) => {
const result = JSON.parse(data.toString());
if (result.is_final) {
console.log(`✓ ${result.transcript}`);
if (result.is_last) {
if (result.full_transcript) {
console.log("\n" + "=".repeat(60));
console.log("FULL TRANSCRIPT");
console.log("=".repeat(60));
console.log(result.full_transcript);
}
ws.close();
}
}
});
ws.on("close", resolve);
ws.on("error", reject);
});
}
// Usage
const apiKey = process.env.SMALLEST_API_KEY;
transcribe("recording.wav", apiKey).then(() => console.log("Done!"));
Install dependencies:
npm install ws wav
Run:
export SMALLEST_API_KEY="your-api-key"
node transcribe.js recording.wav
WebSocket Response Format
{
"session_id": "sess_12345abcde",
"transcript": "Hello, how are you?",
"full_transcript": "Hello, how are you?",
"is_final": true,
"is_last": false,
"language": "en",
"words": [
{"word": "Hello,", "start": 0.0, "end": 0.5, "confidence": 0.98, "speaker": 0},
{"word": "how", "start": 0.5, "end": 0.7, "confidence": 0.99, "speaker": 0},
{"word": "are", "start": 0.7, "end": 0.9, "confidence": 0.97, "speaker": 0},
{"word": "you?", "start": 0.9, "end": 1.2, "confidence": 0.99, "speaker": 0}
]
}
Key Response Fields
| Field | Description |
|---|---|
is_final |
false = partial/interim transcript; true = finalized segment |
is_last |
true when the entire session is complete |
transcript |
Current segment text |
full_transcript |
Accumulated text from entire session (if enabled) |
words |
Word-level timestamps (if enabled) |
Part 3: Advanced Features
Speaker Diarization
Enable diarize=true to identify different speakers:
params = {"model": "pulse", "language": "en", "diarize": "true"}
Response includes speaker labels:
{
"words": [
{"word": "Hello", "speaker": 0, "speaker_confidence": 0.95},
{"word": "Hi", "speaker": 1, "speaker_confidence": 0.92}
],
"utterances": [
{"text": "Hello, how can I help?", "speaker": 0},
{"text": "I have a question.", "speaker": 1}
]
}
Emotion Detection
Enable emotion_detection=true to analyze speaker sentiment:
{
"emotions": {
"happiness": 0.28,
"sadness": 0.0,
"anger": 0.0,
"fear": 0.0,
"disgust": 0.0
}
}
PII/PCI Redaction
For compliance (HIPAA, PCI-DSS), enable redact_pii=true or redact_pci=true:
{
"transcript": "My credit card is [CREDITCARD_1] and SSN is [SSN_1]",
"redacted_entities": ["[CREDITCARD_1]", "[SSN_1]"]
}
Age and Gender Detection
Enable age_detection=true and gender_detection=true:
{
"age": "adult",
"gender": "female"
}
Comparing STT Providers
| Provider | Latency | Languages | Diarization | Emotion | PII Redaction | Price (per 1000 min) |
|---|---|---|---|---|---|---|
| Pulse STT | ~64ms | 32+ | ✅ | ✅ | ✅ | Competitive |
| Google Cloud STT | 200-300ms | 125+ | ✅ | ❌ | ❌ | ~$16 |
| Deepgram | 100-200ms | 36+ | ✅ | ❌ | ✅ | ~$4-5 |
| AssemblyAI | 200-400ms | 30+ | ✅ | ✅ | ✅ | ~$3.50 |
| OpenAI Whisper | Batch only | 99+ | ❌ | ❌ | ❌ | ~$6 |
Why Pulse STT stands out:
- Fastest time-to-first-transcript (64ms)
- All-in-one features (no separate services needed)
- Competitive accuracy across diverse accents
- Built for real-time voice AI applications
Best Practices
Audio Quality
- Use 16kHz, mono, 16-bit PCM for best results
- WAV or FLAC formats are ideal
- Minimize background noise when possible
Error Handling
try:
response = requests.post(url, params=params, headers=headers, data=audio_data, timeout=120)
response.raise_for_status()
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
# Rate limited - implement exponential backoff
time.sleep(2 ** retry_count)
elif e.response.status_code == 401:
# Invalid API key
raise ValueError("Invalid API key")
Rate Limiting
- Add 500ms+ delay between batch requests
- Use webhooks for long audio files
- Implement exponential backoff for 429 errors
Bonus: Full Demo Application
Want to see everything working together? Check out the demo app in the code samples repository — a complete Next.js web application featuring:
- File upload transcription with word-level timestamps (hover to see timing)
- Real-time microphone streaming with live transcript display
- Secure WebSocket proxy that keeps your API key server-side
- Modern UI with Smallest AI brand colors
- Language selection (English, Hindi, Spanish, French, German, Portuguese, Auto-detect)
- Emotion detection and speaker diarization display
Quick Start
cd demo-app
npm install
Create a .env.local file with your API key:
echo 'SMALLEST_API_KEY=your-api-key' > .env.local
Start both servers (Next.js + WebSocket proxy):
npm run dev:all
Then open http://localhost:3000 in Chrome or Safari (for microphone access).
How It Works
The demo runs two servers:
Next.js (port 3000) — Serves the React UI and handles file upload via
/api/transcribeWebSocket Proxy (port 3001) — Securely proxies audio from browser to Pulse STT WebSocket API
Browser → WebSocket Proxy (3001) → Pulse STT (wss://waves-api.smallest.ai)
Browser → Next.js API (3000) → Pulse STT (REST API)
This architecture keeps your API key secure on the server while enabling real-time streaming.
Project Structure
demo-app/
├── src/
│ └── app/
│ ├── api/
│ │ └── transcribe/
│ │ └── route.ts # REST API for file upload
│ ├── page.tsx # Main UI
│ └── layout.tsx
├── ws-server.js # WebSocket proxy server
├── .env.local # Your API key (create this)
└── package.json
Scripts
| Command | Description |
|---|---|
npm run dev |
Start Next.js only |
npm run dev:ws |
Start WebSocket proxy only |
npm run dev:all |
Start both (recommended) |
This architecture pattern is recommended for production apps — API keys stay server-side while the React frontend provides a smooth user experience with both file upload and real-time microphone transcription.
Conclusion
Integrating voice and speech capabilities into your workflow and apps can greatly enhance user experience. With Pulse STT, developers can achieve high-accuracy, low-latency transcription with just a few API calls.
When to use REST API:
- Podcast transcription
- Meeting recordings
- Voicemail processing
- Batch analytics
When to use WebSocket API:
- Live captioning
- Voice assistants
- Call center real-time analytics
- Interactive voice applications
The code patterns in this guide translate directly to production. Start with the REST API for prototyping, then add WebSocket streaming when real-time interaction becomes a requirement.
Resources
- Smallest AI Console — API key management
- Waves Documentation — Full API reference
- Discord Community — Developer support



Top comments (0)