DEV Community

Cover image for Text-to-Speech Integration: Build Voice-Enabled AI Apps with TypeScript
NeuroLink AI
NeuroLink AI

Posted on • Originally published at blog.neurolink.ink

Text-to-Speech Integration: Build Voice-Enabled AI Apps with TypeScript

Your users want voice. They want to listen while commuting, hear responses while cooking, and interact hands-free while multitasking. But adding text-to-speech to your AI application means wrestling with audio encoding, managing voice configurations, and handling streaming audio buffers.

Voice integration should take minutes, not weeks. NeuroLink makes it happen. Pass a single tts option to your existing generate() call and receive both text and audio in one response. No separate API calls. No audio processing libraries. No voice configuration headaches.

This guide walks you through complete TTS integration with NeuroLink. You will learn voice selection, streaming audio, multi-speaker podcasts, and voice assistant patterns.

TL;DR

  • One API call produces text + audio output
  • Google Cloud TTS with Studio, Neural2, WaveNet, and Standard voices
  • Real-time streaming audio for immediate playback
  • Multi-speaker podcast generation
  • 40+ languages supported

Why Voice Matters for AI Apps

Voice transforms how users interact with AI. Reading text requires attention and focus. Listening frees users to do other things.

The Accessibility Advantage

Voice output makes your application accessible to users with visual impairments. Natural AI-generated speech provides better context and nuance than screen readers. Voice also helps users with reading difficulties or those who prefer audio content.

The Engagement Difference

Voice creates emotional connection. A well-chosen voice with appropriate pacing builds trust and personality. Users remember voice interactions more vividly than text exchanges.

What NeuroLink TTS Provides

  • Unified API - Same generate() call produces text and audio
  • Google Cloud Voices - Access to Studio, Neural2, WaveNet, and Standard voices
  • Streaming Support - Real-time audio chunks for immediate playback
  • Format Options - MP3, WAV (LINEAR16), and OGG Opus output
  • Voice Control - Speaking rate, pitch, and volume adjustment

Quick Start: Your First TTS Request

Getting started takes five minutes. You need Google Cloud credentials and the NeuroLink package.

Step 1: Configure Google Cloud TTS

Enable the Cloud Text-to-Speech API in your Google Cloud Console. Create a service account and download the credentials JSON file:

# Required - Path to Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS=path/to/credentials.json

# For LLM provider (any supported provider)
export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...
Enter fullscreen mode Exit fullscreen mode

Step 2: Generate Your First Audio Response

pnpm add @juspay/neurolink
# or
npm install @juspay/neurolink
Enter fullscreen mode Exit fullscreen mode
import { NeuroLink } from "@juspay/neurolink";
import fs from "fs";

async function main() {
  const ai = new NeuroLink();

  // Generate AI response with TTS audio output
  const result = await ai.generate({
    input: {
      text: "Write a friendly welcome message for new users",
      systemPrompt: "You are a helpful assistant with a warm tone",
    },
    tts: {
      enabled: true,
      provider: "google-tts",
      voice: "en-US-Studio-M",
      outputFormat: "mp3",
    },
  });

  // Save the audio file
  if (result.audio?.buffer) {
    fs.writeFileSync("welcome.mp3", result.audio.buffer);
    console.log("Audio saved to welcome.mp3");
  }

  console.log("\nText Response:", result.content);
}

main().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

That's it. One generate() call produces both text and audio. The TTS option integrates seamlessly with any LLM provider.

CLI equivalent:

npx @juspay/neurolink generate "Write a welcome message" \
  --tts \
  --tts-voice "en-US-Studio-M" \
  --output welcome.mp3
Enter fullscreen mode Exit fullscreen mode

Voice Selection Guide

Google TTS offers four voice tiers with different quality levels and pricing.

Voice Quality Tiers

Voice Type Quality Use Case Cost per 1M chars
Studio Premium Production apps, customer-facing ~$160
Neural2 High Standard production apps ~$16
WaveNet High Natural-sounding speech ~$16
Standard Good Development, testing ~$4

Voice Selection Recommendations

Scenario Recommended Voice Rationale
Development/Testing en-US-Standard-A Low cost, fast iteration
Internal Tools en-US-Neural2-A Good quality, reasonable cost
Customer-Facing Apps en-US-Studio-M Premium quality, professional
Podcasts/Content en-US-Studio-O Broadcast quality
High-Volume Processing en-US-Standard-* Cost-effective at scale

Discovering Available Voices

import { NeuroLink } from "@juspay/neurolink";

async function listVoices() {
  const ai = new NeuroLink();
  const voices = await ai.tts.getVoices();

  console.log(`Total voices available: ${voices.length}`);

  // Filter by language
  const englishVoices = voices.filter((v) => v.language.startsWith("en"));
  console.log(`English voices: ${englishVoices.length}`);

  englishVoices.slice(0, 10).forEach((voice) => {
    console.log(
      `  ${voice.name} - ${voice.gender} - ${voice.language} (${voice.type})`
    );
  });
}

listVoices().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

CLI equivalent:

npx @juspay/neurolink tts voices --provider google-tts --language en-US
Enter fullscreen mode Exit fullscreen mode

Streaming Audio

Real-time audio streaming enables immediate playback. Users hear the response as it generates instead of waiting for completion.

import { NeuroLink } from "@juspay/neurolink";

async function streamWithAudio() {
  const ai = new NeuroLink();

  const stream = await ai.stream({
    input: { text: "Explain the history of artificial intelligence" },
    tts: {
      enabled: true,
      streaming: true,
      voice: "en-US-Neural2-A",
    },
  });

  let textContent = "";
  let audioChunks = 0;

  for await (const chunk of stream) {
    if (chunk.content) {
      process.stdout.write(chunk.content);
      textContent += chunk.content;
    }

    if (chunk.audio) {
      audioChunks++;
    }
  }

  console.log(`\nTotal characters: ${textContent.length}`);
  console.log(`Audio chunks received: ${audioChunks}`);
}

streamWithAudio().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

Streaming Benefits

  1. Reduced Latency - Users hear audio within seconds, not after full generation
  2. Memory Efficiency - Process chunks instead of buffering entire responses
  3. Progressive Enhancement - Degrade gracefully if audio playback fails
  4. Real-time Feedback - Users know the system is working

Podcast Generation Pipeline

Generate multi-speaker podcast episodes with different voices for each speaker:

import { NeuroLink } from "@juspay/neurolink";
import fs from "fs";

interface PodcastSection {
  speaker: "host" | "guest";
  text: string;
}

async function generatePodcastEpisode(script: PodcastSection[]) {
  const ai = new NeuroLink();
  const audioSegments: Buffer[] = [];

  for (let i = 0; i < script.length; i++) {
    const section = script[i];
    console.log(`Processing section ${i + 1}/${script.length} (${section.speaker})...`);

    const result = await ai.generate({
      input: {
        text: section.text,
        systemPrompt: `Speak naturally as a ${section.speaker}`,
      },
      tts: {
        enabled: true,
        voice:
          section.speaker === "host"
            ? "en-US-Studio-M"   // Male host voice
            : "en-US-Studio-O", // Female guest voice
        speakingRate: 0.95,
      },
    });

    if (result.audio?.buffer) {
      audioSegments.push(result.audio.buffer);
    }
  }

  return Buffer.concat(audioSegments);
}
Enter fullscreen mode Exit fullscreen mode

This pattern works for any multi-speaker content: interviews, dialogues, audiobooks, or educational content.


Voice Assistant Integration

Build conversational voice assistants with memory and context:

import { NeuroLink } from "@juspay/neurolink";

async function voiceAssistant(userQuery: string) {
  const ai = new NeuroLink({
    conversationMemory: { enabled: true },
  });

  const result = await ai.generate({
    input: { text: userQuery },
    tts: {
      enabled: true,
      voice: "en-US-Neural2-A",
    },
  });

  return {
    text: result.content,
    audio: result.audio?.buffer,
    conversationId: result.conversationId,
  };
}
Enter fullscreen mode Exit fullscreen mode

The voice assistant maintains conversation context across turns. Each response includes both text and audio.


CLI Workflows

The NeuroLink CLI provides quick access to TTS features for testing and prototyping.

# Basic TTS generation
npx @juspay/neurolink generate "Welcome to our platform!" \
  --tts --tts-voice "en-US-Studio-M" --output welcome.mp3

# Stream with voice
npx @juspay/neurolink stream "Tell me a bedtime story" \
  --tts --tts-voice "en-US-Studio-O"

# List available voices
npx @juspay/neurolink tts voices

# Test a specific voice
npx @juspay/neurolink tts test "Hello, this is a voice test" \
  --voice "en-US-Studio-M" --output test.mp3
Enter fullscreen mode Exit fullscreen mode

Audio Quality Settings

Fine-tune audio output with configuration options:

const ttsConfig = {
  tts: {
    enabled: true,
    provider: "google-tts",
    voice: "en-US-Studio-M",
    audioEncoding: "MP3",     // Options: MP3, LINEAR16, OGG_OPUS
    speakingRate: 1.0,        // Range: 0.25 to 4.0
    pitch: 0.0,               // Range: -20.0 to 20.0
    volumeGainDb: 0.0         // Range: -96.0 to 16.0
  }
};
Enter fullscreen mode Exit fullscreen mode

Audio Format Comparison

Format Use Case File Size Quality
MP3 General use, web apps Small Good
LINEAR16 Professional audio, editing Large Lossless
OGG_OPUS Low-latency streaming Small Excellent

Speaking Rate Guidelines

Rate Effect Best For
0.75 Slow, deliberate Accessibility, complex content
1.0 Normal speed General use
1.15 Slightly faster Notifications, quick updates
1.5 Fast Speed listeners, time-sensitive

Summary

Voice transforms AI applications from tools into companions. NeuroLink makes this transformation effortless.

You learned how to:

  • Generate audio output with a single tts option in generate()
  • Select the right voice tier for your use case and budget
  • Stream audio chunks for real-time playback
  • Build multi-speaker podcasts with distinct voices
  • Create conversational voice assistants with memory
  • Use CLI workflows for rapid TTS prototyping
  • Fine-tune audio quality with encoding and modulation settings

Stop building separate audio pipelines. Start shipping voice features.


Found this helpful? Drop a comment below with your questions!

Want to try NeuroLink?

Follow us for more AI development content:

Top comments (0)