NeuroLink AI

Posted on Feb 18 • Originally published at blog.neurolink.ink

Text-to-Speech Integration: Build Voice-Enabled AI Apps with TypeScript

#ai #typescript #tutorial #voiceai

Your users want voice. They want to listen while commuting, hear responses while cooking, and interact hands-free while multitasking. But adding text-to-speech to your AI application means wrestling with audio encoding, managing voice configurations, and handling streaming audio buffers.

Voice integration should take minutes, not weeks. NeuroLink makes it happen. Pass a single tts option to your existing generate() call and receive both text and audio in one response. No separate API calls. No audio processing libraries. No voice configuration headaches.

This guide walks you through complete TTS integration with NeuroLink. You will learn voice selection, streaming audio, multi-speaker podcasts, and voice assistant patterns.

TL;DR

One API call produces text + audio output
Google Cloud TTS with Studio, Neural2, WaveNet, and Standard voices
Real-time streaming audio for immediate playback
Multi-speaker podcast generation
40+ languages supported

Why Voice Matters for AI Apps

Voice transforms how users interact with AI. Reading text requires attention and focus. Listening frees users to do other things.

The Accessibility Advantage

Voice output makes your application accessible to users with visual impairments. Natural AI-generated speech provides better context and nuance than screen readers. Voice also helps users with reading difficulties or those who prefer audio content.

The Engagement Difference

Voice creates emotional connection. A well-chosen voice with appropriate pacing builds trust and personality. Users remember voice interactions more vividly than text exchanges.

What NeuroLink TTS Provides

Unified API - Same generate() call produces text and audio
Google Cloud Voices - Access to Studio, Neural2, WaveNet, and Standard voices
Streaming Support - Real-time audio chunks for immediate playback
Format Options - MP3, WAV (LINEAR16), and OGG Opus output
Voice Control - Speaking rate, pitch, and volume adjustment

Quick Start: Your First TTS Request

Getting started takes five minutes. You need Google Cloud credentials and the NeuroLink package.

Step 1: Configure Google Cloud TTS

Enable the Cloud Text-to-Speech API in your Google Cloud Console. Create a service account and download the credentials JSON file:

# Required - Path to Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS=path/to/credentials.json

# For LLM provider (any supported provider)
export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...

Step 2: Generate Your First Audio Response

pnpm add @juspay/neurolink
# or
npm install @juspay/neurolink

import { NeuroLink } from "@juspay/neurolink";
import fs from "fs";

async function main() {
  const ai = new NeuroLink();

  // Generate AI response with TTS audio output
  const result = await ai.generate({
    input: {
      text: "Write a friendly welcome message for new users",
      systemPrompt: "You are a helpful assistant with a warm tone",
    },
    tts: {
      enabled: true,
      provider: "google-tts",
      voice: "en-US-Studio-M",
      outputFormat: "mp3",
    },
  });

  // Save the audio file
  if (result.audio?.buffer) {
    fs.writeFileSync("welcome.mp3", result.audio.buffer);
    console.log("Audio saved to welcome.mp3");
  }

  console.log("\nText Response:", result.content);
}

main().catch(console.error);

That's it. One generate() call produces both text and audio. The TTS option integrates seamlessly with any LLM provider.

CLI equivalent:

npx @juspay/neurolink generate "Write a welcome message" \
  --tts \
  --tts-voice "en-US-Studio-M" \
  --output welcome.mp3

Voice Selection Guide

Google TTS offers four voice tiers with different quality levels and pricing.

Voice Quality Tiers

Voice Type	Quality	Use Case	Cost per 1M chars
Studio	Premium	Production apps, customer-facing	~$160
Neural2	High	Standard production apps	~$16
WaveNet	High	Natural-sounding speech	~$16
Standard	Good	Development, testing	~$4

Voice Selection Recommendations

Scenario	Recommended Voice	Rationale
Development/Testing	`en-US-Standard-A`	Low cost, fast iteration
Internal Tools	`en-US-Neural2-A`	Good quality, reasonable cost
Customer-Facing Apps	`en-US-Studio-M`	Premium quality, professional
Podcasts/Content	`en-US-Studio-O`	Broadcast quality
High-Volume Processing	`en-US-Standard-*`	Cost-effective at scale

Discovering Available Voices

import { NeuroLink } from "@juspay/neurolink";

async function listVoices() {
  const ai = new NeuroLink();
  const voices = await ai.tts.getVoices();

  console.log(`Total voices available: ${voices.length}`);

  // Filter by language
  const englishVoices = voices.filter((v) => v.language.startsWith("en"));
  console.log(`English voices: ${englishVoices.length}`);

  englishVoices.slice(0, 10).forEach((voice) => {
    console.log(
      `  ${voice.name} - ${voice.gender} - ${voice.language} (${voice.type})`
    );
  });
}

listVoices().catch(console.error);

CLI equivalent:

npx @juspay/neurolink tts voices --provider google-tts --language en-US

Streaming Audio

Real-time audio streaming enables immediate playback. Users hear the response as it generates instead of waiting for completion.

import { NeuroLink } from "@juspay/neurolink";

async function streamWithAudio() {
  const ai = new NeuroLink();

  const stream = await ai.stream({
    input: { text: "Explain the history of artificial intelligence" },
    tts: {
      enabled: true,
      streaming: true,
      voice: "en-US-Neural2-A",
    },
  });

  let textContent = "";
  let audioChunks = 0;

  for await (const chunk of stream) {
    if (chunk.content) {
      process.stdout.write(chunk.content);
      textContent += chunk.content;
    }

    if (chunk.audio) {
      audioChunks++;
    }
  }

  console.log(`\nTotal characters: ${textContent.length}`);
  console.log(`Audio chunks received: ${audioChunks}`);
}

streamWithAudio().catch(console.error);

Streaming Benefits

Reduced Latency - Users hear audio within seconds, not after full generation
Memory Efficiency - Process chunks instead of buffering entire responses
Progressive Enhancement - Degrade gracefully if audio playback fails
Real-time Feedback - Users know the system is working

Podcast Generation Pipeline

Generate multi-speaker podcast episodes with different voices for each speaker:

import { NeuroLink } from "@juspay/neurolink";
import fs from "fs";

interface PodcastSection {
  speaker: "host" | "guest";
  text: string;
}

async function generatePodcastEpisode(script: PodcastSection[]) {
  const ai = new NeuroLink();
  const audioSegments: Buffer[] = [];

  for (let i = 0; i < script.length; i++) {
    const section = script[i];
    console.log(`Processing section ${i + 1}/${script.length} (${section.speaker})...`);

    const result = await ai.generate({
      input: {
        text: section.text,
        systemPrompt: `Speak naturally as a ${section.speaker}`,
      },
      tts: {
        enabled: true,
        voice:
          section.speaker === "host"
            ? "en-US-Studio-M"   // Male host voice
            : "en-US-Studio-O", // Female guest voice
        speakingRate: 0.95,
      },
    });

    if (result.audio?.buffer) {
      audioSegments.push(result.audio.buffer);
    }
  }

  return Buffer.concat(audioSegments);
}

This pattern works for any multi-speaker content: interviews, dialogues, audiobooks, or educational content.

Voice Assistant Integration

Build conversational voice assistants with memory and context:

import { NeuroLink } from "@juspay/neurolink";

async function voiceAssistant(userQuery: string) {
  const ai = new NeuroLink({
    conversationMemory: { enabled: true },
  });

  const result = await ai.generate({
    input: { text: userQuery },
    tts: {
      enabled: true,
      voice: "en-US-Neural2-A",
    },
  });

  return {
    text: result.content,
    audio: result.audio?.buffer,
    conversationId: result.conversationId,
  };
}

The voice assistant maintains conversation context across turns. Each response includes both text and audio.

CLI Workflows

The NeuroLink CLI provides quick access to TTS features for testing and prototyping.

# Basic TTS generation
npx @juspay/neurolink generate "Welcome to our platform!" \
  --tts --tts-voice "en-US-Studio-M" --output welcome.mp3

# Stream with voice
npx @juspay/neurolink stream "Tell me a bedtime story" \
  --tts --tts-voice "en-US-Studio-O"

# List available voices
npx @juspay/neurolink tts voices

# Test a specific voice
npx @juspay/neurolink tts test "Hello, this is a voice test" \
  --voice "en-US-Studio-M" --output test.mp3

Audio Quality Settings

Fine-tune audio output with configuration options:

const ttsConfig = {
  tts: {
    enabled: true,
    provider: "google-tts",
    voice: "en-US-Studio-M",
    audioEncoding: "MP3",     // Options: MP3, LINEAR16, OGG_OPUS
    speakingRate: 1.0,        // Range: 0.25 to 4.0
    pitch: 0.0,               // Range: -20.0 to 20.0
    volumeGainDb: 0.0         // Range: -96.0 to 16.0
  }
};

Audio Format Comparison

Format	Use Case	File Size	Quality
MP3	General use, web apps	Small	Good
LINEAR16	Professional audio, editing	Large	Lossless
OGG_OPUS	Low-latency streaming	Small	Excellent

Speaking Rate Guidelines

Rate	Effect	Best For
0.75	Slow, deliberate	Accessibility, complex content
1.0	Normal speed	General use
1.15	Slightly faster	Notifications, quick updates
1.5	Fast	Speed listeners, time-sensitive

Summary

Voice transforms AI applications from tools into companions. NeuroLink makes this transformation effortless.

You learned how to:

Generate audio output with a single tts option in generate()
Select the right voice tier for your use case and budget
Stream audio chunks for real-time playback
Build multi-speaker podcasts with distinct voices
Create conversational voice assistants with memory
Use CLI workflows for rapid TTS prototyping
Fine-tune audio quality with encoding and modulation settings

DEV Community