DEV Community

Cover image for How Fishjam.io Built a Multi-Speaker AI Game using Gemini Live
Thor 雷神 Schaeff for Google AI

Posted on • Originally published at fishjam.io

How Fishjam.io Built a Multi-Speaker AI Game using Gemini Live

Picture a lively dinner party: glasses clinking, half-finished sentences, and three people laughing at the same time. To a human, navigating this is instinctual. To an AI, it is a nightmare. Developers have effectively mastered the predictable flow of a one-on-one chat. But handling a group conversation, where people interrupt and talk over each other, is much more difficult.

Bernard Gawor and the Fishjam team at Software Mansion set out to showcase their Selective Forwarding Unit solution by building a unique demo app that solves this problem. That’s how the Deep Sea Stories game came to life.

The premise is simple: a group of detectives enters a conference room to solve a mystery. The twist? The "Riddle Master", the entity that knows the secret solution and answers questions is actually a Gemini Voice AI Agent. This required the agent to listen, understand, and respond to a group of users in real-time.

The Anatomy of a Voice Agent

First, let’s look at how an AI Voice Agent typically processes data. It typically operates through a modular pipeline that includes the following steps:

  • Speech-to-Text (S2T): The system converts the user’s spoken input into text using models like Google Speech-to-Text, OpenAI Whisper or ElevenLabs’ transcription service.
  • Large Language Model (LLM): The transcribed text is processed by an LLM (e.g. Gemini, GPT-4, Claude) to understand the context and generate an appropriate text response.
  • Text-to-Speech (TTS): The text response is converted back into natural-sounding speech using services like Google Cloud TTS, ElevenLabs or Azure TTS.
  • Real-time Audio Streaming: The audio is delivered back to the user with minimal latency.

Standard Voice AI Pipeline

A second architecture gaining popularity, and notably used in the newest Gemini Live API models is Speech-to-Speech. Unlike traditional pipelines that convert speech to text and back again, this architecture feeds raw audio directly into the model and generates audio output in a single step. This unified approach not only reduces latency but also preserves non-verbal features, enabling the model to recognize and replicate subtle human emotions, tone, and pacing with high fidelity.

One-to-One vs. Group Contexts

Most standard SDKs make setting up a one-on-one conversation relatively simple. For example, using the Gemini Live API SDK:

const { GoogleGenAI } = require("@google/genai");

// 1. Setup
const ai = new GoogleGenAI({ apiKey: "YOUR_API_KEY" });

async function startAgent() {
  // 2. Connect
  const session = await ai.live.connect({
    model: "gemini-2.5-flash-native-audio-preview-12-2025",
    config: { responseModalities: ["AUDIO"] },
  });

  console.log("Agent Connected!");

  // 3. Listen for the Agent's Voice
  session.receive(async (msg) => {
    // This loop runs every time the AI sends an audio chunk
    if (msg.serverContent?.modelTurn?.parts) {
      const audioData = msg.serverContent.modelTurn.parts[0].inlineData.data;
      console.log(`Received Audio Chunk (${audioData.length} bytes)`);
      // In a real app, you would send 'audioData' to your audio output device
    }
  });

  // 4. Send Your Voice (Simulated)
  // Real apps pipe microphone data here continuously
  console.log("Sending audio...");
  await session.sendRealtimeInput([
    {
      mimeType: "audio/pcm;rate=16000",
      data: "BASE64_ENCODED_PCM_AUDIO_STRING_GOES_HERE",
    },
  ]);
}

startAgent();
Enter fullscreen mode Exit fullscreen mode

However, these SDKs assume a single audio input stream. In a conference room, audio streams are distinct, asynchronous, and overlapping. They had to determine how to aggregate these inputs for the Riddle Master without losing context or introducing unacceptable latency.

They evaluated three specific architectural strategies to handle the multi-speaker environment:

  1. Server-Side Aggregation: This method involves mixing all player audio streams into a single channel before sending it to the AI Agent. While simple to implement, mixing audio makes it incredibly difficult for the Speech-to-Text (S2T) model to transcribe accurately, especially when users talk over one another. This results in “hallucinations” or missed queries.
  2. Agent per Client: This approach assigns a separate Voice AI agent to every single player in the room. This creates a chaotic user experience (all agents speaking at once) and prevents a shared game state. It is also cost-prohibitive, as every user stream consumes separate processing tokens.
  3. Server-Side Filtering using VAD: In this approach, they implemented a centralized gatekeeper using Voice Activity Detection (VAD). They wait for a player to speak, lock the “input slot” and forward only that specific player’s audio to the AI agent. Once they stop speaking, the lock is released, allowing another player to ask questions. This is the solution they finally went with.

Beyond One-on-One: A “Deep Sea Stories” Game Web App

Key Technologies

  • Fishjam: A real-time communication platform handling peer-to-peer audio streaming via WebRTC (SFU). (Not familiar with WebRTC/SFUs? Check out their guide)
  • Gemini GenAI Voice Agent: Provides an easy SDK that makes creating voice agents and initializing audio conversations simple.

Architecture Overview

The game logic is handled on the backend, which manages the conferencing room and peer connections.

  • Player Connection: When players join the game using the frontend client, they connect audio/video via the Fishjam Web SDK. (See: Fishjam React Quick Start).
  • The Bridge: When the game starts, the backend creates a Fishjam Agent. This agent acts like a “ghost peer” in the audio-video room; its sole purpose is to capture audio of the players and forward it to the AI, and vice versa.
  • The Brain: The backend initiates a WebSocket connection with the Gemini agent and forwards the audio stream from players to Gemini and vice versa.

Architecture Diagram

Implementation Details

1. Initializing Clients and game room

import { FishjamClient } from '@fishjam-cloud/js-server-sdk';
import GeminiIntegration from '@fishjam-cloud/js-server-sdk/gemini';

const fishjamClient = new FishjamClient({
  fishjamId: process.env.FISHJAM_ID!,
  managementToken: process.env.FISHJAM_TOKEN!,
});

const genAi = GeminiIntegration.createClient({
  apiKey: process.env.GOOGLE_API_KEY!,
});

const gameRoom = await fishjamClient.createRoom();
Enter fullscreen mode Exit fullscreen mode

2. Creating the Fishjam Agent

When the first player joins the game room, they create the Fishjam agent to capture players' audio on the backend.

import GeminiIntegration from "@fishjam-cloud/js-server-sdk/gemini";

const { agent } = await fishjamClient.createAgent(gameRoom.id, {
  subscribeMode: "auto",
  // Use their preset to match the required audio format (16kHz)
  output: GeminiIntegration.geminiInputAudioSettings,
});
// agentTrack enables to send audio back to players
const agentTrack = agent.createTrack(
  GeminiIntegration.geminiOutputAudioSettings,
);
Enter fullscreen mode Exit fullscreen mode

3. Configuring and Initializing the AI Riddle Master

When users select a story scenario, they configure the Gemini agent with the specific context (the riddle solution and the “Game Master” persona).

const session = await genAi.live.connect({
  model: GEMINI_MODEL,
  config: {
    responseModalities: [Modality.AUDIO],
    systemInstruction:
      "here's the story: ..., and its solution: ... you should answer only yes or no questions about this story",
  },
  callbacks: {
    // Gemini -> Fishjam
    onmessage: (msg) => {
      if (msg.data) {
        // send Riddle Master's audio responses back to players
        const pcmData = Buffer.from(msg.data, "base64");
        agent.sendData(agentTrack.id, pcmData);
      }

      if (msg.serverContent?.interrupted) {
        console.log("Agent was interrupted by user.");
        // Clears the buffer on the Fishjam media server
        agent.interruptTrack(agentTrack.id);
      }
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

4. Bridging Audio (The Glue)

The final piece of the puzzle is the bridge between the SFU and the AI. They capture audio streams from the Fishjam agent (what the players are saying) and pass them through a custom VAD (Voice Activity Detection) filter. This filter implements a “mutex” lock mechanism: it identifies the first active speaker, locks the channel to their ID, and forwards only their audio to Gemini. All other simultaneous audio is ignored until the active speaker finishes their turn.

VAD Logic Diagram

Below is the simplified code of this logic:

// State to track who currently "holds the floor"
let activeSpeakerId: string | null = null;

// They capture audio chunks from ALL players in the room
agent.on("audioTrack", (userId, pcmChunk) => {
  vadService.process(userId, pcmChunk);
});

// VAD Processor Logic
vadService.on("activity", (userId, isSpeaking, audioData) => {
  if (activeSpeakerId === null && isSpeaking) {
    activeSpeakerId = userId; // Lock the floor
  }

  // They only forward audio if it comes from the person holding the lock
  if (userId === activeSpeakerId) {
    voiceAgentSession.sendAudio(audioData);

    // If the active speaker stops speaking (silence detected), release the lock
    if (!isSpeaking) {
      // (Optional: Add a debounce delay here to prevent cutting off pauses)
      activeSpeakerId = null;
    }
  }
});
Enter fullscreen mode Exit fullscreen mode

Challenges in group AI

Building a multi-user voice interface introduces unique challenges compared to 1-on-1 chats:

  • Floor Control: Standard Speech-to-Text models can struggle when multiple players speak simultaneously. Determining which player the AI should respond to or if it should simply listen requires careful handling.
  • Latency: Real-time responsiveness is critical for immersion. The entire pipeline (Audio → Text → LLM → Audio) must execute in milliseconds.
  • Audio Quality: Maintaining clear audio through transcoding and streaming across different networks is essential.

Fortunately, Fishjam’s WebRTC implementation largely solves the latency and audio quality issues. The challenges of Floor Control needed carefully structured implementation on the backend, but it was not really that hard!

Try the Game Yourself!

They have implemented the functionality described above in a live demo. Gather friends and try to solve a mystery with their AI Riddle Master!

If someone is working on AI-based features with real-time video or audio and needs assistance, they can reach out to the team on Discord.

Top comments (2)

Collapse
 
ben profile image
Ben Halpern

Wow

Collapse
 
capestart profile image
CapeStart

Fantastic project. The way FishJam.io uses multi speaker AI to drive the game experience is clever, and the step by step breakdown makes it approachable for builders. Appreciate you sharing the lessons learned.