vast cow

Posted on Feb 16

Audio Segmentation with YAMNet: Detecting Speech, Music, and Silence

#machinelearning #deeplearning #python #tutorial

This article explains a Python program that analyzes an audio file and automatically segments it into Speech, Music, or Silence. The system combines volume-based filtering with a deep learning model to produce structured time-based segments, followed by boundary refinement for improved accuracy.

Overview of the Approach

The pipeline consists of five main stages:

Audio loading and preprocessing
Volume-based silence detection
Audio classification using YAMNet
Coarse segmentation
Fine boundary refinement

The final output is a list of time-stamped segments showing when speech, music, or silence occurs in the audio file.

Installation and Dependencies

The required libraries are:

pip install tensorflow tensorflow-hub librosa numpy

Libraries Used

TensorFlow – deep learning framework
TensorFlow Hub – loads pretrained YAMNet model
Librosa – audio loading and processing
NumPy – numerical computation

Configuration Parameters

The system is controlled by several configuration values:

Audio Settings

Target sample rate: 16,000 Hz
Audio is converted to mono.

Coarse Segmentation

Window size: 8 seconds

Fine Boundary Refinement

Search radius: ±0.8 seconds
Fine window length: 0.96 seconds
Hop size: 0.48 seconds

Volume Thresholds

Silence: below −55 dBFS
Low-level noise: below −45 dBFS

Classification Thresholds

Speech and music scores are compared directly.
The higher score determines the label.

Loading the YAMNet Model

The program loads YAMNet, a pretrained audio event classification model from TensorFlow Hub:

yamnet = hub.load("https://tfhub.dev/google/yamnet/1")

YAMNet outputs probability scores across many audio classes. From this list, the program extracts specific categories:

Speech-related classes

Speech
Conversation
Narration
Singing
Whispering
Shouting
And others

Music-related classes

Music
Musical instrument

The model’s output probabilities are averaged across frames to compute:

speech_score
music_score

Volume-Based Silence Detection

Before running the deep learning model, the program evaluates signal volume using:

RMS (Root Mean Square) level
Peak amplitude level

Both are converted to dBFS (decibels relative to full scale).

Logic

If RMS < −55 dBFS → label as "Silence"
If RMS < −45 dBFS → label as "Silence2"
Otherwise → proceed to YAMNet classification

This step avoids unnecessary model computation for very quiet sections.

Chunk Classification Logic

For each chunk:

Check volume level.
If not silence:

Run YAMNet.
Compute speech and music scores.
Assign label:

label = "Speech" if speech_score >= music_score else "Music"

The result includes:

Label
RMS level
Peak level
Speech score
Music score

Coarse Segmentation

The entire audio file is divided into 8-second chunks.

Each chunk is:

Zero-padded if necessary
Classified independently

The output is a list of labeled intervals with start and end times.

Example:

00:00:00 - 00:00:08  Speech
00:00:08 - 00:00:16  Music

Building Continuous Segments

Adjacent chunks with the same label are merged.

For example:

Chunk	Label
0–8s	Speech
8–16s	Speech
16–24s	Music

Becomes:

0–16s → Speech
16–24s → Music

This creates cleaner segment boundaries.

Fine Boundary Refinement

Coarse segmentation uses large windows, so boundaries may not be precise.

To improve accuracy:

For each boundary between two different labels:

Search within ±0.8 seconds.
1. Slide a smaller 0.96-second window.
2. Reclassify each fine window.
3. Choose the boundary position that maximizes label consistency:
Left side matches previous label
Right side matches next label

This optimization improves transition accuracy between speech and music.

Time Formatting

Segments are formatted as:

HH:MM:SS

Using a helper function that converts seconds to hours, minutes, and seconds.

Example Output

The program prints two results:

Coarse Segments

=== Segments (Coarse) ===
00:00:00 - 00:00:16  Speech
00:00:16 - 00:00:32  Music

Refined Segments

=== Segments (Refined) ===
00:00:00 - 00:00:14  Speech
00:00:14 - 00:00:32  Music

The refined version adjusts transition timing.

Execution

Run the script from the command line:

python script.py input_audio.wav

The program:

Loads the audio.
Performs coarse segmentation.
Refines segment boundaries.
Prints time-aligned labels.

Summary

This system performs structured audio segmentation using:

Signal energy analysis for silence detection
A pretrained deep learning model (YAMNet) for classification
Two-stage segmentation (coarse + refined) for improved boundary precision

It is suitable for applications such as:

Podcast structure analysis
Automatic content indexing
Speech/music separation preprocessing
Media archiving workflows

The combination of deterministic signal processing and neural network inference provides both computational efficiency and classification robustness.

pip install tensorflow tensorflow-hub librosa numpy

import numpy as np 
import librosa 
import tensorflow as tf 
import tensorflow_hub as hub 
from sys import argv 

# ---------------------------- 
# Configuration 
# ---------------------------- 
TARGET_SR = 16000 

# Coarse window 
CHUNK_SEC = 8.0 

# Fine boundary refinement 
FINE_SEARCH_RADIUS_SEC = 0.8 
FINE_WIN_SEC = 0.96 
FINE_HOP_SEC = 0.48 

# Volume thresholds 
SILENCE_DBFS = -55.0 
LOW_NOISE_DBFS = -45.0 

# Speech / Music thresholds 
SPEECH_TH = 0.20 
MUSIC_TH = 0.20 

# ---------------------------- 
# Load YAMNet 
# ---------------------------- 
yamnet = hub.load("https://tfhub.dev/google/yamnet/1") 

class_map_path = yamnet.class_map_path().numpy().decode("utf-8") 
class_names = [] 
with open(class_map_path, "r", encoding="utf-8") as f: 
    next(f) 
    for line in f: 
        parts = line.strip().split(",") 
        class_names.append(parts[2]) 

SPEECH_CLASSES = { 
    "Speech", 
    "Child speech, kid speaking", 
    "Conversation", 
    "Narration, monologue", 
    "Babbling", 
    "Speech synthesizer", 
    "Shout", 
    "Yell", 
    "Whispering", 
    "Singing", 
} 

MUSIC_CLASSES = { 
    "Music", 
    "Musical instrument", 
} 

name_to_idx = {n: i for i, n in enumerate(class_names)} 
speech_idxs = [name_to_idx[n] for n in SPEECH_CLASSES if n in name_to_idx] 
music_idxs = [name_to_idx[n] for n in MUSIC_CLASSES if n in name_to_idx] 

# ---------------------------- 
# Volume utilities 
# ---------------------------- 
def rms_dbfs(x): 
    rms = np.sqrt(np.mean(np.square(x)) + 1e-12) 
    return 20.0 * np.log10(rms + 1e-12) 

def peak_dbfs(x): 
    peak = np.max(np.abs(x)) + 1e-12 
    return 20.0 * np.log10(peak + 1e-12) 

def classify_volume_level(x): 
    r = rms_dbfs(x) 
    p = peak_dbfs(x) 

    if r < SILENCE_DBFS: 
        return "Silence", r, p 
    if r < LOW_NOISE_DBFS: 
        return "Silence2", r, p 
    return None, r, p 

# ---------------------------- 
# YAMNet classification 
# ---------------------------- 
def classify_chunk_with_yamnet(chunk_16k): 
    waveform = tf.convert_to_tensor(chunk_16k, dtype=tf.float32) 
    scores, embeddings, spectrogram = yamnet(waveform) 
    mean_scores = scores.numpy().mean(axis=0) 

    speech_score = float(mean_scores[speech_idxs].sum()) if speech_idxs else 0.0 
    music_score = float(mean_scores[music_idxs].sum()) if music_idxs else 0.0 

    return speech_score, music_score 

def decide_label(chunk_16k): 
    volume_label, r_db, p_db = classify_volume_level(chunk_16k) 

    if volume_label is not None: 
        return { 
            "label": volume_label, 
            "rms_dbfs": r_db, 
            "peak_dbfs": p_db, 
            "speech_score": 0.0, 
            "music_score": 0.0, 
        } 

    speech_score, music_score = classify_chunk_with_yamnet(chunk_16k) 

    label = "Speech" if speech_score >= music_score else "Music" 

    return { 
        "label": label, 
        "rms_dbfs": r_db, 
        "peak_dbfs": p_db, 
        "speech_score": speech_score, 
        "music_score": music_score, 
    } 

# ---------------------------- 
# Window slicing 
# ---------------------------- 
def slice_window(y, start_s, win_s): 
    start = int(round(start_s * TARGET_SR)) 
    win = int(round(win_s * TARGET_SR)) 

    if start < 0: 
        pad_left = -start 
        start = 0 
    else: 
        pad_left = 0 

    end = start + win 
    chunk = y[start:end] 

    if pad_left > 0: 
        chunk = np.pad(chunk, (pad_left, 0)) 
    if len(chunk) < win: 
        chunk = np.pad(chunk, (0, win - len(chunk))) 

    return chunk 

# ---------------------------- 
# Coarse pass 
# ---------------------------- 
def coarse_pass(y): 
    chunk_len = int(CHUNK_SEC * TARGET_SR) 
    n_chunks = int(np.ceil(len(y) / chunk_len)) 
    out = [] 

    for i in range(n_chunks): 
        start = i * chunk_len 
        end = min((i + 1) * chunk_len, len(y)) 
        chunk = y[start:end] 

        if len(chunk) < chunk_len: 
            chunk = np.pad(chunk, (0, chunk_len - len(chunk))) 

        result = decide_label(chunk) 

        out.append({ 
            "chunk_index": i, 
            "start_sec": i * CHUNK_SEC, 
            "end_sec": (i + 1) * CHUNK_SEC, 
            **result, 
        }) 

    return out 

# ---------------------------- 
# Build segments 
# ---------------------------- 
def build_segments(coarse): 
    if not coarse: 
        return [] 

    segs = [{ 
        "label": coarse[0]["label"], 
        "start": coarse[0]["start_sec"], 
        "end": coarse[0]["end_sec"] 
    }] 

    for r in coarse[1:]: 
        if r["label"] == segs[-1]["label"]: 
            segs[-1]["end"] = r["end_sec"] 
        else: 
            segs.append({ 
                "label": r["label"], 
                "start": r["start_sec"], 
                "end": r["end_sec"] 
            }) 

    return segs 

# ---------------------------- 
# Fine boundary refinement 
# ---------------------------- 
def refine_single_boundary(y, t0, left_label, right_label): 
    R = FINE_SEARCH_RADIUS_SEC 
    win = FINE_WIN_SEC 
    hop = FINE_HOP_SEC 

    start_min = t0 - R 
    start_max = t0 + R - win 
    if start_max < start_min: 
        return t0 

    starts = [] 
    labels = [] 

    s = start_min 
    while s <= start_max + 1e-9: 
        chunk = slice_window(y, s, win) 
        r = decide_label(chunk) 
        starts.append(s) 
        labels.append(r["label"]) 
        s += hop 

    if len(labels) < 2: 
        return t0 

    best_k = None 
    best_score = -1 

    left_prefix = [0] * (len(labels) + 1) 
    for i, lab in enumerate(labels): 
        left_prefix[i + 1] = left_prefix[i] + (1 if lab == left_label else 0) 

    right_suffix = [0] * (len(labels) + 1) 
    for i in range(len(labels) - 1, -1, -1): 
        right_suffix[i] = right_suffix[i + 1] + (1 if labels[i] == right_label else 0) 

    for k in range(len(labels) - 1): 
        score = left_prefix[k + 1] + right_suffix[k + 1] 
        if score > best_score: 
            best_score = score 
            best_k = k 

    if best_k is None: 
        return t0 

    return starts[best_k] + win 

def refine_segments_with_finepass(y, segments): 
    if len(segments) < 2: 
        return segments 

    refined = [dict(segments[0])] 

    for i in range(1, len(segments)): 
        prev_seg = refined[-1] 
        cur_seg = dict(segments[i]) 

        t0 = prev_seg["end"] 
        refined_t = refine_single_boundary( 
            y, t0, 
            prev_seg["label"], 
            cur_seg["label"] 
        ) 

        refined_t = max(prev_seg["start"], min(refined_t, cur_seg["end"])) 

        prev_seg["end"] = refined_t 
        cur_seg["start"] = refined_t 

        refined.append(cur_seg) 

    return refined 

# ---------------------------- 
# Time formatting 
# ---------------------------- 
def sec_to_hms(sec): 
    total = int(round(sec)) 
    h = total // 3600 
    m = (total % 3600) // 60 
    s = total % 60 
    return f"{h:02d}:{m:02d}:{s:02d}" 

# ---------------------------- 
# Main 
# ---------------------------- 
if __name__ == "__main__": 
    audio_path = argv[1] 

    y, sr = librosa.load(audio_path, sr=TARGET_SR, mono=True) 

    coarse = coarse_pass(y) 
    segments = build_segments(coarse) 
    refined_segments = refine_segments_with_finepass(y, segments) 

    print("=== Segments (Coarse) ===") 
    for s in segments: 
        print(f"{sec_to_hms(s['start'])} - {sec_to_hms(s['end'])}  {s['label']}") 

    print("\n=== Segments (Refined) ===") 
    for s in refined_segments: 
        print(f"{sec_to_hms(s['start'])} - {sec_to_hms(s['end'])}  {s['label']}")

DEV Community

Audio Segmentation with YAMNet: Detecting Speech, Music, and Silence

Overview of the Approach

Installation and Dependencies

Libraries Used

Configuration Parameters

Audio Settings

Coarse Segmentation

Fine Boundary Refinement

Volume Thresholds

Classification Thresholds

Loading the YAMNet Model

Speech-related classes

Music-related classes

Volume-Based Silence Detection

Logic

Chunk Classification Logic

Coarse Segmentation

Building Continuous Segments

Fine Boundary Refinement

Time Formatting

Example Output

Coarse Segments

Refined Segments

Execution

Summary

Top comments (0)