Lalit Mishra

Posted on Feb 11

Building the Grid: Dynamic Video Compositing with GStreamer and Python

#webrtc #ai #gstreamer #python

The "Gallery View" Gap

In the lifecycle of a WebRTC project, there is a specific feature request that signals the end of the "easy mode" phase: "Can we record the session as a single video file?"

When engineers first approach this, they often implement "Stream Dumping." They spin up a process to save the raw UDP packets from every participant to disk. The result? If you have a 5-person meeting, you end up with 5 separate .webm or .mkv files, all starting at different timestamps, with different resolutions.

This is useless to the end-user. They don't want five files; they want a Composite Recording. They want a single MP4 file where the video layout dynamically adjusts—showing one person, then a split-screen for two, then a 2x2 grid for four—just like the Zoom or Google Meet client experience.

Implementing this requires shifting from a "Forwarding" architecture (SFU) to a "Mixing" architecture (MCU) specifically for the recording pipeline. You need a system that can decode, resize, position, and re-encode multiple live video streams in real-time.

Every engineer who has worked on real-time media systems has lived this meme.

What users see: a red REC button and a polished MP4 file.
What we see: jitter buffers, timestamp drift, packet loss, broken keyframes, and that one participant with 4% packet loss ruining everyone's sync.

The “A/V Sync” fire in the meme isn’t a joke — it’s a rite of passage.
You don’t truly understand distributed systems until you’ve chased a 200ms audio delay across five time domains at 3 AM.

This is where WebRTC engineering stops being just code and becomes craft.
It’s not about saving streams. It’s about orchestrating chaos into something that feels simple.

And that’s the quiet beauty of building media systems — turning network disorder into human moments that look effortless.

The Engine: Why GStreamer?

To build a dynamic compositor, static transcoding tools like FFmpeg (CLI) are often insufficient because they struggle with input sources that appear and disappear mid-stream. We need a dynamic media graph.

GStreamer is the industry standard for this. Unlike a linear transcoder, GStreamer operates as a graph of elements. The critical component for our use case is the compositor element (or glvideomixer for GPU acceleration).

The compositor accepts multiple "Sink Pads" (inputs). Each pad has properties: xpos, ypos, width, height, and zorder. By manipulating these properties at runtime via Python, we can move video streams around the canvas without stopping the pipeline.

The Pipeline Anatomy

A production-grade compositing pipeline for WebRTC typically follows this chain for each participant:

udpsrc: Receives encrypted SRTP or raw RTP packets.
rtpjitterbuffer: The most critical element. It buffers packets to handle network jitter and reorders them before decoding.
rtph264depay: Extracts the H.264 bitstream from the RTP payload.
avdec_h264: Decodes the compressed bitstream into raw video frames (I420/NV12).
videoscale & `capsfilter`: Resizes the raw image to fit the target grid cell (e.g., scaling 1080p down to 640x360).
compositor: The canvas where this stream is painted.

The output of the compositor (the full mixed canvas) then flows through:
x264enc -> mp4mux -> filesink.

Implementation: The Python Controller

We will use PyGObject to interface with GStreamer. The core logic involves calculating the grid coordinates and requesting new pads from the compositor dynamically.

1. The Grid Calculation

First, we need a deterministic algorithm to calculate xpos, ypos, width, and height based on the number of participants (N).

import math

CANVAS_WIDTH = 1920
CANVAS_HEIGHT = 1080

def calculate_layout(n_participants):
    if n_participants == 0:
        return

    # Calculate rows and columns (e.g., 4 users -> 2x2, 5 users -> 3x2)
    cols = math.ceil(math.sqrt(n_participants))
    rows = math.ceil(n_participants / cols)

    cell_w = CANVAS_WIDTH // cols
    cell_h = CANVAS_HEIGHT // rows

    layout =
    for i in range(n_participants):
        row = i // cols
        col = i % cols
        layout.append({
            'x': col * cell_w,
            'y': row * cell_h,
            'w': cell_w,
            'h': cell_h
        })
    return layout

2. Initializing the Pipeline

We set up the static part of the pipeline (Compositor -> Encoder -> File) first.

import gi
gi.require_version('Gst', '1.0')
from gi.repository import Gst, GObject

Gst.init(None)

class CompositeRecorder:
    def __init__(self, filename):
        self.pipeline = Gst.Pipeline.new("recorder")

        # Core Elements
        self.compositor = Gst.ElementFactory.make("compositor", "comp")
        self.encoder = Gst.ElementFactory.make("x264enc", "enc")
        self.muxer = Gst.ElementFactory.make("mp4mux", "mux")
        self.sink = Gst.ElementFactory.make("filesink", "fs")

        # Configuration
        self.sink.set_property("location", filename)
        self.encoder.set_property("tune", "zerolatency")
        self.compositor.set_property("background", 1) # Black background

        # Add and Link
        for elem in [self.compositor, self.encoder, self.muxer, self.sink]:
            self.pipeline.add(elem)

        self.compositor.link(self.encoder)
        self.encoder.link(self.muxer)
        self.muxer.link(self.sink)

        self.inputs = {} # Map port -> {bin, pad}
        self.pipeline.set_state(Gst.State.PLAYING)

3. Dynamic Participant Joining

When a user joins, we create a "Source Bin" and link it to the compositor.

    def add_participant(self, port, codec="H264"):
        # 1. Request a new Pad from the Compositor
        # This is the "slot" where the video will enter
        sink_pad = self.compositor.get_request_pad("sink_%u")

        # 2. Create the Source Bin (UDP -> Decode -> Scale)
        bin_name = f"user_{port}"
        src_bin = Gst.Bin.new(bin_name)

        # Element creation (simplified for brevity)
        udpsrc = Gst.ElementFactory.make("udpsrc")
        udpsrc.set_property("port", port)
        caps = Gst.Caps.from_string("application/x-rtp,media=video,clock-rate=90000,encoding-name=H264")
        udpsrc.set_property("caps", caps)

        depay = Gst.ElementFactory.make("rtph264depay")
        decode = Gst.ElementFactory.make("avdec_h264")
        scale = Gst.ElementFactory.make("videoscale")

        # Link elements inside the bin
        self.pipeline.add(src_bin)
        #... (code to add elements to src_bin and link them)...

        # 3. Link the Bin to the Compositor Pad
        src_pad = scale.get_static_pad("src")
        src_pad.link(sink_pad)

        # 4. Store reference and re-layout
        self.inputs[port] = {'bin': src_bin, 'pad': sink_pad}
        self.update_layout()

        # 5. Sync state
        src_bin.sync_state_with_parent()

    def update_layout(self):
        layout = calculate_layout(len(self.inputs))
        for i, (port, data) in enumerate(self.inputs.items()):
            pad = data['pad']
            coords = layout[i]

            # Dynamically set properties on the live pad
            pad.set_property("xpos", coords['x'])
            pad.set_property("ypos", coords['y'])
            pad.set_property("width", coords['w'])
            pad.set_property("height", coords['h'])

The Unlinking Dance: Handling Leavers

Adding is easy; removing is dangerous. In GStreamer, if you simply unlink an element while data is flowing, you risk a crash or a pipeline stall (Internal Data Flow Error).

To remove a participant safely:

Block the Pad: Install a "Blocking Probe" on the source pad of the participant's bin. This ensures no data is moving across the connection we are about to sever.
Unlink: Once blocked, unlink the bin from the compositor.
Release: Release the request pad back to the compositor (release_request_pad).
Remove: Set the bin state to NULL and remove it from the pipeline.
Re-layout: Recalculate grid positions for remaining users.

The Invisible Complexity: Audio Mixing

Video is visual, but audio is critical. You cannot just "composite" audio. You must mix it.
We use the audiomixer element. The logic mirrors the video path:

Ingest RTP Opus packets.
rtpjitterbuffer -> rtpopusdepay -> opusdec.
Link to audiomixer (which requests sink_%u pads just like the compositor).
Output mixed audio -> opusenc (or AAC) -> mp4mux.

Synchronization Challenge: Video pipelines often have higher latency (decoding + scaling) than audio pipelines. If you simply mix them, lips will move before the voice is heard. You must use queue elements with min-threshold-time to buffer audio slightly to match video processing latency.

Time & Sync: The Jitter Buffer

In WebRTC, packets arrive out of order, or not at all. If you feed raw UDP packets directly into a decoder, it will produce garbage artifacts (smearing/tearing).

The rtpjitterbuffer is mandatory. It reorders packets and waits for retransmissions (NACKs).
Crucially, all streams in your GStreamer pipeline must share a common Clock.
GStreamer pipelines select a global clock (usually the system clock). Incoming RTP streams have their own timestamps. The rtpjitterbuffer translates RTP time to GStreamer running time.

If your output video speeds up (Benny Hill style) or lags, it is almost always a timestamp issue. Ensure your depayloaders are not discarding timestamps and that do-timestamp=true is set on sources where applicable.

Performance Analysis & Hardware Acceleration

Compositing is expensive.

Decoding: 4x 1080p H.264 streams will consume ~1-2 vCPUs in software (avdec_h264).
Encoding: Re-encoding the mixed canvas to H.264 is the heaviest task. Software encoding (x264enc) at 1080p30 requires significant CPU.

For production, Hardware Acceleration is key:

Intel: Use vaapih264dec and vaapih264enc.
NVIDIA: Use nvv4l2decoder and nvv4l2h264enc.

Switching elements in Python is just a string change (ElementFactory.make("nvv4l2h264enc")), but it drastically changes the viability of your recorder. A software recorder might handle 4 participants; a GPU-accelerated one can handle 20+.

Conclusion: The Custom MCU

By building this pipeline, you have effectively built a specialized, write-only Multipoint Control Unit (MCU). Unlike standard MCUs that must minimize latency for real-time interaction, your recorder can afford a few seconds of latency (using larger jitter buffers) to ensure higher quality output.

While complex, this GStreamer approach offers total control: you can add watermarks, overlay names, switch layouts programmatically, and output to any format (HLS, RTMP, File). It moves you from "dumping bits" to "producing content."

🚀 Learn More on YouTube

If you enjoyed this blog into WebRTC, GStreamer, and real-time media architecture, I regularly break down complex backend systems, RTC internals, and production-grade engineering patterns on my YouTube channel:

👉 The Lalit Official
🔗 https://www.youtube.com/@lalit_096/videos

I share practical breakdowns, system design insights, and real-world debugging stories from building scalable media systems.

If this blog helped you understand compositing at a deeper level, you’ll definitely enjoy the long-form technical breakdowns there.

Subscribe and join the journey from “it works” to “it works at scale.”

DEV Community