The "Gallery View" Gap
In the lifecycle of a WebRTC project, there is a specific feature request that signals the end of the "easy mode" phase: "Can we record the session as a single video file?"
When engineers first approach this, they often implement "Stream Dumping." They spin up a process to save the raw UDP packets from every participant to disk. The result? If you have a 5-person meeting, you end up with 5 separate .webm or .mkv files, all starting at different timestamps, with different resolutions.
This is useless to the end-user. They don't want five files; they want a Composite Recording. They want a single MP4 file where the video layout dynamically adjusts—showing one person, then a split-screen for two, then a 2x2 grid for four—just like the Zoom or Google Meet client experience.
Implementing this requires shifting from a "Forwarding" architecture (SFU) to a "Mixing" architecture (MCU) specifically for the recording pipeline. You need a system that can decode, resize, position, and re-encode multiple live video streams in real-time.
Every engineer who has worked on real-time media systems has lived this meme.
What users see: a red REC button and a polished MP4 file.
What we see: jitter buffers, timestamp drift, packet loss, broken keyframes, and that one participant with 4% packet loss ruining everyone's sync.
The “A/V Sync” fire in the meme isn’t a joke — it’s a rite of passage.
You don’t truly understand distributed systems until you’ve chased a 200ms audio delay across five time domains at 3 AM.
This is where WebRTC engineering stops being just code and becomes craft.
It’s not about saving streams. It’s about orchestrating chaos into something that feels simple.
And that’s the quiet beauty of building media systems — turning network disorder into human moments that look effortless.
The Engine: Why GStreamer?
To build a dynamic compositor, static transcoding tools like FFmpeg (CLI) are often insufficient because they struggle with input sources that appear and disappear mid-stream. We need a dynamic media graph.
GStreamer is the industry standard for this. Unlike a linear transcoder, GStreamer operates as a graph of elements. The critical component for our use case is the compositor element (or glvideomixer for GPU acceleration).
The compositor accepts multiple "Sink Pads" (inputs). Each pad has properties: xpos, ypos, width, height, and zorder. By manipulating these properties at runtime via Python, we can move video streams around the canvas without stopping the pipeline.
The Pipeline Anatomy
A production-grade compositing pipeline for WebRTC typically follows this chain for each participant:
-
udpsrc: Receives encrypted SRTP or raw RTP packets. -
rtpjitterbuffer: The most critical element. It buffers packets to handle network jitter and reorders them before decoding. -
rtph264depay: Extracts the H.264 bitstream from the RTP payload. -
avdec_h264: Decodes the compressed bitstream into raw video frames (I420/NV12). -
videoscale& `capsfilter`: Resizes the raw image to fit the target grid cell (e.g., scaling 1080p down to 640x360). -
compositor: The canvas where this stream is painted.
The output of the compositor (the full mixed canvas) then flows through:
x264enc -> mp4mux -> filesink.
Implementation: The Python Controller
We will use PyGObject to interface with GStreamer. The core logic involves calculating the grid coordinates and requesting new pads from the compositor dynamically.
1. The Grid Calculation
First, we need a deterministic algorithm to calculate xpos, ypos, width, and height based on the number of participants (N).
import math
CANVAS_WIDTH = 1920
CANVAS_HEIGHT = 1080
def calculate_layout(n_participants):
if n_participants == 0:
return
# Calculate rows and columns (e.g., 4 users -> 2x2, 5 users -> 3x2)
cols = math.ceil(math.sqrt(n_participants))
rows = math.ceil(n_participants / cols)
cell_w = CANVAS_WIDTH // cols
cell_h = CANVAS_HEIGHT // rows
layout =
for i in range(n_participants):
row = i // cols
col = i % cols
layout.append({
'x': col * cell_w,
'y': row * cell_h,
'w': cell_w,
'h': cell_h
})
return layout
2. Initializing the Pipeline
We set up the static part of the pipeline (Compositor -> Encoder -> File) first.
import gi
gi.require_version('Gst', '1.0')
from gi.repository import Gst, GObject
Gst.init(None)
class CompositeRecorder:
def __init__(self, filename):
self.pipeline = Gst.Pipeline.new("recorder")
# Core Elements
self.compositor = Gst.ElementFactory.make("compositor", "comp")
self.encoder = Gst.ElementFactory.make("x264enc", "enc")
self.muxer = Gst.ElementFactory.make("mp4mux", "mux")
self.sink = Gst.ElementFactory.make("filesink", "fs")
# Configuration
self.sink.set_property("location", filename)
self.encoder.set_property("tune", "zerolatency")
self.compositor.set_property("background", 1) # Black background
# Add and Link
for elem in [self.compositor, self.encoder, self.muxer, self.sink]:
self.pipeline.add(elem)
self.compositor.link(self.encoder)
self.encoder.link(self.muxer)
self.muxer.link(self.sink)
self.inputs = {} # Map port -> {bin, pad}
self.pipeline.set_state(Gst.State.PLAYING)
3. Dynamic Participant Joining
When a user joins, we create a "Source Bin" and link it to the compositor.
def add_participant(self, port, codec="H264"):
# 1. Request a new Pad from the Compositor
# This is the "slot" where the video will enter
sink_pad = self.compositor.get_request_pad("sink_%u")
# 2. Create the Source Bin (UDP -> Decode -> Scale)
bin_name = f"user_{port}"
src_bin = Gst.Bin.new(bin_name)
# Element creation (simplified for brevity)
udpsrc = Gst.ElementFactory.make("udpsrc")
udpsrc.set_property("port", port)
caps = Gst.Caps.from_string("application/x-rtp,media=video,clock-rate=90000,encoding-name=H264")
udpsrc.set_property("caps", caps)
depay = Gst.ElementFactory.make("rtph264depay")
decode = Gst.ElementFactory.make("avdec_h264")
scale = Gst.ElementFactory.make("videoscale")
# Link elements inside the bin
self.pipeline.add(src_bin)
#... (code to add elements to src_bin and link them)...
# 3. Link the Bin to the Compositor Pad
src_pad = scale.get_static_pad("src")
src_pad.link(sink_pad)
# 4. Store reference and re-layout
self.inputs[port] = {'bin': src_bin, 'pad': sink_pad}
self.update_layout()
# 5. Sync state
src_bin.sync_state_with_parent()
def update_layout(self):
layout = calculate_layout(len(self.inputs))
for i, (port, data) in enumerate(self.inputs.items()):
pad = data['pad']
coords = layout[i]
# Dynamically set properties on the live pad
pad.set_property("xpos", coords['x'])
pad.set_property("ypos", coords['y'])
pad.set_property("width", coords['w'])
pad.set_property("height", coords['h'])
The Unlinking Dance: Handling Leavers
Adding is easy; removing is dangerous. In GStreamer, if you simply unlink an element while data is flowing, you risk a crash or a pipeline stall (Internal Data Flow Error).
To remove a participant safely:
- Block the Pad: Install a "Blocking Probe" on the source pad of the participant's bin. This ensures no data is moving across the connection we are about to sever.
- Unlink: Once blocked, unlink the bin from the compositor.
-
Release: Release the request pad back to the compositor (
release_request_pad). -
Remove: Set the bin state to
NULLand remove it from the pipeline. - Re-layout: Recalculate grid positions for remaining users.
The Invisible Complexity: Audio Mixing
Video is visual, but audio is critical. You cannot just "composite" audio. You must mix it.
We use the audiomixer element. The logic mirrors the video path:
- Ingest RTP Opus packets.
-
rtpjitterbuffer->rtpopusdepay->opusdec. - Link to
audiomixer(which requestssink_%upads just like the compositor). - Output mixed audio ->
opusenc(or AAC) ->mp4mux.
Synchronization Challenge: Video pipelines often have higher latency (decoding + scaling) than audio pipelines. If you simply mix them, lips will move before the voice is heard. You must use queue elements with min-threshold-time to buffer audio slightly to match video processing latency.
Time & Sync: The Jitter Buffer
In WebRTC, packets arrive out of order, or not at all. If you feed raw UDP packets directly into a decoder, it will produce garbage artifacts (smearing/tearing).
The rtpjitterbuffer is mandatory. It reorders packets and waits for retransmissions (NACKs).
Crucially, all streams in your GStreamer pipeline must share a common Clock.
GStreamer pipelines select a global clock (usually the system clock). Incoming RTP streams have their own timestamps. The rtpjitterbuffer translates RTP time to GStreamer running time.
If your output video speeds up (Benny Hill style) or lags, it is almost always a timestamp issue. Ensure your depayloaders are not discarding timestamps and that do-timestamp=true is set on sources where applicable.
Performance Analysis & Hardware Acceleration
Compositing is expensive.
-
Decoding: 4x 1080p H.264 streams will consume ~1-2 vCPUs in software (
avdec_h264). -
Encoding: Re-encoding the mixed canvas to H.264 is the heaviest task. Software encoding (
x264enc) at 1080p30 requires significant CPU.
For production, Hardware Acceleration is key:
-
Intel: Use
vaapih264decandvaapih264enc. -
NVIDIA: Use
nvv4l2decoderandnvv4l2h264enc.
Switching elements in Python is just a string change (ElementFactory.make("nvv4l2h264enc")), but it drastically changes the viability of your recorder. A software recorder might handle 4 participants; a GPU-accelerated one can handle 20+.
Conclusion: The Custom MCU
By building this pipeline, you have effectively built a specialized, write-only Multipoint Control Unit (MCU). Unlike standard MCUs that must minimize latency for real-time interaction, your recorder can afford a few seconds of latency (using larger jitter buffers) to ensure higher quality output.
While complex, this GStreamer approach offers total control: you can add watermarks, overlay names, switch layouts programmatically, and output to any format (HLS, RTMP, File). It moves you from "dumping bits" to "producing content."
🚀 Learn More on YouTube
If you enjoyed this blog into WebRTC, GStreamer, and real-time media architecture, I regularly break down complex backend systems, RTC internals, and production-grade engineering patterns on my YouTube channel:
👉 The Lalit Official
🔗 https://www.youtube.com/@lalit_096/videos
I share practical breakdowns, system design insights, and real-world debugging stories from building scalable media systems.
If this blog helped you understand compositing at a deeper level, you’ll definitely enjoy the long-form technical breakdowns there.
Subscribe and join the journey from “it works” to “it works at scale.”




Top comments (0)