ahmet gedik

Posted on Jun 12

Deduplicating Video URLs at Scale with a PHP Canonicalization Pipeline

#php #sqlite #backend #webdev

Last month I opened our admin dashboard and found the trailer for one Korean drama sitting in the discovery feed eleven separate times. Eleven rows, eleven thumbnails, eleven primary keys, all pointing at the exact same video. At TrendVidStream we fan out discovery crons across eight regions, and each region had ingested that clip with its own mix of tracking parameters, a different host alias, and in two cases a shortened link. To a person they are obviously the same video. To a videos table they were eleven unrelated facts. This is the URL canonicalization problem, and if you run any kind of content aggregator it will quietly wreck your data quality before you ever notice it.

Why one video becomes eleven rows

The variation comes from every layer of the stack at once. The same upstream video reaches us as all of these:

https://www.youtube.com/watch?v=dQw4w9WgXcQ
https://youtu.be/dQw4w9WgXcQ?si=8fJ2kLm
https://m.youtube.com/watch?v=dQw4w9WgXcQ&feature=share
https://www.youtube.com/embed/dQw4w9WgXcQ
https://www.youtube.com/shorts/dQw4w9WgXcQ

Region crons receive tracking parameters that the upstream API hands back. Mobile endpoints return m. hosts. Share buttons append si and feature. Embed and shorts surfaces use entirely different path shapes for the identical video. If your dedup strategy is INSERT OR IGNORE keyed on the raw URL string, which is where almost everyone starts, every one of those becomes a brand new row, because as strings they genuinely are all different.

The damage is not just cosmetic. A duplicated catalogue poisons everything downstream: your trending ranker counts the same video five times and over-promotes it, your FTS5 search returns five identical hits on the first page, your region-availability badges are wrong, and your sitemap ships five URLs that search engines treat as duplicate content. We were paying that tax on every one of eight regions, every cron cycle.

What canonical actually means here

Canonicalization is the process of reducing many equivalent representations of a thing down to one chosen representative. For video URLs I split it into two layers, because each layer solves a different failure mode.

The first layer is string normalization: take a URL and produce a deterministic, cleaned-up version of the same URL. Same scheme, lowercased host, no tracking junk, sorted query string. This collapses ?v=X&utm_source=... and ?utm_source=...&v=X into one.

The second layer is identity extraction: pull the platform-stable video ID out of the normalized URL, because two completely different URL shapes such as youtu.be/X and youtube.com/embed/X still point at the same video, and string normalization alone will never make them equal. The identity is what we actually deduplicate on. The normalized URL is what we store and serve.

Treat these as separate steps. Bolting identity logic into your normalizer produces a tangled function that is impossible to test, and you will be testing this code constantly as platforms keep changing their URL formats.

Layer one normalizing the URL string

Here is the normalizer we run on PHP 8.4. It is deliberately boring: parse, rewrite the parts we care about, drop the rest.

<?php
declare(strict_types=1);

final class UrlNormalizer
{
    private const STRIP_PARAMS = [
        'utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content',
        'feature', 'gclid', 'fbclid', 'si', 'pp', 'ab_channel',
    ];

    private const HOST_ALIASES = [
        'm.youtube.com'     => 'www.youtube.com',
        'youtube.com'       => 'www.youtube.com',
        'youtu.be'          => 'www.youtube.com',
        'm.dailymotion.com' => 'www.dailymotion.com',
    ];

    public function normalize(string $raw): string
    {
        $parts = parse_url(trim($raw));
        if ($parts === false || empty($parts['host'])) {
            throw new InvalidArgumentException('unparseable URL: ' . $raw);
        }

        $host = strtolower($parts['host']);
        $host = self::HOST_ALIASES[$host] ?? $host;
        $path = rtrim($parts['path'] ?? '/', '/') ?: '/';

        $query = [];
        if (!empty($parts['query'])) {
            parse_str($parts['query'], $query);
            foreach (self::STRIP_PARAMS as $junk) {
                unset($query[$junk]);
            }
            ksort($query);
        }

        $rebuilt = 'https://' . $host . $path;
        if ($query !== []) {
            $rebuilt .= '?' . http_build_query($query);
        }
        return $rebuilt;
    }
}

A few decisions worth calling out:

Host aliasing is explicit, not clever. youtu.be, m.youtube.com and bare youtube.com all map to www.youtube.com. A lookup table you can read beats a regex you cannot.
Tracking parameters are an allowlist-to-strip, not a denylist-to-keep. We strip known junk and keep everything else, because dropping an unknown parameter risks dropping something semantically meaningful like a playlist index.
The query string is sorted. ksort($query) is the single line that makes ?a=1&b=2 and ?b=2&a=1 produce identical output. Without it your normalization is non-deterministic against input ordering.
Trailing slashes are removed so /watch/ and /watch agree.

The function throws on unparseable input instead of returning a best-effort string. In a pipeline you want a bad record to stop loudly at the boundary, not flow downstream as a silently malformed canonical key.

Layer two extracting a platform-stable identity

Normalization gets us a clean string, but www.youtube.com/embed/X and www.youtube.com/watch?v=X are both already normalized and still describe the same video. That is what identity extraction is for.

<?php
declare(strict_types=1);

final class VideoIdentity
{
    /** @return array{platform: string, id: string} */
    public function resolve(string $normalizedUrl): array
    {
        $u = parse_url($normalizedUrl);
        $host = $u['host'] ?? '';
        $path = $u['path'] ?? '';
        parse_str($u['query'] ?? '', $q);

        return match (true) {
            str_contains($host, 'youtube.com') => [
                'platform' => 'youtube',
                'id'       => $this->youtubeId($path, $q),
            ],
            str_contains($host, 'dailymotion.com') => [
                'platform' => 'dailymotion',
                'id'       => $this->lastSegment($path),
            ],
            str_contains($host, 'vimeo.com') => [
                'platform' => 'vimeo',
                'id'       => $this->lastSegment($path),
            ],
            default => [
                'platform' => 'generic',
                'id'       => hash('xxh128', $normalizedUrl),
            ],
        };
    }

    private function youtubeId(string $path, array $q): string
    {
        if (isset($q['v']) && preg_match('/^[A-Za-z0-9_-]{11}$/', $q['v'])) {
            return $q['v'];
        }
        if (preg_match('#/(?:embed|shorts|v)/([A-Za-z0-9_-]{11})#', $path, $m)) {
            return $m[1];
        }
        $seg = $this->lastSegment($path);
        if (preg_match('/^[A-Za-z0-9_-]{11}$/', $seg)) {
            return $seg;
        }
        throw new RuntimeException('no YouTube id in ' . $path);
    }

    private function lastSegment(string $path): string
    {
        $segments = array_values(array_filter(explode('/', $path)));
        return $segments ? end($segments) : '';
    }
}

The match (true) block dispatches on host and each platform gets a small handler. YouTube is the messy one: the ID can live in the v query parameter, or in an /embed/, /shorts/, or /v/ path segment, or, once youtu.be has been aliased and its path reduced to /<id>, as the last path segment. Every YouTube ID is validated against the canonical eleven-character [A-Za-z0-9_-] pattern, so a malformed path raises instead of producing a junk identity.

Anything we do not recognize falls through to a generic platform whose identity is just a hash of the normalized URL. That is a deliberate floor: an unknown source still gets a stable key, it just cannot be deduplicated across URL shapes the way a known platform can. When we add a new source we add a handler; until then nothing crashes.

The canonical key and an idempotent write

Now we can define the canonical key: platform:id, for example youtube:dQw4w9WgXcQ. It is human-readable in logs, collision-resistant across platforms, and makes a perfect primary key. Here is the schema:

CREATE TABLE videos (
    canonical_key TEXT PRIMARY KEY,   -- e.g. youtube:dQw4w9WgXcQ
    platform      TEXT NOT NULL,
    video_id      TEXT NOT NULL,
    title         TEXT NOT NULL,
    first_seen    INTEGER NOT NULL,
    last_seen     INTEGER NOT NULL
);

CREATE TABLE video_regions (
    canonical_key TEXT NOT NULL REFERENCES videos(canonical_key) ON DELETE CASCADE,
    region        TEXT NOT NULL,
    PRIMARY KEY (canonical_key, region)
);

-- Full-text discovery index, kept in sync with triggers on videos.
CREATE VIRTUAL TABLE videos_fts USING fts5(
    title,
    content='videos',
    content_rowid='rowid'
);

The videos row is the canonical record. Region membership lives in a separate video_regions junction table, because a single canonical video legitimately appears in many regions and that is a many-to-many relationship, not a comma-joined string you have to parse and re-serialize on every write. The FTS5 virtual table is kept in sync with AFTER INSERT, AFTER UPDATE and AFTER DELETE triggers on videos, so search always sees exactly one row per real video.

The write itself is a single idempotent upsert:

<?php
declare(strict_types=1);

final class CanonicalStore
{
    public function __construct(private PDO $db) {}

    public function canonicalKey(string $platform, string $id): string
    {
        return $platform . ':' . $id;
    }

    public function upsert(string $platform, string $id, string $title, string $region): void
    {
        $key = $this->canonicalKey($platform, $id);
        $this->db->beginTransaction();

        $video = $this->db->prepare(
            'INSERT INTO videos (canonical_key, platform, video_id, title, first_seen, last_seen)
             VALUES (:key, :platform, :vid, :title, unixepoch(), unixepoch())
             ON CONFLICT(canonical_key) DO UPDATE SET
                 last_seen = unixepoch(),
                 title     = excluded.title'
        );
        $video->execute([
            ':key'      => $key,
            ':platform' => $platform,
            ':vid'      => $id,
            ':title'    => $title,
        ]);

        $this->db->prepare(
            'INSERT OR IGNORE INTO video_regions (canonical_key, region) VALUES (:key, :region)'
        )->execute([':key' => $key, ':region' => $region]);

        $this->db->commit();
    }
}

ON CONFLICT(canonical_key) DO UPDATE means re-ingesting a video we already have is a no-op on identity: it just bumps last_seen and refreshes the title. Region membership is a separate INSERT OR IGNORE, so the eighth region to see a video adds itself to video_regions without touching or duplicating the canonical row. Run the same cron twice, run all eight regions concurrently, replay yesterday's feed, and the table converges to the same state every time. Idempotency is the property that lets you stop being afraid of your own cron.

Short links and redirects

One case the normalizer cannot handle on its own is a genuine redirect, a t.co or bare share link that only reveals its real target after an HTTP round trip. Resolving those inline would be a disaster: you would add network latency and an external failure mode to your ingest path, and a slow upstream would back up the whole cron.

So we resolve redirects out of band. A short link gets a single HEAD request with a three-second timeout and CURLOPT_FOLLOWLOCATION capped at three hops; the resolved target is cached in a url_redirects(short, resolved, resolved_at) table keyed on the short URL. The next time any region sees that short link, and across eight regions they will see it repeatedly, it is a local SQLite lookup instead of a network call. Redirect resolution is the one place this pipeline touches the network, so it is the one place we cache aggressively and fail soft: if the HEAD request times out we keep the short URL as-is and try again next cycle, rather than dropping the video.

Wiring it into the multi-region cron

The eight region crons do not each need their own canonicalization logic; they call the same three steps. A region fetch does: pull the raw feed, and for each item run normalize, then resolve identity, then upsert(region). Because region membership accumulates through the junction table, the order regions run in does not matter and they can overlap. The US cron at 02:00 and the JP cron at 02:05 both contributing the same video just produce two video_regions rows against one videos row.

This matters operationally because we deploy over FTP. There is no build step or container on the host; the cron entry points are plain PHP files synced up and run by the host's own scheduler. The canonicalization classes are pure, dependency-free PHP that behaves identically on a laptop and on the LiteSpeed box it lands on. The less environment-specific magic in your ingest path, the less there is to go wrong on a host you reach through an FTP client.

Backfilling the rows we already had

That left the eleven-row mess already sitting in the table. A one-shot Python script handled the backfill, Python because our offline tooling already lives there and plain sqlite3 is more than enough for a few hundred thousand rows.

import sqlite3
from collections import defaultdict
from urllib.parse import urlsplit, parse_qs

def youtube_id(url):
    parts = urlsplit(url)
    host = parts.netloc.lower()
    if 'youtu.be' in host:
        return parts.path.strip('/').split('/')[0] or None
    if 'youtube.com' in host:
        q = parse_qs(parts.query)
        if 'v' in q:
            return q['v'][0]
        seg = parts.path.strip('/').split('/')
        if len(seg) == 2 and seg[0] in {'embed', 'shorts', 'v'}:
            return seg[1]
    return None

def canonical_key(url):
    vid = youtube_id(url)
    return f'youtube:{vid}' if vid else None

con = sqlite3.connect('data/videos.db')
con.row_factory = sqlite3.Row
rows = con.execute('SELECT rowid, url, region FROM legacy_videos').fetchall()

groups = defaultdict(list)
for r in rows:
    key = canonical_key(r['url'])
    if key:
        groups[key].append(r)

merged, dropped = 0, 0
for key, members in groups.items():
    keeper = min(m['rowid'] for m in members)
    regions = {m['region'] for m in members}
    merged += 1
    dropped += len(members) - 1
    # upsert keeper into videos, INSERT OR IGNORE each region, delete the rest

print(f'{merged} canonical videos, {dropped} duplicate rows collapsed')
con.close()

The strategy is read everything, group by canonical key, keep the lowest rowid as the survivor, union every duplicate's region into the survivor, and delete the rest, all inside one transaction with a dry-run flag that prints the counts without writing. The first real run collapsed about 38,000 rows into roughly 24,000 canonical videos. Always dry-run a destructive backfill against a copy of the database first; the cost of being wrong about your canonical key is deleting real data.

What it bought us

Concrete results after shipping this across all eight regions:

Row count dropped by a third and stopped growing super-linearly with region count.
FTS5 search returns one hit per video. First-page duplicates were the single most common complaint and they are gone.
Region badges are correct because membership is tracked explicitly instead of inferred from whichever URL variant happened to land first.
The trending ranker stopped double-counting, which quietly improved the entire discovery feed.
The sitemap shrank and the duplicate-content warnings in Search Console cleared.

Conclusion

URL canonicalization looks like a string-cleaning chore and turns out to be a data-modeling decision. The leverage is in separating the two layers, normalize the string and then extract a platform-stable identity, and in making the write idempotent so that eight regions hammering the same endpoint converge instead of multiply. None of the code here is exotic; it is parse_url, a lookup table, a match, and an ON CONFLICT. The discipline is in choosing one canonical representative and routing every write through it. Build that boundary once, test it hard because platform URL formats will keep shifting under you, and your catalogue stays honest no matter how many regions you fan out to.

DEV Community