Building Video Metadata Semantic Search with pgvector and OpenAI Embeddings

#postgres #php #ai #webdev

Our keyword search was quietly failing a third of our users. Someone would type "funny cat falls off table" into ViralVidVault, and our SQLite LIKE '%funny%cat%' query would return nothing — because the actual trending clip was titled "Maine Coon yeets itself into the void." Same video, zero lexical overlap. We track viral video metadata across European markets, and viral content is exactly where titles drift furthest from the words people use to search. Slang, regional phrasing, emoji-laden descriptions, machine-translated captions — keyword matching breaks on all of it.

The fix was semantic search: embed every video's metadata into a vector, embed the query into the same space, and rank by cosine similarity. This post is the production write-up of how we built that on top of pgvector and OpenAI's text-embedding-3-small, how we kept it GDPR-clean, and how we glued it into a stack that is otherwise PHP 8.4, SQLite in WAL mode, LiteSpeed, and a Cloudflare Worker at the edge.

Why Postgres for vectors when the app runs on SQLite

The main site runs on SQLite with WAL enabled. It is fast, file-based, and survives traffic spikes on LiteSpeed without a separate daemon. But vector search wants three things SQLite does not give you natively: a vector column type, an approximate-nearest-neighbour index, and a distance operator that the planner understands. There are SQLite extensions (sqlite-vec is good), but our embedding corpus is ~2.1M rows and growing, and we wanted HNSW indexing with tunable recall. So we run a dedicated Postgres 16 instance with the pgvector extension purely as a search sidecar. The canonical metadata still lives in SQLite; Postgres holds a denormalised projection plus the vectors.

This split matters for a GDPR reason too. The embedding service is the only component that talks to a US-based API (OpenAI). By isolating it, we can document exactly what data crosses the border and keep personal data out of it entirely. We embed video metadata — titles, descriptions, tags, channel names — never user queries tied to an identity, never IP-linked search history. The Postgres box lives in the EU (Hetzner Falkenstein) and logs nothing query-side.

Schema first. The extension and the table:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE video_embeddings (
    video_id     TEXT PRIMARY KEY,
    region       TEXT NOT NULL,        -- 'DE','FR','PL', etc.
    title        TEXT NOT NULL,
    embedding    vector(1536) NOT NULL,
    content_hash TEXT NOT NULL,        -- sha256 of the embedded text
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- HNSW index for cosine distance. m and ef_construction
-- trade build time/memory for recall.
CREATE INDEX video_embeddings_hnsw
    ON video_embeddings
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- We filter by region constantly, so index it.
CREATE INDEX video_embeddings_region ON video_embeddings (region);

The content_hash column is the unglamorous hero here. Embeddings cost money and rate limit. Re-embedding a video whose metadata has not changed is pure waste. We hash the exact string we send to OpenAI; if the hash already exists with the same value, we skip the API call entirely. On our corpus this drops re-embed cost by about 94% on each nightly refresh, because viral videos churn in ranking far more than they churn in metadata.

Generating embeddings without melting your rate limit

The embedding pipeline is a Python service — Python because the OpenAI SDK and async batching are pleasant there, and because it runs as a cron job outside the request path. It reads new and changed rows from a queue table, batches them (the embeddings endpoint accepts up to 2048 inputs per call), and upserts into Postgres.

The text you embed is the highest-leverage decision in the whole system. Do not just embed the title. We concatenate a weighted, structured document so the vector captures channel and tag context:

import hashlib
import asyncio
import os
import psycopg
from openai import AsyncOpenAI

client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
MODEL = "text-embedding-3-small"  # 1536 dims, cheap, plenty good

def build_document(row: dict) -> str:
    # Order matters: titles first, they carry the most signal.
    parts = [
        row["title"],
        f"Channel: {row['channel']}",
        f"Tags: {', '.join(row['tags'][:12])}",
        (row["description"] or "")[:500],
    ]
    return "\n".join(p for p in parts if p)

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

async def embed_batch(rows: list[dict]) -> list[dict]:
    docs, hashes = [], []
    for r in rows:
        doc = build_document(r)
        docs.append(doc)
        hashes.append(content_hash(doc))

    resp = await client.embeddings.create(model=MODEL, input=docs)
    out = []
    for r, h, item in zip(rows, hashes, resp.data):
        out.append({
            "video_id": r["video_id"],
            "region": r["region"],
            "title": r["title"],
            "embedding": item.embedding,
            "content_hash": h,
        })
    return out

async def upsert(conn, records: list[dict]) -> None:
    async with conn.cursor() as cur:
        for rec in records:
            await cur.execute(
                """
                INSERT INTO video_embeddings
                    (video_id, region, title, embedding, content_hash, updated_at)
                VALUES (%(video_id)s, %(region)s, %(title)s,
                        %(embedding)s, %(content_hash)s, now())
                ON CONFLICT (video_id) DO UPDATE SET
                    region = EXCLUDED.region,
                    title = EXCLUDED.title,
                    embedding = EXCLUDED.embedding,
                    content_hash = EXCLUDED.content_hash,
                    updated_at = now()
                WHERE video_embeddings.content_hash
                      IS DISTINCT FROM EXCLUDED.content_hash
                """,
                rec,
            )
    await conn.commit()

Two things worth calling out. First, the WHERE ... IS DISTINCT FROM clause on the upsert means even if the queue feeds us a row we have already embedded, the write is a no-op unless the content actually changed — belt and braces on top of the pre-call hash check. Second, text-embedding-3-small at 1536 dimensions is deliberate. The -large model (3072 dims) scored maybe 2–3% higher on our internal relevance evals but doubled both storage and index memory. For viral video discovery, where the corpus turns over weekly anyway, that is not worth it.

Batching strategy in practice:

Chunk to 256 inputs per request, not the 2048 max. Smaller chunks fail cheaper on transient errors and keep individual request latency under a second.
Back off on 429 with jittered exponential delay. The embeddings endpoint rate-limits on tokens-per-minute, not just requests, so a burst of long descriptions can trip it unexpectedly.
Run embedding as a nightly cron, never inline. Search must keep working if OpenAI is down; you serve slightly stale vectors, nobody notices.

Querying from PHP 8.4

The search request itself is dead simple once the index exists. The PHP front end embeds the user's query (one tiny API call), then asks Postgres for nearest neighbours. We connect to Postgres via PDO only for this one feature — the rest of the app stays on SQLite.

Here is the search service. Note the set_config for hnsw.ef_search, which is the runtime recall knob: higher means more accurate, slower; we found 80 a good balance.

<?php
declare(strict_types=1);

final class SemanticVideoSearch
{
    public function __construct(
        private readonly \PDO $pg,
        private readonly string $openAiKey,
    ) {}

    /** @return list<array{video_id:string,title:string,score:float}> */
    public function search(string $query, string $region, int $limit = 20): array
    {
        $vector = $this->embedQuery($query);
        $literal = '[' . implode(',', $vector) . ']';

        // Tune recall for this session only.
        $this->pg->exec('SET LOCAL hnsw.ef_search = 80');

        $sql = <<<SQL
            SELECT video_id, title,
                   1 - (embedding <=> :vec) AS score
            FROM video_embeddings
            WHERE region = :region
            ORDER BY embedding <=> :vec
            LIMIT :limit
        SQL;

        $stmt = $this->pg->prepare($sql);
        $stmt->bindValue(':vec', $literal);
        $stmt->bindValue(':region', $region);
        $stmt->bindValue(':limit', $limit, \PDO::PARAM_INT);
        $stmt->execute();

        return array_map(
            static fn(array $r): array => [
                'video_id' => $r['video_id'],
                'title'    => $r['title'],
                'score'    => (float) $r['score'],
            ],
            $stmt->fetchAll(\PDO::FETCH_ASSOC),
        );
    }

    /** @return list<float> */
    private function embedQuery(string $query): array
    {
        $ch = curl_init('https://api.openai.com/v1/embeddings');
        curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT        => 4,
            CURLOPT_HTTPHEADER    => [
                'Authorization: Bearer ' . $this->openAiKey,
                'Content-Type: application/json',
            ],
            CURLOPT_POSTFIELDS => json_encode([
                'model' => 'text-embedding-3-small',
                'input' => $query,
            ], JSON_THROW_ON_ERROR),
        ]);

        $raw = curl_exec($ch);
        if ($raw === false) {
            throw new \RuntimeException('embedding failed: ' . curl_error($ch));
        }
        curl_close($ch);

        $data = json_decode($raw, true, 512, JSON_THROW_ON_ERROR);
        return $data['data'][0]['embedding'];
    }
}

The <=> operator is pgvector's cosine distance. We return 1 - distance as a 0..1 similarity score so the front end can show a relevance bar and, more usefully, threshold out garbage — anything under 0.25 we drop rather than show a weak match. The SET LOCAL is scoped to the transaction, so it never leaks into other connections from the pool.

One gotcha that cost me an afternoon: passing the vector as a bound parameter. pgvector accepts the [1,2,3] text literal, and PDO sends it as a string, which is fine. But if you try to use a server-side prepared statement with a typed vector parameter through some drivers, you get cryptic cast errors. The text-literal-as-string approach is portable and the planner still uses the HNSW index. Verify with EXPLAIN ANALYZE that you see an Index Scan using video_embeddings_hnsw and not a sequential scan — if you see seq scan, your ORDER BY distance expression does not match the index's vector_cosine_ops.

Caching at the edge with a Cloudflare Worker

Embedding the query is one network round-trip to the US before we even hit Postgres. For popular queries — and viral search is extremely head-heavy, the same dozen trending terms dominate — that round-trip is wasted. We cache full search responses at the edge in a Cloudflare Worker keyed by normalised query plus region. Popular queries never touch the origin, never touch OpenAI, and never leave the EU edge node closest to the user.

export default {
  async fetch(request, env, ctx) {
    const url = new URL(request.url);
    const q = (url.searchParams.get("q") || "").trim().toLowerCase();
    const region = url.searchParams.get("region") || "DE";
    if (q.length < 2) return new Response("[]", { status: 400 });

    // Normalise so "Funny Cat " and "funny cat" share a cache entry.
    const norm = q.replace(/\s+/g, " ");
    const cacheKey = new Request(
      `https://cache.internal/search?q=${encodeURIComponent(norm)}&region=${region}`,
      { method: "GET" }
    );

    const cache = caches.default;
    let hit = await cache.match(cacheKey);
    if (hit) return hit;

    const origin = await fetch(
      `${env.ORIGIN}/api/semantic-search?q=${encodeURIComponent(norm)}&region=${region}`,
      { headers: { "X-Edge": "vvv" } }
    );

    const body = await origin.text();
    const resp = new Response(body, {
      status: origin.status,
      headers: {
        "Content-Type": "application/json",
        // Viral terms shift hourly; 600s is a sane TTL.
        "Cache-Control": "public, max-age=600",
      },
    });

    if (origin.ok) ctx.waitUntil(cache.put(cacheKey, resp.clone()));
    return resp;
  },
};

The 600-second TTL is a deliberate tradeoff: viral rankings move fast, but the set of videos matching "funny cat" semantically is stable across ten minutes even as their view counts climb. We invalidate aggressively only on the nightly re-embed by bumping a version prefix in the cache key, which the Worker reads from an environment variable. Edge hit rate sits around 70% during peak hours, which means 70% of searches cost us nothing in API calls or Postgres CPU.

Hybrid ranking: semantic is not always right

Pure vector search has a failure mode: it is too fuzzy. Someone searching for an exact channel name or a specific meme phrase wants the literal match first, not a vibes-based neighbour. So we run a cheap hybrid. The semantic results come back from Postgres, and we re-rank with a small lexical boost computed in PHP against the title:

If the query appears as a substring in the title, add a fixed boost to the score.
If query tokens overlap with the channel name, add a smaller boost.
Otherwise, trust the cosine score as-is.

This is not Reciprocal Rank Fusion or anything academic — it is a five-line nudge that fixed the most common complaint ("I searched the exact title and it was third"). The lesson generalises: semantic search is a recall machine, not a precision machine. Pair it with a cheap lexical signal and you get both.

We also clamp results by region at the SQL level rather than post-filtering, because a German user searching trending content does not want Polish-only virals diluting the top 20. Because region is indexed and the HNSW scan respects the WHERE, this stays fast even with the filter.

What it cost and what it bought

Numbers from the first month in production:

Initial backfill of 2.1M videos: about $42 in embedding API calls, run over two nights to stay under rate limits.
Nightly incremental: roughly $0.40/day thanks to the content-hash skip — only genuinely changed metadata gets re-embedded.
Storage: 1536 floats × 4 bytes × 2.1M rows ≈ 13 GB of vectors, plus the HNSW index at about 4 GB resident. Fits comfortably on a 32 GB Postgres box.
Query latency: p50 of 38 ms at the origin (Postgres only, query already embedded), p95 of 110 ms. With edge caching, p50 perceived latency is under 10 ms for cached terms.

The relevance win is harder to put one number on, but our "search returned zero results" rate fell from 31% to 4%. That 27-point swing is almost entirely the long tail of slang, translations, and creative titles that keyword matching could never reach.

Conclusion

Semantic search over video metadata turned out to be one of the highest-leverage, lowest-complexity features we have shipped. The architecture is boring on purpose: Postgres with pgvector as an isolated search sidecar, OpenAI embeddings generated offline in a nightly Python cron, a thin PHP query layer, and a Cloudflare Worker absorbing the head of the distribution at the edge. Nothing about it required abandoning our SQLite-on-LiteSpeed core.

If you take three things away: embed a structured document and not just the title; hash your content so you never pay to re-embed unchanged rows; and pair the semantic recall with a cheap lexical boost so exact matches still win. Keep the US-bound API surface tiny and metadata-only, and the whole thing stays GDPR-defensible by construction. The viral clip titled "Maine Coon yeets itself into the void" now shows up for "funny cat falls off table" — which is the entire point.