ahmet gedik

Posted on Jun 11

Building a Fast Video Metadata Service with Litestar in Python

#python #litestar #async #sqlite

The 280ms problem that started this

When DailyWatch crossed a few million indexed videos, the endpoint that powers our discovery feed started misbehaving. It is a deceptively simple endpoint: give it a query and a region, and it returns a page of videos with title, channel, duration, thumbnail URLs, and a relevance score. Nothing fancy. But under load it was sitting at a p95 of 280ms, and almost none of that was CPU. It was waiting.

Most of DailyWatch runs on PHP 8.4 behind LiteSpeed, with SQLite (FTS5 for full-text search) as the primary store and Cloudflare in front. That stack is genuinely excellent for cached HTML pages — a category page or a watch page is rendered once, cached at the edge, and served thousands of times for free. But the metadata API is a different animal. It does lots of small concurrent reads, it fans out to validate thumbnail availability, and once in a while it calls an external enrichment service for channel data. Under PHP-FPM, every one of those waits held a worker process hostage. Add enough concurrent feed requests and you run out of workers long before you run out of CPU.

I did not want to rewrite the whole site. I wanted to peel off this one chatty, I/O-bound service and run it on something built for concurrency. I rebuilt it in Python on Litestar (the framework formerly known as Starlite). This is the honest write-up of how that went, with the code I actually shipped.

Why Litestar and not FastAPI or Flask

I evaluated FastAPI first because everyone does. It is a fine framework. But three things pushed me to Litestar for this specific job:

msgspec serialization by default. Litestar uses msgspec for encoding and decoding instead of Pydantic v2's slower-by-default path. For a service whose entire job is shuffling JSON in and out, the serializer is not a detail — it is the hot path. msgspec validates and encodes in C and is measurably faster on wide objects like video records.
A real dependency injection system. FastAPI's Depends is clever but everything is a function call graph. Litestar has layered DI you can attach at the app, router, controller, or handler level, which maps cleanly onto "this connection pool lives for the whole app, this request context lives for one request."
Controllers and layered config. I prefer grouping related routes into a class with shared dependencies and guards over a pile of decorated module functions. Litestar's Controller makes that first-class.

None of this means FastAPI is wrong. It means that for a serialization-bound, I/O-bound microservice, Litestar's defaults lined up with what I needed without me fighting them.

Modeling the payload with msgspec

The first thing I did was define the wire format with msgspec structs instead of dataclasses or Pydantic models. msgspec structs are roughly as cheap to instantiate as a plain tuple, and Litestar knows how to serialize them directly.

from __future__ import annotations

import msgspec
from litestar import Litestar, get
from litestar.controller import Controller
from litestar.params import Parameter


class VideoMeta(msgspec.Struct, frozen=True):
    id: str
    title: str
    channel: str
    duration_seconds: int
    thumbnail: str
    region: str
    score: float


class FeedResponse(msgspec.Struct, frozen=True):
    query: str
    region: str
    count: int
    results: list[VideoMeta]


class FeedController(Controller):
    path = "/feed"

    @get()
    async def search(
        self,
        db: "VideoStore",
        q: str = Parameter(min_length=1, max_length=120),
        region: str = Parameter(default="US", max_length=2),
        limit: int = Parameter(default=24, ge=1, le=100),
    ) -> FeedResponse:
        rows = await db.search(q, region=region, limit=limit)
        return FeedResponse(query=q, region=region, count=len(rows), results=rows)

A few things worth calling out. The Parameter markers give me validation and OpenAPI docs for free — a query shorter than one character or a limit over 100 is rejected before my handler runs. The handler is async, so while one request waits on SQLite, the event loop is free to serve others. And the db argument is not parsed from the request; it is injected. That VideoStore comes from dependency injection, which I will get to.

Because VideoMeta is a frozen msgspec struct, Litestar serializes the whole FeedResponse without ever building intermediate dicts. On a 24-item page that is 24 fewer dict allocations per request, which adds up when you are doing thousands of requests a second.

Async all the way down to SQLite

The interesting constraint is that I wanted to keep SQLite. It is the right database for DailyWatch — a single file, no server to operate, FTS5 built in, and Cloudflare absorbs most of the read traffic anyway. The catch is that SQLite is synchronous and the C library does blocking I/O. If you call it naively from an async handler you block the event loop, and you have thrown away the entire reason you went async.

The pragmatic answer is aiosqlite, which runs each connection in a dedicated thread and gives you an awaitable interface. SQLite releases the GIL during its I/O, so a handful of reader connections genuinely run in parallel. I also turn on WAL mode so readers never block on the writer.

import aiosqlite


class VideoStore:
    def __init__(self, conn: aiosqlite.Connection) -> None:
        self._conn = conn

    @classmethod
    async def connect(cls, path: str) -> "VideoStore":
        conn = await aiosqlite.connect(path)
        await conn.execute("PRAGMA journal_mode=WAL")
        await conn.execute("PRAGMA synchronous=NORMAL")
        await conn.execute("PRAGMA cache_size=-16000")  # ~16 MB page cache
        conn.row_factory = aiosqlite.Row
        return cls(conn)

    async def search(self, q: str, *, region: str, limit: int) -> list[VideoMeta]:
        # bm25() returns a distance: lower is better, so negate for a score.
        sql = """
            SELECT v.id, v.title, v.channel, v.duration_seconds,
                   v.thumbnail, v.region,
                   -bm25(video_fts, 8.0, 2.0) AS score
            FROM video_fts
            JOIN videos v ON v.rowid = video_fts.rowid
            WHERE video_fts MATCH :match
              AND (v.region = :region OR v.region = 'GLOBAL')
            ORDER BY score DESC
            LIMIT :limit
        """
        match = self._fts_query(q)
        async with self._conn.execute(
            sql, {"match": match, "region": region, "limit": limit}
        ) as cur:
            rows = await cur.fetchall()
        return [
            VideoMeta(
                id=r["id"],
                title=r["title"],
                channel=r["channel"],
                duration_seconds=r["duration_seconds"],
                thumbnail=r["thumbnail"],
                region=r["region"],
                score=r["score"],
            )
            for r in rows
        ]

    @staticmethod
    def _fts_query(raw: str) -> str:
        # Turn free text into a safe prefix query: 'cat video' -> 'cat* video*'
        terms = [t for t in raw.replace('"', " ").split() if t]
        return " ".join(f"{t}*" for t in terms) or '""'

The bm25() call is the part I care most about. FTS5 ships with BM25 ranking built in, and the two numbers after the table name are per-column weights — here the title column counts for 8x and the channel for 2x, because a query match in the title is far more meaningful than a match in a channel name. I negate the result because FTS5's bm25() returns a value where lower is more relevant, and I want a score where higher is better.

The _fts_query helper is small but it matters. Passing raw user input straight into a MATCH clause is how you get FTS5 syntax errors (or worse) from a stray quote or operator. Splitting on whitespace, stripping quotes, and appending * for prefix matching gives users the "search as you type" behavior they expect while keeping the query well-formed.

Dependency injection instead of globals

The VideoStore needs to exist once for the lifetime of the app, get its connection opened on startup, and closed on shutdown. Litestar's lifespan hooks plus app-level dependencies handle this without a single module-level global.

from contextlib import asynccontextmanager
from collections.abc import AsyncIterator

from litestar import Litestar
from litestar.di import Provide


@asynccontextmanager
async def lifespan(app: Litestar) -> AsyncIterator[None]:
    store = await VideoStore.connect("/var/data/videos.db")
    app.state.store = store
    try:
        yield
    finally:
        await store._conn.close()


async def provide_store(state) -> VideoStore:
    return state.store


app = Litestar(
    route_handlers=[FeedController],
    lifespan=[lifespan],
    dependencies={"db": Provide(provide_store)},
)

Now every handler that declares a db: VideoStore parameter gets the shared store, and nothing reaches into a global. In tests I swap provide_store for one that returns an in-memory store seeded with fixtures, and the handler code does not change at all. That testability is the quiet payoff of DI — it is not about ceremony, it is about being able to substitute the slow real thing for a fast fake one at the edges.

For heavier setups you would open several reader connections and round-robin them, since each aiosqlite.Connection is single-threaded. For DailyWatch's feed traffic, a small pool of three to five readers plus one dedicated writer connection is plenty, and WAL mode means the writer never blocks them.

Keeping the PHP stack honest

I did not throw away PHP. The site's pages still render in PHP 8.4 under LiteSpeed, and they call this Litestar service over the local network for the dynamic feed slot. The PHP side stays simple — it just asks for JSON and trusts the schema, which msgspec guarantees.

<?php
declare(strict_types=1);

function fetch_feed(string $query, string $region = 'US', int $limit = 24): array
{
    $url = 'http://127.0.0.1:8001/feed?' . http_build_query([
        'q'      => $query,
        'region' => $region,
        'limit'  => $limit,
    ]);

    $ch = curl_init($url);
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_TIMEOUT_MS     => 400,
        CURLOPT_FAILONERROR    => true,
    ]);

    $raw = curl_exec($ch);
    if ($raw === false) {
        curl_close($ch);
        return [];           // degrade gracefully: render cached fallback
    }
    curl_close($ch);

    $data = json_decode($raw, true, flags: JSON_THROW_ON_ERROR);
    return $data['results'] ?? [];
}

The important design choice here is the 400ms timeout and the empty-array fallback. The feed is an enhancement, not a hard dependency. If the Litestar service is restarting or slow, PHP renders the cached "trending" block instead of throwing a 500. A free video discovery site lives and dies on never showing the user a broken page, so every cross-service call has a fallback that is good enough to ship.

This boundary also keeps the two languages playing to their strengths. PHP under LiteSpeed is unbeatable at serving cached, templated HTML. Python under an async server is better at the chatty, concurrent, I/O-bound work. Drawing the line at "HTML is PHP, dynamic JSON fan-out is Litestar" meant I migrated one service instead of rewriting a site.

Caching at the edge with Cloudflare

The feed responses are not unique per user — they are unique per (query, region). That makes them cacheable, and Cloudflare is already in front of everything. I set explicit cache headers from Litestar so the edge does the heavy lifting and the Python service mostly handles cache misses and warmups.

from litestar import get
from litestar.response import Response
from litestar.enums import MediaType


@get("/feed", cache=False)
async def search(self, db: VideoStore, q: str, region: str = "US") -> Response:
    rows = await db.search(q, region=region, limit=24)
    body = FeedResponse(query=q, region=region, count=len(rows), results=rows)
    return Response(
        content=body,
        media_type=MediaType.JSON,
        headers={
            # Edge caches for 5 min, serves stale for an hour while revalidating.
            "Cache-Control": "public, max-age=300, stale-while-revalidate=3600",
            "Vary": "Accept-Encoding",
        },
    )

The stale-while-revalidate directive is the trick that keeps latency flat. When a cached feed expires, Cloudflare serves the slightly stale copy immediately and refreshes it in the background, so a user never waits on a cache miss. For a discovery feed where "the trending list from four minutes ago" is completely acceptable, this turns the vast majority of requests into edge hits that never touch Python at all. The origin only sees genuinely cold queries.

One caveat I learned the hard way: be careful what you put in Vary. Varying on a cookie or a fine-grained header fragments the cache into uselessness. Vary on encoding, key your cache on the query and region in the URL, and keep the cacheable surface boring on purpose.

What the numbers looked like

I will not pretend this was a perfectly controlled benchmark — it was a production cutover with real traffic. But the before-and-after on the same hardware was clear:

p95 latency at the origin dropped from ~280ms to ~45ms for cache misses, mostly because requests no longer queued behind blocked workers.
Concurrency before saturation went up sharply. The PHP-FPM version started shedding requests around a few hundred concurrent feed calls because workers were stuck in I/O wait. The async version held steady well past that on the same box, since waiting requests cost an event-loop slot, not a whole process.
Edge hit ratio settled around 90%+ once stale-while-revalidate was tuned, which means the Python service is mostly idle and only earns its keep on the long tail of unique queries.

The serialization win from msgspec is real but it is not where the headline number comes from. The headline number comes from not blocking a worker on I/O. The msgspec speedup shows up as lower CPU per request, which matters for the bill, not the latency.

What I would do differently

A few honest regrets and notes:

I should have added a writer queue sooner. With WAL and a single writer connection, concurrent writes serialize fine, but I initially shared the writer connection across coroutines and hit "database is locked" under bursty ingestion. The fix was a dedicated writer with its own task and an async queue feeding it.
Prefix-only FTS5 matching is greedy. Appending * to every term is great for short queries and bad for long ones, where it over-matches. I now cap prefix expansion to the last token, which is what users are usually still typing.
Connection-per-thread is a real ceiling. aiosqlite is the right tool at this scale, but if read volume ever outgrows a small pool, the move is to a read-replica file or to switch the hot table to a server-based store. SQLite earned its place; I just want to know where the wall is before I hit it.

Conclusion

The lesson was not "rewrite everything in async Python." It was "find the one service that is I/O-bound and chatty, and move only that." PHP 8.4 under LiteSpeed still renders every page on the site, SQLite with FTS5 is still the store, and Cloudflare still absorbs most of the traffic. Litestar slotted in as a focused metadata service that does the concurrent, waiting-heavy work the blocking stack was bad at. msgspec kept the serialization cheap, async kept the workers free, and dependency injection kept the whole thing testable. If you have a corner of your stack that spends its life waiting on I/O, peeling it off onto Litestar is a small, low-risk bet that paid off cleanly for us.

DEV Community