Alistair

Posted on Feb 18

Scaling From 3 Cinemas to 240+ Venues: What Broke and What Evolved

#webscraping #architecture #automation #datapipeline

When I started scraping London cinema listings, I had three venues and a simple script. Fetch a page, parse it, done. Fast forward to today: 240+ venues, half a dozen different platform types, and a pipeline that runs daily across both GitHub's cloud runners and a cluster of 6 Raspberry Pis in my living room.

Here's what I learned about building extraction systems that scale, and the architectural decisions that emerged from necessity rather than planning.

The Retrieve/Transform Split: How Purity Became Practical

Early on, I had a simple mental model: retrieve grabs the main page, transform figures out what to do with it. If transform needed more data, it just... made more requests. Simple enough, right?

Wrong 😅

This made transform impure. It was making network calls, which created a cascading set of problems:

Debugging was a nightmare - request code wasn't all in one place
Caching became complicated - you now have to cache in two different jobs. If you clear the cache of one job, what impact will that have on the other job?
Testing was fragile - you couldn't test transform logic without network access

The solution wasn't about network topology or runner management. It was about simplicity and separation of concerns.

The new contract is simple:

retrieve does all the fetching - even if it needs to parse HTML to find links to follow
transform makes zero network calls - it takes inputs and produces data that adheres to the schema, that's the guarantee

Each function has a single responsibility. Retrieve handles the messy, stateful, network-dependent work. Transform does the pure, testable, repeatable work.

In practice, this means retrieve might fetch a main page, parse it for film listing URLs, fetch all of those, and hand everything to transform as a bundle. Transform just processes what it's given.

This matters for more than just clean code. Once all retrieves complete, the pipeline creates a GitHub release with an immutable blob of all the raw data. Then transform jobs run against that release. If I change downstream code later, I can re-run transforms on old data without hitting anyone's servers again. That only works if transforms are pure functions.

The retrieve workflow lives in one repository, transform in another. Each creates releases named by timestamp. Clean separation all the way down.

The Variety of Retrieval Strategies

With 240 venues, you see every possible variation of how a cinema might publish its data. Here's what emerged:

Single Page: The Dream

Example: Prince Charles Cinema

One big page with everything you need. Parse it once, you're done. These are vanishingly rare and I treasure them.

Main Page + Listing Pages: The Common Pattern

Example: The Castle Cinema

This is by far the most common pattern. You fetch the main "what's on" page to discover what films are showing, then fetch each film's individual listing page for the rich data you need for proper matching - full synopsis, runtime, cast, directors.

It's two-stage, but predictable. Retrieve handles both stages, transform gets a complete dataset.

JSON/API Endpoints: The Developer's Joy

When a cinema exposes a proper API, everything gets easier.

Normal JSON: Cineworld has straightforward endpoints. Hit them, parse the response, done.

Big Standard (OCAPI): This is where it gets interesting. Open Commerce API (OCAPI) is a standardised ticketing platform API used by both Curzon and ODEON. One unified codebase handles two of the biggest cinema chains in London. When you discover a new cinema runs on OCAPI, it's trivial to add - just point the existing module at their endpoints.

Weird JSON: Metro Cinema technically has a JSON API, but it requires signed requests with hard coded API key on the front-end. There's a bunch of hoop-jumping involved. Still better than parsing HTML, but barely.

GraphQL: Same Benefits, Different Query Language

Example: ActOne Cinema

Like JSON endpoints, but with GraphQL queries. You get structured data without HTML wrangling. The learning curve is steeper than REST, but the payoff is the same - no HTML parsing.

The HTML Parsing Toolkit: Cheerio, Playwright, and date-fns

When there's no API and you're parsing HTML, you need the right tools for the job.

Cheerio - For sites that let you just fetch their HTML. Cheerio is like jQuery but without an actual DOM. You can do CSS selectors and extraction without spinning up a browser. Fast and lightweight.

Playwright - For sites that won't let you just fetch HTML. Maybe they have bot detection, maybe they're heavily client-side rendered, maybe they need requests from residential IPs (hello, cluster of 6 Pis). You need a real browser to make it work.

The BFI is the worst offender for needing this. Both BFI Southbank and BFI IMAX run on the same slow, inconsistent site. Pages load in pieces asynchronously and often time out. It's the longest-running retrieve in the entire pipeline. There's no API. It's just a slog 😭

date-fns - Once you've extracted the data, you still have to parse it. Cinema websites output dates and times in wildly different formats. date-fns handles converting these strings into date objects so we can generate the timestamps the schema requires. Anyone who's worked with dates knows how much of a headache they can be without a good library!

Complex Multi-Page: When Listings and Booking Are Separate

Example: Science Museum

This is where it gets properly complicated:

Retrieve "products" from their JSON API
Filter for movies (because they sell all kinds of products)
Now we've got the titles - but nothing else, there's no link to detail pages in this data
Use their HTML search page to search for each title and scrape the first match (this only works because the Science Museum doesn't show many films and they have distinct titles)
Fetch the listing page HTML for each match to get full movie details

It's a multi-stage dance between JSON and HTML, search and direct fetch, just to get a complete dataset. And Retrieve handles all of this. Transform just processes the final bundle.

Shared Cinema Platforms: When Adding Venues Becomes Trivial

The absolute best moment in maintaining this pipeline is discovering a new cinema runs on a platform you already support.

OCAPI powers ODEON and Curzon. One codebase, two major chains, dozens of screens.

Savoy is the big one for independent cinemas - when you find a new independent cinema and realize it's running Savoy's platform, you just configure a new venue to point at it. No new extraction code needed.

Indy Cinema Group and AdmitOne both power multiple cinemas in the dataset. Same pattern - write the platform integration once, point it at new venues as you discover them.

When a cinema migrates between platforms you already know, updating is a trivial config change. This is what makes scaling from a few venues to 200+ feasible - you're not writing 200 different scrapers, you're pointing a dozen implementations at different configurations.

Event Platforms: When Venues Don't Have Their Own Sites

Not every screening venue maintains its own website with listings. Some only publish events on platforms like Eventbrite, Dice, or OutSavvy (in the codebase we call them "sources")

Here's how the pipeline handles this:

Once per retrieve run, pull all London film-specific events from each source. How we get those varies:

Some source let you filter directly on "Films"
For others we search "Films" and "Theatre" (to catch theatre-on-film like NT Live)
Some require keyword searches and some processing once we have the data

From the source, we now have a bunch of events for lots of different venues, some of which may not even be in London. This is where the setup for sources differs - sources don't transform, they "find". Using the venue attributes - name, address, coordinates, alternative names - they find matching events that the venue's transform function can then encorporate when outputting the final list of venue events.

Each source is responsible for matching based on what data it has. Most compare against the venue name (and list of alternative names like "The Ritzy" vs "Ritzy Picturehouse") plus either:

Coordinate match within 100m, or
Postcode match (some listings have wrong coordinates but correct addresses)

Name matching is fuzzy - basic normalization before comparing. I've never seen false positives because the matching is pretty specific, so we're more likely to miss events than missmatch. There are analysis scripts for each source showing which events matched and which didn't, so we can manually review for missing events.

Event-source-only venues don't have a website to retrieve at all. And their transform just returns whatever the sources found.

Example: BFI Stephen Street - a private hire screen that only appears on event platforms when someone books it for a public screening.

The beauty of this pattern: when a new venue shows up on Eventbrite, adding it is minimal effort. The event data is already being pulled daily. You just register the venue metadata and let the matching happen.

What This Looks Like In Practice

Here's the flow:

Retrieve jobs run - some on GitHub's cloud runners, some on the local cluster for sites that need residential IPs
Data gets aggregated into a GitHub release in the retrieve repository
Transform jobs pull that release and run on GitHub's cloud
Each transform is pure - it processes the data it's given, optionally merging in matched events from the event sources
Output is data conforming to a standardized schema, regardless of whether the source was a single HTML page, a GraphQL API, or an Eventbrite search
Final transformed data gets published as a release in the transform repository

The system isn't elegant because I designed it to be. It's elegant because each constraint - rate limits, IP restrictions, venue variety, platform diversity - forced a clean separation of concerns.

And somehow, it all runs daily, for 240+ venues, without falling over* 🍿

* it sometimes falls over

Next post: Getting the Data Model Right: Movie -> Showings -> Performances

DEV Community