Erika S. Adkins

Posted on Feb 12

Data Quality at Scale: Validating JSONL Output with Pydantic

#webscraping #dataengineering #devops #json

Imagine your web scraper has been running perfectly for weeks. Your logs show a "200 OK" status for every request, and your database is filling up with thousands of new rows every hour. You assume everything is fine until a data scientist mentions that the latest dashboard is empty.

Upon investigation, you find the nightmare: a website update changed a single CSS class. Your scraper didn't crash, but it spent the last 48 hours extracting null values for every product price. This is a silent failure. In the world of web scraping, these are far more dangerous than a script that simply crashes.

This guide covers how to prevent "zombie data" scenarios by using Pydantic to validate JSONL (JSON Lines) output. By building a validation layer, you ensure your data is accurate, complete, and ready for production.

Why try/except Isn't Enough

When building scrapers, we often focus on connectivity. We wrap requests in try/except blocks to handle timeouts or 404 errors. While this prevents the script from dying, it does nothing to ensure the integrity of the data being saved.

Websites are volatile. A minor UI tweak can cause "Data Drift," where your selectors still find something, but not what you expected. Common issues include:

A price field that used to be an integer (49) becomes a string with symbols ($49.00).
A required field like a product title becomes an empty string due to a new layout.
A URL selector returns relative paths (/p/123) instead of absolute URLs.

If you pipe this data directly into a production database, you risk breaking downstream applications. You need a schema that acts as a gatekeeper.

Step 1: Defining the Data Schema

Pydantic is a data validation library that enforces type hints at runtime. If the data doesn't match the schema, Pydantic raises an error immediately.

First, install Pydantic:

pip install pydantic

Suppose we are scraping Product Hunt. We want to ensure every product has a name, a valid URL, and a non-negative vote count. Here is how to define that schema:

from typing import Optional
from pydantic import BaseModel, HttpUrl, Field

class ProductModel(BaseModel):
    # Expects a non-empty string
    name: str = Field(min_length=1)

    # Built-in validation for URLs
    url: HttpUrl

    # Upvotes must be an integer and cannot be negative
    upvotes: int = Field(ge=0)

    # Tagline is optional
    tagline: Optional[str] = None

This model does more than just type-checking. Field(min_length=1) ensures that an empty string "" triggers a validation failure, while HttpUrl confirms the string is a properly formatted web address.

Step 2: Validating a JSONL Stream

Most large-scale scrapers export data in JSONL format. Unlike a standard JSON array, JSONL stores one JSON object per line. This is the industry standard for scraping because you can append new data without loading the entire file into memory.

To maintain that efficiency, process the file line-by-line:

import json
from pydantic import ValidationError

def validate_jsonl_stream(input_file: str, output_file: str, error_log: str):
    valid_count = 0
    error_count = 0

    with open(input_file, 'r') as infile, \
         open(output_file, 'w') as outfile, \
         open(error_log, 'w') as errfile:

        for line_number, line in enumerate(infile, 1):
            try:
                data = json.loads(line)

                # Validate against the Pydantic model
                product = ProductModel.model_validate(data)

                # Write valid data to the clean file
                outfile.write(product.model_dump_json() + '\n')
                valid_count += 1

            except (ValidationError, json.JSONDecodeError) as e:
                # Log the specific error and the line number
                error_entry = {
                    "line": line_number,
                    "error": str(e),
                    "raw_data": line.strip()
                }
                errfile.write(json.dumps(error_entry) + '\n')
                error_count += 1

    print(f"Validation Complete: {valid_count} passed, {error_count} failed.")

validate_jsonl_stream('raw_data.jsonl', 'clean_data.jsonl', 'validation_errors.jsonl')

Using model_validate() allows the process to continue even if a line fails. Shunting "dirty" data into an error log helps you inspect why the scraper is failing without losing the valid data already collected.

Step 3: Implementing Business Logic Validators

Basic type checking is a good start, but real-world data quality often requires custom rules. For instance, you might want to ensure your Product Hunt scraper hasn't accidentally captured an external ad link.

Pydantic's @field_validator decorator enforces these specific rules:

from pydantic import field_validator

class ProductModel(BaseModel):
    name: str = Field(min_length=1)
    url: HttpUrl
    upvotes: int = Field(ge=0)

    @field_validator('url')
    @classmethod
    def must_be_product_hunt_domain(cls, v: HttpUrl) -> HttpUrl:
        # Ensure we aren't scraping external ads
        if 'producthunt.com' not in (v.host or ''):
            raise ValueError('URL must be a Product Hunt domain')
        return v

    @field_validator('name')
    @classmethod
    def prevent_placeholder_titles(cls, v: str) -> str:
        # Catch common scraping errors where placeholders are captured
        if v.lower() in ["loading...", "n/a", "none"]:
            raise ValueError('Invalid placeholder title detected')
        return v

These validators turn your schema into a diagnostic tool. If the website's HTML changes and your scraper starts picking up "Sponsored" links instead of products, the must_be_product_hunt_domain validator will catch it.

Step 4: Automating Alerts

Validation is only useful if you act on the results. In a production environment, you shouldn't have to manually check error logs.

You can implement threshold logic: if the percentage of failed records exceeds a limit (like 5%), trigger an alert.

def check_threshold(valid_count, error_count, threshold=0.05):
    total = valid_count + error_count
    if total == 0:
        return

    failure_rate = error_count / total

    if failure_rate > threshold:
        send_alert(f"Scraper Alert: Failure rate is {failure_rate:.2%}. Check selectors!")

def send_alert(message):
    # Connect this to Slack, Email, or PagerDuty
    print(f"ALARM: {message}")

This approach allows for "acceptable noise." If 1 out of 1,000 pages has a unique layout that breaks your logic, it might not require immediate attention. If 500 fail, you know the site has changed and the scraper needs an update.

To Wrap Up

Data validation separates hobbyist scripts from professional data pipelines. Using Pydantic to validate JSONL output provides:

Immediate Detection: Catch silent failures the moment they happen.
Clear Documentation: Pydantic models serve as a living reference for your data structure.
Clean Downstream Pipelines: Ensure your database and analytics tools only receive verified data.

Consider integrating these Pydantic checks into your CI/CD pipeline or as a final step in your orchestration workflows. The small upfront cost of defining a schema saves hours of cleaning corrupted data later.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.