Imagine your web scraper has been running perfectly for weeks. Your logs show a "200 OK" status for every request, and your database is filling up with thousands of new rows every hour. You assume everything is fine until a data scientist mentions that the latest dashboard is empty.
Upon investigation, you find the nightmare: a website update changed a single CSS class. Your scraper didn't crash, but it spent the last 48 hours extracting null values for every product price. This is a silent failure. In the world of web scraping, these are far more dangerous than a script that simply crashes.
This guide covers how to prevent "zombie data" scenarios by using Pydantic to validate JSONL (JSON Lines) output. By building a validation layer, you ensure your data is accurate, complete, and ready for production.
Why try/except Isn't Enough
When building scrapers, we often focus on connectivity. We wrap requests in try/except blocks to handle timeouts or 404 errors. While this prevents the script from dying, it does nothing to ensure the integrity of the data being saved.
Websites are volatile. A minor UI tweak can cause "Data Drift," where your selectors still find something, but not what you expected. Common issues include:
- A price field that used to be an integer (
49) becomes a string with symbols ($49.00). - A required field like a product title becomes an empty string due to a new layout.
- A URL selector returns relative paths (
/p/123) instead of absolute URLs.
If you pipe this data directly into a production database, you risk breaking downstream applications. You need a schema that acts as a gatekeeper.
Step 1: Defining the Data Schema
Pydantic is a data validation library that enforces type hints at runtime. If the data doesn't match the schema, Pydantic raises an error immediately.
First, install Pydantic:
pip install pydantic
Suppose we are scraping Product Hunt. We want to ensure every product has a name, a valid URL, and a non-negative vote count. Here is how to define that schema:
from typing import Optional
from pydantic import BaseModel, HttpUrl, Field
class ProductModel(BaseModel):
# Expects a non-empty string
name: str = Field(min_length=1)
# Built-in validation for URLs
url: HttpUrl
# Upvotes must be an integer and cannot be negative
upvotes: int = Field(ge=0)
# Tagline is optional
tagline: Optional[str] = None
This model does more than just type-checking. Field(min_length=1) ensures that an empty string "" triggers a validation failure, while HttpUrl confirms the string is a properly formatted web address.
Step 2: Validating a JSONL Stream
Most large-scale scrapers export data in JSONL format. Unlike a standard JSON array, JSONL stores one JSON object per line. This is the industry standard for scraping because you can append new data without loading the entire file into memory.
To maintain that efficiency, process the file line-by-line:
import json
from pydantic import ValidationError
def validate_jsonl_stream(input_file: str, output_file: str, error_log: str):
valid_count = 0
error_count = 0
with open(input_file, 'r') as infile, \
open(output_file, 'w') as outfile, \
open(error_log, 'w') as errfile:
for line_number, line in enumerate(infile, 1):
try:
data = json.loads(line)
# Validate against the Pydantic model
product = ProductModel.model_validate(data)
# Write valid data to the clean file
outfile.write(product.model_dump_json() + '\n')
valid_count += 1
except (ValidationError, json.JSONDecodeError) as e:
# Log the specific error and the line number
error_entry = {
"line": line_number,
"error": str(e),
"raw_data": line.strip()
}
errfile.write(json.dumps(error_entry) + '\n')
error_count += 1
print(f"Validation Complete: {valid_count} passed, {error_count} failed.")
validate_jsonl_stream('raw_data.jsonl', 'clean_data.jsonl', 'validation_errors.jsonl')
Using model_validate() allows the process to continue even if a line fails. Shunting "dirty" data into an error log helps you inspect why the scraper is failing without losing the valid data already collected.
Step 3: Implementing Business Logic Validators
Basic type checking is a good start, but real-world data quality often requires custom rules. For instance, you might want to ensure your Product Hunt scraper hasn't accidentally captured an external ad link.
Pydantic's @field_validator decorator enforces these specific rules:
from pydantic import field_validator
class ProductModel(BaseModel):
name: str = Field(min_length=1)
url: HttpUrl
upvotes: int = Field(ge=0)
@field_validator('url')
@classmethod
def must_be_product_hunt_domain(cls, v: HttpUrl) -> HttpUrl:
# Ensure we aren't scraping external ads
if 'producthunt.com' not in (v.host or ''):
raise ValueError('URL must be a Product Hunt domain')
return v
@field_validator('name')
@classmethod
def prevent_placeholder_titles(cls, v: str) -> str:
# Catch common scraping errors where placeholders are captured
if v.lower() in ["loading...", "n/a", "none"]:
raise ValueError('Invalid placeholder title detected')
return v
These validators turn your schema into a diagnostic tool. If the website's HTML changes and your scraper starts picking up "Sponsored" links instead of products, the must_be_product_hunt_domain validator will catch it.
Step 4: Automating Alerts
Validation is only useful if you act on the results. In a production environment, you shouldn't have to manually check error logs.
You can implement threshold logic: if the percentage of failed records exceeds a limit (like 5%), trigger an alert.
def check_threshold(valid_count, error_count, threshold=0.05):
total = valid_count + error_count
if total == 0:
return
failure_rate = error_count / total
if failure_rate > threshold:
send_alert(f"Scraper Alert: Failure rate is {failure_rate:.2%}. Check selectors!")
def send_alert(message):
# Connect this to Slack, Email, or PagerDuty
print(f"ALARM: {message}")
This approach allows for "acceptable noise." If 1 out of 1,000 pages has a unique layout that breaks your logic, it might not require immediate attention. If 500 fail, you know the site has changed and the scraper needs an update.
To Wrap Up
Data validation separates hobbyist scripts from professional data pipelines. Using Pydantic to validate JSONL output provides:
- Immediate Detection: Catch silent failures the moment they happen.
- Clear Documentation: Pydantic models serve as a living reference for your data structure.
- Clean Downstream Pipelines: Ensure your database and analytics tools only receive verified data.
Consider integrating these Pydantic checks into your CI/CD pipeline or as a final step in your orchestration workflows. The small upfront cost of defining a schema saves hours of cleaning corrupted data later.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.