Zil Norvilis

Posted on Feb 13

Testing the Unstable: How to Write Tests for Scrapers

#ruby #rails #testing #scraping

The Scraper’s Dilemma

Testing a normal Rails app is straightforward: you control the database, you control the code. But with scrapers, the "database" is a third-party website that can change its HTML structure at 3:00 AM on a Sunday.

If you don't test your scrapers, you end up with Silent Failures: your script runs perfectly, but it starts saving nil into your database because .product-price was renamed to .price-current.

To survive, you need a Three-Tier Testing Strategy.

Tier 1: Deterministic Unit Tests (VCR)

You should never hit the live network during your standard test runs. It’s slow, it’s flaky, and it's bad etiquette. Instead, use the VCR gem to "record" a successful interaction once and replay it forever.

The Setup

First, configure VCR in your test helper:

# test/test_helper.rb
require "vcr"
require "webmock/minitest"

VCR.configure do |config|
  config.cassette_library_dir = "test/vcr_cassettes"
  config.hook_into :webmock
end

The Test

This ensures that your Logic is correct for a specific version of the HTML without needing an internet connection.

# test/services/price_scraper_test.rb
require "test_helper"

class PriceScraperTest < ActiveSupport::TestCase
  test "correctly extracts the price from the cached HTML" do
    VCR.use_cassette("amazon_product") do
      result = PriceScraper.call("https://amazon.com/item")

      assert_equal 29.99, result.price
      assert_equal "USD", result.currency
    end
  end
end

Tier 2: Defensive Parsing (Validation)

Don't just trust that the parser found what it was looking for. Treat the scraped data like untrusted user input. Use ActiveModel::Validations to ensure the "Contract" between the website and your app is still valid.

# app/models/scraped_product.rb
class ScrapedProduct
  include ActiveModel::Validations

  attr_accessor :name, :price

  validates :name, presence: true, length: { minimum: 5 }
  validates :price, presence: true, numericality: { greater_than: 0 }

  def initialize(attributes = {})
    @name = attributes[:name]
    @price = attributes[:price]
  end
end

In your scraper:

product = ScrapedProduct.new(name: doc.at_css('.title')&.text, price: parsed_price)

unless product.valid?
  raise "Scraping Contract Broken: #{product.errors.full_messages.join(', ')}"
end

Tier 3: The "Smoke Test" (Monitoring)

VCR tests (Tier 1) only prove your code works with yesterday's HTML. They don't tell you if the website changed today.

For this, you need a Production Smoke Test. This is a small script that runs every few hours in your production environment (via Sidekiq or a Cron job). It hits the live site and checks if the key selectors still exist.

# lib/tasks/scraper_monitor.rake
namespace :scraper do
  task monitor: :environment do
    sample_url = "https://example.com/product/1"
    html = HTTP.get(sample_url).to_s
    doc = Nokogiri::HTML(html)

    required_selectors = ['.price', '.title', '.description']
    missing = required_selectors.reject { |s| doc.at_css(s).present? }

    if missing.any?
      # Send an alert to Slack, Discord, or Email
      SlackNotifier.alert("Scraper broken on #{sample_url}! Missing: #{missing.join(', ')}")
    end
  end
end

The Golden Rule: Separate Fetching from Parsing

The biggest mistake developers make is putting the network request and the Nokogiri logic in the same method. This makes testing a nightmare.

Do this instead:

The Fetcher: A class that just returns raw HTML (easy to mock).
The Parser: A class that takes a string of HTML and returns a Hash (easy to unit test with local files).

# Simple to test with any HTML string!
class AmazonParser
  def initialize(html)
    @doc = Nokogiri::HTML(html)
  end

  def price
    # Safe navigation and cleanup
    @doc.at_css('.price')&.text&.gsub(/[^\d.]/, '')&.to_f
  end
end

# Your test now needs NO network and NO VCR:
class AmazonParserTest < ActiveSupport::TestCase
  test "extracts price from raw string" do
    html = "<div class='price'>$29.99</div>"
    parser = AmazonParser.new(html)
    assert_equal 29.99, parser.price
  end
end

Summary: A Resilient Pipeline

VCR: Use it for fast, offline development and CI.
Validations: Use them to catch nil or malformed data before it hits your DB.
Smoke Tests: Run them against the live web to detect when the target site "drifts."

Web scraping is a game of "When, not If." By building a test suite that acknowledges the instability of the web, you stop being a firefighter and start being an architect.

How do you handle site changes? Do you have an "early warning system" or do you wait for the bug reports? Let's discuss in the comments! 👇

DEV Community