The Scraper’s Dilemma
Testing a normal Rails app is straightforward: you control the database, you control the code. But with scrapers, the "database" is a third-party website that can change its HTML structure at 3:00 AM on a Sunday.
If you don't test your scrapers, you end up with Silent Failures: your script runs perfectly, but it starts saving nil into your database because .product-price was renamed to .price-current.
To survive, you need a Three-Tier Testing Strategy.
Tier 1: Deterministic Unit Tests (VCR)
You should never hit the live network during your standard test runs. It’s slow, it’s flaky, and it's bad etiquette. Instead, use the VCR gem to "record" a successful interaction once and replay it forever.
The Setup
First, configure VCR in your test helper:
# test/test_helper.rb
require "vcr"
require "webmock/minitest"
VCR.configure do |config|
config.cassette_library_dir = "test/vcr_cassettes"
config.hook_into :webmock
end
The Test
This ensures that your Logic is correct for a specific version of the HTML without needing an internet connection.
# test/services/price_scraper_test.rb
require "test_helper"
class PriceScraperTest < ActiveSupport::TestCase
test "correctly extracts the price from the cached HTML" do
VCR.use_cassette("amazon_product") do
result = PriceScraper.call("https://amazon.com/item")
assert_equal 29.99, result.price
assert_equal "USD", result.currency
end
end
end
Tier 2: Defensive Parsing (Validation)
Don't just trust that the parser found what it was looking for. Treat the scraped data like untrusted user input. Use ActiveModel::Validations to ensure the "Contract" between the website and your app is still valid.
# app/models/scraped_product.rb
class ScrapedProduct
include ActiveModel::Validations
attr_accessor :name, :price
validates :name, presence: true, length: { minimum: 5 }
validates :price, presence: true, numericality: { greater_than: 0 }
def initialize(attributes = {})
@name = attributes[:name]
@price = attributes[:price]
end
end
In your scraper:
product = ScrapedProduct.new(name: doc.at_css('.title')&.text, price: parsed_price)
unless product.valid?
raise "Scraping Contract Broken: #{product.errors.full_messages.join(', ')}"
end
Tier 3: The "Smoke Test" (Monitoring)
VCR tests (Tier 1) only prove your code works with yesterday's HTML. They don't tell you if the website changed today.
For this, you need a Production Smoke Test. This is a small script that runs every few hours in your production environment (via Sidekiq or a Cron job). It hits the live site and checks if the key selectors still exist.
# lib/tasks/scraper_monitor.rake
namespace :scraper do
task monitor: :environment do
sample_url = "https://example.com/product/1"
html = HTTP.get(sample_url).to_s
doc = Nokogiri::HTML(html)
required_selectors = ['.price', '.title', '.description']
missing = required_selectors.reject { |s| doc.at_css(s).present? }
if missing.any?
# Send an alert to Slack, Discord, or Email
SlackNotifier.alert("Scraper broken on #{sample_url}! Missing: #{missing.join(', ')}")
end
end
end
The Golden Rule: Separate Fetching from Parsing
The biggest mistake developers make is putting the network request and the Nokogiri logic in the same method. This makes testing a nightmare.
Do this instead:
- The Fetcher: A class that just returns raw HTML (easy to mock).
- The Parser: A class that takes a string of HTML and returns a Hash (easy to unit test with local files).
# Simple to test with any HTML string!
class AmazonParser
def initialize(html)
@doc = Nokogiri::HTML(html)
end
def price
# Safe navigation and cleanup
@doc.at_css('.price')&.text&.gsub(/[^\d.]/, '')&.to_f
end
end
# Your test now needs NO network and NO VCR:
class AmazonParserTest < ActiveSupport::TestCase
test "extracts price from raw string" do
html = "<div class='price'>$29.99</div>"
parser = AmazonParser.new(html)
assert_equal 29.99, parser.price
end
end
Summary: A Resilient Pipeline
- VCR: Use it for fast, offline development and CI.
- Validations: Use them to catch
nilor malformed data before it hits your DB. - Smoke Tests: Run them against the live web to detect when the target site "drifts."
Web scraping is a game of "When, not If." By building a test suite that acknowledges the instability of the web, you stop being a firefighter and start being an architect.
How do you handle site changes? Do you have an "early warning system" or do you wait for the bug reports? Let's discuss in the comments! 👇
Top comments (0)