Most Ruby scraping tutorials stop at Nokogiri. They show you how to parse a simple HTML page that doesn't want to hide its data.
In the real world, you aren't scraping static blogs. You are scraping sites protected by Cloudflare, Akamai, and DataDome. These systems look for "bot-like" behavior, TLS fingerprints, and IP reputation. If you show up with a basic HTTP.get request, you’ll be blocked in milliseconds.
To build professional-grade scrapers in Ruby, you need to think about stealth, scale, and data integrity. Here is the enterprise playbook.
1. The Stealth Layer: Bypassing Anti-Bots
Modern anti-bot systems don't just look at your User-Agent. They look at your TLS Fingerprint (how your computer negotiates the HTTPS connection) and your Browser Fingerprint (canvas rendering, hardware concurrency).
The Solution: Ferrum + Stealth
If you must use a browser, Ferrum is the best choice because it uses the Chrome DevTools Protocol (CDP). But to stay hidden, you need to modify how that browser presents itself.
require "ferrum"
# Use a specific window size and disable automation flags
browser = Ferrum::Browser.new(
browser_options: {
"disable-blink-features": "AutomationControlled", # Hides 'navigator.webdriver'
"no-sandbox": nil
},
window_size: [1920, 1080]
)
# Randomize the User-Agent on every session
browser.headers.set("User-Agent" => UserAgentRandomizer.run)
browser.goto("https://target-site.com")
2. The Proxy Strategy: Residential vs. Datacenter
If you scrape 10,000 pages from a single IP, you will be banned. Professional scrapers use Proxy Rotation.
- Datacenter Proxies: Fast and cheap, but easily identified as "server traffic." Best for sites with low protection.
- Residential Proxies: IPs from real home internet connections. Extremely hard to block, but expensive.
The Pro Approach: Use a proxy aggregator (like Bright Data, Oxylabs, or Smartproxy) that provides a single entry point and handles the rotation and "Cool-down" of IPs for you.
3. The Architecture: Sidekiq Orchestration
Professional scraping isn't a single script; it’s a distributed system. You need to handle retries, failures, and rate limits.
The Stack: Rails + Redis + Sidekiq.
# app/sidekiq/scrape_job.rb
class ScrapeJob
include Sidekiq::Job
sidekiq_options retry: 5, queue: :scraping
def perform(url)
# 1. Fetch via a rotating proxy
# 2. Parse with Nokogiri or Ferrum
# 3. Store result
rescue Net::ReadTimeout, Ferrum::TimeoutError
# Sidekiq handles the exponential backoff for us
raise
end
end
By using Sidekiq, you can run 50 scrapers in parallel across multiple workers, dramatically increasing your throughput.
4. Solving the "JS-Heavy" Problem with Internal APIs
As we’ve discussed before, the "Pro" move is to avoid the browser entirely. If the site is built with React/Vue, it’s talking to a JSON API.
Instead of rendering the whole page, reverse-engineer the API.
- Find the API call in the Network Tab.
- Identify the required headers (often a
X-CSRF-Tokenor aBearertoken). - Simulate the request using a lightweight client like
Faraday.
This is 10x more reliable and 100x faster than using a headless browser.
5. Data Integrity: Validation and Schema
Websites change. A professional scraper assumes the site will break.
The Pattern:
- Contract Testing: Use a tool like
Dry-ValidationorJSON Schemato validate the scraped data before saving it. - Alerting: If your scraper returns
nilfor a mandatory field (likeprice), trigger a Slack/PagerDuty alert immediately so you can fix the parser before your database is filled with garbage.
6. The Legal & Ethical Boundary
Professional scraping requires respecting the target's infrastructure.
- Rate Limiting: Don't kill their server. Monitor the response times; if they go up, slow your scraper down.
- Respect Robots.txt: Unless you have a legal reason not to, follow the crawl-delay directives.
- User-Agent: Include a "Contact" email in your User-Agent if possible, so their admins can reach out if you're causing issues.
Summary
The difference between a script and a professional scraper is determinism.
- Stealth: Mask your fingerprint.
- Proxies: Rotate residential IPs.
- Workers: Use Sidekiq for parallel, resilient jobs.
- Validation: Treat scraped data like untrusted user input.
Ruby is an incredible language for this because its ecosystem (Ferrum, Nokogiri, Sidekiq) allows you to build these complex systems with very little boilerplate.
Are you building a scraper that keeps getting blocked? Describe the behavior in the comments and let's troubleshoot! 👇
Top comments (0)