Zil Norvilis

Posted on Feb 12

Enterprise Web Scraping in Ruby: Bypassing Anti-Bots and Scaling Infrastructure

#ruby #scraping #architecture #webdev

Most Ruby scraping tutorials stop at Nokogiri. They show you how to parse a simple HTML page that doesn't want to hide its data.

In the real world, you aren't scraping static blogs. You are scraping sites protected by Cloudflare, Akamai, and DataDome. These systems look for "bot-like" behavior, TLS fingerprints, and IP reputation. If you show up with a basic HTTP.get request, you’ll be blocked in milliseconds.

To build professional-grade scrapers in Ruby, you need to think about stealth, scale, and data integrity. Here is the enterprise playbook.

1. The Stealth Layer: Bypassing Anti-Bots

Modern anti-bot systems don't just look at your User-Agent. They look at your TLS Fingerprint (how your computer negotiates the HTTPS connection) and your Browser Fingerprint (canvas rendering, hardware concurrency).

The Solution: Ferrum + Stealth

If you must use a browser, Ferrum is the best choice because it uses the Chrome DevTools Protocol (CDP). But to stay hidden, you need to modify how that browser presents itself.

require "ferrum"

# Use a specific window size and disable automation flags
browser = Ferrum::Browser.new(
  browser_options: {
    "disable-blink-features": "AutomationControlled", # Hides 'navigator.webdriver'
    "no-sandbox": nil
  },
  window_size: [1920, 1080]
)

# Randomize the User-Agent on every session
browser.headers.set("User-Agent" => UserAgentRandomizer.run)

browser.goto("https://target-site.com")

2. The Proxy Strategy: Residential vs. Datacenter

If you scrape 10,000 pages from a single IP, you will be banned. Professional scrapers use Proxy Rotation.

Datacenter Proxies: Fast and cheap, but easily identified as "server traffic." Best for sites with low protection.
Residential Proxies: IPs from real home internet connections. Extremely hard to block, but expensive.

The Pro Approach: Use a proxy aggregator (like Bright Data, Oxylabs, or Smartproxy) that provides a single entry point and handles the rotation and "Cool-down" of IPs for you.

3. The Architecture: Sidekiq Orchestration

Professional scraping isn't a single script; it’s a distributed system. You need to handle retries, failures, and rate limits.

The Stack: Rails + Redis + Sidekiq.

# app/sidekiq/scrape_job.rb
class ScrapeJob
  include Sidekiq::Job
  sidekiq_options retry: 5, queue: :scraping

  def perform(url)
    # 1. Fetch via a rotating proxy
    # 2. Parse with Nokogiri or Ferrum
    # 3. Store result
  rescue Net::ReadTimeout, Ferrum::TimeoutError
    # Sidekiq handles the exponential backoff for us
    raise 
  end
end

By using Sidekiq, you can run 50 scrapers in parallel across multiple workers, dramatically increasing your throughput.

4. Solving the "JS-Heavy" Problem with Internal APIs

As we’ve discussed before, the "Pro" move is to avoid the browser entirely. If the site is built with React/Vue, it’s talking to a JSON API.

Instead of rendering the whole page, reverse-engineer the API.

Find the API call in the Network Tab.
Identify the required headers (often a X-CSRF-Token or a Bearer token).
Simulate the request using a lightweight client like Faraday.

This is 10x more reliable and 100x faster than using a headless browser.

5. Data Integrity: Validation and Schema

Websites change. A professional scraper assumes the site will break.

The Pattern:

Contract Testing: Use a tool like Dry-Validation or JSON Schema to validate the scraped data before saving it.
Alerting: If your scraper returns nil for a mandatory field (like price), trigger a Slack/PagerDuty alert immediately so you can fix the parser before your database is filled with garbage.

6. The Legal & Ethical Boundary

Professional scraping requires respecting the target's infrastructure.

Rate Limiting: Don't kill their server. Monitor the response times; if they go up, slow your scraper down.
Respect Robots.txt: Unless you have a legal reason not to, follow the crawl-delay directives.
User-Agent: Include a "Contact" email in your User-Agent if possible, so their admins can reach out if you're causing issues.

Summary

The difference between a script and a professional scraper is determinism.

Stealth: Mask your fingerprint.
Proxies: Rotate residential IPs.
Workers: Use Sidekiq for parallel, resilient jobs.
Validation: Treat scraped data like untrusted user input.

Ruby is an incredible language for this because its ecosystem (Ferrum, Nokogiri, Sidekiq) allows you to build these complex systems with very little boilerplate.

Are you building a scraper that keeps getting blocked? Describe the behavior in the comments and let's troubleshoot! 👇

DEV Community