The Missing Piece
In the Ruby scraping ecosystem, we have excellent low-level tools.
- Nokogiri: Great for parsing HTML.
- Ferrum: Great for controlling a headless Chrome browser via CDP.
- HTTP: Great for making requests.
But what if you need to build a Crawler?
A crawler isn't just a script that visits one page. Itβs a system that visits a page, extracts data, finds the "Next Page" link, adds it to a queue, manages concurrency, and exports the data.
For years, Python developers laughed at us because they had Scrapy.
Rubyists had Kimurai, but it has largely gone unmaintained.
Enter Vessel.
Built by the same team behind Ferrum, Vessel is the modern, high-level web crawling framework Ruby has been waiting for.
Why Vessel is a Game Changer
Vessel is built on top of Ferrum. This means:
- It handles JavaScript natively: It drives a real Chrome browser. React, Vue, and Angular sites are no problem.
- It manages the Pool: You don't have to manually spawn threads or manage browser contexts. Vessel handles the concurrency for you.
- It is structured: It forces you to organize your code into "Spiders" (called
Cargoin Vessel), making it scalable.
Step 1: The Setup
Vessel requires a generic Chrome or Chromium installation on your machine (which you likely already have).
gem install vessel
Step 2: Your First Spider
Let's scrape a hypothetical quotes website. We create a class that inherits from Vessel::Cargo.
require "vessel"
class QuotesSpider < Vessel::Cargo
domain "quotes.toscrape.com"
start_urls "https://quotes.toscrape.com/js/"
# Configuration (Optional)
headers "User-Agent" => "Vessel-Bot/1.0"
threads 4 # Run 4 browser tabs in parallel
delay 1 # Wait 1 second between requests
def parse
# 'page' is a Ferrum::Page object
page.css("div.quote").each do |quote|
# Extract data and yield a Hash
yield({
text: quote.at_css("span.text").text,
author: quote.at_css("small.author").text
})
end
# Handle Pagination (The "Crawling" part)
next_link = page.at_css("li.next > a")
if next_link
# Tell Vessel to visit this URL next
yield request(url: next_link.attribute("href"))
end
end
end
Step 3: Running It
You can run this directly from a Ruby script. Vessel gives you built-in export options.
# Run the spider and print results
QuotesSpider.run do |item|
puts "Got quote: #{item[:text]}"
end
# OR Export directly to JSON
# vessel run quotes_spider.rb --export quotes.json
The Magic Under the Hood
1. The Browser Pool
When you set threads 4, Vessel boots up a Chromium instance and opens 4 contexts (tabs). It distributes the URLs in your queue across these tabs automatically. If a page crashes a tab, Vessel handles the restart.
2. Middleware
Just like Rails, Vessel supports middleware. You can intercept requests before they happen.
- Want to rotate User-Agents? Use Middleware.
- Want to inject authentication cookies? Use Middleware.
- Want to block image requests to save bandwidth? Middleware.
# inside your class
middleware do |browser, request|
request.abort if request.resource_type == :image
end
3. Ferrum API Access
Because page is just a Ferrum object, you can do anything a browser can do inside your parse method:
-
page.screenshot(path: "error.png") -
page.mouse.click(x, y) -
page.keyboard.type("hello") -
page.network.wait_for_idle
Vessel vs. Kimurai
If you used Kimurai in the past, Vessel feels very similar. The main difference is the engine.
- Kimurai: Supported Selenium and Mechanize.
- Vessel: Focused entirely on Ferrum (CDP).
This focus makes Vessel significantly faster and more stable than the old Selenium-based crawlers because it communicates directly with the browser engine, skipping the WebDriver overhead.
Summary
If you are scraping a single page, ferrum or httpx is fine.
But if you are building a data pipeline that needs to crawl thousands of pages, handle retries, and manage concurrency, Vessel is the tool you should be using.
It brings the structure of Python's Scrapy to the elegance of Ruby.
Have you tried Vessel yet? How does it compare to your custom scraping scripts? Let me know in the comments! π
Top comments (0)