DEV Community

Cover image for Meet Vessel: The "Scrapy" of the Ruby World
Zil Norvilis
Zil Norvilis

Posted on • Edited on

Meet Vessel: The "Scrapy" of the Ruby World

The Missing Piece

In the Ruby scraping ecosystem, we have excellent low-level tools.

  • Nokogiri: Great for parsing HTML.
  • Ferrum: Great for controlling a headless Chrome browser via CDP.
  • HTTP: Great for making requests.

But what if you need to build a Crawler?
A crawler isn't just a script that visits one page. It’s a system that visits a page, extracts data, finds the "Next Page" link, adds it to a queue, manages concurrency, and exports the data.

For years, Python developers laughed at us because they had Scrapy.
Rubyists had Kimurai, but it has largely gone unmaintained.

Enter Vessel.
Built by the same team behind Ferrum, Vessel is the modern, high-level web crawling framework Ruby has been waiting for.


Why Vessel is a Game Changer

Vessel is built on top of Ferrum. This means:

  1. It handles JavaScript natively: It drives a real Chrome browser. React, Vue, and Angular sites are no problem.
  2. It manages the Pool: You don't have to manually spawn threads or manage browser contexts. Vessel handles the concurrency for you.
  3. It is structured: It forces you to organize your code into "Spiders" (called Cargo in Vessel), making it scalable.

Step 1: The Setup

Vessel requires a generic Chrome or Chromium installation on your machine (which you likely already have).

gem install vessel
Enter fullscreen mode Exit fullscreen mode

Step 2: Your First Spider

Let's scrape a hypothetical quotes website. We create a class that inherits from Vessel::Cargo.

require "vessel"

class QuotesSpider < Vessel::Cargo
  domain "quotes.toscrape.com"
  start_urls "https://quotes.toscrape.com/js/"

  # Configuration (Optional)
  headers "User-Agent" => "Vessel-Bot/1.0"
  threads 4  # Run 4 browser tabs in parallel
  delay 1    # Wait 1 second between requests

  def parse
    # 'page' is a Ferrum::Page object
    page.css("div.quote").each do |quote|
      # Extract data and yield a Hash
      yield({
        text: quote.at_css("span.text").text,
        author: quote.at_css("small.author").text
      })
    end

    # Handle Pagination (The "Crawling" part)
    next_link = page.at_css("li.next > a")
    if next_link
      # Tell Vessel to visit this URL next
      yield request(url: next_link.attribute("href"))
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Step 3: Running It

You can run this directly from a Ruby script. Vessel gives you built-in export options.

# Run the spider and print results
QuotesSpider.run do |item|
  puts "Got quote: #{item[:text]}"
end

# OR Export directly to JSON
# vessel run quotes_spider.rb --export quotes.json
Enter fullscreen mode Exit fullscreen mode

The Magic Under the Hood

1. The Browser Pool

When you set threads 4, Vessel boots up a Chromium instance and opens 4 contexts (tabs). It distributes the URLs in your queue across these tabs automatically. If a page crashes a tab, Vessel handles the restart.

2. Middleware

Just like Rails, Vessel supports middleware. You can intercept requests before they happen.

  • Want to rotate User-Agents? Use Middleware.
  • Want to inject authentication cookies? Use Middleware.
  • Want to block image requests to save bandwidth? Middleware.
# inside your class
middleware do |browser, request|
  request.abort if request.resource_type == :image
end
Enter fullscreen mode Exit fullscreen mode

3. Ferrum API Access

Because page is just a Ferrum object, you can do anything a browser can do inside your parse method:

  • page.screenshot(path: "error.png")
  • page.mouse.click(x, y)
  • page.keyboard.type("hello")
  • page.network.wait_for_idle

Vessel vs. Kimurai

If you used Kimurai in the past, Vessel feels very similar. The main difference is the engine.

  • Kimurai: Supported Selenium and Mechanize.
  • Vessel: Focused entirely on Ferrum (CDP).

This focus makes Vessel significantly faster and more stable than the old Selenium-based crawlers because it communicates directly with the browser engine, skipping the WebDriver overhead.

Summary

If you are scraping a single page, ferrum or httpx is fine.
But if you are building a data pipeline that needs to crawl thousands of pages, handle retries, and manage concurrency, Vessel is the tool you should be using.

It brings the structure of Python's Scrapy to the elegance of Ruby.


Have you tried Vessel yet? How does it compare to your custom scraping scripts? Let me know in the comments! πŸ‘‡

Top comments (0)