DEV Community

Cover image for Hardcoded Selectors vs. AI Prompts: A Resilience Benchmark on Etsy
Erika S. Adkins
Erika S. Adkins

Posted on

Hardcoded Selectors vs. AI Prompts: A Resilience Benchmark on Etsy

Every developer managing a web scraping pipeline at scale knows the "Monday Morning Breakage." You start your week, check the logs, and realize your primary data source updated its layout at 2:00 AM on Sunday. Your CSS selectors, once surgical and precise, now return None or empty strings that pollute your database.

Traditional scraping relies on the structural integrity of HTML. We target specific nodes like div.wt-mb-xs-2 > span.currency-value. But when a site like Etsy—notorious for aggressive A/B testing and dynamic class generation—changes that span to a p, the scraper dies.

This article benchmarks the traditional hardcoded selector approach against semantic AI-powered parsing. By simulating a breaking change on an Etsy product page, we can see if LLMs truly "understand" data when structure fails and analyze the real-world costs of both methods.

The Contenders: Selectors vs. Semantics

Before running the benchmark, we need to define the two philosophies competing for your infrastructure budget.

Approach A: Hardcoded Selectors

This is the industry standard. We use CSS Selectors or XPath to map the precise path to a piece of data.

# A typical (and brittle) Etsy selector
price = soup.select_one("div.wt-display-flex-xs.wt-align-items-center > p.wt-text-title-03").text
Enter fullscreen mode Exit fullscreen mode
  • Pros: Near-zero latency, negligible cost, and 100% predictable.
  • Cons: Extremely brittle. If the site moves the price into a different container for a mobile update, the selector breaks.

Approach B: AI Prompts (Semantic Parsing)

This approach treats HTML as a document to be read rather than a tree to be traversed. We send the raw or cleaned HTML to a Large Language Model (LLM) with natural language instructions.

prompt = "Extract the current price and currency from this Etsy product HTML. Return JSON."
Enter fullscreen mode Exit fullscreen mode
  • Pros: High resilience. AI understands that a number next to a "$" sign in a large font is likely the price, regardless of the <div> structure.
  • Cons: High latency (seconds vs. milliseconds), significant API costs, and the risk of hallucinations.

The Benchmark: Simulating a Layout Change

To test resilience, we’ll use a Python script that simulates a common scenario: Etsy updates its product page layout. We'll start with an "Original" HTML snippet and then "mutate" it by changing class names and nesting levels—exactly the changes that break traditional scrapers.

Setup

You'll need beautifulsoup4 for the selector test and the openai library for the AI test.

pip install beautifulsoup4 openai
Enter fullscreen mode Exit fullscreen mode

The Experiment Script

The following script defines the original HTML and a mutated version where class names are replaced with randomized strings to simulate obfuscation.

from bs4 import BeautifulSoup
import openai
import json

# 1. The Original HTML
original_html = """
<div class="listing-page-content">
    <h1 class="wt-text-body-01">Handmade Ceramic Mug</h1>
    <div class="price-container">
        <p class="wt-text-title-03">$24.99</p>
    </div>
</div>
"""

# 2. The Mutated HTML (The "Monday Morning" surprise)
mutated_html = """
<div class="listing-page-content">
    <h1 class="header-v2">Handmade Ceramic Mug</h1>
    <div class="flex-wrapper-random-123">
        <div class="price-box-new">
            <span class="text-large-bold-green">$24.99</span>
        </div>
    </div>
</div>
"""

def test_css_selector(html):
    soup = BeautifulSoup(html, 'html.parser')
    # This selector targets the original structure
    price_el = soup.select_one("div.price-container > p.wt-text-title-03")
    return price_el.text if price_el else None

def test_ai_prompt(html):
    client = openai.OpenAI(api_key="YOUR_API_KEY")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a data extraction tool. Extract the price from the HTML."},
            {"role": "user", "content": f"HTML: {html}"}
        ],
        response_format={ "type": "json_object" }
    )
    return response.choices[0].message.content

print(f"Selector on Original: {test_css_selector(original_html)}")
print(f"Selector on Mutated: {test_css_selector(mutated_html)}")
Enter fullscreen mode Exit fullscreen mode

The Results

  • CSS Selector: Successfully extracted $24.99 from the original, but returned None for the mutated HTML. The scraper is broken and requires manual intervention.
  • AI Prompt: Even with the mutated HTML, the LLM correctly identified $24.99 as the price. It didn't matter that the class was text-large-bold-green instead of wt-text-title-03.

Cost Analysis: Tokens vs. Time

While the AI won the resilience test, we have to consider the trade-off between direct costs and maintenance.

Direct Cost (API Fees)

Using a model like gpt-4o-mini, parsing a typical Etsy product page (after stripping scripts and styles) costs roughly $0.0001 to $0.0005 per page.

  • 10,000 pages: ~$5.00
  • 1,000,000 pages: ~$500.00

Traditional selectors cost $0 in API fees.

The Hidden Cost (Developer Time)

If your Etsy scraper breaks twice a month and takes a senior developer two hours to fix and redeploy, that’s $240/month in maintenance (at $60/hr).

Metric Hardcoded Selectors AI-Powered Parsing
Cost per 1k Pages $0.00 ~$0.15 - $0.50
Latency < 10ms 800ms - 2s
Resilience Low High
Maintenance High Near Zero

The Verdict: If you are scraping 10 million pages a day, AI is cost-prohibitive as a primary parser. If you are scraping 50,000 high-value leads a month, the developer time saved by using AI outweighs the API costs.

The Hybrid Approach: "Self-Healing" Scrapers

You don't have to choose between cheap and resilient. Most sophisticated scraping teams use Hybrid Fallback Logic.

Use the fast CSS selector by default. If it fails, trigger an AI fallback. The AI extracts the data and can even suggest a new CSS selector for future use.

Implementation: The Try-Except-Heal Pattern

def extract_price(html):
    # Step 1: Try the fast, free way
    price = test_css_selector(html)

    if price:
        return price, "selector"

    # Step 2: Fallback to AI if the selector failed
    print("Selector failed. Triggering AI healing...")
    ai_data = test_ai_prompt(html)

    # Step 3: Return data and log for manual review
    return json.loads(ai_data)['price'], "ai_fallback"
Enter fullscreen mode Exit fullscreen mode

This ensures 99% of requests remain free and fast, while the pipeline stays active during layout changes.

Best Practices for Etsy & E-commerce

To maximize resilience without overspending, follow these guidelines:

  1. Target JSON-LD First: Before using AI or CSS, check for <script type="application/ld+json">. Etsy often embeds structured data for SEO. This is the most resilient non-AI method because it's machine-readable and rarely changes.
  2. Minimize Token Usage: Never send the entire HTML document to an LLM. Use BeautifulSoup to strip <script>, <style>, and <svg> tags first. This can reduce your token count by over 80%.
  3. Use Schema-Aware Prompts: Tell the AI to look for specific Schema.org types. For example: "Find the 'Offer' schema and extract the 'price' property."

To Wrap Up

The choice between hardcoded selectors and AI prompts isn't binary. While traditional selectors remain the backbone of high-volume scraping due to their speed, AI has introduced a new era of resilient data extraction.

  • Use Hardcoded Selectors for high-volume, low-margin scraping where speed is critical.
  • Use AI Prompts for low-volume, high-complexity sites or as a self-healing fallback.
  • The Hybrid Model is the most balanced approach, minimizing both developer burnout and API costs.

By moving toward semantic understanding, you can stop spending Mondays fixing broken selectors and focus on the data itself.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.