I spent three weeks building a web scraper for a side project that aggregated job listings from multiple startup boards. Every site had its own HTML quirks. One used <div class="job-title">, another hid titles inside <h2> tags with no class, and a third relied on JavaScript rendering that even Selenium struggled with. I’d fix one site, and a week later they’d update their markup — my script would crash. I was playing whack-a-mole with CSS selectors and regex.
I needed something that could just understand the content, not memorize its structure. That’s when I turned to large language models (LLMs).
The Breaking Point
I remember spending an entire Saturday debugging why my BeautifulSoup parser returned None for a listing that clearly existed. The site had added a random aria-label change that broke my selector chain. My code looked like this:
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://example-startup.com/jobs")
soup = BeautifulSoup(resp.text, "html.parser")
titles = soup.select("div.job-listing > h2.job-title")
That’s brittle. One class rename and I’m back to inspecting elements.
What I Tried First
- Regex: Great for patterns, terrible for variable content. I wrote a monstrous regex that matched 80% of cases. The other 20% broke silently.
- CSS selectors + XPath: Powerful but site-specific. Maintaining 20 different selector sets was a nightmare.
- JSON-LD / microdata: Only works if the site includes structured data (most don’t).
I even tried machine learning classifiers to identify sections of the page, but that required labeled data and retraining every time a site changed.
The Lightbulb: Ask an LLM to Read the Page
I realized that what I really wanted was a system that could parse the semantic meaning of a web page, not its syntactic structure. Humans can glance at a job listing and extract title, company, location — why couldn’t an AI do the same?
Modern LLMs (like GPT-4 or Claude) are trained on massive amounts of HTML and natural language. They can take raw text (or even rendered HTML) and answer questions about it. So I stopped writing parsers and started writing prompts.
The Approach in Practice
Here’s a simple Python snippet that uses OpenAI’s API to extract structured data from a web page’s text content:
import requests
from bs4 import BeautifulSoup
import openai
import json
# Step 1: Get the page and extract visible text
resp = requests.get("https://example-startup.com/jobs", timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")
# Remove script and style elements
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
page_text = soup.get_text(separator="\n", strip=True)
# Step 2: Define the extraction prompt
prompt = f"""Extract all job listings from the following web page text. Return a JSON array of objects, each with fields: "title", "company", "location", "link" (if available). If a field is missing, use null.
Page text:
{page_text[:5000]} # Limit tokens
"""
# Step 3: Call the LLM
response = openai.ChatCompletion.create(
model="gpt-4o-mini", # Cheaper than full GPT-4
messages=[
{"role": "system", "content": "You are a helpful assistant that extracts structured data from web pages."},
{"role": "user", "content": prompt}
],
temperature=0.1,
max_tokens=2000
)
# Step 4: Parse the JSON output
result = response.choices[0].message.content
try:
jobs = json.loads(result)
print(f"Found {len(jobs)} jobs")
for job in jobs:
print(job["title"])
except json.JSONDecodeError:
print("Failed to parse JSON, raw output:", result)
This is remarkably robust. I ran it against five different job boards, and it correctly extracted titles, companies, and locations from every single one — including a site where the job titles were inside <strong> tags with no clear pattern.
Trade-offs and Realities
Of course, this isn’t a silver bullet. Here are the downsides I’ve encountered:
- Cost: Every request burns tokens. For a single page of ~5k characters, a GPT-4o-mini call costs about $0.002. If you’re scraping thousands of pages, it adds up fast.
- Latency: The API call takes 1–3 seconds. A traditional parser is sub-second.
- Hallucinations: Sometimes the LLM invents a field that wasn’t there. I’ve seen it create a “remote” flag for a job listing that explicitly said “on-site.”
- Token limits: You can’t throw entire pages at it. Large pages need to be trimmed (the code above truncates to 5000 chars). You might miss listings at the bottom.
- Rate limits: OpenAI’s free tier is aggressive. You’ll need proper queuing.
When Not to Use This
If you’re scraping a well-structured API or a site with consistent semantic HTML, traditional parsing is faster, cheaper, and more reliable. Also, if you need millisecond responses (e.g., real-time pricing), the AI approach will be too slow.
Use AI parsing when:
- Sites change their HTML frequently.
- You’re dealing with many different sources.
- The content is in unstructured text (stories, descriptions).
- You have a small scale (tens to hundreds of pages per day).
What I’d Do Differently Next Time
I would build a hybrid pipeline: first try a lightweight selector-based parser for each known site, then fall back to an LLM for unknown or broken ones. I’d also cache the LLM results aggressively and validate the output against a schema.
There are also dedicated services that handle this extraction for you — some are full APIs that combine AI with traditional scraping. I’ve seen tools like the one at https://ai.interwestinfo.com/ that wrap this exact idea into a single endpoint. They take a URL and a schema, and return structured JSON without you writing any prompts. That’s cool if you don’t want to manage API keys and prompt engineering yourself.
But the technique itself — asking a language model to extract data from text — is what made my fragile scraper resilient. It’s not perfect, but it saved my side project from a weekend of frustration.
Let’s Discuss
Have you tried using LLMs for scraping? Did you run into the same hallucination problems, or did you find a way to validate the output? I’m curious what your setup looks like for keeping scrapers alive.
Top comments (2)
This is an excellent example of using LLMs to make web scraping resilient. I really appreciate how you highlight the limitations of traditional parsers—CSS selectors, XPath, and regex—when sites change frequently, and how leveraging semantic understanding from an LLM can drastically reduce maintenance overhead.
The approach of combining BeautifulSoup for pre-processing and then using an LLM to extract structured JSON is elegant. I also like that you address trade-offs—latency, hallucinations, token cost—and suggest hybrid pipelines with caching. These are practical considerations that most tutorials skip.
I’d love to collaborate and explore scalable, multi-site AI scraping frameworks, including schema validation, cross-site consistency checks, and fallback pipelines for unstructured pages. Sharing strategies for hallucination detection, output verification, and caching could benefit developers maintaining long-term scraping projects.
Would you be open to discussing a collaboration to prototype robust AI-assisted web scraping pipelines?
As a beginner in web development, I found this really interesting. I didn't know LLMs could be used to make web scrapers more resilient.Thanks zhongqiue