DEV Community

KazKN
KazKN

Posted on

AI Agents Make Scraping Look Easy. Marketplace State Is Where They Lie.

AI agents make scraping look easy.

Give the agent a page.

Ask for JSON.

Get back a clean object.

{
  "title": "Leather bag",
  "price": 1200,
  "url": "https://example.com/item/123"
}
Enter fullscreen mode Exit fullscreen mode

That feels useful.

It is often not enough.

The hard part of marketplace scraping is not always extracting the card.

The hard part is knowing whether the card still means what you think it means.

200 OK can still be wrong

A scraper can return a valid HTTP response, parse the page, extract the title and price, and still give you bad operational data.

The page did load.

The selector did work.

The JSON did validate.

But the marketplace state may be wrong.

Examples:

  • The item is visible in search but already sold.
  • The item disappeared, but you do not know whether it sold or was deleted.
  • The same listing appears across several market locales.
  • The seller country matters more than the page country.
  • The search term matches a nearby model, not the model you wanted.
  • The current price looks low, but the item condition explains why.
  • The product card hides details that only exist on the item page.

This is where AI extraction can create false confidence.

It makes the data look cleaner than the marketplace really is.

The mistake: treating marketplaces like catalogs

A catalog has products.

A marketplace has state.

For a catalog scrape, this might be enough:

{
  "title": "Product name",
  "price": 49,
  "availability": "In stock"
}
Enter fullscreen mode Exit fullscreen mode

For a resale marketplace, I want a different shape:

{
  "recordType": "listing",
  "listingId": "123",
  "displayStatus": "Available",
  "isSold": false,
  "country": "FR",
  "sellerCountry": "IT",
  "condition": "Very good condition",
  "price": 1200,
  "priceHistory": [
    { "price": 1400, "observedAt": "2026-06-01T12:00:00.000Z" },
    { "price": 1200, "observedAt": "2026-06-08T12:00:00.000Z" }
  ],
  "requiresManualReview": false
}
Enter fullscreen mode Exit fullscreen mode

That object is less pretty.

It is also more honest.

The seven checks I use now

Before trusting marketplace data, I want seven checks.

1. Live and sold records should not be mixed

Active listings tell you supply.

Sold listings tell you demand.

If both are pushed into the same flat dataset without a clear recordType, analysis becomes messy quickly.

2. Disappearance is a signal

If an item was present yesterday and missing today, the scraper should not silently forget it.

It should emit a tracking record.

Something changed.

That change may be more useful than another active listing.

3. Page country is not seller country

On international marketplaces, the locale you searched and the seller location are different concepts.

If you only store one country field, you will eventually confuse yourself.

I prefer:

{
  "country": "FR",
  "sellerCountry": "IT"
}
Enter fullscreen mode Exit fullscreen mode

One tells me where I searched.

The other tells me where the seller appears to be.

4. Search terms need precision filters

Marketplace search is fuzzy.

That is useful for browsing.

It is dangerous for automation.

If the query is classic flap, I want the option to require those words in the result, not merely trust the search engine.

5. Condition changes the meaning of price

A low price without condition is not a deal.

It is just incomplete data.

I want condition and conditionSource visible in the output.

6. Price history matters more than current price

The current price is a snapshot.

The price movement is the signal.

If a scraper runs repeatedly, it should preserve enough state to answer:

  • did the price drop?
  • did it rise?
  • how long has it stayed at this level?
  • did it disappear after the drop?

7. Risk signals should be review queues, not accusations

If two listings look unusually similar, that is worth surfacing.

But a scraper should not declare fraud.

The safer output is:

{
  "recordType": "risk_signal",
  "signalType": "similar_listing_cluster",
  "confidence": "review_required"
}
Enter fullscreen mode Exit fullscreen mode

That gives the human a queue.

It does not pretend the scraper knows more than it does.

Where AI helps

AI can still help a lot.

It can normalize messy titles.

It can classify categories.

It can summarize product descriptions.

It can help detect weird edge cases in extracted records.

But I do not want AI to hide uncertainty.

For marketplace scraping, the best output is not the cleanest JSON.

The best output is the JSON that preserves enough context to make a decision.

My rule now

If a scraper cannot explain state, I do not trust its clean output.

I want to know:

  • what exists now;
  • what disappeared;
  • what sold;
  • where it was found;
  • where the seller appears to be;
  • how the price changed;
  • what needs manual review.

Selectors are only the first layer.

The real product is the state model.

What is the most dangerous false-positive your scraper has returned?

Top comments (0)