DEV Community

Cover image for Bots read fast pages too: what we reprioritised after an AI-crawler audit
Apogee Watcher
Apogee Watcher

Posted on • Originally published at apogeewatcher.hashnode.dev

Bots read fast pages too: what we reprioritised after an AI-crawler audit

We ran a small audit last winter because a client asked whether GPTBot could "see" their new help centre. Search Console looked fine. PageSpeed Insights on the homepage looked fine. Server logs told a different story: long tail URLs timing out, a category template returning a JavaScript shell on the first response, and a /robots.txt rule that blocked one path the marketing team had already pitched for AI Overviews.

Nothing in that list was exotic. It was the kind of drift you only notice when you stop testing the three URLs everyone bookmarks and start reading what bots actually request.

That audit changed our monitoring priorities more than any slide about "optimising for ChatGPT." Bots read fast pages too. They also abandon slow ones, skip empty HTML, and respect robots.txt literally. Below is what we reprioritised, what we stopped overclaiming, and where scheduled PageSpeed monitoring fits once crawlability is on the board.

Why we audited GPTBot and other AI crawler traffic in server logs

Large language models and AI search products reach your content through partner indexes and dedicated crawlers. OpenAI publishes GPTBot; Google still sends Googlebot for Search and related features. Other vendors document their own user-agents. The exact mix varies by site and industry.

We started with logs, not Lighthouse scores:

  1. Filter requests by known AI and major search user-agents for two weeks.
  2. List the top requested paths and median time to first byte (TTFB).
  3. Flag non-200 responses, responses over five seconds, and HTML bodies under a sensible size threshold.
  4. Compare that list to the URLs we actually monitored every week.

The overlap was embarrassingly thin. We were watching home, pricing, and one campaign lander. Bots were hitting long-form guides, filtered category pages, and legacy blog paths nobody had opened in PSI for months.

That gap is the audit. AI crawler performance, in our usage, means "can the bot fetch a complete response in time?" not "will Perplexity cite us tomorrow?"

LLM crawler crawlability: timeouts, JavaScript shells, and empty HTML

Crawlers behave like impatient clients with limited rendering budgets. Common failure modes we saw:

  • Timeouts on deep URLs: Pagination and faceted category routes that humans reach through internal search but bots request directly. TTFB spiked when cache keys multiplied.
  • Client-rendered shells: Initial HTML with a loading spinner and the article body injected after a large bundle. Many crawlers never execute that JavaScript; they store the shell.
  • Accidental blocks: A staging Disallow copied into production, or a path blocked while the sitemap still listed it.
  • Soft 404s: HTTP 200 with "product unavailable" and almost no text. Fine for a human with context; useless for anything parsing structure.

These are crawlability problems first. They also show up in Core Web Vitals work: a page that fails a bot fetch often fails real users on slow mobile networks, just on a longer timeline.

We added one lab check per template: fetch the URL with a simple HTTP client, measure TTFB, and confirm the primary content appears in the first HTML chunk before we trust a green Lighthouse score. It is crude. It caught issues PSI alone did not, because we were not running PSI on those URLs at all.

AI crawler performance and Core Web Vitals: fetch readiness, not citation rank

Core Web Vitals still matter in this conversation, but the job description is narrower than social posts imply.

  • LCP often correlates with main content arriving early enough for a fetch-only crawler to capture text and headings.
  • INP is secondary for a plain GET, but pages drowning the main thread tend to produce messy, slow loads for everyone.
  • CLS is mostly about human mis-taps; stable layout still helps parsers that walk the DOM in order.

We do not tell clients that improving LCP increases LLM citation rates. Google's AI Overviews documentation does not list page speed as a citation factor. We do say that slow or broken fetches reduce the chance your content enters the pool at all. Fetch readiness is the step before relevance, authority, and structure.

That distinction cleaned up our backlog. We stopped pitching "CWV for AI rankings" and started tagging tickets as access (bot can read the page) vs representation (schema, clear headings, FAQ markup for how you want to be quoted). Both matter; only one belongs in a PageSpeed alert policy.

A third lane appeared in lab tooling after the audit: whether agents can use the interface, not only read the HTML. Chrome's experimental Lighthouse Agentic Browsing category covers WebMCP registration, agent-centric accessibility checks, CLS (layout shift breaks programmatic clicks on moving targets), and optional llms.txt discoverability. Scoring is a pass ratio, not another Performance 0–100, which fits a standard that is still moving. We log those audits on scheduled runs as research, not as a client-facing KPI. For what each check means and why we did not promote pass ratios to leadership, see Lighthouse Agentic Browsing scoring on the Watcher blog.

For the full technical baseline (robots rules, sitemaps, SSR vs client rendering), our Watcher article on why AI crawlers need fast, crawlable pages goes deeper. This post is the monitoring shift after we read the logs.

Website optimisation for AI crawlers: what we changed in our URL lists

Before the audit, our default monitoring pack mirrored SEO reporting: homepage, top landing page, maybe /blog. After the audit, we grouped URLs by how bots actually behave:

Group What we added Why
Long-form content Top help articles, comparison pages, glossary entries High bot fetch volume in logs
Faceted routes One category URL with filters applied Timeout and cache risk
Template exemplars PLP, PDP, or docs template per site JS shell risk differs by template
Edge paths Pagination page 2+, print-friendly URLs Often missing from manual PSI habits

We also split monitoring frequency by group. Homepage daily; long-tail content twice weekly; faceted routes after any deploy touching search or filters. That is more runs, but fewer surprises than a quarterly "AI readiness" deck built from three green scores.

Agencies managing many sites copied the shape, not the URLs. Each client gets a short bot-traffic appendix: five paths from logs plus two template exemplars. Updating that appendix quarterly takes less time than one fire drill when a blocked path surfaces in a stakeholder call.

PageSpeed Insights vs scheduled monitoring for AI-ready pages

PSI remains our first tool for a single URL in a hurry. It is official, shows lab and field data when CrUX exists, and answers "what does Lighthouse think right now?"

It does not tell you:

  • Which URLs bots requested last night while your team slept.
  • Whether TTFB regressed on page 2 of a category after a cache change.
  • That /robots.txt flipped during a deploy two days before anyone opened PSI.

After the audit we kept PSI for ad-hoc checks and moved portfolio baselines to scheduled runs with stored history and budget alerts. That is the same split we document for agencies in PageSpeed Insights vs automated monitoring: manual for one decision, automation when regressions must not wait for a calendar reminder.

For AI crawlability specifically, we added two thresholds beside LCP and INP:

  1. TTFB on bot-heavy URLs (internal band, stricter than marketing pages).
  2. A simple "content in first response" check on templates we know use client-side rendering.

Neither threshold proves you will be cited. Both catch "the bot got nothing useful" earlier than a monthly PSI spot check on the homepage.

AI search technical SEO checks we added to onboarding

Technical SEO for AI search still overlaps classic crawl hygiene. We added explicit onboarding steps so new sites do not inherit the homepage-only habit:

  1. robots.txt review: Confirm GPTBot and Googlebot rules match what leadership thinks is public. Block intentionally; do not block by accident.
  2. Sitemap cross-check: Every URL in the sitemap returns 200 and is not disallowed. Remove stale paths; bots follow sitemaps into dead ends.
  3. Render strategy note: Document which templates serve meaningful HTML on first response vs which rely on client bundles. Prioritise fixes on high-traffic templates in group one above.
  4. Optional llms.txt: Some teams publish llms.txt to point systems at preferred docs. Optional for most architectures; Lighthouse's Agentic Browsing category includes a discoverability check if you publish one (see the scoring post linked above).

These steps live in the project wiki, not a PDF nobody opens. When monitoring fires on TTFB for a help article, the engineer sees whether that URL is supposed to be bot-accessible or deliberately restricted.

What we stopped saying after the audit

Three phrases left our client calls:

  • "Fix Core Web Vitals and ChatGPT will cite you." (Overclaim; conflates fetch with ranking and citation.)
  • "We are AI optimised because the homepage scores 90+." (Wrong URL sample; wrong metric story.)
  • "Block all AI bots to be safe." (Sometimes valid for licensing reasons; not a performance strategy, and it removes you from retrieval pools you may want.)

Replacements that survived legal and SEO review:

  • "We keep public content fetchable, fast on first response, and listed consistently in robots.txt and sitemaps."
  • "We monitor the URLs bots actually request, not only the homepage."
  • "Citation and answer quality are content and authority problems; we handle the access layer with performance monitoring and crawl checks."

Calmer language. Fewer disappointed stakeholders when a competitor still gets quoted despite similar scores.

Next step: run a one-week AI crawler log audit on one site

Pick a site where AI search came up in the last quarter. For seven days:

  1. Export server logs (or CDN logs) and filter by GPTBot, Googlebot, and other agents your host documents.
  2. List the ten most-fetched paths and their median TTFB.
  3. Compare that list to the URLs in your current monitoring schedule.
  4. Run PSI or a scheduled test on any high-traffic path you have not checked this month.
  5. Fetch the worst TTFB URL with a simple HTTP client and inspect the first HTML chunk for real content.

If the lists do not overlap, update monitoring before you commission new schema markup. Bots read fast pages; they also skip the ones you never test.

For crawl rules, sitemaps, and the honest line on CWV vs citation, read Why AI crawlers need fast, crawlable pages. For when manual PSI stops scaling across a portfolio, pair it with PageSpeed Insights vs automated monitoring.

Access first. Representation second. Monitoring is how you keep the first from drifting while everyone focuses on the second.

Top comments (4)

Collapse
 
harjjotsinghh profile image
Harjot Singh

The AI-crawler angle is underrated. Your meta, structured data, and first-paint content are increasingly read by bots (AI answer engines) as much as by humans, and the bots are less forgiving of JS-gated content they can't execute. Fast, server-rendered, semantically clear pages win twice now: real users and the models that cite you. Reprioritising after a crawler audit is exactly the right instinct. I bake clean SSR and meta into Moonshift's generated sites for this reason. What surprised you most in the audit, render-blocking, or content the crawler just couldn't see?

Collapse
 
apogeewatcher profile image
Apogee Watcher

Good point on SSR and first response. That matches what we saw: bots are less patient than Lighthouse runs on the homepage.

If we had to pick one surprise, it was content the crawler never received, not render-blocking on the URLs we already tested. The homepage and pricing page looked fine in PSI. Server logs showed GPTBot and Googlebot on category and help URLs where the first HTML chunk was basically a shell (spinner, layout, almost no article body) or a soft 200 with almost no text. Pagination and faceted routes were the other stand-outs: timeouts and high TTFB on paths nobody had opened in PSI for months.

Classic render-blocking on the hero still matters for humans. In this audit, it was secondary because we were not even running lab checks on the templates bots actually fetched. Your Moonshift angle (clean SSR and meta on generated pages) is the right default for anything you want quoted or indexed.

Practical follow-up if you want to replicate the audit on one site: filter logs for a week, list the ten most-requested bot paths, then fetch the worst URL with curl and read the first HTML chunk before you trust a green Lighthouse score on the homepage.

Collapse
 
petteri_pucilowski_7ec755 profile image
Petteri Pucilowski

The "read logs, not Lighthouse scores" point is the part most AI-crawler posts miss, so good to see it framed that way.

One addition worth flagging in your user-agent filter step: CCBot (Common Crawl's crawler) deserves its own row. It's easy to dismiss as just training-data scraping, but it's actually the upstream source for a whole tier of downstream tools - the Common Crawl webgraph feeds backlink-intelligence tools (Crawlgraph and others), academic research datasets, and parts of Bing's grounding pipeline. So a robots.txt rule or a slow origin that blocks/times-out CCBot doesn't just cost you "AI training" inclusion, it makes you invisible in any product built on that quarterly release for the next ~3 months until the following crawl.

The asymmetry that bites people: CCBot crawls infrequently (quarterly-ish for the main release), so if your category template was returning a JS shell during the window CCBot visited, you eat that gap until the next crawl. Unlike Googlebot which re-crawls and self-corrects within days. Worth adding "was CCBot served a complete response during its last visit" to the same audit you described.

Solid writeup, the long-tail-timeout finding is the kind of thing only log analysis surfaces.

Collapse
 
apogeewatcher profile image
Apogee Watcher

Agreed on logs first. Lighthouse is useful when you have already picked the URL; it does not tell you which paths bots requested overnight.

CCBot is a fair addition we underplayed in the post. We had GPTBot and Googlebot in the first pass; Common Crawl deserves its own row in the appendix, for the reasons you outline: infrequent crawl windows and downstream use of that corpus. A JS shell or a slow origin during CCBot’s visit is a different failure mode from Googlebot re-crawling within days.

We have added two lines to our onboarding checklist: filter CCBot separately in the two-week log export, and note the date of its last successful fetch for any template you care about for downstream training or retrieval. “Was CCBot served a complete response on its last visit?” is going into the same one-week audit we describe at the end of the article.

The long-tail timeout finding is the one that still changes client URL lists fastest once someone actually exports logs. Thanks for the CCBot nudge; it sharpens the audit without turning it into a promise of ranking.