- I generate sitemap indexes for 8,177 job pages.
- I cap each sitemap at 45,000 URLs.
- I prioritize state + category pages first.
- I ship it in Next.js 14 with Supabase.
Context
I’m building a PMHNP job board. Next.js 14. Supabase Postgres. Vercel.
This week my data looked like this:
- 8,177 active jobs
- 2,004 companies
- 1,311 new jobs added
- 2,345 jobs expired/removed (net -1,034)
That churn matters.
If I don’t keep sitemaps fresh, Google crawls garbage. Expired URLs linger. New URLs show up late.
I tried “just let internal links handle it”. Brutal.
GSC showed a bunch of URLs stuck in “Discovered — currently not indexed”. I needed a boring, repeatable pipeline: generate sitemaps from the DB, ship them from Next.js, and make sure they don’t explode when I have 8K pages today and 80K later.
1) I start with the DB query. Not my router.
I don’t build sitemaps from the filesystem.
My job pages are database-backed. So the sitemap is database-backed too.
I store a slug and updated_at for each job. That’s enough.
Here’s the exact query I use from the sitemap route.
// lib/sitemaps/jobs.ts
import { createClient } from "@supabase/supabase-js";
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
// Service role key. Server-only.
process.env.SUPABASE_SERVICE_ROLE_KEY!
);
export type JobSitemapRow = {
slug: string;
updated_at: string;
};
export async function fetchActiveJobSlugs(limit: number, offset: number) {
const { data, error } = await supabase
.from("jobs")
.select("slug, updated_at")
.eq("status", "active")
.order("updated_at", { ascending: false })
.range(offset, offset + limit - 1);
if (error) throw new Error(`fetchActiveJobSlugs: ${error.message}`);
return (data ?? []) as JobSitemapRow[];
}
export async function countActiveJobs() {
const { count, error } = await supabase
.from("jobs")
.select("id", { count: "exact", head: true })
.eq("status", "active");
if (error) throw new Error(`countActiveJobs: ${error.message}`);
return count ?? 0;
}
Two notes.
First: SUPABASE_SERVICE_ROLE_KEY never hits the client. This runs in a route handler only.
Second: I order by updated_at so the freshest jobs get crawled earlier. When I add 1,311 jobs in a week, I want those URLs discovered first.
2) I chunk sitemaps. Because Google has limits.
A sitemap can’t be infinite.
Hard limit: 50,000 URLs per sitemap.
I don’t go near the edge. I use 45,000.
So I serve:
-
/sitemap.xmlas an index -
/sitemaps/jobs/0.xml,/sitemaps/jobs/1.xml, ... for job URLs
I learned the hard way that a single giant sitemap becomes a maintenance trap. You’ll want to add more URL types later (states, cities, categories). Indexing is the clean escape hatch.
This is my index route.
// app/sitemap.xml/route.ts
import { NextResponse } from "next/server";
import { countActiveJobs } from "@/lib/sitemaps/jobs";
export const runtime = "nodejs";
const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL!; // e.g. https://example.com
const JOB_SITEMAP_SIZE = 45_000;
function xmlEscape(s: string) {
return s
.replaceAll("&", "&")
.replaceAll("<", "<")
.replaceAll(">", ">")
.replaceAll('"', """)
.replaceAll("'", "'");
}
export async function GET() {
const total = await countActiveJobs();
const parts = Math.max(1, Math.ceil(total / JOB_SITEMAP_SIZE));
const now = new Date().toISOString();
const urls = Array.from({ length: parts }, (_, i) => {
const loc = `${BASE_URL}/sitemaps/jobs/${i}.xml`;
return `\n \n ${xmlEscape(loc)}\n ${now}\n `;
}).join("");
const body = `
${urls}
`;
return new NextResponse(body, {
headers: {
"content-type": "application/xml; charset=utf-8",
// I keep this short because jobs churn.
"cache-control": "public, s-maxage=900, stale-while-revalidate=86400",
},
});
}
That cache header matters.
My listings expire in batches. This week 2,345 got removed.
If I cache for a day, I’m literally telling crawlers about dead pages for a day. Not catastrophic. But it stacks.
15 minutes (s-maxage=900) feels right.
3) I generate each job sitemap with a route param.
Now the meat.
A job sitemap is just a paginated query + XML.
I keep `` out. Google ignores it half the time.
I do include ``. That one’s real.
// app/sitemaps/jobs/[part]/route.ts
import { NextResponse } from "next/server";
import { fetchActiveJobSlugs } from "@/lib/sitemaps/jobs";
export const runtime = "nodejs";
const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL!;
const JOB_SITEMAP_SIZE = 45_000;
function xmlEscape(s: string) {
return s
.replaceAll("&", "&")
.replaceAll("<", "<")
.replaceAll(">", ">")
.replaceAll('"', """)
.replaceAll("'", "'");
}
export async function GET(
_req: Request,
ctx: { params: Promise<{ part: string }> }
) {
const { part } = await ctx.params;
const page = Number(part);
if (!Number.isFinite(page) || page < 0) {
return new NextResponse("Not found", { status: 404 });
}
const offset = page * JOB_SITEMAP_SIZE;
const rows = await fetchActiveJobSlugs(JOB_SITEMAP_SIZE, offset);
// If someone requests /sitemaps/jobs/999.xml, don't return an empty 200.
if (rows.length === 0) {
return new NextResponse("Not found", { status: 404 });
}
const urlset = rows
.map((r) => {
const loc = `${BASE_URL}/jobs/${encodeURIComponent(r.slug)}`;
const lastmod = new Date(r.updated_at).toISOString();
return `\n \n ${xmlEscape(loc)}\n ${lastmod}\n `;
})
.join("");
const body = `
${urlset}
`;
return new NextResponse(body, {
headers: {
"content-type": "application/xml; charset=utf-8",
"cache-control": "public, s-maxage=900, stale-while-revalidate=86400",
},
});
}
The “don’t return an empty 200” thing cost me time.
I shipped empty sitemap parts as valid XML. Google happily fetched them. Wasted crawl budget. And it looked like “it’s working” because there were no errors.
Now I 404 them.
4) I include non-job pages. And I order them.
If you only submit job URLs, you’re missing the pages that help discovery.
On my board, state and category pages are the real hubs:
- States: Washington (197), California (159), Massachusetts (131), New York (117), Colorado (74)
- Categories I added: Telehealth, Per Diem (only 30 listings), New Grad (only 29 listings), Travel/Locum
I want crawlers to hit those early.
So I put them in a small “static-ish” sitemap.
// app/sitemaps/pages.xml/route.ts
import { NextResponse } from "next/server";
export const runtime = "nodejs";
const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL!;
function xmlEscape(s: string) {
return s
.replaceAll("&", "&")
.replaceAll("<", "<")
.replaceAll(">", ">")
.replaceAll('"', """)
.replaceAll("'", "'");
}
export async function GET() {
const now = new Date().toISOString();
// Keep this list small and intentional.
const urls = [
"/",
"/states",
"/categories/telehealth",
"/categories/per-diem",
"/categories/new-grad",
"/categories/travel-locum",
"/states/washington",
"/states/california",
"/states/massachusetts",
"/states/new-york",
"/states/colorado",
];
const urlset = urls
.map((path) => {
const loc = `${BASE_URL}${path}`;
return `\n \n ${xmlEscape(loc)}\n ${now}\n `;
})
.join("");
const body = `
${urlset}
`;
return new NextResponse(body, {
headers: {
"content-type": "application/xml; charset=utf-8",
"cache-control": "public, s-maxage=3600, stale-while-revalidate=86400",
},
});
}
Then I add it to the sitemap index.
I didn’t want a second index.
So I just append one more `` entry.
`ts
// app/sitemap.xml/route.ts (add this near the top)
// ...existing imports
// inside GET(), after computing now
const extra = \n \n ${BASE_URL}/sitemaps/pages.xml\n ${now}\n ;
// and include it in the final sitemapindex body
// ...job parts... ${extra}
`
Simple.
And it’s honest.
Those pages don’t change every 15 minutes, so I cache them for an hour.
5) I validate the sitemap output with a script. Every time.
I don’t trust my eyes.
I run a quick Node script locally to:
- fetch
/sitemap.xml - parse sitemap locations
- fetch a couple sitemap parts
- sanity-check: “Do these contain `` tags?”
No fancy XML parser. Just enough.
// scripts/check-sitemaps.mjs
const BASE = process.env.BASE_URL || "http://localhost:3000";
async function getText(path) {
const res = await fetch(`${BASE}${path}`, {
headers: { "user-agent": "sitemap-check/1.0" },
});
const text = await res.text();
return { status: res.status, text };
}
function extractLocs(xml) {
return [...xml.matchAll(/(.*?)<\/loc>/g)].map((m) => m[1]);
}
const index = await getText("/sitemap.xml");
console.log("/sitemap.xml", index.status);
if (index.status !== 200) process.exit(1);
const locs = extractLocs(index.text);
console.log("sitemaps:", locs.length);
for (const loc of locs.slice(0, 3)) {
const path = new URL(loc).pathname;
const sm = await getText(path);
console.log(path, sm.status, "url tags:", (sm.text.match(//g) || []).length);
if (sm.status !== 200) process.exit(1);
}
I run it like:
BASE_URL="http://localhost:3000" node scripts/check-sitemaps.mjs
This caught two dumb bugs for me:
- I accidentally served
text/plainonce. - I returned an empty sitemap part with a 200.
Both would’ve wasted days if I waited for GSC to complain.
Results
After shipping sitemap indexing + chunked job sitemaps, I had a clean structure for 8,177 active job URLs and a separate sitemap for hub pages (states + categories). This week’s churn was high — 1,311 new jobs and 2,345 removed — and the 15-minute cache kept the sitemap from drifting too far from reality. I also stopped serving empty sitemap parts as 200s, which cut a bunch of pointless fetches when I fat-finger a part number during testing.
Key takeaways
- Build sitemaps from the DB, not your routes.
- Always use a sitemap index once you’re past a few thousand pages.
- Cap sitemap size below 50,000. I use 45,000.
- 404 missing sitemap parts. Empty 200s waste crawling.
- Validate output with a script before you touch GSC.
Closing
I’m about to add schema markup across 8K+ job pages, but the sitemap work had to come first.
If you’re running dynamic pages at this scale: do you regenerate sitemaps on-demand (like this), or do you precompute them on a cron and store them in object storage?
Top comments (0)