Sathish

Posted on Feb 10

Next.js job board pages: sane SEO with sitemaps

#nextjs #webdev #typescript #tutorial

I generate sitemap indexes for 8,177 job pages.
I cap each sitemap at 45,000 URLs.
I prioritize state + category pages first.
I ship it in Next.js 14 with Supabase.

Context

I’m building a PMHNP job board. Next.js 14. Supabase Postgres. Vercel.

This week my data looked like this:

8,177 active jobs
2,004 companies
1,311 new jobs added
2,345 jobs expired/removed (net -1,034)

That churn matters.

If I don’t keep sitemaps fresh, Google crawls garbage. Expired URLs linger. New URLs show up late.

I tried “just let internal links handle it”. Brutal.

GSC showed a bunch of URLs stuck in “Discovered — currently not indexed”. I needed a boring, repeatable pipeline: generate sitemaps from the DB, ship them from Next.js, and make sure they don’t explode when I have 8K pages today and 80K later.

1) I start with the DB query. Not my router.

I don’t build sitemaps from the filesystem.

My job pages are database-backed. So the sitemap is database-backed too.

I store a slug and updated_at for each job. That’s enough.

Here’s the exact query I use from the sitemap route.

// lib/sitemaps/jobs.ts
import { createClient } from "@supabase/supabase-js";

const supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  // Service role key. Server-only.
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

export type JobSitemapRow = {
  slug: string;
  updated_at: string;
};

export async function fetchActiveJobSlugs(limit: number, offset: number) {
  const { data, error } = await supabase
    .from("jobs")
    .select("slug, updated_at")
    .eq("status", "active")
    .order("updated_at", { ascending: false })
    .range(offset, offset + limit - 1);

  if (error) throw new Error(`fetchActiveJobSlugs: ${error.message}`);
  return (data ?? []) as JobSitemapRow[];
}

export async function countActiveJobs() {
  const { count, error } = await supabase
    .from("jobs")
    .select("id", { count: "exact", head: true })
    .eq("status", "active");

  if (error) throw new Error(`countActiveJobs: ${error.message}`);
  return count ?? 0;
}

Two notes.

First: SUPABASE_SERVICE_ROLE_KEY never hits the client. This runs in a route handler only.

Second: I order by updated_at so the freshest jobs get crawled earlier. When I add 1,311 jobs in a week, I want those URLs discovered first.

2) I chunk sitemaps. Because Google has limits.

A sitemap can’t be infinite.

Hard limit: 50,000 URLs per sitemap.

I don’t go near the edge. I use 45,000.

So I serve:

/sitemap.xml as an index
/sitemaps/jobs/0.xml, /sitemaps/jobs/1.xml, ... for job URLs

I learned the hard way that a single giant sitemap becomes a maintenance trap. You’ll want to add more URL types later (states, cities, categories). Indexing is the clean escape hatch.

This is my index route.

// app/sitemap.xml/route.ts
import { NextResponse } from "next/server";
import { countActiveJobs } from "@/lib/sitemaps/jobs";

export const runtime = "nodejs";

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL!; // e.g. https://example.com
const JOB_SITEMAP_SIZE = 45_000;

function xmlEscape(s: string) {
  return s
    .replaceAll("&", "&")
    .replaceAll("<", "<")
    .replaceAll(">", ">")
    .replaceAll('"', """)
    .replaceAll("'", "'");
}

export async function GET() {
  const total = await countActiveJobs();
  const parts = Math.max(1, Math.ceil(total / JOB_SITEMAP_SIZE));

  const now = new Date().toISOString();

  const urls = Array.from({ length: parts }, (_, i) => {
    const loc = `${BASE_URL}/sitemaps/jobs/${i}.xml`;
    return `\n  \n    ${xmlEscape(loc)}\n    ${now}\n  `;
  }).join("");

  const body = `
${urls}
`;

  return new NextResponse(body, {
    headers: {
      "content-type": "application/xml; charset=utf-8",
      // I keep this short because jobs churn.
      "cache-control": "public, s-maxage=900, stale-while-revalidate=86400",
    },
  });
}

That cache header matters.

My listings expire in batches. This week 2,345 got removed.

If I cache for a day, I’m literally telling crawlers about dead pages for a day. Not catastrophic. But it stacks.

15 minutes (s-maxage=900) feels right.

3) I generate each job sitemap with a route param.

Now the meat.

A job sitemap is just a paginated query + XML.

I keep `` out. Google ignores it half the time.

I do include ``. That one’s real.

// app/sitemaps/jobs/[part]/route.ts
import { NextResponse } from "next/server";
import { fetchActiveJobSlugs } from "@/lib/sitemaps/jobs";

export const runtime = "nodejs";

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL!;
const JOB_SITEMAP_SIZE = 45_000;

function xmlEscape(s: string) {
  return s
    .replaceAll("&", "&")
    .replaceAll("<", "<")
    .replaceAll(">", ">")
    .replaceAll('"', """)
    .replaceAll("'", "'");
}

export async function GET(
  _req: Request,
  ctx: { params: Promise<{ part: string }> }
) {
  const { part } = await ctx.params;
  const page = Number(part);

  if (!Number.isFinite(page) || page < 0) {
    return new NextResponse("Not found", { status: 404 });
  }

  const offset = page * JOB_SITEMAP_SIZE;
  const rows = await fetchActiveJobSlugs(JOB_SITEMAP_SIZE, offset);

  // If someone requests /sitemaps/jobs/999.xml, don't return an empty 200.
  if (rows.length === 0) {
    return new NextResponse("Not found", { status: 404 });
  }

  const urlset = rows
    .map((r) => {
      const loc = `${BASE_URL}/jobs/${encodeURIComponent(r.slug)}`;
      const lastmod = new Date(r.updated_at).toISOString();
      return `\n  \n    ${xmlEscape(loc)}\n    ${lastmod}\n  `;
    })
    .join("");

  const body = `
${urlset}
`;

  return new NextResponse(body, {
    headers: {
      "content-type": "application/xml; charset=utf-8",
      "cache-control": "public, s-maxage=900, stale-while-revalidate=86400",
    },
  });
}

The “don’t return an empty 200” thing cost me time.

I shipped empty sitemap parts as valid XML. Google happily fetched them. Wasted crawl budget. And it looked like “it’s working” because there were no errors.

Now I 404 them.

4) I include non-job pages. And I order them.

If you only submit job URLs, you’re missing the pages that help discovery.

On my board, state and category pages are the real hubs:

States: Washington (197), California (159), Massachusetts (131), New York (117), Colorado (74)
Categories I added: Telehealth, Per Diem (only 30 listings), New Grad (only 29 listings), Travel/Locum

I want crawlers to hit those early.

So I put them in a small “static-ish” sitemap.

// app/sitemaps/pages.xml/route.ts
import { NextResponse } from "next/server";

export const runtime = "nodejs";

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL!;

function xmlEscape(s: string) {
  return s
    .replaceAll("&", "&")
    .replaceAll("<", "<")
    .replaceAll(">", ">")
    .replaceAll('"', """)
    .replaceAll("'", "'");
}

export async function GET() {
  const now = new Date().toISOString();

  // Keep this list small and intentional.
  const urls = [
    "/",
    "/states",
    "/categories/telehealth",
    "/categories/per-diem",
    "/categories/new-grad",
    "/categories/travel-locum",
    "/states/washington",
    "/states/california",
    "/states/massachusetts",
    "/states/new-york",
    "/states/colorado",
  ];

  const urlset = urls
    .map((path) => {
      const loc = `${BASE_URL}${path}`;
      return `\n  \n    ${xmlEscape(loc)}\n    ${now}\n  `;
    })
    .join("");

  const body = `
${urlset}
`;

  return new NextResponse(body, {
    headers: {
      "content-type": "application/xml; charset=utf-8",
      "cache-control": "public, s-maxage=3600, stale-while-revalidate=86400",
    },
  });
}

Then I add it to the sitemap index.

I didn’t want a second index.

So I just append one more `` entry.

`ts
// app/sitemap.xml/route.ts (add this near the top)
// ...existing imports

// inside GET(), after computing now
const extra = \n \n ${BASE_URL}/sitemaps/pages.xml\n ${now}\n ;

// and include it in the final sitemapindex body
// ...job parts... ${extra}
`

Simple.

And it’s honest.

Those pages don’t change every 15 minutes, so I cache them for an hour.

5) I validate the sitemap output with a script. Every time.

I don’t trust my eyes.

I run a quick Node script locally to:

fetch /sitemap.xml
parse sitemap locations
fetch a couple sitemap parts
sanity-check: “Do these contain `` tags?”

No fancy XML parser. Just enough.

// scripts/check-sitemaps.mjs
const BASE = process.env.BASE_URL || "http://localhost:3000";

async function getText(path) {
  const res = await fetch(`${BASE}${path}`, {
    headers: { "user-agent": "sitemap-check/1.0" },
  });
  const text = await res.text();
  return { status: res.status, text };
}

function extractLocs(xml) {
  return [...xml.matchAll(/(.*?)<\/loc>/g)].map((m) => m[1]);
}

const index = await getText("/sitemap.xml");
console.log("/sitemap.xml", index.status);
if (index.status !== 200) process.exit(1);

const locs = extractLocs(index.text);
console.log("sitemaps:", locs.length);

for (const loc of locs.slice(0, 3)) {
  const path = new URL(loc).pathname;
  const sm = await getText(path);
  console.log(path, sm.status, "url tags:", (sm.text.match(//g) || []).length);
  if (sm.status !== 200) process.exit(1);
}

I run it like:

BASE_URL="http://localhost:3000" node scripts/check-sitemaps.mjs

This caught two dumb bugs for me:

I accidentally served text/plain once.
I returned an empty sitemap part with a 200.

Both would’ve wasted days if I waited for GSC to complain.

Results

After shipping sitemap indexing + chunked job sitemaps, I had a clean structure for 8,177 active job URLs and a separate sitemap for hub pages (states + categories). This week’s churn was high — 1,311 new jobs and 2,345 removed — and the 15-minute cache kept the sitemap from drifting too far from reality. I also stopped serving empty sitemap parts as 200s, which cut a bunch of pointless fetches when I fat-finger a part number during testing.

Key takeaways

Build sitemaps from the DB, not your routes.
Always use a sitemap index once you’re past a few thousand pages.
Cap sitemap size below 50,000. I use 45,000.
404 missing sitemap parts. Empty 200s waste crawling.
Validate output with a script before you touch GSC.

Closing

I’m about to add schema markup across 8K+ job pages, but the sitemap work had to come first.

If you’re running dynamic pages at this scale: do you regenerate sitemaps on-demand (like this), or do you precompute them on a cron and store them in object storage?

DEV Community