DEV Community

Cover image for Serverless PDF Processing: Why unpdf Beats pdf-parse
Chudi Nnorukam
Chudi Nnorukam

Posted on • Edited on • Originally published at chudi.dev

Serverless PDF Processing: Why unpdf Beats pdf-parse

Originally published at chudi.dev


It was 2 AM. StatementSync was ready to deploy. I pushed to Vercel and watched the build fail.

Error: Cannot find module 'canvas'
    at Function.Module._resolveFilename
Enter fullscreen mode Exit fullscreen mode

Canvas? I'm processing PDFs, not drawing graphics. Three hours later, I learned why pdf-parse breaks on serverless.

The Problem

pdf-parse is the go-to library for PDF text extraction in Node.js:

import pdf from 'pdf-parse';

const dataBuffer = fs.readFileSync('statement.pdf');
const data = await pdf(dataBuffer);
console.log(data.text);
Enter fullscreen mode Exit fullscreen mode

Works perfectly locally. Crashes spectacularly on Vercel.

Why It Fails

pdf-parse depends on pdfjs-dist, Mozilla's PDF.js port for Node. pdfjs-dist has optional dependencies:

{
  "optionalDependencies": {
    "canvas": "^2.x",
    "node-fetch": "^2.x"
  }
}
Enter fullscreen mode Exit fullscreen mode

Canvas is a native module that requires:

  • Python
  • node-gyp
  • C++ build tools

Vercel's serverless runtime doesn't have these. The build either:

  1. Fails outright with missing module errors
  2. Succeeds but crashes at runtime with segfaults

Sometimes the build passes but the function crashes when processing PDFs. This is worse—you discover it in production, not deployment.

The Debugging Journey

Attempt 1: Exclude Canvas

"Just mark canvas as external," Stack Overflow said.

// next.config.js
module.exports = {
  webpack: (config) => {
    config.externals = [...(config.externals || []), 'canvas'];
    return config;
  },
};
Enter fullscreen mode Exit fullscreen mode

Result: Different error.

Error: Could not load the "canvas" module
Enter fullscreen mode Exit fullscreen mode

pdfjs-dist tries to load canvas at runtime, not just build time.

Attempt 2: Legacy Build

"Use pdf-parse legacy mode," another answer suggested.

const pdf = require('pdf-parse/lib/pdf-parse');
Enter fullscreen mode Exit fullscreen mode

Result: Still fails. The dependency chain remains.

Attempt 3: pdfjs-dist Directly

"Skip pdf-parse, use pdfjs-dist with worker disabled."

import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = '';

const pdf = await pdfjsLib.getDocument({ data: buffer }).promise;
Enter fullscreen mode Exit fullscreen mode

Result: Works locally, memory errors on Vercel.

Vercel functions have 1GB memory limit. pdfjs-dist's memory usage is unpredictable with large PDFs.

The Solution: unpdf

After three hours, I found unpdf:

import { getDocument, extractText } from 'unpdf';

const pdf = await getDocument({ data: buffer }).promise;
const text = await extractText(pdf);
Enter fullscreen mode Exit fullscreen mode

Result: Works. First try.

Why unpdf Works

unpdf is built specifically for serverless:

Feature pdf-parse unpdf
Native deps Yes (canvas) No
Vercel compatible No Yes
Edge runtime No Yes
Bundle size Large Small
Memory usage Unpredictable Controlled

The library uses a pure JavaScript PDF parser without native modules. No build-time compilation, no runtime loading issues.

Implementation

Here's the complete pattern for serverless PDF processing:

import { getDocument, extractText } from 'unpdf';

interface Transaction {
  date: string;
  description: string;
  amount: number;
  type: 'debit' | 'credit';
}

async function processPdf(buffer: Buffer): Promise<Transaction[]> {
  // Load PDF
  const pdf = await getDocument({ data: buffer }).promise;

  // Extract text
  const text = await extractText(pdf);

  // Parse transactions (pattern-based for bank statements)
  const transactions = parseTransactions(text);

  // Cleanup
  pdf.destroy();

  return transactions;
}

function parseTransactions(text: string): Transaction[] {
  // Bank-specific parsing patterns
  const lines = text.split('\n');
  const transactions: Transaction[] = [];

  for (const line of lines) {
    const match = line.match(/(\d{2}\/\d{2})\s+(.+?)\s+(-?\$[\d,]+\.\d{2})/);
    if (match) {
      transactions.push({
        date: match[1],
        description: match[2].trim(),
        amount: parseFloat(match[3].replace(/[$,]/g, '')),
        type: match[3].startsWith('-') ? 'debit' : 'credit'
      });
    }
  }

  return transactions;
}
Enter fullscreen mode Exit fullscreen mode

Performance

On Vercel's free tier (1GB memory, 10s timeout):

PDF Size Processing Time Memory Used
1 page 1-2 seconds ~100MB
5 pages 3-4 seconds ~200MB
10 pages 5-6 seconds ~350MB
20 pages 8-9 seconds ~500MB

Comfortable margins for typical bank statements (1-5 pages).

Always call pdf.destroy() after processing. unpdf holds the document in memory until explicitly released.

Pattern-Based vs LLM Extraction

For structured documents like bank statements, pattern-based extraction beats LLM:

Approach Accuracy Cost Speed
Pattern-based 99% $0 3-5s
LLM (GPT-5) 99.5% $0.01-0.05 10-30s
OCR + LLM 95% $0.02-0.08 15-45s

For StatementSync processing 1000 statements/month:

  • Pattern-based: $0
  • LLM: $10-50/month

The 0.5% accuracy difference doesn't justify the cost for this use case.

When to Use What

Use unpdf when:

  • Deploying to Vercel, Netlify, or Cloudflare
  • Processing structured documents (statements, invoices)
  • Need low memory footprint
  • Running on edge runtimes

Use pdf-parse when:

  • Running on traditional servers (EC2, DigitalOcean)
  • Need advanced PDF features (annotations, forms)
  • Have native build tools available

Use LLM extraction when:

  • Documents are unstructured or variable
  • Accuracy is more important than cost
  • Processing low volumes

The Lesson

The right library matters more than clever workarounds. I spent 3 hours trying to make pdf-parse work on serverless. unpdf worked in 10 minutes.

If you're building PDF processing for serverless, start with unpdf. Save yourself the 2 AM debugging.


Related: From Pain Point to MVP: StatementSync in One Week | Portfolio: StatementSync

Top comments (0)