DEV Community

Cover image for Automate Accounting PDFs from Gmail to Google Drive (No Re-Imports, No Duplicates)
Alexander Schneider
Alexander Schneider

Posted on

Automate Accounting PDFs from Gmail to Google Drive (No Re-Imports, No Duplicates)

If you do bookkeeping once a month, you’ve probably lived this loop:

  • invoices arrive via Gmail (sometimes forwarded, replied-to, or resent)
  • you download the PDFs
  • you store them in Drive
  • you file them during bookkeeping (move/delete/rename)
  • and then automation re-imports the same PDFs again 😩

This post shows a Google Apps Script setup that:

✅ scans your entire Gmail (excluding spam/trash)

✅ exports PDF attachments into Google Drive folders by month (YYYY-MM)

✅ detects duplicates by content hash (MD5)

never re-imports the same email again—even if you move/delete files in Drive

✅ still labels threads in Gmail for visibility (but does not rely on labels for correctness)


The core idea

Relying on a Gmail thread label to skip work is tempting, but it breaks this common scenario:

A thread gets labeled “processed”… then later a new reply comes in with a new PDF attachment.

If you skip by label, you miss that PDF.

Instead, we do two things:

1) Message-ID tracking (stored in PropertiesService)

  • Every Gmail message is processed exactly once.
  • This is what prevents re-imports after you manually move/delete files.

2) MD5 de-duplication per month folder

  • Forwarded invoices and resent PDFs won’t create duplicates in Drive.

We still set a label (pdf-imported) for visibility, but the automation does not depend on it.


What you’ll get in Google Drive

A structure like this:

4_Gmail-Files/
  2026-02/
    2026-02-11__MESSAGEID__A0__Invoice_123.pdf
    2026-02-14__MESSAGEID__A0__Receipt.pdf
  2026-03/
    ...
Enter fullscreen mode Exit fullscreen mode

You can do monthly bookkeeping and manually move/delete files however you like. The importer won’t recreate them.


Step 1: Create a Drive folder

Create a folder in Google Drive (e.g. 4_Gmail-Files) and copy its folder ID from the URL:

https://drive.google.com/drive/folders/<FOLDER_ID>
Enter fullscreen mode Exit fullscreen mode

Step 2: Add the Apps Script

Go to script.google.comNew project → paste this code into Code.gs.

Replace ROOT_FOLDER_ID with your own folder ID.

/**
 * Gmail → PDF attachments → Google Drive monthly folders (YYYY-MM)
 *
 * ✅ scans ALL mail (in:anywhere), excluding spam/trash
 * ✅ catches new replies in old threads via LOOKBACK window
 * ✅ stores PDFs in <ROOT>/<YYYY-MM>/
 * ✅ prevents duplicates:
 *    - message-id tracking (each Gmail message processed once)
 *    - MD5 content de-dupe (identical PDFs per month folder stored once)
 * ✅ Gmail thread label for visibility (not used for skipping)
 */

const ROOT_FOLDER_ID = "PASTE_YOUR_FOLDER_ID_HERE";

// Lookback window: must be large enough to catch “old thread, new reply”
const LOOKBACK_DAYS = 365;

const GMAIL_QUERY =
  `in:anywhere -in:spam -in:trash newer_than:${LOOKBACK_DAYS}d has:attachment filename:pdf`;

// Label threads where we actually saved at least one PDF (visibility only)
const THREAD_LABEL = "pdf-imported";

// Mark thread as read after saving something?
const MARK_THREAD_AS_READ = false;

// How many threads to process per run
const MAX_THREADS_PER_RUN = 50;

// Tracker sizing: large enough so you don’t “forget” processed messages
const MAX_TRACKED_MESSAGES = 30000;

// Must be >= LOOKBACK_DAYS (recommend bigger)
const KEEP_MESSAGE_IDS_DAYS = 900;

function run() {
  const root = DriveApp.getFolderById(ROOT_FOLDER_ID);
  const labelObj = THREAD_LABEL ? getOrCreateGmailLabel_(THREAD_LABEL) : null;

  const tracker = loadMessageTracker_();
  pruneMessageTracker_(tracker);

  const threads = GmailApp.search(GMAIL_QUERY, 0, MAX_THREADS_PER_RUN);
  if (!threads.length) {
    saveMessageTracker_(tracker);
    return;
  }

  // monthKey -> { monthFolder, md5Set }
  const monthCache = new Map();

  for (const thread of threads) {
    const messages = thread.getMessages();
    let savedAnyInThread = false;

    for (const msg of messages) {
      const messageId = msg.getId();

      // Prevent re-imports even if files are moved/deleted in Drive
      if (tracker.ids[messageId]) continue;

      const msgDate = msg.getDate();
      const monthKey = formatMonthKey_(msgDate);
      const { monthFolder, md5Set } = getMonthContext_(root, monthKey, monthCache);

      const attachments = msg.getAttachments({ includeInlineImages: false });

      for (let i = 0; i < attachments.length; i++) {
        const att = attachments[i];

        const originalName = att.getName() || `attachment_${i}.pdf`;
        const isPdf =
          att.getContentType() === "application/pdf" ||
          originalName.toLowerCase().endsWith(".pdf");
        if (!isPdf) continue;

        const blob = att.copyBlob();
        const md5 = md5Hex_(blob);

        // Content de-dupe in the month folder
        if (md5Set.has(md5)) continue;

        const safeName = sanitizeFilename_(originalName);
        const datePrefix = Utilities.formatDate(
          msgDate,
          Session.getScriptTimeZone(),
          "yyyy-MM-dd"
        );

        const uniqueDriveName = `${datePrefix}__${messageId}__A${i}__${safeName}`;

        monthFolder.createFile(blob).setName(uniqueDriveName);
        md5Set.add(md5);

        savedAnyInThread = true;
      }

      // Mark this Gmail message as processed (even if it had no PDFs)
      tracker.ids[messageId] = Date.now();
      tracker.order.push(messageId);
    }

    if (savedAnyInThread) {
      if (labelObj) thread.addLabel(labelObj);
      if (MARK_THREAD_AS_READ) thread.markRead();
    }
  }

  saveMessageTracker_(tracker);
}

// ===== Tracker (PropertiesService) =====

function loadMessageTracker_() {
  const props = PropertiesService.getUserProperties();
  const raw = props.getProperty("PDF_TRACKER_V1");
  if (!raw) return { ids: {}, order: [] };

  try {
    const obj = JSON.parse(raw);
    if (!obj.ids) obj.ids = {};
    if (!obj.order) obj.order = [];
    return obj;
  } catch (e) {
    return { ids: {}, order: [] };
  }
}

function saveMessageTracker_(tracker) {
  pruneMessageTracker_(tracker);
  PropertiesService.getUserProperties().setProperty("PDF_TRACKER_V1", JSON.stringify(tracker));
}

function pruneMessageTracker_(tracker) {
  const cutoff = Date.now() - KEEP_MESSAGE_IDS_DAYS * 24 * 60 * 60 * 1000;

  const newOrder = [];
  for (const id of tracker.order) {
    const ts = tracker.ids[id];
    if (!ts) continue;
    if (ts >= cutoff) newOrder.push(id);
    else delete tracker.ids[id];
  }
  tracker.order = newOrder;

  if (tracker.order.length > MAX_TRACKED_MESSAGES) {
    const removeCount = tracker.order.length - MAX_TRACKED_MESSAGES;
    for (let i = 0; i < removeCount; i++) {
      const id = tracker.order[i];
      delete tracker.ids[id];
    }
    tracker.order = tracker.order.slice(removeCount);
  }
}

// ===== Month context + MD5 =====

function getMonthContext_(rootFolder, monthKey, monthCache) {
  if (monthCache.has(monthKey)) return monthCache.get(monthKey);

  const monthFolder = getOrCreateSubfolder_(rootFolder, monthKey);

  const md5Set = new Set();
  const files = monthFolder.getFiles();
  while (files.hasNext()) {
    const f = files.next();
    const md5 = f.getMd5Checksum();
    if (md5) md5Set.add(String(md5).toLowerCase());
  }

  const ctx = { monthFolder, md5Set };
  monthCache.set(monthKey, ctx);
  return ctx;
}

function md5Hex_(blob) {
  const bytes = blob.getBytes();
  const digest = Utilities.computeDigest(Utilities.DigestAlgorithm.MD5, bytes);
  return digest
    .map(b => {
      const v = (b < 0 ? b + 256 : b);
      return (v < 16 ? "0" : "") + v.toString(16);
    })
    .join("");
}

// ===== Helpers =====

function formatMonthKey_(dateObj) {
  return Utilities.formatDate(dateObj, Session.getScriptTimeZone(), "yyyy-MM");
}

function getOrCreateSubfolder_(parentFolder, name) {
  const it = parentFolder.getFoldersByName(name);
  if (it.hasNext()) return it.next();
  return parentFolder.createFolder(name);
}

function sanitizeFilename_(name) {
  return String(name)
    .replace(/[\/\:*?"<>|]/g, "_")
    .replace(/\s+/g, " ")
    .trim();
}

function getOrCreateGmailLabel_(labelName) {
  let label = GmailApp.getUserLabelByName(labelName);
  if (!label) label = GmailApp.createLabel(labelName);
  return label;
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Authorize the script

In Apps Script:

1) Select function run

2) Click Run

3) Accept permissions (Gmail read + Drive access)


Step 4: Set a trigger (every 6 hours)

1) Click the clock icon (“Triggers”)

2) Add Trigger:

  • Function: run
  • Event source: Time-driven
  • Type: Hour timer
  • Interval: Every 6 hours

Why this works for monthly bookkeeping

When you do bookkeeping, you’ll usually move/delete/rename PDFs inside the month folder.

This automation doesn’t care what happens in Drive afterward.

  • It remembers processed Gmail message IDs.
  • So the same email is never imported again, even if you remove the stored PDF.
  • And it still catches new PDFs posted later in an old thread.

Tuning tips

  • If you get invoices as replies to very old threads, increase LOOKBACK_DAYS.
  • Always keep KEEP_MESSAGE_IDS_DAYS >= LOOKBACK_DAYS (recommended: much bigger).
  • If you have high email volume, increase MAX_THREADS_PER_RUN (and keep an eye on Apps Script quotas).

Happy automating 👋

Top comments (0)