Wahyu Tricahyo

Posted on Feb 14

MongoDB Advanced Techniques: Going Beyond the Basics

#backend #database #mongodb #nosql

This post covers the MongoDB patterns and strategies I reach for most often when building applications that need to handle real traffic. Aggregation pipelines, schema design patterns with working examples, index strategies that survive production, and the operational features that tie it all together.

What's covered:

The Aggregation Pipeline
Schema Design Patterns - Bucket, Computed, Outlier
Time-Series Collections
Index Strategies - ESR rule, partial indexes, wildcard indexes
Change Streams
Transactions - write concerns, retryable writes, session handling
Query Performance

The Aggregation Pipeline: Your Most Powerful Tool

The aggregation pipeline is where MongoDB truly shines. Think of it as a series of stages that your data flows through, each one transforming it along the way. While most developers know $match and $group, the real power comes from combining lesser-known stages.

$facet lets you run multiple aggregation pipelines in parallel on the same set of input documents. This is incredibly useful when you need to compute different summaries from the same dataset - say, a total count, a breakdown by category, and an average price - all in a single query instead of three separate round trips.

db.products.aggregate([
  { $facet: {
    totalCount: [{ $count: "count" }],
    byCategory: [{ $group: { _id: "$category", total: { $sum: 1 } } }],
    priceStats: [{ $group: { _id: null, avgPrice: { $avg: "$price" }, maxPrice: { $max: "$price" } } }]
  }}
]);

$lookup with pipelines is another game-changer. Instead of a basic left outer join, you can embed a full sub-pipeline inside the lookup. This means you can filter, project, and even sort the joined documents before they're merged into your results. It's the closest thing MongoDB has to a SQL subquery, and it's incredibly flexible.

db.orders.aggregate([
  { $lookup: {
    from: "products",
    let: { productIds: "$items.productId" },
    pipeline: [
      { $match: { $expr: { $in: ["$_id", "$$productIds"] } } },
      { $project: { name: 1, price: 1 } }
    ],
    as: "productDetails"
  }}
]);

Schema Design Patterns That Scale

MongoDB's flexibility with schema is a double-edged sword. Without discipline, your database can become a mess. Here are patterns that hold up under pressure.

The Bucket Pattern works wonders for time-series data. Instead of storing one document per event, you group events into buckets - say, one document per hour or per day. This reduces the total document count, makes range queries faster, and plays nicely with indexes. If you're logging sensor data, analytics events, or user activity, this pattern can cut your storage and query times dramatically.

// Instead of one document per sensor reading (millions of docs):
// { sensorId: "temp-01", value: 22.5, timestamp: ISODate("2025-02-14T10:00:01Z") }
// { sensorId: "temp-01", value: 22.7, timestamp: ISODate("2025-02-14T10:00:02Z") }

// Bucket them into hourly documents:
{
  sensorId: "temp-01",
  date: ISODate("2025-02-14T10:00:00Z"),
  count: 120,
  readings: [
    { value: 22.5, timestamp: ISODate("2025-02-14T10:00:01Z") },
    { value: 22.7, timestamp: ISODate("2025-02-14T10:00:02Z") },
    // ... up to a cap (e.g., 200 readings per bucket)
  ],
  summary: { min: 21.8, max: 23.4, avg: 22.6 }
}

// Insert a new reading into the current bucket
db.sensorData.updateOne(
  { sensorId: "temp-01", date: ISODate("2025-02-14T10:00:00Z"), count: { $lt: 200 } },
  {
    $push: { readings: { value: 22.9, timestamp: new Date() } },
    $inc: { count: 1 },
    $min: { "summary.min": 22.9 },
    $max: { "summary.max": 22.9 },
  },
  { upsert: true }
);

// Query an entire day with just 24 documents instead of thousands
db.sensorData.find({
  sensorId: "temp-01",
  date: { $gte: ISODate("2025-02-14"), $lt: ISODate("2025-02-15") }
});

The Computed Pattern is all about precomputing values that are expensive to calculate on the fly. If your application frequently displays a product's average rating, don't aggregate all reviews every time. Instead, update a ratingSummary field whenever a new review is added. You trade slightly more work on writes for significantly faster reads.

// Product document with precomputed rating data
{
  _id: ObjectId("..."),
  name: "Wireless Headphones",
  price: 79.99,
  ratingSummary: {
    average: 4.3,
    count: 256,
    total: 1100,         // sum of all ratings, used to recalculate average
    distribution: { 1: 8, 2: 12, 3: 30, 4: 86, 5: 120 }
  }
}

// When a new 5-star review comes in, recalculate the average atomically
// using a pipeline-style update (note the array wrapper - this is required
// to use aggregation expressions like $divide and $add inside updateOne)
db.products.updateOne(
  { _id: productId },
  [
    {
      $set: {
        "ratingSummary.count": { $add: ["$ratingSummary.count", 1] },
        "ratingSummary.total": { $add: ["$ratingSummary.total", 5] },
        "ratingSummary.distribution.5": { $add: ["$ratingSummary.distribution.5", 1] },
        "ratingSummary.average": {
          $round: [
            { $divide: [{ $add: ["$ratingSummary.total", 5] }, { $add: ["$ratingSummary.count", 1] }] },
            1
          ]
        }
      }
    }
  ]
);

// Now displaying a product page requires zero aggregation - just read the document
db.products.findOne({ _id: productId }, { name: 1, price: 1, ratingSummary: 1 });

The key detail here is the array wrapper around the update document. A regular update ({ $inc: ..., $set: ... }) doesn't support aggregation operators like $divide or $add. Wrapping the update in an array tells MongoDB to treat it as an aggregation pipeline, which unlocks the full expression language. Without the array, those expressions get stored as literal objects instead of being evaluated.

The Outlier Pattern handles the reality that not all documents are created equal. Most users might have 10 orders, but a handful might have 10,000. Embedding all orders in the user document works for the majority, but blows up for the outliers. The solution: embed up to a threshold, then overflow into a separate collection with a reference back to the parent.

// User document with embedded orders (works for 99% of users)
{
  _id: "user:1001",
  name: "Alice",
  hasOverflow: false,
  orders: [
    { orderId: "ord-001", total: 49.99, date: ISODate("2025-01-15") },
    { orderId: "ord-002", total: 129.00, date: ISODate("2025-02-01") },
    // embedded up to a threshold, say 50 orders
  ]
}

// When a user exceeds the threshold, flip the flag and overflow to a separate collection
const MAX_EMBEDDED = 50;

async function addOrder(userId, order) {
  const result = await db.users.updateOne(
    { _id: userId, hasOverflow: false, [`orders.${MAX_EMBEDDED - 1}`]: { $exists: false } },
    { $push: { orders: order } }
  );

  // If update didn't match, the user has hit the threshold
  if (result.matchedCount === 0) {
    // Mark the user as overflowed
    await db.users.updateOne({ _id: userId }, { $set: { hasOverflow: true } });
    // Store in the overflow collection
    await db.orderOverflow.insertOne({ userId, ...order });
  }
}

// Note: there's a small race window between the matchedCount check and setting
// hasOverflow - a concurrent write could push past the threshold before the flag
// is set. In practice this means a user might end up with slightly more than
// MAX_EMBEDDED orders before overflowing. For most applications this is harmless.
// If you need strict enforcement, wrap the check-and-set in a transaction or use
// a findOneAndUpdate with the same filter to make the overflow decision atomic.

// Querying: check the flag to decide where to look
async function getUserOrders(userId) {
  const user = await db.users.findOne({ _id: userId });

  if (!user.hasOverflow) {
    return user.orders; // Fast path: everything is embedded
  }

  // Combine embedded orders with overflow
  const overflowOrders = await db.orderOverflow.find({ userId }).toArray();
  return [...user.orders, ...overflowOrders];
}

A note on the addOrder function: there's a subtle race condition between checking matchedCount === 0 and setting hasOverflow: true. If two writes land at the same time when the user is at exactly the threshold, both could fail the initial updateOne, and one could insert into the overflow collection before the flag is set. In practice this is rarely an issue since hitting the exact threshold concurrently is uncommon, and the hasOverflow flag is only used as a read-path optimization, not a consistency guarantee. If you need stronger guarantees, wrapping the overflow transition in a transaction eliminates the race entirely.

Time-Series Collections: The Native Approach

If you're considering the bucket pattern for time-series data, it's worth knowing that MongoDB 5.0+ provides native time-series collections. These handle bucketing, compression, and optimized storage automatically under the hood, so you don't have to manage bucket boundaries or summary fields yourself.

// Create a time-series collection
db.createCollection("sensorReadings", {
  timeseries: {
    timeField: "timestamp",       // required: the field containing the datetime
    metaField: "sensorId",        // optional: groups measurements from the same source
    granularity: "minutes"        // hint for internal bucketing: "seconds", "minutes", or "hours"
  },
  expireAfterSeconds: 2592000     // optional: auto-delete data older than 30 days
});

// Inserts look like regular documents - no manual bucketing needed
db.sensorReadings.insertMany([
  { sensorId: "temp-01", timestamp: new Date(), value: 22.5, unit: "celsius" },
  { sensorId: "temp-01", timestamp: new Date(), value: 22.7, unit: "celsius" },
  { sensorId: "humidity-01", timestamp: new Date(), value: 65.2, unit: "percent" },
]);

// Queries and aggregations work as expected
db.sensorReadings.aggregate([
  { $match: { sensorId: "temp-01", timestamp: { $gte: ISODate("2025-02-14"), $lt: ISODate("2025-02-15") } } },
  { $group: { _id: { $hour: "$timestamp" }, avgTemp: { $avg: "$value" }, count: { $sum: 1 } } },
  { $sort: { _id: 1 } }
]);

The manual bucket pattern still has its place - it gives you full control over bucket size, lets you store precomputed summaries, and works on older MongoDB versions. But for most time-series workloads on MongoDB 5.0+, the native collections are less code, better compression, and one fewer thing to maintain.

Index Strategies That Actually Matter

Indexes are easy to create and easy to get wrong. Here's what I've learned the hard way.

Compound index order matters more than you think. The ESR rule - Equality, Sort, Range - should guide how you order fields in a compound index. Fields used in equality matches come first, followed by sort fields, then range conditions. Getting this order wrong can be the difference between a millisecond query and a full collection scan.

// If your query is: { status: "active", createdAt: { $gte: lastWeek } } sorted by { score: -1 }
// The optimal index is:
db.posts.createIndex({ status: 1, score: -1, createdAt: 1 });
// Equality (status) -> Sort (score) -> Range (createdAt)

Partial indexes save you space and speed by only indexing documents that match a filter expression. If you only query active users, there's no reason to index the inactive ones. This is especially valuable in large collections where a significant portion of documents are irrelevant to your hot queries.

db.users.createIndex(
  { email: 1 },
  { partialFilterExpression: { isActive: true } }
);

Wildcard indexes are useful when you have dynamic or unpredictable field names - common in IoT data or user-defined metadata. Rather than creating an index for every possible field, a wildcard index covers them all.

// Index all fields inside the metadata subdocument
db.devices.createIndex({ "metadata.$**": 1 });

// Query any dynamic field under metadata
db.devices.find({ "metadata.firmware_version": "2.1.0" });
db.devices.find({ "metadata.region": "us-east-1" });

There are real tradeoffs to understand before using them. Every write to an indexed field path creates or updates index entries for each matching field, so write-heavy workloads will see higher I/O and CPU overhead compared to a targeted compound index. Wildcard indexes also consume more memory since they maintain entries across all indexed field paths. Most importantly, they can't support compound queries - a wildcard index on metadata.$** can serve a filter on metadata.region or metadata.firmware_version individually, but not both in the same query. If you know your access patterns, a compound index on the specific fields will always outperform a wildcard. Reserve wildcards for the genuinely unpredictable cases where you can't anticipate which fields will be queried.

Change Streams: Real-Time Without the Complexity

Change streams let you watch for real-time changes to your data without polling. They sit on top of MongoDB's oplog and give you a cursor that emits events whenever documents are inserted, updated, replaced, or deleted.

The practical applications are everywhere: syncing data to a search engine, triggering notifications, maintaining materialized views, or feeding an event-driven architecture. Unlike tailing the oplog directly, change streams are resumable - if your application crashes, you can pick up exactly where you left off using a resume token.

Every change event includes an _id field that acts as the resume token. By saving it to a dedicated collection after processing each event and passing it back as resumeAfter on startup, you create a checkpoint that survives crashes and restarts. Keep in mind that resume tokens are only valid while the oplog still contains the corresponding entry - typically 24-72 hours depending on your cluster. If your application is down longer than that, the token expires and you'll need to handle a full re-sync. This gives you at-least-once delivery, so if your processing isn't idempotent, deduplicate on the consumer side.

const RESUME_TOKEN_COLLECTION = "changeStreamTokens";

async function saveResumeToken(streamName, token) {
  await db.collection(RESUME_TOKEN_COLLECTION).updateOne(
    { _id: streamName },
    { $set: { token, updatedAt: new Date() } },
    { upsert: true }
  );
}

async function getResumeToken(streamName) {
  const doc = await db.collection(RESUME_TOKEN_COLLECTION).findOne({ _id: streamName });
  return doc?.token ?? null;
}

async function watchTickets() {
  const pipeline = [{ $match: { "fullDocument.priority": "high" } }];
  const options = { fullDocument: "updateLookup" };

  const resumeToken = await getResumeToken("tickets-high-priority");
  if (resumeToken) {
    options.resumeAfter = resumeToken;
    console.log("Resuming change stream from stored token");
  }

  const changeStream = db.collection("tickets").watch(pipeline, options);

  changeStream.on("change", async (event) => {
    console.log("High-priority ticket changed:", event.fullDocument);
    // Persist the token after successful processing
    await saveResumeToken("tickets-high-priority", event._id);
  });

  changeStream.on("error", (err) => {
    console.error("Change stream error:", err);
    // Reconnect after a delay - the saved token lets us pick up where we left off
    setTimeout(() => watchTickets(), 5000);
  });
}

Transactions: When You Actually Need Them

Multi-document transactions in MongoDB exist, but they shouldn't be your default approach. If you find yourself needing transactions frequently, it might be a signal that your schema design needs rethinking. That said, there are legitimate cases - financial operations, inventory management, or any workflow where partial updates would leave your data in an inconsistent state.

const session = client.startSession();
try {
  session.startTransaction({
    readConcern: { level: "snapshot" },
    writeConcern: { w: "majority" },
    readPreference: "primary"
  });

  await db.accounts.updateOne({ _id: from }, { $inc: { balance: -amount } }, { session });
  await db.accounts.updateOne({ _id: to }, { $inc: { balance: amount } }, { session });
  await session.commitTransaction();
} catch (error) {
  await session.abortTransaction();
  throw error;
} finally {
  session.endSession();
}

Read and write concerns matter inside transactions. The readConcern: "snapshot" ensures the transaction sees a consistent point-in-time view of the data, so reads within the transaction won't see writes from other concurrent operations. The writeConcern: { w: "majority" } guarantees that committed data has been acknowledged by a majority of replica set members before the transaction reports success. Without majority write concern, a committed transaction could be rolled back if the primary fails before replicating the writes. For financial or critical operations, these aren't optional.

Retryable writes and the retry loop. MongoDB supports retryable writes (retryWrites=true in your connection string, which is the default in modern drivers). This handles transient network errors on individual write operations automatically. But transactions need their own retry logic because the entire transaction can fail due to write conflicts or transient errors, not just a single operation within it. The recommended pattern wraps the transaction in a loop:

async function transferWithRetry(client, from, to, amount, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const session = client.startSession();
    try {
      session.startTransaction({
        readConcern: { level: "snapshot" },
        writeConcern: { w: "majority" },
      });

      const sender = await db.accounts.findOne({ _id: from }, { session });
      if (sender.balance < amount) {
        await session.abortTransaction();
        throw new Error("Insufficient funds");
      }

      await db.accounts.updateOne({ _id: from }, { $inc: { balance: -amount } }, { session });
      await db.accounts.updateOne({ _id: to }, { $inc: { balance: amount } }, { session });
      await session.commitTransaction();
      return; // success

    } catch (error) {
      await session.abortTransaction();

      // TransientTransactionError: safe to retry the whole transaction
      if (error.hasOwnProperty("errorLabels") && error.errorLabels.includes("TransientTransactionError") && attempt < maxRetries - 1) {
        console.log(`Transient error on attempt ${attempt + 1}, retrying...`);
        continue;
      }
      throw error;
    } finally {
      session.endSession();
    }
  }
}

A few ground rules for transactions: keep them under a second if possible - long-running transactions hold WiredTiger snapshots open, increasing cache pressure and blocking other operations. Don't use them across collections that live on different shards unless absolutely necessary, as distributed transactions carry substantially more latency. And if your application is using transactions to coordinate writes across three or more collections regularly, that's usually a sign to revisit whether those collections should be restructured.

Query Performance: Reading `explain()` Like a Pro

Every developer should be comfortable reading the output of explain("executionStats"). The key metrics to watch are totalDocsExamined versus nReturned. If you're examining 100,000 documents to return 10 results, something is wrong with your indexing strategy.

Look for COLLSCAN in the winning plan - that's a collection scan, and outside of very small collections or analytics workloads, it's usually a red flag. You want to see IXSCAN followed by a FETCH, or ideally a covered query where the index alone satisfies the request without touching the documents at all.

db.orders.find({ status: "pending", total: { $gt: 100 } })
  .sort({ createdAt: -1 })
  .explain("executionStats");

Wrapping Up

Each section here solves a specific class of problem: the aggregation pipeline replaces what would otherwise be multi-query application logic, schema patterns prevent document bloat before it starts, proper indexing keeps reads fast without tanking writes, and transactions with the right concerns protect data integrity when single-document atomicity isn't enough.

If any of these sections felt unfamiliar, the best next step is to run explain() on your slowest queries and see what comes back. That tends to reveal which of these techniques your application actually needs.

DEV Community

MongoDB Advanced Techniques: Going Beyond the Basics

The Aggregation Pipeline: Your Most Powerful Tool

Schema Design Patterns That Scale

Time-Series Collections: The Native Approach

Index Strategies That Actually Matter

Change Streams: Real-Time Without the Complexity

Transactions: When You Actually Need Them

Query Performance: Reading `explain()` Like a Pro

Wrapping Up

Top comments (0)

The Aggregation Pipeline: Your Most Powerful Tool

Schema Design Patterns That Scale

Time-Series Collections: The Native Approach

Index Strategies That Actually Matter

Change Streams: Real-Time Without the Complexity

Transactions: When You Actually Need Them

Query Performance: Reading explain() Like a Pro

Wrapping Up

Query Performance: Reading `explain()` Like a Pro