MongoDB guarantees durability—the D in ACID—over the network with strong consistency—the C in the CAP theorem—by default. It still maintains high availability: in the event of a network partition, a majority of nodes continue to serve consistent reads and writes transparently, without raising errors to the application.
Cluster state + operation log replication
A Raft‑inspired consensus (terms, elections, majority commit) is used to achieve this distributed consistency at two levels:
- Writes are directed to the shard's primary, which coordinates consistency between the collection and the indexes. A raft-like election is used to elect one replica as the primary, with the others acting as secondaries.
- Writes to the shard's primary are replicated to the secondaries and acknowledged once a majority has guaranteed durability on persistent storage. The equivalent of the Raft log is the data itself—the transaction oplogs.
Comparison with other databases
It’s important to distinguish between two kinds of consensus: consensus used to control replica roles (leader election) and consensus used to replicate data. In PostgreSQL, for example, failover automation tools such as Patroni use a consensus system (e.g., etcd) to elect a primary, but data replication via WAL streaming is not itself governed by a consensus protocol. As a result, failures during replication can leave replicas in inconsistent states that must be resolved afterward (e.g., via pg_rewind).
PostgreSQL community has discussed adding built-in replication on the hackers mailing list, using MongoDB as an example (Built-in Raft replication). PostgreSQL forks such as YugabyteDB replicate data using Raft.
Whether using consensus or not, all databases balance protection, availability, and performance. For example, in PostgreSQL, setting synchronous_commit to ON and synchronous_standby_names to ANY can ensure durability over the network while remaining available in case of failure, like MongoDB's default (w:majority). Conversely, setting synchronous_commit to off and using w:1 in MongoDB favors performance with asynchronous replication.
Trade-offs between performance and protection
Consensus on writes increases latency, especially in multi-region deployments, because it requires synchronous replication including network latency, but it guarantees no data loss in disaster recovery scenarios (RPO = 0). Some workloads may prefer lower latency and accept limited data loss (for example, a couple of seconds of RPO when a data center burns). If you ingest data from IoT devices, you may favor fast ingestion at the risk of losing some data in such a disaster. Similarly, when migrating from another database, you might prefer fast synchronization and, in case of an infrastructure failure, simply restart the migration from the point before the failure. In such cases, you can use {w:1} write concern in MongoDB instead of the default {w:"majority"}.
Most failures are not full-scale disasters in which an entire data center is lost, but rather transient issues involving short network disconnections. With {w:1}, the primary risk is not data loss—since writes can eventually be synchronized—but split brain, where both sides of a network partition continue to accept writes. This is where the two levels of consensus matter:
- A new primary is elected, and the old primary steps down, limiting the split-brain window to a few seconds.
- With the default
{w:"majority"}, writes that cannot reach a majority are not acknowledged on the side of the partition without a quorum. This prevents split brain. However, with{w:1}, those writes are acknowledged until the old primary steps down.
Because the failure is transient, when the old primary rejoins, no data is physically lost: writes from both sides still exist. However, these writes may conflict, resulting in a diverging database state with two branches. As with any asynchronous replication, this requires conflict resolution. MongoDB handles this as follows:
- Writes from the new primary are preserved, as this is where the application has continued to make progress.
- Writes that occurred on the old primary during the brief split-brain window are rolled back, and it pulls the more recent writes from the new primary.
Thus, when you use {w:1}, you accept the possibility of limited data loss in the event of a failure. Once the node is back, these writes are not entirely lost, but they cannot be merged automatically. MongoDB stores them as BSON files in a rollback directory so you can inspect them and perform manual conflict resolution if needed.
This conflict resolution is documented as Recover To a Timestamp (RTT).
Demo on a Docker lab
Let's try it. I start 3 containers as a replica set:
docker network create lab
docker run --network lab --name m1 --hostname m1 -d mongo --replSet rs0
docker run --network lab --name m2 --hostname m2 -d mongo --replSet rs0
docker run --network lab --name m3 --hostname m3 -d mongo --replSet rs0
docker exec -it m1 mongosh --eval '
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "m1:27017", priority: 3 },
{ _id: 1, host: "m2:27017", priority: 2 },
{ _id: 2, host: "m3:27017", priority: 1 }
]
})
'
until
docker exec -it m1 mongosh --eval "rs.status().members.forEach(m => print(m.name, m.stateStr))" |
grep -C3 "m1:27017 PRIMARY"
do sleep 1 ; done
The last command waits until m1 is the primary, as set by its priority. I do that to make the demo reproducible with simple copy-paste.
I insert "XXX-10" when connected to m1:
docker exec -it m1 mongosh --eval '
db.demo.insertOne(
{ _id:"XXX-10" , date:new Date() },
{ writeConcern: {w: "1"} }
)
'
{ acknowledged: true, insertedId: 'XXX-10' }
I disconnect the secondary m2:
docker network disconnect lab m2
With a replication factor of 3, the cluster is resilient to one failure and I insert "XXX-11", when connected to the primary:
docker exec -it m1 mongosh --eval '
db.demo.insertOne(
{ _id:"XXX-11" , date:new Date() },
{ writeConcern: {w: "1"} }
)
'
{ acknowledged: true, insertedId: 'XXX-11' }
I disconnect m1, the current primary, and reconnect m2, and immediately insert "XXX-12", still connected to m1:
docker network disconnect lab m1
docker network connect lab m2
docker exec -it m1 mongosh --eval '
db.demo.insertOne(
{ _id:"XXX-12" , date:new Date() },
{ writeConcern: {w: "1"} }
)
'
{ acknowledged: true, insertedId: 'XXX-12' }
Here, m1 is still a primary for a short period before it detects it cannot reach the majority of replicas and steps down. If the write concern was {w: "majority"} it would have waited and failed, not able to sync to the quorum, but with {w: "1"} the replication is asynchronous and the write is acknowledged when written to local disks.
Two seconds later, a similar write fails because the primary stepped down:
sleep 2
docker exec -it m1 mongosh --eval '
db.demo.insertOne(
{ _id:"XXX-13" , date:new Date() },
{ writeConcern: {w: "1"} }
)
'
MongoServerError: not primary
I wait that m2 is the new primary, as set by priority, and connect to it to insert "XXX-20":
until
docker exec -it m2 mongosh --eval "rs.status().members.forEach(m => print(m.name, m.stateStr))" |
grep -C3 "m2:27017 PRIMARY"
do sleep 1 ; done
docker exec -it m2 mongosh --eval '
db.demo.insertOne(
{ _id:"XXX-20" , date:new Date() },
{ writeConcern: {w: "1"} }
)
'
{ acknowledged: true, insertedId: 'XXX-20' }
No nodes are down, it's only a network partition, and I can read from all nodes as long as I don't connect through the network. I query the collection on each side:
docker exec -it m1 mongosh --eval 'db.demo.find()'
docker exec -it m2 mongosh --eval 'db.demo.find()'
docker exec -it m3 mongosh --eval 'db.demo.find()'
The inconsistency is visible, "XXX-12" is only in m1 and "XXX-20" only in m2 and m3:
I reconnect m1 so that all nodes can communicate and synchronize their state:
docker network connect lab m1
I query again and all nodes show the same values:

"XXX-12" has disappeared and all nodes are now synchronized to the current state. When it rejoined, m1 rolled back the operations that occurred during the split-brain window. This is expected and acceptable, since the write used a { w: 1 } write concern, which explicitly allows limited data loss in case of failure in order to avoid cross-network latency on each write.
The rolled back operations are not lost, MongoDB logged them in a rollback directory in the BSON format, with the rolled back document as well as the related oplog.
I read and decode all BSON in the rollback directory:
docker exec -i m1 bash -c '
for f in /data/db/rollback/*/removed.*.bson
do
echo "$f"
bsondump $f --pretty
done
' | egrep --color=auto '^|^/.*|.*("op":|"XXX-..").*'
The deleted document is in /data/db/rollback/0ae03154-0a51-4276-ac62-50d73ad31fe0/removed.2026-02-10T10-40-58.1.bson:
{
"_id": "XXX-12",
"date": {
"$date": {
"$numberLong": "1770719868965"
}
}
}
The deleted oplog for the related insert is in /data/db/rollback/local.oplog.rs/removed.2026-02-10T10-40-58.0.bson:
{
"lsid": {
"id": {
"$binary": {
"base64": "erR2AoFXS3mbcX4BJSiWjw==",
"subType": "04"
}
},
"uid": {
"$binary": {
"base64": "47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=",
"subType": "00"
}
}
},
"txnNumber": {
"$numberLong": "1"
},
"op": "i",
"ns": "test.demo",
"ui": {
"$binary": {
"base64": "CuAxVApRQnasYlDXOtMf4A==",
"subType": "04"
}
},
"o": {
"_id": "XXX-12",
"date": {
"$date": {
"$numberLong": "1770719868965"
}
}
},
"o2": {
"_id": "XXX-12"
},
"stmtId": {
"$numberInt": "0"
},
"ts": {
"$timestamp": {
"t": 1770719868,
"i": 1
}
},
"t": {
"$numberLong": "1"
},
"v": {
"$numberLong": "2"
},
"wall": {
"$date": {
"$numberLong": "1770719868983"
}
},
"prevOpTime": {
"ts": {
"$timestamp": {
"t": 0,
"i": 0
}
},
"t": {
"$numberLong": "-1"
}
}
}
The disappeared value "XXX-12" is available here as both its after-image and its oplog entry.
Conclusion: beyond Raft
By default, MongoDB favors strong consistency and durability: writes use { w: "majority" }, are majority-committed, never rolled back, and reads with readConcern: "majority" never observe rolled-back data. In this mode, MongoDB behaves like a classic Raft system: once an operation is committed, it is final.
MongoDB also lets you explicitly relax that guarantee by choosing a weaker write concern such as { w: 1 }. In doing so, you tell the system: "Prioritize availability and latency over immediate global consistency". The demo shows what that implies:
- During a transient network partition, two primaries can briefly accept writes.
- Both branches of history are durably written to disk.
- When the partition heals, MongoDB deterministically chooses the majority branch.
- Operations from the losing branch are rolled back—but not discarded. They are preserved as BSON files with their oplog entries.
- The node then recovers to a majority-committed timestamp (RTT) and rolls forward.
This rollback behavior is where MongoDB intentionally diverges from vanilla Raft.
In classic Raft, the replicated log is the source of truth, and committed log entries are never rolled back. Raft assumes a linearizable, strongly consistent state machine where the application does not expect divergence. MongoDB, by contrast, comes from a NoSQL and event-driven background, where asynchronous replication, eventual consistency, and application-level reconciliation are sometimes acceptable trade-offs.
As a result:
- MongoDB still uses Raft semantics for leader election and terms, so two primaries are never elected in the same term.
- For data replication, MongoDB extends the model with Recover To a Timestamp (RTT) rollback.
- This allows MongoDB to safely support lower write concerns, fast ingestion, multi-region latency optimization, and migration workloads.
In short, MongoDB replication is based on Raft, but adds rollback semantics to support real-world distributed application patterns. Rollbacks happen only when you explicitly allow them, never with majority writes, and they are fully auditable and recoverable.

Top comments (0)