PostgreSQL vs MongoDB vs Cassandra: Multi-Node and How to Choose
PostgreSQL Multi-Node
The honest starting point: PostgreSQL was designed as a single-node database. It's the most important thing to understand before running PostgreSQL at scale. Every multi-node capability PostgreSQL has - streaming replication, logical replication, connection pooling, sharding - was added on top of a core that was built to run on one machine, with one set of files, with one process managing all reads and writes.
The foundation is excellent. The multi-node layer requires additional tooling and carries additional operational complexity as a result. If you are evaluating PostgreSQL for a distributed system, you are evaluating PostgreSQL plus the tooling around it: Patroni or repmgr for high availability, PgBouncer for connection pooling, and potentially Citus for horizontal write scaling. Each is mature and well-understood. None is built in.
Streaming replication
When a write lands on the PostgreSQL primary, it follows the path from Post 3: WAL record written, shared_buffers updated, heap and index pages modified. In a replicated cluster, the WAL record does a second job: it's also shipped to standby nodes via streaming replication.
A background process called the WAL sender on the primary streams WAL records to one or more standby nodes in near real time. Each standby has a WAL receiver that writes those records to its own WAL, and a recovery process that replays them against the standby's heap and index files. The standby is perpetually in crash recovery mode - i.e., applying WAL from the primary rather than from a local crash.
This is the same WAL mechanism from Post 3, but the purpose is different. In crash recovery, WAL is replaying changes to reconstruct a consistent state after a failure. In streaming replication, WAL is shipping live changes to a secondary machine to keep it current.
Synchronous vs asynchronous replication is the most consequential configuration choice in a PostgreSQL cluster:
In asynchronous mode (the default), the primary writes its WAL and acknowledges the commit to the application immediately. The WAL sender ships the record to standbys afterward, but the primary doesn't wait. This means commits are fast. The application sees write latency close to local WAL flush time. The risk: if the primary crashes before the WAL record reaches a standby, and that standby is promoted to primary, those last few committed transactions are gone. The application was told "committed" and the data doesn't exist on the new primary.
In synchronous mode, the primary waits for at least one designated standby to confirm receipt of the WAL record before acknowledging the commit. This eliminates the data loss window but adds the round-trip latency to the standby - so every write now pays network cost. In the same data center, this is typically 1–3ms. Across data centers, it can be 50–100ms. Whether that cost is acceptable depends entirely on your write SLA.
Replication lag is the delay between when a WAL record is generated on the primary and when it's applied on a standby. Under normal conditions, lag is milliseconds. Under heavy write load or network contention, it can grow to seconds. If your application writes on the primary and immediately reads from a standby, the data may not be there yet. This "read your own writes" consistency failure is the most common source of subtle bugs in PostgreSQL deployments that use read replicas.
Read scaling vs write scaling
Read scaling with PostgreSQL is straightforward. Add more standbys, route read traffic to them. Each standby is a full copy of the data, capable of serving any read query. This works well and is the standard pattern for PostgreSQL at read-heavy scale.
Write scaling is fundamentally different. All writes must go to the single primary. There is no mechanism in base PostgreSQL to accept writes on two nodes simultaneously. The primary is the sole source of truth, and its write throughput is bounded by how much that one machine can handle.
Citus is the primary extension for horizontal write scaling. It adds sharding on top of PostgreSQL by distributing rows across worker nodes based on a distribution column. For the orders table, you'd distribute by order_id so each worker owns a range of order_id values and handles reads and writes for that range. A coordinator node routes queries to the appropriate worker. Single-shard queries (i.e., lookups by order_id) route to one worker efficiently. Multi-shard queries (aggregations across all orders) fan out to all workers and aggregate at the coordinator.
Citus works and is used in production at significant scale. But consider what you're taking on: a coordinator that can become a bottleneck for multi-shard queries, shard rebalancing operations when you add workers, and the coordination overhead below.
Transactions across nodes
On a single primary, PostgreSQL transactions are fully ACID. In a Citus sharded cluster, a transaction that touches rows on multiple shards requires two-phase commit (2PC). 2PC runs in two rounds: first, the coordinator asks each involved shard to prepare (lock the rows, write the WAL, but don't commit yet). If all shards confirm they're prepared, the coordinator sends commit to all of them. If any shard fails during preparation, the coordinator sends rollback.
The cost is real: two network round trips to all participating shards, plus locks held during the prepare phase. Under high concurrency, prepared transactions holding locks become a contention source. The failure mode - like a coordinator crash between prepare and commit - leaves shards in a prepared-but-undecided state that requires manual recovery.
2PC is correct, mature, and operationally manageable. It is also slower and more complex than a local transaction. This is one of PostgreSQL's genuine scaling limitations: the richer the transaction semantics, the harder it is to distribute.
CAP position
PostgreSQL is CP - meaning it prioritizes consistency over availability during a network partition. When the network separating the primary from its standbys is partitioned, the primary doesn't know if standbys are down or if it is the isolated node. With automatic failover tooling (Patroni, repmgr), the cluster will eventually elect a new primary from the standby side. But during the election window, writes may be unavailable. PostgreSQL will not allow two nodes to simultaneously believe they're the primary and accept writes, because that would produce divergent state with no way to reconcile it.
The practical consequence: during a partition event, your application may see write unavailability for the duration of the failover - typically seconds to tens of seconds depending on your tooling's detection and election timeout configuration. Reads on standbys may continue; writes are blocked until leadership is resolved. For systems where write availability is more important than strict consistency, this is a constraint to plan around.
MongoDB Multi-Node
Replica sets - replication as a prime feature
MongoDB designed replication as a native feature from early in its history, and it shows. Spinning up a 3-node replica set is a supported, documented, first-class operation. There's no external Patroni equivalent required for basic scaling.
A replica set consists of one primary and one or more secondary nodes. The primary accepts all writes. Secondaries replicate from the primary and can serve reads. Automatic failover happens via an election: when secondaries can't reach the primary, they hold an election among themselves and promote the node with the most up-to-date oplog. The entire process - detection, election, promotion - typically completes in under 10 seconds with default settings.
The replication mechanism is the oplog (operations log), a special capped collection in the local database on every replica set member. When WiredTiger applies a write to the primary's collection B-Tree, MongoDB also appends a logical description of that operation to the oplog: "insert this document with this _id" or "update this field on this document." Secondaries continuously tail the primary's oplog and apply each operation to their own WiredTiger storage.
This is structurally similar to PostgreSQL's WAL streaming, but the oplog is logical - meaning it records operations at a higher semantic level - while PostgreSQL's WAL is physical, so it records exact byte changes to specific pages. The logical oplog is more portable and can be consumed by change data capture systems. The physical WAL is more efficient for pure replication because it doesn't require re-interpreting operations.
Write concern in a cluster
You know write concern from Post 4's single-node discussion. In a replica set, the options become meaningfully different with a quorum-based approach:
w: 1acknowledges as soon as the primary's WiredTiger cache and journal accept the write. Fast. But if the primary crashes before replicating and a secondary is promoted, that write is gone. The new primary never had it. MongoDB will write the un-replicated operations to a rollback file when the old primary rejoins, but the application already received "success."w: majoritywaits until a majority of voting replica set members (for a 3-node set, that's 2 nodes) have written the operation to their journals. This write survives primary failure. If the primary crashes immediately after acknowledging, the operation is on at least one secondary. The newly elected primary will have it. The cost is the round-trip to the nearest secondary that confirms - typically a few milliseconds in the same data center.
The right choice depends entirely on your durability requirements. Financial writes, inventory mutations, anything where "we told the user it succeeded" must be true: use w: majority. High-volume analytics events, activity logs, telemetry: w: 1 is probably fine.
Native sharding
MongoDB's sharded cluster has three components:
mongos is the query router. A stateless process that sits between your application and the shards. Applications connect to
mongosas if it were a regular MongoDB instance.mongosholds a cached view of which shard owns which chunk of the key space and routes operations accordingly.Config servers are a 3-node replica set that stores the authoritative cluster metadata: which shard key ranges map to which shards, which chunks have been migrated, and the overall topology.
mongosreads this at startup and caches it.Shard replica sets are individual replica sets that each own a portion of the data. Each shard is independently a fully functional replica set with its own primary, secondaries, WiredTiger storage, and oplog.
The shard key is the most consequential choice in a MongoDB sharded cluster. For orders, order_id is a reasonable shard key - UUID values hash uniformly across the key space, distributing writes across shards evenly. status would be a terrible shard key. There are four status values. MongoDB would create chunks for those four values and they'd end up on four shards. The overwhelming majority of new orders arrive as pending - so essentially all new writes would hit one shard, defeating the entire purpose of horizontal scaling. This is a write hotspot, and it's a real production incident pattern.
Cross-shard transactions
MongoDB added multi-document, multi-collection transactions in 4.0, extended to sharded clusters in 4.2. They use 2PC internally (same mechanism as Citus). Two network round trips to all involved shards, locks held during the prepare phase, coordinator failure complexity.
MongoDB's documentation is direct about the recommendation: avoid cross-shard transactions where possible. The idiomatic alternative is to model your data so that atomic operations touch a single document or a single shard. Embed related data in one document rather than referencing across documents. Ensure that operations that need to be atomic share the same shard key so they land on the same shard.
This is the same philosophy as Cassandra's table-per-query model - move complexity to the data model so the engine doesn't have to coordinate at runtime. MongoDB gives you more flexibility in when and how you apply this principle. Cassandra enforces it structurally.
CAP position
MongoDB leans CP when using majority read/write concerns. But MongoDB makes the CP/AP tradeoff configurable per operation via write concern and read preference. With readPreference: secondary and w: 1, MongoDB shifts toward AP - reads may be stale (eventual consistency from secondary reads) and writes acknowledge quickly without waiting for durability guarantees. With w: majority and readPreference: primary, it's firmly CP.
The application controls this dial, per operation if needed. This flexibility is genuinely useful. A single MongoDB cluster can serve both a user-facing write path (strong consistency, w: majority) and an analytics read path (eventual consistency, secondary reads) from the same data set.
Comments
No comments yet. Start the discussion.