High Availability & Consistency

Imagine you are building the backend for a high-frequency trading platform. A user transfers $10,000 from their savings to their checking account. The system confirms the transfer. Two seconds later, the database server crashes. When it reboots, the $10,000 is gone from savings, but hasn’t appeared in checking.

Welcome to the harsh reality of distributed systems. You are forced to choose between consistency (safety) and latency (speed). MongoDB doesn’t make this choice for you; instead, it gives you granular control over this trade-off via Write Concern and Read Concern.

1. Write Concern: When is a Write “Safe”?

Write Concern determines exactly when MongoDB considers a write operation to be “successful” and acknowledges it to the client application.

The Levels of Acknowledgment

  • w: 1 (The Default): MongoDB acknowledges the write as soon as the Primary node applies it to memory.
    • Pros: Extremely fast (low latency).
    • Cons: Risky. If the Primary crashes before this data is replicated to a Secondary, the data is permanently lost (rolled back).
  • w: "majority" (The Gold Standard): MongoDB acknowledges the write only after it has propagated to a majority of nodes in the Replica Set (e.g., 2 out of 3 nodes).
    • Pros: Safe against rollbacks. Even if the Primary dies, the data survives.
    • Cons: Higher latency, as it involves network round-trips to Secondary nodes.
  • j: true (Disk Journaling): Wait for the write to be flushed to the on-disk Journal before acknowledging.
    • Pros: Ensures durability even if the server loses power completely and immediately.
    • Cons: Maximum latency (disk I/O is slow).

⚔️ War Story: The Phantom Order

An e-commerce company used the default w: 1 for checkout processing. During Black Friday, a user placed an order for a $2,000 laptop. The Primary node confirmed the order in memory, and the application sent a "Success" email. Half a second later, the Primary node experienced a kernel panic and died before replicating the order. A Secondary took over, but it didn't have the order data. The user was charged, but the warehouse never received the shipping request. They switched to w: "majority" the very next day.

Primary Secondary Secondary w: 1 (Fast) w: majority (Safe)

2. Read Concern: Avoiding the Ghosts of Data Past

While Write Concern dictates how safely data is written, Read Concern dictates how “fresh” and “permanent” the data you read is.

In a distributed database, just because you can read a piece of data doesn’t mean it’s permanent. If you read from a node that has data which hasn’t yet replicated to the majority, that data might be rolled back if a crash occurs.

  • local: Return the most recent data available on the specific node you are querying. Warning: This data might be rolled back if it hasn’t achieved majority replication yet.
  • majority: Only return data that has been fully acknowledged by a majority of the replica set. This guarantees the data you read is durable and will never be rolled back.
  • linearizable: Guarantees that the read returns the absolute latest data, reflecting all successful writes that completed before the read started. This involves checking with the majority of nodes at read-time and is highly expensive in terms of latency.

Step-by-Step Example: The Dirty Read

  1. Time T=0: Primary processes a write balance = $500 with w: 1.
  2. Time T=1: Application reads the Primary with readConcern: local. It sees balance = $500.
  3. Time T=2: Primary crashes before replicating to Secondaries.
  4. Time T=3: Secondary is promoted to Primary. It never received the $500 update. The balance reverts to the old value (e.g., $100).
  5. Result: Your application read data that effectively never existed. Using readConcern: majority would have prevented this by forcing the read to wait or return the older, safe value.

3. Interactive: Consistency vs. Latency Simulator

Experience the trade-off firsthand. Adjust the slider to see how increasing data safety impacts system latency.

w: 1 (Primary Ack) w: majority (Replica Ack) j: true (Disk Flush)
Latency Impact
2 ms
Durability Guarantee
Low
System Ready. Adjust slider to simulate.

4. Case Study: Global Financial Ledger (PEDALS Framework)

Let’s apply these concepts using the PEDALS framework to design a globally available financial ledger.

  • P - Process Requirements: The system must process financial transactions across North America, Europe, and Asia. Absolute data safety is required (zero financial loss). System must remain available even if an entire AWS region goes offline.
  • E - Estimate: 10,000 transactions per second (TPS). High read-to-write ratio (users checking balances 10x more than transferring).
  • D - Data Model: A transactions collection and an accounts collection.
  • A - Architecture:
    • Deploy a multi-region Replica Set spanning us-east, eu-west, and ap-south.
    • Set Write Concern to majority to ensure no transaction is ever lost.
    • Set Read Concern to majority to ensure users never see “ghost” balances.
  • L - Localized Details (Zone Sharding):
    • Zone Sharding allows pinning data to specific geographic regions.
    • Tag Shard A as “EU” and define a shard key range (region: "EU") to enforce GDPR compliance and keep local EU traffic low-latency.
  • S - Scale: As TPS grows, add more Shards within each region, maintaining the multi-region replica set architecture for high availability.

5. Coding Consistency

You can strictly enforce these concerns at the Client, Database, Collection, or individual Operation level in your application code.

Java Implementation

import com.mongodb.WriteConcern;
import com.mongodb.ReadConcern;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoCollection;
import org.bson.Document;

public class ConsistencyLevels {
    public static void main(String[] args) {
        try (MongoClient client = MongoClients.create("mongodb://localhost:27017")) {

            // Enforce Majority Rules at the Collection Level
            MongoCollection<Document> safeCol = client.getDatabase("bank")
                .getCollection("transfers")
                .withWriteConcern(WriteConcern.MAJORITY)
                .withReadConcern(ReadConcern.MAJORITY);

            // This insert completely blocks until acknowledged by a majority of nodes
            safeCol.insertOne(new Document("amount", 100));
            System.out.println("Secure write fully replicated and complete.");
        }
    }
}

Go Implementation

package main

import (
    "context"
    "log"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    "go.mongodb.org/mongo-driver/mongo/writeconcern"
    "go.mongodb.org/mongo-driver/mongo/readconcern"
)

func main() {
    // Connect to the cluster
    client, _ := mongo.Connect(context.TODO(), options.Client().ApplyURI("mongodb://localhost:27017"))

    // Define strict consistency options
    wc := writeconcern.New(writeconcern.WMajority())
    rc := readconcern.Majority()

    opts := options.Collection().
        SetWriteConcern(wc).
        SetReadConcern(rc)

    // Apply options to the collection
    coll := client.Database("bank").Collection("transfers", opts)

    // This insert is durable and safe from rollbacks
    _, err := coll.InsertOne(context.TODO(), map[string]interface{}{"amount": 100})
    if err != nil {
        log.Fatal("Write failed to replicate safely:", err)
    }

    log.Println("Secure write fully replicated and complete.")
}

6. Summary: The Golden Rules of Consistency

  • Write Concern (w: majority) is your shield against data loss during unexpected crashes. It is the gold standard for production systems handling sensitive data.
  • Read Concern (majority) ensures your application only reads truth that will not be erased by a rollback.
  • Zone Sharding is the ultimate architectural tool for Multi-Region deployments, ensuring both legal compliance (e.g., GDPR) and low-latency access by pinning data to geographically local shards.
  • Always deliberately balance Latency (user experience) against Data Safety (business survival). Defaulting blindly to w: 1 is a recipe for disaster in financial or critical path systems.