Resilience: Surviving Split Brain
[!NOTE] This module explores the core principles of Resilience: Surviving Split Brain, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Nightmare: Split Brain
Imagine it is Black Friday. Your 2-Node Elasticsearch cluster (Node A, Node B) holds the entire product catalog and inventory. Both nodes are healthy and replicating data seamlessly. Suddenly, a core network switch fails, severing the connection between the two nodes.
Because they can no longer communicate with each other, they each make a dangerous assumption:
- Node A thinks: “I cannot ping Node B. Node B must have crashed. I am the only survivor, so I must declare myself the Master to keep the cluster alive.”
- Node B thinks: “I cannot ping Node A. Node A must have crashed. I am the only survivor, so I must declare myself the Master to keep the cluster alive.”
Result: Split Brain. You now have TWO distinct, independent clusters, both believing they are the single source of truth. Your application load balancer continues sending checkout requests to both Node A and Node B.
When the network switch is repaired and the partition heals, you face catastrophic Data Divergence. Both nodes accepted conflicting writes (e.g., Node A processed an order for the last smartphone, while Node B processed an order for the same smartphone). You cannot simply merge the changes back together. Data loss and inconsistency are practically guaranteed.
2. The Solution: Quorum ((N/2) + 1)
Distributed systems solve the Split Brain problem using a voting mechanism called Quorum.
Think of Quorum like a corporate Board of Directors. For any major decision to pass, you need a strict majority vote. In Elasticsearch, to be elected Master, a node needs votes from a strict majority of the Master-Eligible nodes in the cluster.
The formula is: Floor(N/2) + 1
Let’s test this with concrete examples:
- 2 Nodes: (2/2) + 1 = 2 votes required.
- If a network partition occurs, neither node can secure 2 votes. The cluster blocks all writes to prevent divergence. This protects your data, but if a single node legitimately crashes, the cluster halts. This is not High Availability.
- 3 Nodes: (3/2) + 1 = 2 votes required.
- If 1 node dies (or gets partitioned), the remaining 2 nodes can still communicate. Together, they have 2 votes (a quorum). They successfully elect a master, and the cluster survives.
- 4 Nodes: (4/2) + 1 = 3 votes required.
- If 2 nodes are partitioned from the other 2, neither side can secure 3 votes. The cluster halts. You have the exact same fault tolerance as a 3-node cluster (can only survive 1 node loss), but you paid for an extra server. This is why we avoid even numbers.
Rule: Always provision an odd number of Master-Eligible Nodes (3, 5, 7, etc.). Never 2 or 4. For massive production clusters, provision 3 small, dedicated Master nodes whose only job is cluster state management (no data storage or search traffic).
3. Interactive: Election Simulator
See how network partitions affect the cluster. Click the buttons to simulate network failure.
4. Troubleshooting: Red vs Yellow Cluster
Before panicking during an incident, check the cluster health color. It indicates data availability.
- Green: All Primary & Replica shards are assigned. Perfect health (All clear).
- Yellow: All Primaries are assigned, but some Replicas are missing.
- Data Loss? No. You can still search everything. (Degraded but functional, like driving on a spare tire).
- Cause: You only have 1 node, but requested 1 Replica (Replicas cannot live on the same node as the Primary). Or a node crashed, and ES is currently rebuilding replicas on other nodes.
- Red: Some Primary shards are missing. Data is unavailable.
- Data Loss? Yes, searches will return incomplete results, and writes to those specific shards will fail. (Engine failure).
- Cause: Multiple nodes crashed simultaneously, or a single node crashed and it held the only copy of an un-replicated index.
Diagnosing a Red Cluster
If the cluster is Red, the first step is always to ask the Master why it is refusing to assign the shard. This process is called Allocation Explanation. It reveals the exact constraint (e.g., disk full, node missing, shard corrupted) blocking the assignment.
Java
// Java: Requesting Cluster Health and Allocation Explanation
HealthResponse healthResponse = client.cluster().health();
System.out.println("Cluster Status: " + healthResponse.status());
if (healthResponse.status().equals(HealthStatus.Red)) {
// Request allocation explanation
ExplainRequest request = ExplainRequest.of(e -> e);
ExplainResponse response = client.cluster().allocationExplain(request);
System.out.println("Unassigned Reason: " + response.unassignedInfo().reason());
for (NodeAllocationExplanation nodeExp : response.nodeAllocationDecisions()) {
System.out.println("Node " + nodeExp.nodeName() + " rejected: " + nodeExp.deciders().get(0).explanation());
}
}
Go
// Go: Requesting Cluster Health
req := esapi.ClusterHealthRequest{}
res, err := req.Do(context.Background(), esClient)
if err != nil {
log.Fatalf("Error getting health: %s", err)
}
defer res.Body.Close()
// For Allocation Explain
explainReq := esapi.ClusterAllocationExplainRequest{}
explainRes, err := explainReq.Do(context.Background(), esClient)
if err != nil {
log.Fatalf("Error explaining allocation: %s", err)
}
defer explainRes.Body.Close()
// Read the body to see why shards are unassigned
// e.g. "NODE_LEFT", "ALLOCATION_FAILED", "CLUSTER_RECOVERED"