Failover Strategies: Surviving the Crash
[!NOTE] This module explores the core principles of Failover Strategies: Surviving the Crash, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
Replication creates a copy of your data. But if the Primary server catches fire, your application goes down.
Failover is the process of promoting a Standby to become the new Primary.
It sounds simple: “If Primary is dead, promote Standby.” But in distributed systems, defining “dead” is the hardest problem.
The Real-World Hook: Imagine running a massive e-commerce platform during Black Friday. At 2:00 PM, the primary database server’s network interface starts flapping—dropping packets intermittently. The standby server assumes the primary is dead and promotes itself. But the primary is still alive and accepting writes from a subset of application servers. You now have two distinct realities. Customers are buying the same TV twice. When the network stabilizes, you have thousands of conflicting orders. This is the Split-Brain nightmare, and it’s caused catastrophic outages at companies like GitHub and GitLab.
1. The Nightmare: Split-Brain
In an ideal world, a server is either 100% alive or 100% dead. In reality, servers experience “gray failures”—partial network partitions, CPU starvation causing missed heartbeats, or faulty switches.
Consider the classic failure scenario:
- Partition: The network cable between the Primary and Standby breaks, but both can still talk to the application.
- False Assumption: The Standby stops receiving WAL (Write-Ahead Logs) and assumes the Primary is dead.
- Rogue Promotion: The Standby promotes itself to Primary.
- Divergence: You now have Two Primaries. Both are accepting divergent writes.
When the network is restored, you have data corruption that is impossible to merge automatically (because Postgres sequence numbers and transaction IDs have diverged on both nodes). The only fix is to manually throw away data from one node—a disaster.
2. Interactive: Split-Brain Simulator
See how a Quorum-based system (like Patroni + Etcd) prevents Split-Brain. A node can only be Primary if it holds the “Leader Key” in the Distributed Consensus Store (DCS).
3. The Solution: Patroni & Distributed Consensus
You should never write your own failover scripts (like a basic bash ping script). Distributed consensus is notoriously difficult to get right. Instead, use Patroni, the industry standard for Postgres High Availability (HA).
Patroni acts as an intelligent supervisor running alongside Postgres. It solves the Split-Brain problem by introducing an external source of truth: a Distributed Consensus Store (DCS) like Etcd, Consul, or ZooKeeper.
How Patroni Achieves Consensus
- The DCS Cluster: The DCS runs as a highly available cluster (usually 3 or 5 nodes) that uses the Raft consensus algorithm. It holds the definitive state of the Postgres cluster.
- The Leader Key: Only one Postgres node can hold the “Leader Key” in the DCS. The DCS enforces this using atomic locks.
- Heartbeats & TTL: The Primary must constantly renew its lease on the Leader Key (e.g., every 5 seconds) to prove it is alive. If the Primary crashes or loses network connectivity to the DCS, its lease expires (Time-To-Live, or TTL, lapses).
- The Race for Promotion: Once the Leader Key expires, all healthy Standbys detect the absence of a leader. They race to acquire the lock in the DCS. The winner is promoted to the new Primary.
The Analogy: Think of the Leader Key as a “Talking Stick” in a meeting. You can only speak (write to the database) if you hold the stick. The DCS is the strict moderator holding the stick. If you fall asleep (crash), the moderator takes the stick back and hands it to the next person who asks for it.
Fencing and STONITH: Ensuring the Dead Stay Dead
What happens if the old Primary was just temporarily frozen (e.g., due to heavy Garbage Collection or a VM pause) and wakes up thinking it’s still the leader? It might try to write to disk, corrupting the cluster.
To prevent this, Patroni uses Fencing, commonly implemented via hardware watchdogs.
- STONITH (Shoot The Other Node In The Head): If the Patroni process loses contact with the DCS, it intentionally triggers a kernel panic or a hardware watchdog to force reboot the server.
- By killing itself, the old Primary absolutely guarantees it cannot accept any more writes, safely allowing the Standby to take over.
4. Client-Side Failover & Connection Routing
Once the Standby is promoted, how does your application know to send writes to the new Primary’s IP address?
Option A: VIP (Virtual IP) Floating
A classic approach uses a tool like keepalived to float a Virtual IP address (e.g., 10.0.0.100).
- The application is hardcoded to connect to
10.0.0.100. - During failover, the VIP is aggressively moved from the dead node to the new Primary using ARP broadcasting.
- Drawback: VIPs don’t work well across different subnets or modern cloud environments (like AWS) which block arbitrary ARP traffic.
Option B: Smart Drivers (libpq Multi-Host)
Modern Postgres drivers (using libpq) are aware of cluster topologies. You can provide a list of all database IPs in your application’s connection string.
## Example connection string for Java/JDBC or Python/psycopg
host=node1,node2,node3 port=5432 target_session_attrs=read-write
How it works:
- The driver connects to
node1and silently executesSELECT pg_is_in_recovery(). - If it returns
true(indicating a Standby), the driver immediately disconnects and moves tonode2. - If it returns
false(indicating a Primary), the driver establishes the connection pool here. - Advantage: This eliminates the need for an external load balancer, removing a single point of failure and reducing network hops.
Option C: HAProxy & PgBouncer
For massive scale, applications connect to a local PgBouncer (for connection pooling), which forwards requests to an HAProxy load balancer. HAProxy constantly health-checks the Postgres nodes (using a Patroni REST API endpoint like /primary) and routes write traffic dynamically. This shields the application from knowing the underlying cluster topology.
5. Summary
| Strategy | Speed | Complexity | Risk |
|---|---|---|---|
| Manual | Slow (Minutes) | Low | High (Human Error, Split-brain) |
| Repmgr | Fast | Medium | Medium (Requires manual STONITH configuration) |
| Patroni + DCS | Fast (<30s) | High (Requires Etcd/Consul cluster) | Low (Proven Correctness, Built-in Watchdogs) |