Module Review: Scaling, Reliability & Operations
[!NOTE] This review chapter consolidates the critical concepts of capacity planning, resilience, and multi-tenancy, providing interactive tools to ensure you have internalized the hardware realities of distributed search.
1. Key Takeaways
- Shard Sizing: Keep shards between 10GB and 50GB. Smaller shards cause “Oversharding” (JVM heap explosion via Lucene overhead); larger shards cause slow recoveries.
- ILM (Index Lifecycle Management): Automatically move data across Hot (NVMe), Warm (HDD), and Cold (S3) nodes based on age and hardware constraints to minimize costs.
- Split Brain: A network partition causing a cluster to elect two independent masters, leading to divergent, un-mergeable data.
- Quorum Rule: To elect a master, you need strictly greater than half the voting members:
(N/2) + 1. Always run 3 (or 5, 7) master-eligible nodes. Never 2. - Multi-Tenancy: “Index-per-Tenant” gives strong isolation but causes oversharding. “Shared Index with Custom Routing” provides high hardware density but requires strict
routingparameters to avoid scatter-gather latency.
2. Interactive Flashcards
Click or press Enter on a card to reveal the answer. Use these to test your active recall.
3. Operations Cheat Sheet
Use this cheat sheet to remember key architectural limits and rules.
| Concept | Limit / Rule | Reason |
|---|---|---|
| Shard Size | 10GB → 50GB | Smaller wastes JVM heap (Lucene overhead). Larger slows down cluster recovery (network/disk transfer). |
| Quorum Math | Floor(N/2) + 1 |
Prevents Split Brain. Ensures a strict majority of master-eligible nodes vote. |
| Minimum Masters | 3 Nodes | 2-node clusters cannot achieve High Availability under Quorum rules. |
| Low Watermark | 85% Disk Full | Elasticsearch stops allocating new shards to the node. |
| High Watermark | 90% Disk Full | Elasticsearch aggressively moves existing shards away from the node. |
| Flood Stage | 95% Disk Full | Indices turn strictly read-only (read_only_allow_delete). Writes return 403. |
| Cluster Yellow | Replicas Missing | Missing redundancy, but all data is fully readable and writable. |
| Cluster Red | Primary Missing | Hard outage for the affected indices. Data is lost or temporarily unavailable. |
4. Next Steps
You have now mastered the operational and reliability principles of running Elasticsearch at scale.
- Review definitions in the Elasticsearch Glossary.
- Proceed to the next module: Data Pipelines & Ingestion.