Capacity Planning: The Math of Shards
[!NOTE] This module explores the core principles of Capacity Planning: The Math of Shards, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Goldilocks Zone: 10GB - 50GB
The #1 question: “How many shards should I have?”
The Rule: Keep shard size between 10GB and 50GB.
To understand why, we must look at the physical realities of the underlying hardware and the JVM architecture:
- Too Small (< 1GB) (The “Oversharding” Problem):
- Overhead: Each shard is a full Apache Lucene index. It consumes JVM heap memory for term dictionaries, thread pools for searching, and holds open file descriptors.
- The Math: A node can typically support 20 shards per GB of heap space. If you give Elasticsearch a 30GB heap, a node shouldn’t have more than 600 shards. Having thousands of 50MB shards exhausts the heap quickly.
- Symptom: “OOM (Out of Memory)”, frequent long Garbage Collection (GC) pauses, and “Cluster State Explosion” leading to slow master operations.
- Too Large (> 50GB):
- Recovery Hell: Moving a shard across the network during a node failure or rebalancing operation takes time. A 100GB shard might take hours to transfer, even on a 10Gbps link, due to disk I/O and CPU overhead.
- Search Latency: Searching across a massive single shard requires more memory for aggregations and caching, often leading to cache thrashing.
- Symptom: “Cluster Red” status persists for hours after a node reboot, and p99 search latencies spike during heavy aggregation queries.
2. Hot-Warm-Cold Architecture
When dealing with time-series data (like logs or metrics), you don’t treat all data equally. Instead, you map hardware constraints to the data lifecycle.
- Hot Nodes (NVMe SSD, High CPU): Active writes and frequent, latency-sensitive searches. Typically the last 7-14 days of data.
- Warm Nodes (HDD/Cheap SSD): Read-only data. Queries here are infrequent and can tolerate slightly higher latencies (e.g., Day 15 to 30).
- Cold Nodes (S3/Snapshots): “Frozen” indices for long-term compliance storage. Very slow searches, but massive cost savings.
Automating the Lifecycle (ILM)
Elasticsearch automates this transition using Index Lifecycle Management (ILM) policies. ILM handles “Rollover” (creating a new hot index when the current one reaches 50GB or 30 days) and moving older indices to cheaper nodes.
Java
// Java: Creating an ILM Policy using the official Java API Client
PutLifecycleRequest request = PutLifecycleRequest.of(b -> b
.name("logs_policy")
.policy(p -> p
.phases(ph -> ph
.hot(h -> h
.actions(a -> a
.rollover(r -> r
.maxSize("50gb")
.maxAge(Time.of(t -> t.time("30d")))
)
)
)
.warm(w -> w
.minAge(Time.of(t -> t.time("30d")))
.actions(a -> a
.forcemerge(fm -> fm.maxNumSegments(1))
.shrink(s -> s.numberOfShards(1))
.allocate(al -> al.require(req -> req.put("data", "warm")))
)
)
)
)
);
client.ilm().putLifecycle(request);
Go
// Go: Creating an ILM Policy using the official Go client
// (Using raw JSON for the request body for simplicity in the generic client)
policyJSON := `{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "30d"
}
}
},
"warm": {
"min_age": "30d",
"actions": {
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
},
"allocate": {
"require": {
"data": "warm"
}
}
}
}
}
}
}`
req := esapi.ILMPutLifecycleRequest{
Policy: "logs_policy",
Body: strings.NewReader(policyJSON),
}
res, err := req.Do(context.Background(), esClient)
if err != nil {
log.Fatalf("Error creating ILM policy: %s", err)
}
defer res.Body.Close()
3. Interactive: Capacity Calculator
How many nodes do you need?
Formula: Total Data * (1 + Replicas) * 1.2 (Overhead)
Total Storage Needed
Total Primary Shards (Daily)
Nodes Required
4. The 85% Watermark
Elasticsearch monitors disk usage to prevent nodes from running out of space.
- Low Watermark (85%): Elasticsearch stops allocating new shards to the node.
- High Watermark (90%): Elasticsearch proactively attempts to move existing shards away from this node to others.
- Flood Stage Watermark (95%): Crisis mode. Elasticsearch enforces a
read_only_allow_deleteblock on every index that has a shard on this node. Your application will start receiving HTTP 403 Forbidden errors on writes.
Lesson: Always provision a 15-20% free space buffer. The usable space on a 1TB drive is functionally only ~850GB before cluster operations degrade.