Foundations — Review & Checklist
[!NOTE] This module explores the core principles of Foundations — Review & Checklist, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Key Takeaways
- The Inverted Index: Instead of mapping row IDs to text, Elasticsearch maps words to lists of row IDs, enabling fast O(1) lookups instead of O(N) table scans.
- Hardware Physics: Elasticsearch transforms random I/O (slow) into sequential I/O (fast) by utilizing the filesystem cache and in-memory segment intersections.
- Horizontal Scalability: Data is partitioned into Shards (mini search engines) across multiple Nodes to achieve infinite horizontal scaling and parallelism.
- High Availability: Replica Shards provide redundant copies of Primary Shards, enabling failover with zero downtime and increased read throughput.
- Indexing Lifecycle: Writes move from the Memory Buffer (not searchable, not safe), to the Translog (safe), to a Refresh creating a Segment (searchable, not safe on disk), to a Flush (searchable, safe on disk).
2. Flashcards
What is an Inverted Index?
A data structure mapping terms (words) to the list of documents containing them, enabling O(1) lookups.
What is the difference between a Shard and a Replica?
A Shard is a data partition (Lucene index). A Replica is an exact copy for high availability and read scaling.
What happens during a Refresh?
Documents in the memory buffer are written to a new Segment in the filesystem cache, making them searchable.
3. Cheat Sheet
| Concept | Purpose | Analogy |
|---|---|---|
| Inverted Index | Fast text search lookup | Book index at the back |
| Cluster | Collection of all nodes | The entire company |
| Node | Single JVM server instance | A single employee |
| Shard | Horizontal data partition | A specialized department |
| Replica | Copy of a primary shard | The backup department |
| Segment | Immutable disk file | A finalized filing cabinet |
| Refresh | Makes data searchable | Printing temporary documents |
| Flush | Makes data durable on disk | Filing documents permanently |
4. Quick Revision
- The Problem with SQL:
LIKE '%text%'requires full table scans (O(N)), causing high latency for search operations. - Elasticsearch Scale: An Index is just a logical namespace. Shards do the actual work. You can scale horizontally by distributing Shards across Nodes.
- Failover: Replicas are promoted to Primary Shards if a node dies, guaranteeing zero downtime.
- Performance Trade-offs: You can increase
refresh_intervalfor better indexing throughput at the cost of near real-time search latency.
5. Next Steps
Continue to the next module to learn about mapping and analysis: Elasticsearch course index.
Don’t forget to check the Elasticsearch Glossary if you need a refresher on the terminology used in this module!