Mapping: The Schema of Search
[!NOTE] Mapping: The Schema of Search provides a comprehensive overview of the core concepts, ensuring you have a solid foundation before diving deeper into the technical details.
1. The Hook: “Schemaless” is a Lie
Elasticsearch claims to be “schemaless”. Reality: If you don’t define a schema (Mapping), Elasticsearch guesses. And it usually guesses WRONG.
- String →
text+keyword(Bloats disk by 2x). - Timestamp →
date(Good). - Floating point →
float(Bad if you only need precision).
War Story: The Mapping Explosion
The Scenario: A company ingested JSON application logs where developers dynamically used user UUIDs as JSON keys (e.g., {"user_id_1234": "login_success"}).
The Catastrophe: Elasticsearch creates a new mapping field for every unique key. The “Cluster State” (which stores the mapping and must be synced to all nodes) grew to 500MB. Nodes spent all their CPU trying to sync the state, garbage collection paused, and the entire cluster collapsed.
The Fix: Never use unbounded dynamic keys. Use nested objects or flatten them ({"user": "1234", "action": "login_success"}), and set dynamic: "strict" to reject unexpected fields.
2. Text vs Keyword (The Billion Dollar Question)
| Feature | text Field |
keyword Field |
|---|---|---|
| Use Case | Full-text search (“find ‘fox’ in body”) | Exact filtering (“status=’active’”) |
| Analysis | Tokenized ("The Fox" → [the, fox]) |
Untouched ("The Fox" → [The Fox]) |
| Data Structure | Inverted Index | Inverted Index + Doc Values |
| Sorting/Aggs | Disabled by default (Too much RAM) | Fast (Uses Doc Values) |
Golden Rule:
- Do you need to search for words inside it? →
text - Do you need to Filter, Sort, or Aggregate? →
keyword - Both? → Multi-field (
"title": { "type": "text", "fields": { "raw": { "type": "keyword" } } })
3. Storage Internals: How Mappings Hit the Disk
A. The Inverted Index (Search)
Maps Term → List<DocIDs>.
- Analogy: Like the index at the back of a textbook. You look up a specific word (term) to find the exact pages (DocIDs) where it appears.
- Used for:
match,termqueries. - Structure: Sorted list of terms (Trie/FST) pointing to Posting Lists.
B. Doc Values (Sorting & Aggregations)
Maps DocID → Value.
- Analogy: Like a spreadsheet. Each row is a document, and each column is a field. Scanning a single column to calculate an average or sort is extremely fast.
- This is a Columnar Store (like Parquet/Cassandra).
- Used for:
sort,aggs,script. - Hardware: Stored on disk, loaded into OS Page Cache.
- Performance: Sequential access pattern. Fast!
C. BKD Trees (Numbers & Geo)
Elasticsearch 5.0+ changed everything for numbers.
- Old: Numbers were strings in the Inverted Index.
- New: Block K-Dimensional (BKD) Trees.
- Why: Optimized for range queries (
price > 100). - Speed: Faster than B-Trees for multi-dimensional data (e.g., Lat/Lon + Date).
4. Interactive: Mapping Designer
See how your choice changes disk usage and capability.
Inverted Index
Doc Values (Columnar)
Capabilities
- Search: ❌
- Sort: ❌
- Aggs: ❌
5. Staff Tip: Disabling Fields
Most JSON logs have fields you never search.
"user_agent": "Mozilla/5.0 ..."
Do you search this? Probably not.
Optimize:
"user_agent": {
"type": "keyword",
"index": false, // No Inverted Index (Save Disk)
"doc_values": true // Still Aggregatable (Top User Agents)
}
- Result: 30% Disk Savings on high-volume log clusters.