Mapping: The Schema of Search

[!NOTE] Mapping: The Schema of Search provides a comprehensive overview of the core concepts, ensuring you have a solid foundation before diving deeper into the technical details.

1. The Hook: “Schemaless” is a Lie

Elasticsearch claims to be “schemaless”. Reality: If you don’t define a schema (Mapping), Elasticsearch guesses. And it usually guesses WRONG.

String → text + keyword (Bloats disk by 2x).
Timestamp → date (Good).
Floating point → float (Bad if you only need precision).

War Story: The Mapping Explosion

The Scenario: A company ingested JSON application logs where developers dynamically used user UUIDs as JSON keys (e.g., {"user_id_1234": "login_success"}). The Catastrophe: Elasticsearch creates a new mapping field for every unique key. The “Cluster State” (which stores the mapping and must be synced to all nodes) grew to 500MB. Nodes spent all their CPU trying to sync the state, garbage collection paused, and the entire cluster collapsed. The Fix: Never use unbounded dynamic keys. Use nested objects or flatten them ({"user": "1234", "action": "login_success"}), and set dynamic: "strict" to reject unexpected fields.

2. Text vs Keyword (The Billion Dollar Question)

Feature	`text` Field	`keyword` Field
Use Case	Full-text search (“find ‘fox’ in body”)	Exact filtering (“status=’active’”)
Analysis	Tokenized (`"The Fox"` → `[the, fox]`)	Untouched (`"The Fox"` → `[The Fox]`)
Data Structure	Inverted Index	Inverted Index + Doc Values
Sorting/Aggs	Disabled by default (Too much RAM)	Fast (Uses Doc Values)

Golden Rule:

Do you need to search for words inside it? → text
Do you need to Filter, Sort, or Aggregate? → keyword
Both? → Multi-field ("title": { "type": "text", "fields": { "raw": { "type": "keyword" } } })

3. Storage Internals: How Mappings Hit the Disk

A. The Inverted Index (Search)

Maps Term → List<DocIDs>.

Analogy: Like the index at the back of a textbook. You look up a specific word (term) to find the exact pages (DocIDs) where it appears.
Used for: match, term queries.
Structure: Sorted list of terms (Trie/FST) pointing to Posting Lists.

B. Doc Values (Sorting & Aggregations)

Maps DocID → Value.

Analogy: Like a spreadsheet. Each row is a document, and each column is a field. Scanning a single column to calculate an average or sort is extremely fast.
This is a Columnar Store (like Parquet/Cassandra).
Used for: sort, aggs, script.
Hardware: Stored on disk, loaded into OS Page Cache.
Performance: Sequential access pattern. Fast!

C. BKD Trees (Numbers & Geo)

Elasticsearch 5.0+ changed everything for numbers.

Old: Numbers were strings in the Inverted Index.
New: Block K-Dimensional (BKD) Trees.
Why: Optimized for range queries (price > 100).
Speed: Faster than B-Trees for multi-dimensional data (e.g., Lat/Lon + Date).

4. Interactive: Mapping Designer

See how your choice changes disk usage and capability.

Input Data: "User_123"

Inverted Index

Empty

Doc Values (Columnar)

Disabled

Capabilities

Search: ❌
Sort: ❌
Aggs: ❌

5. Staff Tip: Disabling Fields

Most JSON logs have fields you never search. "user_agent": "Mozilla/5.0 ..." Do you search this? Probably not. Optimize:

"user_agent": {
  "type": "keyword",
  "index": false,    // No Inverted Index (Save Disk)
  "doc_values": true // Still Aggregatable (Top User Agents)
}

Result: 30% Disk Savings on high-volume log clusters.

Mapping: The Schema of Search

Mapping: The Schema of Search

1. The Hook: “Schemaless” is a Lie

War Story: The Mapping Explosion

2. Text vs Keyword (The Billion Dollar Question)

3. Storage Internals: How Mappings Hit the Disk

A. The Inverted Index (Search)

B. Doc Values (Sorting & Aggregations)

C. BKD Trees (Numbers & Geo)

4. Interactive: Mapping Designer

Inverted Index

Doc Values (Columnar)

Capabilities

5. Staff Tip: Disabling Fields

Found this lesson helpful?