Mapping: The Schema of Search

[!NOTE] Mapping: The Schema of Search provides a comprehensive overview of the core concepts, ensuring you have a solid foundation before diving deeper into the technical details.

1. The Hook: “Schemaless” is a Lie

Elasticsearch claims to be “schemaless”. Reality: If you don’t define a schema (Mapping), Elasticsearch guesses. And it usually guesses WRONG.

  • String → text + keyword (Bloats disk by 2x).
  • Timestamp → date (Good).
  • Floating point → float (Bad if you only need precision).

War Story: The Mapping Explosion

The Scenario: A company ingested JSON application logs where developers dynamically used user UUIDs as JSON keys (e.g., {"user_id_1234": "login_success"}). The Catastrophe: Elasticsearch creates a new mapping field for every unique key. The “Cluster State” (which stores the mapping and must be synced to all nodes) grew to 500MB. Nodes spent all their CPU trying to sync the state, garbage collection paused, and the entire cluster collapsed. The Fix: Never use unbounded dynamic keys. Use nested objects or flatten them ({"user": "1234", "action": "login_success"}), and set dynamic: "strict" to reject unexpected fields.


2. Text vs Keyword (The Billion Dollar Question)

Feature text Field keyword Field
Use Case Full-text search (“find ‘fox’ in body”) Exact filtering (“status=’active’”)
Analysis Tokenized ("The Fox"[the, fox]) Untouched ("The Fox"[The Fox])
Data Structure Inverted Index Inverted Index + Doc Values
Sorting/Aggs Disabled by default (Too much RAM) Fast (Uses Doc Values)

Golden Rule:

  • Do you need to search for words inside it? → text
  • Do you need to Filter, Sort, or Aggregate? → keyword
  • Both? → Multi-field ("title": { "type": "text", "fields": { "raw": { "type": "keyword" } } })

3. Storage Internals: How Mappings Hit the Disk

Maps Term &rarr; List<DocIDs>.

  • Analogy: Like the index at the back of a textbook. You look up a specific word (term) to find the exact pages (DocIDs) where it appears.
  • Used for: match, term queries.
  • Structure: Sorted list of terms (Trie/FST) pointing to Posting Lists.

B. Doc Values (Sorting & Aggregations)

Maps DocID &rarr; Value.

  • Analogy: Like a spreadsheet. Each row is a document, and each column is a field. Scanning a single column to calculate an average or sort is extremely fast.
  • This is a Columnar Store (like Parquet/Cassandra).
  • Used for: sort, aggs, script.
  • Hardware: Stored on disk, loaded into OS Page Cache.
  • Performance: Sequential access pattern. Fast!

C. BKD Trees (Numbers & Geo)

Elasticsearch 5.0+ changed everything for numbers.

  • Old: Numbers were strings in the Inverted Index.
  • New: Block K-Dimensional (BKD) Trees.
  • Why: Optimized for range queries (price > 100).
  • Speed: Faster than B-Trees for multi-dimensional data (e.g., Lat/Lon + Date).

4. Interactive: Mapping Designer

See how your choice changes disk usage and capability.

Inverted Index

Empty

Doc Values (Columnar)

Disabled

Capabilities

  • Search: ❌
  • Sort: ❌
  • Aggs: ❌

5. Staff Tip: Disabling Fields

Most JSON logs have fields you never search. "user_agent": "Mozilla/5.0 ..." Do you search this? Probably not. Optimize:

"user_agent": {
  "type": "keyword",
  "index": false,    // No Inverted Index (Save Disk)
  "doc_values": true // Still Aggregatable (Top User Agents)
}
  • Result: 30% Disk Savings on high-volume log clusters.