Querying & Relevance Engineering — Review & Checklist

[!NOTE] This module explores the core principles of Querying & Relevance Engineering — Review & Checklist, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

Key Takeaways

  1. Filter vs. Query Context: The fundamental dichotomy. Use filter (cached, boolean) for exact matches (status, IDs, dates). Use query (scored, computationally expensive) only when relevance ranking is required (full-text search).
  2. BitSet Caching: In Filter Context, Elasticsearch caches results in highly efficient BitSets, enabling ultra-fast bitwise AND operations for complex boolean logic.
  3. BM25 Scoring Fundamentals: _score is driven by Term Frequency (TF - saturates quickly), Inverse Document Frequency (IDF - rewards rarity), and Field Length Norm (rewards shorter fields).
  4. Aggregations Architecture: Think of aggregations as SQL GROUP BY. They are divided into Buckets (grouping docs) and Metrics (calculating stats within buckets).
  5. Global Ordinals: For fast aggregations on keyword strings, Elasticsearch uses global ordinals (mapping strings to integers). The first aggregation can be slow; use eager_global_ordinals to pre-load for low-latency needs.

Flashcards

Test your understanding of the core concepts.

Query Context

What question does this context answer, and what is the output?

Answers "How well does this match?"

Outputs a calculated `_score` (Float). It is slower because it calculates relevance.

Filter Context

What question does this context answer, and how does it achieve high performance?

Answers "Does this match? (Yes/No)".

It ignores scoring entirely and caches the results in memory-efficient BitSets for rapid boolean operations.

BM25: TF Saturation

How does BM25 handle Term Frequency differently from Classic TF-IDF?

BM25 applies a non-linear saturation curve. Finding a term 100 times is only slightly better than finding it 10 times, preventing spammy documents from dominating.

Global Ordinals

What are they, and why are they critical for Aggregations?

A mapping of unique strings to integer IDs. They allow ES to group by integers rather than comparing string bytes, drastically speeding up bucket aggregations on high-cardinality fields.


Cheat Sheet

Concept The “Why” When to Use
bool query Combines logic. must (score), filter (cache), should (boost), must_not (exclude). The foundation of 99% of complex Elasticsearch queries.
BitSets Arrays of 1s and 0s representing matched documents. Executed via SIMD instructions. Underpins the blazing speed of filter context.
BM25 The math behind _score. Relies on TF, IDF, and Field Length. The default scoring algorithm for full-text relevance.
Buckets Bins documents (e.g., terms, date_histogram). Similar to SQL GROUP BY. Creating faceted navigation or segmenting data.
Metrics Calculates numbers (e.g., avg, sum) inside buckets. Similar to SQL SELECT AVG(). Extracting statistics from grouped data.

Quick Revision

  • Always prefer Filter Context unless you explicitly need documents ranked by relevance.
  • The bool query is your orchestrator: use filter for hard constraints and must/should for relevance.
  • BM25 rewards rarity and brevity: A rare word in a short field yields the highest score.
  • Aggregations are dual-purpose: They return the search results AND the analytical summary in a single round-trip.
  • Beware the first aggregation penalty: If latency is critical, use eager_global_ordinals to pre-build the string-to-int mappings for aggregations.

Next Steps

Now that you understand how to query and rank documents efficiently, it’s time to learn how to scale the system that handles these requests.

Continue to Scaling & Operations

Need a refresher on specific terminology? View the Elasticsearch Glossary