Aggregations: Real-Time Analytics

[!NOTE] This module explores the core principles of Aggregations: Real-Time Analytics, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Pivot: From Search to SQL Group By

Imagine you’re building Amazon’s search page. The user types “laptop”. Finding the 10,000 matching laptops is just the beginning. The real magic—and the real technical challenge—is instantly rendering that left-hand sidebar: “Brand (Dell: 4,000, Apple: 2,000)”, “Price Range”, and “RAM”.

Search finds needles. Aggregations describe the haystack.

  • SQL: SELECT type, AVG(price) FROM products GROUP BY type
  • Elasticsearch: One request matches documents AND builds the summary table.

Benefit: You get the “Search Result List” AND the “Faceted Sidebar” (Price ranges, Categories) in 1 query, without doing a massive table scan or relying on a secondary analytics database.


2. The Anatomy of an Aggregation

Every Aggregation has two main types:

A. Buckets (The “Group By”)

Creates bins of documents.

  • terms: Group by “Category”.
  • date_histogram: Group by “Month”.
  • range: Group by “Price > 100”.

B. Metrics (The “Select”)

Calculates numbers inside a bucket.

  • avg, sum, min, max.
  • cardinality (Approximate Distinct Count - HyperLogLog).

C. Pipeline Aggregations (The “Having”)

Input is another aggregation, not documents.

  • derivative: Calculate rate of change.
  • moving_avg: Smooth out noise.

3. Interactive: The Aggregation Tree

Visualize how docs flow into buckets and compute metrics.

Documents

Buckets (Terms: Color)

Metric (Avg Price)


4. Hardware Reality: Global Ordinals

How does ES group by strings (“Color”) so fast? String comparisons are computationally expensive. Instead of comparing "Red" and "Blue" millions of times, Elasticsearch builds a dictionary. It replaces "Red" with an integer 1 and "Blue" with 2. This mapping (Global Ordinals) is built lazily—it is computed at query time for the first request that needs it.

War Story: The E-Commerce Black Friday Crash A major retailer experienced severe latency spikes every time they pushed a new product catalog update during a flash sale. The issue? They were aggregating on a high-cardinality keyword field (Product IDs). Every index refresh invalidated the Global Ordinals map. The subsequent search requests (a Thundering Herd) all tried to rebuild this massive map simultaneously, choking the JVM heap.

The Fix (Pre-loading): Use eager_global_ordinals in your mapping if you rely on low-latency aggregations on fields that update frequently. This shifts the computational cost of building the ordinal map from query time to refresh time, ensuring your users never take the performance hit.