Snowflake Architecture

[!TIP] The Game Changer: Before Snowflake (2012), Data Warehousing was a “buy a bigger box” game. If you needed more processing power, you bought more disks. Snowflake pioneered the Separation of Storage and Compute, allowing both to scale independently in the cloud.

1. Problem Statement: The Coupled Nightmare

Traditional Data Warehouses (like Teradata or early Redshift) were coupled.

  • The Constraint: Storage and Compute lived on the same servers.
  • The Cost: If you needed 2x more processing power for an end-of-quarter report, you had to double your storage nodes—even if you only had 10GB of data.
  • The Risk: One “heavy” query from the Data Science team could freeze the CEO’s real-time dashboard. No Compute Isolation.

Snowflake’s Solution: A multi-cluster, shared-data architecture designed natively for the Cloud.

[!NOTE] War Story: The End-of-Quarter Reporting Freeze At a mid-sized e-commerce company, the data science team ran a massive machine learning pipeline on their traditional shared-nothing data warehouse on the last day of the quarter. The query consumed 100% of the CPU. Meanwhile, the CFO tried to refresh the quarterly revenue dashboard. Because storage and compute were tightly coupled, the CFO’s dashboard timed out for 4 hours. By migrating to a shared-data, multi-cluster architecture, they isolated the data science warehouse from the BI warehouse. Both teams queried the exact same underlying sales data simultaneously without ever impacting each other’s performance.

2. Requirements & Goals

Functional Requirements

  • SQL Mastery: Support full ANSI SQL, including complex JOINs across billions of rows.
  • Semi-Structured Ease: Ingest JSON, Avro, and Parquet as “First-Class Citizens” without pre-defined schemas.
  • Zero Management: No vacuuming, indexing, or manual distribution keys.

Non-Functional Requirements

  • Unlimited Elasticity: Spin up an “X-Large” cluster in seconds and shut it down when done.
  • Strong Isolation: Zero contention between different workloads (ETL vs. Reporting).
  • Pay-per-Second: Only pay for compute while queries are actually running.

3. The 3-Layer Architecture

Snowflake is built as a three-tier system: Centralized Storage, Multi-Cluster Compute, and Cloud Services.

Architecture: The Snowflake 3-Layer Design
Separation of Storage & Compute | Multi-Cluster Shared Data
Layer 3: Cloud Services (The Brain)
Query Optimizer Metadata Management Security & Auth Access Control
Layer 2: Multi-Cluster Compute (Virtual Warehouses)
Warehouse: BI Reporting
Size: Small (2 Nodes)
Local SSD Cache Warm
Warehouse: Data Science
Size: 4X-Large (128 Nodes)
Heavy Workload Isolated
Layer 1: Centralized Storage (Cloud Object Store)
Shared S3 / Azure Blob / GCS
Micro-Partition A
Micro-Partition B
Micro-Partition C
Micro-Partition D

4. Layer 1: Centralized Storage (The Base)

Snowflake stores data in Micro-Partitions.

  • Proprietary Format: Each micro-partition is an immutable, encrypted blob (50MB–500MB) stored in Cloud Object Storage (S3).
  • Columnar Storage: Within each partition, data is stored by column, not row.
  • Min-Max Pruning: Each micro-partition has a header containing the min and max values for every column.
  • Example: If a partition covers dates 2023-01-01 to 2023-01-15, and you query for June, Snowflake skips that partition without ever downloading it.

Interactive Demo: Min-Max Pruning

Enter a value (e.g., 25) to see which micro-partitions the engine would scan and which ones it would “prune” (skip).

Partition A
Range: [1 - 20]
Idle
Partition B
Range: [21 - 40]
Idle
Partition C
Range: [41 - 60]
Idle

5. Layer 2: Virtual Warehouses (The Muscles)

A “Virtual Warehouse” is a group of cloud compute instances (VMs).

  • Compute Isolation: Multiple warehouses can access the same data simultaneously. A heavy ETL job won’t affect a BI user.
  • Local Caching: Each node in the warehouse has a fast SSD. When it reads data from S3, it caches it locally. Subsequent queries for the same data are near-instant.
  • Statelessness: Warehouses are ephemeral. You can kill a warehouse or scale it from 1 node to 128 nodes (X-Large) in under a minute without moving data.

6. Layer 3: Cloud Services (The Brains)

This is the “Control Plane” that never sleeps.

  • Security: Manages authentication and encryption at rest/transit.
  • Optimization: Rewrites SQL queries for maximum performance on the specific compute cluster size.
  • Metadata Management: This is the secret sauce. Cloud Services tracks all Micro-Partitions, their versions (Time Travel), and their Min/Max stats.

7. Key Performance Driver: Query Pruning

Query Pruning is why Snowflake can scan Petabytes in seconds.

  1. Metadata Check: The Cloud Services layer identifies which Micro-Partitions actually need to be scanned using Min/Max stats.
  2. Column Scans: The Virtual Warehouse only downloads the specific Columns required for the query from S3.
  3. Result: 99.9% of data in S3 is never even touched by the compute nodes.

8. Requirements Traceability Matrix

Requirement Design Choice
Separation of Concerns Decoupled Storage (S3) and Compute (Virtual Warehouses).
Performance (OLAP) Columnar Micro-Partitions + Min-Max Static Pruning.
Workload Isolation Dedicated Virtual Warehouses per department/workload.
Cost Efficiency Auto-resume / Auto-suspend compute nodes. Pay for storage only when idle.
Time Travel Immutable micro-partitions allow querying past states (Copy-on-Write).
Semi-Structured VARIANT data type stores JSON/Parquet natively with sub-column pruning.

9. Interview Gauntlet

  1. Explain the “Shared-Data, Multi-Cluster” architecture.
    • One central storage layer (Shared Data), but many independent compute clusters (Multi-Cluster) accessing it.
  2. What is a Virtual Warehouse?
    • A cluster of compute resources (CPU/RAM/SSD) that executes queries. It is stateless and decoupled from storage.
  3. How does Snowflake handle “Time Travel”?
    • Because micro-partitions are immutable, Snowflake simply keeps the old versions for X days. You can query the table AS OF a specific timestamp.
  4. What is Micro-Partitioning?
    • The process of automatically splitting tables into small, contiguous chunks of storage (50-500MB) stored in a columnar format.
  5. Explain “Min-Max Pruning”.
    • The ability to skip reading files from S3 if their metadata indicates they don’t contain the requested value range.
  6. Does Snowflake use Indexes?
    • No. It uses Micro-Partition metadata (pruning) instead of traditional B-Tree or BitMap indexes to find data.
  7. How do you scale Snowflake?
    • Scaling Up (Resizing a warehouse for complex queries) or Scaling Out (Multi-cluster warehouses for concurrency).
  8. What is the benefit of a local SSD cache on compute nodes?
    • To avoid the latency of pulling the same data from S3 repeatedly. The cache is “warm” for similar subsequent queries.
  9. What is the “VARIANT” data type?
    • A specialized type that allows storing JSON/ORC/Avro. Snowflake automatically flattens it internally for columnar access.
  10. Snowflake vs. Redshift (Original)?
    • Original Redshift is “Shared-Nothing” (Compute and Storage are tied). Snowflake is “Shared-Data” (Decoupled).

10. Summary: The Whiteboard Strategy

1. 3-Layer Split

  • Services: Metadata & Optimizer.
  • Compute: Clusters (VW) with SSD Cache.
  • Storage: S3 (Micro-Partitions).

2. Storage Format

[Micro-Partition 1]
Col A | Col B | Col C
Header: Min/Max Info

[Micro-Partition 2]
Col A | Col B | Col C
Header: Min/Max Info

* Columnar: Only read Col C if needed.
* Pruning: Skip file if query range ∉ [Min, Max].

3. Scaling Types

Scale UP: Small → 4XL (Speed).
Scale OUT: 1 cluster → 10 clusters (Concurrency).
Zero-Copy Clone: Copy metadata only.

4. Why Snowflake?

  • Isolation: Zero contention between groups.
  • Agility: No manual vacuuming or indexing.
  • Economy: Independent scaling of Storage and Compute.