Snowflake Architecture
[!TIP] The Game Changer: Before Snowflake (2012), Data Warehousing was a “buy a bigger box” game. If you needed more processing power, you bought more disks. Snowflake pioneered the Separation of Storage and Compute, allowing both to scale independently in the cloud.
1. Problem Statement: The Coupled Nightmare
Traditional Data Warehouses (like Teradata or early Redshift) were coupled.
- The Constraint: Storage and Compute lived on the same servers.
- The Cost: If you needed 2x more processing power for an end-of-quarter report, you had to double your storage nodes—even if you only had 10GB of data.
- The Risk: One “heavy” query from the Data Science team could freeze the CEO’s real-time dashboard. No Compute Isolation.
Snowflake’s Solution: A multi-cluster, shared-data architecture designed natively for the Cloud.
[!NOTE] War Story: The End-of-Quarter Reporting Freeze At a mid-sized e-commerce company, the data science team ran a massive machine learning pipeline on their traditional shared-nothing data warehouse on the last day of the quarter. The query consumed 100% of the CPU. Meanwhile, the CFO tried to refresh the quarterly revenue dashboard. Because storage and compute were tightly coupled, the CFO’s dashboard timed out for 4 hours. By migrating to a shared-data, multi-cluster architecture, they isolated the data science warehouse from the BI warehouse. Both teams queried the exact same underlying sales data simultaneously without ever impacting each other’s performance.
2. Requirements & Goals
Functional Requirements
- SQL Mastery: Support full ANSI SQL, including complex JOINs across billions of rows.
- Semi-Structured Ease: Ingest JSON, Avro, and Parquet as “First-Class Citizens” without pre-defined schemas.
- Zero Management: No vacuuming, indexing, or manual distribution keys.
Non-Functional Requirements
- Unlimited Elasticity: Spin up an “X-Large” cluster in seconds and shut it down when done.
- Strong Isolation: Zero contention between different workloads (ETL vs. Reporting).
- Pay-per-Second: Only pay for compute while queries are actually running.
3. The 3-Layer Architecture
Snowflake is built as a three-tier system: Centralized Storage, Multi-Cluster Compute, and Cloud Services.
4. Layer 1: Centralized Storage (The Base)
Snowflake stores data in Micro-Partitions.
- Proprietary Format: Each micro-partition is an immutable, encrypted blob (50MB–500MB) stored in Cloud Object Storage (S3).
- Columnar Storage: Within each partition, data is stored by column, not row.
- Min-Max Pruning: Each micro-partition has a header containing the
minandmaxvalues for every column. - Example: If a partition covers
dates 2023-01-01 to 2023-01-15, and you query forJune, Snowflake skips that partition without ever downloading it.
Interactive Demo: Min-Max Pruning
Enter a value (e.g., 25) to see which micro-partitions the engine would scan and which ones it would “prune” (skip).
5. Layer 2: Virtual Warehouses (The Muscles)
A “Virtual Warehouse” is a group of cloud compute instances (VMs).
- Compute Isolation: Multiple warehouses can access the same data simultaneously. A heavy ETL job won’t affect a BI user.
- Local Caching: Each node in the warehouse has a fast SSD. When it reads data from S3, it caches it locally. Subsequent queries for the same data are near-instant.
- Statelessness: Warehouses are ephemeral. You can kill a warehouse or scale it from 1 node to 128 nodes (X-Large) in under a minute without moving data.
6. Layer 3: Cloud Services (The Brains)
This is the “Control Plane” that never sleeps.
- Security: Manages authentication and encryption at rest/transit.
- Optimization: Rewrites SQL queries for maximum performance on the specific compute cluster size.
- Metadata Management: This is the secret sauce. Cloud Services tracks all Micro-Partitions, their versions (Time Travel), and their Min/Max stats.
7. Key Performance Driver: Query Pruning
Query Pruning is why Snowflake can scan Petabytes in seconds.
- Metadata Check: The Cloud Services layer identifies which Micro-Partitions actually need to be scanned using Min/Max stats.
- Column Scans: The Virtual Warehouse only downloads the specific Columns required for the query from S3.
- Result: 99.9% of data in S3 is never even touched by the compute nodes.
8. Requirements Traceability Matrix
| Requirement | Design Choice |
|---|---|
| Separation of Concerns | Decoupled Storage (S3) and Compute (Virtual Warehouses). |
| Performance (OLAP) | Columnar Micro-Partitions + Min-Max Static Pruning. |
| Workload Isolation | Dedicated Virtual Warehouses per department/workload. |
| Cost Efficiency | Auto-resume / Auto-suspend compute nodes. Pay for storage only when idle. |
| Time Travel | Immutable micro-partitions allow querying past states (Copy-on-Write). |
| Semi-Structured | VARIANT data type stores JSON/Parquet natively with sub-column pruning. |
9. Interview Gauntlet
- Explain the “Shared-Data, Multi-Cluster” architecture.
- One central storage layer (Shared Data), but many independent compute clusters (Multi-Cluster) accessing it.
- What is a Virtual Warehouse?
- A cluster of compute resources (CPU/RAM/SSD) that executes queries. It is stateless and decoupled from storage.
- How does Snowflake handle “Time Travel”?
- Because micro-partitions are immutable, Snowflake simply keeps the old versions for X days. You can query the table
AS OFa specific timestamp.
- Because micro-partitions are immutable, Snowflake simply keeps the old versions for X days. You can query the table
- What is Micro-Partitioning?
- The process of automatically splitting tables into small, contiguous chunks of storage (50-500MB) stored in a columnar format.
- Explain “Min-Max Pruning”.
- The ability to skip reading files from S3 if their metadata indicates they don’t contain the requested value range.
- Does Snowflake use Indexes?
- No. It uses Micro-Partition metadata (pruning) instead of traditional B-Tree or BitMap indexes to find data.
- How do you scale Snowflake?
- Scaling Up (Resizing a warehouse for complex queries) or Scaling Out (Multi-cluster warehouses for concurrency).
- What is the benefit of a local SSD cache on compute nodes?
- To avoid the latency of pulling the same data from S3 repeatedly. The cache is “warm” for similar subsequent queries.
- What is the “VARIANT” data type?
- A specialized type that allows storing JSON/ORC/Avro. Snowflake automatically flattens it internally for columnar access.
- Snowflake vs. Redshift (Original)?
- Original Redshift is “Shared-Nothing” (Compute and Storage are tied). Snowflake is “Shared-Data” (Decoupled).
10. Summary: The Whiteboard Strategy
1. 3-Layer Split
- Services: Metadata & Optimizer.
- Compute: Clusters (VW) with SSD Cache.
- Storage: S3 (Micro-Partitions).
2. Storage Format
* Columnar: Only read Col C if needed.
* Pruning: Skip file if query range ∉ [Min, Max].
3. Scaling Types
Scale OUT: 1 cluster → 10 clusters (Concurrency).
Zero-Copy Clone: Copy metadata only.
4. Why Snowflake?
- Isolation: Zero contention between groups.
- Agility: No manual vacuuming or indexing.
- Economy: Independent scaling of Storage and Compute.