Snowflake Architecture

[!NOTE] This module explores the core principles of Snowflake Architecture, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Problem Statement

Traditional Data Warehouses (like Teradata or early Redshift) were coupled. Storage and Compute lived on the same servers. If you needed more processing power, you had to buy more storage even if you didn’t need it.

Snowflake’s Innovation: Separation of Storage and Compute. This allows you to store petabytes of data cheaply and only spin up “Compute Engines” when you actually need to run a query.


2. Requirements & Goals

Functional Requirements

  1. SQL Queries: Support complex JOINs and aggregations across billions of rows.
  2. Data Loading: Ingest structured (CSV) and semi-structured (JSON) data.
  3. Scalability: Scale the number of aggregate queries without slowing down individual ones.

Non-Functional Requirements

  1. Elasticity: Spin up or shut down compute resources in seconds.
  2. Isolation: One department’s heavy query shouldn’t slow down another department’s dashboard.
  3. High Availability: Minimal downtime for the metadata and management layer.

3. The 3-Layer Architecture

Snowflake is designed as a “Cloud Native” system, leveraging existing cloud services like Amazon S3 and Azure Blobs.

Cloud Services
Metadata, Security, Access Control, Optimizer
Centralized Storage
(S3 / Blob Storage) Optimized Columnar Format
Virtual Warehouse A
(Finance Queries) 8 Nodes
Virtual Warehouse B
(Marketing Dashboards) 32 Nodes

4. Layer 1: Centralized Storage (The Base)

Instead of local disks, data is stored in Amazon S3 in a proprietary, encrypted, Columnar Format (similar to Parquet).

  • Min-Max Pruning: Each file contains metadata about the values inside (e.g., “Min Price: 10, Max Price: 50”).
  • Query Pruning: If a query asks for Price = 100, the engine reads the metadata, realizes this file doesn’t have it, and skips the file entirely. No need to download gigabytes of data.

5. Layer 2: Virtual Warehouses (The Muscles)

“Virtual Warehouse” is Snowflake’s term for a Compute Cluster.

  1. Isolation: The Finance team can run a 6-hour report on one cluster without affecting the Marketing team’s dashboard on a separate cluster.
  2. Local Caching: While the “Main Data” is in S3, each compute node has a fast Local SSD Cache. It stores the most recently read data.
  3. Statelessness: If a node dies, or you shut down the warehouse, your data remains safe in S3.

[!TIP] Analogy: The Public Library Storage: The millions of books on the shelves (S3). Compute: The people reading the books (Virtual Warehouses).


6. Layer 3: Cloud Services (The Brains)

This layer manages the entire system. It acts as the “Traffic Controller.”

  1. Metadata Management: Stores which files in S3 belong to which table.
  2. Query Optimizer: Analyzes your SQL and creates the most efficient “Execution Plan.”
  3. Security: Manages encryption keys and user permissions.

7. Interview Gauntlet

  1. What is the benefit of Separation of Storage and Compute?
    • Ans: Scalability and Cost. You can scale storage to exabytes for pennies and scale compute to thousands of cores for just the minutes you use them.
  2. How is data stored for fast queries?
    • Ans: Columnar format. If you query SUM(sales), the engine only reads the sales column, not the entire row.

8. Summary

  • Storage: Persistent, columnar files on S3.
  • Compute: Disposable, isolated clusters with local caching.
  • Metadata: Centralized management layer for coordination.
  • Efficiency: Min-Max pruning and Columnar access.