Data Warehousing 101: Star Schema vs Snowflake Schema

If you have ever written a SQL query that joins six tables to answer a simple business question, you have felt the pain of poorly designed data models. Data warehousing exists to solve this problem â€” organizing data so that analytics and reporting are fast, intuitive, and reliable.

At the heart of data warehouse design is a choice between two schema patterns: star schema and snowflake schema. Understanding when and why to use each is fundamental knowledge for any data engineer, analyst, or BI developer.

What is a Data Warehouse?

A data warehouse is a centralized repository designed specifically for analytical queries and reporting. Unlike your transactional database (which handles day-to-day operations), a data warehouse is optimized for reading and aggregating large volumes of historical data.

Key characteristics:

Subject-oriented â€” Organized around business subjects (sales, customers, inventory) rather than application processes
Integrated â€” Consolidates data from multiple source systems into a consistent format
Time-variant â€” Maintains historical data to enable trend analysis over time
Non-volatile â€” Data is loaded in bulk and rarely updated or deleted once written

OLTP vs OLAP

Before diving into schema design, it is crucial to understand the distinction between the two systems:

Feature	OLTP (Transactional)	OLAP (Analytical)
Purpose	Day-to-day operations	Reporting and analysis
Query type	INSERT, UPDATE, DELETE (single rows)	SELECT with aggregations (millions of rows)
Data model	Highly normalized (3NF)	Denormalized (star/snowflake)
Users	Application users, software	Analysts, business users, dashboards
Data freshness	Real-time current state	Periodic loads (hourly/daily) or near-real-time
Optimization	Write performance	Read performance
Example	MySQL/PostgreSQL for your application	BigQuery/Redshift/Snowflake for analytics
Schema design	ER diagrams, normalized tables	Dimensional modeling (facts + dimensions)

Your application database (OLTP) is not designed for analytics. Running complex aggregation queries on it will slow down your production application. That is why data is extracted, transformed, and loaded (ETL) into a separate data warehouse (OLAP) that is optimized for those queries.

Dimensional Modeling: Facts and Dimensions

Data warehouse schemas are built on dimensional modeling, a technique introduced by Ralph Kimball. The idea is simple but powerful:

Fact Tables

A fact table stores the measurable events of your business â€” transactions, clicks, orders, payments. Each row represents one event.

Fact tables contain:

Foreign keys linking to dimension tables
Measures (numeric values you want to aggregate) â€” revenue, quantity, discount, cost

Example: A fact_sales table with columns like date_key, product_key, store_key, customer_key, quantity_sold, revenue, discount_amount.

Dimension Tables

A dimension table stores the descriptive context around your facts â€” the who, what, where, when.

Example dimensions:

dim_product â€” product name, category, subcategory, brand, price tier
dim_customer â€” customer name, segment, city, state, registration date
dim_store â€” store name, location, region, format (online/offline)
dim_date â€” date, day of week, month, quarter, year, fiscal period, is_holiday

Dimension tables are typically wide (many columns) but short (thousands to millions of rows), while fact tables are narrow (few columns) but tall (millions to billions of rows).

Star Schema Explained

The star schema is the simplest and most widely used data warehouse schema. It gets its name from its shape â€” a central fact table connected directly to dimension tables, forming a star pattern.

Structure

The fact table sits at the center. Each dimension table connects directly to the fact table via a foreign key relationship. Dimension tables are denormalized â€” all descriptive attributes are stored in a single flat table.

For an e-commerce data warehouse:

Center: fact_orders (order_key, date_key, product_key, customer_key, store_key, quantity, revenue, discount)
Points of the star:
- dim_date (date_key, full_date, day_name, month, quarter, year)
- dim_product (product_key, product_name, category, subcategory, brand)
- dim_customer (customer_key, name, email, city, state, segment)
- dim_store (store_key, store_name, city, state, region, format)

Star Schema Query Example

-- Total revenue by product category and quarter
SELECT
    d.quarter,
    p.category,
    SUM(f.revenue) AS total_revenue,
    COUNT(DISTINCT f.order_key) AS total_orders
FROM fact_orders f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2025
GROUP BY d.quarter, p.category
ORDER BY d.quarter, total_revenue DESC;

Notice how clean this query is â€” one fact table, two simple joins, and you have your answer. This is the power of star schema.

Advantages of Star Schema

Simple queries â€” Fewer joins, intuitive structure, easy for analysts to understand
Fast performance â€” Denormalized dimensions mean fewer joins at query time
BI tool friendly â€” Tools like Tableau, Power BI, and Looker work naturally with star schemas
Easy to maintain â€” Adding a new dimension is straightforward

Snowflake Schema Explained

The snowflake schema is a variation of the star schema where dimension tables are normalized into multiple related tables. Instead of storing all product attributes in one dim_product table, you split them into separate tables.

Structure

Using the same e-commerce example:

Center: fact_orders (same as star schema)
Normalized dimensions:
- dim_product (product_key, product_name, subcategory_key, brand_key)
- dim_subcategory (subcategory_key, subcategory_name, category_key)
- dim_category (category_key, category_name)
- dim_brand (brand_key, brand_name, country_of_origin)

The dim_product table no longer contains category or brand names directly â€” it references them through foreign keys to normalized lookup tables.

Snowflake Schema Query Example

-- Total revenue by product category and quarter (snowflake version)
SELECT
    d.quarter,
    cat.category_name,
    SUM(f.revenue) AS total_revenue,
    COUNT(DISTINCT f.order_key) AS total_orders
FROM fact_orders f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_subcategory sub ON p.subcategory_key = sub.subcategory_key
JOIN dim_category cat ON sub.category_key = cat.category_key
WHERE d.year = 2025
GROUP BY d.quarter, cat.category_name
ORDER BY d.quarter, total_revenue DESC;

Notice the extra joins â€” you must traverse through dim_product to dim_subcategory to dim_category to get the category name.

Star Schema vs Snowflake Schema

Feature	Star Schema	Snowflake Schema
Dimension structure	Denormalized (flat)	Normalized (multiple tables)
Number of joins	Fewer (simpler queries)	More (complex queries)
Query performance	Faster (fewer joins)	Slower (more joins)
Storage efficiency	More redundancy	Less redundancy
Ease of understanding	Very intuitive	More complex to navigate
ETL complexity	Simpler to load	More complex transformations
BI tool compatibility	Excellent	Good (but more configuration)
Data integrity	Lower (denormalized)	Higher (normalized)
Maintenance	Easier	Harder (more tables to manage)
Best for	Analytics, dashboards, BI	Large-scale warehouses with strict storage constraints

When to Use Star Schema

Your primary users are business analysts and BI tools
Query performance is the top priority
Your dimension tables are not excessively large
You want simplicity in both querying and ETL

When to Use Snowflake Schema

Storage cost is a major concern and dimension data has significant redundancy
You need strict data integrity with no duplication
Your ETL processes already normalize data naturally
Dimension tables are very large with many shared attributes

In practice, star schema is the dominant choice for most modern data warehouses. The storage savings from snowflake schema are negligible with today's columnar storage engines (BigQuery, Redshift, Snowflake), and the query simplicity of star schema is a massive productivity gain.

Real-World Example: E-Commerce Data Warehouse

Let us design a simple data warehouse for an Indian e-commerce company (think Flipkart or Meesho):

Fact Table: fact_orders

CREATE TABLE fact_orders (
    order_key       BIGINT PRIMARY KEY,
    date_key        INT REFERENCES dim_date(date_key),
    product_key     INT REFERENCES dim_product(product_key),
    customer_key    INT REFERENCES dim_customer(customer_key),
    seller_key      INT REFERENCES dim_seller(seller_key),
    payment_key     INT REFERENCES dim_payment_method(payment_key),
    quantity        INT,
    gross_amount    DECIMAL(12,2),
    discount        DECIMAL(12,2),
    net_revenue     DECIMAL(12,2),
    shipping_cost   DECIMAL(10,2),
    is_returned     BOOLEAN
);

Sample Dimension: dim_customer

CREATE TABLE dim_customer (
    customer_key    INT PRIMARY KEY,
    customer_id     VARCHAR(20),    -- natural key from source system
    full_name       VARCHAR(100),
    email           VARCHAR(100),
    phone           VARCHAR(15),
    city            VARCHAR(50),
    state           VARCHAR(50),
    pin_code        VARCHAR(6),
    tier_city       VARCHAR(10),    -- Tier 1, Tier 2, Tier 3
    segment         VARCHAR(20),    -- Premium, Regular, New
    registration_date DATE,
    effective_from  DATE,           -- SCD Type 2 fields
    effective_to    DATE,
    is_current      BOOLEAN
);

Slowly Changing Dimensions (SCD)

Real-world dimension data changes over time. A customer moves to a new city. A product gets reclassified into a different category. How do you handle these changes?

SCD Type 1: Overwrite

Simply overwrite the old value with the new value. No history is maintained.

Customer moves from Mumbai to Pune? Update the city column to "Pune".
Pros: Simple, no extra storage
Cons: You lose historical context. Past orders will appear as if they were from Pune.

SCD Type 2: Add New Row

Create a new row for the changed record, with version tracking fields (effective_from, effective_to, is_current).

Customer moves from Mumbai to Pune? The old row gets effective_to = 2025-05-04, is_current = false. A new row is inserted with city = Pune, effective_from = 2025-05-05, is_current = true.
Pros: Full history preserved. Past orders correctly link to the Mumbai version.
Cons: Table grows larger. Queries need to filter on is_current = true for current state.

SCD Type 3: Add New Column

Add a column to store the previous value alongside the current value.

Add previous_city and current_city columns.
Pros: Simple to query both current and previous state
Cons: Only tracks one level of history. Does not scale for multiple changes.

SCD Type 2 is the most commonly used approach in production data warehouses because it preserves complete history while maintaining accurate relationships with fact data.

Modern Data Warehouse Tools

The data warehouse landscape has shifted dramatically toward cloud-native, serverless platforms:

Google BigQuery â€” Serverless, pay-per-query, excellent for organizations already on GCP. Supports nested/repeated fields (semi-structured data). Widely used by Indian startups.
Amazon Redshift â€” Columnar storage, integrates tightly with the AWS ecosystem. Offers Redshift Serverless for variable workloads.
Snowflake â€” Multi-cloud (AWS, Azure, GCP), separates compute from storage (scale independently). Known for its data sharing and marketplace features.
Databricks Lakehouse â€” Combines data lake and data warehouse on Delta Lake. Strong for teams that need both ML and BI on the same platform.
Apache Hive / Presto â€” Open-source options for organizations running on-premise Hadoop clusters. Still used in many Indian enterprises.

For most new projects in India, BigQuery and Snowflake are the most popular choices due to their ease of setup and managed infrastructure.

Best Practices for Schema Design

Start with business questions â€” Design your schema to answer the queries your business actually needs, not to mirror your source system structure
Prefer star schema â€” Unless you have a strong, specific reason for snowflake schema, star is almost always the better choice for analytics
Use surrogate keys â€” Generate integer primary keys for dimension tables instead of using natural keys from source systems. This insulates your warehouse from source system changes.
Build a robust dim_date â€” Pre-populate a date dimension with every date for 10+ years, including fiscal periods, holidays (Indian holidays are critical for retail analytics), and flags like is_weekend
Implement SCD Type 2 for critical dimensions â€” At minimum, track history for customer, product, and employee dimensions
Avoid wide fact tables â€” Keep fact tables narrow with only foreign keys and measures. If you need descriptive attributes in query results, join to dimension tables.
Document your grain â€” Clearly define what one row in your fact table represents (one order line item? one transaction? one daily aggregate?). This is the most important design decision.
Test with real queries â€” Before finalizing your schema, write the top 20 business queries your team needs and verify they can be expressed cleanly against your design

Data warehouse schema design is one of those skills where fundamentals matter more than tools. Platforms will change â€” BigQuery might be replaced by something new in five years â€” but the principles of dimensional modeling have remained relevant for three decades. Master the concepts, and the tools become interchangeable.

Data Warehousing 101: Star Schema vs Snowflake Schema

Data Warehousing 101: Star Schema vs Snowflake Schema

What is a Data Warehouse?

OLTP vs OLAP

Dimensional Modeling: Facts and Dimensions

Fact Tables

Dimension Tables

Star Schema Explained

Structure

Star Schema Query Example

Advantages of Star Schema

Snowflake Schema Explained

Structure

Snowflake Schema Query Example

Star Schema vs Snowflake Schema

When to Use Star Schema

When to Use Snowflake Schema

Real-World Example: E-Commerce Data Warehouse

Fact Table: fact_orders

Sample Dimension: dim_customer

Slowly Changing Dimensions (SCD)

SCD Type 1: Overwrite

SCD Type 2: Add New Row

SCD Type 3: Add New Column

Modern Data Warehouse Tools

Best Practices for Schema Design

Investment Banking Career Path: Roles, Promotions, and Growth Explained?

Investment Banking Resume Guide: How to Build a Winning CV?

Multi-Agent Systems in 2026: When to Use One Agent vs Many

Investment Banking Salary in India: Complete 2026 Guide

Your Model Accuracy Dashboard Looks Fine. Your Business Metric Is Collapsing.

Recommended

Apache Kafka: Building Real-Time Data Streaming Pipelines

Building Your First Data Pipeline: A Beginner's Guide to ETL

Explore Meritshot

Resources

Company

FAQs