Data Modeling for BI
Dimensional Modeling Fundamentals
Dimensional Model Overview
Core components of dimensional models and their relationships
Legend
Components
Connection Types
Key Dimensional Modeling Concepts
Facts vs. Dimensions
The distinction between facts and dimensions is central to business analytics:
- Facts represent business processes and events that can be measured
- Dimensions provide context and filtering capabilities for analysis
- Proper modeling allows flexible "slicing and dicing" of metrics
- This approach aligns with how business users naturally think about data
- Enables both high-level summaries and detailed drill-downs
Facts vs. Dimensions
Technical implementation details for facts and dimensions:
- Fact tables contain foreign keys to dimensions and numerical measures
- Facts are typically normalized and partitioned by date
- Dimension tables contain descriptive attributes and hierarchies
- Dimensions often use surrogate keys generated during ETL
- Fact grain determines the level of detail captured in analysis
Star Schema Design
Star Schema Architecture
The structure of a star schema with a central fact table and related dimension tables
Legend
Components
Connection Types
Example Star Schema Design
-- Dimension: Date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE UNIQUE,
day_of_week VARCHAR(10),
day_of_month INT,
month_number INT,
month_name VARCHAR(10),
quarter INT,
year INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN,
fiscal_year INT,
fiscal_quarter INT
);
-- Dimension: Product
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(50) UNIQUE, -- natural key
product_name VARCHAR(100),
product_description TEXT,
brand VARCHAR(50),
category VARCHAR(50),
subcategory VARCHAR(50),
department VARCHAR(50),
unit_cost DECIMAL(10,2),
unit_price DECIMAL(10,2),
current_flag BOOLEAN,
effective_date DATE,
expiration_date DATE
);
-- Dimension: Customer
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50) UNIQUE, -- natural key
customer_name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
address_line1 VARCHAR(100),
address_line2 VARCHAR(100),
city VARCHAR(50),
state VARCHAR(50),
postal_code VARCHAR(20),
country VARCHAR(50),
customer_segment VARCHAR(50),
acquisition_date DATE,
lifetime_value DECIMAL(10,2),
current_flag BOOLEAN,
effective_date DATE,
expiration_date DATE
);
-- Fact: Sales
CREATE TABLE fact_sales (
sales_key BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
customer_key INT REFERENCES dim_customer(customer_key),
store_key INT REFERENCES dim_store(store_key),
promotion_key INT REFERENCES dim_promotion(promotion_key),
transaction_id VARCHAR(50),
sales_amount DECIMAL(12,2),
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
net_amount DECIMAL(12,2),
tax_amount DECIMAL(10,2),
gross_profit DECIMAL(12,2),
transaction_timestamp TIMESTAMP,
return_flag BOOLEAN
);
Star vs. Snowflake Schema
The choice between star and snowflake has business implications:
- Star schemas provide faster reports and dashboards for end users
- Snowflake schemas offer better data consistency for reference data
- Star schemas are more intuitive for business users creating reports
- Snowflake schemas may be better for managing hierarchical dimensions
- Most organizations prioritize query performance over storage efficiency
Star vs. Snowflake Schema
Technical comparison of star and snowflake schemas:
- Star schemas denormalize dimensions for simpler queries and faster joins
- Snowflake schemas normalize dimensions to reduce redundancy
- Star schemas typically have fewer joins and better query performance
- Snowflake schemas save storage space but add query complexity
- Star schemas are preferred in most modern data warehouses
Slowly Changing Dimensions
Slowly Changing Dimension Types
SCD strategies allow businesses to manage changing reference data:
- Track customer attribute changes for accurate point-in-time reporting
- Maintain historical product categorizations for trend analysis
- Support compliance requirements by preserving data lineage
- Enable accurate historical comparisons despite master data changes
- Balance history preservation with performance and storage needs
Slowly Changing Dimension Types
Technical implementation of SCD types:
- Type 1: Overwrite the attribute value (no history)
- Type 2: Add a new row with effective dates and current flag
- Type 3: Add new columns for previous values (limited history)
- Type 4: Use history tables for full change tracking
- Type 6: Hybrid approach combining Types 1, 2, and 3
SCD Type 2 Implementation Example
-- SCD Type 2 dimension with history tracking
CREATE TABLE dim_customer (
customer_key SERIAL PRIMARY KEY, -- surrogate key
customer_id VARCHAR(50), -- natural key (business key)
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
phone VARCHAR(20),
address VARCHAR(100),
city VARCHAR(50),
state VARCHAR(50),
postal_code VARCHAR(20),
segment VARCHAR(50),
effective_date DATE NOT NULL, -- when this version became effective
expiration_date DATE, -- when this version expired (null if current)
current_flag BOOLEAN NOT NULL, -- indicates if this is current version
-- Create an index on the combination of natural key and current flag
-- for efficient lookups during ETL
CONSTRAINT idx_customer_current UNIQUE (customer_id, current_flag)
WHERE current_flag = TRUE
);
-- Query to handle a customer address change in Type 2 SCD
BEGIN TRANSACTION;
-- Step 1: Set expiration and current flag on existing record
UPDATE dim_customer
SET
expiration_date = CURRENT_DATE - 1,
current_flag = FALSE
WHERE
customer_id = '12345'
AND current_flag = TRUE;
-- Step 2: Insert new record with updated address
INSERT INTO dim_customer (
customer_id, first_name, last_name, email, phone,
address, city, state, postal_code, segment,
effective_date, expiration_date, current_flag
)
SELECT
customer_id, first_name, last_name, email, phone,
'New Address', city, state, postal_code, segment, -- Only address changes
CURRENT_DATE, NULL, TRUE
FROM dim_customer
WHERE
customer_id = '12345'
AND expiration_date = CURRENT_DATE - 1;
COMMIT;
Aggregate Tables and Materialized Views
Aggregation Strategies
Aggregation strategies deliver significant business benefits:
- Dashboard performance improvements of 10-100x for large datasets
- Reduced operational costs through efficient resource utilization
- Support for interactive data exploration at scale
- Enablement of mobile BI with low latency requirements
- Cost-effective scaling of analytics to more business users
Aggregation Strategies
Technical implementation of aggregation strategies:
- Precalculated aggregate tables with dimension foreign keys
- Materialized views with incremental refresh mechanisms
- Summary tables with different temporal granularities
- Aggregate navigation in semantic layers
- Indexing strategies for aggregate tables
Materialized Aggregate Example
-- Create a daily sales aggregate table
CREATE TABLE agg_daily_sales (
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),
-- Aggregate measures
total_sales_amount DECIMAL(14,2),
total_quantity INT,
total_discount DECIMAL(12,2),
transaction_count INT,
customer_count INT,
-- Meta information
last_updated_timestamp TIMESTAMP,
-- Primary key
PRIMARY KEY (date_key, product_key, store_key)
);
-- Population/refresh logic
INSERT INTO agg_daily_sales
SELECT
f.date_key,
f.product_key,
f.store_key,
SUM(f.sales_amount) AS total_sales_amount,
SUM(f.quantity) AS total_quantity,
SUM(f.discount_amount) AS total_discount,
COUNT(DISTINCT f.transaction_id) AS transaction_count,
COUNT(DISTINCT f.customer_key) AS customer_count,
CURRENT_TIMESTAMP AS last_updated_timestamp
FROM
fact_sales f
WHERE
f.date_key >= (SELECT MAX(date_key) FROM agg_daily_sales) - 7 -- Incremental refresh
GROUP BY
f.date_key,
f.product_key,
f.store_key
ON CONFLICT (date_key, product_key, store_key)
DO UPDATE SET
total_sales_amount = EXCLUDED.total_sales_amount,
total_quantity = EXCLUDED.total_quantity,
total_discount = EXCLUDED.total_discount,
transaction_count = EXCLUDED.transaction_count,
customer_count = EXCLUDED.customer_count,
last_updated_timestamp = CURRENT_TIMESTAMP;
Conformed Dimensions and Facts
Enterprise Data Warehouse Architecture
Structure of a data warehouse with shared conformed dimensions across multiple fact tables
Legend
Components
Connection Types
Conformed Dimensions
Conformed dimensions deliver essential business capabilities:
- Ensures consistent reporting across different business processes
- Enables integrated dashboards that combine metrics from multiple areas
- Facilitates drill-across capabilities between related subject areas
- Creates a shared business vocabulary across the organization
- Supports enterprise-wide analytics governance
Conformed Dimensions
Technical implementation of conformed dimensions:
- Shared dimension tables referenced by multiple fact tables
- Consistent surrogate key generation across data pipelines
- Master data management integration for consistency
- Attribute standardization and centralized business rules
- Dimension versioning coordination across subject areas
Data Vault Modeling
Data Vault Architecture
Data Vault modeling addresses key business challenges:
- Adaptability to changing business requirements without model redesign
- Auditability for regulatory compliance with full data lineage
- Integration of disparate data sources while preserving source information
- Scalability for enterprise data volumes with parallel loading
- Historical preservation across the enterprise data landscape
Data Vault Architecture
Technical implementation of Data Vault modeling:
- Hub tables that store business keys with minimal context
- Link tables that maintain relationships between business keys
- Satellite tables that store descriptive attributes and history
- Hash keys for performance and referential integrity
- Layered architecture with raw, business, and presentation vaults
Data Vault Core Components
-- Hub: Customer (business entity)
CREATE TABLE hub_customer (
customer_hk CHAR(32) PRIMARY KEY, -- hash key of business key
customer_bk VARCHAR(50) NOT NULL, -- business key
record_source VARCHAR(100) NOT NULL,
load_date TIMESTAMP NOT NULL,
UNIQUE(customer_bk)
);
-- Link: Customer to Account relationship
CREATE TABLE link_customer_account (
link_hk CHAR(32) PRIMARY KEY, -- hash of relationship
customer_hk CHAR(32) NOT NULL REFERENCES hub_customer(customer_hk),
account_hk CHAR(32) NOT NULL REFERENCES hub_account(account_hk),
record_source VARCHAR(100) NOT NULL,
load_date TIMESTAMP NOT NULL,
UNIQUE(customer_hk, account_hk)
);
-- Satellite: Customer details
CREATE TABLE sat_customer_details (
customer_hk CHAR(32) NOT NULL REFERENCES hub_customer(customer_hk),
hashdiff CHAR(32) NOT NULL, -- hash of all attributes
effective_from TIMESTAMP NOT NULL,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL,
-- Descriptive attributes
customer_name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
address VARCHAR(200),
city VARCHAR(50),
state VARCHAR(50),
postal_code VARCHAR(20),
-- Audit/metadata attributes
is_current BOOLEAN,
expiry_date TIMESTAMP,
PRIMARY KEY (customer_hk, effective_from)
);
Best Practices for BI Data Modeling
Design Principles
-
Prioritize Query Performance
- Denormalize dimensions where appropriate
- Create indexes aligned with common query patterns
- Implement aggregates for frequently used metrics
-
Plan for Scalability
- Design partitioning strategies aligned with data volumes
- Implement incremental processing patterns
- Consider distribution keys in MPP databases
-
Balance Flexibility and Standardization
- Create conformed dimensions for cross-subject analysis
- Document business rules and calculations
- Establish naming conventions and metadata standards
-
Design for Understandability
- Use business terminology in model naming
- Document metrics and dimension definitions
- Create intuitive hierarchies for drill-down analysis
-
Implement Proper History Management
- Choose appropriate SCD types based on business requirements
- Document time variance policies for dimensions
- Consider bi-temporal modeling for regulated industries