Introduction
Data warehousing powers reporting and analytics across enterprises. If you’re learning or documenting data warehouse concepts for a WordPress blog, this SEO-optimized post covers the must-know keywords, classification of data (transactional/snapshot/accumulating), dimension behavior (SCD types), and concise examples you can use right away.
Core Concepts
- Data Warehouse (DW) – Central repository for integrated historical and current data used for reporting and analytics.
- ETL (Extract, Transform, Load) – Traditional pipeline: extract from sources → transform → load into DW.
- ELT (Extract, Load, Transform) – Modern approach: load raw data into DW/cloud, then transform inside it.
- OLTP (Online Transaction Processing) – Systems for day-to-day transactional processing (source systems).
- OLAP (Online Analytical Processing) – Systems/tools optimized for multi-dimensional analysis and complex queries.
Types of Data Warehouses
- Enterprise Data Warehouse (EDW) – Organization-wide, authoritative repository.
- Data Mart – Department-focused subset (e.g., Sales Data Mart).
- Operational Data Store (ODS) – Near-real-time store for operational reporting and short-term history.
- Cloud Data Warehouse – Fully-managed cloud services (e.g., Snowflake, BigQuery, Azure Synapse).
Schema & Design
- Star Schema – One central fact table joined to denormalized dimension tables. Simple and fast for queries.
- Snowflake Schema – Normalized dimensions that break dimension tables into related tables (more joins).
- Fact Table – Stores measurements/measurable events (e.g., sales amount, quantity).
- Dimension Table – Describes context for facts (e.g., Customer, Product, Date).
- Granularity – Level of detail (e.g., transaction-level vs daily aggregate).
Types of Fact Tables (Data Types)
Fact tables represent events or measures. Main types:
- Transaction Facts – Each row is an individual event (e.g., a single order line, a payment). High granularity and append-heavy.
- Snapshot Facts – Captures the state of an entity at a specific time (e.g., month-end balance, daily inventory snapshot).
- Accumulating Facts – Track lifecycle/process milestones and get updated as steps complete (e.g., order → fulfillment → delivery). Useful to measure elapsed times between milestones.
- Factless Facts – Records events or coverage without numeric measures (e.g., student attendance, promotion eligibility).
Types of Dimensions
- Conformed Dimension – Reused across multiple facts/data marts (e.g., a single Customer dimension used by sales and support).
- Role-Playing Dimension – Same dimension used for multiple roles (e.g., Date as order_date, ship_date, invoice_date).
- Degenerate Dimension – Dimensionless attributes stored in fact (e.g., invoice number) — no separate dimension table.
- Junk Dimension – Combines low-cardinality flags and indicators into a single small dimension table to avoid cluttering the fact table with many columns.
- Slowly Changing Dimension (SCD) – Describes strategies to handle changes to dimension attributes over time. See next section for details.
Slowly Changing Dimensions (SCDs) — Types & Examples
SCDs define how historical changes in dimension attributes are handled. Choose based on analytic requirements and storage:
SCD Type 0 — No Change
The attribute never changes (static). Example: a legacy product code that must remain as originally loaded.
SCD Type 1 — Overwrite
New values overwrite existing records. No history retained. Example: correct a customer’s misspelled name and replace the old value.
SCD Type 2 — Add New Row (Full History)
Each change inserts a new row with effective date range or version key. History preserved. Typical implementation uses effective_from, effective_to or a current flag.
Example: Customer moves city — SCD2 creates a new customer dimension row with a new surrogate key, while the old row stays for historical reporting.
SCD Type 3 — Partial History
Keep limited history by adding columns like previous_value and current_value. Only the last change or limited changes are tracked.
Example: A customer’s previous country and current country stored as separate columns.
Hybrid / Mixed SCD
Combine SCD strategies for different attributes on the same table. E.g., overwrite some fields (Type 1), keep full history for address (Type 2), and store last value for preferred language (Type 3).
Data Types — Logical & Technical
Logical Classification (what the data represents)
- Fact Data — Measured values (sales amount, clicks).
- Dimension Data — Descriptive/contextual (product name, customer segment).
- Aggregated Data — Summaries for performance (daily totals, monthly averages).
- Operational Data — Near real-time transactional data from source systems.
- Metadata — Data about data: schema, lineage, source system mapping.
Typical Database Data Types (technical)
- Numeric – INTEGER, BIGINT, DECIMAL/NUMERIC, FLOAT (quantities, amounts).
- Character – CHAR, VARCHAR, TEXT (names, descriptions, codes).
- Date/Time – DATE, TIMESTAMP, DATETIME (order date, event time).
- Boolean – BOOLEAN / BIT (flags, true/false attributes).
- Binary / BLOB – Binary large objects for images or files (rare in DW fact/dimension tables).
Processing & Storage
- Staging Area – Temporary workspace for raw extracts; cleansed before loading into DW.
- Data Lake – Repository for raw/unstructured/semi-structured data often used as the source for DW/ELT.
- Cold/Warm/Hot Storage – Classify data by access patterns and cost requirements (hot = frequently accessed).
Performance & Optimization
- Indexing – Speed up lookups (use carefully on large DW tables).
- Partitioning – Split large tables by date or key for faster scans and management.
- Materialized Views – Precomputed query results for faster reporting.
- Denormalization – Favor read performance for OLAP workloads (e.g., star schema).
Governance & Quality
- Data Cleansing – Standardize and correct data before it becomes authoritative.
- Data Lineage – Trace where values came from and how they changed (essential for trust & audits).
- Master Data Management (MDM) – Centralize canonical entities like customer and product.
- Data Governance – Policies, roles, and rules to manage data quality, privacy, access and compliance.
Quick Cheatsheet (Table)
| Term | Short Explanation | Example |
|---|---|---|
| Transaction Fact | Row-per-event with measures. | Each order line with price and qty. |
| Snapshot Fact | State captured at a time. | Monthly account balances. |
| Accumulating Fact | Progress of a process; rows updated. | Order lifecycle status and timestamps. |
| SCD Type 2 | Keep history by adding new rows per change. | Customer address history over time. |
| Conformed Dimension | Shared dimension across marts/facts. | One Customer table used by Sales & Support. |
FAQ
Q: When should I use SCD Type 2 vs Type 1?
A: Use SCD Type 2 when historical accuracy is required (e.g., reporting by customer’s historical region). Use Type 1 when only the latest value matters and history is not needed (e.g., correcting a typo).
Q: Should I store images or documents in the data warehouse?
A: Generally no — store large binaries in object storage (data lake or blob store) and keep references/URLs in the DW.
Conclusion
This post provides a compact but comprehensive reference for data warehouse keywords, fact/dimension types, and SCD strategies. Use it as a template for documentation, training, or as SEO-optimized content for your WordPress blog. If you want, I can also:
- Convert the cheatsheet into a downloadable CSV
- Produce simple SVG diagrams for a star schema and SCD Type 2 example
- Rewrite the post to target a specific keyword phrase (e.g., “datawarehouse SCD guide”)