Realtime Data Warehouse Basics
Data Categories
-
Event (fact tables): Event data captures measurements at a point in time, often at the grain of an event.
1.1. Examples:
fact_user_login (each login event: user_id, device_id, ts, location, success/fail)
fact_order (order submission: order_id, user_id, product_id, ts, revenue, discount)
1.2. Properties: Immutable, append-only, usually very large, partitioned by time.
-
Entity (dimension tables): Entity data captures the attributes of entities used to enrich facts. They provide context (“the who, what, where”).
2.1. Examples:
dim_user (user_id, demographics, signup_date, etc.)
dim_device (device_id, os, model, manufacturer, etc.)
dim_product (product_id, category, price, etc.)
2.2. Properties: Mutable, often modeled with Slowly Changing Dimensions (SCD) to preserve history. Much smaller than fact/event tables.
-
Aggregation (summary/mart tables): Pre-computed aggregates of facts, optimized for queries.
3.1. Examples:
agg_daily_active_users
agg_monthly_revenue_by_region
3.2. Properties: Not raw events or entities; derived, materialized layer for performance.
Entity History Preserving Strategy
Entity state changes over time. Because they are often used to enrich events, preserving history for point-in-time query is crucial.
-
Daily Snapshot (point-in-time rollups)
1.1. Idea: Once per day (or hour), materialize a snapshot for selected, fast-changing attributes (e.g., last_viewed_page).
1.2. Pros: Much smaller history for noisy fields; queries like “as of 2025-08-20 00:00” land on the day’s snapshot.
1.3. Cons: Coarser than SCD; intra-day fidelity is lost unless you increase frequency.
-
Type 2 SCD (row-versioned)
2.1. Idea: Keep one row per version of a user with [valid_from, valid_to) and the current record is when valid_to = '9999-12-31 23:59'.
2.2. Pros: Precise “as-of” queries (e.g., balance on 2025-01-01). Works for any attribute.
2.3. Cons: More writes; need discipline to set valid ranges in stream.
-
Change (audit) log
3.1. Idea: Persist CDC facts (insert/update/delete) with system_time (arrival) and optional valid_time.
3.2. Pros: Forensics, replay, and flexible reconstructions.
3.3. Cons: Heavier storage and more complex queries; usually complement SCD, not replace it.
-
Bitemporal (valid time + system time)
4.1. Idea: SCD2 + audit columns (arrive_ts, retract_ts).
4.2. Pros: Handle late/corrected data cleanly (“what we believed on date X?” vs “what was true on business date Y?”).
4.3. Cons: Most complex; adopt only if you truly need it.
RESINK.AI Recommendations
By default, you should go with the daily snapshot strategy for simplicity
and cost efficiency.