Skip to main content

Realtime Data Warehouse Basics

Data Categories

  1. Event (fact tables): Event data captures measurements at a point in time, often at the grain of an event.
    1.1. Examples:
    • fact_user_login (each login event: user_id, device_id, ts, location, success/fail)
    • fact_order (order submission: order_id, user_id, product_id, ts, revenue, discount)
      1.2. Properties: Immutable, append-only, usually very large, partitioned by time.
  2. Entity (dimension tables): Entity data captures the attributes of entities used to enrich facts. They provide context (“the who, what, where”).
    2.1. Examples:
    • dim_user (user_id, demographics, signup_date, etc.)
    • dim_device (device_id, os, model, manufacturer, etc.)
    • dim_product (product_id, category, price, etc.)
      2.2. Properties: Mutable, often modeled with Slowly Changing Dimensions (SCD) to preserve history. Much smaller than fact/event tables.
  3. Aggregation (summary/mart tables): Pre-computed aggregates of facts, optimized for queries.
    3.1. Examples:
    • agg_daily_active_users
    • agg_monthly_revenue_by_region
      3.2. Properties: Not raw events or entities; derived, materialized layer for performance.

Entity History Preserving Strategy

Entity state changes over time. Because they are often used to enrich events, preserving history for point-in-time query is crucial.
  1. Daily Snapshot (point-in-time rollups)
    1.1. Idea: Once per day (or hour), materialize a snapshot for selected, fast-changing attributes (e.g., last_viewed_page).
    1.2. Pros: Much smaller history for noisy fields; queries like “as of 2025-08-20 00:00” land on the day’s snapshot.
    1.3. Cons: Coarser than SCD; intra-day fidelity is lost unless you increase frequency.
  2. Type 2 SCD (row-versioned)
    2.1. Idea: Keep one row per version of a user with [valid_from, valid_to) and the current record is when valid_to = '9999-12-31 23:59'.
    2.2. Pros: Precise “as-of” queries (e.g., balance on 2025-01-01). Works for any attribute.
    2.3. Cons: More writes; need discipline to set valid ranges in stream.
  3. Change (audit) log
    3.1. Idea: Persist CDC facts (insert/update/delete) with system_time (arrival) and optional valid_time.
    3.2. Pros: Forensics, replay, and flexible reconstructions.
    3.3. Cons: Heavier storage and more complex queries; usually complement SCD, not replace it.
  4. Bitemporal (valid time + system time)
    4.1. Idea: SCD2 + audit columns (arrive_ts, retract_ts).
    4.2. Pros: Handle late/corrected data cleanly (“what we believed on date X?” vs “what was true on business date Y?”).
    4.3. Cons: Most complex; adopt only if you truly need it.

RESINK.AI Recommendations

By default, you should go with the daily snapshot strategy for simplicity and cost efficiency.