Scope
- Use cases
- Landing raw topic data into cost-efficient, queryable storage.
- Preserving event-time, watermarks, and minimal transformations.
- Non-use cases
- Entity enrichment or SCD history building.
- Business-level aggregations or feature generation.
Common steps
Build context
- Identify source topics and sample events (e.g.,
Search,ProductClicked). - Decide target table format and partitioning strategy (e.g., Iceberg daily partitions).
Implementation notes
- Prefer event-time partitioning (e.g.,
PARTITIONED BY (days(event_time))) to reduce small files while keeping predictable pruning. - Use
WATERMARKto model lateness and support downstream time-based operations. - Keep the raw zone schema close to the source; avoid lossy casts or complex transformations.
RESINK.AI recommendations
Example
Variations
- S3-backed Iceberg catalog (catalog-first)
- Paimon catalog alternative
-
Partitioning strategies
- Iceberg:
PARTITIONED BY (days(event_time)), orPARTITIONED BY (hours(event_time))for very high volume. - Paimon:
PARTITIONED BY (dt)wheredtis a DATE derived fromevent_time.
- Iceberg:
Troubleshooting
Kafka deserialization or schema mismatch
Kafka deserialization or schema mismatch
Ensure the
format and JSON field names match the producers. Consider adding optional fields with NULL defaults to avoid hard failures during schema evolution.
