Scope
- Use cases
- Deduplicate events with identical keys/timestamps.
- Include late events within a bounded lateness.
- Non-use cases
- Complex reorder buffering beyond watermark tolerance.
Common steps
Build context
- Identify natural keys and timestamps for dedup (e.g.,
user_id,event_time). - Choose acceptable lateness (e.g., 2–10 minutes).
Implementation notes
ROW_NUMBER()over partition by keys + time descending is a common pattern to keep latest.- For aggregations, use allowed lateness via watermarks and updateable windows.
RESINK.AI recommendations
Example
Variations
- Deduplicate with sequence numbers
- Late event tolerant daily aggregates
Troubleshooting
Too many duplicates slip through
Too many duplicates slip through
Check that partition keys exactly match the natural dedup key and that sources are not reformatting timestamps.
High lateness increases state size
High lateness increases state size
Reduce watermark delay or split streams by key ranges. Consider compaction on sinks.

