Introduction

De-duplicator operator (deduper) eliminates duplicate events from the processing pipeline. An event is considered a duplicate if it was seen within a configurable time-range. This range could be large and extend to days. In that case the state of the deduper will grow substantially and can cause regular check-pointing to take a long time to save this state.

Incremental Check-pointing

The deduper employs a bucketing system where new events are persisted every window. The bucket manager groups events in buckets based on a criterion that can be easily overriden. For each bucket, the manager maintain two sub-buckets:

  • transient sub-bucket: events which are already persisted. This state is not check-pointed by the engine.
  • non-transient sub-bucket: events which have not been persisted yet.

At the end of every application window the bucket manager writes the non-transient sub-bucket of all the buckets and transfers the data from them to the corresponding transient sub-bucket. During recovery the buckets are lazily loaded from the persistent store which also reduces the time in reconstructing the operator's state when it is re-deployed.