library/src/main/java/com/datatorrent/lib/dedup/CheckpointingScheme.md - apex-malhar - Git at Google

 Introduction
 ============
 De-duplicator operator (deduper) eliminates duplicate events from the processing pipeline. An event is considered a duplicate if it was seen within a configurable time-range.
 This range could be large and extend to days. In that case the state of the deduper will grow substantially and can cause regular check-pointing to take a long time to save this state.

 Incremental Check-pointing
 ==========================
 The deduper employs a bucketing system where new events are persisted every window. The bucket manager groups events in buckets based on a criterion that can be easily overriden.
 For each bucket, the manager maintain two sub-buckets:
   * transient sub-bucket: events which are already persisted. This state is not check-pointed by the engine.
   * non-transient sub-bucket: events which have not been persisted yet.

 At the end of every application window the bucket manager writes the non-transient sub-bucket of all the buckets and transfers the data from them to the corresponding transient sub-bucket.
 During recovery the buckets are lazily loaded from the persistent store which also reduces the time in reconstructing the operator's state when it is re-deployed.
	Introduction
	============
	De-duplicator operator (deduper) eliminates duplicate events from the processing pipeline. An event is considered a duplicate if it was seen within a configurable time-range.
	This range could be large and extend to days. In that case the state of the deduper will grow substantially and can cause regular check-pointing to take a long time to save this state.

	Incremental Check-pointing
	==========================
	The deduper employs a bucketing system where new events are persisted every window. The bucket manager groups events in buckets based on a criterion that can be easily overriden.
	For each bucket, the manager maintain two sub-buckets:
	* transient sub-bucket: events which are already persisted. This state is not check-pointed by the engine.
	* non-transient sub-bucket: events which have not been persisted yet.

	At the end of every application window the bucket manager writes the non-transient sub-bucket of all the buckets and transfers the data from them to the corresponding transient sub-bucket.
	During recovery the buckets are lazily loaded from the persistent store which also reduces the time in reconstructing the operator's state when it is re-deployed.