tree: a9dff24372672452b6ab6fea363fe850a9b8c9f6 [path history] [tgz]
  1. src/
  2. pom.xml
  3. README.md
examples/dedup/README.md

This sample application shows how to use the Dedup operator for de-duplicating in a stream of incoming data. The operators in the application are as follows:

  1. Random data generator (RandomGenerator) which emits POJO tuples as records
  2. Dedup operator (Deduper) which accepts the POJO tuples and identifies unique and duplicate tuples.
  3. Console operator (ConsoleUnique) for unique tuples
  4. Console operator (ConsoleDuplicate) for duplicate tuples
  5. Console operator (ConsoleExpired) for expired tuples

The following properties are configured for using the Application:

  1. dt.application.DedupExample.operator.RandomGenerator.prop.tuplesPerWindow - This is a limit on the number of tuples that will be generated by the Random Generator operator.
  2. dt.application.DedupExample.operator.Deduper.prop.keyExpression - This is the pseudo java expression for deriving the key fields from the incoming POJO.
  3. dt.application.DedupExample.operator.Deduper.prop.timeExpression - This is the pseudo java expression for deriving the time field in the incoming POJO. In case, timeExpression is not specified, then the System time is used to compute the expiration for the tuples.
  4. dt.application.DedupExample.operator.Deduper.prop.expireBefore - The expiry time for incoming tuples in seconds. The keys in the system expire after every expireBefore seconds.
  5. dt.application.DedupExample.operator.Deduper.prop.bucketSpan - The span of a single expiry bucket. When an expiry time elapses, the bucket as a whole is discarded from the system. This can be set keeping in mind the largest unit that can be discarded. For example, if expireBefore is set to 1 hour, and we are getting new data per minuite, it would make sense to set the bucketSpan to 1 minute or 5 minutes.

Example values for these parameters have been specified in src/main/resources/META-INF/properties.xml.