Sketches are different from traditional sampling techniques in that sketches process all the elements of a stream, touching each element only once, and have some form of randomization that forms the basis of their stochastic nature. This “one-touch” property makes sketches ideally suited for real-time data processing.
As an example, the first stage of a count-distinct sketching process is a transformation that gives the input data stream the property of white noise, or equivalently, a uniform distribution of values. This is commonly achieved by coordinated hashing of the input unique keys and then normalizing the result to be a uniform random number between zero and one.
The second stage of the sketch is a data structure that follows a set of rules for retaining a small set of the hash values it receives from the transform stage. Sketches also differ from simple sampling schemes in that the size of the sketch often has a configurable, fixed upper bound, which enables straightforward memory management.
The final element of the sketch process is a set of estimator algorithms that upon a request examine the sketch data structure and return a result value. This result value will be approximate but will have well established and mathematically proven error distribution bounds.
Sketches are typically
With this background, let's examine some of the Key Features of the DataSketches library.