layout: doc_page

Designed for Large-scale Computing Systems

Minimal Dependencies

  • Can be integrated into virtually any Java-base system environment.

  • The core library (including Memory) has no dependencies outside of the Java JVM at runtime.

Maven Deployable

  • Registered with The Central Repository

Speed

  • These single-pass, “one-touch” algorithms are fast to enable real-time processing capability.

  • Coupled with the compact binary representations, in many cases the need for costly serialization and deserialization has been eliminated.

  • The sketch data structures are “additive” and embarrassingly parallelizable. The Theta sketches can be merged without losing accuracy.

Integration for Hive, Pig, Druid and Spark

  • Hadoop / Hive Adaptors.

  • Hadoop / Pig Adaptors.

  • Druid Adaptors.

    • For documentation navigate to druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html
  • Spark Examples

Specific Theta Sketch Features for Large Data

  • Hash Seed Handling. Additional protection for managing hash seeds which is particularly important when processing sensitive user identifiers.

  • Sampling. Built-in up-front sampling for cases where additional contol is required to limit overall memory consumption when dealing with millions of sketches.

  • Off-Heap Memory Package. Large query systems often require their own heaps outside the JVM in order to better manage garbage collection latencies. The sketches in this package are designed to operate either on-heap or off-heap.

  • Built-in Upper-Bound and Lower-Bound estimators. You are never in the dark about how good of an estimate the sketch is providing. All the sketches are able to estimate the upper and lower bounds of the estimate given a confidence level.

  • User configurable trade-offs of accuracy vs. storage space as well as other performance tuning options.

  • Additional protection of sensitive data by user configuration of a hash seed that is not stored with the serialized data.

  • Small Footprint Per Sketch. The operating and storage footprint for both row and column oriented storage are minimized with compact binary representations, which are much smaller than the raw input stream and with a well defined upper bound of size.