layout: doc_page

Architecture

The DataSketches Library is organized into the following repository groups:

Java

incubator-datasketches-java

This repository has the core-java sketching classes, which are leveraged by some of the other repositories.
This repository has no external dependencies outside of the DataSketches/memory repository, Java and TestNG for unit tests. This code is versioned and the latest release can be obtained from incubator-datasketches-java.

High-level Repositories Structure

Sketches-core Packages.Package Description
org.apache.datasketchesCommon functions and utilities
org.apache.datasketches.cpcNew Unique Counting Sketch with better accuracy per size than HLL
org.apache.datasketches.fdtFrequent Distinct Tuples Sketch.
org.apache.datasketches.frequenciesFrequent Item Sketches, for both longs and generics
org.apache.datasketches.hashThe 128-bit MurmurHash3 and adaptors
org.apache.datasketches.hllUnique counting HLL sketches for both heap and off-heap.
org.apache.datasketches.hllmapThe (HLL) Unique Count Map Sketch
org.apache.datasketches.kllNew quantiles sketch with better accuracy per size than the standard quantiles sketch.
org.apache.datasketches.quantilesSketches for quantiles, PMF and CDF functions, both doubles and generics and for heap and off-heap.
org.apache.datasketches.samplingWeighted and uniform reservoir sampling with generics
org.apache.datasketches.thetaUnique counting Theta Sketches for both heap and off-heap
org.apache.datasketches.tupleTuple sketches for both primitives and generics
org.apache.datasketches.tuple.adoubleA Tuple sketch with a Summary of a single double
org.apache.datasketches.tuple.anintegerA Tuple sketch with a Summary of a single integer
org.apache.datasketches.tuple.StringsA Tuple sketch with a Summary of an array of Strings

incubator-datasketches-memory

This code is versioned and the latest release can be obtained from incubator-datasketches-memory.

Memory PackagesPackage Description
org.apache.datasketches.memoryLow level, high-performance Memory data-structure management primarily for off-heap.

incubator-datasketches-hive

This repository contains Hive UDFs and UDAFs for use within Hadoop grid enviornments. This code has dependencies on sketches-core as well as Hadoop and Hive. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from incubator-datasketches-hive.

Sketches-hive PackagesPackage Description
org.apache.datasketches.hive.cpcHive UDF and UDAFs for CPC sketches
org.apache.datasketches.hive.frequenciesHive UDF and UDAFs for Frequent Items sketches
org.apache.datasketches.hive.hllHive UDF and UDAFs for HLL sketches
org.apache.datasketches.hive.kllHive UDF and UDAFs for KLL sketches
org.apache.datasketches.hive.quantilesHive UDF and UDAFs for Quantiles sketches
org.apache.datasketches.hive.thetaHive UDF and UDAFs for Theta sketches
org.apache.datasketches.hive.tupleHive UDF and UDAFs for Tuple sketches

incubator-datasketches-pig

This repository contains Pig User Defined Functions (UDF) for use within Hadoop grid environments. This code has dependencies on sketches-core as well as Hadoop and Pig. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from incubator-datasketches-pig.

Sketches-pig PackagesPackage Description
org.apache.datasketches.pig.cpcPig UDFs for CPC sketches
org.apache.datasketches.pig.frequenciesPig UDFs for Frequent Items sketches
org.apache.datasketches.pig.hashPig UDFs for MurmerHash3
org.apache.datasketches.pig.hllPig UDFs for HLL sketches
org.apache.datasketches.pig.kllPig UDFs for KLL sketches
org.apache.datasketches.pig.quantilesPig UDFs for Quantiles sketches
org.apache.datasketches.pig.sampling.Pig UDFs for Sampling sketches
org.apache.datasketches.pig.thetaPig UDFs for Theta sketches
org.apache.datasketches.pig.tuplePig UDFs for Tuple sketches

incubator-datasketches-characterization

This relatively new repository is for code that we use to characterize the accuracy and speed performance of the sketches in the library and is constantly being updated. Examples of the job command files used for various tests can be found in the src/main/resources directory. Some of these tests can run for hours depending on its configuration.

Characterization PackagesPackage Description
org.apache.datasketches.characterizationCommon functions and utilities
org.apache.datasketches.characterization.hashHash function performance
org.apache.datasketches.characterization.memoryMemory performance
org.apache.datasketches.characterization.quantiles.Quantiles performance
org.apache.datasketches.characterization.uniquecountPerformance of Theta and HLL sketches

incubator-datasketches-vector

This is a new repository dedicated to sketches for vector and matrix operations. It is still somewhat experimental.

C++ and Python

incubator-datasketches-cpp

This is the evolving C++ implementations of the same sketches that are available in Java. These implementations are binary compatible with their counterparts in Java. In other words, a sketch created and stored in C++ can be opened and read in Java and visa-versa.

This site also has our Python adaptors that basically wrap the C++ implementations, making the high performance C++ implementations available from Python.

incubator-datasketches-postgres

This site provides the postgres-specific adaptors that wrap the C++ implementations making them available to the Postgres database users.

Web Site

incubator-datasketches-website (was DataSketches.github.io)

This is the DataSketches web site source, and is constantly being updated with new material and to be current with the GitHub master. This site is not versioned.

Command-Line Tool

These repositories provide a command-line tool that provides access to the following sketches:

  • Frequent Items
  • HLL
  • Quantiles
  • Reservoir Sampling
  • Theta Sketches
  • VarOpt Sampling

This tool can be installed from Homebrew.

sketches-cmd

homebrew-sketches

homebrew-sketches-cmd

Deprecated sites

The code in these sites are no longer maintained and will eventually be removed.

sketches-android

This is a new repository dedicated to sketches designed to be run in a mobile client, such as a cell phone. It is still in development and should be considered experimental.

experimental

This repository is an experimental staging area for code that will eventually end up in another repository. This code is not versioned.

sketches-misc

Demos, command-line access, characterization testing and other code not related to production deployment.

This code is offered “as is” and primarily as a reference so that users can understand how some of the performance characterization plots were obtained. This code has few unit tests, if any, and was never intended for production use. Nonetheless, some folks have found it useful. If you find it useful, go for it. This code is not versioned.

Sketches-misc PackagesPackage Description
org.apache.datasketchesUtility functions used by the sketches-misc packages
org.apache.datasketches.cmdSupport for Command Line functions Being Redesigned
org.apache.datasketches.demoSimple demo for brute-force vs Theta and HLL sketches Will be superceded by Command Line functions
org.apache.datasketches.quantilesUtility for computing & printing space table for Quantiles Sketches (only in the test branch)
org.apache.datasketches.samplingBenchmarks and Entropy testing for sampling sketches

characterization-cpp

This is the parallel characterization repository with a parallel objective to the Java characterization repository.

experimental-cpp

This repository is an experimental staging area for C++ code that will eventually end up in another repository.