layout: doc_page

Architecture

The DataSketches Library is organized into the following repositories:

sketches-core

This repository has two modules released with separate jars: the core sketching classes, and the Memory package.
These two modules are leveraged by all he other repositories. This repository has no external dependencies outside of Java and TestNG for unit tests. This code is versioned and the latest release can be obtained from Maven Central.

High-level Package Structure

Sketches-core ModuleDescription
com.yahoo.sketchesCommon functions and utilities
com.yahoo.sketches.frequenciesFrequent Item Sketches, for both longs and generics
com.yahoo.sketches.hashThe 128-bit MurmurHash3 and adaptors
com.yahoo.sketches.hllHLL sketches, and HLL Map sketches
com.yahoo.sketches.quantilesSketches for quantiles, PMF and CDF functions, both doubles and generics
com.yahoo.sketches.samplingReservoir sampling with generics
com.yahoo.sketches.thetaTheta sketches
com.yahoo.sketches.tupleTuple sketches for both primitives and generics
Memory ModuleDescription
com.yahoo.memoryLow level Memory data-structure management primarily for off-heap.

sketches-pig

This repository contains Pig User Defined Functions (UDF) for use within Hadoop grid environments. This code has dependencies on sketches-core as well as Hadoop and Pig. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from Maven Central.

High-level StructurePackage Description
com.yahoo.sketches.pig.frequenciesPig UDFs for Frequent Items sketches
com.yahoo.sketches.pig.hashPig UDF for MurmerHash3
com.yahoo.sketches.pig.quantilesPig UDFs for Quantiles sketches
com.yahoo.sketches.pig.thetaPig UDFs for Theta sketches
com.yahoo.sketches.pig.tuplePig UDFs for Tuple sketches

sketches-hive

This repository contains Hive UDFs and UDAFs for use within Hadoop grid enviornments. This code has dependencies on sketches-core as well as Hadoop and Hive. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from Maven Central.

High-level StructurePackage Description
com.yahoo.sketches.hive.frequenciesHive UDF and UDAFs for Frequent Items sketches
com.yahoo.sketches.hive.quantilesHive UDF and UDAFs for Quantiles sketches
com.yahoo.sketches.hive.thetaHive UDF and UDAFs for Theta sketches
com.yahoo.sketches.hive.tupleHive UDF and UDAFs for Tuple sketches

sketches-misc

Demos, command-line access, characterization testing and other code not related to production deployment.

This code is offered “as is” and primarily as a reference so that users can understand how some of the performance characterization plots were obtained. This code has few unit tests, if any, and was never intended for production use. Nonetheless, some folks have found it useful. If you find it useful, go for it. This code is versioned and the latest release can be obtained from Maven Central.

High-level StructurePackage Description
com.yahoo.sketches.benchmarkBenchmarking code for the HLL sketches
com.yahoo.sketches.cmdSupport for Command Line functions
com.yahoo.sketches.demoSimple demo for brute-force vs Theta & HLL sketches
com.yahoo.sketches.hllError Characterization and Command-line functions for experimenting with CountUniqueMap
com.yahoo.sketches.performanceSpeed and Error Characteriation of Theta an HLL sketches
com.yahoo.sketches.samplingBenchmarks and Entropy testing

experimental

This repository is an experimental staging area for code that will eventually end up in another repository. This code is not versioned and not registered with Maven Central.

DataSketches.github.io

This is the DataSketches.github.io web site, and is constantly being updated with new material and to be current with the latest releases of the registered repositories. This site is not versioned and not registered with Maven Central.