blob: 6ae3578dcb7ffb317461377b9b8e2649a9b7e074 [file] [log] [blame] [view]
---
layout: doc_page
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Sketches by [Component Repository](https://github.com/apache?utf8=%E2%9C%93&q=datasketches)
The DataSketches Library is organized into the following repository groups:
## Java
### datasketches-java
This repository has the core-java sketching classes, which are leveraged by some of the other repositories.
This repository has no external dependencies outside of the DataSketches/memory repository, Java and TestNG for unit tests.
This code is versioned and the latest release can be obtained from
[Downloads](https://datasketches.apache.org/docs/Community/Downloads.html).
<b>High-level Repositories Structure</b>
Sketches-core Packages. | Package Description
-------------------------------|---------------------
org.apache.datasketches | Common functions and utilities
org.apache.datasketches.cpc | New Unique Counting Sketch with better accuracy per size than HLL
org.apache.datasketches.fdt | Frequent Distinct Tuples Sketch.
org.apache.datasketches.frequencies | Frequent Item Sketches, for both longs and generics
org.apache.datasketches.hash | The 128-bit MurmurHash3 and adaptors
org.apache.datasketches.hll | Unique counting HLL sketches for both heap and off-heap.
org.apache.datasketches.hllmap | The (HLL) Unique Count Map Sketch
org.apache.datasketches.kll | Quantiles sketch with better accuracy per size than the standard quantiles sketch. Includes PMF, CDF functions, for floats, doubles. On-heap & off-heap.
org.apache.datasketches.quantiles | Standard Quantiles sketch, plus PMF and CDF functions, for doubles and generics. On-heap & off-heap.
org.apache.datasketches.req | Relative Error Quantiles (REQ) sketch, plus PMF and CDF functions for floats, on-heap. Extremely high accuracy for very high ranks (e.g., 99.999%ile), or very low ranks (e.g., .00001%ile.
org.apache.datasketches.sampling | Weighted and uniform reservoir sampling with generics
org.apache.datasketches.theta | Unique counting Theta Sketches for both on-heap & off-heap
org.apache.datasketches.tuple | Tuple sketches for both primitives and generics
org.apache.datasketches.tuple.adouble | A Tuple sketch with a Summary of a single double
org.apache.datasketches.tuple.aninteger | A Tuple sketch with a Summary of a single integer
org.apache.datasketches.tuple.Strings | A Tuple sketch with a Summary of an array of Strings
### datasketches-memory
This code is versioned and the latest release can be obtained from
[Downloads](https://datasketches.apache.org/docs/Community/Downloads.html).
Memory Packages | Package Description
-------------------------------|---------------------
org.apache.datasketches.memory | Low level, high-performance Memory data-structure management primarily for off-heap.
### datasketches-hive
This repository contains Hive UDFs and UDAFs for use within Hadoop grid enviornments.
This code has dependencies on sketches-core as well as Hadoop and Hive.
Users of this code are advised to use Maven to bring in all the required dependencies.
This code is versioned and the latest release can be obtained from
[Downloads](https://datasketches.apache.org/docs/Community/Downloads.html).
Sketches-hive Packages | Package Description
-------------------------------------|---------------------
org.apache.datasketches.hive.cpc | Hive UDF and UDAFs for CPC sketches
org.apache.datasketches.hive.frequencies | Hive UDF and UDAFs for Frequent Items sketches
org.apache.datasketches.hive.hll | Hive UDF and UDAFs for HLL sketches
org.apache.datasketches.hive.kll | Hive UDF and UDAFs for KLL sketches
org.apache.datasketches.hive.quantiles | Hive UDF and UDAFs for Quantiles sketches
org.apache.datasketches.hive.theta | Hive UDF and UDAFs for Theta sketches
org.apache.datasketches.hive.tuple | Hive UDF and UDAFs for Tuple sketches
### datasketches-pig
This repository contains Pig User Defined Functions (UDF) for use within Hadoop grid environments.
This code has dependencies on sketches-core as well as Hadoop and Pig.
Users of this code are advised to use Maven to bring in all the required dependencies.
This code is versioned and the latest release can be obtained from
[Downloads](https://datasketches.apache.org/docs/Community/Downloads.html).
Sketches-pig Packages | Package Description
-----------------------------------|---------------------
org.apache.datasketches.pig.cpc | Pig UDFs for CPC sketches
org.apache.datasketches.pig.frequencies | Pig UDFs for Frequent Items sketches
org.apache.datasketches.pig.hash | Pig UDFs for MurmerHash3
org.apache.datasketches.pig.hll | Pig UDFs for HLL sketches
org.apache.datasketches.pig.kll | Pig UDFs for KLL sketches
org.apache.datasketches.pig.quantiles | Pig UDFs for Quantiles sketches
org.apache.datasketches.pig.sampling. | Pig UDFs for Sampling sketches
org.apache.datasketches.pig.theta | Pig UDFs for Theta sketches
org.apache.datasketches.pig.tuple | Pig UDFs for Tuple sketches
### datasketches-characterization
This relatively new repository is for Java and C++ code that we use to characterize the accuracy and speed performance of the sketches in
the library and is constantly being updated. Examples of the job command files used for various tests can be found in the src/main/resources directory.
Some of these tests can run for hours depending on its configuration. This component is not formally released and code must be obtained from
the [GitHub site](https://github.com/apache/datasketches-characterization).
Characterization Packages | Package Description
------------------------------------------------|---------------------
org.apache.datasketches.characterization | Common functions and utilities
org.apache.datasketches.characterization.concurrent | Concurrent Theta Sketch
org.apache.datasketches.characterization.cpc | Compressed Probabilistic Counting Sketch
org.apache.datasketches.characterization.fdt | Frequent Distinct Tuples Sketch
org.apache.datasketches.characterization.frequencies | Frequent Items Sketches
org.apache.datasketches.characterization.hash | Hash function performance
org.apache.datasketches.characterization.hll | HyperLogLog Sketcch
org.apache.datasketches.characterization.memory | Memory performance
org.apache.datasketches.characterization.quantiles | Quantiles performance
org.apache.datasketches.characterization.theta | Theta Sketch
org.apache.datasketches.characterization.uniquecount | Base Profiles for Unique Counting Sketches
### datasketches-server
This is a new repository for our experimental docker/container server that enables easy access to the core sketches in the library via HTTP.
This component is not formally released and code must be obtained from
the [GitHub site](https://github.com/apache/datasketches-server).
#### C++ Characterizations
* CPC
* Frequent Items
* HLL
* KLL
* Theta
### datasketches-vector
This component implements the [Frequent Directions Algorithm](/docs/Community/Research.html) [GLP16]. It is still experimental in that the theoretical work has not yet supplied a suitable measure of error for production work. It can be used as is, but it will not go through a formal Apache Release until we can find a way to provide better error properties. It has a dependence on the Memory component.
This component is not formally released and code must be obtained from
the [GitHub site](https://github.com/apache/datasketches-vector).
## C++ and Python
### datasketches-cpp
This is the evolving C++ implementations of the same sketches that are available in Java.
These implementations are *binary compatible* with their counterparts in Java.
In other words, a sketch created and stored in C++ can be opened and read in Java and visa-versa.
This code is versioned and the latest release can be obtained from
[Downloads](https://datasketches.apache.org/docs/Community/Downloads.html).
This site also has our [Python adaptors](https://github.com/apache/datasketches-cpp/tree/master/python) that basically wrap the C++ implementations, making the high performance C++ implementations available from Python.
### datasketches-postgresql
This site provides the postgres-specific adaptors that wrap the C++ implementations making
them available to the PostgreSQL database users. PostgreSQL users should download the PostgreSQL extension from [pgxn.org](https://pgxn.org/dist/datasketches/). For examples refer to the README on the component site. This code is versioned and the latest release can be obtained from
[Downloads](https://datasketches.apache.org/docs/Community/Downloads.html).