Core C++ Sketch Library

Clone this repo:
  1. f546262 Merge pull request #479 from SYaoJun/fix by Lee Rhodes · 9 days ago master
  2. 9bf4579 fix: Add the missing brackets and support one line statement by yaojun · 4 weeks ago
  3. 4f069a3 Merge pull request #478 from MaheshGPai/mahesh_pr by Lee Rhodes · 13 days ago
  4. 7548810 Merge pull request #483 from SYaoJun/0210_tidy by Lee Rhodes · 14 days ago
  5. 2956f15 Add clang-tidy and check script and fix the warnings under count directory by yaojun · 2 weeks ago

Apache DataSketches Core C++ Library Component

This is the core C++ component of the Apache DataSketches library. It contains all the key sketching algorithms from the Java implementation and can be accessed directly by user applications.

This component is also a dependency of other library components that create adaptors for target systems, such as PostgreSQL.

Note that we have parallel core library components for Java, Python, and GO implementations of many of the same sketch algorithms:

Please visit the main Apache DataSketches website for more information.

If you are interested in making contributions to this site, please see our Community page for how to contact us.


This code requires C++11.

This library is header-only. The provided build process is only for unit tests.

Building the unit tests requires CMake 3.12.0 or higher.

Installing the latest CMake on OSX: brew install cmake.

Building and running unit tests using CMake for OSX and Linux:

cmake -S . -B build/Release -DCMAKE_BUILD_TYPE=Release
cmake --build build/Release -t all test

Building and running unit tests using CMake for Windows from the command line:

cd build
cmake ..
cd ..
cmake --build build --config Release
cmake --build build --config Release --target RUN_TESTS

To install a local distribution (OSX and Linux), use the following command. The CMAKE_INSTALL_PREFIX variable controls the destination. If not specified, it defaults to installing in /usr (/usr/include, /usr/lib, etc). In the command below, the installation will be in /tmp/install/DataSketches (/tmp/install/DataSketches/include, /tmp/install/DataSketches/lib, etc).

cmake -S . -B build/Release -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/tmp/install/DataSketches
cmake --build build/Release -t install

To generate an installable package using CMake's built-in cpack packaging tool, use the following command. The type of packaging is controlled by the CPACK_GENERATOR variable (semi-colon separated list). CMake usually supports packaging formats such as RPM, DEB, STGZ, TGZ, TZ, and ZIP.

cmake -S . -B build/Release -DCMAKE_BUILD_TYPE=Release -DCPACK_GENERATOR="RPM;STGZ;TGZ" 
cmake --build build/Release -t package

The DataSketches project can be included in other projects' CMakeLists.txt files in one of two ways.

If DataSketches has been installed on the host (using an RPM, DEB, “make install” into /usr/local, or some way, then CMake's find_package command can be used like this:

find_package(DataSketches 3.2 REQUIRED)
target_link_library(my_dependent_target PUBLIC ${DATASKETCHES_LIB})

When used with find_package, DataSketches exports several variables, including

  • DATASKETCHES_VERSION: The version number of the datasketches package that was imported.
  • DATASKETCHES_INCLUDE_DIR: The directory that should be added to access DataSketches include files. Because CMake automatically includes the interface directories for included target libraries when using target_link_library, under normal circumstances, there will be no need to include this directly
  • DATASKETCHES_LIB: The name of the DataSketches target to include as a dependency. Projects pulling in DataSketches should reference this with target_link_library in order to set up all the correct dependencies and include paths.

If you don‘t have DataSketches installed locally, dependent projects can pull it directly from GitHub using CMake’s ExternalProject module. The code would look something like this:

cmake_policy(SET CMP0097 NEW)
include(ExternalProject)
ExternalProject_Add(datasketches
    GIT_REPOSITORY https://github.com/apache/datasketches-cpp.git
    GIT_TAG 3.2.0
    GIT_SHALLOW true
    GIT_SUBMODULES ""
    INSTALL_DIR /tmp/datasketches-prefix
    CMAKE_ARGS -DBUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} -DCMAKE_INSTALL_PREFIX=/tmp/datasketches-prefix

    # Override the install command to add DESTDIR
    # This is necessary to work around an oddity in the RPM (but not other) package
    # generation, as CMake otherwise picks up the Datasketch files when building
    # an RPM for a dependent package. (RPM scans the directory for files in addition to installing
    # those files referenced in an "install" rule in the cmake file)
    INSTALL_COMMAND env DESTDIR= ${CMAKE_COMMAND} --build . --target install
)
ExternalProject_Get_property(datasketches INSTALL_DIR)
set(datasketches_INSTALL_DIR ${INSTALL_DIR})
message("Source dir of datasketches = ${datasketches_INSTALL_DIR}")
target_include_directories(my_dependent_target 
                            PRIVATE ${datasketches_INSTALL_DIR}/include/DataSketches)
add_dependencies(my_dependent_target datasketches)