Apache datasketches

Clone this repo:
  1. 3bdf2a6 Merge pull request #150 from apache/export_heapu8 by Alexander Saydakov · 3 days ago main
  2. a49e9ac export HEAPU8 explicitly by AlexanderSaydakov · 3 days ago
  3. 7076078 Merge pull request #149 from apache/null_handling by Alexander Saydakov · 2 weeks ago 1.1.0-rc1
  4. a23786d aggs produce null with null input by AlexanderSaydakov · 2 weeks ago
  5. 3bf334f Merge pull request #148 from apache/null_handling by Alexander Saydakov · 4 weeks ago

Apache DataSketches library functions for Google BigQuery

User-Defined Aggregate Functions (UDAFs) and non-aggregate (scalar) functions (UDFs) for BigQuery SQL engine.

DataSketches are probabilistic data structures that can process massive amounts of data and return very accurate results with a small memory footprint. Because of this, DataSketches are particularly useful for “big data” use cases such as streaming analytics and data warehousing.

Please visit the main Apache DataSketches website for more information about DataSketches library.

If you are interested in making contributions to this project please see our Community page for how to contact us.

Requirements

  • Requires Emscripten (emcc compiler)

    git clone https://github.com/emscripten-core/emsdk.git \
    && cd emsdk \
    && ./emsdk install latest \
    && ./emsdk activate latest \
    && source ./emsdk_env.sh \
    && cd ..
    

    This can be installed using ‘brew install emscripten’ on MacOS.

  • Requires a link to datasketches-cpp in this repository

    make datasketches-cpp
    

    This target is a part of the default target ‘all’. This requires wget and unzip.

  • Requires make utility

  • Requires Google Cloud CLI

    curl https://sdk.cloud.google.com | bash 
    
  • Requires npm and @dataform/cli package

    npm install -g @dataform/cli
    
  • Requires setting the following environment variables to your own values:

    export JS_BUCKET=    # GCS bucket to hold compiled artifacts (must include gs://)
    export BQ_PROJECT=   # location of stored SQL functions (routines)
    export BQ_DATASET=   # location of stored SQL functions (routines)
    export BQ_LOCATION=  # location of BQ_DATASET
    

Building, Installing, and Testing

Install All DataSketches

Run the following steps in this repo's root directory to install everything via Cloud Build:

gcloud builds submit \ 
  --project=$BQ_PROJECT \
  --substitutions=_BQ_LOCATION=$BQ_LOCATION,_BQ_DATASET=$BQ_DATASET,_JS_BUCKET=$JS_BUCKET \
  .

Install All DataSketches

Run the following steps in this repo's root directory to install everything:

gcloud auth application-default login # for authentication
make          # compile C++ code and produce .js, .mjs and .wasm artifacts
make install  # upload artifacts to $JS_BUCKET and create SQLX functions in $BQ_PROJECT.$BQ_DATASET
make test     # run tests in BigQuery

The “install” target consists of “upload” and “create”, which can be used separately if desired

Install Specific DataSketches

To install a specific sketch use targets of the form dir.target For example, to install Theta sketch only:

gcloud auth application-default login # for authentication
make theta          # compile C++ code and produce .js, .mjs and .wasm artifacts
make theta.install  # upload artifacts to $JS_BUCKET and create SQLX functions in $BQ_PROJECT.$BQ_DATASET

Currently there is no way to run tests for a specific sketch only. “make example” can be used in an individual sketch directory.