Apache datasketches

Clone this repo:
  1. 9856d5e Merge pull request #29 from apache/nullable_theta by Alexander Saydakov · 6 weeks ago main
  2. 0ed6a12 attempt to fix codegen by AlexanderSaydakov · 6 weeks ago
  3. 69361b0 support nullable by AlexanderSaydakov · 6 weeks ago
  4. 3d9e087 Merge pull request #28 from apache/as_binary by Jon Malkin · 7 weeks ago
  5. 54155d1 Finish renaming by Jon · 7 weeks ago

Apache® DataSketches™ Spark Library

This repo is still an early-stage work in progress.

There have been multiple attempts to help integrate Apache DataSketches into Apache Spark, including one built into Spark itself as of v3.5. All are useful work, but in comparing them, there are various limitations to each library. Whether limiting the type of sketches available (e.g. native Spark provides only HLL) or limiting flexibility and functionality (e.g. forcing HLL and Theta to use a common interface which precludes set operations HLL cannot support, or using global parameters to control the sizes of all sketch instances in the query), the other libraries place undesirable constraints on developers looking to use sketches in their queries or data systems. This library aims to restore that choice to develoeprs.

Build and Test Instructions

Building the library requires sbt, a commonly used build system for Scala projects. There are several environment variables that can be used to configure the project:

  • Java version, typically via $JAVA_HOME: Default is 11
  • $SCALA_VERSION: Default is 2.12.20
  • $SPARK_VERSION: Default is 3.5.4

The package is built using sbt package and tests are run with sbt test.

If building for the pyspark package, please also read python/README.md.