Apache datasketches

Clone this repo:

Branches

  1. d8da04a Merge pull request #15 from apache/kll_rename by Jon Malkin · 24 hours ago main
  2. 7f88746 Fix typo in name by Jon Malkin · 25 hours ago
  3. ffe0be1 rename KLL aggregate files by Jon Malkin · 25 hours ago
  4. afc1d8d Merge pull request #14 from jmalkin/ci_workflow by Jon Malkin · 26 hours ago
  5. 9663a21 add jdk and spark versions to cache by Jon Malkin · 27 hours ago

Apache® DataSketches™ Spark Library

This repo is still an early-stage work in progress.

There have been multiple attempts to help integrate Apache DataSketches into Apache Spark, including one built into Spark itself as of v3.5. All are useful work, but in comparing them, there are various limitations to each library. Whether limiting the type of sketches available (e.g. native Spark provides only HLL) or limiting flexibility and functionality (e.g. forcing HLL and Theta to use a common interface which precludes set operations HLL cannot support, or using global parameters to control the sizes of all sketch instances in the query), the other libraries place undesirable constraints on developers looking to use sketches in their queries or data systems. This library aims to restore that choice to develoeprs.