A Comparison of Python Sketching Libraries

These notebooks aim to provide a fair comparison between the Python support for Apache Software Foundation (ASF) DataSketches and the datasketch library.
The notebooks are not an attempt to fully characterize either library's implementation, rather to highlight differences that manifest as a result of design choices for each library.

Summary

As of the time of writing, November 2023, the versions under comparison are:

Name: datasketches
Version: 4.1.0

Name: datasketch
Version: 1.6.4

1. Distinct Counting

1a. HyperLogLog

Each library implements various algorithms that fall under the HyperLogLog umbrella. They can be summarized as follows

ASFdatasketch
HyperLogLogHyperLogLogHyperLogLog++
Hash functions64 bit MurmurHashSHA1 32 bit. Others are possible.SHA1 64 bit. Others are possible.
Bucket sizes (bits)4, 6, 848
EstimatorExact mode for small cardinalities Historic Inverse Probability (HIP) for a single sketch; Harmonic mean after merging multiple sketchesNo exact mode. Harmonic mean for all sketchesHarmonic mean for all sketches (single or merged) with bias corrections at small and large cardinalities

Since datasketch.hyperloglog.HyperLogLog uses only $32$ bit hash functions, we do not regard this as an appropriate solution for industry-applications because the error estimates will vary wildly when the input is large enough. To ensure the fairest comparison between the sketches, we will only compare the ASF implementations with datasketch.hyperloglog.HyperLogLogPlusPlus with the MurmurHash function over $64$ bits using mmh3.hash64(x, signed=False)[0].