These notebooks aim to provide a fair comparison between the Python support for Apache Software Foundation (ASF) DataSketches and the datasketch library.
The notebooks are not an attempt to fully characterize either library's implementation, rather to highlight differences that manifest as a result of design choices for each library.
As of the time of writing, November 2023, the versions under comparison are:
Name: datasketches Version: 4.1.0 Name: datasketch Version: 1.6.4
Each library implements various algorithms that fall under the HyperLogLog umbrella. They can be summarized as follows
ASF | datasketch | ||
---|---|---|---|
HyperLogLog | HyperLogLog | HyperLogLog++ | |
Hash functions | 64 bit MurmurHash | SHA1 32 bit. Others are possible. | SHA1 64 bit. Others are possible. |
Bucket sizes (bits) | 4, 6, 8 | 4 | 8 |
Estimator | Exact mode for small cardinalities Historic Inverse Probability (HIP) for a single sketch; Harmonic mean after merging multiple sketches | No exact mode. Harmonic mean for all sketches | Harmonic mean for all sketches (single or merged) with bias corrections at small and large cardinalities |
Since datasketch.hyperloglog.HyperLogLog
uses only $32$ bit hash functions, we do not regard this as an appropriate solution for industry-applications because the error estimates will vary wildly when the input is large enough. To ensure the fairest comparison between the sketches, we will only compare the ASF implementations with datasketch.hyperloglog.HyperLogLogPlusPlus
with the MurmurHash function over $64$ bits using mmh3.hash64(x, signed=False)[0]
.