blob: ed03e6188d19b30dbd65d001a45878bdc5f6b132 [file] [log] [blame] [view]
# A Comparison of Python Sketching Libraries
These notebooks aim to provide a fair comparison between the Python support for
[Apache Software Foundation (ASF) DataSketches](https://pypi.org/project/datasketches/) and the [datasketch](https://pypi.org/project/datasketch/)
library.
The notebooks are not an attempt to fully characterize either library's implementation, rather to highlight differences that manifest as a result
of design choices for each library.
## Summary
As of the time of writing, November 2023, the versions under comparison are:
```
Name: datasketches
Version: 4.1.0
Name: datasketch
Version: 1.6.4
```
## 1. Distinct Counting
### 1a. HyperLogLog
Each library implements various algorithms that fall under the HyperLogLog umbrella.
They can be summarized as follows
| | ASF | datasketch | |
|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| | HyperLogLog | HyperLogLog | HyperLogLog++ |
| Hash functions | 64 bit MurmurHash | SHA1 32 bit. Others are possible. | SHA1 64 bit. Others are possible. |
| Bucket sizes (bits) | 4, 6, 8 | 4 | 8 |
| Estimator | Exact mode for small cardinalities Historic Inverse Probability (HIP) for a single sketch; Harmonic mean after merging multiple sketches | No exact mode. Harmonic mean for all sketches | Harmonic mean for all sketches (single or merged) with bias corrections at small and large cardinalities |
Since `datasketch.hyperloglog.HyperLogLog` uses only $32$ bit hash functions, we do not regard this as an appropriate solution for
industry-applications because the error estimates will vary wildly when the input is large enough.
To ensure the fairest comparison between the sketches, we will only compare the ASF implementations with `datasketch.hyperloglog.HyperLogLogPlusPlus`
with the MurmurHash function over $64$ bits using `mmh3.hash64(x, signed=False)[0]`.