Approximate Counting using HyperLogLog

count(distinct value) can often cause memory exhausted errors where input data and the cardinality of value are large.

HyperLogLog is an efficient algorithm for approximating the number of distinct elements in a multiset. Hivemall implements HyperLogLog++ in approx_count_distinct.

Usage

approx_count_distinct is less accurate than COUNT(DISTINCT expression), but performs better on huge input.

select
    count(distinct rowid) as actual,
    approx_count_distinct(rowid) as default_p 
from
    train;
actualdefault_p
4584061745567770
select
    approx_count_distinct(rowid, '-p 4') as p4,
    approx_count_distinct(rowid, '-p 6 -sp 6') as p6_sp6,
    approx_count_distinct(rowid, '-p 14') as p14,
    approx_count_distinct(rowid, '-p 15') as p15,
    approx_count_distinct(rowid, '-p 16') as p16,
    approx_count_distinct(rowid, '-p 24') as p24,
    approx_count_distinct(rowid, '-p 25') as p25,
    approx_count_distinct(rowid, '-p 15 -sp 15') as p15_sp15
from
    train;
p4p6_sp6p14p15p16p24p25p15_sp15
3803306649332600450510154556777045614484458313594583228045567770

Note

p controls expected precision and memory consumption tradeoff and default p=15 generally works well. Find More information on this paper.

Function Signature

You can find the function signature and options of approx_count_distinct is as follows:

select 
    approx_count_distinct(rowid, '-help')
from
    train;
usage: HLLEvaluator [-help] [-p <arg>] [-sp <arg>]
 -help       Show function help
 -p <arg>    The size of registers for the normal set. `p` MUST be in the
             range [4,sp] and 15 by the default
 -sp <arg>   The size of registers for the sparse set. `sp` MUST be in the
             range [4,32] and 25 by the defaul