Approximate Counting using HyperLogLog

count(distinct value) can often cause memory exhausted errors where input data and the cardinality of value are large.

HyperLogLog is an efficient algorithm for approximating the number of distinct elements in a multiset. Hivemall implements HyperLogLog++ in approx_count_distinct.

Usage

approx_count_distinct is less accurate than COUNT(DISTINCT expression), but performs better on huge input.

select
    count(distinct rowid) as actual,
    approx_count_distinct(rowid) as default_p 
from
    train;

actual	default_p
45840617	45567770

select
    approx_count_distinct(rowid, '-p 4') as p4,
    approx_count_distinct(rowid, '-p 6 -sp 6') as p6_sp6,
    approx_count_distinct(rowid, '-p 14') as p14,
    approx_count_distinct(rowid, '-p 15') as p15,
    approx_count_distinct(rowid, '-p 16') as p16,
    approx_count_distinct(rowid, '-p 24') as p24,
    approx_count_distinct(rowid, '-p 25') as p25,
    approx_count_distinct(rowid, '-p 15 -sp 15') as p15_sp15
from
    train;

p4	p6_sp6	p14	p15	p16	p24	p25	p15_sp15
38033066	49332600	45051015	45567770	45614484	45831359	45832280	45567770

Note
p controls expected precision and memory consumption tradeoff and default p=15 generally works well. Find More information on this paper.

Function Signature

You can find the function signature and options of approx_count_distinct is as follows:

select 
    approx_count_distinct(rowid, '-help')
from
    train;

usage: HLLEvaluator [-help] [-p <arg>] [-sp <arg>]
 -help       Show function help
 -p <arg>    The size of registers for the normal set. `p` MUST be in the
             range [4,sp] and 15 by the default
 -sp <arg>   The size of registers for the sparse set. `sp` MUST be in the
             range [4,32] and 25 by the defaul