commit | 23dd5b13cbe802fa056299b0a5ed7cf1b6ee108a | [log] [tgz] |
---|---|---|
author | Knut Nesheim <knutin@gmail.com> | Fri Jan 10 11:38:35 2014 +0100 |
committer | Knut Nesheim <knutin@gmail.com> | Fri Jan 10 11:38:35 2014 +0100 |
tree | 00c08c30cb5f18fb9f0a3316c96ec029f1474006 | |
parent | 1392b478028ee8a4f8b0fc4143e021c6e656c480 [diff] |
Moved more heavy lifting into the backend module to allow for more optimizations. Introduced hyper_binary, using a fixed size binary as an array.
hyper
is an implementation of the HyperLogLog algorithm with bias correction from HLL++ as (described by Google)[http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/40671.pdf].
Using hyper
you can estimate the cardinality of large data sets using constant memory. The union of two HLL filters is lossless, which allows perfect parallellization.
1> hyper:insert(<<"foobar">>, hyper:insert(<<"quux">>, hyper:new(4))). {hyper,4,{hyper_gb,{2,{10,1,{8,1,nil,nil},nil}}}} 2> hyper:card(v(-1)). 2.136502281992361
The error from estimations can be seen in this example:
10> Run = fun (P, Card) -> hyper:card(lists:foldl(fun (_, H) -> hyper:insert(crypto:rand_bytes(32), H) end, hyper:new(P), lists:seq(1, Card))) end. #Fun<erl_eval.12.80484245> 11> Run(12, 10000). 9984.777312630407 12> Run(14, 10000). 9994.719695068114 13> Run(16, 10000). 9991.34646427657