commit | ca37f1d2aabdf41f370fb12a9cd52a4e89d5b542 | [log] [tgz] |
---|---|---|

author | saydakov <saydakov@yahoo-inc.com> | Mon Nov 12 12:25:58 2018 -0800 |

committer | saydakov <saydakov@yahoo-inc.com> | Mon Nov 12 12:25:58 2018 -0800 |

tree | 19bdd554ba3d2dae7f9d170308e2dcf43d484c43 | |

parent | 7100d7b63715908e5b1aa38f379776272be0901a [diff] |

section headers

1 file changed

tree: 19bdd554ba3d2dae7f9d170308e2dcf43d484c43

README.md

Module for PostgreSQL to support approximate algorithms based on the Datasketches core library sketches-core-cpp. See https://datasketches.github.io/ for details.

This module currently supportstwo sketches:

- CPC (Compressed Probabilistic Counting) sketch - very compact (when serialized) distinct-counting sketch
- KLL float quantiles sketch - for estimating distributions: quantile, rank, PMF (hystogram), CDF

Exact count distinct:

$ time psql test -c "select count(distinct id) from random_ints_100m;" count ---------- 63208457 (1 row) real 1m59.060s

Approximate count distinct:

$ time psql test -c "select cpc_sketch_distinct(id) from random_ints_100m;" cpc_sketch_distinct --------------------- 62716231.1448033 (1 row) real 0m21.811s

Note that the above one-off distinct count is just to show the basic usage. Most importantly, the sketch can be used as an “additive” disctinct count metric in a data cube.

Table “normal” has 1 million values from the normal distribution mean=0 and stddev=1 We can build a sketch, which represents the distribution (create table kll_float_sketch_test(sketch kll_float_sketch)):

$ psql test -c "insert into kll_float_sketch_test select kll_float_sketch_build(value) from normal"; INSERT 0 1

We expect the value with rank 0.5 (median) to be approximately 0:

$ psql test -c "select kll_float_sketch_get_quantile(sketch, 0.5) from kll_float_sketch_test"; kll_float_sketch_get_quantile ------------------------------- 0.00648344

In reverse: we expect the rank of value 0 (true median) to be approximately 0.5:

$ psql test -c "select kll_float_sketch_get_rank(sketch, 0) from kll_float_sketch_test"; kll_float_sketch_get_rank --------------------------- 0.496289

Note that the normal distribution was used just to show the basic usage. The sketch does not make any assumptions about the distribution.