| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## HLL Sketch Examples" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Basic Sketch Usage" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 1, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "from datasketches import hll_sketch, hll_union, tgt_hll_type" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We'll create a sketch with log2(k) = 12" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 2, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "sk = hll_sketch(12)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 3, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "### HLL SKETCH SUMMARY: \n", |
| " Log Config K : 12\n", |
| " Hll Target : HLL_4\n", |
| " Current Mode : HLL\n", |
| " LB : 2.06958e+06\n", |
| " Estimate : 2.09635e+06\n", |
| " UB : 2.12379e+06\n", |
| " OutOfOrder flag: 0\n", |
| " CurMin : 7\n", |
| " NumAtCurMin : 72\n", |
| " HipAccum : 2.09635e+06\n", |
| " KxQ0 : 5.80703\n", |
| " KxQ1 : 0\n", |
| "\n" |
| ] |
| } |
| ], |
| "source": [ |
| "n = 1 << 21\n", |
| "for i in range(0, n):\n", |
| " sk.update(i)\n", |
| "print(sk)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 4, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Upper bound (1 std. dev) as % of true value: 101.2703\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 5, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Estimate as % of true value: 99.9618\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 6, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Lower bound (1 std. dev) as % of true value: 98.6852\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Finally, we can serialize and deserialize the sketch, which will give us back the same structure." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 7, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "2096" |
| ] |
| }, |
| "execution_count": 7, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "sk_bytes = sk.serialize_compact()\n", |
| "len(sk_bytes)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 8, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "### HLL SKETCH SUMMARY: \n", |
| " Log Config K : 12\n", |
| " Hll Target : HLL_4\n", |
| " Current Mode : HLL\n", |
| " LB : 2.06958e+06\n", |
| " Estimate : 2.09635e+06\n", |
| " UB : 2.12379e+06\n", |
| " OutOfOrder flag: 0\n", |
| " CurMin : 7\n", |
| " NumAtCurMin : 72\n", |
| " HipAccum : 2.09635e+06\n", |
| " KxQ0 : 5.80703\n", |
| " KxQ1 : 0\n", |
| "\n" |
| ] |
| } |
| ], |
| "source": [ |
| "sk2 = hll_sketch.deserialize(sk_bytes)\n", |
| "print(sk2)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Sketch Union Usage" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 9, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "k = 12\n", |
| "n = 1 << 20\n", |
| "offset = int(3 * n / 4)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 10, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "sk1 = hll_sketch(k)\n", |
| "sk2 = hll_sketch(k + 1)\n", |
| "for i in range(0, n):\n", |
| " sk1.update(i)\n", |
| " sk2.update(i + offset)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 11, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "union = hll_union(k+1)\n", |
| "union.update(sk1)\n", |
| "union.update(sk2)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Note how log config k has automatically adopted the value of the smaller input sketch." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 12, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "### HLL SKETCH SUMMARY: \n", |
| " Log Config K : 12\n", |
| " Hll Target : HLL_4\n", |
| " Current Mode : HLL\n", |
| " LB : 1.80197e+06\n", |
| " Estimate : 1.83108e+06\n", |
| " UB : 1.86121e+06\n", |
| " OutOfOrder flag: 1\n", |
| " CurMin : 6\n", |
| " NumAtCurMin : 2\n", |
| " HipAccum : 1.76932e+06\n", |
| " KxQ0 : 6.60752\n", |
| " KxQ1 : 0\n", |
| "\n" |
| ] |
| } |
| ], |
| "source": [ |
| "result = union.get_result()\n", |
| "print(result)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We can again compare against the exact result, in this case 1.75*n" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 13, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Estimate as % of true value: 99.7859\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": {}, |
| "outputs": [], |
| "source": [] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "Python 3", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.7.0" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 2 |
| } |