| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## CPC Sketch Examples" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Basic Sketch Usage" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 1, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "from datasketches import cpc_sketch, cpc_union" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We'll create a sketch with log2(k) = 12" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 2, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "sk = cpc_sketch(12)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 3, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "### CPC sketch summary:\n", |
| " lgK : 12\n", |
| " seed hash : 93cc\n", |
| " C : 38212\n", |
| " flavor : 4\n", |
| " merged : false\n", |
| " compressed : false\n", |
| " intresting col : 5\n", |
| " HIP estimate : 2.09721e+06\n", |
| " kxp : 11.4725\n", |
| " offset : 6\n", |
| " table : allocated\n", |
| " num SV : 135\n", |
| " window : allocated\n", |
| "### End sketch summary\n", |
| "\n" |
| ] |
| } |
| ], |
| "source": [ |
| "n = 1 << 21\n", |
| "for i in range(0, n):\n", |
| " sk.update(i)\n", |
| "print(sk)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 4, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Upper bound (1 std. dev) as % of true value: 100.9281\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 5, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Estimate as % of true value: 100.0026\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 6, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Lower bound (1 std. dev) as % of true value: 99.0935\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Finally, we can serialize and deserialize the sketch, which will give us back the same structure." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 7, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "2484" |
| ] |
| }, |
| "execution_count": 7, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "sk_bytes = sk.serialize()\n", |
| "len(sk_bytes)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 8, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "### CPC sketch summary:\n", |
| " lgK : 12\n", |
| " seed hash : 93cc\n", |
| " C : 38212\n", |
| " flavor : 4\n", |
| " merged : false\n", |
| " compressed : false\n", |
| " intresting col : 5\n", |
| " HIP estimate : 2.09721e+06\n", |
| " kxp : 11.4725\n", |
| " offset : 6\n", |
| " table : allocated\n", |
| " num SV : 135\n", |
| " window : allocated\n", |
| "### End sketch summary\n", |
| "\n" |
| ] |
| } |
| ], |
| "source": [ |
| "sk2 = cpc_sketch.deserialize(sk_bytes)\n", |
| "print(sk2)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Sketch Union Usage" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 9, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "k = 12\n", |
| "n = 1 << 20\n", |
| "offset = int(3 * n / 4)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 10, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "sk1 = cpc_sketch(k)\n", |
| "sk2 = cpc_sketch(k + 1)\n", |
| "for i in range(0, n):\n", |
| " sk1.update(i)\n", |
| " sk2.update(i + offset)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 11, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "union = cpc_union(k+1)\n", |
| "union.update(sk1)\n", |
| "union.update(sk2)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Note how log config k has automatically adopted the value of the smaller input sketch." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 12, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "### CPC sketch summary:\n", |
| " lgK : 12\n", |
| " seed hash : 93cc\n", |
| " C : 37418\n", |
| " flavor : 4\n", |
| " merged : true\n", |
| " compressed : false\n", |
| " intresting col : 5\n", |
| " HIP estimate : 0\n", |
| " kxp : 4096\n", |
| " offset : 6\n", |
| " table : allocated\n", |
| " num SV : 123\n", |
| " window : allocated\n", |
| "### End sketch summary\n", |
| "\n" |
| ] |
| } |
| ], |
| "source": [ |
| "result = union.get_result()\n", |
| "print(result)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We can again compare against the exact result, in this case 1.75*n" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 13, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Estimate as % of true value: 99.6646\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))" |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "Python 3", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.7.0" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 2 |
| } |