blob: 0b7caffcef274828e7d4ec0ea115afe15009882a [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CPC Sketch Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basic Sketch Usage"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from datasketches import cpc_sketch, cpc_union"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll create a sketch with log2(k) = 12"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"sk = cpc_sketch(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### CPC sketch summary:\n",
" lgK : 12\n",
" seed hash : 93cc\n",
" C : 38212\n",
" flavor : 4\n",
" merged : false\n",
" compressed : false\n",
" intresting col : 5\n",
" HIP estimate : 2.09721e+06\n",
" kxp : 11.4725\n",
" offset : 6\n",
" table : allocated\n",
" num SV : 135\n",
" window : allocated\n",
"### End sketch summary\n",
"\n"
]
}
],
"source": [
"n = 1 << 21\n",
"for i in range(0, n):\n",
" sk.update(i)\n",
"print(sk)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Upper bound (1 std. dev) as % of true value: 100.9281\n"
]
}
],
"source": [
"print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimate as % of true value: 100.0026\n"
]
}
],
"source": [
"print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lower bound (1 std. dev) as % of true value: 99.0935\n"
]
}
],
"source": [
"print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can serialize and deserialize the sketch, which will give us back the same structure."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2484"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sk_bytes = sk.serialize()\n",
"len(sk_bytes)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### CPC sketch summary:\n",
" lgK : 12\n",
" seed hash : 93cc\n",
" C : 38212\n",
" flavor : 4\n",
" merged : false\n",
" compressed : false\n",
" intresting col : 5\n",
" HIP estimate : 2.09721e+06\n",
" kxp : 11.4725\n",
" offset : 6\n",
" table : allocated\n",
" num SV : 135\n",
" window : allocated\n",
"### End sketch summary\n",
"\n"
]
}
],
"source": [
"sk2 = cpc_sketch.deserialize(sk_bytes)\n",
"print(sk2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sketch Union Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"k = 12\n",
"n = 1 << 20\n",
"offset = int(3 * n / 4)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"sk1 = cpc_sketch(k)\n",
"sk2 = cpc_sketch(k + 1)\n",
"for i in range(0, n):\n",
" sk1.update(i)\n",
" sk2.update(i + offset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"union = cpc_union(k+1)\n",
"union.update(sk1)\n",
"union.update(sk2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how log config k has automatically adopted the value of the smaller input sketch."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### CPC sketch summary:\n",
" lgK : 12\n",
" seed hash : 93cc\n",
" C : 37418\n",
" flavor : 4\n",
" merged : true\n",
" compressed : false\n",
" intresting col : 5\n",
" HIP estimate : 0\n",
" kxp : 4096\n",
" offset : 6\n",
" table : allocated\n",
" num SV : 123\n",
" window : allocated\n",
"### End sketch summary\n",
"\n"
]
}
],
"source": [
"result = union.get_result()\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can again compare against the exact result, in this case 1.75*n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimate as % of true value: 99.6646\n"
]
}
],
"source": [
"print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}