blob: 93332ce879d5d3bbdbebcd69dc87eb26fd0f9105 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HLL Sketch Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basic Sketch Usage"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from datasketches import hll_sketch, hll_union, tgt_hll_type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll create a sketch with log2(k) = 12"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"sk = hll_sketch(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### HLL SKETCH SUMMARY: \n",
" Log Config K : 12\n",
" Hll Target : HLL_4\n",
" Current Mode : HLL\n",
" LB : 2.06958e+06\n",
" Estimate : 2.09635e+06\n",
" UB : 2.12379e+06\n",
" OutOfOrder flag: 0\n",
" CurMin : 7\n",
" NumAtCurMin : 72\n",
" HipAccum : 2.09635e+06\n",
" KxQ0 : 5.80703\n",
" KxQ1 : 0\n",
"\n"
]
}
],
"source": [
"n = 1 << 21\n",
"for i in range(0, n):\n",
" sk.update(i)\n",
"print(sk)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Upper bound (1 std. dev) as % of true value: 101.2703\n"
]
}
],
"source": [
"print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimate as % of true value: 99.9618\n"
]
}
],
"source": [
"print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lower bound (1 std. dev) as % of true value: 98.6852\n"
]
}
],
"source": [
"print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can serialize and deserialize the sketch, which will give us back the same structure."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2096"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sk_bytes = sk.serialize_compact()\n",
"len(sk_bytes)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### HLL SKETCH SUMMARY: \n",
" Log Config K : 12\n",
" Hll Target : HLL_4\n",
" Current Mode : HLL\n",
" LB : 2.06958e+06\n",
" Estimate : 2.09635e+06\n",
" UB : 2.12379e+06\n",
" OutOfOrder flag: 0\n",
" CurMin : 7\n",
" NumAtCurMin : 72\n",
" HipAccum : 2.09635e+06\n",
" KxQ0 : 5.80703\n",
" KxQ1 : 0\n",
"\n"
]
}
],
"source": [
"sk2 = hll_sketch.deserialize(sk_bytes)\n",
"print(sk2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sketch Union Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"k = 12\n",
"n = 1 << 20\n",
"offset = int(3 * n / 4)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"sk1 = hll_sketch(k)\n",
"sk2 = hll_sketch(k + 1)\n",
"for i in range(0, n):\n",
" sk1.update(i)\n",
" sk2.update(i + offset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"union = hll_union(k+1)\n",
"union.update(sk1)\n",
"union.update(sk2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how log config k has automatically adopted the value of the smaller input sketch."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### HLL SKETCH SUMMARY: \n",
" Log Config K : 12\n",
" Hll Target : HLL_4\n",
" Current Mode : HLL\n",
" LB : 1.80197e+06\n",
" Estimate : 1.83108e+06\n",
" UB : 1.86121e+06\n",
" OutOfOrder flag: 1\n",
" CurMin : 6\n",
" NumAtCurMin : 2\n",
" HipAccum : 1.76932e+06\n",
" KxQ0 : 6.60752\n",
" KxQ1 : 0\n",
"\n"
]
}
],
"source": [
"result = union.get_result()\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can again compare against the exact result, in this case 1.75*n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimate as % of true value: 99.7859\n"
]
}
],
"source": [
"print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}