blob: 28c6157604476633d24542fcddc6cb3a13d285f5 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Theta Sketch Tutorial\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
" * [Overview](#Overview)\n",
" * [Set-up](#Set-up)\n",
" * [Basic Sketch Usage](#Basic-Sketch-Usage)\n",
" * [Sketch Unions](#Sketch-Unions)\n",
" * [Sketch Intersections](#Sketch-Intersections)\n",
" * [Set Difference (A-not-B)](#Set-Difference-(A-not-B))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview\n",
"\n",
"This tutorial covers basic operation of the Theta sketch for distinct counting. We will demonstrate how to create and feed data into sketches as well as the various set operations. We will also include the HLL sketch for comparison.\n",
"\n",
"Characterization tests of the hash function we use, Murmur3, have shown that it has excellent independence properties. As a reuslt, we can achieve reasonable performance for demonstration purposes by feeding in sequential integers. This lets us experiment with the set operations in a controlled but still realistic manner, and to know the exact result without resorting to an expensive computation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set-up\n",
"\n",
"This tutorial assuems you have already downloaded and installed the python wrapper for the DataSketches library. See the [DataSketches Downloads](http://datasketches.apache.org/docs/Community/Downloads.html) page for details"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datasketches import theta_sketch, update_theta_sketch, compact_theta_sketch\n",
"from datasketches import theta_union, theta_intersection, theta_a_not_b\n",
"\n",
"from datasketches import hll_sketch, hll_union"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basic Sketch Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To start, we'll create a sketch with ~1 million points in order to demonstrate basic sketch operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n = 2**20\n",
"k = 12\n",
"sk1 = update_theta_sketch(k)\n",
"hll1 = hll_sketch(k)\n",
"for i in range(0, n):\n",
" sk1.update(i)\n",
" hll1.update(i)\n",
"print(sk1)\n",
"print(hll1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The summary contains most data of interest, but we can also query for specific information. And in this case, since we know the exact number of distinct items presented to the sketch, we can look at the estimate, upper, and lower bounds as a percentage of the exact value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f'Exact result:\\t\\t\\t{n}')\n",
"print('')\n",
"print(f'Theta upper bound (1 std. dev):\\t{sk1.get_upper_bound(1):.1f}\\t({100*sk1.get_upper_bound(1) / n - 100:.2f}%)')\n",
"print(f'Theta estimate:\\t\\t\\t{sk1.get_estimate():.1f}\\t({100*sk1.get_estimate() / n - 100:.2f}%)')\n",
"print(f'Theta lower bound (1 std. dev):\\t{sk1.get_lower_bound(1):.1f}\\t({100*sk1.get_lower_bound(1) / n - 100:.2f}%)')\n",
"print('')\n",
"print(f'HLL upper bound (1 std. dev):\\t{hll1.get_upper_bound(1):.1f}\\t({100*hll1.get_upper_bound(1) / n - 100:.2f}%)')\n",
"print(f'HLL estimate:\\t\\t\\t{hll1.get_estimate():.1f}\\t({100*hll1.get_estimate() / n - 100:.2f}%)')\n",
"print(f'HLL lower bound (1 std. dev):\\t{hll1.get_lower_bound(1):.1f}\\t({100*hll1.get_lower_bound(1) / n - 100:.2f}%)')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can serialize and reconstruct the sketch. If we compact the sketch prior to serialization, we can still query the rebuilt sketch but cannot update it further. When reconstructed, we can see that the estimate is exactly the same."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sk1_bytes = sk1.compact().serialize()\n",
"len(sk1_bytes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_sk1 = theta_sketch.deserialize(sk1_bytes)\n",
"print(f'Estimate (original):\\t{sk1.get_estimate()}')\n",
"print(f'Estimate (new):\\t\\t{new_sk1.get_estimate()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sketch Unions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Theta Sketch unions make use of a separate union object. The union will accept input sketches with different values of $k$.\n",
"\n",
"For this example, we will create a sketch with distinct values that partially overlap those in `sk1`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"offset = int(15 * n / 16)\n",
"sk2 = update_theta_sketch(k+1)\n",
"hll2 = hll_sketch(k+1)\n",
"for i in range(0, n):\n",
" sk2.update(i + offset)\n",
" hll2.update(i + offset)\n",
"print(sk2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now feed the sketches into the union. As constructed, the exact number of unique values presented to the two sketches is $\\frac{31}{16}n$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"union = theta_union(k)\n",
"union.update(sk1)\n",
"union.update(sk2)\n",
"\n",
"union_hll = hll_union(k)\n",
"union_hll.update(hll1)\n",
"union_hll.update(hll2)\n",
"\n",
"exact = int(31 * n / 16);\n",
"result = union.get_result()\n",
"theta_bound_pct = 100 * (result.get_upper_bound(1) - result.get_estimate()) / exact\n",
"\n",
"hll_result = union_hll.get_result()\n",
"hll_bound_pct = 100 * (hll_result.get_upper_bound(1) - hll_result.get_estimate()) / exact\n",
"\n",
"\n",
"print(f'Exact result:\\t{exact}')\n",
"print(f'Theta Estimate:\\t{result.get_estimate():.1f} ({100*(result.get_estimate()/exact - 1):.2f}% +- {theta_bound_pct:.2f}%)')\n",
"print(f'HLL Estimate:\\t{hll_result.get_estimate():.1f} ({100*(hll_result.get_estimate()/exact - 1):.2f}% +- {hll_bound_pct:.2f}%)')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sketch Intersections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Beyond unions, theta sketches also support intersctions through the use of an intersection object. For comparison, we also present the HLL estimate here using, using the Inclusion-Exclusion formula: $|A \\cup B| = |A| + |B| - |A \\cap B|$.\n",
"\n",
"That formula might not seem too bad when intersecting 2 sketches, but as the number of sketches increases the formula becomes increasingly comples, and the error compounds rapidly. By comparison, the Theta set operations can be applied to an arbitrary number of sketches. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"intersection = theta_intersection()\n",
"intersection.update(sk1)\n",
"intersection.update(sk2)\n",
"\n",
"hll_inter_est = hll1.get_estimate() + hll2.get_estimate() - hll_result.get_estimate()\n",
"\n",
"print(\"Has result: \", intersection.has_result())\n",
"result = intersection.get_result()\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, we expect the sets to have an overlap of $\\frac{1}{16}n$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exact = int(n / 16)\n",
"theta_bound_pct = 100 * (result.get_upper_bound(1) - result.get_estimate()) / exact\n",
"\n",
"print(f'Exact result:\\t\\t{exact}')\n",
"print(f'Theta Estimate:\\t\\t{result.get_estimate():.1f} ({100*(result.get_estimate()/exact - 1):.2f}% +- {theta_bound_pct:.2f}%)')\n",
"print(f'HLL Estimate:\\t\\t{hll_inter_est:.1f} ({100*(hll_inter_est/exact - 1):.2f}%)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set Difference (A-not-B)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we have the set difference operation. Unlike `theta_union` and `theta_intersection`, `theta_a_not_b` is currently stateless: The object takes as input 2 sketches at a time, namely $a$ and $b$, and directly returns the result as a sketch."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"anb = theta_a_not_b()\n",
"result = anb.compute(sk1, sk2)\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By using the same two sketches as before, the expected result here is $\\frac{15}{16}n$.\n",
"\n",
"Our HLL estimate comes from manipulating the Inclusion-Exclusion formula above to obtain $|A| - |A \\cap B| = |A \\cup B| - |B|$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exact = int(15 * n /16)\n",
"theta_bound_pct = 100 * (result.get_upper_bound(1) - result.get_estimate()) / exact\n",
"hll_diff_est = hll_result.get_estimate() - hll2.get_estimate()\n",
"\n",
"print(f'Exact result:\\t{exact}')\n",
"print(f'Theta estimate:\\t{result.get_estimate():.1f} ({100*(result.get_estimate()/exact -1):.2f}% +- {theta_bound_pct:.2f}%)')\n",
"print(f'HLL estimate:\\t{hll_diff_est:.1f} ({100*(hll_diff_est/exact - 1):.2f}%)')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}