blob: 418ce848df60fee383b289203582bb802329d851 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# KLL Sketch Tutorial"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
" * [Overview](#Overview)\n",
" * [Set-up](#Set-up)\n",
" * [Creating a KLL Sketch](#Creating-a-KLL-Sketch)\n",
" * [Querying the sketch](#Querying-the-sketch)\n",
" * [Merging Sketches](#Merging-Sketches)\n",
" * [Serializing Sketches for Transportation](#Serializing-Sketches-for-Transportation)\n",
" * [Using in a Data Cube](#Using-in-a-Data-Cube)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview\n",
"\n",
"This tutorual will focus on the KLL sketch. We will demonstrate how to create and feed data into sketches, and also show an option for moving sketches between systems. We will rely on synthetic data to help us better reason about expected results when visualizing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set-up\n",
"\n",
"This tutorial assumes you have already downloaded and installed the python wrapper for the DataSketches library. See the [DataSketches Downloads](http://datasketches.apache.org/docs/Community/Downloads.html) page for details"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datasketches import kll_floats_sketch\n",
"\n",
"import base64\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"sns.set(color_codes=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Creating a KLL Sketch\n",
"\n",
"Sketch creation is simple: As with all the sketches in the library, you simply need to decide on your error tolerance, which determines the maximum size of the sketch. The DataSketches library refers to that value as $k$.\n",
"\n",
"We can get an estimate of the expected error bound (99th percentile) without instantiating anything."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(kll_floats_sketch.get_normalized_rank_error(160, False))\n",
"print(kll_floats_sketch.get_normalized_rank_error(200, False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, the (one-sided) error with $k=160$ is about 1.67% versus 1.33% at $k=200$. For the rest of the examples, we will use $200$. We can now instantiate a sketch."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"k = 200\n",
"sk = kll_floats_sketch(k)\n",
"print(sk)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sketch has seen no data so far (N=0) and is consequently storing nothing (Retained items=0). Storage bytes refers to how much space would be required to save the sketch as an array of bytes, which in this case is fairly minimal.\n",
"Next, we can add some data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sk.update(np.random.exponential(size=150))\n",
"print(sk)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We added 150 samples, which is few enough that the sketch is still in exact mode, meaning it is storing everything rather than sampling. To be able to compare the sketch to an exact computation, we will generate new data -- and a lot more of it. We will also create a sketch with a much larger $k$ to demonstrate the effect of increasing the sketch size."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sk = kll_floats_sketch(k)\n",
"sk_large = kll_floats_sketch(10*k)\n",
"data = np.random.exponential(size=2**24)\n",
"sk.update(data)\n",
"sk_large.update(data)\n",
"print(sk)\n",
"print(sk_large)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the sketch is well into sampling territory, having processed nearly 17 million items. We can see that the sketch is retaining only 645 items. The 2676 bytes of storage compares to 64MB for raw data using 4-byte floats. Next we will start querying the sketch to better understand the performance. Even the much larger sketch uses less 24k bytes with fewer than 6000 points."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Querying the sketch\n",
"\n",
"The median for an exponential distribution is $\\frac{ln 2}{\\lambda}$, and the default numpy exponential distribution has $\\lambda = 1.0$, so the median should be close to $0.693$. Similarly, if we ask for the rank $ln 2$, we should get a value close to $0.5$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f'Theoretical median : {np.log(2):.6f}')\n",
"print(f'Estimated median, k=200 : {sk.get_quantile(0.5):.6f}')\n",
"print(f'Estimated median, k=2000 : {sk_large.get_quantile(0.5):.6f}')\n",
"print('')\n",
"print(f'Exact Quantile of ln(2) : 0.5')\n",
"print(f'Est. Quantile of ln(2), k=200 : {sk.get_rank(np.log(2)):.6f}')\n",
"print(f'Est. Quantile of ln(2), k=2000 : {sk_large.get_rank(np.log(2)):.6f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the common use cases of a quantiles sketch like KLL is visualizing data with a histogram. We can create one from the sketch easily enough, but for this tutorial we also want to know how well we are doing. Fortunately, we can still compute a histogram on this data directly for comparison.\n",
"\n",
"Note that the sketch returns a PMF while the histogram computes data only for the bins between the provided points and must be converted to a PMF. The sketch also returns a bin containing all the mass less than the minimum provided point. In this case that will always be 0, so we discard it for plotting."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xmin = 0 # could use sk.get_min_value() but we know the bound here\n",
"xmax = sk.get_quantile(0.99995) # this will exclude a little data from the exact distribution\n",
"num_splits = 40\n",
"step = (xmax - xmin) / num_splits\n",
"splits = [xmin + (i*step) for i in range(0, num_splits)]\n",
"x = splits.copy()\n",
"x.append(xmax)\n",
"\n",
"pmf = sk.get_pmf(splits)[1:]\n",
"pmf_large = sk_large.get_pmf(splits)[1:]\n",
"exact_pmf = np.histogram(data, bins=x)[0] / sk.get_n()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(12,12))\n",
"plt.subplot(2,1,1)\n",
"plt.title('PMF, k = 200')\n",
"plt.ylabel('Probability')\n",
"plt.bar(x=splits, height=pmf, align='edge', width=-.07, color='blue')\n",
"plt.bar(x=splits, height=exact_pmf, align='edge', width=.07, color='red')\n",
"plt.legend(['KLL, k=200','Exact'])\n",
"\n",
"plt.subplot(2,1,2)\n",
"plt.title('PMF, k = 2000')\n",
"plt.ylabel('Probability')\n",
"plt.bar(x=splits, height=pmf_large, align='edge', width=-.07, color='blue')\n",
"plt.bar(x=splits, height=exact_pmf, align='edge', width=.07, color='red')\n",
"plt.legend(['KLL, k=2000','Exact'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sketch with $k=200$ clearly provides a good approximation. In the case of this exponential distribution, we sometimes observe that there is additional mass near the right edge of the tail compared to the true PMF, although still within the provided error bound with high probability. While this is not problematic given the guarantees of the sketch, certain use cases requiring high precision at extreme quantiles may find it less satisfactory. With the larger $k=2000$ sketch, the accuracy is improved while using much less space thena the raw data, although larger than the smaller-$k$ sketch.\n",
"\n",
"We will eventaully provide what we call a Relative Error Quantiles sketch that will have tighter error bounds as you approach the tail of the distribution, which will be useful if you care primarily about accuray in the tail, but that will require a larger sketch."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Merging sketches\n",
"\n",
"A single sketch is certainly useful, but the real power of sketches comes from the ability to merge them. Here, we will create two simple sketches to demonstrate. For good measure, we'll use different values of $k$ for the sketches, as well as feed them different numbers of points."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sk1 = kll_floats_sketch(k)\n",
"sk2 = kll_floats_sketch(int(1.5 * k))\n",
"\n",
"data1 = np.random.normal(loc=-2.0, size=2**24)\n",
"data2 = np.random.normal(loc=2.0, size=2**25)\n",
"sk1.update(data1)\n",
"sk2.update(data2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the KLL sketch, there is no separate object for unions. We can either create another empty sketch and use that as a merge target or we can merge sketch 2 into sketch 1. Taking the latter approach and plotting the resulting histogram gives us the expected distribution. Note that one sketch has twice as many points as the other."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sk1.merge(sk2)\n",
"print(sk1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We saved the input data so that we can again compute an exact distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xmin = sk1.get_min_value()\n",
"xmax = sk1.get_max_value()\n",
"num_splits = 20\n",
"step = (xmax - xmin) / num_splits\n",
"splits = [xmin + (i*step) for i in range(0, num_splits)]\n",
"x = splits.copy()\n",
"x.append(xmax)\n",
"\n",
"pmf = sk1.get_pmf(splits)[1:]\n",
"cdf = sk1.get_cdf(splits)[1:]\n",
"exact_pmf = (np.histogram(data1, bins=x)[0] + np.histogram(data2, bins=x)[0]) / sk1.get_n()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(12,6))\n",
"plt.subplot(1,2,1)\n",
"plt.bar(x=splits, height=pmf, align='edge', width=-.3, color='blue')\n",
"plt.bar(x=splits, height=exact_pmf, align='edge', width=.3, color='red')\n",
"plt.legend(['KLL','Exact'])\n",
"plt.ylabel('Probability')\n",
"plt.title('Merged PMF')\n",
"\n",
"plt.subplot(1,2,2)\n",
"plt.bar(x=splits, height=cdf, align='edge', width=-.3, color='blue')\n",
"plt.bar(x=splits, height=np.cumsum(exact_pmf), align='edge', width=.3, color='red')\n",
"plt.legend(['KLL','Exact'])\n",
"plt.ylabel('Probability')\n",
"plt.title('Merged CDF')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that we do not need to do anything special to merge the sketches despite the different values of $k$, and the 2:1 relative ratio of weights of the two distributions was preserved despite the input sketch size difference."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Serializing Sketches for Transportation\n",
"\n",
"Being able to move sketches between platforms is important. One of the useful aspects of the DataSketches library in particular is binary compatibility across languages. While this section will remain within python, sketches serialized from C++- or Java-based systems would work identically.\n",
"\n",
"In this section, we will start by creating a tab-separated file with a handfull\n",
"of sketches and then load it in as a dataframe. We will encode each binary sketch image as base64.\n",
"\n",
"To simplify sketch creation, the first step will be to define a simple function to generate a line for the file with the given parameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def generate_sketch(family: str, n: int, mean: float, var: float) -> str:\n",
" sk = kll_floats_sketch(200)\n",
" if (family == 'normal'):\n",
" sk.update(np.random.normal(loc=mean, scale=var, size=n))\n",
" elif (family == 'uniform'):\n",
" b = mean + np.sqrt(3 * var)\n",
" a = 2 * mean - b\n",
" sk.update(np.random.uniform(low=a, high=b, size=n))\n",
" else:\n",
" return None\n",
" sk_b64 = base64.b64encode(sk.serialize()).decode('utf-8')\n",
" return f'{family}\\t{n}\\t{mean}\\t{var}\\t{sk_b64}\\n'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"filename = 'kll_tutorial.tsv'\n",
"with open(filename, 'w') as f:\n",
" f.write('family\\tn\\tmean\\tvariance\\tkll\\n')\n",
" f.write(generate_sketch('normal', 2**23, -4.0, 0.5))\n",
" f.write(generate_sketch('normal', 2**24, 0.0, 1.0))\n",
" f.write(generate_sketch('normal', 2**25, 2.0, 0.5))\n",
" f.write(generate_sketch('normal', 2**23, 4.0, 0.2))\n",
" f.write(generate_sketch('normal', 2**22, -2.0, 2.0))\n",
" f.write(generate_sketch('uniform', 2**21, 0.5, 1.0/12))\n",
" f.write(generate_sketch('uniform', 2**22, 5.0, 1.0/12))\n",
" f.write(generate_sketch('uniform', 2**20, -0.5, 1.0/3))\n",
" f.write(generate_sketch('uniform', 2**23, 0.0, 4.0/3))\n",
" f.write(generate_sketch('uniform', 2**22, -4.0, 1.0/3)) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If your system has a *nix shell, you can inspect the resulting file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!head -2 {filename}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using in a Data Cube"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our file with 10 sketches, we can use pandas to load them in. To ensure that we load the sketches as useful objects, we need to define a converter function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"deserialize_kll = lambda x : kll_floats_sketch.deserialize(base64.b64decode(x))\n",
"\n",
"df = pd.read_csv(filename,\n",
" sep='\\t',\n",
" header=0,\n",
" dtype={'family':'category', 'n':int, 'mean':float, 'var':float},\n",
" converters={'kll':deserialize_kll}\n",
" )\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sketch column is represented by the string equivalent, which is not very useful for viewing here but does show that the column contains the actual objects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally, we can now perform queries on the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query_result = kll_floats_sketch(10*k)\n",
"for sk in df.loc[df['family'] == 'normal'].itertuples(index=False):\n",
" query_result.merge(sk.kll)\n",
"print(query_result)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see that the resulting sketch has processed 71 million items (272MB of data) and is summarizing it using 563 items, and can be serialized into only 2352 bytes, which includes some sketch metadata.\n",
"\n",
"Finally, we want to visualize this data. Remember that we have a mixture of 5 Gaussian distributions:\n",
"\n",
"| $\\mu$ | $\\sigma^2$ | n |\n",
"|-----:|----:|---------:|\n",
"| -4.0 | 0.5 | $2^{23}$ |\n",
"| 0.0 | 1.0 | $2^{24}$ |\n",
"| 2.0 | 0.5 | $2^{25}$ |\n",
"| 4.0 | 0.2 | $2^{23}$ |\n",
"| -2.0 | 2.0 | $2^{22}$ |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xmin = query_result.get_quantile(0.0005)\n",
"xmax = query_result.get_quantile(0.9995)\n",
"num_splits = 50\n",
"step = (xmax - xmin) / num_splits\n",
"splits = [xmin + (i*step) for i in range(0, num_splits)]\n",
"\n",
"pmf = query_result.get_pmf(splits)\n",
"x = splits.copy()\n",
"x.append(xmax)\n",
"plt.figure(figsize=(12,6))\n",
"plt.title('PMF')\n",
"plt.ylabel('Probability')\n",
"plt.bar(x=x, height=pmf, align='edge', width=-0.15)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}