| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Frequent Items Sketch Examples" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Basic Sketch Usage" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "More so than other sketches in the library, the Frequent Items sketch can take some practice to use since it identifies exceptionally heavy hitters rather than returning a \"top N\" list. We assume readers have already familiarized themselves with the [sketch documentation](https://datasketches.github.io/docs/Frequency/FrequentItemsOverview.html) and are aware of the key concepts around use of this sketch." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 2, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "from datasketches import frequent_strings_sketch, frequent_items_error_type" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We'll use a very small sketch in this case so that we can easily fill it, otherwise the difference between error types is more difficult to demonstrate." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 3, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "k = 3\n", |
| "fi = frequent_strings_sketch(k)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "A brief digression into implementation details to help explain what we're doing here. The Frequent Items sketch maintains a list of items, but purges the least frequent items when the list fills. For this example, we'll keep inserting items until after a purge takes place.\n", |
| "\n", |
| "We'll insert items with exponentially decreasing weights, which in this case gives us a more interesting set of results when we later query things." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 4, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "Update 1: 1 items\n", |
| "Update 2: 2 items\n", |
| "Update 3: 3 items\n", |
| "Update 4: 4 items\n", |
| "Update 5: 5 items\n", |
| "Update 6: 6 items\n", |
| "Update 7: 3 items\n", |
| "Update 8: 4 items\n" |
| ] |
| } |
| ], |
| "source": [ |
| "n = 8\n", |
| "for i in range(0,n):\n", |
| " fi.update(str(i), 2 ** (n-i))\n", |
| " i += 1\n", |
| " print('Update ' + str(i) + ': ' + str(fi.get_num_active_items()) + ' items')" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We can see where the purge happened, and in this case we inserted a low-weight item after the purge. We can now compare querying items to exclude either false positives or false negatives.\n", |
| " - `NO_FALSE_POSITIVES` returns all items with a _lower_ bound above the a posteriori error\n", |
| " - `NO_FALSE_NEGATIVES` returns all items with an _upper_ bound above the a posteriori error\n", |
| "\n", |
| "The latter option will always include any results from the first set and may include others. Items are returned as (id, estimate, lower_bound, upper_bound) and are sorted by decreasing weight." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 5, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "[('0', 256, 224, 256), ('1', 128, 96, 128)]" |
| ] |
| }, |
| "execution_count": 5, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 6, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "[('0', 256, 224, 256),\n", |
| " ('1', 128, 96, 128),\n", |
| " ('2', 64, 32, 64),\n", |
| " ('7', 34, 2, 34)]" |
| ] |
| }, |
| "execution_count": 6, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "The sketch also allows us to query for individual items directly." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 7, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "256\n", |
| "64\n", |
| "2\n" |
| ] |
| } |
| ], |
| "source": [ |
| "print(fi.get_estimate(\"0\"))\n", |
| "print(fi.get_upper_bound(\"2\"))\n", |
| "print(fi.get_lower_bound(\"7\"))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We can also query for items not in the the list, whether the item has never been seen or if it has been evicted from the active set." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 8, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "0" |
| ] |
| }, |
| "execution_count": 8, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "fi.get_estimate(\"5\")" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "The sketch may also be serialized for archiving, and reconstructed." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 9, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "84" |
| ] |
| }, |
| "execution_count": 9, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "sk_bytes = fi.serialize()\n", |
| "len(sk_bytes)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 11, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "### Frequent items sketch summary:\n", |
| " lg cur map size : 3\n", |
| " lg max map size : 3\n", |
| " num active items : 4\n", |
| " total weight : 510\n", |
| " max error : 32\n", |
| "### End sketch summary\n", |
| "\n" |
| ] |
| } |
| ], |
| "source": [ |
| "fi2 = frequent_strings_sketch.deserialize(sk_bytes)\n", |
| "print(fi2)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Merging Example" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Frequent Items sketches support `merge()` to combine sketches. Keep in mind that the combined sketches may not have any meaningfully frequent items, even if there were frequent items in one of the input sketches.\n", |
| "\n", |
| "We'll start by creating a sketch with lots of equally-weighted very light items, but with a combined weight several times greater than that of the first sketch, and then merge that into the first sketch." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 12, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "fi2 = frequent_strings_sketch(k)\n", |
| "wt = fi.get_total_weight()\n", |
| "for i in range(0,4*wt):\n", |
| " fi2.update(str(i))\n", |
| "fi.merge(fi2)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Even though all these new items have weight 1, there are so many of them that we have nothing if we ask for no fasle positives." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 13, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "0" |
| ] |
| }, |
| "execution_count": 13, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "We do, however, see a few potentially heavy items if we request no false negatives." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 14, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "3" |
| ] |
| }, |
| "execution_count": 14, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES))" |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "Python 3", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.7.0" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 2 |
| } |