python/jupyter/FrequentItemsSketch.ipynb - datasketches-cpp - Git at Google

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Frequent Items Sketch Examples"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Basic Sketch Usage"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "More so than other sketches in the library, the Frequent Items sketch can take some practice to use since it identifies exceptionally heavy hitters rather than returning a \"top N\" list. We assume readers have already familiarized themselves with the [sketch documentation](https://datasketches.github.io/docs/Frequency/FrequentItemsOverview.html) and are aware of the key concepts around use of this sketch."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
     "from datasketches import frequent_strings_sketch, frequent_items_error_type"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We'll use a very small sketch in this case so that we can easily fill it, otherwise the difference between error types is more difficult to demonstrate."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
     "k = 3\n",
     "fi = frequent_strings_sketch(k)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "A brief digression into implementation details to help explain what we're doing here. The Frequent Items sketch maintains a list of items, but purges the least frequent items when the list fills. For this example, we'll keep inserting items until after a purge takes place.\n",
     "\n",
     "We'll insert items with exponentially decreasing weights, which in this case gives us a more interesting set of results when we later query things."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "Update 1: 1 items\n",
       "Update 2: 2 items\n",
       "Update 3: 3 items\n",
       "Update 4: 4 items\n",
       "Update 5: 5 items\n",
       "Update 6: 6 items\n",
       "Update 7: 3 items\n",
       "Update 8: 4 items\n"
      ]
     }
    ],
    "source": [
     "n = 8\n",
     "for i in range(0,n):\n",
     "    fi.update(str(i), 2 ** (n-i))\n",
     "    i += 1\n",
     "    print('Update ' + str(i) + ': ' + str(fi.get_num_active_items()) + ' items')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We can see where the purge happened, and in this case we inserted a low-weight item after the purge. We can now compare querying items to exclude either false positives or false negatives.\n",
     "  - `NO_FALSE_POSITIVES` returns all items with a _lower_ bound above the a posteriori error\n",
     "  - `NO_FALSE_NEGATIVES` returns all items with an _upper_ bound above the a posteriori error\n",
     "\n",
     "The latter option will always include any results from the first set and may include others. Items are returned as (id, estimate, lower_bound, upper_bound) and are sorted by decreasing weight."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "[('0', 256, 224, 256), ('1', 128, 96, 128)]"
       ]
      },
      "execution_count": 5,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "[('0', 256, 224, 256),\n",
        " ('1', 128, 96, 128),\n",
        " ('2', 64, 32, 64),\n",
        " ('7', 34, 2, 34)]"
       ]
      },
      "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The sketch also allows us to query for individual items directly."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "256\n",
       "64\n",
       "2\n"
      ]
     }
    ],
    "source": [
     "print(fi.get_estimate(\"0\"))\n",
     "print(fi.get_upper_bound(\"2\"))\n",
     "print(fi.get_lower_bound(\"7\"))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We can also query for items not in the the list, whether the item has never been seen or if it has been evicted from the active set."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "0"
       ]
      },
      "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "fi.get_estimate(\"5\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The sketch may also be serialized for archiving, and reconstructed."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "84"
       ]
      },
      "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "sk_bytes = fi.serialize()\n",
     "len(sk_bytes)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 11,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "### Frequent items sketch summary:\n",
       "   lg cur map size  : 3\n",
       "   lg max map size  : 3\n",
       "   num active items : 4\n",
       "   total weight     : 510\n",
       "   max error        : 32\n",
       "### End sketch summary\n",
       "\n"
      ]
     }
    ],
    "source": [
     "fi2 = frequent_strings_sketch.deserialize(sk_bytes)\n",
     "print(fi2)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Merging Example"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Frequent Items sketches support `merge()` to combine sketches. Keep in mind that the combined sketches may not have any meaningfully frequent items, even if there were frequent items in one of the input sketches.\n",
     "\n",
     "We'll start by creating a sketch with lots of equally-weighted very light items, but with a combined weight several times greater than that of the first sketch, and then merge that into the first sketch."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 12,
    "metadata": {},
    "outputs": [],
    "source": [
     "fi2 = frequent_strings_sketch(k)\n",
     "wt = fi.get_total_weight()\n",
     "for i in range(0,4*wt):\n",
     "    fi2.update(str(i))\n",
     "fi.merge(fi2)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Even though all these new items have weight 1, there are so many of them that we have nothing if we ask for no fasle positives."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "0"
       ]
      },
      "execution_count": 13,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We do, however, see a few potentially heavy items if we request no false negatives."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "3"
       ]
      },
      "execution_count": 14,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES))"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.7.0"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Frequent Items Sketch Examples"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Basic Sketch Usage"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"More so than other sketches in the library, the Frequent Items sketch can take some practice to use since it identifies exceptionally heavy hitters rather than returning a \"top N\" list. We assume readers have already familiarized themselves with the [sketch documentation](https://datasketches.github.io/docs/Frequency/FrequentItemsOverview.html) and are aware of the key concepts around use of this sketch."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"from datasketches import frequent_strings_sketch, frequent_items_error_type"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We'll use a very small sketch in this case so that we can easily fill it, otherwise the difference between error types is more difficult to demonstrate."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"k = 3\n",
	"fi = frequent_strings_sketch(k)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A brief digression into implementation details to help explain what we're doing here. The Frequent Items sketch maintains a list of items, but purges the least frequent items when the list fills. For this example, we'll keep inserting items until after a purge takes place.\n",
	"\n",
	"We'll insert items with exponentially decreasing weights, which in this case gives us a more interesting set of results when we later query things."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Update 1: 1 items\n",
	"Update 2: 2 items\n",
	"Update 3: 3 items\n",
	"Update 4: 4 items\n",
	"Update 5: 5 items\n",
	"Update 6: 6 items\n",
	"Update 7: 3 items\n",
	"Update 8: 4 items\n"
	]
	}
	],
	"source": [
	"n = 8\n",
	"for i in range(0,n):\n",
	" fi.update(str(i), 2 ** (n-i))\n",
	" i += 1\n",
	" print('Update ' + str(i) + ': ' + str(fi.get_num_active_items()) + ' items')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can see where the purge happened, and in this case we inserted a low-weight item after the purge. We can now compare querying items to exclude either false positives or false negatives.\n",
	" - `NO_FALSE_POSITIVES` returns all items with a _lower_ bound above the a posteriori error\n",
	" - `NO_FALSE_NEGATIVES` returns all items with an _upper_ bound above the a posteriori error\n",
	"\n",
	"The latter option will always include any results from the first set and may include others. Items are returned as (id, estimate, lower_bound, upper_bound) and are sorted by decreasing weight."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[('0', 256, 224, 256), ('1', 128, 96, 128)]"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[('0', 256, 224, 256),\n",
	" ('1', 128, 96, 128),\n",
	" ('2', 64, 32, 64),\n",
	" ('7', 34, 2, 34)]"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The sketch also allows us to query for individual items directly."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"256\n",
	"64\n",
	"2\n"
	]
	}
	],
	"source": [
	"print(fi.get_estimate(\"0\"))\n",
	"print(fi.get_upper_bound(\"2\"))\n",
	"print(fi.get_lower_bound(\"7\"))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can also query for items not in the the list, whether the item has never been seen or if it has been evicted from the active set."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"fi.get_estimate(\"5\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The sketch may also be serialized for archiving, and reconstructed."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"84"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sk_bytes = fi.serialize()\n",
	"len(sk_bytes)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"### Frequent items sketch summary:\n",
	" lg cur map size : 3\n",
	" lg max map size : 3\n",
	" num active items : 4\n",
	" total weight : 510\n",
	" max error : 32\n",
	"### End sketch summary\n",
	"\n"
	]
	}
	],
	"source": [
	"fi2 = frequent_strings_sketch.deserialize(sk_bytes)\n",
	"print(fi2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Merging Example"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Frequent Items sketches support `merge()` to combine sketches. Keep in mind that the combined sketches may not have any meaningfully frequent items, even if there were frequent items in one of the input sketches.\n",
	"\n",
	"We'll start by creating a sketch with lots of equally-weighted very light items, but with a combined weight several times greater than that of the first sketch, and then merge that into the first sketch."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"fi2 = frequent_strings_sketch(k)\n",
	"wt = fi.get_total_weight()\n",
	"for i in range(0,4*wt):\n",
	" fi2.update(str(i))\n",
	"fi.merge(fi2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Even though all these new items have weight 1, there are so many of them that we have nothing if we ask for no fasle positives."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We do, however, see a few potentially heavy items if we request no false negatives."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3"
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}