blob: 70da228612a0ddaf6751b3316ae4a96a61730ad8 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github"
},
"source": [
"<a href=\"https://colab.research.google.com/github/apache/beam/blob/master//Users/dcavazos/src/beam/examples/notebooks/documentation/transforms/python/elementwise/filter-py.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "view-the-docs-top"
},
"source": [
"<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/filter\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "_-code"
},
"outputs": [],
"source": [
"#@title Licensed under the Apache License, Version 2.0 (the \"License\")\n",
"# Licensed to the Apache Software Foundation (ASF) under one\n",
"# or more contributor license agreements. See the NOTICE file\n",
"# distributed with this work for additional information\n",
"# regarding copyright ownership. The ASF licenses this file\n",
"# to you under the Apache License, Version 2.0 (the\n",
"# \"License\"); you may not use this file except in compliance\n",
"# with the License. You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing,\n",
"# software distributed under the License is distributed on an\n",
"# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
"# KIND, either express or implied. See the License for the\n",
"# specific language governing permissions and limitations\n",
"# under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "filter"
},
"source": [
"# Filter\n",
"\n",
"<script type=\"text/javascript\">\n",
"localStorage.setItem('language', 'language-py')\n",
"</script>\n",
"\n",
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.Filter\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>\n",
"\n",
"Given a predicate, filter out all elements that don't satisfy that predicate.\n",
"May also be used to filter based on an inequality with a given value based\n",
"on the comparison ordering of the element."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "setup"
},
"source": [
"## Setup\n",
"\n",
"To run a code cell, you can click the **Run cell** button at the top left of the cell,\n",
"or select it and press **`Shift+Enter`**.\n",
"Try modifying a code cell and re-running it to see what happens.\n",
"\n",
"> To learn more about Colab, see\n",
"> [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).\n",
"\n",
"First, let's install the `apache-beam` module."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "setup-code"
},
"outputs": [],
"source": [
"!pip install --quiet -U apache-beam"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "examples"
},
"source": [
"## Examples\n",
"\n",
"In the following examples, we create a pipeline with a `PCollection` of produce with their icon, name, and duration.\n",
"Then, we apply `Filter` in multiple ways to filter out produce by their duration value.\n",
"\n",
"`Filter` accepts a function that keeps elements that return `True`, and filters out the remaining elements."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-1-filtering-with-a-function"
},
"source": [
"### Example 1: Filtering with a function\n",
"\n",
"We define a function `is_perennial` which returns `True` if the element's duration equals `'perennial'`, and `False` otherwise."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "example-1-filtering-with-a-function-code"
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"\n",
"def is_perennial(plant):\n",
" return plant['duration'] == 'perennial'\n",
"\n",
"with beam.Pipeline() as pipeline:\n",
" perennials = (\n",
" pipeline\n",
" | 'Gardening plants' >> beam.Create([\n",
" {'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'},\n",
" {'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'},\n",
" {'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'},\n",
" {'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'},\n",
" {'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'},\n",
" ])\n",
" | 'Filter perennials' >> beam.Filter(is_perennial)\n",
" | beam.Map(print)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-1-filtering-with-a-function-2"
},
"source": [
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-2-filtering-with-a-lambda-function"
},
"source": [
"### Example 2: Filtering with a lambda function\n",
"\n",
"We can also use lambda functions to simplify **Example 1**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "example-2-filtering-with-a-lambda-function-code"
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"\n",
"with beam.Pipeline() as pipeline:\n",
" perennials = (\n",
" pipeline\n",
" | 'Gardening plants' >> beam.Create([\n",
" {'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'},\n",
" {'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'},\n",
" {'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'},\n",
" {'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'},\n",
" {'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'},\n",
" ])\n",
" | 'Filter perennials' >> beam.Filter(\n",
" lambda plant: plant['duration'] == 'perennial')\n",
" | beam.Map(print)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-2-filtering-with-a-lambda-function-2"
},
"source": [
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-3-filtering-with-multiple-arguments"
},
"source": [
"### Example 3: Filtering with multiple arguments\n",
"\n",
"You can pass functions with multiple arguments to `Filter`.\n",
"They are passed as additional positional arguments or keyword arguments to the function.\n",
"\n",
"In this example, `has_duration` takes `plant` and `duration` as arguments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "example-3-filtering-with-multiple-arguments-code"
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"\n",
"def has_duration(plant, duration):\n",
" return plant['duration'] == duration\n",
"\n",
"with beam.Pipeline() as pipeline:\n",
" perennials = (\n",
" pipeline\n",
" | 'Gardening plants' >> beam.Create([\n",
" {'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'},\n",
" {'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'},\n",
" {'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'},\n",
" {'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'},\n",
" {'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'},\n",
" ])\n",
" | 'Filter perennials' >> beam.Filter(has_duration, 'perennial')\n",
" | beam.Map(print)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-3-filtering-with-multiple-arguments-2"
},
"source": [
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-4-filtering-with-side-inputs-as-singletons"
},
"source": [
"### Example 4: Filtering with side inputs as singletons\n",
"\n",
"If the `PCollection` has a single value, such as the average from another computation,\n",
"passing the `PCollection` as a *singleton* accesses that value.\n",
"\n",
"In this example, we pass a `PCollection` the value `'perennial'` as a singleton.\n",
"We then use that value to filter out perennials."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "example-4-filtering-with-side-inputs-as-singletons-code"
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"\n",
"with beam.Pipeline() as pipeline:\n",
" perennial = pipeline | 'Perennial' >> beam.Create(['perennial'])\n",
"\n",
" perennials = (\n",
" pipeline\n",
" | 'Gardening plants' >> beam.Create([\n",
" {'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'},\n",
" {'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'},\n",
" {'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'},\n",
" {'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'},\n",
" {'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'},\n",
" ])\n",
" | 'Filter perennials' >> beam.Filter(\n",
" lambda plant, duration: plant['duration'] == duration,\n",
" duration=beam.pvalue.AsSingleton(perennial),\n",
" )\n",
" | beam.Map(print)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-4-filtering-with-side-inputs-as-singletons-2"
},
"source": [
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-5-filtering-with-side-inputs-as-iterators"
},
"source": [
"### Example 5: Filtering with side inputs as iterators\n",
"\n",
"If the `PCollection` has multiple values, pass the `PCollection` as an *iterator*.\n",
"This accesses elements lazily as they are needed,\n",
"so it is possible to iterate over large `PCollection`s that won't fit into memory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "example-5-filtering-with-side-inputs-as-iterators-code"
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"\n",
"with beam.Pipeline() as pipeline:\n",
" valid_durations = pipeline | 'Valid durations' >> beam.Create([\n",
" 'annual',\n",
" 'biennial',\n",
" 'perennial',\n",
" ])\n",
"\n",
" valid_plants = (\n",
" pipeline\n",
" | 'Gardening plants' >> beam.Create([\n",
" {'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'},\n",
" {'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'},\n",
" {'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'},\n",
" {'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'},\n",
" {'icon': '🥔', 'name': 'Potato', 'duration': 'PERENNIAL'},\n",
" ])\n",
" | 'Filter valid plants' >> beam.Filter(\n",
" lambda plant, valid_durations: plant['duration'] in valid_durations,\n",
" valid_durations=beam.pvalue.AsIter(valid_durations),\n",
" )\n",
" | beam.Map(print)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-5-filtering-with-side-inputs-as-iterators-2"
},
"source": [
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>\n",
"\n",
"> **Note**: You can pass the `PCollection` as a *list* with `beam.pvalue.AsList(pcollection)`,\n",
"> but this requires that all the elements fit into memory."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-6-filtering-with-side-inputs-as-dictionaries"
},
"source": [
"### Example 6: Filtering with side inputs as dictionaries\n",
"\n",
"If a `PCollection` is small enough to fit into memory, then that `PCollection` can be passed as a *dictionary*.\n",
"Each element must be a `(key, value)` pair.\n",
"Note that all the elements of the `PCollection` must fit into memory for this.\n",
"If the `PCollection` won't fit into memory, use `beam.pvalue.AsIter(pcollection)` instead."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "example-6-filtering-with-side-inputs-as-dictionaries-code"
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"\n",
"with beam.Pipeline() as pipeline:\n",
" keep_duration = pipeline | 'Duration filters' >> beam.Create([\n",
" ('annual', False),\n",
" ('biennial', False),\n",
" ('perennial', True),\n",
" ])\n",
"\n",
" perennials = (\n",
" pipeline\n",
" | 'Gardening plants' >> beam.Create([\n",
" {'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'},\n",
" {'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'},\n",
" {'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'},\n",
" {'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'},\n",
" {'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'},\n",
" ])\n",
" | 'Filter plants by duration' >> beam.Filter(\n",
" lambda plant, keep_duration: keep_duration[plant['duration']],\n",
" keep_duration=beam.pvalue.AsDict(keep_duration),\n",
" )\n",
" | beam.Map(print)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "example-6-filtering-with-side-inputs-as-dictionaries-2"
},
"source": [
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "related-transforms"
},
"source": [
"## Related transforms\n",
"\n",
"* [FlatMap](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for\n",
" each input it might produce zero or more outputs.\n",
"* [ParDo](https://beam.apache.org/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping\n",
" operation, and includes other abilities such as multiple output collections and side-inputs.\n",
"\n",
"<table align=\"left\" style=\"margin-right:1em\">\n",
" <td>\n",
" <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.Filter\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n",
" </td>\n",
"</table>\n",
"\n",
"<br/><br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "view-the-docs-bottom"
},
"source": [
"<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/filter\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>"
]
}
],
"metadata": {
"colab": {
"name": "Filter - element-wise transform",
"toc_visible": true
},
"kernelspec": {
"display_name": "python3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}