| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "view-in-github" |
| }, |
| "source": [ |
| "<a href=\"https://colab.research.google.com/github/apache/beam/blob/master//Users/dcavazos/src/beam/examples/notebooks/documentation/transforms/python/elementwise/map-py.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\"/></a>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "view-the-docs-top" |
| }, |
| "source": [ |
| "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/map\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "cellView": "form", |
| "id": "_-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "#@title Licensed under the Apache License, Version 2.0 (the \"License\")\n", |
| "# Licensed to the Apache Software Foundation (ASF) under one\n", |
| "# or more contributor license agreements. See the NOTICE file\n", |
| "# distributed with this work for additional information\n", |
| "# regarding copyright ownership. The ASF licenses this file\n", |
| "# to you under the Apache License, Version 2.0 (the\n", |
| "# \"License\"); you may not use this file except in compliance\n", |
| "# with the License. You may obtain a copy of the License at\n", |
| "#\n", |
| "# http://www.apache.org/licenses/LICENSE-2.0\n", |
| "#\n", |
| "# Unless required by applicable law or agreed to in writing,\n", |
| "# software distributed under the License is distributed on an\n", |
| "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", |
| "# KIND, either express or implied. See the License for the\n", |
| "# specific language governing permissions and limitations\n", |
| "# under the License." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "map" |
| }, |
| "source": [ |
| "# Map\n", |
| "\n", |
| "<script type=\"text/javascript\">\n", |
| "localStorage.setItem('language', 'language-py')\n", |
| "</script>\n", |
| "\n", |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.Map\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>\n", |
| "\n", |
| "Applies a simple 1-to-1 mapping function over each element in the collection." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "setup" |
| }, |
| "source": [ |
| "## Setup\n", |
| "\n", |
| "To run a code cell, you can click the **Run cell** button at the top left of the cell,\n", |
| "or select it and press **`Shift+Enter`**.\n", |
| "Try modifying a code cell and re-running it to see what happens.\n", |
| "\n", |
| "> To learn more about Colab, see\n", |
| "> [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).\n", |
| "\n", |
| "First, let's install the `apache-beam` module." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "setup-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "!pip install --quiet -U apache-beam" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "examples" |
| }, |
| "source": [ |
| "## Examples\n", |
| "\n", |
| "In the following examples, we create a pipeline with a `PCollection` of produce with their icon, name, and duration.\n", |
| "Then, we apply `Map` in multiple ways to transform every element in the `PCollection`.\n", |
| "\n", |
| "`Map` accepts a function that returns a single element for every input element in the `PCollection`." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-1-map-with-a-predefined-function" |
| }, |
| "source": [ |
| "### Example 1: Map with a predefined function\n", |
| "\n", |
| "We use the function `str.strip` which takes a single `str` element and outputs a `str`.\n", |
| "It strips the input element's whitespaces, including newlines and tabs." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-1-map-with-a-predefined-function-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " ' 🍓Strawberry \\n',\n", |
| " ' 🥕Carrot \\n',\n", |
| " ' 🍆Eggplant \\n',\n", |
| " ' 🍅Tomato \\n',\n", |
| " ' 🥔Potato \\n',\n", |
| " ])\n", |
| " | 'Strip' >> beam.Map(str.strip)\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-1-map-with-a-predefined-function-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-2-map-with-a-function" |
| }, |
| "source": [ |
| "### Example 2: Map with a function\n", |
| "\n", |
| "We define a function `strip_header_and_newline` which strips any `'#'`, `' '`, and `'\\n'` characters from each element." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-2-map-with-a-function-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "def strip_header_and_newline(text):\n", |
| " return text.strip('# \\n')\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " '# 🍓Strawberry\\n',\n", |
| " '# 🥕Carrot\\n',\n", |
| " '# 🍆Eggplant\\n',\n", |
| " '# 🍅Tomato\\n',\n", |
| " '# 🥔Potato\\n',\n", |
| " ])\n", |
| " | 'Strip header' >> beam.Map(strip_header_and_newline)\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-2-map-with-a-function-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-3-map-with-a-lambda-function" |
| }, |
| "source": [ |
| "### Example 3: Map with a lambda function\n", |
| "\n", |
| "We can also use lambda functions to simplify **Example 2**." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-3-map-with-a-lambda-function-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " '# 🍓Strawberry\\n',\n", |
| " '# 🥕Carrot\\n',\n", |
| " '# 🍆Eggplant\\n',\n", |
| " '# 🍅Tomato\\n',\n", |
| " '# 🥔Potato\\n',\n", |
| " ])\n", |
| " | 'Strip header' >> beam.Map(lambda text: text.strip('# \\n'))\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-3-map-with-a-lambda-function-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-4-map-with-multiple-arguments" |
| }, |
| "source": [ |
| "### Example 4: Map with multiple arguments\n", |
| "\n", |
| "You can pass functions with multiple arguments to `Map`.\n", |
| "They are passed as additional positional arguments or keyword arguments to the function.\n", |
| "\n", |
| "In this example, `strip` takes `text` and `chars` as arguments." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-4-map-with-multiple-arguments-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "def strip(text, chars=None):\n", |
| " return text.strip(chars)\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " '# 🍓Strawberry\\n',\n", |
| " '# 🥕Carrot\\n',\n", |
| " '# 🍆Eggplant\\n',\n", |
| " '# 🍅Tomato\\n',\n", |
| " '# 🥔Potato\\n',\n", |
| " ])\n", |
| " | 'Strip header' >> beam.Map(strip, chars='# \\n')\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-4-map-with-multiple-arguments-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-5-maptuple-for-key-value-pairs" |
| }, |
| "source": [ |
| "### Example 5: MapTuple for key-value pairs\n", |
| "\n", |
| "If your `PCollection` consists of `(key, value)` pairs,\n", |
| "you can use `MapTuple` to unpack them into different function arguments." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-5-maptuple-for-key-value-pairs-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " ('🍓', 'Strawberry'),\n", |
| " ('🥕', 'Carrot'),\n", |
| " ('🍆', 'Eggplant'),\n", |
| " ('🍅', 'Tomato'),\n", |
| " ('🥔', 'Potato'),\n", |
| " ])\n", |
| " | 'Format' >> beam.MapTuple(\n", |
| " lambda icon, plant: '{}{}'.format(icon, plant))\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-5-maptuple-for-key-value-pairs-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-6-map-with-side-inputs-as-singletons" |
| }, |
| "source": [ |
| "### Example 6: Map with side inputs as singletons\n", |
| "\n", |
| "If the `PCollection` has a single value, such as the average from another computation,\n", |
| "passing the `PCollection` as a *singleton* accesses that value.\n", |
| "\n", |
| "In this example, we pass a `PCollection` the value `'# \\n'` as a singleton.\n", |
| "We then use that value as the characters for the `str.strip` method." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-6-map-with-side-inputs-as-singletons-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " chars = pipeline | 'Create chars' >> beam.Create(['# \\n'])\n", |
| "\n", |
| " plants = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " '# 🍓Strawberry\\n',\n", |
| " '# 🥕Carrot\\n',\n", |
| " '# 🍆Eggplant\\n',\n", |
| " '# 🍅Tomato\\n',\n", |
| " '# 🥔Potato\\n',\n", |
| " ])\n", |
| " | 'Strip header' >> beam.Map(\n", |
| " lambda text, chars: text.strip(chars),\n", |
| " chars=beam.pvalue.AsSingleton(chars),\n", |
| " )\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-6-map-with-side-inputs-as-singletons-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-7-map-with-side-inputs-as-iterators" |
| }, |
| "source": [ |
| "### Example 7: Map with side inputs as iterators\n", |
| "\n", |
| "If the `PCollection` has multiple values, pass the `PCollection` as an *iterator*.\n", |
| "This accesses elements lazily as they are needed,\n", |
| "so it is possible to iterate over large `PCollection`s that won't fit into memory." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-7-map-with-side-inputs-as-iterators-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " chars = pipeline | 'Create chars' >> beam.Create(['#', ' ', '\\n'])\n", |
| "\n", |
| " plants = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " '# 🍓Strawberry\\n',\n", |
| " '# 🥕Carrot\\n',\n", |
| " '# 🍆Eggplant\\n',\n", |
| " '# 🍅Tomato\\n',\n", |
| " '# 🥔Potato\\n',\n", |
| " ])\n", |
| " | 'Strip header' >> beam.Map(\n", |
| " lambda text, chars: text.strip(''.join(chars)),\n", |
| " chars=beam.pvalue.AsIter(chars),\n", |
| " )\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-7-map-with-side-inputs-as-iterators-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>\n", |
| "\n", |
| "> **Note**: You can pass the `PCollection` as a *list* with `beam.pvalue.AsList(pcollection)`,\n", |
| "> but this requires that all the elements fit into memory." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-8-map-with-side-inputs-as-dictionaries" |
| }, |
| "source": [ |
| "### Example 8: Map with side inputs as dictionaries\n", |
| "\n", |
| "If a `PCollection` is small enough to fit into memory, then that `PCollection` can be passed as a *dictionary*.\n", |
| "Each element must be a `(key, value)` pair.\n", |
| "Note that all the elements of the `PCollection` must fit into memory for this.\n", |
| "If the `PCollection` won't fit into memory, use `beam.pvalue.AsIter(pcollection)` instead." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-8-map-with-side-inputs-as-dictionaries-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "def replace_duration(plant, durations):\n", |
| " plant['duration'] = durations[plant['duration']]\n", |
| " return plant\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " durations = pipeline | 'Durations' >> beam.Create([\n", |
| " (0, 'annual'),\n", |
| " (1, 'biennial'),\n", |
| " (2, 'perennial'),\n", |
| " ])\n", |
| "\n", |
| " plant_details = (\n", |
| " pipeline\n", |
| " | 'Gardening plants' >> beam.Create([\n", |
| " {'icon': '🍓', 'name': 'Strawberry', 'duration': 2},\n", |
| " {'icon': '🥕', 'name': 'Carrot', 'duration': 1},\n", |
| " {'icon': '🍆', 'name': 'Eggplant', 'duration': 2},\n", |
| " {'icon': '🍅', 'name': 'Tomato', 'duration': 0},\n", |
| " {'icon': '🥔', 'name': 'Potato', 'duration': 2},\n", |
| " ])\n", |
| " | 'Replace duration' >> beam.Map(\n", |
| " replace_duration,\n", |
| " durations=beam.pvalue.AsDict(durations),\n", |
| " )\n", |
| " | beam.Map(print)\n", |
| " )" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-8-map-with-side-inputs-as-dictionaries-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "related-transforms" |
| }, |
| "source": [ |
| "## Related transforms\n", |
| "\n", |
| "* [FlatMap](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for\n", |
| " each input it may produce zero or more outputs.\n", |
| "* [Filter](https://beam.apache.org/documentation/transforms/python/elementwise/filter) is useful if the function is just\n", |
| " deciding whether to output an element or not.\n", |
| "* [ParDo](https://beam.apache.org/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping\n", |
| " operation, and includes other abilities such as multiple output collections and side-inputs.\n", |
| "\n", |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.Map\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "view-the-docs-bottom" |
| }, |
| "source": [ |
| "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/map\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>" |
| ] |
| } |
| ], |
| "metadata": { |
| "colab": { |
| "name": "Map - element-wise transform", |
| "toc_visible": true |
| }, |
| "kernelspec": { |
| "display_name": "python3", |
| "name": "python3" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 2 |
| } |