examples/notebooks/documentation/transforms/python/elementwise/regex-py.ipynb - beam - Git at Google

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "view-in-github"
    },
    "source": [
     "<a href=\"https://colab.research.google.com/github/apache/beam/blob/master//Users/dcavazos/src/beam/examples/notebooks/documentation/transforms/python/elementwise/regex-py.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\"/></a>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "view-the-docs-top"
    },
    "source": [
     "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/regex\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "cellView": "form",
     "id": "_-code"
    },
    "outputs": [],
    "source": [
     "#@title Licensed under the Apache License, Version 2.0 (the \"License\")\n",
     "# Licensed to the Apache Software Foundation (ASF) under one\n",
     "# or more contributor license agreements. See the NOTICE file\n",
     "# distributed with this work for additional information\n",
     "# regarding copyright ownership. The ASF licenses this file\n",
     "# to you under the Apache License, Version 2.0 (the\n",
     "# \"License\"); you may not use this file except in compliance\n",
     "# with the License. You may obtain a copy of the License at\n",
     "#\n",
     "#   http://www.apache.org/licenses/LICENSE-2.0\n",
     "#\n",
     "# Unless required by applicable law or agreed to in writing,\n",
     "# software distributed under the License is distributed on an\n",
     "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
     "# KIND, either express or implied. See the License for the\n",
     "# specific language governing permissions and limitations\n",
     "# under the License."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "regex"
    },
    "source": [
     "# Regex\n",
     "\n",
     "<script type=\"text/javascript\">\n",
     "localStorage.setItem('language', 'language-py')\n",
     "</script>\n",
     "\n",
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.Regex\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>\n",
     "\n",
     "Filters input string elements based on a regex. May also transform them based on the matching groups."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "setup"
    },
    "source": [
     "## Setup\n",
     "\n",
     "To run a code cell, you can click the **Run cell** button at the top left of the cell,\n",
     "or select it and press **`Shift+Enter`**.\n",
     "Try modifying a code cell and re-running it to see what happens.\n",
     "\n",
     "> To learn more about Colab, see\n",
     "> [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).\n",
     "\n",
     "First, let's install the `apache-beam` module."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "setup-code"
    },
    "outputs": [],
    "source": [
     "!pip install --quiet -U apache-beam"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "examples"
    },
    "source": [
     "## Examples\n",
     "\n",
     "In the following examples, we create a pipeline with a `PCollection` of text strings.\n",
     "Then, we use the `Regex` transform to search, replace, and split through the text elements using\n",
     "[regular expressions](https://docs.python.org/3/library/re.html).\n",
     "\n",
     "You can use tools to help you create and test your regular expressions, such as\n",
     "[regex101](https://regex101.com/).\n",
     "Make sure to specify the Python flavor at the left side bar.\n",
     "\n",
     "Lets look at the\n",
     "[regular expression `(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)`](https://regex101.com/r/Z7hTTj/3)\n",
     "for example.\n",
     "It matches anything that is not a whitespace `\\s` (`[ \\t\\n\\r\\f\\v]`) or comma `,`\n",
     "until a comma is found and stores that in the named group `icon`,\n",
     "this can match even `utf-8` strings.\n",
     "Then it matches any number of whitespaces, followed by at least one word character\n",
     "`\\w` (`[a-zA-Z0-9_]`), which is stored in the second group for the *name*.\n",
     "It does the same with the third group for the *duration*.\n",
     "\n",
     "> *Note:* To avoid unexpected string escaping in your regular expressions,\n",
     "> it is recommended to use\n",
     "> [raw strings](https://docs.python.org/3/reference/lexical_analysis.html?highlight=raw#string-and-bytes-literals)\n",
     "> such as `r'raw-string'` instead of `'escaped-string'`."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-1-regex-match"
    },
    "source": [
     "### Example 1: Regex match\n",
     "\n",
     "`Regex.matches` keeps only the elements that match the regular expression,\n",
     "returning the matched group.\n",
     "The argument `group` is set to `0` (the entire match) by default,\n",
     "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
     "\n",
     "`Regex.matches` starts to match the regular expression at the beginning of the string.\n",
     "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
     "\n",
     "To start matching at any point instead of the beginning of the string, use\n",
     "[`Regex.find(regex)`](#example-4-regex-find)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-1-regex-match-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "# Matches a named group 'icon', and then two comma-separated groups.\n",
     "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_matches = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '🍓, Strawberry, perennial',\n",
     "          '🥕, Carrot, biennial ignoring trailing words',\n",
     "          '🍆, Eggplant, perennial',\n",
     "          '🍅, Tomato, annual',\n",
     "          '🥔, Potato, perennial',\n",
     "          '# 🍌, invalid, format',\n",
     "          'invalid, 🍉, format',\n",
     "      ])\n",
     "      | 'Parse plants' >> beam.Regex.matches(regex)\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-1-regex-match-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-2-regex-match-with-all-groups"
    },
    "source": [
     "### Example 2: Regex match with all groups\n",
     "\n",
     "`Regex.all_matches` keeps only the elements that match the regular expression,\n",
     "returning *all groups* as a list.\n",
     "The groups are returned in the order encountered in the regular expression,\n",
     "including `group 0` (the entire match) as the first group.\n",
     "\n",
     "`Regex.all_matches` starts to match the regular expression at the beginning of the string.\n",
     "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
     "\n",
     "To start matching at any point instead of the beginning of the string, use\n",
     "[`Regex.find_all(regex, group=Regex.ALL, outputEmpty=False)`](#example-5-regex-find-all)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-2-regex-match-with-all-groups-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "# Matches a named group 'icon', and then two comma-separated groups.\n",
     "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_all_matches = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '🍓, Strawberry, perennial',\n",
     "          '🥕, Carrot, biennial ignoring trailing words',\n",
     "          '🍆, Eggplant, perennial',\n",
     "          '🍅, Tomato, annual',\n",
     "          '🥔, Potato, perennial',\n",
     "          '# 🍌, invalid, format',\n",
     "          'invalid, 🍉, format',\n",
     "      ])\n",
     "      | 'Parse plants' >> beam.Regex.all_matches(regex)\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-2-regex-match-with-all-groups-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-3-regex-match-into-key-value-pairs"
    },
    "source": [
     "### Example 3: Regex match into key-value pairs\n",
     "\n",
     "`Regex.matches_kv` keeps only the elements that match the regular expression,\n",
     "returning a key-value pair using the specified groups.\n",
     "The argument `keyGroup` is set to a group number like `3`, or to a named group like `'icon'`.\n",
     "The argument `valueGroup` is set to `0` (the entire match) by default,\n",
     "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
     "\n",
     "`Regex.matches_kv` starts to match the regular expression at the beginning of the string.\n",
     "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
     "\n",
     "To start matching at any point instead of the beginning of the string, use\n",
     "[`Regex.find_kv(regex, keyGroup)`](#example-6-regex-find-as-key-value-pairs)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-3-regex-match-into-key-value-pairs-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "# Matches a named group 'icon', and then two comma-separated groups.\n",
     "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_matches_kv = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '🍓, Strawberry, perennial',\n",
     "          '🥕, Carrot, biennial ignoring trailing words',\n",
     "          '🍆, Eggplant, perennial',\n",
     "          '🍅, Tomato, annual',\n",
     "          '🥔, Potato, perennial',\n",
     "          '# 🍌, invalid, format',\n",
     "          'invalid, 🍉, format',\n",
     "      ])\n",
     "      | 'Parse plants' >> beam.Regex.matches_kv(regex, keyGroup='icon')\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-3-regex-match-into-key-value-pairs-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-4-regex-find"
    },
    "source": [
     "### Example 4: Regex find\n",
     "\n",
     "`Regex.find` keeps only the elements that match the regular expression,\n",
     "returning the matched group.\n",
     "The argument `group` is set to `0` (the entire match) by default,\n",
     "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
     "\n",
     "`Regex.find` matches the first occurrence of the regular expression in the string.\n",
     "To start matching at the beginning, add `'^'` at the beginning of the regular expression.\n",
     "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
     "\n",
     "If you need to match from the start only, consider using\n",
     "[`Regex.matches(regex)`](#example-1-regex-match)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-4-regex-find-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "# Matches a named group 'icon', and then two comma-separated groups.\n",
     "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_matches = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '# 🍓, Strawberry, perennial',\n",
     "          '# 🥕, Carrot, biennial ignoring trailing words',\n",
     "          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n",
     "          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n",
     "          '# 🥔, Potato, perennial',\n",
     "      ])\n",
     "      | 'Parse plants' >> beam.Regex.find(regex)\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-4-regex-find-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-5-regex-find-all"
    },
    "source": [
     "### Example 5: Regex find all\n",
     "\n",
     "`Regex.find_all` returns a list of all the matches of the regular expression,\n",
     "returning the matched group.\n",
     "The argument `group` is set to `0` by default, but can be set to a group number like `3`, to a named group like `'icon'`, or to `Regex.ALL` to return all groups.\n",
     "The argument `outputEmpty` is set to `True` by default, but can be set to `False` to skip elements where no matches were found.\n",
     "\n",
     "`Regex.find_all` matches the regular expression anywhere it is found in the string.\n",
     "To start matching at the beginning, add `'^'` at the start of the regular expression.\n",
     "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
     "\n",
     "If you need to match all groups from the start only, consider using\n",
     "[`Regex.all_matches(regex)`](#example-2-regex-match-with-all-groups)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-5-regex-find-all-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "# Matches a named group 'icon', and then two comma-separated groups.\n",
     "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_find_all = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '# 🍓, Strawberry, perennial',\n",
     "          '# 🥕, Carrot, biennial ignoring trailing words',\n",
     "          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n",
     "          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n",
     "          '# 🥔, Potato, perennial',\n",
     "      ])\n",
     "      | 'Parse plants' >> beam.Regex.find_all(regex)\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-5-regex-find-all-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-6-regex-find-as-key-value-pairs"
    },
    "source": [
     "### Example 6: Regex find as key-value pairs\n",
     "\n",
     "`Regex.find_kv` returns a list of all the matches of the regular expression,\n",
     "returning a key-value pair using the specified groups.\n",
     "The argument `keyGroup` is set to a group number like `3`, or to a named group like `'icon'`.\n",
     "The argument `valueGroup` is set to `0` (the entire match) by default,\n",
     "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
     "\n",
     "`Regex.find_kv` matches the first occurrence of the regular expression in the string.\n",
     "To start matching at the beginning, add `'^'` at the beginning of the regular expression.\n",
     "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
     "\n",
     "If you need to match as key-value pairs from the start only, consider using\n",
     "[`Regex.matches_kv(regex)`](#example-3-regex-match-into-key-value-pairs)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-6-regex-find-as-key-value-pairs-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "# Matches a named group 'icon', and then two comma-separated groups.\n",
     "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_matches_kv = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '# 🍓, Strawberry, perennial',\n",
     "          '# 🥕, Carrot, biennial ignoring trailing words',\n",
     "          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n",
     "          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n",
     "          '# 🥔, Potato, perennial',\n",
     "      ])\n",
     "      | 'Parse plants' >> beam.Regex.find_kv(regex, keyGroup='icon')\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-6-regex-find-as-key-value-pairs-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-7-regex-replace-all"
    },
    "source": [
     "### Example 7: Regex replace all\n",
     "\n",
     "`Regex.replace_all` returns the string with all the occurrences of the regular expression replaced by another string.\n",
     "You can also use\n",
     "[backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub)\n",
     "on the `replacement`."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-7-regex-replace-all-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_replace_all = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '🍓 : Strawberry : perennial',\n",
     "          '🥕 : Carrot : biennial',\n",
     "          '🍆\\t:\\tEggplant\\t:\\tperennial',\n",
     "          '🍅 : Tomato : annual',\n",
     "          '🥔 : Potato : perennial',\n",
     "      ])\n",
     "      | 'To CSV' >> beam.Regex.replace_all(r'\\s*:\\s*', ',')\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-7-regex-replace-all-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-8-regex-replace-first"
    },
    "source": [
     "### Example 8: Regex replace first\n",
     "\n",
     "`Regex.replace_first` returns the string with the first occurrence of the regular expression replaced by another string.\n",
     "You can also use\n",
     "[backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub)\n",
     "on the `replacement`."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-8-regex-replace-first-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_replace_first = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '🍓, Strawberry, perennial',\n",
     "          '🥕, Carrot, biennial',\n",
     "          '🍆,\\tEggplant, perennial',\n",
     "          '🍅, Tomato, annual',\n",
     "          '🥔, Potato, perennial',\n",
     "      ])\n",
     "      | 'As dictionary' >> beam.Regex.replace_first(r'\\s*,\\s*', ': ')\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-8-regex-replace-first-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-9-regex-split"
    },
    "source": [
     "### Example 9: Regex split\n",
     "\n",
     "`Regex.split` returns the list of strings that were delimited by the specified regular expression.\n",
     "The argument `outputEmpty` is set to `False` by default, but can be set to `True` to keep empty items in the output list."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "id": "example-9-regex-split-code"
    },
    "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "\n",
     "with beam.Pipeline() as pipeline:\n",
     "  plants_split = (\n",
     "      pipeline\n",
     "      | 'Garden plants' >> beam.Create([\n",
     "          '🍓 : Strawberry : perennial',\n",
     "          '🥕 : Carrot : biennial',\n",
     "          '🍆\\t:\\tEggplant : perennial',\n",
     "          '🍅 : Tomato : annual',\n",
     "          '🥔 : Potato : perennial',\n",
     "      ])\n",
     "      | 'Parse plants' >> beam.Regex.split(r'\\s*:\\s*')\n",
     "      | beam.Map(print)\n",
     "  )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "example-9-regex-split-2"
    },
    "source": [
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "related-transforms"
    },
    "source": [
     "## Related transforms\n",
     "\n",
     "* [FlatMap](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for\n",
     "  each input it may produce zero or more outputs.\n",
     "* [Map](https://beam.apache.org/documentation/transforms/python/elementwise/map) applies a simple 1-to-1 mapping function over each element in the collection\n",
     "\n",
     "<table align=\"left\" style=\"margin-right:1em\">\n",
     "  <td>\n",
     "    <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.Regex\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n",
     "  </td>\n",
     "</table>\n",
     "\n",
     "<br/><br/><br/></icon>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "view-the-docs-bottom"
    },
    "source": [
     "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/regex\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>"
    ]
   }
  ],
  "metadata": {
   "colab": {
    "name": "Regex - element-wise transform",
    "toc_visible": true
   },
   "kernelspec": {
    "display_name": "python3",
    "name": "python3"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 2
 }