| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "view-in-github" |
| }, |
| "source": [ |
| "<a href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/documentation/transforms/python/elementwise/regex-py.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\"/></a>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "view-the-docs-top" |
| }, |
| "source": [ |
| "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/regex\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "cellView": "form", |
| "id": "_-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "#@title Licensed under the Apache License, Version 2.0 (the \"License\")\n", |
| "# Licensed to the Apache Software Foundation (ASF) under one\n", |
| "# or more contributor license agreements. See the NOTICE file\n", |
| "# distributed with this work for additional information\n", |
| "# regarding copyright ownership. The ASF licenses this file\n", |
| "# to you under the Apache License, Version 2.0 (the\n", |
| "# \"License\"); you may not use this file except in compliance\n", |
| "# with the License. You may obtain a copy of the License at\n", |
| "#\n", |
| "# http://www.apache.org/licenses/LICENSE-2.0\n", |
| "#\n", |
| "# Unless required by applicable law or agreed to in writing,\n", |
| "# software distributed under the License is distributed on an\n", |
| "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", |
| "# KIND, either express or implied. See the License for the\n", |
| "# specific language governing permissions and limitations\n", |
| "# under the License." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "regex" |
| }, |
| "source": [ |
| "# Regex\n", |
| "\n", |
| "<script type=\"text/javascript\">\n", |
| "localStorage.setItem('language', 'language-py')\n", |
| "</script>\n", |
| "\n", |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.Regex\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>\n", |
| "\n", |
| "Filters input string elements based on a regex. May also transform them based on the matching groups." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "setup" |
| }, |
| "source": [ |
| "## Setup\n", |
| "\n", |
| "To run a code cell, you can click the **Run cell** button at the top left of the cell,\n", |
| "or select it and press **`Shift+Enter`**.\n", |
| "Try modifying a code cell and re-running it to see what happens.\n", |
| "\n", |
| "> To learn more about Colab, see\n", |
| "> [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).\n", |
| "\n", |
| "First, let's install the `apache-beam` module." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "setup-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "!pip install --quiet -U apache-beam" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "examples" |
| }, |
| "source": [ |
| "## Examples\n", |
| "\n", |
| "In the following examples, we create a pipeline with a `PCollection` of text strings.\n", |
| "Then, we use the `Regex` transform to search, replace, and split through the text elements using\n", |
| "[regular expressions](https://docs.python.org/3/library/re.html).\n", |
| "\n", |
| "You can use tools to help you create and test your regular expressions, such as\n", |
| "[regex101](https://regex101.com/).\n", |
| "Make sure to specify the Python flavor at the left side bar.\n", |
| "\n", |
| "Lets look at the\n", |
| "[regular expression `(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)`](https://regex101.com/r/Z7hTTj/3)\n", |
| "for example.\n", |
| "It matches anything that is not a whitespace `\\s` (`[ \\t\\n\\r\\f\\v]`) or comma `,`\n", |
| "until a comma is found and stores that in the named group `icon`,\n", |
| "this can match even `utf-8` strings.\n", |
| "Then it matches any number of whitespaces, followed by at least one word character\n", |
| "`\\w` (`[a-zA-Z0-9_]`), which is stored in the second group for the *name*.\n", |
| "It does the same with the third group for the *duration*.\n", |
| "\n", |
| "> *Note:* To avoid unexpected string escaping in your regular expressions,\n", |
| "> it is recommended to use\n", |
| "> [raw strings](https://docs.python.org/3/reference/lexical_analysis.html?highlight=raw#string-and-bytes-literals)\n", |
| "> such as `r'raw-string'` instead of `'escaped-string'`." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-1-regex-match" |
| }, |
| "source": [ |
| "### Example 1: Regex match\n", |
| "\n", |
| "`Regex.matches` keeps only the elements that match the regular expression,\n", |
| "returning the matched group.\n", |
| "The argument `group` is set to `0` (the entire match) by default,\n", |
| "but can be set to a group number like `3`, or to a named group like `'icon'`.\n", |
| "\n", |
| "`Regex.matches` starts to match the regular expression at the beginning of the string.\n", |
| "To match until the end of the string, add `'$'` at the end of the regular expression.\n", |
| "\n", |
| "To start matching at any point instead of the beginning of the string, use\n", |
| "[`Regex.find(regex)`](#example-4-regex-find)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-1-regex-match-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "# Matches a named group 'icon', and then two comma-separated groups.\n", |
| "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_matches = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '🍓, Strawberry, perennial',\n", |
| " '🥕, Carrot, biennial ignoring trailing words',\n", |
| " '🍆, Eggplant, perennial',\n", |
| " '🍅, Tomato, annual',\n", |
| " '🥔, Potato, perennial',\n", |
| " '# 🍌, invalid, format',\n", |
| " 'invalid, 🍉, format',\n", |
| " ])\n", |
| " | 'Parse plants' >> beam.Regex.matches(regex)\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-1-regex-match-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-2-regex-match-with-all-groups" |
| }, |
| "source": [ |
| "### Example 2: Regex match with all groups\n", |
| "\n", |
| "`Regex.all_matches` keeps only the elements that match the regular expression,\n", |
| "returning *all groups* as a list.\n", |
| "The groups are returned in the order encountered in the regular expression,\n", |
| "including `group 0` (the entire match) as the first group.\n", |
| "\n", |
| "`Regex.all_matches` starts to match the regular expression at the beginning of the string.\n", |
| "To match until the end of the string, add `'$'` at the end of the regular expression.\n", |
| "\n", |
| "To start matching at any point instead of the beginning of the string, use\n", |
| "[`Regex.find_all(regex, group=Regex.ALL, outputEmpty=False)`](#example-5-regex-find-all)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-2-regex-match-with-all-groups-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "# Matches a named group 'icon', and then two comma-separated groups.\n", |
| "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_all_matches = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '🍓, Strawberry, perennial',\n", |
| " '🥕, Carrot, biennial ignoring trailing words',\n", |
| " '🍆, Eggplant, perennial',\n", |
| " '🍅, Tomato, annual',\n", |
| " '🥔, Potato, perennial',\n", |
| " '# 🍌, invalid, format',\n", |
| " 'invalid, 🍉, format',\n", |
| " ])\n", |
| " | 'Parse plants' >> beam.Regex.all_matches(regex)\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-2-regex-match-with-all-groups-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-3-regex-match-into-key-value-pairs" |
| }, |
| "source": [ |
| "### Example 3: Regex match into key-value pairs\n", |
| "\n", |
| "`Regex.matches_kv` keeps only the elements that match the regular expression,\n", |
| "returning a key-value pair using the specified groups.\n", |
| "The argument `keyGroup` is set to a group number like `3`, or to a named group like `'icon'`.\n", |
| "The argument `valueGroup` is set to `0` (the entire match) by default,\n", |
| "but can be set to a group number like `3`, or to a named group like `'icon'`.\n", |
| "\n", |
| "`Regex.matches_kv` starts to match the regular expression at the beginning of the string.\n", |
| "To match until the end of the string, add `'$'` at the end of the regular expression.\n", |
| "\n", |
| "To start matching at any point instead of the beginning of the string, use\n", |
| "[`Regex.find_kv(regex, keyGroup)`](#example-6-regex-find-as-key-value-pairs)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-3-regex-match-into-key-value-pairs-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "# Matches a named group 'icon', and then two comma-separated groups.\n", |
| "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_matches_kv = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '🍓, Strawberry, perennial',\n", |
| " '🥕, Carrot, biennial ignoring trailing words',\n", |
| " '🍆, Eggplant, perennial',\n", |
| " '🍅, Tomato, annual',\n", |
| " '🥔, Potato, perennial',\n", |
| " '# 🍌, invalid, format',\n", |
| " 'invalid, 🍉, format',\n", |
| " ])\n", |
| " | 'Parse plants' >> beam.Regex.matches_kv(regex, keyGroup='icon')\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-3-regex-match-into-key-value-pairs-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-4-regex-find" |
| }, |
| "source": [ |
| "### Example 4: Regex find\n", |
| "\n", |
| "`Regex.find` keeps only the elements that match the regular expression,\n", |
| "returning the matched group.\n", |
| "The argument `group` is set to `0` (the entire match) by default,\n", |
| "but can be set to a group number like `3`, or to a named group like `'icon'`.\n", |
| "\n", |
| "`Regex.find` matches the first occurrence of the regular expression in the string.\n", |
| "To start matching at the beginning, add `'^'` at the beginning of the regular expression.\n", |
| "To match until the end of the string, add `'$'` at the end of the regular expression.\n", |
| "\n", |
| "If you need to match from the start only, consider using\n", |
| "[`Regex.matches(regex)`](#example-1-regex-match)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-4-regex-find-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "# Matches a named group 'icon', and then two comma-separated groups.\n", |
| "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_matches = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '# 🍓, Strawberry, perennial',\n", |
| " '# 🥕, Carrot, biennial ignoring trailing words',\n", |
| " '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n", |
| " '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n", |
| " '# 🥔, Potato, perennial',\n", |
| " ])\n", |
| " | 'Parse plants' >> beam.Regex.find(regex)\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-4-regex-find-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-5-regex-find-all" |
| }, |
| "source": [ |
| "### Example 5: Regex find all\n", |
| "\n", |
| "`Regex.find_all` returns a list of all the matches of the regular expression,\n", |
| "returning the matched group.\n", |
| "The argument `group` is set to `0` by default, but can be set to a group number like `3`, to a named group like `'icon'`, or to `Regex.ALL` to return all groups.\n", |
| "The argument `outputEmpty` is set to `True` by default, but can be set to `False` to skip elements where no matches were found.\n", |
| "\n", |
| "`Regex.find_all` matches the regular expression anywhere it is found in the string.\n", |
| "To start matching at the beginning, add `'^'` at the start of the regular expression.\n", |
| "To match until the end of the string, add `'$'` at the end of the regular expression.\n", |
| "\n", |
| "If you need to match all groups from the start only, consider using\n", |
| "[`Regex.all_matches(regex)`](#example-2-regex-match-with-all-groups)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-5-regex-find-all-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "# Matches a named group 'icon', and then two comma-separated groups.\n", |
| "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_find_all = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '# 🍓, Strawberry, perennial',\n", |
| " '# 🥕, Carrot, biennial ignoring trailing words',\n", |
| " '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n", |
| " '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n", |
| " '# 🥔, Potato, perennial',\n", |
| " ])\n", |
| " | 'Parse plants' >> beam.Regex.find_all(regex)\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-5-regex-find-all-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-6-regex-find-as-key-value-pairs" |
| }, |
| "source": [ |
| "### Example 6: Regex find as key-value pairs\n", |
| "\n", |
| "`Regex.find_kv` returns a list of all the matches of the regular expression,\n", |
| "returning a key-value pair using the specified groups.\n", |
| "The argument `keyGroup` is set to a group number like `3`, or to a named group like `'icon'`.\n", |
| "The argument `valueGroup` is set to `0` (the entire match) by default,\n", |
| "but can be set to a group number like `3`, or to a named group like `'icon'`.\n", |
| "\n", |
| "`Regex.find_kv` matches the first occurrence of the regular expression in the string.\n", |
| "To start matching at the beginning, add `'^'` at the beginning of the regular expression.\n", |
| "To match until the end of the string, add `'$'` at the end of the regular expression.\n", |
| "\n", |
| "If you need to match as key-value pairs from the start only, consider using\n", |
| "[`Regex.matches_kv(regex)`](#example-3-regex-match-into-key-value-pairs)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-6-regex-find-as-key-value-pairs-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "# Matches a named group 'icon', and then two comma-separated groups.\n", |
| "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_matches_kv = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '# 🍓, Strawberry, perennial',\n", |
| " '# 🥕, Carrot, biennial ignoring trailing words',\n", |
| " '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n", |
| " '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n", |
| " '# 🥔, Potato, perennial',\n", |
| " ])\n", |
| " | 'Parse plants' >> beam.Regex.find_kv(regex, keyGroup='icon')\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-6-regex-find-as-key-value-pairs-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-7-regex-replace-all" |
| }, |
| "source": [ |
| "### Example 7: Regex replace all\n", |
| "\n", |
| "`Regex.replace_all` returns the string with all the occurrences of the regular expression replaced by another string.\n", |
| "You can also use\n", |
| "[backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub)\n", |
| "on the `replacement`." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-7-regex-replace-all-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_replace_all = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '🍓 : Strawberry : perennial',\n", |
| " '🥕 : Carrot : biennial',\n", |
| " '🍆\\t:\\tEggplant\\t:\\tperennial',\n", |
| " '🍅 : Tomato : annual',\n", |
| " '🥔 : Potato : perennial',\n", |
| " ])\n", |
| " | 'To CSV' >> beam.Regex.replace_all(r'\\s*:\\s*', ',')\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-7-regex-replace-all-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-8-regex-replace-first" |
| }, |
| "source": [ |
| "### Example 8: Regex replace first\n", |
| "\n", |
| "`Regex.replace_first` returns the string with the first occurrence of the regular expression replaced by another string.\n", |
| "You can also use\n", |
| "[backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub)\n", |
| "on the `replacement`." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-8-regex-replace-first-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_replace_first = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '🍓, Strawberry, perennial',\n", |
| " '🥕, Carrot, biennial',\n", |
| " '🍆,\\tEggplant, perennial',\n", |
| " '🍅, Tomato, annual',\n", |
| " '🥔, Potato, perennial',\n", |
| " ])\n", |
| " | 'As dictionary' >> beam.Regex.replace_first(r'\\s*,\\s*', ': ')\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-8-regex-replace-first-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-9-regex-split" |
| }, |
| "source": [ |
| "### Example 9: Regex split\n", |
| "\n", |
| "`Regex.split` returns the list of strings that were delimited by the specified regular expression.\n", |
| "The argument `outputEmpty` is set to `False` by default, but can be set to `True` to keep empty items in the output list." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "id": "example-9-regex-split-code" |
| }, |
| "outputs": [], |
| "source": [ |
| "import apache_beam as beam\n", |
| "\n", |
| "with beam.Pipeline() as pipeline:\n", |
| " plants_split = (\n", |
| " pipeline\n", |
| " | 'Garden plants' >> beam.Create([\n", |
| " '🍓 : Strawberry : perennial',\n", |
| " '🥕 : Carrot : biennial',\n", |
| " '🍆\\t:\\tEggplant : perennial',\n", |
| " '🍅 : Tomato : annual',\n", |
| " '🥔 : Potato : perennial',\n", |
| " ])\n", |
| " | 'Parse plants' >> beam.Regex.split(r'\\s*:\\s*')\n", |
| " | beam.Map(print))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "example-9-regex-split-2" |
| }, |
| "source": [ |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "related-transforms" |
| }, |
| "source": [ |
| "## Related transforms\n", |
| "\n", |
| "* [FlatMap](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for\n", |
| " each input it may produce zero or more outputs.\n", |
| "* [Map](https://beam.apache.org/documentation/transforms/python/elementwise/map) applies a simple 1-to-1 mapping function over each element in the collection\n", |
| "\n", |
| "<table align=\"left\" style=\"margin-right:1em\">\n", |
| " <td>\n", |
| " <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.Regex\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n", |
| " </td>\n", |
| "</table>\n", |
| "\n", |
| "<br/><br/><br/></icon>" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "view-the-docs-bottom" |
| }, |
| "source": [ |
| "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/regex\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>" |
| ] |
| } |
| ], |
| "metadata": { |
| "colab": { |
| "name": "Regex - element-wise transform", |
| "toc_visible": true |
| }, |
| "kernelspec": { |
| "display_name": "python3", |
| "name": "python3" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 4 |
| } |