blob: 40d648ab65fb4b8d0b72c60dd5985413597ffbcf [file] [log] [blame]
{
"license": [
"Licensed to the Apache Software Foundation (ASF) under one",
"or more contributor license agreements. See the NOTICE file",
"distributed with this work for additional information",
"regarding copyright ownership. The ASF licenses this file",
"to you under the Apache License, Version 2.0 (the",
"\"License\"); you may not use this file except in compliance",
"with the License. You may obtain a copy of the License at",
"",
" http://www.apache.org/licenses/LICENSE-2.0",
"",
"Unless required by applicable law or agreed to in writing,",
"software distributed under the License is distributed on an",
"\"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY",
"KIND, either express or implied. See the License for the",
"specific language governing permissions and limitations",
"under the License."
],
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Try Apache Beam - Java",
"version": "0.3.2",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-java.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"metadata": {
"id": "lNKIMlEDZ_Vw",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Try Apache Beam - Java\n",
"\n",
"In this notebook, we set up a Java development environment and work through a simple example using the [DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can explore other runners with the [Beam Capatibility Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
"\n",
"To navigate through different sections, use the table of contents. From **View** drop-down list, select **Table of contents**.\n",
"\n",
"To run a code cell, you can click the **Run cell** button at the top left of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code cell and re-running it to see what happens.\n",
"\n",
"To learn more about Colab, see [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
]
},
{
"metadata": {
"id": "Fz6KSQ13_3Rr",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Setup\n",
"\n",
"First, you need to set up your environment."
]
},
{
"metadata": {
"id": "GOOk81Jj_yUy",
"colab_type": "code",
"outputId": "68240031-2990-41fa-a327-38e15dc9fdf9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 136
}
},
"cell_type": "code",
"source": [
"# Run and print a shell command.\n",
"def run(cmd):\n",
" print('>> {}'.format(cmd))\n",
" !{cmd} # This is magic to run 'cmd' in the shell.\n",
" print('')\n",
"\n",
"# Copy the input file into the local filesystem.\n",
"run('mkdir -p data')\n",
"run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/')"
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
">> mkdir -p data\n",
"\n",
">> gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/\n",
"Copying gs://dataflow-samples/shakespeare/kinglear.txt...\n",
"/ [1 files][153.6 KiB/153.6 KiB] \n",
"Operation completed over 1 objects/153.6 KiB. \n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "Hmto8JTSWwUK",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Installing development tools\n",
"\n",
"Let's start by installing Java. We'll use the `default-jdk`, which uses [OpenJDK](https://openjdk.java.net/). This will take a while, so feel free to go for a walk or do some stretching.\n",
"\n",
"**Note:** Alternatively, you could install the propietary [Oracle JDK](https://www.oracle.com/technetwork/java/javase/downloads/index.html) instead."
]
},
{
"metadata": {
"id": "ONYtX0doWpFz",
"colab_type": "code",
"outputId": "04bfa861-0bf8-4352-e878-0f24c6c7b61e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
}
},
"cell_type": "code",
"source": [
"# Update and upgrade the system before installing anything else.\n",
"run('apt-get update > /dev/null')\n",
"run('apt-get upgrade > /dev/null')\n",
"\n",
"# Install the Java JDK.\n",
"run('apt-get install default-jdk > /dev/null')\n",
"\n",
"# Check the Java version to see if everything is working well.\n",
"run('javac -version')"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": [
">> apt-get update > /dev/null\n",
"\n",
">> apt-get upgrade > /dev/null\n",
"Extracting templates from packages: 100%\n",
"\n",
">> apt-get install default-jdk > /dev/null\n",
"\n",
">> javac -version\n",
"javac 10.0.2\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "Wab7H4IZW9xZ",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Now, let's install [Gradle](https://gradle.org/), which we'll need to automate the build and running processes for our application. \n",
"\n",
"**Note:** Alternatively, you could install and configure [Maven](https://maven.apache.org/) instead."
]
},
{
"metadata": {
"id": "xS3Oeu3DW7vy",
"colab_type": "code",
"outputId": "1b2c1f11-5e35-4d22-8002-814ea61224c9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 595
}
},
"cell_type": "code",
"source": [
"import os\n",
"\n",
"# Download the gradle source.\n",
"gradle_version = 'gradle-5.0'\n",
"gradle_path = f\"/opt/{gradle_version}\"\n",
"if not os.path.exists(gradle_path):\n",
" run(f\"wget -q -nc -O gradle.zip https://services.gradle.org/distributions/{gradle_version}-bin.zip\")\n",
" run('unzip -q -d /opt gradle.zip')\n",
" run('rm -f gradle.zip')\n",
"\n",
"# We're choosing to use the absolute path instead of adding it to the $PATH environment variable.\n",
"def gradle(args):\n",
" run(f\"{gradle_path}/bin/gradle --console=plain {args}\")\n",
"\n",
"gradle('-v')"
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
">> wget -q -nc -O gradle.zip https://services.gradle.org/distributions/gradle-5.0-bin.zip\n",
"\n",
">> unzip -q -d /opt gradle.zip\n",
"\n",
">> rm -f gradle.zip\n",
"\n",
">> /opt/gradle-5.0/bin/gradle --console=plain -v\n",
"\u001b[m\n",
"Welcome to Gradle 5.0!\n",
"\n",
"Here are the highlights of this release:\n",
" - Kotlin DSL 1.0\n",
" - Task timeouts\n",
" - Dependency alignment aka BOM support\n",
" - Interactive `gradle init`\n",
"\n",
"For more details see https://docs.gradle.org/5.0/release-notes.html\n",
"\n",
"\n",
"------------------------------------------------------------\n",
"Gradle 5.0\n",
"------------------------------------------------------------\n",
"\n",
"Build time: 2018-11-26 11:48:43 UTC\n",
"Revision: 7fc6e5abf2fc5fe0824aec8a0f5462664dbcd987\n",
"\n",
"Kotlin DSL: 1.0.4\n",
"Kotlin: 1.3.10\n",
"Groovy: 2.5.4\n",
"Ant: Apache Ant(TM) version 1.9.13 compiled on July 10 2018\n",
"JVM: 10.0.2 (Oracle Corporation 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)\n",
"OS: Linux 4.14.79+ amd64\n",
"\n",
"\u001b[m\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "YTkkapX9KVhA",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## build.gradle\n",
"\n",
"We'll also need a [`build.gradle`](https://guides.gradle.org/creating-new-gradle-builds/) file which will allow us to invoke some useful commands."
]
},
{
"metadata": {
"id": "oUqfqWyMuIfR",
"colab_type": "code",
"outputId": "292a06b2-ce06-46b6-8598-480d83974bbb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"cell_type": "code",
"source": [
"%%writefile build.gradle\n",
"\n",
"plugins {\n",
" // id 'idea' // Uncomment for IntelliJ IDE\n",
" // id 'eclipse' // Uncomment for Eclipse IDE\n",
"\n",
" // Apply java plugin and make it a runnable application.\n",
" id 'java'\n",
" id 'application'\n",
"\n",
" // 'shadow' allows us to embed all the dependencies into a fat jar.\n",
" id 'com.github.johnrengelman.shadow' version '4.0.3'\n",
"}\n",
"\n",
"// This is the path of the main class, stored within ./src/main/java/\n",
"mainClassName = 'samples.quickstart.WordCount'\n",
"\n",
"// Declare the sources from which to fetch dependencies.\n",
"repositories {\n",
" mavenCentral()\n",
"}\n",
"\n",
"// Java version compatibility.\n",
"sourceCompatibility = 1.8\n",
"targetCompatibility = 1.8\n",
"\n",
"// Use the latest Apache Beam major version 2.\n",
"// You can also lock into a minor version like '2.9.+'.\n",
"ext.apacheBeamVersion = '2.+'\n",
"\n",
"// Declare the dependencies of the project.\n",
"dependencies {\n",
" shadow \"org.apache.beam:beam-sdks-java-core:$apacheBeamVersion\"\n",
"\n",
" runtime \"org.apache.beam:beam-runners-direct-java:$apacheBeamVersion\"\n",
" runtime \"org.slf4j:slf4j-api:1.+\"\n",
" runtime \"org.slf4j:slf4j-jdk14:1.+\"\n",
"\n",
" testCompile \"junit:junit:4.+\"\n",
"}\n",
"\n",
"// Configure 'shadowJar' instead of 'jar' to set up the fat jar.\n",
"shadowJar {\n",
" baseName = 'WordCount' // Name of the fat jar file.\n",
" classifier = null // Set to null, otherwise 'shadow' appends a '-all' to the jar file name.\n",
" manifest {\n",
" attributes('Main-Class': mainClassName) // Specify where the main class resides.\n",
" }\n",
"}"
],
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": [
"Writing build.gradle\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "cwZcqmFgoLJ9",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Creating the directory structure\n",
"\n",
"Java and Gradle expect a specific [directory structure](https://docs.gradle.org/current/userguide/organizing_gradle_projects.html). This helps organize large projects into a standard structure.\n",
"\n",
"For now, we only need a place where our quickstart code will reside. That has to go within `./src/main/java/`."
]
},
{
"metadata": {
"id": "Mr1KTQznbd9F",
"colab_type": "code",
"outputId": "2e4635b9-0577-4399-b8d6-078183ff9da2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
}
},
"cell_type": "code",
"source": [
"run('mkdir -p src/main/java/samples/quickstart')"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
">> mkdir -p src/main/java/samples/quickstart\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "cPvvFB19uXNw",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Minimal word count\n",
"\n",
"The following example is the \"Hello, World!\" of data processing, a basic implementation of word count. We're creating a simple data processing pipeline that reads a text file and counts the number of occurrences of every word.\n",
"\n",
"There are many scenarios where all the data does not fit in memory. Notice that the outputs of the pipeline go to the file system, which allows for large processing jobs in distributed environments."
]
},
{
"metadata": {
"id": "Fl3iUat7KYIE",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## WordCount.java"
]
},
{
"metadata": {
"id": "5l3S2mjMBKhT",
"colab_type": "code",
"outputId": "6e55ec70-e727-44c9-a425-4afed97188fe",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"cell_type": "code",
"source": [
"%%writefile src/main/java/samples/quickstart/WordCount.java\n",
"\n",
"package samples.quickstart;\n",
"\n",
"import org.apache.beam.sdk.Pipeline;\n",
"import org.apache.beam.sdk.io.TextIO;\n",
"import org.apache.beam.sdk.options.PipelineOptions;\n",
"import org.apache.beam.sdk.options.PipelineOptionsFactory;\n",
"import org.apache.beam.sdk.transforms.Count;\n",
"import org.apache.beam.sdk.transforms.Filter;\n",
"import org.apache.beam.sdk.transforms.FlatMapElements;\n",
"import org.apache.beam.sdk.transforms.MapElements;\n",
"import org.apache.beam.sdk.values.KV;\n",
"import org.apache.beam.sdk.values.TypeDescriptors;\n",
"\n",
"import java.util.Arrays;\n",
"\n",
"public class WordCount {\n",
" public static void main(String[] args) {\n",
" String inputsDir = \"data/*\";\n",
" String outputsPrefix = \"outputs/part\";\n",
"\n",
" PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();\n",
" Pipeline pipeline = Pipeline.create(options);\n",
" pipeline\n",
" .apply(\"Read lines\", TextIO.read().from(inputsDir))\n",
" .apply(\"Find words\", FlatMapElements.into(TypeDescriptors.strings())\n",
" .via((String line) -> Arrays.asList(line.split(\"[^\\\\p{L}]+\"))))\n",
" .apply(\"Filter empty words\", Filter.by((String word) -> !word.isEmpty()))\n",
" .apply(\"Count words\", Count.perElement())\n",
" .apply(\"Write results\", MapElements.into(TypeDescriptors.strings())\n",
" .via((KV<String, Long> wordCount) ->\n",
" wordCount.getKey() + \": \" + wordCount.getValue()))\n",
" .apply(TextIO.write().to(outputsPrefix));\n",
" pipeline.run();\n",
" }\n",
"}"
],
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"text": [
"Writing src/main/java/samples/quickstart/WordCount.java\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "yoO4xHnaKiz9",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Build and run"
]
},
{
"metadata": {
"id": "giJMbbcq2OPu",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Let's first check how the final file system structure looks like. These are all the files required to build and run our application.\n",
"\n",
"* `build.gradle` - build configuration for Gradle\n",
"* `src/main/java/samples/quickstart/WordCount.java` - application source code\n",
"* `data/kinglear.txt` - input data, this could be any file or files\n",
"\n",
"We are now ready to build the application using `gradle build`."
]
},
{
"metadata": {
"id": "urmCmtG08F-0",
"colab_type": "code",
"outputId": "a2b65437-4244-4844-82d2-1789d5cfd7ca",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 510
}
},
"cell_type": "code",
"source": [
"# Build the project.\n",
"gradle('build')\n",
"\n",
"# Check the generated build files.\n",
"run('ls -lh build/libs/')"
],
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": [
">> /opt/gradle-5.0/bin/gradle --console=plain build\n",
"\u001b[mStarting a Gradle Daemon (subsequent builds will be faster)\n",
"> Task :compileJava\n",
"> Task :processResources NO-SOURCE\n",
"> Task :classes\n",
"> Task :jar\n",
"> Task :startScripts\n",
"> Task :distTar\n",
"> Task :distZip\n",
"> Task :shadowJar\n",
"> Task :startShadowScripts\n",
"> Task :shadowDistTar\n",
"> Task :shadowDistZip\n",
"> Task :assemble\n",
"> Task :compileTestJava NO-SOURCE\n",
"> Task :processTestResources NO-SOURCE\n",
"> Task :testClasses UP-TO-DATE\n",
"> Task :test NO-SOURCE\n",
"> Task :check UP-TO-DATE\n",
"> Task :build\n",
"\n",
"BUILD SUCCESSFUL in 56s\n",
"9 actionable tasks: 9 executed\n",
"\u001b[m\n",
">> ls -lh build/libs/\n",
"total 40M\n",
"-rw-r--r-- 1 root root 2.9K Mar 4 22:59 content.jar\n",
"-rw-r--r-- 1 root root 40M Mar 4 23:00 WordCount.jar\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "LrRFNZHD8dtu",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"There are two files generated:\n",
"* The `content.jar` file, the application generated from the regular `build` command. It's only a few kilobytes in size.\n",
"* The `WordCount.jar` file, with the `baseName` we specified in the `shadowJar` section of the `gradle.build` file. It's a several megabytes in size, with all the required libraries it needs to run embedded in it.\n",
"\n",
"The file we're actually interested in is the fat JAR file `WordCount.jar`. To run the fat JAR, we'll use the `gradle runShadow` command."
]
},
{
"metadata": {
"id": "CgTXBdTsBn1F",
"colab_type": "code",
"outputId": "5e447cf9-a01a-4a82-9237-676f0091d4bd",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1822
}
},
"cell_type": "code",
"source": [
"# Run the shadow (fat jar) build.\n",
"gradle('runShadow')\n",
"\n",
"# Sample the first 20 results, remember there are no ordering guarantees.\n",
"run('head -n 20 outputs/part-00000-of-*')"
],
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"text": [
">> /opt/gradle-5.0/bin/gradle --console=plain runShadow\n",
"\u001b[m> Task :compileJava UP-TO-DATE\n",
"> Task :processResources NO-SOURCE\n",
"> Task :classes UP-TO-DATE\n",
"> Task :shadowJar UP-TO-DATE\n",
"> Task :startShadowScripts UP-TO-DATE\n",
"> Task :installShadowDist\n",
"\n",
"> Task :runShadow\n",
"WARNING: An illegal reflective access operation has occurred\n",
"WARNING: Illegal reflective access by org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.UnsafeUtil (file:/content/build/install/content-shadow/lib/WordCount.jar) to field java.nio.Buffer.address\n",
"WARNING: Please consider reporting this to the maintainers of org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.UnsafeUtil\n",
"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
"WARNING: All illegal access operations will be denied in a future release\n",
"Mar 04, 2019 11:00:24 PM org.apache.beam.sdk.io.FileBasedSource getEstimatedSizeBytes\n",
"INFO: Filepattern data/* matched 1 files with total size 157283\n",
"Mar 04, 2019 11:00:24 PM org.apache.beam.sdk.io.FileBasedSource split\n",
"INFO: Splitting filepattern data/* into bundles of size 52427 took 1 ms and produced 1 files and 3 bundles\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer 7d12bbc4-9165-4493-8563-fb710b827daa for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17 pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer 5b1bdb18-9f9a-47cd-80d2-a65baa31aa60 for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17 pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer 6302e6b5-5428-48e6-b571-76a9282d7f45 for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17 pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/7d12bbc4-9165-4493-8563-fb710b827daa\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/5b1bdb18-9f9a-47cd-80d2-a65baa31aa60\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer 493e3ec4-f0f7-4d10-8209-079f7ac4db16 for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17 pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/6302e6b5-5428-48e6-b571-76a9282d7f45\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/493e3ec4-f0f7-4d10-8209-079f7ac4db16\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn process\n",
"INFO: Finalizing 4 file results\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation createMissingEmptyShards\n",
"INFO: Finalizing for destination null num shards 4.\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-00-23-1/7d12bbc4-9165-4493-8563-fb710b827daa, shard=3, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00003-of-00004\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-00-23-1/5b1bdb18-9f9a-47cd-80d2-a65baa31aa60, shard=2, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00002-of-00004\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-00-23-1/6302e6b5-5428-48e6-b571-76a9282d7f45, shard=1, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00001-of-00004\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-00-23-1/493e3ec4-f0f7-4d10-8209-079f7ac4db16, shard=0, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@3ad2e17, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00000-of-00004\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/6302e6b5-5428-48e6-b571-76a9282d7f45\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/5b1bdb18-9f9a-47cd-80d2-a65baa31aa60\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/7d12bbc4-9165-4493-8563-fb710b827daa\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-00-23-1/493e3ec4-f0f7-4d10-8209-079f7ac4db16\n",
"Mar 04, 2019 11:00:43 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"WARNING: Failed to match temporary files under: [/content/outputs/.temp-beam-2019-03-04_23-00-23-1/].\n",
"\n",
"BUILD SUCCESSFUL in 24s\n",
"5 actionable tasks: 2 executed, 3 up-to-date\n",
"\u001b[m\n",
">> head -n 20 outputs/part-00000-of-*\n",
"==> outputs/part-00000-of-00001 <==\n",
"(u'canker', 1)\n",
"(u'bounty', 2)\n",
"(u'provision', 3)\n",
"(u'to', 438)\n",
"(u'terms', 2)\n",
"(u'unnecessary', 2)\n",
"(u'tongue', 5)\n",
"(u'knives', 1)\n",
"(u'Commend', 1)\n",
"(u'Hum', 2)\n",
"(u'Set', 2)\n",
"(u'smell', 6)\n",
"(u'dreadful', 3)\n",
"(u'frowning', 1)\n",
"(u'World', 1)\n",
"(u'tike', 1)\n",
"(u'yes', 3)\n",
"(u'oldness', 1)\n",
"(u'boat', 1)\n",
"(u\"in's\", 1)\n",
"\n",
"==> outputs/part-00000-of-00004 <==\n",
"retinue: 1\n",
"stink: 1\n",
"beaks: 1\n",
"Ten: 1\n",
"riots: 2\n",
"Service: 1\n",
"dealing: 1\n",
"stop: 2\n",
"detain: 1\n",
"beware: 1\n",
"pilferings: 1\n",
"swimming: 1\n",
"The: 124\n",
"Been: 1\n",
"behavior: 1\n",
"impetuous: 1\n",
"Thy: 20\n",
"Tis: 24\n",
"Soldiers: 7\n",
"Juno: 1\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "T_oqlIM55MzM",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Distributing your application\n",
"\n",
"We can run our fat JAR file as long as we have a Java Runtime Environment installed.\n",
"\n",
"To distribute, we copy the fat JAR file and run it with `java -jar`."
]
},
{
"metadata": {
"id": "b3YSRjYnavpd",
"colab_type": "code",
"outputId": "ef88153a-f75f-4e80-8434-ac452a77a199",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1907
}
},
"cell_type": "code",
"source": [
"# You can now distribute and run your Java application as a standalone jar file.\n",
"run('cp build/libs/WordCount.jar .')\n",
"run('java -jar WordCount.jar')\n",
"\n",
"# Sample the first 20 results, remember there are no ordering guarantees.\n",
"run('head -n 20 outputs/part-00000-of-*')"
],
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": [
">> cp build/libs/WordCount.jar .\n",
"\n",
">> java -jar WordCount.jar\n",
"WARNING: An illegal reflective access operation has occurred\n",
"WARNING: Illegal reflective access by org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.UnsafeUtil (file:/content/WordCount.jar) to field java.nio.Buffer.address\n",
"WARNING: Please consider reporting this to the maintainers of org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.UnsafeUtil\n",
"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
"WARNING: All illegal access operations will be denied in a future release\n",
"Mar 04, 2019 11:00:49 PM org.apache.beam.sdk.io.FileBasedSource getEstimatedSizeBytes\n",
"INFO: Filepattern data/* matched 1 files with total size 157283\n",
"Mar 04, 2019 11:00:49 PM org.apache.beam.sdk.io.FileBasedSource split\n",
"INFO: Splitting filepattern data/* into bundles of size 52427 took 1 ms and produced 1 files and 3 bundles\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer 273fc5ad-09b8-4e87-95c9-5d9ec72ed294 for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@7a362b6b pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer c224db69-5869-4259-bd43-ca0431ec77fe for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@7a362b6b pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer f45d6e07-d37a-4af3-ad8b-bc316cef7d99 for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@7a362b6b pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-00-49-1/f45d6e07-d37a-4af3-ad8b-bc316cef7d99\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-00-49-1/c224db69-5869-4259-bd43-ca0431ec77fe\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-00-49-1/273fc5ad-09b8-4e87-95c9-5d9ec72ed294\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn process\n",
"INFO: Finalizing 3 file results\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation createMissingEmptyShards\n",
"INFO: Finalizing for destination null num shards 3.\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-00-49-1/273fc5ad-09b8-4e87-95c9-5d9ec72ed294, shard=2, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@7a362b6b, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00002-of-00003\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-00-49-1/c224db69-5869-4259-bd43-ca0431ec77fe, shard=0, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@7a362b6b, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00000-of-00003\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-00-49-1/f45d6e07-d37a-4af3-ad8b-bc316cef7d99, shard=1, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@7a362b6b, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00001-of-00003\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-00-49-1/273fc5ad-09b8-4e87-95c9-5d9ec72ed294\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-00-49-1/f45d6e07-d37a-4af3-ad8b-bc316cef7d99\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-00-49-1/c224db69-5869-4259-bd43-ca0431ec77fe\n",
"Mar 04, 2019 11:01:10 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"WARNING: Failed to match temporary files under: [/content/outputs/.temp-beam-2019-03-04_23-00-49-1/].\n",
"\n",
">> head -n 20 outputs/part-00000-of-*\n",
"==> outputs/part-00000-of-00001 <==\n",
"(u'canker', 1)\n",
"(u'bounty', 2)\n",
"(u'provision', 3)\n",
"(u'to', 438)\n",
"(u'terms', 2)\n",
"(u'unnecessary', 2)\n",
"(u'tongue', 5)\n",
"(u'knives', 1)\n",
"(u'Commend', 1)\n",
"(u'Hum', 2)\n",
"(u'Set', 2)\n",
"(u'smell', 6)\n",
"(u'dreadful', 3)\n",
"(u'frowning', 1)\n",
"(u'World', 1)\n",
"(u'tike', 1)\n",
"(u'yes', 3)\n",
"(u'oldness', 1)\n",
"(u'boat', 1)\n",
"(u\"in's\", 1)\n",
"\n",
"==> outputs/part-00000-of-00003 <==\n",
"With: 31\n",
"justification: 1\n",
"hither: 15\n",
"make: 46\n",
"opposed: 2\n",
"prince: 5\n",
"Burn: 1\n",
"waking: 1\n",
"waked: 3\n",
"inform: 6\n",
"mercy: 5\n",
"about: 11\n",
"danger: 6\n",
"Croak: 1\n",
"happier: 1\n",
"stick: 2\n",
"oppressed: 1\n",
"erlook: 1\n",
"untented: 1\n",
"myself: 10\n",
"\n",
"==> outputs/part-00000-of-00004 <==\n",
"retinue: 1\n",
"stink: 1\n",
"beaks: 1\n",
"Ten: 1\n",
"riots: 2\n",
"Service: 1\n",
"dealing: 1\n",
"stop: 2\n",
"detain: 1\n",
"beware: 1\n",
"pilferings: 1\n",
"swimming: 1\n",
"The: 124\n",
"Been: 1\n",
"behavior: 1\n",
"impetuous: 1\n",
"Thy: 20\n",
"Tis: 24\n",
"Soldiers: 7\n",
"Juno: 1\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "k-HubCrk-h_G",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Word count with comments\n",
"\n",
"Below is mostly the same code as above, but with comments explaining every line in more detail."
]
},
{
"metadata": {
"id": "wvnWyYklCXer",
"colab_type": "code",
"outputId": "275507a3-05e9-44ca-8625-d745154d5720",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"cell_type": "code",
"source": [
"%%writefile src/main/java/samples/quickstart/WordCount.java\n",
"\n",
"package samples.quickstart;\n",
"\n",
"import org.apache.beam.sdk.Pipeline;\n",
"import org.apache.beam.sdk.io.TextIO;\n",
"import org.apache.beam.sdk.options.PipelineOptions;\n",
"import org.apache.beam.sdk.options.PipelineOptionsFactory;\n",
"import org.apache.beam.sdk.transforms.Count;\n",
"import org.apache.beam.sdk.transforms.Filter;\n",
"import org.apache.beam.sdk.transforms.FlatMapElements;\n",
"import org.apache.beam.sdk.transforms.MapElements;\n",
"import org.apache.beam.sdk.values.KV;\n",
"import org.apache.beam.sdk.values.PCollection;\n",
"import org.apache.beam.sdk.values.TypeDescriptors;\n",
"\n",
"import java.util.Arrays;\n",
"\n",
"public class WordCount {\n",
" public static void main(String[] args) {\n",
" String inputsDir = \"data/*\";\n",
" String outputsPrefix = \"outputs/part\";\n",
"\n",
" PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();\n",
" Pipeline pipeline = Pipeline.create(options);\n",
"\n",
" // Store the word counts in a PCollection.\n",
" // Each element is a KeyValue of (word, count) of types KV<String, Long>.\n",
" PCollection<KV<String, Long>> wordCounts =\n",
" // The input PCollection is an empty pipeline.\n",
" pipeline\n",
"\n",
" // Read lines from a text file.\n",
" .apply(\"Read lines\", TextIO.read().from(inputsDir))\n",
" // Element type: String - text line\n",
"\n",
" // Use a regular expression to iterate over all words in the line.\n",
" // FlatMapElements will yield an element for every element in an iterable.\n",
" .apply(\"Find words\", FlatMapElements.into(TypeDescriptors.strings())\n",
" .via((String line) -> Arrays.asList(line.split(\"[^\\\\p{L}]+\"))))\n",
" // Element type: String - word\n",
"\n",
" // Keep only non-empty words.\n",
" .apply(\"Filter empty words\", Filter.by((String word) -> !word.isEmpty()))\n",
" // Element type: String - word\n",
"\n",
" // Count each unique word.\n",
" .apply(\"Count words\", Count.perElement());\n",
" // Element type: KV<String, Long> - key: word, value: counts\n",
"\n",
" // We can process a PCollection through other pipelines, too.\n",
" // The input PCollection are the wordCounts from the previous step.\n",
" wordCounts\n",
" // Format the results into a string so we can write them to a file.\n",
" .apply(\"Write results\", MapElements.into(TypeDescriptors.strings())\n",
" .via((KV<String, Long> wordCount) ->\n",
" wordCount.getKey() + \": \" + wordCount.getValue()))\n",
" // Element type: str - text line\n",
"\n",
" // Finally, write the results to a file.\n",
" .apply(TextIO.write().to(outputsPrefix));\n",
"\n",
" // We have to explicitly run the pipeline, otherwise it's only a definition.\n",
" pipeline.run();\n",
" }\n",
"}"
],
"execution_count": 14,
"outputs": [
{
"output_type": "stream",
"text": [
"Overwriting src/main/java/samples/quickstart/WordCount.java\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "wKAJp7ON4Vpp",
"colab_type": "code",
"outputId": "9a4c7a72-70a1-4d31-89c1-cf7fb8fcdf53",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 2060
}
},
"cell_type": "code",
"source": [
"# Build and run the project. The 'runShadow' task implicitly does a 'build'.\n",
"gradle('runShadow')\n",
"\n",
"# Sample the first 20 results, remember there are no ordering guarantees.\n",
"run('head -n 20 outputs/part-00000-of-*')"
],
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"text": [
">> /opt/gradle-5.0/bin/gradle --console=plain runShadow\n",
"\u001b[m> Task :compileJava\n",
"> Task :processResources NO-SOURCE\n",
"> Task :classes\n",
"> Task :shadowJar\n",
"> Task :startShadowScripts\n",
"> Task :installShadowDist\n",
"\n",
"> Task :runShadow\n",
"WARNING: An illegal reflective access operation has occurred\n",
"WARNING: Illegal reflective access by org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.UnsafeUtil (file:/content/build/install/content-shadow/lib/WordCount.jar) to field java.nio.Buffer.address\n",
"WARNING: Please consider reporting this to the maintainers of org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.UnsafeUtil\n",
"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
"WARNING: All illegal access operations will be denied in a future release\n",
"Mar 04, 2019 11:01:26 PM org.apache.beam.sdk.io.FileBasedSource getEstimatedSizeBytes\n",
"INFO: Filepattern data/* matched 1 files with total size 157283\n",
"Mar 04, 2019 11:01:26 PM org.apache.beam.sdk.io.FileBasedSource split\n",
"INFO: Splitting filepattern data/* into bundles of size 52427 took 1 ms and produced 1 files and 3 bundles\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer e2eeada2-5a8b-4493-acc5-c706204d9669 for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@8c3619e pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer 7acdc85e-ff7d-42d0-9b2f-9ce385956c0e for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@8c3619e pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn processElement\n",
"INFO: Opening writer d1a6a591-77f0-4994-affc-d83378e7b7c0 for window org.apache.beam.sdk.transforms.windowing.GlobalWindow@8c3619e pane PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} destination null\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-01-25-1/e2eeada2-5a8b-4493-acc5-c706204d9669\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-01-25-1/7acdc85e-ff7d-42d0-9b2f-9ce385956c0e\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$Writer close\n",
"INFO: Successfully wrote temporary file /content/outputs/.temp-beam-2019-03-04_23-01-25-1/d1a6a591-77f0-4994-affc-d83378e7b7c0\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn process\n",
"INFO: Finalizing 3 file results\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation createMissingEmptyShards\n",
"INFO: Finalizing for destination null num shards 3.\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-01-25-1/7acdc85e-ff7d-42d0-9b2f-9ce385956c0e, shard=2, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@8c3619e, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00002-of-00003\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-01-25-1/e2eeada2-5a8b-4493-acc5-c706204d9669, shard=0, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@8c3619e, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00000-of-00003\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation moveToOutputFiles\n",
"INFO: Will copy temporary file FileResult{tempFilename=/content/outputs/.temp-beam-2019-03-04_23-01-25-1/d1a6a591-77f0-4994-affc-d83378e7b7c0, shard=1, window=org.apache.beam.sdk.transforms.windowing.GlobalWindow@8c3619e, paneInfo=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} to final location /content/outputs/part-00001-of-00003\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-01-25-1/e2eeada2-5a8b-4493-acc5-c706204d9669\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-01-25-1/7acdc85e-ff7d-42d0-9b2f-9ce385956c0e\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"INFO: Will remove known temporary file /content/outputs/.temp-beam-2019-03-04_23-01-25-1/d1a6a591-77f0-4994-affc-d83378e7b7c0\n",
"Mar 04, 2019 11:01:46 PM org.apache.beam.sdk.io.FileBasedSink$WriteOperation removeTemporaryFiles\n",
"WARNING: Failed to match temporary files under: [/content/outputs/.temp-beam-2019-03-04_23-01-25-1/].\n",
"\n",
"BUILD SUCCESSFUL in 33s\n",
"5 actionable tasks: 5 executed\n",
"\u001b[m\n",
">> head -n 20 outputs/part-00000-of-*\n",
"==> outputs/part-00000-of-00001 <==\n",
"(u'canker', 1)\n",
"(u'bounty', 2)\n",
"(u'provision', 3)\n",
"(u'to', 438)\n",
"(u'terms', 2)\n",
"(u'unnecessary', 2)\n",
"(u'tongue', 5)\n",
"(u'knives', 1)\n",
"(u'Commend', 1)\n",
"(u'Hum', 2)\n",
"(u'Set', 2)\n",
"(u'smell', 6)\n",
"(u'dreadful', 3)\n",
"(u'frowning', 1)\n",
"(u'World', 1)\n",
"(u'tike', 1)\n",
"(u'yes', 3)\n",
"(u'oldness', 1)\n",
"(u'boat', 1)\n",
"(u\"in's\", 1)\n",
"\n",
"==> outputs/part-00000-of-00003 <==\n",
"wrath: 3\n",
"nicely: 2\n",
"hall: 1\n",
"Sure: 2\n",
"legs: 4\n",
"ten: 1\n",
"yourselves: 1\n",
"embossed: 1\n",
"poorly: 1\n",
"temper: 2\n",
"Dismissing: 1\n",
"Legitimate: 1\n",
"tyrannous: 1\n",
"turn: 13\n",
"gold: 2\n",
"minds: 1\n",
"dowers: 2\n",
"policy: 1\n",
"I: 708\n",
"V: 6\n",
"\n",
"==> outputs/part-00000-of-00004 <==\n",
"retinue: 1\n",
"stink: 1\n",
"beaks: 1\n",
"Ten: 1\n",
"riots: 2\n",
"Service: 1\n",
"dealing: 1\n",
"stop: 2\n",
"detain: 1\n",
"beware: 1\n",
"pilferings: 1\n",
"swimming: 1\n",
"The: 124\n",
"Been: 1\n",
"behavior: 1\n",
"impetuous: 1\n",
"Thy: 20\n",
"Tis: 24\n",
"Soldiers: 7\n",
"Juno: 1\n",
"\n"
],
"name": "stdout"
}
]
}
]
}