docs/quickstart/setup_quickstart.md - flink - Git at Google

 ---
 title: "Quickstart: Setup"
 # Top navigation
 top-nav-group: quickstart
 top-nav-pos: 1
 top-nav-title: Setup & Run Example
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 * This will be replaced by the TOC
 {:toc}

 Get a Flink example program up and running in a few simple steps.

 ## Setup: Download and Start

 Flink runs on __Linux, Mac OS X, and Windows__. To be able to run Flink, the only requirement is to have a working __Java 7.x__ (or higher) installation. Windows users, please take a look at the [Flink on Windows]({{ site.baseurl }}/setup/local_setup.html#flink-on-windows) guide which describes how to run Flink on Windows for local setups.

 ### Download

 Download a binary from the [downloads page](http://flink.apache.org/downloads.html). You can pick any Hadoop/Scala combination you like, for instance [Flink for Hadoop 2]({{ site.FLINK_DOWNLOAD_URL_HADOOP2_STABLE }}).

 ### Start a Local Flink Cluster

 1. Go to the download directory.
 2. Unpack the downloaded archive.
 3. Start Flink.

 ~~~bash
 $ cd ~/Downloads        # Go to download directory
 $ tar xzf flink-*.tgz   # Unpack the downloaded archive
 $ cd flink-{{site.version}}
 $ bin/start-local.sh    # Start Flink
 ~~~

 Check the __JobManager's web frontend__ at [http://localhost:8081](http://localhost:8081) and make sure everything is up and running. The web frontend should report a single available TaskManager instance.

 <a href="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-1.png" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-1.png" alt="JobManager: Overview"/></a>

 ## Run Example

 Now, we are going to run the [SocketTextStreamWordCount example](https://github.com/apache/flink/blob/release-1.0.0/flink-quickstart/flink-quickstart-java/src/main/resources/archetype-resources/src/main/java/SocketTextStreamWordCount.java) and read text from a socket and count the number of distinct words.

 * First of all, we use **netcat** to start local server via

   ~~~bash
   $ nc -l -p 9000
   ~~~

 * Submit the Flink program:

   ~~~bash
   $ bin/flink run examples/streaming/SocketTextStreamWordCount.jar \
     --hostname localhost \
     --port 9000
   Printing result to stdout. Use --output to specify output path.
   03/08/2016 17:21:56 Job execution switched to status RUNNING.
   03/08/2016 17:21:56 Source: Socket Stream -> Flat Map(1/1) switched to SCHEDULED
   03/08/2016 17:21:56 Source: Socket Stream -> Flat Map(1/1) switched to DEPLOYING
   03/08/2016 17:21:56 Keyed Aggregation -> Sink: Unnamed(1/1) switched to SCHEDULED
   03/08/2016 17:21:56 Keyed Aggregation -> Sink: Unnamed(1/1) switched to DEPLOYING
   03/08/2016 17:21:56 Source: Socket Stream -> Flat Map(1/1) switched to RUNNING
   03/08/2016 17:21:56 Keyed Aggregation -> Sink: Unnamed(1/1) switched to RUNNING
   ~~~

   The program connects to the socket and waits for input. You can check the web interface to verify that the job is running as expected:

   <div class="row">
     <div class="col-sm-6">
       <a href="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-2.png" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-2.png" alt="JobManager: Overview (cont'd)"/></a>
     </div>
     <div class="col-sm-6">
       <a href="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-3.png" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-3.png" alt="JobManager: Running Jobs"/></a>
     </div>
   </div>

 * Counts are printed to `stdout`. Monitor the JobManager's output file and write some text in `nc`:

   ~~~bash
   $ nc -l -p 9000
   lorem ipsum
   ipsum ipsum ipsum
   bye
   ~~~

   The `.out` file will print the counts immediately:

   ~~~bash
   $ tail -f log/flink-*-jobmanager-*.out
   (lorem,1)
   (ipsum,1)
   (ipsum,2)
   (ipsum,3)
   (ipsum,4)
   (bye,1)
   ~~~~

   To **stop** Flink when you're done type:

   ~~~bash
   $ bin/stop-local.sh
   ~~~

   <a href="{{ site.baseurl }}/page/img/quickstart-setup/setup.gif" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/setup.gif" alt="Quickstart: Setup"/></a>

 ## Next Steps

 Check out the [step-by-step example](run_example_quickstart.html) in order to get a first feel of Flink's programming APIs. When you are done with that, go ahead and read the [streaming guide]({{ site.baseurl }}/apis/streaming/).

 ### Cluster Setup

 __Running Flink on a cluster__ is as easy as running it locally. Having __passwordless SSH__ and
 __the same directory structure__ on all your cluster nodes lets you use our scripts to control
 everything.

 1. Copy the unpacked __flink__ directory from the downloaded archive to the same file system path
 on each node of your setup.
 2. Choose a __master node__ (JobManager) and set the `jobmanager.rpc.address` key in
 `conf/flink-conf.yaml` to its IP or hostname. Make sure that all nodes in your cluster have the same
 `jobmanager.rpc.address` configured.
 3. Add the IPs or hostnames (one per line) of all __worker nodes__ (TaskManager) to the slaves files
 in `conf/slaves`.

 You can now __start the cluster__ at your master node with `bin/start-cluster.sh`.

 The following __example__ illustrates the setup with three nodes (with IP addresses from _10.0.0.1_
 to _10.0.0.3_ and hostnames _master_, _worker1_, _worker2_) and shows the contents of the
 configuration files, which need to be accessible at the same path on all machines:

 <div class="row">
   <div class="col-md-6 text-center">
     <img src="{{ site.baseurl }}/page/img/quickstart_cluster.png" style="width: 85%">
   </div>
 <div class="col-md-6">
   <div class="row">
     <p class="lead text-center">
       /path/to/<strong>flink/conf/<br>flink-conf.yaml</strong>
     <pre>jobmanager.rpc.address: 10.0.0.1</pre>
     </p>
   </div>
 <div class="row" style="margin-top: 1em;">
   <p class="lead text-center">
     /path/to/<strong>flink/<br>conf/slaves</strong>
   <pre>
 10.0.0.2
 10.0.0.3</pre>
   </p>
 </div>
 </div>
 </div>

 Have a look at the [Configuration]({{ site.baseurl }}/setup/config.html) section of the documentation to see other available configuration options.
 For Flink to run efficiently, a few configuration values need to be set.

 In particular,

  * the amount of available memory per TaskManager (`taskmanager.heap.mb`),
  * the number of available CPUs per machine (`taskmanager.numberOfTaskSlots`),
  * the total number of CPUs in the cluster (`parallelism.default`) and
  * the temporary directories (`taskmanager.tmp.dirs`)


 are very important configuration values.

 ### Flink on YARN

 You can easily deploy Flink on your existing __YARN cluster__.

 1. Download the __Flink Hadoop2 package__: [Flink with Hadoop 2]({{site.FLINK_DOWNLOAD_URL_HADOOP2_STABLE}})
 2. Make sure your __HADOOP_HOME__ (or _YARN_CONF_DIR_ or _HADOOP_CONF_DIR_) __environment variable__ is set to read your YARN and HDFS configuration.
 3. Run the __YARN client__ with: `./bin/yarn-session.sh`. You can run the client with options `-n 10 -tm 8192` to allocate 10 TaskManagers with 8GB of memory each.
	---
	title: "Quickstart: Setup"
	# Top navigation
	top-nav-group: quickstart
	top-nav-pos: 1
	top-nav-title: Setup & Run Example
	---
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	* This will be replaced by the TOC
	{:toc}

	Get a Flink example program up and running in a few simple steps.

	## Setup: Download and Start

	Flink runs on __Linux, Mac OS X, and Windows__. To be able to run Flink, the only requirement is to have a working __Java 7.x__ (or higher) installation. Windows users, please take a look at the [Flink on Windows]({{ site.baseurl }}/setup/local_setup.html#flink-on-windows) guide which describes how to run Flink on Windows for local setups.

	### Download

	Download a binary from the [downloads page](http://flink.apache.org/downloads.html). You can pick any Hadoop/Scala combination you like, for instance [Flink for Hadoop 2]({{ site.FLINK_DOWNLOAD_URL_HADOOP2_STABLE }}).

	### Start a Local Flink Cluster

	1. Go to the download directory.
	2. Unpack the downloaded archive.
	3. Start Flink.

	~~~bash
	$ cd ~/Downloads # Go to download directory
	$ tar xzf flink-*.tgz # Unpack the downloaded archive
	$ cd flink-{{site.version}}
	$ bin/start-local.sh # Start Flink
	~~~

	Check the __JobManager's web frontend__ at [http://localhost:8081](http://localhost:8081) and make sure everything is up and running. The web frontend should report a single available TaskManager instance.

	<a href="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-1.png" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-1.png" alt="JobManager: Overview"/></a>

	## Run Example

	Now, we are going to run the [SocketTextStreamWordCount example](https://github.com/apache/flink/blob/release-1.0.0/flink-quickstart/flink-quickstart-java/src/main/resources/archetype-resources/src/main/java/SocketTextStreamWordCount.java) and read text from a socket and count the number of distinct words.

	* First of all, we use netcat to start local server via

	~~~bash
	$ nc -l -p 9000
	~~~

	* Submit the Flink program:

	~~~bash
	$ bin/flink run examples/streaming/SocketTextStreamWordCount.jar \
	--hostname localhost \
	--port 9000
	Printing result to stdout. Use --output to specify output path.
	03/08/2016 17:21:56 Job execution switched to status RUNNING.
	03/08/2016 17:21:56 Source: Socket Stream -> Flat Map(1/1) switched to SCHEDULED
	03/08/2016 17:21:56 Source: Socket Stream -> Flat Map(1/1) switched to DEPLOYING
	03/08/2016 17:21:56 Keyed Aggregation -> Sink: Unnamed(1/1) switched to SCHEDULED
	03/08/2016 17:21:56 Keyed Aggregation -> Sink: Unnamed(1/1) switched to DEPLOYING
	03/08/2016 17:21:56 Source: Socket Stream -> Flat Map(1/1) switched to RUNNING
	03/08/2016 17:21:56 Keyed Aggregation -> Sink: Unnamed(1/1) switched to RUNNING
	~~~

	The program connects to the socket and waits for input. You can check the web interface to verify that the job is running as expected:

	<div class="row">
	<div class="col-sm-6">
	<a href="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-2.png" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-2.png" alt="JobManager: Overview (cont'd)"/></a>
	</div>
	<div class="col-sm-6">
	<a href="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-3.png" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/jobmanager-3.png" alt="JobManager: Running Jobs"/></a>
	</div>
	</div>

	* Counts are printed to `stdout`. Monitor the JobManager's output file and write some text in `nc`:

	~~~bash
	$ nc -l -p 9000
	lorem ipsum
	ipsum ipsum ipsum
	bye
	~~~

	The `.out` file will print the counts immediately:

	~~~bash
	$ tail -f log/flink--jobmanager-.out
	(lorem,1)
	(ipsum,1)
	(ipsum,2)
	(ipsum,3)
	(ipsum,4)
	(bye,1)
	~~~~

	To stop Flink when you're done type:

	~~~bash
	$ bin/stop-local.sh
	~~~

	<a href="{{ site.baseurl }}/page/img/quickstart-setup/setup.gif" ><img class="img-responsive" src="{{ site.baseurl }}/page/img/quickstart-setup/setup.gif" alt="Quickstart: Setup"/></a>

	## Next Steps

	Check out the [step-by-step example](run_example_quickstart.html) in order to get a first feel of Flink's programming APIs. When you are done with that, go ahead and read the [streaming guide]({{ site.baseurl }}/apis/streaming/).

	### Cluster Setup

	__Running Flink on a cluster__ is as easy as running it locally. Having __passwordless SSH__ and
	__the same directory structure__ on all your cluster nodes lets you use our scripts to control
	everything.

	1. Copy the unpacked __flink__ directory from the downloaded archive to the same file system path
	on each node of your setup.
	2. Choose a __master node__ (JobManager) and set the `jobmanager.rpc.address` key in
	`conf/flink-conf.yaml` to its IP or hostname. Make sure that all nodes in your cluster have the same
	`jobmanager.rpc.address` configured.
	3. Add the IPs or hostnames (one per line) of all __worker nodes__ (TaskManager) to the slaves files
	in `conf/slaves`.

	You can now __start the cluster__ at your master node with `bin/start-cluster.sh`.

	The following __example__ illustrates the setup with three nodes (with IP addresses from _10.0.0.1_
	to _10.0.0.3_ and hostnames _master_, _worker1_, _worker2_) and shows the contents of the
	configuration files, which need to be accessible at the same path on all machines:

	<div class="row">
	<div class="col-md-6 text-center">
	<img src="{{ site.baseurl }}/page/img/quickstart_cluster.png" style="width: 85%">
	</div>
	<div class="col-md-6">
	<div class="row">
	<p class="lead text-center">
	/path/to/<strong>flink/conf/<br>flink-conf.yaml</strong>
	<pre>jobmanager.rpc.address: 10.0.0.1</pre>
	</p>
	</div>
	<div class="row" style="margin-top: 1em;">
	<p class="lead text-center">
	/path/to/<strong>flink/<br>conf/slaves</strong>
	<pre>
	10.0.0.2
	10.0.0.3</pre>
	</p>
	</div>
	</div>
	</div>

	Have a look at the [Configuration]({{ site.baseurl }}/setup/config.html) section of the documentation to see other available configuration options.
	For Flink to run efficiently, a few configuration values need to be set.

	In particular,

	* the amount of available memory per TaskManager (`taskmanager.heap.mb`),
	* the number of available CPUs per machine (`taskmanager.numberOfTaskSlots`),
	* the total number of CPUs in the cluster (`parallelism.default`) and
	* the temporary directories (`taskmanager.tmp.dirs`)


	are very important configuration values.

	### Flink on YARN

	You can easily deploy Flink on your existing __YARN cluster__.

	1. Download the __Flink Hadoop2 package__: [Flink with Hadoop 2]({{site.FLINK_DOWNLOAD_URL_HADOOP2_STABLE}})
	2. Make sure your __HADOOP_HOME__ (or _YARN_CONF_DIR_ or _HADOOP_CONF_DIR_) __environment variable__ is set to read your YARN and HDFS configuration.
	3. Run the __YARN client__ with: `./bin/yarn-session.sh`. You can run the client with options `-n 10 -tm 8192` to allocate 10 TaskManagers with 8GB of memory each.