layout: post title: ‘Use Stratosphere with Amazon Elastic MapReduce’ date: 2014-02-18 19:57:18 author: “Robert Metzger” author-twitter: “rmetzger_” categories: news

This step-by-step tutorial will guide you through the setup of Stratosphere using Amazon Elastic MapReduce.

Background

Amazon Elastic MapReduce (Amazon EMR) is part of Amazon Web services. EMR allows to create Hadoop clusters that analyze data stored in Amazon S3 (AWS' cloud storage). Stratosphere runs on top of Hadoop using the recently released cluster resource manager YARN. YARN allows to use many different data analysis tools in your cluster side by side. Tools that run with YARN are, for example Apache Giraph, Spark or HBase. Stratosphere also runs on YARN and that's the approach for this tutorial.

1. Step: Login to AWS and prepare secure access

You need to have SSH keys to access the Hadoop master node. If you do not have keys for your computer, generate them:

2. Step: Create your Hadoop Cluster in the cloud

  • Select Elastic MapReduce from the AWS console
  • Click the blue “Create cluster” button.
  • Thats it! You can now press the “Create cluster” button at the end of the form to boot it!

3. Step: Launch Stratosphere

You might need to wait a few minutes until Amazon started your cluster. (You can monitor the progress of the instances in EC2). Use the refresh button in the top right corner.

You see that the master is up if the field Master public DNS contains a value (first line), connect to it using SSH.

{% highlight bash %} ssh hadoop@ -i <path to your .pem>

for my example, it looks like this:

ssh hadoop@ec2-54-213-61-105.us-west-2.compute.amazonaws.com -i ~/Downloads/work-laptop.pem {% endhighlight %}

(Windows users have to follow these instructions to SSH into the machine running the master.) Once connected to the master, download and start Stratosphere for YARN:

{% highlight bash %} cd stratosphere-yarn-0.5-SNAPSHOT/ ./bin/yarn-session.sh -n 4 -jm 1024 -tm 3000 {% endhighlight %}

The arguments have the following meaning -n number of TaskManagers (=workers). This number must not exeed the number of task instances -jm memory (heapspace) for the JobManager -tm memory for the TaskManagers

Once the output has changed from {% highlight bash %} JobManager is now running on N/A:6123 {% endhighlight %} to {% highlight bash %} JobManager is now running on ip-172-31-13-68.us-west-2.compute.internal:6123 {% endhighlight %} Stratosphere has started the JobManager. It will take a few seconds until the TaskManagers (workers) have connected to the JobManager. To see how many TaskManagers connected, you have to access the JobManager's web interface. Follow the steps below to do that ...

This step shows how to submit and monitor a Stratosphere Job in the Amazon Cloud.

We recommend to create a SOCKS-proxy with your SSH that allows you to easily connect into the cluster. (If you've already a VPN setup with EC2, you can probably use that as well.)

{% highlight bash %} ssh -D localhost:2001 hadoop@ -i {% endhighlight %}

Notice the -D localhost:2001 argument: It opens a SOCKS proxy on your computer allowing any application to use it to communicate through the proxy via an SSH tunnel to the master node. This allows you to access all services in your EMR cluster, such as the HDFS NameNode or the YARN web interface.

Since you're connected to the master now, you can open several web interfaces:
YARN Resource Manager: http://<masterIPAddress>:9026/
HDFS NameNode: http://<masterIPAddress>:9101/

You find the masterIPAddress by entering ifconfig into the terminal: {% highlight bash %} [hadoop@ip-172-31-38-95 ~]$ ifconfig eth0 Link encap:Ethernet HWaddr 02:CF:8E:CB:28:B2
inet addr:172.31.38.95 Bcast:172.31.47.255 Mask:255.255.240.0 inet6 addr: fe80::cf:8eff:fecb:28b2/64 Scope:Link RX bytes:166314967 (158.6 MiB) TX bytes:89319246 (85.1 MiB) {% endhighlight %}

Optional: If you want to use the hostnames within your Firefox (that also makes the NameNode links work), you have to enable DNS resolution over the SOCKS proxy. Open the Firefox config about:config and set network.proxy.socks_remote_dns to true.

The YARN ResourceManager also allows you to connect to Stratosphere's JobManager web interface. Click the ApplicationMaster link in the “Tracking UI” column.

To run the Wordcount example, you have to upload some sample data. {% highlight bash %}

download a text

wget http://www.gnu.org/licenses/gpl.txt

upload it to HDFS:

hadoop fs -copyFromLocal gpl.txt /input {% endhighlight %}

To run a Job, enter the following command into the master's command line: {% highlight bash %}

optional: go to the extracted directory

cd stratosphere-yarn-0.5-SNAPSHOT/

run the wordcount example

./bin/stratosphere run -w -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar -a 16 hdfs:///input hdfs:///output {% endhighlight %}

Make sure that the number of TaskManager's have connected to the JobManager.

Lets go through the command in detail:

  • ./bin/stratosphere is the standard launcher for Stratosphere jobs from the command line
  • The -w flag stands for “wait”. It is a very useful to track the progress of the job.
  • -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar the -j command sets the jar file containing the job. If you have you own application, place your Jar-file here.
  • -a 16 hdfs:///input hdfs:///output the -a command specifies the Job-specific arguments. In this case, the wordcount expects the following input <numSubStasks> <input> <output>.

You can monitor the progress of your job in the JobManager webinterface. Once the job has finished (which should be the case after less than 10 seconds), you can analyze it there. Inspect the result in HDFS using:

{% highlight bash %} hadoop fs -tail /output {% endhighlight %}

If you want to shut down the whole cluster in the cloud, use Amazon's webinterface and click on “Terminate cluster”. If you just want to stop the YARN session, press CTRL+C in the terminal. The Stratosphere instances will be killed by YARN.