tree: 963262a2dcaae301efef15ba982e5217169cbc90 [path history] [tgz]

deploy/README.md

Nemo-Hadoop(YARN/HDFS) Cluster Deployment Guide

Install the operating system

Ubuntu 14.04.4 LTS
Other OS/versions not tested

Initialize the fresh OS

Run initialize_fresh_ubuntu.sh on each node

Set hostnames

Assign a name to each node (v-m, v-w1, v-w2, ...)
Set /etc/hosts accordingly on v-m and pscp the file to all v-w
Set hostname of each node using set_hostname.sh

Set up YARN/HDFS cluster

Set up the conf files(core-site.xml, hdfs-site.xml, yarn-site.xml, slaves, ...) on v-m, and pscp them to all v-w
hdfs namenode -format (does not always cleanly format... may need to delete the hdfs directory)
start-yarn.sh && start-dfs.sh
For more information, refer to the official Hadoop website

Viewing the YARN/HDFS Web UI (when the nodes can't be directly accessed from the internet)

This is the case for our cmslab cluster
Use ssh tunneling: ssh -D 8123 -f -C -q -N username@gateway
Turn on SOCKS proxy in chrome(web browser) advanced settings
Set your mac's /etc/hosts
Go to v-m:8088 and v-m:50070

Run Nemo in the cluster

git clone Nemo on v-m and install
Upload a local input file to HDFS with hdfs -put
Launch a Nemo job with -deploy_mode yarn, and hdfs paths as the input/output
Example: ./bin/run.sh -deploy_mode yarn -job_id mr -user_main org.apache.nemo.examples.beam.WordCount -user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"

And you're all set.....?

I hope so
But the chances are that you'll run into problems that are not covered in this guide
When you resolve the problems(with your friend Google), please share what you've learned by updating this README
Final tip: Learn how to use pssh, and you‘ll be able to maintain a n-node cluster as if you’re maintaining a single server
A great tutorial on pssh: https://www.tecmint.com/execute-commands-on-multiple-linux-servers-using-pssh/

Some Example Commands for copying files

# miss any of these and you'll have a very intersting(?) YARN/HDFS cluster
pscp -h ~/parallel/hostfile /etc/hosts /etc/hosts
pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/core-site.xml /home/ubuntu/hadoop/etc/hadoop/core-site.xml
pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml

Some Example Commands for querying cluster status

# sanity check
pssh -i -h ~/parallel/hostfile 'echo $HADOOP_HOME'
pssh -i -h ~/parallel/hostfile 'echo $JAVA_HOME'
pssh -i -h ~/parallel/hostfile 'yarn classpath'

# any zombies? things running alright?
pssh -i -h ~/parallel/hostfile 'jps'
pssh -i -h ~/parallel/hostfile 'ps'

# resource usage
htop

Misc

# When HDFS is being weird... run the following commands in order to hard-format
pssh -i -h ~/parallel/hostfile 'rm -rf /home/ubuntu/hadoop/dfs'
hdfs namenode -format