Nemo-Hadoop(YARN/HDFS) Cluster Deployment Guide

Install the operating system

  • Ubuntu 14.04.4 LTS
  • Other OS/versions not tested

Initialize the fresh OS

  • Run initialize_fresh_ubuntu.sh on each node

Set hostnames

  • Assign a name to each node (v-m, v-w1, v-w2, ...)
  • Set /etc/hosts accordingly on v-m and pscp the file to all v-w
  • Set hostname of each node using set_hostname.sh

Set up YARN/HDFS cluster

  • Set up the conf files(core-site.xml, hdfs-site.xml, yarn-site.xml, slaves, ...) on v-m, and pscp them to all v-w
  • hdfs namenode -format (does not always cleanly format... may need to delete the hdfs directory)
  • start-yarn.sh && start-dfs.sh
  • For more information, refer to the official Hadoop website

Viewing the YARN/HDFS Web UI (when the nodes can't be directly accessed from the internet)

  • This is the case for our cmslab cluster
  • Use ssh tunneling: ssh -D 8123 -f -C -q -N username@gateway
  • Turn on SOCKS proxy in chrome(web browser) advanced settings
  • Set your mac's /etc/hosts
  • Go to v-m:8088 and v-m:50070

Run Nemo in the cluster

  • git clone Nemo on v-m and install
  • Upload a local input file to HDFS with hdfs -put
  • Launch a Nemo job with -deploy_mode yarn, and hdfs paths as the input/output
  • Example: ./bin/run.sh -deploy_mode yarn -job_id mr -user_main org.apache.nemo.examples.beam.WordCount -user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"

And you're all set.....?

  • I hope so
  • But the chances are that you'll run into problems that are not covered in this guide
  • When you resolve the problems(with your friend Google), please share what you've learned by updating this README
  • Final tip: Learn how to use pssh, and you‘ll be able to maintain a n-node cluster as if you’re maintaining a single server
  • A great tutorial on pssh: https://www.tecmint.com/execute-commands-on-multiple-linux-servers-using-pssh/

Some Example Commands for copying files

# miss any of these and you'll have a very intersting(?) YARN/HDFS cluster
pscp -h ~/parallel/hostfile /etc/hosts /etc/hosts
pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/core-site.xml /home/ubuntu/hadoop/etc/hadoop/core-site.xml
pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml

Some Example Commands for querying cluster status

# sanity check
pssh -i -h ~/parallel/hostfile 'echo $HADOOP_HOME'
pssh -i -h ~/parallel/hostfile 'echo $JAVA_HOME'
pssh -i -h ~/parallel/hostfile 'yarn classpath'

# any zombies? things running alright?
pssh -i -h ~/parallel/hostfile 'jps'
pssh -i -h ~/parallel/hostfile 'ps'

# resource usage
htop

Misc

# When HDFS is being weird... run the following commands in order to hard-format
pssh -i -h ~/parallel/hostfile 'rm -rf /home/ubuntu/hadoop/dfs'
hdfs namenode -format