| # Nemo-Hadoop(YARN/HDFS) Cluster Deployment Guide |
| |
| ## Install the operating system |
| * Ubuntu 14.04.4 LTS |
| * Other OS/versions not tested |
| |
| ## Initialize the fresh OS |
| * Run `initialize_fresh_ubuntu.sh` on each node |
| |
| ## Set hostnames |
| * Assign a name to each node (v-m, v-w1, v-w2, ...) |
| * Set `/etc/hosts` accordingly on v-m and `pscp` the file to all v-w |
| * Set `hostname` of each node using `set_hostname.sh` |
| |
| ## Set up YARN/HDFS cluster |
| * Set up the conf files(core-site.xml, hdfs-site.xml, yarn-site.xml, slaves, ...) on v-m, and `pscp` them to all v-w |
| * hdfs namenode -format (does not always cleanly format... may need to delete the hdfs directory) |
| * start-yarn.sh && start-dfs.sh |
| * For more information, refer to the official Hadoop website |
| |
| ## Viewing the YARN/HDFS Web UI (when the nodes can't be directly accessed from the internet) |
| * This is the case for our cmslab cluster |
| * Use ssh tunneling: `ssh -D 8123 -f -C -q -N username@gateway` |
| * Turn on SOCKS proxy in chrome(web browser) advanced settings |
| * Set your mac's `/etc/hosts` |
| * Go to `v-m:8088` and `v-m:50070` |
| |
| ## Run Nemo in the cluster |
| * git clone Nemo on v-m and install |
| * Upload a local input file to HDFS with `hdfs -put` |
| * Launch a Nemo job with `-deploy_mode yarn`, and hdfs paths as the input/output |
| * Example: `./bin/run.sh -deploy_mode yarn -job_id mr -user_main org.apache.nemo.examples.beam.WordCount -user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"` |
| |
| ## And you're all set.....? |
| * I hope so |
| * But the chances are that you'll run into problems that are not covered in this guide |
| * When you resolve the problems(with your friend Google), please share what you've learned by updating this README |
| * Final tip: Learn how to use `pssh`, and you'll be able to maintain a n-node cluster as if you're maintaining a single server |
| * A great tutorial on pssh: https://www.tecmint.com/execute-commands-on-multiple-linux-servers-using-pssh/ |
| |
| ## Some Example Commands for copying files |
| ```bash |
| # miss any of these and you'll have a very intersting(?) YARN/HDFS cluster |
| pscp -h ~/parallel/hostfile /etc/hosts /etc/hosts |
| pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/core-site.xml /home/ubuntu/hadoop/etc/hadoop/core-site.xml |
| pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml |
| ``` |
| |
| ## Some Example Commands for querying cluster status |
| ```bash |
| # sanity check |
| pssh -i -h ~/parallel/hostfile 'echo $HADOOP_HOME' |
| pssh -i -h ~/parallel/hostfile 'echo $JAVA_HOME' |
| pssh -i -h ~/parallel/hostfile 'yarn classpath' |
| |
| # any zombies? things running alright? |
| pssh -i -h ~/parallel/hostfile 'jps' |
| pssh -i -h ~/parallel/hostfile 'ps' |
| |
| # resource usage |
| htop |
| ``` |
| |
| ## Misc |
| ```bash |
| # When HDFS is being weird... run the following commands in order to hard-format |
| pssh -i -h ~/parallel/hostfile 'rm -rf /home/ubuntu/hadoop/dfs' |
| hdfs namenode -format |
| ``` |
| |