deploy/README.md - incubator-nemo - Git at Google

 # Nemo-Hadoop(YARN/HDFS) Cluster Deployment Guide

 ## Install the operating system
 * Ubuntu 14.04.4 LTS
 * Other OS/versions not tested

 ## Initialize the fresh OS
 * Run `initialize_fresh_ubuntu.sh` on each node

 ## Set hostnames
 * Assign a name to each node (v-m, v-w1, v-w2, ...)
 * Set `/etc/hosts` accordingly on v-m and `pscp` the file to all v-w
 * Set `hostname` of each node using `set_hostname.sh`

 ## Set up YARN/HDFS cluster
 * Set up the conf files(core-site.xml, hdfs-site.xml, yarn-site.xml, slaves, ...) on v-m, and `pscp` them to all v-w
 * hdfs namenode -format (does not always cleanly format... may need to delete the hdfs directory)
 * start-yarn.sh && start-dfs.sh
 * For more information, refer to the official Hadoop website

 ## Viewing the YARN/HDFS Web UI (when the nodes can't be directly accessed from the internet)
 * This is the case for our cmslab cluster
 * Use ssh tunneling: `ssh -D 8123 -f -C -q -N username@gateway`
 * Turn on SOCKS proxy in chrome(web browser) advanced settings
 * Set your mac's `/etc/hosts`
 * Go to `v-m:8088` and `v-m:50070`

 ## Run Nemo in the cluster
 * git clone Nemo on v-m and install
 * Upload a local input file to HDFS with `hdfs -put`
 * Launch a Nemo job with `-deploy_mode yarn`, and hdfs paths as the input/output
 * Example: `./bin/run.sh -deploy_mode yarn -job_id mr -user_main org.apache.nemo.examples.beam.WordCount -user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"`

 ## And you're all set.....?
 * I hope so
 * But the chances are that you'll run into problems that are not covered in this guide
 * When you resolve the problems(with your friend Google), please share what you've learned by updating this README
 * Final tip: Learn how to use `pssh`, and you'll be able to maintain a n-node cluster as if you're maintaining a single server
 * A great tutorial on pssh: https://www.tecmint.com/execute-commands-on-multiple-linux-servers-using-pssh/

 ## Some Example Commands for copying files
 ```bash
 # miss any of these and you'll have a very intersting(?) YARN/HDFS cluster
 pscp -h ~/parallel/hostfile /etc/hosts /etc/hosts
 pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/core-site.xml /home/ubuntu/hadoop/etc/hadoop/core-site.xml
 pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml
 ```

 ## Some Example Commands for querying cluster status
 ```bash
 # sanity check
 pssh -i -h ~/parallel/hostfile 'echo $HADOOP_HOME'
 pssh -i -h ~/parallel/hostfile 'echo $JAVA_HOME'
 pssh -i -h ~/parallel/hostfile 'yarn classpath'

 # any zombies? things running alright?
 pssh -i -h ~/parallel/hostfile 'jps'
 pssh -i -h ~/parallel/hostfile 'ps'

 # resource usage
 htop
 ```

 ## Misc
 ```bash
 # When HDFS is being weird... run the following commands in order to hard-format
 pssh -i -h ~/parallel/hostfile 'rm -rf /home/ubuntu/hadoop/dfs'
 hdfs namenode -format
 ```
	# Nemo-Hadoop(YARN/HDFS) Cluster Deployment Guide

	## Install the operating system
	* Ubuntu 14.04.4 LTS
	* Other OS/versions not tested

	## Initialize the fresh OS
	* Run `initialize_fresh_ubuntu.sh` on each node

	## Set hostnames
	* Assign a name to each node (v-m, v-w1, v-w2, ...)
	* Set `/etc/hosts` accordingly on v-m and `pscp` the file to all v-w
	* Set `hostname` of each node using `set_hostname.sh`

	## Set up YARN/HDFS cluster
	* Set up the conf files(core-site.xml, hdfs-site.xml, yarn-site.xml, slaves, ...) on v-m, and `pscp` them to all v-w
	* hdfs namenode -format (does not always cleanly format... may need to delete the hdfs directory)
	* start-yarn.sh && start-dfs.sh
	* For more information, refer to the official Hadoop website

	## Viewing the YARN/HDFS Web UI (when the nodes can't be directly accessed from the internet)
	* This is the case for our cmslab cluster
	* Use ssh tunneling: `ssh -D 8123 -f -C -q -N username@gateway`
	* Turn on SOCKS proxy in chrome(web browser) advanced settings
	* Set your mac's `/etc/hosts`
	* Go to `v-m:8088` and `v-m:50070`

	## Run Nemo in the cluster
	* git clone Nemo on v-m and install
	* Upload a local input file to HDFS with `hdfs -put`
	* Launch a Nemo job with `-deploy_mode yarn`, and hdfs paths as the input/output
	* Example: `./bin/run.sh -deploy_mode yarn -job_id mr -user_main org.apache.nemo.examples.beam.WordCount -user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"`

	## And you're all set.....?
	* I hope so
	* But the chances are that you'll run into problems that are not covered in this guide
	* When you resolve the problems(with your friend Google), please share what you've learned by updating this README
	* Final tip: Learn how to use `pssh`, and you'll be able to maintain a n-node cluster as if you're maintaining a single server
	* A great tutorial on pssh: https://www.tecmint.com/execute-commands-on-multiple-linux-servers-using-pssh/

	## Some Example Commands for copying files
	```bash
	# miss any of these and you'll have a very intersting(?) YARN/HDFS cluster
	pscp -h ~/parallel/hostfile /etc/hosts /etc/hosts
	pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/core-site.xml /home/ubuntu/hadoop/etc/hadoop/core-site.xml
	pscp -h ~/parallel/hostfile /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml
	```

	## Some Example Commands for querying cluster status
	```bash
	# sanity check
	pssh -i -h ~/parallel/hostfile 'echo $HADOOP_HOME'
	pssh -i -h ~/parallel/hostfile 'echo $JAVA_HOME'
	pssh -i -h ~/parallel/hostfile 'yarn classpath'

	# any zombies? things running alright?
	pssh -i -h ~/parallel/hostfile 'jps'
	pssh -i -h ~/parallel/hostfile 'ps'

	# resource usage
	htop
	```

	## Misc
	```bash
	# When HDFS is being weird... run the following commands in order to hard-format
	pssh -i -h ~/parallel/hostfile 'rm -rf /home/ubuntu/hadoop/dfs'
	hdfs namenode -format
	```