src/site/confluence/quick-start-guide.confluence - whirr - Git at Google

 h1. Getting Started with Whirr

 The Whirr CLI provides the most convenient way to launch clusters.

 h3. Pre-requisites
 * Java 6
 * An account with a cloud provider, such as Amazon EC2, or Rackspace Cloud Servers
 * An SSH client

 h3. Install Whirr

 [Download|http://www.apache.org/dyn/closer.cgi/incubator/whirr/] or
 [build|https://cwiki.apache.org/confluence/display/WHIRR/How+To+Contribute] Whirr.

 You can test that Whirr is working by running:

 {code}
 % bin/whirr version
 {code}

 Which will display the version of Whirr that is installed.

 To get usage instructions type:

 {code}
 % bin/whirr
 {code}

 h3. Configure a Hadoop cluster

 First, create a properties file to define the cluster. The name doesn't matter,
 but here we will assume it is called _hadoop.properties_ and located in your home directory.
 This file defines a cluster
 with a single machine for the namenode and jobtracker, and
 a further machine for a datanode and tasktracker. You can see how to launch
 other services by consulting the sample configurations in the _recipes_
 directory of the distribution.

 {code}
 whirr.cluster-name=myhadoopcluster
 whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker
 whirr.provider=ec2
 whirr.identity=${env:AWS_ACCESS_KEY_ID}
 whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
 whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
 whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
 {code}

 Note that we haven't specified a particular cloud image, since Whirr
 provides a default for each provider which should work well enough. However, for
 larger clusters you will likely use larger hardware sizes or particular images.
 See the _recipes_ files and the [Configuration Guide|configuration-guide] for
 details.

 In this configuration file the cloud identity and credential are read from
 environment variables - you can equally well put them in the configuration file
 if you wish.

 The {{private-key-file}} and {{public-key-file}} properties specify an SSH
 keypair. You can generate a keypair with:

 {code}
 % ssh-keygen -t rsa -P ''
 {code}

 You should use only RSA SSH keys, since DSA keys are not accepted yet.

 *Note*: the keypair specified by these properties is not the same as the AWS
 keypair generated with the {{ec2-add-keypair}} command or the AWS Management
 Console (since these don't place _both_ of the keys on your local machine). The
 PEM-encoded X.509 Certificate and Private Key (e.g. pk-XXXXXX.pem) cannot be
 used as a keypair either.

 h3. Launch a Hadoop cluster

 Run the following command to launch a cluster:

 {code}
 % bin/whirr launch-cluster --config hadoop.properties
 {code}

 Messages will be logged to the console as the cluster starts. You can see
 debug-level logging in a file named _whirr.log_ in the directory you ran the
 _whirr_ command from.

 A message will be printed out when the cluster has started, with a URL that you
 can use to access the web UI.

 h3. Run a proxy

 For security reasons, traffic from the network your client is running on is
 proxied through the master node of the cluster using an SSH tunnel
 (a SOCKS proxy on port 6666).

 A script to launch the proxy is created when you launch the cluster, and may be found in
 _~/.whirr/<cluster-name>_. Run it as a follows (in a new terminal window):

 {code}
 % . ~/.whirr/myhadoopcluster/hadoop-proxy.sh
 {code}

 To stop the proxy, just kill the process with Ctrl-C.

 Web browsers need to be configured to use this proxy too, so you can view pages
 served by worker nodes in the cluster. The most convenient way to do this is to
 use a
 [proxy auto-config (PAC) file|http://en.wikipedia.org/wiki/Proxy\_auto-config]
 file, such as [this one|http://apache-hadoop-ec2.s3.amazonaws.com/proxy.pac] for
 Hadoop EC2 clusters.

 If you are using Firefox, then you may find
 [FoxyProxy|http://foxyproxy.mozdev.org/] useful for managing PAC files.

 h3. Run a MapReduce job

 After you launch a cluster, a _hadoop-site.xml_ file is created in the directory
 _~/.whirr/<cluster-name>_. You can use this to connect to the cluster by setting
 the {{HADOOP\_CONF\_DIR}} environment variable.
 (It is also possible to set the configuration file to use by passing it as a
 {{-conf}} option to Hadoop Tools):

 {code}
 % export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster
 {code}

 You should now be able to browse HDFS:

 {code}
 % hadoop fs -ls /
 {code}

 Note that the version of Hadoop installed locally should match the version
 installed on the cluster. You should also make sure that the {{HADOOP\_HOME}}
 environment variable is set.

 Here's how you can run a MapReduce job:

 {code}
 hadoop fs -mkdir input
 hadoop fs -put $HADOOP_HOME/LICENSE.txt input
 hadoop jar $HADOOP_HOME/hadoop-*examples*.jar wordcount input output
 hadoop fs -cat output/part-* | head
 {code}

 h3. Configuration

 Whirr is configured using a properties file, and optionally using command line arguments when using the CLI. Command line arguments take precedence over properties specified in a properties file.

 For example, instead of using the properties file above, you could launch a
 Hadoop cluster with the following command line (note that the {{whirr.}} prefix
 for properties is not reflected in the command line argument):

 {code}
 % bin/whirr launch-cluster \
     --cluster-name=myhadoopcluster \
     --instance-templates='1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker' \
     --provider=ec2 \
     --identity=$AWS_ACCESS_KEY_ID \
     --credential=$AWS_SECRET_ACCESS_KEY \
     --private-key-file=~/.ssh/id_rsa \
     --public-key-file=~/.ssh/id_rsa.pub
 {code}

 Notice that here we took advantage of the fact that the AWS credentials have
 been defined in environment variables.

 See the [configuration guide|configuration-guide] for a list of all the configuration
 properties you can set.

 h3. Destroy a cluster

 When you've finished using a cluster you can terminate the instances and clean up resources with the following.

 *WARNING: All data will be deleted when you destroy the cluster.*

 {code}
 % bin/whirr destroy-cluster --config hadoop.properties
 {code}

 At this point you shut down the SSH proxy to the cluster if you started one
 earlier.
	h1. Getting Started with Whirr

	The Whirr CLI provides the most convenient way to launch clusters.

	h3. Pre-requisites
	* Java 6
	* An account with a cloud provider, such as Amazon EC2, or Rackspace Cloud Servers
	* An SSH client

	h3. Install Whirr

	[Download\|http://www.apache.org/dyn/closer.cgi/incubator/whirr/] or
	[build\|https://cwiki.apache.org/confluence/display/WHIRR/How+To+Contribute] Whirr.

	You can test that Whirr is working by running:

	{code}
	% bin/whirr version
	{code}

	Which will display the version of Whirr that is installed.

	To get usage instructions type:

	{code}
	% bin/whirr
	{code}

	h3. Configure a Hadoop cluster

	First, create a properties file to define the cluster. The name doesn't matter,
	but here we will assume it is called _hadoop.properties_ and located in your home directory.
	This file defines a cluster
	with a single machine for the namenode and jobtracker, and
	a further machine for a datanode and tasktracker. You can see how to launch
	other services by consulting the sample configurations in the _recipes_
	directory of the distribution.

	{code}
	whirr.cluster-name=myhadoopcluster
	whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker
	whirr.provider=ec2
	whirr.identity=${env:AWS_ACCESS_KEY_ID}
	whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
	whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
	whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
	{code}

	Note that we haven't specified a particular cloud image, since Whirr
	provides a default for each provider which should work well enough. However, for
	larger clusters you will likely use larger hardware sizes or particular images.
	See the _recipes_ files and the [Configuration Guide\|configuration-guide] for
	details.

	In this configuration file the cloud identity and credential are read from
	environment variables - you can equally well put them in the configuration file
	if you wish.

	The {{private-key-file}} and {{public-key-file}} properties specify an SSH
	keypair. You can generate a keypair with:

	{code}
	% ssh-keygen -t rsa -P ''
	{code}

	You should use only RSA SSH keys, since DSA keys are not accepted yet.

	Note: the keypair specified by these properties is not the same as the AWS
	keypair generated with the {{ec2-add-keypair}} command or the AWS Management
	Console (since these don't place _both_ of the keys on your local machine). The
	PEM-encoded X.509 Certificate and Private Key (e.g. pk-XXXXXX.pem) cannot be
	used as a keypair either.

	h3. Launch a Hadoop cluster

	Run the following command to launch a cluster:

	{code}
	% bin/whirr launch-cluster --config hadoop.properties
	{code}

	Messages will be logged to the console as the cluster starts. You can see
	debug-level logging in a file named _whirr.log_ in the directory you ran the
	_whirr_ command from.

	A message will be printed out when the cluster has started, with a URL that you
	can use to access the web UI.

	h3. Run a proxy

	For security reasons, traffic from the network your client is running on is
	proxied through the master node of the cluster using an SSH tunnel
	(a SOCKS proxy on port 6666).

	A script to launch the proxy is created when you launch the cluster, and may be found in
	_~/.whirr/<cluster-name>_. Run it as a follows (in a new terminal window):

	{code}
	% . ~/.whirr/myhadoopcluster/hadoop-proxy.sh
	{code}

	To stop the proxy, just kill the process with Ctrl-C.

	Web browsers need to be configured to use this proxy too, so you can view pages
	served by worker nodes in the cluster. The most convenient way to do this is to
	use a
	[proxy auto-config (PAC) file\|http://en.wikipedia.org/wiki/Proxy\_auto-config]
	file, such as [this one\|http://apache-hadoop-ec2.s3.amazonaws.com/proxy.pac] for
	Hadoop EC2 clusters.

	If you are using Firefox, then you may find
	[FoxyProxy\|http://foxyproxy.mozdev.org/] useful for managing PAC files.

	h3. Run a MapReduce job

	After you launch a cluster, a _hadoop-site.xml_ file is created in the directory
	_~/.whirr/<cluster-name>_. You can use this to connect to the cluster by setting
	the {{HADOOP\_CONF\_DIR}} environment variable.
	(It is also possible to set the configuration file to use by passing it as a
	{{-conf}} option to Hadoop Tools):

	{code}
	% export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster
	{code}

	You should now be able to browse HDFS:

	{code}
	% hadoop fs -ls /
	{code}

	Note that the version of Hadoop installed locally should match the version
	installed on the cluster. You should also make sure that the {{HADOOP\_HOME}}
	environment variable is set.

	Here's how you can run a MapReduce job:

	{code}
	hadoop fs -mkdir input
	hadoop fs -put $HADOOP_HOME/LICENSE.txt input
	hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount input output
	hadoop fs -cat output/part-* \| head
	{code}

	h3. Configuration

	Whirr is configured using a properties file, and optionally using command line arguments when using the CLI. Command line arguments take precedence over properties specified in a properties file.

	For example, instead of using the properties file above, you could launch a
	Hadoop cluster with the following command line (note that the {{whirr.}} prefix
	for properties is not reflected in the command line argument):

	{code}
	% bin/whirr launch-cluster \
	--cluster-name=myhadoopcluster \
	--instance-templates='1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker' \
	--provider=ec2 \
	--identity=$AWS_ACCESS_KEY_ID \
	--credential=$AWS_SECRET_ACCESS_KEY \
	--private-key-file=~/.ssh/id_rsa \
	--public-key-file=~/.ssh/id_rsa.pub
	{code}

	Notice that here we took advantage of the fact that the AWS credentials have
	been defined in environment variables.

	See the [configuration guide\|configuration-guide] for a list of all the configuration
	properties you can set.

	h3. Destroy a cluster

	When you've finished using a cluster you can terminate the instances and clean up resources with the following.

	WARNING: All data will be deleted when you destroy the cluster.

	{code}
	% bin/whirr destroy-cluster --config hadoop.properties
	{code}

	At this point you shut down the SSH proxy to the cluster if you started one
	earlier.