blob: ae2b0738f9df4c6a060e4488cf54afdd868ae080 [file] [log] [blame]
Getting Started With Hadoop On Demand (HOD)
===========================================
1. Pre-requisites:
==================
Hardware:
HOD requires a minimum of 3 nodes configured through a resource manager.
Software:
The following components are assumed to be installed before using HOD:
* Torque:
(http://www.clusterresources.com/pages/products/torque-resource-manager.php)
Currently HOD supports Torque out of the box. We assume that you are
familiar with configuring Torque. You can get information about this
from the following link:
http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki
* Python (http://www.python.org/)
We require version 2.5.1 of Python.
The following components can be optionally installed for getting better
functionality from HOD:
* Twisted Python: This can be used for improving the scalability of HOD
(http://twistedmatrix.com/trac/)
* Hadoop: HOD can automatically distribute Hadoop to all nodes in the
cluster. However, it can also use a pre-installed version of Hadoop,
if it is available on all nodes in the cluster.
(http://hadoop.apache.org/core)
HOD currently supports Hadoop 0.15 and above.
NOTE: HOD configuration requires the location of installs of these
components to be the same on all nodes in the cluster. It will also
make the configuration simpler to have the same location on the submit
nodes.
2. Resource Manager Configuration Pre-requisites:
=================================================
For using HOD with Torque:
* Install Torque components: pbs_server on a head node, pbs_moms on all
compute nodes, and PBS client tools on all compute nodes and submit
nodes.
* Create a queue for submitting jobs on the pbs_server.
* Specify a name for all nodes in the cluster, by setting a 'node
property' to all the nodes.
This can be done by using the 'qmgr' command. For example:
qmgr -c "set node node properties=cluster-name"
* Ensure that jobs can be submitted to the nodes. This can be done by
using the 'qsub' command. For example:
echo "sleep 30" | qsub -l nodes=3
* More information about setting up Torque can be found by referring
to the documentation under:
http://www.clusterresources.com/pages/products/torque-resource-manager.php
3. Setting up HOD:
==================
* HOD is available under the 'contrib' section of Hadoop under the root
directory 'hod'.
* Distribute the files under this directory to all the nodes in the
cluster. Note that the location where the files are copied should be
the same on all the nodes.
* On the node from where you want to run hod, edit the file hodrc
which can be found in the <install dir>/conf directory. This file
contains the minimal set of values required for running hod.
* Specify values suitable to your environment for the following
variables defined in the configuration file. Note that some of these
variables are defined at more than one place in the file.
* ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK
1.5.x
* ${CLUSTER_NAME}: Name of the cluster which is specified in the
'node property' as mentioned in resource manager configuration.
* ${HADOOP_HOME}: Location of Hadoop installation on the compute and
submit nodes.
* ${RM_QUEUE}: Queue configured for submiting jobs in the resource
manager configuration.
* ${RM_HOME}: Location of the resource manager installation on the
compute and submit nodes.
* The following environment variables *may* need to be set depending on
your environment. These variables must be defined where you run the
HOD client, and also be specified in the HOD configuration file as the
value of the key resource_manager.env-vars. Multiple variables can be
specified as a comma separated list of key=value pairs.
* HOD_PYTHON_HOME: If you install python to a non-default location
of the compute nodes, or submit nodes, then, this variable must be
defined to point to the python executable in the non-standard
location.
NOTE:
You can also review other configuration options in the file and
modify them to suit your needs. Refer to the file config.txt for
information about the HOD configuration.
4. Running HOD:
===============
4.1 Overview:
-------------
A typical session of HOD will involve atleast three steps: allocate,
run hadoop jobs, deallocate.
4.1.1 Operation allocate
------------------------
The allocate operation is used to allocate a set of nodes and install and
provision Hadoop on them. It has the following syntax:
hod -c config_file -t hadoop_tarball_location -o "allocate \
cluster_dir number_of_nodes"
The hadoop_tarball_location must be a location on a shared file system
accesible from all nodes in the cluster. Note, the cluster_dir must exist
before running the command. If the command completes successfully then
cluster_dir/hadoop-site.xml will be generated and will contain information
about the allocated cluster's JobTracker and NameNode.
For example, the following command uses a hodrc file in ~/hod-config/hodrc and
allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes,
storing the generated Hadoop configuration in a directory named
~/hadoop-cluster:
$ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o "allocate \
~/hadoop-cluster 10"
HOD also supports an environment variable called HOD_CONF_DIR. If this is
defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc.
Defining this allows the above command to also be run as follows:
$ export HOD_CONF_DIR=~/hod-config
$ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10"
4.1.2 Running Hadoop jobs using the allocated cluster
-----------------------------------------------------
Now, one can run Hadoop jobs using the allocated cluster in the usual manner:
hadoop --config cluster_dir hadoop_command hadoop_command_args
Continuing our example, the following command will run a wordcount example on
the allocated cluster:
$ hadoop --config ~/hadoop-cluster jar \
/path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output
4.1.3 Operation deallocate
--------------------------
The deallocate operation is used to release an allocated cluster. When
finished with a cluster, deallocate must be run so that the nodes become free
for others to use. The deallocate operation has the following syntax:
hod -o "deallocate cluster_dir"
Continuing our example, the following command will deallocate the cluster:
$ hod -o "deallocate ~/hadoop-cluster"
4.2 Command Line Options
------------------------
This section covers the major command line options available via the hod
command:
--help
Prints out the help message to see the basic options.
--verbose-help
All configuration options provided in the hodrc file can be passed on the
command line, using the syntax --section_name.option_name[=value]. When
provided this way, the value provided on command line overrides the option
provided in hodrc. The verbose-help command lists all the available options in
the hodrc file. This is also a nice way to see the meaning of the
configuration options.
-c config_file
Provides the configuration file to use. Can be used with all other options of
HOD. Alternatively, the HOD_CONF_DIR environment variable can be defined to
specify a directory that contains a file named hodrc, alleviating the need to
specify the configuration file in each HOD command.
-b 1|2|3|4
Enables the given debug level. Can be used with all other options of HOD. 4 is
most verbose.
-o "help"
Lists the operations available in the operation mode.
-o "allocate cluster_dir number_of_nodes"
Allocates a cluster on the given number of cluster nodes, and store the
allocation information in cluster_dir for use with subsequent hadoop commands.
Note that the cluster_dir must exist before running the command.
-o "list"
Lists the clusters allocated by this user. Information provided includes the
Torque job id corresponding to the cluster, the cluster directory where the
allocation information is stored, and whether the Map/Reduce daemon is still
active or not.
-o "info cluster_dir"
Lists information about the cluster whose allocation information is stored in
the specified cluster directory.
-o "deallocate cluster_dir"
Deallocates the cluster whose allocation information is stored in the
specified cluster directory.
-t hadoop_tarball
Provisions Hadoop from the given tar.gz file. This option is only applicable
to the allocate operation. For better distribution performance it is
recommended that the Hadoop tarball contain only the libraries and binaries,
and not the source or documentation.
-Mkey1=value1 -Mkey2=value2
Provides configuration parameters for the provisioned Map/Reduce daemons
(JobTracker and TaskTrackers). A hadoop-site.xml is generated with these
values on the cluster nodes
-Hkey1=value1 -Hkey2=value2
Provides configuration parameters for the provisioned HDFS daemons (NameNode
and DataNodes). A hadoop-site.xml is generated with these values on the
cluster nodes
-Ckey1=value1 -Ckey2=value2
Provides configuration parameters for the client from where jobs can be
submitted. A hadoop-site.xml is generated with these values on the submit
node.