blob: a4ee3a92743a3d81403307ef3b70b1fad3aa8573 [file] [log] [blame]
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<document>
<header>
<title>Cluster Setup</title>
</header>
<body>
<section>
<title>Purpose</title>
<p>This document describes how to install, configure and manage non-trivial
Hadoop clusters ranging from a few nodes to extremely large clusters with
thousands of nodes.</p>
<p>
To play with Hadoop, you may first want to install Hadoop on a single machine (see <a href="ext:single-node-setup"> Hadoop Quick Start</a>).
</p>
</section>
<section>
<title>Pre-requisites</title>
<ol>
<li>
Make sure all <a href="ext:single-node-setup/PreReqs">requisite</a> software
is installed on all nodes in your cluster.
</li>
<li>
<a href="ext:single-node-setup/Download">Get</a> the Hadoop software.
</li>
</ol>
</section>
<section>
<title>Installation</title>
<p>Installing a Hadoop cluster typically involves unpacking the software
on all the machines in the cluster.</p>
<p>Typically one machine in the cluster is designated as the
<code>NameNode</code> and another machine the as <code>JobTracker</code>,
exclusively. These are the <em>masters</em>. The rest of the machines in
the cluster act as both <code>DataNode</code> <em>and</em>
<code>TaskTracker</code>. These are the <em>slaves</em>.</p>
<p>The root of the distribution is referred to as
<code>HADOOP_HOME</code>. All machines in the cluster usually have the same
<code>HADOOP_HOME</code> path.</p>
</section>
<section>
<title>Configuration</title>
<p>The following sections describe how to configure a Hadoop cluster.</p>
<section>
<title>Configuration Files</title>
<p>Hadoop configuration is driven by two types of important
configuration files:</p>
<ol>
<li>
Read-only default configuration -
<a href="ext:common-default">src/core/core-default.xml</a>,
<a href="ext:hdfs-default">src/hdfs/hdfs-default.xml</a>,
<a href="ext:mapred-default">src/mapred/mapred-default.xml</a> and
<a href="ext:mapred-queues">conf/mapred-queues.xml.template</a>.
</li>
<li>
Site-specific configuration -
<a href="#core-site.xml">conf/core-site.xml</a>,
<a href="#hdfs-site.xml">conf/hdfs-site.xml</a>,
<a href="#mapred-site.xml">conf/mapred-site.xml</a> and
<a href="#mapred-queues.xml">conf/mapred-queues.xml</a>.
</li>
</ol>
<p>To learn more about how the Hadoop framework is controlled by these
configuration files, look
<a href="ext:api/org/apache/hadoop/conf/configuration">here</a>.</p>
<p>Additionally, you can control the Hadoop scripts found in the
<code>bin/</code> directory of the distribution, by setting site-specific
values via the <code>conf/hadoop-env.sh</code>.</p>
</section>
<section>
<title>Site Configuration</title>
<p>To configure the Hadoop cluster you will need to configure the
<em>environment</em> in which the Hadoop daemons execute as well as
the <em>configuration parameters</em> for the Hadoop daemons.</p>
<p>The Hadoop daemons are <code>NameNode</code>/<code>DataNode</code>
and <code>JobTracker</code>/<code>TaskTracker</code>.</p>
<section>
<title>Configuring the Environment of the Hadoop Daemons</title>
<p>Administrators should use the <code>conf/hadoop-env.sh</code> script
to do site-specific customization of the Hadoop daemons' process
environment.</p>
<p>At the very least you should specify the
<code>JAVA_HOME</code> so that it is correctly defined on each
remote node.</p>
<p>Administrators can configure individual daemons using the
configuration options <code>HADOOP_*_OPTS</code>. Various options
available are shown below in the table. </p>
<table>
<tr><th>Daemon</th><th>Configure Options</th></tr>
<tr><td>NameNode</td><td>HADOOP_NAMENODE_OPTS</td></tr>
<tr><td>DataNode</td><td>HADOOP_DATANODE_OPTS</td></tr>
<tr><td>SecondaryNamenode</td>
<td>HADOOP_SECONDARYNAMENODE_OPTS</td></tr>
<tr><td>JobTracker</td><td>HADOOP_JOBTRACKER_OPTS</td></tr>
<tr><td>TaskTracker</td><td>HADOOP_TASKTRACKER_OPTS</td></tr>
</table>
<p> For example, To configure Namenode to use parallelGC, the
following statement should be added in <code>hadoop-env.sh</code> :
<br/><code>
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
</code><br/></p>
<p>Other useful configuration parameters that you can customize
include:</p>
<ul>
<li>
<code>HADOOP_LOG_DIR</code> - The directory where the daemons'
log files are stored. They are automatically created if they don't
exist.
</li>
<li>
<code>HADOOP_HEAPSIZE</code> - The maximum amount of heapsize
to use, in MB e.g. <code>1000MB</code>. This is used to
configure the heap size for the hadoop daemon. By default,
the value is <code>1000MB</code>.
</li>
</ul>
</section>
<section>
<title>Configuring the Hadoop Daemons</title>
<p>This section deals with important parameters to be specified in the
following:</p>
<anchor id="core-site.xml"/><p><code>conf/core-site.xml</code>:</p>
<table>
<tr>
<th>Parameter</th>
<th>Value</th>
<th>Notes</th>
</tr>
<tr>
<td>fs.default.name</td>
<td>URI of <code>NameNode</code>.</td>
<td><em>hdfs://hostname/</em></td>
</tr>
</table>
<anchor id="hdfs-site.xml"/><p><code>conf/hdfs-site.xml</code>:</p>
<table>
<tr>
<th>Parameter</th>
<th>Value</th>
<th>Notes</th>
</tr>
<tr>
<td>dfs.name.dir</td>
<td>
Path on the local filesystem where the <code>NameNode</code>
stores the namespace and transactions logs persistently.</td>
<td>
If this is a comma-delimited list of directories then the name
table is replicated in all of the directories, for redundancy.
</td>
</tr>
<tr>
<td>dfs.data.dir</td>
<td>
Comma separated list of paths on the local filesystem of a
<code>DataNode</code> where it should store its blocks.
</td>
<td>
If this is a comma-delimited list of directories, then data will
be stored in all named directories, typically on different
devices.
</td>
</tr>
</table>
<anchor id="mapred-site.xml"/><p><code>conf/mapred-site.xml</code>:</p>
<table>
<tr>
<th>Parameter</th>
<th>Value</th>
<th>Notes</th>
</tr>
<tr>
<td>mapreduce.jobtracker.address</td>
<td>Host or IP and port of <code>JobTracker</code>.</td>
<td><em>host:port</em> pair.</td>
</tr>
<tr>
<td>mapreduce.jobtracker.system.dir</td>
<td>
Path on the HDFS where where the Map/Reduce framework stores
system files e.g. <code>/hadoop/mapred/system/</code>.
</td>
<td>
This is in the default filesystem (HDFS) and must be accessible
from both the server and client machines.
</td>
</tr>
<tr>
<td>mapreduce.cluster.local.dir</td>
<td>
Comma-separated list of paths on the local filesystem where
temporary Map/Reduce data is written.
</td>
<td>Multiple paths help spread disk i/o.</td>
</tr>
<tr>
<td>mapred.tasktracker.{map|reduce}.tasks.maximum</td>
<td>
The maximum number of Map/Reduce tasks, which are run
simultaneously on a given <code>TaskTracker</code>, individually.
</td>
<td>
Defaults to 2 (2 maps and 2 reduces), but vary it depending on
your hardware.
</td>
</tr>
<tr>
<td>dfs.hosts/dfs.hosts.exclude</td>
<td>List of permitted/excluded DataNodes.</td>
<td>
If necessary, use these files to control the list of allowable
datanodes.
</td>
</tr>
<tr>
<td>mapreduce.jobtracker.hosts.filename/mapreduce.jobtracker.hosts.exclude.filename</td>
<td>List of permitted/excluded TaskTrackers.</td>
<td>
If necessary, use these files to control the list of allowable
TaskTrackers.
</td>
</tr>
</table>
<p>Typically all the above parameters are marked as
<a href="ext:api/org/apache/hadoop/conf/configuration/final_parameters">
final</a> to ensure that they cannot be overriden by user-applications.
</p>
<anchor id="mapred-queues.xml"/><p><code>conf/mapred-queues.xml
</code>:</p>
<p>This file is used to configure the queues in the Map/Reduce
system. Queues are abstract entities in the JobTracker that can be
used to manage collections of jobs. They provide a way for
administrators to organize jobs in specific ways and to enforce
certain policies on such collections, thus providing varying
levels of administrative control and management functions on jobs.
</p>
<p>One can imagine the following sample scenarios:</p>
<ul>
<li> Jobs submitted by a particular group of users can all be
submitted to one queue. </li>
<li> Long running jobs in an organization can be submitted to a
queue. </li>
<li> Short running jobs can be submitted to a queue and the number
of jobs that can run concurrently can be restricted. </li>
</ul>
<p>The usage of queues is closely tied to the scheduler configured
at the JobTracker via <em>mapreduce.jobtracker.taskscheduler</em>.
The degree of support of queues depends on the scheduler used. Some
schedulers support a single queue, while others support more complex
configurations. Schedulers also implement the policies that apply
to jobs in a queue. Some schedulers, such as the Fairshare scheduler,
implement their own mechanisms for collections of jobs and do not rely
on queues provided by the framework. The administrators are
encouraged to refer to the documentation of the scheduler they are
interested in for determining the level of support for queues.</p>
<p>The Map/Reduce framework supports some basic operations on queues
such as job submission to a specific queue, access control for queues,
queue states, viewing configured queues and their properties
and refresh of queue properties. In order to fully implement some of
these operations, the framework takes the help of the configured
scheduler.</p>
<p>The following types of queue configurations are possible:</p>
<ul>
<li> Single queue: The default configuration in Map/Reduce comprises
of a single queue, as supported by the default scheduler. All jobs
are submitted to this default queue which maintains jobs in a priority
based FIFO order.</li>
<li> Multiple single level queues: Multiple queues are defined, and
jobs can be submitted to any of these queues. Different policies
can be applied to these queues by schedulers that support this
configuration to provide a better level of support. For example,
the <a href="capacity_scheduler.html">capacity scheduler</a>
provides ways of configuring different
capacity and fairness guarantees on these queues.</li>
<li> Hierarchical queues: Hierarchical queues are a configuration in
which queues can contain other queues within them recursively. The
queues that contain other queues are referred to as
container queues. Queues that do not contain other queues are
referred as leaf or job queues. Jobs can only be submitted to leaf
queues. Hierarchical queues can potentially offer a higher level
of control to administrators, as schedulers can now build a
hierarchy of policies where policies applicable to a container
queue can provide context for policies applicable to queues it
contains. It also opens up possibilities for delegating queue
administration where administration of queues in a container queue
can be turned over to a different set of administrators, within
the context provided by the container queue. For example, the
<a href="capacity_scheduler.html">capacity scheduler</a>
uses hierarchical queues to partition capacity of a cluster
among container queues, and allowing queues they contain to divide
that capacity in more ways.</li>
</ul>
<p>Most of the configuration of the queues can be refreshed/reloaded
without restarting the Map/Reduce sub-system by editing this
configuration file as described in the section on
<a href="commands_manual.html#RefreshQueues">reloading queue
configuration</a>.
Not all configuration properties can be reloaded of course,
as will description of each property below explain.</p>
<p>The format of conf/mapred-queues.xml is different from the other
configuration files, supporting nested configuration
elements to support hierarchical queues. The format is as follows:
</p>
<source>
&lt;queues aclsEnabled="$aclsEnabled"&gt;
&lt;queue&gt;
&lt;name&gt;$queue-name&lt;/name&gt;
&lt;state&gt;$state&lt;/state&gt;
&lt;queue&gt;
&lt;name&gt;$child-queue1&lt;/name&gt;
&lt;properties&gt;
&lt;property key="$key" value="$value"/&gt;
...
&lt;/properties&gt;
&lt;queue&gt;
&lt;name&gt;$grand-child-queue1&lt;/name&gt;
...
&lt;/queue&gt;
&lt;/queue&gt;
&lt;queue&gt;
&lt;name&gt;$child-queue2&lt;/name&gt;
...
&lt;/queue&gt;
...
...
...
&lt;queue&gt;
&lt;name&gt;$leaf-queue&lt;/name&gt;
&lt;acl-submit-job&gt;$acls&lt;/acl-submit-job&gt;
&lt;acl-administer-jobs&gt;$acls&lt;/acl-administer-jobs&gt;
&lt;properties&gt;
&lt;property key="$key" value="$value"/&gt;
...
&lt;/properties&gt;
&lt;/queue&gt;
&lt;/queue&gt;
&lt;/queues&gt;
</source>
<table>
<tr>
<th>Tag/Attribute</th>
<th>Value</th>
<th>
<a href="commands_manual.html#RefreshQueues">Refresh-able?</a>
</th>
<th>Notes</th>
</tr>
<tr>
<td><anchor id="queues_tag"/>queues</td>
<td>Root element of the configuration file.</td>
<td>Not-applicable</td>
<td>All the queues are nested inside this root element of the
file. There can be only one root queues element in the file.</td>
</tr>
<tr>
<td>aclsEnabled</td>
<td>Boolean attribute to the
<a href="#queues_tag"><em>&lt;queues&gt;</em></a> tag
specifying whether ACLs are supported for controlling job
submission and administration for <em>all</em> the queues
configured.
</td>
<td>Yes</td>
<td>If <em>false</em>, ACLs are ignored for <em>all</em> the
configured queues. <br/><br/>
If <em>true</em>, the user and group details of the user
are checked against the configured ACLs of the corresponding
job-queue while submitting and administering jobs. ACLs can be
specified for each queue using the queue-specific tags
"acl-$acl_name", defined below. ACLs are checked only against
the job-queues, i.e. the leaf-level queues; ACLs configured
for the rest of the queues in the hierarchy are ignored.
</td>
</tr>
<tr>
<td><anchor id="queue_tag"/>queue</td>
<td>A child element of the
<a href="#queues_tag"><em>&lt;queues&gt;</em></a> tag or another
<a href="#queue_tag"><em>&lt;queue&gt;</em></a>. Denotes a queue
in the system.
</td>
<td>Not applicable</td>
<td>Queues can be hierarchical and so this element can contain
children of this same type.</td>
</tr>
<tr>
<td>name</td>
<td>Child element of a
<a href="#queue_tag"><em>&lt;queue&gt;</em></a> specifying the
name of the queue.</td>
<td>No</td>
<td>Name of the queue cannot contain the character <em>":"</em>
which is reserved as the queue-name delimiter when addressing a
queue in a hierarchy.</td>
</tr>
<tr>
<td>state</td>
<td>Child element of a
<a href="#queue_tag"><em>&lt;queue&gt;</em></a> specifying the
state of the queue.
</td>
<td>Yes</td>
<td>Each queue has a corresponding state. A queue in
<em>'running'</em> state can accept new jobs, while a queue in
<em>'stopped'</em> state will stop accepting any new jobs. State
is defined and respected by the framework only for the
leaf-level queues and is ignored for all other queues.
<br/><br/>
The state of the queue can be viewed from the command line using
<code>'bin/mapred queue'</code> command and also on the the Web
UI.<br/><br/>
Administrators can stop and start queues at runtime using the
feature of <a href="commands_manual.html#RefreshQueues">reloading
queue configuration</a>. If a queue is stopped at runtime, it
will complete all the existing running jobs and will stop
accepting any new jobs.
</td>
</tr>
<tr>
<td>acl-submit-job</td>
<td>Child element of a
<a href="#queue_tag"><em>&lt;queue&gt;</em></a> specifying the
list of users and groups that can submit jobs to the specified
queue.</td>
<td>Yes</td>
<td>
Applicable only to leaf-queues.<br/><br/>
The list of users and groups are both comma separated
list of names. The two lists are separated by a blank.
Example: <em>user1,user2 group1,group2</em>.
If you wish to define only a list of groups, provide
a blank at the beginning of the value.
<br/><br/>
</td>
</tr>
<tr>
<td>acl-administer-job</td>
<td>Child element of a
<a href="#queue_tag"><em>&lt;queue&gt;</em></a> specifying the
list of users and groups that can change the priority of a job
or kill a job that has been submitted to the specified queue.
</td>
<td>Yes</td>
<td>
Applicable only to leaf-queues.<br/><br/>
The list of users and groups are both comma separated
list of names. The two lists are separated by a blank.
Example: <em>user1,user2 group1,group2</em>.
If you wish to define only a list of groups, provide
a blank at the beginning of the value. Note that an
owner of a job can always change the priority or kill
his/her own job, irrespective of the ACLs.
</td>
</tr>
<tr>
<td><anchor id="properties_tag"/>properties</td>
<td>Child element of a
<a href="#queue_tag"><em>&lt;queue&gt;</em></a> specifying the
scheduler specific properties.</td>
<td>Not applicable</td>
<td>The scheduler specific properties are the children of this
element specified as a group of &lt;property&gt; tags described
below. The JobTracker completely ignores these properties. These
can be used as per-queue properties needed by the scheduler
being configured. Please look at the scheduler specific
documentation as to how these properties are used by that
particular scheduler.
</td>
</tr>
<tr>
<td><anchor id="property_tag"/>property</td>
<td>Child element of
<a href="#properties_tag"><em>&lt;properties&gt;</em></a> for a
specific queue.</td>
<td>Not applicable</td>
<td>A single scheduler specific queue-property. Ignored by
the JobTracker and used by the scheduler that is configured.</td>
</tr>
<tr>
<td>key</td>
<td>Attribute of a
<a href="#property_tag"><em>&lt;property&gt;</em></a> for a
specific queue.</td>
<td>Scheduler-specific</td>
<td>The name of a single scheduler specific queue-property.</td>
</tr>
<tr>
<td>value</td>
<td>Attribute of a
<a href="#property_tag"><em>&lt;property&gt;</em></a> for a
specific queue.</td>
<td>Scheduler-specific</td>
<td>The value of a single scheduler specific queue-property.
The value can be anything that is left for the proper
interpretation by the scheduler that is configured.</td>
</tr>
</table>
<p>Once the queues are configured properly and the Map/Reduce
system is up and running, from the command line one can
<a href="commands_manual.html#QueuesList">get the list
of queues</a> and
<a href="commands_manual.html#QueuesInfo">obtain
information specific to each queue</a>. This information is also
available from the web UI. On the web UI, queue information can be
seen by going to queueinfo.jsp, linked to from the queues table-cell
in the cluster-summary table. The queueinfo.jsp prints the hierarchy
of queues as well as the specific information for each queue.
</p>
<p> Users can submit jobs only to a
leaf-level queue by specifying the fully-qualified queue-name for
the property name <em>mapreduce.job.queuename</em> in the job
configuration. The character ':' is the queue-name delimiter and so,
for e.g., if one wants to submit to a configured job-queue 'Queue-C'
which is one of the sub-queues of 'Queue-B' which in-turn is a
sub-queue of 'Queue-A', then the job configuration should contain
property <em>mapreduce.job.queuename</em> set to the <em>
&lt;value&gt;Queue-A:Queue-B:Queue-C&lt;/value&gt;</em></p>
</section>
<section>
<title>Real-World Cluster Configurations</title>
<p>This section lists some non-default configuration parameters which
have been used to run the <em>sort</em> benchmark on very large
clusters.</p>
<ul>
<li>
<p>Some non-default configuration values used to run sort900,
that is 9TB of data sorted on a cluster with 900 nodes:</p>
<table>
<tr>
<th>Configuration File</th>
<th>Parameter</th>
<th>Value</th>
<th>Notes</th>
</tr>
<tr>
<td>conf/hdfs-site.xml</td>
<td>dfs.block.size</td>
<td>134217728</td>
<td>HDFS blocksize of 128MB for large file-systems.</td>
</tr>
<tr>
<td>conf/hdfs-site.xml</td>
<td>dfs.namenode.handler.count</td>
<td>40</td>
<td>
More NameNode server threads to handle RPCs from large
number of DataNodes.
</td>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.reduce.shuffle.parallelcopies</td>
<td>20</td>
<td>
Higher number of parallel copies run by reduces to fetch
outputs from very large number of maps.
</td>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.map.java.opts</td>
<td>-Xmx512M</td>
<td>
Larger heap-size for child jvms of maps.
</td>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.reduce.java.opts</td>
<td>-Xmx512M</td>
<td>
Larger heap-size for child jvms of reduces.
</td>
</tr>
<tr>
<td>conf/core-site.xml</td>
<td>fs.inmemory.size.mb</td>
<td>200</td>
<td>
Larger amount of memory allocated for the in-memory
file-system used to merge map-outputs at the reduces.
</td>
</tr>
<tr>
<td>conf/core-site.xml</td>
<td>mapreduce.task.io.sort.factor</td>
<td>100</td>
<td>More streams merged at once while sorting files.</td>
</tr>
<tr>
<td>conf/core-site.xml</td>
<td>mapreduce.task.io.sort.mb</td>
<td>200</td>
<td>Higher memory-limit while sorting data.</td>
</tr>
<tr>
<td>conf/core-site.xml</td>
<td>io.file.buffer.size</td>
<td>131072</td>
<td>Size of read/write buffer used in SequenceFiles.</td>
</tr>
</table>
</li>
<li>
<p>Updates to some configuration values to run sort1400 and
sort2000, that is 14TB of data sorted on 1400 nodes and 20TB of
data sorted on 2000 nodes:</p>
<table>
<tr>
<th>Configuration File</th>
<th>Parameter</th>
<th>Value</th>
<th>Notes</th>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.jobtracker.handler.count</td>
<td>60</td>
<td>
More JobTracker server threads to handle RPCs from large
number of TaskTrackers.
</td>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.reduce.shuffle.parallelcopies</td>
<td>50</td>
<td></td>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.tasktracker.http.threads</td>
<td>50</td>
<td>
More worker threads for the TaskTracker's http server. The
http server is used by reduces to fetch intermediate
map-outputs.
</td>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.map.java.opts</td>
<td>-Xmx512M</td>
<td>
Larger heap-size for child jvms of maps.
</td>
</tr>
<tr>
<td>conf/mapred-site.xml</td>
<td>mapreduce.reduce.java.opts</td>
<td>-Xmx1024M</td>
<td>Larger heap-size for child jvms of reduces.</td>
</tr>
</table>
</li>
</ul>
</section>
<section>
<title> Memory management</title>
<p>Users/admins can also specify the maximum virtual memory
of the launched child-task, and any sub-process it launches
recursively, using <code>mapred.{map|reduce}.child.ulimit</code>. Note
that the value set here is a per process limit.
The value for <code>mapred.{map|reduce}.child.ulimit</code> should be
specified in kilo bytes (KB). And also the value must be greater than
or equal to the -Xmx passed to JavaVM, else the VM might not start.
</p>
<p>Note: <code>mapred.{map|reduce}.child.java.opts</code> are used only for
configuring the launched child tasks from task tracker. Configuring
the memory options for daemons is documented in
<a href="cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemons">
cluster_setup.html </a></p>
<p>The memory available to some parts of the framework is also
configurable. In map and reduce tasks, performance may be influenced
by adjusting parameters influencing the concurrency of operations and
the frequency with which data will hit disk. Monitoring the filesystem
counters for a job- particularly relative to byte counts from the map
and into the reduce- is invaluable to the tuning of these
parameters.</p>
</section>
<section>
<title> Memory monitoring</title>
<p>A <code>TaskTracker</code>(TT) can be configured to monitor memory
usage of tasks it spawns, so that badly-behaved jobs do not bring
down a machine due to excess memory consumption. With monitoring
enabled, every task is assigned a task-limit for virtual memory (VMEM).
In addition, every node is assigned a node-limit for VMEM usage.
A TT ensures that a task is killed if it, and
its descendants, use VMEM over the task's per-task limit. It also
ensures that one or more tasks are killed if the sum total of VMEM
usage by all tasks, and their descendents, cross the node-limit.</p>
<p>Users can, optionally, specify the VMEM task-limit per job. If no
such limit is provided, a default limit is used. A node-limit can be
set per node.</p>
<p>Currently the memory monitoring and management is only supported
in Linux platform.</p>
<p>To enable monitoring for a TT, the
following parameters all need to be set:</p>
<table>
<tr><th>Name</th><th>Type</th><th>Description</th></tr>
<tr><td>mapred.tasktracker.vmem.reserved</td><td>long</td>
<td>A number, in bytes, that represents an offset. The total VMEM on
the machine, minus this offset, is the VMEM node-limit for all
tasks, and their descendants, spawned by the TT.
</td></tr>
<tr><td>mapred.task.default.maxvmem</td><td>long</td>
<td>A number, in bytes, that represents the default VMEM task-limit
associated with a task. Unless overridden by a job's setting,
this number defines the VMEM task-limit.
</td></tr>
<tr><td>mapred.task.limit.maxvmem</td><td>long</td>
<td>A number, in bytes, that represents the upper VMEM task-limit
associated with a task. Users, when specifying a VMEM task-limit
for their tasks, should not specify a limit which exceeds this amount.
</td></tr>
</table>
<p>In addition, the following parameters can also be configured.</p>
<table>
<tr><th>Name</th><th>Type</th><th>Description</th></tr>
<tr><td>mapreduce.tasktracker.taskmemorymanager.monitoringinterval</td>
<td>long</td>
<td>The time interval, in milliseconds, between which the TT
checks for any memory violation. The default value is 5000 msec
(5 seconds).
</td></tr>
</table>
<p>Here's how the memory monitoring works for a TT.</p>
<ol>
<li>If one or more of the configuration parameters described
above are missing or -1 is specified , memory monitoring is
disabled for the TT.
</li>
<li>In addition, monitoring is disabled if
<code>mapred.task.default.maxvmem</code> is greater than
<code>mapred.task.limit.maxvmem</code>.
</li>
<li>If a TT receives a task whose task-limit is set by the user
to a value larger than <code>mapred.task.limit.maxvmem</code>, it
logs a warning but executes the task.
</li>
<li>Periodically, the TT checks the following:
<ul>
<li>If any task's current VMEM usage is greater than that task's
VMEM task-limit, the task is killed and reason for killing
the task is logged in task diagonistics . Such a task is considered
failed, i.e., the killing counts towards the task's failure count.
</li>
<li>If the sum total of VMEM used by all tasks and descendants is
greater than the node-limit, the TT kills enough tasks, in the
order of least progress made, till the overall VMEM usage falls
below the node-limt. Such killed tasks are not considered failed
and their killing does not count towards the tasks' failure counts.
</li>
</ul>
</li>
</ol>
<p>Schedulers can choose to ease the monitoring pressure on the TT by
preventing too many tasks from running on a node and by scheduling
tasks only if the TT has enough VMEM free. In addition, Schedulers may
choose to consider the physical memory (RAM) available on the node
as well. To enable Scheduler support, TTs report their memory settings
to the JobTracker in every heartbeat. Before getting into details,
consider the following additional memory-related parameters than can be
configured to enable better scheduling:</p>
<table>
<tr><th>Name</th><th>Type</th><th>Description</th></tr>
<tr><td>mapred.tasktracker.pmem.reserved</td><td>int</td>
<td>A number, in bytes, that represents an offset. The total
physical memory (RAM) on the machine, minus this offset, is the
recommended RAM node-limit. The RAM node-limit is a hint to a
Scheduler to scheduler only so many tasks such that the sum
total of their RAM requirements does not exceed this limit.
RAM usage is not monitored by a TT.
</td></tr>
</table>
<p>A TT reports the following memory-related numbers in every
heartbeat:</p>
<ul>
<li>The total VMEM available on the node.</li>
<li>The value of <code>mapred.tasktracker.vmem.reserved</code>,
if set.</li>
<li>The total RAM available on the node.</li>
<li>The value of <code>mapred.tasktracker.pmem.reserved</code>,
if set.</li>
</ul>
</section>
<section>
<title>Task Controllers</title>
<p>Task controllers are classes in the Hadoop Map/Reduce
framework that define how user's map and reduce tasks
are launched and controlled. They can
be used in clusters that require some customization in
the process of launching or controlling the user tasks.
For example, in some
clusters, there may be a requirement to run tasks as
the user who submitted the job, instead of as the task
tracker user, which is how tasks are launched by default.
This section describes how to configure and use
task controllers.</p>
<p>The following task controllers are the available in
Hadoop.
</p>
<table>
<tr><th>Name</th><th>Class Name</th><th>Description</th></tr>
<tr>
<td>DefaultTaskController</td>
<td>org.apache.hadoop.mapred.DefaultTaskController</td>
<td> The default task controller which Hadoop uses to manage task
execution. The tasks run as the task tracker user.</td>
</tr>
<tr>
<td>LinuxTaskController</td>
<td>org.apache.hadoop.mapred.LinuxTaskController</td>
<td>This task controller, which is supported only on Linux,
runs the tasks as the user who submitted the job. It requires
these user accounts to be created on the cluster nodes
where the tasks are launched. It
uses a setuid executable that is included in the Hadoop
distribution. The task tracker uses this executable to
launch and kill tasks. The setuid executable switches to
the user who has submitted the job and launches or kills
the tasks. For maximum security, this task controller
sets up restricted permissions and user/group ownership of
local files and directories used by the tasks such as the
job jar files, intermediate files, task log files and distributed
cache files. Particularly note that, because of this, except the
job owner and tasktracker, no other user can access any of the
local files/directories including those localized as part of the
distributed cache.
</td>
</tr>
</table>
<section>
<title>Configuring Task Controllers</title>
<p>The task controller to be used can be configured by setting the
value of the following key in mapred-site.xml</p>
<table>
<tr>
<th>Property</th><th>Value</th><th>Notes</th>
</tr>
<tr>
<td>mapreduce.tasktracker.taskcontroller</td>
<td>Fully qualified class name of the task controller class</td>
<td>Currently there are two implementations of task controller
in the Hadoop system, DefaultTaskController and LinuxTaskController.
Refer to the class names mentioned above to determine the value
to set for the class of choice.
</td>
</tr>
</table>
</section>
<section>
<title>Using the LinuxTaskController</title>
<p>This section of the document describes the steps required to
use the LinuxTaskController.</p>
<p>In order to use the LinuxTaskController, a setuid executable
should be built and deployed on the compute nodes. The
executable is named task-controller. To build the executable,
execute
<em>ant task-controller -Dhadoop.conf.dir=/path/to/conf/dir.
</em>
The path passed in <em>-Dhadoop.conf.dir</em> should be the path
on the cluster nodes where a configuration file for the setuid
executable would be located. The executable would be built to
<em>build.dir/dist.dir/bin</em> and should be installed to
<em>$HADOOP_HOME/bin</em>.
</p>
<p>
The executable must have specific permissions as follows. The
executable should have <em>6050 or --Sr-s---</em> permissions
user-owned by root(super-user) and group-owned by a group
of which only the TaskTracker's user is the sole group member.
For example, let's say that the TaskTracker is run as user
<em>mapred</em> who is part of the groups <em>users</em> and
<em>mapredGroup</em> any of them being the primary group.
Let also be that <em>users</em> has both <em>mapred</em> and
another user <em>X</em> as its members, while <em>mapredGroup</em>
has only <em>mapred</em> as its member. Going by the above
description, the setuid/setgid executable should be set
<em>6050 or --Sr-s---</em> with user-owner as <em>mapred</em> and
group-owner as <em>mapredGroup</em> which has
only <em>mapred</em> as its member(and not <em>users</em> which has
<em>X</em> also as its member besides <em>mapred</em>).
</p>
<p>The executable requires a configuration file called
<em>taskcontroller.cfg</em> to be
present in the configuration directory passed to the ant target
mentioned above. If the binary was not built with a specific
conf directory, the path defaults to
<em>/path-to-binary/../conf</em>. The configuration file must be
owned by the user running TaskTracker (user <em>mapred</em> in the
above example), group-owned by anyone and should have the
permissions <em>0400 or r--------</em>.
</p>
<p>The executable requires following configuration items to be
present in the <em>taskcontroller.cfg</em> file. The items should
be mentioned as simple <em>key=value</em> pairs.
</p>
<table><tr><th>Name</th><th>Description</th></tr>
<tr>
<td>mapreduce.cluster.local.dir</td>
<td>Path to mapreduce.cluster.local.directories. Should be same as the value
which was provided to key in mapred-site.xml. This is required to
validate paths passed to the setuid executable in order to prevent
arbitrary paths being passed to it.</td>
</tr>
<tr>
<td>hadoop.log.dir</td>
<td>Path to hadoop log directory. Should be same as the value which
the TaskTracker is started with. This is required to set proper
permissions on the log files so that they can be written to by the user's
tasks and read by the TaskTracker for serving on the web UI.</td>
</tr>
</table>
<p>
The LinuxTaskController requires that paths including and leading up to
the directories specified in
<em>mapreduce.cluster.local.dir</em> and <em>hadoop.log.dir</em> to be set 755
permissions.
</p>
</section>
</section>
<section>
<title>Monitoring Health of TaskTracker Nodes</title>
<p>Hadoop Map/Reduce provides a mechanism by which administrators
can configure the TaskTracker to run an administrator supplied
script periodically to determine if a node is healthy or not.
Administrators can determine if the node is in a healthy state
by performing any checks of their choice in the script. If the
script detects the node to be in an unhealthy state, it must print
a line to standard output beginning with the string <em>ERROR</em>.
The TaskTracker spawns the script periodically and checks its
output. If the script's output contains the string <em>ERROR</em>,
as described above, the node's status is reported as 'unhealthy'
and the node is black-listed on the JobTracker. No further tasks
will be assigned to this node. However, the
TaskTracker continues to run the script, so that if the node
becomes healthy again, it will be removed from the blacklisted
nodes on the JobTracker automatically. The node's health
along with the output of the script, if it is unhealthy, is
available to the administrator in the JobTracker's web interface.
The time since the node was healthy is also displayed on the
web interface.
</p>
<section>
<title>Configuring the Node Health Check Script</title>
<p>The following parameters can be used to control the node health
monitoring script in <em>mapred-site.xml</em>.</p>
<table>
<tr><th>Name</th><th>Description</th></tr>
<tr><td><code>mapreduce.tasktracker.healthchecker.script.path</code></td>
<td>Absolute path to the script which is periodically run by the
TaskTracker to determine if the node is
healthy or not. The file should be executable by the TaskTracker.
If the value of this key is empty or the file does
not exist or is not executable, node health monitoring
is not started.</td>
</tr>
<tr>
<td><code>mapreduce.tasktracker.healthchecker.interval</code></td>
<td>Frequency at which the node health script is run,
in milliseconds</td>
</tr>
<tr>
<td><code>mapreduce.tasktracker.healthchecker.script.timeout</code></td>
<td>Time after which the node health script will be killed by
the TaskTracker if unresponsive.
The node is marked unhealthy. if node health script times out.</td>
</tr>
<tr>
<td><code>mapreduce.tasktracker.healthchecker.script.args</code></td>
<td>Extra arguments that can be passed to the node health script
when launched.
These should be comma separated list of arguments. </td>
</tr>
</table>
</section>
</section>
</section>
<section>
<title>Slaves</title>
<p>Typically you choose one machine in the cluster to act as the
<code>NameNode</code> and one machine as to act as the
<code>JobTracker</code>, exclusively. The rest of the machines act as
both a <code>DataNode</code> and <code>TaskTracker</code> and are
referred to as <em>slaves</em>.</p>
<p>List all slave hostnames or IP addresses in your
<code>conf/slaves</code> file, one per line.</p>
</section>
<section>
<title>Logging</title>
<p>Hadoop uses the <a href="http://logging.apache.org/log4j/">Apache
log4j</a> via the <a href="http://commons.apache.org/logging/">Apache
Commons Logging</a> framework for logging. Edit the
<code>conf/log4j.properties</code> file to customize the Hadoop
daemons' logging configuration (log-formats and so on).</p>
<section>
<title>History Logging</title>
<p> The job history files are stored in central location
<code> mapreduce.jobtracker.jobhistory.location </code> which can be on DFS also,
whose default value is <code>${HADOOP_LOG_DIR}/history</code>.
The history web UI is accessible from job tracker web UI.</p>
<p> The history files are also logged to user specified directory
<code>mapreduce.job.userhistorylocation</code>
which defaults to job output directory. The files are stored in
"_logs/history/" in the specified directory. Hence, by default
they will be in "mapreduce.output.fileoutputformat.outputdir/_logs/history/". User can stop
logging by giving the value <code>none</code> for
<code>mapreduce.job.userhistorylocation</code> </p>
<p> User can view the history logs summary in specified directory
using the following command <br/>
<code>$ bin/hadoop job -history output-dir</code><br/>
This command will print job details, failed and killed tip
details. <br/>
More details about the job such as successful tasks and
task attempts made for each task can be viewed using the
following command <br/>
<code>$ bin/hadoop job -history all output-dir</code><br/></p>
</section>
</section>
<p>Once all the necessary configuration is complete, distribute the files
to the <code>HADOOP_CONF_DIR</code> directory on all the machines,
typically <code>${HADOOP_HOME}/conf</code>.</p>
</section>
<section>
<title>Cluster Restartability</title>
<section>
<title>Map/Reduce</title>
<p>The job tracker restart can recover running jobs if
<code>mapreduce.jobtracker.restart.recover</code> is set true and
<a href="#Logging">JobHistory logging</a> is enabled. Also
<code>mapreduce.jobtracker.jobhistory.block.size</code> value should be
set to an optimal value to dump job history to disk as soon as
possible, the typical value is 3145728(3MB).</p>
</section>
</section>
<section>
<title>Hadoop Rack Awareness</title>
<p>The HDFS and the Map/Reduce components are rack-aware.</p>
<p>The <code>NameNode</code> and the <code>JobTracker</code> obtains the
<code>rack id</code> of the slaves in the cluster by invoking an API
<a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve
">resolve</a> in an administrator configured
module. The API resolves the slave's DNS name (also IP address) to a
rack id. What module to use can be configured using the configuration
item <code>topology.node.switch.mapping.impl</code>. The default
implementation of the same runs a script/command configured using
<code>topology.script.file.name</code>. If topology.script.file.name is
not set, the rack id <code>/default-rack</code> is returned for any
passed IP address. The additional configuration in the Map/Reduce
part is <code>mapred.cache.task.levels</code> which determines the number
of levels (in the network topology) of caches. So, for example, if it is
the default value of 2, two levels of caches will be constructed -
one for hosts (host -> task mapping) and another for racks
(rack -> task mapping).
</p>
</section>
<section>
<title>Hadoop Startup</title>
<p>To start a Hadoop cluster you will need to start both the HDFS and
Map/Reduce cluster.</p>
<p>
Format a new distributed filesystem:<br/>
<code>$ bin/hadoop namenode -format</code>
</p>
<p>
Start the HDFS with the following command, run on the designated
<code>NameNode</code>:<br/>
<code>$ bin/start-dfs.sh</code>
</p>
<p>The <code>bin/start-dfs.sh</code> script also consults the
<code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>NameNode</code>
and starts the <code>DataNode</code> daemon on all the listed slaves.</p>
<p>
Start Map-Reduce with the following command, run on the designated
<code>JobTracker</code>:<br/>
<code>$ bin/start-mapred.sh</code>
</p>
<p>The <code>bin/start-mapred.sh</code> script also consults the
<code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>JobTracker</code>
and starts the <code>TaskTracker</code> daemon on all the listed slaves.
</p>
</section>
<section>
<title>Hadoop Shutdown</title>
<p>
Stop HDFS with the following command, run on the designated
<code>NameNode</code>:<br/>
<code>$ bin/stop-dfs.sh</code>
</p>
<p>The <code>bin/stop-dfs.sh</code> script also consults the
<code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>NameNode</code>
and stops the <code>DataNode</code> daemon on all the listed slaves.</p>
<p>
Stop Map/Reduce with the following command, run on the designated
the designated <code>JobTracker</code>:<br/>
<code>$ bin/stop-mapred.sh</code><br/>
</p>
<p>The <code>bin/stop-mapred.sh</code> script also consults the
<code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>JobTracker</code>
and stops the <code>TaskTracker</code> daemon on all the listed slaves.</p>
</section>
</body>
</document>