| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| |
| <document> |
| |
| <header> |
| <title>Cluster Setup</title> |
| </header> |
| |
| <body> |
| |
| <section> |
| <title>Purpose</title> |
| |
| <p>This document describes how to install, configure and manage non-trivial |
| Hadoop clusters ranging from a few nodes to extremely large clusters with |
| thousands of nodes.</p> |
| <p> |
| To play with Hadoop, you may first want to install Hadoop on a single machine (see <a href="ext:single-node-setup"> Hadoop Quick Start</a>). |
| </p> |
| </section> |
| |
| <section> |
| <title>Pre-requisites</title> |
| |
| <ol> |
| <li> |
| Make sure all <a href="ext:single-node-setup/PreReqs">requisite</a> software |
| is installed on all nodes in your cluster. |
| </li> |
| <li> |
| <a href="ext:single-node-setup/Download">Get</a> the Hadoop software. |
| </li> |
| </ol> |
| </section> |
| |
| <section> |
| <title>Installation</title> |
| |
| <p>Installing a Hadoop cluster typically involves unpacking the software |
| on all the machines in the cluster.</p> |
| |
| <p>Typically one machine in the cluster is designated as the |
| <code>NameNode</code> and another machine the as <code>JobTracker</code>, |
| exclusively. These are the <em>masters</em>. The rest of the machines in |
| the cluster act as both <code>DataNode</code> <em>and</em> |
| <code>TaskTracker</code>. These are the <em>slaves</em>.</p> |
| |
| <p>The root of the distribution is referred to as |
| <code>HADOOP_HOME</code>. All machines in the cluster usually have the same |
| <code>HADOOP_HOME</code> path.</p> |
| </section> |
| |
| <section> |
| <title>Configuration</title> |
| |
| <p>The following sections describe how to configure a Hadoop cluster.</p> |
| |
| <section> |
| <title>Configuration Files</title> |
| |
| <p>Hadoop configuration is driven by two types of important |
| configuration files:</p> |
| <ol> |
| <li> |
| Read-only default configuration - |
| <a href="ext:common-default">src/core/core-default.xml</a>, |
| <a href="ext:hdfs-default">src/hdfs/hdfs-default.xml</a>, |
| <a href="ext:mapred-default">src/mapred/mapred-default.xml</a> and |
| <a href="ext:mapred-queues">conf/mapred-queues.xml.template</a>. |
| </li> |
| <li> |
| Site-specific configuration - |
| <a href="#core-site.xml">conf/core-site.xml</a>, |
| <a href="#hdfs-site.xml">conf/hdfs-site.xml</a>, |
| <a href="#mapred-site.xml">conf/mapred-site.xml</a> and |
| <a href="#mapred-queues.xml">conf/mapred-queues.xml</a>. |
| </li> |
| </ol> |
| |
| <p>To learn more about how the Hadoop framework is controlled by these |
| configuration files, look |
| <a href="ext:api/org/apache/hadoop/conf/configuration">here</a>.</p> |
| |
| <p>Additionally, you can control the Hadoop scripts found in the |
| <code>bin/</code> directory of the distribution, by setting site-specific |
| values via the <code>conf/hadoop-env.sh</code>.</p> |
| </section> |
| |
| <section> |
| <title>Site Configuration</title> |
| |
| <p>To configure the Hadoop cluster you will need to configure the |
| <em>environment</em> in which the Hadoop daemons execute as well as |
| the <em>configuration parameters</em> for the Hadoop daemons.</p> |
| |
| <p>The Hadoop daemons are <code>NameNode</code>/<code>DataNode</code> |
| and <code>JobTracker</code>/<code>TaskTracker</code>.</p> |
| |
| <section> |
| <title>Configuring the Environment of the Hadoop Daemons</title> |
| |
| <p>Administrators should use the <code>conf/hadoop-env.sh</code> script |
| to do site-specific customization of the Hadoop daemons' process |
| environment.</p> |
| |
| <p>At the very least you should specify the |
| <code>JAVA_HOME</code> so that it is correctly defined on each |
| remote node.</p> |
| |
| <p>Administrators can configure individual daemons using the |
| configuration options <code>HADOOP_*_OPTS</code>. Various options |
| available are shown below in the table. </p> |
| <table> |
| <tr><th>Daemon</th><th>Configure Options</th></tr> |
| <tr><td>NameNode</td><td>HADOOP_NAMENODE_OPTS</td></tr> |
| <tr><td>DataNode</td><td>HADOOP_DATANODE_OPTS</td></tr> |
| <tr><td>SecondaryNamenode</td> |
| <td>HADOOP_SECONDARYNAMENODE_OPTS</td></tr> |
| <tr><td>JobTracker</td><td>HADOOP_JOBTRACKER_OPTS</td></tr> |
| <tr><td>TaskTracker</td><td>HADOOP_TASKTRACKER_OPTS</td></tr> |
| </table> |
| |
| <p> For example, To configure Namenode to use parallelGC, the |
| following statement should be added in <code>hadoop-env.sh</code> : |
| <br/><code> |
| export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}" |
| </code><br/></p> |
| |
| <p>Other useful configuration parameters that you can customize |
| include:</p> |
| <ul> |
| <li> |
| <code>HADOOP_LOG_DIR</code> - The directory where the daemons' |
| log files are stored. They are automatically created if they don't |
| exist. |
| </li> |
| <li> |
| <code>HADOOP_HEAPSIZE</code> - The maximum amount of heapsize |
| to use, in MB e.g. <code>1000MB</code>. This is used to |
| configure the heap size for the hadoop daemon. By default, |
| the value is <code>1000MB</code>. |
| </li> |
| </ul> |
| </section> |
| |
| <section> |
| <title>Configuring the Hadoop Daemons</title> |
| |
| <p>This section deals with important parameters to be specified in the |
| following:</p> |
| <anchor id="core-site.xml"/><p><code>conf/core-site.xml</code>:</p> |
| |
| <table> |
| <tr> |
| <th>Parameter</th> |
| <th>Value</th> |
| <th>Notes</th> |
| </tr> |
| <tr> |
| <td>fs.default.name</td> |
| <td>URI of <code>NameNode</code>.</td> |
| <td><em>hdfs://hostname/</em></td> |
| </tr> |
| </table> |
| |
| <anchor id="hdfs-site.xml"/><p><code>conf/hdfs-site.xml</code>:</p> |
| |
| <table> |
| <tr> |
| <th>Parameter</th> |
| <th>Value</th> |
| <th>Notes</th> |
| </tr> |
| <tr> |
| <td>dfs.name.dir</td> |
| <td> |
| Path on the local filesystem where the <code>NameNode</code> |
| stores the namespace and transactions logs persistently.</td> |
| <td> |
| If this is a comma-delimited list of directories then the name |
| table is replicated in all of the directories, for redundancy. |
| </td> |
| </tr> |
| <tr> |
| <td>dfs.data.dir</td> |
| <td> |
| Comma separated list of paths on the local filesystem of a |
| <code>DataNode</code> where it should store its blocks. |
| </td> |
| <td> |
| If this is a comma-delimited list of directories, then data will |
| be stored in all named directories, typically on different |
| devices. |
| </td> |
| </tr> |
| </table> |
| |
| <anchor id="mapred-site.xml"/><p><code>conf/mapred-site.xml</code>:</p> |
| |
| <table> |
| <tr> |
| <th>Parameter</th> |
| <th>Value</th> |
| <th>Notes</th> |
| </tr> |
| <tr> |
| <td>mapreduce.jobtracker.address</td> |
| <td>Host or IP and port of <code>JobTracker</code>.</td> |
| <td><em>host:port</em> pair.</td> |
| </tr> |
| <tr> |
| <td>mapreduce.jobtracker.system.dir</td> |
| <td> |
| Path on the HDFS where where the Map/Reduce framework stores |
| system files e.g. <code>/hadoop/mapred/system/</code>. |
| </td> |
| <td> |
| This is in the default filesystem (HDFS) and must be accessible |
| from both the server and client machines. |
| </td> |
| </tr> |
| <tr> |
| <td>mapreduce.cluster.local.dir</td> |
| <td> |
| Comma-separated list of paths on the local filesystem where |
| temporary Map/Reduce data is written. |
| </td> |
| <td>Multiple paths help spread disk i/o.</td> |
| </tr> |
| <tr> |
| <td>mapred.tasktracker.{map|reduce}.tasks.maximum</td> |
| <td> |
| The maximum number of Map/Reduce tasks, which are run |
| simultaneously on a given <code>TaskTracker</code>, individually. |
| </td> |
| <td> |
| Defaults to 2 (2 maps and 2 reduces), but vary it depending on |
| your hardware. |
| </td> |
| </tr> |
| <tr> |
| <td>dfs.hosts/dfs.hosts.exclude</td> |
| <td>List of permitted/excluded DataNodes.</td> |
| <td> |
| If necessary, use these files to control the list of allowable |
| datanodes. |
| </td> |
| </tr> |
| <tr> |
| <td>mapreduce.jobtracker.hosts.filename/mapreduce.jobtracker.hosts.exclude.filename</td> |
| <td>List of permitted/excluded TaskTrackers.</td> |
| <td> |
| If necessary, use these files to control the list of allowable |
| TaskTrackers. |
| </td> |
| </tr> |
| </table> |
| |
| <p>Typically all the above parameters are marked as |
| <a href="ext:api/org/apache/hadoop/conf/configuration/final_parameters"> |
| final</a> to ensure that they cannot be overriden by user-applications. |
| </p> |
| |
| <anchor id="mapred-queues.xml"/><p><code>conf/mapred-queues.xml |
| </code>:</p> |
| <p>This file is used to configure the queues in the Map/Reduce |
| system. Queues are abstract entities in the JobTracker that can be |
| used to manage collections of jobs. They provide a way for |
| administrators to organize jobs in specific ways and to enforce |
| certain policies on such collections, thus providing varying |
| levels of administrative control and management functions on jobs. |
| </p> |
| <p>One can imagine the following sample scenarios:</p> |
| <ul> |
| <li> Jobs submitted by a particular group of users can all be |
| submitted to one queue. </li> |
| <li> Long running jobs in an organization can be submitted to a |
| queue. </li> |
| <li> Short running jobs can be submitted to a queue and the number |
| of jobs that can run concurrently can be restricted. </li> |
| </ul> |
| <p>The usage of queues is closely tied to the scheduler configured |
| at the JobTracker via <em>mapreduce.jobtracker.taskscheduler</em>. |
| The degree of support of queues depends on the scheduler used. Some |
| schedulers support a single queue, while others support more complex |
| configurations. Schedulers also implement the policies that apply |
| to jobs in a queue. Some schedulers, such as the Fairshare scheduler, |
| implement their own mechanisms for collections of jobs and do not rely |
| on queues provided by the framework. The administrators are |
| encouraged to refer to the documentation of the scheduler they are |
| interested in for determining the level of support for queues.</p> |
| <p>The Map/Reduce framework supports some basic operations on queues |
| such as job submission to a specific queue, access control for queues, |
| queue states, viewing configured queues and their properties |
| and refresh of queue properties. In order to fully implement some of |
| these operations, the framework takes the help of the configured |
| scheduler.</p> |
| <p>The following types of queue configurations are possible:</p> |
| <ul> |
| <li> Single queue: The default configuration in Map/Reduce comprises |
| of a single queue, as supported by the default scheduler. All jobs |
| are submitted to this default queue which maintains jobs in a priority |
| based FIFO order.</li> |
| <li> Multiple single level queues: Multiple queues are defined, and |
| jobs can be submitted to any of these queues. Different policies |
| can be applied to these queues by schedulers that support this |
| configuration to provide a better level of support. For example, |
| the <a href="capacity_scheduler.html">capacity scheduler</a> |
| provides ways of configuring different |
| capacity and fairness guarantees on these queues.</li> |
| <li> Hierarchical queues: Hierarchical queues are a configuration in |
| which queues can contain other queues within them recursively. The |
| queues that contain other queues are referred to as |
| container queues. Queues that do not contain other queues are |
| referred as leaf or job queues. Jobs can only be submitted to leaf |
| queues. Hierarchical queues can potentially offer a higher level |
| of control to administrators, as schedulers can now build a |
| hierarchy of policies where policies applicable to a container |
| queue can provide context for policies applicable to queues it |
| contains. It also opens up possibilities for delegating queue |
| administration where administration of queues in a container queue |
| can be turned over to a different set of administrators, within |
| the context provided by the container queue. For example, the |
| <a href="capacity_scheduler.html">capacity scheduler</a> |
| uses hierarchical queues to partition capacity of a cluster |
| among container queues, and allowing queues they contain to divide |
| that capacity in more ways.</li> |
| </ul> |
| |
| <p>Most of the configuration of the queues can be refreshed/reloaded |
| without restarting the Map/Reduce sub-system by editing this |
| configuration file as described in the section on |
| <a href="commands_manual.html#RefreshQueues">reloading queue |
| configuration</a>. |
| Not all configuration properties can be reloaded of course, |
| as will description of each property below explain.</p> |
| |
| <p>The format of conf/mapred-queues.xml is different from the other |
| configuration files, supporting nested configuration |
| elements to support hierarchical queues. The format is as follows: |
| </p> |
| |
| <source> |
| <queues aclsEnabled="$aclsEnabled"> |
| <queue> |
| <name>$queue-name</name> |
| <state>$state</state> |
| <queue> |
| <name>$child-queue1</name> |
| <properties> |
| <property key="$key" value="$value"/> |
| ... |
| </properties> |
| <queue> |
| <name>$grand-child-queue1</name> |
| ... |
| </queue> |
| </queue> |
| <queue> |
| <name>$child-queue2</name> |
| ... |
| </queue> |
| ... |
| ... |
| ... |
| <queue> |
| <name>$leaf-queue</name> |
| <acl-submit-job>$acls</acl-submit-job> |
| <acl-administer-jobs>$acls</acl-administer-jobs> |
| <properties> |
| <property key="$key" value="$value"/> |
| ... |
| </properties> |
| </queue> |
| </queue> |
| </queues> |
| </source> |
| <table> |
| <tr> |
| <th>Tag/Attribute</th> |
| <th>Value</th> |
| <th> |
| <a href="commands_manual.html#RefreshQueues">Refresh-able?</a> |
| </th> |
| <th>Notes</th> |
| </tr> |
| |
| <tr> |
| <td><anchor id="queues_tag"/>queues</td> |
| <td>Root element of the configuration file.</td> |
| <td>Not-applicable</td> |
| <td>All the queues are nested inside this root element of the |
| file. There can be only one root queues element in the file.</td> |
| </tr> |
| |
| <tr> |
| <td>aclsEnabled</td> |
| <td>Boolean attribute to the |
| <a href="#queues_tag"><em><queues></em></a> tag |
| specifying whether ACLs are supported for controlling job |
| submission and administration for <em>all</em> the queues |
| configured. |
| </td> |
| <td>Yes</td> |
| <td>If <em>false</em>, ACLs are ignored for <em>all</em> the |
| configured queues. <br/><br/> |
| If <em>true</em>, the user and group details of the user |
| are checked against the configured ACLs of the corresponding |
| job-queue while submitting and administering jobs. ACLs can be |
| specified for each queue using the queue-specific tags |
| "acl-$acl_name", defined below. ACLs are checked only against |
| the job-queues, i.e. the leaf-level queues; ACLs configured |
| for the rest of the queues in the hierarchy are ignored. |
| </td> |
| </tr> |
| |
| <tr> |
| <td><anchor id="queue_tag"/>queue</td> |
| <td>A child element of the |
| <a href="#queues_tag"><em><queues></em></a> tag or another |
| <a href="#queue_tag"><em><queue></em></a>. Denotes a queue |
| in the system. |
| </td> |
| <td>Not applicable</td> |
| <td>Queues can be hierarchical and so this element can contain |
| children of this same type.</td> |
| </tr> |
| |
| <tr> |
| <td>name</td> |
| <td>Child element of a |
| <a href="#queue_tag"><em><queue></em></a> specifying the |
| name of the queue.</td> |
| <td>No</td> |
| <td>Name of the queue cannot contain the character <em>":"</em> |
| which is reserved as the queue-name delimiter when addressing a |
| queue in a hierarchy.</td> |
| </tr> |
| |
| <tr> |
| <td>state</td> |
| <td>Child element of a |
| <a href="#queue_tag"><em><queue></em></a> specifying the |
| state of the queue. |
| </td> |
| <td>Yes</td> |
| <td>Each queue has a corresponding state. A queue in |
| <em>'running'</em> state can accept new jobs, while a queue in |
| <em>'stopped'</em> state will stop accepting any new jobs. State |
| is defined and respected by the framework only for the |
| leaf-level queues and is ignored for all other queues. |
| <br/><br/> |
| The state of the queue can be viewed from the command line using |
| <code>'bin/mapred queue'</code> command and also on the the Web |
| UI.<br/><br/> |
| Administrators can stop and start queues at runtime using the |
| feature of <a href="commands_manual.html#RefreshQueues">reloading |
| queue configuration</a>. If a queue is stopped at runtime, it |
| will complete all the existing running jobs and will stop |
| accepting any new jobs. |
| </td> |
| </tr> |
| |
| <tr> |
| <td>acl-submit-job</td> |
| <td>Child element of a |
| <a href="#queue_tag"><em><queue></em></a> specifying the |
| list of users and groups that can submit jobs to the specified |
| queue.</td> |
| <td>Yes</td> |
| <td> |
| Applicable only to leaf-queues.<br/><br/> |
| The list of users and groups are both comma separated |
| list of names. The two lists are separated by a blank. |
| Example: <em>user1,user2 group1,group2</em>. |
| If you wish to define only a list of groups, provide |
| a blank at the beginning of the value. |
| <br/><br/> |
| </td> |
| </tr> |
| |
| <tr> |
| <td>acl-administer-job</td> |
| <td>Child element of a |
| <a href="#queue_tag"><em><queue></em></a> specifying the |
| list of users and groups that can change the priority of a job |
| or kill a job that has been submitted to the specified queue. |
| </td> |
| <td>Yes</td> |
| <td> |
| Applicable only to leaf-queues.<br/><br/> |
| The list of users and groups are both comma separated |
| list of names. The two lists are separated by a blank. |
| Example: <em>user1,user2 group1,group2</em>. |
| If you wish to define only a list of groups, provide |
| a blank at the beginning of the value. Note that an |
| owner of a job can always change the priority or kill |
| his/her own job, irrespective of the ACLs. |
| </td> |
| </tr> |
| |
| <tr> |
| <td><anchor id="properties_tag"/>properties</td> |
| <td>Child element of a |
| <a href="#queue_tag"><em><queue></em></a> specifying the |
| scheduler specific properties.</td> |
| <td>Not applicable</td> |
| <td>The scheduler specific properties are the children of this |
| element specified as a group of <property> tags described |
| below. The JobTracker completely ignores these properties. These |
| can be used as per-queue properties needed by the scheduler |
| being configured. Please look at the scheduler specific |
| documentation as to how these properties are used by that |
| particular scheduler. |
| </td> |
| </tr> |
| |
| <tr> |
| <td><anchor id="property_tag"/>property</td> |
| <td>Child element of |
| <a href="#properties_tag"><em><properties></em></a> for a |
| specific queue.</td> |
| <td>Not applicable</td> |
| <td>A single scheduler specific queue-property. Ignored by |
| the JobTracker and used by the scheduler that is configured.</td> |
| </tr> |
| |
| <tr> |
| <td>key</td> |
| <td>Attribute of a |
| <a href="#property_tag"><em><property></em></a> for a |
| specific queue.</td> |
| <td>Scheduler-specific</td> |
| <td>The name of a single scheduler specific queue-property.</td> |
| </tr> |
| |
| <tr> |
| <td>value</td> |
| <td>Attribute of a |
| <a href="#property_tag"><em><property></em></a> for a |
| specific queue.</td> |
| <td>Scheduler-specific</td> |
| <td>The value of a single scheduler specific queue-property. |
| The value can be anything that is left for the proper |
| interpretation by the scheduler that is configured.</td> |
| </tr> |
| |
| </table> |
| |
| <p>Once the queues are configured properly and the Map/Reduce |
| system is up and running, from the command line one can |
| <a href="commands_manual.html#QueuesList">get the list |
| of queues</a> and |
| <a href="commands_manual.html#QueuesInfo">obtain |
| information specific to each queue</a>. This information is also |
| available from the web UI. On the web UI, queue information can be |
| seen by going to queueinfo.jsp, linked to from the queues table-cell |
| in the cluster-summary table. The queueinfo.jsp prints the hierarchy |
| of queues as well as the specific information for each queue. |
| </p> |
| |
| <p> Users can submit jobs only to a |
| leaf-level queue by specifying the fully-qualified queue-name for |
| the property name <em>mapreduce.job.queuename</em> in the job |
| configuration. The character ':' is the queue-name delimiter and so, |
| for e.g., if one wants to submit to a configured job-queue 'Queue-C' |
| which is one of the sub-queues of 'Queue-B' which in-turn is a |
| sub-queue of 'Queue-A', then the job configuration should contain |
| property <em>mapreduce.job.queuename</em> set to the <em> |
| <value>Queue-A:Queue-B:Queue-C</value></em></p> |
| </section> |
| <section> |
| <title>Real-World Cluster Configurations</title> |
| |
| <p>This section lists some non-default configuration parameters which |
| have been used to run the <em>sort</em> benchmark on very large |
| clusters.</p> |
| |
| <ul> |
| <li> |
| <p>Some non-default configuration values used to run sort900, |
| that is 9TB of data sorted on a cluster with 900 nodes:</p> |
| <table> |
| <tr> |
| <th>Configuration File</th> |
| <th>Parameter</th> |
| <th>Value</th> |
| <th>Notes</th> |
| </tr> |
| <tr> |
| <td>conf/hdfs-site.xml</td> |
| <td>dfs.block.size</td> |
| <td>134217728</td> |
| <td>HDFS blocksize of 128MB for large file-systems.</td> |
| </tr> |
| <tr> |
| <td>conf/hdfs-site.xml</td> |
| <td>dfs.namenode.handler.count</td> |
| <td>40</td> |
| <td> |
| More NameNode server threads to handle RPCs from large |
| number of DataNodes. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.reduce.shuffle.parallelcopies</td> |
| <td>20</td> |
| <td> |
| Higher number of parallel copies run by reduces to fetch |
| outputs from very large number of maps. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.map.java.opts</td> |
| <td>-Xmx512M</td> |
| <td> |
| Larger heap-size for child jvms of maps. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.reduce.java.opts</td> |
| <td>-Xmx512M</td> |
| <td> |
| Larger heap-size for child jvms of reduces. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/core-site.xml</td> |
| <td>fs.inmemory.size.mb</td> |
| <td>200</td> |
| <td> |
| Larger amount of memory allocated for the in-memory |
| file-system used to merge map-outputs at the reduces. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/core-site.xml</td> |
| <td>mapreduce.task.io.sort.factor</td> |
| <td>100</td> |
| <td>More streams merged at once while sorting files.</td> |
| </tr> |
| <tr> |
| <td>conf/core-site.xml</td> |
| <td>mapreduce.task.io.sort.mb</td> |
| <td>200</td> |
| <td>Higher memory-limit while sorting data.</td> |
| </tr> |
| <tr> |
| <td>conf/core-site.xml</td> |
| <td>io.file.buffer.size</td> |
| <td>131072</td> |
| <td>Size of read/write buffer used in SequenceFiles.</td> |
| </tr> |
| </table> |
| </li> |
| <li> |
| <p>Updates to some configuration values to run sort1400 and |
| sort2000, that is 14TB of data sorted on 1400 nodes and 20TB of |
| data sorted on 2000 nodes:</p> |
| <table> |
| <tr> |
| <th>Configuration File</th> |
| <th>Parameter</th> |
| <th>Value</th> |
| <th>Notes</th> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.jobtracker.handler.count</td> |
| <td>60</td> |
| <td> |
| More JobTracker server threads to handle RPCs from large |
| number of TaskTrackers. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.reduce.shuffle.parallelcopies</td> |
| <td>50</td> |
| <td></td> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.tasktracker.http.threads</td> |
| <td>50</td> |
| <td> |
| More worker threads for the TaskTracker's http server. The |
| http server is used by reduces to fetch intermediate |
| map-outputs. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.map.java.opts</td> |
| <td>-Xmx512M</td> |
| <td> |
| Larger heap-size for child jvms of maps. |
| </td> |
| </tr> |
| <tr> |
| <td>conf/mapred-site.xml</td> |
| <td>mapreduce.reduce.java.opts</td> |
| <td>-Xmx1024M</td> |
| <td>Larger heap-size for child jvms of reduces.</td> |
| </tr> |
| </table> |
| </li> |
| </ul> |
| </section> |
| <section> |
| <title> Memory management</title> |
| <p>Users/admins can also specify the maximum virtual memory |
| of the launched child-task, and any sub-process it launches |
| recursively, using <code>mapred.{map|reduce}.child.ulimit</code>. Note |
| that the value set here is a per process limit. |
| The value for <code>mapred.{map|reduce}.child.ulimit</code> should be |
| specified in kilo bytes (KB). And also the value must be greater than |
| or equal to the -Xmx passed to JavaVM, else the VM might not start. |
| </p> |
| |
| <p>Note: <code>mapred.{map|reduce}.child.java.opts</code> are used only for |
| configuring the launched child tasks from task tracker. Configuring |
| the memory options for daemons is documented in |
| <a href="cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemons"> |
| cluster_setup.html </a></p> |
| |
| <p>The memory available to some parts of the framework is also |
| configurable. In map and reduce tasks, performance may be influenced |
| by adjusting parameters influencing the concurrency of operations and |
| the frequency with which data will hit disk. Monitoring the filesystem |
| counters for a job- particularly relative to byte counts from the map |
| and into the reduce- is invaluable to the tuning of these |
| parameters.</p> |
| </section> |
| |
| <section> |
| <title> Memory monitoring</title> |
| <p>A <code>TaskTracker</code>(TT) can be configured to monitor memory |
| usage of tasks it spawns, so that badly-behaved jobs do not bring |
| down a machine due to excess memory consumption. With monitoring |
| enabled, every task is assigned a task-limit for virtual memory (VMEM). |
| In addition, every node is assigned a node-limit for VMEM usage. |
| A TT ensures that a task is killed if it, and |
| its descendants, use VMEM over the task's per-task limit. It also |
| ensures that one or more tasks are killed if the sum total of VMEM |
| usage by all tasks, and their descendents, cross the node-limit.</p> |
| |
| <p>Users can, optionally, specify the VMEM task-limit per job. If no |
| such limit is provided, a default limit is used. A node-limit can be |
| set per node.</p> |
| <p>Currently the memory monitoring and management is only supported |
| in Linux platform.</p> |
| <p>To enable monitoring for a TT, the |
| following parameters all need to be set:</p> |
| |
| <table> |
| <tr><th>Name</th><th>Type</th><th>Description</th></tr> |
| <tr><td>mapred.tasktracker.vmem.reserved</td><td>long</td> |
| <td>A number, in bytes, that represents an offset. The total VMEM on |
| the machine, minus this offset, is the VMEM node-limit for all |
| tasks, and their descendants, spawned by the TT. |
| </td></tr> |
| <tr><td>mapred.task.default.maxvmem</td><td>long</td> |
| <td>A number, in bytes, that represents the default VMEM task-limit |
| associated with a task. Unless overridden by a job's setting, |
| this number defines the VMEM task-limit. |
| </td></tr> |
| <tr><td>mapred.task.limit.maxvmem</td><td>long</td> |
| <td>A number, in bytes, that represents the upper VMEM task-limit |
| associated with a task. Users, when specifying a VMEM task-limit |
| for their tasks, should not specify a limit which exceeds this amount. |
| </td></tr> |
| </table> |
| |
| <p>In addition, the following parameters can also be configured.</p> |
| |
| <table> |
| <tr><th>Name</th><th>Type</th><th>Description</th></tr> |
| <tr><td>mapreduce.tasktracker.taskmemorymanager.monitoringinterval</td> |
| <td>long</td> |
| <td>The time interval, in milliseconds, between which the TT |
| checks for any memory violation. The default value is 5000 msec |
| (5 seconds). |
| </td></tr> |
| </table> |
| |
| <p>Here's how the memory monitoring works for a TT.</p> |
| <ol> |
| <li>If one or more of the configuration parameters described |
| above are missing or -1 is specified , memory monitoring is |
| disabled for the TT. |
| </li> |
| <li>In addition, monitoring is disabled if |
| <code>mapred.task.default.maxvmem</code> is greater than |
| <code>mapred.task.limit.maxvmem</code>. |
| </li> |
| <li>If a TT receives a task whose task-limit is set by the user |
| to a value larger than <code>mapred.task.limit.maxvmem</code>, it |
| logs a warning but executes the task. |
| </li> |
| <li>Periodically, the TT checks the following: |
| <ul> |
| <li>If any task's current VMEM usage is greater than that task's |
| VMEM task-limit, the task is killed and reason for killing |
| the task is logged in task diagonistics . Such a task is considered |
| failed, i.e., the killing counts towards the task's failure count. |
| </li> |
| <li>If the sum total of VMEM used by all tasks and descendants is |
| greater than the node-limit, the TT kills enough tasks, in the |
| order of least progress made, till the overall VMEM usage falls |
| below the node-limt. Such killed tasks are not considered failed |
| and their killing does not count towards the tasks' failure counts. |
| </li> |
| </ul> |
| </li> |
| </ol> |
| |
| <p>Schedulers can choose to ease the monitoring pressure on the TT by |
| preventing too many tasks from running on a node and by scheduling |
| tasks only if the TT has enough VMEM free. In addition, Schedulers may |
| choose to consider the physical memory (RAM) available on the node |
| as well. To enable Scheduler support, TTs report their memory settings |
| to the JobTracker in every heartbeat. Before getting into details, |
| consider the following additional memory-related parameters than can be |
| configured to enable better scheduling:</p> |
| |
| <table> |
| <tr><th>Name</th><th>Type</th><th>Description</th></tr> |
| <tr><td>mapred.tasktracker.pmem.reserved</td><td>int</td> |
| <td>A number, in bytes, that represents an offset. The total |
| physical memory (RAM) on the machine, minus this offset, is the |
| recommended RAM node-limit. The RAM node-limit is a hint to a |
| Scheduler to scheduler only so many tasks such that the sum |
| total of their RAM requirements does not exceed this limit. |
| RAM usage is not monitored by a TT. |
| </td></tr> |
| </table> |
| |
| <p>A TT reports the following memory-related numbers in every |
| heartbeat:</p> |
| <ul> |
| <li>The total VMEM available on the node.</li> |
| <li>The value of <code>mapred.tasktracker.vmem.reserved</code>, |
| if set.</li> |
| <li>The total RAM available on the node.</li> |
| <li>The value of <code>mapred.tasktracker.pmem.reserved</code>, |
| if set.</li> |
| </ul> |
| </section> |
| |
| <section> |
| <title>Task Controllers</title> |
| <p>Task controllers are classes in the Hadoop Map/Reduce |
| framework that define how user's map and reduce tasks |
| are launched and controlled. They can |
| be used in clusters that require some customization in |
| the process of launching or controlling the user tasks. |
| For example, in some |
| clusters, there may be a requirement to run tasks as |
| the user who submitted the job, instead of as the task |
| tracker user, which is how tasks are launched by default. |
| This section describes how to configure and use |
| task controllers.</p> |
| <p>The following task controllers are the available in |
| Hadoop. |
| </p> |
| <table> |
| <tr><th>Name</th><th>Class Name</th><th>Description</th></tr> |
| <tr> |
| <td>DefaultTaskController</td> |
| <td>org.apache.hadoop.mapred.DefaultTaskController</td> |
| <td> The default task controller which Hadoop uses to manage task |
| execution. The tasks run as the task tracker user.</td> |
| </tr> |
| <tr> |
| <td>LinuxTaskController</td> |
| <td>org.apache.hadoop.mapred.LinuxTaskController</td> |
| <td>This task controller, which is supported only on Linux, |
| runs the tasks as the user who submitted the job. It requires |
| these user accounts to be created on the cluster nodes |
| where the tasks are launched. It |
| uses a setuid executable that is included in the Hadoop |
| distribution. The task tracker uses this executable to |
| launch and kill tasks. The setuid executable switches to |
| the user who has submitted the job and launches or kills |
| the tasks. For maximum security, this task controller |
| sets up restricted permissions and user/group ownership of |
| local files and directories used by the tasks such as the |
| job jar files, intermediate files, task log files and distributed |
| cache files. Particularly note that, because of this, except the |
| job owner and tasktracker, no other user can access any of the |
| local files/directories including those localized as part of the |
| distributed cache. |
| </td> |
| </tr> |
| </table> |
| <section> |
| <title>Configuring Task Controllers</title> |
| <p>The task controller to be used can be configured by setting the |
| value of the following key in mapred-site.xml</p> |
| <table> |
| <tr> |
| <th>Property</th><th>Value</th><th>Notes</th> |
| </tr> |
| <tr> |
| <td>mapreduce.tasktracker.taskcontroller</td> |
| <td>Fully qualified class name of the task controller class</td> |
| <td>Currently there are two implementations of task controller |
| in the Hadoop system, DefaultTaskController and LinuxTaskController. |
| Refer to the class names mentioned above to determine the value |
| to set for the class of choice. |
| </td> |
| </tr> |
| </table> |
| </section> |
| <section> |
| <title>Using the LinuxTaskController</title> |
| <p>This section of the document describes the steps required to |
| use the LinuxTaskController.</p> |
| |
| <p>In order to use the LinuxTaskController, a setuid executable |
| should be built and deployed on the compute nodes. The |
| executable is named task-controller. To build the executable, |
| execute |
| <em>ant task-controller -Dhadoop.conf.dir=/path/to/conf/dir. |
| </em> |
| The path passed in <em>-Dhadoop.conf.dir</em> should be the path |
| on the cluster nodes where a configuration file for the setuid |
| executable would be located. The executable would be built to |
| <em>build.dir/dist.dir/bin</em> and should be installed to |
| <em>$HADOOP_HOME/bin</em>. |
| </p> |
| |
| <p> |
| The executable must have specific permissions as follows. The |
| executable should have <em>6050 or --Sr-s---</em> permissions |
| user-owned by root(super-user) and group-owned by a group |
| of which only the TaskTracker's user is the sole group member. |
| For example, let's say that the TaskTracker is run as user |
| <em>mapred</em> who is part of the groups <em>users</em> and |
| <em>mapredGroup</em> any of them being the primary group. |
| Let also be that <em>users</em> has both <em>mapred</em> and |
| another user <em>X</em> as its members, while <em>mapredGroup</em> |
| has only <em>mapred</em> as its member. Going by the above |
| description, the setuid/setgid executable should be set |
| <em>6050 or --Sr-s---</em> with user-owner as <em>mapred</em> and |
| group-owner as <em>mapredGroup</em> which has |
| only <em>mapred</em> as its member(and not <em>users</em> which has |
| <em>X</em> also as its member besides <em>mapred</em>). |
| </p> |
| |
| <p>The executable requires a configuration file called |
| <em>taskcontroller.cfg</em> to be |
| present in the configuration directory passed to the ant target |
| mentioned above. If the binary was not built with a specific |
| conf directory, the path defaults to |
| <em>/path-to-binary/../conf</em>. The configuration file must be |
| owned by the user running TaskTracker (user <em>mapred</em> in the |
| above example), group-owned by anyone and should have the |
| permissions <em>0400 or r--------</em>. |
| </p> |
| |
| <p>The executable requires following configuration items to be |
| present in the <em>taskcontroller.cfg</em> file. The items should |
| be mentioned as simple <em>key=value</em> pairs. |
| </p> |
| <table><tr><th>Name</th><th>Description</th></tr> |
| <tr> |
| <td>mapreduce.cluster.local.dir</td> |
| <td>Path to mapreduce.cluster.local.directories. Should be same as the value |
| which was provided to key in mapred-site.xml. This is required to |
| validate paths passed to the setuid executable in order to prevent |
| arbitrary paths being passed to it.</td> |
| </tr> |
| <tr> |
| <td>hadoop.log.dir</td> |
| <td>Path to hadoop log directory. Should be same as the value which |
| the TaskTracker is started with. This is required to set proper |
| permissions on the log files so that they can be written to by the user's |
| tasks and read by the TaskTracker for serving on the web UI.</td> |
| </tr> |
| </table> |
| |
| <p> |
| The LinuxTaskController requires that paths including and leading up to |
| the directories specified in |
| <em>mapreduce.cluster.local.dir</em> and <em>hadoop.log.dir</em> to be set 755 |
| permissions. |
| </p> |
| </section> |
| |
| </section> |
| <section> |
| <title>Monitoring Health of TaskTracker Nodes</title> |
| <p>Hadoop Map/Reduce provides a mechanism by which administrators |
| can configure the TaskTracker to run an administrator supplied |
| script periodically to determine if a node is healthy or not. |
| Administrators can determine if the node is in a healthy state |
| by performing any checks of their choice in the script. If the |
| script detects the node to be in an unhealthy state, it must print |
| a line to standard output beginning with the string <em>ERROR</em>. |
| The TaskTracker spawns the script periodically and checks its |
| output. If the script's output contains the string <em>ERROR</em>, |
| as described above, the node's status is reported as 'unhealthy' |
| and the node is black-listed on the JobTracker. No further tasks |
| will be assigned to this node. However, the |
| TaskTracker continues to run the script, so that if the node |
| becomes healthy again, it will be removed from the blacklisted |
| nodes on the JobTracker automatically. The node's health |
| along with the output of the script, if it is unhealthy, is |
| available to the administrator in the JobTracker's web interface. |
| The time since the node was healthy is also displayed on the |
| web interface. |
| </p> |
| |
| <section> |
| <title>Configuring the Node Health Check Script</title> |
| <p>The following parameters can be used to control the node health |
| monitoring script in <em>mapred-site.xml</em>.</p> |
| <table> |
| <tr><th>Name</th><th>Description</th></tr> |
| <tr><td><code>mapreduce.tasktracker.healthchecker.script.path</code></td> |
| <td>Absolute path to the script which is periodically run by the |
| TaskTracker to determine if the node is |
| healthy or not. The file should be executable by the TaskTracker. |
| If the value of this key is empty or the file does |
| not exist or is not executable, node health monitoring |
| is not started.</td> |
| </tr> |
| <tr> |
| <td><code>mapreduce.tasktracker.healthchecker.interval</code></td> |
| <td>Frequency at which the node health script is run, |
| in milliseconds</td> |
| </tr> |
| <tr> |
| <td><code>mapreduce.tasktracker.healthchecker.script.timeout</code></td> |
| <td>Time after which the node health script will be killed by |
| the TaskTracker if unresponsive. |
| The node is marked unhealthy. if node health script times out.</td> |
| </tr> |
| <tr> |
| <td><code>mapreduce.tasktracker.healthchecker.script.args</code></td> |
| <td>Extra arguments that can be passed to the node health script |
| when launched. |
| These should be comma separated list of arguments. </td> |
| </tr> |
| </table> |
| </section> |
| </section> |
| |
| </section> |
| |
| <section> |
| <title>Slaves</title> |
| |
| <p>Typically you choose one machine in the cluster to act as the |
| <code>NameNode</code> and one machine as to act as the |
| <code>JobTracker</code>, exclusively. The rest of the machines act as |
| both a <code>DataNode</code> and <code>TaskTracker</code> and are |
| referred to as <em>slaves</em>.</p> |
| |
| <p>List all slave hostnames or IP addresses in your |
| <code>conf/slaves</code> file, one per line.</p> |
| </section> |
| |
| <section> |
| <title>Logging</title> |
| |
| <p>Hadoop uses the <a href="http://logging.apache.org/log4j/">Apache |
| log4j</a> via the <a href="http://commons.apache.org/logging/">Apache |
| Commons Logging</a> framework for logging. Edit the |
| <code>conf/log4j.properties</code> file to customize the Hadoop |
| daemons' logging configuration (log-formats and so on).</p> |
| |
| <section> |
| <title>History Logging</title> |
| |
| <p> The job history files are stored in central location |
| <code> mapreduce.jobtracker.jobhistory.location </code> which can be on DFS also, |
| whose default value is <code>${HADOOP_LOG_DIR}/history</code>. |
| The history web UI is accessible from job tracker web UI.</p> |
| |
| <p> The history files are also logged to user specified directory |
| <code>mapreduce.job.userhistorylocation</code> |
| which defaults to job output directory. The files are stored in |
| "_logs/history/" in the specified directory. Hence, by default |
| they will be in "mapreduce.output.fileoutputformat.outputdir/_logs/history/". User can stop |
| logging by giving the value <code>none</code> for |
| <code>mapreduce.job.userhistorylocation</code> </p> |
| |
| <p> User can view the history logs summary in specified directory |
| using the following command <br/> |
| <code>$ bin/hadoop job -history output-dir</code><br/> |
| This command will print job details, failed and killed tip |
| details. <br/> |
| More details about the job such as successful tasks and |
| task attempts made for each task can be viewed using the |
| following command <br/> |
| <code>$ bin/hadoop job -history all output-dir</code><br/></p> |
| </section> |
| </section> |
| |
| <p>Once all the necessary configuration is complete, distribute the files |
| to the <code>HADOOP_CONF_DIR</code> directory on all the machines, |
| typically <code>${HADOOP_HOME}/conf</code>.</p> |
| </section> |
| <section> |
| <title>Cluster Restartability</title> |
| <section> |
| <title>Map/Reduce</title> |
| <p>The job tracker restart can recover running jobs if |
| <code>mapreduce.jobtracker.restart.recover</code> is set true and |
| <a href="#Logging">JobHistory logging</a> is enabled. Also |
| <code>mapreduce.jobtracker.jobhistory.block.size</code> value should be |
| set to an optimal value to dump job history to disk as soon as |
| possible, the typical value is 3145728(3MB).</p> |
| </section> |
| </section> |
| |
| <section> |
| <title>Hadoop Rack Awareness</title> |
| <p>The HDFS and the Map/Reduce components are rack-aware.</p> |
| <p>The <code>NameNode</code> and the <code>JobTracker</code> obtains the |
| <code>rack id</code> of the slaves in the cluster by invoking an API |
| <a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve |
| ">resolve</a> in an administrator configured |
| module. The API resolves the slave's DNS name (also IP address) to a |
| rack id. What module to use can be configured using the configuration |
| item <code>topology.node.switch.mapping.impl</code>. The default |
| implementation of the same runs a script/command configured using |
| <code>topology.script.file.name</code>. If topology.script.file.name is |
| not set, the rack id <code>/default-rack</code> is returned for any |
| passed IP address. The additional configuration in the Map/Reduce |
| part is <code>mapred.cache.task.levels</code> which determines the number |
| of levels (in the network topology) of caches. So, for example, if it is |
| the default value of 2, two levels of caches will be constructed - |
| one for hosts (host -> task mapping) and another for racks |
| (rack -> task mapping). |
| </p> |
| </section> |
| |
| <section> |
| <title>Hadoop Startup</title> |
| |
| <p>To start a Hadoop cluster you will need to start both the HDFS and |
| Map/Reduce cluster.</p> |
| |
| <p> |
| Format a new distributed filesystem:<br/> |
| <code>$ bin/hadoop namenode -format</code> |
| </p> |
| |
| <p> |
| Start the HDFS with the following command, run on the designated |
| <code>NameNode</code>:<br/> |
| <code>$ bin/start-dfs.sh</code> |
| </p> |
| <p>The <code>bin/start-dfs.sh</code> script also consults the |
| <code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>NameNode</code> |
| and starts the <code>DataNode</code> daemon on all the listed slaves.</p> |
| |
| <p> |
| Start Map-Reduce with the following command, run on the designated |
| <code>JobTracker</code>:<br/> |
| <code>$ bin/start-mapred.sh</code> |
| </p> |
| <p>The <code>bin/start-mapred.sh</code> script also consults the |
| <code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>JobTracker</code> |
| and starts the <code>TaskTracker</code> daemon on all the listed slaves. |
| </p> |
| </section> |
| |
| <section> |
| <title>Hadoop Shutdown</title> |
| |
| <p> |
| Stop HDFS with the following command, run on the designated |
| <code>NameNode</code>:<br/> |
| <code>$ bin/stop-dfs.sh</code> |
| </p> |
| <p>The <code>bin/stop-dfs.sh</code> script also consults the |
| <code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>NameNode</code> |
| and stops the <code>DataNode</code> daemon on all the listed slaves.</p> |
| |
| <p> |
| Stop Map/Reduce with the following command, run on the designated |
| the designated <code>JobTracker</code>:<br/> |
| <code>$ bin/stop-mapred.sh</code><br/> |
| </p> |
| <p>The <code>bin/stop-mapred.sh</code> script also consults the |
| <code>${HADOOP_CONF_DIR}/slaves</code> file on the <code>JobTracker</code> |
| and stops the <code>TaskTracker</code> daemon on all the listed slaves.</p> |
| </section> |
| </body> |
| |
| </document> |