| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| |
| <document> |
| |
| <header> |
| <title>Capacity Scheduler</title> |
| </header> |
| |
| <body> |
| |
| <section> |
| <title>Purpose</title> |
| |
| <p>This document describes the Capacity Scheduler, a pluggable |
| MapReduce scheduler for Hadoop which provides a way to share |
| large clusters.</p> |
| </section> |
| |
| <section> |
| <title>Features</title> |
| |
| <p>The Capacity Scheduler supports the following features:</p> |
| <ul> |
| <li> |
| Multiple queues, possibly hierarchical/recursive, where a job is |
| submitted to a queue. |
| </li> |
| <li> |
| Queues are allocated a fraction of the capacity of the grid in the |
| sense that a certain capacity of resources will be at their |
| disposal. All jobs submitted to a queue will have access to the |
| capacity allocated to the queue. |
| </li> |
| <li> |
| Free resources can be allocated to any queue beyond it's capacity. |
| When there is demand for these resources from queues running below |
| capacity at a future point in time, as tasks scheduled on these |
| resources complete, they will be assigned to jobs on queues |
| running below the capacity. |
| </li> |
| <li> |
| Queues optionally support job priorities (disabled by default). |
| </li> |
| <li> |
| Within a queue, jobs with higher priority will have access to the |
| queue's resources before jobs with lower priority. However, once a |
| job is running, it will not be preempted for a higher priority job, |
| though new tasks from the higher priority job will be |
| preferentially scheduled. |
| </li> |
| <li> |
| In order to prevent one or more users from monopolizing its |
| resources, each queue enforces a limit on the percentage of |
| resources allocated to a user at any given time, if there is |
| competition for them. |
| </li> |
| <li> |
| Queues can use idle resources of other queues. In order to prevent |
| monopolizing of resources by particular queues, each queue can be |
| set a cap on the maximum number of resources it can expand to in |
| the presence of idle resources in other queues of the cluster. |
| </li> |
| <li> |
| Support for memory-intensive jobs, wherein a job can optionally |
| specify higher memory-requirements than the default, and the tasks |
| of the job will only be run on TaskTrackers that have enough memory |
| to spare. |
| </li> |
| <li> |
| Support for refreshing/reloading some of the queue-properties |
| without restarting the JobTracker, taking advantage of the |
| <a href="cluster_setup.html#Refreshing+queue+configuration"> |
| queue-refresh</a> feature in the framework. |
| </li> |
| </ul> |
| </section> |
| |
| <section> |
| <title>Picking a Task to Run</title> |
| |
| <p>Note that many of these steps can be, and will be, enhanced over time |
| to provide better algorithms.</p> |
| |
| <p>Whenever a TaskTracker is free, the Capacity Scheduler picks |
| a queue which has most free space (whose ratio of # of running slots to |
| capacity is the lowest).</p> |
| |
| <p>Once a queue is selected, the Scheduler picks a job in the queue. Jobs |
| are sorted based on when they're submitted and their priorities (if the |
| queue supports priorities). Jobs are considered in order, and a job is |
| selected if its user is within the user-quota for the queue, i.e., the |
| user is not already using queue resources above his/her limit. The |
| Scheduler also makes sure that there is enough free memory in the |
| TaskTracker to tun the job's task, in case the job has special memory |
| requirements.</p> |
| |
| <p>Once a job is selected, the Scheduler picks a task to run. This logic |
| to pick a task remains unchanged from earlier versions.</p> |
| |
| </section> |
| |
| <section> |
| <title>Installation</title> |
| |
| <p>The Capacity Scheduler is available as a JAR file in the Hadoop |
| tarball under the <em>contrib/capacity-scheduler</em> directory. The name of |
| the JAR file would be on the lines of hadoop-*-capacity-scheduler.jar.</p> |
| <p>You can also build the Scheduler from source by executing |
| <em>ant package</em>, in which case it would be available under |
| <em>build/contrib/capacity-scheduler</em>.</p> |
| <p>To run the Capacity Scheduler in your Hadoop installation, you need |
| to put it on the <em>CLASSPATH</em>. The easiest way is to copy the |
| <code>hadoop-*-capacity-scheduler.jar</code> from |
| to <code>HADOOP_HOME/lib</code>. Alternatively, you can modify |
| <em>HADOOP_CLASSPATH</em> to include this jar, in |
| <code>conf/hadoop-env.sh</code>.</p> |
| </section> |
| |
| <section> |
| <title>Configuration</title> |
| |
| <section> |
| <title>Using the Capacity Scheduler</title> |
| <p> |
| To make the Hadoop framework use the Capacity Scheduler, set up |
| the following property in the site configuration:</p> |
| <table> |
| <tr> |
| <th>Name</th> |
| <th>Value</th> |
| </tr> |
| <tr> |
| <td>mapreduce.jobtracker.taskscheduler</td> |
| <td>org.apache.hadoop.mapred.CapacityTaskScheduler</td> |
| </tr> |
| </table> |
| </section> |
| |
| <section> |
| <title>Setting Up Queues</title> |
| <p> |
| You can define multiple, possibly hierarchical queues to which users |
| can submit jobs with the Capacity Scheduler. To define queues, |
| various properties should be set in two configuration files - |
| <a href="cluster_setup.html#mapred-queues.xml">mapred-queues.xml</a> |
| and |
| <a href="ext:capacity-scheduler-conf">conf/capacity-scheduler.xml</a> |
| .</p> |
| <p><em>conf/capacity-scheduler.xml</em> can be used to configure (1) |
| job-initialization-poller related properties and (2) the |
| default values for various properties in the queues</p> |
| <p><em>conf/mapred-queues.xml</em> contains the actual queue |
| configuration including (1) framework specific properties like ACLs |
| for controlling which users or groups have access to the queues and |
| state of the queues and (2) the scheduler specific properties for |
| each queue. If any of these scheduler specific properties are |
| missing and not configured for a queue, then the properties in |
| <em>conf/capacity-scheduler.xml</em> are used to set default values. |
| More details about the properties that can be configured, and their |
| semantics is mentioned below. Also, a default template for |
| mapred-queues.xml tailored for using with |
| Capacity-scheduler can be found |
| <a href="ext:mapred-queues-capacity-scheduler">here</a>.</p> |
| </section> |
| |
| <section> |
| <title>Configuring Properties for Queues</title> |
| |
| <p>The Capacity Scheduler can be configured with several properties |
| for each queue that control the behavior of the Scheduler. As |
| described above, this scheduler specific configuration has to be in |
| the <em>conf/mapred-queues.xml</em> along with the rest of the |
| framework specific configuration. By |
| default, the configuration is set up for one queue, named |
| <em>default</em>.</p> |
| <p>To specify a property for a specific queue that is defined in the |
| mapred-queues.xml, you should set the corresponding property in a |
| <property> tag explained |
| <a href="cluster_setup.html#property_tag">here</a>. |
| </p> |
| |
| <p>The properties defined for queues and their descriptions are |
| listed in the table below:</p> |
| |
| <table> |
| <tr> |
| <th>Name</th> |
| <th> |
| <a href="commands_manual.html#RefreshQueues"> |
| Refresh-able?</a> |
| </th> |
| <th>Applicable to?</th> |
| <th>Description</th> |
| </tr> |
| <tr> |
| <td>capacity</td> |
| <td>Yes</td> |
| <td>Container queues as well as leaf queues</td> |
| <td>For a root-level container queue, this is the percentage of the |
| number of slots in the cluster that will be available for all its |
| immediate children together. For a root-level leaf-queue, this is |
| the percentage of the number of slots in the cluster that will be |
| available for all its jobs. For a non-root level container queue, |
| this is the percentage of the number of slots in its parent queue |
| that will be available for all its children together. For a |
| non-root-level leaf queue, this is the percentage of the number of |
| slots in its parent queue that will be available for jobs in this |
| queue. The sum of capacities for all children of a container queue |
| should be less than or equal 100. The sum of capacities of all the |
| root-level queues should be less than or equal to 100. |
| </td> |
| </tr> |
| <tr> |
| <td>maximum-capacity</td> |
| <td>Yes</td> |
| <td>Container queues as well as leaf queues</td> |
| <td> |
| A limit in percentage beyond which a non-root-level queue cannot use |
| the capacity of its parent queue; for a root-level queue, this is |
| the limit in percentage beyond which it cannot use the |
| cluster-capacity. This property provides a means to limit how much |
| excess capacity a queue can use. It can be used to prevent queues |
| with long running jobs from occupying more than a certain percentage |
| of the parent-queue or the cluster, which, in the absence of |
| pre-emption, can lead to capacity guarantees of other queues getting |
| affected. |
| |
| The maximum-capacity of a queue can only be greater than or equal to |
| its capacity. By default, there is no limit for a queue. For a |
| non-root-level queue this means it can occupy till the |
| maximum-capacity of its parent, for a root-level queue, it means that |
| it can occupy the whole cluster. A value of 100 implies that a queue |
| can use the complete capacity of its parent, or the complete |
| cluster-capacity in case of root-level-queues. |
| </td> |
| </tr> |
| <tr> |
| <td>supports-priority</td> |
| <td>No</td> |
| <td>Leaf queues only</td> |
| <td>If true, priorities of jobs will be taken into account in scheduling |
| decisions. |
| </td> |
| </tr> |
| <tr> |
| <td>minimum-user-limit-percent</td> |
| <td>Yes</td> |
| <td>Leaf queues only</td> |
| <td>Each queue enforces a limit on the percentage of resources |
| allocated to a user at any given time, if there is competition |
| for them. This user limit can vary between a minimum and maximum |
| value. The former depends on the number of users who have submitted |
| jobs, and the latter is set to this property value. For example, |
| suppose the value of this property is 25. If two users have |
| submitted jobs to a queue, no single user can use more than 50% |
| of the queue resources. If a third user submits a job, no single |
| user can use more than 33% of the queue resources. With 4 or more |
| users, no user can use more than 25% of the queue's resources. A |
| value of 100 implies no user limits are imposed. |
| </td> |
| </tr> |
| <tr> |
| <td>maximum-initialized-jobs-per-user</td> |
| <td>Yes</td> |
| <td>Leaf queues only</td> |
| <td> |
| Maximum number of jobs which are allowed to be pre-initialized for |
| a particular user in the queue. Once a job is scheduled, i.e. |
| it starts running, then that job is not considered |
| while scheduler computes the maximum job a user is allowed to |
| initialize. |
| </td> |
| </tr> |
| </table> |
| <p>See <a href="ext:mapred-queues-capacity-scheduler"> |
| this configuration file</a> for a default configuration of queues in |
| capacity-scheduler.</p> |
| </section> |
| |
| <section> |
| <title>Memory Management</title> |
| |
| <p>The Capacity Scheduler supports scheduling of tasks on a |
| <code>TaskTracker</code>(TT) based on a job's memory requirements |
| and the availability of RAM and Virtual Memory (VMEM) on the TT node. |
| See the |
| <a href="mapred_tutorial.html">MapReduce Tutorial</a> |
| for details on how the TT monitors memory usage.</p> |
| <p>Currently the memory based scheduling is only supported in Linux platform.</p> |
| <p>Memory-based scheduling works as follows:</p> |
| <ol> |
| <li>The absence of any one or more of three config parameters |
| or -1 being set as value of any of the parameters, |
| <code>mapred.tasktracker.vmem.reserved</code>, |
| <code>mapred.task.default.maxvmem</code>, or |
| <code>mapred.task.limit.maxvmem</code>, disables memory-based |
| scheduling, just as it disables memory monitoring for a TT. These |
| config parameters are described in the |
| <a href="mapred_tutorial.html">MapReduce Tutorial</a>. |
| The value of |
| <code>mapred.tasktracker.vmem.reserved</code> is |
| obtained from the TT via its heartbeat. |
| </li> |
| <li>If all the three mandatory parameters are set, the Scheduler |
| enables VMEM-based scheduling. First, the Scheduler computes the free |
| VMEM on the TT. This is the difference between the available VMEM on the |
| TT (the node's total VMEM minus the offset, both of which are sent by |
| the TT on each heartbeat)and the sum of VMs already allocated to |
| running tasks (i.e., sum of the VMEM task-limits). Next, the Scheduler |
| looks at the VMEM requirements for the job that's first in line to |
| run. If the job's VMEM requirements are less than the available VMEM on |
| the node, the job's task can be scheduled. If not, the Scheduler |
| ensures that the TT does not get a task to run (provided the job |
| has tasks to run). This way, the Scheduler ensures that jobs with |
| high memory requirements are not starved, as eventually, the TT |
| will have enough VMEM available. If the high-mem job does not have |
| any task to run, the Scheduler moves on to the next job. |
| </li> |
| <li>In addition to VMEM, the Capacity Scheduler can also consider |
| RAM on the TT node. RAM is considered the same way as VMEM. TTs report |
| the total RAM available on their node, and an offset. If both are |
| set, the Scheduler computes the available RAM on the node. Next, |
| the Scheduler figures out the RAM requirements of the job, if any. |
| As with VMEM, users can optionally specify a RAM limit for their job |
| (<code>mapred.task.maxpmem</code>, described in the MapReduce |
| tutorial). The Scheduler also maintains a limit for this value |
| (<code>mapred.capacity-scheduler.task.default-pmem-percentage-in-vmem</code>, |
| described below). All these three values must be set for the |
| Scheduler to schedule tasks based on RAM constraints. |
| </li> |
| <li>The Scheduler ensures that jobs cannot ask for RAM or VMEM higher |
| than configured limits. If this happens, the job is failed when it |
| is submitted. |
| </li> |
| </ol> |
| |
| <p>As described above, the additional scheduler-based config |
| parameters are as follows:</p> |
| |
| <table> |
| <tr><th>Name</th><th>Description</th></tr> |
| <tr><td>mapred.capacity-scheduler.task.default-pmem-<br/>percentage-in-vmem</td> |
| <td>A percentage of the default VMEM limit for jobs |
| (<code>mapred.task.default.maxvmem</code>). This is the default |
| RAM task-limit associated with a task. Unless overridden by a |
| job's setting, this number defines the RAM task-limit.</td> |
| </tr> |
| <tr><td>mapred.capacity-scheduler.task.limit.maxpmem</td> |
| <td>Configuration which provides an upper limit to maximum physical |
| memory which can be specified by a job. If a job requires more |
| physical memory than what is specified in this limit then the same |
| is rejected.</td> |
| </tr> |
| </table> |
| </section> |
| <section> |
| <title>Job Initialization Parameters</title> |
| <p>Capacity scheduler lazily initializes the jobs before they are |
| scheduled, for reducing the memory footprint on jobtracker. |
| Following are the parameters, by which you can control the laziness |
| of the job initialization. The following parameters can be |
| configured in capacity-scheduler.xml: |
| </p> |
| |
| <table> |
| <tr><th>Name</th><th>Description</th></tr> |
| <tr> |
| <td> |
| mapred.capacity-scheduler.init-poll-interval |
| </td> |
| <td> |
| Amount of time in miliseconds which is used to poll the scheduler |
| job queue to look for jobs to be initialized. |
| </td> |
| </tr> |
| <tr> |
| <td> |
| mapred.capacity-scheduler.init-worker-threads |
| </td> |
| <td> |
| Number of worker threads which would be used by Initialization |
| poller to initialize jobs in a set of queue. If number mentioned |
| in property is equal to number of job queues then a thread is |
| assigned jobs from one queue. If the number configured is lesser than |
| number of queues, then a thread can get jobs from more than one queue |
| which it initializes in a round robin fashion. If the number configured |
| is greater than number of queues, then number of threads spawned |
| would be equal to number of job queues. |
| </td> |
| </tr> |
| </table> |
| </section> |
| <section> |
| <title>Reviewing the Configuration of the Capacity Scheduler</title> |
| <p> |
| Once the installation and configuration is completed, you can review |
| it after starting the MapReduce cluster from the admin UI. |
| </p> |
| <ul> |
| <li>Start the MapReduce cluster as usual.</li> |
| <li>Open the JobTracker web UI.</li> |
| <li>The queues you have configured should be listed under the <em>Scheduling |
| Information</em> section of the page.</li> |
| <li>The properties for the queues should be visible in the <em>Scheduling |
| Information</em> column against each queue.</li> |
| </ul> |
| </section> |
| |
| </section> |
| </body> |
| |
| </document> |