blob: 27dfd961eeaeb7d527d2e2c8b3380f4c6e57ae5e [file] [log] [blame]
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<document>
<header>
<title>Capacity Scheduler</title>
</header>
<body>
<section>
<title>Purpose</title>
<p>This document describes the Capacity Scheduler, a pluggable
MapReduce scheduler for Hadoop which provides a way to share
large clusters.</p>
</section>
<section>
<title>Features</title>
<p>The Capacity Scheduler supports the following features:</p>
<ul>
<li>
Multiple queues, possibly hierarchical/recursive, where a job is
submitted to a queue.
</li>
<li>
Queues are allocated a fraction of the capacity of the grid in the
sense that a certain capacity of resources will be at their
disposal. All jobs submitted to a queue will have access to the
capacity allocated to the queue.
</li>
<li>
Free resources can be allocated to any queue beyond it's capacity.
When there is demand for these resources from queues running below
capacity at a future point in time, as tasks scheduled on these
resources complete, they will be assigned to jobs on queues
running below the capacity.
</li>
<li>
Queues optionally support job priorities (disabled by default).
</li>
<li>
Within a queue, jobs with higher priority will have access to the
queue's resources before jobs with lower priority. However, once a
job is running, it will not be preempted for a higher priority job,
though new tasks from the higher priority job will be
preferentially scheduled.
</li>
<li>
In order to prevent one or more users from monopolizing its
resources, each queue enforces a limit on the percentage of
resources allocated to a user at any given time, if there is
competition for them.
</li>
<li>
Queues can use idle resources of other queues. In order to prevent
monopolizing of resources by particular queues, each queue can be
set a cap on the maximum number of resources it can expand to in
the presence of idle resources in other queues of the cluster.
</li>
<li>
Support for memory-intensive jobs, wherein a job can optionally
specify higher memory-requirements than the default, and the tasks
of the job will only be run on TaskTrackers that have enough memory
to spare.
</li>
<li>
Support for refreshing/reloading some of the queue-properties
without restarting the JobTracker, taking advantage of the
<a href="cluster_setup.html#Refreshing+queue+configuration">
queue-refresh</a> feature in the framework.
</li>
</ul>
</section>
<section>
<title>Picking a Task to Run</title>
<p>Note that many of these steps can be, and will be, enhanced over time
to provide better algorithms.</p>
<p>Whenever a TaskTracker is free, the Capacity Scheduler picks
a queue which has most free space (whose ratio of # of running slots to
capacity is the lowest).</p>
<p>Once a queue is selected, the Scheduler picks a job in the queue. Jobs
are sorted based on when they're submitted and their priorities (if the
queue supports priorities). Jobs are considered in order, and a job is
selected if its user is within the user-quota for the queue, i.e., the
user is not already using queue resources above his/her limit. The
Scheduler also makes sure that there is enough free memory in the
TaskTracker to tun the job's task, in case the job has special memory
requirements.</p>
<p>Once a job is selected, the Scheduler picks a task to run. This logic
to pick a task remains unchanged from earlier versions.</p>
</section>
<section>
<title>Installation</title>
<p>The Capacity Scheduler is available as a JAR file in the Hadoop
tarball under the <em>contrib/capacity-scheduler</em> directory. The name of
the JAR file would be on the lines of hadoop-*-capacity-scheduler.jar.</p>
<p>You can also build the Scheduler from source by executing
<em>ant package</em>, in which case it would be available under
<em>build/contrib/capacity-scheduler</em>.</p>
<p>To run the Capacity Scheduler in your Hadoop installation, you need
to put it on the <em>CLASSPATH</em>. The easiest way is to copy the
<code>hadoop-*-capacity-scheduler.jar</code> from
to <code>HADOOP_HOME/lib</code>. Alternatively, you can modify
<em>HADOOP_CLASSPATH</em> to include this jar, in
<code>conf/hadoop-env.sh</code>.</p>
</section>
<section>
<title>Configuration</title>
<section>
<title>Using the Capacity Scheduler</title>
<p>
To make the Hadoop framework use the Capacity Scheduler, set up
the following property in the site configuration:</p>
<table>
<tr>
<th>Name</th>
<th>Value</th>
</tr>
<tr>
<td>mapreduce.jobtracker.taskscheduler</td>
<td>org.apache.hadoop.mapred.CapacityTaskScheduler</td>
</tr>
</table>
</section>
<section>
<title>Setting Up Queues</title>
<p>
You can define multiple, possibly hierarchical queues to which users
can submit jobs with the Capacity Scheduler. To define queues,
various properties should be set in two configuration files -
<a href="cluster_setup.html#mapred-queues.xml">mapred-queues.xml</a>
and
<a href="ext:capacity-scheduler-conf">conf/capacity-scheduler.xml</a>
.</p>
<p><em>conf/capacity-scheduler.xml</em> can be used to configure (1)
job-initialization-poller related properties and (2) the
default values for various properties in the queues</p>
<p><em>conf/mapred-queues.xml</em> contains the actual queue
configuration including (1) framework specific properties like ACLs
for controlling which users or groups have access to the queues and
state of the queues and (2) the scheduler specific properties for
each queue. If any of these scheduler specific properties are
missing and not configured for a queue, then the properties in
<em>conf/capacity-scheduler.xml</em> are used to set default values.
More details about the properties that can be configured, and their
semantics is mentioned below. Also, a default template for
mapred-queues.xml tailored for using with
Capacity-scheduler can be found
<a href="ext:mapred-queues-capacity-scheduler">here</a>.</p>
</section>
<section>
<title>Configuring Properties for Queues</title>
<p>The Capacity Scheduler can be configured with several properties
for each queue that control the behavior of the Scheduler. As
described above, this scheduler specific configuration has to be in
the <em>conf/mapred-queues.xml</em> along with the rest of the
framework specific configuration. By
default, the configuration is set up for one queue, named
<em>default</em>.</p>
<p>To specify a property for a specific queue that is defined in the
mapred-queues.xml, you should set the corresponding property in a
&lt;property&gt; tag explained
<a href="cluster_setup.html#property_tag">here</a>.
</p>
<p>The properties defined for queues and their descriptions are
listed in the table below:</p>
<table>
<tr>
<th>Name</th>
<th>
<a href="commands_manual.html#RefreshQueues">
Refresh-able?</a>
</th>
<th>Applicable to?</th>
<th>Description</th>
</tr>
<tr>
<td>capacity</td>
<td>Yes</td>
<td>Container queues as well as leaf queues</td>
<td>For a root-level container queue, this is the percentage of the
number of slots in the cluster that will be available for all its
immediate children together. For a root-level leaf-queue, this is
the percentage of the number of slots in the cluster that will be
available for all its jobs. For a non-root level container queue,
this is the percentage of the number of slots in its parent queue
that will be available for all its children together. For a
non-root-level leaf queue, this is the percentage of the number of
slots in its parent queue that will be available for jobs in this
queue. The sum of capacities for all children of a container queue
should be less than or equal 100. The sum of capacities of all the
root-level queues should be less than or equal to 100.
</td>
</tr>
<tr>
<td>maximum-capacity</td>
<td>Yes</td>
<td>Container queues as well as leaf queues</td>
<td>
A limit in percentage beyond which a non-root-level queue cannot use
the capacity of its parent queue; for a root-level queue, this is
the limit in percentage beyond which it cannot use the
cluster-capacity. This property provides a means to limit how much
excess capacity a queue can use. It can be used to prevent queues
with long running jobs from occupying more than a certain percentage
of the parent-queue or the cluster, which, in the absence of
pre-emption, can lead to capacity guarantees of other queues getting
affected.
The maximum-capacity of a queue can only be greater than or equal to
its capacity. By default, there is no limit for a queue. For a
non-root-level queue this means it can occupy till the
maximum-capacity of its parent, for a root-level queue, it means that
it can occupy the whole cluster. A value of 100 implies that a queue
can use the complete capacity of its parent, or the complete
cluster-capacity in case of root-level-queues.
</td>
</tr>
<tr>
<td>supports-priority</td>
<td>No</td>
<td>Leaf queues only</td>
<td>If true, priorities of jobs will be taken into account in scheduling
decisions.
</td>
</tr>
<tr>
<td>minimum-user-limit-percent</td>
<td>Yes</td>
<td>Leaf queues only</td>
<td>Each queue enforces a limit on the percentage of resources
allocated to a user at any given time, if there is competition
for them. This user limit can vary between a minimum and maximum
value. The former depends on the number of users who have submitted
jobs, and the latter is set to this property value. For example,
suppose the value of this property is 25. If two users have
submitted jobs to a queue, no single user can use more than 50%
of the queue resources. If a third user submits a job, no single
user can use more than 33% of the queue resources. With 4 or more
users, no user can use more than 25% of the queue's resources. A
value of 100 implies no user limits are imposed.
</td>
</tr>
<tr>
<td>maximum-initialized-jobs-per-user</td>
<td>Yes</td>
<td>Leaf queues only</td>
<td>
Maximum number of jobs which are allowed to be pre-initialized for
a particular user in the queue. Once a job is scheduled, i.e.
it starts running, then that job is not considered
while scheduler computes the maximum job a user is allowed to
initialize.
</td>
</tr>
</table>
<p>See <a href="ext:mapred-queues-capacity-scheduler">
this configuration file</a> for a default configuration of queues in
capacity-scheduler.</p>
</section>
<section>
<title>Memory Management</title>
<p>The Capacity Scheduler supports scheduling of tasks on a
<code>TaskTracker</code>(TT) based on a job's memory requirements
and the availability of RAM and Virtual Memory (VMEM) on the TT node.
See the
<a href="mapred_tutorial.html">MapReduce Tutorial</a>
for details on how the TT monitors memory usage.</p>
<p>Currently the memory based scheduling is only supported in Linux platform.</p>
<p>Memory-based scheduling works as follows:</p>
<ol>
<li>The absence of any one or more of three config parameters
or -1 being set as value of any of the parameters,
<code>mapred.tasktracker.vmem.reserved</code>,
<code>mapred.task.default.maxvmem</code>, or
<code>mapred.task.limit.maxvmem</code>, disables memory-based
scheduling, just as it disables memory monitoring for a TT. These
config parameters are described in the
<a href="mapred_tutorial.html">MapReduce Tutorial</a>.
The value of
<code>mapred.tasktracker.vmem.reserved</code> is
obtained from the TT via its heartbeat.
</li>
<li>If all the three mandatory parameters are set, the Scheduler
enables VMEM-based scheduling. First, the Scheduler computes the free
VMEM on the TT. This is the difference between the available VMEM on the
TT (the node's total VMEM minus the offset, both of which are sent by
the TT on each heartbeat)and the sum of VMs already allocated to
running tasks (i.e., sum of the VMEM task-limits). Next, the Scheduler
looks at the VMEM requirements for the job that's first in line to
run. If the job's VMEM requirements are less than the available VMEM on
the node, the job's task can be scheduled. If not, the Scheduler
ensures that the TT does not get a task to run (provided the job
has tasks to run). This way, the Scheduler ensures that jobs with
high memory requirements are not starved, as eventually, the TT
will have enough VMEM available. If the high-mem job does not have
any task to run, the Scheduler moves on to the next job.
</li>
<li>In addition to VMEM, the Capacity Scheduler can also consider
RAM on the TT node. RAM is considered the same way as VMEM. TTs report
the total RAM available on their node, and an offset. If both are
set, the Scheduler computes the available RAM on the node. Next,
the Scheduler figures out the RAM requirements of the job, if any.
As with VMEM, users can optionally specify a RAM limit for their job
(<code>mapred.task.maxpmem</code>, described in the MapReduce
tutorial). The Scheduler also maintains a limit for this value
(<code>mapred.capacity-scheduler.task.default-pmem-percentage-in-vmem</code>,
described below). All these three values must be set for the
Scheduler to schedule tasks based on RAM constraints.
</li>
<li>The Scheduler ensures that jobs cannot ask for RAM or VMEM higher
than configured limits. If this happens, the job is failed when it
is submitted.
</li>
</ol>
<p>As described above, the additional scheduler-based config
parameters are as follows:</p>
<table>
<tr><th>Name</th><th>Description</th></tr>
<tr><td>mapred.capacity-scheduler.task.default-pmem-<br/>percentage-in-vmem</td>
<td>A percentage of the default VMEM limit for jobs
(<code>mapred.task.default.maxvmem</code>). This is the default
RAM task-limit associated with a task. Unless overridden by a
job's setting, this number defines the RAM task-limit.</td>
</tr>
<tr><td>mapred.capacity-scheduler.task.limit.maxpmem</td>
<td>Configuration which provides an upper limit to maximum physical
memory which can be specified by a job. If a job requires more
physical memory than what is specified in this limit then the same
is rejected.</td>
</tr>
</table>
</section>
<section>
<title>Job Initialization Parameters</title>
<p>Capacity scheduler lazily initializes the jobs before they are
scheduled, for reducing the memory footprint on jobtracker.
Following are the parameters, by which you can control the laziness
of the job initialization. The following parameters can be
configured in capacity-scheduler.xml:
</p>
<table>
<tr><th>Name</th><th>Description</th></tr>
<tr>
<td>
mapred.capacity-scheduler.init-poll-interval
</td>
<td>
Amount of time in miliseconds which is used to poll the scheduler
job queue to look for jobs to be initialized.
</td>
</tr>
<tr>
<td>
mapred.capacity-scheduler.init-worker-threads
</td>
<td>
Number of worker threads which would be used by Initialization
poller to initialize jobs in a set of queue. If number mentioned
in property is equal to number of job queues then a thread is
assigned jobs from one queue. If the number configured is lesser than
number of queues, then a thread can get jobs from more than one queue
which it initializes in a round robin fashion. If the number configured
is greater than number of queues, then number of threads spawned
would be equal to number of job queues.
</td>
</tr>
</table>
</section>
<section>
<title>Reviewing the Configuration of the Capacity Scheduler</title>
<p>
Once the installation and configuration is completed, you can review
it after starting the MapReduce cluster from the admin UI.
</p>
<ul>
<li>Start the MapReduce cluster as usual.</li>
<li>Open the JobTracker web UI.</li>
<li>The queues you have configured should be listed under the <em>Scheduling
Information</em> section of the page.</li>
<li>The properties for the queues should be visible in the <em>Scheduling
Information</em> column against each queue.</li>
</ul>
</section>
</section>
</body>
</document>