| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version |
| 2.0 (the "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 Unless required by |
| applicable law or agreed to in writing, software distributed under |
| the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES |
| OR CONDITIONS OF ANY KIND, either express or implied. See the |
| License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| <document> |
| <header> |
| <title>Fair Scheduler Guide</title> |
| </header> |
| <body> |
| |
| <section> |
| <title>Purpose</title> |
| |
| <p>This document describes the Fair Scheduler, a pluggable |
| MapReduce scheduler for Hadoop which provides a way to share |
| large clusters.</p> |
| </section> |
| |
| <section> |
| <title>Introduction</title> |
| <p>Fair scheduling is a method of assigning resources to jobs |
| such that all jobs get, on average, an equal share of resources |
| over time. When there is a single job running, that job uses the |
| entire cluster. When other jobs are submitted, tasks slots that |
| free up are assigned to the new jobs, so that each job gets |
| roughly the same amount of CPU time. Unlike the default Hadoop |
| scheduler, which forms a queue of jobs, this lets short jobs finish |
| in reasonable time while not starving long jobs. It is also a |
| reasonable way to share a cluster between a number of users. Finally, |
| fair sharing can also work with job priorities - the priorities are |
| used as weights to determine the fraction of total compute time that |
| each job should get. |
| </p> |
| <p> |
| The scheduler actually organizes jobs further into "pools", and |
| shares resources fairly between these pools. By default, there is a |
| separate pool for each user, so that each user gets the same share |
| of the cluster no matter how many jobs they submit. However, it is |
| also possible to set a job's pool based on the user's Unix group or |
| any other jobconf property, such as the queue name property used by |
| <a href="capacity_scheduler.html">Capacity Scheduler</a>. |
| Within each pool, fair sharing is used to share capacity between |
| the running jobs. Pools can also be given weights to share the |
| cluster non-proportionally in the config file. |
| </p> |
| <p> |
| In addition to providing fair sharing, the Fair Scheduler allows |
| assigning guaranteed minimum shares to pools, which is useful for |
| ensuring that certain users, groups or production applications |
| always get sufficient resources. When a pool contains jobs, it gets |
| at least its minimum share, but when the pool does not need its full |
| guaranteed share, the excess is split between other running jobs. |
| This lets the scheduler guarantee capacity for pools while utilizing |
| resources efficiently when these pools don't contain jobs. |
| </p> |
| <p> |
| The Fair Scheduler lets all jobs run by default, but it is also |
| possible to limit the number of running jobs per user and per pool |
| through the config file. This can be useful when a user must submit |
| hundreds of jobs at once, or in general to improve performance if |
| running too many jobs at once would cause too much intermediate data |
| to be created or too much context-switching. Limiting the jobs does |
| not cause any subsequently submitted jobs to fail, only to wait in the |
| sheduler's queue until some of the user's earlier jobs finish. Jobs to |
| run from each user/pool are chosen in order of priority and then |
| submit time, as in the default FIFO scheduler in Hadoop. |
| </p> |
| <p> |
| Finally, the fair scheduler provides several extension points where |
| the basic functionality can be extended. For example, the weight |
| calculation can be modified to give a priority boost to new jobs, |
| implementing a "shortest job first" policy which reduces response |
| times for interactive jobs even further. |
| </p> |
| </section> |
| |
| <section> |
| <title>Installation</title> |
| <p> |
| To run the fair scheduler in your Hadoop installation, you need to put |
| it on the CLASSPATH. The easiest way is to copy the |
| <em>hadoop-fairscheduler-*.jar</em> from |
| <em>HADOOP_HOME/contrib/fairscheduler</em> to <em>HADOOP_HOME/lib</em>. |
| Alternatively you can modify <em>HADOOP_CLASSPATH</em> to include this jar, in |
| <em>HADOOP_CONF_DIR/hadoop-env.sh</em> |
| </p> |
| <p> |
| In order to compile fair scheduler, from sources execute <em> ant |
| package</em> in source folder and copy the |
| <em>build/contrib/fair-scheduler/hadoop-fairscheduler-*.jar</em> |
| to <em>HADOOP_HOME/lib</em> |
| </p> |
| <p> |
| You will also need to set the following property in the Hadoop config |
| file <em>HADOOP_CONF_DIR/mapred-site.xml</em> to have Hadoop use |
| the fair scheduler: <br/> |
| <code><property></code><br/> |
| <code> <name>mapred.jobtracker.taskScheduler</name></code><br/> |
| <code> <value>org.apache.hadoop.mapred.FairScheduler</value></code><br/> |
| <code></property></code> |
| </p> |
| <p> |
| Once you restart the cluster, you can check that the fair scheduler |
| is running by going to http://<jobtracker URL>/scheduler |
| on the JobTracker's web UI. A "job scheduler administration" page should |
| be visible there. This page is described in the Administration section. |
| </p> |
| </section> |
| |
| <section> |
| <title>Configuring the Fair scheduler</title> |
| <p> |
| The following properties can be set in mapred-site.xml to configure |
| the fair scheduler: |
| </p> |
| <table> |
| <tr> |
| <th>Name</th><th>Description</th> |
| </tr> |
| <tr> |
| <td> |
| mapred.fairscheduler.allocation.file |
| </td> |
| <td> |
| Specifies an absolute path to an XML file which contains the |
| allocations for each pool, as well as the per-pool and per-user |
| limits on number of running jobs. If this property is not |
| provided, allocations are not used.<br/> |
| This file must be in XML format, and can contain three types of |
| elements: |
| <ul> |
| <li>pool elements, which may contain elements for minMaps, |
| minReduces, maxRunningJobs (limit the number of jobs from the |
| pool to run at once),and weight (to share the cluster |
| non-proportionally with other pools). |
| </li> |
| <li>user elements, which may contain a maxRunningJobs to limit |
| jobs. Note that by default, there is a separate pool for each |
| user, so these may not be necessary; they are useful, however, |
| if you create a pool per user group or manually assign jobs |
| to pools.</li> |
| <li>A userMaxJobsDefault element, which sets the default running |
| job limit for any users whose limit is not specified.</li> |
| </ul> |
| <br/> |
| Example Allocation file is listed below :<br/> |
| <code><?xml version="1.0"?> </code> <br/> |
| <code><allocations></code> <br/> |
| <code> <pool name="sample_pool"></code><br/> |
| <code> <minMaps>5</minMaps></code><br/> |
| <code> <minReduces>5</minReduces></code><br/> |
| <code> <weight>2.0</weight></code><br/> |
| <code> </pool></code><br/> |
| <code> <user name="sample_user"></code><br/> |
| <code> <maxRunningJobs>6</maxRunningJobs></code><br/> |
| <code> </user></code><br/> |
| <code> <userMaxJobsDefault>3</userMaxJobsDefault></code><br/> |
| <code></allocations></code> |
| <br/> |
| This example creates a pool sample_pool with a guarantee of 5 map |
| slots and 5 reduce slots. The pool also has a weight of 2.0, meaning |
| it has a 2x higher share of the cluster than other pools (the default |
| weight is 1). Finally, the example limits the number of running jobs |
| per user to 3, except for sample_user, who can run 6 jobs concurrently. |
| Any pool not defined in the allocations file will have no guaranteed |
| capacity and a weight of 1.0. Also, any pool or user with no max |
| running jobs set in the file will be allowed to run an unlimited |
| number of jobs. |
| </td> |
| </tr> |
| <tr> |
| <td> |
| mapred.fairscheduler.assignmultiple |
| </td> |
| <td> |
| Allows the scheduler to assign both a map task and a reduce task |
| on each heartbeat, which improves cluster throughput when there |
| are many small tasks to run. Boolean value, default: false. |
| </td> |
| </tr> |
| <tr> |
| <td> |
| mapred.fairscheduler.sizebasedweight |
| </td> |
| <td> |
| Take into account job sizes in calculating their weights for fair |
| sharing.By default, weights are only based on job priorities. |
| Setting this flag to true will make them based on the size of the |
| job (number of tasks needed) as well,though not linearly |
| (the weight will be proportional to the log of the number of tasks |
| needed). This lets larger jobs get larger fair shares while still |
| providing enough of a share to small jobs to let them finish fast. |
| Boolean value, default: false. |
| </td> |
| </tr> |
| <tr> |
| <td> |
| mapred.fairscheduler.poolnameproperty |
| </td> |
| <td> |
| Specify which jobconf property is used to determine the pool that a |
| job belongs in. String, default: user.name (i.e. one pool for each |
| user). Some other useful values to set this to are: <br/> |
| <ul> |
| <li> group.name (to create a pool per Unix group).</li> |
| <li>mapred.job.queue.name (the same property as the queue name in |
| <a href="capacity_scheduler.html">Capacity Scheduler</a>).</li> |
| </ul> |
| </td> |
| </tr> |
| <tr> |
| <td> |
| mapred.fairscheduler.weightadjuster |
| </td> |
| <td> |
| An extensibility point that lets you specify a class to adjust the |
| weights of running jobs. This class should implement the |
| <em>WeightAdjuster</em> interface. There is currently one example |
| implementation - <em>NewJobWeightBooster</em>, which increases the |
| weight of jobs for the first 5 minutes of their lifetime to let |
| short jobs finish faster. To use it, set the weightadjuster |
| property to the full class name, |
| <code>org.apache.hadoop.mapred.NewJobWeightBooster</code> |
| NewJobWeightBooster itself provides two parameters for setting the |
| duration and boost factor. <br/> |
| <ol> |
| <li> <em>mapred.newjobweightbooster.factor</em> |
| Factor by which new jobs weight should be boosted. Default is 3</li> |
| <li><em>mapred.newjobweightbooster.duration</em> |
| Duration in milliseconds, default 300000 for 5 minutes</li> |
| </ol> |
| </td> |
| </tr> |
| <tr> |
| <td> |
| mapred.fairscheduler.loadmanager |
| </td> |
| <td> |
| An extensibility point that lets you specify a class that determines |
| how many maps and reduces can run on a given TaskTracker. This class |
| should implement the LoadManager interface. By default the task caps |
| in the Hadoop config file are used, but this option could be used to |
| make the load based on available memory and CPU utilization for example. |
| </td> |
| </tr> |
| <tr> |
| <td> |
| mapred.fairscheduler.taskselector: |
| </td> |
| <td> |
| An extensibility point that lets you specify a class that determines |
| which task from within a job to launch on a given tracker. This can be |
| used to change either the locality policy (e.g. keep some jobs within |
| a particular rack) or the speculative execution algorithm (select |
| when to launch speculative tasks). The default implementation uses |
| Hadoop's default algorithms from JobInProgress. |
| </td> |
| </tr> |
| </table> |
| </section> |
| <section> |
| <title> Administration</title> |
| <p> |
| The fair scheduler provides support for administration at runtime |
| through two mechanisms: |
| </p> |
| <ol> |
| <li> |
| It is possible to modify pools' allocations |
| and user and pool running job limits at runtime by editing the allocation |
| config file. The scheduler will reload this file 10-15 seconds after it |
| sees that it was modified. |
| </li> |
| <li> |
| Current jobs, pools, and fair shares can be examined through the |
| JobTracker's web interface, at http://<jobtracker URL>/scheduler. |
| On this interface, it is also possible to modify jobs' priorities or |
| move jobs from one pool to another and see the effects on the fair |
| shares (this requires JavaScript). |
| </li> |
| </ol> |
| <p> |
| The following fields can be seen for each job on the web interface: |
| </p> |
| <ul> |
| <li><em>Submitted</em> - Date and time job was submitted.</li> |
| <li><em>JobID, User, Name</em> - Job identifiers as on the standard |
| web UI.</li> |
| <li><em>Pool</em> - Current pool of job. Select another value to move job to |
| another pool.</li> |
| <li><em>Priority</em> - Current priority. Select another value to change the |
| job's priority</li> |
| <li><em>Maps/Reduces Finished</em>: Number of tasks finished / total tasks.</li> |
| <li><em>Maps/Reduces Running</em>: Tasks currently running.</li> |
| <li><em>Map/Reduce Fair Share</em>: The average number of task slots that this |
| job should have at any given time according to fair sharing. The actual |
| number of tasks will go up and down depending on how much compute time |
| the job has had, but on average it will get its fair share amount.</li> |
| </ul> |
| <p> |
| In addition, it is possible to turn on an "advanced" view for the web UI, |
| by going to http://<jobtracker URL>/scheduler?advanced. This view shows |
| four more columns used for calculations internally: |
| </p> |
| <ul> |
| <li><em>Maps/Reduce Weight</em>: Weight of the job in the fair sharing |
| calculations. This depends on priority and potentially also on |
| job size and job age if the <em>sizebasedweight</em> and |
| <em>NewJobWeightBooster</em> are enabled.</li> |
| <li><em>Map/Reduce Deficit</em>: The job's scheduling deficit in machine- |
| seconds - the amount of resources it should have gotten according to |
| its fair share, minus how many it actually got. Positive deficit means |
| the job will be scheduled again in the near future because it needs to |
| catch up to its fair share. The scheduler schedules jobs with higher |
| deficit ahead of others. Please see the Implementation section of |
| this document for details.</li> |
| </ul> |
| </section> |
| <section> |
| <title>Implementation</title> |
| <p>There are two aspects to implementing fair scheduling: Calculating |
| each job's fair share, and choosing which job to run when a task slot |
| becomes available.</p> |
| <p>To select jobs to run, the scheduler then keeps track of a |
| "deficit" for each job - the difference between the amount of |
| compute time it should have gotten on an ideal scheduler, and the amount |
| of compute time it actually got. This is a measure of how |
| "unfair" we've been to the job. Every few hundred |
| milliseconds, the scheduler updates the deficit of each job by looking |
| at how many tasks each job had running during this interval vs. its |
| fair share. Whenever a task slot becomes available, it is assigned to |
| the job with the highest deficit. There is one exception - if there |
| were one or more jobs who were not meeting their pool capacity |
| guarantees, we only choose among these "needy" jobs (based |
| again on their deficit), to ensure that the scheduler meets pool |
| guarantees as soon as possible.</p> |
| <p> |
| The fair shares are calculated by dividing the capacity of the cluster |
| among runnable jobs according to a "weight" for each job. By |
| default the weight is based on priority, with each level of priority |
| having 2x higher weight than the next (for example, VERY_HIGH has 4x the |
| weight of NORMAL). However, weights can also be based on job sizes and ages, |
| as described in the Configuring section. For jobs that are in a pool, |
| fair shares also take into account the minimum guarantee for that pool. |
| This capacity is divided among the jobs in that pool according again to |
| their weights. |
| </p> |
| <p>Finally, when limits on a user's running jobs or a pool's running jobs |
| are in place, we choose which jobs get to run by sorting all jobs in order |
| of priority and then submit time, as in the standard Hadoop scheduler. Any |
| jobs that fall after the user/pool's limit in this ordering are queued up |
| and wait idle until they can be run. During this time, they are ignored |
| from the fair sharing calculations and do not gain or lose deficit (their |
| fair share is set to zero).</p> |
| </section> |
| </body> |
| </document> |