| % |
| % Licensed to the Apache Software Foundation (ASF) under one |
| % or more contributor license agreements. See the NOTICE file |
| % distributed with this work for additional information |
| % regarding copyright ownership. The ASF licenses this file |
| % to you under the Apache License, Version 2.0 (the |
| % "License"); you may not use this file except in compliance |
| % with the License. You may obtain a copy of the License at |
| % |
| % http://www.apache.org/licenses/LICENSE-2.0 |
| % |
| % Unless required by applicable law or agreed to in writing, |
| % software distributed under the License is distributed on an |
| % "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| % KIND, either express or implied. See the License for the |
| % specific language governing permissions and limitations |
| % under the License. |
| % |
| % Create well-known link to this spot for HTML version |
| \ifpdf |
| \else |
| \HCode{<a name='DUCC_RM'></a>} |
| \fi |
| \chapter{Resource Management} |
| \label{chap:rm} |
| \section{Overview} |
| |
| The DUCC Resource Manager is responsible for allocating cluster resources among the various |
| requests for work in the system. DUCC recognizes several categories of work: |
| |
| \begin{description} |
| \item[Managed Jobs] |
| Managed jobs are Java applications implemented in the UIMA framework |
| and are scaled out by DUCC as some number of discrete processes. Processes which |
| compose managed jobs are always restartable and usually preemptable. Preemption |
| occurs as a consequence of enforcing fair-share scheduling policies. |
| |
| \item[Services] |
| Services are long-running processes which perform some (common) function on behalf of |
| jobs or other services. Services are scaled out as a set of, from the RM point of view, |
| unrelated non-preemptable processes. |
| |
| \item[Reservations] |
| A reservation provides non-preemptable, persistent, dedicated use of a full machine or |
| some part of a machine to a specific user. |
| |
| \item[Arbitrary Processes] |
| An {\em arbitrary process} or {\em managed reservation} is any process at all, which may |
| or may not have anything to do with UIMA. These processes are typically used to |
| run non-UIMA tasks such as application builds, large Eclipse workspaces for debugging, |
| etc. These processes are usually scheduled as non-preemptable allocations, |
| occupying either a dedicated machine or some portion of a machine. |
| |
| \end{description} |
| |
| To apportion the cumulative memory resource among requests the Resource Manager |
| defines some minimum unit of memory and allocates machines such that a "fair" number of "memory |
| units" are awarded to every user of the system. This minimum quantity is called a share quantum, |
| or simply, a share. The scheduling goal is to award an equitable number of memory shares to |
| every user of the system. The memory shares in a system are divided equally among all the users |
| who have work in the system. Once an allocation is assigned to a user, that user's jobs are then |
| also assigned an equal number of shares, out of the user's allocation. Finally, the Resource |
| Manager maps the share allotments to physical resources. To map a share allotment to physical |
| resources, the Resource Manager considers the amount of memory that each job declares it |
| requires for each process. That per-process memory requirement is translated into the minimum |
| number of collocated quantum shares required for the process to run. |
| |
| To compute the memory requirements for a job, the declared memory is rounded up to the nearest |
| multiple of the share quantum. The total number of quantum shares for the job is calculated, |
| and then divided by the number of quantum shares declared for the job to arrive at the number of |
| processes to allocate. The output of each scheduling cycle is always in terms of processes, |
| where each process is allowed to occupy some number of shares. The DUCC agents implement a |
| mechanism to ensure that no user's job processes exceed their allocated memory assignments. |
| |
| For example, suppose the share quantum is 15GB. A job that declares it requires 14GB per process |
| is assigned one quantum share per process. If that job is assigned 20 shares, it will be allocated 20 |
| processes across the cluster. A job that declares 28GB per process would be assigned two quanta |
| per process. If that job is assigned 20 shares, it is allocated 10 processes across the cluster. Both |
| jobs occupy the same amount of memory; they consume the same level of system resources. The |
| second job does so in half as many processes. |
| |
| |
| Some work may be deemed to be more "important" than other work. To accommodate this, the RM |
| implements a weighted fair-share scheduler. During the fair share |
| calculations, jobs with higher weights are assigned more shares proportional to their weights; jobs |
| with lower weights are assigned proportionally fewer shares. Jobs with equal weights are assigned |
| an equal number of shares. |
| |
| The abstraction used to organized jobs by fair-share weight is the |
| job class or simply {\em class}. All job submissions must included a declared job class; if none |
| is declared, a default class is chosen by DUCC. As jobs enter the system they are |
| grouped with other jobs of the same class weight. The class abstraction |
| and its attributes are described in \hyperref[sec:rm.job-classes]{subsequent sections}. |
| |
| The scheduler executes in three primary phases: |
| \begin{enumerate} |
| |
| \item The How-Much phase: every job is assigned some number of |
| quantum shares, which is converted to the number of |
| processes of the declared size. |
| |
| \item The What-Of phase: physical machines are found which can |
| accommodate the number of processes allocated by the |
| How-Much phase. Jobs are mapped to physical machines such |
| that the total declared per-process amount of memory for all |
| jobs scheduled to a machine do not exceed the physical |
| memory on the machine. |
| |
| \item Defragmentation. If the what-of phase cannot allocate |
| space according to the output of the how-much phase, the |
| system is said to be {\em fragmented.} The RM scans for |
| ``rich'' jobs and will attempt to preempt some small number |
| of processes sufficient to guarantee every job gets at least |
| one process allocation. (Note that sometimes this is not possible, |
| in which case unscheduled work remaims pending until such |
| time as space is freed-up.) |
| |
| \end{enumerate} |
| |
| The How-Much phase is itself subdivided into three phases: |
| \begin{enumerate} |
| |
| \item Class counts:Apply weighed fair-share to all the job classes that have jobs assigned to |
| them. This apportions all shares in the system among all the classes according to their |
| weights. This phase takes into account all users and all jobs in the system. |
| |
| \item User counts: For each class, collect all the users with |
| jobs submitted to that class, and apply fair-share (with |
| equal weights) to equally divide all the class shares among |
| the users. |
| |
| \item Job counts: For each user, collect all jobs |
| assigned to that user and equally divide all the user's shares among |
| the jobs. This apportions all shares given to this user for each class among the user's |
| jobs in that class. |
| \end{enumerate} |
| |
| All non-preemptable allocations are restricted to one allocation per request. If space is |
| available, the request succeeds immediately. If space can be made for the request through |
| preemptions, the preemptions are scheduled and the reservation is deferred until space |
| is available. If space cannot be found by means of preemption, the reservation remains |
| pending until it either succeeds (by cancelation of other non-preemptive work, by |
| adding resources to the system, or by increasing the user's non-preemptive allotment), or until |
| it is canceled by the user or an administrator. |
| |
| \section{Preemption vs Eviction} |
| The RM makes a subtle distinction between {\em preemption} and {\em eviction}. |
| |
| {\em Preemption} occurs only as a result of fair-share |
| calculations or defragmentation. Preemption is the process of |
| deallocating shares from jobs beloing to users whose current |
| allocation exceeds their fair-share, and conversely, only processes |
| belonging to fair-share jobs can be preempted. This is generally |
| dynamic: more jobs in the system result in a smaller fair-share |
| for any given user, and fewer jobs result in a higher fair-share |
| allocation. |
| |
| {\em Eviction} occurs only as a result of system-detected errors, |
| changes in node configuration, or changes in class |
| configuration. Eviction may affect both preemptable work and some |
| types of non-preemptable work. |
| |
| Work that is non-preemptable, but restartable can be evicted. Such work consists of service |
| processes (which are automatically resubmitted by the Service Manager), and managed reservations, |
| which can be resubmitted by the user. |
| |
| Unmanaged reservations are never evicted for any reason. If something occurs that |
| would result in the reservation being (fatally) misplaced, the node is marked |
| unschedulable and remains as such until the condition is corrected or the reservation |
| is canceled. Once the condition is repaired (either the reservaiton is canceled, or |
| the problem is corrected), the node becomes schedulable again. |
| |
| \section{Scheduling Policies} |
| |
| The Resource Manager implements three scheduling policies. Scheduling policies are |
| associated with \hyperref[sec:rm.job-classes]{\em classes}. |
| \begin{description} |
| \item[FAIR\_SHARE] This is weighted-fair-share. All processes scheduled under |
| fair-share are always {\em preemptable}. |
| |
| \item[FIXED\_SHARE] The FIXED\_SHARE policy is used to allocate non-preemptable |
| shares. The shares might be {\em evicted} as described above, but they are |
| never {\em preempted}. Fixed share alloations are restricted to one |
| allocation per request and may be subject to \hyperref[sec:rm.allotment]{allotment caps}. |
| |
| FIXED\_SHARE allocations have several uses: |
| \begin{itemize} |
| \item Unmaged reservations. In this case DUCC starts no work in the share(s); the user must |
| log in (or run something via ssh), and then manually release the reservation to free |
| the resources. This is often used for testing and debugging. |
| \item Services. If a service is registered to run in a FIXED\_SHARE allocation, |
| DUCC allocates the resources, starts and manages the service, and releases the |
| resource if the service is stopped or unregistered. |
| \item UIMA jobs. A ``normal'' UIMA job may be submitted to a FIXED\_SHARE |
| class. In this case, the processes are never preempted, allowing constant and |
| predictable execution of the job. The resources are automatically released when |
| the job exits. |
| \item Managed reservations. The \hyperref[sec:cli.viaducc]{\em viaducc} utility is provided |
| as a convenience for running managed reservations. |
| \end{itemize} |
| |
| \item[RESERVE] The RESERVE policy is used to allocate a full, dedicated machine. |
| The allocation may be {\em evicted} but it is never {\em preempted}. It is |
| restricted to a single machine per request. The memory size |
| specified in the reservation must match machine size |
| exactly, within the limits of rounding to the next highest multiple of the |
| quantum. DUCC will not ``promote'' a reservation request to a larger machine |
| than is asked for. A reservation that does not adequately match any |
| machine remains pending until resources are made available or it is |
| canceled by the user or an administrator. Reservations may be |
| subject to \hyperref[sec:rm.allotment]{allotment caps}. |
| |
| \end{description} |
| |
| \section{Allotment} |
| \label{sec:rm.allotment} |
| |
| Allotment is a new concept introduced with DUCC 2.0.0 to prevent non-preemptable |
| requests from dominating a cluster. This replaces the DUCC version 1 class |
| policies of max-processes and max-machines. |
| |
| It is possible to associate a maximum share allotment with any non-preemptable class. |
| Allotment is assigned per user and is global across all non-preemptable classes. It is configured |
| \hyperref[sec:ducc.properties]{ducc.properties} with {\em ducc.rm.global\_allotment}. |
| |
| A simple user registry provides per-user overrides of the global allotment as needed. The |
| registry may be included in the class definition file (specified in ducc.properties under |
| ducc.rm.class.definitions), or in a separate file, specified in ducc.properties as |
| {\em ducc.rm.user.registry}. |
| |
| |
| \section{Priority vs Weight} |
| |
| It is possible that the various policies may interfere with each other. It is also possible that |
| the fair share weights are not sufficient to guarantee sufficient resources are allocated to |
| high importance jobs. Class-based priorities are used to resolve these conflicts. |
| |
| Simply: priority is used to specify the order of evaluation of the |
| job classes. Weight is used to proportionally allocate the number |
| of shares to all classes of the same priority under the weighted |
| fair-share policies. |
| |
| \paragraph{Priority.} |
| |
| When a scheduling cycle starts, the scheduling classes are ordered from "best" to "worst" priority. |
| The scheduler then attempts to allocate ALL of the system's resources to the "best" priority class. |
| If any resources are left, the scheduler proceeds to schedule classes in the next best |
| priority, and so on, until either all the |
| resources are exhausted or there is no more work to schedule. |
| |
| It is possible to have multiple job classes of the same priority. What this means is that resources |
| are allocated for the set of job classes from the same set of resources at the same time, usually |
| under weighted fair-share. (It would be unusual to have multiple non-preemptable classes at the |
| same priority. If this is configured, the class requests are filled arbitrarily with no attempt |
| to divide the resources fairly or equitably). Resources for higher priority |
| classes will have already been allocated, resources for lower priority classes may never become |
| available. |
| |
| To constrain high priority jobs from completely monopolizing the |
| system, fair-share weights are used for FAIR\_SHARE classes, and |
| allotment is used for non-preemptable classes. |
| |
| \paragraph{Weight.} Weight is used to determine the relative importance of jobs in a set of job classes of |
| the same priority when doing fair-share allocation. All job classes of the same priority are assigned |
| shares from the full set of available resources according to their weights using weighted fair-share. |
| Weights are used only for fair-share allocation. |
| |
| \section{Node Pools} |
| It may be desired or necessary to constrain certain types of resource allocations to a specific |
| subset of the resources. Some nodes may have special hardware, or perhaps it is desired to |
| prevent certain types of jobs from being scheduled on some specific set of machines. Nodepools |
| are designed to provide this function. |
| |
| Nodepools impose hierarchical partitioning on the set of available machines. A nodepool is a |
| subset of the full set of machines in the cluster. Nodepools may not overlap. A nodepool may |
| itself contain non-overlapping subpools. |
| |
| Job classes are associated with nodepools. The scheduler treates preemptable work and |
| non-preemptable work differently with regards to nodepools: |
| \begin{description} |
| \item[Preemptable work.] The scheduler will attempt to allocate preemptable work in |
| the nodepool associated with the work's class. If this nodepool becomes exhausted, |
| and there are subpools, the scheduler proceeds to try to allocate resources within |
| the subpools, recursively, until either all work is scheduled or there is no more |
| work to schedule. (Allocations made within subpools are referred to as ``squatters''; |
| aloocations made in the directly associated nodepool are referred to as ``residents''.) |
| |
| During eviction, the scheduler attempts to evict squatters first and only evicts |
| residents once all the squatters are gone. |
| |
| \item[Non-Preemptable work.] Non-preemptable work can only be allocated directly |
| in the nodepool associated with the work's class. Such work can never become a |
| squatter. The reason is that non-preemptbable squatters cannot be evicted, and so |
| could dominate pools intended for other work. |
| \end{description} |
| |
| More information on nodepools and their configuration can be \hyperref[subsec:nodepools]{found here}. |
| |
| \section{Scheduling Classes} |
| \label{sec:rm.job-classes} |
| The primary abstraction to control and configure the scheduler is the {\em class}. A class is simply a set |
| of rules used to parametrize how resources are assigned to work requests. Every request that enters the system is |
| associated with a single class. |
| |
| The scheduling class defines the following rules: |
| |
| \begin{description} |
| \item[Priority] This is the order of evaluation and assignment of resources to this class. See |
| the discussion of priority vs Weight for details. |
| |
| \item[Weight] This is used for the weighted fair-share calculations. |
| |
| \item[Scheduling Policy] This defines the policy, fair share, fixed share, or reserve used to |
| schedule the jobs in this class. |
| |
| \item[Nodepool] A class may be associated with exactly one nodepool. Fair-share jobs submitted to the class |
| are assigned only resources which lie in that nodepool, or in any of the subpools defined |
| within that nodepool. Non-preemptable requests must always be fulfilled from the nodepool |
| assigned to the class; subpools are exempt from non-preemptable requests submitted to higher-level |
| nodepools. |
| |
| \item[Prediction] For the type of work that DUCC is designed to run, new processes typically take |
| a great deal of time initializing. It is not unusual to experience 30 minutes or more of |
| initialization before work items start to be processed. |
| |
| When a job is expanding (i.e. the number of assigned processes is allowed to dynamically |
| increase), it may be that the job will complete before the new processes can be assigned and |
| the analytics within them complete initialization. In this situation it is wasteful to allow the |
| job to expand, even if its fair-share is greater than the number of processes it currently has |
| assigned. |
| |
| By enabling prediction, the scheduler will consider the average initialization time for processes |
| in this job, current rate of work completion, and predict the number of processes needed to |
| complete the job in the optimal amount of time. If this number is less than the job's fair, share, |
| the actual allocation is capped by the predicted needs. |
| |
| \item[Prediction Fudge] When doing prediction, it may be desired to look some time into the |
| future past initialization times to predict if the job will end soon after it is expanded. |
| The prediction fudge specifies a time past the expected initialization time that is used to |
| predict the number of future shares needed. This avoids wasteful preemption of work to make space |
| for other work that will be completing very soon anyway. |
| |
| \item[Initialization cap] Because of the long initialization time of processes in most DUCC jobs, |
| process failure during the initialization phase can be very expensive in terms of wasted |
| resources. If a process is going to fail because of bugs, missing services, or any other |
| reason, it is best to catch it early. |
| |
| The initialization cap is used to limit the number of processes assigned to a job until it is |
| known that at least one process has successfully passed from initialization to running. As soon |
| as this occurs the scheduler will proceed to assign the job its full fair-share of resources. |
| |
| \item[Expand By Doubling] Even after initialization has succeeded, it may be desired to throttle |
| the rate of expansion of a job into new processes. |
| |
| When expand-by-doubling is enabled, the scheduler allocates either twice the number of |
| resources a job currently has, or its fair-share of resources, whichever is smallest. |
| |
| If expand-by-doubling is disabled, jobs are allocated their full fair-share immediately. |
| |
| \end{description} |
| |
| More information on nodepools and their configuration can be \hyperref[subsubsec:class.configuration]{found here}. |