blob: 02954da0d7ff2a37a6b5fefc37d40ebc5dbe3063 [file] [log] [blame]
%
% Licensed to the Apache Software Foundation (ASF) under one
% or more contributor license agreements. See the NOTICE file
% distributed with this work for additional information
% regarding copyright ownership. The ASF licenses this file
% to you under the Apache License, Version 2.0 (the
% "License"); you may not use this file except in compliance
% with the License. You may obtain a copy of the License at
%
% http://www.apache.org/licenses/LICENSE-2.0
%
% Unless required by applicable law or agreed to in writing,
% software distributed under the License is distributed on an
% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
% KIND, either express or implied. See the License for the
% specific language governing permissions and limitations
% under the License.
%
\section{Resource Manager Configuration: Classes and Nodepools}
\label{sec:ducc.classes}
The class configuration file is used by the Resource Manager configures the rules used for job
scheduling. See the \hyperref[chap:rm]{Resource Manager chapter} for a detailed description of the DUCC
scheduler, scheduling classes, and how classes are used to configure the scheduling process.
The scheduler configuration file is specified in ducc.properties. The default name is
ducc.classes and is specified by the property {\em ducc.rm.class.definitions}.
\subsection{Nodepools}
\label{subsec:nodepools}
\subsubsection{Overview}
A {\em nodepool} is a grouping of a subset of the physical nodes to allow differing
scheduling policies to be applied to different nodes in the system. Some typical
nodepool groupings might include:
\begin{enumerate}
\item Group Intel and Power nodes separately so that users may submit jobs that run
only in Intel architecture, or only Power, or ``don't care''.
\item Designate a group of nodes with large locally attached disks such that users
can run jobs that require those disks.
\item Designate a specific set of nodes with specialized hardware such as high-speed
network, such that jobs can be scheduled to run only on those nodes.
\end{enumerate}
A Nodepool is a subset of some larger collection of nodes. Nodepools themselves may be
further subdivided. Nodepools may not overlap: every node belongs to exactly
one nodepool. During system start-up the consistency of nodepool definition is checked
and the system will refuse to start if the configuration is incorrect.
NOTE: The administrative command {\em check\_ducc -c} may be used to verify and validate
you class configration before attemping to start DUCC. {\em check\_ducc -cv} may be used
to additionally ``pretty-print'' the ducc.classes configuration to the console to revel
class nesting and inheritance.
For example, the diagram below is an abstract representation of all the nodes in a
system. There are five nodepools defined:
\begin{itemize}
\item Nodepool ``NpAllOfThem'' is subdivided into three pools, NP1, NP2, and NP3. All
the nodes not contained in NP1, NP2, and NP3 belong to the pool called ``NpAllOfThem''.
\item Nodepool NP1 is not further subdivided.
\item Nodepool NP2 is not further subdivided.
\item Nodepool NP3 is further subdivided to form NP4. All nodes within NP3 but
not in NP4 are contained in NP3.
\item Nodepool NP4 is not further subdivided.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[width=5.5in]{images/Nodepool1.jpg}
\caption{Nodepool Example}
\label{fig:Nodepools1}
\end{figure}
In the figure below the Nodepools are incorrectly defined for two reasons:
\begin{enumerate}
\item NP1 and NP2 overlap.
\item NP4 overlaps both nodepool ``NpAllOfThem'' and NP3.
\end{enumerate}
\begin{figure}[H]
\centering
\includegraphics[width=5.5in]{images/Nodepool2.jpg}
\caption{Nodepools: Overlapping Pools are Incorrect}
\label{fig:Nodepools2}
\end{figure}
Multiple ``top-level'' nodepools are allowed. A ``top-level'' nodepool has no containing
pool. Multiple top-level pools logically divide a cluster of machines into {\em multiple
independent clusters} from the standpoint of the scheduler. Work scheduled over one
pool in no way affects work scheduled over the other pool. The figure below shows an
abstract nodepool configuration with two top-level nodepools, ``Top-NP1'' and ``Top-NP2''.
\begin{figure}[H]
\centering
\includegraphics[width=5.5in]{images/Nodepool3.jpg}
\caption{Nodepools: Multiple top-level Nodepools}
\label{fig:Nodepools3}
\end{figure}
\subsubsection{Scheduling considerations}
A primary goal of the scheduler is to insure that no resources are left idle if there
is pending work that is able to use those resources. Therefore, work scheduled to
a class defined over a specific nodepool (say, NpAllOfThem), may be scheduled on nodes
in any of the nodepools contained within NpAllOfThem. If work defined over a
subpool (such as NP1) arrives, processes on nodes in NP1 that were scheduled for
NpAllOfThem are considered {\em squatters} and are the most likely candidates for
eviction. (Processes assigned to their proper nodepools are considered {\em residents}
and are evicted only after all {\em squatters} have been evicted.) The scheduler strives
to avoid creating {\em squatters}.
Because non-preemptable allocations can't be preempted, work submitted to a class
implementing one of the non-preemptable policies (FIXED or RESERVE) are never allowed
to ``squat'' in other nodepools and are only scheduled on nodes in their
proper nodepool.
In the case of multiple top-level nodepools: these nodepools and their sub-pools
form independent scheduling groups. Specifically,
\begin{itemize}
\item Fair-share allocations over any nodepool in one top-level pool do NOT affect the
fair-share allocations for jobs in any other top-level nodepool. Top-level nodepools
define independently scheduled of resources within a single DUCC cluster.
\item Work submitted to classes under one top-level nodepool do NOT get expanded to
nodes under another top-level nodepool, even is there is sufficient capacity.
\end{itemize}
Most installations will want to assign the majority of nodes to a single top-level
nodepool (or its subpools), using other top-level pools for nodes that cannot be
shared with other work.
\subsubsection{Configuration}
\label{subsubsec:nodepool.configuration}
DUCC uses simple named stanzas containing key/value pairs to configure nodepools.
At least one nodepool definition is required. This nodepool need not have any subpools or node
definitions. The first top-level nodepool is considered the ``default'' nodepool. Any node not
named specifically in one of the node files which checks in with DUCC is assigned to this
first, {\em default} nodepool.
Thus, if only one nodepool is defined with no other attributes, all nodes are
assigned to that pool.
A nodepool definition consists of the token ``Nodepool'' followed by the
name of the nodepool, followed by a block delimited with ``curly'' braces \{ and \}. This
block contains the attributes of the nodepool as key/value pairs.
Lineneds are ignored. A semicolon ``$;$'' may optionally be used to
delimit key/value pairs for readability, and an equals sign ``='' may optionally
be used to delimit keys from values, also just for readability. See the
\hyperref[fig:nodepool.configuration]{below}.
The attributes of a Nodepool are:
\begin{description}
\item[domain] This is valid only in the ``default'' (first) nodepool. Any node
in any nodefile which does not have a domain, and any node which checks
in to the Resource Manager without a domain name is assigned this domain name
in order that the scheduler may deal entirely with full-qualified node names.
If no {\em domain} is specified, DUCC will attempt to guess the domain based
on the domain name returned on the node where the Resource Manager resides.
\item[nodefile] This is the name of a file containing the names of the nodes
which are members of this nodepool.
\item[parent] This is used to indicate which nodepool is the logical parent.
Any nodepool without a {\em parent} is considered a top-level nodepool.
\end{description}
The following example defines six nodepools,
\begin{enumerate}
\item A top-level nodepool called ``--default--''. All nodes not named
in any nodefile are assigned to this nodepool.
\item A top-level nodepool called ``jobdriver'', consisting of the nodes
named in the file {\em jobdriver.nodes}.
\item A subpool of ``--default--'' called ``intel'', consisting of the
nodes named in {\em intel.nodes}.
\item A subpool of ``--default--'' called ``power'', consisting of the
nodes named in the file {\em power.nodes}.
\item A subpool of ``intel'' called ``nightly-test'', consisting of the
nodes named in {\em nightly-test.nodes}.
\item And a subpool of ``power'' called ``timing-p7'', consisting of the
nodes named in {\em timing-p7.nodes}.
\end{enumerate}
\begin{figure}[H]
\begin{verbatim}
Nodepool --default-- { domain mydomain.net }
Nodepool jobdriver { nodefile jobdriver.nodes }
Nodepool intel { nodefile intel.nodes ; parent --default-- }
Nodepool power { nodefile power.nodes ; parent --default-- }
Nodepool nightly-test { nodefile nightly-test.nodes ; parent intel }
Nodepool timing-p7 { nodefile timing-p7.nodes ; parent power }
\end{verbatim}
\caption{Sample Nodepool Configuration}
\label{fig:nodepool.configuration}
\end{figure}
\subsection{Class Definitions}
\label{subsubsec:class.configuration}
Scheduler classes are defined in the same simple block language as
nodepools.
A simple inheritance (or ``template'') scheme is supported for classes. Any
class may be configured to ``derive'' from any other class. In this case, the
child class acquires all the attributes of the parent class, any of which may
be selectively overridden. Multiple inheritance is not supported but
nested inheritance is; that is, class A may inherit from class B which inherits
from class C and so on. In this way, generalized templates for the site's
class structure may be defined.
The general form of a class definition consists of the keyword Class, followed
by the name of the class, and then optionally by the name of a ``parent'' class
whose characteristics it inherits. Following the name (and optionally parent class
name) are the attributes of the class, also within a \{ block \} as for nodepools, and
with lines and key/value pairs optionally delimited by ``$;$'' and ``$=$'', respectively.
See the sample \hyperref[fig:class.configuration]{below}.
The attributes defined for classes are:
\begin{description}
\item[abstract] If specified, this indicates this class is a template ONLY. It is used
as a model for other classes. Values are ``true'' or ``false''. The default is
``false''. This class is never passed to the scheduler and may not be referenced
by jobs.
\item[debug] FAIR\_SHARE only. This specifies the name of a class to substitute
for jobs submitted for debug. For example, if class {\em normal} specifies
\begin{verbatim}
debug = fixed
\end{verbatim}
then any job submitted to this class with debugging requested is actually scheduled
in class {\em fixed}. (For example, one probably does not want a debugging job
scheduled as FAIR\_SHARE and possibly preempted, preferring the non-preemptable
class {\em fixed}.
\item[default] This specifies the class to be used as the default class for work submission
if no class is explicitly given. Only one class of type FAIR\_SHARE may contain this
designation, in which case it names the default FAIR\_SHARE class. Only one class of type
FIXED\_SHARE or RESERVE may contain this designation, in which case it names the default
class to use for reservations (Note that either FIXED\_SHARE or RESERVE scheduling policies
are valid for reservations.)
\item[expand-by-doubling] FAIR\_SHARE only. If ``true'', and the {\em initialization-cap} is
set, then after any process has initialized, the job will expand to its maximum allowable
shares by doubling in size each scheduling cycle.
If not specified, the global value set in \hyperref[sec:ducc.properties]{ducc.properties} is used.
\item[initialization-cap] FAIR\_SHARE only. If specified, this is the largest number of processes this job
may be assigned until at least one process has successfully completed initialization.
If not specified, the global value set in \hyperref[sec:ducc.properties]{ducc.properties} is used.
\item[max-processes] FAIR\_SHARE and FIXED\_SHARE only. This is the largest number of FIXED-SHARE,
non-preemptable shares any single job may be assigned.
Omit this property, or set it to 0 to disable the cap.
g \item[prediction-fudge] FAIR\_SHARE only. When the scheduler is considering expanding the
number of processes for a job it tries to determine if the job may complete before those
processes are allocated and initialized. The {\em prediction-fudge} adds some amount of
time (in milliseconds) to the projected completion time. This allows installations to
prevent jobs from expanding when they were otherwise going to end in a few minutes
anyway.
If not specified, the global value set in \hyperref[sec:ducc.properties]{ducc.properties} is used.
\item[nodepool] Jobs for this class are assigned to nodes in this nodepool. The
value must be the name of one of the configured nodepools.
\item[policy] This is the scheduling policy, one of FAIR\_SHARE, FIXED\_SHARE, or RESERVE. This
attribute is required (there is no default).
\item[priority] This is the scheduling priority for jobs in this class.
\item[weight] FAIR\_SHARE only. This is the fair-share weight for jobs in this class.
\end{description}
The following figure illustrates a representative class configuration for a large cluster,
consisting of mixed Intel and Power nodes. This class definition assumes the
\hyperref[fig:nodepool.configuration]{nodepool configuration} shown above. FAIR\_SHARE,
FIXED\_SHARE, and RESERVE classes are defined over each machine architecture, Intel and Power,
and over the combined pool.
\begin{figure}[H]
\begin{verbatim}
# --------------------- Fair share definitions ---------------
Class fair-base {
policy = FAIR_SHARE
nodepool = intel
priority = 10
weight = 100
abstract = true
debug = fixed
}
Class nightly-test fair-base { weight = 100; nodepool nightly-test; priority = 7}
Class background fair-base { weight = 20 }
Class low fair-base { weight = 50 }
Class normal fair-base { weight = 100; default = true }
Class high fair-base { weight = 200 }
Class weekly fair-base { weight = 400 }
Class background-p7 background { nodepool = power }
Class low-p7 low { nodepool = power }
Class normal-p7 normal { nodepool = power }
Class high-p7 high { nodepool = power }
Class weekly-p7 weekly { nodepool = power }
Class background-all background { nodepool = --default-- }
Class low-all low { nodepool = --default-- }
Class normal-all normal { nodepool = --default-- }
Class high-all high { nodepool = --default-- }
Class weekly-all weekly { nodepool = --default-- }
# --------------------- Fixed share definitions ---------------
Class fixed-base {
policy = FIXED_SHARE
nodepool = intel
priority = 5
abstract = true
max-processes = 10
}
Class fixed fixed-base { }
Class fixed-p7 fixed-base { nodepool = power; default = true; }
Class JobDriver fixed-base { nodepool = jobdriver; priority = 0 }
# --------------------- Reserve definitions ---------------
Class reserve-base {
policy = RESERVE
nodepool = intel
priority = 1
abstract = true
}
Class reserve reserve-base { }
Class reserve-p7 reserve-base { nodepool = power }
Class timing-p7 reserve-base { nodepool = timing-p7 }
\end{verbatim}
\caption{Sample Class Configuration}
\label{fig:class.configuration}
\end{figure}
\subsection{Validation}
The administrative command, \hyperref[subsec:admin.check-ducc]{\em check\_ducc} may be used to
validate a configuration, with the {\em -c} and {\em v} options. This reads the entire configuration and
nodefiles, validates consistency of the definitions and insures the nodepools do not overlap.
The \hyperref[subsec:admin.start-ducc]{\em start\_ducc} command always runs full validation, and if the
configuration is found to be incorrect, the cluster is not started.
Configuration checking is done internally by the DUCC java utility {\em
org.apache.uima.ducc.commonNodeConfiguration}. This utility contains a public
API as described in the Javadoc. It may be invoked from the command line as follows:
\paragraph{Usage:}
\begin{verbatim}
java org.apache.uima.ducc.commonNodeConfiguration [-p] [-v nodefile] configfile
\end{verbatim}
\paragraph{Options:}
\begin{description}
\item[$-p$] Pretty-print the compiled configuration to stdout. This illustrates
nodepool nesting, and shows the fully-completed scheduling classes after inheritance.
\item[$-v$ nodefile] This should be the master nodelist used to start DUCC. This
is assumed to be constructed to reflect the nodepool organization as
\hyperref[sec:admin-ducc.nodes]{described here}. If provided,
the nodepools are validated and checked for overlaps.
\item[configfile] This is the name of the file containing the configuration.
\end{description}