blob: 354131118e0562d0926d21b2a4aece12b361a6c1 [file] [log] [blame]
%
% Licensed to the Apache Software Foundation (ASF) under one
% or more contributor license agreements. See the NOTICE file
% distributed with this work for additional information
% regarding copyright ownership. The ASF licenses this file
% to you under the Apache License, Version 2.0 (the
% "License"); you may not use this file except in compliance
% with the License. You may obtain a copy of the License at
%
% http://www.apache.org/licenses/LICENSE-2.0
%
% Unless required by applicable law or agreed to in writing,
% software distributed under the License is distributed on an
% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
% KIND, either express or implied. See the License for the
% specific language governing permissions and limitations
% under the License.
%
% Create well-known link to this spot for HTML version
\section{Overview}
This chapter provides an overview of the DUCC process logs and how to interpret the
entries therein.
Each of the DUCC ``head node'' processes writes a detailed log of its operation to
the directory \ducchome/logs. The logs are managed by Apache log4j. All logs are
managed by a single log4j configuration file
\begin{verbatim}
$DUCC_HOME/resources/log4j.xml
\end{verbatim}
The DUCC logger is configured to check for updates to the log4j.xml
configuration file and automatically update without the need to restart any of
the DUCC processes. The update may take up to 60 seconds to take effect.
The DUCC logger is loaded and configured through the log4j API such that other
log4j configuration files that might be in the classpath are ignored. This also
means that log4j configuration files in the user's classpath will not interfere
with DUCC's logger.
The logs are set to roll after reaching a given size and the number of generations
is limited to prevent overrunning disk space. In general the log level is set to
provide sufficient diagnostic output to resolve most issues.
Each DUCC component writes its own log as defined in the following table:
\begin{tabular} {| l | l |}
\hline
Component & Log Name \\
\hline
\hline
Resource Manager & rm.log \\
\hline
Service Manager & sm.log \\
\hline
Orchestrator & or.log \\
\hline
Process Manager & pm.log \\
\hline
Web Server & ws.log \\
\hline
Agent & {\em [hostname].agent.log } \\
\hline
\end{tabular}
Because there may be many agents, the agent log is prefixed with the name of the host for
each running agent.
The log4j file may be customized for each installation to change the format or content of the
log files, according to the rules defined by log4j itself.
The general format of a log message is as follows:
\begin{verbatim}
Timestamp LOGLEVEL COMPONENT.sourceFileName method-name J[Jobid-or-NA] T[TID] text
\end{verbatim}
where
\begin{description}
\item[Timestamp] is the time of the occurrence. By default, the timestamp uses millisecond granularity.
\item[LOGLEVEL] This is one of the log4j debug levels, INFO, ERROR, DEBUG, etc.
\item[COMPONENT] This identifies the DUCC component emitting the message. The components include
\begin{description}
\item[SM] Service Manager
\item[RM] Resource Manager
\item[PM] Process Manager
\item[OR] Orchestrator
\item[WS] Web Server
\item[Agent] Agent
\item[JD] Job Driver. These logs are written to the log directory specified in each
job submission.
\item[JobProcessComponent] Job process, also known as JP. These logs are written to the
log directory specifid in each job submission.
\end{description}
\item[sourceFileName] This is the name of the Java source file from which the message is emitted.
\item[method-name] This is the name of the method in {\em sourceFileName} which emitted the message.
\item[{J[Workid-or-NA]}] This is the DUCC assigned id of the work being processed, when relevant. If the
message is not associated with work, this field shows ``N/A''. Some logs (such as JP and JD logs)
pertain ONLY to a specific job and do not contain this field.
\item[{T[TID]}] This is the ID of the thread emitting the message. Some logs (such as RM) do not use
this field so it is omitted.
\item[text] This is the component-specific text content of the message. Key messages are described
in detail in subsequent sections.
\end{description}
\section{Resource Manager Log (rm.log)}
The RM log is designed to show all phases of resource scheduling. Much of the flow of a job can
be observed in this log alone. The following specific information is available and is explained in
more detail below:
\begin{itemize}
\item Bootstrap configuration
\item Node arrival and missed heartbeats
\item Node occupancy
\item Job arrival and status updates
\item Calculation of job caps
\item How-much - fair share
\item What-of - host assignment and preemption
\item Defragmentation
\item Internal schedule
\item Published schedule
\end{itemize}
Most useful messages are emitted under log level INFO but a wealth of details can
be seen by incresing the log level to DEBUG or TRACE. To do so,
edit the file
{\em \ducchome/resources/log4j.xml}
and change the {\em priority} value to {\em debug} (or {\em trace}) in the stanza
similar to that shown here. Within about a minute the logger will
pick up the change and increase its log level.
\begin{verbatim}
<category name="org.apache.uima.ducc.sm" additivity="true">
<priority value="debug"/>
<appender-ref ref="smlog" />
</category>
\end{verbatim}
\subsection{Bootstrap Configuration}
The RM summarizes its entire configuration when it starts up and prints it to the log to
provide context for subsequent data and as verification that the RM is configured in the
way it was thought to be. All the following are found in the bootstrap section and are mostly
self-explanatory:
\begin{itemize}
\item A pretty-print of the class configuration. This is the same as produced by the {\em check\_ducc -c -v}
command.
\item A summary of all classes, one per line. This is a more concise display and is similar to the
DUCC Classes page in the web server.
\item A listing of all RM configuration parameters and the environment including things such as the
version of Java, the operating system, etc.
\item Nodepool occupancy. As host names are parsed from the {\em ducc.nodes} files, the RM log
shows exactly which nodepool each node is added to.
\end{itemize}
The RM logs can wrap quickly under high load in which case this information is lost.
The following represent key RM logs lines to search for if it is desired to examine or verify its
initialization. (Part of the leaders on these messages are removed here to shorten the
lines for publication.)
\paragraph{Initial RM start}
The first logged line of any RM start will contain the string {\em Starting component: resourceManager}:
\begin{verbatim}
RM.ResourceManagerComponent- N/A boot ... Starting Component: resourceManager
\end{verbatim}
\paragraph{RM Node and Class Configuration}
The first configuration lines show the reading and validation of the node and class configuration. Look
for the string {\em printNodepool} to find these lines:
\begin{verbatim}
RM.Config- N/A printNodepool Nodepool --default--
RM.Config- N/A printNodepool Search Order: 100
RM.Config- N/A printNodepool Node File: None
RM.Config- N/A printNodepool <None>
RM.Config- N/A printNodepool Classes: background low normal high normal-all nightly-test reserve
RM.Config- N/A printNodepool Subpools: jobdriver power intel
...
\end{verbatim}
\paragraph{RM Scheduling Configuration}
Next the RM reads configures its scheduling parameters and emits the information. It also emits information
about its environment: the ActiveMQ broker, JVM information, OS information, DUCC version, etc. To fine
this search for the string {\em init Scheduler}.
\begin{verbatim}
init Scheduler running with share quantum : 15 GB
init reserved DRAM : 0 GB
init DRAM override : 0 GB
init scheduler : org.apache.uima.ducc.rm.scheduler.NodepoolScheduler
... (more lines) ...
init DUCC home : /home/challngr/ducc_runtime
init ActiveMQ URL : tcp://bluej537:61617?jms.useCompression=true
init JVM : Oracle Corporation 1.7.0_45
init JAVA_HOME : /users1/challngr/jdk1.7.0_45/jre
init JVM Path : /users/challngr/jdk1.7.0_45/bin/java
init JMX URL : service:jmx:rmi:///jndi/rmi://bluej537:2099/jmxrmi
init OS Architecture : amd64
init OS Name : Linux
init DUCC Version : 2.0.0-beta
init RM Version : 2.0.0
\end{verbatim}
\paragraph{RM Begins to Schedule}
The next lines will show the nodes checking in and which nodepools they are assigned to. When the scheduler is
ready to accept Orchestrator requests you will see assignment of the JobDriver reservation. At this point
RM is fully operational. The confirmation of JobDriver assignment is similar to this:
\begin{verbatim}
Reserved:
ID JobName User Class Shares Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst
R______7434 Job_Drive System JobDriver 1 1 1 0 2 0 0 0 1
\end{verbatim}
\subsection{Node Arrival and Missed Heartbeats}
\subsubsection{Node Arrival}
As each node ``checks in'' with the RM a line is printed with details about the node. Some fields
are redundant but are produced by different components processing the node arrival and thus serve
as confirmation that all parts are operating correctly.
A node arrival entry is of the form:
\begin{verbatim}
LOGHEADER Nodepool: power Host added: power : bluej290-18 shares 3 total 9: 48128 <none>
\end{verbatim}
where the fields mean (if the field isn't described here, the value is not relevant to node arrival):
\begin{description}
\item[LOGHEADER] is the log entry header as described above.
\item[Nodepool:power] The node is added to the ``power'' nodepool
\item[bluej290-18] This is the name of the node
\item[shares 3] The number of full shares supported on this machine.
\item[total 9] This is the total shares in the system after this node arrives.
\item[48128] This is the memory, in KB, on that host.
\end{description}
\subsubsection{Missed Heartbeats}
The DUCC Agents send out regular ``heartbeat'' messages with current node statistics. These
messages are used by RM to determine if a node has failed. If a heartbeat does not arrive
at the specified time this is noted in the log as a {\em missing heartbeat}. If a specific (configurable) number
of consecutive heartbeats is missed, the RM marks the node offline and instructs the
DUCC Orchestrator to purge the shares so they can be rescheduled.
A missed heartbeat log entry is of the form
\begin{verbatim}
[LOGHEADER] "*** Missed heartbeat ***" NODENAME count[NN]
\end{verbatim}
where the fields mean:
\begin{description}
\item[LOGHEADER] is the log entry header as described above.
\item[*** Missed heartbeat ***] Indicates this is a missing heartbeat message.
\item[NODENAME] This is the name of the (possibly) errant host.
\item[count[N]] This is the number of CONSECUTIVE missing heartbeats.
\end{description}
Note that it is not unusual to miss the occasional heartbeat or two due to general network or system load.
As soon as a heartbeat is received the count is reset to 0.
If the number of missing heartbeats exceeds the value {\em ducc.rm.node.stability} configured in
{\em ducc.properties} the node is marked offline and this message is emitted:
\begin{verbatim}
HEADER "*** ! Notification of node death:" NODENAME
\end{verbatim}
If the node recovers and rejoins, the NodeArrives message as described above is emitted.
\subsection{Node Occupancy}
{\em Node occupancy} describes, for each node, the capacity of the node, the work assigned to
that node, and the number of open shares on that node. The RM writes the node occupancy
to its log before assignment of every new schedule. The occupancy can be found under the log header line:
\begin{verbatim}
[LOGHEADER] Machine occupancy before schedule
\end{verbatim}
NOTE: The current node occupancy can be queried interactively with the
\hyperref[subsec:admin.rm-qoccupancy]{rm\_occupancy} command:
\begin{verbatim}
DUCC_HOME/admin/rm_qoccupancy
\end{verbatim}
Sample node occupancy as displayed in the log follows. The header is included in the log.
\begin{verbatim}
Name Order Active Shares Unused Shares Memory (MB) Jobs
--------------- ----- ------------- ------------- ----------- ------ ...
f1n2.bluej.net 16 16 0 255459 206710 206715 207878 206719 207900
f1n4.bluej.net 16 0 16 255459 <none>[16]
f7n2.bluej.net 16 0 16 255459 <none>[16]
f9n10.bluej.net 16 0 16 255459 <none>[16]
f6n1.bluej.net 16 0 16 255459 <none>[16]
f7n1.bluej.net 16 3 13 255459 203408 [13]
f7n3.bluej.net 16 16 0 255459 206716 207904 206720 206717 206718
f4n10.bluej.net 16 15 1 255459 209155 208975 209153 209155 [1]
f7n5.bluej.net 16 16 0 255459 208960
f1n3.bluej.net 16 16 0 255459 205608 206695 207906 206693 206693
f1n1.bluej.net 16 3 13 255459 208913 [13]
f6n10.bluej.net 16 3 13 255459 208977 [13]
f6n7.bluej.net 16 0 16 255459 <none> [16]
f7n6.bluej.net 16 15 1 255459 209155 209151 206701 209155 [1]
\end{verbatim}
The meaning of each column is:
\begin{description}
\item[Name] The host name.
\item[Order] This is the share order of the node. The number represents the number of quantum shares
that can be scheduled on this node. (Recall that an actual process may and usually does
occupy multiple quantum shares.)
\item[Active Shares] This is the number of quantum shares on the node which are scheduled
for work.
\item[Unused Shares] This is the number of quantum shares available for new work.
\item[Memory] This is the real memory capacity of the node, as reported by the node's
Agent process.
\item[Jobs] Each entry here is the DUCC-assigned id of a job with process assigned to
this node. Each entry corresponds to one process. If an ID appears more than
once the job has more than one process assigned to the node; see for example, the
node {\bf f1n3.bluej.net} with multiple entries for job {\em 206693}.
When no work is assigned to the node, the string {\bf $<$none$>$} is displayed.
When there is a number in brackets, e.g. {\bf [13]} for node {\bf f7n1.bluej.net}, the
number represents the number of quantum shares available to be scheduled on the node.
\end{description}
\subsection{Job Arrival and Status Updates}
\paragraph{Orchestrator State Arrival}
On a regular basis the Orchestrator publishes the full state of
work which may require resources. This is the prime input to the
RM's scheduler and must arrive on a regular basis. Every arrival
of an Orchestrator publication is flagged in the log as follows.
If these aren't observed every
\hyperref[itm:props-or.state.publish.rate]{Orchestrator publish
interval} something is wrong; most likely the Orchestrator or
the ActiveMQ broker has a problem.
\begin{verbatim}
RM.ResourceManagerComponent- N/A onJobManagerStateUpdate -------> OR state arrives
\end{verbatim}
\paragraph{Job State}
Immediately after the OR state arrival is logged the state of all work needing scheduling
is logged. These are always tracked by the {\em JobManagerConverter} module in the
RM and is logged similar to the following. It shows the state of each bit of work
of interest, and if that state has changed since the last publication, what that state is.
\begin{verbatim}
...
RM.JobManagerConverter- 7433 eventArrives Received non-schedulable job, state = Completed
RM.JobManagerConverter- 7434 eventArrives [SPR] State: WaitingForResources -> Assigned
...
\end{verbatim}
\subsection{Calculation Of Job Caps}
Prior to every schedule, and immediately after receipt of the Orchestrator state,
the RM examines every piece of work and calculates the maximum level of resources the
job can physically use at the moment. This handles the {\em expand-by-doubling}
function, the {\em prediction} function, and accounts for the amount of work left
relative to the resources the work already possesses.
The curious or intrepid can see the code that implements this in {\em RmJob.java} method
{\em initJobCap()}.
The calculation is done in two steps:
\begin{enumerate}
\item Calculate the projected cap. This uses the prediction logic and amount of
work remaining to
calculate the {\em largest} number of resources the job can use, if the system
had unlimited resources. This is an upper bound on the actual resources
assigned to the job.
\item Adjust the cap down using expand-by-doubling and the initialization state of
the work. The result of this step is always a {\em smaller or equal} number
as the projected cap.
\end{enumerate}
The goal of this step is to calculate the largest number of resource the job can
actually use at the moment. The FAIR\_SHARE calculations may further revise this
down, but will never revise it up.
If there is no data yet on the initialization state of work, the rejected cap cannot
be calculated and a line such as the following is emitted:
\begin{verbatim}
RM.RmJob - 7483 getPrjCap Hilaria Cannot predict cap: init_wait true || time_per_item 0.0
\end{verbatim}
If the job has completed initialization the projected cap is calculated based on
the average initialization time of all the job processes and the current rate of
work-item completion. A line such as this is emitted:
\begin{verbatim}
RM.RmJob- 7483 Hilaria O 2 T 58626 NTh 28 TI 18626 TR 12469.0 R 2.2456e-03 QR 1868 \
P 132 F 1736 ST 1433260775524 return 434
\end{verbatim}
In this particular line:
\begin{description}
\item[7483] is the job id
\item[Hilaria] is the job's owner (userid)
\item[O 2] this says this is an {\em order 2} job: each process will occupy two quantum shares.
\item[T 58626] is the smallest number of milliseconds until a new process for this job
can be made runnable, based on the average initialization time for processes in
this job, the Orchestrator publish rate, and the
\hyperref[itm:props-rm.prediction.fudge]{\em RM prediction fudge.}
\item[Nth] This is the number of threads currently executing for this job. It is
calculated as the (number of currently allocated processes) * (the number of threads
per process).
\item[TI] This is the average initialization time in milliseconds for processes in this job.
\item[TR] This is the average execution time in milliseconds for work items in this job.
\item[R] This is the current rate at which the job is completing work items,
calculated as (Nth / TR).
\item[QR] The is the number of work items (questions) remaining to be executed.
\item[P] This is the projected number of questions that can be completed
in the time from ``now'' until a new process can be started and initialized
(in this case 58626 milliseconds from now, see above), with the currently
allocated resources, calculated as (T * R).
\item[F] This is the number of questions that will remain unanswered at the
end of the target (T) period, calculated as (QR - P).
\item[ST] This is the time the job was submitted.
\item[return] This is the projected cap, the largest number of processes this
job can physically use, calculated as (F / threads-per-process).
If the returned projected cap is 0, it is adjusted up to the number of
processes currently allocated.
\end{description}
Once the projected cap is calculated a final check is made to avoid several problems:
\begin{itemize}
\item Preemption of processes that contain active work but are not using all their
threads. This occurs when a job is ``winding down'' and may have more
processes than it technically needs, but all processes still are performing work.
\item The job may have declared a maximum number of processes to allocate, which is
less than the number it could otherwise be awarded.
\item If prediction is being used, revise the estimate down to the smaller
of the projected cap and the resources currently allocated.
\item If initialization caps are being applied and no process in the job has
successfully initialized, revise the estimate down to the initialization cap.
\item If expand-by-doubling is being used, potentially revise the estimate down
to no more than double the currently allocated processes.
\end{itemize}
The final cap is emitted in a line such as:
\begin{verbatim}
RM.RmJob- 7483 initJobCap Hilaria O 2 Base cap: 7 Expected future cap: 434 potential cap 7 actual cap 7
\end{verbatim}
In this line:
\begin{description}
\item[7483] is the job id.
\item[Hilaria] is the job's user name.
\item[O 2] indicates this job uses two quantum shares per processes.
\item[Base cap:] This is an upper-bound on the number of processes
that can be used in a perfect world. It is calculated by
dividing the number of questions by the number of threads per
process. It is then revised down by the declared max-processes
in the job. In the example above, the job declared
max-processes of 7.
\item[Expected future cap] This is the projected cap, described above.
\item[Potential cap] This is the base cap, possibly revised downward
by the future cap, if it is projected that fewer processes are
would be useful.
\item[Actual cap] This is the assigned maximum processes to be
scheduled for this job, possibly adjusted based on the
initialization status of the job and the expand-by-doubling policy.
\end{description}
The {\em actual cap} is the one used to calculate the job's FAIR\_SHARE and
is always the the largest number of processes usable in a perfect world. Note
that the FAIR\_SHARE calculation may result in further reduction of this
number.
\subsection{The ``how much'' calculations}
The RM log includes a section that details the fair-share calculations. The details
of this are rather involved and out-of-scope for this section. Interested parties
are welcome to read the scheduler source, in the file {\em NodePoolScheduler.java},
methods {\em countClassShares, countJobShares, countUserShares}, and {\em apportion\_qshares}.
The logs reveal the inputs to each of the methods above. The overall logic is as follows
and can be followed in the logs.
\begin{itemize}
\item All job classes of equal priority are bundled together and handed to the
{\em countClassShares} method. This method assigns some number of shares
to each class based on the weighted fair-share logic, using the configured
class weights. The start of this can be seen under log lines similar to this:
\begin{verbatim}
INFO RM.NodepoolScheduler- N/A apportion_qshares countClassShares RmCounter Start
\end{verbatim}
\item All users for each class are passed to the {\em countUserShares} method
and then assigned some number of shares from the
pool of shares assigned to the class, again
using the fair-share computations, but with equal weighs.
The start of this can be seen under log lines similar to this:
\begin{verbatim}
INFO RM.NodepoolScheduler- N/A apportion_qshares countJobShares RmCounter Start
\end{verbatim}
\item All jobs for each user are passed to the {\em countJobShares} method
and assigned some number of shares from the pool assigned to the user, using
the fair-share calculator with equal weights.
The start of this can be seen under log lines similar to this:
\begin{verbatim}
INFO RM.NodepoolScheduler- N/A apportion_qshares countUserShares RmCounter Start
\end{verbatim}
\item The method {\em apportion\_qshares} is the common fair-share
calculator, used by the three routines above.
\end{itemize}
\subsection{The ``what of'' calculations}
These calculations are also too involved to discuss in detail for this section.
Interested parties may look in {\em NodePoolScheduler.java}, method
{\em whatOfFairShare}, and {\em NodePool.java} method {\em traverseNodepoolsForExpansion}
to see details.
The logs track the general flow through the methods above and generally contain
enough information to diagnose problems should they arise.
The key log message here, other than those sketching logic flow, shows the
assignment of specific processes to jobs as seen below.
\begin{verbatim}
RM.NodePool- 7483 connectShare share bluej290-12.461 order 2 machine \
bluej290-12 false 2 0 2 31744 <none>
\end{verbatim}
This shows job {\em 7483} being assigned a process on host {\em bluej290-12} as
RM share id {\em 461}, which consists of {\em 2 quantum shares} (order 2). Host
bluej290-12 is a 32GB machine with {\em 31744} KB of usable, schedulable memory.
\subsection{Defragmentation}
The RM considers the system's memory pool to be fragmented if the counted
resources from the the ``how much'' phase of scheduling cannot be fully
mapped to real physical resources in the ``what of'' phase. In short, the
``how much'' phase assumes an ideal, unfragmented virtual cluster. The ``what of''
phase may be unable to make the necessary physical assignments without excessive
preemption of jobs that are legitimately at or below their fair share allocations.
Intuitively, the ``how much'' phase guarantees that if you could do unlimited
shuffling around of the allocated resources, everything would ``fit''. The
system is considered fragmented if such shuffling is actually needed. The
defragmentation processes attempts that shuffling, under the constraint of
interrupting the smallest amount of productive work possible.
One scheduling goal, however, is to attempt to guarantee every job gets
at least some minimal number of it's fairly-counted processes. This minimal number
is called the
\hyperref[itm:props-rm.defragmentation.threshold]{defragmentation threshold.} and
is configured in ducc.properties. This threshold is used to rigorously define
``smallest amount of productive work'' as used in the previous paragraph.
The defragmentation threshold is used in
two ways:
\begin{enumerate}
\item Attempt to get every work request resources allocated at least up
to the level of the defragmentation threshold.
\item Never steal resources beyond the defragmentation threshold during
the ``take from the rich'' phase of defragmentation, described below.
\end{enumerate}
To accomplish this, a final stage, ``defragmentation'', is
performed before publishing the new schedule to the Orchestrator
for deployment.
Defragmentation consists of several steps. The details are again involved,
but an understanding of the logic will make following the log relatively
straightforward.
\begin{itemize}
\item Examine every job and determine whether it was assigned
all the processes from the ``how much'' phase. If not, it is marked
as POTENTIALLY NEEDY.
This step is logged with the tag {\em detectFragmentation}.
\item Examine every POTENTIALLY NEEDY job to determine if there are
sufficient preemptions pending such that the ``how much'' phase will be able
to complete as soon as the preemptions complete. If not, the
job is marked ACTUALLY NEEDY.
This step is also logged with the tag {\em detectFragmentation}.
\item For every job marked ACTUALLY NEEDY, examine all jobs in the
system already assigned shares to determine which ones can
donate some resources to the ACTUALLY NEEDY jobs. These are typically
jobs with more processes than their FAIR SHARE, but which, in a
perfect, unfragmented layout, would be allocated more resources. These
jobs are called {\em rich} jobs.
This step is logged with the tags {\em insureFullEviction} and
{\em doFinalEvictions}.
\item Attempt to match allocations from ``rich'' jobs with jobs that
are ACTUALLY NEEDY. If the ACTUALLY NEEDY job is able to use
one of the ``rich job'' allocations, the allocation is scheduled for
preemption. (Note there are many reasons that a rich job may not
have appropriate resources to donate: mismatched nodepool, physical
host too small, not preemptable, etc.).
This step is logged with the tag {\em takeFromTheRich}. If this
step has any successes, the log will also show lines with the
tags {\em clearShare} and {\em shrinkByOne} as the resources
are scheduled for reuse.
\item The needy job is placed in a list of jobs which are given the
highest priority for assignment of new processes, at the start of each
subsequent scheduling cycle, until such time
as they are no longer needy.
This step is logged with the tag {\em Expand needy}.
\end{itemize}
Those who wish to see the details of defragmentation can find them in
{\em NodepoolScheduler.java}, starting with the method {\em detectFragmentation}
and tracing the flows from there.
\subsection{Published Schedule}
The schedule gets printed to the log twice on every scheduling cycle. The first
form is a pretty-printed summary of all known jobs, showing which ones are
getting more resources, {\em expanding}, those which are losing resources,
{\em shrinking}, and those which are not changing, {\em stable.}
The second form is a {\em toString()} of the structure sent to the Orchestrator,
showing the exact resources currently assigned, added, or lost this cycle.
\paragraph{The pretty-printed schedule}
This entry is divided into five sections. Each section contains one line for
each relevant job, with largely self-explanatory headers. An example follows (
wrapped here so it fits within a printed page):
\begin{verbatim}
ID JobName User Class Shares Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst
J____7485 mega-2 bob low 7 2 14 4 24 11510 11495 false 7
J____7486 mega-1 mary normal 93 2 186 4 28 14768 14764 false 93
\end{verbatim}
Here,
\begin{description}
\item[ID] is the unique DUCC ID of the work, prefixed with an indication of what kind of
work it is: Job (J), a Service (S), a Reservation (R), or Managed Reservation (M).
\item[JobName] is the user-supplied name / description of the job.
\item[User] is the owner of the work.
\item[Class] is the scheduling class used to schedule the work.
\item[Shares] is the number of allocations awarded, which might be processes, or simply reserved space. It
is a human-readable convenience, calculated as (Order * QShares).
\item[Order] is the number of share quanta per allocation.
\item[QShares] is the total quantum shares awarded to the work.
\item[Nth] is the declared number of threads per process.
\item[Memory] is the amount of memory in GB for each allocation.
\item[nQuest] is the number of work items (questions) for the job, where relevant.
\item[Ques Rem] is the number of work items not yet completed.
\item[InitWait] is either {\em true} or {\em false}, indicating whether at least one process
has successfully completed initialization.
\item[Max P/Nst] is the job-declared maximum processes / instances for the job.
\end{description}
The five subsections of this log section are:
\begin{description}
\item[Expanded] This is the list of all work that is receiving more resources this cycle.
\item[Shrunken] This is the list of work that is losing resources this cycle.
\item[Stable] This is the list of work whose assigned resources do not change this cycle.
\item[Dormant] This is the list of work that is unable to receive any resources this cycle.
\item[Reserved] This is the list of reservations.
\end{description}
\paragraph{The Orchestrator Structure}
This is a list containing up to four lines per scheduled work.
The specific resources shown here are formatted thus:
\begin{verbatim}
hostname.RM share id^Initialization time
\end{verbatim}
The {\em hostname} is the name of the host where the resource is assigned. The {\em RM Share}
is the unique (to RM only) id of the share assigned to this resource. The {\em Initialization time}
is the amount of time spent by the process residing within this resource in its initialization phase.
The lines are:
\begin{enumerate}
\item The type of work and it's DUCC ID, for example:
\begin{verbatim}
Reservation 7438
\end{verbatim}
\item The complete set of all resources currently assigned to the work, for example:
\begin{verbatim}
Existing[1]: bluej537-7-73.1^0
\end{verbatim}
The resources here include all resources the RM tracks as being owned by the job, including
older resources, newly assigned resources, and resources scheduled for eviction. The specific
resources which are being added or removed are shown in the next lines.
\item The complete set of resources the RM has scheduled for eviction, but which are not
yet confirmed freed. For example, we see 7 resources which have been evicted:
\begin{verbatim}
Removals[7]: bluej290-11.465^19430 bluej290-12.461^11802 bluej290-4.460^12672 \
bluej290-5.464^23004 bluej290-2.467^22909 bluej290-7.463^20636 \
bluej290-6.466^19931
\end{verbatim}
\item The complete set of resources which are being added to the work in this cycle. For
example:
\begin{verbatim}
Additions[4]: bluej291-43.560^0 bluej291-42.543^0 bluej290-23.544^0 bluej291-44.559^0
\end{verbatim}
\end{enumerate}
In most cases, if resources cannot be awarded, this section also shows the reason
string which is published for the benefit of the web server and the Orchestrator's job monitor:
\begin{verbatim}
Job 7487 Waiting for defragmentation.
Existing[0]:
Additions[0]:
Removals[0]:
\end{verbatim}
In some cases, it is possible that a job will show BOTH Additions and Removals. This usually
occurs as a result of the defragmentation step. The job will have been found in need of
new resources during the initial fair-share computation but later during fragmentation,
it is also found to be a ``rich'' job which must donate resources to under-allocated work.
Not all the processes belonging to the ``rich'' job may be appropriate for the poor job,
in which case they will be allowed to expand even as it is donating some to the
under-allocated work.
This can also occur if resources were previously preempted, for some reason the
preemption is taking a long time. Since then other resources have become freed and
the can now re-expand. It is not possible to reverse a preemption (because the actual
state of the preemption is not knowable) so both expansion and shrinkage can be
in progress for the same job.
\section{Service Manager Log (sm.log)}
The service manager log shows the events involved in managing services. These events include
\begin{itemize}
\item Bootstrap configuration
\item Receipt and anlysis of Orchestrator state
\item CLI requests: register, modify, start, stop, unregister, etc.
\item Dispatching / startup of service instances
\item Progression of Service state
\item Starting and logging of pingers
\end{itemize}
To enable finer-grained messages, edit the file
{\em \ducchome/resources/log4j.xml}
and change the {\em priority} value to {\em debug} (or {\em trace}) in the stanza
similar to that shown here. Within about a minute the logger will
pick up the change and increase its log level.
\begin{verbatim}
<category name="org.apache.uima.ducc.sm" additivity="true">
<priority value="debug"/>
<appender-ref ref="smlog" />
</category>
\end{verbatim}
\subsection{Bootstrap configuration}
\paragraph{Initial SM start}
The first logged line of any RM start will contain the string {\em Starting component: serviceManager}:
\begin{verbatim}
SM.ServiceManagerComponent - N/A boot ... Starting Component: serviceManager
\end{verbatim}
This is followed by a summary of the environment in which the Service Manager is running, including
configuration data, information about the JRE and about hosting hardware.
\begin{verbatim}
Service Manager starting:
DUCC home : /home/challngr/ducc_runtime
ActiveMQ URL : tcp://bluej537:61617?jms.useCompression=true
JVM : Oracle Corporation 1.7.0_45
JAVA_HOME : /users1/challngr/jdk1.7.0_45/jre
JVM Path : /users/challngr/jdk1.7.0_45/bin/java
JMX URL : service:jmx:rmi:///jndi/rmi://bluej537:2102/jmxrmi
OS Architecture : amd64
Crypto enabled : true
Test mode enabled : true
Service ping rate : 15000
Service ping timeout : 15000
Service ping stability : 10
Default ping class : org.apache.uima.ducc.cli.UimaAsPing
Init Failure Max : 1
Instance Failure Max : 3
Instance Failure Window : 10
DUCC Version : 2.0.0-beta
SM Version : 2.0.0
\end{verbatim}
\paragraph{Initialize the service registry}
Following this is are entries showing the reading and internal initialization of the
service registry, for example:
\begin{verbatim}
SM.ServiceHandler - 411 ServiceStateHandler.registerService \
adding UIMA-AS:FixedSleepAE_1:tcp://bluej537:61617 411
\end{verbatim}
\paragraph{Wait for the Resource Manager and Orchestrator to become ready}
The Service Manager waits until the Resource Manager and Orchestrator are ready. This usually
results in lines similar to the following, which are normal and expected. If these lines
continue for more than a few minutes it is possible that some part of DUCC has not correctly
started:
\begin{verbatim}
SM.ServiceManagerComponent - N/A orchestratorStateArrives \
Orchestrator JD node not assigned, ignoring Orchestrator state update.
\end{verbatim}
\paragraph{Begin work}
When the Service Manager is ready for business, lines similar to the following are emitted on
a regular basis:
\begin{verbatim}
INFO SM.ServiceManagerComponent - N/A processIncoming ===== Orchestrator State Arrives =====
\end{verbatim}
\subsection{Receipt and analysis of Orchestrator State}
\subsection{CLI Requests}
CLI Requests are all logged at log level INFO. Initial receipt of the command
is tagged with the string {\em ServiceManagerComponent} and contains the name of
the command issued. Information pertinent to the command issued is logged.
For example, this shows registration of a new service.
\begin{verbatim}
INFO SM.ServiceManagerComponent - 428 register ServiceRegisterEvent \
[ninstances=2, autostart=Unset, \
endpoint=UIMA-AS:FixedSleepAE_5:tcp://bluej537:61617, user=challngr]
\end{verbatim}
Some commands require service ownership or administrative authority. Here we show
a service being started, and the authorization check being made.
\begin{verbatim}
INFO SM.ServiceManagerComponent - N/A start Starting service ServiceStartEvent \
[friendly=430, user=challngr, instances=-1]
INFO SM.ServiceHandler - 430 authorized start request from challngr allowed.
\end{verbatim}
\subsection{Dispatching / Startup of Service Instances}
The full set of startup messages is logged when an instance is dispatched to DUCC to
be scheduled and started. These starts are usually preceeded by a message or two
indicating what triggered the start. For instance, this shows service 427 being
started because it was referenced by job 7676:
\begin{verbatim}
INFO SM.ServiceSet - 427 reference Reference start requested by 7676
INFO SM.ServiceSet - 427 reference References job/service 7676 count[1] implementors [0]
INFO SM.ServiceSet - 427 reference Reference starting new service instances.
\end{verbatim}
This is followed by a line indicating how many instances are to be started:
\begin{verbatim}
INFO SM.ServiceSet - 427 start Starting instance. Current count 0 needed 2
\end{verbatim}
Shortly thereafter the {\em stdout} messages from the submission to the
DUCC Orchestrator are shown. These are tagged with the keyword {\em ServiceInstance}.
\begin{verbatim}
INFO SM.ServiceInstance - 427 start START INSTANCE
INFO SM.ServiceInstance - 427 start Start stdout: 050 ducc_ling Version 1.1.2 \
compiled Aug 4 2014 at 06:45:31
\end{verbatim}
A few lines later the DUCC-assigned ID of the instance is shown, in
this case, ID 7677, for service 427.
\begin{verbatim}
INFO SM.ServiceInstance - 427 start Start stdout: Service instance 7677 submitted
INFO SM.ServiceInstance - N/A start Request to start service 427 accepted as service instance 7677
\end{verbatim}
THe next lines show the environment used for the service submit.
\begin{verbatim}
INFO SM.ServiceInstance - 427 start Start stdout: 1104 Running with user and group: \
id 2087 gid 2001 eid 2087 egid 2001
INFO SM.ServiceInstance - 427 start Start stdout: 300 Bypassing redirect of log.
INFO SM.ServiceInstance - 427 start Start stdout: 4050 Limits: CORE soft[1024] hard[-1]
INFO SM.ServiceInstance - 427 start Start stdout: 4050 Limits: CPU soft[-1] hard[-1]
INFO SM.ServiceInstance - 427 start Start stdout: 4050 Limits: DATA soft[-1] hard[-1]
... (more environment) ...
INFO SM.ServiceInstance - 427 start Start stdout: Environ[0] = DUCC_SERVICE_INSTANCE=0
INFO SM.ServiceInstance - 427 start Start stdout: Environ[1] = JAVA_HOME=/opt/ibm-jdk1.7
INFO SM.ServiceInstance - 427 start Start stdout: Environ[2] = LESSCLOSE=lessclose.sh %s %s
INFO SM.ServiceInstance - 427 start Start stdout: Environ[3] = ENV=/etc/bash.bashrc
\end{verbatim}
Next the exact command line executed to submit the isntance is shown. If the submission
is successful, see {\em START INSTANCE COMPLETE}. The exact progression of
the scheduling and deployment of an
instance is not shown in this log. If it is desired to observe that, look in
the RM log, searching on the service instance id. In the examples shown here, that
id is 7677.
\begin{verbatim}
INFO SM.ServiceInstance - 427 start Start stdout: 1000 Command to exec: /opt/ibm-jdk1.7/bin/java
INFO SM.ServiceInstance - 427 start Start stdout: arg[1]: -cp
INFO SM.ServiceInstance - 427 start <INHIBITED CP>
INFO SM.ServiceInstance - 427 start <INHIBITED CP>
INFO SM.ServiceInstance - 427 start Start stdout: arg[4]: --specification
INFO SM.ServiceInstance - 427 start Start stdout: arg[5]: /home/challngr/ducc_runtime/state/services/427.svc
INFO SM.ServiceInstance - 427 start Start stdout: arg[6]: --service_id
INFO SM.ServiceInstance - 427 start Start stdout: arg[7]: 427
INFO SM.ServiceInstance - 427 start Start stdout: 1001 Command launching...
INFO SM.ServiceInstance - 427 start Start stdout: Service instance 7677 submitted
INFO SM.ServiceInstance - N/A start Request to start service 427 accepted as service instance 7677
INFO SM.ServiceInstance - 427 start START INSTANCE COMPLETE
\end{verbatim}
The state progression of the service as it starts is then available, as described in the next section.
\subsection{Progression of Service State}
The state of each service as it starts and stops is tracked at log level INFO. To find the state progression
for any particular service search on {\em service-id setState} where service-id is the ID of the service.
Here we show the progression for service 427 as it is started and progresses to fully functional (there are
other lines logged between these of course). Note that if a service has multiple instances defined, the
overall service state is determined by an aggregate of the states of the individual instances. In these
messages, the ``Inst'' field of each message shows DUCC ID of the most recently updated service instance
that triggered the overall state change.
\begin{verbatim}
SM.ServiceSet - 427 setState State update \
from[Stopped] to[Starting] via[Starting] Inst[7677/Received]
SM.ServiceSet - 427 setState State update \
from[Starting] to[Initializing] via[Initializing] Inst[7677/Initializing]
SM.ServiceSet - 427 setState State update \
from[Initializing] to[Waiting] via[Waiting] Inst[7677/Running]
SM.ServiceSet - 427 setState State update \
from[Waiting] to[Available] via[Available] Inst[7677/Running]
\end{verbatim}
\subsection{Starting and Logging Pingers}
When a pinger is started (or restarted) the event is logged and tagged
with the string {\em service-id startPingThread}. Following are a few
lines stating the name of the pinger and the Java class used to
implement the pinger. For example:
\begin{verbatim}
INFO SM.ServiceSet - 430 startPingThread Starting service monitor.
INFO SM.PingDriver - 430 find RegisteredPinger \
Loading site-registered service monitor from org.apache.uima.ducc.cli.UimaAsPing
INFO SM.PingDriver - 430 <ctr> Using ping class org.apache.uima.ducc.cli.UimaAsPing
\end{verbatim}
If the pingers do not declare loggers, their {\em stdout} and {\em stderr} are
captured in the logs, under the tag {\em handleStatistics}. As well, every
ping is recorded with its ``info'' string, making it possible to see the
exact state of the pinger. For example:
\begin{verbatim}
INFO SM.PingDriver - 411 handleStatistics Ping ok: UIMA-AS:FixedSleepAE_1:tcp://bluej537:61617 \
Alive[true] Healthy[true] + Info: QDEPTH[0] AveNQ[0] Consum[26] Prod[0] \
minNQ[0] maxNQ[0] expCnt[0] inFlt[0] DQ[0] NQ[0] NDisp[0] \
MetaNode[192.168.4.36] MetaPid[8892:67]
\end{verbatim}
Pingers always return state to the Service Manager and some of that state
affects the SM's operation; in particular, pingers can start and stop
specific service instances or change the autostart setting for a
service. They also must return success and failure status to the SM.
The ping state is logged under the tag {\em service-id signalRebalance}
as seen below. If {\em Additions:} or {\em Deletions:} is non-zero, you
can expect to see SM automatically start or stop specific instances for
the service.
\begin{verbatim}
INFO SM.ServiceSet - 430 signalRebalance PING: Additions: 0 deletions: 0 \
excessive failures: false implementors 1 references 0
\end{verbatim}
\subsection{Publishing State}
By default the SM log does not include the state as published to the
Orchestrator because it can be voluminous and cause the logs to
become cluttered and to wrap too fast. It may be necessary to
increas the log level to DEBUG as described at the start of this section.
The published state will be emitted with lines similar to the following. Most
entries will show ``Available'', which means either they do not depend on
the Service Manager, or the service they depend on is in state Available. If there
is some exceptional condition pertaining to a job, that is shown. In the
sample below, Job 251214 is waiting for a specific service whose state itself
is ``waiting''.
\begin{verbatim}
INFO SM.ServiceHandler - 251261 handleModifiedobs No service dependencies, no updates made.
INFO SM.ServiceManagerComponent - N/A publish Publishing State, active job count = 102
INFO SM.ServiceManagerComponent - N/A publish Service Map
Job 251159 Service state Available
Job 251263 Service state Available
Job 251192 Service state Available
Job 251214 Service state Waiting \
[UIMA-AS:fastqa-Dedicated-Staging-2015-08_cache:tcp://broker42:62616 : waiting]
\end{verbatim}
\section{ (Orchestrator Log or.log)}
To be filled in.
\section{Process Manager Log (pm.log)}
To be filled in.
\section{Agent log Log (hostname.agent.log)}
To be filled in.