uima-ducc-duccdocs/src/site/tex/duccbook/part4/system-logs.tex - uima-ducc - Git at Google

 %
 % Licensed to the Apache Software Foundation (ASF) under one
 % or more contributor license agreements.  See the NOTICE file
 % distributed with this work for additional information
 % regarding copyright ownership.  The ASF licenses this file
 % to you under the Apache License, Version 2.0 (the
 % "License"); you may not use this file except in compliance
 % with the License.  You may obtain a copy of the License at
 %
 %   http://www.apache.org/licenses/LICENSE-2.0
 %
 % Unless required by applicable law or agreed to in writing,
 % software distributed under the License is distributed on an
 % "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 % KIND, either express or implied.  See the License for the
 % specific language governing permissions and limitations
 % under the License.
 %
 % Create well-known link to this spot for HTML version
  \section{Overview}

     This chapter provides an overview of the DUCC process logs and how to interpret the
     entries therein.

     Each of the DUCC ``head node'' processes writes a detailed log of its operation to
     the directory \ducchome/logs.  The logs are managed by Apache log4j.  All logs are
     managed by a single log4j configuration file
 \begin{verbatim}
         $DUCC_HOME/resources/log4j.xml
 \end{verbatim}

     The DUCC logger is configured to check for updates to the log4j.xml
     configuration file and automatically update without the need to restart any of
     the DUCC processes.  The update may take up to 60 seconds to take effect.

     The DUCC logger is loaded and configured through the log4j API such that other
     log4j configuration files that might be in the classpath are ignored.  This also
     means that log4j configuration files in the user's classpath will not interfere
     with DUCC's logger.

     The logs are set to roll after reaching a given size and the number of generations
     is limited to prevent overrunning disk space.  In general the log level is set to
     provide sufficient diagnostic output to resolve most issues.

     Each DUCC component writes its own log as defined in the following table:

     \begin{tabular} {| l | l |}
        \hline
           Component & Log Name \\
       \hline
       \hline
           Resource Manager & rm.log \\
       \hline
           Service Manager & sm.log \\
       \hline
           Orchestrator & or.log \\
       \hline
           Process Manager & pm.log \\
       \hline
           Web Server & ws.log \\
       \hline
           Agent & {\em [hostname].agent.log } \\
       \hline
     \end{tabular}

     Because there may be many agents, the agent log is prefixed with the name of the host for
     each running agent.

     The log4j file may be customized for each installation to change the format or content of the
     log files, according to the rules defined by log4j itself.

     The general format of a log message is as follows:
 \begin{verbatim}
     Timestamp   LOGLEVEL  COMPONENT.sourceFileName method-name J[Jobid-or-NA] T[TID] text
 \end{verbatim}
     where
     \begin{description}
       \item[Timestamp] is the time of the occurrence.  By default, the timestamp uses millisecond granularity.
       \item[LOGLEVEL] This is one of the log4j debug levels, INFO, ERROR, DEBUG, etc.
       \item[COMPONENT] This identifies the DUCC component emitting the message.  The components include
         \begin{description}
           \item[SM] Service Manager
           \item[RM] Resource Manager
           \item[PM] Process Manager
           \item[OR] Orchestrator
           \item[WS] Web Server
           \item[Agent] Agent
           \item[JD] Job Driver.  These logs are written to the log directory specified in each
             job submission.
           \item[JobProcessComponent] Job process, also known as JP.  These logs are written to the
             log directory specifid in each job submission.
         \end{description}
       \item[sourceFileName] This is the name of the Java source file from which the message is emitted.
       \item[method-name] This is the name of the method in {\em sourceFileName} which emitted the message.
       \item[{J[Workid-or-NA]}] This is the DUCC assigned id of the work being processed, when relevant.  If the
         message is not associated with work, this field shows ``N/A''.  Some logs (such as JP and JD logs)
         pertain ONLY to a specific job and do not contain this field.
       \item[{T[TID]}] This is the ID of the thread emitting the message.  Some logs (such as RM) do not use
         this field so it is omitted.
       \item[text] This is the component-specific text content of the message.  Key messages are described
         in detail in subsequent sections.

     \end{description}

 \section{Resource Manager Log (rm.log)}

     The RM log is designed to show all phases of resource scheduling.  Much of the flow of a job can
     be observed in this log alone.  The following specific information is available and is explained in
     more detail below:
     \begin{itemize}
       \item Bootstrap configuration
       \item Node arrival and missed heartbeats
       \item Node occupancy
       \item Job arrival and status updates
       \item Calculation of job caps
       \item How-much - fair share
       \item What-of - host assignment and preemption
       \item Defragmentation
       \item Internal schedule
       \item Published schedule
     \end{itemize}

     Most useful messages are emitted under log level INFO but a wealth of details can
     be seen by incresing the log level to DEBUG or TRACE.  To do so,
     edit the file
     {\em  \ducchome/resources/log4j.xml}
     and change the {\em priority} value to {\em debug} (or {\em trace}) in the stanza
     similar to that shown here.  Within about a minute the logger will
     pick up the change and increase its log level.
 \begin{verbatim}
    <category name="org.apache.uima.ducc.sm" additivity="true">
      <priority value="debug"/>
      <appender-ref ref="smlog" />
    </category>
 \end{verbatim}

 \subsection{Bootstrap Configuration}
    The RM summarizes its entire configuration when it starts up and prints it to the log to
    provide context for subsequent data and as verification that the RM is configured in the
    way it was thought to be.  All the following are found in the bootstrap section and are mostly
    self-explanatory:

    \begin{itemize}
      \item A pretty-print of the class configuration.  This is the same as produced by the {\em check\_ducc -c -v}
        command.
      \item A summary of all classes, one per line.  This is a more concise display and is similar to the
        DUCC Classes page in the web server.
      \item A listing of all RM configuration parameters and the environment including things such as the
        version of Java, the operating system, etc.
      \item Nodepool occupancy.  As host names are parsed from the {\em ducc.nodes} files, the RM log
        shows exactly which nodepool each node is added to.
    \end{itemize}

    The RM logs can wrap quickly under high load in which case this information is lost.

    The following represent key RM logs lines to search for if it is desired to examine or verify its
    initialization.  (Part of the leaders on these messages are removed here to shorten the
    lines for publication.)

     \paragraph{Initial RM start}
     The first logged line of any RM start will contain the string {\em Starting component:  resourceManager}:
 \begin{verbatim}
 RM.ResourceManagerComponent- N/A boot  ... Starting Component:  resourceManager
 \end{verbatim}

     \paragraph{RM Node and Class Configuration}
     The first configuration lines show the reading and validation of the node and class configuration.  Look
     for the string {\em printNodepool} to find these lines:
 \begin{verbatim}
 RM.Config- N/A printNodepool   Nodepool --default--
 RM.Config- N/A printNodepool      Search Order: 100
 RM.Config- N/A printNodepool      Node File: None
 RM.Config- N/A printNodepool                 <None>
 RM.Config- N/A printNodepool      Classes: background low normal high normal-all nightly-test reserve
 RM.Config- N/A printNodepool      Subpools: jobdriver power intel
          ...
 \end{verbatim}

    \paragraph{RM Scheduling Configuration}
    Next the RM reads configures its scheduling parameters and emits the information.  It also emits information
    about its environment: the ActiveMQ broker, JVM information, OS information, DUCC version, etc.  To fine
    this search for the string {\em init  Scheduler}.
 \begin{verbatim}
  init  Scheduler running with share quantum       :  15  GB
  init                         reserved DRAM       :  0  GB
  init                         DRAM override       :  0  GB
  init                         scheduler           :  org.apache.uima.ducc.rm.scheduler.NodepoolScheduler
             ... (more lines) ...

  init                         DUCC home           :  /home/challngr/ducc_runtime
  init                         ActiveMQ URL        :  tcp://bluej537:61617?jms.useCompression=true
  init                         JVM                 :  Oracle Corporation 1.7.0_45
  init                         JAVA_HOME           :  /users1/challngr/jdk1.7.0_45/jre
  init                         JVM Path            :  /users/challngr/jdk1.7.0_45/bin/java
  init                         JMX URL             :  service:jmx:rmi:///jndi/rmi://bluej537:2099/jmxrmi
  init                         OS Architecture     :  amd64
  init                         OS Name             :  Linux
  init                         DUCC Version        :  2.0.0-beta
  init                         RM Version          :  2.0.0
 \end{verbatim}

    \paragraph{RM Begins to Schedule}
    The next lines will show the nodes checking in and which nodepools they are assigned to.  When the scheduler is
    ready to accept Orchestrator requests you will see assignment of the JobDriver reservation.  At this point
    RM is fully operational.  The confirmation of JobDriver assignment is similar to this:
 \begin{verbatim}
 Reserved:
          ID    JobName    User      Class Shares Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst
 R______7434 Job_Drive   System  JobDriver      1     1       1   0      2      0        0        0         1
 \end{verbatim}

 \subsection{Node Arrival and Missed Heartbeats}
 \subsubsection{Node Arrival}
     As each node ``checks in'' with the RM a line is printed with details about the node.  Some fields
     are redundant but are produced by different components processing the node arrival and thus serve
     as confirmation that all parts are operating correctly.

     A node arrival entry is of the form:
 \begin{verbatim}
 LOGHEADER Nodepool: power Host added: power :  bluej290-18   shares  3 total    9:   48128 <none>
 \end{verbatim}
     where the fields mean (if the field isn't described here, the value is not relevant to node arrival):
     \begin{description}
       \item[LOGHEADER] is the log entry header as described above.
       \item[Nodepool:power] The node is added to the ``power'' nodepool
       \item[bluej290-18] This is the name of the node
       \item[shares 3] The number of full shares supported on this machine.
       \item[total 9] This is the total shares in the system after this node arrives.
       \item[48128] This is the memory, in KB, on that host.
     \end{description}

 \subsubsection{Missed Heartbeats}
     The DUCC Agents send out regular ``heartbeat'' messages with current node statistics. These
     messages are used by RM to determine if a node has failed.  If a heartbeat does not arrive
     at the specified time this is noted in the log as a {\em missing heartbeat}. If a specific (configurable) number
     of consecutive heartbeats is missed, the RM marks the node offline and instructs the
     DUCC Orchestrator to purge the shares so they can be rescheduled.

     A missed heartbeat log entry is of the form
 \begin{verbatim}
     [LOGHEADER] "*** Missed heartbeat ***" NODENAME count[NN]
 \end{verbatim}
     where the fields mean:
     \begin{description}
       \item[LOGHEADER] is the log entry header as described above.
       \item[*** Missed heartbeat ***] Indicates this is a missing heartbeat message.
       \item[NODENAME] This is the name of the (possibly) errant host.
       \item[count[N]] This is the number of CONSECUTIVE missing heartbeats.
     \end{description}

     Note that it is not unusual to miss the occasional heartbeat or two due to general network or system load.
     As soon as a heartbeat is received the count is reset to 0.

     If the number of missing heartbeats exceeds the value {\em ducc.rm.node.stability} configured in
     {\em ducc.properties} the node is marked offline and this message is emitted:
 \begin{verbatim}
     HEADER "*** ! Notification of node death:" NODENAME
 \end{verbatim}

     If the node recovers and rejoins, the NodeArrives message as described above is emitted.

 \subsection{Node Occupancy}
     {\em Node occupancy} describes, for each node, the capacity of the node, the work assigned to
     that node, and the number of open shares on that node.  The RM writes the node occupancy
     to its log before assignment of every new schedule.  The occupancy can be found under the log header line:
 \begin{verbatim}
     [LOGHEADER] Machine occupancy before schedule
 \end{verbatim}

     NOTE: The current node occupancy can be queried interactively with the
     \hyperref[subsec:admin.rm-qoccupancy]{rm\_occupancy} command:
 \begin{verbatim}
     DUCC_HOME/admin/rm_qoccupancy
 \end{verbatim}

     Sample node occupancy as displayed in the log follows.  The header is included in the log.
 \begin{verbatim}
            Name Order Active Shares Unused Shares Memory (MB) Jobs
 --------------- ----- ------------- ------------- ----------- ------ ...
  f1n2.bluej.net    16            16             0      255459 206710 206715 207878 206719 207900
  f1n4.bluej.net    16             0            16      255459 <none>[16]
  f7n2.bluej.net    16             0            16      255459 <none>[16]
 f9n10.bluej.net    16             0            16      255459 <none>[16]
  f6n1.bluej.net    16             0            16      255459 <none>[16]
  f7n1.bluej.net    16             3            13      255459 203408 [13]
  f7n3.bluej.net    16            16             0      255459 206716 207904 206720 206717 206718
 f4n10.bluej.net    16            15             1      255459 209155 208975 209153 209155 [1]
  f7n5.bluej.net    16            16             0      255459 208960
  f1n3.bluej.net    16            16             0      255459 205608 206695 207906 206693 206693
  f1n1.bluej.net    16             3            13      255459 208913 [13]
 f6n10.bluej.net    16             3            13      255459 208977 [13]
  f6n7.bluej.net    16             0            16      255459 <none> [16]
  f7n6.bluej.net    16            15             1      255459 209155 209151 206701 209155 [1]
 \end{verbatim}

     The meaning of each column is:
     \begin{description}
       \item[Name] The host name.
       \item[Order] This is the share order of the node.  The number represents the number of quantum shares
         that can be scheduled on this node. (Recall that an actual process may and usually does
         occupy multiple quantum shares.)
       \item[Active Shares] This is the number of quantum shares on the node which are scheduled
         for work.
       \item[Unused Shares] This is the number of quantum shares available for new work.
       \item[Memory] This is the real memory capacity of the node, as reported by the node's
         Agent process.
       \item[Jobs] Each entry here is the DUCC-assigned id of a job with process assigned to
         this node.  Each entry corresponds to one process.  If an ID appears more than
         once the job has more than one process assigned to the node; see for example, the
         node {\bf f1n3.bluej.net} with multiple entries for job {\em 206693}.

         When no work is assigned to the node, the string {\bf $<$none$>$} is displayed.

         When there is a number in brackets, e.g. {\bf [13]} for node {\bf f7n1.bluej.net}, the
         number represents the number of quantum shares available to be scheduled on the node.
     \end{description}

 \subsection{Job Arrival and Status Updates}

    \paragraph{Orchestrator State Arrival}

      On a regular basis the Orchestrator publishes the full state of
      work which may require resources.  This is the prime input to the
      RM's scheduler and must arrive on a regular basis.  Every arrival
      of an Orchestrator publication is flagged in the log as follows.
      If these aren't observed every
      \hyperref[itm:props-or.state.publish.rate]{Orchestrator publish
        interval} something is wrong; most likely the Orchestrator or
      the ActiveMQ broker has a problem.

 \begin{verbatim}
 RM.ResourceManagerComponent- N/A onJobManagerStateUpdate  -------> OR state arrives
 \end{verbatim}

     \paragraph{Job State}
     Immediately after the OR state arrival is logged the state of all work needing scheduling
     is logged.  These are always tracked by the {\em JobManagerConverter} module in the
     RM and is logged similar to the following.  It shows the state of each bit of work
     of interest, and if that state has changed since the last publication, what that state is.
 \begin{verbatim}
    ...
 RM.JobManagerConverter- 7433 eventArrives  Received non-schedulable job, state =  Completed
 RM.JobManagerConverter- 7434 eventArrives  [SPR] State:  WaitingForResources -> Assigned
    ...
 \end{verbatim}

 \subsection{Calculation Of Job Caps}
    Prior to every schedule, and immediately after receipt of the Orchestrator state,
    the RM examines every piece of work and calculates the maximum level of resources the
    job can physically use at the moment.  This handles the {\em expand-by-doubling}
    function, the {\em prediction} function, and accounts for the amount of work left
    relative to the resources the work already possesses.

    The curious or intrepid can see the code that implements this in {\em RmJob.java} method
    {\em initJobCap()}.

    The calculation is done in two steps:
    \begin{enumerate}
      \item Calculate the projected cap.  This uses the prediction logic and amount of
        work remaining to
        calculate the {\em largest} number of resources the job can use, if the system
        had unlimited resources.  This is an upper bound on the actual resources
        assigned to the job.
      \item Adjust the cap down using expand-by-doubling and the initialization state of
        the work.  The result of this step is always a {\em smaller or equal} number
        as the projected cap.
    \end{enumerate}
    The goal of this step is to calculate the largest number of resource the job can
    actually use at the moment.  The FAIR\_SHARE calculations may further revise this
    down, but will never revise it up.

    If there is no data yet on the initialization state of work, the rejected cap cannot
    be calculated and a line such as the following is emitted:
 \begin{verbatim}
 RM.RmJob - 7483 getPrjCap  Hilaria Cannot predict cap: init_wait true || time_per_item 0.0
 \end{verbatim}

    If the job has completed initialization the projected cap is calculated based on
    the average initialization time of all the job processes and the current rate of
    work-item completion.  A line such as this is emitted:
 \begin{verbatim}
 RM.RmJob- 7483 Hilaria O 2 T 58626 NTh 28 TI 18626 TR 12469.0 R 2.2456e-03 QR 1868 \
        P 132 F 1736  ST 1433260775524 return 434
 \end{verbatim}
    In this particular line:
       \begin{description}
         \item[7483] is the job id
         \item[Hilaria] is the job's owner (userid)
         \item[O 2] this says this is an {\em order 2} job: each process will occupy two quantum shares.
         \item[T 58626] is the smallest number of milliseconds until a new process for this job
           can be made runnable, based on the average initialization time for processes in
           this job, the Orchestrator publish rate, and the
           \hyperref[itm:props-rm.prediction.fudge]{\em RM prediction fudge.}
         \item[Nth] This is the number of threads currently executing for this job.  It is
           calculated as the (number of currently allocated processes) * (the number of threads
           per process).
         \item[TI] This is the average initialization time in milliseconds for processes in this job.
         \item[TR] This is the average execution time in milliseconds for work items in this job.
         \item[R] This is the current rate at which the job is completing work items,
           calculated as (Nth / TR).
         \item[QR] The is the number of work items (questions) remaining to be executed.
         \item[P] This is the projected number of questions that can be completed
           in the time from ``now'' until a new process can be started and initialized
           (in this case 58626 milliseconds from now, see above), with the currently
           allocated resources, calculated as (T * R).
         \item[F] This is the number of questions that will remain unanswered at the
           end of the target (T) period, calculated as (QR - P).
         \item[ST] This is the time the job was submitted.
         \item[return] This is the projected cap, the largest number of processes this
           job can physically use, calculated as (F / threads-per-process).

           If the returned projected cap is 0, it is adjusted up to the number of
           processes currently allocated.
       \end{description}

    Once the projected cap is calculated a final check is made to avoid several problems:
    \begin{itemize}
      \item Preemption of processes that contain active work but are not using all their
        threads.  This occurs when a job is ``winding down'' and may have more
        processes than it technically needs, but all processes still are performing work.
      \item The job may have declared a maximum number of processes to allocate, which is
        less than the number it could otherwise be awarded.
      \item If prediction is being used, revise the estimate down to the smaller
        of the projected cap and the resources currently allocated.
      \item If initialization caps are being applied and no process in the job has
        successfully initialized, revise the estimate down to the initialization cap.
      \item If expand-by-doubling is being used, potentially revise the estimate down
        to no more than double the currently allocated processes.
    \end{itemize}

    The final cap is emitted in a line such as:
 \begin{verbatim}
 RM.RmJob- 7483 initJobCap Hilaria O 2 Base cap: 7 Expected future cap: 434 potential cap 7 actual cap 7
 \end{verbatim}
     In this line:
     \begin{description}
       \item[7483] is the job id.
       \item[Hilaria] is the job's user name.
       \item[O 2] indicates this job uses two quantum shares per processes.
       \item[Base cap:] This is an upper-bound on the number of processes
         that can be used in a perfect world.  It is calculated by
         dividing the number of questions by the number of threads per
         process.  It is then revised down by the declared max-processes
         in the job.  In the example above, the job declared
         max-processes of 7.
       \item[Expected future cap] This is the projected cap, described above.
       \item[Potential cap] This is the base cap, possibly revised downward
         by the future cap, if it is projected that fewer processes are
         would be useful.
       \item[Actual cap] This is the assigned maximum processes to be
         scheduled for this job, possibly adjusted based on the
         initialization status of the job and the expand-by-doubling policy.
     \end{description}

     The {\em actual cap} is the one used to calculate the job's FAIR\_SHARE and
     is always the the largest number of processes usable in a perfect world.  Note
     that the FAIR\_SHARE calculation may result in further reduction of this
     number.

 \subsection{The ``how much'' calculations}
    The RM log includes a section that details the fair-share calculations.  The details
    of this are rather involved and out-of-scope for this section.  Interested parties
    are welcome to read the scheduler source, in the file {\em NodePoolScheduler.java},
    methods {\em countClassShares, countJobShares, countUserShares}, and {\em apportion\_qshares}.

    The logs reveal the inputs to each of the methods above.  The overall logic is as follows
    and can be followed in the logs.

    \begin{itemize}
      \item All job classes of equal priority are bundled together and handed to the
        {\em countClassShares} method.  This method assigns some number of shares
        to each class based on the weighted fair-share logic, using the configured
        class weights.  The start of this can be seen under log lines similar to this:
 \begin{verbatim}
 INFO RM.NodepoolScheduler- N/A apportion_qshares  countClassShares RmCounter Start
 \end{verbatim}

      \item All users for each class are passed to the {\em countUserShares} method
        and then assigned some number of shares from the
        pool of shares assigned to the class, again
        using the fair-share computations, but with equal weighs.
        The start of this can be seen under log lines similar to this:
 \begin{verbatim}
 INFO RM.NodepoolScheduler- N/A apportion_qshares  countJobShares RmCounter Start
 \end{verbatim}

      \item All jobs for each user are passed to the {\em countJobShares} method
        and assigned some number of shares from the pool assigned to the user, using
        the fair-share calculator with equal weights.
        The start of this can be seen under log lines similar to this:
 \begin{verbatim}
 INFO RM.NodepoolScheduler- N/A apportion_qshares  countUserShares RmCounter Start
 \end{verbatim}

      \item The method {\em apportion\_qshares} is the common fair-share
        calculator, used by the three routines above.
    \end{itemize}

 \subsection{The ``what of'' calculations}
     These calculations are also too involved to discuss in detail for this section.

     Interested parties may look in {\em NodePoolScheduler.java}, method
     {\em whatOfFairShare}, and {\em NodePool.java} method {\em traverseNodepoolsForExpansion}
     to see details.

     The logs track the general flow through the methods above and generally contain
     enough information to diagnose problems should they arise.

     The key log message here, other than those sketching logic flow, shows the
     assignment of specific processes to jobs as seen below.

 \begin{verbatim}
 RM.NodePool- 7483 connectShare  share bluej290-12.461 order 2 machine \
                   bluej290-12        false     2             0             2       31744 <none>
 \end{verbatim}
     This shows job {\em 7483} being assigned a process on host {\em bluej290-12} as
     RM share id {\em 461}, which consists of {\em 2 quantum shares} (order 2).  Host
     bluej290-12 is a 32GB machine with {\em 31744} KB of usable, schedulable memory.

 \subsection{Defragmentation}
     The RM considers the system's memory pool to be fragmented if the counted
     resources from the the ``how much'' phase of scheduling cannot be fully
     mapped to real physical resources in the ``what of'' phase.  In short, the
     ``how much'' phase assumes an ideal, unfragmented virtual cluster.  The ``what of''
     phase may be unable to make the necessary physical assignments without excessive
     preemption of jobs that are legitimately at or below their fair share allocations.

     Intuitively, the ``how much'' phase guarantees that if you could do unlimited
     shuffling around of the allocated resources, everything would ``fit''.  The
     system is considered fragmented if such shuffling is actually needed.  The
     defragmentation processes attempts that shuffling, under the constraint of
     interrupting the smallest amount of productive work possible.

     One scheduling goal, however, is to attempt to guarantee every job gets
     at least some minimal number of it's fairly-counted processes.  This minimal number
     is called the
     \hyperref[itm:props-rm.defragmentation.threshold]{defragmentation threshold.} and
     is configured in ducc.properties.  This threshold is used to rigorously define
     ``smallest amount of productive work'' as used in the previous paragraph.
     The defragmentation threshold is used in
     two ways:

     \begin{enumerate}
       \item Attempt to get every work request resources allocated at least up
         to the level of the defragmentation threshold.
       \item Never steal resources beyond the defragmentation threshold during
         the ``take from the rich'' phase of defragmentation, described below.
     \end{enumerate}
     To accomplish this, a final stage, ``defragmentation'', is
     performed before publishing the new schedule to the Orchestrator
     for deployment.

     Defragmentation consists of several steps.  The details are again involved,
     but an understanding of the logic will make following the log relatively
     straightforward.
     \begin{itemize}
       \item Examine every job and determine whether it was assigned
         all the processes from the ``how much'' phase.  If not, it is marked
         as POTENTIALLY NEEDY.

         This step is logged with the tag {\em detectFragmentation}.

       \item Examine every POTENTIALLY NEEDY job to determine if there are
         sufficient preemptions pending such that the ``how much'' phase will be able
         to complete as soon as the preemptions complete.  If not, the
         job is marked ACTUALLY NEEDY.

         This step is also logged with the tag {\em detectFragmentation}.

       \item For every job marked ACTUALLY NEEDY, examine all jobs in the
         system already assigned shares to determine which ones can
         donate some resources to the ACTUALLY NEEDY jobs.  These are typically
         jobs with more processes than their FAIR SHARE, but which, in a
         perfect, unfragmented layout, would be allocated more resources.  These
         jobs are called {\em rich} jobs.

         This step is logged with the tags {\em insureFullEviction} and
         {\em doFinalEvictions}.

       \item Attempt to match allocations from ``rich'' jobs with jobs that
         are ACTUALLY NEEDY.  If the ACTUALLY NEEDY job is able to use
         one of the ``rich job'' allocations, the allocation is scheduled for
         preemption.  (Note there are many reasons that a rich job may not
         have appropriate resources to donate: mismatched nodepool, physical
         host too small, not preemptable, etc.).

         This step is logged with the tag {\em takeFromTheRich}. If this
         step has any successes, the log will also show lines with the
         tags {\em clearShare} and {\em shrinkByOne} as the resources
         are scheduled for reuse.

         \item The needy job is placed in a list of jobs which are given the
           highest priority for assignment of new processes, at the start of each
           subsequent scheduling cycle, until such time
           as they are no longer needy.

           This step is logged with the tag {\em Expand needy}.
     \end{itemize}

     Those who wish to see the details of defragmentation can find them in
     {\em NodepoolScheduler.java}, starting with the method {\em detectFragmentation}
     and tracing the flows from there.

 \subsection{Published Schedule}

    The schedule gets printed to the log twice on every scheduling cycle.  The first
    form is a pretty-printed summary of all known jobs, showing which ones are
    getting more resources, {\em expanding}, those which are losing resources,
    {\em shrinking}, and those which are not changing, {\em stable.}

    The second form is a {\em toString()} of the structure sent to the Orchestrator,
    showing the exact resources currently assigned, added, or lost this cycle.

    \paragraph{The pretty-printed schedule}
       This entry is divided into five sections.  Each section contains one line for
       each relevant job, with largely self-explanatory headers. An example follows (
       wrapped here so it fits within a printed page):
 \begin{verbatim}
        ID JobName  User  Class  Shares Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst
 J____7485 mega-2   bob   low         7     2      14   4     24  11510    11495    false         7
 J____7486 mega-1   mary  normal     93     2     186   4     28  14768    14764    false        93

 \end{verbatim}
      Here,

      \begin{description}
        \item[ID] is the unique DUCC ID of the work, prefixed with an indication of what kind of
          work it is: Job (J), a Service (S), a Reservation (R), or Managed Reservation (M).
        \item[JobName] is the user-supplied name / description of the job.
        \item[User] is the owner of the work.
        \item[Class] is the scheduling class used to schedule the work.
        \item[Shares] is the number of allocations awarded, which might be processes, or simply reserved space.  It
          is a human-readable convenience, calculated as (Order * QShares).
        \item[Order] is the number of share quanta per allocation.
        \item[QShares] is the total quantum shares awarded to the work.
        \item[Nth] is the declared number of threads per process.
        \item[Memory] is the amount of memory in GB for each allocation.
        \item[nQuest] is the number of work items (questions) for the job, where relevant.
        \item[Ques Rem] is the number of work items not yet completed.
        \item[InitWait] is either {\em true} or {\em false}, indicating whether at least one process
          has successfully completed initialization.
        \item[Max P/Nst] is the job-declared maximum processes / instances for the job.
      \end{description}

      The five subsections of this log section are:
      \begin{description}
        \item[Expanded] This is the list of all work that is receiving more resources this cycle.
        \item[Shrunken] This is the list of work that is losing resources this cycle.
        \item[Stable] This is the list of work whose assigned resources do not change this cycle.
        \item[Dormant] This is the list of work that is unable to receive any resources this cycle.
        \item[Reserved] This is the list of reservations.
      \end{description}

    \paragraph{The Orchestrator Structure}
       This is a list containing up to four lines per scheduled work.

       The specific resources shown here are formatted thus:
 \begin{verbatim}
     hostname.RM share id^Initialization time
 \end{verbatim}
       The {\em hostname} is the name of the host where the resource is assigned.  The {\em RM Share}
       is the unique (to RM only) id of the share assigned to this resource.  The {\em Initialization time}
       is the amount of time spent by the process residing within this resource in its initialization phase.

       The lines are:
       \begin{enumerate}
         \item The type of work and it's DUCC ID, for example:
 \begin{verbatim}
  Reservation 7438
 \end{verbatim}
         \item The complete set of all resources currently assigned to the work, for example:
 \begin{verbatim}
 Existing[1]: bluej537-7-73.1^0
 \end{verbatim}
           The resources here include all resources the RM tracks as being owned by the job, including
           older resources, newly assigned resources, and resources scheduled for eviction.  The specific
           resources which are being added or removed are shown in the next lines.

         \item The complete set of resources the RM has scheduled for eviction, but which are not
           yet confirmed freed.  For example, we see 7 resources which have been evicted:
 \begin{verbatim}
  Removals[7]: bluej290-11.465^19430 bluej290-12.461^11802 bluej290-4.460^12672 \
               bluej290-5.464^23004  bluej290-2.467^22909 bluej290-7.463^20636 \
               bluej290-6.466^19931
 \end{verbatim}

         \item The complete set of resources which are being added to the work in this cycle.  For
           example:

 \begin{verbatim}
  Additions[4]: bluej291-43.560^0 bluej291-42.543^0 bluej290-23.544^0 bluej291-44.559^0
 \end{verbatim}
       \end{enumerate}

       In most cases, if resources cannot be awarded, this section also shows the reason
       string which is published for the benefit of the web server and the Orchestrator's job monitor:
 \begin{verbatim}
  Job         7487 Waiting for defragmentation.
 	Existing[0]:
 	Additions[0]:
 	Removals[0]:
 \end{verbatim}

      In some cases, it is possible that a job will show BOTH Additions and Removals.  This usually
      occurs as a result of the defragmentation step.  The job will have been found in need of
      new resources during the initial fair-share computation but later during fragmentation,
      it is also found to be a ``rich'' job which must donate resources to under-allocated work.
      Not all the processes belonging to the ``rich'' job may be appropriate for the poor job,
      in which case they will be allowed to expand even as it is donating some to the
      under-allocated work.

      This can also occur if resources were previously preempted, for some reason the
      preemption is taking a long time.  Since then other resources have become freed and
      the can now re-expand.  It is not possible to reverse a preemption (because the actual
      state of the preemption is not knowable) so both expansion and shrinkage can be
      in progress for the same job.

 \section{Service  Manager Log (sm.log)}

    The service manager log shows the events involved in managing services.  These events include
    \begin{itemize}
      \item Bootstrap configuration
      \item Receipt and anlysis of Orchestrator state
      \item CLI requests: register, modify, start, stop, unregister, etc.
      \item Dispatching / startup of service instances
      \item Progression of Service state
      \item Starting and logging of pingers
    \end{itemize}

     To enable finer-grained messages, edit the file
     {\em  \ducchome/resources/log4j.xml}
     and change the {\em priority} value to {\em debug} (or {\em trace}) in the stanza
     similar to that shown here.  Within about a minute the logger will
     pick up the change and increase its log level.
 \begin{verbatim}
    <category name="org.apache.uima.ducc.sm" additivity="true">
      <priority value="debug"/>
      <appender-ref ref="smlog" />
    </category>
 \end{verbatim}

 \subsection{Bootstrap configuration}

     \paragraph{Initial SM start}
     The first logged line of any RM start will contain the string {\em Starting component:  serviceManager}:
 \begin{verbatim}
 SM.ServiceManagerComponent - N/A boot  ... Starting Component:  serviceManager
 \end{verbatim}

     This is followed by a summary of the environment in which the Service Manager is running, including
     configuration data, information about the JRE and about hosting hardware.

 \begin{verbatim}
   Service Manager starting:
       DUCC home               :  /home/challngr/ducc_runtime
       ActiveMQ URL            :  tcp://bluej537:61617?jms.useCompression=true

       JVM                     :  Oracle Corporation 1.7.0_45
       JAVA_HOME               :  /users1/challngr/jdk1.7.0_45/jre
       JVM Path                :  /users/challngr/jdk1.7.0_45/bin/java
       JMX URL                 :  service:jmx:rmi:///jndi/rmi://bluej537:2102/jmxrmi

       OS Architecture         :  amd64
       Crypto enabled          :  true

       Test mode enabled       :  true

       Service ping rate       :  15000
       Service ping timeout    :  15000
       Service ping stability  :  10
       Default ping class      :  org.apache.uima.ducc.cli.UimaAsPing

       Init Failure Max        :  1
       Instance Failure Max    :  3
       Instance Failure Window :  10

       DUCC Version            :  2.0.0-beta
       SM Version              :  2.0.0
 \end{verbatim}

     \paragraph{Initialize the service registry}
     Following this is are entries showing the reading and internal initialization of the
     service registry, for example:
 \begin{verbatim}
 SM.ServiceHandler - 411 ServiceStateHandler.registerService  \
                     adding UIMA-AS:FixedSleepAE_1:tcp://bluej537:61617 411
 \end{verbatim}

     \paragraph{Wait for the Resource Manager and Orchestrator to become ready}

     The Service Manager waits until the Resource Manager and Orchestrator are ready.  This usually
     results in lines similar to the following, which are normal and expected.  If these lines
     continue for more than a few minutes it is possible that some part of DUCC has not correctly
     started:
 \begin{verbatim}
 SM.ServiceManagerComponent - N/A orchestratorStateArrives \
                              Orchestrator JD node not assigned, ignoring Orchestrator state update.
 \end{verbatim}

     \paragraph{Begin work}
     When the Service Manager is ready for business, lines similar to the following are emitted on
     a regular basis:
 \begin{verbatim}
 INFO SM.ServiceManagerComponent - N/A processIncoming  ===== Orchestrator State Arrives =====
 \end{verbatim}

 \subsection{Receipt and analysis of Orchestrator State}

 \subsection{CLI Requests}

    CLI Requests are all logged at log level INFO.  Initial receipt of the command
    is tagged with the string {\em ServiceManagerComponent} and contains the name of
    the command issued.  Information pertinent to the command issued is logged.
    For example, this shows registration of a new service.

 \begin{verbatim}
 INFO SM.ServiceManagerComponent - 428 register  ServiceRegisterEvent \
      [ninstances=2, autostart=Unset, \
      endpoint=UIMA-AS:FixedSleepAE_5:tcp://bluej537:61617, user=challngr]
 \end{verbatim}

    Some commands require service ownership or administrative authority.  Here we show
    a service being started, and the authorization check being made.

 \begin{verbatim}
 INFO SM.ServiceManagerComponent - N/A start  Starting service ServiceStartEvent \
      [friendly=430, user=challngr, instances=-1]
 INFO SM.ServiceHandler - 430 authorized  start request from challngr allowed.
 \end{verbatim}

 \subsection{Dispatching / Startup of Service Instances}
     The full set of startup messages is logged when an instance is dispatched to DUCC to
     be scheduled and started.  These starts are usually preceeded by a message or two
     indicating what triggered the start.  For instance, this shows service 427 being
     started because it was referenced by job 7676:

 \begin{verbatim}
 INFO SM.ServiceSet - 427 reference  Reference start requested by  7676
 INFO SM.ServiceSet - 427 reference   References job/service 7676 count[1] implementors [0]
 INFO SM.ServiceSet - 427 reference  Reference starting new service instances.
 \end{verbatim}

     This is followed by a line indicating how many instances are to be started:
 \begin{verbatim}
 INFO SM.ServiceSet - 427 start  Starting instance. Current count 0 needed 2
 \end{verbatim}

     Shortly thereafter the {\em stdout} messages from the submission to the
     DUCC Orchestrator are shown.  These are tagged with the keyword {\em ServiceInstance}.
 \begin{verbatim}
 INFO SM.ServiceInstance - 427 start  START INSTANCE
 INFO SM.ServiceInstance - 427 start  Start stdout: 050 ducc_ling Version 1.1.2 \
                           compiled Aug  4 2014 at 06:45:31
 \end{verbatim}

     A few lines later the DUCC-assigned ID of the instance is shown, in
     this case, ID 7677, for service 427.
 \begin{verbatim}
 INFO SM.ServiceInstance - 427 start  Start stdout: Service instance 7677 submitted
 INFO SM.ServiceInstance - N/A start  Request to start service 427 accepted as service instance  7677
 \end{verbatim}

    THe next lines show the environment used for the service submit.
 \begin{verbatim}
 INFO SM.ServiceInstance - 427 start  Start stdout: 1104 Running with user and group: \
      id 2087 gid 2001 eid 2087 egid 2001
 INFO SM.ServiceInstance - 427 start  Start stdout: 300 Bypassing redirect of log.
 INFO SM.ServiceInstance - 427 start  Start stdout: 4050 Limits:       CORE soft[1024] hard[-1]
 INFO SM.ServiceInstance - 427 start  Start stdout: 4050 Limits:        CPU soft[-1] hard[-1]
 INFO SM.ServiceInstance - 427 start  Start stdout: 4050 Limits:       DATA soft[-1] hard[-1]
        ... (more environment) ...
 INFO SM.ServiceInstance - 427 start  Start stdout: Environ[0] = DUCC_SERVICE_INSTANCE=0
 INFO SM.ServiceInstance - 427 start  Start stdout: Environ[1] = JAVA_HOME=/opt/ibm-jdk1.7
 INFO SM.ServiceInstance - 427 start  Start stdout: Environ[2] = LESSCLOSE=lessclose.sh %s %s
 INFO SM.ServiceInstance - 427 start  Start stdout: Environ[3] = ENV=/etc/bash.bashrc
 \end{verbatim}

     Next the exact command line executed to submit the isntance is shown.  If the submission
     is successful, see {\em START INSTANCE COMPLETE}.  The exact progression of
     the scheduling and deployment of an
     instance is not shown in this log.  If it is desired to observe that, look in
     the RM log, searching on the service instance id.  In the examples shown here, that
     id is 7677.

 \begin{verbatim}
 INFO SM.ServiceInstance - 427 start  Start stdout: 1000 Command to exec: /opt/ibm-jdk1.7/bin/java
 INFO SM.ServiceInstance - 427 start  Start stdout:     arg[1]: -cp
 INFO SM.ServiceInstance - 427 start  <INHIBITED CP>
 INFO SM.ServiceInstance - 427 start  <INHIBITED CP>
 INFO SM.ServiceInstance - 427 start  Start stdout:     arg[4]: --specification
 INFO SM.ServiceInstance - 427 start  Start stdout:     arg[5]: /home/challngr/ducc_runtime/state/services/427.svc
 INFO SM.ServiceInstance - 427 start  Start stdout:     arg[6]: --service_id
 INFO SM.ServiceInstance - 427 start  Start stdout:     arg[7]: 427
 INFO SM.ServiceInstance - 427 start  Start stdout: 1001 Command launching...
 INFO SM.ServiceInstance - 427 start  Start stdout: Service instance 7677 submitted
 INFO SM.ServiceInstance - N/A start  Request to start service 427 accepted as service instance  7677
 INFO SM.ServiceInstance - 427 start  START INSTANCE COMPLETE
 \end{verbatim}

     The state progression of the service as it starts is then available, as described in the next section.


 \subsection{Progression of Service State}

     The state of each service as it starts and stops is tracked at log level INFO.  To find the state progression
     for any particular service search on {\em service-id setState} where service-id is the ID of the service.

     Here we show the progression for service 427 as it is started and progresses to fully functional (there are
     other lines logged between these of course).  Note that if a service has multiple instances defined, the
     overall service state is determined by an aggregate of the states of the individual instances.  In these
     messages, the ``Inst'' field of each message shows DUCC ID of the most recently updated  service instance
     that triggered the overall state change.

 \begin{verbatim}
 SM.ServiceSet - 427 setState  State update \
                               from[Stopped] to[Starting] via[Starting] Inst[7677/Received]
 SM.ServiceSet - 427 setState  State update \
                               from[Starting] to[Initializing] via[Initializing] Inst[7677/Initializing]
 SM.ServiceSet - 427 setState  State update \
                               from[Initializing] to[Waiting] via[Waiting] Inst[7677/Running]
 SM.ServiceSet - 427 setState  State update \
                               from[Waiting] to[Available] via[Available] Inst[7677/Running]
 \end{verbatim}

 \subsection{Starting and Logging Pingers}

     When a pinger is started (or restarted) the event is logged and tagged
     with the string {\em service-id startPingThread}.  Following are a few
     lines stating the name of the pinger and the Java class used to
     implement the pinger.  For example:

 \begin{verbatim}
 INFO SM.ServiceSet - 430 startPingThread  Starting service monitor.
 INFO SM.PingDriver - 430 find RegisteredPinger  \
      Loading site-registered service monitor from org.apache.uima.ducc.cli.UimaAsPing
 INFO SM.PingDriver - 430 <ctr>  Using ping class org.apache.uima.ducc.cli.UimaAsPing
 \end{verbatim}

     If the pingers do not declare loggers, their {\em stdout} and {\em stderr} are
     captured in the logs, under the tag {\em handleStatistics}.  As well, every
     ping is recorded with its ``info'' string, making it possible to see the
     exact state of the pinger.  For example:

 \begin{verbatim}
 INFO SM.PingDriver - 411 handleStatistics  Ping ok:  UIMA-AS:FixedSleepAE_1:tcp://bluej537:61617 \
                          Alive[true] Healthy[true] + Info: QDEPTH[0] AveNQ[0] Consum[26] Prod[0] \
                          minNQ[0] maxNQ[0] expCnt[0] inFlt[0] DQ[0] NQ[0] NDisp[0] \
                          MetaNode[192.168.4.36] MetaPid[8892:67]
 \end{verbatim}

     Pingers always return state to the Service Manager and some of that state
     affects the SM's operation; in particular, pingers can start and stop
     specific service instances or change the autostart setting for a
     service.  They also must return success and failure status to the SM.
     The ping state is logged under the tag {\em service-id signalRebalance}
     as seen below.  If {\em Additions:} or {\em Deletions:} is non-zero, you
     can expect to see SM automatically start or stop specific instances for
     the service.
 \begin{verbatim}
 INFO SM.ServiceSet - 430 signalRebalance  PING: Additions: 0 deletions: 0 \
      excessive failures: false implementors 1 references 0
 \end{verbatim}

 \subsection{Publishing State}
     By default the SM log does not include the state as published to the
     Orchestrator because it can be voluminous and cause the logs to
     become cluttered and to wrap too fast.  It may be necessary to
     increas the log level to DEBUG as described at the start of this section.

    The published state will be emitted with lines similar to the following.  Most
    entries will show ``Available'', which means either they do not depend on
    the Service Manager, or the service they depend on is in state Available.  If there
    is some exceptional condition pertaining to a job, that is shown.  In the
    sample below, Job 251214 is waiting for a specific service whose state itself
    is ``waiting''.

 \begin{verbatim}
 INFO SM.ServiceHandler - 251261 handleModifiedobs  No service dependencies, no updates made.
 INFO SM.ServiceManagerComponent - N/A publish  Publishing State, active job count = 102
 INFO SM.ServiceManagerComponent - N/A publish  Service Map
 Job 251159 Service state Available
 Job 251263 Service state Available
 Job 251192 Service state Available
 Job 251214 Service state Waiting \
     [UIMA-AS:fastqa-Dedicated-Staging-2015-08_cache:tcp://broker42:62616 : waiting]
 \end{verbatim}


 \section{ (Orchestrator Log or.log)}
     To be filled in.

 \section{Process Manager Log (pm.log)}
     To be filled in.

 \section{Agent log Log (hostname.agent.log)}
     To be filled in.