| % |
| % Licensed to the Apache Software Foundation (ASF) under one |
| % or more contributor license agreements. See the NOTICE file |
| % distributed with this work for additional information |
| % regarding copyright ownership. The ASF licenses this file |
| % to you under the Apache License, Version 2.0 (the |
| % "License"); you may not use this file except in compliance |
| % with the License. You may obtain a copy of the License at |
| % |
| % http://www.apache.org/licenses/LICENSE-2.0 |
| % |
| % Unless required by applicable law or agreed to in writing, |
| % software distributed under the License is distributed on an |
| % "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| % KIND, either express or implied. See the License for the |
| % specific language governing permissions and limitations |
| % under the License. |
| % |
| % Create well-known link to this spot for HTML version |
| \section{Overview} |
| |
| This chapter provides an overview of the DUCC process logs and how to interpret the |
| entries therein. |
| |
| Each of the DUCC ``head node'' processes writes a detailed log of its operation to |
| the directory \ducchome/logs. The logs are managed by Apache log4j. All logs are |
| managed by a single log4j configuration file |
| \begin{verbatim} |
| $DUCC_HOME/resources/log4j.xml |
| \end{verbatim} |
| |
| The DUCC logger is configured to check for updates to the log4j.xml |
| configuration file and automatically update without the need to restart any of |
| the DUCC processes. The update may take up to 60 seconds to take effect. |
| |
| The DUCC logger is loaded and configured through the log4j API such that other |
| log4j configuration files that might be in the classpath are ignored. This also |
| means that log4j configuration files in the user's classpath will not interfere |
| with DUCC's logger. |
| |
| The logs are set to roll after reaching a given size and the number of generations |
| is limited to prevent overrunning disk space. In general the log level is set to |
| provide sufficient diagnostic output to resolve most issues. |
| |
| Each DUCC component writes its own log as defined in the following table: |
| |
| \begin{tabular} {| l | l |} |
| \hline |
| Component & Log Name \\ |
| \hline |
| \hline |
| Resource Manager & rm.log \\ |
| \hline |
| Service Manager & sm.log \\ |
| \hline |
| Orchestrator & or.log \\ |
| \hline |
| Process Manager & pm.log \\ |
| \hline |
| Web Server & ws.log \\ |
| \hline |
| Agent & {\em [hostname].agent.log } \\ |
| \hline |
| \end{tabular} |
| |
| Because there may be many agents, the agent log is prefixed with the name of the host for |
| each running agent. |
| |
| The log4j file may be customized for each installation to change the format or content of the |
| log files, according to the rules defined by log4j itself. |
| |
| The general format of a log message is as follows: |
| \begin{verbatim} |
| Timestamp LOGLEVEL COMPONENT.sourceFileName method-name J[Jobid-or-NA] T[TID] text |
| \end{verbatim} |
| where |
| \begin{description} |
| \item[Timestamp] is the time of the occurrence. By default, the timestamp uses millisecond granularity. |
| \item[LOGLEVEL] This is one of the log4j debug levels, INFO, ERROR, DEBUG, etc. |
| \item[COMPONENT] This identifies the DUCC component emitting the message. The components include |
| \begin{description} |
| \item[SM] Service Manager |
| \item[RM] Resource Manager |
| \item[PM] Process Manager |
| \item[OR] Orchestrator |
| \item[WS] Web Server |
| \item[Agent] Agent |
| \item[JD] Job Driver. These logs are written to the log directory specified in each |
| job submission. |
| \item[JobProcessComponent] Job process, also known as JP. These logs are written to the |
| log directory specifid in each job submission. |
| \end{description} |
| \item[sourceFileName] This is the name of the Java source file from which the message is emitted. |
| \item[method-name] This is the name of the method in {\em sourceFileName} which emitted the message. |
| \item[{J[Workid-or-NA]}] This is the DUCC assigned id of the work being processed, when relevant. If the |
| message is not associated with work, this field shows ``N/A''. Some logs (such as JP and JD logs) |
| pertain ONLY to a specific job and do not contain this field. |
| \item[{T[TID]}] This is the ID of the thread emitting the message. Some logs (such as RM) do not use |
| this field so it is omitted. |
| \item[text] This is the component-specific text content of the message. Key messages are described |
| in detail in subsequent sections. |
| |
| \end{description} |
| |
| \section{Resource Manager Log (rm.log)} |
| |
| The RM log is designed to show all phases of resource scheduling. Much of the flow of a job can |
| be observed in this log alone. The following specific information is available and is explained in |
| more detail below: |
| \begin{itemize} |
| \item Bootstrap configuration |
| \item Node arrival and missed heartbeats |
| \item Node occupancy |
| \item Job arrival and status updates |
| \item Calculation of job caps |
| \item How-much - fair share |
| \item What-of - host assignment and preemption |
| \item Defragmentation |
| \item Internal schedule |
| \item Published schedule |
| \end{itemize} |
| |
| Most useful messages are emitted under log level INFO but a wealth of details can |
| be seen by incresing the log level to DEBUG or TRACE. To do so, |
| edit the file |
| {\em \ducchome/resources/log4j.xml} |
| and change the {\em priority} value to {\em debug} (or {\em trace}) in the stanza |
| similar to that shown here. Within about a minute the logger will |
| pick up the change and increase its log level. |
| \begin{verbatim} |
| <category name="org.apache.uima.ducc.sm" additivity="true"> |
| <priority value="debug"/> |
| <appender-ref ref="smlog" /> |
| </category> |
| \end{verbatim} |
| |
| \subsection{Bootstrap Configuration} |
| The RM summarizes its entire configuration when it starts up and prints it to the log to |
| provide context for subsequent data and as verification that the RM is configured in the |
| way it was thought to be. All the following are found in the bootstrap section and are mostly |
| self-explanatory: |
| |
| \begin{itemize} |
| \item A pretty-print of the class configuration. This is the same as produced by the {\em check\_ducc -c -v} |
| command. |
| \item A summary of all classes, one per line. This is a more concise display and is similar to the |
| DUCC Classes page in the web server. |
| \item A listing of all RM configuration parameters and the environment including things such as the |
| version of Java, the operating system, etc. |
| \item Nodepool occupancy. As host names are parsed from the {\em ducc.nodes} files, the RM log |
| shows exactly which nodepool each node is added to. |
| \end{itemize} |
| |
| The RM logs can wrap quickly under high load in which case this information is lost. |
| |
| The following represent key RM logs lines to search for if it is desired to examine or verify its |
| initialization. (Part of the leaders on these messages are removed here to shorten the |
| lines for publication.) |
| |
| \paragraph{Initial RM start} |
| The first logged line of any RM start will contain the string {\em Starting component: resourceManager}: |
| \begin{verbatim} |
| RM.ResourceManagerComponent- N/A boot ... Starting Component: resourceManager |
| \end{verbatim} |
| |
| \paragraph{RM Node and Class Configuration} |
| The first configuration lines show the reading and validation of the node and class configuration. Look |
| for the string {\em printNodepool} to find these lines: |
| \begin{verbatim} |
| RM.Config- N/A printNodepool Nodepool --default-- |
| RM.Config- N/A printNodepool Search Order: 100 |
| RM.Config- N/A printNodepool Node File: None |
| RM.Config- N/A printNodepool <None> |
| RM.Config- N/A printNodepool Classes: background low normal high normal-all nightly-test reserve |
| RM.Config- N/A printNodepool Subpools: jobdriver power intel |
| ... |
| \end{verbatim} |
| |
| \paragraph{RM Scheduling Configuration} |
| Next the RM reads configures its scheduling parameters and emits the information. It also emits information |
| about its environment: the ActiveMQ broker, JVM information, OS information, DUCC version, etc. To fine |
| this search for the string {\em init Scheduler}. |
| \begin{verbatim} |
| init Scheduler running with share quantum : 15 GB |
| init reserved DRAM : 0 GB |
| init DRAM override : 0 GB |
| init scheduler : org.apache.uima.ducc.rm.scheduler.NodepoolScheduler |
| ... (more lines) ... |
| |
| init DUCC home : /home/challngr/ducc_runtime |
| init ActiveMQ URL : tcp://bluej537:61617?jms.useCompression=true |
| init JVM : Oracle Corporation 1.7.0_45 |
| init JAVA_HOME : /users1/challngr/jdk1.7.0_45/jre |
| init JVM Path : /users/challngr/jdk1.7.0_45/bin/java |
| init JMX URL : service:jmx:rmi:///jndi/rmi://bluej537:2099/jmxrmi |
| init OS Architecture : amd64 |
| init OS Name : Linux |
| init DUCC Version : 2.0.0-beta |
| init RM Version : 2.0.0 |
| \end{verbatim} |
| |
| \paragraph{RM Begins to Schedule} |
| The next lines will show the nodes checking in and which nodepools they are assigned to. When the scheduler is |
| ready to accept Orchestrator requests you will see assignment of the JobDriver reservation. At this point |
| RM is fully operational. The confirmation of JobDriver assignment is similar to this: |
| \begin{verbatim} |
| Reserved: |
| ID JobName User Class Shares Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst |
| R______7434 Job_Drive System JobDriver 1 1 1 0 2 0 0 0 1 |
| \end{verbatim} |
| |
| \subsection{Node Arrival and Missed Heartbeats} |
| \subsubsection{Node Arrival} |
| As each node ``checks in'' with the RM a line is printed with details about the node. Some fields |
| are redundant but are produced by different components processing the node arrival and thus serve |
| as confirmation that all parts are operating correctly. |
| |
| A node arrival entry is of the form: |
| \begin{verbatim} |
| LOGHEADER Nodepool: power Host added: power : bluej290-18 shares 3 total 9: 48128 <none> |
| \end{verbatim} |
| where the fields mean (if the field isn't described here, the value is not relevant to node arrival): |
| \begin{description} |
| \item[LOGHEADER] is the log entry header as described above. |
| \item[Nodepool:power] The node is added to the ``power'' nodepool |
| \item[bluej290-18] This is the name of the node |
| \item[shares 3] The number of full shares supported on this machine. |
| \item[total 9] This is the total shares in the system after this node arrives. |
| \item[48128] This is the memory, in KB, on that host. |
| \end{description} |
| |
| \subsubsection{Missed Heartbeats} |
| The DUCC Agents send out regular ``heartbeat'' messages with current node statistics. These |
| messages are used by RM to determine if a node has failed. If a heartbeat does not arrive |
| at the specified time this is noted in the log as a {\em missing heartbeat}. If a specific (configurable) number |
| of consecutive heartbeats is missed, the RM marks the node offline and instructs the |
| DUCC Orchestrator to purge the shares so they can be rescheduled. |
| |
| A missed heartbeat log entry is of the form |
| \begin{verbatim} |
| [LOGHEADER] "*** Missed heartbeat ***" NODENAME count[NN] |
| \end{verbatim} |
| where the fields mean: |
| \begin{description} |
| \item[LOGHEADER] is the log entry header as described above. |
| \item[*** Missed heartbeat ***] Indicates this is a missing heartbeat message. |
| \item[NODENAME] This is the name of the (possibly) errant host. |
| \item[count[N]] This is the number of CONSECUTIVE missing heartbeats. |
| \end{description} |
| |
| Note that it is not unusual to miss the occasional heartbeat or two due to general network or system load. |
| As soon as a heartbeat is received the count is reset to 0. |
| |
| If the number of missing heartbeats exceeds the value {\em ducc.rm.node.stability} configured in |
| {\em ducc.properties} the node is marked offline and this message is emitted: |
| \begin{verbatim} |
| HEADER "*** ! Notification of node death:" NODENAME |
| \end{verbatim} |
| |
| If the node recovers and rejoins, the NodeArrives message as described above is emitted. |
| |
| \subsection{Node Occupancy} |
| {\em Node occupancy} describes, for each node, the capacity of the node, the work assigned to |
| that node, and the number of open shares on that node. The RM writes the node occupancy |
| to its log before assignment of every new schedule. The occupancy can be found under the log header line: |
| \begin{verbatim} |
| [LOGHEADER] Machine occupancy before schedule |
| \end{verbatim} |
| |
| NOTE: The current node occupancy can be queried interactively with the |
| \hyperref[subsec:admin.rm-qoccupancy]{rm\_occupancy} command: |
| \begin{verbatim} |
| DUCC_HOME/admin/rm_qoccupancy |
| \end{verbatim} |
| |
| Sample node occupancy as displayed in the log follows. The header is included in the log. |
| \begin{verbatim} |
| Name Order Active Shares Unused Shares Memory (MB) Jobs |
| --------------- ----- ------------- ------------- ----------- ------ ... |
| f1n2.bluej.net 16 16 0 255459 206710 206715 207878 206719 207900 |
| f1n4.bluej.net 16 0 16 255459 <none>[16] |
| f7n2.bluej.net 16 0 16 255459 <none>[16] |
| f9n10.bluej.net 16 0 16 255459 <none>[16] |
| f6n1.bluej.net 16 0 16 255459 <none>[16] |
| f7n1.bluej.net 16 3 13 255459 203408 [13] |
| f7n3.bluej.net 16 16 0 255459 206716 207904 206720 206717 206718 |
| f4n10.bluej.net 16 15 1 255459 209155 208975 209153 209155 [1] |
| f7n5.bluej.net 16 16 0 255459 208960 |
| f1n3.bluej.net 16 16 0 255459 205608 206695 207906 206693 206693 |
| f1n1.bluej.net 16 3 13 255459 208913 [13] |
| f6n10.bluej.net 16 3 13 255459 208977 [13] |
| f6n7.bluej.net 16 0 16 255459 <none> [16] |
| f7n6.bluej.net 16 15 1 255459 209155 209151 206701 209155 [1] |
| \end{verbatim} |
| |
| The meaning of each column is: |
| \begin{description} |
| \item[Name] The host name. |
| \item[Order] This is the share order of the node. The number represents the number of quantum shares |
| that can be scheduled on this node. (Recall that an actual process may and usually does |
| occupy multiple quantum shares.) |
| \item[Active Shares] This is the number of quantum shares on the node which are scheduled |
| for work. |
| \item[Unused Shares] This is the number of quantum shares available for new work. |
| \item[Memory] This is the real memory capacity of the node, as reported by the node's |
| Agent process. |
| \item[Jobs] Each entry here is the DUCC-assigned id of a job with process assigned to |
| this node. Each entry corresponds to one process. If an ID appears more than |
| once the job has more than one process assigned to the node; see for example, the |
| node {\bf f1n3.bluej.net} with multiple entries for job {\em 206693}. |
| |
| When no work is assigned to the node, the string {\bf $<$none$>$} is displayed. |
| |
| When there is a number in brackets, e.g. {\bf [13]} for node {\bf f7n1.bluej.net}, the |
| number represents the number of quantum shares available to be scheduled on the node. |
| \end{description} |
| |
| \subsection{Job Arrival and Status Updates} |
| |
| \paragraph{Orchestrator State Arrival} |
| |
| On a regular basis the Orchestrator publishes the full state of |
| work which may require resources. This is the prime input to the |
| RM's scheduler and must arrive on a regular basis. Every arrival |
| of an Orchestrator publication is flagged in the log as follows. |
| If these aren't observed every |
| \hyperref[itm:props-or.state.publish.rate]{Orchestrator publish |
| interval} something is wrong; most likely the Orchestrator or |
| the ActiveMQ broker has a problem. |
| |
| \begin{verbatim} |
| RM.ResourceManagerComponent- N/A onJobManagerStateUpdate -------> OR state arrives |
| \end{verbatim} |
| |
| \paragraph{Job State} |
| Immediately after the OR state arrival is logged the state of all work needing scheduling |
| is logged. These are always tracked by the {\em JobManagerConverter} module in the |
| RM and is logged similar to the following. It shows the state of each bit of work |
| of interest, and if that state has changed since the last publication, what that state is. |
| \begin{verbatim} |
| ... |
| RM.JobManagerConverter- 7433 eventArrives Received non-schedulable job, state = Completed |
| RM.JobManagerConverter- 7434 eventArrives [SPR] State: WaitingForResources -> Assigned |
| ... |
| \end{verbatim} |
| |
| \subsection{Calculation Of Job Caps} |
| Prior to every schedule, and immediately after receipt of the Orchestrator state, |
| the RM examines every piece of work and calculates the maximum level of resources the |
| job can physically use at the moment. This handles the {\em expand-by-doubling} |
| function, the {\em prediction} function, and accounts for the amount of work left |
| relative to the resources the work already possesses. |
| |
| The curious or intrepid can see the code that implements this in {\em RmJob.java} method |
| {\em initJobCap()}. |
| |
| The calculation is done in two steps: |
| \begin{enumerate} |
| \item Calculate the projected cap. This uses the prediction logic and amount of |
| work remaining to |
| calculate the {\em largest} number of resources the job can use, if the system |
| had unlimited resources. This is an upper bound on the actual resources |
| assigned to the job. |
| \item Adjust the cap down using expand-by-doubling and the initialization state of |
| the work. The result of this step is always a {\em smaller or equal} number |
| as the projected cap. |
| \end{enumerate} |
| The goal of this step is to calculate the largest number of resource the job can |
| actually use at the moment. The FAIR\_SHARE calculations may further revise this |
| down, but will never revise it up. |
| |
| If there is no data yet on the initialization state of work, the rejected cap cannot |
| be calculated and a line such as the following is emitted: |
| \begin{verbatim} |
| RM.RmJob - 7483 getPrjCap Hilaria Cannot predict cap: init_wait true || time_per_item 0.0 |
| \end{verbatim} |
| |
| If the job has completed initialization the projected cap is calculated based on |
| the average initialization time of all the job processes and the current rate of |
| work-item completion. A line such as this is emitted: |
| \begin{verbatim} |
| RM.RmJob- 7483 Hilaria O 2 T 58626 NTh 28 TI 18626 TR 12469.0 R 2.2456e-03 QR 1868 \ |
| P 132 F 1736 ST 1433260775524 return 434 |
| \end{verbatim} |
| In this particular line: |
| \begin{description} |
| \item[7483] is the job id |
| \item[Hilaria] is the job's owner (userid) |
| \item[O 2] this says this is an {\em order 2} job: each process will occupy two quantum shares. |
| \item[T 58626] is the smallest number of milliseconds until a new process for this job |
| can be made runnable, based on the average initialization time for processes in |
| this job, the Orchestrator publish rate, and the |
| \hyperref[itm:props-rm.prediction.fudge]{\em RM prediction fudge.} |
| \item[Nth] This is the number of threads currently executing for this job. It is |
| calculated as the (number of currently allocated processes) * (the number of threads |
| per process). |
| \item[TI] This is the average initialization time in milliseconds for processes in this job. |
| \item[TR] This is the average execution time in milliseconds for work items in this job. |
| \item[R] This is the current rate at which the job is completing work items, |
| calculated as (Nth / TR). |
| \item[QR] The is the number of work items (questions) remaining to be executed. |
| \item[P] This is the projected number of questions that can be completed |
| in the time from ``now'' until a new process can be started and initialized |
| (in this case 58626 milliseconds from now, see above), with the currently |
| allocated resources, calculated as (T * R). |
| \item[F] This is the number of questions that will remain unanswered at the |
| end of the target (T) period, calculated as (QR - P). |
| \item[ST] This is the time the job was submitted. |
| \item[return] This is the projected cap, the largest number of processes this |
| job can physically use, calculated as (F / threads-per-process). |
| |
| If the returned projected cap is 0, it is adjusted up to the number of |
| processes currently allocated. |
| \end{description} |
| |
| Once the projected cap is calculated a final check is made to avoid several problems: |
| \begin{itemize} |
| \item Preemption of processes that contain active work but are not using all their |
| threads. This occurs when a job is ``winding down'' and may have more |
| processes than it technically needs, but all processes still are performing work. |
| \item The job may have declared a maximum number of processes to allocate, which is |
| less than the number it could otherwise be awarded. |
| \item If prediction is being used, revise the estimate down to the smaller |
| of the projected cap and the resources currently allocated. |
| \item If initialization caps are being applied and no process in the job has |
| successfully initialized, revise the estimate down to the initialization cap. |
| \item If expand-by-doubling is being used, potentially revise the estimate down |
| to no more than double the currently allocated processes. |
| \end{itemize} |
| |
| The final cap is emitted in a line such as: |
| \begin{verbatim} |
| RM.RmJob- 7483 initJobCap Hilaria O 2 Base cap: 7 Expected future cap: 434 potential cap 7 actual cap 7 |
| \end{verbatim} |
| In this line: |
| \begin{description} |
| \item[7483] is the job id. |
| \item[Hilaria] is the job's user name. |
| \item[O 2] indicates this job uses two quantum shares per processes. |
| \item[Base cap:] This is an upper-bound on the number of processes |
| that can be used in a perfect world. It is calculated by |
| dividing the number of questions by the number of threads per |
| process. It is then revised down by the declared max-processes |
| in the job. In the example above, the job declared |
| max-processes of 7. |
| \item[Expected future cap] This is the projected cap, described above. |
| \item[Potential cap] This is the base cap, possibly revised downward |
| by the future cap, if it is projected that fewer processes are |
| would be useful. |
| \item[Actual cap] This is the assigned maximum processes to be |
| scheduled for this job, possibly adjusted based on the |
| initialization status of the job and the expand-by-doubling policy. |
| \end{description} |
| |
| The {\em actual cap} is the one used to calculate the job's FAIR\_SHARE and |
| is always the the largest number of processes usable in a perfect world. Note |
| that the FAIR\_SHARE calculation may result in further reduction of this |
| number. |
| |
| \subsection{The ``how much'' calculations} |
| The RM log includes a section that details the fair-share calculations. The details |
| of this are rather involved and out-of-scope for this section. Interested parties |
| are welcome to read the scheduler source, in the file {\em NodePoolScheduler.java}, |
| methods {\em countClassShares, countJobShares, countUserShares}, and {\em apportion\_qshares}. |
| |
| The logs reveal the inputs to each of the methods above. The overall logic is as follows |
| and can be followed in the logs. |
| |
| \begin{itemize} |
| \item All job classes of equal priority are bundled together and handed to the |
| {\em countClassShares} method. This method assigns some number of shares |
| to each class based on the weighted fair-share logic, using the configured |
| class weights. The start of this can be seen under log lines similar to this: |
| \begin{verbatim} |
| INFO RM.NodepoolScheduler- N/A apportion_qshares countClassShares RmCounter Start |
| \end{verbatim} |
| |
| \item All users for each class are passed to the {\em countUserShares} method |
| and then assigned some number of shares from the |
| pool of shares assigned to the class, again |
| using the fair-share computations, but with equal weighs. |
| The start of this can be seen under log lines similar to this: |
| \begin{verbatim} |
| INFO RM.NodepoolScheduler- N/A apportion_qshares countJobShares RmCounter Start |
| \end{verbatim} |
| |
| \item All jobs for each user are passed to the {\em countJobShares} method |
| and assigned some number of shares from the pool assigned to the user, using |
| the fair-share calculator with equal weights. |
| The start of this can be seen under log lines similar to this: |
| \begin{verbatim} |
| INFO RM.NodepoolScheduler- N/A apportion_qshares countUserShares RmCounter Start |
| \end{verbatim} |
| |
| \item The method {\em apportion\_qshares} is the common fair-share |
| calculator, used by the three routines above. |
| \end{itemize} |
| |
| \subsection{The ``what of'' calculations} |
| These calculations are also too involved to discuss in detail for this section. |
| |
| Interested parties may look in {\em NodePoolScheduler.java}, method |
| {\em whatOfFairShare}, and {\em NodePool.java} method {\em traverseNodepoolsForExpansion} |
| to see details. |
| |
| The logs track the general flow through the methods above and generally contain |
| enough information to diagnose problems should they arise. |
| |
| The key log message here, other than those sketching logic flow, shows the |
| assignment of specific processes to jobs as seen below. |
| |
| \begin{verbatim} |
| RM.NodePool- 7483 connectShare share bluej290-12.461 order 2 machine \ |
| bluej290-12 false 2 0 2 31744 <none> |
| \end{verbatim} |
| This shows job {\em 7483} being assigned a process on host {\em bluej290-12} as |
| RM share id {\em 461}, which consists of {\em 2 quantum shares} (order 2). Host |
| bluej290-12 is a 32GB machine with {\em 31744} KB of usable, schedulable memory. |
| |
| \subsection{Defragmentation} |
| The RM considers the system's memory pool to be fragmented if the counted |
| resources from the the ``how much'' phase of scheduling cannot be fully |
| mapped to real physical resources in the ``what of'' phase. In short, the |
| ``how much'' phase assumes an ideal, unfragmented virtual cluster. The ``what of'' |
| phase may be unable to make the necessary physical assignments without excessive |
| preemption of jobs that are legitimately at or below their fair share allocations. |
| |
| Intuitively, the ``how much'' phase guarantees that if you could do unlimited |
| shuffling around of the allocated resources, everything would ``fit''. The |
| system is considered fragmented if such shuffling is actually needed. The |
| defragmentation processes attempts that shuffling, under the constraint of |
| interrupting the smallest amount of productive work possible. |
| |
| One scheduling goal, however, is to attempt to guarantee every job gets |
| at least some minimal number of it's fairly-counted processes. This minimal number |
| is called the |
| \hyperref[itm:props-rm.defragmentation.threshold]{defragmentation threshold.} and |
| is configured in ducc.properties. This threshold is used to rigorously define |
| ``smallest amount of productive work'' as used in the previous paragraph. |
| The defragmentation threshold is used in |
| two ways: |
| |
| \begin{enumerate} |
| \item Attempt to get every work request resources allocated at least up |
| to the level of the defragmentation threshold. |
| \item Never steal resources beyond the defragmentation threshold during |
| the ``take from the rich'' phase of defragmentation, described below. |
| \end{enumerate} |
| To accomplish this, a final stage, ``defragmentation'', is |
| performed before publishing the new schedule to the Orchestrator |
| for deployment. |
| |
| Defragmentation consists of several steps. The details are again involved, |
| but an understanding of the logic will make following the log relatively |
| straightforward. |
| \begin{itemize} |
| \item Examine every job and determine whether it was assigned |
| all the processes from the ``how much'' phase. If not, it is marked |
| as POTENTIALLY NEEDY. |
| |
| This step is logged with the tag {\em detectFragmentation}. |
| |
| \item Examine every POTENTIALLY NEEDY job to determine if there are |
| sufficient preemptions pending such that the ``how much'' phase will be able |
| to complete as soon as the preemptions complete. If not, the |
| job is marked ACTUALLY NEEDY. |
| |
| This step is also logged with the tag {\em detectFragmentation}. |
| |
| \item For every job marked ACTUALLY NEEDY, examine all jobs in the |
| system already assigned shares to determine which ones can |
| donate some resources to the ACTUALLY NEEDY jobs. These are typically |
| jobs with more processes than their FAIR SHARE, but which, in a |
| perfect, unfragmented layout, would be allocated more resources. These |
| jobs are called {\em rich} jobs. |
| |
| This step is logged with the tags {\em insureFullEviction} and |
| {\em doFinalEvictions}. |
| |
| \item Attempt to match allocations from ``rich'' jobs with jobs that |
| are ACTUALLY NEEDY. If the ACTUALLY NEEDY job is able to use |
| one of the ``rich job'' allocations, the allocation is scheduled for |
| preemption. (Note there are many reasons that a rich job may not |
| have appropriate resources to donate: mismatched nodepool, physical |
| host too small, not preemptable, etc.). |
| |
| This step is logged with the tag {\em takeFromTheRich}. If this |
| step has any successes, the log will also show lines with the |
| tags {\em clearShare} and {\em shrinkByOne} as the resources |
| are scheduled for reuse. |
| |
| \item The needy job is placed in a list of jobs which are given the |
| highest priority for assignment of new processes, at the start of each |
| subsequent scheduling cycle, until such time |
| as they are no longer needy. |
| |
| This step is logged with the tag {\em Expand needy}. |
| \end{itemize} |
| |
| Those who wish to see the details of defragmentation can find them in |
| {\em NodepoolScheduler.java}, starting with the method {\em detectFragmentation} |
| and tracing the flows from there. |
| |
| \subsection{Published Schedule} |
| |
| The schedule gets printed to the log twice on every scheduling cycle. The first |
| form is a pretty-printed summary of all known jobs, showing which ones are |
| getting more resources, {\em expanding}, those which are losing resources, |
| {\em shrinking}, and those which are not changing, {\em stable.} |
| |
| The second form is a {\em toString()} of the structure sent to the Orchestrator, |
| showing the exact resources currently assigned, added, or lost this cycle. |
| |
| \paragraph{The pretty-printed schedule} |
| This entry is divided into five sections. Each section contains one line for |
| each relevant job, with largely self-explanatory headers. An example follows ( |
| wrapped here so it fits within a printed page): |
| \begin{verbatim} |
| ID JobName User Class Shares Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst |
| J____7485 mega-2 bob low 7 2 14 4 24 11510 11495 false 7 |
| J____7486 mega-1 mary normal 93 2 186 4 28 14768 14764 false 93 |
| |
| \end{verbatim} |
| Here, |
| |
| \begin{description} |
| \item[ID] is the unique DUCC ID of the work, prefixed with an indication of what kind of |
| work it is: Job (J), a Service (S), a Reservation (R), or Managed Reservation (M). |
| \item[JobName] is the user-supplied name / description of the job. |
| \item[User] is the owner of the work. |
| \item[Class] is the scheduling class used to schedule the work. |
| \item[Shares] is the number of allocations awarded, which might be processes, or simply reserved space. It |
| is a human-readable convenience, calculated as (Order * QShares). |
| \item[Order] is the number of share quanta per allocation. |
| \item[QShares] is the total quantum shares awarded to the work. |
| \item[Nth] is the declared number of threads per process. |
| \item[Memory] is the amount of memory in GB for each allocation. |
| \item[nQuest] is the number of work items (questions) for the job, where relevant. |
| \item[Ques Rem] is the number of work items not yet completed. |
| \item[InitWait] is either {\em true} or {\em false}, indicating whether at least one process |
| has successfully completed initialization. |
| \item[Max P/Nst] is the job-declared maximum processes / instances for the job. |
| \end{description} |
| |
| The five subsections of this log section are: |
| \begin{description} |
| \item[Expanded] This is the list of all work that is receiving more resources this cycle. |
| \item[Shrunken] This is the list of work that is losing resources this cycle. |
| \item[Stable] This is the list of work whose assigned resources do not change this cycle. |
| \item[Dormant] This is the list of work that is unable to receive any resources this cycle. |
| \item[Reserved] This is the list of reservations. |
| \end{description} |
| |
| \paragraph{The Orchestrator Structure} |
| This is a list containing up to four lines per scheduled work. |
| |
| The specific resources shown here are formatted thus: |
| \begin{verbatim} |
| hostname.RM share id^Initialization time |
| \end{verbatim} |
| The {\em hostname} is the name of the host where the resource is assigned. The {\em RM Share} |
| is the unique (to RM only) id of the share assigned to this resource. The {\em Initialization time} |
| is the amount of time spent by the process residing within this resource in its initialization phase. |
| |
| The lines are: |
| \begin{enumerate} |
| \item The type of work and it's DUCC ID, for example: |
| \begin{verbatim} |
| Reservation 7438 |
| \end{verbatim} |
| \item The complete set of all resources currently assigned to the work, for example: |
| \begin{verbatim} |
| Existing[1]: bluej537-7-73.1^0 |
| \end{verbatim} |
| The resources here include all resources the RM tracks as being owned by the job, including |
| older resources, newly assigned resources, and resources scheduled for eviction. The specific |
| resources which are being added or removed are shown in the next lines. |
| |
| \item The complete set of resources the RM has scheduled for eviction, but which are not |
| yet confirmed freed. For example, we see 7 resources which have been evicted: |
| \begin{verbatim} |
| Removals[7]: bluej290-11.465^19430 bluej290-12.461^11802 bluej290-4.460^12672 \ |
| bluej290-5.464^23004 bluej290-2.467^22909 bluej290-7.463^20636 \ |
| bluej290-6.466^19931 |
| \end{verbatim} |
| |
| \item The complete set of resources which are being added to the work in this cycle. For |
| example: |
| |
| \begin{verbatim} |
| Additions[4]: bluej291-43.560^0 bluej291-42.543^0 bluej290-23.544^0 bluej291-44.559^0 |
| \end{verbatim} |
| \end{enumerate} |
| |
| In most cases, if resources cannot be awarded, this section also shows the reason |
| string which is published for the benefit of the web server and the Orchestrator's job monitor: |
| \begin{verbatim} |
| Job 7487 Waiting for defragmentation. |
| Existing[0]: |
| Additions[0]: |
| Removals[0]: |
| \end{verbatim} |
| |
| In some cases, it is possible that a job will show BOTH Additions and Removals. This usually |
| occurs as a result of the defragmentation step. The job will have been found in need of |
| new resources during the initial fair-share computation but later during fragmentation, |
| it is also found to be a ``rich'' job which must donate resources to under-allocated work. |
| Not all the processes belonging to the ``rich'' job may be appropriate for the poor job, |
| in which case they will be allowed to expand even as it is donating some to the |
| under-allocated work. |
| |
| This can also occur if resources were previously preempted, for some reason the |
| preemption is taking a long time. Since then other resources have become freed and |
| the can now re-expand. It is not possible to reverse a preemption (because the actual |
| state of the preemption is not knowable) so both expansion and shrinkage can be |
| in progress for the same job. |
| |
| \section{Service Manager Log (sm.log)} |
| |
| The service manager log shows the events involved in managing services. These events include |
| \begin{itemize} |
| \item Bootstrap configuration |
| \item Receipt and anlysis of Orchestrator state |
| \item CLI requests: register, modify, start, stop, unregister, etc. |
| \item Dispatching / startup of service instances |
| \item Progression of Service state |
| \item Starting and logging of pingers |
| \end{itemize} |
| |
| To enable finer-grained messages, edit the file |
| {\em \ducchome/resources/log4j.xml} |
| and change the {\em priority} value to {\em debug} (or {\em trace}) in the stanza |
| similar to that shown here. Within about a minute the logger will |
| pick up the change and increase its log level. |
| \begin{verbatim} |
| <category name="org.apache.uima.ducc.sm" additivity="true"> |
| <priority value="debug"/> |
| <appender-ref ref="smlog" /> |
| </category> |
| \end{verbatim} |
| |
| \subsection{Bootstrap configuration} |
| |
| \paragraph{Initial SM start} |
| The first logged line of any RM start will contain the string {\em Starting component: serviceManager}: |
| \begin{verbatim} |
| SM.ServiceManagerComponent - N/A boot ... Starting Component: serviceManager |
| \end{verbatim} |
| |
| This is followed by a summary of the environment in which the Service Manager is running, including |
| configuration data, information about the JRE and about hosting hardware. |
| |
| \begin{verbatim} |
| Service Manager starting: |
| DUCC home : /home/challngr/ducc_runtime |
| ActiveMQ URL : tcp://bluej537:61617?jms.useCompression=true |
| |
| JVM : Oracle Corporation 1.7.0_45 |
| JAVA_HOME : /users1/challngr/jdk1.7.0_45/jre |
| JVM Path : /users/challngr/jdk1.7.0_45/bin/java |
| JMX URL : service:jmx:rmi:///jndi/rmi://bluej537:2102/jmxrmi |
| |
| OS Architecture : amd64 |
| Crypto enabled : true |
| |
| Test mode enabled : true |
| |
| Service ping rate : 15000 |
| Service ping timeout : 15000 |
| Service ping stability : 10 |
| Default ping class : org.apache.uima.ducc.cli.UimaAsPing |
| |
| Init Failure Max : 1 |
| Instance Failure Max : 3 |
| Instance Failure Window : 10 |
| |
| DUCC Version : 2.0.0-beta |
| SM Version : 2.0.0 |
| \end{verbatim} |
| |
| \paragraph{Initialize the service registry} |
| Following this is are entries showing the reading and internal initialization of the |
| service registry, for example: |
| \begin{verbatim} |
| SM.ServiceHandler - 411 ServiceStateHandler.registerService \ |
| adding UIMA-AS:FixedSleepAE_1:tcp://bluej537:61617 411 |
| \end{verbatim} |
| |
| \paragraph{Wait for the Resource Manager and Orchestrator to become ready} |
| |
| The Service Manager waits until the Resource Manager and Orchestrator are ready. This usually |
| results in lines similar to the following, which are normal and expected. If these lines |
| continue for more than a few minutes it is possible that some part of DUCC has not correctly |
| started: |
| \begin{verbatim} |
| SM.ServiceManagerComponent - N/A orchestratorStateArrives \ |
| Orchestrator JD node not assigned, ignoring Orchestrator state update. |
| \end{verbatim} |
| |
| \paragraph{Begin work} |
| When the Service Manager is ready for business, lines similar to the following are emitted on |
| a regular basis: |
| \begin{verbatim} |
| INFO SM.ServiceManagerComponent - N/A processIncoming ===== Orchestrator State Arrives ===== |
| \end{verbatim} |
| |
| \subsection{Receipt and analysis of Orchestrator State} |
| |
| \subsection{CLI Requests} |
| |
| CLI Requests are all logged at log level INFO. Initial receipt of the command |
| is tagged with the string {\em ServiceManagerComponent} and contains the name of |
| the command issued. Information pertinent to the command issued is logged. |
| For example, this shows registration of a new service. |
| |
| \begin{verbatim} |
| INFO SM.ServiceManagerComponent - 428 register ServiceRegisterEvent \ |
| [ninstances=2, autostart=Unset, \ |
| endpoint=UIMA-AS:FixedSleepAE_5:tcp://bluej537:61617, user=challngr] |
| \end{verbatim} |
| |
| Some commands require service ownership or administrative authority. Here we show |
| a service being started, and the authorization check being made. |
| |
| \begin{verbatim} |
| INFO SM.ServiceManagerComponent - N/A start Starting service ServiceStartEvent \ |
| [friendly=430, user=challngr, instances=-1] |
| INFO SM.ServiceHandler - 430 authorized start request from challngr allowed. |
| \end{verbatim} |
| |
| \subsection{Dispatching / Startup of Service Instances} |
| The full set of startup messages is logged when an instance is dispatched to DUCC to |
| be scheduled and started. These starts are usually preceeded by a message or two |
| indicating what triggered the start. For instance, this shows service 427 being |
| started because it was referenced by job 7676: |
| |
| \begin{verbatim} |
| INFO SM.ServiceSet - 427 reference Reference start requested by 7676 |
| INFO SM.ServiceSet - 427 reference References job/service 7676 count[1] implementors [0] |
| INFO SM.ServiceSet - 427 reference Reference starting new service instances. |
| \end{verbatim} |
| |
| This is followed by a line indicating how many instances are to be started: |
| \begin{verbatim} |
| INFO SM.ServiceSet - 427 start Starting instance. Current count 0 needed 2 |
| \end{verbatim} |
| |
| Shortly thereafter the {\em stdout} messages from the submission to the |
| DUCC Orchestrator are shown. These are tagged with the keyword {\em ServiceInstance}. |
| \begin{verbatim} |
| INFO SM.ServiceInstance - 427 start START INSTANCE |
| INFO SM.ServiceInstance - 427 start Start stdout: 050 ducc_ling Version 1.1.2 \ |
| compiled Aug 4 2014 at 06:45:31 |
| \end{verbatim} |
| |
| A few lines later the DUCC-assigned ID of the instance is shown, in |
| this case, ID 7677, for service 427. |
| \begin{verbatim} |
| INFO SM.ServiceInstance - 427 start Start stdout: Service instance 7677 submitted |
| INFO SM.ServiceInstance - N/A start Request to start service 427 accepted as service instance 7677 |
| \end{verbatim} |
| |
| THe next lines show the environment used for the service submit. |
| \begin{verbatim} |
| INFO SM.ServiceInstance - 427 start Start stdout: 1104 Running with user and group: \ |
| id 2087 gid 2001 eid 2087 egid 2001 |
| INFO SM.ServiceInstance - 427 start Start stdout: 300 Bypassing redirect of log. |
| INFO SM.ServiceInstance - 427 start Start stdout: 4050 Limits: CORE soft[1024] hard[-1] |
| INFO SM.ServiceInstance - 427 start Start stdout: 4050 Limits: CPU soft[-1] hard[-1] |
| INFO SM.ServiceInstance - 427 start Start stdout: 4050 Limits: DATA soft[-1] hard[-1] |
| ... (more environment) ... |
| INFO SM.ServiceInstance - 427 start Start stdout: Environ[0] = DUCC_SERVICE_INSTANCE=0 |
| INFO SM.ServiceInstance - 427 start Start stdout: Environ[1] = JAVA_HOME=/opt/ibm-jdk1.7 |
| INFO SM.ServiceInstance - 427 start Start stdout: Environ[2] = LESSCLOSE=lessclose.sh %s %s |
| INFO SM.ServiceInstance - 427 start Start stdout: Environ[3] = ENV=/etc/bash.bashrc |
| \end{verbatim} |
| |
| Next the exact command line executed to submit the isntance is shown. If the submission |
| is successful, see {\em START INSTANCE COMPLETE}. The exact progression of |
| the scheduling and deployment of an |
| instance is not shown in this log. If it is desired to observe that, look in |
| the RM log, searching on the service instance id. In the examples shown here, that |
| id is 7677. |
| |
| \begin{verbatim} |
| INFO SM.ServiceInstance - 427 start Start stdout: 1000 Command to exec: /opt/ibm-jdk1.7/bin/java |
| INFO SM.ServiceInstance - 427 start Start stdout: arg[1]: -cp |
| INFO SM.ServiceInstance - 427 start <INHIBITED CP> |
| INFO SM.ServiceInstance - 427 start <INHIBITED CP> |
| INFO SM.ServiceInstance - 427 start Start stdout: arg[4]: --specification |
| INFO SM.ServiceInstance - 427 start Start stdout: arg[5]: /home/challngr/ducc_runtime/state/services/427.svc |
| INFO SM.ServiceInstance - 427 start Start stdout: arg[6]: --service_id |
| INFO SM.ServiceInstance - 427 start Start stdout: arg[7]: 427 |
| INFO SM.ServiceInstance - 427 start Start stdout: 1001 Command launching... |
| INFO SM.ServiceInstance - 427 start Start stdout: Service instance 7677 submitted |
| INFO SM.ServiceInstance - N/A start Request to start service 427 accepted as service instance 7677 |
| INFO SM.ServiceInstance - 427 start START INSTANCE COMPLETE |
| \end{verbatim} |
| |
| The state progression of the service as it starts is then available, as described in the next section. |
| |
| |
| \subsection{Progression of Service State} |
| |
| The state of each service as it starts and stops is tracked at log level INFO. To find the state progression |
| for any particular service search on {\em service-id setState} where service-id is the ID of the service. |
| |
| Here we show the progression for service 427 as it is started and progresses to fully functional (there are |
| other lines logged between these of course). Note that if a service has multiple instances defined, the |
| overall service state is determined by an aggregate of the states of the individual instances. In these |
| messages, the ``Inst'' field of each message shows DUCC ID of the most recently updated service instance |
| that triggered the overall state change. |
| |
| \begin{verbatim} |
| SM.ServiceSet - 427 setState State update \ |
| from[Stopped] to[Starting] via[Starting] Inst[7677/Received] |
| SM.ServiceSet - 427 setState State update \ |
| from[Starting] to[Initializing] via[Initializing] Inst[7677/Initializing] |
| SM.ServiceSet - 427 setState State update \ |
| from[Initializing] to[Waiting] via[Waiting] Inst[7677/Running] |
| SM.ServiceSet - 427 setState State update \ |
| from[Waiting] to[Available] via[Available] Inst[7677/Running] |
| \end{verbatim} |
| |
| \subsection{Starting and Logging Pingers} |
| |
| When a pinger is started (or restarted) the event is logged and tagged |
| with the string {\em service-id startPingThread}. Following are a few |
| lines stating the name of the pinger and the Java class used to |
| implement the pinger. For example: |
| |
| \begin{verbatim} |
| INFO SM.ServiceSet - 430 startPingThread Starting service monitor. |
| INFO SM.PingDriver - 430 find RegisteredPinger \ |
| Loading site-registered service monitor from org.apache.uima.ducc.cli.UimaAsPing |
| INFO SM.PingDriver - 430 <ctr> Using ping class org.apache.uima.ducc.cli.UimaAsPing |
| \end{verbatim} |
| |
| If the pingers do not declare loggers, their {\em stdout} and {\em stderr} are |
| captured in the logs, under the tag {\em handleStatistics}. As well, every |
| ping is recorded with its ``info'' string, making it possible to see the |
| exact state of the pinger. For example: |
| |
| \begin{verbatim} |
| INFO SM.PingDriver - 411 handleStatistics Ping ok: UIMA-AS:FixedSleepAE_1:tcp://bluej537:61617 \ |
| Alive[true] Healthy[true] + Info: QDEPTH[0] AveNQ[0] Consum[26] Prod[0] \ |
| minNQ[0] maxNQ[0] expCnt[0] inFlt[0] DQ[0] NQ[0] NDisp[0] \ |
| MetaNode[192.168.4.36] MetaPid[8892:67] |
| \end{verbatim} |
| |
| Pingers always return state to the Service Manager and some of that state |
| affects the SM's operation; in particular, pingers can start and stop |
| specific service instances or change the autostart setting for a |
| service. They also must return success and failure status to the SM. |
| The ping state is logged under the tag {\em service-id signalRebalance} |
| as seen below. If {\em Additions:} or {\em Deletions:} is non-zero, you |
| can expect to see SM automatically start or stop specific instances for |
| the service. |
| \begin{verbatim} |
| INFO SM.ServiceSet - 430 signalRebalance PING: Additions: 0 deletions: 0 \ |
| excessive failures: false implementors 1 references 0 |
| \end{verbatim} |
| |
| \subsection{Publishing State} |
| By default the SM log does not include the state as published to the |
| Orchestrator because it can be voluminous and cause the logs to |
| become cluttered and to wrap too fast. It may be necessary to |
| increas the log level to DEBUG as described at the start of this section. |
| |
| The published state will be emitted with lines similar to the following. Most |
| entries will show ``Available'', which means either they do not depend on |
| the Service Manager, or the service they depend on is in state Available. If there |
| is some exceptional condition pertaining to a job, that is shown. In the |
| sample below, Job 251214 is waiting for a specific service whose state itself |
| is ``waiting''. |
| |
| \begin{verbatim} |
| INFO SM.ServiceHandler - 251261 handleModifiedobs No service dependencies, no updates made. |
| INFO SM.ServiceManagerComponent - N/A publish Publishing State, active job count = 102 |
| INFO SM.ServiceManagerComponent - N/A publish Service Map |
| Job 251159 Service state Available |
| Job 251263 Service state Available |
| Job 251192 Service state Available |
| Job 251214 Service state Waiting \ |
| [UIMA-AS:fastqa-Dedicated-Staging-2015-08_cache:tcp://broker42:62616 : waiting] |
| \end{verbatim} |
| |
| |
| \section{ (Orchestrator Log or.log)} |
| To be filled in. |
| |
| \section{Process Manager Log (pm.log)} |
| To be filled in. |
| |
| \section{Agent log Log (hostname.agent.log)} |
| To be filled in. |