blob: 1ad5752531b32aaf0ab874698ae8f53909610a6f [file] [log] [blame]
%
% Licensed to the Apache Software Foundation (ASF) under one
% or more contributor license agreements. See the NOTICE file
% distributed with this work for additional information
% regarding copyright ownership. The ASF licenses this file
% to you under the Apache License, Version 2.0 (the
% "License"); you may not use this file except in compliance
% with the License. You may obtain a copy of the License at
%
% http://www.apache.org/licenses/LICENSE-2.0
%
% Unless required by applicable law or agreed to in writing,
% software distributed under the License is distributed on an
% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
% KIND, either express or implied. See the License for the
% specific language governing permissions and limitations
% under the License.
%
\section{System Pages}
\label{sec:system-details}
These pages show information relating to the {\DUCC} System itself:
\begin{description}
\item[Administration]This displays system administrators and implements
the interface to various administrative controls.
\item[Broker] This shows selective information for the system's broker.
\item[Classes] This shows the system's scheduling class definitions.
\item[Daemons] This shows the status of all {\DUCC} processes.
\item[DuccBook] This is a link to the book you are reading.
\item[Machines] This shows details of all the machines (nodes) in the {\DUCC} cluster.
\end{description}
\subsection{Administration}
This page has two tabs:
\begin{description}
\item[Administrators] This shows the user-ids that are authorized to administer
{\DUCC}. In addition to executing the ``Control'' functions described below,
administrators may cancel any job, reservation, or service, and may modify
services they do not own.
In order to perform administrative functions, the following must be satisfied:
\begin{enumerate}
\item The user is logged-in to the web server.
\item The user is a registered administrator.
\item The user has set the role as ``administrator'' in the {\DUCC} Preferences
page. This is a safeguard so that administrators who are also users
are less likely to inadvertently affect other people's jobs.
\end{enumerate}
\item[Control] Currently {\DUCC} supports a single administrative control function
via the web server: Stop new job submissions and re-enable them. If submissions
are blocked, all existing work runs normally, but no new work is accepted.
\end{description}
\subsection{Broker}
This page shows selective information for the system's broker.
Information includes host, port, version, uptime, memory used, threads, load average, topics and queues.
\subsection{Classes}
This page shows the definitions of the {\DUCC} scheduling classes. The scheduling classes are
discussed in more detail in the \hyperref[sec:rm.job-classes]{Resource Manager} section.
\subsection{Daemons}
\label{sec:system-details.daemons}
This page shows the current state of all {\DUCC} processes. By default, only the administrative
processes, Broker, Database, Orchestrator, ProcessManager, ResourceManager, ServiceManager, and Webserver are
shown. A button in the upper left of the page titled ``Show Agents'' enables display of
the status of all the {\DUCC} agents as well. (Agents are suppressed by default because the
page is expensive to render for large systems.)
The columns shown on this page include
\begin{description}
\item[Status] \hfill \\
This indicates whether the daemon is running and broadcasting state {\em up},
or not {\em down}.
All {\DUCC} daemons broadcast a heartbeat containing process state. If the Status
is {\em down}, either the daemon is not functioning, or something is preventing
state from reaching the web server via {\DUCC}'s ActiveMQ instance.
\item[Daemon Name] \hfill \\
This is the name of the process.
\item[Boot Time] \hfill \\
This shows the date and time of the latest boot of the specific process.
\item[Host IP] \hfill \\
This is the IP address of the processor where the process is running.
\item[Host Name] \hfill \\
This shows the hostname of the processor where the process is running.
\item[PID] \hfill \\
This is the Unix process Id of the {\DUCC} process.
\item[Publication Size (last)] \hfill \\
This shows the size of the most recent state publication of the process, in bytes.
\item[Publication Size (max)] \hfill \\
This shows the size of the largest state publication of the process, in bytes.
\item[Heartbeat (last)] \hfill \\
This shows the number of seconds since the last state publication for the process.
Large numbers here indicate potential cluster or {\DUCC} problems.
\item[Heartbeat (max)] \hfill \\
This shows the longest delay since a state publication for the process was received
at the web server. Large numbers here indicate potential cluster or {\DUCC} problems.
\item[Heartbeat (max) TOD] \hfill \\
This shows the time the longest delay of a state publication occurred.
\item[JConsole URL] \hfill \\
This is the jconsole URL for the process.
\end{description}
\subsection{Machines}
This page shows the states of all the machines (nodes) in the {\DUCC} cluster.
The columns shown on this page include
\begin{description}
\item[Status] \hfill \\
This shows the current state of a machine. Values include:
\begin{description}
\item[defined] The node is in the {\DUCC}
\hyperref[sec:admin-ducc.nodes]{nodes file}, but no {\DUCC} process has been
started there, or else there is a communication problem and
the state messages are not being delivered.
\item[up] The node has a {\DUCC} Agent process running on it and the
resource manager is receiving regular heartbeat packets from it.
\item[down] The node had a healthy {\DUCC} Agent on it at some point
in the past (since the last {\DUCC} boot), but the resource manager
has stopped receiving heartbeats from it.
The agent may have been manually shut down, may have crashed, or there
may be a communication problem.
Additionally, very heavy loads from jobs running the the node can cause
the {\DUCC} Agents heartbeats to be delayed.
\end{description}
\item[IP] \hfill \\
This is the IP address of the node.
\item[Name] \hfill \\
This is the hostname of the node.
\item[Nodepool] \hfill \\
This is the host nodepool.
\item[Memory(GB) usable] \hfill \\
This is the amount of usable memory, in GB, as reported by each machine.
This is the maximum amount that can be allocated by the resource manager.
Usually the amount will be slightly less than the installed memory. This is because
a small bit of memory is usually reserved by the hardware for its own purposes. For
example, a machine with 48GB of installed memory may report only 47GB available.
\item[Memory(GB) free] \hfill \\
This is the amount of free memory, in GB, as reported by each machine.
This is the amount not presently allocated by the resource manager.
\item[CPU] \hfill \\
This is the host CPU one minute load average.
\item[Swap(GB) inuse] \hfill \\
This is the total size in-use swap data. {\DUCC} shows any value greater than 0 in
red as swapping can very significantly slow applications. However, swap use does
not always mean there is a performance problem. This is flagged by {\DUCC} simply
as an alert of a potential problem
\item[Swap(GB) free] \hfill \\
This is the total size of swap area.
\item[C-Groups] \hfill \\
If on then C-Groups are in use and processes deployed by {\DUCC} will
be limited in resource consumption.
\item[Alien PIDs] \hfill \\
This shows the number of processes not owned by {\DUCC}, the operating system, or
jobs scheduled on each node. The Unix Process IDS of these processes is displayed
in a hover.
{\DUCC} preconfigures many of the standard operating
\hyperref[itm:props-rogue.process]{system process} and
\hyperref[itm:props-rogue.user]{userids}. This list may be updated by each
installation.
A common cause of alien PIDs is errant process run in unmanaged reservations. A
user may reserve a machine for use as a sandbox. If the reservation is released
without properly terminating all the processes, they may linger. When {\DUCC}
schedules the node for other purposes, significant performance penalties may be
paid due to competition between the legitimately scheduled work and the leftover
``alien'' processes. The purpose of this column is to bring attention to this situation.
\item[Heartbeat(last)] \hfill \\
This shows the number of seconds since the last agent heartbeat from this machine.
\end{description}