blob: f1909baa1f2a2507d17179ceeb801f1a6376417a [file] [log] [blame] [view]
Monitoring
==========
The [current Whimsy server status](https://whimsy.apache.org/status/) is represented
as a tree of named nodes, and created by the [status monitoring code](https://github.com/apache/whimsy/tree/master/www/status/).
Nodes, names, and strings
-------------------------
Each major branch is produced by a [monitor](monitors). Each monitor can
return a tree of nodes, or a single String, or an array of Strings. The name
of the monitor is used as the name of the node produced.
Leaf nodes consist of a String, an array of Strings, or a Hash where one
element in the Hash has a key of `data` with a value of either a String or an
array of Strings.
Non-leaf nodes consist of a Hash where one element in the Hash has a key of
`data` with a value that is a Hash of names and child nodes.
Levels
------
Each node is associated with a status *level*. Valid levels are `success`,
`info`, `warning`, `danger`, and `fatal`. (The first four levels are modelled
after Bootstrap [alerts](http://getbootstrap.com/components/#alerts)).
Default level for valid leaf nodes is `success`. Invalid leaf nodes (e.g., a
node consisting of a `nil` value) have a level of `danger`. Only leaf nodes
that in the form of a Hash can have levels. Leaf nodes that are not Hashes
will be normalized into a Hash with a `level` and `data`.
Default level for non-leaf nodes is the highest level in children nodes (where
`fatal` > `danger`, `danger` > `warning`, `warning` > `info` and `info` >
`success`). Normally monitors will not assign level values for non-leaf
nodes.
Titles
------
Non-leaf nodes have a *title* describing the contents of the children. Titles
show up as tooltips in the browser.
Default for title is either a list or a count of the names of child nodes with
the highest status. Again, normally monitors will not assign title values for
nodes.
Text
----
Somewhat rare, but a node may have *text* which is used in place of the name
of the node for display purposes (the name continues to be used to produce the
anchor id for the element for linking purposes).
Internally, exceptions returned by a monitor are converted to a leaf node with
a name of `exception`, a title containing the exception, and data consisting
of a stack traceback.
Href
----
Leaf nodes may have a *href* which will be used as the target for the link
used to display the contents of the leaf node (either a single String or an
array of Strings).
Mtime
-----
Anchors and the top of each major branch emanating from the root have an
*mtime* value which indicates when that data was last updated. This is
described below in the control flow section below.
Leaf nodes can have a mtime value in place of data. Such values will be
converted to local time and displayed as the last update value. Hovering over
such items will show the GMT value of the time specified in ISO-8601 format.
Control Flow
============
Fetching the [status](https://whimsy.apache.org/status/) web page, which
can be done either by browsers or pings, results in a call to
[index.cgi](https://github.com/apache/whimsy/blob/master/www/status/index.cgi).
If it has been more than 60 seconds since the last status update, index.cgi
will call
[monitor.rb](https://github.com/apache/whimsy/blob/master/www/status/monitor.rb).
StatusMonitor.rb will load and then call each of the monitors defined in the
[monitors](https://github.com/apache/whimsy/tree/master/www/status/monitors)
subdirectory.
Monitors are simple class methods. Monitors can assume that they are called
no more often than once a minute, and are passed the normalized results of the
previous call.
As monitors are called in response to a ping, they are expected to produce
results in sub-second time in order to avoid the ping timing out. (Monitors
are run in separate threads to minimize the total elapsed time). Monitors
that perform activities that take a substantial amount of time may elect to do
so less frequently than once a minute, and can take advantage of the `mtime`
values to determine when to do so.
Results are collected into a hash, and that hash is then normalized.
Normalization resolves default values for items like levels and titles
recursively.
The normalized status is written to disk as [status.json](status.json), and
used as a response to pings that occur less than a minute after the previous
status.
Alerts
======
The Apache Software Foundation infrastructure team uses
[Nodeping](https://nodeping.com/reports/status/70MTNEPXE6) to monitor
status. A dozen+ servers around the world check status regularly,
and will report failure results to the infrastructure
[Slack](https://the-asf.slack.com/) channel. _Important:_ The Infrastructure
team ensures the underlying VM is up; the Whimsy PMC is responsible for
the server software running inside the VM.
There are currently two Nodeping checks:
- Public facing: this checks the status return from the public URL https://whimsy.apache.org/incubator/podlings/by-age; this gives[Public check results](https://nodeping.com/reports/statusevents/check/2018042000290QH9Q-OZZ2KBZC)
- Full status: this checks the status return from https://whimsy.apache.org/status/; this gives [Status results](https://nodeping.com/reports/statusevents/check/2018042000290QH9Q-UMFGNACX)
While the full status for whimsy is represented as a tree of nodes, each
assigned one of our levels, and containing either child nodes or one or more
strings, all the infrastructure team is currently concerned with is a boolean
status (`success` and `info` are treated as success, and `warning` and
`danger` are treated as failure) and the computed title for the root node.