docs/source/design-nodedb.rst - incubator-warble-server - Git at Google

 Node Task Registry Design
 =========================

 .. toctree::
    :maxdepth: 2
    :caption: Contents:


 ****************************
 Node Tasks
 ****************************

 #####################
 Basic Task Design
 #####################
 Warble Nodes can have one or more (or all) tasks assigned to it. Each
 task consists of a target to test, as well as what to test and how to go
 about that, encapsulated in a payload object. Each check you wish to
 perform requires an associated task, but may be performed by multiple
 nodes. Thus, testing whether your main web site works on port 80
 requires a task, as does a test for https on port 443, as they are
 technically two distinct targets. Specific tasks may have optional tests
 built into them,for instance a SSL certificate check on a https site.

 #####################
 Task status
 #####################
 A task can be either enabled, disabled, or muted. Disabling a task
 prevents it from running on nodes, whereas muting a task will still
 cause nodes to perform it, but alerting will be silenced. Muting can be
 used for when you still need to monitor a situation, but you don't need
 to be reminded whenever the test results changes.

 #####################
 Task sensitivity
 #####################
 A task can also have a specific sensitivity set. Sensitivity denotes
 how failures are treated, and when to alert about state changes:

 - **low**: Alerting only happens if all currently active nodes agree that
   the test has failed, e.g. the service is down completely.

 - **default**: Alerting happens if a majority of nodes agree that the test
   has failed. This is the default behavior and balances out the need for
   speedy alerting versus the need for fewer false positives.

 - **high**: Alerting happens if more than one node sees failures. While
   more sensitive than the default, it still removes a fair bit of false
   positives by requiring confirmation of a reported failure by at least
   one other node.

 - **twitchy**: Alerting happens if any node registers a failure.
   This may be useful for services that have guaranteed service level
   agreements, but can lead to a lot of false positives.

 It should be noted that if you run a setup of Warble with only one, or
 very few nodes attached, the sensitivity levels may differ very little
 in terms of when alerting happens, as the definition of quorum changes
 based on how many active nodes you have at any given time.

 *******************
 Task Categories
 *******************
 Each task is assigned a task category, which helps you separate tasks
 into easily recognizable groups and access definitions.

 Each task category has a distinct alerting and escalation path, meaning
 you can assign different teams to different categories, and have alerts
 go to that team, independent of other task categories. This can be
 useful for having front-end issues go to a specific team, while back-end
 issues go to another team.

 #####################
 Task Category Access
 #####################
 Users can be assigned the following access levels to categories, on a
 per-user basis:

 1. Read-only access: The user can read and analyze test results, but
    cannot edit or remove tasks, nor see the specific payload details
    (thus, if you add a test with credentials, users with read-only
    access cannot see the credentials)

 2. Read/write access: The user can read, modify, and remove existing
    tests. They can also add new tests to the category.

 3. Admin access: The user can, besides permissions listed above, also
    modify or remove the category altogether or change its alerting
    options. This access level should generally be reserved for power
    users only.

 It should be noted that `super users` on the system (such as the account
 you create at setup) can freely access and modify any aspect of the
 tasks/categories.
	Node Task Registry Design
	=========================

	.. toctree::
	:maxdepth: 2
	:caption: Contents:


	****************************
	Node Tasks
	****************************

	#####################
	Basic Task Design
	#####################
	Warble Nodes can have one or more (or all) tasks assigned to it. Each
	task consists of a target to test, as well as what to test and how to go
	about that, encapsulated in a payload object. Each check you wish to
	perform requires an associated task, but may be performed by multiple
	nodes. Thus, testing whether your main web site works on port 80
	requires a task, as does a test for https on port 443, as they are
	technically two distinct targets. Specific tasks may have optional tests
	built into them,for instance a SSL certificate check on a https site.

	#####################
	Task status
	#####################
	A task can be either enabled, disabled, or muted. Disabling a task
	prevents it from running on nodes, whereas muting a task will still
	cause nodes to perform it, but alerting will be silenced. Muting can be
	used for when you still need to monitor a situation, but you don't need
	to be reminded whenever the test results changes.

	#####################
	Task sensitivity
	#####################
	A task can also have a specific sensitivity set. Sensitivity denotes
	how failures are treated, and when to alert about state changes:

	- low: Alerting only happens if all currently active nodes agree that
	the test has failed, e.g. the service is down completely.

	- default: Alerting happens if a majority of nodes agree that the test
	has failed. This is the default behavior and balances out the need for
	speedy alerting versus the need for fewer false positives.

	- high: Alerting happens if more than one node sees failures. While
	more sensitive than the default, it still removes a fair bit of false
	positives by requiring confirmation of a reported failure by at least
	one other node.

	- twitchy: Alerting happens if any node registers a failure.
	This may be useful for services that have guaranteed service level
	agreements, but can lead to a lot of false positives.

	It should be noted that if you run a setup of Warble with only one, or
	very few nodes attached, the sensitivity levels may differ very little
	in terms of when alerting happens, as the definition of quorum changes
	based on how many active nodes you have at any given time.

	*******************
	Task Categories
	*******************
	Each task is assigned a task category, which helps you separate tasks
	into easily recognizable groups and access definitions.

	Each task category has a distinct alerting and escalation path, meaning
	you can assign different teams to different categories, and have alerts
	go to that team, independent of other task categories. This can be
	useful for having front-end issues go to a specific team, while back-end
	issues go to another team.

	#####################
	Task Category Access
	#####################
	Users can be assigned the following access levels to categories, on a
	per-user basis:

	1. Read-only access: The user can read and analyze test results, but
	cannot edit or remove tasks, nor see the specific payload details
	(thus, if you add a test with credentials, users with read-only
	access cannot see the credentials)

	2. Read/write access: The user can read, modify, and remove existing
	tests. They can also add new tests to the category.

	3. Admin access: The user can, besides permissions listed above, also
	modify or remove the category altogether or change its alerting
	options. This access level should generally be reserved for power
	users only.

	It should be noted that `super users` on the system (such as the account
	you create at setup) can freely access and modify any aspect of the
	tasks/categories.