blueprints/distributed-traffic-monitor.md - trafficcontrol - Git at Google

 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 # Distributed Traffic Monitor

 ## Problem Description
 Currently, TM polls all caches in a CDN. As CDNs grow, this becomes a major
 pain point as TM is limited by the amount of bandwidth and CPU it requires to
 receive and process data from every cache on the CDN, and scaling vertically by
 running it on better hardware is only feasible up to a certain point. Also, the
 performance of a cache observed by a TM which is very far away from it does not
 always reflect the performance observed by clients that are actually using the
 cache (because the clients are typically much closer to it).

 ## Proposed Change
 TM should have the ability to poll only a subset of caches in a CDN and peer
 with other TMs which are monitoring other subsets in order to get a full view
 of the CDN's health. This would allow us to run TM in a more distributed manner
 across the CDN, giving us a view of cache health that is closer to what clients
 actually observe and enabling us to scale TM horizontally. Additionally, we
 would like to have the option to disable _stat polling_ in order for these
 distributed TMs to focus on _health polling_.

 ### Traffic Portal Impact
 This proposal does not require any TP changes.

 ### Traffic Ops Impact
 This proposal might have limited impact on TO. The existing TO API endpoints
 already provide the data that TM will need to run in a distributed manner, and
 any changes made to TM APIs that TO uses will remain backwards-compatible.
 However, TO may need to be updated if it uses any stat-polling-related TM APIs
 so that it only requests from TMs that have stat-polling enabled.

 ### t3c Impact
 This proposal does not require `t3c` changes. Note: the `tc-health-client`
 periodically polls a random TM to get cache health states, and because
 distributed TMs will still serve the cache health states of all caches in a
 CDN, there will be no impact to the `tc-health-client`. It can continue to poll
 any random TM and still get all the cache health data for the entire CDN.

 ### Traffic Monitor Impact
 TM will gain at least two more configuration options:
 - `distributed_polling_enabled` (default: false) - when set to true, TM will
   run in _distributed mode_ (more details on this below). When set to false, TM
   will run in its legacy, normal mode.
 - `stat_polling_disabled` (default: false) - when set to true, TM will _not_ do
   stat polling for caches. When set to false, TM will do stat polling for
   caches (legacy, normal behavior). Initially, this must be set to true if
   `distributed_polling_enabled` is also set to true. In a later phase of
   development, we will add the ability to enable stat polling in distributed
   mode.

 Note: these are configuration options as opposed to profile parameters because
 we currently do not have the capability to have per-profile monitoring.json
 snapshots (or per-TM-server configuration in one snapshot).

 To use _distributed mode_, generally all TMs in the CDN need to be running in
 distributed mode (if they're taking part in the health protocol). It should
 still be possible to run TMs in the _legacy_ (non-distributed) mode in order to
 provide cache stat polling (which is important for Traffic Stats), but they
 should not be set to `ONLINE` in order to keep them from interfering with the
 health protocol.

 While in _distributed mode_, a TM instance will only monitor a subset of
 cachegroups in its given CDN. The number of cachegroups each TM will monitor
 depends on the number of cachegroups that contain TM servers for the CDN. These
 will be referred to as "TM groups." A TM group contains 1 to many TM servers,
 and a CDN can have 1 to many TM groups. If there are N TM groups, each TM group
 will monitor roughly 1/N of the cachegroups in the CDN. Each TM in the group
 will monitor all of caches in that 1/N portion of cachegroups that the TM group
 is responsible for. For example, if there are 10 cachegroups and 3 TM groups:
 - TM group 1 monitors cachegroups 1-4
 - TM group 2 monitors cachegroups 5-7
 - TM group 3 monitors cachegroups 8-10

 Because every TM can serve the health state of every cache, distributed TMs
 will need to peer not only with their own group members but also with other
 groups as well. However, instead of simultaneously requesting cache health
 states from all out-of-group peers, each distributed TM will simultaneously
 request cache health states from 1 TM in every other TM group, alternating
 between group members in a deterministic, round-robin fashion. For this
 out-of-group peering, a new TM API route will be added that returns only the
 cache health states for caches that the TM group is responsible for polling.

 A safety feature will be added to TM (while running in distributed mode) to
 ensure that all cachegroups are polled by at least 1 TM group, and an
 additional profile parameter override will be available in order to manually
 assign cachegroups to TM groups for polling.

 ### Traffic Router Impact
 This proposal should have no impact on TR.

 ### Traffic Stats Impact
 Because we will be able to disable stats polling on TM, TS will need to poll
 TMs that actually have stats polling enabled. TMs with polling enabled should
 be given a specific server status (other than `ONLINE`), which TS will be
 configured to poll, and that might mean creating a new server status
 specifically for that purpose.

 ### Traffic Vault Impact
 This proposal has no impact on Traffic Vault.

 ### Documentation Impact
 Any new configuration options added to TM should be documented, and the steps
 necessary to run TM in a distributed manner as well as how it works should be
 described in some form of documentation (probably the TM admin docs).

 ### Testing Impact
 New TM unit and integration tests should be added where applicable. It would
 also be recommended to run both types of TMs in production (distributed and
 non-distributed) and compare the reported cache health states between both
 types. This would help discover any issues with running TM in a distributed
 manner using data from a production environment. However, TR should still get
 health states from the non-distributed TMs until we are confident in the health
 states reported by distributed TMs.

 ### Performance Impact
 This proposal allows TM to be scaled horizontally, so operators can increase
 the number of TM groups in order to get the desired amount of load per TM.

 ### Security Impact
 This proposal does not have much impact on security, but allowing TM to scale
 horizontally means that there may be more firewall rules that will need to be
 applied to any new TM servers that are deployed. However, TM will not need any
 _new_ ports opened, assuming the same `httpListener` and `httpsListener`
 configuration is used.

 ### Upgrade Impact
 TMs running in a distributed manner can be upgraded in the same way that
 non-distributed TMs are upgraded today. For instance, we would likely upgrade
 the `OFFLINE` TMs, then set the upgraded TMs to `ONLINE` while simultaneously
 setting the old TMs to `OFFLINE`.

 ### Operations Impact
 There should be little impact on operations other than the effort necessary to
 provision and deploy new TM servers to run in a distributed manner. Existing
 automation can still be used for upgrades, configuration, etc., but automation
 may need a way to differentiate between non-distributed and distributed TMs
 within the same environment so that both types are configured differently.

 Troubleshooting distributed TMs might be more difficult than non-distributed
 TMs as there will be more servers involved. However, the health of a cache
 should always be determined by the same TMs (assuming no new TM groups are
 added to the system), so it would be best to investigate the TM servers in the
 "authoritative" TM group for the cache under investigation. To help aid this
 kind of troubleshooting, we may want TM to have an API that returns information
 about which TM groups it thinks are currently monitoring which cache groups.

 ### Developer Impact
 Developers should know that once this change is implemented, there will be two
 different "run modes" for TM -- distributed and non-distributed. TM will do
 certain things differently in the distributed mode compared to the
 non-distributed mode even though the vast majority of things will be the same.
 Therefore, developers will need to take care to ensure the proper behavior is
 followed depending on which "run mode" TM is in.

 Also, because this proposal will allow TMs to monitor only a subset of caches,
 it may make it easier to set up a development environment using production-like
 data and caches. It is somewhat infeasible for most TM development environments
 to poll an entire, large CDN, but with distributed TM groups, developers could
 essentially choose how many caches they want their local TM to poll.

 ## Alternatives

 - Cache Self-Monitoring: Make caches monitor themselves by using remap rules,
   essentially replacing TM's Cache Health Monitoring. The
   [Proof-of-Concept](https://github.com/apache/trafficcontrol/pull/4529) has
   more details.

 ## Dependencies
 This proposal does not intend to add any new dependencies.

 ## References
 The following mailing list threads were related to this blueprint:
 - [Proposal: Distributed Health Monitoring](https://lists.apache.org/thread.html/rf3307f824c0f82892cbb0fea74a5c6a274c8ea4f303d125e8f1212da%40%3Cdev.trafficcontrol.apache.org%3E)
 - [Distributed Traffic Monitor Feedback/Requirements](https://lists.apache.org/thread.html/rf985a2b9e8a440d396a0097a71882919bff5b3cb5f8d6c3a53143162%40%3Cdev.trafficcontrol.apache.org%3E)
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	# Distributed Traffic Monitor

	## Problem Description
	Currently, TM polls all caches in a CDN. As CDNs grow, this becomes a major
	pain point as TM is limited by the amount of bandwidth and CPU it requires to
	receive and process data from every cache on the CDN, and scaling vertically by
	running it on better hardware is only feasible up to a certain point. Also, the
	performance of a cache observed by a TM which is very far away from it does not
	always reflect the performance observed by clients that are actually using the
	cache (because the clients are typically much closer to it).

	## Proposed Change
	TM should have the ability to poll only a subset of caches in a CDN and peer
	with other TMs which are monitoring other subsets in order to get a full view
	of the CDN's health. This would allow us to run TM in a more distributed manner
	across the CDN, giving us a view of cache health that is closer to what clients
	actually observe and enabling us to scale TM horizontally. Additionally, we
	would like to have the option to disable _stat polling_ in order for these
	distributed TMs to focus on _health polling_.

	### Traffic Portal Impact
	This proposal does not require any TP changes.

	### Traffic Ops Impact
	This proposal might have limited impact on TO. The existing TO API endpoints
	already provide the data that TM will need to run in a distributed manner, and
	any changes made to TM APIs that TO uses will remain backwards-compatible.
	However, TO may need to be updated if it uses any stat-polling-related TM APIs
	so that it only requests from TMs that have stat-polling enabled.

	### t3c Impact
	This proposal does not require `t3c` changes. Note: the `tc-health-client`
	periodically polls a random TM to get cache health states, and because
	distributed TMs will still serve the cache health states of all caches in a
	CDN, there will be no impact to the `tc-health-client`. It can continue to poll
	any random TM and still get all the cache health data for the entire CDN.

	### Traffic Monitor Impact
	TM will gain at least two more configuration options:
	- `distributed_polling_enabled` (default: false) - when set to true, TM will
	run in _distributed mode_ (more details on this below). When set to false, TM
	will run in its legacy, normal mode.
	- `stat_polling_disabled` (default: false) - when set to true, TM will _not_ do
	stat polling for caches. When set to false, TM will do stat polling for
	caches (legacy, normal behavior). Initially, this must be set to true if
	`distributed_polling_enabled` is also set to true. In a later phase of
	development, we will add the ability to enable stat polling in distributed
	mode.

	Note: these are configuration options as opposed to profile parameters because
	we currently do not have the capability to have per-profile monitoring.json
	snapshots (or per-TM-server configuration in one snapshot).

	To use _distributed mode_, generally all TMs in the CDN need to be running in
	distributed mode (if they're taking part in the health protocol). It should
	still be possible to run TMs in the _legacy_ (non-distributed) mode in order to
	provide cache stat polling (which is important for Traffic Stats), but they
	should not be set to `ONLINE` in order to keep them from interfering with the
	health protocol.

	While in _distributed mode_, a TM instance will only monitor a subset of
	cachegroups in its given CDN. The number of cachegroups each TM will monitor
	depends on the number of cachegroups that contain TM servers for the CDN. These
	will be referred to as "TM groups." A TM group contains 1 to many TM servers,
	and a CDN can have 1 to many TM groups. If there are N TM groups, each TM group
	will monitor roughly 1/N of the cachegroups in the CDN. Each TM in the group
	will monitor all of caches in that 1/N portion of cachegroups that the TM group
	is responsible for. For example, if there are 10 cachegroups and 3 TM groups:
	- TM group 1 monitors cachegroups 1-4
	- TM group 2 monitors cachegroups 5-7
	- TM group 3 monitors cachegroups 8-10

	Because every TM can serve the health state of every cache, distributed TMs
	will need to peer not only with their own group members but also with other
	groups as well. However, instead of simultaneously requesting cache health
	states from all out-of-group peers, each distributed TM will simultaneously
	request cache health states from 1 TM in every other TM group, alternating
	between group members in a deterministic, round-robin fashion. For this
	out-of-group peering, a new TM API route will be added that returns only the
	cache health states for caches that the TM group is responsible for polling.

	A safety feature will be added to TM (while running in distributed mode) to
	ensure that all cachegroups are polled by at least 1 TM group, and an
	additional profile parameter override will be available in order to manually
	assign cachegroups to TM groups for polling.

	### Traffic Router Impact
	This proposal should have no impact on TR.

	### Traffic Stats Impact
	Because we will be able to disable stats polling on TM, TS will need to poll
	TMs that actually have stats polling enabled. TMs with polling enabled should
	be given a specific server status (other than `ONLINE`), which TS will be
	configured to poll, and that might mean creating a new server status
	specifically for that purpose.

	### Traffic Vault Impact
	This proposal has no impact on Traffic Vault.

	### Documentation Impact
	Any new configuration options added to TM should be documented, and the steps
	necessary to run TM in a distributed manner as well as how it works should be
	described in some form of documentation (probably the TM admin docs).

	### Testing Impact
	New TM unit and integration tests should be added where applicable. It would
	also be recommended to run both types of TMs in production (distributed and
	non-distributed) and compare the reported cache health states between both
	types. This would help discover any issues with running TM in a distributed
	manner using data from a production environment. However, TR should still get
	health states from the non-distributed TMs until we are confident in the health
	states reported by distributed TMs.

	### Performance Impact
	This proposal allows TM to be scaled horizontally, so operators can increase
	the number of TM groups in order to get the desired amount of load per TM.

	### Security Impact
	This proposal does not have much impact on security, but allowing TM to scale
	horizontally means that there may be more firewall rules that will need to be
	applied to any new TM servers that are deployed. However, TM will not need any
	_new_ ports opened, assuming the same `httpListener` and `httpsListener`
	configuration is used.

	### Upgrade Impact
	TMs running in a distributed manner can be upgraded in the same way that
	non-distributed TMs are upgraded today. For instance, we would likely upgrade
	the `OFFLINE` TMs, then set the upgraded TMs to `ONLINE` while simultaneously
	setting the old TMs to `OFFLINE`.

	### Operations Impact
	There should be little impact on operations other than the effort necessary to
	provision and deploy new TM servers to run in a distributed manner. Existing
	automation can still be used for upgrades, configuration, etc., but automation
	may need a way to differentiate between non-distributed and distributed TMs
	within the same environment so that both types are configured differently.

	Troubleshooting distributed TMs might be more difficult than non-distributed
	TMs as there will be more servers involved. However, the health of a cache
	should always be determined by the same TMs (assuming no new TM groups are
	added to the system), so it would be best to investigate the TM servers in the
	"authoritative" TM group for the cache under investigation. To help aid this
	kind of troubleshooting, we may want TM to have an API that returns information
	about which TM groups it thinks are currently monitoring which cache groups.

	### Developer Impact
	Developers should know that once this change is implemented, there will be two
	different "run modes" for TM -- distributed and non-distributed. TM will do
	certain things differently in the distributed mode compared to the
	non-distributed mode even though the vast majority of things will be the same.
	Therefore, developers will need to take care to ensure the proper behavior is
	followed depending on which "run mode" TM is in.

	Also, because this proposal will allow TMs to monitor only a subset of caches,
	it may make it easier to set up a development environment using production-like
	data and caches. It is somewhat infeasible for most TM development environments
	to poll an entire, large CDN, but with distributed TM groups, developers could
	essentially choose how many caches they want their local TM to poll.

	## Alternatives

	- Cache Self-Monitoring: Make caches monitor themselves by using remap rules,
	essentially replacing TM's Cache Health Monitoring. The
	[Proof-of-Concept](https://github.com/apache/trafficcontrol/pull/4529) has
	more details.

	## Dependencies
	This proposal does not intend to add any new dependencies.

	## References
	The following mailing list threads were related to this blueprint:
	- [Proposal: Distributed Health Monitoring](https://lists.apache.org/thread.html/rf3307f824c0f82892cbb0fea74a5c6a274c8ea4f303d125e8f1212da%40%3Cdev.trafficcontrol.apache.org%3E)
	- [Distributed Traffic Monitor Feedback/Requirements](https://lists.apache.org/thread.html/rf985a2b9e8a440d396a0097a71882919bff5b3cb5f8d6c3a53143162%40%3Cdev.trafficcontrol.apache.org%3E)