blob: 04baead9202c56197b4bead78a4ffd935ae4e822 [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Distributed Traffic Monitor
## Problem Description
Currently, TM polls all caches in a CDN. As CDNs grow, this becomes a major
pain point as TM is limited by the amount of bandwidth and CPU it requires to
receive and process data from every cache on the CDN, and scaling vertically by
running it on better hardware is only feasible up to a certain point. Also, the
performance of a cache observed by a TM which is very far away from it does not
always reflect the performance observed by clients that are actually using the
cache (because the clients are typically much closer to it).
## Proposed Change
TM should have the ability to poll only a subset of caches in a CDN and peer
with other TMs which are monitoring other subsets in order to get a full view
of the CDN's health. This would allow us to run TM in a more distributed manner
across the CDN, giving us a view of cache health that is closer to what clients
actually observe and enabling us to scale TM horizontally. Additionally, we
would like to have the option to disable _stat polling_ in order for these
distributed TMs to focus on _health polling_.
### Traffic Portal Impact
This proposal does not require any TP changes.
### Traffic Ops Impact
This proposal might have limited impact on TO. The existing TO API endpoints
already provide the data that TM will need to run in a distributed manner, and
any changes made to TM APIs that TO uses will remain backwards-compatible.
However, TO may need to be updated if it uses any stat-polling-related TM APIs
so that it only requests from TMs that have stat-polling enabled.
### t3c Impact
This proposal does not require `t3c` changes. Note: the `tc-health-client`
periodically polls a random TM to get cache health states, and because
distributed TMs will still serve the cache health states of all caches in a
CDN, there will be no impact to the `tc-health-client`. It can continue to poll
any random TM and still get all the cache health data for the entire CDN.
### Traffic Monitor Impact
TM will gain at least two more configuration options:
- `distributed_polling_enabled` (default: false) - when set to true, TM will
run in _distributed mode_ (more details on this below). When set to false, TM
will run in its legacy, normal mode.
- `stat_polling_disabled` (default: false) - when set to true, TM will _not_ do
stat polling for caches. When set to false, TM will do stat polling for
caches (legacy, normal behavior). Initially, this must be set to true if
`distributed_polling_enabled` is also set to true. In a later phase of
development, we will add the ability to enable stat polling in distributed
mode.
Note: these are configuration options as opposed to profile parameters because
we currently do not have the capability to have per-profile monitoring.json
snapshots (or per-TM-server configuration in one snapshot).
To use _distributed mode_, generally all TMs in the CDN need to be running in
distributed mode (if they're taking part in the health protocol). It should
still be possible to run TMs in the _legacy_ (non-distributed) mode in order to
provide cache stat polling (which is important for Traffic Stats), but they
should not be set to `ONLINE` in order to keep them from interfering with the
health protocol.
While in _distributed mode_, a TM instance will only monitor a subset of
cachegroups in its given CDN. The number of cachegroups each TM will monitor
depends on the number of cachegroups that contain TM servers for the CDN. These
will be referred to as "TM groups." A TM group contains 1 to many TM servers,
and a CDN can have 1 to many TM groups. If there are N TM groups, each TM group
will monitor roughly 1/N of the cachegroups in the CDN. Each TM in the group
will monitor all of caches in that 1/N portion of cachegroups that the TM group
is responsible for. For example, if there are 10 cachegroups and 3 TM groups:
- TM group 1 monitors cachegroups 1-4
- TM group 2 monitors cachegroups 5-7
- TM group 3 monitors cachegroups 8-10
Because every TM can serve the health state of every cache, distributed TMs
will need to peer not only with their own group members but also with other
groups as well. However, instead of simultaneously requesting cache health
states from all out-of-group peers, each distributed TM will simultaneously
request cache health states from 1 TM in every other TM group, alternating
between group members in a deterministic, round-robin fashion. For this
out-of-group peering, a new TM API route will be added that returns only the
cache health states for caches that the TM group is responsible for polling.
A safety feature will be added to TM (while running in distributed mode) to
ensure that all cachegroups are polled by at least 1 TM group, and an
additional profile parameter override will be available in order to manually
assign cachegroups to TM groups for polling.
### Traffic Router Impact
This proposal should have no impact on TR.
### Traffic Stats Impact
Because we will be able to disable stats polling on TM, TS will need to poll
TMs that actually have stats polling enabled. TMs with polling enabled should
be given a specific server status (other than `ONLINE`), which TS will be
configured to poll, and that might mean creating a new server status
specifically for that purpose.
### Traffic Vault Impact
This proposal has no impact on Traffic Vault.
### Documentation Impact
Any new configuration options added to TM should be documented, and the steps
necessary to run TM in a distributed manner as well as how it works should be
described in some form of documentation (probably the TM admin docs).
### Testing Impact
New TM unit and integration tests should be added where applicable. It would
also be recommended to run both types of TMs in production (distributed and
non-distributed) and compare the reported cache health states between both
types. This would help discover any issues with running TM in a distributed
manner using data from a production environment. However, TR should still get
health states from the non-distributed TMs until we are confident in the health
states reported by distributed TMs.
### Performance Impact
This proposal allows TM to be scaled horizontally, so operators can increase
the number of TM groups in order to get the desired amount of load per TM.
### Security Impact
This proposal does not have much impact on security, but allowing TM to scale
horizontally means that there may be more firewall rules that will need to be
applied to any new TM servers that are deployed. However, TM will not need any
_new_ ports opened, assuming the same `httpListener` and `httpsListener`
configuration is used.
### Upgrade Impact
TMs running in a distributed manner can be upgraded in the same way that
non-distributed TMs are upgraded today. For instance, we would likely upgrade
the `OFFLINE` TMs, then set the upgraded TMs to `ONLINE` while simultaneously
setting the old TMs to `OFFLINE`.
### Operations Impact
There should be little impact on operations other than the effort necessary to
provision and deploy new TM servers to run in a distributed manner. Existing
automation can still be used for upgrades, configuration, etc., but automation
may need a way to differentiate between non-distributed and distributed TMs
within the same environment so that both types are configured differently.
Troubleshooting distributed TMs might be more difficult than non-distributed
TMs as there will be more servers involved. However, the health of a cache
should always be determined by the same TMs (assuming no new TM groups are
added to the system), so it would be best to investigate the TM servers in the
"authoritative" TM group for the cache under investigation. To help aid this
kind of troubleshooting, we may want TM to have an API that returns information
about which TM groups it thinks are currently monitoring which cache groups.
### Developer Impact
Developers should know that once this change is implemented, there will be two
different "run modes" for TM -- distributed and non-distributed. TM will do
certain things differently in the distributed mode compared to the
non-distributed mode even though the vast majority of things will be the same.
Therefore, developers will need to take care to ensure the proper behavior is
followed depending on which "run mode" TM is in.
Also, because this proposal will allow TMs to monitor only a subset of caches,
it may make it easier to set up a development environment using production-like
data and caches. It is somewhat infeasible for most TM development environments
to poll an entire, large CDN, but with distributed TM groups, developers could
essentially choose how many caches they want their local TM to poll.
## Alternatives
- Cache Self-Monitoring: Make caches monitor themselves by using remap rules,
essentially replacing TM's Cache Health Monitoring. The
[Proof-of-Concept](https://github.com/apache/trafficcontrol/pull/4529) has
more details.
## Dependencies
This proposal does not intend to add any new dependencies.
## References
The following mailing list threads were related to this blueprint:
- [Proposal: Distributed Health Monitoring](https://lists.apache.org/thread.html/rf3307f824c0f82892cbb0fea74a5c6a274c8ea4f303d125e8f1212da%40%3Cdev.trafficcontrol.apache.org%3E)
- [Distributed Traffic Monitor Feedback/Requirements](https://lists.apache.org/thread.html/rf985a2b9e8a440d396a0097a71882919bff5b3cb5f8d6c3a53143162%40%3Cdev.trafficcontrol.apache.org%3E)