Distributed Traffic Monitor

Problem Description

Currently, TM polls all caches in a CDN. As CDNs grow, this becomes a major pain point as TM is limited by the amount of bandwidth and CPU it requires to receive and process data from every cache on the CDN, and scaling vertically by running it on better hardware is only feasible up to a certain point. Also, the performance of a cache observed by a TM which is very far away from it does not always reflect the performance observed by clients that are actually using the cache (because the clients are typically much closer to it).

Proposed Change

TM should have the ability to poll only a subset of caches in a CDN and peer with other TMs which are monitoring other subsets in order to get a full view of the CDN's health. This would allow us to run TM in a more distributed manner across the CDN, giving us a view of cache health that is closer to what clients actually observe and enabling us to scale TM horizontally. Additionally, we would like to have the option to disable stat polling in order for these distributed TMs to focus on health polling.

Traffic Portal Impact

This proposal does not require any TP changes.

Traffic Ops Impact

This proposal might have limited impact on TO. The existing TO API endpoints already provide the data that TM will need to run in a distributed manner, and any changes made to TM APIs that TO uses will remain backwards-compatible. However, TO may need to be updated if it uses any stat-polling-related TM APIs so that it only requests from TMs that have stat-polling enabled.

t3c Impact

This proposal does not require t3c changes. Note: the tc-health-client periodically polls a random TM to get cache health states, and because distributed TMs will still serve the cache health states of all caches in a CDN, there will be no impact to the tc-health-client. It can continue to poll any random TM and still get all the cache health data for the entire CDN.

Traffic Monitor Impact

TM will gain at least two more configuration options:

distributed_polling_enabled (default: false) - when set to true, TM will run in distributed mode (more details on this below). When set to false, TM will run in its legacy, normal mode.
stat_polling_disabled (default: false) - when set to true, TM will not do stat polling for caches. When set to false, TM will do stat polling for caches (legacy, normal behavior). Initially, this must be set to true if distributed_polling_enabled is also set to true. In a later phase of development, we will add the ability to enable stat polling in distributed mode.

Note: these are configuration options as opposed to profile parameters because we currently do not have the capability to have per-profile monitoring.json snapshots (or per-TM-server configuration in one snapshot).

To use distributed mode, generally all TMs in the CDN need to be running in distributed mode (if they're taking part in the health protocol). It should still be possible to run TMs in the legacy (non-distributed) mode in order to provide cache stat polling (which is important for Traffic Stats), but they should not be set to ONLINE in order to keep them from interfering with the health protocol.

While in distributed mode, a TM instance will only monitor a subset of cachegroups in its given CDN. The number of cachegroups each TM will monitor depends on the number of cachegroups that contain TM servers for the CDN. These will be referred to as “TM groups.” A TM group contains 1 to many TM servers, and a CDN can have 1 to many TM groups. If there are N TM groups, each TM group will monitor roughly 1/N of the cachegroups in the CDN. Each TM in the group will monitor all of caches in that 1/N portion of cachegroups that the TM group is responsible for. For example, if there are 10 cachegroups and 3 TM groups:

TM group 1 monitors cachegroups 1-4
TM group 2 monitors cachegroups 5-7
TM group 3 monitors cachegroups 8-10

Because every TM can serve the health state of every cache, distributed TMs will need to peer not only with their own group members but also with other groups as well. However, instead of simultaneously requesting cache health states from all out-of-group peers, each distributed TM will simultaneously request cache health states from 1 TM in every other TM group, alternating between group members in a deterministic, round-robin fashion. For this out-of-group peering, a new TM API route will be added that returns only the cache health states for caches that the TM group is responsible for polling.

A safety feature will be added to TM (while running in distributed mode) to ensure that all cachegroups are polled by at least 1 TM group, and an additional profile parameter override will be available in order to manually assign cachegroups to TM groups for polling.

Traffic Router Impact

This proposal should have no impact on TR.

Traffic Stats Impact

Because we will be able to disable stats polling on TM, TS will need to poll TMs that actually have stats polling enabled. TMs with polling enabled should be given a specific server status (other than ONLINE), which TS will be configured to poll, and that might mean creating a new server status specifically for that purpose.

Traffic Vault Impact

This proposal has no impact on Traffic Vault.

Documentation Impact

Any new configuration options added to TM should be documented, and the steps necessary to run TM in a distributed manner as well as how it works should be described in some form of documentation (probably the TM admin docs).

Testing Impact

New TM unit and integration tests should be added where applicable. It would also be recommended to run both types of TMs in production (distributed and non-distributed) and compare the reported cache health states between both types. This would help discover any issues with running TM in a distributed manner using data from a production environment. However, TR should still get health states from the non-distributed TMs until we are confident in the health states reported by distributed TMs.

Performance Impact

This proposal allows TM to be scaled horizontally, so operators can increase the number of TM groups in order to get the desired amount of load per TM.

Security Impact

This proposal does not have much impact on security, but allowing TM to scale horizontally means that there may be more firewall rules that will need to be applied to any new TM servers that are deployed. However, TM will not need any new ports opened, assuming the same httpListener and httpsListener configuration is used.

Upgrade Impact

TMs running in a distributed manner can be upgraded in the same way that non-distributed TMs are upgraded today. For instance, we would likely upgrade the OFFLINE TMs, then set the upgraded TMs to ONLINE while simultaneously setting the old TMs to OFFLINE.

Operations Impact

There should be little impact on operations other than the effort necessary to provision and deploy new TM servers to run in a distributed manner. Existing automation can still be used for upgrades, configuration, etc., but automation may need a way to differentiate between non-distributed and distributed TMs within the same environment so that both types are configured differently.

Troubleshooting distributed TMs might be more difficult than non-distributed TMs as there will be more servers involved. However, the health of a cache should always be determined by the same TMs (assuming no new TM groups are added to the system), so it would be best to investigate the TM servers in the “authoritative” TM group for the cache under investigation. To help aid this kind of troubleshooting, we may want TM to have an API that returns information about which TM groups it thinks are currently monitoring which cache groups.

Developer Impact

Developers should know that once this change is implemented, there will be two different “run modes” for TM -- distributed and non-distributed. TM will do certain things differently in the distributed mode compared to the non-distributed mode even though the vast majority of things will be the same. Therefore, developers will need to take care to ensure the proper behavior is followed depending on which “run mode” TM is in.

Also, because this proposal will allow TMs to monitor only a subset of caches, it may make it easier to set up a development environment using production-like data and caches. It is somewhat infeasible for most TM development environments to poll an entire, large CDN, but with distributed TM groups, developers could essentially choose how many caches they want their local TM to poll.

Alternatives

Cache Self-Monitoring: Make caches monitor themselves by using remap rules, essentially replacing TM's Cache Health Monitoring. The Proof-of-Concept has more details.

Dependencies

This proposal does not intend to add any new dependencies.

References

The following mailing list threads were related to this blueprint: