Helix Tutorial: Customizing Health Checks

In this chapter, we'll learn how to customize health checks based on metrics of your distributed system.

Health Checks

Note: this in currently in development mode, not yet ready for production.

Helix provides the ability for each node in the system to report health metrics on a periodic basis.

Helix supports multiple ways to aggregate these metrics:

  • SUM
  • AVG
  • EXPONENTIAL DECAY
  • WINDOW

Helix persists the aggregated value only.

Applications can define a threshold on the aggregate values according to the SLAs, and when the SLA is violated Helix will fire an alert. Currently Helix only fires an alert, but in a future release we plan to use these metrics to either mark the node dead or load balance the partitions. This feature will be valuable for distributed systems that support multi-tenancy and have a large variation in work load patterns. In addition, this can be used to detect skewed partitions (hotspots) and rebalance the cluster.