PIP-382: Add a label named reason for topic_load_failed_total

Background knowledge

Pulsar has a metric that indicates load topic failed: topic_load_failed_total, it will be increased at the following cases

  • The target bundle in unloading.
  • Failed to load policies.
  • Failed to load up Managed Ledger.
  • Failed to read Metadata store.
  • Topic initialize fails, such as failed to re-build deduplication info.
  • Topic load timeout.
  • Others.

Motivation & Goals

Adding an additional label of the metric topic_load_failed_total may let us know what error happened fastly, so we can fix the issue fastly.

Metrics

Add a label named reason for topic_load_failed_total

  • label name: reason
  • label values:
    • bundle_unloading
    • failed_load_policies
    • failed_load_ml
    • failed_access_metadata_store
    • failed_init
    • timeout
    • others

Monitoring & Alternatives

  • If the value of label value reason = bundle_unloading increases a moment, and it stop to increase after a while, it means everything is fine.
    • Otherwise, the load-balancer may encounter an error.
  • If the value of label value reason = timeout increases a moment, and it stops to increase after a while, it means too many topics were loaded at the same time, it may be okay.
    • Otherwise, broker may encounter a deadlock issue, or the resources is not enough for the current use case.
  • For other label values, it means something is not expected, and we can apart them by the label value.

General Notes

Links