This RFC proposes a new metric API in replace of the old perf-counter API.
The perf-counter API has bad naming convention to be parsed and queried over the external monitoring system like Prometheus, or open-falcon.
Here are some examples of the perf-counter it exposes:
replica*eon.nfs_client*recent_copy_data_size
zion*profiler*RPC_RRDB_RRDB_MULTI_GET.latency.server.p999
collector*app.pegasus*app.stat.put_qps#xiaomi_algo_browser
replica*app.pegasus*check_and_mutate_qps@15.34
Apart from the perf-counter name, there are 3 “tags” for each counter:
replica*eon.nfs_client*recent_copy_data_size | service=pegasus, job=replica, port=35801
From our experiences, this naming rules have the problems listed below:
The name includes invalid characters for Prometheus: ‘*’, ‘.’, ‘@’, ‘#’. While they are legal in open-falcon, in order to support Prometheus we must convert all these characters to ‘_’. It causes that Pegasus has to expose different styles of counter name to the two monitoring systems.
Strange and meaningless words like “zion”, “eon” are included in the counter name.
Information like “table name”, “replica id” are weirdly encoded into the counter name following some obsecure rules, such as “#<table name>
” and “@<replica id>
”. This increases the difficulty on the engineering of perf-counters, that needs additional parsing.
Firstly, let's take a look on the new naming. For the above perf-counters:
nfs_client_recent_copy_data_size | entity=server
RPC_RRDB_RRDB_MULTI_GET_server_latency | p=999,entity=server
put_qps | table=xiaomi_algo_browser,entity=table
check_and_mutate_qps | table=13,partition=34,entity=replica
What's changed:
I apply the naming rules of prometheus, only use underscores ‘_’ for word separator.
Better utilization of “tags”/“labels” (optional key-value pairs for attributes of each metric).
#<table name>
” is replaced with tag “table=<table name>
”@<replica id>
” is replaced with tags “table=<table id>,partition=<partition id>
”.I introduce a new tag called “entity”, which is used to identify the level of the perf-counter. There are 3 types of entity in Pegasus:
As the naming is changed now, the perf-counter API must be correspondingly evolved. That is the new metric API.
To declare a perf-counter, take get_qps
of a replica as an example, what was it like:
class rocksdb_store { private: perf_counter_wrapper _get_qps; }; rocksdb_store::rocksdb_store() { _get_qps = init_app_counter( "app.pegasus", fmt::format("get_qps@{}", str_gpid), COUNTER_TYPE_RATE, "statistic the qps of GET request"); }
After using the new API:
METRIC_DEFINE_meter(server, get_qps, kRequestsPerSecond, "the qps of GET requests") METRIC_DEFINE_entity(replica); class rocksdb_store { private: meter_ptr _get_qps; }; rocksdb_store::rocksdb_store() { _replica_entity = METRIC_ENTITY_replica.instantiate(str_gpid, { { "table" : get_gpid().app_id() }, { "partition" : get_gpid().partition_index() }, }); _get_qps = METRIC_get_qps.instantiate(replica_entity); }
The code might be seemingly more complicated, but in fact it's simpler:
Each instance of METRIC_ENTITY_replica
is attributed with a replica's ID. Instantiated once, the instance (_replica_entity
) can be used to create every metric belongs to this replica, compared to old API, where every perf-counter needs to encode ID into its name (fmt::format("get_qps@{}", str_gpid)
).
“app.pegasus” in the old API is called a “section”. We now use entity instead. Remembering what a section name means is not friendly to new-comers.
Metric definition and metric instantiation is decoupled. A metric is defined in global scope, its instantiation in the constructor of rocksdb_store
is reduced to one line of code.
The API is inspired by kudu metrics, which is a well-documented and mature implementation of metrics in C++.
The documentations and monitoring templates must adapt to the latest metric name when this refactoring is released.