| = Metrics History |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| == Design |
| === Round-robin databases |
| When Solr runs in "cloud" mode it collects long-term history of certain key metrics. This information |
| can be used for very simple monitoring and troubleshooting, but also some Solr Cloud components |
| (eg. autoscaling) can use this data for making informed decisions based on long-term |
| trends of selected metrics. |
| |
| [IMPORTANT] |
| ==== |
| Metrics history is available ONLY in SolrCloud mode, it's not supported in standalone Solr. Also, |
| the `.system` collection must exist if metrics history should be persisted. |
| ==== |
| |
| This data is maintained as multi-resolution time series, with a fixed total number of data points |
| per metric history (a fixed size window). Multi-resolution refers to the fact that data from the most detailed |
| time series is periodically resampled to create coarser-grained time series, which in turn |
| are periodically resampled again to build even coarser-grained series. |
| |
| In the default configuration selected metrics are sampled every 60 seconds, and the following |
| time series are built: |
| |
| * 240 samples, every 60 sec (4 hours) |
| * 288 samples, every 600 sec (48 hours) |
| * 336 samples, every 1h (2 weeks) |
| * 180 samples, every 4h (2 months) |
| * 365 samples, every 1 day (1 year) |
| |
| This means that the total number of samples in all data series is constant, and consequently |
| the size of this data structure is also constant (because the size of the moving window is fixed, and |
| older samples are replaced by newer ones). This arrangement is referred to as a |
| round-robin database, and Solr uses implementation of this concept provided by RRD4j library. |
| |
| === Storage |
| Databases created with RRD4j are compact - for the time series specified above the total size |
| of data is ca. 11kB for each of the primary time series, including its resampled data. Each database may contain |
| several primary time series ("datasources" in RRD4j parlance) and their re-sampled versions (called |
| "archives"). |
| |
| This data is updated in memory and then periodically stored in the `.system` |
| collection in the form of Solr documents with a binary `data_bin` field, each document |
| containing data of one full database. This method of storage is much more compact and generates less |
| update operations than storing each data point in a separate Solr document. Metrics history API allows retrieving |
| detailed data from each database, including retrieval of all individual datapoints. |
| |
| Databases are identified primarily by their corresponding metric registry name, so for databases that |
| keep track of aggregated metrics this will be eg. `solr.jvm`, `solr.node`, `solr.collection.gettingstarted`, |
| and for databases with non-aggregated metrics this will be eg. `solr.jvm.localhost:8983_solr`, |
| `solr.node.localhost:7574_solr`, `solr.core.gettingstarted.shard1.replica_n1`. |
| |
| === Collected metrics |
| Currently the following selected metrics are tracked: |
| |
| * `solr.core` and `solr.collection` metrics: |
| ** `QUERY./select.requests` |
| ** `UPDATE./update.requests` |
| ** `INDEX.sizeInBytes` |
| ** `numShards` (aggregated, active shards) |
| ** `numReplicas` (aggregated, active replicas) |
| |
| * `solr.node` metrics: |
| ** `CONTAINER.fs.coreRoot.usableSpace` |
| ** `numNodes` (aggregated, number of live nodes) |
| * `solr.jvm` metrics: |
| ** `memory.heap.used` |
| ** `os.processCpuLoad` |
| ** `os.systemLoadAverage` |
| |
| Separate databases are created for each of these groups, and each database keeps data for |
| all metrics listed in that group. |
| |
| === SolrRrdBackendFactory |
| This component is responsible for managing in-memory databases and periodically saving them |
| to the `.system` collection. If the `.system` collection is not available the updates to the |
| databases will be kept in memory, until the time when `.system` collection becomes available. |
| |
| If the `.system` collection is permanently unavailable then data will not be saved and it will |
| be lost when the Solr node is shut down. |
| |
| === MetricsHistoryHandler |
| This component provides a REST API for accessing the metrics history. It is also responsible for |
| collecting and periodically updating the in-memory databases. |
| |
| This handler also performs aggregation of metrics on per-collection level, and on a cluster level. |
| By default only these aggregated metrics are tracked - historic data from each node and each replica |
| in each collection is not collected separately. Aggregated databases are managed on the Overseer leader |
| node. |
| |
| The handler assumes that a simple aggregation (sum of partial metric values from each resource) is |
| sufficient. This happens to make sense for the default built-in sets of metrics. Future extensions will |
| provide other aggregation strategies (average, max, min, ...). |
| |
| == Metrics History Configuration |
| There are two mechanisms for configuring this subsystem: |
| |
| * `/clusterprops.json` - this is the primary mechanism. It uses the cluster properties JSON |
| file in ZooKeeper. Configuration is stored in the `/metrics/history` element in a JSON map. |
| |
| * `solr.xml` - this is the secondary mechanism, which is not recommended but provided for consistency |
| with the existing metrics configuration section in this file. Configuration is stored in the |
| `/solr/metrics/history` element of this file. |
| |
| Currently the following configuration options are supported: |
| |
| `enable`:: boolean, default is true. If this if false then metrics history is not collected |
| but can still be retrieved from existing databases. When this is true then metrics are |
| periodically collected, aggregated and saved. |
| |
| `enableReplicas`:: boolean, default is false. When this is true non-aggregated history will be |
| collected for each replica in each collection. When this is false then only aggregated history |
| is collected for each collection. |
| |
| `enableNodes`:: boolean, default is false. When this is true then non-aggregated history will be |
| collected separately for each node (for node and JVM metrics), with database names consisting of |
| base registry name with appended node name, eg. `solr.jvm.localhost:8983_solr`. When this is false |
| then only aggregated history will be collected in a single `solr.jvm` and `solr.node` cluster-wide |
| databases. |
| |
| `collectPeriod`:: integer, in seconds, default is 60. Metrics values will be collected and respective |
| databases updated every `collectPeriod` seconds. |
| [IMPORTANT] |
| ==== |
| Value of `collectPeriod` must be at least 1, and if it's changed then all previously existing databases |
| with their historic data must be manually removed (new databases will be created automatically). |
| ==== |
| |
| `syncPeriod`:: integer, in seconds, default is 60. Data from modified databases will be saved to Solr |
| every `syncPeriod` seconds. When accessing the databases via REST API the visibility of most recent |
| data depends on this period, because requests accessing the data from other nodes see only the |
| version of the data that is stored in the `.system` collection. |
| |
| === Example configuration |
| Example `/clusterprops.json` file with metrics history configuration that turns on the collection of |
| per-node metrics history for node and JVM metrics. Note: typically this file will also contain other |
| properties unrelated to metrics history API. |
| |
| [source,json] |
| ---- |
| { |
| "metrics" : { |
| "history" : { |
| "enable" : true, |
| "enableNodes" : true, |
| "syncPeriod" : 300 |
| } |
| } |
| } |
| ---- |
| |
| == Metrics History API |
| Main entry point for accessing metrics history is `/admin/metrics/history` (or `/api/cluster/metrics/history` for |
| v2 API). |
| |
| The following sections describe actions available in this API. All calls have at least one |
| required parameter `action`. |
| |
| === List databases (`action=list`) |
| This call produces a list of available databases. It supports the following parameters: |
| |
| `rows`:: optional integer, default is 500. Maximum number of results to return |
| |
| Example: |
| [source,bash] |
| ---- |
| curl http://localhost:8983/solr/admin/metrics/history?action=list&rows=10 |
| ---- |
| [source,json] |
| ---- |
| { |
| "responseHeader": { |
| "status": 0, |
| "QTime": 16 |
| }, |
| "metrics": [ |
| "solr.collection..system", |
| "solr.collection.gettingstarted", |
| "solr.jvm", |
| "solr.node" |
| ] |
| } |
| ---- |
| |
| === Database status (`action=status`) |
| This call provides detailed status of the selected database. |
| |
| The following parameters are supported: |
| |
| `name`:: string, required: database name |
| |
| Example: |
| [source,bash] |
| ---- |
| curl http://localhost:8983/solr/admin/metrics/history?action=status&name=solr.collection.gettingstarted |
| ---- |
| [source,json] |
| ---- |
| { |
| "responseHeader": { |
| "status": 0, |
| "QTime": 38 |
| }, |
| "metrics": [ |
| "solr.collection.gettingstarted", |
| [ |
| "status", |
| { |
| "lastModified": 1527268438, |
| "step": 60, |
| "datasourceCount": 5, |
| "archiveCount": 5, |
| "datasourceNames": [ |
| "numShards", |
| "numReplicas", |
| "QUERY./select.requests", |
| "UPDATE./update.requests", |
| "INDEX.sizeInBytes" |
| ], |
| "datasources": [ |
| { |
| "datasource": "DS:numShards:GAUGE:120:U:U", |
| "lastValue": 2 |
| }, |
| { |
| "datasource": "DS:QUERY./select.requests:COUNTER:120:U:U", |
| "lastValue": 8786 |
| }, |
| ... |
| ], |
| "archives": [ |
| { |
| "archive": "RRA:AVERAGE:0.5:1:240", |
| "steps": 1, |
| "consolFun": "AVERAGE", |
| "xff": 0.5, |
| "startTime": 1527254040, |
| "endTime": 1527268380, |
| "rows": 240 |
| }, |
| { |
| "archive": "RRA:AVERAGE:0.5:10:288", |
| "steps": 10, |
| "consolFun": "AVERAGE", |
| "xff": 0.5, |
| "startTime": 1527096000, |
| "endTime": 1527268200, |
| "rows": 288 |
| }, |
| ... |
| ] |
| } |
| ] |
| ] |
| } |
| ---- |
| |
| === Get database data (`action=get`) |
| This call retrieves all data collected in the specified database. |
| |
| The following parameters are supported: |
| |
| `name`:: string, required: database name |
| `format`:: string, optional, default is `list`. Format of the data. Currently the |
| following formats are supported: |
| |
| * `list` - each datapoint is returned as separate JSON element. For efficiency, for each |
| datasource in a database for each time series the timestamps are provided separately from |
| values (because points from all datasources in a given time series share the same timestamps). |
| * `string` - all datapoint values and timestamps are returned as strings, with values separated by new line character. |
| * `graph` - data is returned as PNG images, Base64-encoded, containing graphs of each time series values over time. |
| |
| In each case the response is structured in a similar way: archive identifiers are keys in a JSON map, |
| and timestamps / datapoints / graphs are values. |
| |
| ==== Examples |
| This is the output using the default `list` format: |
| [source,bash] |
| ---- |
| curl http://localhost:8983/solr/admin/metrics/history?action=get&name=solr.collection.gettingstarted |
| ---- |
| [source,json] |
| ---- |
| { |
| "responseHeader": { |
| "status": 0, |
| "QTime": 36 |
| }, |
| "metrics": [ |
| "solr.collection.gettingstarted", |
| [ |
| "data", |
| { |
| "RRA:AVERAGE:0.5:1:240": { |
| "timestamps":1527254460, |
| "timestamps":1527254520, |
| "timestamps":1527254580, |
| ... |
| "values": { |
| "numShards": "NaN", |
| "numShards": 2.0, |
| "numShards": 2.0, |
| ... |
| "numReplicas": "NaN", |
| "numReplicas": 4.0, |
| "numReplicas": 4.0, |
| ... |
| "QUERY./select.requests": "NaN", |
| "QUERY./select.requests": 123, |
| "QUERY./select.requests": 456, |
| ... |
| } |
| }, |
| "RRA:AVERAGE:0.5:10:288": { |
| ... |
| ---- |
| |
| This is the output when using the `string` format: |
| [source,bash] |
| ---- |
| curl http://localhost:8983/solr/admin/metrics/history?action=get&name=solr.collection.gettingstarted&format=string |
| ---- |
| [source,json] |
| ---- |
| { |
| "responseHeader": { |
| "status": 0, |
| "QTime": 11 |
| }, |
| "metrics": [ |
| "solr.collection.gettingstarted", |
| [ |
| "data", |
| { |
| "RRA:AVERAGE:0.5:1:240": { |
| "timestamps": "1527254820\n1527254880\n1527254940\n...", |
| "values": { |
| "numShards": "NaN\n2.0\n2.0\n2.0\n2.0\n2.0\n2.0\n...", |
| "numReplicas": "NaN\n4.0\n4.0\n4.0\n4.0\n4.0\n4.0\n...", |
| "QUERY./select.requests": "NaN\n123\n456\n789\n...", |
| ... |
| } |
| }, |
| "RRA:AVERAGE:0.5:10:288": { |
| ... |
| ---- |
| |
| This is the output when using the `graph` format: |
| [source,bash] |
| ---- |
| curl http://localhost:8983/solr/admin/metrics/history?action=get&name=solr.collection.gettingstarted&format=graph |
| ---- |
| [source,json] |
| ---- |
| { |
| "responseHeader": { |
| "status": 0, |
| "QTime": 2275 |
| }, |
| "metrics": [ |
| "solr.collection.gettingstarted", |
| [ |
| "data", |
| { |
| "RRA:AVERAGE:0.5:1:240": { |
| "values": { |
| "numShards": "iVBORw0KGgoAAAANSUhEUgAAAkQAAA...", |
| "numReplicas": "iVBORw0KGgoAAAANSUhEUgAAAkQA...", |
| "QUERY./select.requests": "iVBORw0KGgoAAAANS...", |
| ... |
| } |
| }, |
| "RRA:AVERAGE:0.5:10:288": { |
| "values": { |
| "numShards": "iVBORw0KGgoAAAANSUhEUgAAAkQAAA...", |
| ... |
| }, |
| ... |
| ---- |
| |
| .Example 60 sec resolution history graph for `QUERY./select.requests` metric |
| image::images/metrics-history/query-graph-60s.png[image] |
| |
| |
| .Example 10 min resolution history graph for `QUERY./select.requests` metric |
| image::images/metrics-history/query-graph-10min.png[image] |
| |
| |
| .Example 60 sec resolution history graph for `UPDATE./update.requests` metric |
| image::images/metrics-history/update-graph-60s.png[image] |
| |
| .Example 60 sec resolution history graph for `memory.heap.used` metric |
| image::images/metrics-history/memHeap-60s.png[image] |
| |
| .Example 60 sec resolution history graph for `os.systemLoadAverage` metric |
| image::images/metrics-history/loadAvg-60s.png[image] |