blob: 5dc1c3cb18ce5a38b32ba319db0ae0b1caa050e7 [file] [log] [blame]
= Metrics History
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
== Design
=== Round-robin databases
When Solr runs in "cloud" mode it collects long-term history of certain key metrics. This information
can be used for very simple monitoring and troubleshooting, but also some Solr Cloud components
(eg. autoscaling) can use this data for making informed decisions based on long-term
trends of selected metrics.
[IMPORTANT]
====
Metrics history is available ONLY in SolrCloud mode, it's not supported in standalone Solr. Also,
the `.system` collection must exist if metrics history should be persisted.
====
This data is maintained as multi-resolution time series, with a fixed total number of data points
per metric history (a fixed size window). Multi-resolution refers to the fact that data from the most detailed
time series is periodically resampled to create coarser-grained time series, which in turn
are periodically resampled again to build even coarser-grained series.
In the default configuration selected metrics are sampled every 60 seconds, and the following
time series are built:
* 240 samples, every 60 sec (4 hours)
* 288 samples, every 600 sec (48 hours)
* 336 samples, every 1h (2 weeks)
* 180 samples, every 4h (2 months)
* 365 samples, every 1 day (1 year)
This means that the total number of samples in all data series is constant, and consequently
the size of this data structure is also constant (because the size of the moving window is fixed, and
older samples are replaced by newer ones). This arrangement is referred to as a
round-robin database, and Solr uses implementation of this concept provided by RRD4j library.
=== Storage
Databases created with RRD4j are compact - for the time series specified above the total size
of data is ca. 11kB for each of the primary time series, including its resampled data. Each database may contain
several primary time series ("datasources" in RRD4j parlance) and their re-sampled versions (called
"archives").
This data is updated in memory and then periodically stored in the `.system`
collection in the form of Solr documents with a binary `data_bin` field, each document
containing data of one full database. This method of storage is much more compact and generates less
update operations than storing each data point in a separate Solr document. Metrics history API allows retrieving
detailed data from each database, including retrieval of all individual datapoints.
Databases are identified primarily by their corresponding metric registry name, so for databases that
keep track of aggregated metrics this will be eg. `solr.jvm`, `solr.node`, `solr.collection.gettingstarted`,
and for databases with non-aggregated metrics this will be eg. `solr.jvm.localhost:8983_solr`,
`solr.node.localhost:7574_solr`, `solr.core.gettingstarted.shard1.replica_n1`.
=== Collected metrics
Currently the following selected metrics are tracked:
* `solr.core` and `solr.collection` metrics:
** `QUERY./select.requests`
** `UPDATE./update.requests`
** `INDEX.sizeInBytes`
** `numShards` (aggregated, active shards)
** `numReplicas` (aggregated, active replicas)
* `solr.node` metrics:
** `CONTAINER.fs.coreRoot.usableSpace`
** `numNodes` (aggregated, number of live nodes)
* `solr.jvm` metrics:
** `memory.heap.used`
** `os.processCpuLoad`
** `os.systemLoadAverage`
Separate databases are created for each of these groups, and each database keeps data for
all metrics listed in that group.
=== SolrRrdBackendFactory
This component is responsible for managing in-memory databases and periodically saving them
to the `.system` collection. If the `.system` collection is not available the updates to the
databases will be kept in memory, until the time when `.system` collection becomes available.
If the `.system` collection is permanently unavailable then data will not be saved and it will
be lost when the Solr node is shut down.
=== MetricsHistoryHandler
This component provides a REST API for accessing the metrics history. It is also responsible for
collecting and periodically updating the in-memory databases.
This handler also performs aggregation of metrics on per-collection level, and on a cluster level.
By default only these aggregated metrics are tracked - historic data from each node and each replica
in each collection is not collected separately. Aggregated databases are managed on the Overseer leader
node.
The handler assumes that a simple aggregation (sum of partial metric values from each resource) is
sufficient. This happens to make sense for the default built-in sets of metrics. Future extensions will
provide other aggregation strategies (average, max, min, ...).
== Metrics History Configuration
There are two mechanisms for configuring this subsystem:
* `/clusterprops.json` - this is the primary mechanism. It uses the cluster properties JSON
file in ZooKeeper. Configuration is stored in the `/metrics/history` element in a JSON map.
* `solr.xml` - this is the secondary mechanism, which is not recommended but provided for consistency
with the existing metrics configuration section in this file. Configuration is stored in the
`/solr/metrics/history` element of this file.
Currently the following configuration options are supported:
`enable`:: boolean, default is true. If this if false then metrics history is not collected
but can still be retrieved from existing databases. When this is true then metrics are
periodically collected, aggregated and saved.
`enableReplicas`:: boolean, default is false. When this is true non-aggregated history will be
collected for each replica in each collection. When this is false then only aggregated history
is collected for each collection.
`enableNodes`:: boolean, default is false. When this is true then non-aggregated history will be
collected separately for each node (for node and JVM metrics), with database names consisting of
base registry name with appended node name, eg. `solr.jvm.localhost:8983_solr`. When this is false
then only aggregated history will be collected in a single `solr.jvm` and `solr.node` cluster-wide
databases.
`collectPeriod`:: integer, in seconds, default is 60. Metrics values will be collected and respective
databases updated every `collectPeriod` seconds.
[IMPORTANT]
====
Value of `collectPeriod` must be at least 1, and if it's changed then all previously existing databases
with their historic data must be manually removed (new databases will be created automatically).
====
`syncPeriod`:: integer, in seconds, default is 60. Data from modified databases will be saved to Solr
every `syncPeriod` seconds. When accessing the databases via REST API the visibility of most recent
data depends on this period, because requests accessing the data from other nodes see only the
version of the data that is stored in the `.system` collection.
=== Example configuration
Example `/clusterprops.json` file with metrics history configuration that turns on the collection of
per-node metrics history for node and JVM metrics. Note: typically this file will also contain other
properties unrelated to metrics history API.
[source,json]
----
{
"metrics" : {
"history" : {
"enable" : true,
"enableNodes" : true,
"syncPeriod" : 300
}
}
}
----
== Metrics History API
Main entry point for accessing metrics history is `/admin/metrics/history` (or `/api/cluster/metrics/history` for
v2 API).
The following sections describe actions available in this API. All calls have at least one
required parameter `action`.
=== List databases (`action=list`)
This call produces a list of available databases. It supports the following parameters:
`rows`:: optional integer, default is 500. Maximum number of results to return
Example:
[source,bash]
----
curl http://localhost:8983/solr/admin/metrics/history?action=list&rows=10
----
[source,json]
----
{
"responseHeader": {
"status": 0,
"QTime": 16
},
"metrics": [
"solr.collection..system",
"solr.collection.gettingstarted",
"solr.jvm",
"solr.node"
]
}
----
=== Database status (`action=status`)
This call provides detailed status of the selected database.
The following parameters are supported:
`name`:: string, required: database name
Example:
[source,bash]
----
curl http://localhost:8983/solr/admin/metrics/history?action=status&name=solr.collection.gettingstarted
----
[source,json]
----
{
"responseHeader": {
"status": 0,
"QTime": 38
},
"metrics": [
"solr.collection.gettingstarted",
[
"status",
{
"lastModified": 1527268438,
"step": 60,
"datasourceCount": 5,
"archiveCount": 5,
"datasourceNames": [
"numShards",
"numReplicas",
"QUERY./select.requests",
"UPDATE./update.requests",
"INDEX.sizeInBytes"
],
"datasources": [
{
"datasource": "DS:numShards:GAUGE:120:U:U",
"lastValue": 2
},
{
"datasource": "DS:QUERY./select.requests:COUNTER:120:U:U",
"lastValue": 8786
},
...
],
"archives": [
{
"archive": "RRA:AVERAGE:0.5:1:240",
"steps": 1,
"consolFun": "AVERAGE",
"xff": 0.5,
"startTime": 1527254040,
"endTime": 1527268380,
"rows": 240
},
{
"archive": "RRA:AVERAGE:0.5:10:288",
"steps": 10,
"consolFun": "AVERAGE",
"xff": 0.5,
"startTime": 1527096000,
"endTime": 1527268200,
"rows": 288
},
...
]
}
]
]
}
----
=== Get database data (`action=get`)
This call retrieves all data collected in the specified database.
The following parameters are supported:
`name`:: string, required: database name
`format`:: string, optional, default is `list`. Format of the data. Currently the
following formats are supported:
* `list` - each datapoint is returned as separate JSON element. For efficiency, for each
datasource in a database for each time series the timestamps are provided separately from
values (because points from all datasources in a given time series share the same timestamps).
* `string` - all datapoint values and timestamps are returned as strings, with values separated by new line character.
* `graph` - data is returned as PNG images, Base64-encoded, containing graphs of each time series values over time.
In each case the response is structured in a similar way: archive identifiers are keys in a JSON map,
and timestamps / datapoints / graphs are values.
==== Examples
This is the output using the default `list` format:
[source,bash]
----
curl http://localhost:8983/solr/admin/metrics/history?action=get&name=solr.collection.gettingstarted
----
[source,json]
----
{
"responseHeader": {
"status": 0,
"QTime": 36
},
"metrics": [
"solr.collection.gettingstarted",
[
"data",
{
"RRA:AVERAGE:0.5:1:240": {
"timestamps":1527254460,
"timestamps":1527254520,
"timestamps":1527254580,
...
"values": {
"numShards": "NaN",
"numShards": 2.0,
"numShards": 2.0,
...
"numReplicas": "NaN",
"numReplicas": 4.0,
"numReplicas": 4.0,
...
"QUERY./select.requests": "NaN",
"QUERY./select.requests": 123,
"QUERY./select.requests": 456,
...
}
},
"RRA:AVERAGE:0.5:10:288": {
...
----
This is the output when using the `string` format:
[source,bash]
----
curl http://localhost:8983/solr/admin/metrics/history?action=get&name=solr.collection.gettingstarted&format=string
----
[source,json]
----
{
"responseHeader": {
"status": 0,
"QTime": 11
},
"metrics": [
"solr.collection.gettingstarted",
[
"data",
{
"RRA:AVERAGE:0.5:1:240": {
"timestamps": "1527254820\n1527254880\n1527254940\n...",
"values": {
"numShards": "NaN\n2.0\n2.0\n2.0\n2.0\n2.0\n2.0\n...",
"numReplicas": "NaN\n4.0\n4.0\n4.0\n4.0\n4.0\n4.0\n...",
"QUERY./select.requests": "NaN\n123\n456\n789\n...",
...
}
},
"RRA:AVERAGE:0.5:10:288": {
...
----
This is the output when using the `graph` format:
[source,bash]
----
curl http://localhost:8983/solr/admin/metrics/history?action=get&name=solr.collection.gettingstarted&format=graph
----
[source,json]
----
{
"responseHeader": {
"status": 0,
"QTime": 2275
},
"metrics": [
"solr.collection.gettingstarted",
[
"data",
{
"RRA:AVERAGE:0.5:1:240": {
"values": {
"numShards": "iVBORw0KGgoAAAANSUhEUgAAAkQAAA...",
"numReplicas": "iVBORw0KGgoAAAANSUhEUgAAAkQA...",
"QUERY./select.requests": "iVBORw0KGgoAAAANS...",
...
}
},
"RRA:AVERAGE:0.5:10:288": {
"values": {
"numShards": "iVBORw0KGgoAAAANSUhEUgAAAkQAAA...",
...
},
...
----
.Example 60 sec resolution history graph for `QUERY./select.requests` metric
image::images/metrics-history/query-graph-60s.png[image]
.Example 10 min resolution history graph for `QUERY./select.requests` metric
image::images/metrics-history/query-graph-10min.png[image]
.Example 60 sec resolution history graph for `UPDATE./update.requests` metric
image::images/metrics-history/update-graph-60s.png[image]
.Example 60 sec resolution history graph for `memory.heap.used` metric
image::images/metrics-history/memHeap-60s.png[image]
.Example 60 sec resolution history graph for `os.systemLoadAverage` metric
image::images/metrics-history/loadAvg-60s.png[image]