blob: ddecb56bf969bcfcad8dc843e91f39cf24be342a [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
= JMX Metrics
:table_opts: cols="3,2,8,2", opts="stretch,header"
== Overview
Ignite exposes a large number of metrics useful for monitoring your cluster or application.
You can use JMX and a monitoring tool, such as JConsole to access these metrics via JMX.
You can also access them programmatically.
On this page, we've collected the most useful metrics and grouped them into various common categories based on the monitoring task.
// link:monitoring-metrics/configuring-metrics[Configuring Metrics]
== Understanding MBean's ObjectName
Every JMX Mbean has an https://docs.oracle.com/javase/8/docs/api/javax/management/ObjectName.html[ObjectName,window=_blank].
The ObjectName is used to identify the bean.
The ObjectName consists of a domain and a list of key properties, and can be represented as a string as follows:
domain: key1 = value1 , key2 = value2
All Ignite metrics have the same domain: `org.apache.<classloaderId>` where the classloader ID is optional (omitted if you set `IGNITE_MBEAN_APPEND_CLASS_LOADER_ID=false`). In addition, each metric has two properties: `group` and `name`.
For example:
org.apache:group=SPIs,name=TcpDiscoverySpi
This MBean provides various metrics related to node discovery.
The MBean ObjectName can be used to identify the bean in UI tools like JConsole.
For example, JConsole displays MBeans in a tree-like structure where all beans are first grouped by domain and then by the 'group' property:
image::images/jconsole.png[]
{sp}+
== Monitoring the Amount of Data
If you do not use link:persistence/native-persistence[Native persistence] (i.e., all your data is kept in memory), you would want to monitor RAM usage.
If you use Native persistence, in addition to RAM, you should monitor the size of the data storage on disk.
The size of the data loaded into a node is available at different levels of aggregation. You can monitor for:
* The total size of the data the node keeps on disk or in RAM. This amount is the sum of the size of each configured data region (in the simplest case, only the default data region) plus the sizes of the system data regions.
* The size of a specific link:memory-configuration/data-regions[data region] on that node. The data region size is the sum of the sizes of all cache groups.
* The size of a specific cache/cache group on that node, including the backup partitions.
These metrics can be enabled/disabled for each level separately and are exposed via different JMX beans listed below.
=== Allocated Space vs. Actual Size of Data
There is no way to get the exact size of the data (neither in RAM nor on disk). Instead, there are two ways to estimate it.
You can get the size of the space _allocated_ for storing the data.
(The "space" here refers either to the space in RAM or on disk depending on whether you use Native persistence or not.)
Space is allocated when the size of the storage gets full and more entries need to be added.
However, when you remove entries from caches, the space is not deallocated.
It is reused when new entries need to be added to the storage on subsequent write operations. Therefore, the allocated size does not decrease when you remove entries from the caches.
The allocated size is available at the level of data storage, data region, and cache group metrics.
The metric is called `TotalAllocatedSize`.
You can also get an estimate of the actual size of data by multiplying the number of link:memory-centric-storage#data-pages[data pages] in use by the fill factor. The fill factor is the ratio of the size of data in a page to the page size, averaged over all pages. The number of pages in use and the fill factor are available at the level of data <<Data Region Size,region metrics>>.
Add up the estimated size of all data regions to get the estimated total amount of data on the node.
:allocsize_note: Note that when Native persistence is disabled, this metric shows the total size of the allocated space in RAM.
=== Monitoring RAM Memory Usage
The amount of data in RAM can be monitored for each data region through the following MBeans:
Mbean's Object Name: ::
+
--
----
group=DataRegionMetrics,name=<Data Region name>
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| PagesFillFactor| float | The average size of data in pages as a ratio of the page size. When Native persistence is enabled, this metric is applicable only to the persistent storage (i.e. pages on disk). | Node
| TotalUsedPages | long | The number of data pages that are currently in use. When Native persistence is enabled, this metric is applicable only to the persistent storage (i.e. pages on disk).| Node
| PhysicalMemoryPages |long | The number of the allocated pages in RAM. | Node
| PhysicalMemorySize |long |The size of the allocated space in RAM in bytes. | Node
|===
--
If you have multiple data regions, add up the sizes of all data regions to get the total size of the data on the node.
=== Monitoring Storage Size
Persistent storage, when enabled, saves all application data on disk.
The total amount of data each node keeps on disk consists of the persistent storage (application data), the link:persistence/native-persistence#write-ahead-log[WAL files], and link:persistence/native-persistence#wal-archive[WAL Archive] files.
==== Persistent Storage Size
To monitor the size of the persistent storage on disk, use the following metrics:
Mbean's Object Name: ::
+
--
----
group="Persistent Store",name=DataStorageMetrics
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| TotalAllocatedSize | long | The size of the space allocated on disk for the entire data storage (in bytes). {allocsize_note} | Node
| WalTotalSize | long | Total size of the WAL files in bytes, including the WAL archive files. | Node
| WalArchiveSegments | int | The number of WAL segments in the archive. | Node
|===
[cols="1,4",opts="header"]
|===
|Operation | Description
| enableMetrics | Enable collection of metrics related to the persistent storage at runtime.
| disableMetrics | Disable metrics collection.
|===
--
==== Data Region Size
For each configured data region, Ignite creates a separate JMX Bean that exposes specific information about the region. Metrics collection for data regions are disabled by default. You can link:monitoring-metrics/configuring-metrics#enabling-data-region-metrics[enable it in the data region configuration, or via JMX at runtime] (see the Bean's operations below).
The size of the data region on a node comprises the size of all partitions (including backup partitions) that this node owns for all caches in that data region.
Data region metrics are available in the following MBean:
Mbean's Object Name: ::
+
--
----
group=DataRegionMetrics,name=<Data Region name>
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| TotalAllocatedSize | long | The size of the space allocated for this data region (in bytes). {allocsize_note} | Node
| PagesFillFactor| float | The average amount of data in pages as a ratio of the page size. | Node
| TotalUsedPages | long | The number of data pages that are currently in use. | Node
| PhysicalMemoryPages |long |The number of data pages in this data region held in RAM. | Node
| PhysicalMemorySize | long |The size of the allocated space in RAM in bytes.| Node
|===
[cols="1,4",opts="header"]
|===
|Operation | Description
| enableMetrics | Enable metrics collection for this data region.
| disableMetrics | Disable metrics collection for this data region.
|===
--
==== Cache Group Size
If you don't use link:configuring-caches/cache-groups[cache groups], each cache will be its own group.
There is a separate JMX bean for each cache group.
The name of the bean corresponds to the name of the group.
Mbean's Object Name: ::
+
--
----
group="Cache groups",name=<Cache group name>
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
|TotalAllocatedSize |long | The amount of space allocated for the cache group on this node. | Node
|===
--
== Monitoring Checkpointing Operations
Checkpointing may slow down cluster operations.
You may want to monitor how much time each checkpoint operation takes, so that you can tune the properties that affect checkpointing.
You may also want to monitor the disk performance to see if the slow-down is caused by external reasons.
See link:persistence/persistence-tuning#pages-writes-throttling[Pages Writes Throttling] and link:persistence/persistence-tuning#adjusting-checkpointing-buffer-size[Checkpointing Buffer Size] for performance tips.
Mbean's Object Name: ::
+
--
group="Persistent Store",name=DataStorageMetrics
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| DirtyPages | long | The number of pages in memory that have been changed but not yet synchronized to disk. Those will be written to disk during next checkpoint. | Node
|LastCheckpointDuration | long | The time in milliseconds it took to create the last checkpoint. | Node
|CheckpointBufferSize | long | The size of the checkpointing buffer. | Global
|===
--
== Monitoring Rebalancing
link:data-rebalancing[Rebalancing] is the process of moving partitions between the cluster nodes so that the data is always distributed in a balanced manner. Rebalancing is triggered when a new node joins, or an existing node leaves the cluster.
If you have multiple caches, they will be rebalanced sequentially.
There are several metrics that you can use to monitor the progress of the rebalancing process for a specific cache.
Mbean's Object Name: ::
+
--
----
group=<cache name>,name=org.apache.ignite.internal.processors.cache.CacheLocalMetricsMXBeanImpl
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
|RebalancingStartTime | long | This metric shows the time when rebalancing of local partitions started for the cache. This metric will return 0 if the local partitions do not participate in the rebalancing. The time is returned in milliseconds. | Node
| EstimatedRebalancingFinishTime | long | Expected time of completion of the rebalancing process. | Node
| KeysToRebalanceLeft | long | The number of keys on the node that remain to be rebalanced. You can monitor this metric to learn when the rebalancing process finishes.| Node
|===
--
== Monitoring Topology
Topology refers to the set of nodes in a cluster. There are a number of metrics that expose the information about the topology of the cluster. If the topology changes too frequently or has a size that is different from what you expect, you may want to look into whether there are network problems.
Mbean's Object Name: ::
+
--
----
group=Kernal,name=ClusterMetricsMXBeanImpl
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| TotalServerNodes| long |The number of server nodes in the cluster.| Global
| TotalClientNodes| long |The number of client nodes in the cluster. | Global
| TotalBaselineNodes | long | The number of nodes that are registered in the link:clustering/baseline-topology[baseline topology]. When a node goes down, it remains registered in the baseline topology and you need to remote it manually. | Global
| ActiveBaselineNodes | long | The number of nodes that are currently active in the baseline topology. | Global
|===
--
Mbean's Object Name: ::
+
--
----
group=SPIs,name=TcpDiscoverySpi
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| Coordinator | String | The node ID of the current coordinator node.| Global
| CoordinatorNodeFormatted|String a|
Detailed information about the coordinator node.
....
TcpDiscoveryNode [id=e07ad289-ff5b-4a73-b3d4-d323a661b6d4,
consistentId=fa65ff2b-e7e2-4367-96d9-fd0915529c25,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.25.4.200],
sockAddrs=[mymachine.local/172.25.4.200:47500,
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500,
order=2, intOrder=2, lastExchangeTime=1568187777249, loc=false,
ver=8.7.5#20190520-sha1:d159cd7a, isClient=false]
....
| Global
|===
--
== Monitoring Caches
Cache-related metrics. For each cache, Ignite will create two JMX MBeans that will expose the metrics specific to the cache. One MBean shows cluster-wide information about the cache, such as the total number of entries in the cache. The other MBean shows local information about the cache, such as the number of entries of the cache that are located on the local node.
Global Cache Mbean's Object Name: ::
+
--
....
group=<Cache_Name>,name="org.apache.ignite.internal.processors.cache.CacheClusterMetricsMXBeanImpl"`
....
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| CacheSize | long | The total number of entries in the cache across all nodes. | Global
|===
--
Local Cache Mbean's Object Name: ::
+
--
----
group=<Cache Name>,name="org.apache.ignite.internal.processors.cache.CacheLocalMetricsMXBeanImpl"
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| CacheSize | long | The number of entries of the cache that are stored on the local node. | Node
|===
--
== Monitoring Transactions
Note that if a transaction spans multiple nodes (i.e., if the keys that are changed as a result of the transaction execution are located on multiple nodes), the counters will increase on each node. For example, the 'TransactionsCommittedNumber' counter will increase on each node where the keys affected by the transaction are stored.
Mbean's Object Name: ::
+
--
----
group=TransactionMetrics,name=TransactionMetricsMxBeanImpl
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| LockedKeysNumber | long | The number of keys locked on the node. | Node
| TransactionsCommittedNumber |long | The number of transactions that have been committed on the node | Node
| TransactionsRolledBackNumber | long | The number of transactions that were rolled back. | Node
| OwnerTransactionsNumber | long | The number of transactions initiated on the node. | Node
| TransactionsHoldingLockNumber | long | The number of open transactions that hold a lock on at least one key on the node.| Node
|===
--
////
this isn't in 8.7.6 yet
{sp}+
Mbean's Object Name: ::
`group=Transactions,name=TransactionsMXBeanImpl`
*Attributes:*::
{sp}
+
--
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| TotalNodeSystemTime | long | system time | Node
| TotalNodeUserTime | | | Node
| NodeSystemTimeHistogram | | | Node
| NodeUserTimeHistogram | | | Node
|===
--
////
////
{sp}+
== Monitoring Compute Jobs
Mbean's Object Name: ::
`group= ,name=`
*Attributes:*::
{sp}
+
--
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| | | |
|===
--
////
////
== Monitoring Snapshots
Mbean's Object Name: ::
+
--
----
group=TODO ,name= TODO
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| LastSnapshotOperation | | |
| LastSnapshotStartTime || |
| SnapshotInProgress | | |
|===
--
////
== Monitoring Data Center Replication
Refer to the link:data-center-replication/managing-and-monitoring#dr_jmx[Managing and Monitoring Replication] page.
////
== Monitoring Memory Consumption
JVM memory
Mbean's Object Name: ::
+
----
group=Kernal,name=ClusterMetricsMXBeanImpl
----
*Attributes:*::
+
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| HeapMemoryUsed | long | The Java heap size on the node. | Node
|===
////
== Monitoring Client Connections
Metrics related to JDBC/ODBC or thin client connections.
Mbean's Object Name: ::
+
--
----
group=Clients,name=ClientListenerProcessor
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| Connections | java.util.List<String> a| A list of strings, each string containing information about a connection:
....
JdbcClient [id=4294967297, user=<anonymous>,
rmtAddr=127.0.0.1:39264, locAddr=127.0.0.1:10800]
....
| Node
|===
[cols="1,4",opts="header"]
|===
|Operation | Description
| dropConnection (id)| Disconnect a specific client.
| dropAllConnections | Disconnect all clients.
|===
--
== Monitoring Message Queues
When thread pools queues' are growing, it means that the node cannot keep up with the load, or there was an error while processing messages in the queue.
Continuous growth of the queue size can lead to OOM errors.
=== Communication Message Queue
The queue of outgoing communication messages contains communication messages that are waiting to be sent to other nodes.
If the size is growing, it means there is a problem.
Mbean's Object Name: ::
+
--
----
group=SPIs,name=TcpCommunicationSpi
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| OutboundMessagesQueueSize | int | The size of the queue of outgoing communication messages. | Node
|===
--
=== Discovery Messages Queue
The queue of discovery messages.
Mbean's Object Name: ::
+
--
----
group=SPIs,name=TcpDiscoverySpi
----
[{table_opts}]
|===
| Attribute | Type | Description | Scope
| MessageWorkerQueueSize | int | The size of the queue of discovery messages that are waiting to be sent to other nodes. | Node
|AvgMessageProcessingTime|long| Average message processing time. | Node
|===
--
////
== Monitoring Executor Queue Size
There is a number of executor thread pools running within each node that are dedicated to specific tasks.
You may want to monitor the size of the executor's queues.
You can read more about the thread pools on the link:perf-troubleshooting-guide/thread-pools-tuning[Thread Tuning Page]
There is a JMX Bean for each thread pool.
////