docs/_docs/monitoring-metrics/tracing.adoc - ignite - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one or more
 // contributor license agreements.  See the NOTICE file distributed with
 // this work for additional information regarding copyright ownership.
 // The ASF licenses this file to You under the Apache License, Version 2.0
 // (the "License"); you may not use this file except in compliance with
 // the License.  You may obtain a copy of the License at
 //
 // http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 = Tracing

 :javaFile: {javaCodeDir}/Tracing.java

 WARNING: This feature is experimental.

 A number of APIs in Ignite are instrumented for tracing with OpenCensus.
 You can collect distributed traces of various tasks executed in your cluster and use this information to diagnose latency problems.

 We suggest you get familiar with OpenCensus tracing documentation before reading this chapter: https://opencensus.io/tracing/[^].

 The following Ignite APIs are instrumented for tracing:

 * Discovery
 * Communication
 * Exchange
 * Transactions
 * SQL queries


 To view traces, you must export them into an external system.
 You can use one of the OpenCensus exporters or write your own, but in any case, you will have to write code that registers an exporter in Ignite.
 Refer to <<Exporting Traces>> for details.


 == Configuring Tracing

 Enable OpenCensus tracing in the node configuration. All nodes in the cluster must use the same tracing configuration.

 [tabs]
 --
 tab:XML[]
 [source, xml]
 ----
 include::code-snippets/xml/tracing.xml[tags=ignite-config;!discovery, indent=0]
 ----

 tab:Java[]
 [source, java]
 ----
 include::{javaFile}[tags=config, indent=0]
 ----
 tab:C#/.NET[]

 tab:C++[unsupported]
 --


 == Enabling Trace Sampling

 When you start your cluster with the above configuration, Ignite does not collect traces.
 You have to enable trace sampling for a specific API at runtime.
 You can turn trace sampling on and off at will, for example, only for the period when you are troubleshooting a problem.

 You can do this in two ways:

 * via the control script from the command line
 * programmatically

 Traces are collected at a given probabilistic sampling rate.
 The rate is specified as a value between 0.0 and 1.0 inclusive: `0` means no sampling, `1` means always sampling.

 When the sampling rate is set to a value greater than 0, Ignite collects traces.
 To disable trace collection, set the sampling rate to 0.

 The following sections describe the two ways of enabling trace sampling.

 === Using Control Script

 Go to the `{IGNITE_HOME}/bin` directory of your Ignite installation.
 Enable experimental commands in the control script:

 [source, shell]
 ----
 export IGNITE_ENABLE_EXPERIMENTAL_COMMAND=true
 ----

 Enable tracing for a specific API:

 [source, shell]
 ----
 ./control.sh --tracing-configuration set --scope TX --sampling-rate 1
 ----

 Refer to the link:control-script#tracing-configuration[Control Script] sections for the list of all parameters.

 === Programmatically

 Once you start the node, you can enable trace sampling as follows:

 [source, java]
 ----
 include::{javaFile}[tags=enable-sampling, indent=0]
 ----


 The `--scope` parameter specifies the API you want to trace.
 The following APIs are instrumented for tracing:

 * `DISCOVERY` — discovery events
 * `EXCHANGE` —  exchange events
 * `COMMUNICATION` — communication events
 * `TX` — transactions
 * `SQL` — SQL queries

 The `--sampling-rate` is the probabilistic sampling rate, a number between `0` and `1`:

 * `0` means no sampling,
 * `1` means always sampling.


 == Exporting Traces

 To view traces, you need to export them to an external backend using one of the available exporters.
 OpenCensus supports a number of exporters out-of-the-box, and you can write a custom one.
 Refer to the link:https://opencensus.io/exporters/[OpenCensus Exporters^] for details.

 In this section, we will show how to export traces to link:https://zipkin.io[Zipkin^].

 . Follow link:https://zipkin.io/pages/quickstart.html[this guide^] to launch Zipkin on your machine.
 . Register `ZipkinTraceExporter` in the application where you start Ignite:
 +
 --
 [source, java]
 ----
 include::{javaFile}[tags=export-to-zipkin, indent=0]
 ----
 --


 . Open http://localhost:9411/zipkin[^] in your browser and click the search icon.
 +
 --
 This is what a trace of the transaction looks like:

 image::images/trace_in_zipkin.png[]
 --

 == Analyzing Trace Data

 A trace is recorded information about the execution of a specific event.
 Each trace consists of a tree of _spans_.
 A span is an individual unit of work performed by the system in order to process the event.

 Because of the distributed nature of Ignite, an operation usually involves multiple nodes.
 Therefore, a trace can include spans from multiple nodes.
 Each span always contains the information about the node where the corresponding operation was executed.

 In the image of the transaction trace presented above, you can see that the trace contains the spans associated with the following operations:

 * acquire locks (`transactions.colocated.lock.map`),
 * get (`transactions.near.enlist.read`),
 * put (`transactions.near.enlist.write`),
 * commit (`transactions.commit`), and
 * close (`transactions.close`).

 The commit operation, in turn, consists of two operations: prepare and finish.

 You can click on each span to view the annotations and tags attached to it.


 image::images/span.png[Span]


 == Tracing SQL Queries

 To enable SQL queries tracing, use `SQL` as a value of the `scope` parameter during the link:https://ignite.apache.org/docs/latest/monitoring-metrics/tracing#enabling-trace-sampling[trace sampling configuration, window=_blank].
 If tracing of SQL queries is enabled, execution of each SQL query on any cluster node will produce a separate trace.

 [IMPORTANT]
 ====
 [discrete]
 Enabling tracing for SQL queries imposes a severe degradation on SQL engine performance.
 ====

 The table below provides descriptions, a list of tags, and annotations for each span that can be a part of the SQL query trace tree.

 [NOTE]
 ====
 [discrete]
 Depending on the SQL query type and its execution plan, some spans may not be present in the SQL query span tree.
 ====

 [cols="2,5,5",opts="header"]
 |===
 |Span Name | Description | Tags and Annotations
 | sql.query | Execution of an SQL query from the moment of registration until the used resources on the query initiator node are released a|
 * sql.query.text - SQL query text
 * sql.schema - SQL schema
 | sql.cursor.open | SQL query cursor opening |
 | sql.cursor.close | SQL query cursor closure |
 | sql.cursor.cancel | SQL query cursor cancellation |
 | sql.query.parse | Parsing of SQL query a|
 * sql.parser.cache.hit - Whether parsing of the SQL query was skipped due to the cached result
 | sql.query.execute.request | Processing of SQL query execution request a|
 * sql.query.text - SQL query text
 | sql.next.page.request | Processing of the request for obtaining the next page of local SQL query execution result  |
 | sql.page.response | Processing of the message with a node local SQL query execution result page |
 | sql.query.execute | Execution of query by H2 SQL engine a|
 * sql.query.text - SQL query text
 | sql.page.prepare | Reading rows from the cursor and preparing a result page a|
 * sql.page.rows - Number of rows that a result page contains
 | sql.fail.response | Processing of a message that indicates failure of SQL query execution |
 | sql.dml.query.execute.request | Processing of SQL DML query execution request a|
 * sql.query.text - SQL query text
 | sql.dml.query.response | Processing of SQL DML query execution result by query initiator node |
 | sql.query.cancel.request | Processing of SQL query cancel request |
 | sql.iterator.open | SQL query iterator opening |
 | sql.iterator.close | SQL query iterator closure |
 | sql.page.fetch | Fetching SQL query result page a|
 * sql.page.rows - Number of rows that result page contains
 | sql.page.wait | Waiting for SQL query results page to be received from remote node |
 | sql.index.range.request | Processing SQL index range request a|
 * sql.index - SQL index name
 * sql.table - SQL table name
 * sql.index.range.rows - Number of rows that an index range request result contains
 | sql.index.range.response | Processing SQL index range responce |
 | sql.dml.query.execute |  Execution of SQL DML query |
 | sql.command.query.execute | Execution of an SQL command query, which is either a DDL query or an Ignite native command |
 | sql.partitions.reserve | Reservation of data partitions used to execute a query a|
 * Annotation message that indicates reservation of data partitions for a particular cache - `Cache partitions were reserved [cache=<name of the cache>, partitions=[<partitions numbers>]`
 | sql.cache.update | Cache update as a result of SQL DML query execution a|
 * sql.cache.updates - Number of cache entries to be updated as a result of DML query
 | sql.batch.process| Processing of SQL batch update |
 |===

 ////
 TODO: describe annotations and tags
 === Annotations

 === Tags

 The `node.id` and `node.consistentId` are the ID and consistent ID of the node where the root operation started.
 ////
	// Licensed to the Apache Software Foundation (ASF) under one or more
	// contributor license agreements. See the NOTICE file distributed with
	// this work for additional information regarding copyright ownership.
	// The ASF licenses this file to You under the Apache License, Version 2.0
	// (the "License"); you may not use this file except in compliance with
	// the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.
	= Tracing

	:javaFile: {javaCodeDir}/Tracing.java

	WARNING: This feature is experimental.

	A number of APIs in Ignite are instrumented for tracing with OpenCensus.
	You can collect distributed traces of various tasks executed in your cluster and use this information to diagnose latency problems.

	We suggest you get familiar with OpenCensus tracing documentation before reading this chapter: https://opencensus.io/tracing/[^].

	The following Ignite APIs are instrumented for tracing:

	* Discovery
	* Communication
	* Exchange
	* Transactions
	* SQL queries


	To view traces, you must export them into an external system.
	You can use one of the OpenCensus exporters or write your own, but in any case, you will have to write code that registers an exporter in Ignite.
	Refer to <<Exporting Traces>> for details.


	== Configuring Tracing

	Enable OpenCensus tracing in the node configuration. All nodes in the cluster must use the same tracing configuration.

	[tabs]
	--
	tab:XML[]
	[source, xml]
	----
	include::code-snippets/xml/tracing.xml[tags=ignite-config;!discovery, indent=0]
	----

	tab:Java[]
	[source, java]
	----
	include::{javaFile}[tags=config, indent=0]
	----
	tab:C#/.NET[]

	tab:C++[unsupported]
	--


	== Enabling Trace Sampling

	When you start your cluster with the above configuration, Ignite does not collect traces.
	You have to enable trace sampling for a specific API at runtime.
	You can turn trace sampling on and off at will, for example, only for the period when you are troubleshooting a problem.

	You can do this in two ways:

	* via the control script from the command line
	* programmatically

	Traces are collected at a given probabilistic sampling rate.
	The rate is specified as a value between 0.0 and 1.0 inclusive: `0` means no sampling, `1` means always sampling.

	When the sampling rate is set to a value greater than 0, Ignite collects traces.
	To disable trace collection, set the sampling rate to 0.

	The following sections describe the two ways of enabling trace sampling.

	=== Using Control Script

	Go to the `{IGNITE_HOME}/bin` directory of your Ignite installation.
	Enable experimental commands in the control script:

	[source, shell]
	----
	export IGNITE_ENABLE_EXPERIMENTAL_COMMAND=true
	----

	Enable tracing for a specific API:

	[source, shell]
	----
	./control.sh --tracing-configuration set --scope TX --sampling-rate 1
	----

	Refer to the link:control-script#tracing-configuration[Control Script] sections for the list of all parameters.

	=== Programmatically

	Once you start the node, you can enable trace sampling as follows:

	[source, java]
	----
	include::{javaFile}[tags=enable-sampling, indent=0]
	----


	The `--scope` parameter specifies the API you want to trace.
	The following APIs are instrumented for tracing:

	* `DISCOVERY` — discovery events
	* `EXCHANGE` — exchange events
	* `COMMUNICATION` — communication events
	* `TX` — transactions
	* `SQL` — SQL queries

	The `--sampling-rate` is the probabilistic sampling rate, a number between `0` and `1`:

	* `0` means no sampling,
	* `1` means always sampling.


	== Exporting Traces

	To view traces, you need to export them to an external backend using one of the available exporters.
	OpenCensus supports a number of exporters out-of-the-box, and you can write a custom one.
	Refer to the link:https://opencensus.io/exporters/[OpenCensus Exporters^] for details.

	In this section, we will show how to export traces to link:https://zipkin.io[Zipkin^].

	. Follow link:https://zipkin.io/pages/quickstart.html[this guide^] to launch Zipkin on your machine.
	. Register `ZipkinTraceExporter` in the application where you start Ignite:
	+
	--
	[source, java]
	----
	include::{javaFile}[tags=export-to-zipkin, indent=0]
	----
	--


	. Open http://localhost:9411/zipkin[^] in your browser and click the search icon.
	+
	--
	This is what a trace of the transaction looks like:

	image::images/trace_in_zipkin.png[]
	--

	== Analyzing Trace Data

	A trace is recorded information about the execution of a specific event.
	Each trace consists of a tree of _spans_.
	A span is an individual unit of work performed by the system in order to process the event.

	Because of the distributed nature of Ignite, an operation usually involves multiple nodes.
	Therefore, a trace can include spans from multiple nodes.
	Each span always contains the information about the node where the corresponding operation was executed.

	In the image of the transaction trace presented above, you can see that the trace contains the spans associated with the following operations:

	* acquire locks (`transactions.colocated.lock.map`),
	* get (`transactions.near.enlist.read`),
	* put (`transactions.near.enlist.write`),
	* commit (`transactions.commit`), and
	* close (`transactions.close`).

	The commit operation, in turn, consists of two operations: prepare and finish.

	You can click on each span to view the annotations and tags attached to it.


	image::images/span.png[Span]


	== Tracing SQL Queries

	To enable SQL queries tracing, use `SQL` as a value of the `scope` parameter during the link:https://ignite.apache.org/docs/latest/monitoring-metrics/tracing#enabling-trace-sampling[trace sampling configuration, window=_blank].
	If tracing of SQL queries is enabled, execution of each SQL query on any cluster node will produce a separate trace.

	[IMPORTANT]
	====
	[discrete]
	Enabling tracing for SQL queries imposes a severe degradation on SQL engine performance.
	====

	The table below provides descriptions, a list of tags, and annotations for each span that can be a part of the SQL query trace tree.

	[NOTE]
	====
	[discrete]
	Depending on the SQL query type and its execution plan, some spans may not be present in the SQL query span tree.
	====

	[cols="2,5,5",opts="header"]
	\|===
	\|Span Name \| Description \| Tags and Annotations
	\| sql.query \| Execution of an SQL query from the moment of registration until the used resources on the query initiator node are released a\|
	* sql.query.text - SQL query text
	* sql.schema - SQL schema
	\| sql.cursor.open \| SQL query cursor opening \|
	\| sql.cursor.close \| SQL query cursor closure \|
	\| sql.cursor.cancel \| SQL query cursor cancellation \|
	\| sql.query.parse \| Parsing of SQL query a\|
	* sql.parser.cache.hit - Whether parsing of the SQL query was skipped due to the cached result
	\| sql.query.execute.request \| Processing of SQL query execution request a\|
	* sql.query.text - SQL query text
	\| sql.next.page.request \| Processing of the request for obtaining the next page of local SQL query execution result \|
	\| sql.page.response \| Processing of the message with a node local SQL query execution result page \|
	\| sql.query.execute \| Execution of query by H2 SQL engine a\|
	* sql.query.text - SQL query text
	\| sql.page.prepare \| Reading rows from the cursor and preparing a result page a\|
	* sql.page.rows - Number of rows that a result page contains
	\| sql.fail.response \| Processing of a message that indicates failure of SQL query execution \|
	\| sql.dml.query.execute.request \| Processing of SQL DML query execution request a\|
	* sql.query.text - SQL query text
	\| sql.dml.query.response \| Processing of SQL DML query execution result by query initiator node \|
	\| sql.query.cancel.request \| Processing of SQL query cancel request \|
	\| sql.iterator.open \| SQL query iterator opening \|
	\| sql.iterator.close \| SQL query iterator closure \|
	\| sql.page.fetch \| Fetching SQL query result page a\|
	* sql.page.rows - Number of rows that result page contains
	\| sql.page.wait \| Waiting for SQL query results page to be received from remote node \|
	\| sql.index.range.request \| Processing SQL index range request a\|
	* sql.index - SQL index name
	* sql.table - SQL table name
	* sql.index.range.rows - Number of rows that an index range request result contains
	\| sql.index.range.response \| Processing SQL index range responce \|
	\| sql.dml.query.execute \| Execution of SQL DML query \|
	\| sql.command.query.execute \| Execution of an SQL command query, which is either a DDL query or an Ignite native command \|
	\| sql.partitions.reserve \| Reservation of data partitions used to execute a query a\|
	* Annotation message that indicates reservation of data partitions for a particular cache - `Cache partitions were reserved [cache=<name of the cache>, partitions=[<partitions numbers>]`
	\| sql.cache.update \| Cache update as a result of SQL DML query execution a\|
	* sql.cache.updates - Number of cache entries to be updated as a result of DML query
	\| sql.batch.process\| Processing of SQL batch update \|
	\|===

	////
	TODO: describe annotations and tags
	=== Annotations

	=== Tags

	The `node.id` and `node.consistentId` are the ID and consistent ID of the node where the root operation started.
	////