docs/_docs/perf-and-troubleshooting/handling-exceptions.adoc - ignite - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one or more
 // contributor license agreements.  See the NOTICE file distributed with
 // this work for additional information regarding copyright ownership.
 // The ASF licenses this file to You under the Apache License, Version 2.0
 // (the "License"); you may not use this file except in compliance with
 // the License.  You may obtain a copy of the License at
 //
 // http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 = Handling Exceptions

 This section outlines basic exceptions that can be generated by Ignite, and explains how to set
 up and use the critical failures handler.

 == Handling Ignite Exceptions

 Exceptions supported by the Ignite API and actions you can take related to these exceptions are described below.
 Please see the Javadoc _throws_ clause for checked exceptions.

 [cols="25%,35%,30%,10%", width="100%"]
 |=======================================================================
 |Exception	|Description	|Action	|Runtime exception

 | `CacheInvalidStateException`
 | Thrown when you try to perform an operation on a cache in which some partitions have been lost. Depending on the partition
 loss policy configured for the cache, this exception is thrown either on read and/or write operations.
 See link:configuring-caches/partition-loss-policy[Partition Loss Policy] for details.
 | Reset lost partitions. You may want to restore the data by returning the nodes that caused the partition loss to the cluster.
 | Yes

 |`IgniteException`
 |Indicates an error condition in the cluster.
 |Operation failed. Exit from the method.
 |Yes

 |`IgniteClientDisconnectedException`
 |Thrown by the Ignite API when a client node gets disconnected from cluster. Thrown from Cache operations, compute API, and data structures.
 |Wait and use retry logic.
 |Yes
 |`IgniteAuthenticationException`
 |Thrown when there is either a node authentication failure or security authentication failure.
 |Operation failed. Exit from the method.
 |No
 |`IgniteClientException`
 |Can be thrown from Cache operations.
 |Check exception message for the action to be taken.
 |Yes
 |`IgniteDeploymentException`
 |Thrown when the Ignite API fails to deploy a job or task on a node. Thrown from the Compute API.
 |Operation failed. Exit from the method.
 |Yes
 |`IgniteInterruptedException`
 |Used to wrap the standard `InterruptedException` into `IgniteException`.
 |Retry after clearing the interrupted flag.
 |Yes
 |`IgniteSpiException`
 |Thrown by various SPI (`CollisionSpi`, `LoadBalancingSpi`, `TcpDiscoveryIpFinder`, `FailoverSpi`, `UriDeploymentSpi`, etc.)
 |Operation failed. Exit from the method.
 |Yes
 |`IgniteSQLException`
 |Thrown when there is a SQL query processing error. This exception also provides query specific error codes.
 |Operation failed. Exit from the method.
 |Yes
 |`IgniteAccessControlException`
 |Thrown when there is an authentication / authorization failure.
 |Operation failed. Exit from the method.
 |No
 |`IgniteCacheRestartingException`
 |Thrown from Ignite cache API if a cache is restarting.
 |Wait and use retry logic.
 |Yes
 |`IgniteFutureTimeoutException`
 |Thrown when a future computation is timed out.
 |Either increase timeout limit or exit from the method.
 |Yes
 |`IgniteFutureCancelledException`
 |Thrown when a future computation cannot be retrieved because it was cancelled.
 |Use retry logic.
 |Yes
 |`IgniteIllegalStateException`
 |Indicates that the Ignite instance is in an invalid state for the requested operation.
 |Operation failed. Exit from the method.
 |Yes
 |`IgniteNeedReconnectException`
 |Indicates that a node should try to reconnect to the cluster.
 |Use retry logic.
 |No
 |`IgniteDataIntegrityViolationException`
 |Thrown if a data integrity violation is found.
 |Operation failed. Exit from the method.
 |Yes
 |`IgniteOutOfMemoryException`
 |Thrown when the system does not have enough memory to process Ignite operations. Thrown from Cache operations.
 |Operation failed. Exit from the method.
 |Yes
 |`IgniteTxOptimisticCheckedException`
 |Thrown when a transaction fails optimistically.
 |Use retry logic.
 |No
 |`IgniteTxRollbackCheckedException`
 |Thrown when a transaction has been automatically rolled back.
 |Use retry logic.
 |No
 |`IgniteTxTimeoutCheckedException`
 |Thrown when a transaction times out.
 |Use retry logic.
 |No
 |`ClusterTopologyException`
 |Indicates an error with the cluster topology (e.g. crashed node, etc.). Thrown from Compute and Events API
 |Wait on future and use retry logic.
 |Yes
 |=======================================================================

 == Critical Failures Handling

 Ignite is a robust and fault tolerant system. But in the real world, some unpredictable issues and problems arise
 that can affect the state of both an individual node as well as the whole cluster. Such issues can be detected at
 runtime and handled accordingly using a preconfigured critical failure handler.

 === Critical Failures

 The following failures are treated as critical:

 * System critical errors (e.g. `OutOfMemoryError`).

 * Unintentional system worker termination (e.g. due to an unhandled exception).

 * System workers hanging.

 * Cluster nodes segmentation.

 A system critical error is an error which leads to the system's inoperability. For example:

 * File I/O errors - usually `IOException` is thrown by file read/write operations. It's possible when Ignite
 native persistence is enabled (e.g., in cases when no space is left or on a device error), and also for in-memory
 mode because Ignite uses disk storage for keeping some metadata (e.g., in cases when the file descriptors limit is
 exceeded or file access is prohibited).

 * Out of memory error - when Ignite memory management system fails to allocate more space
 (`IgniteOutOfMemoryException`).

 * Out of memory error - when a cluster node runs out of Java heap (`OutOfMemoryError`).

 === Failures Handling

 When Ignite detects a critical failure, it handles the failure according to a preconfigured failure handler.
 The failure handler can be configured as follows:

 :javaFile: code-snippets/java/src/main/java/org/apache/ignite/snippets/FailureHandler.java

 [tabs]
 --
 tab:XML[]
 [source,xml]
 ----
 <bean class="org.apache.ignite.configuration.IgniteConfiguration">
     <property name="failureHandler">
         <bean class="org.apache.ignite.failure.StopNodeFailureHandler"/>
     </property>
 </bean>
 ----
 tab:Java[]
 [source,java]
 ----
 include::{javaFile}[tag=configure-handler,indent=0]
 ----
 --

 Ignite support following failure handlers:

 [width=100%,cols="30%,70%"]
 |=======================================================================
 |Class |Description

 |`NoOpFailureHandler`
 |Ignores any failures. Useful for testing and debugging.
 |`RestartProcessFailureHandler`
 |A specific implementation that can be used only with `ignite.sh\|bat`. The process must be terminated by using the `Ignition.restart(true)` method.
 |`StopNodeFailureHandler`
 |Stops the node in case of critical errors by calling the `Ignition.stop(true)` or `Ignition.stop(nodeName, true)` methods.
 |`StopNodeOrHaltFailureHandler`
 |This is the default handler, which tries to stop a node. If the node can't be stopped, then the handler  terminates the JVM process.

 |=======================================================================

 === Critical Workers Health Check

 Ignite has a number of internal workers that are essential for the cluster to function correctly. If one of them is
 terminated, the node can become inoperative.

 The following system workers are considered mission critical:

 * Discovery worker - discovery events handling.
 * TCP communication worker - peer-to-peer communication between nodes.
 * Exchange worker - partition map exchange.
 * Workers of the system's striped pool.
 * Data Streamer striped pool workers.
 * Timeout worker - timeouts handling.
 * Checkpoint thread - check-pointing in Ignite persistence.
 * WAL workers - write-ahead logging, segments archiving, and compression.
 * Expiration worker - TTL based expiration.
 * NIO workers - base networking.

 Ignite has an internal mechanism for verifying that critical workers are operational.
 Each worker is regularly checked to confirm that it is alive and updating its heartbeat timestamp.
 If a worker is not alive and updating, the worker is regarded as blocked and Ignite will print a message to the log file.
 You can set the period of inactivity via the `IgniteConfiguration.systemWorkerBlockedTimeout` property.

 Even though Ignite considers an unresponsive system worker to be a critical error, it doesn't handle this situation automatically,
 other than printing out a message to the log file.
 If you want to enable a particular failure handler for unresponsive system workers of all the types, clear the
 `ignoredFailureTypes` property of the handler as shown below:

 [tabs]
 --
 tab:XML[]
 [source,xml]
 ----
 <bean class="org.apache.ignite.configuration.IgniteConfiguration">

     <property name="systemWorkerBlockedTimeout" value="#{60 * 60 * 1000}"/>

     <property name="failureHandler">
         <bean class="org.apache.ignite.failure.StopNodeFailureHandler">

           <!-- Enable this handler to react to unresponsive critical workers occasions. -->
           <property name="ignoredFailureTypes">
             <list>
             </list>
           </property>

       </bean>

     </property>
 </bean>
 ----
 tab:Java[]
 [source,java]
 ----
 include::{javaFile}[tag=failure-types,indent=0]
 ----
 --
	// Licensed to the Apache Software Foundation (ASF) under one or more
	// contributor license agreements. See the NOTICE file distributed with
	// this work for additional information regarding copyright ownership.
	// The ASF licenses this file to You under the Apache License, Version 2.0
	// (the "License"); you may not use this file except in compliance with
	// the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.
	= Handling Exceptions

	This section outlines basic exceptions that can be generated by Ignite, and explains how to set
	up and use the critical failures handler.

	== Handling Ignite Exceptions

	Exceptions supported by the Ignite API and actions you can take related to these exceptions are described below.
	Please see the Javadoc _throws_ clause for checked exceptions.

	[cols="25%,35%,30%,10%", width="100%"]
	\|=======================================================================
	\|Exception \|Description \|Action \|Runtime exception

	\| `CacheInvalidStateException`
	\| Thrown when you try to perform an operation on a cache in which some partitions have been lost. Depending on the partition
	loss policy configured for the cache, this exception is thrown either on read and/or write operations.
	See link:configuring-caches/partition-loss-policy[Partition Loss Policy] for details.
	\| Reset lost partitions. You may want to restore the data by returning the nodes that caused the partition loss to the cluster.
	\| Yes

	\|`IgniteException`
	\|Indicates an error condition in the cluster.
	\|Operation failed. Exit from the method.
	\|Yes

	\|`IgniteClientDisconnectedException`
	\|Thrown by the Ignite API when a client node gets disconnected from cluster. Thrown from Cache operations, compute API, and data structures.
	\|Wait and use retry logic.
	\|Yes
	\|`IgniteAuthenticationException`
	\|Thrown when there is either a node authentication failure or security authentication failure.
	\|Operation failed. Exit from the method.
	\|No
	\|`IgniteClientException`
	\|Can be thrown from Cache operations.
	\|Check exception message for the action to be taken.
	\|Yes
	\|`IgniteDeploymentException`
	\|Thrown when the Ignite API fails to deploy a job or task on a node. Thrown from the Compute API.
	\|Operation failed. Exit from the method.
	\|Yes
	\|`IgniteInterruptedException`
	\|Used to wrap the standard `InterruptedException` into `IgniteException`.
	\|Retry after clearing the interrupted flag.
	\|Yes
	\|`IgniteSpiException`
	\|Thrown by various SPI (`CollisionSpi`, `LoadBalancingSpi`, `TcpDiscoveryIpFinder`, `FailoverSpi`, `UriDeploymentSpi`, etc.)
	\|Operation failed. Exit from the method.
	\|Yes
	\|`IgniteSQLException`
	\|Thrown when there is a SQL query processing error. This exception also provides query specific error codes.
	\|Operation failed. Exit from the method.
	\|Yes
	\|`IgniteAccessControlException`
	\|Thrown when there is an authentication / authorization failure.
	\|Operation failed. Exit from the method.
	\|No
	\|`IgniteCacheRestartingException`
	\|Thrown from Ignite cache API if a cache is restarting.
	\|Wait and use retry logic.
	\|Yes
	\|`IgniteFutureTimeoutException`
	\|Thrown when a future computation is timed out.
	\|Either increase timeout limit or exit from the method.
	\|Yes
	\|`IgniteFutureCancelledException`
	\|Thrown when a future computation cannot be retrieved because it was cancelled.
	\|Use retry logic.
	\|Yes
	\|`IgniteIllegalStateException`
	\|Indicates that the Ignite instance is in an invalid state for the requested operation.
	\|Operation failed. Exit from the method.
	\|Yes
	\|`IgniteNeedReconnectException`
	\|Indicates that a node should try to reconnect to the cluster.
	\|Use retry logic.
	\|No
	\|`IgniteDataIntegrityViolationException`
	\|Thrown if a data integrity violation is found.
	\|Operation failed. Exit from the method.
	\|Yes
	\|`IgniteOutOfMemoryException`
	\|Thrown when the system does not have enough memory to process Ignite operations. Thrown from Cache operations.
	\|Operation failed. Exit from the method.
	\|Yes
	\|`IgniteTxOptimisticCheckedException`
	\|Thrown when a transaction fails optimistically.
	\|Use retry logic.
	\|No
	\|`IgniteTxRollbackCheckedException`
	\|Thrown when a transaction has been automatically rolled back.
	\|Use retry logic.
	\|No
	\|`IgniteTxTimeoutCheckedException`
	\|Thrown when a transaction times out.
	\|Use retry logic.
	\|No
	\|`ClusterTopologyException`
	\|Indicates an error with the cluster topology (e.g. crashed node, etc.). Thrown from Compute and Events API
	\|Wait on future and use retry logic.
	\|Yes
	\|=======================================================================

	== Critical Failures Handling

	Ignite is a robust and fault tolerant system. But in the real world, some unpredictable issues and problems arise
	that can affect the state of both an individual node as well as the whole cluster. Such issues can be detected at
	runtime and handled accordingly using a preconfigured critical failure handler.

	=== Critical Failures

	The following failures are treated as critical:

	* System critical errors (e.g. `OutOfMemoryError`).

	* Unintentional system worker termination (e.g. due to an unhandled exception).

	* System workers hanging.

	* Cluster nodes segmentation.

	A system critical error is an error which leads to the system's inoperability. For example:

	* File I/O errors - usually `IOException` is thrown by file read/write operations. It's possible when Ignite
	native persistence is enabled (e.g., in cases when no space is left or on a device error), and also for in-memory
	mode because Ignite uses disk storage for keeping some metadata (e.g., in cases when the file descriptors limit is
	exceeded or file access is prohibited).

	* Out of memory error - when Ignite memory management system fails to allocate more space
	(`IgniteOutOfMemoryException`).

	* Out of memory error - when a cluster node runs out of Java heap (`OutOfMemoryError`).

	=== Failures Handling

	When Ignite detects a critical failure, it handles the failure according to a preconfigured failure handler.
	The failure handler can be configured as follows:

	:javaFile: code-snippets/java/src/main/java/org/apache/ignite/snippets/FailureHandler.java

	[tabs]
	--
	tab:XML[]
	[source,xml]
	----
	<bean class="org.apache.ignite.configuration.IgniteConfiguration">
	<property name="failureHandler">
	<bean class="org.apache.ignite.failure.StopNodeFailureHandler"/>
	</property>
	</bean>
	----
	tab:Java[]
	[source,java]
	----
	include::{javaFile}[tag=configure-handler,indent=0]
	----
	--

	Ignite support following failure handlers:

	[width=100%,cols="30%,70%"]
	\|=======================================================================
	\|Class \|Description

	\|`NoOpFailureHandler`
	\|Ignores any failures. Useful for testing and debugging.
	\|`RestartProcessFailureHandler`
	\|A specific implementation that can be used only with `ignite.sh\\|bat`. The process must be terminated by using the `Ignition.restart(true)` method.
	\|`StopNodeFailureHandler`
	\|Stops the node in case of critical errors by calling the `Ignition.stop(true)` or `Ignition.stop(nodeName, true)` methods.
	\|`StopNodeOrHaltFailureHandler`
	\|This is the default handler, which tries to stop a node. If the node can't be stopped, then the handler terminates the JVM process.

	\|=======================================================================

	=== Critical Workers Health Check

	Ignite has a number of internal workers that are essential for the cluster to function correctly. If one of them is
	terminated, the node can become inoperative.

	The following system workers are considered mission critical:

	* Discovery worker - discovery events handling.
	* TCP communication worker - peer-to-peer communication between nodes.
	* Exchange worker - partition map exchange.
	* Workers of the system's striped pool.
	* Data Streamer striped pool workers.
	* Timeout worker - timeouts handling.
	* Checkpoint thread - check-pointing in Ignite persistence.
	* WAL workers - write-ahead logging, segments archiving, and compression.
	* Expiration worker - TTL based expiration.
	* NIO workers - base networking.

	Ignite has an internal mechanism for verifying that critical workers are operational.
	Each worker is regularly checked to confirm that it is alive and updating its heartbeat timestamp.
	If a worker is not alive and updating, the worker is regarded as blocked and Ignite will print a message to the log file.
	You can set the period of inactivity via the `IgniteConfiguration.systemWorkerBlockedTimeout` property.

	Even though Ignite considers an unresponsive system worker to be a critical error, it doesn't handle this situation automatically,
	other than printing out a message to the log file.
	If you want to enable a particular failure handler for unresponsive system workers of all the types, clear the
	`ignoredFailureTypes` property of the handler as shown below:

	[tabs]
	--
	tab:XML[]
	[source,xml]
	----
	<bean class="org.apache.ignite.configuration.IgniteConfiguration">

	<property name="systemWorkerBlockedTimeout" value="#{60 * 60 * 1000}"/>

	<property name="failureHandler">
	<bean class="org.apache.ignite.failure.StopNodeFailureHandler">

	<!-- Enable this handler to react to unresponsive critical workers occasions. -->
	<property name="ignoredFailureTypes">
	<list>
	</list>
	</property>

	</bean>

	</property>
	</bean>
	----
	tab:Java[]
	[source,java]
	----
	include::{javaFile}[tag=failure-types,indent=0]
	----
	--