| // Licensed to the Apache Software Foundation (ASF) under one or more |
| // contributor license agreements. See the NOTICE file distributed with |
| // this work for additional information regarding copyright ownership. |
| // The ASF licenses this file to You under the Apache License, Version 2.0 |
| // (the "License"); you may not use this file except in compliance with |
| // the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| = Handling Exceptions |
| |
| This section outlines basic exceptions that can be generated by Ignite, and explains how to set |
| up and use the critical failures handler. |
| |
| == Handling Ignite Exceptions |
| |
| Exceptions supported by the Ignite API and actions you can take related to these exceptions are described below. |
| Please see the Javadoc _throws_ clause for checked exceptions. |
| |
| [cols="25%,35%,30%,10%", width="100%"] |
| |======================================================================= |
| |Exception |Description |Action |Runtime exception |
| |
| | `CacheInvalidStateException` |
| | Thrown when you try to perform an operation on a cache in which some partitions have been lost. Depending on the partition |
| loss policy configured for the cache, this exception is thrown either on read and/or write operations. |
| See link:configuring-caches/partition-loss-policy[Partition Loss Policy] for details. |
| | Reset lost partitions. You may want to restore the data by returning the nodes that caused the partition loss to the cluster. |
| | Yes |
| |
| |`IgniteException` |
| |Indicates an error condition in the cluster. |
| |Operation failed. Exit from the method. |
| |Yes |
| |
| |`IgniteClientDisconnectedException` |
| |Thrown by the Ignite API when a client node gets disconnected from cluster. Thrown from Cache operations, compute API, and data structures. |
| |Wait and use retry logic. |
| |Yes |
| |`IgniteAuthenticationException` |
| |Thrown when there is either a node authentication failure or security authentication failure. |
| |Operation failed. Exit from the method. |
| |No |
| |`IgniteClientException` |
| |Can be thrown from Cache operations. |
| |Check exception message for the action to be taken. |
| |Yes |
| |`IgniteDeploymentException` |
| |Thrown when the Ignite API fails to deploy a job or task on a node. Thrown from the Compute API. |
| |Operation failed. Exit from the method. |
| |Yes |
| |`IgniteInterruptedException` |
| |Used to wrap the standard `InterruptedException` into `IgniteException`. |
| |Retry after clearing the interrupted flag. |
| |Yes |
| |`IgniteSpiException` |
| |Thrown by various SPI (`CollisionSpi`, `LoadBalancingSpi`, `TcpDiscoveryIpFinder`, `FailoverSpi`, `UriDeploymentSpi`, etc.) |
| |Operation failed. Exit from the method. |
| |Yes |
| |`IgniteSQLException` |
| |Thrown when there is a SQL query processing error. This exception also provides query specific error codes. |
| |Operation failed. Exit from the method. |
| |Yes |
| |`IgniteAccessControlException` |
| |Thrown when there is an authentication / authorization failure. |
| |Operation failed. Exit from the method. |
| |No |
| |`IgniteCacheRestartingException` |
| |Thrown from Ignite cache API if a cache is restarting. |
| |Wait and use retry logic. |
| |Yes |
| |`IgniteFutureTimeoutException` |
| |Thrown when a future computation is timed out. |
| |Either increase timeout limit or exit from the method. |
| |Yes |
| |`IgniteFutureCancelledException` |
| |Thrown when a future computation cannot be retrieved because it was cancelled. |
| |Use retry logic. |
| |Yes |
| |`IgniteIllegalStateException` |
| |Indicates that the Ignite instance is in an invalid state for the requested operation. |
| |Operation failed. Exit from the method. |
| |Yes |
| |`IgniteNeedReconnectException` |
| |Indicates that a node should try to reconnect to the cluster. |
| |Use retry logic. |
| |No |
| |`IgniteDataIntegrityViolationException` |
| |Thrown if a data integrity violation is found. |
| |Operation failed. Exit from the method. |
| |Yes |
| |`IgniteOutOfMemoryException` |
| |Thrown when the system does not have enough memory to process Ignite operations. Thrown from Cache operations. |
| |Operation failed. Exit from the method. |
| |Yes |
| |`IgniteTxOptimisticCheckedException` |
| |Thrown when a transaction fails optimistically. |
| |Use retry logic. |
| |No |
| |`IgniteTxRollbackCheckedException` |
| |Thrown when a transaction has been automatically rolled back. |
| |Use retry logic. |
| |No |
| |`IgniteTxTimeoutCheckedException` |
| |Thrown when a transaction times out. |
| |Use retry logic. |
| |No |
| |`ClusterTopologyException` |
| |Indicates an error with the cluster topology (e.g. crashed node, etc.). Thrown from Compute and Events API |
| |Wait on future and use retry logic. |
| |Yes |
| |======================================================================= |
| |
| == Critical Failures Handling |
| |
| Ignite is a robust and fault tolerant system. But in the real world, some unpredictable issues and problems arise |
| that can affect the state of both an individual node as well as the whole cluster. Such issues can be detected at |
| runtime and handled accordingly using a preconfigured critical failure handler. |
| |
| === Critical Failures |
| |
| The following failures are treated as critical: |
| |
| * System critical errors (e.g. `OutOfMemoryError`). |
| |
| * Unintentional system worker termination (e.g. due to an unhandled exception). |
| |
| * System workers hanging. |
| |
| * Cluster nodes segmentation. |
| |
| A system critical error is an error which leads to the system's inoperability. For example: |
| |
| * File I/O errors - usually `IOException` is thrown by file read/write operations. It's possible when Ignite |
| native persistence is enabled (e.g., in cases when no space is left or on a device error), and also for in-memory |
| mode because Ignite uses disk storage for keeping some metadata (e.g., in cases when the file descriptors limit is |
| exceeded or file access is prohibited). |
| |
| * Out of memory error - when Ignite memory management system fails to allocate more space |
| (`IgniteOutOfMemoryException`). |
| |
| * Out of memory error - when a cluster node runs out of Java heap (`OutOfMemoryError`). |
| |
| === Failures Handling |
| |
| When Ignite detects a critical failure, it handles the failure according to a preconfigured failure handler. |
| The failure handler can be configured as follows: |
| |
| :javaFile: code-snippets/java/src/main/java/org/apache/ignite/snippets/FailureHandler.java |
| |
| [tabs] |
| -- |
| tab:XML[] |
| [source,xml] |
| ---- |
| <bean class="org.apache.ignite.configuration.IgniteConfiguration"> |
| <property name="failureHandler"> |
| <bean class="org.apache.ignite.failure.StopNodeFailureHandler"/> |
| </property> |
| </bean> |
| ---- |
| tab:Java[] |
| [source,java] |
| ---- |
| include::{javaFile}[tag=configure-handler,indent=0] |
| ---- |
| -- |
| |
| Ignite support following failure handlers: |
| |
| [width=100%,cols="30%,70%"] |
| |======================================================================= |
| |Class |Description |
| |
| |`NoOpFailureHandler` |
| |Ignores any failures. Useful for testing and debugging. |
| |`RestartProcessFailureHandler` |
| |A specific implementation that can be used only with `ignite.sh\|bat`. The process must be terminated by using the `Ignition.restart(true)` method. |
| |`StopNodeFailureHandler` |
| |Stops the node in case of critical errors by calling the `Ignition.stop(true)` or `Ignition.stop(nodeName, true)` methods. |
| |`StopNodeOrHaltFailureHandler` |
| |This is the default handler, which tries to stop a node. If the node can't be stopped, then the handler terminates the JVM process. |
| |
| |======================================================================= |
| |
| === Critical Workers Health Check |
| |
| Ignite has a number of internal workers that are essential for the cluster to function correctly. If one of them is |
| terminated, the node can become inoperative. |
| |
| The following system workers are considered mission critical: |
| |
| * Discovery worker - discovery events handling. |
| * TCP communication worker - peer-to-peer communication between nodes. |
| * Exchange worker - partition map exchange. |
| * Workers of the system's striped pool. |
| * Data Streamer striped pool workers. |
| * Timeout worker - timeouts handling. |
| * Checkpoint thread - check-pointing in Ignite persistence. |
| * WAL workers - write-ahead logging, segments archiving, and compression. |
| * Expiration worker - TTL based expiration. |
| * NIO workers - base networking. |
| |
| Ignite has an internal mechanism for verifying that critical workers are operational. |
| Each worker is regularly checked to confirm that it is alive and updating its heartbeat timestamp. |
| If a worker is not alive and updating, the worker is regarded as blocked and Ignite will print a message to the log file. |
| You can set the period of inactivity via the `IgniteConfiguration.systemWorkerBlockedTimeout` property. |
| |
| Even though Ignite considers an unresponsive system worker to be a critical error, it doesn't handle this situation automatically, |
| other than printing out a message to the log file. |
| If you want to enable a particular failure handler for unresponsive system workers of all the types, clear the |
| `ignoredFailureTypes` property of the handler as shown below: |
| |
| [tabs] |
| -- |
| tab:XML[] |
| [source,xml] |
| ---- |
| <bean class="org.apache.ignite.configuration.IgniteConfiguration"> |
| |
| <property name="systemWorkerBlockedTimeout" value="#{60 * 60 * 1000}"/> |
| |
| <property name="failureHandler"> |
| <bean class="org.apache.ignite.failure.StopNodeFailureHandler"> |
| |
| <!-- Enable this handler to react to unresponsive critical workers occasions. --> |
| <property name="ignoredFailureTypes"> |
| <list> |
| </list> |
| </property> |
| |
| </bean> |
| |
| </property> |
| </bean> |
| ---- |
| tab:Java[] |
| [source,java] |
| ---- |
| include::{javaFile}[tag=failure-types,indent=0] |
| ---- |
| -- |
| |