manual/core/retries/README.md - cassandra-java-driver - Git at Google

 ## Retries

 ### Quick overview

 What to do when a request failed on a node: retry (same or other node), rethrow, or ignore.

 * `advanced.retry-policy` in the configuration. Default policy retries at most once, in cases that
   have a high chance of success; you can also write your own.
 * can have per-profile policies.
 * only kicks in if the query is idempotent.

 -----

 When a query fails, it sometimes makes sense to retry it: the error might be temporary, or the query
 might work on a different node. The driver uses a *retry policy* to determine when and how to retry.
 It is defined in the [configuration](../configuration/):

 ```
 datastax-java-driver.advanced.retry-policy {
   class = DefaultRetryPolicy
 }
 ```

 The behavior of the default policy will be detailed in the sections below. You can also use your
 own policy by specifying the fully-qualified name of a class that implements [RetryPolicy].

 The policy has several methods that cover different error cases. Each method returns a decision to
 indicate what to do next:

 * retry on the same node;
 * retry on the next node in the [query plan](../load_balancing/) for this statement;
 * rethrow the exception to the user code (from the `session.execute` call, or as a failed future if
   using the asynchronous API);
 * ignore the exception. That is, mark the request as successful, and return an empty result set.

 ### onUnavailable

 A request reached the coordinator, but there weren't enough live replicas to achieve the requested
 consistency level. The coordinator replied with an `UNAVAILABLE` error.

 If the policy rethrows the error, the user code will get an [UnavailableException]. You can inspect
 the exception's fields to get the amount of replicas that were known to be *alive* when the error
 was triggered, as well as the amount of replicas that where *required* by the requested consistency
 level.

 The default policy triggers a maximum of one retry, to the next node in the query plan. The
 rationale is that the first coordinator might have been network-isolated from all other nodes
 (thinking they're down), but still able to communicate with the client; in that case, retrying on
 the same node has almost no chance of success, but moving to the next node might solve the issue.

 ### onReadTimeout

 A read request reached the coordinator, which initially believed that there were enough live
 replicas to process it. But one or several replicas were too slow to answer within the predefined
 timeout (`read_request_timeout_in_ms` in `cassandra.yaml`); therefore the coordinator replied to the
 client with a `READ_TIMEOUT` error.

 This could be due to a temporary overloading of these replicas, or even that they just failed or
 were turned off. During reads, Cassandra doesn't request data from every replica to minimize
 internal network traffic; instead, some replicas are only asked for a checksum of the data. A read
 timeout may occur even if enough replicas responded to fulfill the consistency level, but only
 checksum responses were received (the method's `dataPresent` parameter allow you to check if you're
 in that situation).

 If the policy rethrows the error, the user code will get a [ReadTimeoutException] \(do not confuse
 this error with [DriverTimeoutException], which happens when the coordinator didn't reply at all to
 the client).

 The default policy triggers a maximum of one retry (to the same node), and only if enough replicas
 had responded to the read request but data was not retrieved amongst those. That usually means that
 enough replicas are alive to satisfy the consistency, but the coordinator picked a dead one for data
 retrieval, not having detected that replica as dead yet. The reasoning is that by the time we get
 the timeout, the dead replica will likely have been detected as dead and the retry has a high chance
 of success.

 ### onWriteTimeout

 This is similar to `onReadTimeout`, but for write operations. The reason reads and writes are
 handled separately is because a read is obviously a non mutating operation, whereas a write is
 likely to be. If a write times out at the coordinator level, there is no way to know whether the
 mutation was applied or not on the non-answering replica.

 If the policy rethrows the error, the user code will get a [WriteTimeoutException].

 This method is only invoked for [idempotent](../idempotence/) statements. Otherwise, the driver
 bypasses the retry policy and always rethrows the error.

 The default policy triggers a maximum of one retry (to the same node), and only for a `BATCH_LOG`
 write. The reasoning is that the coordinator tries to write the distributed batch log against a
 small subset of nodes in the local datacenter; a timeout usually means that none of these nodes were
 alive but the coordinator hadn't detected them as dead yet. By the time we get the timeout, the dead
 nodes will likely have been detected as dead, and the retry has a high chance of success.

 ### onRequestAborted

 The request was aborted before we could get a response from the coordinator. This can happen in two
 cases:

 * if the connection was closed due to an external event. This will manifest as a
   [ClosedConnectionException] \(network failure) or [HeartbeatException] \(missed
   [heartbeat](../pooling/#heartbeat));
 * if there was an unexpected error while decoding the response (this can only be a driver bug).

 This method is only invoked for [idempotent](../idempotence/) statements. Otherwise, the driver
 bypasses the retry policy and always rethrows the error.

 The default policy retries on the next node if the connection was closed, and rethrows (assuming a
 driver bug) in all other cases.

 ### onErrorResponse

 The coordinator replied with an error other than `READ_TIMEOUT`, `WRITE_TIMEOUT` or `UNAVAILABLE`.
 Namely, this covers [OverloadedException], [ServerError], [TruncateException],
 [ReadFailureException] and [WriteFailureException].

 This method is only invoked for [idempotent](../idempotence/) statements. Otherwise, the driver
 bypasses the retry policy and always rethrows the error.

 The default policy rethrows read and write failures, and retries other errors on the next node.

 ### Hard-coded rules

 There are a few cases where retrying is always the right thing to do. These are not covered by
 `RetryPolicy`, but instead hard-coded in the driver:

 * **any error before a network write was attempted**: to send a query, the driver selects a node,
   borrows a connection from the host's [connection pool](../pooling/), and then writes the message
   to the connection. Errors can occur before the write was even attempted, for example if the
   connection pool is saturated, or if the node went down right after we borrowed. In those cases, it
   is always safe to retry since the request wasn't sent, so the driver will transparently move to
   the next node in the query plan.
 * **re-preparing a statement**: when the driver executes a prepared statement, it may find out that
   the coordinator doesn't know about it, and need to re-prepare it on the fly (this is described in
   detail [here](../statements/prepared/)). The query is then retried on the same node.
 * **trying to communicate with a node that is bootstrapping**: this is a rare edge case, as in
   practice the driver should never try to communicate with a bootstrapping node (the only way is if
   it was specified as a contact point). It is again safe to assume that the query was not executed
   at all, so the driver moves to the next node.

 Similarly, some errors have no chance of being solved by a retry. They will always be rethrown
 directly to the user. These include:

 * [QueryValidationException] and any of its subclasses;
 * [FunctionFailureException];
 * [ProtocolError].

 ### Using multiple policies

 The retry policy can be overridden in [execution profiles](../configuration/#profiles):

 ```
 datastax-java-driver {
   advanced.retry-policy {
     class = DefaultRetryPolicy
   }
   profiles {
     custom-retries {
       advanced.retry-policy {
         class = CustomRetryPolicy
       }
     }
     slow {
       request.timeout = 30 seconds
     }
   }
 }
 ```

 The `custom-retries` profile uses a dedicated policy. The `slow` profile inherits the default
 profile's. Note that this goes beyond configuration inheritance: the driver only creates a single
 `DefaultRetryPolicy` instance and reuses it (this also occurs if two sibling profiles have the same
 configuration).

 Each request uses its declared profile's policy. If it doesn't declare any profile, or if the
 profile doesn't have a dedicated policy, then the default profile's policy is used.

 [AllNodesFailedException]:   https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/AllNodesFailedException.html
 [ClosedConnectionException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/connection/ClosedConnectionException.html
 [DriverTimeoutException]:    https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/DriverTimeoutException.html
 [FunctionFailureException]:  https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/FunctionFailureException.html
 [HeartbeatException]:        https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/connection/HeartbeatException.html
 [ProtocolError]:             https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ProtocolError.html
 [OverloadedException]:       https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/OverloadedException.html
 [QueryValidationException]:  https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/QueryValidationException.html
 [ReadFailureException]:      https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ReadFailureException.html
 [ReadTimeoutException]:      https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ReadTimeoutException.html
 [RetryDecision]:             https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/retry/RetryDecision.html
 [RetryPolicy]:               https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/retry/RetryPolicy.html
 [ServerError]:               https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ServerError.html
 [TruncateException]:         https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/TruncateException.html
 [UnavailableException]:      https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/UnavailableException.html
 [WriteFailureException]:     https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/WriteFailureException.html
 [WriteTimeoutException]:     https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/WriteTimeoutException.html
	## Retries

	### Quick overview

	What to do when a request failed on a node: retry (same or other node), rethrow, or ignore.

	* `advanced.retry-policy` in the configuration. Default policy retries at most once, in cases that
	have a high chance of success; you can also write your own.
	* can have per-profile policies.
	* only kicks in if the query is idempotent.

	-----

	When a query fails, it sometimes makes sense to retry it: the error might be temporary, or the query
	might work on a different node. The driver uses a retry policy to determine when and how to retry.
	It is defined in the [configuration](../configuration/):

	```
	datastax-java-driver.advanced.retry-policy {
	class = DefaultRetryPolicy
	}
	```

	The behavior of the default policy will be detailed in the sections below. You can also use your
	own policy by specifying the fully-qualified name of a class that implements [RetryPolicy].

	The policy has several methods that cover different error cases. Each method returns a decision to
	indicate what to do next:

	* retry on the same node;
	* retry on the next node in the [query plan](../load_balancing/) for this statement;
	* rethrow the exception to the user code (from the `session.execute` call, or as a failed future if
	using the asynchronous API);
	* ignore the exception. That is, mark the request as successful, and return an empty result set.

	### onUnavailable

	A request reached the coordinator, but there weren't enough live replicas to achieve the requested
	consistency level. The coordinator replied with an `UNAVAILABLE` error.

	If the policy rethrows the error, the user code will get an [UnavailableException]. You can inspect
	the exception's fields to get the amount of replicas that were known to be alive when the error
	was triggered, as well as the amount of replicas that where required by the requested consistency
	level.

	The default policy triggers a maximum of one retry, to the next node in the query plan. The
	rationale is that the first coordinator might have been network-isolated from all other nodes
	(thinking they're down), but still able to communicate with the client; in that case, retrying on
	the same node has almost no chance of success, but moving to the next node might solve the issue.

	### onReadTimeout

	A read request reached the coordinator, which initially believed that there were enough live
	replicas to process it. But one or several replicas were too slow to answer within the predefined
	timeout (`read_request_timeout_in_ms` in `cassandra.yaml`); therefore the coordinator replied to the
	client with a `READ_TIMEOUT` error.

	This could be due to a temporary overloading of these replicas, or even that they just failed or
	were turned off. During reads, Cassandra doesn't request data from every replica to minimize
	internal network traffic; instead, some replicas are only asked for a checksum of the data. A read
	timeout may occur even if enough replicas responded to fulfill the consistency level, but only
	checksum responses were received (the method's `dataPresent` parameter allow you to check if you're
	in that situation).

	If the policy rethrows the error, the user code will get a [ReadTimeoutException] \(do not confuse
	this error with [DriverTimeoutException], which happens when the coordinator didn't reply at all to
	the client).

	The default policy triggers a maximum of one retry (to the same node), and only if enough replicas
	had responded to the read request but data was not retrieved amongst those. That usually means that
	enough replicas are alive to satisfy the consistency, but the coordinator picked a dead one for data
	retrieval, not having detected that replica as dead yet. The reasoning is that by the time we get
	the timeout, the dead replica will likely have been detected as dead and the retry has a high chance
	of success.

	### onWriteTimeout

	This is similar to `onReadTimeout`, but for write operations. The reason reads and writes are
	handled separately is because a read is obviously a non mutating operation, whereas a write is
	likely to be. If a write times out at the coordinator level, there is no way to know whether the
	mutation was applied or not on the non-answering replica.

	If the policy rethrows the error, the user code will get a [WriteTimeoutException].

	This method is only invoked for [idempotent](../idempotence/) statements. Otherwise, the driver
	bypasses the retry policy and always rethrows the error.

	The default policy triggers a maximum of one retry (to the same node), and only for a `BATCH_LOG`
	write. The reasoning is that the coordinator tries to write the distributed batch log against a
	small subset of nodes in the local datacenter; a timeout usually means that none of these nodes were
	alive but the coordinator hadn't detected them as dead yet. By the time we get the timeout, the dead
	nodes will likely have been detected as dead, and the retry has a high chance of success.

	### onRequestAborted

	The request was aborted before we could get a response from the coordinator. This can happen in two
	cases:

	* if the connection was closed due to an external event. This will manifest as a
	[ClosedConnectionException] \(network failure) or [HeartbeatException] \(missed
	[heartbeat](../pooling/#heartbeat));
	* if there was an unexpected error while decoding the response (this can only be a driver bug).

	This method is only invoked for [idempotent](../idempotence/) statements. Otherwise, the driver
	bypasses the retry policy and always rethrows the error.

	The default policy retries on the next node if the connection was closed, and rethrows (assuming a
	driver bug) in all other cases.

	### onErrorResponse

	The coordinator replied with an error other than `READ_TIMEOUT`, `WRITE_TIMEOUT` or `UNAVAILABLE`.
	Namely, this covers [OverloadedException], [ServerError], [TruncateException],
	[ReadFailureException] and [WriteFailureException].

	This method is only invoked for [idempotent](../idempotence/) statements. Otherwise, the driver
	bypasses the retry policy and always rethrows the error.

	The default policy rethrows read and write failures, and retries other errors on the next node.

	### Hard-coded rules

	There are a few cases where retrying is always the right thing to do. These are not covered by
	`RetryPolicy`, but instead hard-coded in the driver:

	* any error before a network write was attempted: to send a query, the driver selects a node,
	borrows a connection from the host's [connection pool](../pooling/), and then writes the message
	to the connection. Errors can occur before the write was even attempted, for example if the
	connection pool is saturated, or if the node went down right after we borrowed. In those cases, it
	is always safe to retry since the request wasn't sent, so the driver will transparently move to
	the next node in the query plan.
	* re-preparing a statement: when the driver executes a prepared statement, it may find out that
	the coordinator doesn't know about it, and need to re-prepare it on the fly (this is described in
	detail [here](../statements/prepared/)). The query is then retried on the same node.
	* trying to communicate with a node that is bootstrapping: this is a rare edge case, as in
	practice the driver should never try to communicate with a bootstrapping node (the only way is if
	it was specified as a contact point). It is again safe to assume that the query was not executed
	at all, so the driver moves to the next node.

	Similarly, some errors have no chance of being solved by a retry. They will always be rethrown
	directly to the user. These include:

	* [QueryValidationException] and any of its subclasses;
	* [FunctionFailureException];
	* [ProtocolError].

	### Using multiple policies

	The retry policy can be overridden in [execution profiles](../configuration/#profiles):

	```
	datastax-java-driver {
	advanced.retry-policy {
	class = DefaultRetryPolicy
	}
	profiles {
	custom-retries {
	advanced.retry-policy {
	class = CustomRetryPolicy
	}
	}
	slow {
	request.timeout = 30 seconds
	}
	}
	}
	```

	The `custom-retries` profile uses a dedicated policy. The `slow` profile inherits the default
	profile's. Note that this goes beyond configuration inheritance: the driver only creates a single
	`DefaultRetryPolicy` instance and reuses it (this also occurs if two sibling profiles have the same
	configuration).

	Each request uses its declared profile's policy. If it doesn't declare any profile, or if the
	profile doesn't have a dedicated policy, then the default profile's policy is used.

	[AllNodesFailedException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/AllNodesFailedException.html
	[ClosedConnectionException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/connection/ClosedConnectionException.html
	[DriverTimeoutException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/DriverTimeoutException.html
	[FunctionFailureException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/FunctionFailureException.html
	[HeartbeatException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/connection/HeartbeatException.html
	[ProtocolError]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ProtocolError.html
	[OverloadedException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/OverloadedException.html
	[QueryValidationException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/QueryValidationException.html
	[ReadFailureException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ReadFailureException.html
	[ReadTimeoutException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ReadTimeoutException.html
	[RetryDecision]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/retry/RetryDecision.html
	[RetryPolicy]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/retry/RetryPolicy.html
	[ServerError]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/ServerError.html
	[TruncateException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/TruncateException.html
	[UnavailableException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/UnavailableException.html
	[WriteFailureException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/WriteFailureException.html
	[WriteTimeoutException]: https://docs.datastax.com/en/drivers/java/4.3/com/datastax/oss/driver/api/core/servererrors/WriteTimeoutException.html