docs/layouts/shortcodes/generated/expert_fault_tolerance_section.html - flink - Git at Google

 <table class="configuration table table-bordered">
     <thead>
         <tr>
             <th class="text-left" style="width: 20%">Key</th>
             <th class="text-left" style="width: 15%">Default</th>
             <th class="text-left" style="width: 10%">Type</th>
             <th class="text-left" style="width: 55%">Description</th>
         </tr>
     </thead>
     <tbody>
         <tr>
             <td><h5>cluster.io-pool.size</h5></td>
             <td style="word-wrap: break-word;">(none)</td>
             <td>Integer</td>
             <td>The size of the IO executor pool used by the cluster to execute blocking IO operations (Master as well as TaskManager processes). By default it will use 4 * the number of CPU cores (hardware contexts) that the cluster process has access to. Increasing the pool size allows to run more IO operations concurrently.</td>
         </tr>
         <tr>
             <td><h5>cluster.registration.error-delay</h5></td>
             <td style="word-wrap: break-word;">10000</td>
             <td>Long</td>
             <td>The pause made after an registration attempt caused an exception (other than timeout) in milliseconds.</td>
         </tr>
         <tr>
             <td><h5>cluster.registration.initial-timeout</h5></td>
             <td style="word-wrap: break-word;">100</td>
             <td>Long</td>
             <td>Initial registration timeout between cluster components in milliseconds.</td>
         </tr>
         <tr>
             <td><h5>cluster.registration.max-timeout</h5></td>
             <td style="word-wrap: break-word;">30000</td>
             <td>Long</td>
             <td>Maximum registration timeout between cluster components in milliseconds.</td>
         </tr>
         <tr>
             <td><h5>cluster.registration.refused-registration-delay</h5></td>
             <td style="word-wrap: break-word;">30000</td>
             <td>Long</td>
             <td>The pause made after the registration attempt was refused in milliseconds.</td>
         </tr>
         <tr>
             <td><h5>cluster.services.shutdown-timeout</h5></td>
             <td style="word-wrap: break-word;">30000</td>
             <td>Long</td>
             <td>The shutdown timeout for cluster services like executors in milliseconds.</td>
         </tr>
         <tr>
             <td><h5>heartbeat.interval</h5></td>
             <td style="word-wrap: break-word;">10000</td>
             <td>Long</td>
             <td>Time interval between heartbeat RPC requests from the sender to the receiver side.</td>
         </tr>
         <tr>
             <td><h5>heartbeat.rpc-failure-threshold</h5></td>
             <td style="word-wrap: break-word;">2</td>
             <td>Integer</td>
             <td>The number of consecutive failed heartbeat RPCs until a heartbeat target is marked as unreachable. Failed heartbeat RPCs can be used to detect dead targets faster because they no longer receive the RPCs. The detection time is <code class="highlighter-rouge">heartbeat.interval</code> * <code class="highlighter-rouge">heartbeat.rpc-failure-threshold</code>. In environments with a flaky network, setting this value too low can produce false positives. In this case, we recommend to increase this value, but not higher than <code class="highlighter-rouge">heartbeat.timeout</code> / <code class="highlighter-rouge">heartbeat.interval</code>. The mechanism can be disabled by setting this option to <code class="highlighter-rouge">-1</code></td>
         </tr>
         <tr>
             <td><h5>heartbeat.timeout</h5></td>
             <td style="word-wrap: break-word;">50000</td>
             <td>Long</td>
             <td>Timeout for requesting and receiving heartbeats for both sender and receiver sides.</td>
         </tr>
         <tr>
             <td><h5>jobmanager.execution.failover-strategy</h5></td>
             <td style="word-wrap: break-word;">"region"</td>
             <td>String</td>
             <td>This option specifies how the job computation recovers from task failures. Accepted values are:<ul><li>'full': Restarts all tasks to recover the job.</li><li>'region': Restarts all tasks that could be affected by the task failure. More details can be found <a href="../../ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy">here</a>.</li></ul></td>
         </tr>
     </tbody>
 </table>
	<table class="configuration table table-bordered">
	<thead>
	<tr>
	<th class="text-left" style="width: 20%">Key</th>
	<th class="text-left" style="width: 15%">Default</th>
	<th class="text-left" style="width: 10%">Type</th>
	<th class="text-left" style="width: 55%">Description</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><h5>cluster.io-pool.size</h5></td>
	<td style="word-wrap: break-word;">(none)</td>
	<td>Integer</td>
	<td>The size of the IO executor pool used by the cluster to execute blocking IO operations (Master as well as TaskManager processes). By default it will use 4 * the number of CPU cores (hardware contexts) that the cluster process has access to. Increasing the pool size allows to run more IO operations concurrently.</td>
	</tr>
	<tr>
	<td><h5>cluster.registration.error-delay</h5></td>
	<td style="word-wrap: break-word;">10000</td>
	<td>Long</td>
	<td>The pause made after an registration attempt caused an exception (other than timeout) in milliseconds.</td>
	</tr>
	<tr>
	<td><h5>cluster.registration.initial-timeout</h5></td>
	<td style="word-wrap: break-word;">100</td>
	<td>Long</td>
	<td>Initial registration timeout between cluster components in milliseconds.</td>
	</tr>
	<tr>
	<td><h5>cluster.registration.max-timeout</h5></td>
	<td style="word-wrap: break-word;">30000</td>
	<td>Long</td>
	<td>Maximum registration timeout between cluster components in milliseconds.</td>
	</tr>
	<tr>
	<td><h5>cluster.registration.refused-registration-delay</h5></td>
	<td style="word-wrap: break-word;">30000</td>
	<td>Long</td>
	<td>The pause made after the registration attempt was refused in milliseconds.</td>
	</tr>
	<tr>
	<td><h5>cluster.services.shutdown-timeout</h5></td>
	<td style="word-wrap: break-word;">30000</td>
	<td>Long</td>
	<td>The shutdown timeout for cluster services like executors in milliseconds.</td>
	</tr>
	<tr>
	<td><h5>heartbeat.interval</h5></td>
	<td style="word-wrap: break-word;">10000</td>
	<td>Long</td>
	<td>Time interval between heartbeat RPC requests from the sender to the receiver side.</td>
	</tr>
	<tr>
	<td><h5>heartbeat.rpc-failure-threshold</h5></td>
	<td style="word-wrap: break-word;">2</td>
	<td>Integer</td>
	<td>The number of consecutive failed heartbeat RPCs until a heartbeat target is marked as unreachable. Failed heartbeat RPCs can be used to detect dead targets faster because they no longer receive the RPCs. The detection time is <code class="highlighter-rouge">heartbeat.interval</code> * <code class="highlighter-rouge">heartbeat.rpc-failure-threshold</code>. In environments with a flaky network, setting this value too low can produce false positives. In this case, we recommend to increase this value, but not higher than <code class="highlighter-rouge">heartbeat.timeout</code> / <code class="highlighter-rouge">heartbeat.interval</code>. The mechanism can be disabled by setting this option to <code class="highlighter-rouge">-1</code></td>
	</tr>
	<tr>
	<td><h5>heartbeat.timeout</h5></td>
	<td style="word-wrap: break-word;">50000</td>
	<td>Long</td>
	<td>Timeout for requesting and receiving heartbeats for both sender and receiver sides.</td>
	</tr>
	<tr>
	<td><h5>jobmanager.execution.failover-strategy</h5></td>
	<td style="word-wrap: break-word;">"region"</td>
	<td>String</td>
	<td>This option specifies how the job computation recovers from task failures. Accepted values are:<ul><li>'full': Restarts all tasks to recover the job.</li><li>'region': Restarts all tasks that could be affected by the task failure. More details can be found <a href="../../ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy">here</a>.</li></ul></td>
	</tr>
	</tbody>
	</table>