Slave recovery is a feature of Mesos that allows:
Mesos slave could be restarted for an upgrade or due to a crash. This feature is introduced in 0.14.0 release.
Slave recovery works by having the slave checkpoint enough information (e.g., Task Info, Executor Info, Status Updates) about the running tasks and executors to local disk. Once the slave and the framework(s) enable checkpointing, any subsequent slave restarts would recover the checkpointed information and reconnect with the executors. Note that if the host running the slave process is rebooted all the executors/tasks are killed.
NOTE: To enable recovery the framework should explicitly request checkpointing. Alternatively, a framework that doesn't want the disk i/o overhead of checkpointing can opt out of checkpointing.
NOTE: From Mesos 0.22.0 slave checkpointing will be automatically enabled for all slaves.
As part of this feature, 4 new flags were added to the slave.
checkpoint
: Whether to checkpoint slave and frameworks information to disk [Default: true].
NOTE: From Mesos 0.22.0 this flag will be removed as it will be enabled for all slaves.
strict
: Whether to do recovery in strict mode [Default: true].
recover
: Whether to recover status updates and reconnect with old executors [Default: reconnect].
recovery_timeout
: Amount of time allotted for the slave to recover [Default: 15 mins].
recovery_timeout
to recover, any executors that are waiting to reconnect to the slave will self-terminate. NOTE: This flag is only applicable when --checkpoint
is enabled.NOTE: If none of the frameworks have enabled checkpointing, executors/tasks of frameworks die when the slave dies and are not recovered.
A restarted slave should re-register with master within a timeout (currently, 75s). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn shuts down any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting a slave (e.g, using monit).
For the complete list of slave options: ./mesos-slave.sh --help
As part of this feature, FrameworkInfo
has been updated to include an optional checkpoint
field. A framework that would like to opt in to checkpointing should set FrameworkInfo.checkpoint=True
before registering with the master.
NOTE: Frameworks that have enabled checkpointing will only get offers from checkpointing slaves. So, before setting
checkpoint=True
on FrameworkInfo, ensure that there are slaves in your cluster that have enabled checkpointing. Because, if there are no checkpointing slaves, the framework would not get any offers and hence cannot launch any tasks/executors!
If you want to upgrade a running Mesos cluster to 0.14.0 to take advantage of slave recovery please follow the upgrade instructions.