| --- |
| title: Apache Mesos - Agent Recovery |
| layout: documentation |
| --- |
| |
| # Agent Recovery |
| |
| If the `mesos-agent` process on a host exits (perhaps due to a Mesos bug or |
| because the operator kills the process while [upgrading Mesos](upgrades.md)), |
| any executors/tasks that were being managed by the `mesos-agent` process will |
| continue to run. |
| |
| By default, all the executors/tasks that were being managed by the old |
| `mesos-agent` process are expected to gracefully exit on their own, and |
| will be shut down after the agent restarted if they did not. |
| |
| However, if a framework enabled _checkpointing_ when it registered with the |
| master, any executors belonging to that framework can reconnect to the new |
| `mesos-agent` process and continue running uninterrupted. Hence, enabling |
| framework checkpointing allows tasks to tolerate Mesos agent upgrades and |
| unexpected `mesos-agent` crashes without experiencing any downtime. |
| |
| Agent recovery works by having the agent checkpoint information about its own |
| state and about the tasks and executors it is managing to local disk, for |
| example the `SlaveInfo`, `FrameworkInfo` and `ExecutorInfo` messages or the |
| unacknowledged status updates of running tasks. |
| |
| When the agent restarts, it will verify that its current configuration, set |
| from the environment variables and command-line flags, is compatible with the |
| checkpointed information and will refuse to restart if not. |
| |
| A special case occurs when the agent detects that its host system was rebooted |
| since the last run of the agent: The agent will try to recover its previous ID |
| as usual, but if that fails it will actually erase the information of the |
| previous run and will register with the master as a new agent. |
| |
| Note that executors and tasks that exited between agent shutdown and restart |
| are not automatically restarted during agent recovery. |
| |
| ## Framework Configuration |
| |
| A framework can control whether its executors will be recovered by setting |
| the `checkpoint` flag in its `FrameworkInfo` when registering with the master. |
| Enabling this feature results in increased I/O overhead at each agent that runs |
| tasks launched by the framework. By default, frameworks do **not** checkpoint |
| their state. |
| |
| ## Agent Configuration |
| |
| Four [configuration flags](configuration/agent.md) control the recovery |
| behavior of a Mesos agent: |
| |
| * `strict`: Whether to do agent recovery in strict mode [Default: true]. |
| - If strict=true, all recovery errors are considered fatal. |
| - If strict=false, any errors (e.g., corruption in checkpointed data) during |
| recovery are ignored and as much state as possible is recovered. |
| |
| * `reconfiguration_policy`: Which kind of configuration changes are accepted |
| when trying to recover [Default: equal]. |
| - If reconfiguration_policy=equal, no configuration changes are accepted. |
| - If reconfiguration_policy=additive, the agent will allow the new |
| configuration to contain additional attributes, increased resourced or an |
| additional fault domain. For a more detailed description, see |
| [this](https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob;f=src/slave/compatibility.hpp;h=78b421a01abe5d2178c93832577577a7ba282b38;hb=HEAD#l37). |
| |
| * `recover`: Whether to recover status updates and reconnect with old |
| executors [Default: reconnect] |
| - If recover=reconnect, reconnect with any old live executors, provided |
| the executor's framework enabled checkpointing. |
| - If recover=cleanup, kill any old live executors and exit. Use this |
| option when doing an incompatible agent or executor upgrade! |
| **NOTE:** If no checkpointing information exists, no recovery is performed |
| and the agent registers with the master as a new agent. |
| |
| * `recovery_timeout`: Amount of time allotted for the agent to |
| recover [Default: 15 mins]. |
| - If the agent takes longer than `recovery_timeout` to recover, any |
| executors that are waiting to reconnect to the agent will self-terminate. |
| **NOTE:** If none of the frameworks have enabled checkpointing, the |
| executors and tasks running at an agent die when the agent dies and are |
| not recovered. |
| |
| A restarted agent should reregister with master within a timeout (75 seconds |
| by default: see the `--max_agent_ping_timeouts` and `--agent_ping_timeout` |
| [configuration flags](configuration.md)). If the agent takes longer than this |
| timeout to reregister, the master shuts down the agent, which in turn will |
| shutdown any live executors/tasks. |
| |
| Therefore, it is highly recommended to automate the process of restarting an |
| agent, e.g. using a process supervisor such as [monit](http://mmonit.com/monit/) |
| or `systemd`. |
| |
| ## Known issues with `systemd` and process lifetime |
| |
| There is a known issue when using `systemd` to launch the `mesos-agent`. A |
| description of the problem can be found in [MESOS-3425](https://issues.apache.org/jira/browse/MESOS-3425) |
| and all relevant work can be tracked in the epic [MESOS-3007](https://issues.apache.org/jira/browse/MESOS-3007). |
| |
| This problem was fixed in Mesos `0.25.0` for the mesos containerizer when |
| cgroups isolation is enabled. Further fixes for the posix isolators and docker |
| containerizer are available in `0.25.1`, `0.26.1`, `0.27.1`, and `0.28.0`. |
| |
| It is recommended that you use the default [KillMode](http://www.freedesktop.org/software/systemd/man/systemd.kill.html) |
| for systemd processes, which is `control-group`, which kills all child processes |
| when the agent stops. This ensures that "side-car" processes such as the |
| `fetcher` and `perf` are terminated alongside the agent. |
| The systemd patches for Mesos explicitly move executors and their children into |
| a separate systemd slice, dissociating their lifetime from the agent. This |
| ensures the executors survive agent restarts. |
| |
| The following excerpt of a `systemd` unit configuration file shows how to set |
| the flag explicitly: |
| |
| ``` |
| [Service] |
| ExecStart=/usr/bin/mesos-agent |
| KillMode=control-cgroup |
| ``` |