commit | 0cb1591b709e3c9f32093d943b8e2ddcdcf7999f | [log] [tgz] |
---|---|---|
author | Charles-Francois Natali <cf.natali@gmail.com> | Sat May 02 01:41:09 2020 +0100 |
committer | Andrei Budnik <abudnik@apache.org> | Thu May 07 13:25:32 2020 +0200 |
tree | 0ecf2eb2a7c4908e43911573394bcf58fff37416 | |
parent | a32513a1fc6a149b30f04721f866e3cbb6003661 [diff] |
Keep retrying to remove cgroup on EBUSY. This is a follow-up to MESOS-10107, which introduced retries when calling `rmdir` on a seemingly empty cgroup fails with `EBUSY` because of various kernel bugs. At the time, the fix introduced a bounded number of retries, using an exponential backoff summing up to slightly over 1s. This was done because it was similar to what Docker does, and worked during testing. However, after 1 month without seeing this error in our cluster at work, we finally experienced one case where the 1s timeout wasn't enough. It could be because the machine was busy at the time, or some other random factor. So instead of only trying for 1s, I think it might make sense to just keep retrying, until the top-level container destruction timeout - set at 1 minute - kicks in. This actually makes more sense, and avoids having a magical timeout in the cgroup code. We just need to ensure that when the destroyer is finalized, it discards the future in charge of doing the periodic remove. This closes #362
Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, Jenkins, Spark, Aurora, and other frameworks on a dynamically shared pool of nodes.
Visit us at mesos.apache.org.
Documentation is available in the docs/ directory. Additionally, a rendered HTML version can be found on the Mesos website's Documentation page.
Instructions are included on the Getting Started page.
Apache Mesos is licensed under the Apache License, Version 2.0.
For additional information, see the LICENSE and NOTICE files.