Introduce mesos disk collector

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container.
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.

Testing Done:
- I added unit tests.
- Tested in vagrant and it works as intenced.
- I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)

Here is one specific test setup: On two hosts I created a two tasks. Each task creates identical nested directory structures and files in them. The overall size is 30GB. test_host_1 runs the current version of observer and test_host_2 runs Observer with this patch and also has mesos_disk_collection enabled. The results are as follows:

```
rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:17 UTC 2018
observer.observer_cpu 108.9
Thu Mar 22 04:36:27 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:38 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:48 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:58 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:08 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:18 UTC 2018
observer.observer_cpu 111.0

rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:20 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:30 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:40 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:50 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:00 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:10 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:20 UTC 2018
observer.observer_cpu 1.8
```

Reviewed at https://reviews.apache.org/r/66103/
15 files changed
tree: 357e08d294009648f38fa25dc18b478e56da753a
  1. .github/
  2. 3rdparty/
  3. api/
  4. build-support/
  5. buildSrc/
  6. commons/
  7. config/
  8. docs/
  9. examples/
  10. gradle/
  11. src/
  12. ui/
  13. .auroraversion
  14. .bowerrc
  15. .gitattributes
  16. .gitignore
  17. .isort.cfg
  18. .reviewboardrc
  19. build.gradle
  20. CHANGELOG
  21. CONTRIBUTING.md
  22. gradlew
  23. KEYS
  24. LICENSE
  25. NOTICE
  26. pants
  27. pants.ini
  28. rbt
  29. README.md
  30. RELEASE-NOTES.md
  31. settings.gradle
  32. Vagrantfile
README.md

Aurora Logo

Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.

To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.

Features

Aurora is built for users and operators.

  • User-facing Features:

  • Under the hood, to help you rest easy:

    • Preemption: important services can ‘steal’ resources when they need it
    • High-availability: resists machine failures and disk failures
    • Scalable: proven to work in data center-sized clusters, with hundreds of users and thousands of jobs
    • Instrumented: a wealth of information makes it easy to monitor and debug

When and when not to use Aurora

Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.

However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.

Companies using Aurora

Are you using Aurora too? Let us know, or submit a patch to join the list!

Getting Help

If you have questions that aren‘t answered in our documentation, you can reach out to one of our mailing lists. We’re also often available in IRC: #aurora on irc.freenode.net.

You can also file bugs/issues in our JIRA queue.

License

Except as otherwise noted this software is licensed under the Apache License, Version 2.0

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.