a3f8aef6b4dcc13c79d60fc1ce02bcfcdb6e097d - aurora

commit	a3f8aef6b4dcc13c79d60fc1ce02bcfcdb6e097d	[log] [tgz]
author	Reza Motamedi <reza.motamedi@gmail.com>	Mon Mar 26 13:47:13 2018 -0700
committer	Santhosh Kumar <sshanmugham@twitter.com>	Mon Mar 26 13:47:13 2018 -0700
tree	357e08d294009648f38fa25dc18b478e56da753a
parent	03eb337998b5c394a3f6238922b4701b20fb392b [diff]

Introduce mesos disk collector

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container.
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.

Testing Done:
- I added unit tests.
- Tested in vagrant and it works as intenced.
- I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)

Here is one specific test setup: On two hosts I created a two tasks. Each task creates identical nested directory structures and files in them. The overall size is 30GB. test_host_1 runs the current version of observer and test_host_2 runs Observer with this patch and also has mesos_disk_collection enabled. The results are as follows:

```
rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:17 UTC 2018
observer.observer_cpu 108.9
Thu Mar 22 04:36:27 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:38 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:48 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:58 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:08 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:18 UTC 2018
observer.observer_cpu 111.0

rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:20 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:30 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:40 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:50 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:00 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:10 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:20 UTC 2018
observer.observer_cpu 1.8
```

Reviewed at https://reviews.apache.org/r/66103/

15 files changed

tree: 357e08d294009648f38fa25dc18b478e56da753a

README.md

Aurora Logo

Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.

To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.

Features

Aurora is built for users and operators.

User-facing Features:
- Management of long-running services
- Cron jobs
- Resource quotas: provide guaranteed resources for specific applications
- Rolling job updates, with automatic rollback
- Multi-user support
- Sophisticated DSL: supports templating, allowing you to establish common patterns and avoid redundant configurations
- Dedicated machines: for things like stateful services that must always run on the same machines
- Service registration: announce services in ZooKeeper for discovery by various clients
- Scheduling constraints to run on specific machines, or to mitigate impact of issues like machine and rack failure
Under the hood, to help you rest easy:
- Preemption: important services can ‘steal’ resources when they need it
- High-availability: resists machine failures and disk failures
- Scalable: proven to work in data center-sized clusters, with hundreds of users and thousands of jobs
- Instrumented: a wealth of information makes it easy to monitor and debug

When and when not to use Aurora

Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.

However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.

Companies using Aurora

Are you using Aurora too? Let us know, or submit a patch to join the list!

Getting Help

If you have questions that aren‘t answered in our documentation, you can reach out to one of our mailing lists. We’re also often available in IRC: #aurora on irc.freenode.net.

You can also file bugs/issues in our JIRA queue.

License

Except as otherwise noted this software is licensed under the Apache License, Version 2.0

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.