Terminate the executor on unhandled errors

This commit consits of two independent parts:

a) ensure we interrupt the main thread when there are unhandled exceptions
b) ensure the main thread of the executor can be interrupted

Testing Done:
This bug is pretty hard to reproduce and test. I therefore opted for a manual
verification and injected an exception throw shortly before the last statement
of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in
hanging executors on the host. With this patch everything is terminated as
expected.

For details of the suffessful run, please see the executor logs below. Please
note that the `apport.fileutils` is due to Ubuntu messing  with its Python
installation. This is not critical.

```
twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination handler.)
I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0
I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0
Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d

ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon 139968452134656)>. Interrupting main thread.
Traceback (most recent call last):
  File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run
    self.__real_run(*args, **kw)
  File "apache/aurora/executor/status_manager.py", line 62, in run
  File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown
RuntimeError: Woops!
Exception in thread Thread-7 [TID=25450]:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", line 115, in identified
    return instancemethod(self, *args, **kwargs)
  File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 130, in _excepting_run
    sys.excepthook(*sys.exc_info())
  File "apache/thermos/common/excepthook.py", line 41, in teardown_handler
    self._former_hook()(exc_type, value, trace)
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
ImportError: No module named apport.fileutils

twitter.common.app debug: main exited with ^C
twitter.common.app debug: Shutting application down.
twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception termination handler.)
twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
twitter.common.app debug: Finishing up module teardown.
twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 139968622749504)>
twitter.common.app debug:   Active thread (daemon): <TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c] [TID=25449], started daemon 139967951009536)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-13, started daemon 139968485705472)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-9, started daemon 139967934224128)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-12, started daemon 139967942616832)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-3, started daemon 139968510883584)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-11, started daemon 139967925831424)>
twitter.common.app debug: Exiting cleanly.
```

Corresponding agent logs, indicating that Mesos knows about the crash on teardown:
```
I1031 15:59:54.692739  1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130
I1031 15:59:54.692834  1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931
I1031 15:59:54.692996  1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000
```

Bugs closed: AURORA-1955

Reviewed at https://reviews.apache.org/r/63443/
5 files changed
tree: fd819944f066d79204a3eddb8d9b5833d339244f
  1. 3rdparty/
  2. api/
  3. build-support/
  4. buildSrc/
  5. commons/
  6. config/
  7. docs/
  8. examples/
  9. gradle/
  10. src/
  11. ui/
  12. .auroraversion
  13. .bowerrc
  14. .gitattributes
  15. .gitignore
  16. .isort.cfg
  17. .reviewboardrc
  18. build.gradle
  19. CHANGELOG
  20. CONTRIBUTING.md
  21. gradlew
  22. KEYS
  23. LICENSE
  24. NOTICE
  25. pants
  26. pants.ini
  27. rbt
  28. README.md
  29. RELEASE-NOTES.md
  30. settings.gradle
  31. Vagrantfile
README.md

Aurora Logo

Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.

To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.

Features

Aurora is built for users and operators.

  • User-facing Features:

  • Under the hood, to help you rest easy:

    • Preemption: important services can ‘steal’ resources when they need it
    • High-availability: resists machine failures and disk failures
    • Scalable: proven to work in data center-sized clusters, with hundreds of users and thousands of jobs
    • Instrumented: a wealth of information makes it easy to monitor and debug

When and when not to use Aurora

Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.

However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.

Companies using Aurora

Are you using Aurora too? Let us know, or submit a patch to join the list!

Getting Help

If you have questions that aren‘t answered in our documentation, you can reach out to one of our mailing lists. We’re also often available in IRC: #aurora on irc.freenode.net.

You can also file bugs/issues in our JIRA queue.

License

Except as otherwise noted this software is licensed under the Apache License, Version 2.0

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.