commit | 06d25fd4d5abbf1e5e59683bfa488b176b3721e8 | [log] [tgz] |
---|---|---|
author | Stephan Erb <serb@apache.org> | Thu Nov 02 12:01:40 2017 +0100 |
committer | Stephan Erb <serb@apache.org> | Thu Nov 02 12:01:40 2017 +0100 |
tree | fd819944f066d79204a3eddb8d9b5833d339244f | |
parent | d106b4ecc9537b8e844c4edc2210b9fe1853ccc4 [diff] |
Terminate the executor on unhandled errors This commit consits of two independent parts: a) ensure we interrupt the main thread when there are unhandled exceptions b) ensure the main thread of the executor can be interrupted Testing Done: This bug is pretty hard to reproduce and test. I therefore opted for a manual verification and injected an exception throw shortly before the last statement of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in hanging executors on the host. With this patch everything is terminated as expected. For details of the suffessful run, please see the executor logs below. Please note that the `apport.fileutils` is due to Ubuntu messing with its Python installation. This is not critical. ``` twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination handler.) I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0 I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0 Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon 139968452134656)>. Interrupting main thread. Traceback (most recent call last): File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run self.__real_run(*args, **kw) File "apache/aurora/executor/status_manager.py", line 62, in run File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown RuntimeError: Woops! Exception in thread Thread-7 [TID=25450]: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", line 115, in identified return instancemethod(self, *args, **kwargs) File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 130, in _excepting_run sys.excepthook(*sys.exc_info()) File "apache/thermos/common/excepthook.py", line 41, in teardown_handler self._former_hook()(exc_type, value, trace) File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook from apport.fileutils import likely_packaged, get_recent_crashes ImportError: No module named apport.fileutils twitter.common.app debug: main exited with ^C twitter.common.app debug: Shutting application down. twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception termination handler.) twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.) twitter.common.app debug: Finishing up module teardown. twitter.common.app debug: Active thread: <_MainThread(MainThread, started 139968622749504)> twitter.common.app debug: Active thread (daemon): <TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c] [TID=25449], started daemon 139967951009536)> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-13, started daemon 139968485705472)> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-9, started daemon 139967934224128)> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-12, started daemon 139967942616832)> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-3, started daemon 139968510883584)> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-11, started daemon 139967925831424)> twitter.common.app debug: Exiting cleanly. ``` Corresponding agent logs, indicating that Mesos knows about the crash on teardown: ``` I1031 15:59:54.692739 1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130 I1031 15:59:54.692834 1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931 I1031 15:59:54.692996 1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 ``` Bugs closed: AURORA-1955 Reviewed at https://reviews.apache.org/r/63443/
Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.
To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.
Aurora is built for users and operators.
User-facing Features:
Under the hood, to help you rest easy:
Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.
However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.
Are you using Aurora too? Let us know, or submit a patch to join the list!
If you have questions that aren‘t answered in our documentation, you can reach out to one of our mailing lists. We’re also often available in IRC: #aurora on irc.freenode.net.
You can also file bugs/issues in our JIRA queue.
Except as otherwise noted this software is licensed under the Apache License, Version 2.0
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.