commit | 73c28280c518996714a7bcacdd9e687baafaf35d | [log] [tgz] |
---|---|---|
author | Julia Wang <jwang98052@yahoo.com> | Thu Oct 20 19:22:18 2016 -0700 |
committer | Mariia Mykhailova <mariia@apache.org> | Mon Oct 31 11:01:27 2016 -0700 |
tree | ef7476fe787cfb8e09508c2fdd20b0654b6498b9 | |
parent | 1a7c6d1d3caeabd34cac8f9ba430a8990f574f4e [diff] |
[REEF-1482] Driver does not exit even if all the task exit normally Currently, when all the tasks and all the evaluators are completed, sometimes driver still doen't shut down and hungs there forever. This happens intermittently. When there are many nodes like 500 nodes in IMRU runs in Yarn cluster, the issue can happen in every 2 or 3 runs. The investigation shows there is a potential dead lock in ResourceManagerStatus. This PR is to resolve this issue by reducing the scope of code under locks. With the fixes, I have tested 10 times with 500 nodes in cluster, there is no repro any more. JIRA: [REEF-1482](https://issues.apache.org/jira/browse/REEF-1482) Pull request: This closes #1162
Apache REEF™ (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.
Detailed information on REEF can be found in the following places:
The developer mailing list is the best way to reach REEF‘s developers when the above aren’t sufficient.
Component | OS | Status |
---|---|---|
REEF Java | Ubuntu | |
REEF.NET | Windows |
Java | .NET | |
---|---|---|
Build & run unit tests | java\BUILD.md | cs\BUILD.md |