commit | e2ea191473397691605602c6e40c6aad8a56d81a | [log] [tgz] |
---|---|---|
author | Jordan Ly <jordan.ly8@gmail.com> | Mon Feb 19 21:33:21 2018 -0800 |
committer | Jordan Ly <jly@twitter.com> | Mon Feb 19 21:33:21 2018 -0800 |
tree | a3e4f2af8919da4c5c872bf4fb60b601573c1d6b | |
parent | c69ccb9f13acdb99f25385a6c783c71376839599 [diff] |
Fix cron id collision bug by avoiding state in Quartz jobs There is a pretty rare situation that can occur that will cause the scheduler to crash. The steps are: 1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute) 2. Perform 2 runs of the cron 3. Deschedule the cron 4. Reschedule the cron 5. Perform 3 runs of the cron 6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run. For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision. We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`. Reviewed at https://reviews.apache.org/r/65680/
Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.
To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.
Aurora is built for users and operators.
User-facing Features:
Under the hood, to help you rest easy:
Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.
However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.
Are you using Aurora too? Let us know, or submit a patch to join the list!
If you have questions that aren‘t answered in our documentation, you can reach out to one of our mailing lists. We’re also often available in IRC: #aurora on irc.freenode.net.
You can also file bugs/issues in our JIRA queue.
Except as otherwise noted this software is licensed under the Apache License, Version 2.0
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.