Fix cron id collision bug by avoiding state in Quartz jobs

There is a pretty rare situation that can occur that will cause the scheduler to crash.

The steps are:

1. Schedule and start a cron (runs every minute, graceful shutdown period
   > 1 minute)
2. Perform 2 runs of the cron
3. Deschedule the cron
4. Reschedule the cron
5. Perform 3 runs of the cron
6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and
a new cron trying to start

The reason for this bug is that some state is persisted between cron
scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress"
token, while the `killFollowups` set indicates "completion" in order to ensure there are no
concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the
"completion" token in `killFollowups`. Later, a "work in progress" token may be added and
a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.

For the example above, the runs in step 2 will add the key to the set to show that all runs are
finished and another run can start. The 3rd run in step
5 will mistakenly see that the 2nd run has started and finished since the "completion" token was
preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run
causing a ID collision.

We should not preserve any state between cron scheduling/descheduling outside of the given Quartz
`JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as
`killFollowups`.

Reviewed at https://reviews.apache.org/r/65680/
2 files changed
tree: a3e4f2af8919da4c5c872bf4fb60b601573c1d6b
  1. .github/
  2. 3rdparty/
  3. api/
  4. build-support/
  5. buildSrc/
  6. commons/
  7. config/
  8. docs/
  9. examples/
  10. gradle/
  11. src/
  12. ui/
  13. .auroraversion
  14. .bowerrc
  15. .gitattributes
  16. .gitignore
  17. .isort.cfg
  18. .reviewboardrc
  19. build.gradle
  20. CHANGELOG
  21. CONTRIBUTING.md
  22. gradlew
  23. KEYS
  24. LICENSE
  25. NOTICE
  26. pants
  27. pants.ini
  28. rbt
  29. README.md
  30. RELEASE-NOTES.md
  31. settings.gradle
  32. Vagrantfile
README.md

Aurora Logo

Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.

To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.

Features

Aurora is built for users and operators.

  • User-facing Features:

  • Under the hood, to help you rest easy:

    • Preemption: important services can ‘steal’ resources when they need it
    • High-availability: resists machine failures and disk failures
    • Scalable: proven to work in data center-sized clusters, with hundreds of users and thousands of jobs
    • Instrumented: a wealth of information makes it easy to monitor and debug

When and when not to use Aurora

Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.

However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.

Companies using Aurora

Are you using Aurora too? Let us know, or submit a patch to join the list!

Getting Help

If you have questions that aren‘t answered in our documentation, you can reach out to one of our mailing lists. We’re also often available in IRC: #aurora on irc.freenode.net.

You can also file bugs/issues in our JIRA queue.

License

Except as otherwise noted this software is licensed under the Apache License, Version 2.0

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.