Job configurations can be updated at any point in their lifecycle. Usually updates are done incrementally using a process called a rolling upgrade, in which Tasks are upgraded in small groups, one group at a time. Updates are done using various Aurora Client commands.
There are several sub-commands to manage job updates:
aurora update start <job key> <configuration file> aurora update info <job key> aurora update pause <job key> aurora update resume <job key> aurora update abort <job key> aurora update list <cluster>
start a job update, the command will return once it has sent the instructions to the scheduler. At that point, you may view detailed progress for the update with the
info subcommand, in addition to viewing graphical progress in the web browser. You may also get a full listing of in-progress updates in a cluster with
Once an update has been started, you can
pause to keep the update but halt progress. This can be useful for doing things like debug a partially-updated job to determine whether you would like to proceed. You can
resume to proceed.
abort a job update regardless of the state it is in. This will instruct the scheduler to completely abandon the job update and leave the job in the current (possibly partially-updated) state.
For a configuration update, the Aurora Scheduler calculates required changes by examining the current job config state and the new desired job config. It then starts a rolling batched update process by going through every batch and performing these operations, in order:
The Aurora Scheduler continues through the instance list until all tasks are updated and in
RUNNING. If the scheduler determines the update is not going well (based on the criteria specified in the UpdateConfig), it cancels the update.
Update cancellation runs a procedure similar to the described above update sequence, but in reverse order. New instance configs are swapped with old instance configs and batch updates proceed backwards from the point where the update failed. E.g.; (0,1,2) (3,4,5) (6,7, 8-FAIL) results in a rollback in order (8,7,6) (5,4,3) (2,1,0).
For details on how to control a job update, please see the UpdateConfig configuration object.
Some Aurora services may benefit from having more control over updates by explicitly acknowledging (“heartbeating”) job update progress. This may be helpful for mission-critical service updates where explicit job health monitoring is vital during the entire job update lifecycle. Such job updates would rely on an external service (or a custom client) periodically pulsing an active coordinated job update via a pulseJobUpdate RPC.
A coordinated update is defined by setting a positive pulse_interval_secs value in job configuration file. If no pulses are received within specified interval the update will be blocked. A blocked update is unable to continue rolling forward (or rolling back) but retains its active status. It may only be unblocked by a fresh
NOTE: A coordinated update starts in
ROLL_FORWARD_AWAITING_PULSE state and will not make any progress until the first pulse arrives. However, a paused update (
ROLL_BACK_PAUSED) is still considered active and upon resuming will immediately make progress provided the pulse interval has not expired.
Updates can take advantage of Custom SLA Requirements and specify the
sla_aware=True option within UpdateConfig to only update instances if the action will maintain the task's SLA requirements. This feature allows updates to avoid killing too many instances in the face of unexpected failures outside of the update range.
See the Using the
sla_aware option for more information on how to use this feature.
Canary deployments are a pattern for rolling out updates to a subset of job instances, in order to test different code versions alongside the actual production job. It is a risk-mitigation strategy for job owners and commonly used in a form where job instance 0 runs with a different configuration than the instances 1-N.
For example, consider a job with 4 instances that each request 1 core of cpu, 1 GB of RAM, and 1 GB of disk space as specified in the configuration file
hello_world.aurora. If you want to update it so it requests 2 GB of RAM instead of 1. You can create a new configuration file to do that called
new_hello_world.aurora and issue
aurora update start <job_key_value>/0-1 new_hello_world.aurora
This results in instances 0 and 1 having 1 cpu, 2 GB of RAM, and 1 GB of disk space, while instances 2 and 3 have 1 cpu, 1 GB of RAM, and 1 GB of disk space. If instance 3 dies and restarts, it restarts with 1 cpu, 1 GB RAM, and 1 GB disk space.
So that means there are two simultaneous task configurations for the same job at the same time, just valid for different ranges of instances. While this isn't a recommended pattern, it is valid and supported by the Aurora scheduler.