Aurora Configuration Reference

Don't know where to start? The Aurora configuration schema is very powerful, and configurations can become quite complex for advanced use cases.

For examples of simple configurations to get something up and running quickly, check out the Tutorial. When you feel comfortable with the basics, move on to the Configuration Tutorial for more in-depth coverage of configuration design.

Process Schema

Process objects consist of required name and cmdline attributes. You can customize Process behavior with its optional attributes. Remember, Processes are handled by Thermos.

Process Objects

Attribute NameTypeDescription
nameStringProcess name (Required)
cmdlineStringCommand line (Required)
max_failuresIntegerMaximum process failures (Default: 1)
daemonBooleanWhen True, this is a daemon process. (Default: False)
ephemeralBooleanWhen True, this is an ephemeral process. (Default: False)
min_durationIntegerMinimum duration between process restarts in seconds. (Default: 5)
finalBooleanWhen True, this process is a finalizing one that should run last. (Default: False)
loggerLoggerStruct defining the log behavior for the process. (Default: Empty)

name

The name is any valid UNIX filename string (specifically no slashes, NULLs or leading periods). Within a Task object, each Process name must be unique.

cmdline

The command line run by the process. The command line is invoked in a bash subshell, so can involve fully-blown bash scripts. However, nothing is supplied for command-line arguments so $* is unspecified.

max_failures

The maximum number of failures (non-zero exit statuses) this process can have before being marked permanently failed and not retried. If a process permanently fails, Thermos looks at the failure limit of the task containing the process (usually 1) to determine if the task has failed as well.

Setting max_failures to 0 makes the process retry indefinitely until it achieves a successful (zero) exit status. It retries at most once every min_duration seconds to prevent an effective denial of service attack on the coordinating Thermos scheduler.

daemon

By default, Thermos processes are non-daemon. If daemon is set to True, a successful (zero) exit status does not prevent future process runs. Instead, the process reinvokes after min_duration seconds. However, the maximum failure limit still applies. A combination of daemon=True and max_failures=0 causes a process to retry indefinitely regardless of exit status. This should be avoided for very short-lived processes because of the accumulation of checkpointed state for each process run. When running in Mesos specifically, max_failures is capped at 100.

ephemeral

By default, Thermos processes are non-ephemeral. If ephemeral is set to True, the process' status is not used to determine if its containing task has completed. For example, consider a task with a non-ephemeral webserver process and an ephemeral logsaver process that periodically checkpoints its log files to a centralized data store. The task is considered finished once the webserver process has completed, regardless of the logsaver's current status.

min_duration

Processes may succeed or fail multiple times during a single task's duration. Each of these is called a process run. min_duration is the minimum number of seconds the scheduler waits before running the same process.

final

Processes can be grouped into two classes: ordinary processes and finalizing processes. By default, Thermos processes are ordinary. They run as long as the task is considered healthy (i.e., no failure limits have been reached.) But once all regular Thermos processes finish or the task reaches a certain failure threshold, it moves into a “finalization” stage and runs all finalizing processes. These are typically processes necessary for cleaning up the task, such as log checkpointers, or perhaps e-mail notifications that the task completed.

Finalizing processes may not depend upon ordinary processes or vice-versa, however finalizing processes may depend upon other finalizing processes and otherwise run as a typical process schedule.

logger

The default behavior of Thermos is to store stderr/stdout logs in files which grow unbounded. In the event that you have large log volume, you may want to configure Thermos to automatically rotate logs after they grow to a certain size, which can prevent your job from using more than its allocated disk space.

Logger objects specify a destination for Process logs which is, by default, file - a pair of stdout and stderr files. Its also possible to specify console to get logs output to the Process stdout and stderr streams, none to suppress any logs output or both to send logs to files and console streams.

The default Logger mode is standard which lets the stdout and stderr streams grow without bound.

Attribute NameTypeDescription
destinationLoggerDestinationDestination of logs. (Default: file)
modeLoggerModeMode of the logger. (Default: standard)
rotateRotatePolicyAn optional rotation policy. (Default: Empty)

A RotatePolicy describes log rotation behavior for when mode is set to rotate and it is ignored otherwise. If rotate is Empty or RotatePolicy() when the mode is set to rotate the defaults below are used.

Attribute NameTypeDescription
log_sizeIntegerMaximum size (in bytes) of an individual log file. (Default: 100 MiB)
backupsIntegerThe maximum number of backups to retain. (Default: 5)

An example process configuration is as follows:

    process = Process(
      name='process',
      logger=Logger(
        destination=LoggerDestination('both'),
        mode=LoggerMode('rotate'),
        rotate=RotatePolicy(log_size=5*MB, backups=5)
      )
    )

Task Schema

Tasks fundamentally consist of a name and a list of Process objects stored as the value of the processes attribute. Processes can be further constrained with constraints. By default, name's value inherits from the first Process in the processes list, so for simple Task objects with one Process, name can be omitted. In Mesos, resources is also required.

Task Object

paramtypedescription
nameStringProcess name (Required) (Default: processes0.name)
processesList of Process objectsList of Process objects bound to this task. (Required)
constraintsList of Constraint objectsList of Constraint objects constraining processes.
resourcesResource objectResource footprint. (Required)
max_failuresIntegerMaximum process failures before being considered failed (Default: 1)
max_concurrencyIntegerMaximum number of concurrent processes (Default: 0, unlimited concurrency.)
finalization_waitIntegerAmount of time allocated for finalizing processes, in seconds. (Default: 30)

name

name is a string denoting the name of this task. It defaults to the name of the first Process in the list of Processes associated with the processes attribute.

processes

processes is an unordered list of Process objects. To constrain the order in which they run, use constraints.

constraints

A list of Constraint objects. Currently it supports only one type, the order constraint. order is a list of process names that should run in the order given. For example,

    process = Process(cmdline = "echo hello {{name}}")
    task = Task(name = "echoes",
                processes = [process(name = "jim"), process(name = "bob")],
                constraints = [Constraint(order = ["jim", "bob"]))

Constraints can be supplied ad-hoc and in duplicate. Not all Processes need be constrained, however Tasks with cycles are rejected by the Thermos scheduler.

Use the order function as shorthand to generate Constraint lists. The following:

    order(process1, process2)

is shorthand for

    [Constraint(order = [process1.name(), process2.name()])]

The order function accepts Process name strings ('foo', 'bar') or the processes themselves, e.g. foo=Process(name='foo', ...), bar=Process(name='bar', ...), constraints=order(foo, bar).

resources

Takes a Resource object, which specifies the amounts of CPU, memory, and disk space resources to allocate to the Task.

max_failures

max_failures is the number of failed processes needed for the Task to be marked as failed.

For example, assume a Task has two Processes and a max_failures value of 2:

    template = Process(max_failures=10)
    task = Task(
      name = "fail",
      processes = [
         template(name = "failing", cmdline = "exit 1"),
         template(name = "succeeding", cmdline = "exit 0")
      ],
      max_failures=2)

The failing Process could fail 10 times before being marked as permanently failed, and the succeeding Process could succeed on the first run. However, the task would succeed despite only allowing for two failed processes. To be more specific, there would be 10 failed process runs yet 1 failed process. Both processes would have to fail for the Task to fail.

max_concurrency

For Tasks with a number of expensive but otherwise independent processes, you may want to limit the amount of concurrency the Thermos scheduler provides rather than artificially constraining it via order constraints. For example, a test framework may generate a task with 100 test run processes, but wants to run it on a machine with only 4 cores. You can limit the amount of parallelism to 4 by setting max_concurrency=4 in your task configuration.

For example, the following task spawns 180 Processes (“mappers”) to compute individual elements of a 180 degree sine table, all dependent upon one final Process (“reducer”) to tabulate the results:

def make_mapper(id):
  return Process(
    name = "mapper%03d" % id,
    cmdline = "echo 'scale=50;s(%d\*4\*a(1)/180)' | bc -l >
               temp.sine_table.%03d" % (id, id))

def make_reducer():
  return Process(name = "reducer", cmdline = "cat temp.\* | nl \> sine\_table.txt
                 && rm -f temp.\*")

processes = map(make_mapper, range(180))

task = Task(
  name = "mapreduce",
  processes = processes + [make\_reducer()],
  constraints = [Constraint(order = [mapper.name(), 'reducer']) for mapper
                 in processes],
  max_concurrency = 8)

finalization_wait

Process execution is organizued into three active stages: ACTIVE, CLEANING, and FINALIZING. The ACTIVE stage is when ordinary processes run. This stage lasts as long as Processes are running and the Task is healthy. The moment either all Processes have finished successfully or the Task has reached a maximum Process failure limit, it goes into CLEANING stage and send SIGTERMs to all currently running Processes and their process trees. Once all Processes have terminated, the Task goes into FINALIZING stage and invokes the schedule of all Processes with the “final” attribute set to True.

This whole process from the end of ACTIVE stage to the end of FINALIZING must happen within finalization_wait seconds. If it does not finish during that time, all remaining Processes are sent SIGKILLs (or if they depend upon uncompleted Processes, are never invoked.)

When running on Aurora, the finalization_wait is capped at 60 seconds.

Constraint Object

Current constraint objects only support a single ordering constraint, order, which specifies its processes run sequentially in the order given. By default, all processes run in parallel when bound to a Task without ordering constraints.

paramtypedescription
orderList of StringList of processes by name (String) that should be run serially.

Resource Object

Specifies the amount of CPU, Ram, and disk resources the task needs. See the Resource Isolation document for suggested values and to understand how resources are allocated.

paramtypedescription
cpuFloatFractional number of cores required by the task.
ramIntegerBytes of RAM required by the task.
diskIntegerBytes of disk required by the task.
gpuIntegerNumber of GPU cores required by the task

Job Schema

Job Objects

Note: Specifying a Container object as the value of the container property is deprecated in favor of setting its value directly to the appropriate Docker or Mesos container type

Note: Specifying preemption behavior of tasks through production flag is deprecated in favor of electing appropriate task tier via tier attribute.

nametypedescription
taskTaskThe Task object to bind to this job. Required.
nameStringJob name. (Default: inherited from the task attribute's name)
roleStringJob role account. Required.
clusterStringCluster in which this job is scheduled. Required.
environmentStringJob environment, default devel. By default must be one of prod, devel, test or staging<number> but it can be changed by the Cluster operator using the scheduler option allowed_job_environments.
contactStringBest email address to reach the owner of the job. For production jobs, this is usually a team mailing list.
instancesIntegerNumber of instances (sometimes referred to as replicas or shards) of the task to create. (Default: 1)
cron_scheduleStringCron schedule in cron format. May only be used with non-service jobs. See Cron Jobs for more information. Default: None (not a cron job.)
cron_collision_policyStringPolicy to use when a cron job is triggered while a previous run is still active. KILL_EXISTING Kill the previous run, and schedule the new run CANCEL_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING)
update_configUpdateConfig objectParameters for controlling the rate and policy of rolling updates.
constraintsdictScheduling constraints for the tasks. See the section on the constraint specification language
serviceBooleanIf True, restart tasks regardless of success or failure. (Default: False)
max_task_failuresIntegerMaximum number of failures after which the task is considered to have failed (Default: 1) Set to -1 to allow for infinite failures
priorityIntegerPreemption priority to give the task (Default 0). Tasks with higher priorities may preempt tasks at lower priorities.
productionBoolean(Deprecated) Whether or not this is a production task that may preempt other tasks (Default: False). Production job role must have the appropriate quota.
health_check_configHealthCheckConfig objectParameters for controlling a task's health checks. HTTP health check is only used if a health port was assigned with a command line wildcard.
containerChoice of Container, Docker or Mesos objectAn optional container to run all processes inside of.
lifecycleLifecycleConfig objectAn optional task lifecycle configuration that dictates commands to be executed on startup/teardown. HTTP lifecycle is enabled by default if the “health” port is requested. See LifecycleConfig Objects for more information.
tierStringTask tier type. The default scheduler tier configuration allows for 3 tiers: revocable, preemptible, and preferred. If a tier is not elected, Aurora assigns the task to a tier based on its choice of production (that is preferred for production and preemptible for non-production jobs). See the section on Configuration Tiers for more information.
announceAnnouncer objectOptionally enable Zookeeper ServerSet announcements. See [Announcer Objects] for more information.
enable_hooksBooleanWhether to enable Client Hooks for this job. (Default: False)
partition_policyPartitionPolicy objectAn optional partition policy that allows job owners to define how to handle partitions for running tasks (in partition-aware Aurora clusters)
metadatalist of Metadata objectslist of Metadata objects for user's customized metadata information.
executor_configExecutorConfig objectAllows choosing an alternative executor defined in custom_executor_config to be used instead of Thermos. Tasks will be launched with Thermos as the executor by default. See Custom Executors for more info.
sla_policyChoice of CountSlaPolicy, PercentageSlaPolicy or CoordinatorSlaPolicy objectAn optional SLA policy that allows job owners to describe the SLA requirements for the job. See SlaPolicy Objects for more information.

UpdateConfig Objects

Parameters for controlling the rate and policy of rolling updates.

objecttypedescription
batch_sizeIntegerMaximum number of shards to be updated in one iteration (Default: 1)
watch_secsIntegerMinimum number of seconds a shard must remain in RUNNING state before considered a success (Default: 45)
max_per_shard_failuresIntegerMaximum number of restarts per shard during update. Increments total failure count when this limit is exceeded. (Default: 0)
max_total_failuresIntegerMaximum number of shard failures to be tolerated in total during an update. Cannot be greater than or equal to the total number of tasks in a job. (Default: 0)
rollback_on_failurebooleanWhen False, prevents auto rollback of a failed update (Default: True)
wait_for_batch_completionbooleanWhen True, all threads from a given batch will be blocked from picking up new instances until the entire batch is updated. This essentially simulates the legacy sequential updater algorithm. (Default: False)
pulse_interval_secsIntegerIndicates a coordinated update. If no pulses are received within the provided interval the update will be blocked. Beta-updater only. Will fail on submission when used with client updater. (Default: None)
update_strategyChoice of QueueUpdateStrategy, BatchUpdateStrategy, or VariableBatchUpdateStrategy objectIndicate which update strategy to use for this update.
sla_awarebooleanWhen True, updates will only update an instance if it does not break the task's specified SLA Requirements. (Default: None)

QueueUpdateStrategy Objects

Update strategy which will keep the active updating instances at size batch_size throughout the update until there are no more instances left to update.

objecttypedescription
batch_sizeIntegerMaximum number of shards to be updated in one iteration (Default: 1)

BatchUpdateStrategy Objects

Update strategy which will wait until a maximum of batch_size number of instances are updated before continuing on to the next group until all instances are updated.

objecttypedescription
batch_sizeIntegerMaximum number of shards to be updated in one iteration (Default: 1)

VariableBatchUpdateStrategy Objects

Similar to Batch Update strategy, this strategy will wait until all instances in a current group are updated before updating more instances. However, instead of maintaining a static group size, the size of each group may change as the update progresses. For example, an update which modifies a total of 10 instances may be done in batch sizes of 2, 3, and 5. If the number of instances to be updated are greater than the sum of the groups, the last group size will be used in perpetuity until all instances are updated. Following the previous example, if instead of 10 instances 20 instances are modified, the update groups would become: 2, 3, 5, 5, 5.

objecttypedescription
batch_sizesList(Integer)Maximum number of shards to be updated per iteration. As each iteration completes, the next iteration's group size may change. If there are still instances that need to be updated after all sizes are used, the last size will be reused for the remainder of the update.

Using the sla_aware option

There are some nuances around the sla_aware option that users should be aware of:

  • SLA-aware updates work in tandem with maintenance. Draining a host that has an instance of the job being updated affects the SLA and thus will be taken into account when the update determines whether or not it is safe to update another instance.
  • SLA-aware updates will use the SLAPolicy of the newest configuration when determining whether or not it is safe to update an instance. For example, if the current configuration specifies a PercentageSlaPolicy that allows for 5% of instances to be down and the updated configuration increaes this value to 10%, the SLA calculation will be done using the 10% policy. Be mindful of this when doing an update that modifies the SLAPolicy since it may be possible to put the old configuration in a bad state that the new configuration would not be affected by. Additionally, if the update is rolled back, then the rollback will use the old SLAPolicy (or none if there was not one previously).
  • If using the CoordinatorSlaPolicy, it is important to pay attention to the batch_size of the update. If you have a complex SLA requirement, then you may be limiting the throughput of your updates with an insufficient batch_size. For example, imagine you have a job with 9 instance that represents three replicated caches, and you can only update one instance per replica set: [0 1 2] [3 4 5] [6 7 8] (the number indicates the instance ID and the brackets represent replica sets). If your batch_size is 3, then you will slowly update one replica set at a time. If your batch_size is 9, then you can update all replica sets in parallel and thus speeding up the update.
  • If an instance fails an SLA check for an update, then it will be rechecked starting at a delay from sla_aware_kill_retry_min_delay and exponentially increasing up to sla_aware_kill_retry_max_delay. These are cluster-operator set values.

HealthCheckConfig Objects

Parameters for controlling a task's health checks via HTTP or a shell command.

paramtypedescription
health_checkerHealthCheckerConfigConfigure what kind of health check to use.
initial_interval_secsIntegerInitial grace period (during which health-check failures are ignored) while performing health checks. (Default: 15)
interval_secsIntegerInterval on which to check the task's health. (Default: 10)
max_consecutive_failuresIntegerMaximum number of consecutive failures that will be tolerated before considering a task unhealthy (Default: 0)
min_consecutive_successesIntegerMinimum number of consecutive successful health checks required before considering a task healthy (Default: 1)
timeout_secsIntegerHealth check timeout. (Default: 1)

HealthCheckerConfig Objects

paramtypedescription
httpHttpHealthCheckerConfigure health check to use HTTP. (Default)
shellShellHealthCheckerConfigure health check via a shell command.

HttpHealthChecker Objects

paramtypedescription
endpointStringHTTP endpoint to check (Default: /health)
expected_responseStringIf not empty, fail the HTTP health check if the response differs. Case insensitive. (Default: ok)
expected_response_codeIntegerIf not zero, fail the HTTP health check if the response code differs. (Default: 0)

ShellHealthChecker Objects

paramtypedescription
shell_commandStringAn alternative to HTTP health checking. Specifies a shell command that will be executed. Any non-zero exit status will be interpreted as a health check failure.

PartitionPolicy Objects

paramtypedescription
rescheduleBooleanWhether or not to reschedule when running tasks become partitioned (Default: True)
delay_secsIntegerHow long to delay transitioning to LOST when running tasks are partitioned. (Default: 0)

Metadata Objects

Describes a piece of user metadata in a key value pair

paramtypedescription
keyStringIndicate which metadata the user provides
valueStringProvide the metadata content for corresponding key

ExecutorConfig Objects

Describes an Executor name and data to pass to the Mesos Task

paramtypedescription
nameStringName of the executor to use for this task. Must match the name of an executor in custom_executor_config or Thermos (AuroraExecutor). (Default: AuroraExecutor)
dataStringData blob to pass on to the executor. (Default: "")

Announcer Objects

If the announce field in the Job configuration is set, each task will be registered in the ServerSet /aurora/role/environment/jobname in the zookeeper ensemble configured by the executor (which can be optionally overriden by specifying zk_path parameter). If no Announcer object is specified, no announcement will take place. For more information about ServerSets, see the Service Discover documentation.

By default, the hostname in the registered endpoints will be the --hostname parameter that is passed to the mesos agent. To override the hostname value, the executor can be started with --announcer-hostname=<overriden_value>. If you decide to use --announcer-hostname and if the overriden value needs to change for every executor, then the executor has to be started inside a wrapper, see Executor Wrapper.

For example, if you want the hostname in the endpoint to be an IP address instead of the hostname, the --hostname parameter to the mesos agent can be set to the machine IP or the executor can be started with --announcer-hostname=<host_ip> while wrapping the executor inside a script.

objecttypedescription
primary_portStringWhich named port to register as the primary endpoint in the ServerSet (Default: http)
portmapdictA mapping of additional endpoints to be announced in the ServerSet (Default: { 'aurora': '{{primary_port}}' })
zk_pathStringZookeeper serverset path override (executor must be started with the --announcer-allow-custom-serverset-path parameter)

Port aliasing with the Announcer portmap

The primary endpoint registered in the ServerSet is the one allocated to the port specified by the primary_port in the Announcer object, by default the http port. This port can be referenced from anywhere within a configuration as {{thermos.ports[http]}}.

Without the port map, each named port would be allocated a unique port number. The portmap allows two different named ports to be aliased together. The default portmap aliases the aurora port (i.e. {{thermos.ports[aurora]}}) to the http port. Even though the two ports can be referenced independently, only one port is allocated by Mesos. Any port referenced in a Process object but which is not in the portmap will be allocated dynamically by Mesos and announced as well.

It is possible to use the portmap to alias names to static port numbers, e.g. {'http': 80, 'https': 443, 'aurora': 'http'}. In this case, referencing {{thermos.ports[aurora]}} would look up {{thermos.ports[http]}} then find a static port 80. No port would be requested of or allocated by Mesos.

Static ports should be used cautiously as Aurora does nothing to prevent two tasks with the same static port allocations from being co-scheduled. External constraints such as agent attributes should be used to enforce such guarantees should they be needed.

Container Objects

Describes the container the job's processes will run inside. If not using Docker or the Mesos unified-container, the container can be omitted from your job config.

paramtypedescription
mesosMesosA native Mesos container to use.
dockerDockerA Docker container to use (via Docker engine)

Mesos Object

paramtypedescription
imageChoice(AppcImage, DockerImage)An optional filesystem image to use within this container.
volumesList(Volume)An optional list of volume mounts for this container.

Volume Object

paramtypedescription
container_pathStringPath on the host to mount.
host_pathStringMount point in the container.
modeEnumMode of the mount, can be ‘RW’ or ‘RO’.

AppcImage

Describes an AppC filesystem image.

paramtypedescription
nameStringThe name of the appc image.
image_idStringThe image id of the appc image.

DockerImage

Describes a Docker filesystem image.

paramtypedescription
nameStringThe name of the docker image.
tagStringThe tag that identifies the docker image.

Docker Object

Note: In order to correctly execute processes inside a job, the Docker container must have Python 2.7 installed. Note: For private docker registry, mesos mandates the docker credential file to be named as .dockercfg, even though docker may create a credential file with a different name on various platforms. Also, the .dockercfg file needs to be copied into the sandbox using the -thermos_executor_resources flag, specified while starting Aurora.

paramtypedescription
imageStringThe name of the docker image to execute. If the image does not exist locally it will be pulled with docker pull.
parametersList(Parameter)Additional parameters to pass to the Docker engine.

Docker Parameter Object

Docker CLI parameters. This needs to be enabled by the scheduler -allow_docker_parameters option. See Docker Command Line Reference for valid parameters.

paramtypedescription
nameStringThe name of the docker parameter. E.g. volume
valueStringThe value of the parameter. E.g. /usr/local/bin:/usr/bin:rw

LifecycleConfig Objects

Note: The only lifecycle configuration supported is the HTTP lifecycle via the HttpLifecycleConfig.

paramtypedescription
httpHttpLifecycleConfigConfigure the lifecycle manager to send lifecycle commands to the task via HTTP.

HttpLifecycleConfig Objects

Note: The combined graceful_shutdown_wait_secs and shutdown_wait_secs is implicitly upper bounded by the --stop_timeout_in_secs flag exposed by the executor (see options here, default is 2 minutes). Therefore, if the user specifies values that add up to more than --stop_timeout_in_secs, the task will be killed earlier than the user anticipates (see the termination lifecycle here). Furthermore, stop_timeout_in_secs itself is implicitly upper bounded by two scheduler options: transient_task_state_timeout and preemption_slot_hold_time (see reference here. If the stop_timeout_in_secs exceeds either of these scheduler options, tasks could be designated as LOST or tasks utilizing preemption could lose their desired slot respectively. Cluster operators should be aware of these timings should they change the defaults.

paramtypedescription
portStringThe named port to send POST commands. (Default: health)
graceful_shutdown_endpointStringEndpoint to hit to indicate that a task should gracefully shutdown. (Default: /quitquitquit)
shutdown_endpointStringEndpoint to hit to give a task its final warning before being killed. (Default: /abortabortabort)
graceful_shutdown_wait_secsIntegerThe amount of time (in seconds) to wait after hitting the graceful_shutdown_endpoint before proceeding with the task termination lifecycle. (Default: 5)
shutdown_wait_secsIntegerThe amount of time (in seconds) to wait after hitting the shutdown_endpoint before proceeding with the task termination lifecycle. (Default: 5)

graceful_shutdown_endpoint

If the Job is listening on the port as specified by the HttpLifecycleConfig (default: health), a HTTP POST request will be sent over localhost to this endpoint to request that the task gracefully shut itself down. This is a courtesy call before the shutdown_endpoint is invoked graceful_shutdown_wait_secs seconds later.

shutdown_endpoint

If the Job is listening on the port as specified by the HttpLifecycleConfig (default: health), a HTTP POST request will be sent over localhost to this endpoint to request as a final warning before being shut down. If the task does not shut down on its own after shutdown_wait_secs seconds, it will be forcefully killed.

SlaPolicy Objects

Configuration for specifying custom SLA requirements for a job. There are 3 supported SLA policies namely, CountSlaPolicy, PercentageSlaPolicy and CoordinatorSlaPolicy.

CountSlaPolicy Objects

paramtypedescription
countIntegerThe number of active instances required every durationSecs.
duration_secsIntegerMinimum time duration a task needs to be RUNNING to be treated as active.

PercentageSlaPolicy Objects

paramtypedescription
percentageFloatThe percentage of active instances required every durationSecs.
duration_secsIntegerMinimum time duration a task needs to be RUNNING to be treated as active.

CoordinatorSlaPolicy Objects

paramtypedescription
coordinator_urlStringThe URL to the Coordinator service to be contacted before performing SLA affecting actions (job updates, host drains etc).
status_keyStringThe field in the Coordinator response that indicates the SLA status for working on the task. (Default: drain)

Specifying Scheduling Constraints

In the Job object there is a map constraints from String to String allowing the user to tailor the schedulability of tasks within the job.

The constraint map's key value is the attribute name in which we constrain Tasks within our Job. The value is how we constrain them. There are two types of constraints: limit constraints and value constraints.

constraintdescription
LimitA string that specifies a limit for a constraint. Starts with 'limit: followed by an Integer and closing single quote, such as 'limit:1'.
ValueA string that specifies a value for a constraint. To include a list of values, separate the values using commas. To negate the values of a constraint, start with a ! .

Further details can be found in the Scheduling Constraints feature description.

Template Namespaces

Currently, a few Pystachio namespaces have special semantics. Using them in your configuration allow you to tailor application behavior through environment introspection or interact in special ways with the Aurora client or Aurora-provided services.

mesos Namespace

The mesos namespace contains variables which relate to the mesos agent which launched the task. The instance variable can be used to distinguish between Task replicas.

variable nametypedescription
instanceIntegerThe instance number of the created task. A job with 5 replicas has instance numbers 0, 1, 2, 3, and 4.
hostnameStringThe instance hostname that the task was launched on.

Please note, there is no uniqueness guarantee for instance in the presence of network partitions. If that is required, it should be baked in at the application level using a distributed coordination service such as Zookeeper.

thermos Namespace

The thermos namespace contains variables that work directly on the Thermos platform in addition to Aurora. This namespace is fully compatible with Tasks invoked via the thermos CLI.

variabletypedescription
portsmap of string to IntegerA map of names to port numbers
task_idstringThe task ID assigned to this task.

The thermos.ports namespace is automatically populated by Aurora when invoking tasks on Mesos. When running the thermos command directly, these ports must be explicitly mapped with the -P option.

For example, if ‘{{thermos.ports[http]}}’ is specified in a Process configuration, it is automatically extracted and auto-populated by Aurora, but must be specified with, for example, thermos -P http:12345 to map http to port 12345 when running via the CLI.