docs-archive/apache-airflow/2.0.0/_sources/scheduler.rst.txt - airflow-site - Git at Google

  .. Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

  ..   http://www.apache.org/licenses/LICENSE-2.0

  .. Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 Scheduler
 ==========

 The Airflow scheduler monitors all tasks and DAGs, then triggers the
 task instances once their dependencies are complete. Behind the scenes,
 the scheduler spins up a subprocess, which monitors and stays in sync with all
 DAGs in the specified DAG directory. Once per minute, by default, the scheduler
 collects DAG parsing results and checks whether any active tasks can be triggered.

 The Airflow scheduler is designed to run as a persistent service in an
 Airflow production environment. To kick it off, all you need to do is
 execute the ``airflow scheduler`` command. It uses the configuration specified in
 ``airflow.cfg``.

 The scheduler uses the configured :doc:`Executor </executor/index>` to run tasks that are ready.

 To start a scheduler, simply run the command:

 .. code-block:: bash

     airflow scheduler

 Your DAGs will start executing once the scheduler is running successfully.

 .. note::

     The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG.
     Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``,
     sequentially.


 The scheduler won't trigger your tasks until the period it covers has ended e.g., A job with ``schedule_interval`` set as ``@daily`` runs after the day
 has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed.
 In the UI, it appears as if Airflow is running your tasks a day **late**

 .. note::

     If you run a DAG on a ``schedule_interval`` of one day, the run with ``execution_date`` ``2019-11-21`` triggers soon after ``2019-11-21T23:59``.

     **Let’s Repeat That**, the scheduler runs your job one ``schedule_interval`` AFTER the start date, at the END of the period.

     You should refer to :doc:`dag-run` for details on scheduling a DAG.

 Triggering DAG with Future Date
 -------------------------------

 If you want to use 'external trigger' to run future-dated execution dates, set ``allow_trigger_in_future = True`` in ``scheduler`` section in ``airflow.cfg``.
 This only has effect if your DAG has no ``schedule_interval``.
 If you keep default ``allow_trigger_in_future = False`` and try 'external trigger' to run future-dated execution dates,
 the scheduler won't execute it now but the scheduler will execute it in the future once the current date rolls over to the execution date.

 Running More Than One Scheduler
 -------------------------------

 .. versionadded: 2.0.0

 Airflow supports running more than one scheduler concurrently -- both for performance reasons and for
 resiliency.

 Overview
 """"""""

 The :abbr:`HA (highly available)` scheduler is designed to take advantage of the existing metadata database.
 This was primarily done for operational simplicity: every component already has to speak to this DB, and by
 not using direct communication or consensus algorithm between schedulers (Raft, Paxos, etc.) nor another
 consensus tool (Apache Zookeeper, or Consul for instance) we have kept the "operational surface area" to a
 minimum.

 The scheduler now uses the serialized DAG representation to make its scheduling decisions and the rough
 outline of the scheduling loop is:

 - Check for any DAGs needing a new DagRun, and create them
 - Examine a batch of DagRuns for schedulable TaskInstances or complete DagRuns
 - Select schedulable TaskInstances, and whilst respecting Pool limits and other concurrency limits, enqueue
   them for execution

 This does however place some requirements on the Database.

 .. _scheduler:ha:db_requirements:

 Database Requirements
 """""""""""""""""""""

 The short version is that users of PostgreSQL 9.6+ or MySQL 8+ are all ready to go -- you can start running as
 many copies of the scheduler as you like -- there is no further set up or config options needed. If you are
 using a different database please read on.

 To maintain performance and throughput there is one part of the scheduling loop that does a number of
 calculations in memory (because having to round-trip to the DB for each TaskInstance would be too slow) so we
 need to ensure that only a single scheduler is in this critical section at once - otherwise limits would not
 be correctly respected. To achieve this we use database row-level locks (using ``SELECT ... FOR UPDATE``).

 This critical section is where TaskInstances go from scheduled state and are enqueued to the executor, whilst
 ensuring the various concurrency and pool limits are respected. The critical section is obtained by asking for
 a row-level write lock on every row of the Pool table (roughly equivalent to ``SELECT * FROM slot_pool FOR
 UPDATE NOWAIT`` but the exact query is slightly different).

 The following databases are fully supported and provide an "optimal" experience:

 - PostgreSQL 9.6+
 - MySQL 8+

 .. warning::

   MariaDB does not implement the ``SKIP LOCKED`` or ``NOWAIT`` SQL clauses (see `MDEV-13115
   <https://jira.mariadb.org/browse/MDEV-13115>`_). Without these features running multiple schedulers is not
   supported and deadlock errors have been reported.

 .. warning::

   MySQL 5.x also does not support ``SKIP LOCKED`` or ``NOWAIT``, and additionally is more prone to deciding
   queries are deadlocked, so running with more than a single scheduler on MySQL 5.x is not supported or
   recommended.

 .. note::

   Microsoft SQLServer has not been tested with HA.

 .. _scheduler:ha:tunables:

 Scheduler Tuneables
 """""""""""""""""""

 The following config settings can be used to control aspects of the Scheduler HA loop.

 - :ref:`config:scheduler__max_dagruns_to_create_per_loop`

   This changes the number of dags that are locked by each scheduler when
   creating dag runs. One possible reason for setting this lower is if you
   have huge dags and are running multiple schedules, you won't want one
   scheduler to do all the work.

 - :ref:`config:scheduler__max_dagruns_per_loop_to_schedule`

   How many DagRuns should a scheduler examine (and lock) when scheduling
   and queuing tasks. Increasing this limit will allow more throughput for
   smaller DAGs but will likely slow down throughput for larger (>500
   tasks for example) DAGs. Setting this too high when using multiple
   schedulers could also lead to one scheduler taking all the dag runs
   leaving no work for the others.

 - :ref:`config:scheduler__use_row_level_locking`

   Should the scheduler issue ``SELECT ... FOR UPDATE`` in relevant queries.
   If this is set to False then you should not run more than a single
   scheduler at once

 - :ref:`config:scheduler__pool_metrics_interval`

   How often (in seconds) should pool usage stats be sent to statsd (if
   statsd_on is enabled). This is a *relatively* expensive query to compute
   this, so this should be set to match the same period as your statsd roll-up
   period.

 - :ref:`config:scheduler__clean_tis_without_dagrun_interval`

   How often should each scheduler run a check to "clean up" TaskInstance rows
   that are found to no longer have a matching DagRun row.

   In normal operation the scheduler won't do this, it is only possible to do
   this by deleting rows via the UI, or directly in the DB. You can set this
   lower if this check is not important to you -- tasks will be left in what
   ever state they are until the cleanup happens, at which point they will be
   set to failed.

 - :ref:`config:scheduler__orphaned_tasks_check_interval`

   How often (in seconds) should the scheduler check for orphaned tasks or dead
   SchedulerJobs.

   This setting controls how a dead scheduler will be noticed and the tasks it
   was "supervising" get picked up by another scheduler. (The tasks will stay
   running, so there is no harm in not detecting this for a while.)

   When a SchedulerJob is detected as "dead" (as determined by
   :ref:`config:scheduler__scheduler_health_check_threshold`) any running or
   queued tasks that were launched by the dead process will be "adopted" and
   monitored by this scheduler instead.
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	Scheduler
	==========

	The Airflow scheduler monitors all tasks and DAGs, then triggers the
	task instances once their dependencies are complete. Behind the scenes,
	the scheduler spins up a subprocess, which monitors and stays in sync with all
	DAGs in the specified DAG directory. Once per minute, by default, the scheduler
	collects DAG parsing results and checks whether any active tasks can be triggered.

	The Airflow scheduler is designed to run as a persistent service in an
	Airflow production environment. To kick it off, all you need to do is
	execute the ``airflow scheduler`` command. It uses the configuration specified in
	``airflow.cfg``.

	The scheduler uses the configured :doc:`Executor </executor/index>` to run tasks that are ready.

	To start a scheduler, simply run the command:

	.. code-block:: bash

	airflow scheduler

	Your DAGs will start executing once the scheduler is running successfully.

	.. note::

	The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG.
	Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``,
	sequentially.


	The scheduler won't trigger your tasks until the period it covers has ended e.g., A job with ``schedule_interval`` set as ``@daily`` runs after the day
	has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed.
	In the UI, it appears as if Airflow is running your tasks a day late

	.. note::

	If you run a DAG on a ``schedule_interval`` of one day, the run with ``execution_date`` ``2019-11-21`` triggers soon after ``2019-11-21T23:59``.

	Let’s Repeat That, the scheduler runs your job one ``schedule_interval`` AFTER the start date, at the END of the period.

	You should refer to :doc:`dag-run` for details on scheduling a DAG.

	Triggering DAG with Future Date
	-------------------------------

	If you want to use 'external trigger' to run future-dated execution dates, set ``allow_trigger_in_future = True`` in ``scheduler`` section in ``airflow.cfg``.
	This only has effect if your DAG has no ``schedule_interval``.
	If you keep default ``allow_trigger_in_future = False`` and try 'external trigger' to run future-dated execution dates,
	the scheduler won't execute it now but the scheduler will execute it in the future once the current date rolls over to the execution date.

	Running More Than One Scheduler
	-------------------------------

	.. versionadded: 2.0.0

	Airflow supports running more than one scheduler concurrently -- both for performance reasons and for
	resiliency.

	Overview
	""""""""

	The :abbr:`HA (highly available)` scheduler is designed to take advantage of the existing metadata database.
	This was primarily done for operational simplicity: every component already has to speak to this DB, and by
	not using direct communication or consensus algorithm between schedulers (Raft, Paxos, etc.) nor another
	consensus tool (Apache Zookeeper, or Consul for instance) we have kept the "operational surface area" to a
	minimum.

	The scheduler now uses the serialized DAG representation to make its scheduling decisions and the rough
	outline of the scheduling loop is:

	- Check for any DAGs needing a new DagRun, and create them
	- Examine a batch of DagRuns for schedulable TaskInstances or complete DagRuns
	- Select schedulable TaskInstances, and whilst respecting Pool limits and other concurrency limits, enqueue
	them for execution

	This does however place some requirements on the Database.

	.. _scheduler:ha:db_requirements:

	Database Requirements
	"""""""""""""""""""""

	The short version is that users of PostgreSQL 9.6+ or MySQL 8+ are all ready to go -- you can start running as
	many copies of the scheduler as you like -- there is no further set up or config options needed. If you are
	using a different database please read on.

	To maintain performance and throughput there is one part of the scheduling loop that does a number of
	calculations in memory (because having to round-trip to the DB for each TaskInstance would be too slow) so we
	need to ensure that only a single scheduler is in this critical section at once - otherwise limits would not
	be correctly respected. To achieve this we use database row-level locks (using ``SELECT ... FOR UPDATE``).

	This critical section is where TaskInstances go from scheduled state and are enqueued to the executor, whilst
	ensuring the various concurrency and pool limits are respected. The critical section is obtained by asking for
	a row-level write lock on every row of the Pool table (roughly equivalent to ``SELECT * FROM slot_pool FOR
	UPDATE NOWAIT`` but the exact query is slightly different).

	The following databases are fully supported and provide an "optimal" experience:

	- PostgreSQL 9.6+
	- MySQL 8+

	.. warning::

	MariaDB does not implement the ``SKIP LOCKED`` or ``NOWAIT`` SQL clauses (see `MDEV-13115
	<https://jira.mariadb.org/browse/MDEV-13115>`_). Without these features running multiple schedulers is not
	supported and deadlock errors have been reported.

	.. warning::

	MySQL 5.x also does not support ``SKIP LOCKED`` or ``NOWAIT``, and additionally is more prone to deciding
	queries are deadlocked, so running with more than a single scheduler on MySQL 5.x is not supported or
	recommended.

	.. note::

	Microsoft SQLServer has not been tested with HA.

	.. _scheduler:ha:tunables:

	Scheduler Tuneables
	"""""""""""""""""""

	The following config settings can be used to control aspects of the Scheduler HA loop.

	- :ref:`config:scheduler__max_dagruns_to_create_per_loop`

	This changes the number of dags that are locked by each scheduler when
	creating dag runs. One possible reason for setting this lower is if you
	have huge dags and are running multiple schedules, you won't want one
	scheduler to do all the work.

	- :ref:`config:scheduler__max_dagruns_per_loop_to_schedule`

	How many DagRuns should a scheduler examine (and lock) when scheduling
	and queuing tasks. Increasing this limit will allow more throughput for
	smaller DAGs but will likely slow down throughput for larger (>500
	tasks for example) DAGs. Setting this too high when using multiple
	schedulers could also lead to one scheduler taking all the dag runs
	leaving no work for the others.

	- :ref:`config:scheduler__use_row_level_locking`

	Should the scheduler issue ``SELECT ... FOR UPDATE`` in relevant queries.
	If this is set to False then you should not run more than a single
	scheduler at once

	- :ref:`config:scheduler__pool_metrics_interval`

	How often (in seconds) should pool usage stats be sent to statsd (if
	statsd_on is enabled). This is a relatively expensive query to compute
	this, so this should be set to match the same period as your statsd roll-up
	period.

	- :ref:`config:scheduler__clean_tis_without_dagrun_interval`

	How often should each scheduler run a check to "clean up" TaskInstance rows
	that are found to no longer have a matching DagRun row.

	In normal operation the scheduler won't do this, it is only possible to do
	this by deleting rows via the UI, or directly in the DB. You can set this
	lower if this check is not important to you -- tasks will be left in what
	ever state they are until the cleanup happens, at which point they will be
	set to failed.

	- :ref:`config:scheduler__orphaned_tasks_check_interval`

	How often (in seconds) should the scheduler check for orphaned tasks or dead
	SchedulerJobs.

	This setting controls how a dead scheduler will be noticed and the tasks it
	was "supervising" get picked up by another scheduler. (The tasks will stay
	running, so there is no harm in not detecting this for a while.)

	When a SchedulerJob is detected as "dead" (as determined by
	:ref:`config:scheduler__scheduler_health_check_threshold`) any running or
	queued tasks that were launched by the dead process will be "adopted" and
	monitored by this scheduler instead.