blob: 8b5e6cb04a192d62013d3bfc6222dd105e01b8d8 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
What is Airflow?
=========================================
`Apache Airflow <https://github.com/apache/airflow>`_ is an open-source platform for developing, scheduling,
and monitoring batch-oriented workflows. Airflow's extensible Python framework enables you to build workflows
connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is
deployable in many ways, varying from a single process on your laptop to a distributed setup to support even
the biggest workflows.
Workflows as code
=========================================
The main characteristic of Airflow workflows is that all workflows are defined in Python code. "Workflows as
code" serves several purposes:
- **Dynamic**: Airflow pipelines are configured as Python code, allowing for dynamic pipeline generation.
- **Extensible**: The Airflow framework contains operators to connect with numerous technologies. All Airflow components are extensible to easily adjust to your environment.
- **Flexible**: Workflow parameterization is built-in leveraging the `Jinja <https://jinja.palletsprojects.com>`_ templating engine.
Take a look at the following snippet of code:
.. code-block:: python
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator
# A DAG represents a workflow, a collection of tasks
with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule="0 0 * * *") as dag:
# Tasks are represented as operators
hello = BashOperator(task_id="hello", bash_command="echo hello")
@task()
def airflow():
print("airflow")
# Set dependencies between tasks
hello >> airflow()
Here you see:
- A DAG named "demo", starting on Jan 1st 2022 and running once a day. A DAG is Airflow's representation of a workflow.
- Two tasks, a BashOperator running a Bash script and a Python function defined using the ``@task`` decorator
- ``>>`` between the tasks defines a dependency and controls in which order the tasks will be executed
Airflow evaluates this script and executes the tasks at the set interval and in the defined order. The status
of the "demo" DAG is visible in the web interface:
.. image:: /img/hello_world_graph_view.png
:alt: Demo DAG in the Graph View, showing the status of one DAG run
This example demonstrates a simple Bash and Python script, but these tasks can run any arbitrary code. Think
of running a Spark job, moving data between two buckets, or sending an email. The same structure can also be
seen running over time:
.. image:: /img/hello_world_grid_view.png
:alt: Demo DAG in the Grid View, showing the status of all DAG runs
Each column represents one DAG run. These are two of the most used views in Airflow, but there are several
other views which allow you to deep dive into the state of your workflows.
Why Airflow?
=========================================
Airflow is a batch workflow orchestration platform. The Airflow framework contains operators to connect with
many technologies and is easily extensible to connect with a new technology. If your workflows have a clear
start and end, and run at regular intervals, they can be programmed as an Airflow DAG.
If you prefer coding over clicking, Airflow is the tool for you. Workflows are defined as Python code which
means:
- Workflows can be stored in version control so that you can roll back to previous versions
- Workflows can be developed by multiple people simultaneously
- Tests can be written to validate functionality
- Components are extensible and you can build on a wide collection of existing components
Rich scheduling and execution semantics enable you to easily define complex pipelines, running at regular
intervals. Backfilling allows you to (re-)run pipelines on historical data after making changes to your logic.
And the ability to rerun partial pipelines after resolving an error helps maximize efficiency.
Airflow's user interface provides both in-depth views of pipelines and individual tasks, and an overview of
pipelines over time. From the interface, you can inspect logs and manage tasks, for example retrying a task in
case of failure.
The open-source nature of Airflow ensures you work on components developed, tested, and used by many other
`companies <https://github.com/apache/airflow/blob/main/INTHEWILD.md>`_ around the world. In the active
`community <https://airflow.apache.org/community>`_ you can find plenty of helpful resources in the form of
blogs posts, articles, conferences, books, and more. You can connect with other peers via several channels
such as `Slack <https://s.apache.org/airflow-slack>`_ and mailing lists.
Why not Airflow?
=========================================
Airflow was built for finite batch workflows. While the CLI and REST API do allow triggering workflows,
Airflow was not built for infinitely-running event-based workflows. Airflow is not a streaming solution.
However, a streaming system such as Apache Kafka is often seen working together with Apache Airflow. Kafka can
be used for ingestion and processing in real-time, event data is written to a storage location, and Airflow
periodically starts a workflow processing a batch of data.
If you prefer clicking over coding, Airflow is probably not the right solution. The web interface aims to make
managing workflows as easy as possible and the Airflow framework is continuously improved to make the
developer experience as smooth as possible. However, the philosophy of Airflow is to define workflows as code
so coding will always be required.
.. toctree::
:hidden:
:caption: Content
Overview <self>
project
license
start
installation/index
upgrading-from-1-10/index
tutorial/index
howto/index
ui
concepts/index
executor/index
dag-run
plugins
security/index
logging-monitoring/index
timezone
Using the CLI <usage-cli>
integration
kubernetes
lineage
listeners
dag-serialization
modules_management
Release Policies <release-process>
release_notes
best-practices
production-deployment
faq
privacy_notice
.. toctree::
:hidden:
:caption: References
Operators and hooks <operators-and-hooks-ref>
CLI <cli-and-env-variables-ref>
Templates <templates-ref>
Python API <python-api-ref>
Stable REST API <stable-rest-api-ref>
deprecated-rest-api-ref
Configurations <configurations-ref>
Extra packages <extra-packages-ref>
.. toctree::
:hidden:
:caption: Internal DB details
Database Migrations <migrations-ref>
Database ERD Schema <database-erd-ref>