The full changelog is about 3,000 lines long (already excluding everything backported to 1.10), so for now I’ll simply share some of the major features in 2.0.0 compared to 1.10.14:

A new way of writing dags: the TaskFlow API (AIP-31)

(Known in 2.0.0alphas as Functional DAGs.)

DAGs are now much much nicer to author especially when using PythonOperator. Dependencies are handled more clearly and XCom is nicer to use

Read more here:

TaskFlow API Tutorial
TaskFlow API Documentation

A quick teaser of what DAGs can now look like:

```
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago

@dag(default_args={'owner': 'airflow'}, schedule_interval=None, start_date=days_ago(2))
def tutorial_taskflow_api_etl():
   @task
   def extract():
       return {"1001": 301.27, "1002": 433.21, "1003": 502.22}

   @task
   def transform(order_data_dict: dict) -> dict:
       total_order_value = 0

       for value in order_data_dict.values():
           total_order_value += value

       return {"total_order_value": total_order_value}

   @task()
   def load(total_order_value: float):

       print("Total order value is: %.2f" % total_order_value)

   order_data = extract()
   order_summary = transform(order_data)
   load(order_summary["total_order_value"])

tutorial_etl_dag = tutorial_taskflow_api_etl()
```

Fully specified REST API (AIP-32)

We now have a fully supported, no-longer-experimental API with a comprehensive OpenAPI specification

Read more here:

REST API Documentation.

Massive Scheduler performance improvements

As part of AIP-15 (Scheduler HA+performance) and other work Kamil did, we significantly improved the performance of the Airflow Scheduler. It now starts tasks much, MUCH quicker.

Over at Astronomer.io we’ve benchmarked the scheduler—it’s fast (we had to triple check the numbers as we don’t quite believe them at first!)

Scheduler is now HA compatible (AIP-15)

It’s now possible and supported to run more than a single scheduler instance. This is super useful for both resiliency (in case a scheduler goes down) and scheduling performance.

To fully use this feature you need Postgres 9.6+ or MySQL 8+ (MySQL 5, and MariaDB won’t work with more than one scheduler I’m afraid).

There’s no config or other set up required to run more than one scheduler—just start up a scheduler somewhere else (ensuring it has access to the DAG files) and it will cooperate with your existing schedulers through the database.

For more information, read the Scheduler HA documentation.

Task Groups (AIP-34)

SubDAGs were commonly used for grouping tasks in the UI, but they had many drawbacks in their execution behaviour (primarirly that they only executed a single task in parallel!) To improve this experience, we’ve introduced “Task Groups”: a method for organizing tasks which provides the same grouping behaviour as a subdag without any of the execution-time drawbacks.

SubDAGs will still work for now, but we think that any previous use of SubDAGs can now be replaced with task groups. If you find an example where this isn’t the case, please let us know by opening an issue on GitHub

For more information, check out the Task Group documentation.

Refreshed UI

We’ve given the Airflow UI a visual refresh and updated some of the styling. Check out the UI section of the docs for screenshots.

We have also added an option to auto-refresh task states in Graph View so you no longer need to continuously press the refresh button :).

If you make heavy use of sensors in your Airflow cluster, you might find that sensor execution takes up a significant proportion of your cluster even with “reschedule” mode. To improve this, we’ve added a new mode called “Smart Sensors”.

This feature is in “early-access”: it’s been well-tested by AirBnB and is “stable”/usable, but we reserve the right to make backwards incompatible changes to it in a future release (if we have to. We’ll try very hard not to!)

Read more about it in the Smart Sensors documentation.

Simplified KubernetesExecutor

For Airflow 2.0, we have re-architected the KubernetesExecutor in a fashion that is simultaneously faster, easier to understand, and more flexible for Airflow users. Users will now be able to access the full Kubernetes API to create a .yaml pod_template_file instead of specifying parameters in their airflow.cfg.

We have also replaced the executor_config dictionary with the pod_override parameter, which takes a Kubernetes V1Pod object for a 1:1 setting override. These changes have removed over three thousand lines of code from the KubernetesExecutor, which makes it run faster and creates fewer potential errors.

Read more here:

Docs on pod_template_file
Docs on pod_override

Airflow core and providers: Splitting Airflow into 60+ packages

Airflow 2.0 is not a monolithic “one to rule them all” package. We’ve split Airflow into core and 61 (for now) provider packages. Each provider package is for either a particular external service (Google, Amazon, Microsoft, Snowflake), a database (Postgres, MySQL), or a protocol (HTTP/FTP). Now you can create a custom Airflow installation from “building” blocks and choose only what you need, plus add whatever other requirements you might have. Some of the common providers are installed automatically (ftp, http, imap, sqlite) as they are commonly used. Other providers are automatically installed when you choose appropriate extras when installing Airflow.

The provider architecture should make it much easier to get a fully customized, yet consistent runtime with the right set of Python dependencies.

But that’s not all: you can write your own custom providers and add things like custom connection types, customizations of the Connection Forms, and extra links to your operators in a manageable way. You can build your own provider and install it as a Python package and have your customizations visible right in the Airflow UI.

Our very own Jarek Potiuk has written about providers in much more detail on the Polidea blog.

Docs on the providers concept and writing custom providers
Docs on the all providers packages available

Security

As part of Airflow 2.0 effort, there has been a conscious focus on Security and reducing areas of exposure. This is represented across different functional areas in different forms. For example, in the new REST API, all operations now require authorization. Similarly, in the configuration settings, the Fernet key is now required to be specified.

Configuration

Configuration in the form of the airflow.cfg file has been rationalized further in distinct sections, specifically around “core”. Additionally, a significant amount of configuration options have been deprecated or moved to individual component-specific configuration files, such as the pod-template-file for Kubernetes execution-related configuration.

Thanks to all of you

We’ve tried to make as few breaking changes as possible and to provide deprecation path in the code, especially in the case of anything called in the DAG. That said, please read throughUPDATING.md to check what might affect you. For example: r We re-organized the layout of operators (they now all live under airflow.providers.*) but the old names should continue to work - you’ll just notice a lot of DeprecationWarnings that need to be fixed up.

Thank you so much to all the contributors who got us to this point, in no particular order: Kaxil Naik, Daniel Imberman, Jarek Potiuk, Tomek Urbaszek, Kamil Breguła, Gerard Casas Saez, Xiaodong DENG, Kevin Yang, James Timmins, Yingbo Wang, Qian Yu, Ryan Hamilton and the 100s of others who keep making Airflow better for everyone.
Explicitly shutdown logging in tasks so concurrent.futures can be used (#13057)

This fixes three problems:

1. That remote logs weren't being uploaded due to the fork change
2. That the S3 hook attempted to fetch credentials from the DB, but the
   ORM had already been disposed.
3. That even if forking was disabled, that S3 logs would fail due to use
   of concurrent.futures. See https://bugs.python.org/issue33097
5 files changed
tree: b48a22f9cbf2833f7559d1db489660b9deac7802
  1. .github/
  2. airflow/
  3. chart/
  4. clients/
  5. dags/
  6. dev/
  7. docker-context-files/
  8. docs/
  9. empty/
  10. hooks/
  11. images/
  12. kubernetes_tests/
  13. license-templates/
  14. licenses/
  15. manifests/
  16. metastore_browser/
  17. provider_packages/
  18. scripts/
  19. tests/
  20. .asf.yaml
  21. .bash_completion
  22. .coveragerc
  23. .dockerignore
  24. .editorconfig
  25. .flake8
  26. .gitignore
  27. .hadolint.yaml
  28. .mailmap
  29. .markdownlint.yml
  30. .pre-commit-config.yaml
  31. .rat-excludes
  32. .readthedocs.yml
  33. breeze
  34. breeze-complete
  35. BREEZE.rst
  36. CHANGELOG.txt
  37. CI.rst
  38. CODE_OF_CONDUCT.md
  39. codecov.yml
  40. confirm
  41. CONTRIBUTING.rst
  42. Dockerfile
  43. Dockerfile.ci
  44. IMAGES.rst
  45. INSTALL
  46. INTHEWILD.md
  47. LICENSE
  48. LOCAL_VIRTUALENV.rst
  49. MANIFEST.in
  50. NOTICE
  51. PULL_REQUEST_WORKFLOW.rst
  52. pylintrc
  53. pyproject.toml
  54. pytest.ini
  55. README.md
  56. setup.cfg
  57. setup.py
  58. STATIC_CODE_CHECKS.rst
  59. TESTING.rst
  60. UPDATING.md
  61. yamllint-config.yml
README.md

Apache Airflow

PyPI version GitHub Build Coverage Status License PyPI - Python Version Docker Pulls Docker Stars PyPI - Downloads Code style: black Twitter Follow Slack Status

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Table of contents

Project Focus

Airflow works best with workflows that are mostly static and slowly changing. When DAG structure is similar from one run to the next, it allows for clarity around unit of work and continuity. Other similar projects include Luigi, Oozie and Azkaban.

Airflow is commonly used to process data, but has the opinion that tasks should ideally be idempotent (i.e. results of the task will be the same, and will not create duplicated data in a destination system), and should not pass large quantities of data from one task to the next (though tasks can pass metadata using Airflow's Xcom feature). For high-volume, data-intensive tasks, a best practice is to delegate to external services that specialize on that type of work.

Airflow is not a streaming solution, but it is often used to process real-time data, pulling data off streams in batches.

Principles

  • Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
  • Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
  • Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful Jinja templating engine.
  • Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers.

Requirements

Apache Airflow is tested with:

Master version (2.0.0dev)Stable version (1.10.14)
Python3.6, 3.7, 3.82.7, 3.5, 3.6, 3.7, 3.8
PostgreSQL9.6, 10, 11, 12, 139.6, 10, 11, 12, 13
MySQL5.7, 85.6, 5.7
SQLitelatest stablelatest stable
Kubernetes1.16.9, 1.17.5, 1.18.61.16.9, 1.17.5, 1.18.6

Note: MariaDB and MySQL 5.x are unable to or have limitations with running multiple schedulers -- please see the “Scheduler” docs.

Note: SQLite is used in Airflow tests. Do not use it in production.

Additional notes on Python version requirements

  • Stable version requires at least Python 3.5.3 when using Python 3

Getting started

Visit the official Airflow website documentation (latest stable release) for help with installing Airflow, getting started, or walking through a more complete tutorial.

Note: If you're looking for documentation for master branch (latest development branch): you can find it on s.apache.org/airflow-docs.

For more information on Airflow's Roadmap or Airflow Improvement Proposals (AIPs), visit the Airflow Wiki.

Official Docker (container) images for Apache Airflow are described in IMAGES.rst.

Installing from PyPI

We publish Apache Airflow as apache-airflow package in PyPI. Installing it however might be sometimes tricky because Airflow is a bit of both a library and application. Libraries usually keep their dependencies open and applications usually pin them, but we should do neither and both at the same time. We decided to keep our dependencies as open as possible (in setup.py) so users can install different versions of libraries if needed. This means that from time to time plain pip install apache-airflow will not work or will produce unusable Airflow installation.

In order to have repeatable installation, however, introduced in Airflow 1.10.10 and updated in Airflow 1.10.12 we also keep a set of “known-to-be-working” constraint files in the orphan constraints-master and constraints-1-10 branches. We keep those “known-to-be-working” constraints files separately per major/minor python version. You can use them as constraint files when installing Airflow from PyPI. Note that you have to specify correct Airflow tag/version/branch and python versions in the URL.

  1. Installing just Airflow:

NOTE!!!

On November 2020, new version of PIP (20.3) has been released with a new, 2020 resolver. This resolver does not yet work with Apache Airflow and might leads to errors in installation - depends on your choice of extras. In order to install Airflow you need to either downgrade pip to version 20.2.4 pip upgrade --pip==20.2.4 or, in case you use Pip 20.3, you need to add option --use-deprecated legacy-resolver to your pip install command.

pip install apache-airflow==1.10.14 \
 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.14/constraints-3.7.txt"
  1. Installing with extras (for example postgres,google)
pip install apache-airflow[postgres,google]==1.10.14 \
 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.14/constraints-3.7.txt"

For information on installing backport providers check [/docs/backport-providers.rst][/docs/backport-providers.rst].

Official source code

Apache Airflow is an Apache Software Foundation (ASF) project, and our official source code releases:

Following the ASF rules, the source packages released must be sufficient for a user to build and test the release provided they have access to the appropriate platform and tools.

Convenience packages

There are other ways of installing and using Airflow. Those are “convenience” methods - they are not “official releases” as stated by the ASF Release Policy, but they can be used by the users who do not want to build the software themselves.

Those are - in the order of most common ways people install Airflow:

  • PyPI releases to install Airflow using standard pip tool
  • Docker Images to install airflow via docker tool, use them in Kubernetes, Helm Charts, docker-compose, docker swarm etc. You can read more about using, customising, and extending the images in the Latest docs, and learn details on the internals in the IMAGES.rst document.
  • Tags in GitHub to retrieve the git project sources that were used to generate official source packages via git

All those artifacts are not official releases, but they are prepared using officially released sources. Some of those artifacts are “development” or “pre-release” ones, and they are clearly marked as such following the ASF Policy.

User Interface

  • DAGs: Overview of all DAGs in your environment.

    DAGs

  • Tree View: Tree representation of a DAG that spans across time.

    Tree View

  • Graph View: Visualization of a DAG's dependencies and their current status for a specific run.

    Graph View

  • Task Duration: Total time spent on different tasks over time.

    Task Duration

  • Gantt View: Duration and overlap of a DAG.

    Gantt View

  • Code View: Quick way to view source code of a DAG.

    Code View

Contributing

Want to help build Apache Airflow? Check out our contributing documentation.

Who uses Apache Airflow?

More than 350 organizations are using Apache Airflow in the wild.

Who Maintains Apache Airflow?

Airflow is the work of the community, but the core committers/maintainers are responsible for reviewing and merging PRs as well as steering conversation around new feature requests. If you would like to become a maintainer, please review the Apache Airflow committer requirements.

Can I use the Apache Airflow logo in my presentation?

Yes! Be sure to abide by the Apache Foundation trademark policies and the Apache Airflow Brandbook. The most up to date logos are found in this repo and on the Apache Software Foundation website.

Airflow merchandise

If you would love to have Apache Airflow stickers, t-shirt etc. then check out Redbubble Shop.

Links