title: “Airflow Survey 2020” linkTitle: “Airflow Survey 2020” author: “Tomek Urbaszek” twitter: “turbaszek” github: “turbaszek” linkedin: “tomaszurbaszek” description: “We observe steady growth in number of users as well as in an amount of active contributors. So listening and understanding our community is of high importance.” tags: [“community”, “survey”, “users”] date: “2021-03-09”

Apache Airflow Survey 2020

World of data processing tools is growing steadily. Apache Airflow seems to be already considered as crucial component of this complex ecosystem. We observe steady growth in number of users as well as in an amount of active contributors. So listening and understanding our community is of high importance.

It's worth to note that the 2020 survey was still mostly about 1.10.X version of Apache Airflow and possibly many drawbacks were addressed in the 2.0 version that was released in December 2020. But if this is true, we will learn next year!

Overview of the user

What best describes your current occupation? (single choice)

No.%
Data Engineer11556.65
Developer2813.79
DevOps178.37
Solutions Architect146.9
Data Scientist125.91
Other104.93
Data Analyst41.97
Support Engineer31.48

Those results are not a surprise as Airflow is a tool dedicated to data-related tasks. The majority of our users are data engineers, scientists or analysts. The 2020 results are similar to those from 2019 with visible slight increase in ML use cases.

Additionally, 79% of users uses Airflow on daily basis and 16% interacts with it at least once a week.

How many people work in your company? (single choice)

No.%
200+10752.71
51-2004421.67
11-503718.23
1-10157.39

How many people in your company use Airflow? (single choice)

No.%
1-58441.38
6-207536.95
21-502311.33
50+2110.34

Airflow is a software that is used and trusted by big companies. We can also see that Airflow can work fine for teams of different sizes. However, in some cases users may use multiple Airflow instances.

Are you considering moving to other workflow engines? (single choice)

No.%
No, we are happy with Airflow17485.71
Yes2914.29

Nearly 1 out of 7 users is considering migrating to other workflow engines. Their decision is usually justified by need of easier workflow writing experience (12.32%), better UI/UX and faster scheduler (8.37% both).

While the first point may be addressed by TaskFlow API in Airflow 2.0 the other two are definitely addressed in the new major version. And the early feedback from 2.0 users seems to be confirming it.

The alternative engines considered by users are mainly Prefect and Argo. Some participants also mentioned Luigi, Kubeflow or custom solutions.

Are you or your team actively participating in Airflow development - contributing? (single choice)

No.%
I wish we could9948.77
No5929.06
Yes4522.17

This is really heart-warming result. It means that 1 out of 5 users contributes actively to our project! But it would be good to learn if there's something else than time that is stopping people who wish to contribute from doing it. If there are some other obstacles we definitely would like to learn about them so we can improve. That said - if you know something we can improve please reach out via Slack, dev list or Github discussions.

How likely are you to recommend Apache Airflow? (single choice)

No.2020 %2019 %
Very Likely12561.5845.45%
Likely6230.5440.26%
Neutral115.4210.71%
Unlikely31.482.60%
Very unlikely20.990.97%

Here is good news! It seems that people are more willing to recommend Apache Airflow than year before.

What is your source of information about Airflow? (multiple choice)

No.%
Documentation15475.86
Airflow website13968.47
Slack12863.05
Github12762.56
Stack Overflow7235.47
Airflow Summit Videos4421.67
The dev mailing list3316.26
Awesome Apache Airflow repository2110.34
Other157.39

Here we see that Airflow documentation is the crucial source of information. What's interesting is that more than 60% of users are getting information from Github and Slack channels.

Airflow uses cases

Do you have any customisation of Airflow? (single choice)

No.%
No, we use vanilla Airflow15475.86
Yes, we have small patches (no fork)3416.75
Yes, we have separate fork157.39

When onboarding new members to airflow, what is the biggest problem? (multiple choice)

No.%
No guide on best practises on developing DAGs10250.25
There is no easy option to launch Airflow6431.53
Small number of tutorials on different aspects of using Airflow5728.08
Documentation is not clear enough5326.11
There is no easy option to deploy DAGs to an Airflow instance5225.62
No problems3416.75
Small number of blogs regarding Airflow3014.78

Which interface(s) of Airflow do you use as part of your current role? (multiple choice)

No.%
Original Airflow Graphical User Interface19998.03
CLI8843.35
API4823.65
Custom (own created) Airflow Graphical User Interface125.91
Other31.48

Do you combine multiple DAGs? (multiple choice)

No.%
Yes, by triggering another DAG8742.86
No, I don't combine multiple DAGs7938.92
Yes, through SubDAG4019.7
Other188.87

How do you integrate with external services? (multiple choice)

No.%
Using existing dedicated operators / hooks14772.41
Using Bash / Python operator14068.97
Using own custom operators / hooks13867.98
Other125.91

What external services do you use in your Airflow DAGs? (multiple choice)

No.%
Amazon Web Services12159.61
Internal company systems11355.67
Google Cloud Platform / Google APIs9747.78
Hadoop / Spark / Flink / Other Apache software7235.47
Microsoft Azure2110.34
Other199.36
I do not use external services in my Airflow DAGs52.46

Do you use Airflow Plugins? If yes, what do you use them for? (multiple choice)

No.%
Adding new operators/sensors and hooks11958.62
I don't use Airflow plugins6933.99
Adding AppBuilder views & menu items2713.3
Adding new executors178.37
Adding OperatorExtraLinks136.4
Other

Do you use Airflow's data lineage feature? (single choice)

No.%
No, I will use such feature if fully supported in Airflow10551.72
No, data lineage isn’t a concern for my usage.6833.5
Yes, I use another data lineage product2411.82
Yes, I use custom implementation52.46
Yes, I use Airflow's experimental data lineage feature10.49

When asked what lineage product users use, the answers were varying from custom tools to known product like Amundsen, Atlas or dbt.

Deployment

How many active DAGs do you have in your largest Airflow instance? (open question)

Number of DAGsNo.%
< 206432
21-403316
41-60136
61-1003216
101-2003115
201-30084
301-999126
1000+105

What is the maximum number of tasks that you have used in one DAG? (open question)

Number of DAGsNo.%
< 104221
11-203115
21-30157
31-40115
41-502211
51-1003919
101-200168
201-500168
501+115

Which version of Airflow do you use currently? (single choice)

No.%
1.10.145527.09
2.0.0+4522.17
1.10.122713.3
1.10.102612.81
1.10.11146.9
1.10.5 or older104.93
1.10.983.94
1.10.1373.45
1.10.641.97
1.10.741.97
1.10.831.48

This was probably one of the most important questions in the survey. While it‘s good to see that more than 60% of users use one of three latest Airflow versions, it’s worrying that the rest are using versions that are old or have known security vulnerabilities.

Additionally, more than 20% of users are already using 2.0.0+ versions which is reasonably good information.

What meta-database do you use? (single choice)

No.%
Postgres 123617.73
Postgres 9.63316.26
Postgres 113115.27
MySQL 5.72713.3
MySQL 8.0209.85
Postgres 10209.85
Other199.36
Postgres 13188.87

This means that more about 69% of users decide to use Postgres as their meta-database. MySQL is the choice of nearly 24% users. The other responses included some MySQL versions like MariaDB or cloud hosted database like Cloud SQL (used by Google Composer) or AWS Aurora.

It's good to know that users rather avoid using SQLite in production deployments!

What executor type do you use? (single choice)

No.20202019
Celery10049.26%44.81%
Kubernetes4823.65%16.88%
Local4019.7%27.60%
Sequential104.93%7.14%
Other52.46%3.57

In comparison to previous year it seems that more users use currently Celery and Kubernetes executors and LocalExecutor usage dropped by nearly 8 points. This may suggest that users' deployments are growing, and they need more scalable solutions.

Among CeleryExecutor users 78% use Redis as a broker, 19% use RabbitMQ and the rest is using other brokers or is not sure what is used in their deployments.

What metrics do you use to monitor Airflow? (multiple choice)

No.%
I do not use monitoring6532.02
External monitoring service6029.56
Information from metadatabase5125.12
Statsd4924.14
Other3115.27

The other responses included mostly information about tools used by users including DataDog and Prometheus exporter.

How do you deploy Airflow? (single choice)

No.%
On virtual machines (for example using AWS EC2)6431.53
Using a managed service like Astronomer, Google Composer or AWS MWAA3517.24
On Kubernetes (using custom deployments)2914.29
On premises2813.79
On Kubernetes (using another helm chart)209.85
On Kubernetes (using Apache Airflow's helm chart)178.37
Other125.91

Nearly 33% of users deploys Airflow using some kind of Kubernetes deployment. This is about 10 percent more than in 2019. There's slightly increase in usage of Airflow via managed services (14.61% in 2019).

Do you use containerisation for deployment? (single choice)

No.%
Yes, using helm chart / kubernetes5828.57
No, I don’t use containerisation5728.08
Yes, single docker image4924.14
Yes, using docker compose3919.21

Among users who do not use Kubernetes based deployments 58% of them use containerisation. About 42% of those users use docker-compose for deployments.

How do you distribute your DAGs? (single choice)

No.%
Using a synchronizing process (Git sync, GCS fuse, etc)7938.92
Bake them into the docker image5627.59
Shared files system3416.75
Other209.85
I don’t know146.9

The most popular way of distributing DAGs seems to be using a synchronizing process. About 40% of users use this process together with Kubernetes deployments.

Future of Airflow

In your opinion, what could be improved in Airflow? (multiple choice)

No.%
Web UI10049.26
Examples, how-to, onboarding documentation9044.33
Logging, monitoring and alerting9044.33
Technical documentation9044.33
Scheduler performance8340.89
DAG authoring6431.53
Authentication and authorization5828.57
REST API5125.12
Other4421.67
Reliability4120.2
External integration e.g. AWS, GCP, Apache products3617.73
Security2813.79
CLI209.85
Everything work fine for me146.9
I don’t know41.97

Which features would most interest you? (multiple choice)

No.%
DAG versioning10953.69
Builtin statistics7134.98
Improved data lineage6532.02
Scheduling at the start of the interval6331.03
Stateless workers5929.06
More option to configure schedules (time units, increments)5728.08
Multi-tenant deployment4924.14
DAG fetcher (AIP-5)3919.21
Generic transfer operator3416.75
Other3316.26
I have everything I need115.42
Nothing115.42

Will you consider migrating to Airflow 2.0? (single choice)

No.%
Yes, as soon as possible8139.9
Yes, once it’s mature (for example after 2.1)7235.47
I am already using Airflow 2.0+3919.21
I don't know yet83.94
No, I do not plan to migrate31.48

What are the features of Airflow 2.0 you are most excited about? (multiple choice)

No.%
General performance improvements13365.52
Refreshed WebUI10250.25
Scheduler HA9948.77
Official docker image8441.38
@task decorator5627.59
Official helm chart5125.12
Providers packages4120.2
Configurable XCom backends3316.26
CeleryKubernetesExecutor3115.27
Other125.91

Summary

From an open-source point of view, it is good to see that many people would love to contribute to Apache Airflow. This means that there are resources that if unleashed may make our community even stronger. From a product perspective, it is important to know that users are usually using the latest versions of our software and are willing to upgrade to new ones.

Finally, there are still some things to improve - documentation, onboarding guides and plug-and-play airflow deployments. However, we hope that with the increase of adoption there will be an increase in people willing to share their experience and tools.