title: “Airflow Survey 2019” linkTitle: “Airflow Survey 2019” author: “Tomek Urbaszek” twitter: “Nuclearriot” github: “nuclearpinguin” linkedin: “tomaszurbaszek” description: “Receiving and adjusting to our users’ feedback is a must. Let’s see who Airflow users are, how they play with it, and what they miss.” tags: [“community”, “survey”, “users”] date: “2019-12-11”

Apache Airflow Survey 2019

Apache Airflow is growing faster than ever. Thus, receiving and adjusting to our users’ feedback is a must. We created survey and we got 308 responses. Let’s see who Airflow users are, how they play with it, and what they miss.

Overview of the user

What best describes your current occupation?

No.%
Data Engineer19462.99%
Developer3411.04%
Architect237.47%
Data Scientist196.17%
Data Analyst134.22%
DevOps134.22%
IT Administrator20.65%
Machine Learning Engineer20.65%
Manager20.65%
Operations20.65%
Chief Data Officer10.32%
Engineering Manager10.32%
Intern10.32%
Product owner10.32%
Quant10.32%

In your day to day job, what do you use Airflow for?

No.%
Data processing (ETL)29896.75%
Artificial Intelligence and Machine Learning Pipelines9029.22%
Automating DevOps operations6420.78%

According to the survey, most of the Airflow users are the “data” people. Moreover, 28.57% uses Airflow to both ETL and ML pipelines meaning that those two fields are somehow connected. Only five respondents use Airflow for DevOps operations only, That means that other 59 people who use Airflow for DevOps stuff use it also for ETL / ML purposes.

How many active DAGs do you have in your largest Airflow instance?

No.%
0-2011537.34%
21-406521.10%
41-604414.29%
61-100289.09%
101-200289.09%
201-30072.27%
301-99982.60%
1000+134.22%

The majority of users do not exceed 100 active DAGs per Airflow instance. However, as we can see there are users who exceed thousands of DAGs with a maximum number 5000.

What is the maximum number of tasks that you have used in one DAG?

No.%
0-106119.81%
11-206019.48%
21-303110.06%
31-40216.82%
41-50268.44%
51-1003611.69%
101-200289.09%
201-500216.82%
501+2411.54%

The given maximum number of tasks in a single DAG was 10 000 (!). The number of tasks depends on the purposes of a DAG, so it’s rather hard to say if users have “simple” or “complicated” workflows.

When onboarding new members to Airflow, what is the biggest problem?

No.%
No guide on best practises on developing DAGs16051.95%
Small number of tutorials on different aspects of using Airflow5718.51%
Documentation is not clear enough4213.64%
Small number of blogs regarding Airflow61.95%
Other4313.96%

This is an important result. Using Airflow is all about writing and scheduling DAGs. No guide or any other complete resource on best practices for developing Dags is a big problem. Diving deep in the “other” answers, we can find that:

  • Airflow’s “magic” (scheduler, executors, schedule times) is hard to understand
  • DAG testing is not easy to do and to explain
  • Airflow UI needs some love.

How likely are you to recommend Apache Airflow?

No.%
Very Likely14045.45%
Likely12440.26%
Neutral3310.71%
Unlikely82.60%
Very unlikely30.97%

This means that more than 85% of people who use Airflow like it. It seems Airflow does its job nicely. However, we have to remember that this survey is likely biased - it’s more likely that you respond to the survey if you like the tool you use. Should we focus then on those 11 people who did not like Airflow? It’s a good question.

Airflow usage

Which interface(s) of Airflow do you use as part of your current role?

No.%
Original Airflow Graphical User Interface29796.43%
CLI12640.91%
Original Airflow Graphical User Interface, CLI11737.99%
API6019.48%
Original Airflow Graphical User Interface, CLI, API3210.39%
Custom (own created) Airflow Graphical User Interface258.12%

It’s visible that usage of CLI goes in pair with using Airflow web UI. Our survey included some UX related questions to allow us to understand how users use Airflow webserver.

What do you use the Graphical User Interface for?

What do you use CLI for?

In Airflow, which UI view(s) are important for you?

Here we see that the majority uses Web UI mostly for monitoring purposes:

  • Monitoring DAGs
  • Accessing logs

An interesting result is that many people seem not to use backfilling as there’s no other way than to do it by CLI.

What executor type do you use?

No.%
Celery13844.81%
Local8527.60%
Kubernetes5216.88%
Sequential227.14%
Other113.57

The other option mostly consisted of information that someone uses a few types or is migrating from one executor to another. What can be observed is an increase in usage of Local and Kubernetes executors when compared to results from an earlier survey done by Ash.

Do you use Kubernetes-based deployments for Airflow?

No.%
No - we do not plan to use Kubernetes near term8828.57%
Yes - setup on our own via Helm Chart or similar6521.10%
Not yet - but we use Kubernetes in our organization and we could move6119.81%
Yes - via managed service in the cloud (Composer / Astronomer etc.)4514.61%
Not yet - but we plan to deploy Kubernetes in our organization soon4213.64%
Other72.27%

The most interesting thing is that there’s nearly 30% of users who do not use Kubernetes, and they are not going to move. This means we should keep other deployment options in mind when working on Airflow 2.0. On the other hand, almost 70% of the users already use Kubernetes, or it’s a viable option for them.

Do you combine multiple DAGs?

No.%
No, I don't combine multiple DAGs12741.23%
Yes, through SubDAG7323.70%
Yes, by triggering another DAG7223.38%
Other3611.69%

In the other category, 9 people explicitly mentioned using ExternalTaskSensor, and I think it could be treated as running subDAGs by triggering other DAGs.

Do you use Airflow Plugins? If yes, what do you use it for?

No.%
Adding new operators/sensors and hooks18760.71%
I don't use Airflow plugins10935.39%
Adding AppBuilder views & menu items3110.06%
Adding new executor185.84%
Adding OperatorExtraLinks72.27%

The high percentage - 60% for “Adding new operators/sensors and hooks” is quite a surprising result for some of us - especially that you do not actually need to use the plugin mechanism to add any of those. Those are standard python objects, and you can simply drop your hooks/operators/sensors code to PYTHONPATH environment variable and they will work. It seems that this may be a result of a lack of best practices guide.

Plugins are more useful for adding views and menu items - yet only 10%. OperatorExtraLinks are even more useful (though relatively new) feature, so it’s not entirely surprising they are hardly used.

It was also kind of surprising that someone at all uses plugins to use their own executors. We considered removing that option recently - but now we have to rethink our approach.

What metrics do you use to monitor Airflow?

There were a lot of different responses. Some use Prometheus and other services, others do not use any monitoring. One of the interesting responses linked to this solution for airflow_operators_metrics.

External services

What external services do you use in your Airflow DAGs?

No.%
Amazon Web Services16051.95%
Internal company systems15048.7%
Hadoop / Spark / Flink / Other Apache software11938.64%
Google Cloud Platform / Google APIs11236.36%
Microsoft Azure289.09%
I do not use external services in my Airflow DAGs185.84%

It’s not surprising that Amazon Web Services is leading the way as they are considered the most mature cloud provider. Internal system and other Apache products on the next two positions are quite understandable if we take into account that the majority uses Airflow for ETL processes.

What external services do you use in your Airflow DAGs? (Mixed providers)

No.%
Google Cloud Platform / Google APIs, Amazon Web Services4414.29%
Amazon Web Services, Microsoft Azure51.62%
Google Cloud Platform / Google APIs, Microsoft Azure41.3%

This result is not surprising because companies usually prefer to stick with one cloud provider.

How do you integrate with external services?

No.%
Using Bash / Python operator22071.43%
Using existing, dedicated operators / hooks21770.45%
Using own, custom operators / hooks21670.13%

We had some anecdotal evidence that people use more Python/Bash operators than the dedicated ones - but it looks like all ways of using Airflow to connect to external services are equally popular.

What can be improved

In your opinion, what could be improved in Airflow?

No.%
Scheduler performance18961.36%
Web UI18058.44%
Logging, monitoring and alerting14547.08%
Examples, how-to, onboarding documentation14346.43%
Technical documentation13744.48%
Reliability11236.36%
REST API9631.17%
Authentication and authorization8928.9%
External integration e.g. AWS, GCP, Apache product4915.91%
CLI4113.31%
I don’t know51.62%

The results are rather quite self-explaining. Improved performance of Airflow, better UI, and more telemetry are desirable. But this should go in pair with improved documentation and resources about using the Airflow, especially when we take into account the problem of onboarding new users.

Another interesting point from that question is that only 16% think that operators should be extended and improved. This suggests that we should focus on improving Airflow core instead of adding more and more integrations.

What would be the most interesting feature for you?

No.%
Production-ready Airflow docker image17556.82%
Declarative way of writing DAGs / automated DAGs generation15550.32%
Horizontal Autoscaling12239.61%
Asynchronous Operators9731.49%
Stateless web server8126.3%
Knative Executor4815.58%
I already have all I need134.22%

Production Docker image wins, and it’s not a surprise. We all know that deploying Airflow is not a plug and play process, and that’s why the official image is being worked on by Jarek Potiuk. An unexpected result is that half of the users would like to have a declarative way of creating DAGs. That seems to be something that is “against Airflow” as we always emphasize the possibility of writing workflows in pure python. Stories about DAG generators are not new and confirm that there’s a need for a way to declare DAGs.

Data

If you think I missed something and you want to look for insights on your own the data is available for you here:

The processed data includes multi-choice options one-hot encoded. If you find any interesting insight, please update the article (make PR to Airflow site).