blob: 73c5c71f1f2b24e14f4ce824feaf95a986082d2b [file] [log] [blame]
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Apache Airflow – Use Cases</title>
<link>/use-cases/</link>
<description>Recent content in Use Cases on Apache Airflow</description>
<generator>Hugo -- gohugo.io</generator>
<atom:link href="/use-cases/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Use-Cases: Adobe</title>
<link>/use-cases/adobe/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/adobe/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;Modern big data platforms need sophisticated data pipelines connecting to many backend services enabling complex workflows. These workflows need to be deployed, monitored, and run either on regular schedules or triggered by external events. Adobe Experience Platform component services architected and built an orchestration service to enable their users to author, schedule, and monitor complex hierarchical (including sequential and parallel) workflows for Apache Spark (TM) and non-Spark jobs.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Adobe Experience Platform built an orchestration service to meet our user and customer requirements. It is architected based on guiding principles to leverage an off-the-shelf, open-source orchestration engine that is abstracted to other services through an API and extendable to any application through a pluggable framework. Adobe Experience Platform orchestration service leverages Apache Airflow execution engine for scheduling and executing various workflows. Apache Airflow is highly extensible and with support of K8s Executor it can scale to meet our requirements. It has a very rich Airflow Web UI to provide various workflow-related insights. Airflow’s active community that addresses issues and different feature requests also made it additionally attractive for us.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;Adobe Experience Platform is using Apache Airflow&amp;rsquo;s plugin interface to write custom operators to meet our use cases. With K8s Executor, we could scale it to run 1000(s) of concurrent workflows. Adobe and Adobe Experience Platform teams can focus on business use cases because all scheduling, dependency management, and retrying logic is offloaded to Apache Airflow.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: Adyen</title>
<link>/use-cases/adyen/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/adyen/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;Many years ago we started out with our own orchestration framework. Due to all the required custom functionality it made sense at the time. However, quickly we realized creating an orchestration tool is not to be underestimated. With the quickly increasing number of users and teams, time spent on fixing issues increased, severely limiting development speed. Furthermore, due to it not being open source, we constantly had to make the effort ourselves to stay up to date with the industry standards and tools. We needed a tool for our Big Data Platform to schedule and execute many ETL jobs while at the same time, giving our users the possibility to redo or undo their tasks.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Apache Airflow enabled us to extend upon the already existing operators and sensors to make writing ETL DAGs as easy as possible. Within a couple minutes of training, data scientists are able to write their own DAGs containing an Apache Spark job and its corresponding dependencies. The Web UI allows our data scientists to closely monitor the status and logs of the jobs so that they can quickly interfere if something is not going as planned. We created our own access groups such that teams have full privileges on their own DAGs while only read privileges on other teams DAGs.&lt;/p&gt;
&lt;p&gt;One powerful functionality of Apache Airflow is the ability to backfill. This is helpful when new tasks are introduced or old jobs need to be rerun. By creating our own plugin for Apache Airflow, we built a simple tool to streamline back-filling. Besides clearing the runs, it also clears the underlying data that was generated by the Spark Job. Coming from Apache Airflow 1.10, this plugin only required minor changes to support Apache Airflow 2.0.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;We started out with having a full team just working on our orchestration tool. With the help of Apache Airflow we managed to give the responsibility of maintaining DAGs back to the data scientist teams. This allowed us to grow quicker than ever to 20 teams that own in total approximately 200 DAGs and over 5000 tasks. In the meantime, our team has been able to extend Apache Airflow further while also focussing on getting other new exciting technologies on-board. With airflow, we now spend our time making progress instead of getting stuck fixing all sorts of issues.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: Big Fish Games</title>
<link>/use-cases/big-fish-games/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/big-fish-games/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;The main challenge is the lack of standardized ETL workflow orchestration tools. PowerShell and Python-based ETL frameworks built in-house are currently used for scheduling and running analytical workloads. However, there is no web UI through which we can monitor these workflows and it requires additional effort to maintain this framework. These scheduled jobs based on external dependencies are not well suited to modern Big Data platforms and their complex workflows. Although we experimented with Apache Oozie for certain workflows, it did not handle failed jobs properly. For late data arrival, these tools are not flexible enough to enforce retry attempts for the job failures.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Apache Airflow helps us programmatically control our workflows in Python by setting task dependencies and monitoring tasks within each DAG in a Web UI. Airflow allows us to view detailed logs for each task in these complex workflows. It has built-in connectors for Hive, MySQL, Google Cloud APIs and others. It also lends us flexibility to create our own custom connectors (i.e. for a Netezza database) using JDBCHook and JDBCOperator or extend the existing operator such as Hive Operator. For complex workflows, we can design ETLs using Airflow by running certain tasks only on weekdays. A powerful feature of Airflow is its support for backfilling of data: when we add a new task to a DAG, we can backfill for that task alone. Airflow also allows us to set external DAG dependencies alongside features such as SQLSensor on a database table to run a specific task.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;We seek to run concurrent tasks with DAGs and concurrent DAGs using Apache Airflow, in hopes of running our entire ETL workload faster. Airflow helps our analysts and developers focus on the analyses, rather than labor over building an ETL framework to schedule and monitor our applications. Airflow facilitates a seamless ETL migration to the Google Cloud Platform (GCP), as GCP maintains Cloud Composer, an Apache Airflow managed service.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: Business Operations</title>
<link>/use-cases/business_operations/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/business_operations/</guid>
<description>
&lt;div style=&#34;display: flex; justify-content: center; align-items: center;&#34;&gt;
&lt;h1 id=&#34;use-airflow-for-business-operations-pipelines&#34;&gt;Use Airflow for Business operations pipelines&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;Airflow can be the starting point for your business idea! For many companies, Airflow delivers the data that powers their core business applications. Whether you need to aggregate user data to power personalized recommendations, display analytics in a user-facing dashboard, or prepare the input data for an LLM, Airflow is the perfect orchestrator.&lt;/p&gt;
&lt;p&gt;This video shows an example of using Airflow to run the pipelines that power a customer-facing analytics dashboard. You can find the code shown in this example &lt;a href=&#34;https://github.com/astronomer/business-operations-structure-example&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;videoContainer&#34; style=&#34;display: flex; justify-content: center; align-items: center; border: 2px solid #ccc; width: 75%; margin: auto; padding: 20px;&#34;&gt;
&lt;a href=&#34;https://www.youtube.com/embed/2CEApKN0z1U?autoplay=1&#34;&gt;
&lt;img id=&#34;videoPlaceholder&#34; src=&#34;/usecase-video-placeholders/placeholder_business_ops_video.png&#34; style=&#34;cursor: pointer; width: 100%; max-width: 560px;&#34; alt=&#34;Click to play a one minute video showing the use case&#34; title=&#34;Click to play video&#34;/&gt;
&lt;/a&gt;
&lt;/div&gt;
&lt;/br&gt;
&lt;h2 id=&#34;why-use-airflow-for-business-operations&#34;&gt;Why use Airflow for Business Operations?&lt;/h2&gt;
&lt;p&gt;Airflow is trusted and tested by many companies to deliver their data on time. Airflow is a popular choice to build your business upon, because it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tool agnostic&lt;/strong&gt;: Using Airflow future-proofs your business, as it can be used to orchestrate actions in nearly any external tool or service. This means you can always switch to the newest and best tools, without needing to change your whole orchestration layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensible&lt;/strong&gt;: There are many Airflow modules available to connect to popular data tools, and you can write your own custom operators and hooks for specific use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic&lt;/strong&gt;: In Airflow you can define &lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html&#34;&gt;dynamic tasks&lt;/a&gt;, which serve as placeholders to adapt at runtime based on changing input.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalable&lt;/strong&gt;: Airflow can be scaled to handle infinite numbers of tasks and workflows, given enough computing power. If you choose Airflow, your business will be able to grow with it.&lt;/li&gt;
&lt;/ul&gt;
&lt;/br&gt;
&lt;h2 id=&#34;airflow-features-for-business-operations&#34;&gt;Airflow features for Business Operations&lt;/h2&gt;
&lt;p&gt;Airflow has several key features that make it a great option for orchestrating business operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html&#34;&gt;&lt;strong&gt;Dynamic task mapping&lt;/strong&gt;&lt;/a&gt;: Oftentimes business operations are not static. You may design your pipelines to have one task per customer or report, and those lists will always be changing. Dynamic task mapping allows you to build flexibility into your pipelines, so they can adjust at runtime based on changing input.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html&#34;&gt;&lt;strong&gt;Datasets&lt;/strong&gt;&lt;/a&gt;: It is unlikely that you will have one team, much less one pipeline, responsible for all of the data that powers your business. Datasets allow you to make your pipelines event-based, scheduling them for when all data prerequisites are available rather than a specific time. With this type of scheduling, you can create smaller, more modular pipelines, that can be managed by the team responsible for that data, making your operations more efficient and easier to manage.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/notifications.html&#34;&gt;&lt;strong&gt;Notifications&lt;/strong&gt;&lt;/a&gt;: When relying on an orchestrator to power your business applications, it&amp;rsquo;s critical that you know promptly when something goes wrong. Airflow has a suite of notifications available so you can send alerts to your system of preference.&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>
<item>
<title>Use-Cases: Dish</title>
<link>/use-cases/dish/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/dish/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;We faced increasing complexity managing lengthy crontabs with scheduling being an issue, this required carefully planning timing due to resource constraints, usage patterns, and especially custom code needed for retry logic. In the last case, having to verify success of previous jobs and/or steps prior to running the next. Furthermore, time to results is important, but we were increasingly relying on buffers for processing, where things were effectively sitting idle and not processing, waiting for the next stage, in an effort to not rely as much on custom code/logic.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Relying on community built and existing hooks and operators to the majority of cloud services we use has allowed us to focus on business outcomes rather than operations.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;Airflow helps us manage many of our pain-points, letting us benefit from the overall ecosystem and community. We are able to reduce time-to-end delivery of data products by being event-driven in our processing flows (in our first usage, for example, we were able to take out over 2 hours - on average - of various waiting between stages). Furthermore, we are able to arrive at and iterate on products quicker as a result of not needing as much custom or roll-our-own solutions. Our code base is smaller and simpler, it is easier to follow, and to a large extent our DAGs serve as sufficient documentation for new contributors to understand what is going on.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: ETL/ELT</title>
<link>/use-cases/etl_analytics/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/etl_analytics/</guid>
<description>
&lt;div style=&#34;display: flex; justify-content: center; align-items: center;&#34;&gt;
&lt;h1 id=&#34;use-airflow-for-etlelt-pipelines&#34;&gt;Use Airflow for ETL/ELT pipelines&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT) data pipelines are the most common use case for Apache Airflow. 90% of respondents in the 2023 Apache Airflow survey are using Airflow for ETL/ELT to power analytics use cases.&lt;/p&gt;
&lt;p&gt;The video below shows a simple ETL/ELT pipeline in Airflow that extracts climate data from a CSV file, as well as weather data from an API, runs transformations and then loads the results into a database to power a dashboard. You can find the code for this example &lt;a href=&#34;https://github.com/astronomer/airflow-quickstart&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;videoContainer&#34; style=&#34;display: flex; justify-content: center; align-items: center; border: 2px solid #ccc; width: 75%; margin: auto; padding: 20px;&#34;&gt;
&lt;a href=&#34;https://www.youtube.com/embed/ljBU_VyihVQ?autoplay=1&#34;&gt;
&lt;img id=&#34;videoPlaceholder&#34; src=&#34;/usecase-video-placeholders/placeholder_etl_video.png&#34; style=&#34;cursor: pointer; width: 100%; max-width: 560px;&#34; alt=&#34;Click to play a one minute video showing the use case&#34; title=&#34;Click to play video&#34;/&gt;
&lt;/a&gt;
&lt;/div&gt;
&lt;/br&gt;
&lt;h2 id=&#34;why-use-airflow-for-etlelt-pipelines&#34;&gt;Why use Airflow for ETL/ELT pipelines?&lt;/h2&gt;
&lt;p&gt;Airflow is the de-facto standard for defining ETL/ELT pipelines as Python code. Airflow is popular for this use case because it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tool agnostic&lt;/strong&gt;: Airflow can be used to orchestrate ETL/ELT pipelines for any data source or destination.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensible&lt;/strong&gt;: There are many Airflow modules available to connect to any data source or destination, and you can write your own custom operators and hooks for specific use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic&lt;/strong&gt;: In Airflow you can define &lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html&#34;&gt;dynamic tasks&lt;/a&gt;, which serve as placeholders to adapt at runtime based on changing input.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalable&lt;/strong&gt;: Airflow can be scaled to handle infinite numbers of tasks and workflows, given enough computing power.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;airflow-features-for-etlelt-pipelines&#34;&gt;Airflow features for ETL/ELT pipelines&lt;/h2&gt;
&lt;p&gt;Airflow has several key features that make it a great option for ETL/ELT:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html&#34;&gt;Datasets&lt;/a&gt;&lt;/strong&gt;: In Airflow you can schedule your DAGs in a data-driven way, based on updates to Datasets from any other task in your Airflow instance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/objectstorage.html&#34;&gt;Object Storage&lt;/a&gt;&lt;/strong&gt;: The Airflow Object Storage is an abstraction over the &lt;a href=&#34;https://docs.python.org/3/library/pathlib.html&#34;&gt;Path API&lt;/a&gt; that simplifies interaction with object storage systems such as Amazon S3, Google Cloud Storage, and Azure Blob Storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow-providers/index.html&#34;&gt;Airflow providers&lt;/a&gt;&lt;/strong&gt;: Airflow providers extend core Airflow functionality with additional modules to simplify integration with popular data tools. You can find a list of active providers &lt;a href=&#34;https://airflow.apache.org/docs/#active-providers&#34;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>
<item>
<title>Use-Cases: Experity</title>
<link>/use-cases/experity/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/experity/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;We had to deploy our complex, flagship app to multiple nodes in multiple ways. This required tasks to communicate across Windows nodes and coordinate timing perfectly. We did not want to buy an expensive enterprise scheduling tool and needed ultimate flexibility.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Ultimately we decided flexible, multi-node, DAG capable tooling was key and airflow was one of the few tools that fit that bill. Having it based on open source and python were large factors that upheld our core principles. At the time, Airflow was missing a windows hook and operator so we contributed the WinRM hook and operator back to the community. Given its flexibility we also use DAG generators to have our metadata drive our DAGs and keep maintenance costs down.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;We have a very flexible deployment framework that allows us to be as nimble as possible. The reliability is something we have grown to trust as long as we use the tool correctly. The scalability has also allowed us to decrease the time it takes to operate on our fleet of servers.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: Infrastructure Management</title>
<link>/use-cases/infrastructure-management/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/infrastructure-management/</guid>
<description>
&lt;div style=&#34;display: flex; justify-content: center; align-items: center;&#34;&gt;
&lt;h1 id=&#34;use-airflow-for-infrastructure-management&#34;&gt;Use Airflow for Infrastructure Management&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;Airflow can interact with any API, which makes it a great tool to manage your infrastructure, such as Kubernetes or Spark clusters running in any cloud. As of Airflow 2.7, the setup/teardown feature is available, a special type of task with intelligent behavior to spin up and tear down infrastructure at the exact time you need it.&lt;/p&gt;
&lt;p&gt;Infrastructure management is often needed within the context of other use cases, such as MLOps, or implementing data quality checks. This video shows an example of how it might be used for an MLOps pipeline. You can find the code shown in this example &lt;a href=&#34;https://github.com/astronomer/use-case-setup-teardown-data-quality&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;videoContainer&#34; style=&#34;display: flex; justify-content: center; align-items: center; border: 2px solid #ccc; width: 75%; margin: auto; padding: 20px;&#34;&gt;
&lt;a href=&#34;https://www.youtube.com/embed/JkURWnl76GQ?autoplay=1&#34;&gt;
&lt;img id=&#34;videoPlaceholder&#34; src=&#34;/usecase-video-placeholders/placeholder_infra_video.png&#34; style=&#34;cursor: pointer; width: 100%; max-width: 560px;&#34; alt=&#34;Click to play a one minute video showing the use case&#34; title=&#34;Click to play video&#34;/&gt;
&lt;/a&gt;
&lt;/div&gt;
&lt;/br&gt;
&lt;h2 id=&#34;why-use-airflow-for-infrastructure-management&#34;&gt;Why use Airflow for Infrastructure Management&lt;/h2&gt;
&lt;p&gt;Airflow is a popular choice for pipelines that require managing infrastructure because it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Python native&lt;/strong&gt;: Pipelines as Python code make it easy to turn custom functions into tasks. Any logic you need to manage your infrastructure, you can implement in Airflow with Python.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensible&lt;/strong&gt;: Infrastructure management is needed for many use cases, including MLOps, data quality checks, and more. Airflow&amp;rsquo;s flexibility and wide array of providers makes it suitable for any use case that you may need to implement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalable&lt;/strong&gt;: Airflow can be scaled to handle infinite numbers of tasks and workflows, given enough computing power. If you choose Airflow, your business will be able to grow with it.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;airflow-features-for-infrastructure-management&#34;&gt;Airflow features for Infrastructure Management&lt;/h2&gt;
&lt;p&gt;Airflow 2.7 implemented a new key feature that makes it an even greater option for managing infrastructure:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/howto/setup-and-teardown.html&#34;&gt;&lt;strong&gt;Setup/teardown tasks&lt;/strong&gt;&lt;/a&gt;: Setup/teardown tasks are a special type of task that can be used to manage the infrastructure needed to run other tasks. They have special behavior to support the pattern of setting up resources and configuration (e.g. a Spark cluster or other compute resources) before a task runs, and then tearing down that infrastructure after the task has completed, even if the task fails.&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>
<item>
<title>Use-Cases: MLOps</title>
<link>/use-cases/mlops/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/mlops/</guid>
<description>
&lt;div style=&#34;display: flex; justify-content: center; align-items: center;&#34;&gt;
&lt;h1 id=&#34;use-airflow-for-machine-learning-operations-mlops&#34;&gt;Use Airflow for Machine Learning Operations (MLOps)&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;Machine Learning Operations (MLOps) is a broad term encompassing everything needed to run machine learning models in production. MLOps is a rapidly evolving field with many different best practices and behavioral patterns, with Apache Airflow providing tool agnostic orchestration capabilities for all steps. An emerging subset of MLOps is Large Language Model Operations (LLMOps), which focuses on developing pipelines around applications of large language models like GPT-4 or Command.&lt;/p&gt;
&lt;p&gt;The following video shows an example of using Airflow and Weaviate to create an automatic RAG pipeline that ingests and embeds data from news articles and provides trading advice. You can find the code shown in this example &lt;a href=&#34;https://github.com/astronomer/use-case-airflow-llm-rag-finance&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;videoContainer&#34; style=&#34;display: flex; justify-content: center; align-items: center; border: 2px solid #ccc; width: 75%; margin: auto; padding: 20px;&#34;&gt;
&lt;a href=&#34;https://www.youtube.com/embed/QcBdh_n4es4?autoplay=1&#34;&gt;
&lt;img id=&#34;videoPlaceholder&#34; src=&#34;/usecase-video-placeholders/placeholder_mlops_video.png&#34; style=&#34;cursor: pointer; width: 100%; max-width: 560px;&#34; alt=&#34;Click to play a one minute video showing the use case&#34; title=&#34;Click to play video&#34;/&gt;
&lt;/a&gt;
&lt;/div&gt;
&lt;/br&gt;
&lt;h2 id=&#34;why-use-airflow-for-mlops&#34;&gt;Why use Airflow for MLOps?&lt;/h2&gt;
&lt;p&gt;Airflow is a popular choice for orchestrating MLOps workflows because it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Python native&lt;/strong&gt;: You use Python code to define Airflow pipelines, which makes it easy to integrate the most popular machine learning tools and embed your ML operations in a best practice CI/CD workflow. By using the decorators of the TaskFlow API you can turn existing scripts into Airflow tasks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensible&lt;/strong&gt;: Airflow itself is written in Python, which makes it extensible with &lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html&#34;&gt;custom modules&lt;/a&gt; and &lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/plugins.html&#34;&gt;Airflow plugins&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data agnostic&lt;/strong&gt;: Airflow is data agnostic, which means it can be used to orchestrate any data pipeline, regardless of the data format or storage solution. You can plug in any new data storage, such as the latest vector database or your favorite RDBMS, with minimal effort.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;airflow-features-for-mlops&#34;&gt;Airflow features for MLOps&lt;/h2&gt;
&lt;p&gt;Airflow has several key features that make it a great option for orchestrating MLOps workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Monitoring and alerting&lt;/strong&gt;: Airflow comes with production-ready monitoring and alerting modules like Airflow notifiers, extensive logging features, and Airflow listeners. They enable you to have fine-grained control over how you monitor your ML operations and how Airflow alerts you if something goes wrong.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Features for day 2 ops&lt;/strong&gt;: Simple features like automatic retries, complex dependencies and branching logic, as well as the option to make pipelines dynamic make a big difference when orchestrating MLOps pipelines. Airflow has all of these built-in.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://airflow.apache.org/docs/apache-airflow-providers/index.html&#34;&gt;Airflow providers&lt;/a&gt;&lt;/strong&gt;: Airflow providers extend core Airflow functionality with additional modules to simplify integration with popular data tools, including many popular MLOps tools. You can find a list of active providers &lt;a href=&#34;https://airflow.apache.org/docs/#active-providers&#34;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>
<item>
<title>Use-Cases: Onefootball</title>
<link>/use-cases/onefootball/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/onefootball/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;With millions of daily active users, managing the complexity of data engineering at Onefootball is a constant challenge. Lengthy crontabs, multiplication of custom API clients, erosion of confidence in the analytics served, increasing heroism (&amp;ldquo;only one person can solve this issue&amp;rdquo;). Those are the challenges that most teams face unless they consciously invest in their tools and processes.&lt;/p&gt;
&lt;p&gt;On top of that, new data tools appear each month: third party data sources, cloud providers solutions, different storage technologies&amp;hellip; Managing all those integrations is costly and brittle, especially for small data engineering teams that are trying to do more with less.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Airflow had been on our radar for a while until one day we took the leap. We used the DAG paradigm to migrate the pipelines running on crontabs. We benefited from the community Hooks and Operators to remove parts of our code, or to refactor the API clients specific to our business. We use the alerts, SLAs and the web UI to regain confidence in our analytics. We use our airflow internal PRs as catalysts for team discussion and to challenge our technical designs.&lt;/p&gt;
&lt;p&gt;We have DAGs orchestrating SQL transformations in our data warehouse, but also DAGs that are orchestrating functions ran against our Kubernetes cluster both for training Machine Learning models and sending daily analytics emails.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;The learning curve was steep but in about 100 days we were able to efficiently use Airflow to manage the complexity of our data engineering. We currently have 17 DAGs (adding on average 1 per week), we have 2 contributions on apache/airflow, we have 7 internal hooks and operators and are planning to add more as our migration efforts continue.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: Plarium Krasnodar</title>
<link>/use-cases/plarium-krasnodar/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/plarium-krasnodar/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;Our Research &amp;amp; Development department carries out various experiments, and in all of them, we need to create workflow orchestrations for solving tasks in game dev. Previously, we didn&amp;rsquo;t have any suitable tools with a sufficient number of built-in functions, and we had to orchestrate processes manually and entirely from scratch every time. This led to difficulties with dependencies and monitoring when building complex workflows. We needed a tool that would provide a more centralized approach so that we could see all the logs, the number of retries, and the task performance time. The most important thing that we lacked was the ability to backfill historical data and restart failed tasks.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Apache Airflow offers lots of convenient built-in solutions, including integrative ones. The DAG model helps us avoid errors and follow general patterns when building workflows. In addition, this platform has a large community where we can find plenty of sensors and operators that cover 90% of our cases. This allows us to save ourselves loads of time.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;Thanks to Apache Airflow, we&amp;rsquo;ve managed to simplify the process of building complex workflows. Many procedures that are so important for game development, such as working with the churn rate, processing messages to the support team, and sorting bank offers, now run efficiently, and all issues are resolved centrally. In addition, Apache Airflow is widely used in the industry, allowing us to onboard new people to our team more quickly and smoothly.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: RancherBySUSE</title>
<link>/use-cases/suse/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/suse/</guid>
<description>
&lt;h4 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h4&gt;
&lt;p&gt;Our aim was to build, package, test and distribute curated and trusted containers at scale in an automated way. Those containers can be of any nature, meaning that we need a solution that allows us to build any kind of software with any kind of building tools like Maven, Rust, Java, Ant, or Go.&lt;/p&gt;
&lt;p&gt;The construction of these containers requires the installation of several libraries (which may even conflict) and the orchestration of complex workflows with several integrations, executed either on a scheduled basis or triggered by events from external systems.&lt;/p&gt;
&lt;p&gt;Finally, our building pipeline will be triggered by the release of sources upstream. This means that we need to trigger our pipeline whenever a new version is released by the owner of the software.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Apache Airflow has proven to be the perfect solution for implementing and controlling our pipelines. Its capability to orchestrate complex workflows programmatically and monitor their execution is complemented by a comprehensive graphical interface and detailed logs view.&lt;/p&gt;
&lt;p&gt;Being extendable with a high-level language like Python has allowed us to customize our workflows as code with incredible flexibility and quality. Apache Airflow has enabled us to dynamically create and execute tasks derived from external sources, scheduling them to run in batches, thus reliably executing large-scale processes.&lt;/p&gt;
&lt;p&gt;Apache Airflow also allows the execution of dependent tasks across nodes of different natures. This helped us to orchestrate the steps to build each container on the appropriate worker node. It offers multiple pre-built functionalities to facilitate integrations with external APIs, notifying events via Slack or e-mail as they occur. Its ability to isolate task execution allows us to scale, sparing us the need to worry about low-level details. Its complete REST API has allowed us to trigger workflows through events produced by external sources.&lt;/p&gt;
&lt;h4 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h4&gt;
&lt;p&gt;Thanks to Apache Airflow, we have been able to automate the lifecycle for the creation of our collections of containers in record time. We can execute concurrent processes much faster and more reliably, controlling aspects such as upstream failure handling or task-level concurrency, through configuration in a straightforward manner.&lt;/p&gt;
&lt;hr&gt;
</description>
</item>
<item>
<title>Use-Cases: Seniorlink</title>
<link>/use-cases/seniorlink/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/seniorlink/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;Here at Seniorlink, we provide services, support, and technology that engages family caregivers. One of our focuses is using data to bolster our knowledge and improve the experience of our users. Like many looking to build an effective data stack, we adopted a Python, Spark, Redshift, and Tableau core toolset.&lt;/p&gt;
&lt;p&gt;We had built a robust stack of batch processes to deliver value to the business, deploying these data services in AWS using a mixture of EMR, ECS, Lambda, and EC2. Moving fast, as many new endeavors do, we ultimately ended up with one monolithic batch process with many smaller satellite jobs. Given the scale and quantity of jobs, we began to lose transparency as to what was happening. Additionally, many jobs were launched in a single EMR cluster and so tightly coupled that a failure in one job required the recompute of all the jobs run on that cluster. These behaviors are highly inefficient, difficult to debug and result in long iteration periods given the duration of these batch jobs.&lt;/p&gt;
&lt;p&gt;We were beginning to lose precious time manually managing the schedules via AWS Datapiplines, AWS Lambdas, and ECS Tasks. Much of our development effort was spent waiting for the monolith to complete running to examine a smaller job within. Our best chance at keeping system transparency was active documentation in our internal wiki.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Airflow gave us a way to orchestrate our disparate tools into a single place. Instead of dealing with multiple schedules, we have a straightforward UI to consider. We gained a great deal of transparency, being able to monitor the status of tasks, re-run or restart tasks from any given point in a workflow, and manage the dependencies between jobs using DAGs. We were able to decouple our monolith and schedule the resulting smaller tasks confidently.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;Airflow increased the visibility of our batch processes through the use of the DAGs and the UI. Our end-to-end run time decreased by 20%, given our ability to decouple our monolithic batch jobs into several smaller ones. Our development and debugging time reduced with our ability to manage and isolate all our tasks. We were able to merge our diverse toolset into a more central location. Lastly, with its broad adoption, we were able to quickly push this new framework up to our production environment.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: Sift</title>
<link>/use-cases/sift/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/sift/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;At Sift, we’re constantly training machine learning models that feed into the core of Sift’s Digital Trust &amp;amp; Safety platform. The platform gives our customers a way to discern suspicious online behavior from trustworthy behavior, allowing our customers to protect their online transactions, maintain the integrity of their content platforms, and keep their users’ accounts secure. To make this possible, we’ve built model training pipelines that consist of hundreds of steps in MapReduce and Spark, with complex requirements between them.&lt;/p&gt;
&lt;p&gt;When we built these workflows, we found that we needed a centralized way to organize the interactions between the many steps in each workflow. But before Airflow, we didn’t have an easy way to express those dependencies. And as we added steps to the workflows, it became increasingly difficult to coordinate their dependencies and keep ML experiments in sync.&lt;/p&gt;
&lt;p&gt;It soon became clear that we needed a way to orchestrate both the scheduled execution of our jobs and the dependencies between steps of not only a single workflow, but of multiple workflows. We needed a way to dynamically create several experimental ML workflows at once that could each have their own code, dependencies, and tasks. Additionally, we needed a way to be able to monitor the status of tasks, and re-run or restart tasks from any given point in a workflow with ease.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;Airflow makes it easy to clearly define the interactions between various jobs, expanding the scope of what we can do in our model training pipelines. We now have the ability to schedule and coordinate all jobs while managing the dependencies between them using DAGs. Each of our main workflows, including our model training pipeline and ETL pipelines, has its own DAG code that manages its tasks’ dependencies and the execution schedule for the pipeline. We even define dependencies between separate DAGs by using Airflow’s ExternalTaskSensor. This allows our DAGs to actually depend on each other and keep each one of them focused and compact in its scope.&lt;/p&gt;
&lt;p&gt;As part of our custom Airflow setup, we’ve built out a separate Airflow ecosystem for short-lived experimental DAGs as well, so that we can test changes to our jobs or run separate model training pipelines in isolation. Using deployment scripts that edit our DAGs when we upload them to Airflow, the same code that powers an existing DAG can be deployed in a separate, isolated environment with experimental edits. This means that each experiment can have its own isolated code, running in parallel with other pipelines, without accidentally touching others’ jobs or dependencies.&lt;/p&gt;
&lt;p&gt;Finally, Airflow has given us the ability to manage our tasks’ successes and failures through its user interface. Airflow allows us to track our tasks’ failures, duration, history, and logs in one central UI, and that same UI also allows us to easily retry single tasks, branches of a DAG, or entire DAGs.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;Airflow initially gave us a way to solve our existing problems: we used Airflow to replace rigid crons with well-defined DAG dependencies, to build isolated ML experiments using short-lived DAGs, and to track our pipelines’ successes and failures.&lt;/p&gt;
&lt;p&gt;But even after that, Airflow helped us to grow beyond those initial challenges, and expanded the scope of what we could feasibly tackle. Airflow not only made it easier to manage our ever-expanding ML pipelines, but also allowed us to create entirely new pipelines, ranging from workflows that back up our production data to complex ETL pipelines that transform data into experimentation-ready formats.&lt;/p&gt;
&lt;p&gt;Airflow also allowed us to support a more diverse toolset. Shell scripts, Java, Python, Jupyter notebooks, and more - all of these can be managed from an Airflow DAG, allowing developers to utilize our data to test new ideas, generate insights, and improve our models with ease.&lt;/p&gt;
</description>
</item>
<item>
<title>Use-Cases: Snapp</title>
<link>/use-cases/snapp/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/use-cases/snapp/</guid>
<description>
&lt;h5 id=&#34;what-was-the-problem&#34;&gt;What was the problem?&lt;/h5&gt;
&lt;p&gt;As the Map team at Snapp, one of the largest and fastest-growing internet companies in the Middle East, we have experienced significant growth over the past couple of years, expanding from a team of 7 to a team of 60. However, with this growth came the realization that some of our crucial tasks were being performed manually. This manual approach consumed valuable time and hindered our ability to execute these tasks efficiently.&lt;/p&gt;
&lt;p&gt;To address this challenge and streamline our operations, we recognized the need for an orchestration tool that could automate these tasks, saving time, and energy, and increasing reliability and monitoring for our runs. After conducting thorough research and evaluating various options, we ultimately decided to implement Airflow. Airflow is widely regarded as the leading open-source platform for task orchestration, making it an ideal choice for the diverse use cases of our Map team.&lt;/p&gt;
&lt;p&gt;By leveraging Airflow, we aim to automate our critical tasks, enabling us to execute them more efficiently and effectively. This automation will not only enhance our productivity but also provide us with greater control and visibility over our workflows. With Airflow&amp;rsquo;s robust features and flexibility, we are confident that it will significantly improve our team&amp;rsquo;s performance and contribute to the continued success of Snapp.&lt;/p&gt;
&lt;h5 id=&#34;how-did-apache-airflow-help-to-solve-this-problem&#34;&gt;How did Apache Airflow help to solve this problem?&lt;/h5&gt;
&lt;p&gt;After implementing Apache Airflow on our cloud platform, specifically utilizing the KubernetesExecutor, we experienced a significant improvement in our task management capabilities. With Airflow, each sub-team within the Map team was able to create and manage their own DAGs, automating various tasks seamlessly. This included essential procedures such as data updates, model training pipelines, and project deployments, leveraging the SparkKubernetesOperator and other relevant tools.&lt;/p&gt;
&lt;p&gt;One notable example of Airflow&amp;rsquo;s impact was the creation of a DAG specifically designed to update the traffic congestion colorization for our streets. This DAG runs every 10 minutes, ensuring that our congestion data remains up-to-date and accurate. The intuitive Airflow UI also proved to be invaluable, as it enabled our non-technical teammates to easily work with DAGs and monitor their progress.&lt;/p&gt;
&lt;p&gt;By utilizing Airflow, we have not only automated our tasks but also improved collaboration and efficiency within our team. The ability to manage and monitor workflows through Airflow has significantly reduced manual effort and increased reliability. We are now able to focus more on analyzing and utilizing the data rather than spending time on repetitive and time-consuming manual tasks. Overall, Apache Airflow has proven to be an indispensable tool for our Map team, enabling us to streamline our operations and achieve greater productivity.&lt;/p&gt;
&lt;h5 id=&#34;what-are-the-results&#34;&gt;What are the results?&lt;/h5&gt;
&lt;p&gt;The implementation of Airflow has yielded significant results for our team. By automating and scheduling various tasks, ranging from data-related operations to deployments and data updates for the Map, we have successfully saved approximately 40 hours of manual work per week. This substantial time savings has allowed our team members to focus on more strategic and value-added activities, ultimately enhancing our overall productivity.&lt;/p&gt;
&lt;p&gt;Furthermore, Airflow&amp;rsquo;s intuitive UI has enhanced visibility into our workflows. We can easily check DAG and task logs through the Airflow UI, enabling us to monitor the progress and performance of our tasks effectively. This improved visibility has not only increased our confidence in the reliability of our processes but has also facilitated troubleshooting and issue resolution, leading to smoother operations and reduced downtime.&lt;/p&gt;
&lt;p&gt;Overall, the results of implementing Airflow have been highly beneficial for our team. The significant reduction in manual work hours has freed up valuable time and resources, allowing us to allocate them towards more critical tasks. Additionally, the improved visibility and monitoring capabilities offered by Airflow have enhanced our operational efficiency and reliability. We are extremely pleased with the positive impact Airflow has had on our team&amp;rsquo;s productivity and look forward to leveraging its capabilities further in the future.&lt;/p&gt;
</description>
</item>
</channel>
</rss>