blob: 16a5b769ec825384511f9bb404da3583538a898c [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
.. _howto/operator:DatabricksWorkflowTaskGroup:
DatabricksWorkflowTaskGroup
===========================
Use the :class:`~airflow.providers.databricks.operators.databricks_workflow.DatabricksWorkflowTaskGroup` to launch and monitor
Databricks notebook job runs as Airflow tasks. The task group launches a `Databricks Workflow <https://docs.databricks.com/en/workflows/index.html/>`_ and runs the notebook jobs from within it, resulting in a `75% cost reduction <https://www.databricks.com/product/pricing>`_ ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute) when compared to executing ``DatabricksNotebookOperator`` outside of ``DatabricksWorkflowTaskGroup``.
There are a few advantages to defining your Databricks Workflows in Airflow:
======================================= ============================================= =================================
Authoring interface via Databricks (Web-based with Databricks UI) via Airflow(Code with Airflow Dag)
======================================= ============================================= =================================
Workflow compute pricing ✅ ✅
Notebook code in source control ✅ ✅
Workflow structure in source control ✅ ✅
Retry from beginning ✅ ✅
Retry single task ✅ ✅
Task groups within Workflows ✅
Trigger workflows from other Dags ✅
Workflow-level parameters ✅
======================================= ============================================= =================================
Examples
--------
Example of what a Dag looks like with a DatabricksWorkflowTaskGroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. exampleinclude:: /../../databricks/tests/system/databricks/example_databricks_workflow.py
:language: python
:start-after: [START howto_databricks_workflow_notebook]
:end-before: [END howto_databricks_workflow_notebook]
With this example, Airflow will produce a job named ``<dag_name>.test_workflow_<USER>_<GROUP_ID>`` that will
run task ``notebook_1`` and then ``notebook_2``. The job will be created in the databricks workspace
if it does not already exist. If the job already exists, it will be updated to match
the workflow defined in the Dag.
The following image displays the resulting Databricks Workflow in the Airflow UI (based on the above example provided)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../img/databricks_workflow_task_group_airflow_graph_view.png
The corresponding Databricks Workflow in the Databricks UI for the run triggered from the Airflow Dag is depicted below
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../img/workflow_run_databricks_graph_view.png
To minimize update conflicts, we recommend that you keep parameters in the ``notebook_params`` of the
``DatabricksWorkflowTaskGroup`` and not in the ``DatabricksNotebookOperator`` whenever possible.
This is because, tasks in the ``DatabricksWorkflowTaskGroup`` are passed in on the job trigger time and
do not modify the job definition.