| |
| |
| |
| |
| <!DOCTYPE html> |
| <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> |
| <head> |
| <meta charset="utf-8"> |
| |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| |
| <title>Tutorial — Airflow Documentation</title> |
| |
| |
| |
| |
| |
| |
| |
| |
| <script type="text/javascript" src="_static/js/modernizr.min.js"></script> |
| |
| |
| <script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script> |
| <script type="text/javascript" src="_static/jquery.js"></script> |
| <script type="text/javascript" src="_static/underscore.js"></script> |
| <script type="text/javascript" src="_static/doctools.js"></script> |
| <script type="text/javascript" src="_static/language_data.js"></script> |
| |
| <script type="text/javascript" src="_static/js/theme.js"></script> |
| |
| |
| |
| |
| <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> |
| <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> |
| <link rel="index" title="Index" href="genindex.html" /> |
| <link rel="search" title="Search" href="search.html" /> |
| <link rel="next" title="How-to Guides" href="howto/index.html" /> |
| <link rel="prev" title="Installation" href="installation.html" /> |
| |
| <script> |
| document.addEventListener('DOMContentLoaded', function() { |
| var el = document.getElementById('changelog'); |
| if (el !== null ) { |
| // [AIRFLOW-...] |
| el.innerHTML = el.innerHTML.replace( |
| /\[(AIRFLOW-[\d]+)\]/g, |
| `<a href="https://issues.apache.org/jira/browse/$1">[$1]</a>` |
| ); |
| // (#...) |
| el.innerHTML = el.innerHTML.replace( |
| /\(#([\d]+)\)/g, |
| `<a href="https://github.com/apache/airflow/pull/$1">(#$1)</a>` |
| ); |
| }; |
| }) |
| </script> |
| <style> |
| .example-header { |
| position: relative; |
| background: #9AAA7A; |
| padding: 8px 16px; |
| margin-bottom: 0; |
| } |
| .example-header--with-button { |
| padding-right: 166px; |
| } |
| .example-header:after{ |
| content: ''; |
| display: table; |
| clear: both; |
| } |
| .example-title { |
| display:block; |
| padding: 4px; |
| margin-right: 16px; |
| color: white; |
| overflow-x: auto; |
| } |
| .example-header-button { |
| top: 8px; |
| right: 16px; |
| position: absolute; |
| } |
| .example-header + .highlight-python { |
| margin-top: 0 !important; |
| } |
| .viewcode-button { |
| display: inline-block; |
| padding: 8px 16px; |
| border: 0; |
| margin: 0; |
| outline: 0; |
| border-radius: 2px; |
| -webkit-box-shadow: 0 3px 5px 0 rgba(0,0,0,.3); |
| box-shadow: 0 3px 6px 0 rgba(0,0,0,.3); |
| color: #404040; |
| background-color: #e7e7e7; |
| cursor: pointer; |
| font-size: 16px; |
| font-weight: 500; |
| line-height: 1; |
| text-decoration: none; |
| text-overflow: ellipsis; |
| overflow: hidden; |
| text-transform: uppercase; |
| -webkit-transition: background-color .2s; |
| transition: background-color .2s; |
| vertical-align: middle; |
| white-space: nowrap; |
| } |
| .viewcode-button:visited { |
| color: #404040; |
| } |
| .viewcode-button:hover, .viewcode-button:focus { |
| color: #404040; |
| background-color: #d6d6d6; |
| } |
| </style> |
| <!-- Matomo --> |
| <script> |
| var _paq = window._paq = window._paq || []; |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '13']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script> |
| <!-- End Matomo --> |
| <link rel="canonical" href="https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html" /> |
| </head> |
| |
| |
| <body class="wy-body-for-nav"> |
| |
| |
| <div class="wy-grid-for-nav"> |
| |
| <nav data-toggle="wy-nav-shift" class="wy-nav-side"> |
| <div class="wy-side-scroll"> |
| <div class="wy-side-nav-search" > |
| |
| |
| |
| <a href="index.html" class="icon icon-home"> Airflow |
| |
| |
| |
| </a> |
| |
| |
| |
| |
| <div class="version"> |
| 1.10.4 |
| </div> |
| |
| |
| |
| |
| <div role="search"> |
| <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> |
| <input type="text" name="q" placeholder="Search docs" /> |
| <input type="hidden" name="check_keywords" value="yes" /> |
| <input type="hidden" name="area" value="default" /> |
| </form> |
| </div> |
| |
| |
| </div> |
| |
| <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> |
| |
| |
| |
| |
| |
| |
| <ul class="current"> |
| <li class="toctree-l1"><a class="reference internal" href="project.html">Project</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="license.html">License</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="start.html">Quick Start</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li> |
| <li class="toctree-l1 current"><a class="current reference internal" href="#">Tutorial</a><ul> |
| <li class="toctree-l2"><a class="reference internal" href="#example-pipeline-definition">Example Pipeline definition</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#it-s-a-dag-definition-file">It’s a DAG definition file</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#importing-modules">Importing Modules</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#default-arguments">Default Arguments</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#instantiate-a-dag">Instantiate a DAG</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#tasks">Tasks</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#templating-with-jinja">Templating with Jinja</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#setting-up-dependencies">Setting up Dependencies</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#recap">Recap</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#testing">Testing</a><ul> |
| <li class="toctree-l3"><a class="reference internal" href="#running-the-script">Running the Script</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#command-line-metadata-validation">Command Line Metadata Validation</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#id1">Testing</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#backfill">Backfill</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l2"><a class="reference internal" href="#what-s-next">What’s Next?</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l1"><a class="reference internal" href="howto/index.html">How-to Guides</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="ui.html">UI / Screenshots</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="concepts.html">Concepts</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="profiling.html">Data Profiling</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="cli.html">Command Line Interface</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="scheduler.html">Scheduling & Triggers</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="plugins.html">Plugins</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="security.html">Security</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="timezone.html">Time zones</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="api.html">Experimental Rest API</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="integration.html">Integration</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="metrics.html">Metrics</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="kubernetes.html">Kubernetes</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="lineage.html">Lineage</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="changelog.html">Changelog</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="faq.html">FAQ</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="macros.html">Macros reference</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="_api/index.html">API Reference</a></li> |
| </ul> |
| |
| |
| |
| </div> |
| </div> |
| </nav> |
| |
| <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> |
| |
| |
| <nav class="wy-nav-top" aria-label="top navigation"> |
| |
| <i data-toggle="wy-nav-top" class="fa fa-bars"></i> |
| <a href="index.html">Airflow</a> |
| |
| </nav> |
| |
| |
| <div class="wy-nav-content"> |
| |
| <div class="rst-content"> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| <div role="navigation" aria-label="breadcrumbs navigation"> |
| |
| <ul class="wy-breadcrumbs"> |
| |
| <li><a href="index.html">Docs</a> »</li> |
| |
| <li>Tutorial</li> |
| |
| |
| <li class="wy-breadcrumbs-aside"> |
| |
| |
| <a href="_sources/tutorial.rst.txt" rel="nofollow"> View page source</a> |
| |
| |
| </li> |
| |
| </ul> |
| |
| |
| <hr/> |
| </div> |
| <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> |
| <div itemprop="articleBody"> |
| |
| <div class="section" id="tutorial"> |
| <h1>Tutorial<a class="headerlink" href="#tutorial" title="Permalink to this headline">¶</a></h1> |
| <p>This tutorial walks you through some of the fundamental Airflow concepts, |
| objects, and their usage while writing your first pipeline.</p> |
| <div class="section" id="example-pipeline-definition"> |
| <h2>Example Pipeline definition<a class="headerlink" href="#example-pipeline-definition" title="Permalink to this headline">¶</a></h2> |
| <p>Here is an example of a basic pipeline definition. Do not worry if this looks |
| complicated, a line by line explanation follows below.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="sd">"""</span> |
| <span class="sd">Code that goes along with the Airflow tutorial located at:</span> |
| <span class="sd">https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py</span> |
| <span class="sd">"""</span> |
| <span class="kn">from</span> <span class="nn">airflow</span> <span class="kn">import</span> <span class="n">DAG</span> |
| <span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="kn">import</span> <span class="n">BashOperator</span> |
| <span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span> |
| |
| |
| <span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span> |
| <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> |
| <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> |
| <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'airflow@example.com'</span><span class="p">],</span> |
| <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> |
| <span class="s1">'retry_delay'</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> |
| <span class="c1"># 'queue': 'bash_queue',</span> |
| <span class="c1"># 'pool': 'backfill',</span> |
| <span class="c1"># 'priority_weight': 10,</span> |
| <span class="c1"># 'end_date': datetime(2016, 1, 1),</span> |
| <span class="p">}</span> |
| |
| <span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span><span class="s1">'tutorial'</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">schedule_interval</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> |
| |
| <span class="c1"># t1, t2 and t3 are examples of tasks created by instantiating operators</span> |
| <span class="n">t1</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'print_date'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'sleep'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'sleep 5'</span><span class="p">,</span> |
| <span class="n">retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">templated_command</span> <span class="o">=</span> <span class="s2">"""</span> |
| <span class="s2"> {</span><span class="si">% f</span><span class="s2">or i in range(5) %}</span> |
| <span class="s2"> echo "{{ ds }}"</span> |
| <span class="s2"> echo "{{ macros.ds_add(ds, 7)}}"</span> |
| <span class="s2"> echo "{{ params.my_param }}"</span> |
| <span class="s2"> {</span><span class="si">% e</span><span class="s2">ndfor %}</span> |
| <span class="s2">"""</span> |
| |
| <span class="n">t3</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'templated'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="n">templated_command</span><span class="p">,</span> |
| <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s1">'my_param'</span><span class="p">:</span> <span class="s1">'Parameter I passed in'</span><span class="p">},</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| <span class="n">t3</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="it-s-a-dag-definition-file"> |
| <h2>It’s a DAG definition file<a class="headerlink" href="#it-s-a-dag-definition-file" title="Permalink to this headline">¶</a></h2> |
| <p>One thing to wrap your head around (it may not be very intuitive for everyone |
| at first) is that this Airflow Python script is really |
| just a configuration file specifying the DAG’s structure as code. |
| The actual tasks defined here will run in a different context from |
| the context of this script. Different tasks run on different workers |
| at different points in time, which means that this script cannot be used |
| to cross communicate between tasks. Note that for this |
| purpose we have a more advanced feature called <code class="docutils literal notranslate"><span class="pre">XCom</span></code>.</p> |
| <p>People sometimes think of the DAG definition file as a place where they |
| can do some actual data processing - that is not the case at all! |
| The script’s purpose is to define a DAG object. It needs to evaluate |
| quickly (seconds, not minutes) since the scheduler will execute it |
| periodically to reflect the changes if any.</p> |
| </div> |
| <div class="section" id="importing-modules"> |
| <h2>Importing Modules<a class="headerlink" href="#importing-modules" title="Permalink to this headline">¶</a></h2> |
| <p>An Airflow pipeline is just a Python script that happens to define an |
| Airflow DAG object. Let’s start by importing the libraries we will need.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># The DAG object; we'll need this to instantiate a DAG</span> |
| <span class="kn">from</span> <span class="nn">airflow</span> <span class="kn">import</span> <span class="n">DAG</span> |
| |
| <span class="c1"># Operators; we need this to operate!</span> |
| <span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="kn">import</span> <span class="n">BashOperator</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="default-arguments"> |
| <h2>Default Arguments<a class="headerlink" href="#default-arguments" title="Permalink to this headline">¶</a></h2> |
| <p>We’re about to create a DAG and some tasks, and we have the choice to |
| explicitly pass a set of arguments to each task’s constructor |
| (which would become redundant), or (better!) we can define a dictionary |
| of default parameters that we can use when creating tasks.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span> |
| |
| <span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span> |
| <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> |
| <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> |
| <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'airflow@example.com'</span><span class="p">],</span> |
| <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> |
| <span class="s1">'retry_delay'</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> |
| <span class="c1"># 'queue': 'bash_queue',</span> |
| <span class="c1"># 'pool': 'backfill',</span> |
| <span class="c1"># 'priority_weight': 10,</span> |
| <span class="c1"># 'end_date': datetime(2016, 1, 1),</span> |
| <span class="p">}</span> |
| </pre></div> |
| </div> |
| <p>For more information about the BaseOperator’s parameters and what they do, |
| refer to the <a class="reference internal" href="_api/airflow/models/index.html#airflow.models.BaseOperator" title="airflow.models.BaseOperator"><code class="xref py py-class docutils literal notranslate"><span class="pre">airflow.models.BaseOperator</span></code></a> documentation.</p> |
| <p>Also, note that you could easily define different sets of arguments that |
| would serve different purposes. An example of that would be to have |
| different settings between a production and development environment.</p> |
| </div> |
| <div class="section" id="instantiate-a-dag"> |
| <h2>Instantiate a DAG<a class="headerlink" href="#instantiate-a-dag" title="Permalink to this headline">¶</a></h2> |
| <p>We’ll need a DAG object to nest our tasks into. Here we pass a string |
| that defines the <code class="docutils literal notranslate"><span class="pre">dag_id</span></code>, which serves as a unique identifier for your DAG. |
| We also pass the default argument dictionary that we just defined and |
| define a <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> of 1 day for the DAG.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span> |
| <span class="s1">'tutorial'</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">schedule_interval</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="tasks"> |
| <h2>Tasks<a class="headerlink" href="#tasks" title="Permalink to this headline">¶</a></h2> |
| <p>Tasks are generated when instantiating operator objects. An object |
| instantiated from an operator is called a constructor. The first argument |
| <code class="docutils literal notranslate"><span class="pre">task_id</span></code> acts as a unique identifier for the task.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">t1</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'print_date'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'sleep'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'sleep 5'</span><span class="p">,</span> |
| <span class="n">retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>Notice how we pass a mix of operator specific arguments (<code class="docutils literal notranslate"><span class="pre">bash_command</span></code>) and |
| an argument common to all operators (<code class="docutils literal notranslate"><span class="pre">retries</span></code>) inherited |
| from BaseOperator to the operator’s constructor. This is simpler than |
| passing every argument for every constructor call. Also, notice that in |
| the second task we override the <code class="docutils literal notranslate"><span class="pre">retries</span></code> parameter with <code class="docutils literal notranslate"><span class="pre">3</span></code>.</p> |
| <p>The precedence rules for a task are as follows:</p> |
| <ol class="arabic simple"> |
| <li><p>Explicitly passed arguments</p></li> |
| <li><p>Values that exist in the <code class="docutils literal notranslate"><span class="pre">default_args</span></code> dictionary</p></li> |
| <li><p>The operator’s default value, if one exists</p></li> |
| </ol> |
| <p>A task must include or inherit the arguments <code class="docutils literal notranslate"><span class="pre">task_id</span></code> and <code class="docutils literal notranslate"><span class="pre">owner</span></code>, |
| otherwise Airflow will raise an exception.</p> |
| </div> |
| <div class="section" id="templating-with-jinja"> |
| <h2>Templating with Jinja<a class="headerlink" href="#templating-with-jinja" title="Permalink to this headline">¶</a></h2> |
| <p>Airflow leverages the power of |
| <a class="reference external" href="http://jinja.pocoo.org/docs/dev/">Jinja Templating</a> and provides |
| the pipeline author |
| with a set of built-in parameters and macros. Airflow also provides |
| hooks for the pipeline author to define their own parameters, macros and |
| templates.</p> |
| <p>This tutorial barely scratches the surface of what you can do with |
| templating in Airflow, but the goal of this section is to let you know |
| this feature exists, get you familiar with double curly brackets, and |
| point to the most common template variable: <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">ds</span> <span class="pre">}}</span></code> (today’s “date |
| stamp”).</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">templated_command</span> <span class="o">=</span> <span class="s2">"""</span> |
| <span class="s2"> {</span><span class="si">% f</span><span class="s2">or i in range(5) %}</span> |
| <span class="s2"> echo "{{ ds }}"</span> |
| <span class="s2"> echo "{{ macros.ds_add(ds, 7) }}"</span> |
| <span class="s2"> echo "{{ params.my_param }}"</span> |
| <span class="s2"> {</span><span class="si">% e</span><span class="s2">ndfor %}</span> |
| <span class="s2">"""</span> |
| |
| <span class="n">t3</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'templated'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="n">templated_command</span><span class="p">,</span> |
| <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s1">'my_param'</span><span class="p">:</span> <span class="s1">'Parameter I passed in'</span><span class="p">},</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>Notice that the <code class="docutils literal notranslate"><span class="pre">templated_command</span></code> contains code logic in <code class="docutils literal notranslate"><span class="pre">{%</span> <span class="pre">%}</span></code> blocks, |
| references parameters like <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">ds</span> <span class="pre">}}</span></code>, calls a function as in |
| <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">macros.ds_add(ds,</span> <span class="pre">7)}}</span></code>, and references a user-defined parameter |
| in <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">params.my_param</span> <span class="pre">}}</span></code>.</p> |
| <p>The <code class="docutils literal notranslate"><span class="pre">params</span></code> hook in <code class="docutils literal notranslate"><span class="pre">BaseOperator</span></code> allows you to pass a dictionary of |
| parameters and/or objects to your templates. Please take the time |
| to understand how the parameter <code class="docutils literal notranslate"><span class="pre">my_param</span></code> makes it through to the template.</p> |
| <p>Files can also be passed to the <code class="docutils literal notranslate"><span class="pre">bash_command</span></code> argument, like |
| <code class="docutils literal notranslate"><span class="pre">bash_command='templated_command.sh'</span></code>, where the file location is relative to |
| the directory containing the pipeline file (<code class="docutils literal notranslate"><span class="pre">tutorial.py</span></code> in this case). This |
| may be desirable for many reasons, like separating your script’s logic and |
| pipeline code, allowing for proper code highlighting in files composed in |
| different languages, and general flexibility in structuring pipelines. It is |
| also possible to define your <code class="docutils literal notranslate"><span class="pre">template_searchpath</span></code> as pointing to any folder |
| locations in the DAG constructor call.</p> |
| <p>Using that same DAG constructor call, it is possible to define |
| <code class="docutils literal notranslate"><span class="pre">user_defined_macros</span></code> which allow you to specify your own variables. |
| For example, passing <code class="docutils literal notranslate"><span class="pre">dict(foo='bar')</span></code> to this argument allows you |
| to use <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">foo</span> <span class="pre">}}</span></code> in your templates. Moreover, specifying |
| <code class="docutils literal notranslate"><span class="pre">user_defined_filters</span></code> allow you to register you own filters. For example, |
| passing <code class="docutils literal notranslate"><span class="pre">dict(hello=lambda</span> <span class="pre">name:</span> <span class="pre">'Hello</span> <span class="pre">%s'</span> <span class="pre">%</span> <span class="pre">name)</span></code> to this argument allows |
| you to use <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">'world'</span> <span class="pre">|</span> <span class="pre">hello</span> <span class="pre">}}</span></code> in your templates. For more information |
| regarding custom filters have a look at the |
| <a class="reference external" href="http://jinja.pocoo.org/docs/dev/api/#writing-filters">Jinja Documentation</a></p> |
| <p>For more information on the variables and macros that can be referenced |
| in templates, make sure to read through the <a class="reference internal" href="macros.html"><span class="doc">Macros reference</span></a></p> |
| </div> |
| <div class="section" id="setting-up-dependencies"> |
| <h2>Setting up Dependencies<a class="headerlink" href="#setting-up-dependencies" title="Permalink to this headline">¶</a></h2> |
| <p>We have tasks <cite>t1</cite>, <cite>t2</cite> and <cite>t3</cite> that do not depend on each other. Here’s a few ways |
| you can define dependencies between them:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">t1</span><span class="o">.</span><span class="n">set_downstream</span><span class="p">(</span><span class="n">t2</span><span class="p">)</span> |
| |
| <span class="c1"># This means that t2 will depend on t1</span> |
| <span class="c1"># running successfully to run.</span> |
| <span class="c1"># It is equivalent to:</span> |
| <span class="n">t2</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| |
| <span class="c1"># The bit shift operator can also be</span> |
| <span class="c1"># used to chain operations:</span> |
| <span class="n">t1</span> <span class="o">>></span> <span class="n">t2</span> |
| |
| <span class="c1"># And the upstream dependency with the</span> |
| <span class="c1"># bit shift operator:</span> |
| <span class="n">t2</span> <span class="o"><<</span> <span class="n">t1</span> |
| |
| <span class="c1"># Chaining multiple dependencies becomes</span> |
| <span class="c1"># concise with the bit shift operator:</span> |
| <span class="n">t1</span> <span class="o">>></span> <span class="n">t2</span> <span class="o">>></span> <span class="n">t3</span> |
| |
| <span class="c1"># A list of tasks can also be set as</span> |
| <span class="c1"># dependencies. These operations</span> |
| <span class="c1"># all have the same effect:</span> |
| <span class="n">t1</span><span class="o">.</span><span class="n">set_downstream</span><span class="p">([</span><span class="n">t2</span><span class="p">,</span> <span class="n">t3</span><span class="p">])</span> |
| <span class="n">t1</span> <span class="o">>></span> <span class="p">[</span><span class="n">t2</span><span class="p">,</span> <span class="n">t3</span><span class="p">]</span> |
| <span class="p">[</span><span class="n">t2</span><span class="p">,</span> <span class="n">t3</span><span class="p">]</span> <span class="o"><<</span> <span class="n">t1</span> |
| </pre></div> |
| </div> |
| <p>Note that when executing your script, Airflow will raise exceptions when |
| it finds cycles in your DAG or when a dependency is referenced more |
| than once.</p> |
| </div> |
| <div class="section" id="recap"> |
| <h2>Recap<a class="headerlink" href="#recap" title="Permalink to this headline">¶</a></h2> |
| <p>Alright, so we have a pretty basic DAG. At this point your code should look |
| something like this:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="sd">"""</span> |
| <span class="sd">Code that goes along with the Airflow tutorial located at:</span> |
| <span class="sd">https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py</span> |
| <span class="sd">"""</span> |
| <span class="kn">from</span> <span class="nn">airflow</span> <span class="kn">import</span> <span class="n">DAG</span> |
| <span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="kn">import</span> <span class="n">BashOperator</span> |
| <span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span> |
| |
| |
| <span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span> |
| <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> |
| <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> |
| <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'airflow@example.com'</span><span class="p">],</span> |
| <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> |
| <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> |
| <span class="s1">'retry_delay'</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> |
| <span class="c1"># 'queue': 'bash_queue',</span> |
| <span class="c1"># 'pool': 'backfill',</span> |
| <span class="c1"># 'priority_weight': 10,</span> |
| <span class="c1"># 'end_date': datetime(2016, 1, 1),</span> |
| <span class="p">}</span> |
| |
| <span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span> |
| <span class="s1">'tutorial'</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">schedule_interval</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> |
| |
| <span class="c1"># t1, t2 and t3 are examples of tasks created by instantiating operators</span> |
| <span class="n">t1</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'print_date'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'sleep'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'sleep 5'</span><span class="p">,</span> |
| <span class="n">retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">templated_command</span> <span class="o">=</span> <span class="s2">"""</span> |
| <span class="s2"> {</span><span class="si">% f</span><span class="s2">or i in range(5) %}</span> |
| <span class="s2"> echo "{{ ds }}"</span> |
| <span class="s2"> echo "{{ macros.ds_add(ds, 7)}}"</span> |
| <span class="s2"> echo "{{ params.my_param }}"</span> |
| <span class="s2"> {</span><span class="si">% e</span><span class="s2">ndfor %}</span> |
| <span class="s2">"""</span> |
| |
| <span class="n">t3</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'templated'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="n">templated_command</span><span class="p">,</span> |
| <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s1">'my_param'</span><span class="p">:</span> <span class="s1">'Parameter I passed in'</span><span class="p">},</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| <span class="n">t3</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="testing"> |
| <h2>Testing<a class="headerlink" href="#testing" title="Permalink to this headline">¶</a></h2> |
| <div class="section" id="running-the-script"> |
| <h3>Running the Script<a class="headerlink" href="#running-the-script" title="Permalink to this headline">¶</a></h3> |
| <p>Time to run some tests. First, let’s make sure the pipeline |
| is parsed successfully.</p> |
| <p>Let’s assume we’re saving the code from the previous step in |
| <code class="docutils literal notranslate"><span class="pre">tutorial.py</span></code> in the DAGs folder referenced in your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>. |
| The default location for your DAGs is <code class="docutils literal notranslate"><span class="pre">~/airflow/dags</span></code>.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>python ~/airflow/dags/tutorial.py |
| </pre></div> |
| </div> |
| <p>If the script does not raise an exception it means that you haven’t done |
| anything horribly wrong, and that your Airflow environment is somewhat |
| sound.</p> |
| </div> |
| <div class="section" id="command-line-metadata-validation"> |
| <h3>Command Line Metadata Validation<a class="headerlink" href="#command-line-metadata-validation" title="Permalink to this headline">¶</a></h3> |
| <p>Let’s run a few commands to validate this script further.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># print the list of active DAGs</span> |
| airflow list_dags |
| |
| <span class="c1"># prints the list of tasks the "tutorial" dag_id</span> |
| airflow list_tasks tutorial |
| |
| <span class="c1"># prints the hierarchy of tasks in the tutorial DAG</span> |
| airflow list_tasks tutorial --tree |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="id1"> |
| <h3>Testing<a class="headerlink" href="#id1" title="Permalink to this headline">¶</a></h3> |
| <p>Let’s test by running the actual task instances on a specific date. The |
| date specified in this context is an <code class="docutils literal notranslate"><span class="pre">execution_date</span></code>, which simulates the |
| scheduler running your task or dag at a specific date + time:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># command layout: command subcommand dag_id task_id date</span> |
| |
| <span class="c1"># testing print_date</span> |
| airflow <span class="nb">test</span> tutorial print_date <span class="m">2015</span>-06-01 |
| |
| <span class="c1"># testing sleep</span> |
| airflow <span class="nb">test</span> tutorial sleep <span class="m">2015</span>-06-01 |
| </pre></div> |
| </div> |
| <p>Now remember what we did with templating earlier? See how this template |
| gets rendered and executed by running this command:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># testing templated</span> |
| airflow <span class="nb">test</span> tutorial templated <span class="m">2015</span>-06-01 |
| </pre></div> |
| </div> |
| <p>This should result in displaying a verbose log of events and ultimately |
| running your bash command and printing the result.</p> |
| <p>Note that the <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">test</span></code> command runs task instances locally, outputs |
| their log to stdout (on screen), doesn’t bother with dependencies, and |
| doesn’t communicate state (running, success, failed, …) to the database. |
| It simply allows testing a single task instance.</p> |
| </div> |
| <div class="section" id="backfill"> |
| <h3>Backfill<a class="headerlink" href="#backfill" title="Permalink to this headline">¶</a></h3> |
| <p>Everything looks like it’s running fine so let’s run a backfill. |
| <code class="docutils literal notranslate"><span class="pre">backfill</span></code> will respect your dependencies, emit logs into files and talk to |
| the database to record status. If you do have a webserver up, you’ll be able |
| to track the progress. <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">webserver</span></code> will start a web server if you |
| are interested in tracking the progress visually as your backfill progresses.</p> |
| <p>Note that if you use <code class="docutils literal notranslate"><span class="pre">depends_on_past=True</span></code>, individual task instances |
| will depend on the success of the preceding task instance, except for the |
| start_date specified itself, for which this dependency is disregarded.</p> |
| <p>The date range in this context is a <code class="docutils literal notranslate"><span class="pre">start_date</span></code> and optionally an <code class="docutils literal notranslate"><span class="pre">end_date</span></code>, |
| which are used to populate the run schedule with task instances from this dag.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># optional, start a web server in debug mode in the background</span> |
| <span class="c1"># airflow webserver --debug &</span> |
| |
| <span class="c1"># start your backfill on a date range</span> |
| airflow backfill tutorial -s <span class="m">2015</span>-06-01 -e <span class="m">2015</span>-06-07 |
| </pre></div> |
| </div> |
| </div> |
| </div> |
| <div class="section" id="what-s-next"> |
| <h2>What’s Next?<a class="headerlink" href="#what-s-next" title="Permalink to this headline">¶</a></h2> |
| <p>That’s it, you’ve written, tested and backfilled your very first Airflow |
| pipeline. Merging your code into a code repository that has a master scheduler |
| running against it should get it to get triggered and run every day.</p> |
| <p>Here’s a few things you might want to do next:</p> |
| <ul> |
| <li><p>Take an in-depth tour of the UI - click all the things!</p></li> |
| <li><p>Keep reading the docs! Especially the sections on:</p> |
| <blockquote> |
| <div><ul class="simple"> |
| <li><p>Command line interface</p></li> |
| <li><p>Operators</p></li> |
| <li><p>Macros</p></li> |
| </ul> |
| </div></blockquote> |
| </li> |
| <li><p>Write your first pipeline!</p></li> |
| </ul> |
| </div> |
| </div> |
| |
| |
| </div> |
| |
| </div> |
| <footer> |
| |
| <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation"> |
| |
| <a href="howto/index.html" class="btn btn-neutral float-right" title="How-to Guides" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a> |
| |
| |
| <a href="installation.html" class="btn btn-neutral float-left" title="Installation" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a> |
| |
| </div> |
| |
| |
| <hr/> |
| |
| <div role="contentinfo"> |
| <p> |
| |
| </p> |
| </div> |
| Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. |
| |
| </footer> |
| |
| </div> |
| </div> |
| |
| </section> |
| |
| </div> |
| |
| |
| |
| <script type="text/javascript"> |
| jQuery(function () { |
| SphinxRtdTheme.Navigation.enable(true); |
| }); |
| </script> |
| |
| |
| |
| |
| |
| |
| </body> |
| </html> |