| |
| |
| <!DOCTYPE html> |
| <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> |
| <head> |
| <meta charset="utf-8"> |
| |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| |
| <title>Tutorial — Airflow Documentation</title> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> |
| <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> |
| <link rel="index" title="Index" href="genindex.html" /> |
| <link rel="search" title="Search" href="search.html" /> |
| <link rel="next" title="How-to Guides" href="howto/index.html" /> |
| <link rel="prev" title="Installation" href="installation.html" /> |
| |
| |
| <script src="_static/js/modernizr.min.js"></script> |
| |
| </head> |
| |
| <body class="wy-body-for-nav"> |
| |
| |
| <div class="wy-grid-for-nav"> |
| |
| |
| <nav data-toggle="wy-nav-shift" class="wy-nav-side"> |
| <div class="wy-side-scroll"> |
| <div class="wy-side-nav-search"> |
| |
| |
| |
| <a href="index.html" class="icon icon-home"> Airflow |
| |
| |
| |
| </a> |
| |
| |
| |
| |
| |
| |
| |
| <div role="search"> |
| <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> |
| <input type="text" name="q" placeholder="Search docs" /> |
| <input type="hidden" name="check_keywords" value="yes" /> |
| <input type="hidden" name="area" value="default" /> |
| </form> |
| </div> |
| |
| |
| </div> |
| |
| <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> |
| |
| |
| |
| |
| |
| |
| <ul class="current"> |
| <li class="toctree-l1"><a class="reference internal" href="project.html">Project</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="license.html">License</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="start.html">Quick Start</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li> |
| <li class="toctree-l1 current"><a class="current reference internal" href="#">Tutorial</a><ul> |
| <li class="toctree-l2"><a class="reference internal" href="#example-pipeline-definition">Example Pipeline definition</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#it-s-a-dag-definition-file">It’s a DAG definition file</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#importing-modules">Importing Modules</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#default-arguments">Default Arguments</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#instantiate-a-dag">Instantiate a DAG</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#tasks">Tasks</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#templating-with-jinja">Templating with Jinja</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#setting-up-dependencies">Setting up Dependencies</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#recap">Recap</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#testing">Testing</a><ul> |
| <li class="toctree-l3"><a class="reference internal" href="#running-the-script">Running the Script</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#command-line-metadata-validation">Command Line Metadata Validation</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#id1">Testing</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#backfill">Backfill</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l2"><a class="reference internal" href="#what-s-next">What’s Next?</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l1"><a class="reference internal" href="howto/index.html">How-to Guides</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="ui.html">UI / Screenshots</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="concepts.html">Concepts</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="profiling.html">Data Profiling</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="cli.html">Command Line Interface</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="scheduler.html">Scheduling & Triggers</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="plugins.html">Plugins</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="security.html">Security</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="timezone.html">Time zones</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="api.html">Experimental Rest API</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="integration.html">Integration</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="lineage.html">Lineage</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="faq.html">FAQ</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="code.html">API Reference</a></li> |
| </ul> |
| |
| |
| |
| </div> |
| </div> |
| </nav> |
| |
| <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> |
| |
| |
| <nav class="wy-nav-top" aria-label="top navigation"> |
| |
| <i data-toggle="wy-nav-top" class="fa fa-bars"></i> |
| <a href="index.html">Airflow</a> |
| |
| </nav> |
| |
| |
| <div class="wy-nav-content"> |
| |
| <div class="rst-content"> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| <div role="navigation" aria-label="breadcrumbs navigation"> |
| |
| <ul class="wy-breadcrumbs"> |
| |
| <li><a href="index.html">Docs</a> »</li> |
| |
| <li>Tutorial</li> |
| |
| |
| <li class="wy-breadcrumbs-aside"> |
| |
| |
| <a href="_sources/tutorial.rst.txt" rel="nofollow"> View page source</a> |
| |
| |
| </li> |
| |
| </ul> |
| |
| |
| <hr/> |
| </div> |
| <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> |
| <div itemprop="articleBody"> |
| |
| <div class="section" id="tutorial"> |
| <h1>Tutorial<a class="headerlink" href="#tutorial" title="Permalink to this headline">¶</a></h1> |
| <p>This tutorial walks you through some of the fundamental Airflow concepts, |
| objects, and their usage while writing your first pipeline.</p> |
| <div class="section" id="example-pipeline-definition"> |
| <h2>Example Pipeline definition<a class="headerlink" href="#example-pipeline-definition" title="Permalink to this headline">¶</a></h2> |
| <p>Here is an example of a basic pipeline definition. Do not worry if this looks |
| complicated, a line by line explanation follows below.</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="sd">"""</span> |
| <span class="sd">Code that goes along with the Airflow tutorial located at:</span> |
| <span class="sd">https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/tutorial.py</span> |
| <span class="sd">"""</span> |
| <span class="kn">from</span> <span class="nn">airflow</span> <span class="k">import</span> <span class="n">DAG</span> |
| <span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="k">import</span> <span class="n">BashOperator</span> |
| <span class="kn">from</span> <span class="nn">datetime</span> <span class="k">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span> |
| |
| |
| <span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span> |
| <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> |
| <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> |
| <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'airflow@example.com'</span><span class="p">],</span> |
| <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> |
| <span class="s1">'retry_delay'</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> |
| <span class="c1"># 'queue': 'bash_queue',</span> |
| <span class="c1"># 'pool': 'backfill',</span> |
| <span class="c1"># 'priority_weight': 10,</span> |
| <span class="c1"># 'end_date': datetime(2016, 1, 1),</span> |
| <span class="p">}</span> |
| |
| <span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span><span class="s1">'tutorial'</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">schedule_interval</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> |
| |
| <span class="c1"># t1, t2 and t3 are examples of tasks created by instantiating operators</span> |
| <span class="n">t1</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'print_date'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'sleep'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'sleep 5'</span><span class="p">,</span> |
| <span class="n">retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">templated_command</span> <span class="o">=</span> <span class="s2">"""</span> |
| <span class="s2"> {</span><span class="si">% f</span><span class="s2">or i in range(5) %}</span> |
| <span class="s2"> echo "{{ ds }}"</span> |
| <span class="s2"> echo "{{ macros.ds_add(ds, 7)}}"</span> |
| <span class="s2"> echo "{{ params.my_param }}"</span> |
| <span class="s2"> {</span><span class="si">% e</span><span class="s2">ndfor %}</span> |
| <span class="s2">"""</span> |
| |
| <span class="n">t3</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'templated'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="n">templated_command</span><span class="p">,</span> |
| <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s1">'my_param'</span><span class="p">:</span> <span class="s1">'Parameter I passed in'</span><span class="p">},</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| <span class="n">t3</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="it-s-a-dag-definition-file"> |
| <h2>It’s a DAG definition file<a class="headerlink" href="#it-s-a-dag-definition-file" title="Permalink to this headline">¶</a></h2> |
| <p>One thing to wrap your head around (it may not be very intuitive for everyone |
| at first) is that this Airflow Python script is really |
| just a configuration file specifying the DAG’s structure as code. |
| The actual tasks defined here will run in a different context from |
| the context of this script. Different tasks run on different workers |
| at different points in time, which means that this script cannot be used |
| to cross communicate between tasks. Note that for this |
| purpose we have a more advanced feature called <code class="docutils literal notranslate"><span class="pre">XCom</span></code>.</p> |
| <p>People sometimes think of the DAG definition file as a place where they |
| can do some actual data processing - that is not the case at all! |
| The script’s purpose is to define a DAG object. It needs to evaluate |
| quickly (seconds, not minutes) since the scheduler will execute it |
| periodically to reflect the changes if any.</p> |
| </div> |
| <div class="section" id="importing-modules"> |
| <h2>Importing Modules<a class="headerlink" href="#importing-modules" title="Permalink to this headline">¶</a></h2> |
| <p>An Airflow pipeline is just a Python script that happens to define an |
| Airflow DAG object. Let’s start by importing the libraries we will need.</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># The DAG object; we'll need this to instantiate a DAG</span> |
| <span class="kn">from</span> <span class="nn">airflow</span> <span class="k">import</span> <span class="n">DAG</span> |
| |
| <span class="c1"># Operators; we need this to operate!</span> |
| <span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="k">import</span> <span class="n">BashOperator</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="default-arguments"> |
| <h2>Default Arguments<a class="headerlink" href="#default-arguments" title="Permalink to this headline">¶</a></h2> |
| <p>We’re about to create a DAG and some tasks, and we have the choice to |
| explicitly pass a set of arguments to each task’s constructor |
| (which would become redundant), or (better!) we can define a dictionary |
| of default parameters that we can use when creating tasks.</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">datetime</span> <span class="k">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span> |
| |
| <span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span> |
| <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> |
| <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> |
| <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'airflow@example.com'</span><span class="p">],</span> |
| <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> |
| <span class="s1">'retry_delay'</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> |
| <span class="c1"># 'queue': 'bash_queue',</span> |
| <span class="c1"># 'pool': 'backfill',</span> |
| <span class="c1"># 'priority_weight': 10,</span> |
| <span class="c1"># 'end_date': datetime(2016, 1, 1),</span> |
| <span class="p">}</span> |
| </pre></div> |
| </div> |
| <p>For more information about the BaseOperator’s parameters and what they do, |
| refer to the <a class="reference internal" href="code.html#airflow.models.BaseOperator" title="airflow.models.BaseOperator"><code class="xref py py-class docutils literal notranslate"><span class="pre">airflow.models.BaseOperator</span></code></a> documentation.</p> |
| <p>Also, note that you could easily define different sets of arguments that |
| would serve different purposes. An example of that would be to have |
| different settings between a production and development environment.</p> |
| </div> |
| <div class="section" id="instantiate-a-dag"> |
| <h2>Instantiate a DAG<a class="headerlink" href="#instantiate-a-dag" title="Permalink to this headline">¶</a></h2> |
| <p>We’ll need a DAG object to nest our tasks into. Here we pass a string |
| that defines the <code class="docutils literal notranslate"><span class="pre">dag_id</span></code>, which serves as a unique identifier for your DAG. |
| We also pass the default argument dictionary that we just defined and |
| define a <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> of 1 day for the DAG.</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span> |
| <span class="s1">'tutorial'</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">schedule_interval</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="tasks"> |
| <h2>Tasks<a class="headerlink" href="#tasks" title="Permalink to this headline">¶</a></h2> |
| <p>Tasks are generated when instantiating operator objects. An object |
| instantiated from an operator is called a constructor. The first argument |
| <code class="docutils literal notranslate"><span class="pre">task_id</span></code> acts as a unique identifier for the task.</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">t1</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'print_date'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'sleep'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'sleep 5'</span><span class="p">,</span> |
| <span class="n">retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>Notice how we pass a mix of operator specific arguments (<code class="docutils literal notranslate"><span class="pre">bash_command</span></code>) and |
| an argument common to all operators (<code class="docutils literal notranslate"><span class="pre">retries</span></code>) inherited |
| from BaseOperator to the operator’s constructor. This is simpler than |
| passing every argument for every constructor call. Also, notice that in |
| the second task we override the <code class="docutils literal notranslate"><span class="pre">retries</span></code> parameter with <code class="docutils literal notranslate"><span class="pre">3</span></code>.</p> |
| <p>The precedence rules for a task are as follows:</p> |
| <ol class="arabic simple"> |
| <li>Explicitly passed arguments</li> |
| <li>Values that exist in the <code class="docutils literal notranslate"><span class="pre">default_args</span></code> dictionary</li> |
| <li>The operator’s default value, if one exists</li> |
| </ol> |
| <p>A task must include or inherit the arguments <code class="docutils literal notranslate"><span class="pre">task_id</span></code> and <code class="docutils literal notranslate"><span class="pre">owner</span></code>, |
| otherwise Airflow will raise an exception.</p> |
| </div> |
| <div class="section" id="templating-with-jinja"> |
| <h2>Templating with Jinja<a class="headerlink" href="#templating-with-jinja" title="Permalink to this headline">¶</a></h2> |
| <p>Airflow leverages the power of |
| <a class="reference external" href="http://jinja.pocoo.org/docs/dev/">Jinja Templating</a> and provides |
| the pipeline author |
| with a set of built-in parameters and macros. Airflow also provides |
| hooks for the pipeline author to define their own parameters, macros and |
| templates.</p> |
| <p>This tutorial barely scratches the surface of what you can do with |
| templating in Airflow, but the goal of this section is to let you know |
| this feature exists, get you familiar with double curly brackets, and |
| point to the most common template variable: <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">ds</span> <span class="pre">}}</span></code> (today’s “date |
| stamp”).</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">templated_command</span> <span class="o">=</span> <span class="s2">"""</span> |
| <span class="s2"> {</span><span class="si">% f</span><span class="s2">or i in range(5) %}</span> |
| <span class="s2"> echo "{{ ds }}"</span> |
| <span class="s2"> echo "{{ macros.ds_add(ds, 7) }}"</span> |
| <span class="s2"> echo "{{ params.my_param }}"</span> |
| <span class="s2"> {</span><span class="si">% e</span><span class="s2">ndfor %}</span> |
| <span class="s2">"""</span> |
| |
| <span class="n">t3</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'templated'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="n">templated_command</span><span class="p">,</span> |
| <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s1">'my_param'</span><span class="p">:</span> <span class="s1">'Parameter I passed in'</span><span class="p">},</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>Notice that the <code class="docutils literal notranslate"><span class="pre">templated_command</span></code> contains code logic in <code class="docutils literal notranslate"><span class="pre">{%</span> <span class="pre">%}</span></code> blocks, |
| references parameters like <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">ds</span> <span class="pre">}}</span></code>, calls a function as in |
| <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">macros.ds_add(ds,</span> <span class="pre">7)}}</span></code>, and references a user-defined parameter |
| in <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">params.my_param</span> <span class="pre">}}</span></code>.</p> |
| <p>The <code class="docutils literal notranslate"><span class="pre">params</span></code> hook in <code class="docutils literal notranslate"><span class="pre">BaseOperator</span></code> allows you to pass a dictionary of |
| parameters and/or objects to your templates. Please take the time |
| to understand how the parameter <code class="docutils literal notranslate"><span class="pre">my_param</span></code> makes it through to the template.</p> |
| <p>Files can also be passed to the <code class="docutils literal notranslate"><span class="pre">bash_command</span></code> argument, like |
| <code class="docutils literal notranslate"><span class="pre">bash_command='templated_command.sh'</span></code>, where the file location is relative to |
| the directory containing the pipeline file (<code class="docutils literal notranslate"><span class="pre">tutorial.py</span></code> in this case). This |
| may be desirable for many reasons, like separating your script’s logic and |
| pipeline code, allowing for proper code highlighting in files composed in |
| different languages, and general flexibility in structuring pipelines. It is |
| also possible to define your <code class="docutils literal notranslate"><span class="pre">template_searchpath</span></code> as pointing to any folder |
| locations in the DAG constructor call.</p> |
| <p>Using that same DAG constructor call, it is possible to define |
| <code class="docutils literal notranslate"><span class="pre">user_defined_macros</span></code> which allow you to specify your own variables. |
| For example, passing <code class="docutils literal notranslate"><span class="pre">dict(foo='bar')</span></code> to this argument allows you |
| to use <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">foo</span> <span class="pre">}}</span></code> in your templates. Moreover, specifying |
| <code class="docutils literal notranslate"><span class="pre">user_defined_filters</span></code> allow you to register you own filters. For example, |
| passing <code class="docutils literal notranslate"><span class="pre">dict(hello=lambda</span> <span class="pre">name:</span> <span class="pre">'Hello</span> <span class="pre">%s'</span> <span class="pre">%</span> <span class="pre">name)</span></code> to this argument allows |
| you to use <code class="docutils literal notranslate"><span class="pre">{{</span> <span class="pre">'world'</span> <span class="pre">|</span> <span class="pre">hello</span> <span class="pre">}}</span></code> in your templates. For more information |
| regarding custom filters have a look at the |
| <a class="reference external" href="http://jinja.pocoo.org/docs/dev/api/#writing-filters">Jinja Documentation</a></p> |
| <p>For more information on the variables and macros that can be referenced |
| in templates, make sure to read through the <a class="reference internal" href="code.html#macros"><span class="std std-ref">Macros</span></a> section</p> |
| </div> |
| <div class="section" id="setting-up-dependencies"> |
| <h2>Setting up Dependencies<a class="headerlink" href="#setting-up-dependencies" title="Permalink to this headline">¶</a></h2> |
| <p>We have tasks <cite>t1</cite>, <cite>t2</cite> and <cite>t3</cite> that do not depend on each other. Here’s a few ways |
| you can define dependencies between them:</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">t1</span><span class="o">.</span><span class="n">set_downstream</span><span class="p">(</span><span class="n">t2</span><span class="p">)</span> |
| |
| <span class="c1"># This means that t2 will depend on t1</span> |
| <span class="c1"># running successfully to run.</span> |
| <span class="c1"># It is equivalent to:</span> |
| <span class="n">t2</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| |
| <span class="c1"># The bit shift operator can also be</span> |
| <span class="c1"># used to chain operations:</span> |
| <span class="n">t1</span> <span class="o">>></span> <span class="n">t2</span> |
| |
| <span class="c1"># And the upstream dependency with the</span> |
| <span class="c1"># bit shift operator:</span> |
| <span class="n">t2</span> <span class="o"><<</span> <span class="n">t1</span> |
| |
| <span class="c1"># Chaining multiple dependencies becomes</span> |
| <span class="c1"># concise with the bit shift operator:</span> |
| <span class="n">t1</span> <span class="o">>></span> <span class="n">t2</span> <span class="o">>></span> <span class="n">t3</span> |
| |
| <span class="c1"># A list of tasks can also be set as</span> |
| <span class="c1"># dependencies. These operations</span> |
| <span class="c1"># all have the same effect:</span> |
| <span class="n">t1</span><span class="o">.</span><span class="n">set_downstream</span><span class="p">([</span><span class="n">t2</span><span class="p">,</span> <span class="n">t3</span><span class="p">])</span> |
| <span class="n">t1</span> <span class="o">>></span> <span class="p">[</span><span class="n">t2</span><span class="p">,</span> <span class="n">t3</span><span class="p">]</span> |
| <span class="p">[</span><span class="n">t2</span><span class="p">,</span> <span class="n">t3</span><span class="p">]</span> <span class="o"><<</span> <span class="n">t1</span> |
| </pre></div> |
| </div> |
| <p>Note that when executing your script, Airflow will raise exceptions when |
| it finds cycles in your DAG or when a dependency is referenced more |
| than once.</p> |
| </div> |
| <div class="section" id="recap"> |
| <h2>Recap<a class="headerlink" href="#recap" title="Permalink to this headline">¶</a></h2> |
| <p>Alright, so we have a pretty basic DAG. At this point your code should look |
| something like this:</p> |
| <div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="sd">"""</span> |
| <span class="sd">Code that goes along with the Airflow tutorial located at:</span> |
| <span class="sd">https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/tutorial.py</span> |
| <span class="sd">"""</span> |
| <span class="kn">from</span> <span class="nn">airflow</span> <span class="k">import</span> <span class="n">DAG</span> |
| <span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="k">import</span> <span class="n">BashOperator</span> |
| <span class="kn">from</span> <span class="nn">datetime</span> <span class="k">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span> |
| |
| |
| <span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span> |
| <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> |
| <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> |
| <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'airflow@example.com'</span><span class="p">],</span> |
| <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> |
| <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> |
| <span class="s1">'retry_delay'</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> |
| <span class="c1"># 'queue': 'bash_queue',</span> |
| <span class="c1"># 'pool': 'backfill',</span> |
| <span class="c1"># 'priority_weight': 10,</span> |
| <span class="c1"># 'end_date': datetime(2016, 1, 1),</span> |
| <span class="p">}</span> |
| |
| <span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span> |
| <span class="s1">'tutorial'</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">schedule_interval</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> |
| |
| <span class="c1"># t1, t2 and t3 are examples of tasks created by instantiating operators</span> |
| <span class="n">t1</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'print_date'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'sleep'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="s1">'sleep 5'</span><span class="p">,</span> |
| <span class="n">retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">templated_command</span> <span class="o">=</span> <span class="s2">"""</span> |
| <span class="s2"> {</span><span class="si">% f</span><span class="s2">or i in range(5) %}</span> |
| <span class="s2"> echo "{{ ds }}"</span> |
| <span class="s2"> echo "{{ macros.ds_add(ds, 7)}}"</span> |
| <span class="s2"> echo "{{ params.my_param }}"</span> |
| <span class="s2"> {</span><span class="si">% e</span><span class="s2">ndfor %}</span> |
| <span class="s2">"""</span> |
| |
| <span class="n">t3</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span> |
| <span class="n">task_id</span><span class="o">=</span><span class="s1">'templated'</span><span class="p">,</span> |
| <span class="n">bash_command</span><span class="o">=</span><span class="n">templated_command</span><span class="p">,</span> |
| <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s1">'my_param'</span><span class="p">:</span> <span class="s1">'Parameter I passed in'</span><span class="p">},</span> |
| <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span> |
| |
| <span class="n">t2</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| <span class="n">t3</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="testing"> |
| <h2>Testing<a class="headerlink" href="#testing" title="Permalink to this headline">¶</a></h2> |
| <div class="section" id="running-the-script"> |
| <h3>Running the Script<a class="headerlink" href="#running-the-script" title="Permalink to this headline">¶</a></h3> |
| <p>Time to run some tests. First let’s make sure that the pipeline |
| parses. Let’s assume we’re saving the code from the previous step in |
| <code class="docutils literal notranslate"><span class="pre">tutorial.py</span></code> in the DAGs folder referenced in your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>. |
| The default location for your DAGs is <code class="docutils literal notranslate"><span class="pre">~/airflow/dags</span></code>.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>python ~/airflow/dags/tutorial.py |
| </pre></div> |
| </div> |
| <p>If the script does not raise an exception it means that you haven’t done |
| anything horribly wrong, and that your Airflow environment is somewhat |
| sound.</p> |
| </div> |
| <div class="section" id="command-line-metadata-validation"> |
| <h3>Command Line Metadata Validation<a class="headerlink" href="#command-line-metadata-validation" title="Permalink to this headline">¶</a></h3> |
| <p>Let’s run a few commands to validate this script further.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># print the list of active DAGs</span> |
| airflow list_dags |
| |
| <span class="c1"># prints the list of tasks the "tutorial" dag_id</span> |
| airflow list_tasks tutorial |
| |
| <span class="c1"># prints the hierarchy of tasks in the tutorial DAG</span> |
| airflow list_tasks tutorial --tree |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="id1"> |
| <h3>Testing<a class="headerlink" href="#id1" title="Permalink to this headline">¶</a></h3> |
| <p>Let’s test by running the actual task instances on a specific date. The |
| date specified in this context is an <code class="docutils literal notranslate"><span class="pre">execution_date</span></code>, which simulates the |
| scheduler running your task or dag at a specific date + time:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># command layout: command subcommand dag_id task_id date</span> |
| |
| <span class="c1"># testing print_date</span> |
| airflow <span class="nb">test</span> tutorial print_date <span class="m">2015</span>-06-01 |
| |
| <span class="c1"># testing sleep</span> |
| airflow <span class="nb">test</span> tutorial sleep <span class="m">2015</span>-06-01 |
| </pre></div> |
| </div> |
| <p>Now remember what we did with templating earlier? See how this template |
| gets rendered and executed by running this command:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># testing templated</span> |
| airflow <span class="nb">test</span> tutorial templated <span class="m">2015</span>-06-01 |
| </pre></div> |
| </div> |
| <p>This should result in displaying a verbose log of events and ultimately |
| running your bash command and printing the result.</p> |
| <p>Note that the <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">test</span></code> command runs task instances locally, outputs |
| their log to stdout (on screen), doesn’t bother with dependencies, and |
| doesn’t communicate state (running, success, failed, …) to the database. |
| It simply allows testing a single task instance.</p> |
| </div> |
| <div class="section" id="backfill"> |
| <h3>Backfill<a class="headerlink" href="#backfill" title="Permalink to this headline">¶</a></h3> |
| <p>Everything looks like it’s running fine so let’s run a backfill. |
| <code class="docutils literal notranslate"><span class="pre">backfill</span></code> will respect your dependencies, emit logs into files and talk to |
| the database to record status. If you do have a webserver up, you’ll be able |
| to track the progress. <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">webserver</span></code> will start a web server if you |
| are interested in tracking the progress visually as your backfill progresses.</p> |
| <p>Note that if you use <code class="docutils literal notranslate"><span class="pre">depends_on_past=True</span></code>, individual task instances |
| will depend on the success of the preceding task instance, except for the |
| start_date specified itself, for which this dependency is disregarded.</p> |
| <p>The date range in this context is a <code class="docutils literal notranslate"><span class="pre">start_date</span></code> and optionally an <code class="docutils literal notranslate"><span class="pre">end_date</span></code>, |
| which are used to populate the run schedule with task instances from this dag.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># optional, start a web server in debug mode in the background</span> |
| <span class="c1"># airflow webserver --debug &</span> |
| |
| <span class="c1"># start your backfill on a date range</span> |
| airflow backfill tutorial -s <span class="m">2015</span>-06-01 -e <span class="m">2015</span>-06-07 |
| </pre></div> |
| </div> |
| </div> |
| </div> |
| <div class="section" id="what-s-next"> |
| <h2>What’s Next?<a class="headerlink" href="#what-s-next" title="Permalink to this headline">¶</a></h2> |
| <p>That’s it, you’ve written, tested and backfilled your very first Airflow |
| pipeline. Merging your code into a code repository that has a master scheduler |
| running against it should get it to get triggered and run every day.</p> |
| <p>Here’s a few things you might want to do next:</p> |
| <ul> |
| <li><p class="first">Take an in-depth tour of the UI - click all the things!</p> |
| </li> |
| <li><p class="first">Keep reading the docs! Especially the sections on:</p> |
| <blockquote> |
| <div><ul class="simple"> |
| <li>Command line interface</li> |
| <li>Operators</li> |
| <li>Macros</li> |
| </ul> |
| </div></blockquote> |
| </li> |
| <li><p class="first">Write your first pipeline!</p> |
| </li> |
| </ul> |
| </div> |
| </div> |
| |
| |
| </div> |
| |
| </div> |
| <footer> |
| |
| <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation"> |
| |
| <a href="howto/index.html" class="btn btn-neutral float-right" title="How-to Guides" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a> |
| |
| |
| <a href="installation.html" class="btn btn-neutral" title="Installation" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a> |
| |
| </div> |
| |
| |
| <hr/> |
| |
| <div role="contentinfo"> |
| <p> |
| |
| </p> |
| </div> |
| Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. |
| |
| </footer> |
| |
| </div> |
| </div> |
| |
| </section> |
| |
| </div> |
| |
| |
| |
| |
| |
| |
| |
| <script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script> |
| <script type="text/javascript" src="_static/jquery.js"></script> |
| <script type="text/javascript" src="_static/underscore.js"></script> |
| <script type="text/javascript" src="_static/doctools.js"></script> |
| |
| |
| |
| |
| <script type="text/javascript" src="_static/js/theme.js"></script> |
| |
| <script type="text/javascript"> |
| jQuery(function () { |
| SphinxRtdTheme.Navigation.enable(true); |
| }); |
| </script> |
| |
| </body> |
| </html> |