blob: 8108f6f9e1b3771244e227bdfa92a56e07b46af7 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Modules Management
==================
Airflow allows you to use your own Python modules in the DAG and in the
Airflow configuration. The following article will describe how you can
create your own module so that Airflow can load it correctly, as well as
diagnose problems when modules are not loaded properly.
Packages Loading in Python
--------------------------
The list of directories from which Python tries to load the module is given
by the variable ``sys.path``. Python really tries to
`intelligently determine the contents of <https://stackoverflow.com/a/38403654>`_
of this variable, including depending on the operating system and how Python
is installed and which Python version is used.
You can check the contents of this variable for the current Python environment
by running an interactive terminal as in the example below:
.. code-block:: pycon
>>> import sys
>>> from pprint import pprint
>>> pprint(sys.path)
['',
'/home/arch/.pyenv/versions/3.7.4/lib/python37.zip',
'/home/arch/.pyenv/versions/3.7.4/lib/python3.7',
'/home/arch/.pyenv/versions/3.7.4/lib/python3.7/lib-dynload',
'/home/arch/venvs/airflow/lib/python3.7/site-packages']
``sys.path`` is initialized during program startup. The first precedence is
given to the current directory, i.e, ``path[0]`` is the directory containing
the current script that was used to invoke or an empty string in case it was
an interactive shell. Second precedence is given to the ``PYTHONPATH`` if provided,
followed by installation-dependent default paths which is managed by
`site <https://docs.python.org/3/library/site.html#module-site>`_ module.
``sys.path`` can also be modified during a Python session by simply using append
(for example, ``sys.path.append("/path/to/custom/package")``). Python will start
searching for packages in the newer paths once they're added. Airflow makes use
of this feature as described in the section :ref:`Additional modules in Airflow <additional-modules-in-airflow>`.
In the variable ``sys.path`` there is a directory ``site-packages`` which
contains the installed **external packages**, which means you can install
packages with ``pip`` or ``anaconda`` and you can use them in Airflow.
In the next section, you will learn how to create your own simple
installable package and how to specify additional directories to be added
to ``sys.path`` using the environment variable :envvar:`PYTHONPATH`.
Creating a package in Python
----------------------------
1. Before starting, install the following packages:
``setuptools``: setuptools is a package development process library designed
for creating and distributing Python packages.
``wheel``: The wheel package provides a bdist_wheel command for setuptools. It
creates .whl file which is directly installable through the ``pip install``
command. We can then upload the same file to pypi.org.
.. code-block:: bash
pip install --upgrade pip setuptools wheel
2. Create the package directory - in our case, we will call it ``airflow_operators``.
.. code-block:: bash
mkdir airflow_operators
3. Create the file ``__init__.py`` inside the package and add following code:
.. code-block:: python
print("Hello from airflow_operators")
When we import this package, it should print the above message.
4. Create ``setup.py``:
.. code-block:: python
import setuptools
setuptools.setup(
name='airflow_operators',
)
5. Build the wheel:
.. code-block:: bash
python setup.py bdist_wheel
This will create a few directories in the project and the overall structure will
look like following:
.. code-block:: bash
.
├── airflow_operators
│   ├── __init__.py
├── airflow_operators.egg-info
│   ├── PKG-INFO
│   ├── SOURCES.txt
│   ├── dependency_links.txt
│   └── top_level.txt
├── build
│   └── bdist.macosx-10.15-x86_64
├── dist
│   └── airflow_operators-0.0.0-py3-none-any.whl
└── setup.py
6. Install the .whl file using pip:
.. code-block:: bash
pip install dist/airflow_operators-0.0.0-py3-none-any.whl
7. The package is now ready to use!
.. code-block:: pycon
>>> import airflow_operators
Hello from airflow_operators
>>>
The package can be removed using pip command:
.. code-block:: bash
pip uninstall airflow_operators
For more details on how to create to create and publish python packages,
see `Packaging Python Projects <https://packaging.python.org/tutorials/packaging-projects/>`_.
Adding directories to the path
------------------------------
You can specify additional directories to be added to ``sys.path`` using the
environment variable :envvar:`PYTHONPATH`. Start the python shell by providing
the path to root of your project using the following command:
.. code-block:: bash
PYTHONPATH=/home/arch/projects/airflow_operators python
The ``sys.path`` variable will look like below:
.. code-block:: pycon
>>> import sys
>>> from pprint import pprint
>>> pprint(sys.path)
['',
'/home/arch/projects/airflow_operators'
'/home/arch/.pyenv/versions/3.7.4/lib/python37.zip',
'/home/arch/.pyenv/versions/3.7.4/lib/python3.7',
'/home/arch/.pyenv/versions/3.7.4/lib/python3.7/lib-dynload',
'/home/arch/venvs/airflow/lib/python3.7/site-packages']
As we can see that our provided directory is now added to the path, let's
try to import the package now:
.. code-block:: pycon
>>> import airflow_operators
Hello from airflow_operators
>>>
We can also use :envvar:`PYTHONPATH` variable with the airflow commands.
For example, if we run the following airflow command:
.. code-block:: bash
PYTHONPATH=/home/arch/projects/airflow_operators airflow info
We'll see the ``Python PATH`` updated with our mentioned :envvar:`PYTHONPATH`
value as shown below:
.. code-block:: none
Python PATH: [/home/arch/venv/bin:/home/arch/projects/airflow_operators:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/home/arch/venv/lib/python3.8/site-packages:/home/arch/airflow/dags:/home/arch/airflow/config:/home/arch/airflow/plugins]
.. _additional-modules-in-airflow:
Additional modules in Airflow
-----------------------------
Airflow adds three additional directories to the ``sys.path``:
- DAGS folder: It is configured with option ``dags_folder`` in section ``[core]``.
- Config folder: It is configured by setting ``AIRFLOW_HOME`` variable (``{AIRFLOW_HOME}/config``) by default.
- Plugins Folder: It is configured with option ``plugins_folder`` in section ``[core]``.
You can also see the exact paths using the ``airflow info`` command,
and use them similar to directories specified with the environment variable
:envvar:`PYTHONPATH`. An example of the contents of the sys.path variable
specified by this command may be as follows:
Python PATH: [/home/rootcss/venvs/airflow/bin:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/home/rootcss/venvs/airflow/lib/python3.8/site-packages:/home/rootcss/airflow/dags:/home/rootcss/airflow/config:/home/rootcss/airflow/plugins]
Below is the sample output of the ``airflow info`` command:
.. seealso:: :ref:`plugins:loading`
.. code-block:: none
Apache Airflow: 2.0.0b3
System info
OS | Linux
architecture | x86_64
uname | uname_result(system='Linux', node='85cd7ab7018e', release='4.19.76-linuxkit', version='#1 SMP Tue May 26 11:42:35 UTC 2020', machine='x86_64', processor='')
locale | ('en_US', 'UTF-8')
python_version | 3.8.6 (default, Nov 25 2020, 02:47:44) [GCC 8.3.0]
python_location | /usr/local/bin/python
Tools info
git | git version 2.20.1
ssh | OpenSSH_7.9p1 Debian-10+deb10u2, OpenSSL 1.1.1d 10 Sep 2019
kubectl | NOT AVAILABLE
gcloud | NOT AVAILABLE
cloud_sql_proxy | NOT AVAILABLE
mysql | mysql Ver 8.0.22 for Linux on x86_64 (MySQL Community Server - GPL)
sqlite3 | 3.27.2 2019-02-25 16:06:06 bd49a8271d650fa89e446b42e513b595a717b9212c91dd384aab871fc1d0alt1
psql | psql (PostgreSQL) 11.9 (Debian 11.9-0+deb10u1)
Paths info
airflow_home | /root/airflow
system_path | /opt/bats/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
python_path | /usr/local/bin:/opt/airflow:/files/plugins:/usr/local/lib/python38.zip:/usr/local/lib/python3.8:/usr/
| local/lib/python3.8/lib-dynload:/usr/local/lib/python3.8/site-packages:/files/dags:/root/airflow/conf
| ig:/root/airflow/plugins
airflow_on_path | True
Config info
executor | LocalExecutor
task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
sql_alchemy_conn | postgresql+psycopg2://postgres:airflow@postgres/airflow
dags_folder | /files/dags
plugins_folder | /root/airflow/plugins
base_log_folder | /root/airflow/logs
Providers info
apache-airflow-providers-amazon | 1.0.0b2
apache-airflow-providers-apache-cassandra | 1.0.0b2
apache-airflow-providers-apache-druid | 1.0.0b2
apache-airflow-providers-apache-hdfs | 1.0.0b2
apache-airflow-providers-apache-hive | 1.0.0b2