blob: 115bd4d510062db78c9526563e7184d5bc7ea908 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
.. _build:build_image:
Building the image
==================
Before you dive-deeply in the way how the Airflow Image is built, let us first explain why you might need
to build the custom container image and we show a few typical ways you can do it.
Quick start scenarios of image extending
----------------------------------------
The most common scenarios where you want to build your own image are adding a new ``apt`` package,
adding a new ``PyPI`` dependency and embedding DAGs into the image.
Example Dockerfiles for those scenarios are below, and you can read further
for more complex cases which might involve either extending or customizing the image. You will find
more information about more complex scenarios below, but if your goal is to quickly extend the Airflow
image with new provider, package, etc. then here is a quick start for you.
Adding new ``apt`` package
..........................
The following example adds ``vim`` to the Airflow image. When adding packages via ``apt`` you should
switch to the ``root`` user when running the ``apt`` commands, but do not forget to switch back to the
``airflow`` user after installation is complete.
.. exampleinclude:: docker-examples/extending/add-apt-packages/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
Adding a new ``PyPI`` package
.............................
The following example adds ``lxml`` python package from PyPI to the image. When adding packages via
``pip`` you need to use the ``airflow`` user rather than ``root``. Attempts to install ``pip`` packages
as ``root`` will fail with an appropriate error message.
.. exampleinclude:: docker-examples/extending/add-pypi-packages/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
Embedding DAGs
..............
The following example adds ``test_dag.py`` to your image in the ``/opt/airflow/dags`` folder.
.. exampleinclude:: docker-examples/extending/embedding-dags/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
.. exampleinclude:: docker-examples/extending/embedding-dags/test_dag.py
:language: Python
:start-after: [START dag]
:end-before: [END dag]
Extending vs. customizing the image
-----------------------------------
You might want to know very quickly whether you need to extend or customize the existing image
for Apache Airflow. This chapter gives you a short answer to those questions.
Here is the comparison of the two approaches:
+----------------------------------------------------+-----------+-------------+
| | Extending | Customizing |
+====================================================+===========+=============+
| Uses familiar 'FROM' pattern of image building | Yes | No |
+----------------------------------------------------+-----------+-------------+
| Requires only basic knowledge about images | Yes | No |
+----------------------------------------------------+-----------+-------------+
| Builds quickly | Yes | No |
+----------------------------------------------------+-----------+-------------+
| Produces image heavily optimized for size | No | Yes |
+----------------------------------------------------+-----------+-------------+
| Can build from custom airflow sources (forks) | No | Yes |
+----------------------------------------------------+-----------+-------------+
| Can build on air-gaped system | No | Yes |
+----------------------------------------------------+-----------+-------------+
TL;DR; If you have a need to build custom image, it is easier to start with "Extending". However, if your
dependencies require compilation steps or when your require to build the image from security vetted
packages, switching to "Customizing" the image provides much more optimized images. For example,
if we compare equivalent images built by "Extending" and "Customization", they end up being
1.1GB and 874MB respectively - a 20% improvement in size for the Customized image.
.. note::
You can also combine both - customizing & extending the image in one. You can build your
optimized base image first using ``customization`` method (for example by your admin team) with all
the heavy compilation required dependencies and you can publish it in your registry and let others
``extend`` your image using ``FROM`` and add their own lightweight dependencies. This reflects well
the split where typically "Casual" users will Extend the image and "Power-users" will customize it.
Airflow Summit 2020's `Production Docker Image <https://youtu.be/wDr3Y7q2XoI>`_ talk provides more
details about the context, architecture and customization/extension methods for the Production Image.
Why customizing the image ?
---------------------------
The Apache Airflow community, releases Docker Images which are ``reference images`` for Apache Airflow.
However, Airflow has more than 60 community managed providers (installable via extras) and some of the
default extras/providers installed are not used by everyone, sometimes others extras/providers
are needed, sometimes (very often actually) you need to add your own custom dependencies,
packages or even custom providers.
In Kubernetes and Docker terms this means that you need another image with your specific requirements.
This is why you should learn how to build your own Docker (or more properly Container) image.
You might be tempted to use the ``reference image`` and dynamically install the new packages while
starting your containers, but this is a bad idea for multiple reasons - starting from fragility of the build
and ending with the extra time needed to install those packages - which has to happen every time every
container starts. The only viable way to deal with new dependencies and requirements in production is to
build and use your own image. You should only use installing dependencies dynamically in case of
"hobbyist" and "quick start" scenarios when you want to iterate quickly to try things out and later
replace it with your own images.
Building images primer
----------------------
.. note::
The ``Dockerfile`` does not strictly follow the `SemVer <https://semver.org/>`_ approach of
Apache Airflow when it comes to features and backwards compatibility. While Airflow code strictly
follows it, the ``Dockerfile`` is really a way to conveniently package Airflow using standard container
approach, occasionally there are some changes in the building process or in the entrypoint of the image
that require slight adaptation. Details of changes and adaptation needed can be found in the
:doc:`Changelog <changelog>`.
There are several most-typical scenarios that you will encounter and here is a quick recipe on how to achieve
your goal quickly. In order to understand details you can read further, but for the simple cases using
typical tools here are the simple examples.
In the simplest case building your image consists of those steps:
1) Create your own ``Dockerfile`` (name it ``Dockerfile``) where you add:
* information what your image should be based on (for example ``FROM: apache/airflow:|airflow-version|-python3.8``
* additional steps that should be executed in your image (typically in the form of ``RUN <command>``)
2) Build your image. This can be done with ``docker`` CLI tools and examples below assume ``docker`` is used.
There are other tools like ``kaniko`` or ``podman`` that allow you to build the image, but ``docker`` is
so far the most popular and developer-friendly tool out there. Typical way of building the image looks
like follows (``my-image:0.0.1`` is the custom tag of your image containing version).
In case you use some kind of registry where you will be using the image from, it is usually named
in the form of ``registry/image-name``. The name of the image has to be configured for the deployment
method your image will be deployed. This can be set for example as image name in the
`docker-compose file <running-airflow-in-docker>`_ or in the `Helm chart <helm-chart>`_.
.. code-block:: shell
docker build . -f Dockerfile --pull --tag my-image:0.0.1
3) [Optional] Test the image. Airflow contains tool that allows you to test the image. This step however,
requires locally checked out or extracted Airflow sources. If you happen to have the sources you can
test the image by running this command (in airflow root folder). The output will tell you if the image
is "good-to-go".
.. code-block:: shell
./scripts/ci/tools/verify_docker_image.sh PROD my-image:0.0.1
4) Once you build the image locally you have usually several options to make them available for your deployment:
* For ``docker-compose`` deployment, if you've already built your image, and want to continue
building the image manually when needed with ``docker build``, you can edit the
docker-compose.yaml and replace the "apache/airflow:<version>" image with the
image you've just built ``my-image:0.0.1`` - it will be used from your local Docker
Engine cache. You can also simply set ``AIRFLOW_IMAGE_NAME`` variable to
point to your image and ``docker-compose`` will use it automatically without having
to modify the file.
* Also for ``docker-compose`` deployment, you can delegate image building to the docker-compose.
To do that - open your ``docker-compose.yaml`` file and search for the phrase "In order to add custom dependencies".
Follow these instructions of commenting the "image" line and uncommenting the "build" line.
This is a standard docker-compose feature and you can read about it in
`Docker Compose build reference <https://docs.docker.com/compose/reference/build/>`_.
Run ``docker-compose build`` to build the images. Similarly as in the previous case, the
image is stored in Docker engine cache and Docker Compose will use it from there.
The ``docker-compose build`` command uses the same ``docker build`` command that
you can run manually under-the-hood.
* For some - development targeted - Kubernetes deployments you can load the images directly to
Kubernetes clusters. Clusters such as ``kind`` or ``minikube`` have dedicated ``load`` method to load the
images to the cluster.
* Last but not least - you can push your image to a remote registry which is the most common way
of storing and exposing the images, and it is most portable way of publishing the image. Both
Docker-Compose and Kubernetes can make use of images exposed via registries.
Extending the image
-------------------
Extending the image is easiest if you just need to add some dependencies that do not require
compiling. The compilation framework of Linux (so called ``build-essential``) is pretty big, and
for the production images, size is really important factor to optimize for, so our Production Image
does not contain ``build-essential``. If you need a compiler like gcc or g++ or make/cmake etc. - those
are not found in the image and it is recommended that you follow the "customize" route instead.
How to extend the image - it is something you are most likely familiar with - simply
build a new image using Dockerfile's ``FROM`` directive and add whatever you need. Then you can add your
Debian dependencies with ``apt`` or PyPI dependencies with ``pip install`` or any other stuff you need.
Base images
...........
There are two types of images you can extend your image from:
1) Regular Airflow image that contains the most common extras and providers, and all supported backend
database clients for AMD64 platform and Postgres for ARM64 platform.
2) Slim Airflow image, which is a minimal image, contains all supported backends database clients installed
for AMD64 platform and Postgres for ARM64 platform, but contains no extras or providers, except
the 4 default providers.
.. note:: Differences of slim image vs. regular image.
The slim image is small comparing to regular image (~500 MB vs ~1.1GB) and you might need to add a
lot more packages and providers in order to make it useful for your case (but if you use only a
small subset of providers, it might be a good starting point for you).
The slim images might have dependencies in different versions than those used when providers are
preinstalled, simply because core Airflow might have less limits on the versions on its own.
When you install some providers they might require downgrading some dependencies if the providers
require different limits for the same dependencies.
Naming conventions for the images:
+----------------+------------------+---------------------------------+--------------------------------------+
| Image | Python | Standard image | Slim image |
+================+==================+=================================+======================================+
| Latest default | 3.7 | apache/airflow:latest | apache/airflow:slim-latest |
| Default | 3.7 | apache/airflow:X.Y.Z | apache/airflow:slim-X.Y.Z |
| Latest | 3.7,3.8,3.9,3.10 | apache/airflow:latest-pythonN.M | apache/airflow:slim-latest-pythonN.M |
| Specific | 3.7,3.8,3.9,3.10 | apache/airflow:X.Y.Z-pythonN.M | apache/airflow:slim-X.Y.Z-pythonN.M |
+----------------+------------------+---------------------------------+--------------------------------------+
* The "latest" image is always the latest released stable version available.
.. spelling::
pythonN
Important notes for the base images
-----------------------------------
You should be aware, about a few things:
* The production image of airflow uses "airflow" user, so if you want to add some of the tools
as ``root`` user, you need to switch to it with ``USER`` directive of the Dockerfile and switch back to
``airflow`` user when you are done. Also you should remember about following the
`best practices of Dockerfiles <https://docs.docker.com/develop/develop-images/dockerfile_best-practices/>`_
to make sure your image is lean and small.
* The PyPI dependencies in Apache Airflow are installed in the user library, of the "airflow" user, so
PIP packages are installed to ``~/.local`` folder as if the ``--user`` flag was specified when running PIP.
Note also that using ``--no-cache-dir`` is a good idea that can help to make your image smaller.
.. note::
Only as of ``2.0.1`` image the ``--user`` flag is turned on by default by setting ``PIP_USER`` environment
variable to ``true``. This can be disabled by un-setting the variable or by setting it to ``false``. In the
2.0.0 image you had to add the ``--user`` flag as ``pip install --user`` command.
* If your apt, or PyPI dependencies require some of the ``build-essential`` or other packages that need
to compile your python dependencies, then your best choice is to follow the "Customize the image" route,
because you can build a highly-optimized (for size) image this way. However it requires you to use
the Dockerfile that is released as part of Apache Airflow sources (also available at
`Dockerfile <https://github.com/apache/airflow/blob/main/Dockerfile>`_)
* You can also embed your dags in the image by simply adding them with COPY directive of Airflow.
The DAGs in production image are in ``/opt/airflow/dags`` folder.
* You can build your image without any need for Airflow sources. It is enough that you place the
``Dockerfile`` and any files that are referred to (such as Dag files) in a separate directory and run
a command ``docker build . --pull --tag my-image:my-tag`` (where ``my-image`` is the name you want to name it
and ``my-tag`` is the tag you want to tag the image with.
* If your way of extending image requires to create writable directories, you MUST remember about adding
``umask 0002`` step in your RUN command. This is necessary in order to accommodate our approach for
running the image with an arbitrary user. Such user will always run with ``GID=0`` -
the entrypoint will prevent non-root GIDs. You can read more about it in
:ref:`arbitrary docker user <arbitrary-docker-user>` documentation for the entrypoint. The
``umask 0002`` is set as default when you enter the image, so any directories you create by default
in runtime, will have ``GID=0`` and will be group-writable.
.. note::
When you build image for Airflow version < ``2.1`` (for example 2.0.2 or 1.10.15) the image is built with
PIP 20.2.4 because ``PIP21+`` is only supported for ``Airflow 2.1+``
.. note::
Only as of ``2.0.2`` the default group of ``airflow`` user is ``root``. Previously it was ``airflow``,
so if you are building your images based on an earlier image, you need to manually change the default
group for airflow user:
.. code-block:: docker
RUN usermod -g 0 airflow
Examples of image extending
---------------------------
Example of customizing Airflow Provider packages
................................................
The :ref:`Airflow Providers <providers:community-maintained-providers>` are released independently of core
Airflow and sometimes you might want to upgrade specific providers only to fix some problems or
use features available in that provider version. Here is an example of how you can do it
.. exampleinclude:: docker-examples/extending/custom-providers/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
Example of adding Airflow Provider package and ``apt`` package
..............................................................
The following example adds ``apache-spark`` airflow-providers which requires both ``java`` and
python package from PyPI.
.. exampleinclude:: docker-examples/extending/add-providers/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
Example of adding ``apt`` package
.................................
The following example adds ``vim`` to the airflow image.
.. exampleinclude:: docker-examples/extending/add-apt-packages/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
Example of adding ``PyPI`` package
..................................
The following example adds ``lxml`` python package from PyPI to the image.
.. exampleinclude:: docker-examples/extending/add-pypi-packages/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
Example when writable directory is needed
.........................................
The following example adds a new directory that is supposed to be writable for any arbitrary user
running the container.
.. exampleinclude:: docker-examples/extending/writable-directory/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
Example when you add packages requiring compilation
...................................................
The following example adds ``mpi4py`` package which requires both ``build-essential`` and ``mpi compiler``.
.. exampleinclude:: docker-examples/extending/add-build-essential-extend/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
The size of this image is ~ 1.1 GB when build. As you will see further, you can achieve 20% reduction in
size of the image in case you use "Customizing" rather than "Extending" the image.
Example when you want to embed DAGs
...................................
The following example adds ``test_dag.py`` to your image in the ``/opt/airflow/dags`` folder.
.. exampleinclude:: docker-examples/extending/embedding-dags/Dockerfile
:language: Dockerfile
:start-after: [START Dockerfile]
:end-before: [END Dockerfile]
.. exampleinclude:: docker-examples/extending/embedding-dags/test_dag.py
:language: Python
:start-after: [START dag]
:end-before: [END dag]
Customizing the image
---------------------
.. warning::
BREAKING CHANGE! As of Airflow 2.3.0 you need to use
`Buildkit <https://docs.docker.com/develop/develop-images/build_enhancements/>`_ to build customized
Airflow Docker image. We are using new features of Building (and ``dockerfile:1.4`` syntax)
to make our image faster to build and "standalone" - i.e. not needing any extra files from
Airflow in order to be build. As of Airflow 2.3.0, the ``Dockerfile`` that is released with Airflow
does not need any extra folders or files and can be copied and used from any folder.
Previously you needed to copy Airflow sources together with the Dockerfile as some scripts were
needed to make it work. You also need to use ``DOCKER_CONTEXT_FILES`` build arg if you want to
use your own custom files during the build (see
:ref:`Using docker context files <using-docker-context-files>` for details).
.. note::
You can usually use the latest ``Dockerfile`` released by Airflow to build previous Airflow versions.
Note however, that there are slight changes in the Dockerfile and entrypoint scripts that can make it
behave slightly differently, depending which Dockerfile version you used. Details of what has changed
in each of the released versions of Docker image can be found in the :doc:`Changelog <changelog>`.
Prerequisites for building customized docker image:
* You need to enable `Buildkit <https://docs.docker.com/develop/develop-images/build_enhancements/>`_ to
build the image. This can be done by setting ``DOCKER_BUILDKIT=1`` as an environment variable
or by installing `the buildx plugin <https://docs.docker.com/buildx/working-with-buildx/>`_
and running ``docker buildx build`` command.
* You need to have a new Docker installed to handle ``1.4`` syntax of the Dockerfile.
Docker version ``20.10.7`` and above is known to work.
Before attempting to customize the image, you need to download flexible and customizable ``Dockerfile``.
You can extract the officially released version of the Dockerfile from the
`released sources <https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-sources.html>`_.
You can also conveniently download the latest released version
`from GitHub <https://raw.githubusercontent.com/apache/airflow/|version|/Dockerfile>`_. You can save it
in any directory - there is no need for any other files to be present there. If you wish to use your own
files (for example custom configuration of ``pip`` or your own ``requirements`` or custom dependencies,
you need to use ``DOCKER_CONTEXT_FILES`` build arg and place the files in the directory pointed at by
the arg (see :ref:`Using docker context files <using-docker-context-files>` for details).
Customizing the image is an optimized way of adding your own dependencies to the image - better
suited to prepare highly optimized (for size) production images, especially when you have dependencies
that require to be compiled before installing (such as ``mpi4py``).
It also allows more sophisticated usages, needed by "Power-users" - for example using forked version
of Airflow, or building the images from security-vetted sources.
The big advantage of this method is that it produces optimized image even if you need some compile-time
dependencies that are not needed in the final image.
The disadvantage it that building the image takes longer and it requires you to use
the Dockerfile that is released as part of Apache Airflow sources.
The disadvantage is that the pattern of building Docker images with ``--build-arg`` is less familiar
to developers of such images. However it is quite well-known to "power-users". That's why the
customizing flow is better suited for those users who have more familiarity and have more custom
requirements.
The image also usually builds much longer than the equivalent "Extended" image because instead of
extending the layers that are already coming from the base image, it rebuilds the layers needed
to add extra dependencies needed at early stages of image building.
When customizing the image you can choose a number of options how you install Airflow:
* From the PyPI releases (default)
* From the custom installation sources - using additional/replacing the original apt or PyPI repositories
* From local sources. This is used mostly during development.
* From tag or branch, or specific commit from a GitHub Airflow repository (or fork). This is particularly
useful when you build image for a custom version of Airflow that you keep in your fork and you do not
want to release the custom Airflow version to PyPI.
* From locally stored binary packages for Airflow, Airflow Providers and other dependencies. This is
particularly useful if you want to build Airflow in a highly-secure environment where all such packages
must be vetted by your security team and stored in your private artifact registry. This also
allows to build airflow image in an air-gaped environment.
* Side note. Building ``Airflow`` in an ``air-gaped`` environment sounds pretty funny, doesn't it?
You can also add a range of customizations while building the image:
* base python image you use for Airflow
* version of Airflow to install
* extras to install for Airflow (or even removing some default extras)
* additional apt/python dependencies to use while building Airflow (DEV dependencies)
* add ``requirements.txt`` file to ``docker-context-files`` directory to add extra requirements
* additional apt/python dependencies to install for runtime version of Airflow (RUNTIME dependencies)
* additional commands and variables to set if needed during building or preparing Airflow runtime
* choosing constraint file to use when installing Airflow
Additional explanation is needed for the last point. Airflow uses constraints to make sure
that it can be predictably installed, even if some new versions of Airflow dependencies are
released (or even dependencies of our dependencies!). The docker image and accompanying scripts
usually determine automatically the right versions of constraints to be used based on the Airflow
version installed and Python version. For example 2.0.2 version of Airflow installed from PyPI
uses constraints from ``constraints-2.0.2`` tag). However in some cases - when installing airflow from
GitHub for example - you have to manually specify the version of constraints used, otherwise
it will default to the latest version of the constraints which might not be compatible with the
version of Airflow you use.
You can also download any version of Airflow constraints and adapt it with your own set of
constraints and manually set your own versions of dependencies in your own constraints and use the version
of constraints that you manually prepared.
You can read more about constraints in :doc:`apache-airflow:installation/installing-from-pypi`
Note that if you place ``requirements.txt`` in the ``docker-context-files`` folder, it will be
used to install all requirements declared there. It is recommended that the file
contains specified version of dependencies to add with ``==`` version specifier, to achieve
stable set of requirements, independent if someone releases a newer version. However you have
to make sure to update those requirements and rebuild the images to account for latest security fixes.
Choosing Debian version when customizing the image
--------------------------------------------------
The reference Airflow image currently uses ``bullseye`` version of Debian (also known as Debian 10) as base
image, however when you want to build a custom image, you can also use ``buster`` version of base images.
Airflow supports both versions of Debian. You choose which version of Debian to use by choosing the
right version of python base image:
* ``--build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster`` uses buster version of Debian (Debian 10)
* ``--build-arg PYTHON_BASE_IMAGE="python:3.7-slim-bullseye`` uses bullseye version of Debian (Debian 11)
.. _using-docker-context-files:
Using docker-context-files
--------------------------
When customizing the image, you can optionally make Airflow install custom binaries or provide custom
configuration for your pip in ``docker-context-files``. In order to enable it, you need to add
``--build-arg DOCKER_CONTEXT_FILES=docker-context-files`` build arg when you build the image.
You can pass any subdirectory of your docker context, it will always be mapped to ``/docker-context-files``
during the build.
You can use ``docker-context-files`` for the following purposes:
* you can place ``requirements.txt`` and add any ``pip`` packages you want to install in the
``docker-context-file`` folder. Those requirements will be automatically installed during the build.
.. exampleinclude:: docker-examples/customizing/own-requirements.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
* you can place ``pip.conf`` (and legacy ``.piprc``) in the ``docker-context-files`` folder and they
will be used for all ``pip`` commands (for example you can configure your own sources
or authentication mechanisms)
.. exampleinclude:: docker-examples/customizing/custom-pip.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
* you can place ``.whl`` packages that you downloaded and install them with
``INSTALL_PACKAGES_FROM_CONTEXT`` set to ``true`` . It's useful if you build the image in
restricted security environments (see: :ref:`image-build-secure-environments` for details):
.. exampleinclude:: docker-examples/restricted/restricted_environments.sh
:language: bash
:start-after: [START download]
:end-before: [END download]
.. note::
You can also pass ``--build-arg DOCKER_CONTEXT_FILES=.`` if you want to place your ``requirements.txt``
in main directory without creating a dedicated folder, however this is a good practice to keep any files
that you copy to the image context in a sub-folder. This makes it easier to separate things that
are used on the host from those that are passed in Docker context. Of course, by default when you run
``docker build .`` the whole folder is available as "Docker build context" and sent to the docker
engine, but the ``DOCKER_CONTEXT_FILES`` are always copied to the ``build`` segment of the image so
copying all your local folder might unnecessarily increase time needed to build the image and your
cache will be invalidated every time any of the files in your local folder change.
.. warning::
BREAKING CHANGE! As of Airflow 2.3.0 you need to specify additional flag:
``--build-arg DOCKER_CONTEXT_Files=docker-context-files`` in order to use the files placed
in ``docker-context-files``. Previously that switch was not needed. Unfortunately this change is needed
in order to enable ``Dockerfile`` as standalone Dockerfile without any extra files. As of Airflow 2.3.0
the ``Dockerfile`` that is released with Airflow does not need any extra folders or files and can
be copied and used from any folder. Previously you needed to copy Airflow sources together with the
Dockerfile as some scripts were needed to make it work. With Airflow 2.3.0, we are using ``Buildkit``
features that enable us to make the ``Dockerfile`` a completely standalone file that can be used "as-is".
Examples of image customizing
-----------------------------
.. _image-build-pypi:
Building from PyPI packages
...........................
This is the basic way of building the custom images from sources.
The following example builds the production image in version ``3.7`` with latest PyPI-released Airflow,
with default set of Airflow extras and dependencies. The latest PyPI-released Airflow constraints are used automatically.
.. exampleinclude:: docker-examples/customizing/stable-airflow.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
The following example builds the production image in version ``3.7`` with default extras from ``2.3.0`` Airflow
package. The ``2.3.0`` constraints are used automatically.
.. exampleinclude:: docker-examples/customizing/pypi-selected-version.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
The following example builds the production image in version ``3.8`` with additional airflow extras
(``mssql,hdfs``) from ``2.3.0`` PyPI package, and additional dependency (``oauth2client``).
.. exampleinclude:: docker-examples/customizing/pypi-extras-and-deps.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
The following example adds ``mpi4py`` package which requires both ``build-essential`` and ``mpi compiler``.
.. exampleinclude:: docker-examples/customizing/add-build-essential-custom.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
The above image is equivalent of the "extended" image from previous chapter but its size is only
874 MB. Comparing to 1.1 GB of the "extended image" this is about 230 MB less, so you can achieve ~20%
improvement in size of the image by using "customization" vs. extension. The saving can increase in case you
have more complex dependencies to build.
.. _image-build-optimized:
Building optimized images
.........................
The following example the production image in version ``3.7`` with additional airflow extras from ``2.0.2``
PyPI package but it includes additional apt dev and runtime dependencies.
The dev dependencies are those that require ``build-essential`` and usually need to involve recompiling
of some python dependencies so those packages might require some additional DEV dependencies to be
present during recompilation. Those packages are not needed at runtime, so we only install them for the
"build" time. They are not installed in the final image, thus producing much smaller images.
In this case pandas requires recompilation so it also needs gcc and g++ as dev APT dependencies.
The ``jre-headless`` does not require recompiling so it can be installed as the runtime APT dependency.
.. exampleinclude:: docker-examples/customizing/pypi-dev-runtime-deps.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
.. _image-build-github:
Building from GitHub
....................
This method is usually used for development purpose. But in case you have your own fork you can point
it to your forked version of source code without having to release it to PyPI. It is enough to have
a branch or tag in your repository and use the tag or branch in the URL that you point the installation to.
In case of GitHub builds you need to pass the constraints reference manually in case you want to use
specific constraints, otherwise the default ``constraints-main`` is used.
The following example builds the production image in version ``3.7`` with default extras from the latest main version and
constraints are taken from latest version of the constraints-main branch in GitHub.
.. exampleinclude:: docker-examples/customizing/github-main.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
The following example builds the production image with default extras from the
latest ``v2-*-test`` version and constraints are taken from the latest version of
the ``constraints-2-*`` branch in GitHub (for example ``v2-2-test`` branch matches ``constraints-2-2``).
Note that this command might fail occasionally as only the "released version" constraints when building a
version and "main" constraints when building main are guaranteed to work.
.. exampleinclude:: docker-examples/customizing/github-v2-2-test.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
You can also specify another repository to build from. If you also want to use different constraints
repository source, you must specify it as additional ``CONSTRAINTS_GITHUB_REPOSITORY`` build arg.
The following example builds the production image using ``potiuk/airflow`` fork of Airflow and constraints
are also downloaded from that repository.
.. exampleinclude:: docker-examples/customizing/github-different-repository.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
.. _image-build-custom:
Using custom installation sources
.................................
You can customize more aspects of the image - such as additional commands executed before apt dependencies
are installed, or adding extra sources to install your dependencies from. You can see all the arguments
described below but here is an example of rather complex command to customize the image
based on example in `this comment <https://github.com/apache/airflow/issues/8605#issuecomment-690065621>`_:
In case you need to use your custom PyPI package indexes, you can also customize PYPI sources used during
image build by adding a ``docker-context-files/pip.conf`` file when building the image.
This ``pip.conf`` will not be committed to the repository (it is added to ``.gitignore``) and it will not be
present in the final production image. It is added and used only in the build segment of the image.
Therefore this ``pip.conf`` file can safely contain list of package indexes you want to use,
usernames and passwords used for authentication. More details about ``pip.conf`` file can be found in the
`pip configuration <https://pip.pypa.io/en/stable/topics/configuration/>`_.
If you used the ``.piprc`` before (some older versions of ``pip`` used it for customization), you can put it
in the ``docker-context-files/.piprc`` file and it will be automatically copied to ``HOME`` directory
of the ``airflow`` user.
Note, that those customizations are only available in the ``build`` segment of the Airflow image and they
are not present in the ``final`` image. If you wish to extend the final image and add custom ``.piprc`` and
``pip.conf``, you should add them in your own Dockerfile used to extend the Airflow image.
Such customizations are independent of the way how airflow is installed.
.. note::
Similar results could be achieved by modifying the Dockerfile manually (see below) and injecting the
commands needed, but by specifying the customizations via build-args, you avoid the need of
synchronizing the changes from future Airflow Dockerfiles. Those customizations should work with the
future version of Airflow's official ``Dockerfile`` at most with minimal modifications od parameter
names (if any), so using the build command for your customizations makes your custom image more
future-proof.
The following - rather complex - example shows capabilities of:
* Adding airflow extras (slack, odbc)
* Adding PyPI dependencies (``azure-storage-blob, oauth2client, beautifulsoup4, dateparser, rocketchat_API,typeform``)
* Adding custom environment variables while installing ``apt`` dependencies - both DEV and RUNTIME
(``ACCEPT_EULA=Y'``)
* Adding custom curl command for adding keys and configuring additional apt sources needed to install
``apt`` dependencies (both DEV and RUNTIME)
* Adding custom ``apt`` dependencies, both DEV (``msodbcsql17 unixodbc-dev g++) and runtime msodbcsql17 unixodbc git procps vim``)
.. exampleinclude:: docker-examples/customizing/custom-sources.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
.. _image-build-secure-environments:
Build images in security restricted environments
................................................
You can also make sure your image is only built using local constraint file and locally downloaded
wheel files. This is often useful in Enterprise environments where the binary files are verified and
vetted by the security teams. It is also the most complex way of building the image. You should be an
expert of building and using Dockerfiles in order to use it and have to have specific needs of security if
you want to follow that route.
This builds below builds the production image with packages and constraints used from the local
``docker-context-files`` rather than installed from PyPI or GitHub. It also disables MySQL client
installation as it is using external installation method.
Note that as a prerequisite - you need to have downloaded wheel files. In the example below we
first download such constraint file locally and then use ``pip download`` to get the ``.whl`` files needed
but in most likely scenario, those wheel files should be copied from an internal repository of such .whl
files. Note that ``AIRFLOW_VERSION_SPECIFICATION`` is only there for reference, the apache airflow ``.whl`` file
in the right version is part of the ``.whl`` files downloaded.
Note that 'pip download' will only works on Linux host as some of the packages need to be compiled from
sources and you cannot install them providing ``--platform`` switch. They also need to be downloaded using
the same python version as the target image.
The ``pip download`` might happen in a separate environment. The files can be committed to a separate
binary repository and vetted/verified by the security team and used subsequently to build images
of Airflow when needed on an air-gaped system.
Example of preparing the constraint files and wheel files. Note that ``mysql`` dependency is removed
as ``mysqlclient`` is installed from Oracle's ``apt`` repository and if you want to add it, you need
to provide this library from your repository if you want to build Airflow image in an "air-gaped" system.
.. exampleinclude:: docker-examples/restricted/restricted_environments.sh
:language: bash
:start-after: [START download]
:end-before: [END download]
After this step is finished, your ``docker-context-files`` folder will contain all the packages that
are needed to install Airflow from.
Those downloaded packages and constraint file can be pre-vetted by your security team before you attempt
to install the image. You can also store those downloaded binary packages in your private artifact registry
which allows for the flow where you will download the packages on one machine, submit only new packages for
security vetting and only use the new packages when they were vetted.
On a separate (air-gaped) system, all the PyPI packages can be copied to ``docker-context-files``
where you can build the image using the packages downloaded by passing those build args:
* ``INSTALL_PACKAGES_FROM_CONTEXT="true"`` - to use packages present in ``docker-context-files``
* ``AIRFLOW_PRE_CACHED_PIP_PACKAGES="false"`` - to not pre-cache packages from PyPI when building image
* ``AIRFLOW_CONSTRAINTS_LOCATION=/docker-context-files/YOUR_CONSTRAINT_FILE.txt`` - to downloaded constraint files
* (Optional) ``INSTALL_MYSQL_CLIENT="false"`` if you do not want to install ``MySQL``
client from the Oracle repositories.
* (Optional) ``INSTALL_MSSQL_CLIENT="false"`` if you do not want to install ``MsSQL``
client from the Microsoft repositories.
* (Optional) ``INSTALL_POSTGRES_CLIENT="false"`` if you do not want to install ``Postgres``
client from the Postgres repositories.
Note, that the solution we have for installing python packages from local packages, only solves the problem
of "air-gaped" python installation. The Docker image also downloads ``apt`` dependencies and ``node-modules``.
Those types of dependencies are however more likely to be available in your "air-gaped" system via transparent
proxies and it should automatically reach out to your private registries, however in the future the
solution might be applied to both of those installation steps.
You can also use techniques described in the previous chapter to make ``docker build`` use your private
apt sources or private PyPI repositories (via ``.pypirc``) available which can be security-vetted.
If you fulfill all the criteria, you can build the image on an air-gaped system by running command similar
to the below:
.. exampleinclude:: docker-examples/restricted/restricted_environments.sh
:language: bash
:start-after: [START build]
:end-before: [END build]
Modifying the Dockerfile
........................
The build arg approach is a convenience method if you do not want to manually modify the ``Dockerfile``.
Our approach is flexible enough, to be able to accommodate most requirements and
customizations out-of-the-box. When you use it, you do not need to worry about adapting the image every
time new version of Airflow is released. However sometimes it is not enough if you have very
specific needs and want to build a very custom image. In such case you can simply modify the
``Dockerfile`` manually as you see fit and store it in your forked repository. However you will have to
make sure to rebase your changes whenever new version of Airflow is released, because we might modify
the approach of our Dockerfile builds in the future and you might need to resolve conflicts
and rebase your changes.
There are a few things to remember when you modify the ``Dockerfile``:
* We are using the widely recommended pattern of ``.dockerignore`` where everything is ignored by default
and only the required folders are added through exclusion (!). This allows to keep docker context small
because there are many binary artifacts generated in the sources of Airflow and if they are added to
the context, the time of building the image would increase significantly. If you want to add any new
folders to be available in the image you must add them here with leading ``!``
.. code-block:: text
# Ignore everything
**
# Allow only these directories
!airflow
...
* The ``docker-context-files`` folder is automatically added to the context of the image, so if you want
to add individual files, binaries, requirement files etc you can add them there. The
``docker-context-files`` is copied to the ``/docker-context-files`` folder of the build segment of the
image, so it is not present in the final image - which makes the final image smaller in case you want
to use those files only in the ``build`` segment. You must copy any files from the directory manually,
using COPY command if you want to get the files in your final image (in the main image segment).
More details
------------
Build Args reference
....................
The detailed ``--build-arg`` reference can be found in :doc:`build-arg-ref`.
The architecture of the images
..............................
You can read more details about the images - the context, their parameters and internal structure in the
`IMAGES.rst <https://github.com/apache/airflow/blob/main/IMAGES.rst>`_ document.