dev/breeze/doc/03_developer_tasks.rst - airflow - Git at Google

  .. Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

  ..   http://www.apache.org/licenses/LICENSE-2.0

  .. Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 Regular development tasks
 =========================

 The regular Breeze development tasks are available as top-level commands. Those tasks are most often
 used during the development, that's why they are available without any sub-command. More advanced
 commands are separated to sub-commands.

 .. contents:: :local:

 Entering Breeze shell
 ---------------------

 This is the most often used feature of breeze. It simply allows to enter the shell inside the Breeze
 development environment (inside the Breeze container).

 You can use additional ``breeze`` flags to choose your environment. You can specify a Python
 version to use, and backend (the meta-data database). Thanks to that, with Breeze, you can recreate the same
 environments as we have in matrix builds in the CI. See next chapter for backend selection.

 For example, you can choose to run Python 3.8 tests with MySQL as backend and with mysql version 8
 as follows:

 .. code-block:: bash

     breeze --python 3.8 --backend mysql --mysql-version 8.0

 .. note:: Note for Windows WSL2 users

    You may find error messages:

 .. code-block:: bash

     Current context is now "..."
     protocol not available
     Error 1 returned

 Try adding ``--builder=default`` to your command. For example:

 .. code-block:: bash

     breeze --builder=default --python 3.8 --backend mysql --mysql-version 8.0

 The choices you make are persisted in the ``./.build/`` cache directory so that next time when you use the
 ``breeze`` script, it could use the values that were used previously. This way you do not have to specify
 them when you run the script. You can delete the ``.build/`` directory in case you want to restore the
 default settings.

 You can also run breeze with ``SKIP_SAVING_CHOICES`` to non-empty value and breeze invocation will not save
 used cache value to cache - this is useful when you run non-interactive scripts with ``breeze shell`` and
 want to - for example - force Python version used only for that execution without changing the Python version
 that user used last time.

 You can see which value of the parameters that can be stored persistently in cache marked with >VALUE<
 in the help of the commands (for example in output of ``breeze config --help``).

 Selecting Backend
 -----------------

 When you run breeze commands, you can additionally select which backend you want to use. Currently Airflow
 supports Sqlite, MySQL and Postgres as backends - MySQL and Postgres are supported in various versions.

 You can choose which backend to use by adding ``--backend`` flag and additionally you can select version
 of the backend, if you want to start a different version of backend (for example for ``--backend postgres``
 you can specify ``--postgres-version 13`` to start Postgres 13). The ``--help`` command in breeze commands
 will show you which backends are supported and which versions are available for each backend.

 The choice you made for backend and version are ``sticky`` - the last used selection is cached in the
 ``.build`` folder and next time you run any of the ``breeze`` commands that use backend the will use the
 last selected backend and version.

 .. note::

   You can also (temporarily for the time of running a single command) override the backend version
   used via ``BACKEND_VERSION`` environment variable. This is used mostly in CI where we have common way of
   running tests for all backends and we want to specify different parameters. In order to override the
   backend version, it has to be a valid version for the backend you are using. For example if you set
   ``BACKEND_VERSION`` to ``13`` and you are using ``--backend postgres``, Postgres 13 will be used, but
   if you set ``BACKEND_VERSION`` to ``8.0`` and you are using ``--backend postgres``, the last used Postgres
   version will be used.

 Breeze will inform you at startup which backend and version it is using:

 .. raw:: html

     <div align="center">
         <img src="images/version_information.png" width="640" alt="Version information printed by Breeze">
     </div>


 Port Forwarding
 ---------------

 When you run Airflow Breeze, the following ports are automatically forwarded:

 .. code-block::

     * 12322 -> forwarded to Airflow ssh server -> airflow:22
     * 28080 -> forwarded to Airflow webserver -> airflow:8080
     * 25555 -> forwarded to Flower dashboard -> airflow:5555
     * 25433 -> forwarded to Postgres database -> postgres:5432
     * 23306 -> forwarded to MySQL database  -> mysql:3306
     * 26379 -> forwarded to Redis broker -> redis:6379


 You can connect to these ports/databases using:

 .. code-block::

     * ssh connection for remote debugging: ssh -p 12322 airflow@127.0.0.1 pw: airflow
     * Webserver: http://127.0.0.1:28080
     * Flower:    http://127.0.0.1:25555
     * Postgres:  jdbc:postgresql://127.0.0.1:25433/airflow?user=postgres&password=airflow
     * Mysql:     jdbc:mysql://127.0.0.1:23306/airflow?user=root
     * Redis:     redis://127.0.0.1:26379/0

 If you do not use ``start-airflow`` command, you can start the webserver manually with
 the ``airflow webserver`` command if you want to run it. You can use ``tmux`` to multiply terminals.
 You may need to create a user prior to running the webserver in order to log in.
 This can be done with the following command:

 .. code-block:: bash

     airflow users create --role Admin --username admin --password admin --email admin@example.com --firstname foo --lastname bar

 For databases, you need to run ``airflow db reset`` at least once (or run some tests) after you started
 Airflow Breeze to get the database/tables created. You can connect to databases with IDE or any other
 database client:


 .. raw:: html

     <div align="center">
         <img src="images/database_view.png" width="640" alt="Airflow Breeze - Database view">
     </div>

 You can change the used host port numbers by setting appropriate environment variables:

 * ``SSH_PORT``
 * ``WEBSERVER_HOST_PORT``
 * ``POSTGRES_HOST_PORT``
 * ``MYSQL_HOST_PORT``
 * ``MSSQL_HOST_PORT``
 * ``FLOWER_HOST_PORT``
 * ``REDIS_HOST_PORT``

 If you set these variables, next time when you enter the environment the new ports should be in effect.

 Building the documentation
 --------------------------

 To build documentation in Breeze, use the ``build-docs`` command:

 .. code-block:: bash

      breeze build-docs

 Results of the build can be found in the ``docs/_build`` folder.

 The documentation build consists of three steps:

 * verifying consistency of indexes
 * building documentation
 * spell checking

 You can choose only one stage of the two by providing ``--spellcheck-only`` or ``--docs-only`` after
 extra ``--`` flag.

 .. code-block:: bash

     breeze build-docs --spellcheck-only

 This process can take some time, so in order to make it shorter you can filter by package, using package
 short ``provider id`` (might be multiple of them).

 .. code-block:: bash

      breeze build-docs <provider id> <provider id>

 or you can use package filter. The filters are glob pattern matching full
 package names and can be used to select more than one package with single filter.

 .. code-block:: bash

      breeze build-docs --package-filter apache-airflow-providers-*

 Often errors during documentation generation come from the docstrings of auto-api generated classes.
 During the docs building auto-api generated files are stored in the ``docs/_api`` folder. This helps you
 easily identify the location the problems with documentation originated from.

 These are all available flags of ``build-docs`` command:

 .. image:: ./images/output_build-docs.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_build-docs.svg
   :width: 100%
   :alt: Breeze build documentation

 While you can use full name of doc package starting with ``apache-airflow-providers-`` in package filter,
 You can use shorthand version - just take the remaining part and replace every ``dash("-")`` with
 a ``dot(".")``.

 Example:
 If the provider name is ``apache-airflow-providers-cncf-kubernetes``, it will be ``cncf.kubernetes``.

 Note: For building docs for apache-airflow-providers index, use ``apache-airflow-providers``
 as the short hand operator.

 Running static checks
 ---------------------

 You can run static checks via Breeze. You can also run them via pre-commit command but with auto-completion
 Breeze makes it easier to run selective static checks. If you press <TAB> after the static-check and if
 you have auto-complete setup you should see auto-completable list of all checks available.

 For example, this following command:

 .. code-block:: bash

      breeze static-checks --type mypy-airflow

 will run mypy check for currently staged files inside ``airflow/`` excluding providers.

 Selecting files to run static checks on
 ---------------------------------------

 Pre-commits run by default on staged changes that you have locally changed. It will run it on all the
 files you run ``git add`` on and it will ignore any changes that you have modified but not staged.
 If you want to run it on all your modified files you should add them with ``git add`` command.

 With ``--all-files`` you can run static checks on all files in the repository. This is useful when you
 want to be sure they will not fail in CI, or when you just rebased your changes and want to
 re-run latest pre-commits on your changes, but it can take a long time (few minutes) to wait for the result.

 .. code-block:: bash

      breeze static-checks --type mypy-airflow --all-files

 The above will run mypy check for all files.

 You can limit that by selecting specific files you want to run static checks on. You can do that by
 specifying (can be multiple times) ``--file`` flag.

 .. code-block:: bash

      breeze static-checks --type mypy-airflow --file airflow/utils/code_utils.py --file airflow/utils/timeout.py

 The above will run mypy check for those to files (note: autocomplete should work for the file selection).

 However, often you do not remember files you modified and you want to run checks for files that belong
 to specific commits you already have in your branch. You can use ``breeze static check`` to run the checks
 only on changed files you have already committed to your branch - either for specific commit, for last
 commit, for all changes in your branch since you branched off from main or for specific range
 of commits you choose.

 .. code-block:: bash

      breeze static-checks --type mypy-airflow --last-commit

 The above will run mypy check for all files in the last commit in your branch.

 .. code-block:: bash

      breeze static-checks --type mypy-airflow --only-my-changes

 The above will run mypy check for all commits in your branch which were added since you branched off from main.

 .. code-block:: bash

      breeze static-checks --type mypy-airflow --commit-ref 639483d998ecac64d0fef7c5aa4634414065f690

 The above will run mypy check for all files in the 639483d998ecac64d0fef7c5aa4634414065f690 commit.
 Any ``commit-ish`` reference from Git will work here (branch, tag, short/long hash etc.)

 .. code-block:: bash

      breeze static-checks --type identity --verbose --from-ref HEAD^^^^ --to-ref HEAD

 The above will run the check for the last 4 commits in your branch. You can use any ``commit-ish`` references
 in ``--from-ref`` and ``--to-ref`` flags.


 These are all available flags of ``static-checks`` command:

 .. image:: ./images/output_static-checks.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_static-checks.svg
   :width: 100%
   :alt: Breeze static checks


 .. note::

     When you run static checks, some of the artifacts (mypy_cache) is stored in docker-compose volume
     so that it can speed up static checks execution significantly. However, sometimes, the cache might
     get broken, in which case you should run ``breeze down`` to clean up the cache.


 .. note::

     You cannot change Python version for static checks that are run within Breeze containers.
     The ``--python`` flag has no effect for them. They are always run with lowest supported Python version.
     The main reason is to keep consistency in the results of static checks and to make sure that
     our code is fine when running the lowest supported version.

 Starting Airflow
 ----------------

 For testing Airflow you often want to start multiple components (in multiple terminals). Breeze has
 built-in ``start-airflow`` command that start breeze container, launches multiple terminals using tmux
 and launches all Airflow necessary components in those terminals.

 When you are starting airflow from local sources, www asset compilation is automatically executed before.

 .. code-block:: bash

     breeze --python 3.8 --backend mysql start-airflow

 You can also use it to start different executor.

 .. code-block:: bash

     breeze start-airflow --executor CeleryExecutor

 You can also use it to start any released version of Airflow from ``PyPI`` with the
 ``--use-airflow-version`` flag - useful for testing and looking at issues raised for specific version.

 .. code-block:: bash

     breeze start-airflow --python 3.8 --backend mysql --use-airflow-version 2.7.0

 When you are installing version from PyPI, it's also possible to specify extras that should be used
 when installing Airflow - you can provide several extras separated by coma - for example to install
 providers together with Airflow that you are installing. For example when you are using celery executor
 in Airflow 2.7.0+ you need to add ``celery`` extra.

 .. code-block:: bash

     breeze start-airflow --use-airflow-version 2.7.0 --executor CeleryExecutor --airflow-extras celery


 These are all available flags of ``start-airflow`` command:

 .. image:: ./images/output_start-airflow.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_start-airflow.svg
   :width: 100%
   :alt: Breeze start-airflow

 Launching multiple terminals in the same environment
 ----------------------------------------------------

 Often if you want to run full airflow in the Breeze environment you need to launch multiple terminals and
 run ``airflow webserver``, ``airflow scheduler``, ``airflow worker`` in separate terminals.

 This can be achieved either via ``tmux`` or via exec-ing into the running container from the host. Tmux
 is installed inside the container and you can launch it with ``tmux`` command. Tmux provides you with the
 capability of creating multiple virtual terminals and multiplex between them. More about ``tmux`` can be
 found at `tmux GitHub wiki page <https://github.com/tmux/tmux/wiki>`_ . Tmux has several useful shortcuts
 that allow you to split the terminals, open new tabs etc - it's pretty useful to learn it.

 Another way is to exec into Breeze terminal from the host's terminal. Often you can
 have multiple terminals in the host (Linux/MacOS/WSL2 on Windows) and you can simply use those terminals
 to enter the running container. It's as easy as launching ``breeze exec`` while you already started the
 Breeze environment. You will be dropped into bash and environment variables will be read in the same
 way as when you enter the environment. You can do it multiple times and open as many terminals as you need.

 These are all available flags of ``exec`` command:

 .. image:: ./images/output_exec.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_exec.svg
   :width: 100%
   :alt: Breeze exec


 Compiling www assets
 --------------------

 Airflow webserver needs to prepare www assets - compiled with node and yarn. The ``compile-www-assets``
 command takes care about it. This is needed when you want to run webserver inside of the breeze.

 .. image:: ./images/output_compile-www-assets.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_compile-www-assets.svg
   :width: 100%
   :alt: Breeze compile-www-assets

 Breeze cleanup
 --------------

 Sometimes you need to cleanup your docker environment (and it is recommended you do that regularly). There
 are several reasons why you might want to do that.

 Breeze uses docker images heavily and those images are rebuild periodically and might leave dangling, unused
 images in docker cache. This might cause extra disk usage. Also running various docker compose commands
 (for example running tests with ``breeze testing tests``) might create additional docker networks that might
 prevent new networks from being created. Those networks are not removed automatically by docker-compose.
 Also Breeze uses it's own cache to keep information about all images.

 All those unused images, networks and cache can be removed by running ``breeze cleanup`` command. By default
 it will not remove the most recent images that you might need to run breeze commands, but you
 can also remove those breeze images to clean-up everything by adding ``--all`` command (note that you will
 need to build the images again from scratch - pulling from the registry might take a while).

 Breeze will ask you to confirm each step, unless you specify ``--answer yes`` flag.

 These are all available flags of ``cleanup`` command:

 .. image:: ./images/output_cleanup.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_cleanup.svg
   :width: 100%
   :alt: Breeze cleanup

 Database volumes in Breeze
 --------------------------

 Breeze keeps data for all it's integration in named docker volumes. Each backend and integration
 keeps data in their own volume. Those volumes are persisted until ``breeze down`` command.
 You can also preserve the volumes by adding flag ``--preserve-volumes`` when you run the command.
 Then, next time when you start Breeze, it will have the data pre-populated.

 These are all available flags of ``down`` command:

 .. image:: ./images/output-down.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output-down.svg
   :width: 100%
   :alt: Breeze down

 Running arbitrary commands in container
 ---------------------------------------

 More sophisticated usages of the breeze shell is using the ``breeze shell`` command - it has more parameters
 and you can also use it to execute arbitrary commands inside the container.

 .. code-block:: bash

      breeze shell "ls -la"

 Those are all available flags of ``shell`` command:

 .. image:: ./images/output_shell.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_shell.svg
   :width: 100%
   :alt: Breeze shell

 Running Breeze with Metrics
 ---------------------------

 Running Breeze with a StatsD Metrics Stack
 ..........................................

 You can launch an instance of Breeze pre-configured to emit StatsD metrics using
 ``breeze start-airflow --integration statsd``.  This will launch an Airflow webserver
 within the Breeze environment as well as containers running StatsD, Prometheus, and
 Grafana.  The integration configures the "Targets" in Prometheus, the "Datasources" in
 Grafana, and includes a default dashboard in Grafana.

 When you run Airflow Breeze with this integration, in addition to the standard ports
 (See "Port Forwarding" below), the following are also automatically forwarded:

 * 29102 -> forwarded to StatsD Exporter -> breeze-statsd-exporter:9102
 * 29090 -> forwarded to Prometheus -> breeze-prometheus:9090
 * 23000 -> forwarded to Grafana -> breeze-grafana:3000

 You can connect to these ports/databases using:

 * StatsD Metrics: http://127.0.0.1:29102/metrics
 * Prometheus Targets: http://127.0.0.1:29090/targets
 * Grafana Dashboards: http://127.0.0.1:23000/dashboards

 Running Breeze with an OpenTelemetry Metrics Stack
 ..................................................

 ----

 [Work in Progress]
 NOTE:  This will launch the stack as described below but Airflow integration is
 still a Work in Progress.  This should be considered experimental and likely to
 change by the time Airflow fully supports emitting metrics via OpenTelemetry.

 ----

 You can launch an instance of Breeze pre-configured to emit OpenTelemetry metrics
 using ``breeze start-airflow --integration otel``.  This will launch Airflow within
 the Breeze environment as well as containers running OpenTelemetry-Collector,
 Prometheus, and Grafana.  The integration handles all configuration of the
 "Targets" in Prometheus and the "Datasources" in Grafana, so it is ready to use.

 When you run Airflow Breeze with this integration, in addition to the standard ports
 (See "Port Forwarding" below), the following are also automatically forwarded:

 * 28889 -> forwarded to OpenTelemetry Collector -> breeze-otel-collector:8889
 * 29090 -> forwarded to Prometheus -> breeze-prometheus:9090
 * 23000 -> forwarded to Grafana -> breeze-grafana:3000

 You can connect to these ports using:

 * OpenTelemetry Collector: http://127.0.0.1:28889/metrics
 * Prometheus Targets: http://127.0.0.1:29090/targets
 * Grafana Dashboards: http://127.0.0.1:23000/dashboards

 Running Breeze with OpenLineage
 ...............................

 You can launch an instance of Breeze pre-configured to emit OpenLineage metrics using
 ``breeze start-airflow --integration openlineage``.  This will launch an Airflow webserver
 within the Breeze environment as well as containers running a [Marquez](https://marquezproject.ai/)
 webserver and API server.

 When you run Airflow Breeze with this integration, in addition to the standard ports
 (See "Port Forwarding" below), the following are also automatically forwarded:

 * MARQUEZ_API_HOST_PORT (default 25000) -> forwarded to Marquez API -> marquez:5000
 * MARQUEZ_API_ADMIN_HOST_PORT (default 25001) -> forwarded to Marquez Admin API -> marquez:5001
 * MARQUEZ_HOST_PORT (default 23100) -> forwarded to Marquez -> marquez_web:3000

 You can connect to these services using:

 * Marquez Webserver: http://127.0.0.1:23100
 * Marquez API: http://127.0.0.1:25000/api/v1
 * Marquez Admin API: http://127.0.0.1:25001

 Make sure to substitute the port numbers if you have customized them via the above env vars.

 Stopping the environment
 ------------------------

 After starting up, the environment runs in the background and takes quite some memory which you might
 want to free for other things you are running on your host.

 You can always stop it via:

 .. code-block:: bash

    breeze down

 These are all available flags of ``down`` command:

 .. image:: ./images/output_down.svg
   :target: https://raw.githubusercontent.com/apache/airflow/main/dev/breeze/images/output_down.svg
   :width: 100%
   :alt: Breeze down


 Using local virtualenv environment in Your Host IDE
 ---------------------------------------------------

 You can set up your host IDE (for example, IntelliJ's PyCharm/Idea) to work with Breeze
 and benefit from all the features provided by your IDE, such as local and remote debugging,
 language auto-completion, documentation support, etc.

 To use your host IDE with Breeze:

 1. Create a local virtual environment:

    You can use any of the following wrappers to create and manage your virtual environments:
    `pyenv <https://github.com/pyenv/pyenv>`_, `pyenv-virtualenv <https://github.com/pyenv/pyenv-virtualenv>`_,
    or `virtualenvwrapper <https://virtualenvwrapper.readthedocs.io/en/latest/>`_.

 2. Use the right command to activate the virtualenv (``workon`` if you use virtualenvwrapper or
    ``pyenv activate`` if you use pyenv.

 3. Initialize the created local virtualenv:

 .. code-block:: bash

    ./scripts/tools/initialize_virtualenv.py

 .. warning::
    Make sure that you use the right Python version in this command - matching the Python version you have
    in your local virtualenv. If you don't, you will get strange conflicts.

 4. Select the virtualenv you created as the project's default virtualenv in your IDE.

 Note that you can also use the local virtualenv for Airflow development without Breeze.
 This is a lightweight solution that has its own limitations.

 More details on using the local virtualenv are available in the
 `Local Virtualenv <../../../contributing-docs/07_local_virtualenv.rst>`_.

 ------

 Next step: Follow the `Troubleshooting <04_troubleshooting.rst>`_ guide to troubleshoot your Breeze environment.