| .. Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| Scaling Out with Celery |
| ======================= |
| |
| ``CeleryExecutor`` is one of the ways you can scale out the number of workers. For this |
| to work, you need to setup a Celery backend (**RabbitMQ**, **Redis**, ...) and |
| change your ``airflow.cfg`` to point the executor parameter to |
| ``CeleryExecutor`` and provide the related Celery settings. |
| |
| For more information about setting up a Celery broker, refer to the |
| exhaustive `Celery documentation on the topic <http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html>`_. |
| |
| Here are a few imperative requirements for your workers: |
| |
| - ``airflow`` needs to be installed, and the CLI needs to be in the path |
| - Airflow configuration settings should be homogeneous across the cluster |
| - Operators that are executed on the worker need to have their dependencies |
| met in that context. For example, if you use the ``HiveOperator``, |
| the hive CLI needs to be installed on that box, or if you use the |
| ``MySqlOperator``, the required Python library needs to be available in |
| the ``PYTHONPATH`` somehow |
| - The worker needs to have access to its ``DAGS_FOLDER``, and you need to |
| synchronize the filesystems by your own means. A common setup would be to |
| store your DAGS_FOLDER in a Git repository and sync it across machines using |
| Chef, Puppet, Ansible, or whatever you use to configure machines in your |
| environment. If all your boxes have a common mount point, having your |
| pipelines files shared there should work as well |
| |
| |
| To kick off a worker, you need to setup Airflow and kick off the worker |
| subcommand |
| |
| .. code-block:: bash |
| |
| airflow worker |
| |
| Your worker should start picking up tasks as soon as they get fired in |
| its direction. |
| |
| Note that you can also run "Celery Flower", a web UI built on top of Celery, |
| to monitor your workers. You can use the shortcut command ``airflow flower`` |
| to start a Flower web server. |
| |
| Please note that you must have the ``flower`` python library already installed on your system. The recommend way is to install the airflow celery bundle. |
| |
| .. code-block:: bash |
| |
| pip install 'apache-airflow[celery]' |
| |
| |
| Some caveats: |
| |
| - Make sure to use a database backed result backend |
| - Make sure to set a visibility timeout in [celery_broker_transport_options] that exceeds the ETA of your longest running task |
| - Tasks can consume resources. Make sure your worker has enough resources to run `worker_concurrency` tasks |
| - Queue names are limited to 256 characters, but each broker backend might have its own restrictions |