Adds automated user creation in production image (#13728) * Adds automated user creation in the production image This PR implements automated user creation for the production image controlled by environment variables. This is a solution for anyone who would like to make a quick test of the production image and would like to: * init/upgrade the DB automatically * create a user This is particularly useful for internal SQLite db initialization but can also be used to initialize the user in docker-compose or similar cases where there is no equivalent of init containers that are usually used to perform the initialization. Closes #860

commit: bc026cf6961626dd01edfaf064562bfb1f2baf42 [log] [tgz]
author: Jarek Potiuk <jarek.potiuk@polidea.com> Tue Jan 19 15:45:29 2021 +0100
committer: GitHub <noreply@github.com> Tue Jan 19 15:45:29 2021 +0100
tree: 8dca00999c60923e99b856eff99b81c0582afdf1
parent: c065d32189bfee80ab938d96ad74f6492e9c9b24 [diff]
diff --git a/docs/apache-airflow/production-deployment.rst b/docs/apache-airflow/production-deployment.rst
index 95e466e..18f234e 100644
--- a/docs/apache-airflow/production-deployment.rst
+++ b/docs/apache-airflow/production-deployment.rst

@@ -18,22 +18,24 @@
 Production Deployment
 ^^^^^^^^^^^^^^^^^^^^^
 
-It is time to deploy your DAG in production. To do this, first, you need to make sure that the Airflow is itself production-ready.
-Let's see what precautions you need to take.
+It is time to deploy your DAG in production. To do this, first, you need to make sure that the Airflow
+is itself production-ready. Let's see what precautions you need to take.
 
 Database backend
 ================
 
-Airflow comes with an ``SQLite`` backend by default. This allows the user to run Airflow without any external database.
-However, such a setup is meant to be used for testing purposes only; running the default setup in production can lead to data loss in multiple scenarios.
-If you want to run production-grade Airflow, make sure you :doc:`configure the backend <howto/set-up-database>` to be an external database such as PostgreSQL or MySQL.
+Airflow comes with an ``SQLite`` backend by default. This allows the user to run Airflow without any external
+database. However, such a setup is meant to be used for testing purposes only; running the default setup
+in production can lead to data loss in multiple scenarios. If you want to run production-grade Airflow,
+make sure you :doc:`configure the backend <howto/set-up-database>` to be an external database
+such as PostgreSQL or MySQL.
 
 You can change the backend using the following config
 
 .. code-block:: ini
 
- [core]
- sql_alchemy_conn = my_conn_string
+    [core]
+    sql_alchemy_conn = my_conn_string
 
 Once you have changed the backend, airflow needs to create all the tables required for operation.
 Create an empty DB and give airflow's user the permission to ``CREATE/ALTER`` it.
@@ -41,39 +43,45 @@
 
 .. code-block:: bash
 
- airflow db upgrade
+    airflow db upgrade
 
 ``upgrade`` keeps track of migrations already applied, so it's safe to run as often as you need.
 
 .. note::
 
- Do not use ``airflow db init`` as it can create a lot of default connections, charts, etc. which are not required in production DB.
+    Do not use ``airflow db init`` as it can create a lot of default connections, charts, etc. which are not
+    required in production DB.
 
 
 Multi-Node Cluster
 ==================
 
-Airflow uses :class:`airflow.executors.sequential_executor.SequentialExecutor` by default. However, by its nature, the user is limited to executing at most
-one task at a time. ``Sequential Executor`` also pauses the scheduler when it runs a task, hence not recommended in a production setup.
-You should use the :class:`Local executor <airflow.executors.local_executor.LocalExecutor>` for a single machine.
-For a multi-node setup, you should use the :doc:`Kubernetes executor <../executor/kubernetes>` or the :doc:`Celery executor <../executor/celery>`.
+Airflow uses :class:`~airflow.executors.sequential_executor.SequentialExecutor` by default. However, by it
+nature, the user is limited to executing at most one task at a time. ``Sequential Executor`` also pauses
+the scheduler when it runs a task, hence not recommended in a production setup. You should use the
+:class:`~airflow.executors.local_executor.LocalExecutor` for a single machine.
+For a multi-node setup, you should use the :doc:`Kubernetes executor <../executor/kubernetes>` or
+the :doc:`Celery executor <../executor/celery>`.
 
 
-Once you have configured the executor, it is necessary to make sure that every node in the cluster contains the same configuration and dags.
-Airflow sends simple instructions such as "execute task X of dag Y", but does not send any dag files or configuration. You can use a simple cronjob or
-any other mechanism to sync DAGs and configs across your nodes, e.g., checkout DAGs from git repo every 5 minutes on all nodes.
+Once you have configured the executor, it is necessary to make sure that every node in the cluster contains
+the same configuration and dags. Airflow sends simple instructions such as "execute task X of dag Y", but
+does not send any dag files or configuration. You can use a simple cronjob or any other mechanism to sync
+DAGs and configs across your nodes, e.g., checkout DAGs from git repo every 5 minutes on all nodes.
 
 
 Logging
 =======
 
-If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system (DFS) such as ``S3`` and ``GCS``, or external services such as
-Stackdriver Logging, Elasticsearch or Amazon CloudWatch.
-This way, the logs are available even after the node goes down or gets replaced. See :doc:`logging-monitoring/logging-tasks` for configurations.
+If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system
+(DFS) such as ``S3`` and ``GCS``, or external services such as Stackdriver Logging, Elasticsearch or
+Amazon CloudWatch. This way, the logs are available even after the node goes down or gets replaced.
+See :doc:`logging-monitoring/logging-tasks` for configurations.
 
 .. note::
 
-    The logs only appear in your DFS after the task has finished. You can view the logs while the task is running in UI itself.
+    The logs only appear in your DFS after the task has finished. You can view the logs while the task is
+    running in UI itself.
 
 
 Configuration
@@ -105,7 +113,8 @@
 
 Strategies for mitigation:
 
-* When running on kubernetes, use a ``livenessProbe`` on the scheduler deployment to fail if the scheduler has not heartbeat in a while.
+* When running on kubernetes, use a ``livenessProbe`` on the scheduler deployment to fail if the scheduler
+  has not heartbeat in a while.
   `Example: <https://github.com/apache/airflow/blob/190066cf201e5b0442bbbd6df74efecae523ee76/chart/templates/scheduler/scheduler-deployment.yaml#L118-L136>`_.
 
 .. _docker_image:
@@ -274,9 +283,13 @@
         rocketchat_API \
         typeform" \
     --build-arg ADDITIONAL_DEV_APT_DEPS="msodbcsql17 unixodbc-dev g++" \
-    --build-arg ADDITIONAL_DEV_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add --no-tty - && curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
+    --build-arg ADDITIONAL_DEV_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | \
+    apt-key add --no-tty - && \
+    curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
     --build-arg ADDITIONAL_DEV_ENV_VARS="ACCEPT_EULA=Y" \
-    --build-arg ADDITIONAL_RUNTIME_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add --no-tty - && curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
+    --build-arg ADDITIONAL_RUNTIME_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | \
+    apt-key add --no-tty - && \
+    curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
     --build-arg ADDITIONAL_RUNTIME_APT_DEPS="msodbcsql17 unixodbc git procps vim" \
     --build-arg ADDITIONAL_RUNTIME_ENV_VARS="ACCEPT_EULA=Y" \
     --tag my-image
@@ -617,7 +630,7 @@
 |                                          |                                          | when installing runtime deps.            |
 +------------------------------------------+------------------------------------------+------------------------------------------+
 | ``AIRFLOW_HOME``                         | ``/opt/airflow``                         | Airflow’s HOME (that’s where logs and    |
-|                                          |                                          | sqlite databases are stored).            |
+|                                          |                                          | SQLite databases are stored).            |
 +------------------------------------------+------------------------------------------+------------------------------------------+
 | ``AIRFLOW_UID``                          | ``50000``                                | Airflow user UID.                        |
 +------------------------------------------+------------------------------------------+------------------------------------------+
@@ -749,6 +762,130 @@
     --build-arg ADDITIONAL_RUNTIME_APT_DEPS="default-jre-headless"
 
 
+Actions executed at image start
+-------------------------------
+
+If you are using the default entrypoint of the production image,
+there are a few actions that are automatically performed when the container starts.
+In some cases, you can pass environment variables to the image to trigger some of that behaviour.
+
+The variables that control the "execution" behaviour start with ``_AIRFLOW`` to distinguish them
+from the variables used to build the image starting with ``AIRFLOW``.
+
+Creating system user
+....................
+
+Airflow image is Open-Shift compatible, which means that you can start it with random user ID and group id 0.
+Airflow will automatically create such a user and make it's home directory point to ``/home/airflow``.
+You can read more about it in the "Support arbitrary user ids" chapter in the
+`Openshift best practices <https://docs.openshift.com/container-platform/4.1/openshift_images/create-images.html#images-create-guide-openshift_create-images>`_.
+
+Waits for Airflow DB connection
+...............................
+
+In case Postgres or MySQL DB is used, the entrypoint will wait until the airflow DB connection becomes
+available. This happens always when you use the default entrypoint.
+
+The script detects backend type depending on the URL schema and assigns default port numbers if not specified
+in the URL. Then it loops until the connection to the host/port specified can be established
+It tries ``CONNECTION_CHECK_MAX_COUNT`` times and sleeps ``CONNECTION_CHECK_SLEEP_TIME`` between checks
+
+Supported schemes:
+
+* ``postgres://`` - default port 5432
+* ``mysql://``    - default port 3306
+* ``sqlite://``
+
+In case of SQLite backend, there is no connection to establish and waiting is skipped.
+
+Upgrading Airflow DB
+....................
+
+If you set ``_AIRFLOW_DB_UPGRADE`` variable to a non-empty value, the entrypoint will run
+the ``airflow db upgrade`` command right after verifying the connection. You can also use this
+when you are running airflow with internal SQLite database (default) to upgrade the db and create
+admin users at entrypoint, so that you can start the webserver immediately. Note - using SQLite is
+intended only for testing purpose, never use SQLite in production as it has severe limitations when it
+comes to concurrency.
+
+
+Creating admin user
+...................
+
+The entrypoint can also create webserver user automatically when you enter it. you need to set
+``_AIRFLOW_WWW_USER_CREATE`` to a non-empty value in order to do that. This is not intended for
+production, it is only useful if you would like to run a quick test with the production image.
+You need to pass at least password to create such user via ``_AIRFLOW_WWW_USER_PASSWORD_CMD`` or
+``_AIRFLOW_WWW_USER_PASSWORD_CMD`` similarly like for other ``*_CMD`` variables, the content of
+the ``*_CMD`` will be evaluated as shell command and it's output will be set ass password.
+
+User creation will fail if none of the ``PASSWORD`` variables are set - there is no default for
+password for security reasons.
+
++-----------+--------------------------+----------------------------------------------------------------------+
+| Parameter | Default                  | Environment variable                                                 |
++===========+==========================+======================================================================+
+| username  | admin                    | ``_AIRFLOW_WWW_USER_USERNAME``                                       |
++-----------+--------------------------+----------------------------------------------------------------------+
+| password  |                          | ``_AIRFLOW_WWW_USER_PASSWORD_CMD`` or ``_AIRFLOW_WWW_USER_PASSWORD`` |
++-----------+--------------------------+----------------------------------------------------------------------+
+| firstname | Airflow                  | ``_AIRFLOW_WWW_USER_FIRSTNAME``                                      |
++-----------+--------------------------+----------------------------------------------------------------------+
+| lastname  | Admin                    | ``_AIRFLOW_WWW_USER_LASTNAME``                                       |
++-----------+--------------------------+----------------------------------------------------------------------+
+| email     | airflowadmin@example.com | ``_AIRFLOW_WWW_USER_EMAIL``                                          |
++-----------+--------------------------+----------------------------------------------------------------------+
+| role      | Admin                    | ``_AIRFLOW_WWW_USER_ROLE``                                           |
++-----------+--------------------------+----------------------------------------------------------------------+
+
+In case the password is specified, the user will be attempted to be created, but the entrypoint will
+not fail if the attempt fails (this accounts for the case that the user is already created).
+
+You can, for example start the webserver in the production image with initializing the internal SQLite
+database and creating an ``admin/admin`` Admin user with the following command:
+
+.. code-block:: bash
+
+  docker run -it -p 8080:8080 \
+    --env "_AIRFLOW_DB_UPGRADE=true" \
+    --env "_AIRFLOW_WWW_USER_CREATE=true" \
+    --env "_AIRFLOW_WWW_USER_PASSWORD=admin" \
+      apache/airflow:master-python3.8 webserver
+
+
+.. code-block:: bash
+
+  docker run -it -p 8080:8080 \
+    --env "_AIRFLOW_DB_UPGRADE=true" \
+    --env "_AIRFLOW_WWW_USER_CREATE=true" \
+    --env "_AIRFLOW_WWW_USER_PASSWORD_CMD=echo admin" \
+      apache/airflow:master-python3.8 webserver
+
+The commands above perform initialization of the SQLite database, create admin user with admin password
+and Admin role. They also forward local port ``8080`` to the webserver port and finally start the webserver.
+
+
+Waits for celery broker connection
+..................................
+
+In case Postgres or MySQL DB is used, and one of the ``scheduler``, ``celery``, ``worker``, or ``flower``
+commands are used the entrypoint will wait until the celery broker DB connection is available.
+
+The script detects backend type depending on the URL schema and assigns default port numbers if not specified
+in the URL. Then it loops until connection to the host/port specified can be established
+It tries ``CONNECTION_CHECK_MAX_COUNT`` times and sleeps ``CONNECTION_CHECK_SLEEP_TIME`` between checks
+
+Supported schemes:
+
+* ``amqp(s)://``  (rabbitmq) - default port 5672
+* ``redis://``               - default port 6379
+* ``postgres://``            - default port 5432
+* ``mysql://``               - default port 3306
+* ``sqlite://``
+
+In case of SQLite backend, there is no connection to establish and waiting is skipped.
+
+
 Recipes
 -------
 
@@ -761,7 +898,8 @@
 
 Some operators, such as :class:`airflow.providers.google.cloud.operators.kubernetes_engine.GKEStartPodOperator`,
 :class:`airflow.providers.google.cloud.operators.dataflow.DataflowStartSqlJobOperator`, require
-the installation of `Google Cloud SDK <https://cloud.google.com/sdk>`__ (includes ``gcloud``). You can also run these commands with BashOperator.
+the installation of `Google Cloud SDK <https://cloud.google.com/sdk>`__ (includes ``gcloud``).
+You can also run these commands with BashOperator.
 
 Create a new Dockerfile like the one shown below.
 
@@ -845,37 +983,64 @@
 Secured Server and Service Access on Google Cloud
 =================================================
 
-This section describes techniques and solutions for securely accessing servers and services when your Airflow environment is deployed on Google Cloud, or you connect to Google services, or you are connecting to the Google API.
+This section describes techniques and solutions for securely accessing servers and services when your Airflow
+environment is deployed on Google Cloud, or you connect to Google services, or you are connecting
+to the Google API.
 
 IAM and Service Accounts
 ------------------------
 
-You should do not rely on internal network segmentation or firewalling as our primary security mechanisms. To protect your organization's data, every request you make should contain sender identity. In the case of Google Cloud, the identity is provided by `the IAM and Service account <https://cloud.google.com/iam/docs/service-accounts>`__. Each Compute Engine instance has an associated service account identity. It provides cryptographic credentials that your workload can use to prove its identity when making calls to Google APIs or third-party services. Each instance has access only to short-lived credentials. If you use Google-managed service account keys, then the private key is always held in escrow and is never directly accessible.
+You should not rely on internal network segmentation or firewalling as our primary security mechanisms.
+To protect your organization's data, every request you make should contain sender identity. In the case of
+Google Cloud, the identity is provided by
+`the IAM and Service account <https://cloud.google.com/iam/docs/service-accounts>`__. Each Compute Engine
+instance has an associated service account identity. It provides cryptographic credentials that your workload
+can use to prove its identity when making calls to Google APIs or third-party services. Each instance has
+access only to short-lived credentials. If you use Google-managed service account keys, then the private
+key is always held in escrow and is never directly accessible.
 
-If you are using Kubernetes Engine, you can use `Workload Identity <https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity>`__ to assign an identity to individual pods.
+If you are using Kubernetes Engine, you can use
+`Workload Identity <https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity>`__ to assign
+an identity to individual pods.
 
 For more information about service accounts in the Airflow, see :ref:`howto/connection:gcp`
 
 Impersonate Service Accounts
 ----------------------------
 
-If you need access to other service accounts, you can :ref:`impersonate other service accounts <howto/connection:gcp:impersonation>` to exchange the token with the default identity to another service account. Thus, the account keys are still managed by Google and cannot be read by your workload.
+If you need access to other service accounts, you can
+:ref:`impersonate other service accounts <howto/connection:gcp:impersonation>` to exchange the token with
+the default identity to another service account. Thus, the account keys are still managed by Google
+and cannot be read by your workload.
 
-It is not recommended to generate service account keys and store them in the metadata database or the secrets backend. Even with the use of the backend secret, the service account key is available for your workload.
+It is not recommended to generate service account keys and store them in the metadata database or the
+secrets backend. Even with the use of the backend secret, the service account key is available for
+your workload.
 
 Access to Compute Engine Instance
 ---------------------------------
 
-If you want to establish an SSH connection to the Compute Engine instance, you must have the network address of this instance and credentials to access it. To simplify this task, you can use :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` instead of :class:`~airflow.providers.ssh.hooks.ssh.SSHHook`
+If you want to establish an SSH connection to the Compute Engine instance, you must have the network address
+of this instance and credentials to access it. To simplify this task, you can use
+:class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook`
+instead of :class:`~airflow.providers.ssh.hooks.ssh.SSHHook`
 
-The :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` support authorization with Google OS Login service. It is an extremely robust way to manage Linux access properly as it stores short-lived ssh keys in the metadata service, offers PAM modules for access and sudo privilege checking and offers nsswitch user lookup into the metadata service as well.
+The :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` support authorization with
+Google OS Login service. It is an extremely robust way to manage Linux access properly as it stores
+short-lived ssh keys in the metadata service, offers PAM modules for access and sudo privilege checking
+and offers the ``nsswitch`` user lookup into the metadata service as well.
 
-It also solves the discovery problem that arises as your infrastructure grows. You can use the instance name instead of the network address.
+It also solves the discovery problem that arises as your infrastructure grows. You can use the
+instance name instead of the network address.
 
 Access to Amazon Web Service
 ----------------------------
 
-Thanks to `Web Identity Federation <https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_oidc.html>`__, you can exchange the Google Cloud Platform identity to the Amazon Web Service identity, which effectively means access to Amazon Web Service platform. For more information, see: :ref:`howto/connection:aws:gcp-federation`
+Thanks to the
+`Web Identity Federation <https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_oidc.html>`__,
+you can exchange the Google Cloud Platform identity to the Amazon Web Service identity,
+which effectively means access to Amazon Web Service platform.
+For more information, see: :ref:`howto/connection:aws:gcp-federation`
 
 .. spelling::
 

diff --git a/scripts/in_container/prod/entrypoint_prod.sh b/scripts/in_container/prod/entrypoint_prod.sh
index 699038b..00bec58 100755
--- a/scripts/in_container/prod/entrypoint_prod.sh
+++ b/scripts/in_container/prod/entrypoint_prod.sh

@@ -39,121 +39,216 @@
     nc -zvvn "${ip}" "${port}"
 }
 
-function verify_db_connection {
-    DB_URL="${1}"
+function wait_for_connection {
+    # Waits for Connection to the backend specified via URL passed as first parameter
+    # Detects backend type depending on the URL schema and assigns
+    # default port numbers if not specified in the URL.
+    # Then it loops until connection to the host/port specified can be established
+    # It tries `CONNECTION_CHECK_MAX_COUNT` times and sleeps `CONNECTION_CHECK_SLEEP_TIME` between checks
+    local connection_url
+    connection_url="${1}"
 
-    DB_CHECK_MAX_COUNT=${MAX_DB_CHECK_COUNT:=20}
-    DB_CHECK_SLEEP_TIME=${DB_CHECK_SLEEP_TIME:=3}
-
-    local DETECTED_DB_BACKEND=""
-    local DETECTED_DB_HOST=""
-    local DETECTED_DB_PORT=""
+    local detected_backend=""
+    local detected_host=""
+    local detected_port=""
 
 
-    if [[ ${DB_URL} != sqlite* ]]; then
+    if [[ ${connection_url} != sqlite* ]]; then
         # Auto-detect DB parameters
-        [[ ${DB_URL} =~ ([^:]*)://([^:]*[@.*]?):([^@]*)@?([^/:]*):?([0-9]*)/([^\?]*)\??(.*) ]] && \
-            DETECTED_DB_BACKEND=${BASH_REMATCH[1]} &&
+        [[ ${connection_url} =~ ([^:]*)://([^:]*[@.*]?):([^@]*)@?([^/:]*):?([0-9]*)/([^\?]*)\??(.*) ]] && \
+            detected_backend=${BASH_REMATCH[1]} &&
             # Not used USER match
             # Not used PASSWORD match
-            DETECTED_DB_HOST=${BASH_REMATCH[4]} &&
-            DETECTED_DB_PORT=${BASH_REMATCH[5]} &&
+            detected_host=${BASH_REMATCH[4]} &&
+            detected_port=${BASH_REMATCH[5]} &&
             # Not used SCHEMA match
             # Not used PARAMS match
 
-        echo DB_BACKEND="${DB_BACKEND:=${DETECTED_DB_BACKEND}}"
+        echo BACKEND="${BACKEND:=${detected_backend}}"
+        readonly BACKEND
 
-        if [[ -z "${DETECTED_DB_PORT=}" ]]; then
-            if [[ ${DB_BACKEND} == "postgres"* ]]; then
-                DETECTED_DB_PORT=5432
-            elif [[ ${DB_BACKEND} == "mysql"* ]]; then
-                DETECTED_DB_PORT=3306
+        if [[ -z "${detected_port=}" ]]; then
+            if [[ ${BACKEND} == "postgres"* ]]; then
+                detected_port=5432
+            elif [[ ${BACKEND} == "mysql"* ]]; then
+                detected_port=3306
+            elif [[ ${BACKEND} == "redis"* ]]; then
+                detected_port=6379
+            elif [[ ${BACKEND} == "amqp"* ]]; then
+                detected_port=5672
             fi
         fi
 
-        DETECTED_DB_HOST=${DETECTED_DB_HOST:="localhost"}
+        detected_host=${detected_host:="localhost"}
 
         # Allow the DB parameters to be overridden by environment variable
-        echo DB_HOST="${DB_HOST:=${DETECTED_DB_HOST}}"
-        echo DB_PORT="${DB_PORT:=${DETECTED_DB_PORT}}"
+        echo DB_HOST="${DB_HOST:=${detected_host}}"
+        readonly DB_HOST
 
+        echo DB_PORT="${DB_PORT:=${detected_port}}"
+        readonly DB_PORT
+        local countdown
+        countdown="${CONNECTION_CHECK_MAX_COUNT}"
         while true
         do
             set +e
-            LAST_CHECK_RESULT=$(run_nc "${DB_HOST}" "${DB_PORT}" >/dev/null 2>&1)
-            RES=$?
+            local last_check_result
+            local res
+            last_check_result=$(run_nc "${DB_HOST}" "${DB_PORT}" >/dev/null 2>&1)
+            res=$?
             set -e
-            if [[ ${RES} == 0 ]]; then
+            if [[ ${res} == 0 ]]; then
                 echo
                 break
             else
                 echo -n "."
-                DB_CHECK_MAX_COUNT=$((DB_CHECK_MAX_COUNT-1))
+                countdown=$((countdown-1))
             fi
-            if [[ ${DB_CHECK_MAX_COUNT} == 0 ]]; then
+            if [[ ${countdown} == 0 ]]; then
                 echo
-                echo "ERROR! Maximum number of retries (${DB_CHECK_MAX_COUNT}) reached while checking ${DB_BACKEND} db. Exiting"
+                echo "ERROR! Maximum number of retries (${CONNECTION_CHECK_MAX_COUNT}) reached."
+                echo "       while checking ${BACKEND} connection."
                 echo
-                break
+                echo "Last check result:"
+                echo
+                echo "${last_check_result}"
+                echo
+                exit 1
             else
-                sleep "${DB_CHECK_SLEEP_TIME}"
+                sleep "${CONNECTION_CHECK_SLEEP_TIME}"
             fi
         done
-        if [[ ${RES} != 0 ]]; then
-            echo "        ERROR: ${DB_URL} db could not be reached!"
-            echo
-            echo "${LAST_CHECK_RESULT}"
-            echo
-            export EXIT_CODE=${RES}
+    fi
+}
+
+function create_www_user() {
+    local local_password=""
+    # Warning: command environment variables (*_CMD) have priority over usual configuration variables
+    # for configuration parameters that require sensitive information. This is the case for the SQL database
+    # and the broker backend in this entrypoint script.
+    if [[ -n "${_AIRFLOW_WWW_USER_PASSWORD_CMD=}" ]]; then
+        local_password=$(eval "${_AIRFLOW_WWW_USER_PASSWORD_CMD}")
+        unset _AIRFLOW_WWW_USER_PASSWORD_CMD
+    elif [[ -n "${_AIRFLOW_WWW_USER_PASSWORD=}" ]]; then
+        local_password="${_AIRFLOW_WWW_USER_PASSWORD}"
+        unset _AIRFLOW_WWW_USER_PASSWORD
+    fi
+    if [[ -z ${local_password} ]]; then
+        echo
+        echo "ERROR! Airflow Admin password not set via _AIRFLOW_WWW_USER_PASSWORD or _AIRFLOW_WWW_USER_PASSWORD_CMD variables!"
+        echo
+        exit 1
+    fi
+
+    airflow users create \
+       --username "${_AIRFLOW_WWW_USER_USERNAME="admin"}" \
+       --firstname "${_AIRFLOW_WWW_USER_FIRSTNAME="Airflow"}" \
+       --lastname "${_AIRFLOW_WWW_USER_LASTNME="Admin"}" \
+       --email "${_AIRFLOW_WWW_USER_EMAIL="airflowadmin@example.com"}" \
+       --role "${_AIRFLOW_WWW_USER_ROLE="Admin"}" \
+       --password "${local_password}" ||
+    airflow create_user \
+       --username "${_AIRFLOW_WWW_USER_USERNAME="admin"}" \
+       --firstname "${_AIRFLOW_WWW_USER_FIRSTNAME="Airflow"}" \
+       --lastname "${_AIRFLOW_WWW_USER_LASTNME="Admin"}" \
+       --email "${_AIRFLOW_WWW_USER_EMAIL="airflowadmin@example.com"}" \
+       --role "${_AIRFLOW_WWW_USER_ROLE="Admin"}" \
+       --password "${local_password}" || true
+}
+
+function create_system_user_if_missing() {
+    # This is needed in case of OpenShift-compatible container execution. In case of OpenShift random
+    # User id is used when starting the image, however group 0 is kept as the user group. Our production
+    # Image is OpenShift compatible, so all permissions on all folders are set so that 0 group can exercise
+    # the same privileges as the default "airflow" user, this code checks if the user is already
+    # present in /etc/passwd and will create the system user dynamically, including setting its
+    # HOME directory to the /home/airflow so that (for example) the ${HOME}/.local folder where airflow is
+    # Installed can be automatically added to PYTHONPATH
+    if ! whoami &> /dev/null; then
+      if [[ -w /etc/passwd ]]; then
+        echo "${USER_NAME:-default}:x:$(id -u):0:${USER_NAME:-default} user:${AIRFLOW_USER_HOME_DIR}:/sbin/nologin" \
+            >> /etc/passwd
+      fi
+      export HOME="${AIRFLOW_USER_HOME_DIR}"
+    fi
+}
+
+function wait_for_airflow_db() {
+    # Verifies connection to the Airflow DB
+    if [[ -n "${AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD=}" ]]; then
+        wait_for_connection "$(eval "${AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD}")"
+    else
+        # if no DB configured - use sqlite db by default
+        AIRFLOW__CORE__SQL_ALCHEMY_CONN="${AIRFLOW__CORE__SQL_ALCHEMY_CONN:="sqlite:///${AIRFLOW_HOME}/airflow.db"}"
+        wait_for_connection "${AIRFLOW__CORE__SQL_ALCHEMY_CONN}"
+    fi
+}
+
+function upgrade_db() {
+    # Runs airflow db upgrade
+    airflow db upgrade || airflow upgradedb || true
+}
+
+function wait_for_celery_backend() {
+    # Verifies connection to Celery Broker
+    if [[ -n "${AIRFLOW__CELERY__BROKER_URL_CMD=}" ]]; then
+        wait_for_connection "$(eval "${AIRFLOW__CELERY__BROKER_URL_CMD}")"
+    else
+        AIRFLOW__CELERY__BROKER_URL=${AIRFLOW__CELERY__BROKER_URL:=}
+        if [[ -n ${AIRFLOW__CELERY__BROKER_URL=} ]]; then
+            wait_for_connection "${AIRFLOW__CELERY__BROKER_URL}"
         fi
     fi
 }
 
-if ! whoami &> /dev/null; then
-  if [[ -w /etc/passwd ]]; then
-    echo "${USER_NAME:-default}:x:$(id -u):0:${USER_NAME:-default} user:${AIRFLOW_USER_HOME_DIR}:/sbin/nologin" \
-        >> /etc/passwd
-  fi
-  export HOME="${AIRFLOW_USER_HOME_DIR}"
+function exec_to_bash_or_python_command_if_specified() {
+    # If one of the commands: 'airflow', 'bash', 'python' is used, either run appropriate
+    # command with exec or update the command line parameters
+    if [[ ${AIRFLOW_COMMAND} == "bash" ]]; then
+       shift
+       exec "/bin/bash" "${@}"
+    elif [[ ${AIRFLOW_COMMAND} == "python" ]]; then
+       shift
+       exec "python" "${@}"
+    fi
+}
+
+
+CONNECTION_CHECK_MAX_COUNT=${CONNECTION_CHECK_MAX_COUNT:=20}
+readonly CONNECTION_CHECK_MAX_COUNT
+
+CONNECTION_CHECK_SLEEP_TIME=${CONNECTION_CHECK_SLEEP_TIME:=3}
+readonly CONNECTION_CHECK_SLEEP_TIME
+
+create_system_user_if_missing
+wait_for_airflow_db
+
+if [[ -n "${_AIRFLOW_DB_UPGRADE=}" ]] ; then
+    upgrade_db
 fi
 
-# Warning: command environment variables (*_CMD) have priority over usual configuration variables
-# for configuration parameters that require sensitive information. This is the case for the SQL database
-# and the broker backend in this entrypoint script.
-
-
-if [[ -n "${AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD=}" ]]; then
-    verify_db_connection "$(eval "$AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD")"
-else
-    # if no DB configured - use sqlite db by default
-    AIRFLOW__CORE__SQL_ALCHEMY_CONN="${AIRFLOW__CORE__SQL_ALCHEMY_CONN:="sqlite:///${AIRFLOW_HOME}/airflow.db"}"
-    verify_db_connection "${AIRFLOW__CORE__SQL_ALCHEMY_CONN}"
+if [[ -n "${_AIRFLOW_WWW_USER_CREATE=}" ]] ; then
+    create_www_user
 fi
 
+# The `bash` and `python` commands should also verify the basic connections
+# So they are run after the DB check
+exec_to_bash_or_python_command_if_specified "${@}"
 
-# The Bash and python commands still should verify the basic connections so they are run after the
-# DB check but before the broker check
-if [[ ${AIRFLOW_COMMAND} == "bash" ]]; then
-   shift
-   exec "/bin/bash" "${@}"
-elif [[ ${AIRFLOW_COMMAND} == "python" ]]; then
-   shift
-   exec "python" "${@}"
-elif [[ ${AIRFLOW_COMMAND} == "airflow" ]]; then
+# Remove "airflow" if it is specified as airflow command
+# This way both command types work the same way:
+#
+#     docker run IMAGE airflow webserver
+#     docker run IMAGE webserver
+#
+if [[ ${AIRFLOW_COMMAND} == "airflow" ]]; then
    AIRFLOW_COMMAND="${2}"
    shift
 fi
 
 # Note: the broker backend configuration concerns only a subset of Airflow components
 if [[ ${AIRFLOW_COMMAND} =~ ^(scheduler|celery|worker|flower)$ ]]; then
-    if [[ -n "${AIRFLOW__CELERY__BROKER_URL_CMD=}" ]]; then
-        verify_db_connection "$(eval "$AIRFLOW__CELERY__BROKER_URL_CMD")"
-    else
-        AIRFLOW__CELERY__BROKER_URL=${AIRFLOW__CELERY__BROKER_URL:=}
-        if [[ -n ${AIRFLOW__CELERY__BROKER_URL=} ]]; then
-            verify_db_connection "${AIRFLOW__CELERY__BROKER_URL}"
-        fi
-    fi
+    wait_for_celery_backend "${@}"
 fi
 
 exec "airflow" "${@}"
commit	bc026cf6961626dd01edfaf064562bfb1f2baf42	[log] [tgz]
author	Jarek Potiuk <jarek.potiuk@polidea.com>	Tue Jan 19 15:45:29 2021 +0100
committer	GitHub <noreply@github.com>	Tue Jan 19 15:45:29 2021 +0100
tree	8dca00999c60923e99b856eff99b81c0582afdf1
parent	c065d32189bfee80ab938d96ad74f6492e9c9b24 [diff]