blob: 039e10b966cc17d4b7e498a66838f5746ab44d6f [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
SQL Operators
=============
These operators perform various queries against a SQL database, including
column- and table-level data quality checks.
.. _howto/operator:SQLColumnCheckOperator:
Check SQL Table Columns
~~~~~~~~~~~~~~~~~~~~~~~
Use the :class:`~airflow.providers.common.sql.operators.sql.SQLColumnCheckOperator` to run data quality
checks against columns of a given table. As well as a connection ID and table, a column_mapping
describing the relationship between columns and tests to run must be supplied. An example column
mapping is a set of three nested dictionaries and looks like:
.. code-block:: python
column_mapping = {
"col_name": {
"null_check": {
"equal_to": 0,
},
"min": {
"greater_than": 5,
"leq_to": 10,
"tolerance": 0.2,
},
"max": {"less_than": 1000, "geq_to": 10, "tolerance": 0.01},
}
}
Where col_name is the name of the column to run checks on, and each entry in its dictionary is a check.
The valid checks are:
- null_check: checks the number of NULL values in the column
- distinct_check: checks the COUNT of values in the column that are distinct
- unique_check: checks the number of distinct values in a column against the number of rows
- min: checks the minimum value in the column
- max: checks the maximum value in the column
Each entry in the check's dictionary is either a condition for success of the check or the tolerance. The
conditions for success are:
- greater_than
- geq_to
- less_than
- leq_to
- equal_to
When specifying conditions, equal_to is not compatible with other conditions. Both a lower- and an upper-
bound condition may be specified in the same check. The tolerance is a percentage that the result may
be out of bounds but still considered successful.
The below example demonstrates how to instantiate the SQLColumnCheckOperator task.
.. exampleinclude:: /../../tests/system/providers/common/sql/example_sql_column_table_check.py
:language: python
:dedent: 4
:start-after: [START howto_operator_sql_column_check]
:end-before: [END howto_operator_sql_column_check]
.. _howto/operator:SQLTableCheckOperator:
Check SQL Table Values
~~~~~~~~~~~~~~~~~~~~~~~
Use the :class:`~airflow.providers.common.sql.operators.sql.SQLTableCheckOperator` to run data quality
checks against a given table. As well as a connection ID and table, a checks dictionary
describing the relationship between the table and tests to run must be supplied. An example
checks argument is a set of two nested dictionaries and looks like:
.. code-block:: python
checks = (
{
"row_count_check": {
"check_statement": "COUNT(*) = 1000",
},
"column_sum_check": {"check_statement": "col_a + col_b < col_c"},
},
)
The first set of keys are the check names, which are referenced in the templated query the operator builds.
The dictionary key under the check name must be check_statement, with the value a SQL statement that
resolves to a boolean (this can be any string or int that resolves to a boolean in
airflow.operators.sql.parse_boolean).
The below example demonstrates how to instantiate the SQLTableCheckOperator task.
.. exampleinclude:: /../../tests/system/providers/common/sql/example_sql_column_table_check.py
:language: python
:dedent: 4
:start-after: [START howto_operator_sql_table_check]
:end-before: [END howto_operator_sql_table_check]