[GRIFFIN-358] Added measure configuration guide for duplication and sparkSql measures.

commit: 15bcfa44969252865e47c34733a7b4e2a0584664 [log] [tgz]
author: chitralverma <chitralverma@gmail.com> Sun May 02 23:09:43 2021 +0530
committer: chitralverma <chitralverma@gmail.com> Mon May 03 02:21:57 2021 +0530
tree: 38a0b44b4eefeffd1d8b0bf3ce2a85c3dae399bf
parent: aaa1bf3baad389871a7828549866789c62ea76ad [diff]
diff --git a/griffin-doc/measure/dimensions.md b/griffin-doc/measure/dimensions.md
index 577beaa..123b8fb 100644
--- a/griffin-doc/measure/dimensions.md
+++ b/griffin-doc/measure/dimensions.md

@@ -86,12 +86,12 @@
 
 ## SparkSQL
 
-In some cases, the above-mentioned quality dimensions/ measures may not be able to model a complete data quality
-definition. For such cases, Apache Griffin allows the definition of complex custom user-defined checks as SparkSQL
-queries.
+In some cases, the above-mentioned dimensions/ measures may not enough to model a complete data quality definition. For
+such cases, Apache Griffin allows the definition of complex custom user-defined checks as SparkSQL queries.
 
-SparkSQL measure is like a pro mode and allows advanced users to put complex custom checks that are not covered by other
-measures. These SparkSQL queries may contain clauses like select/ from/ where/ group-by/ order-by/ limit, etc.
+SparkSQL measure is like a pro mode that allows advanced users to configure complex custom checks that are not covered
+by other measures. These SparkSQL queries may contain clauses like `select`, `from`, `where`, `group-by`, `order-by`
+, `limit`, etc.
 
 ## Profiling
 

diff --git a/griffin-doc/measure/measure-configuration-guide/duplication.md b/griffin-doc/measure/measure-configuration-guide/duplication.md
new file mode 100644
index 0000000..1a2c5ec
--- /dev/null
+++ b/griffin-doc/measure/measure-configuration-guide/duplication.md

@@ -0,0 +1,247 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Duplication Measure - Configuration Guide
+=====================================
+
+### Introduction
+
+Asserting the measure of duplication of the entities within a data set implies that no entity exists more than once
+within the data set and that there is a key that can be used to uniquely access each entity. For example, in a master
+product table, each product must appear once and be assigned a unique identifier that represents that product within a
+system or across multiple applications/ systems.
+
+Redundancies in a dataset can be measured in terms of the following metrics,
+
+- **Duplicate:** the number of values that are the same as other values in the list
+- **Distinct:** the number of non-null values that are different from each other (Non-unique + Unique)
+- **Non-Unique:** the number of values that have at least one duplicate in the list
+- **Unique:** the number of values that have no duplicates
+
+Duplication measure in Apache Griffin computes all of these metrics for a user-defined data asset.
+
+As an example, consider the data set below,
+
+```
++---+-------------+------+
+|id |name         |gender|
++---+-------------+------+
+|1  |John Smith   |Male  |
+|2  |John Smith   |Male  |
+|3  |Rebecca Davis|Female|
+|4  |Paul Adams   |Male  |
+|5  |null         |null  |
++---+-------------+------+
+```
+
+From the above definitions of redundancy measurement following results are generated (considering only `name` column),
+
+| Metric    | Count | Records         | Explanation                                                |
+|:----------|:-----:|:----------------|:-----------------------------------------------------------|
+|Duplicate  | 1     | Record 2        | There is one duplicate of the John Smith record (Record 1) |
+|Distinct   | 3     | Records 1, 3, 4 | These records contain distinct values                      |
+|Non-unique | 1     | Record 1        | John Smith has a duplicate record - it isn't unique        |
+|Unique     | 2     | Records 3 and 4 | Rebecca Davis and Paul Adams appear only once in the list, they have no duplicates |
+
+_Note:_ The above results are when only `name` column is measured by duplication measure. Results for whole row would be
+different.
+
+### Configuration
+
+The duplication measure can be configured as below,
+
+```json
+{
+  ...
+
+  "measures": [
+    {
+      "name": "duplication_measure",
+      "type": "duplication",
+      "data.source": "crime_report_source",
+      "config": {
+        "expr": "incident",
+        "bad.record.definition": "duplicate"
+      },
+      "out": [
+        {
+          "type": "metric",
+          "name": "duplication_metric",
+          "flatten": "map"
+        },
+        {
+          "type": "record",
+          "name": "duplication_records"
+        }
+      ]
+    }
+  ]
+
+  ...
+}
+ ```
+
+##### Key Parameters:
+
+| Name    | Type     | Description                            | Supported Values                                          |
+|:--------|:---------|:---------------------------------------|:----------------------------------------------------------|
+| name    | `String` | User-defined name of this measure      | -                                                         |
+| type    | `String` | Type of Measure                        | completeness, duplication, profiling, accuracy, sparksql  |
+| data.source | `String` | Name of data source on which this measure is applied  | -                                      |
+| config  | `Object` | Configuration params of the measure    | Depends on measure type ([see below](#example-config-object))                       |
+| out     | `List  ` | Define output(s) of measure execution  | [See below](#outputs)                                               |
+
+##### Example `config` Object:
+
+`config` object for duplication measure contains the following keys,
+
+- `expr`: The value for `expr` is a comma separated string of columns in the data asset on which the duplication measure
+  is to be executed. `expr` is an optional key for Duplication measure, i.e., if it is not defined, the entire row will
+  be checked by duplication measure.
+
+- `bad.record.definition`: As the key suggests, its value defines what exactly would be considered as a bad record after
+  this measure computes redundancies on the data asset. Since the redundancies are calculated as `duplicate`, `unique`
+  , `non_unique`, and  `distinct`, the value of this key must also be one of these values. This key is mandatory and
+  must be defined with appropriate value.
+
+  _Note:_ This expression describes the bad records. This means that for the config,
+  ```json
+  {
+    ...
+    
+    "config": {
+      "expr": "incident",
+      "bad.record.definition": "duplicate"
+    }
+
+    ...
+  }
+  ```
+  the records with duplicate value in `incident` column will be considered bad.
+
+  If `expr` was not defined then the duplicate records themselves will be considered bad.
+
+### Outputs
+
+Duplication measure supports the following two outputs,
+
+- Metrics
+- Records
+
+Users can choose to define any combination of these 2 outputs. For no outputs, skip this `out: [ ... ]` section from the
+measure configuration.
+
+#### Metrics Outputs
+
+To write metrics for duplication measure, configure the measure with output section as below,
+
+```json
+{
+  ...
+
+  "out": [
+    {
+      "name": "duplication_metric",
+      "type": "metric",
+      "flatten": "map"
+    }
+  ]
+
+  ...
+}
+ ```
+
+This will generate the metrics like below,
+
+```json
+{
+  ...
+
+  "value": {
+    "duplication_measure": {
+      "measure_name": "duplication_measure",
+      "measure_type": "Duplication",
+      "data_source": "crime_report_source",
+      "metrics": {
+        "duplicate": "4363",
+        "unique": "58",
+        "non_unique": "196",
+        "distinct": "254"
+      }
+    }
+  }
+
+  ...
+}
+```
+
+#### Record Outputs
+
+To write records as output for duplication measure, configure the measure with output section as below,
+
+```json
+{
+  ...
+
+  "out": [
+    {
+      "type": "record",
+      "name": "duplication_records"
+    }
+  ]
+
+  ...
+}
+ ```
+
+The above configuration will generate the records output like below,
+
+```
++-------------------+-----------------------------------------+----------------------------------+-------------+-------+-------------+--------+
+|date_time          |incident                                 |address                           |city         |zipcode|__tmst       |__status|
++-------------------+-----------------------------------------+----------------------------------+-------------+-------+-------------+--------+
+|2015-10-30 03:32:00|MUNI CODE/MISC                           |400 Block LYTTON AV               |PALO ALTO    |94301  |1619970144048|good    |
+|2015-10-31 23:00:00|MUNI CODE/MISC                           |1 Block ENCINA AV                 |PALO ALTO    |94301  |1619970144048|bad     |
+|2015-10-31 23:10:00|MUNI CODE/MISC                           |LYTTON AV & BRYANT ST             |PALO ALTO    |94301  |1619970144048|bad     |
+|2016-02-07 07:38:00|MUNI CODE/MISC                           |200 Block UNIVERSITY AV           |PALO ALTO    |94301  |1619970144048|bad     |
+|2015-05-31 07:27:00|THEFT PETTY ATT/FROM AUTO (488FAA)       |2000 Block EMERSON ST             |PALO ALTO    |94301  |1619970144048|good    |
+|2015-06-02 16:00:00|OUTSIDE INVESTIGATION/MISC (OUTSIDE)     |700 Block CEREZA DR               |PALO ALTO    |94306  |1619970144048|good    |
+|2015-10-31 07:21:00|VEHICLE/IMPOUNDED                        |400 Block BRYANT ST               |PALO ALTO    |94301  |1619970144048|good    |
+|2016-02-02 16:50:00|VEHICLE/IMPOUNDED                        |1400 Block BLOCK EL CAMINO REAL   |PALO ALTO    |null   |1619970144048|bad     |
+|2016-02-04 07:14:00|VEHICLE/IMPOUNDED                        |EMBARCADERO RD & GREER RD         |PALO ALTO    |94303  |1619970144048|bad     |
+|2015-05-27 05:00:00|FOUND PROPERTY/MISC (FOUND)              |HAMILTON AV & WAVERLEY ST         |PALO ALTO    |94301  |1619970144048|good    |
+|2015-05-29 20:30:00|FOUND PROPERTY/MISC (FOUND)              |700 Block GREER RD                |PALO ALTO    |94303  |1619970144048|bad     |
+|2015-05-30 22:50:00|FOUND PROPERTY/MISC (FOUND)              |600 Block GUINDA ST               |PALO ALTO    |94301  |1619970144048|bad     |
+|2015-06-01 18:30:00|FOUND PROPERTY/MISC (FOUND)              |200 Block UNIVERSITY AV           |PALO ALTO    |94301  |1619970144048|bad     |
+|2015-06-01 22:30:00|FOUND PROPERTY/MISC (FOUND)              |300 Block HOMER AV                |PALO ALTO    |94301  |1619970144048|bad     |
+|2015-10-20 13:30:00|FOUND PROPERTY/MISC (FOUND)              |2600 Block MIDDLEFIELD RD         |PALO ALTO    |94306  |1619970144048|bad     |
+|2015-10-21 23:30:00|FOUND PROPERTY/MISC (FOUND)              |3200 Block BLOCK E BAYSHORE RD    |PALO ALTO    |94303  |1619970144048|bad     |
+|2015-09-28 06:35:00|STALKING/MISC (6469)                     |700 Block DE SOTO DR              |PALO ALTO    |94303  |1619970144048|good    |
+|2015-07-02 20:23:00|DRUNK IN PUBLIC (647(F)PC)               |800 Block EL CAMINO REAL          |PALO ALTO    |94301  |1619970144048|good    |
+|2015-08-01 01:30:00|DRUNK IN PUBLIC (647(F)PC)               |500 Block HIGH ST                 |PALO ALTO    |94301  |1619970144048|bad     |
+|2015-07-11 21:56:00|OUTSIDE WARRANT (O/W-MISD)               |3500 Block BLOCK EL CAMINO REAL   |PALO ALTO    |94306  |1619970144048|good    |
++-------------------+-----------------------------------------+----------------------------------+-------------+-------+-------------+--------+
+only showing top 20 rows
+ ```
+
+A new column `__status` has been added to the original data set on which this measure was executed. The value of this
+column can be either `bad` or `good` which can be used to calculate the metrics/ separate data based on quality etc.
+These values for `__status` column are based on the value of user-defined key `bad.record.definition`.
+
+_Note:_ This output is for `ConsoleSink`. 
\ No newline at end of file

diff --git a/griffin-doc/measure/measure-configuration-guide/sparksql.md b/griffin-doc/measure/measure-configuration-guide/sparksql.md
new file mode 100644
index 0000000..7486649
--- /dev/null
+++ b/griffin-doc/measure/measure-configuration-guide/sparksql.md

@@ -0,0 +1,224 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+SparkSQL Measure - Configuration Guide
+=====================================
+
+### Introduction
+
+In some cases, the pre-defined dimensions/ measures may not enough to model a complete data quality definition. For such
+cases, Apache Griffin allows the definition of complex custom user-defined checks as SparkSQL queries.
+
+SparkSQL measure is like a pro mode that allows advanced users to configure complex custom checks that are not covered
+by other measures. These SparkSQL queries may contain clauses like `select`, `from`, `where`, `group-by`, `order-by`
+, `limit`, `join`, etc.
+
+### Configuration
+
+The SparkSQL measure can be configured as below,
+
+```json
+{
+  ...
+
+  "measures": [
+    {
+      "name": "spark_sql_measure",
+      "type": "sparkSQL",
+      "data.source": "crime_report_source",
+      "config": {
+        "expr": "SELECT t.*, sq.zip IS NULL AS __measure_spark_sql_measure FROM crime_report_source AS t LEFT OUTER JOIN (SELECT zipcode as zip, COUNT(DISTINCT city) AS city_count FROM crime_report_source GROUP BY zipcode having city_count = 1) as sq ON sq.zip=t.zipcode",
+        "bad.record.definition": "__measure_spark_sql_measure"
+      },
+      "out": [
+        {
+          "type": "metric",
+          "name": "spark_sql_metric",
+          "flatten": "map"
+        },
+        {
+          "type": "record",
+          "name": "spark_sql_records"
+        }
+      ]
+    }
+  ]
+
+  ...
+}
+ ```
+
+##### Key Parameters:
+
+| Name    | Type     | Description                            | Supported Values                                          |
+|:--------|:---------|:---------------------------------------|:----------------------------------------------------------|
+| name    | `String` | User-defined name of this measure      | -                                                         |
+| type    | `String` | Type of Measure                        | completeness, duplication, profiling, accuracy, sparksql  |
+| data.source | `String` | Name of data source on which this measure is applied  | -                                      |
+| config  | `Object` | Configuration params of the measure    | Depends on measure type ([see below](#example-config-object))                       |
+| out     | `List  ` | Define output(s) of measure execution  | [See below](#outputs)                                               |
+
+##### Example `config` Object:
+
+`config` object for SparkSQL measure contains the following keys,
+
+- `expr`: The value for `expr` is a valid SparkSQL query string. This is a mandatory parameter
+
+- `bad.record.definition`: As the key suggests, its value defines what exactly would be considered as a bad record after
+  this query executes. For example, the output of the configured query above has the following schema,
+  ```
+    root
+      |-- date_time: timestamp (nullable = true)
+      |-- incident: string (nullable = true)
+      |-- address: string (nullable = true)
+      |-- city: string (nullable = true)
+      |-- zipcode: integer (nullable = true)
+      |-- __measure_spark_sql_measure: integer (nullable = false)
+  ```
+
+  In order to separate the good data from bad data, a `bad.record.definition` expression must be set. This expression
+  can be a SparkSQL like expression and must yield a column with boolean data type.
+
+  _Note:_ This expression describes the bad records, i.e. if `bad.record.definition` = `true` for a record, it is marked
+  as bad/ incomplete record.
+
+### Outputs
+
+SparkSQL measure supports the following two outputs,
+
+- Metrics
+- Records
+
+Users can choose to define any combination of these 2 outputs. For no outputs, skip this `out: [ ... ]` section from the
+measure configuration.
+
+#### Metrics Outputs
+
+To write metrics for SparkSQL measure, configure the measure with output section as below,
+
+```json
+{
+  ...
+
+  "out": [
+    {
+      "name": "spark_sql_metric",
+      "type": "metric",
+      "flatten": "map"
+    }
+  ]
+
+  ...
+}
+ ```
+
+This will generate the metrics like below,
+
+```json
+{
+  ...
+
+  "value": {
+    "spark_sql_measure": {
+      "measure_name": "spark_sql_measure",
+      "measure_type": "SparkSQL",
+      "data_source": "crime_report_source",
+      "metrics": {
+        "total": "4617",
+        "complete": "1983",
+        "incomplete": "2634"
+      }
+    }
+  }
+
+  ...
+}
+```
+
+_Note:_ This output is similar to that of completeness measure as it produces the `total`, `complete` and `incomplete`
+metrics. For each record of the query resultant data set the `bad.record.definition` expression is evaluated, and the
+bad records are marked as incomplete records (`bad.record.definition` = `true`).
+
+#### Record Outputs
+
+To write records as output for SparkSQL measure, configure the measure with output section as below,
+
+```json
+{
+  ...
+
+  "out": [
+    {
+      "type": "record",
+      "name": "spark_sql_records"
+    }
+  ]
+
+  ...
+}
+ ```
+
+The above configuration will generate the records output like below,
+
+```
++-------------------+----------------------------------+-------------------------------+---------+-------+-------------+--------+
+|date_time          |incident                          |address                        |city     |zipcode|__tmst       |__status|
++-------------------+----------------------------------+-------------------------------+---------+-------+-------------+--------+
+|2015-05-26 05:56:00|PENAL CODE/MISC (PENALMI)         |3900 Block BLOCK EL CAMINO REAL|PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 05:56:00|DRUNK IN PUBLIC ADULT/MISC (647FA)|3900 Block BLOCK EL CAMINO REAL|PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 05:56:00|PENAL CODE/MISC (PENALMI)         |3900 Block BLOCK EL CAMINO REAL|PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 05:56:00|PENAL CODE/MISC (PENALMI)         |3900 Block BLOCK EL CAMINO REAL|PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 08:00:00|N&D/POSSESSION (11350)            |WILKIE WAY & JAMES RD          |PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 08:00:00|N&D/PARAPHERNALIA (11364)         |WILKIE WAY & JAMES RD          |PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 08:00:00|TRAFFIC/SUSPENDED LICENSE (14601) |WILKIE WAY & JAMES RD          |PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 10:30:00|TRAFFIC/MISC (TRAFMISC)           |EL CAMINO REAL & UNIVERSITY AV |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 11:31:00|PENAL CODE/MISC (PENALMI)         |400 Block ALMA ST              |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 12:04:00|B&P/MISC (B&PMISC)                |700 Block URBAN LN             |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 12:25:00|PENAL CODE/MISC (PENALMI)         |500 Block HIGH ST              |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 13:06:00|WARRANT/OTHER AGENCY (OWARRANT)   |800 Block BRYANT ST            |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 13:30:00|THEFT GRAND/BIKE/BIKE PARTS (487B)|1 Block UNIVERSITY AV          |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 13:35:00|PENAL CODE/MISC (PENALMI)         |800 Block BRYANT ST            |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 14:30:00|F&W/BRANDISHING (417)             |2200 Block EL CAMINO REAL      |PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 14:43:00|ACCIDENT MINOR INJURY (1181)      |3300 Block BLOCK EL CAMINO REAL|PALO ALTO|94306  |1619986534335|bad     |
+|2015-05-26 15:22:00|THEFT PETTY/MISC (488M)           |300 Block POE ST               |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 16:31:00|B&P/MISC (B&PMISC)                |500 Block WAVERLEY ST          |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 16:48:00|MUNI CODE/MISC (1090MISC)         |200 Block UNIVERSITY AV        |PALO ALTO|94301  |1619986534335|good    |
+|2015-05-26 16:48:00|MUNI CODE/MISC (1090MISC)         |200 Block UNIVERSITY AV        |PALO ALTO|94301  |1619986534335|good    |
++-------------------+----------------------------------+-------------------------------+---------+-------+-------------+--------+
+only showing top 20 rows
+ ```
+
+A new column `__status` has been added to the original data set on which this measure was executed. The value of this
+column can be either `bad` or `good` which can be used to calculate the metrics/ separate data based on quality etc.
+These values for `__status` column are based on the value of user-defined key `bad.record.definition`.
+
+_Note:_ This output is for `ConsoleSink`.
+
+**Further Reading**
+
+Unlike the other measures, the record output of SparkSQL measure can be different from the source data set (in terms of
+schema and/ or content). This is because the output is dependent on a user-defined query. In the example described
+above, the query has been explicitly written in a way that the output and input data sets have the same schema.
+
+Users can do the same by explicitly defining a boolean column (like `bad.record.definition`) with
+alias `__measure_{measure_name}` directly in the SQL query and defining
+`bad.record.definition: "__measure_{measure_name}"` in the measure config.
+
+For example if the user-defined name of the measure is `spark_sql_measure`, the alias column name
+becomes `__measure_spark_sql_measure`.
\ No newline at end of file

diff --git a/measure/src/main/resources/config-batch-preproc.json b/measure/src/main/resources/config-batch-preproc.json
index 5a091e6..b40f57a 100644
--- a/measure/src/main/resources/config-batch-preproc.json
+++ b/measure/src/main/resources/config-batch-preproc.json

@@ -74,22 +74,22 @@
       ]
     },
     {
-      "name": "query_measure",
+      "name": "spark_sql_measure",
       "type": "sparkSQL",
       "data.source": "crime_report_source",
       "config": {
-        "expr": "select * from crime_report_source",
-        "bad.record.definition": "zipcode is null and city is null"
+        "expr": "SELECT t.*, sq.zip IS NULL AS __measure_spark_sql_measure FROM crime_report_source AS t LEFT OUTER JOIN (SELECT zipcode as zip, COUNT(DISTINCT city) AS city_count FROM crime_report_source GROUP BY zipcode having city_count = 1) as sq ON sq.zip=t.zipcode",
+        "bad.record.definition": "__measure_spark_sql_measure"
       },
       "out": [
         {
           "type": "metric",
-          "name": "sql_metric",
+          "name": "spark_sql_metric",
           "flatten": "map"
         },
         {
           "type": "record",
-          "name": "sql_records"
+          "name": "spark_sql_records"
         }
       ]
     },
commit	15bcfa44969252865e47c34733a7b4e2a0584664	[log] [tgz]
author	chitralverma <chitralverma@gmail.com>	Sun May 02 23:09:43 2021 +0530
committer	chitralverma <chitralverma@gmail.com>	Mon May 03 02:21:57 2021 +0530
tree	38a0b44b4eefeffd1d8b0bf3ce2a85c3dae399bf
parent	aaa1bf3baad389871a7828549866789c62ea76ad [diff]