[GRIFFIN-358] Added general documentation for new dimensions/ measures and completeness measure configuration guide.

commit: 25ed0d1d0f9f1e4d66b7f0d64cd4e7e3cf81b40e [log] [tgz]
author: chitralverma <chitralverma@gmail.com> Sun May 02 18:33:55 2021 +0530
committer: chitralverma <chitralverma@gmail.com> Mon May 03 02:21:26 2021 +0530
tree: 3265e279a48d1e50dcca61b0c4a51737218da07e
parent: 739d3f5cbf0f7c0cd528119370b7bd751c47d592 [diff]
diff --git a/griffin-doc/measure/dimensions.md b/griffin-doc/measure/dimensions.md
new file mode 100644
index 0000000..577beaa
--- /dev/null
+++ b/griffin-doc/measure/dimensions.md

@@ -0,0 +1,114 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+Dimensions of Data Quality
+==========================
+
+In light of the management axiom “what gets measured gets managed” (Willcocks and Lester, 1996), the dimensions of Data
+Quality signifies a crucial management element in the domain of data quality. Through time, researchers and
+practitioners have suggested several views of data quality measures/ dimensions many of which have overlapping, and
+sometimes conflicting interpretations.
+
+While it is important to embrace the diversity of views of data quality measures/ dimensions, it is equally important
+for the data quality research and practitioner community to be united in the consistent interpretation of this
+foundational concept.
+
+Apache Griffin, takes a step towards this consistent interpretation through a systematic review of research and
+practitioner literature and measures/ assesses the quality of user-defined data assets in terms of the following
+dimension/ measures,
+
+- [Accuracy](#accuracy)
+- [Completeness](#completeness)
+- [Duplication](#duplication)
+- [Custom User-defined (SparkSQL based)](#sparksql)
+- [Profiling](#profiling)
+
+## Accuracy
+
+Data accuracy refers to the degree to which the values of a said attribute agree with an identified reference truth
+data (source of correct information). In-accurate data may come from different sources like,
+
+- Dynamically computed values,
+- the result of a manual workflow,
+- irate customers, etc.
+
+Accuracy measure quantifies the extent to which data sets contains are correct, reliable and certified values that are
+free of error. Higher accuracy values signify that the said data set represents the “real-life” values/ objects that it
+intends to model.
+
+## Completeness
+
+Completeness refers to the degree to which values are present in a data collection. When data is incomplete due to
+unavailability (missing records), this does not represent a lack of completeness. As far as an individual datum is
+concerned, only two situations are possible - either a value is assigned to the attribute in question or not. The latter
+case is usually represented by a `null` value.
+
+The definition of Completeness and its scope itself may change from one user to another. Thus, Apache Griffin allows
+users to define SQL-like expressions which describe their definition of completeness. For a tabular data set with
+columns `name`, `email` and `age`, some examples of such completeness definitions are mentioned below,
+
+- `name is NULL`
+- `name is NULL and age is NULL`
+- `email NOT RLIKE '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'`
+
+## Duplication
+
+Asserting the measure of duplication of the entities within a data set implies that no entity exists more than once
+within the data set and that there is a key that can be used to uniquely access each entity. For example, in a master
+product table, each product must appear once and be assigned a unique identifier that represents that product within a
+system or across multiple applications/ systems.
+
+Redundancies in a dataset can be measured in terms of the following metrics,
+
+- **Duplicate:** the number of values that are the same as other values in the list
+- **Distinct:** the number of non-null values that are different from each other (Non-unique + Unique)
+- **Non-Unique:** the number of values that have at least one duplicate in the list
+- **Unique:** the number of values that have no duplicates
+
+Duplication measure in Apache Griffin computes all of these metrics for a user-defined data asset.
+
+## SparkSQL
+
+In some cases, the above-mentioned quality dimensions/ measures may not be able to model a complete data quality
+definition. For such cases, Apache Griffin allows the definition of complex custom user-defined checks as SparkSQL
+queries.
+
+SparkSQL measure is like a pro mode and allows advanced users to put complex custom checks that are not covered by other
+measures. These SparkSQL queries may contain clauses like select/ from/ where/ group-by/ order-by/ limit, etc.
+
+## Profiling
+
+Data processing and its analysis can't truly be complete without data profiling - reviewing source data for content and
+quality. Data profiling helps to find data quality rules and requirements that will support a more thorough data quality
+assessment in a later step. Data profiling can help us to,
+
+- **Discover Structure of data**
+
+  Validating that data is consistent and formatted correctly, and performing mathematical checks on the data (e.g. sum,
+  minimum or maximum). Structure discovery helps understand how well data is structured—for example, what percentage of
+  phone numbers do not have the correct number of digits.
+
+- **Discover Content of data**
+
+  Looking into individual data records to discover errors. Content discovery identifies which specific rows in a table
+  contain problems, and which systemic issues occur in the data (for example, phone numbers with no area code).
+
+Data Profiling helps us create a huge amount of insight into the quality levels of our data and helps to find data
+quality rules and requirements that will support a more thorough data quality assessment in a later step.

diff --git a/griffin-doc/measure/measure-configuration-guide/completeness.md b/griffin-doc/measure/measure-configuration-guide/completeness.md
new file mode 100644
index 0000000..eda2ef2
--- /dev/null
+++ b/griffin-doc/measure/measure-configuration-guide/completeness.md

@@ -0,0 +1,209 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Completeness Measure - Configuration Guide
+=====================================
+
+### Introduction
+
+Completeness refers to the degree to which values are present in a data collection. When data is incomplete due to
+unavailability (missing records), this does not represent a lack of completeness. As far as an individual datum is
+concerned, only two situations are possible - either a value is assigned to the attribute in question or not. The latter
+case is usually represented by a `null` value.
+
+The definition of Completeness and its scope itself may change from one user to another. Thus, Apache Griffin allows
+users to define SQL-like expressions which describe their definition of completeness. For a tabular data set with
+columns `name`, `email` and `age`, some examples of such completeness definitions are mentioned below,
+
+- `name is NULL`
+- `name is NULL and age is NULL`
+- `email NOT RLIKE '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'`
+
+### Configuration
+
+The completeness measure can be configured as below,
+
+```json
+{
+  ...
+
+  "measures": [
+    {
+      "name": "completeness_measure",
+      "type": "completeness",
+      "data.source": "crime_report_source",
+      "config": {
+        "expr": "zipcode is null OR city is null"
+      },
+      "out": [
+        {
+          "type": "metric",
+          "name": "comp_metric",
+          "flatten": "map"
+        },
+        {
+          "type": "record",
+          "name": "comp_records"
+        }
+      ]
+    }
+  ]
+
+  ...
+}
+ ```
+
+##### Key Parameters:
+
+| Name    | Type     | Description                            | Supported Values                                          |
+|:--------|:---------|:---------------------------------------|:----------------------------------------------------------|
+| name    | `String` | User-defined name of this measure      | -                                                         |
+| type    | `String` | Type of Measure                        | completeness, duplication, profiling, accuracy, sparksql  |
+| data.source | `String` | Name of data source on which this measure is applied  | -                                      |
+| config  | `Object` | Configuration params of the measure    | Depends on measure type ([see below](#example-config-object))                       |
+| out     | `List  ` | Define output(s) of measure execution  | [See below](#outputs)                                               |
+
+##### Example `config` Object:
+
+`config` object for completeness measure contains only one key `expr`. The value for `expr` is a SQL-like expression
+string which definition this completeness. For more complex definitions, expressions can be clubbed with `AND` and `OR`.
+
+_Note:_ This expression describes the bad or incomplete records. This means that for `"expr": "zipcode is NULL"` the
+records which contain `null` in zipcode column are considered as incomplete.
+
+It can be defined as mentioned below,
+
+```json
+{
+  ...
+
+  "config": {
+    "expr": "zipcode is null OR city is null"
+  }
+
+  ...
+}
+ ```
+
+### Outputs
+
+Completeness measure supports the following two outputs,
+
+- Metrics
+- Records
+
+Users can choose to define any combination of these 2 outputs. For no outputs, skip this `out: [ ... ]` section from the
+measure configuration.
+
+#### Metrics Outputs
+
+To write metrics for completeness measure, configure the measure with output section as below,
+
+```json
+{
+  ...
+
+  "out": [
+    {
+      "name": "comp_metric",
+      "type": "metric",
+      "flatten": "map"
+    }
+  ]
+
+  ...
+}
+ ```
+
+This will generate the metrics like below,
+
+```json
+{
+  ...
+
+  "value": {
+    "completeness_measure": {
+      "measure_name": "completeness_measure",
+      "measure_type": "Completeness",
+      "data_source": "crime_report_source",
+      "metrics": {
+        "total": "4617",
+        "complete": "4459",
+        "incomplete": "158"
+      }
+    }
+  }
+
+  ...
+}
+```
+
+#### Record Outputs
+
+To write records as output for completeness measure, configure the measure with output section as below,
+
+```json
+{
+  ...
+
+  "out": [
+    {
+      "type": "record",
+      "name": "comp_records"
+    }
+  ]
+
+  ...
+}
+ ```
+
+The above configuration will generate the records output like below,
+
+```
++-------------------+---------------------------------------------+---------------------------------------+-------------+-------+-------------+--------+
+|date_time          |incident                                     |address                                |city         |zipcode|__tmst       |__status|
++-------------------+---------------------------------------------+---------------------------------------+-------------+-------+-------------+--------+
+|2015-05-26 05:56:00|PENAL CODE/MISC (PENALMI)                    |3900 Block BLOCK EL CAMINO REAL        |PALO ALTO    |94306  |1619969055893|good    |
+|2015-05-26 05:56:00|DRUNK IN PUBLIC ADULT/MISC (647FA)           |3900 Block BLOCK EL CAMINO REAL        |PALO ALTO    |94306  |1619969055893|good    |
+|2015-05-26 05:56:00|PENAL CODE/MISC (PENALMI)                    |3900 Block BLOCK EL CAMINO REAL        |PALO ALTO    |94306  |1619969055893|good    |
+|2015-05-26 22:55:00|DRIVER'S LICENSE SUSPENDED/ALC (14601.2(A)VC)|EL CAMINO REAL & N SAN ANTONIO RD      |""           |null   |1619969055893|bad     |
+|2015-05-26 22:55:00|TRAFFIC/SUSPENDED LICENSE (14601)            |EL CAMINO REAL & N SAN ANTONIO RD      |""           |null   |1619969055893|bad     |
+|2015-06-01 01:41:00|TRAFFIC/SUSPENDED LICENSE (14601)            |QUARRY RD & ARBORETUM RD               |PALO ALTO    |94304  |1619969055893|good    |
+|2015-06-01 02:49:00|TRAFFIC/SUSPENDED LICENSE (14601)            |2000 Block BLOCK E BAYSHORE RD         |PALO ALTO    |94303  |1619969055893|good    |
+|2015-06-01 03:13:00|DRIVING WITH A SUSPENDED OR RE (14601.1(A)VC)|100 Block SAN ANTONIO RD               |PALO ALTO    |null   |1619969055893|bad     |
+|2015-06-01 03:13:00|TRAFFIC/SUSPENDED LICENSE (14601)            |100 Block SAN ANTONIO RD               |PALO ALTO    |null   |1619969055893|bad     |
+|2015-06-01 03:13:00|WARRANT/PALO ALTO (PWARRANT)                 |100 Block SAN ANTONIO RD               |PALO ALTO    |null   |1619969055893|bad     |
+|2015-06-01 16:20:00|BURGLARY ATTEMPT/AUTO (459AA)                |300 Block LOWELL AV                    |PALO ALTO    |94301  |1619969055893|good    |
+|2015-06-01 16:30:00|BURGLARY/AUTO (459A)                         |800 Block EL CAMINO REAL               |PALO ALTO    |94301  |1619969055893|good    |
+|2015-06-01 16:30:00|ASSAULT WITH DEADLY WEAPON (245)             |1100 Block N RENGSTORFF AV             |MOUNTAIN VIEW|null   |1619969055893|bad     |
+|2015-06-01 16:30:00|BURGLARY/AUTO (459A)                         |2000 Block EL CAMINO REAL              |PALO ALTO    |94306  |1619969055893|good    |
+|2015-06-01 16:30:00|TRAFFIC/SUSPENDED LICENSE (14601)            |1100 Block N RENGSTORFF AV             |MOUNTAIN VIEW|null   |1619969055893|bad     |
+|2015-06-01 16:30:00|PENAL CODE/RESISTING ARREST (148RA)          |1100 Block N RENGSTORFF AV             |MOUNTAIN VIEW|null   |1619969055893|bad     |
+|2015-06-01 16:30:00|BURGLARY/AUTO (459A)                         |1100 Block N RENGSTORFF AV             |MOUNTAIN VIEW|null   |1619969055893|bad     |
+|2015-06-01 16:30:00|ASSAULT WITH DEADLY WEAPON (245)             |1100 Block N RENGSTORFF AV             |MOUNTAIN VIEW|null   |1619969055893|bad     |
+|2015-06-01 17:30:00|IDENTITY THEFT/MISC. (530M)                  |300 Block CREEKSIDE DR                 |PALO ALTO    |94306  |1619969055893|good    |
+|2015-06-01 17:30:00|PENAL CODE/TERRORIST THREATS (422)           |100 Block CALIFORNIA AV                |PALO ALTO    |94306  |1619969055893|good    |
++-------------------+---------------------------------------------+---------------------------------------+-------------+-------+-------------+--------+
+only showing top 20 rows
+ ```
+
+A new column `__status` has been added to the original data set on which this measure was executed. The value of this
+column can be either `bad` or `good` which can be used to calculate the metrics/ separate data based on quality etc.
+
+_Note:_ This output is for `ConsoleSink`. 
\ No newline at end of file
commit	25ed0d1d0f9f1e4d66b7f0d64cd4e7e3cf81b40e	[log] [tgz]
author	chitralverma <chitralverma@gmail.com>	Sun May 02 18:33:55 2021 +0530
committer	chitralverma <chitralverma@gmail.com>	Mon May 03 02:21:26 2021 +0530
tree	3265e279a48d1e50dcca61b0c4a51737218da07e
parent	739d3f5cbf0f7c0cd528119370b7bd751c47d592 [diff]