[GRIFFIN-358] Added profiling measure configuration guide.

commit: 89e38a83e49cd97a6041dc068a19013dea68b8b1 [log] [tgz]
author: chitralverma <chitralverma@gmail.com> Mon May 03 03:03:52 2021 +0530
committer: chitralverma <chitralverma@gmail.com> Mon May 03 03:03:52 2021 +0530
tree: e10643ed3cfaad9c886d5c2cf533c45a4dae779a
parent: 15bcfa44969252865e47c34733a7b4e2a0584664 [diff]
diff --git a/griffin-doc/measure/dimensions.md b/griffin-doc/measure/dimensions.md
index 123b8fb..d253179 100644
--- a/griffin-doc/measure/dimensions.md
+++ b/griffin-doc/measure/dimensions.md

@@ -53,6 +53,8 @@
 free of error. Higher accuracy values signify that the said data set represents the “real-life” values/ objects that it
 intends to model.
 
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/accuracy.md).
+
 ## Completeness
 
 Completeness refers to the degree to which values are present in a data collection. When data is incomplete due to
@@ -68,6 +70,8 @@
 - `name is NULL and age is NULL`
 - `email NOT RLIKE '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'`
 
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/completeness.md).
+
 ## Duplication
 
 Asserting the measure of duplication of the entities within a data set implies that no entity exists more than once
@@ -84,6 +88,8 @@
 
 Duplication measure in Apache Griffin computes all of these metrics for a user-defined data asset.
 
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/duplication.md).
+
 ## SparkSQL
 
 In some cases, the above-mentioned dimensions/ measures may not enough to model a complete data quality definition. For
@@ -93,6 +99,8 @@
 by other measures. These SparkSQL queries may contain clauses like `select`, `from`, `where`, `group-by`, `order-by`
 , `limit`, etc.
 
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/sparksql.md).
+
 ## Profiling
 
 Data processing and its analysis can't truly be complete without data profiling - reviewing source data for content and
@@ -112,3 +120,5 @@
 
 Data Profiling helps us create a huge amount of insight into the quality levels of our data and helps to find data
 quality rules and requirements that will support a more thorough data quality assessment in a later step.
+
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/profiling.md).
\ No newline at end of file

diff --git a/griffin-doc/measure/measure-configuration-guide/profiling.md b/griffin-doc/measure/measure-configuration-guide/profiling.md
new file mode 100644
index 0000000..c3bf602
--- /dev/null
+++ b/griffin-doc/measure/measure-configuration-guide/profiling.md

@@ -0,0 +1,207 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Profiling Measure - Configuration Guide
+=====================================
+
+### Introduction
+
+Data processing and its analysis can't truly be complete without data profiling - reviewing source data for content and
+quality. Data profiling helps to find data quality rules and requirements that will support a more thorough data quality
+assessment in a later step. Data profiling can help us to,
+
+- **Discover Structure of data**
+
+  Validating that data is consistent and formatted correctly, and performing mathematical checks on the data (e.g. sum,
+  minimum or maximum). Structure discovery helps understand how well data is structured—for example, what percentage of
+  phone numbers do not have the correct number of digits.
+
+- **Discover Content of data**
+
+  Looking into individual data records to discover errors. Content discovery identifies which specific rows in a table
+  contain problems, and which systemic issues occur in the data (for example, phone numbers with no area code).
+
+The process of Data profiling involves:
+
+- Collecting descriptive statistics like min, max, count and sum
+- Collecting data types, length and recurring patterns
+- Discovering metadata and assessing its accuracy, etc.
+
+A common problem in data management circles is the confusion around what is meant by Data profiling as opposed to Data
+Quality Assessment due to the interchangeable use of these 2 terms.
+
+Data Profiling helps us create a huge amount of insight into the quality levels of our data and helps to find data
+quality rules and requirements that will support a more thorough data quality assessment in a later step. For example,
+data profiling can help us to discover value frequencies, formats and patterns for each attribute in the data asset.
+Using data profiling alone we can find some perceived defects and outliers in the data asset and we end up with a whole
+range of clues based on which correct Quality assessment measures can be defined like completeness/ distinctness etc.
+
+### Configuration
+
+The Profiling measure can be configured as below,
+
+```json
+{
+  ...
+
+  "measures": [
+    {
+      "name": "profiling_measure",
+      "type": "profiling",
+      "data.source": "crime_report_source",
+      "config": {
+        "expr": "city,zipcode",
+        "approx.distinct.count": true,
+        "round.scale": 2
+      },
+      "out": [
+        {
+          "type": "metric",
+          "name": "prof_metric",
+          "flatten": "map"
+        }
+      ]
+    }
+  ]
+
+  ...
+}
+ ```
+
+##### Key Parameters:
+
+| Name    | Type     | Description                            | Supported Values                                          |
+|:--------|:---------|:---------------------------------------|:----------------------------------------------------------|
+| name    | `String` | User-defined name of this measure      | -                                                         |
+| type    | `String` | Type of Measure                        | completeness, duplication, profiling, accuracy, sparksql  |
+| data.source | `String` | Name of data source on which this measure is applied  | -                                      |
+| config  | `Object` | Configuration params of the measure    | Depends on measure type ([see below](#example-config-object))                       |
+| out     | `List  ` | Define output(s) of measure execution  | [See below](#outputs)                                               |
+
+##### Example `config` Object:
+
+`config` object for Profiling measure contains the following keys,
+
+- `expr`: The value for `expr` is a comma separated string of columns in the data asset on which the profiling measure
+  is to be executed. `expr` is an optional key for Profiling measure, i.e., if it is not defined, all columns in the
+  data set will be profiled.
+
+- `approx.distinct.count`: The value for this key is boolean. If this is `true`, the distinct counts will be
+  approximated to allow up to 5% error. Approximate counts are usually faster by are less accurate. If this is set
+  to `false`, then the counts will be 100% accurate.
+
+- `round.scale`: Several resultant metrics of profiling measure are floating-point numbers. This key controls to extent
+  to which these floating-point numbers are rounded. For example, if `round.scale = 2` then all floating-point metric
+  values will be rounded to 2 decimal places.
+
+### Outputs
+
+Unlike other measures, Profiling does not produce record outputs. Thus, only metric outputs must be configured.
+
+#### Metrics Outputs
+
+For each column in the data set, the profile contains the following,
+
+- avg_col_len: Average length of the column value across all rows
+- max_col_len: Maximum length of the column value across all rows
+- min_col_len: Minimum length of the column value across all rows
+- avg: Average column value across all rows
+- max: Maximum column value across all rows
+- min: Minimum column value across all rows
+- null_count: Count of null values across all rows for this column
+- approx_distinct_count **OR** distinct_count: Count of (approx) distinct values across all rows for this column
+- variance: Variance measures variability from the average or mean.
+- kurtosis: Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
+  distribution.
+- std_dev: Standard deviation is a measure of the amount of variation or dispersion of a set of values.
+- total: Total values across all rows. This is same for all columns.
+- data_type: Data type of this column.
+
+To write Profiling metrics, configure the measure with output section as below,
+
+```json
+{
+  ...
+
+  "out": [
+    {
+      "name": "prof_metric",
+      "type": "metric",
+      "flatten": "map"
+    }
+  ]
+
+  ...
+}
+ ```
+
+This will generate the metrics like below,
+
+```json
+{
+  ...
+
+  "value": {
+    "profiling_measure": {
+      "measure_name": "profiling_measure",
+      "measure_type": "Profiling",
+      "data_source": "crime_report_source",
+      "metrics": {
+        "column_details": {
+          "city": {
+            "avg_col_len": null,
+            "max_col_len": "25",
+            "variance": null,
+            "kurtosis": null,
+            "avg": null,
+            "min": null,
+            "null_count": "0",
+            "approx_distinct_count": "6",
+            "total": "4617",
+            "std_dev": null,
+            "data_type": "string",
+            "max": null,
+            "min_col_len": "2"
+          },
+          "zipcode": {
+            "avg_col_len": "5.0",
+            "max_col_len": "5",
+            "variance": "4.57",
+            "kurtosis": "-1.57",
+            "avg": "94303.11",
+            "min": "94301",
+            "null_count": "158",
+            "approx_distinct_count": "4",
+            "total": "4617",
+            "std_dev": "2.14",
+            "data_type": "int",
+            "max": "94306",
+            "min_col_len": "5"
+          }
+        }
+      }
+    }
+  }
+
+  ...
+}
+```
+
+_Note:_ Some mathematical metrics are bound to the type of attribute under consideration, for example standard deviation
+cannot be calculated for a column name of string type, thus, the value for these metrics are null for such columns.
\ No newline at end of file
commit	89e38a83e49cd97a6041dc068a19013dea68b8b1	[log] [tgz]
author	chitralverma <chitralverma@gmail.com>	Mon May 03 03:03:52 2021 +0530
committer	chitralverma <chitralverma@gmail.com>	Mon May 03 03:03:52 2021 +0530
tree	e10643ed3cfaad9c886d5c2cf533c45a4dae779a
parent	15bcfa44969252865e47c34733a7b4e2a0584664 [diff]