[GRIFFIN-358] Added profiling measure configuration guide.
diff --git a/griffin-doc/measure/dimensions.md b/griffin-doc/measure/dimensions.md
index 123b8fb..d253179 100644
--- a/griffin-doc/measure/dimensions.md
+++ b/griffin-doc/measure/dimensions.md
@@ -53,6 +53,8 @@
free of error. Higher accuracy values signify that the said data set represents the “real-life” values/ objects that it
intends to model.
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/accuracy.md).
+
## Completeness
Completeness refers to the degree to which values are present in a data collection. When data is incomplete due to
@@ -68,6 +70,8 @@
- `name is NULL and age is NULL`
- `email NOT RLIKE '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'`
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/completeness.md).
+
## Duplication
Asserting the measure of duplication of the entities within a data set implies that no entity exists more than once
@@ -84,6 +88,8 @@
Duplication measure in Apache Griffin computes all of these metrics for a user-defined data asset.
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/duplication.md).
+
## SparkSQL
In some cases, the above-mentioned dimensions/ measures may not enough to model a complete data quality definition. For
@@ -93,6 +99,8 @@
by other measures. These SparkSQL queries may contain clauses like `select`, `from`, `where`, `group-by`, `order-by`
, `limit`, etc.
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/sparksql.md).
+
## Profiling
Data processing and its analysis can't truly be complete without data profiling - reviewing source data for content and
@@ -112,3 +120,5 @@
Data Profiling helps us create a huge amount of insight into the quality levels of our data and helps to find data
quality rules and requirements that will support a more thorough data quality assessment in a later step.
+
+A detailed measure configuration guide is avaiable [here](measure-configuration-guide/profiling.md).
\ No newline at end of file
diff --git a/griffin-doc/measure/measure-configuration-guide/profiling.md b/griffin-doc/measure/measure-configuration-guide/profiling.md
new file mode 100644
index 0000000..c3bf602
--- /dev/null
+++ b/griffin-doc/measure/measure-configuration-guide/profiling.md
@@ -0,0 +1,207 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Profiling Measure - Configuration Guide
+=====================================
+
+### Introduction
+
+Data processing and its analysis can't truly be complete without data profiling - reviewing source data for content and
+quality. Data profiling helps to find data quality rules and requirements that will support a more thorough data quality
+assessment in a later step. Data profiling can help us to,
+
+- **Discover Structure of data**
+
+ Validating that data is consistent and formatted correctly, and performing mathematical checks on the data (e.g. sum,
+ minimum or maximum). Structure discovery helps understand how well data is structured—for example, what percentage of
+ phone numbers do not have the correct number of digits.
+
+- **Discover Content of data**
+
+ Looking into individual data records to discover errors. Content discovery identifies which specific rows in a table
+ contain problems, and which systemic issues occur in the data (for example, phone numbers with no area code).
+
+The process of Data profiling involves:
+
+- Collecting descriptive statistics like min, max, count and sum
+- Collecting data types, length and recurring patterns
+- Discovering metadata and assessing its accuracy, etc.
+
+A common problem in data management circles is the confusion around what is meant by Data profiling as opposed to Data
+Quality Assessment due to the interchangeable use of these 2 terms.
+
+Data Profiling helps us create a huge amount of insight into the quality levels of our data and helps to find data
+quality rules and requirements that will support a more thorough data quality assessment in a later step. For example,
+data profiling can help us to discover value frequencies, formats and patterns for each attribute in the data asset.
+Using data profiling alone we can find some perceived defects and outliers in the data asset and we end up with a whole
+range of clues based on which correct Quality assessment measures can be defined like completeness/ distinctness etc.
+
+### Configuration
+
+The Profiling measure can be configured as below,
+
+```json
+{
+ ...
+
+ "measures": [
+ {
+ "name": "profiling_measure",
+ "type": "profiling",
+ "data.source": "crime_report_source",
+ "config": {
+ "expr": "city,zipcode",
+ "approx.distinct.count": true,
+ "round.scale": 2
+ },
+ "out": [
+ {
+ "type": "metric",
+ "name": "prof_metric",
+ "flatten": "map"
+ }
+ ]
+ }
+ ]
+
+ ...
+}
+ ```
+
+##### Key Parameters:
+
+| Name | Type | Description | Supported Values |
+|:--------|:---------|:---------------------------------------|:----------------------------------------------------------|
+| name | `String` | User-defined name of this measure | - |
+| type | `String` | Type of Measure | completeness, duplication, profiling, accuracy, sparksql |
+| data.source | `String` | Name of data source on which this measure is applied | - |
+| config | `Object` | Configuration params of the measure | Depends on measure type ([see below](#example-config-object)) |
+| out | `List ` | Define output(s) of measure execution | [See below](#outputs) |
+
+##### Example `config` Object:
+
+`config` object for Profiling measure contains the following keys,
+
+- `expr`: The value for `expr` is a comma separated string of columns in the data asset on which the profiling measure
+ is to be executed. `expr` is an optional key for Profiling measure, i.e., if it is not defined, all columns in the
+ data set will be profiled.
+
+- `approx.distinct.count`: The value for this key is boolean. If this is `true`, the distinct counts will be
+ approximated to allow up to 5% error. Approximate counts are usually faster by are less accurate. If this is set
+ to `false`, then the counts will be 100% accurate.
+
+- `round.scale`: Several resultant metrics of profiling measure are floating-point numbers. This key controls to extent
+ to which these floating-point numbers are rounded. For example, if `round.scale = 2` then all floating-point metric
+ values will be rounded to 2 decimal places.
+
+### Outputs
+
+Unlike other measures, Profiling does not produce record outputs. Thus, only metric outputs must be configured.
+
+#### Metrics Outputs
+
+For each column in the data set, the profile contains the following,
+
+- avg_col_len: Average length of the column value across all rows
+- max_col_len: Maximum length of the column value across all rows
+- min_col_len: Minimum length of the column value across all rows
+- avg: Average column value across all rows
+- max: Maximum column value across all rows
+- min: Minimum column value across all rows
+- null_count: Count of null values across all rows for this column
+- approx_distinct_count **OR** distinct_count: Count of (approx) distinct values across all rows for this column
+- variance: Variance measures variability from the average or mean.
+- kurtosis: Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
+ distribution.
+- std_dev: Standard deviation is a measure of the amount of variation or dispersion of a set of values.
+- total: Total values across all rows. This is same for all columns.
+- data_type: Data type of this column.
+
+To write Profiling metrics, configure the measure with output section as below,
+
+```json
+{
+ ...
+
+ "out": [
+ {
+ "name": "prof_metric",
+ "type": "metric",
+ "flatten": "map"
+ }
+ ]
+
+ ...
+}
+ ```
+
+This will generate the metrics like below,
+
+```json
+{
+ ...
+
+ "value": {
+ "profiling_measure": {
+ "measure_name": "profiling_measure",
+ "measure_type": "Profiling",
+ "data_source": "crime_report_source",
+ "metrics": {
+ "column_details": {
+ "city": {
+ "avg_col_len": null,
+ "max_col_len": "25",
+ "variance": null,
+ "kurtosis": null,
+ "avg": null,
+ "min": null,
+ "null_count": "0",
+ "approx_distinct_count": "6",
+ "total": "4617",
+ "std_dev": null,
+ "data_type": "string",
+ "max": null,
+ "min_col_len": "2"
+ },
+ "zipcode": {
+ "avg_col_len": "5.0",
+ "max_col_len": "5",
+ "variance": "4.57",
+ "kurtosis": "-1.57",
+ "avg": "94303.11",
+ "min": "94301",
+ "null_count": "158",
+ "approx_distinct_count": "4",
+ "total": "4617",
+ "std_dev": "2.14",
+ "data_type": "int",
+ "max": "94306",
+ "min_col_len": "5"
+ }
+ }
+ }
+ }
+ }
+
+ ...
+}
+```
+
+_Note:_ Some mathematical metrics are bound to the type of attribute under consideration, for example standard deviation
+cannot be calculated for a column name of string type, thus, the value for these metrics are null for such columns.
\ No newline at end of file