[GRIFFIN-358] Added general documentation for new dimensions/ measures and completeness measure configuration guide.
diff --git a/griffin-doc/measure/dimensions.md b/griffin-doc/measure/dimensions.md
new file mode 100644
index 0000000..577beaa
--- /dev/null
+++ b/griffin-doc/measure/dimensions.md
@@ -0,0 +1,114 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+Dimensions of Data Quality
+==========================
+
+In light of the management axiom “what gets measured gets managed” (Willcocks and Lester, 1996), the dimensions of Data
+Quality signifies a crucial management element in the domain of data quality. Through time, researchers and
+practitioners have suggested several views of data quality measures/ dimensions many of which have overlapping, and
+sometimes conflicting interpretations.
+
+While it is important to embrace the diversity of views of data quality measures/ dimensions, it is equally important
+for the data quality research and practitioner community to be united in the consistent interpretation of this
+foundational concept.
+
+Apache Griffin, takes a step towards this consistent interpretation through a systematic review of research and
+practitioner literature and measures/ assesses the quality of user-defined data assets in terms of the following
+dimension/ measures,
+
+- [Accuracy](#accuracy)
+- [Completeness](#completeness)
+- [Duplication](#duplication)
+- [Custom User-defined (SparkSQL based)](#sparksql)
+- [Profiling](#profiling)
+
+## Accuracy
+
+Data accuracy refers to the degree to which the values of a said attribute agree with an identified reference truth
+data (source of correct information). In-accurate data may come from different sources like,
+
+- Dynamically computed values,
+- the result of a manual workflow,
+- irate customers, etc.
+
+Accuracy measure quantifies the extent to which data sets contains are correct, reliable and certified values that are
+free of error. Higher accuracy values signify that the said data set represents the “real-life” values/ objects that it
+intends to model.
+
+## Completeness
+
+Completeness refers to the degree to which values are present in a data collection. When data is incomplete due to
+unavailability (missing records), this does not represent a lack of completeness. As far as an individual datum is
+concerned, only two situations are possible - either a value is assigned to the attribute in question or not. The latter
+case is usually represented by a `null` value.
+
+The definition of Completeness and its scope itself may change from one user to another. Thus, Apache Griffin allows
+users to define SQL-like expressions which describe their definition of completeness. For a tabular data set with
+columns `name`, `email` and `age`, some examples of such completeness definitions are mentioned below,
+
+- `name is NULL`
+- `name is NULL and age is NULL`
+- `email NOT RLIKE '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'`
+
+## Duplication
+
+Asserting the measure of duplication of the entities within a data set implies that no entity exists more than once
+within the data set and that there is a key that can be used to uniquely access each entity. For example, in a master
+product table, each product must appear once and be assigned a unique identifier that represents that product within a
+system or across multiple applications/ systems.
+
+Redundancies in a dataset can be measured in terms of the following metrics,
+
+- **Duplicate:** the number of values that are the same as other values in the list
+- **Distinct:** the number of non-null values that are different from each other (Non-unique + Unique)
+- **Non-Unique:** the number of values that have at least one duplicate in the list
+- **Unique:** the number of values that have no duplicates
+
+Duplication measure in Apache Griffin computes all of these metrics for a user-defined data asset.
+
+## SparkSQL
+
+In some cases, the above-mentioned quality dimensions/ measures may not be able to model a complete data quality
+definition. For such cases, Apache Griffin allows the definition of complex custom user-defined checks as SparkSQL
+queries.
+
+SparkSQL measure is like a pro mode and allows advanced users to put complex custom checks that are not covered by other
+measures. These SparkSQL queries may contain clauses like select/ from/ where/ group-by/ order-by/ limit, etc.
+
+## Profiling
+
+Data processing and its analysis can't truly be complete without data profiling - reviewing source data for content and
+quality. Data profiling helps to find data quality rules and requirements that will support a more thorough data quality
+assessment in a later step. Data profiling can help us to,
+
+- **Discover Structure of data**
+
+ Validating that data is consistent and formatted correctly, and performing mathematical checks on the data (e.g. sum,
+ minimum or maximum). Structure discovery helps understand how well data is structured—for example, what percentage of
+ phone numbers do not have the correct number of digits.
+
+- **Discover Content of data**
+
+ Looking into individual data records to discover errors. Content discovery identifies which specific rows in a table
+ contain problems, and which systemic issues occur in the data (for example, phone numbers with no area code).
+
+Data Profiling helps us create a huge amount of insight into the quality levels of our data and helps to find data
+quality rules and requirements that will support a more thorough data quality assessment in a later step.
diff --git a/griffin-doc/measure/measure-configuration-guide/completeness.md b/griffin-doc/measure/measure-configuration-guide/completeness.md
new file mode 100644
index 0000000..eda2ef2
--- /dev/null
+++ b/griffin-doc/measure/measure-configuration-guide/completeness.md
@@ -0,0 +1,209 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Completeness Measure - Configuration Guide
+=====================================
+
+### Introduction
+
+Completeness refers to the degree to which values are present in a data collection. When data is incomplete due to
+unavailability (missing records), this does not represent a lack of completeness. As far as an individual datum is
+concerned, only two situations are possible - either a value is assigned to the attribute in question or not. The latter
+case is usually represented by a `null` value.
+
+The definition of Completeness and its scope itself may change from one user to another. Thus, Apache Griffin allows
+users to define SQL-like expressions which describe their definition of completeness. For a tabular data set with
+columns `name`, `email` and `age`, some examples of such completeness definitions are mentioned below,
+
+- `name is NULL`
+- `name is NULL and age is NULL`
+- `email NOT RLIKE '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'`
+
+### Configuration
+
+The completeness measure can be configured as below,
+
+```json
+{
+ ...
+
+ "measures": [
+ {
+ "name": "completeness_measure",
+ "type": "completeness",
+ "data.source": "crime_report_source",
+ "config": {
+ "expr": "zipcode is null OR city is null"
+ },
+ "out": [
+ {
+ "type": "metric",
+ "name": "comp_metric",
+ "flatten": "map"
+ },
+ {
+ "type": "record",
+ "name": "comp_records"
+ }
+ ]
+ }
+ ]
+
+ ...
+}
+ ```
+
+##### Key Parameters:
+
+| Name | Type | Description | Supported Values |
+|:--------|:---------|:---------------------------------------|:----------------------------------------------------------|
+| name | `String` | User-defined name of this measure | - |
+| type | `String` | Type of Measure | completeness, duplication, profiling, accuracy, sparksql |
+| data.source | `String` | Name of data source on which this measure is applied | - |
+| config | `Object` | Configuration params of the measure | Depends on measure type ([see below](#example-config-object)) |
+| out | `List ` | Define output(s) of measure execution | [See below](#outputs) |
+
+##### Example `config` Object:
+
+`config` object for completeness measure contains only one key `expr`. The value for `expr` is a SQL-like expression
+string which definition this completeness. For more complex definitions, expressions can be clubbed with `AND` and `OR`.
+
+_Note:_ This expression describes the bad or incomplete records. This means that for `"expr": "zipcode is NULL"` the
+records which contain `null` in zipcode column are considered as incomplete.
+
+It can be defined as mentioned below,
+
+```json
+{
+ ...
+
+ "config": {
+ "expr": "zipcode is null OR city is null"
+ }
+
+ ...
+}
+ ```
+
+### Outputs
+
+Completeness measure supports the following two outputs,
+
+- Metrics
+- Records
+
+Users can choose to define any combination of these 2 outputs. For no outputs, skip this `out: [ ... ]` section from the
+measure configuration.
+
+#### Metrics Outputs
+
+To write metrics for completeness measure, configure the measure with output section as below,
+
+```json
+{
+ ...
+
+ "out": [
+ {
+ "name": "comp_metric",
+ "type": "metric",
+ "flatten": "map"
+ }
+ ]
+
+ ...
+}
+ ```
+
+This will generate the metrics like below,
+
+```json
+{
+ ...
+
+ "value": {
+ "completeness_measure": {
+ "measure_name": "completeness_measure",
+ "measure_type": "Completeness",
+ "data_source": "crime_report_source",
+ "metrics": {
+ "total": "4617",
+ "complete": "4459",
+ "incomplete": "158"
+ }
+ }
+ }
+
+ ...
+}
+```
+
+#### Record Outputs
+
+To write records as output for completeness measure, configure the measure with output section as below,
+
+```json
+{
+ ...
+
+ "out": [
+ {
+ "type": "record",
+ "name": "comp_records"
+ }
+ ]
+
+ ...
+}
+ ```
+
+The above configuration will generate the records output like below,
+
+```
++-------------------+---------------------------------------------+---------------------------------------+-------------+-------+-------------+--------+
+|date_time |incident |address |city |zipcode|__tmst |__status|
++-------------------+---------------------------------------------+---------------------------------------+-------------+-------+-------------+--------+
+|2015-05-26 05:56:00|PENAL CODE/MISC (PENALMI) |3900 Block BLOCK EL CAMINO REAL |PALO ALTO |94306 |1619969055893|good |
+|2015-05-26 05:56:00|DRUNK IN PUBLIC ADULT/MISC (647FA) |3900 Block BLOCK EL CAMINO REAL |PALO ALTO |94306 |1619969055893|good |
+|2015-05-26 05:56:00|PENAL CODE/MISC (PENALMI) |3900 Block BLOCK EL CAMINO REAL |PALO ALTO |94306 |1619969055893|good |
+|2015-05-26 22:55:00|DRIVER'S LICENSE SUSPENDED/ALC (14601.2(A)VC)|EL CAMINO REAL & N SAN ANTONIO RD |"" |null |1619969055893|bad |
+|2015-05-26 22:55:00|TRAFFIC/SUSPENDED LICENSE (14601) |EL CAMINO REAL & N SAN ANTONIO RD |"" |null |1619969055893|bad |
+|2015-06-01 01:41:00|TRAFFIC/SUSPENDED LICENSE (14601) |QUARRY RD & ARBORETUM RD |PALO ALTO |94304 |1619969055893|good |
+|2015-06-01 02:49:00|TRAFFIC/SUSPENDED LICENSE (14601) |2000 Block BLOCK E BAYSHORE RD |PALO ALTO |94303 |1619969055893|good |
+|2015-06-01 03:13:00|DRIVING WITH A SUSPENDED OR RE (14601.1(A)VC)|100 Block SAN ANTONIO RD |PALO ALTO |null |1619969055893|bad |
+|2015-06-01 03:13:00|TRAFFIC/SUSPENDED LICENSE (14601) |100 Block SAN ANTONIO RD |PALO ALTO |null |1619969055893|bad |
+|2015-06-01 03:13:00|WARRANT/PALO ALTO (PWARRANT) |100 Block SAN ANTONIO RD |PALO ALTO |null |1619969055893|bad |
+|2015-06-01 16:20:00|BURGLARY ATTEMPT/AUTO (459AA) |300 Block LOWELL AV |PALO ALTO |94301 |1619969055893|good |
+|2015-06-01 16:30:00|BURGLARY/AUTO (459A) |800 Block EL CAMINO REAL |PALO ALTO |94301 |1619969055893|good |
+|2015-06-01 16:30:00|ASSAULT WITH DEADLY WEAPON (245) |1100 Block N RENGSTORFF AV |MOUNTAIN VIEW|null |1619969055893|bad |
+|2015-06-01 16:30:00|BURGLARY/AUTO (459A) |2000 Block EL CAMINO REAL |PALO ALTO |94306 |1619969055893|good |
+|2015-06-01 16:30:00|TRAFFIC/SUSPENDED LICENSE (14601) |1100 Block N RENGSTORFF AV |MOUNTAIN VIEW|null |1619969055893|bad |
+|2015-06-01 16:30:00|PENAL CODE/RESISTING ARREST (148RA) |1100 Block N RENGSTORFF AV |MOUNTAIN VIEW|null |1619969055893|bad |
+|2015-06-01 16:30:00|BURGLARY/AUTO (459A) |1100 Block N RENGSTORFF AV |MOUNTAIN VIEW|null |1619969055893|bad |
+|2015-06-01 16:30:00|ASSAULT WITH DEADLY WEAPON (245) |1100 Block N RENGSTORFF AV |MOUNTAIN VIEW|null |1619969055893|bad |
+|2015-06-01 17:30:00|IDENTITY THEFT/MISC. (530M) |300 Block CREEKSIDE DR |PALO ALTO |94306 |1619969055893|good |
+|2015-06-01 17:30:00|PENAL CODE/TERRORIST THREATS (422) |100 Block CALIFORNIA AV |PALO ALTO |94306 |1619969055893|good |
++-------------------+---------------------------------------------+---------------------------------------+-------------+-------+-------------+--------+
+only showing top 20 rows
+ ```
+
+A new column `__status` has been added to the original data set on which this measure was executed. The value of this
+column can be either `bad` or `good` which can be used to calculate the metrics/ separate data based on quality etc.
+
+_Note:_ This output is for `ConsoleSink`.
\ No newline at end of file