update documentation added extra information Author: iyuriysoft <42405281+iyuriysoft@users.noreply.github.com> Closes #479 from iyuriysoft/patch-4.

commit: 73b76e03a1b00fa11b168f055160e0f82a3d34de [log] [tgz]
author: iyuriysoft <42405281+iyuriysoft@users.noreply.github.com> Tue Jan 29 07:02:49 2019 +0800
committer: William Guo <guoyp@apache.org> Tue Jan 29 07:02:49 2019 +0800
tree: d1df92e082ce7d4ec8ae85bbfa29814873074919
parent: 07f78ad0a8bed4da96ac27b282b54cee6329d9e1 [diff]
diff --git a/griffin-doc/measure/dsl-guide.md b/griffin-doc/measure/dsl-guide.md
index 5296176..b9a4b8c 100644
--- a/griffin-doc/measure/dsl-guide.md
+++ b/griffin-doc/measure/dsl-guide.md

@@ -127,6 +127,14 @@
 Distinctness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is distinct.
     e.g. `name, age`, `name, (age + 1) as next_age`
 
+### Uniqueness Rule
+Uniqueness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is unique. The uniqueness indicates the items without any replica of data.
+    e.g. `name, age`, `name, (age + 1) as next_age`
+
+### Completeness Rule
+Completeness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is null.
+    e.g. `name, age`, `name, (age + 1) as next_age`
+
 ### Timeliness Rule
 Timeliness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the input time and output time (calculate time as default if not set).  
 	e.g. `ts`, `ts, end_ts`
@@ -167,6 +175,12 @@
 
 After the translation, the metrics will be persisted in table `distinct_metric` and `dup_metric`.
 
+### Completeness
+For completeness, is to check for null. The columns you measure are incomplete if they are null. 
+- **total count of source**: `SELECT COUNT(*) AS total FROM source`, save as table `total_count`.
+- **incomplete metric**: `SELECT count(*) as incomplete FROM source WHERE NOT (id IS NOT NULL)`, save as table `incomplete_count`.
+- **complete metric**: `SELECT (source.total - incomplete_count.incomplete) AS complete FROM source LEFT JOIN incomplete_count`, save as table `complete_count`.
+
 ### Timeliness
 For timeliness, is to measure the latency of each item, and get the statistics of the latencies.  
 For example, the dsl rule is `ts, out_ts`, the first column means the input time of item, the second column means the output time of item, if not set, `__tmst` will be the default output time column. After the translation, the sql rule is as below:  

diff --git a/griffin-doc/measure/measure-configuration-guide.md b/griffin-doc/measure/measure-configuration-guide.md
index 2522ee4..1013ae6 100644
--- a/griffin-doc/measure/measure-configuration-guide.md
+++ b/griffin-doc/measure/measure-configuration-guide.md

@@ -238,10 +238,30 @@
     * num: the duplicate number name in metric, optional.
     * duplication.array: optional, if set as a non-empty string, the duplication metric will be computed, and the group metric name is this string.
     * with.accumulate: optional, default is true, if set as false, in streaming mode, the data set will not compare with old data to check distinctness.
+  + uniqueness dq type detail configuration
+    * source: name of data source to measure uniqueness.
+    * target: name of data source to compare with. It is always the same as source, or more than source.
+    * unique: the unique count name in metric, optional.
+    * total: the total count name in metric, optional.
+    * dup: the duplicate count name in metric, optional.
+    * num: the duplicate number name in metric, optional.
+    * duplication.array: optional, if set as a non-empty string, the duplication metric will be computed, and the group metric name is this string.
+  + completeness dq type detail configuration
+    * source: name of data source to measure completeness.
+    * total: name of data source to compare with. It is always the same as source, or more than source.
+    * complete: the column name in metric, optional. The number of not null values.
+    * incomplete: the column name in metric, optional. The number of null values.
   + timeliness dq type detail configuration
     * source: name of data source to measure timeliness.
     * latency: the latency column name in metric, optional.
+    * total: column name, optional.
+    * avg: column name, optional. The average latency.
+    * step: column nmae, optional. The histogram where "bin" is step=floor(latency/step.size).
+    * count: column name, optional. The number of the same latencies in the concrete step.
+    * percentile: column name, optional.
     * threshold: optional, if set as a time string like "1h", the items with latency more than 1 hour will be record.
+    * step.size: optional, used to build the histogram of latencies, in milliseconds (ex. "100").
+    * percentile.values: optional, used to compute the percentile metrics, values between 0 and 1. For instance, We can see fastest and slowest latencies if set [0.1, 0.9].
 - **cache**: Cache output dataframe. Optional, valid only for "spark-sql" and "df-ops" mode. Defaults to `false` if not specified.
 - **out**: List of output sinks for the job.
   + Metric output.
commit	73b76e03a1b00fa11b168f055160e0f82a3d34de	[log] [tgz]
author	iyuriysoft <42405281+iyuriysoft@users.noreply.github.com>	Tue Jan 29 07:02:49 2019 +0800
committer	William Guo <guoyp@apache.org>	Tue Jan 29 07:02:49 2019 +0800
tree	d1df92e082ce7d4ec8ae85bbfa29814873074919
parent	07f78ad0a8bed4da96ac27b282b54cee6329d9e1 [diff]