update documentation
added extra information
Author: iyuriysoft <42405281+iyuriysoft@users.noreply.github.com>
Closes #479 from iyuriysoft/patch-4.
diff --git a/griffin-doc/measure/dsl-guide.md b/griffin-doc/measure/dsl-guide.md
index 5296176..b9a4b8c 100644
--- a/griffin-doc/measure/dsl-guide.md
+++ b/griffin-doc/measure/dsl-guide.md
@@ -127,6 +127,14 @@
Distinctness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is distinct.
e.g. `name, age`, `name, (age + 1) as next_age`
+### Uniqueness Rule
+Uniqueness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is unique. The uniqueness indicates the items without any replica of data.
+ e.g. `name, age`, `name, (age + 1) as next_age`
+
+### Completeness Rule
+Completeness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is null.
+ e.g. `name, age`, `name, (age + 1) as next_age`
+
### Timeliness Rule
Timeliness rule expression in Apache Griffin DSL is a list of selection expressions separated by comma, indicates the input time and output time (calculate time as default if not set).
e.g. `ts`, `ts, end_ts`
@@ -167,6 +175,12 @@
After the translation, the metrics will be persisted in table `distinct_metric` and `dup_metric`.
+### Completeness
+For completeness, is to check for null. The columns you measure are incomplete if they are null.
+- **total count of source**: `SELECT COUNT(*) AS total FROM source`, save as table `total_count`.
+- **incomplete metric**: `SELECT count(*) as incomplete FROM source WHERE NOT (id IS NOT NULL)`, save as table `incomplete_count`.
+- **complete metric**: `SELECT (source.total - incomplete_count.incomplete) AS complete FROM source LEFT JOIN incomplete_count`, save as table `complete_count`.
+
### Timeliness
For timeliness, is to measure the latency of each item, and get the statistics of the latencies.
For example, the dsl rule is `ts, out_ts`, the first column means the input time of item, the second column means the output time of item, if not set, `__tmst` will be the default output time column. After the translation, the sql rule is as below:
diff --git a/griffin-doc/measure/measure-configuration-guide.md b/griffin-doc/measure/measure-configuration-guide.md
index 2522ee4..1013ae6 100644
--- a/griffin-doc/measure/measure-configuration-guide.md
+++ b/griffin-doc/measure/measure-configuration-guide.md
@@ -238,10 +238,30 @@
* num: the duplicate number name in metric, optional.
* duplication.array: optional, if set as a non-empty string, the duplication metric will be computed, and the group metric name is this string.
* with.accumulate: optional, default is true, if set as false, in streaming mode, the data set will not compare with old data to check distinctness.
+ + uniqueness dq type detail configuration
+ * source: name of data source to measure uniqueness.
+ * target: name of data source to compare with. It is always the same as source, or more than source.
+ * unique: the unique count name in metric, optional.
+ * total: the total count name in metric, optional.
+ * dup: the duplicate count name in metric, optional.
+ * num: the duplicate number name in metric, optional.
+ * duplication.array: optional, if set as a non-empty string, the duplication metric will be computed, and the group metric name is this string.
+ + completeness dq type detail configuration
+ * source: name of data source to measure completeness.
+ * total: name of data source to compare with. It is always the same as source, or more than source.
+ * complete: the column name in metric, optional. The number of not null values.
+ * incomplete: the column name in metric, optional. The number of null values.
+ timeliness dq type detail configuration
* source: name of data source to measure timeliness.
* latency: the latency column name in metric, optional.
+ * total: column name, optional.
+ * avg: column name, optional. The average latency.
+ * step: column nmae, optional. The histogram where "bin" is step=floor(latency/step.size).
+ * count: column name, optional. The number of the same latencies in the concrete step.
+ * percentile: column name, optional.
* threshold: optional, if set as a time string like "1h", the items with latency more than 1 hour will be record.
+ * step.size: optional, used to build the histogram of latencies, in milliseconds (ex. "100").
+ * percentile.values: optional, used to compute the percentile metrics, values between 0 and 1. For instance, We can see fastest and slowest latencies if set [0.1, 0.9].
- **cache**: Cache output dataframe. Optional, valid only for "spark-sql" and "df-ops" mode. Defaults to `false` if not specified.
- **out**: List of output sinks for the job.
+ Metric output.