Griffin 2.0.0 arch (#654)
* new proposal for data quality tool
* repolish
* enterprise job scheduler might retry or stop the downstream scheduler based on standardized result
* illustrate data quality etl phase, append after business phase
* elaborate integrate with business workflow
* init DQDiagrams
* add metric storage service in DQDiagrams
* triggered on demand
* add more metrics
* update arch diagram
* init two table diff result set
* data platform upgrades data quality checking pipeline
* typo
* update it as data quality constrains
diff --git a/griffin-doc/DQDiagrams.md b/griffin-doc/DQDiagrams.md
new file mode 100644
index 0000000..c964f25
--- /dev/null
+++ b/griffin-doc/DQDiagrams.md
@@ -0,0 +1,101 @@
+# DQ Diagrams
+## Arch
+
+
+## Entities
+
+### DQMetric
+> Represents a generic data quality metric used to assess various aspects of data quality (quantitative).
+
+- **DQCompletenessMetric**
+ > Measures the completeness of data, ensuring that all required data is present.
+
+ - **DQCOUNTMetric**
+ > A specific completeness metric that counts the number of non-missing values in a dataset.
+ - **DQNULLPERCENTAGEMetric**
+ > A specific completeness metric that counts the percentage of null values in a dataset.
+
+- **DQAccuracyMetric**
+ > Measures the accuracy of data, ensuring that data values are correct and conform to a known standard.
+
+ - **DQNULLMetric**
+ > An accuracy metric that counts the number of NULL values in a dataset.
+
+- **DQPROFILEMetric**
+ > Measures the profile of data, such as max, min, median, avg, stddev
+
+ - **DQMAXMetric**
+ > A profile metric that max of values in a dataset.
+
+ - **DQMINMetric**
+ > A profile metric that min of values in a dataset.
+
+ - **DQMEDIANMetric**
+ > A profile metric that median of values in a dataset.
+
+ - **DQAVGMetric**
+ > A profile metric that average of values in a dataset.
+
+ - **DQSTDDEVMetric**
+ > A profile metric that standard deviation of values in a dataset.
+
+ - **DQTOPKMetric**
+ > A profile metric that list top k frequent items of values in a dataset.
+
+
+- **DQUniquenessMetric**
+ > Measures the uniqueness of data, ensuring that there are no duplicate records.
+
+ - **DQDISTINCTCOUNTMetric**
+ > A specific uniqueness metric that identifies and counts unique records in a dataset.
+
+- **DQFreshnessMetric**
+ > Measures the freshness of data, ensuring that the data is up-to-date.
+
+ - **DQTTUMetric (Time to Usable)**
+ > A freshness metric that measures the time taken for data to become usable after it is created or updated.
+
+- **DQDiffMetric**
+ > Compares data across different datasets or points in time to identify discrepancies.
+
+ - **DQTableDiffMetric**
+ > A specific diff metric that compares entire tables to identify differences.
+
+ - **DQFileDiffMetric**
+ > A specific diff metric that compares files to identify differences.
+
+- **MetricStorageService**
+ > A data quality metric storage and fetch service
+
+
+- **DQJob**
+ > Abstract Data Quality related Jobs.
+ - **MetricCollectingJob**
+ > A job that collects data quality metrics from various sources and stores them for analysis.
+
+ - **DQCheckJob**
+ > A job that performs data quality checks based on predefined rules and metrics.
+
+ - **DQAlertJob**
+ > A job that generates alerts when data quality issues are detected.
+
+ - **DQDag**
+ > A directed acyclic graph that defines the dependencies and execution order of various data quality jobs.
+
+- **Scheduler**
+ > A system that schedules and manages the execution of data quality jobs.
+ > This is the default scheduler, it will launch data quality jobs periodically.
+
+ - **DolphinSchdulerAdapter**
+ > Connects our planed data quality jobs with Apache Dolphinscheduler,
+ > allowing data quality jobs to be triggered upon the completion of dependent previous jobs.
+ - **AirflowSchdulerAdapter**
+ > Connects our planed data quality jobs with apache airflow,
+ > so that data quality jobs can be triggered upon the completion of dependent previous jobs.
+ >
+
+- **Worker**
+ > should we need another worker layer, since most work are done on big data side
+ >
+>
+
diff --git a/griffin-doc/DataQualityTool.md b/griffin-doc/DataQualityTool.md
new file mode 100644
index 0000000..1347000
--- /dev/null
+++ b/griffin-doc/DataQualityTool.md
@@ -0,0 +1,177 @@
+# Data Quality Tool
+
+## Introduction
+
+In the evolving landscape of data architecture, ensuring data quality remains a critical success factor for all companies.
+Data architectures have progressed significantly over recent years, transitioning from relational databases and data
+warehouses to data lakes, hybrid data lake and warehouse combinations, and modern lakehouses.
+
+Despite these advancements, data quality issues persist and have become increasingly vital, especially in the era of AI
+and data integration. Improving data quality is essential for all organizations, and maintaining it across various
+environments requires a combination of people, processes, and technology.
+
+To address these challenges, we will upgrade data quality tool designed to be easily adopted by any data organization.
+This tool abstracts common data quality problems and integrates seamlessly with diverse data architectures.
+
+## Data Quality Dimensions
+
+1. **Accuracy** – Data should be error-free by business needs.
+2. **Consistency** – Data should not conflict with other values across data sets.
+3. **Completeness** – Data should not be missing.
+4. **Timeliness** – Data should be up-to-date in a limited time frame
+5. **Uniqueness** – Data should have no duplicates.
+6. **Validity** – Data should conform to a specified format.
+
+## Our new Architecture
+
+Our new architecture consists of two primary layers: the Data Quality Layer and the Integration Layer.
+
+### Data Quality Constraints Layer
+
+This constraints layer abstracts the core concepts of the data quality lifecycle, focusing on:
+
+- **Defining Specific Data Quality Constraints**:
+ - **Metrics**: Establishing specific data quality metrics.
+ - **Anomaly Detection**: Implementing methods for detecting anomalies.
+ - **Actions**: Defining actions to be taken based on the data quality assessments.
+
+- **Measuring Data Quality**:
+ - Utilizing various connectors such as SQL, HTTP, and CMD to measure data quality across different systems.
+
+- **Unifying Data Quality Results**:
+ - Creating a standardized and structured view of data quality results across different dimensions to ensure a consistent understanding.
+
+- **Flexible Data Quality Jobs**:
+ - Designing data quality jobs within a generic, topological Directed Acyclic Graph (DAG) framework to facilitate easy plug-and-play functionality.
+
+### Integration Layer
+
+This layer provides a robust framework to enable users to integrate Griffin data quality pipelines seamlessly with their business processes. It includes:
+
+- **Scheduler Integration**:
+ - Ensuring seamless integration with typical schedulers for efficient pipeline execution.
+
+- **Apache DolphinScheduler Integration**:
+ - Facilitating effortless integration within the Java ecosystem to leverage Apache DolphinScheduler.
+
+- **Apache Airflow Integration**:
+ - Enabling smooth integration within the AI ecosystem using Apache Airflow.
+
+This architecture aims to provide a comprehensive and flexible approach to managing data quality
+and integrating it into various existing business workflows in data team.
+
+So that enterprise job scheduling system will launch optional data quality check pipelines after usual data jobs are finished.
+And maybe based on data quality result, schedule some actions such as retry or stop the downstream scheduling like circuit breaker.
+
+### Data Quality Layer
+
+#### Data Quality Constraints Definition
+
+This concept has been thoroughly discussed in the original Apache Griffin design documents. Essentially, we aim to quantify
+the data quality of a dataset based on the aforementioned dimensions. For example, to measure the count of records in a user
+table, our data quality constraint definition could be:
+
+**Simple Version:**
+
+- **Metric**
+ - Name: count_of_users
+ - Target: user_table
+ - Dimension: count
+- **Anomaly Condition:** $metric <= 0
+- **Post Action:** send alert
+
+**Advanced Version:**
+
+- **Metric**
+ - Name: count_of_users
+ - Target: user_table
+ - Filters: city = 'shanghai' and event_date = '20240601'
+ - Dimension: count
+- **Anomaly Condition:** $metric <= 0
+- **Post Action:** send alert
+
+#### Data Quality Pipelines(DAG)
+
+We support several typical data quality pipelines:
+
+**One Dataset Profiling Pipeline:**
+
+```plaintext
+recording_target_table_metric_job -> anomaly_condition_job -> post_action_job
+```
+
+**Dataset Diff Pipeline:**
+
+```plaintext
+recording_target_table1_metric_job ->
+ \
+ -> anomaly_condition_job -> post_action_job
+ /
+recording_target_table2_metric_job ->
+```
+
+**Compute Platform Migration Pipeline:**
+
+```plaintext
+run_job_on_platform_v1 -> recording_target_table_metric_job_on_v1 ->
+ \
+ -> anomaly_condition_job -> post_action_job
+ /
+run_job_on_platform_v2 -> recording_target_table_metric_job_on_v2 ->
+```
+#### Data Quality Report
+
+- **Meet Expectations**
+ + Data Quality Constrain 1: Passed
+ + Data Quality Constrain 2: Passed
+- **Does Not Meet Expectations**
+ + Data Quality Constrain 3: Failed
+ - Violation details
+ - Possible root cause
+ + Data Quality Constrain 4: Failed
+ - Violation details
+ - Possible root cause
+
+#### Connectors
+
+The executor measures the data quality of the target dataset by recording the metrics. It supports many predefined protocols,
+and customers can extend the executor protocol if they want to add their own business logic.
+
+**Predefined Protocols:**
+
+- MySQL: `jdbc:mysql://hostname:port/database_name?user=username&password=password`
+- Presto: `jdbc:presto://hostname:port/catalog/schema`
+- Trino: `jdbc:trino://hostname:port/catalog/schema`
+- HTTP: `http://hostname:port/api/v1/query?query=<prometheus_query>`
+- Docker
+
+### Integration layer
+
+Every data team has its own existing scheduler.
+While we provide a default scheduler, for greater adoption, we will refactor
+our Apache Griffin scheduler capabilities to leverage our customers' schedulers.
+This involves redesigning our scheduler to either ingest job instances into our customers' schedulers
+or bridge our DQ pipelines to their DAGs.
+
+```plaintext
+ biz_etl_phase || data_quality_phase
+ ||
+business_etl_job -> recording_target_table1_metric_job - ->
+ || \
+ || -> anomaly_condition_job -> post_action_job
+ || /
+business_etl_job -> recording_target_table2_metric_job - ->
+ ||
+```
+
+ - integration with a generic scheduler
+
+ - integration with apache dolphinscheduler
+
+ - integration with apache airflow
+
+
+
+
+
+
diff --git a/griffin-doc/TwoTablesDiffResult.md b/griffin-doc/TwoTablesDiffResult.md
new file mode 100644
index 0000000..6357a56
--- /dev/null
+++ b/griffin-doc/TwoTablesDiffResult.md
@@ -0,0 +1,16 @@
+# Two tables diff result set
+We want to unify result set for two table comparing, when two tables' schema are the same,
+we can construct result set as below to let our users quickly find the difference between two tables.
+
+| diff_type | col1_src | col1_target | col2_src | col2_target | col3_src | col3_target | col4_src | col4_target |
+|------------|--------------|-------------|-----------|-------------|-----------|-------------|------------|-------------|
+| missing | prefix1 | NULL | sug_vote1 | NULL | pv_total1 | NULL | 2024-01-01 | NULL |
+| additional | NULL | prefix1 | NULL | sug_vote2 | NULL | pv_total2 | NULL | 2024-01-01 |
+| missing | prefix3 | NULL | sug_vote3 | NULL | pv_total3 | NULL | 2024-01-03 | NULL |
+| additional | NULL | prefix4 | NULL | sug_vote4 | NULL | pv_total3 | NULL | 2024-01-03 |
+| missing | prefix5 | NULL | sug_vote5 | NULL | pv_total5 | NULL | 2024-01-05 | NULL |
+| additional | NULL | prefix5 | NULL | sug_vote5 | NULL | pv_total6 | NULL | 2024-01-05 |
+| missing | prefix7 | NULL | sug_vote7 | NULL | pv_total7 | NULL | 2024-01-07 | NULL |
+| additional | NULL | prefix8 | NULL | sug_vote8 | NULL | pv_total8 | NULL | 2024-01-07 |
+| missing | prefix9 | NULL | sug_vote9 | NULL | pv_total9 | NULL | 2024-01-09 | NULL |
+| additional | NULL | prefix10 | NULL | sug_vote10 | NULL | pv_total10 | NULL | 2024-01-09 |
diff --git a/griffin-doc/arch2.png b/griffin-doc/arch2.png
new file mode 100644
index 0000000..cc871bf
--- /dev/null
+++ b/griffin-doc/arch2.png
Binary files differ