blob: 73040529de173c6ddbfd4fea059a66464583afcf [file] [log] [blame]
I"ý<h2 id="user-story">User Story</h2>
<p>Say we have two data set(demo_src, demo_tgt), we need to know what is the data quality for target data set, based on source data set.</p>
<p>For simplicity, suppose both two data set have the same schema as this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id bigint
age int
desc string
dt string
hour string
</code></pre></div></div>
<p>both dt and hour are partitions,</p>
<p>as every day we have one daily partition dt(like 20180912),</p>
<p>for every day we have 24 hourly partitions(like 00, 01, 02, …, 23).</p>
<h2 id="environment-preparation">Environment Preparation</h2>
<p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p>
<ul>
<li>JDK (1.8+)</li>
<li>Hadoop (2.6.0+)</li>
<li>Spark (2.2.1+)</li>
<li>Hive (2.2.0)</li>
</ul>
<h2 id="build-apache-griffin-measure-module">Build Apache Griffin Measure Module</h2>
<ol>
<li>Download Apache Griffin source package <a href="https://www.apache.org/dist/griffin/0.4.0/">here</a>.</li>
<li>Unzip the source package.
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.4.0-source-release.zip
cd griffin-0.4.0-source-release
</code></pre></div> </div>
</li>
<li>Build Apache Griffin jars.
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install
</code></pre></div> </div>
<p>Move the built apache griffin measure jar to your work path.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.4.0.jar &lt;work path&gt;/griffin-measure.jar
</code></pre></div> </div>
</li>
</ol>
<h2 id="data-preparation">Data Preparation</h2>
<p>For our quick start, We will generate two hive tables demo_src and demo_tgt.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_src`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs:///griffin/data/batch/demo_src';
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_tgt`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs:///griffin/data/batch/demo_tgt';
</code></pre></div></div>
<p>The data could be generated this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1|18|student
2|23|engineer
3|42|cook
...
</code></pre></div></div>
<p>For demo_src and demo_tgt, there could be some different items between each other.
You can download <a href="/data/batch">demo data</a> and execute <code class="language-plaintext highlighter-rouge">./gen_demo_data.sh</code> to get the two data source files.
Then we will load data into both two tables for every hour.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09');
LOAD DATA LOCAL INPATH 'demo_tgt' INTO TABLE demo_tgt PARTITION (dt='20180912',hour='09');
</code></pre></div></div>
<p>Or you can just execute <code class="language-plaintext highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p>
<h2 id="define-data-quality-measure">Define data quality measure</h2>
<h4 id="apache-griffin-env-configuration">Apache Griffin env configuration</h4>
<p>The environment config file: env.json</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"spark": {
"log.level": "WARN"
},
"sinks": [
{
"type": "console"
},
{
"type": "hdfs",
"config": {
"path": "hdfs:///griffin/persist"
}
},
{
"type": "elasticsearch",
"config": {
"method": "post",
"api": "http://es:9200/griffin/accuracy"
}
}
]
}
</code></pre></div></div>
<h4 id="define-griffin-data-quality">Define griffin data quality</h4>
<p>The DQ config file: dq.json</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"name": "batch_accu",
"process.type": "batch",
"data.sources": [
{
"name": "src",
"baseline": true,
"connectors": [
{
"type": "hive",
"version": "1.2",
"config": {
"database": "default",
"table.name": "demo_src"
}
}
]
}, {
"name": "tgt",
"connectors": [
{
"type": "hive",
"version": "1.2",
"config": {
"database": "default",
"table.name": "demo_tgt"
}
}
]
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "accuracy",
"out.dataframe.name": "accu",
"rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc",
"details": {
"source": "src",
"target": "tgt",
"miss": "miss_count",
"total": "total_count",
"matched": "matched_count"
},
"out": [
{
"type": "metric",
"name": "accu"
},
{
"type": "record",
"name": "missRecords"
}
]
}
]
},
"sinks": ["CONSOLE", "HDFS"]
}
</code></pre></div></div>
<h2 id="measure-data-quality">Measure data quality</h2>
<p>Submit the measure job to Spark, with config file paths as parameters.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
--driver-memory 1g --executor-memory 1g --num-executors 2 \
&lt;path&gt;/griffin-measure.jar \
&lt;path&gt;/env.json &lt;path&gt;/dq.json
</code></pre></div></div>
<h2 id="report-data-quality-metrics">Report data quality metrics</h2>
<p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="language-plaintext highlighter-rouge">hdfs:///griffin/persist/&lt;job name&gt;/&lt;timestamp&gt;/_METRICS</code>.</p>
<h2 id="refine-data-quality-report">Refine Data Quality report</h2>
<p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p>
<h2 id="more-details">More Details</h2>
<p>For more details about apache griffin measures, you can visit our documents in <a href="https://github.com/apache/griffin/tree/master/griffin-doc">github</a>.</p>
:ET