| I"1<h2 id="user-story">User Story</h2> |
| <p>Say we have one data set(demo_src), partitioned by hour, we want to know what is the data like for each hour.</p> |
| |
| <p>For simplicity, suppose both two data set have the same schema as this:</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id bigint |
| age int |
| desc string |
| dt string |
| hour string |
| </code></pre></div></div> |
| <p>both dt and hour are partitions,</p> |
| |
| <p>as every day we have one daily partition dt(like 20180912),</p> |
| |
| <p>for every day we have 24 hourly partitions(like 00, 01, 02, …, 23).</p> |
| |
| <h2 id="environment-preparation">Environment Preparation</h2> |
| <p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p> |
| <ul> |
| <li>JDK (1.8+)</li> |
| <li>Hadoop (2.6.0+)</li> |
| <li>Spark (2.2.1+)</li> |
| <li>Hive (2.2.0)</li> |
| </ul> |
| |
| <h2 id="build-apache-griffin-measure-module">Build Apache Griffin Measure Module</h2> |
| <ol> |
| <li>Download Apache Griffin source package <a href="https://www.apache.org/dist/griffin/0.4.0/">here</a>.</li> |
| <li>Unzip the source package. |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.4.0-source-release.zip |
| cd griffin-0.4.0-source-release |
| </code></pre></div> </div> |
| </li> |
| <li>Build Apache Griffin jars. |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install |
| </code></pre></div> </div> |
| |
| <p>Move the built apache griffin measure jar to your work path.</p> |
| |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.4.0.jar <work path>/griffin-measure.jar |
| </code></pre></div> </div> |
| </li> |
| </ol> |
| |
| <h2 id="data-preparation">Data Preparation</h2> |
| |
| <p>For our quick start, We will generate a hive table demo_src.</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script |
| --Note: replace hdfs location with your own path |
| CREATE EXTERNAL TABLE `demo_src`( |
| `id` bigint, |
| `age` int, |
| `desc` string) |
| PARTITIONED BY ( |
| `dt` string, |
| `hour` string) |
| ROW FORMAT DELIMITED |
| FIELDS TERMINATED BY '|' |
| LOCATION |
| 'hdfs:///griffin/data/batch/demo_src'; |
| </code></pre></div></div> |
| <p>The data could be generated this:</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1|18|student |
| 2|23|engineer |
| 3|42|cook |
| ... |
| </code></pre></div></div> |
| <p>You can download <a href="/data/batch">demo data</a> and execute <code class="language-plaintext highlighter-rouge">./gen_demo_data.sh</code> to get the data source file. |
| Then we will load data into hive table for every hour.</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09'); |
| </code></pre></div></div> |
| <p>Or you can just execute <code class="language-plaintext highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p> |
| |
| <h2 id="define-data-quality-measure">Define data quality measure</h2> |
| |
| <h4 id="apache-griffin-env-configuration">Apache Griffin env configuration</h4> |
| <p>The environment config file: env.json</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ |
| "spark": { |
| "log.level": "WARN" |
| }, |
| "sinks": [ |
| { |
| "type": "console" |
| }, |
| { |
| "type": "hdfs", |
| "config": { |
| "path": "hdfs:///griffin/persist" |
| } |
| }, |
| { |
| "type": "elasticsearch", |
| "config": { |
| "method": "post", |
| "api": "http://es:9200/griffin/accuracy" |
| } |
| } |
| ] |
| } |
| </code></pre></div></div> |
| |
| <h4 id="define-apache-griffin-data-quality">Define Apache Griffin data quality</h4> |
| <p>The DQ config file: dq.json</p> |
| |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ |
| "name": "batch_prof", |
| "process.type": "batch", |
| "data.sources": [ |
| { |
| "name": "src", |
| "baseline": true, |
| "connectors": [ |
| { |
| "type": "hive", |
| "version": "1.2", |
| "config": { |
| "database": "default", |
| "table.name": "demo_tgt" |
| } |
| } |
| ] |
| } |
| ], |
| "evaluate.rule": { |
| "rules": [ |
| { |
| "dsl.type": "griffin-dsl", |
| "dq.type": "profiling", |
| "out.dataframe.name": "prof", |
| "rule": "src.id.count() AS id_count, src.age.max() AS age_max, src.desc.length().max() AS desc_length_max", |
| "out": [ |
| { |
| "type": "metric", |
| "name": "prof" |
| } |
| ] |
| } |
| ] |
| }, |
| "sinks": ["CONSOLE", "HDFS"] |
| } |
| </code></pre></div></div> |
| |
| <h2 id="measure-data-quality">Measure data quality</h2> |
| <p>Submit the measure job to Spark, with config file paths as parameters.</p> |
| |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \ |
| --driver-memory 1g --executor-memory 1g --num-executors 2 \ |
| <path>/griffin-measure.jar \ |
| <path>/env.json <path>/dq.json |
| </code></pre></div></div> |
| |
| <h2 id="report-data-quality-metrics">Report data quality metrics</h2> |
| <p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="language-plaintext highlighter-rouge">hdfs:///griffin/persist/<job name>/<timestamp>/_METRICS</code>.</p> |
| |
| <h2 id="refine-data-quality-report">Refine Data Quality report</h2> |
| <p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p> |
| |
| <h2 id="more-details">More Details</h2> |
| <p>For more details about apache griffin measures, you can visit our documents in <a href="https://github.com/apache/griffin/tree/master/griffin-doc">github</a>.</p> |
| :ET |