.jekyll-cache/Jekyll/Cache/Jekyll--Converters--Markdown/f7/871062d83c39a64298def159cb668de32150c904e56a8ad25ecaa4d48d3c9e - griffin-site - Git at Google

 I"ý<h2 id="user-story">User Story</h2>
 <p>Say we have two data set(demo_src, demo_tgt), we need to know what is the data quality for target data set, based on source data set.</p>

 <p>For simplicity, suppose both two data set have the same schema as this:</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id                      bigint
 age                     int
 desc                    string
 dt                      string
 hour                    string
 </code></pre></div></div>
 <p>both dt and hour are partitions,</p>

 <p>as every day we have one daily partition dt(like 20180912),</p>

 <p>for every day we have 24 hourly partitions(like 00, 01, 02, â¦, 23).</p>

 <h2 id="environment-preparation">Environment Preparation</h2>
 <p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p>
 <ul>
   <li>JDK (1.8+)</li>
   <li>Hadoop (2.6.0+)</li>
   <li>Spark (2.2.1+)</li>
   <li>Hive (2.2.0)</li>
 </ul>

 <h2 id="build-apache-griffin-measure-module">Build Apache Griffin Measure Module</h2>
 <ol>
   <li>Download Apache Griffin source package <a href="https://www.apache.org/dist/griffin/0.4.0/">here</a>.</li>
   <li>Unzip the source package.
     <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.4.0-source-release.zip
 cd griffin-0.4.0-source-release
 </code></pre></div>    </div>
   </li>
   <li>Build Apache Griffin jars.
     <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install
 </code></pre></div>    </div>

     <p>Move the built apache griffin measure jar to your work path.</p>

     <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.4.0.jar &lt;work path&gt;/griffin-measure.jar
 </code></pre></div>    </div>
   </li>
 </ol>

 <h2 id="data-preparation">Data Preparation</h2>

 <p>For our quick start, We will generate two hive tables demo_src and demo_tgt.</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script
 --Note: replace hdfs location with your own path
 CREATE EXTERNAL TABLE `demo_src`(
   `id` bigint,
   `age` int,
   `desc` string)
 PARTITIONED BY (
   `dt` string,
   `hour` string)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '|'
 LOCATION
   'hdfs:///griffin/data/batch/demo_src';

 --Note: replace hdfs location with your own path
 CREATE EXTERNAL TABLE `demo_tgt`(
   `id` bigint,
   `age` int,
   `desc` string)
 PARTITIONED BY (
   `dt` string,
   `hour` string)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '|'
 LOCATION
   'hdfs:///griffin/data/batch/demo_tgt';

 </code></pre></div></div>
 <p>The data could be generated this:</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1|18|student
 2|23|engineer
 3|42|cook
 ...
 </code></pre></div></div>
 <p>For demo_src and demo_tgt, there could be some different items between each other.
 You can download <a href="/data/batch">demo data</a> and execute <code class="language-plaintext highlighter-rouge">./gen_demo_data.sh</code> to get the two data source files.
 Then we will load data into both two tables for every hour.</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09');
 LOAD DATA LOCAL INPATH 'demo_tgt' INTO TABLE demo_tgt PARTITION (dt='20180912',hour='09');
 </code></pre></div></div>
 <p>Or you can just execute <code class="language-plaintext highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p>

 <h2 id="define-data-quality-measure">Define data quality measure</h2>

 <h4 id="apache-griffin-env-configuration">Apache Griffin env configuration</h4>
 <p>The environment config file: env.json</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
   "spark": {
     "log.level": "WARN"
   },
   "sinks": [
     {
       "type": "console"
     },
     {
       "type": "hdfs",
       "config": {
         "path": "hdfs:///griffin/persist"
       }
     },
     {
       "type": "elasticsearch",
       "config": {
         "method": "post",
         "api": "http://es:9200/griffin/accuracy"
       }
     }
   ]
 }
 </code></pre></div></div>

 <h4 id="define-griffin-data-quality">Define griffin data quality</h4>
 <p>The DQ config file: dq.json</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
   "name": "batch_accu",
   "process.type": "batch",
   "data.sources": [
     {
       "name": "src",
       "baseline": true,
       "connectors": [
         {
           "type": "hive",
           "version": "1.2",
           "config": {
             "database": "default",
             "table.name": "demo_src"
           }
         }
       ]
     }, {
       "name": "tgt",
       "connectors": [
         {

           "type": "hive",
           "version": "1.2",
           "config": {
             "database": "default",
             "table.name": "demo_tgt"
           }
         }
       ]
     }
   ],
   "evaluate.rule": {
     "rules": [
       {
         "dsl.type": "griffin-dsl",
         "dq.type": "accuracy",
         "out.dataframe.name": "accu",
         "rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc",
         "details": {
           "source": "src",
           "target": "tgt",
           "miss": "miss_count",
           "total": "total_count",
           "matched": "matched_count"
         },
         "out": [
           {
             "type": "metric",
             "name": "accu"
           },
           {
             "type": "record",
             "name": "missRecords"
           }
         ]
       }
     ]
   },
   "sinks": ["CONSOLE", "HDFS"]
 }
 </code></pre></div></div>

 <h2 id="measure-data-quality">Measure data quality</h2>
 <p>Submit the measure job to Spark, with config file paths as parameters.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
 --driver-memory 1g --executor-memory 1g --num-executors 2 \
 &lt;path&gt;/griffin-measure.jar \
 &lt;path&gt;/env.json &lt;path&gt;/dq.json
 </code></pre></div></div>

 <h2 id="report-data-quality-metrics">Report data quality metrics</h2>
 <p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="language-plaintext highlighter-rouge">hdfs:///griffin/persist/&lt;job name&gt;/&lt;timestamp&gt;/_METRICS</code>.</p>

 <h2 id="refine-data-quality-report">Refine Data Quality report</h2>
 <p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p>

 <h2 id="more-details">More Details</h2>
 <p>For more details about apache griffin measures, you can visit our documents in <a href="https://github.com/apache/griffin/tree/master/griffin-doc">github</a>.</p>
 :ET
	I"ý<h2 id="user-story">User Story</h2>
	<p>Say we have two data set(demo_src, demo_tgt), we need to know what is the data quality for target data set, based on source data set.</p>

	<p>For simplicity, suppose both two data set have the same schema as this:</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id bigint
	age int
	desc string
	dt string
	hour string
	</code></pre></div></div>
	<p>both dt and hour are partitions,</p>

	<p>as every day we have one daily partition dt(like 20180912),</p>

	<p>for every day we have 24 hourly partitions(like 00, 01, 02, â¦, 23).</p>

	<h2 id="environment-preparation">Environment Preparation</h2>
	<p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p>
	<ul>
	<li>JDK (1.8+)</li>
	<li>Hadoop (2.6.0+)</li>
	<li>Spark (2.2.1+)</li>
	<li>Hive (2.2.0)</li>
	</ul>

	<h2 id="build-apache-griffin-measure-module">Build Apache Griffin Measure Module</h2>
	<ol>
	<li>Download Apache Griffin source package <a href="https://www.apache.org/dist/griffin/0.4.0/">here</a>.</li>
	<li>Unzip the source package.
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.4.0-source-release.zip
	cd griffin-0.4.0-source-release
	</code></pre></div> </div>
	</li>
	<li>Build Apache Griffin jars.
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install
	</code></pre></div> </div>

	<p>Move the built apache griffin measure jar to your work path.</p>

	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.4.0.jar <work path>/griffin-measure.jar
	</code></pre></div> </div>
	</li>
	</ol>

	<h2 id="data-preparation">Data Preparation</h2>

	<p>For our quick start, We will generate two hive tables demo_src and demo_tgt.</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script
	--Note: replace hdfs location with your own path
	CREATE EXTERNAL TABLE `demo_src`(
	`id` bigint,
	`age` int,
	`desc` string)
	PARTITIONED BY (
	`dt` string,
	`hour` string)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\|'
	LOCATION
	'hdfs:///griffin/data/batch/demo_src';

	--Note: replace hdfs location with your own path
	CREATE EXTERNAL TABLE `demo_tgt`(
	`id` bigint,
	`age` int,
	`desc` string)
	PARTITIONED BY (
	`dt` string,
	`hour` string)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\|'
	LOCATION
	'hdfs:///griffin/data/batch/demo_tgt';

	</code></pre></div></div>
	<p>The data could be generated this:</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1\|18\|student
	2\|23\|engineer
	3\|42\|cook
	...
	</code></pre></div></div>
	<p>For demo_src and demo_tgt, there could be some different items between each other.
	You can download <a href="/data/batch">demo data</a> and execute <code class="language-plaintext highlighter-rouge">./gen_demo_data.sh</code> to get the two data source files.
	Then we will load data into both two tables for every hour.</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09');
	LOAD DATA LOCAL INPATH 'demo_tgt' INTO TABLE demo_tgt PARTITION (dt='20180912',hour='09');
	</code></pre></div></div>
	<p>Or you can just execute <code class="language-plaintext highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p>

	<h2 id="define-data-quality-measure">Define data quality measure</h2>

	<h4 id="apache-griffin-env-configuration">Apache Griffin env configuration</h4>
	<p>The environment config file: env.json</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
	"spark": {
	"log.level": "WARN"
	},
	"sinks": [
	{
	"type": "console"
	},
	{
	"type": "hdfs",
	"config": {
	"path": "hdfs:///griffin/persist"
	}
	},
	{
	"type": "elasticsearch",
	"config": {
	"method": "post",
	"api": "http://es:9200/griffin/accuracy"
	}
	}
	]
	}
	</code></pre></div></div>

	<h4 id="define-griffin-data-quality">Define griffin data quality</h4>
	<p>The DQ config file: dq.json</p>

	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
	"name": "batch_accu",
	"process.type": "batch",
	"data.sources": [
	{
	"name": "src",
	"baseline": true,
	"connectors": [
	{
	"type": "hive",
	"version": "1.2",
	"config": {
	"database": "default",
	"table.name": "demo_src"
	}
	}
	]
	}, {
	"name": "tgt",
	"connectors": [
	{

	"type": "hive",
	"version": "1.2",
	"config": {
	"database": "default",
	"table.name": "demo_tgt"
	}
	}
	]
	}
	],
	"evaluate.rule": {
	"rules": [
	{
	"dsl.type": "griffin-dsl",
	"dq.type": "accuracy",
	"out.dataframe.name": "accu",
	"rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc",
	"details": {
	"source": "src",
	"target": "tgt",
	"miss": "miss_count",
	"total": "total_count",
	"matched": "matched_count"
	},
	"out": [
	{
	"type": "metric",
	"name": "accu"
	},
	{
	"type": "record",
	"name": "missRecords"
	}
	]
	}
	]
	},
	"sinks": ["CONSOLE", "HDFS"]
	}
	</code></pre></div></div>

	<h2 id="measure-data-quality">Measure data quality</h2>
	<p>Submit the measure job to Spark, with config file paths as parameters.</p>

	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
	--driver-memory 1g --executor-memory 1g --num-executors 2 \
	<path>/griffin-measure.jar \
	<path>/env.json <path>/dq.json
	</code></pre></div></div>

	<h2 id="report-data-quality-metrics">Report data quality metrics</h2>
	<p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="language-plaintext highlighter-rouge">hdfs:///griffin/persist/<job name>/<timestamp>/_METRICS</code>.</p>

	<h2 id="refine-data-quality-report">Refine Data Quality report</h2>
	<p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p>

	<h2 id="more-details">More Details</h2>
	<p>For more details about apache griffin measures, you can visit our documents in <a href="https://github.com/apache/griffin/tree/master/griffin-doc">github</a>.</p>
	:ET