docs/quickstart.html - griffin-site - Git at Google

 <!DOCTYPE html>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <head>
   	<meta charset="utf-8">
   	<meta http-equiv="X-UA-Compatible" content="IE=edge">

  	<title>Griffin - Quick Start</title>
  	<meta name="description" content="Apache Griffin - Big Data Quality Solution For Batch and Streaming">

 	<meta name="keywords" content="Griffin, Hadoop, Security, Real Time">
 	<meta name="author" content="eBay Inc.">

 	<meta charset="utf-8">
 	<meta name="viewport" content="initial-scale=1">

 	<link rel="stylesheet" href="/css/animate.css">
 	<link rel="stylesheet" href="/css/bootstrap.min.css">

 	<link rel="stylesheet" href="/css/font-awesome.min.css">

 	<link rel="stylesheet" href="/css/misc.css">
 	<link rel="stylesheet" href="/css/style.css">
 	<link rel="stylesheet" href="/css/styles.css">
   	<link rel="stylesheet" href="/css/main.css">
   	<link rel="alternate" type="application/rss+xml" title="Griffin" href="http://griffin.apache.org/feed.xml" />
   	<link rel="shortcut icon" href="/images/favicon.ico">

   	<!-- Baidu Analytics Tracking-->
 	<script>
 	var _hmt = _hmt || [];
 	(function() {
 	  var hm = document.createElement("script");
 	  hm.src = "//hm.baidu.com/hm.js?fedc55df2ea52777a679192e8f849ece";
 	  var s = document.getElementsByTagName("script")[0];
 	  s.parentNode.insertBefore(hm, s);
 	})();
 	</script>

 	<!-- Google Analytics Tracking -->
 	<script>
 	  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
 	  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
 	  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 	  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
 	  ga('create', 'UA-68929805-1', 'auto');
 	  ga('send', 'pageview');
 	</script>
 </head>

 <body>
 <!-- header start -->
 <div id="home_page">
   <div class="topbar">
     <div class="container">
       <div class="row" >
         <nav class="navbar navbar-default">
           <div class="container-fluid">
             <!-- Brand and toggle get grouped for better mobile display -->
             <div class="navbar-header">
               <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button>
               <a class="navbar-brand" href="/"><img src="/images/logo.png" height="44px" style="margin-top:-7px"></a> </div>
             </div>
           </div>
           <!-- /.container-fluid -->
         </nav>
       </div>
     </div>
   </div>

 </div>
 <!-- header end -->
 <div class="container-fluid page-content">
   <div class="row">
     <div class="col-md-10 col-md-offset-1">
       <!-- sidebar -->
       <div class="col-xs-6 col-sm-3" id="sidebar" role="navigation">
         <ul class="nav" id="adminnav">

         <li class="heading">Getting Started</li>

           <li class="sidenavli  current"><a href="/docs/quickstart.html" data-permalink="/docs/quickstart.html" id="">Quick Start</a></li>

           <li class="sidenavli  "><a href="/docs/quickstart-cn.html" data-permalink="/docs/quickstart.html" id="">Quick Start (Chinese Version)</a></li>

           <li class="sidenavli  "><a href="/docs/usecases.html" data-permalink="/docs/quickstart.html" id="">Streaming Use Cases</a></li>

           <li class="sidenavli  "><a href="/docs/profiling.html" data-permalink="/docs/quickstart.html" id="">Profiling Use Cases</a></li>

           <li class="sidenavli  "><a href="/docs/faq.html" data-permalink="/docs/quickstart.html" id="">FAQ</a></li>

           <li class="sidenavli  "><a href="/docs/community.html" data-permalink="/docs/quickstart.html" id="">Community</a></li>

           <li class="sidenavli  "><a href="/docs/conf.html" data-permalink="/docs/quickstart.html" id="">Conference</a></li>

         <li class="divider"></li>

         <li class="heading">Development</li>

           <li class="sidenavli  "><a href="/docs/contribute.html" data-permalink="/docs/quickstart.html" id="">Contribution</a></li>

           <li class="sidenavli  "><a href="/docs/contributors.html" data-permalink="/docs/quickstart.html" id="">Contributors</a></li>

         <li class="divider"></li>

         <li class="heading">Download</li>

           <li class="sidenavli  "><a href="/docs/latest.html" data-permalink="/docs/quickstart.html" id="">Latest version</a></li>

           <li class="sidenavli  "><a href="/docs/download.html" data-permalink="/docs/quickstart.html" id="">Archived</a></li>

         <li class="divider"></li>

         <li class="sidenavli">
           <a href="mailto:dev@griffin.apache.org" target="_blank">Need Help?</a>
         </li>
         </ul>
       </div>
       <div class="col-xs-6 col-sm-9 page-main-content" style="margin-left: -15px" id="loadcontent">
         <h1 class="page-header" style="margin-top: 0px">Quick Start</h1>
         <h2 id="user-story">User Story</h2>
 <p>Say we have two data set(demo_src, demo_tgt), we need to know what is the data quality for target data set, based on source data set.</p>

 <p>For simplicity, suppose both two data set have the same schema as this:</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id                      bigint
 age                     int
 desc                    string
 dt                      string
 hour                    string
 </code></pre></div></div>
 <p>both dt and hour are partitions,</p>

 <p>as every day we have one daily partition dt(like 20180912),</p>

 <p>for every day we have 24 hourly partitions(like 00, 01, 02, …, 23).</p>

 <h2 id="environment-preparation">Environment Preparation</h2>
 <p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p>
 <ul>
   <li>JDK (1.8+)</li>
   <li>Hadoop (2.6.0+)</li>
   <li>Spark (2.2.1+)</li>
   <li>Hive (2.2.0)</li>
 </ul>

 <h2 id="build-apache-griffin-measure-module">Build Apache Griffin Measure Module</h2>
 <ol>
   <li>Download Apache Griffin source package <a href="https://www.apache.org/dist/griffin/0.4.0/">here</a>.</li>
   <li>Unzip the source package.
     <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.4.0-source-release.zip
 cd griffin-0.4.0-source-release
 </code></pre></div>    </div>
   </li>
   <li>Build Apache Griffin jars.
     <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install
 </code></pre></div>    </div>

     <p>Move the built apache griffin measure jar to your work path.</p>

     <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.4.0.jar &lt;work path&gt;/griffin-measure.jar
 </code></pre></div>    </div>
   </li>
 </ol>

 <h2 id="data-preparation">Data Preparation</h2>

 <p>For our quick start, We will generate two hive tables demo_src and demo_tgt.</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script
 --Note: replace hdfs location with your own path
 CREATE EXTERNAL TABLE `demo_src`(
   `id` bigint,
   `age` int,
   `desc` string)
 PARTITIONED BY (
   `dt` string,
   `hour` string)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '|'
 LOCATION
   'hdfs:///griffin/data/batch/demo_src';

 --Note: replace hdfs location with your own path
 CREATE EXTERNAL TABLE `demo_tgt`(
   `id` bigint,
   `age` int,
   `desc` string)
 PARTITIONED BY (
   `dt` string,
   `hour` string)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '|'
 LOCATION
   'hdfs:///griffin/data/batch/demo_tgt';

 </code></pre></div></div>
 <p>The data could be generated this:</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1|18|student
 2|23|engineer
 3|42|cook
 ...
 </code></pre></div></div>
 <p>For demo_src and demo_tgt, there could be some different items between each other.
 You can download <a href="/data/batch">demo data</a> and execute <code class="language-plaintext highlighter-rouge">./gen_demo_data.sh</code> to get the two data source files.
 Then we will load data into both two tables for every hour.</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09');
 LOAD DATA LOCAL INPATH 'demo_tgt' INTO TABLE demo_tgt PARTITION (dt='20180912',hour='09');
 </code></pre></div></div>
 <p>Or you can just execute <code class="language-plaintext highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p>

 <h2 id="define-data-quality-measure">Define data quality measure</h2>

 <h4 id="apache-griffin-env-configuration">Apache Griffin env configuration</h4>
 <p>The environment config file: env.json</p>
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
   "spark": {
     "log.level": "WARN"
   },
   "sinks": [
     {
       "type": "console"
     },
     {
       "type": "hdfs",
       "config": {
         "path": "hdfs:///griffin/persist"
       }
     },
     {
       "type": "elasticsearch",
       "config": {
         "method": "post",
         "api": "http://es:9200/griffin/accuracy"
       }
     }
   ]
 }
 </code></pre></div></div>

 <h4 id="define-griffin-data-quality">Define griffin data quality</h4>
 <p>The DQ config file: dq.json</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
   "name": "batch_accu",
   "process.type": "batch",
   "data.sources": [
     {
       "name": "src",
       "baseline": true,
       "connectors": [
         {
           "type": "hive",
           "version": "1.2",
           "config": {
             "database": "default",
             "table.name": "demo_src"
           }
         }
       ]
     }, {
       "name": "tgt",
       "connectors": [
         {

           "type": "hive",
           "version": "1.2",
           "config": {
             "database": "default",
             "table.name": "demo_tgt"
           }
         }
       ]
     }
   ],
   "evaluate.rule": {
     "rules": [
       {
         "dsl.type": "griffin-dsl",
         "dq.type": "accuracy",
         "out.dataframe.name": "accu",
         "rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc",
         "details": {
           "source": "src",
           "target": "tgt",
           "miss": "miss_count",
           "total": "total_count",
           "matched": "matched_count"
         },
         "out": [
           {
             "type": "metric",
             "name": "accu"
           },
           {
             "type": "record",
             "name": "missRecords"
           }
         ]
       }
     ]
   },
   "sinks": ["CONSOLE", "HDFS"]
 }
 </code></pre></div></div>

 <h2 id="measure-data-quality">Measure data quality</h2>
 <p>Submit the measure job to Spark, with config file paths as parameters.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
 --driver-memory 1g --executor-memory 1g --num-executors 2 \
 &lt;path&gt;/griffin-measure.jar \
 &lt;path&gt;/env.json &lt;path&gt;/dq.json
 </code></pre></div></div>

 <h2 id="report-data-quality-metrics">Report data quality metrics</h2>
 <p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="language-plaintext highlighter-rouge">hdfs:///griffin/persist/&lt;job name&gt;/&lt;timestamp&gt;/_METRICS</code>.</p>

 <h2 id="refine-data-quality-report">Refine Data Quality report</h2>
 <p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p>

 <h2 id="more-details">More Details</h2>
 <p>For more details about apache griffin measures, you can visit our documents in <a href="https://github.com/apache/griffin/tree/master/griffin-doc">github</a>.</p>

       </div><!--end of loadcontent-->
     </div>
     <!--end of centered content-->
   </div>
 </div>
 <!--end of container-->


 <!-- footer start -->
 <div class="footerwrapper">
     <div class="container">
         <div class="row">
             <div class="col-md-3">
                 <img src="/images/incubator_feather_egg_logo.png" height="60">
             </div>
             <div class="col-md-9">
                 <div style="margin-left:auto; margin-right:auto; text-align:center;font-size:12px;">
                     <div>
                         Apache Griffin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
                     </div>
                 </div>
             </div>
         </div>
         <div class="row" style="padding-top:10px;">
             Copyright © 2018 The Apache Software Foundation, Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br>
 			Apache Griffin, Griffin, Apache, the Apache feather logo and the Apache Griffin logo are trademarks of The Apache Software Foundation.
         </div>
 		<div class="row text-center" style="padding-top:10px;">
 			<a href="https://www.apache.org/events/current-event.html">
 				<img src="https://www.apache.org/events/current-event-234x60.png" alt="ASF Current Event">
 			</a>
 		</div>
     </div>
 </div>
 <!-- footer end -->

 <!-- JavaScripts -->
 <script src="https://code.jquery.com/jquery-2.2.4.min.js"></script>


 </body>
 </html>
	<!DOCTYPE html>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">

	<title>Griffin - Quick Start</title>
	<meta name="description" content="Apache Griffin - Big Data Quality Solution For Batch and Streaming">

	<meta name="keywords" content="Griffin, Hadoop, Security, Real Time">
	<meta name="author" content="eBay Inc.">

	<meta charset="utf-8">
	<meta name="viewport" content="initial-scale=1">

	<link rel="stylesheet" href="/css/animate.css">
	<link rel="stylesheet" href="/css/bootstrap.min.css">

	<link rel="stylesheet" href="/css/font-awesome.min.css">

	<link rel="stylesheet" href="/css/misc.css">
	<link rel="stylesheet" href="/css/style.css">
	<link rel="stylesheet" href="/css/styles.css">
	<link rel="stylesheet" href="/css/main.css">
	<link rel="alternate" type="application/rss+xml" title="Griffin" href="http://griffin.apache.org/feed.xml" />
	<link rel="shortcut icon" href="/images/favicon.ico">

	<!-- Baidu Analytics Tracking-->
	<script>
	var _hmt = _hmt \|\| [];
	(function() {
	var hm = document.createElement("script");
	hm.src = "//hm.baidu.com/hm.js?fedc55df2ea52777a679192e8f849ece";
	var s = document.getElementsByTagName("script")[0];
	s.parentNode.insertBefore(hm, s);
	})();
	</script>

	<!-- Google Analytics Tracking -->
	<script>
	(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]\|\|function(){
	(i[r].q=i[r].q\|\|[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
	m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
	})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
	ga('create', 'UA-68929805-1', 'auto');
	ga('send', 'pageview');
	</script>
	</head>

	<body>
	<!-- header start -->
	<div id="home_page">
	<div class="topbar">
	<div class="container">
	<div class="row" >
	<nav class="navbar navbar-default">
	<div class="container-fluid">
	<!-- Brand and toggle get grouped for better mobile display -->
	<div class="navbar-header">
	<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button>
	<a class="navbar-brand" href="/"><img src="/images/logo.png" height="44px" style="margin-top:-7px"></a> </div>
	</div>
	</div>
	<!-- /.container-fluid -->
	</nav>
	</div>
	</div>
	</div>

	</div>
	<!-- header end -->
	<div class="container-fluid page-content">
	<div class="row">
	<div class="col-md-10 col-md-offset-1">
	<!-- sidebar -->
	<div class="col-xs-6 col-sm-3" id="sidebar" role="navigation">
	<ul class="nav" id="adminnav">

	<li class="heading">Getting Started</li>

	<li class="sidenavli current"><a href="/docs/quickstart.html" data-permalink="/docs/quickstart.html" id="">Quick Start</a></li>

	<li class="sidenavli "><a href="/docs/quickstart-cn.html" data-permalink="/docs/quickstart.html" id="">Quick Start (Chinese Version)</a></li>

	<li class="sidenavli "><a href="/docs/usecases.html" data-permalink="/docs/quickstart.html" id="">Streaming Use Cases</a></li>

	<li class="sidenavli "><a href="/docs/profiling.html" data-permalink="/docs/quickstart.html" id="">Profiling Use Cases</a></li>

	<li class="sidenavli "><a href="/docs/faq.html" data-permalink="/docs/quickstart.html" id="">FAQ</a></li>

	<li class="sidenavli "><a href="/docs/community.html" data-permalink="/docs/quickstart.html" id="">Community</a></li>

	<li class="sidenavli "><a href="/docs/conf.html" data-permalink="/docs/quickstart.html" id="">Conference</a></li>

	<li class="divider"></li>

	<li class="heading">Development</li>

	<li class="sidenavli "><a href="/docs/contribute.html" data-permalink="/docs/quickstart.html" id="">Contribution</a></li>

	<li class="sidenavli "><a href="/docs/contributors.html" data-permalink="/docs/quickstart.html" id="">Contributors</a></li>

	<li class="divider"></li>

	<li class="heading">Download</li>

	<li class="sidenavli "><a href="/docs/latest.html" data-permalink="/docs/quickstart.html" id="">Latest version</a></li>

	<li class="sidenavli "><a href="/docs/download.html" data-permalink="/docs/quickstart.html" id="">Archived</a></li>

	<li class="divider"></li>

	<li class="sidenavli">
	<a href="mailto:dev@griffin.apache.org" target="_blank">Need Help?</a>
	</li>
	</ul>
	</div>
	<div class="col-xs-6 col-sm-9 page-main-content" style="margin-left: -15px" id="loadcontent">
	<h1 class="page-header" style="margin-top: 0px">Quick Start</h1>
	<h2 id="user-story">User Story</h2>
	<p>Say we have two data set(demo_src, demo_tgt), we need to know what is the data quality for target data set, based on source data set.</p>

	<p>For simplicity, suppose both two data set have the same schema as this:</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id bigint
	age int
	desc string
	dt string
	hour string
	</code></pre></div></div>
	<p>both dt and hour are partitions,</p>

	<p>as every day we have one daily partition dt(like 20180912),</p>

	<p>for every day we have 24 hourly partitions(like 00, 01, 02, …, 23).</p>

	<h2 id="environment-preparation">Environment Preparation</h2>
	<p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p>
	<ul>
	<li>JDK (1.8+)</li>
	<li>Hadoop (2.6.0+)</li>
	<li>Spark (2.2.1+)</li>
	<li>Hive (2.2.0)</li>
	</ul>

	<h2 id="build-apache-griffin-measure-module">Build Apache Griffin Measure Module</h2>
	<ol>
	<li>Download Apache Griffin source package <a href="https://www.apache.org/dist/griffin/0.4.0/">here</a>.</li>
	<li>Unzip the source package.
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.4.0-source-release.zip
	cd griffin-0.4.0-source-release
	</code></pre></div> </div>
	</li>
	<li>Build Apache Griffin jars.
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install
	</code></pre></div> </div>

	<p>Move the built apache griffin measure jar to your work path.</p>

	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.4.0.jar <work path>/griffin-measure.jar
	</code></pre></div> </div>
	</li>
	</ol>

	<h2 id="data-preparation">Data Preparation</h2>

	<p>For our quick start, We will generate two hive tables demo_src and demo_tgt.</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script
	--Note: replace hdfs location with your own path
	CREATE EXTERNAL TABLE `demo_src`(
	`id` bigint,
	`age` int,
	`desc` string)
	PARTITIONED BY (
	`dt` string,
	`hour` string)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\|'
	LOCATION
	'hdfs:///griffin/data/batch/demo_src';

	--Note: replace hdfs location with your own path
	CREATE EXTERNAL TABLE `demo_tgt`(
	`id` bigint,
	`age` int,
	`desc` string)
	PARTITIONED BY (
	`dt` string,
	`hour` string)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\|'
	LOCATION
	'hdfs:///griffin/data/batch/demo_tgt';

	</code></pre></div></div>
	<p>The data could be generated this:</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1\|18\|student
	2\|23\|engineer
	3\|42\|cook
	...
	</code></pre></div></div>
	<p>For demo_src and demo_tgt, there could be some different items between each other.
	You can download <a href="/data/batch">demo data</a> and execute <code class="language-plaintext highlighter-rouge">./gen_demo_data.sh</code> to get the two data source files.
	Then we will load data into both two tables for every hour.</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09');
	LOAD DATA LOCAL INPATH 'demo_tgt' INTO TABLE demo_tgt PARTITION (dt='20180912',hour='09');
	</code></pre></div></div>
	<p>Or you can just execute <code class="language-plaintext highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p>

	<h2 id="define-data-quality-measure">Define data quality measure</h2>

	<h4 id="apache-griffin-env-configuration">Apache Griffin env configuration</h4>
	<p>The environment config file: env.json</p>
	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
	"spark": {
	"log.level": "WARN"
	},
	"sinks": [
	{
	"type": "console"
	},
	{
	"type": "hdfs",
	"config": {
	"path": "hdfs:///griffin/persist"
	}
	},
	{
	"type": "elasticsearch",
	"config": {
	"method": "post",
	"api": "http://es:9200/griffin/accuracy"
	}
	}
	]
	}
	</code></pre></div></div>

	<h4 id="define-griffin-data-quality">Define griffin data quality</h4>
	<p>The DQ config file: dq.json</p>

	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
	"name": "batch_accu",
	"process.type": "batch",
	"data.sources": [
	{
	"name": "src",
	"baseline": true,
	"connectors": [
	{
	"type": "hive",
	"version": "1.2",
	"config": {
	"database": "default",
	"table.name": "demo_src"
	}
	}
	]
	}, {
	"name": "tgt",
	"connectors": [
	{

	"type": "hive",
	"version": "1.2",
	"config": {
	"database": "default",
	"table.name": "demo_tgt"
	}
	}
	]
	}
	],
	"evaluate.rule": {
	"rules": [
	{
	"dsl.type": "griffin-dsl",
	"dq.type": "accuracy",
	"out.dataframe.name": "accu",
	"rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc",
	"details": {
	"source": "src",
	"target": "tgt",
	"miss": "miss_count",
	"total": "total_count",
	"matched": "matched_count"
	},
	"out": [
	{
	"type": "metric",
	"name": "accu"
	},
	{
	"type": "record",
	"name": "missRecords"
	}
	]
	}
	]
	},
	"sinks": ["CONSOLE", "HDFS"]
	}
	</code></pre></div></div>

	<h2 id="measure-data-quality">Measure data quality</h2>
	<p>Submit the measure job to Spark, with config file paths as parameters.</p>

	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
	--driver-memory 1g --executor-memory 1g --num-executors 2 \
	<path>/griffin-measure.jar \
	<path>/env.json <path>/dq.json
	</code></pre></div></div>

	<h2 id="report-data-quality-metrics">Report data quality metrics</h2>
	<p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="language-plaintext highlighter-rouge">hdfs:///griffin/persist/<job name>/<timestamp>/_METRICS</code>.</p>

	<h2 id="refine-data-quality-report">Refine Data Quality report</h2>
	<p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p>

	<h2 id="more-details">More Details</h2>
	<p>For more details about apache griffin measures, you can visit our documents in <a href="https://github.com/apache/griffin/tree/master/griffin-doc">github</a>.</p>

	</div><!--end of loadcontent-->
	</div>
	<!--end of centered content-->
	</div>
	</div>
	<!--end of container-->


	<!-- footer start -->
	<div class="footerwrapper">
	<div class="container">
	<div class="row">
	<div class="col-md-3">
	<img src="/images/incubator_feather_egg_logo.png" height="60">
	</div>
	<div class="col-md-9">
	<div style="margin-left:auto; margin-right:auto; text-align:center;font-size:12px;">
	<div>
	Apache Griffin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
	</div>
	</div>
	</div>
	</div>
	<div class="row" style="padding-top:10px;">
	Copyright © 2018 The Apache Software Foundation, Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br>
	Apache Griffin, Griffin, Apache, the Apache feather logo and the Apache Griffin logo are trademarks of The Apache Software Foundation.
	</div>
	<div class="row text-center" style="padding-top:10px;">
	<a href="https://www.apache.org/events/current-event.html">
	<img src="https://www.apache.org/events/current-event-234x60.png" alt="ASF Current Event">
	</a>
	</div>
	</div>
	</div>
	<!-- footer end -->

	<!-- JavaScripts -->
	<script src="https://code.jquery.com/jquery-2.2.4.min.js"></script>



	</body>
	</html>