| <!DOCTYPE html> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| |
| <title>Griffin - Quick Start</title> |
| <meta name="description" content="Apache Griffin - Big Data Quality Solution For Batch and Streaming"> |
| |
| <meta name="keywords" content="Griffin, Hadoop, Security, Real Time"> |
| <meta name="author" content="eBay Inc."> |
| |
| <meta charset="utf-8"> |
| <meta name="viewport" content="initial-scale=1"> |
| |
| <link rel="stylesheet" href="/css/animate.css"> |
| <link rel="stylesheet" href="/css/bootstrap.min.css"> |
| |
| <link rel="stylesheet" href="/css/font-awesome.min.css"> |
| |
| <link rel="stylesheet" href="/css/misc.css"> |
| <link rel="stylesheet" href="/css/style.css"> |
| <link rel="stylesheet" href="/css/styles.css"> |
| <link rel="stylesheet" href="/css/main.css"> |
| <link rel="alternate" type="application/rss+xml" title="Griffin" href="http://griffin.apache.org/feed.xml" /> |
| <link rel="shortcut icon" href="/images/favicon.ico"> |
| |
| <!-- Baidu Analytics Tracking--> |
| <script> |
| var _hmt = _hmt || []; |
| (function() { |
| var hm = document.createElement("script"); |
| hm.src = "//hm.baidu.com/hm.js?fedc55df2ea52777a679192e8f849ece"; |
| var s = document.getElementsByTagName("script")[0]; |
| s.parentNode.insertBefore(hm, s); |
| })(); |
| </script> |
| |
| <!-- Google Analytics Tracking --> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| ga('create', 'UA-68929805-1', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| </head> |
| |
| <body> |
| <!-- header start --> |
| <div id="home_page"> |
| <div class="topbar"> |
| <div class="container"> |
| <div class="row" > |
| <nav class="navbar navbar-default"> |
| <div class="container-fluid"> |
| <!-- Brand and toggle get grouped for better mobile display --> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button> |
| <a class="navbar-brand" href="/"><img src="/images/logo.png" height="44px" style="margin-top:-7px"></a> </div> |
| </div> |
| </div> |
| <!-- /.container-fluid --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| |
| </div> |
| <!-- header end --> |
| <div class="container-fluid page-content"> |
| <div class="row"> |
| <div class="col-md-10 col-md-offset-1"> |
| <!-- sidebar --> |
| <div class="col-xs-6 col-sm-3" id="sidebar" role="navigation"> |
| <ul class="nav" id="adminnav"> |
| |
| <li class="heading">Getting Started</li> |
| |
| <li class="sidenavli current"><a href="/docs/quickstart.html" data-permalink="/docs/quickstart.html" id="">Quick Start</a></li> |
| |
| <li class="sidenavli "><a href="/docs/quickstart-cn.html" data-permalink="/docs/quickstart.html" id="">Quick Start (Chinese Version)</a></li> |
| |
| <li class="sidenavli "><a href="/docs/usecases.html" data-permalink="/docs/quickstart.html" id="">Streaming Use Cases</a></li> |
| |
| <li class="sidenavli "><a href="/docs/profiling.html" data-permalink="/docs/quickstart.html" id="">Profiling Use Cases</a></li> |
| |
| <li class="sidenavli "><a href="/docs/faq.html" data-permalink="/docs/quickstart.html" id="">FAQ</a></li> |
| |
| <li class="sidenavli "><a href="/docs/community.html" data-permalink="/docs/quickstart.html" id="">Community</a></li> |
| |
| <li class="sidenavli "><a href="/docs/conf.html" data-permalink="/docs/quickstart.html" id="">Conference</a></li> |
| |
| <li class="divider"></li> |
| |
| <li class="heading">Development</li> |
| |
| <li class="sidenavli "><a href="/docs/contribute.html" data-permalink="/docs/quickstart.html" id="">Contribution</a></li> |
| |
| <li class="sidenavli "><a href="/docs/contributors.html" data-permalink="/docs/quickstart.html" id="">Contributors</a></li> |
| |
| <li class="divider"></li> |
| |
| <li class="heading">Download</li> |
| |
| <li class="sidenavli "><a href="/docs/latest.html" data-permalink="/docs/quickstart.html" id="">Latest version</a></li> |
| |
| <li class="sidenavli "><a href="/docs/download.html" data-permalink="/docs/quickstart.html" id="">Archived</a></li> |
| |
| <li class="divider"></li> |
| |
| <li class="sidenavli"> |
| <a href="mailto:dev@griffin.apache.org" target="_blank">Need Help?</a> |
| </li> |
| </ul> |
| </div> |
| <div class="col-xs-6 col-sm-9 page-main-content" style="margin-left: -15px" id="loadcontent"> |
| <h1 class="page-header" style="margin-top: 0px">Quick Start</h1> |
| <h2 id="user-story">User Story</h2> |
| <p>Say we have two data set(demo_src, demo_tgt), we need to know what is the data quality for target data set, based on source data set.</p> |
| |
| <p>For simplicity, suppose both two data set have the same schema as this:</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id bigint |
| age int |
| desc string |
| dt string |
| hour string |
| </code></pre></div></div> |
| <p>both dt and hour are partitions,</p> |
| |
| <p>as every day we have one daily partition dt(like 20180912),</p> |
| |
| <p>for every day we have 24 hourly partitions(like 00, 01, 02, …, 23).</p> |
| |
| <h2 id="environment-preparation">Environment Preparation</h2> |
| <p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p> |
| <ul> |
| <li>JDK (1.8+)</li> |
| <li>Hadoop (2.6.0+)</li> |
| <li>Spark (2.2.1+)</li> |
| <li>Hive (2.2.0)</li> |
| </ul> |
| |
| <h2 id="build-apache-griffin-measure-module">Build Apache Griffin Measure Module</h2> |
| <ol> |
| <li>Download Apache Griffin source package <a href="https://www.apache.org/dist/griffin/0.4.0/">here</a>.</li> |
| <li>Unzip the source package. |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.4.0-source-release.zip |
| cd griffin-0.4.0-source-release |
| </code></pre></div> </div> |
| </li> |
| <li>Build Apache Griffin jars. |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install |
| </code></pre></div> </div> |
| |
| <p>Move the built apache griffin measure jar to your work path.</p> |
| |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.4.0.jar <work path>/griffin-measure.jar |
| </code></pre></div> </div> |
| </li> |
| </ol> |
| |
| <h2 id="data-preparation">Data Preparation</h2> |
| |
| <p>For our quick start, We will generate two hive tables demo_src and demo_tgt.</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script |
| --Note: replace hdfs location with your own path |
| CREATE EXTERNAL TABLE `demo_src`( |
| `id` bigint, |
| `age` int, |
| `desc` string) |
| PARTITIONED BY ( |
| `dt` string, |
| `hour` string) |
| ROW FORMAT DELIMITED |
| FIELDS TERMINATED BY '|' |
| LOCATION |
| 'hdfs:///griffin/data/batch/demo_src'; |
| |
| --Note: replace hdfs location with your own path |
| CREATE EXTERNAL TABLE `demo_tgt`( |
| `id` bigint, |
| `age` int, |
| `desc` string) |
| PARTITIONED BY ( |
| `dt` string, |
| `hour` string) |
| ROW FORMAT DELIMITED |
| FIELDS TERMINATED BY '|' |
| LOCATION |
| 'hdfs:///griffin/data/batch/demo_tgt'; |
| |
| </code></pre></div></div> |
| <p>The data could be generated this:</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1|18|student |
| 2|23|engineer |
| 3|42|cook |
| ... |
| </code></pre></div></div> |
| <p>For demo_src and demo_tgt, there could be some different items between each other. |
| You can download <a href="/data/batch">demo data</a> and execute <code class="language-plaintext highlighter-rouge">./gen_demo_data.sh</code> to get the two data source files. |
| Then we will load data into both two tables for every hour.</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09'); |
| LOAD DATA LOCAL INPATH 'demo_tgt' INTO TABLE demo_tgt PARTITION (dt='20180912',hour='09'); |
| </code></pre></div></div> |
| <p>Or you can just execute <code class="language-plaintext highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p> |
| |
| <h2 id="define-data-quality-measure">Define data quality measure</h2> |
| |
| <h4 id="apache-griffin-env-configuration">Apache Griffin env configuration</h4> |
| <p>The environment config file: env.json</p> |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ |
| "spark": { |
| "log.level": "WARN" |
| }, |
| "sinks": [ |
| { |
| "type": "console" |
| }, |
| { |
| "type": "hdfs", |
| "config": { |
| "path": "hdfs:///griffin/persist" |
| } |
| }, |
| { |
| "type": "elasticsearch", |
| "config": { |
| "method": "post", |
| "api": "http://es:9200/griffin/accuracy" |
| } |
| } |
| ] |
| } |
| </code></pre></div></div> |
| |
| <h4 id="define-griffin-data-quality">Define griffin data quality</h4> |
| <p>The DQ config file: dq.json</p> |
| |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ |
| "name": "batch_accu", |
| "process.type": "batch", |
| "data.sources": [ |
| { |
| "name": "src", |
| "baseline": true, |
| "connectors": [ |
| { |
| "type": "hive", |
| "version": "1.2", |
| "config": { |
| "database": "default", |
| "table.name": "demo_src" |
| } |
| } |
| ] |
| }, { |
| "name": "tgt", |
| "connectors": [ |
| { |
| |
| "type": "hive", |
| "version": "1.2", |
| "config": { |
| "database": "default", |
| "table.name": "demo_tgt" |
| } |
| } |
| ] |
| } |
| ], |
| "evaluate.rule": { |
| "rules": [ |
| { |
| "dsl.type": "griffin-dsl", |
| "dq.type": "accuracy", |
| "out.dataframe.name": "accu", |
| "rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc", |
| "details": { |
| "source": "src", |
| "target": "tgt", |
| "miss": "miss_count", |
| "total": "total_count", |
| "matched": "matched_count" |
| }, |
| "out": [ |
| { |
| "type": "metric", |
| "name": "accu" |
| }, |
| { |
| "type": "record", |
| "name": "missRecords" |
| } |
| ] |
| } |
| ] |
| }, |
| "sinks": ["CONSOLE", "HDFS"] |
| } |
| </code></pre></div></div> |
| |
| <h2 id="measure-data-quality">Measure data quality</h2> |
| <p>Submit the measure job to Spark, with config file paths as parameters.</p> |
| |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \ |
| --driver-memory 1g --executor-memory 1g --num-executors 2 \ |
| <path>/griffin-measure.jar \ |
| <path>/env.json <path>/dq.json |
| </code></pre></div></div> |
| |
| <h2 id="report-data-quality-metrics">Report data quality metrics</h2> |
| <p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="language-plaintext highlighter-rouge">hdfs:///griffin/persist/<job name>/<timestamp>/_METRICS</code>.</p> |
| |
| <h2 id="refine-data-quality-report">Refine Data Quality report</h2> |
| <p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p> |
| |
| <h2 id="more-details">More Details</h2> |
| <p>For more details about apache griffin measures, you can visit our documents in <a href="https://github.com/apache/griffin/tree/master/griffin-doc">github</a>.</p> |
| |
| </div><!--end of loadcontent--> |
| </div> |
| <!--end of centered content--> |
| </div> |
| </div> |
| <!--end of container--> |
| |
| |
| <!-- footer start --> |
| <div class="footerwrapper"> |
| <div class="container"> |
| <div class="row"> |
| <div class="col-md-3"> |
| <img src="/images/incubator_feather_egg_logo.png" height="60"> |
| </div> |
| <div class="col-md-9"> |
| <div style="margin-left:auto; margin-right:auto; text-align:center;font-size:12px;"> |
| <div> |
| Apache Griffin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="row" style="padding-top:10px;"> |
| Copyright © 2018 The Apache Software Foundation, Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br> |
| Apache Griffin, Griffin, Apache, the Apache feather logo and the Apache Griffin logo are trademarks of The Apache Software Foundation. |
| </div> |
| <div class="row text-center" style="padding-top:10px;"> |
| <a href="https://www.apache.org/events/current-event.html"> |
| <img src="https://www.apache.org/events/current-event-234x60.png" alt="ASF Current Event"> |
| </a> |
| </div> |
| </div> |
| </div> |
| <!-- footer end --> |
| |
| <!-- JavaScripts --> |
| <script src="https://code.jquery.com/jquery-2.2.4.min.js"></script> |
| |
| |
| |
| </body> |
| </html> |