blob: b27c85e114a99d35ebafc5471e23f12e500e5c5a [file] [log] [blame]
I"¦E<h2 id="apache-griffin-入门指南">Apache Griffin 入门指南</h2>
<p>数据质量模块是大数据平台中必不可少的一个功能组件,<a href="http://griffin.apache.org">Apache Griffin</a>(以下简称Griffin)是一个开源的大数据数据质量解决方案,它支持批处理和流模式两种数据质量检测方式,可以从不同维度(比如离线任务执行完毕后检查源端和目标端的数据数量是否一致、源表的数据空值数量等)度量数据资产,从而提升数据的准确度、可信度。</p>
<p>在Griffin的架构中,主要分为Define、Measure和Analyze三个部分,如下图所示:</p>
<p><img src="/images/arch-1.png" alt="arch" /></p>
<p>各部分的职责如下:</p>
<ul>
<li>Define:主要负责定义数据质量统计的维度,比如数据质量统计的时间跨度、统计的目标(源端和目标端的数据数量是否一致,数据源里某一字段的非空的数量、不重复值的数量、最大值、最小值、top5的值数量等)</li>
<li>Measure:主要负责执行统计任务,生成统计结果</li>
<li>Analyze:主要负责保存与展示统计结果</li>
</ul>
<p>基于以上功能,我们大数据平台计划引入Griffin作为数据质量解决方案,实现数据一致性检查、空值统计等功能。以下是安装步骤总结:</p>
<h3 id="安装部署">安装部署</h3>
<h4 id="依赖准备">依赖准备</h4>
<ul>
<li>JDK (1.8 or later versions)</li>
<li>MySQL(version 5.6及以上)</li>
<li>Hadoop (2.6.0 or later)</li>
<li>Hive (version 2.x)</li>
<li>Spark (version 2.2.1)</li>
<li>Livy(livy-0.5.0-incubating)</li>
<li>ElasticSearch (5.0 or later versions)</li>
</ul>
<h4 id="初始化">初始化</h4>
<p>初始化操作具体请参考<a href="https://github.com/apache/griffin/blob/master/griffin-doc/deploy/deploy-guide.md">Apache Griffin Deployment Guide</a>,由于我的测试环境中Hadoop集群、Hive集群已搭好,故这里省略Hadoop、Hive安装步骤,只保留拷贝配置文件、配置Hadoop配置文件目录步骤。</p>
<p>1、MySQL:</p>
<p>在MySQL中创建数据库quartz,然后执行<a href="https://github.com/apache/griffin/blob/master/service/src/main/resources/Init_quartz_mysql_innodb.sql">Init_quartz_mysql_innodb.sql</a>脚本初始化表信息:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mysql -u &lt;username&gt; -p &lt;password&gt; &lt; Init_quartz_mysql_innodb.sql
</code></pre></div></div>
<p>2、Hadoop和Hive:</p>
<p>从Hadoop服务器拷贝配置文件到Livy服务器上,这里假设将配置文件放在/usr/data/conf目录下。</p>
<p>在Hadoop服务器上创建/home/spark_conf目录,并将Hive的配置文件hive-site.xml上传到该目录下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#创建/home/spark_conf目录
hadoop fs -mkdir -p /home/spark_conf
#上传hive-site.xml
hadoop fs -put hive-site.xml /home/spark_conf/
</code></pre></div></div>
<p>3、设置环境变量:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/bin/bash
export JAVA_HOME=/data/jdk1.8.0_192
#spark目录
export SPARK_HOME=/usr/data/spark-2.1.1-bin-2.6.3
#livy命令目录
export LIVY_HOME=/usr/data/livy/bin
#hadoop配置文件目录
export HADOOP_CONF_DIR=/usr/data/conf
</code></pre></div></div>
<p>4、Livy配置:</p>
<p>更新livy/conf下的livy.conf配置文件:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>livy.server.host = 127.0.0.1
livy.spark.master = yarn
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true
</code></pre></div></div>
<p>启动livy:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>livy-server start
</code></pre></div></div>
<p>5、Elasticsearch配置:</p>
<p>在ES里创建griffin索引:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -XPUT http://es:9200/griffin -d '
{
"aliases": {},
"mappings": {
"accuracy": {
"properties": {
"name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"tmst": {
"type": "date"
}
}
}
},
"settings": {
"index": {
"number_of_replicas": "2",
"number_of_shards": "5"
}
}
}
'
</code></pre></div></div>
<h4 id="源码打包部署">源码打包部署</h4>
<p>在这里我使用源码编译打包的方式来部署Griffin,Griffin的源码地址是:<a href="https://github.com/apache/griffin.git">https://github.com/apache/griffin.git</a>,这里我使用的源码tag是griffin-0.4.0,下载完成在idea中导入并展开源码的结构图如下:</p>
<p><img src="/images/project.jpg" alt="project" /></p>
<p>Griffin的源码结构很清晰,主要包括griffin-doc、measure、service和ui四个模块,其中griffin-doc负责存放Griffin的文档,measure负责与spark交互,执行统计任务,service使用spring boot作为服务实现,负责给ui模块提供交互所需的restful api,保存统计任务,展示统计结果。</p>
<p>源码导入构建完毕后,需要修改配置文件,具体修改的配置文件如下:</p>
<p>1、service/src/main/resources/application.properties:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Apache Griffin应用名称
spring.application.name=griffin_service
# MySQL数据库配置信息
spring.datasource.url=jdbc:mysql://10.xxx.xx.xxx:3306/griffin_quartz?useSSL=false
spring.datasource.username=xxxxx
spring.datasource.password=xxxxx
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore配置信息
hive.metastore.uris=thrift://namenode.test01.xxx:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry,按需配置
kafka.schema.registry.url=http://namenode.test01.xxx:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap,登录策略为ldap时配置
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name
fs.defaultFS=
# elasticsearch配置
elasticsearch.host=griffindq02-test1-rgtj1-tj1
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy配置
livy.uri=http://10.104.xxx.xxx:8998/batches
# yarn url配置
yarn.uri=http://10.104.xxx.xxx:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook
</code></pre></div></div>
<p>2、service/src/main/resources/quartz.properties</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
</code></pre></div></div>
<p>3、service/src/main/resources/sparkProperties.json:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"file": "hdfs:///griffin/griffin-measure.jar",
"className": "org.apache.griffin.measure.Application",
"name": "griffin",
"queue": "default",
"numExecutors": 2,
"executorCores": 1,
"driverMemory": "1g",
"executorMemory": "1g",
"conf": {
"spark.yarn.dist.files": "hdfs:///home/spark_conf/hive-site.xml"
},
"files": [
]
}
</code></pre></div></div>
<p>4、service/src/main/resources/env/env_batch.json:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"spark": {
"log.level": "INFO"
},
"sinks": [
{
"type": "CONSOLE",
"config": {
"max.log.lines": 10
}
},
{
"type": "HDFS",
"config": {
"path": "hdfs://namenodetest01.xx.xxxx.com:9001/griffin/persist",
"max.persist.lines": 10000,
"max.lines.per.file": 10000
}
},
{
"type": "ELASTICSEARCH",
"config": {
"method": "post",
"api": "http://10.xxx.xxx.xxx:9200/griffin/accuracy",
"connection.timeout": "1m",
"retry": 10
}
}
],
"griffin.checkpoint": []
}
</code></pre></div></div>
<p>配置文件修改好后,在idea里的terminal里执行如下maven命令进行编译打包:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn -Dmaven.test.skip=true clean install
</code></pre></div></div>
<p>命令执行完成后,会在service和measure模块的target目录下分别看到service-0.4.0.jar和measure-0.4.0.jar两个jar,将这两个jar分别拷贝到服务器目录下。这两个jar的使用方式如下:</p>
<p>1、使用如下命令将measure-0.4.0.jar这个jar上传到HDFS的/griffin文件目录里:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#改变jar名称
mv measure-0.4.0.jar griffin-measure.jar
#上传griffin-measure.jar到HDFS文件目录里
hadoop fs -put measure-0.4.0.jar /griffin/
</code></pre></div></div>
<p>这样做的目的主要是因为spark在yarn集群上执行任务时,需要到HDFS的/griffin目录下加载griffin-measure.jar,避免发生类org.apache.griffin.measure.Application找不到的错误。</p>
<p>2、运行service-0.4.0.jar,启动Griffin管理后台:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nohup java -jar service-0.4.0.jar&gt;service.out 2&gt;&amp;1 &amp;
</code></pre></div></div>
<p>几秒钟后,我们可以访问Apache Griffin的默认UI(默认情况下,spring boot的端口是8080)。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://IP:8080
</code></pre></div></div>
<p>UI操作文档链接:<a href="https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md">Apache Griffin User Guide</a>。通过UI操作界面,我们可以创建自己的统计任务,部分结果展示界面如下:</p>
<p><img src="/images/dashboard-big.png" alt="dashboard" /></p>
<h4 id="功能体验">功能体验</h4>
<p>1、在hive里创建表demo_src和demo_tgt:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_src`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs:///griffin/data/batch/demo_src';
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_tgt`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs:///griffin/data/batch/demo_tgt';
</code></pre></div></div>
<p>2、生成测试数据:</p>
<p>从<a href="http://griffin.apache.org/data/batch/">http://griffin.apache.org/data/batch/</a>地址下载所有文件到Hadoop服务器上,然后使用如下命令执行gen-hive-data.sh脚本:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nohup ./gen-hive-data.sh&gt;gen.out 2&gt;&amp;1 &amp;
</code></pre></div></div>
<p>注意观察gen.out日志文件,如果有错误,视情况进行调整。这里我的测试环境Hadoop和Hive安装在同一台服务器上,因此直接运行脚本。</p>
<p>3、通过UI界面创建统计任务,具体按照<a href="https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md">Apache Griffin User Guide</a>
一步步操作。</p>
<h3 id="踩坑过程">踩坑过程</h3>
<p>1、gen-hive-data.sh脚本生成数据失败,报no such file or directory错误。</p>
<p>错误原因:HDFS中的/griffin/data/batch/demo_src/和/griffin/data/batch/demo_tgt/目录下”dt=时间”目录不存在,如dt=20190113。</p>
<p>解决办法:给脚本中增加hadoop fs -mkdir创建目录操作,修改完后如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/bin/bash
#create table
hive -f create-table.hql
echo "create table done"
#current hour
sudo ./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template &gt; insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"
#last hour
sudo ./gen_demo_data.sh
cur_date=`date -d '1 hour ago' +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template &gt; insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"
#next hours
set +e
while true
do
sudo ./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
next_date=`date -d "+1hour" '+%Y%m%d%H'`
dt=${next_date:0:8}
hour=${next_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template &gt; insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"
sleep 3600
done
set -e
</code></pre></div></div>
<p>2、HDFS的/griffin/persist目录下没有统计结果文件,检查该目录的权限,设置合适的权限即可。</p>
<p>3、ES中的metric数据为空,有两种可能:</p>
<ul>
<li>service/src/main/resources/env/env_batch.json里的ES配置信息不正确</li>
<li>执行spark任务的yarn服务器上没有配置ES服务器的hostname,连接异常</li>
</ul>
<p>4、启动service-0.4.0.jar之后,访问不到UI界面,查看启动日志无异常。检查打包时是不是执行的mvn package命令,将该命令替换成mvn -Dmaven.test.skip=true clean install命令重新打包启动即可。</p>
:ET