Functional Specification Document

Goals

Apache Griffin is a model driven open source Data Quality solution for distributed data systems at any scale in both streaming or batch data context. When people use open source products (e.g. Hadoop, Spark, Kafka, Storm), they always need a data quality service to build his/her confidence on data quality processed by those platforms. Apache Griffin creates a unified process to define and construct data quality measurement pipeline across multiple data systems to provide:

Automatic quality validation of the data
Data profiling and anomaly detection
Data quality lineage from upstream to downstream data systems.
Data quality health monitoring visualization
Shared infrastructure resource management

Background and strategic fit

When people play with big data in Hadoop (or other streaming data), data quality often becomes one big challenge. Different teams have built customized data quality tools to detect and analyze data quality issues within their own domain. We are thinking to take a platform approach to provide shared Infrastructure and generic features to solve common data quality pain points. This would enable us to build trusted data assets.

Currently it’s very difficult and costly to do data quality validation when we have big data flow across multi-platforms (e.g. Oracle, Hadoop, Couchbase, Cassandra, Kafka, Mongodb). Take eBay real time personalization platform as a sample, everyday we have to validate data quality status for ~600M records ( imaging we have 150M active users for our website). Data quality often becomes one big challenge both in its streaming and batch pipelines.

So we conclude 3 data quality problems at eBay:

Lack of end2end unified view of data quality measurement from multiple data sources to target applications, it usually takes a long time to identify and fix poor data quality.
How to get data quality measured in streaming mode, we need to have a process and tool to visualize data quality insights through registering dataset which you want to check data quality, creating data quality measurement model, executing the data quality validation job and getting metrics insights for action taking.
No Shared platform and API Service, have to apply and manage own hardware and software infrastructure.

Assumptions

Users will primarily access this application from a PC
We are handling textual data, no binary or encoded data

Main business process

Business_Process_image

Feature List

User interaction and design

step 2 : Select the target dataset and fields which will be used for comparision

step 3 : Mapping the target fields with source

step 4 : Set partition configuration for source dataset and target dataset

step 5 : Set basic configuration for your measure

confirmation :

step 2 : Define your syntax check logic which will be applied on the selected field

step 3 : Set partition configuration for target dataset

step 4 : Set basic configuration for your measure