Apache Griffin is a model driven open source Data Quality solution for distributed data systems at any scale in both streaming or batch data context. When people use open source products (e.g. Hadoop, Spark, Kafka, Storm), they always need a data quality service to build his/her confidence on data quality processed by those platforms. Apache Griffin creates a unified process to define and construct data quality measurement pipeline across multiple data systems to provide:
When people play with big data in Hadoop (or other streaming data), data quality often becomes one big challenge. Different teams have built customized data quality tools to detect and analyze data quality issues within their own domain. We are thinking to take a platform approach to provide shared Infrastructure and generic features to solve common data quality pain points. This would enable us to build trusted data assets.
Currently it’s very difficult and costly to do data quality validation when we have big data flow across multi-platforms (e.g. Oracle, Hadoop, Couchbase, Cassandra, Kafka, Mongodb). Take eBay real time personalization platform as a sample, everyday we have to validate data quality status for ~600M records ( imaging we have 150M active users for our website). Data quality often becomes one big challenge both in its streaming and batch pipelines.
So we conclude 3 data quality problems at eBay:
step 2 : Select the target dataset and fields which will be used for comparision
step 3 : Mapping the target fields with source
step 4 : Set partition configuration for source dataset and target dataset
step 5 : Set basic configuration for your measure
confirmation :
step 2 : Define your syntax check logic which will be applied on the selected field
step 3 : Set partition configuration for target dataset
step 4 : Set basic configuration for your measure
confirmation :
Below is a list of questions to be addressed as a result of this requirements document: