Apache Griffin measures consist of batch measure and streaming measure, this document merely gives the batch measure sample.
{ "name": "accu_batch", "process.type": "BATCH", "data.sources": [ { "name": "source", "baseline": true, "connectors": [ { "type": "AVRO", "version": "1.7", "config": { "file.name": "src/test/resources/users_info_src.avro" } } ] }, { "name": "target", "connectors": [ { "type": "AVRO", "version": "1.7", "config": { "file.name": "src/test/resources/users_info_target.avro" } } ] } ], "evaluate.rule": { "rules": [ { "dsl.type": "griffin-dsl", "dq.type": "ACCURACY", "out.dataframe.name": "accu", "rule": "source.user_id = target.user_id AND upper(source.first_name) = upper(target.first_name) AND source.last_name = target.last_name AND source.address = target.address AND source.email = target.email AND source.phone = target.phone AND source.post_code = target.post_code", "details": { "source": "source", "target": "target", "miss": "miss_count", "total": "total_count", "matched": "matched_count" }, "out": [ { "type": "metric", "name": "accu" }, { "type": "record", "name": "missRecords" } ] } ] }, "sinks": ["CONSOLE", "ELASTICSEARCH"] }
Above is the configure file of batch accuracy job.
In this sample, we use avro file as source and target.
In this accuracy sample, the rule describes the match condition: src.user_id = tgt.user_id AND upper(src.first_name) = upper(tgt.first_name) AND src.last_name = tgt.last_name
.
The accuracy metrics will be persisted as metric, with miss column named “miss_count”, total column named “total_count”, matched column named “matched_count”.
The miss records of source will be persisted as record.
{ "name": "prof_batch", "process.type": "BATCH", "data.sources": [ { "name": "source", "connectors": [ { "type": "HIVE", "version": "1.2", "config": { "database": "default", "table.name": "src" } } ] } ], "evaluate.rule": { "rules": [ { "dsl.type": "griffin-dsl", "dq.type": "PROFILING", "out.dataframe.name": "prof", "rule": "select max(age) as `max_age`, min(age) as `min_age` from source", "out": [ { "type": "metric", "name": "prof" } ] }, { "dsl.type": "griffin-dsl", "dq.type": "PROFILING", "out.dataframe.name": "name_grp", "rule": "select name, count(*) as cnt from source group by name", "out": [ { "type": "metric", "name": "name_grp", "flatten": "array" } ] } ] }, "sinks": ["CONSOLE", "ELASTICSEARCH"] }
Above is the configure file of batch profiling job.
In this sample, we use hive table as source.
In this profiling sample, the rule describes the profiling request: select max(age) as max_age, min(age) as min_age from source
and select name, count(*) as cnt from source group by name
. The profiling metrics will be persisted as metric, with the max and min value of age, and count group by name, like this: {"max_age": 53, "min_age": 11, "name_grp": [{"name": "Adam", "cnt": 13}, {"name": "Fred", "cnt": 2}]}
.