RFC-49: Support sync with DataHub

Proposers

@xushiyan

Approvers

@vinothchandar
@Sivabalan

Status

JIRA: HUDI-3468

Overview

Support sync with DataHub to provide rich metadata capabilities for Hudi tables.

DataHub is an open-source metadata platform for the modern data stack.

Implementation

To sync with DataHub, we can make use of existing hudi-sync abstraction by extending org.apache.hudi.sync.common.AbstractSyncTool.

The sync mechanism can be implemented via Java Emitter. The main work is about

take in user's configurations to connect to an existing DataHub instance
compose desired metadata for sync based on DataHub's metadata model

Configurations

Necessary configurations will be added using pattern hoodie.sync.datahub.* to connect to the user-operated DataHub instance.

Metadata Model

A Hudi table maps to a Dataset entity in DataHub.

Identifier

A Dataset can be identified by urn consists of Data Platform (default hudi), table identifier (<db>.<table>), and optional/configurable environment suffix. An example:

urn:li:dataset:(urn:li:dataPlatform:hudi,mydb,mytable,prod)

Schema

Schema can be sync‘ed via the SchemaMetadata aspect. platformSchema (raw schema) will be sync’ed using the avro schema string persisted in the commit metadata.

Dataset Properties

Key-value table properties, e.g., last sync‘ed commit timestamp, can be sync’ed via the DatasetProperties aspect.

Column Stats

Column stats, e.g., min/max value of selected fields, can be retrieved from Hudi metadata table‘s column stats partition, and sync’ed via the fieldProfiles of DatasetProfile aspect.

Rollout/Adoption Plan

This is a new feature to be enabled by configuration. Users can choose to turn on or off at any time. This feature won‘t interfere with existing Hudi tables’ operations.

Test Plan

Unit tests
Run a PoC setup with DataHub integration to verify the desired metadata are sync'ed