JIRA: HUDI-3468
Support sync with DataHub to provide rich metadata capabilities for Hudi tables.
DataHub is an open-source metadata platform for the modern data stack.
Read more in https://datahubproject.io/docs/#introduction
To sync with DataHub, we can make use of existing hudi-sync
abstraction by extending org.apache.hudi.sync.common.AbstractSyncTool
.
The sync mechanism can be implemented via Java Emitter. The main work is about
Necessary configurations will be added using pattern hoodie.sync.datahub.*
to connect to the user-operated DataHub instance.
A Hudi table maps to a Dataset entity in DataHub.
A Dataset can be identified by urn consists of Data Platform (default hudi
), table identifier (<db>.<table>
), and optional/configurable environment suffix. An example:
urn:li:dataset:(urn:li:dataPlatform:hudi,mydb,mytable,prod)
Schema can be sync‘ed via the SchemaMetadata
aspect. platformSchema
(raw schema) will be sync’ed using the avro schema string persisted in the commit metadata.
Key-value table properties, e.g., last sync‘ed commit timestamp, can be sync’ed via the DatasetProperties
aspect.
Column stats, e.g., min/max value of selected fields, can be retrieved from Hudi metadata table‘s column stats partition, and sync’ed via the fieldProfiles
of DatasetProfile
aspect.
This is a new feature to be enabled by configuration. Users can choose to turn on or off at any time. This feature won‘t interfere with existing Hudi tables’ operations.