GH Feature Request: https://github.com/apache/incubator-xtable/issues/590
Please keep the status updated in
rfc/README.md.
Users of Apache XTable (Incubating) today can translate metadata across table formats (iceberg, hudi, and delta) and use the tables in different platforms depending on their choice. Today there's still some friction involved in terms of usability because users need to explicitly register the tables in the catalog of their choice (glue, HMS, unity, bigLake etc.) and then use the catalog in the platform of their choice to do DDL, DML queries.
XTable is built on the principle of omnidirectional interoperability, and I'm proposing an interface which allows syncing metadata of table formats to multiple catalogs in a continuous and incremental manner. With this new functionality we will be able to
Introducing the following interfaces. [PR]
CatalogSyncClient: This interface contains methods that are responsible for creating table, refreshing table metadata, dropping table etc. in target catalog. Consider this interface as a translation layer between InternalTable and the catalog's table object.CatalogSync: This interface synchronizes the internal XTable object (InternalTable) to multiple target catalogs using the methods available in CatalogSyncClient interface.CatalogTableIdentifier: Represents a catalog table identifier in a multi-level catalog system. HierarchicalTableIdentifier is an internal representation of a fully qualified table identifier within a catalog following the three level hierarchy convention (it's used by all the major catalogs glue, hms, unity etc.). In the future, we can support other conventions by implementing this interface.For XTable users, defining their source/target catalog configurations and synchronizing tables will be handled by the RunCatalogSync class. This utility class parses the user’s YAML configuration, synchronizes table format metadata when necessary, and then uses the previously defined interfaces to synchronize the table in the catalog. [PR]
User's YAML configuration.
sourceCatalog: Configuration of the source catalog from which XTable will read. It must contain all the necessary connection and access details for describing and listing tables.catalogId: A user-defined unique identifier for the catalog, allows user to sync table to multiple catalogs of the same name/type eg: HMS catalog with url1, HMS catalog with url2.catalogType: The type of the source catalog. This might be a specific type understood by XTable, such as Hive, Glue etc.catalogSyncClientImpl(optional): A fully qualified class name that implements the interface for CatalogSyncClient, it can be used if the implementation for catalogType doesn't exist in XTable.catalogConversionSourceImpl(optional): A fully qualified class name that implements the interface for CatalogConversionSource, it can be used if the implementation for catalogType doesn't exist in XTable.catalogProperties: A collection of configs used to configure access or connection properties for the catalog.targetCatalogs: Defines configuration one or more target catalogs, to which XTable will write or update tables. Unlike the source, these catalogs must be writable.datasets: A list of datasets that specify how a source table maps to one or more target tables.sourceCatalogTableIdentifier: Identifies the source table in sourceCatalog. This can be done in two ways:tableIdentifier: Specifies a source table by its 3 level hierarchical fully qualified name - catalogName, databaseName and tableName. If catalogName is not provided, the default catalog will be used.storageIdentifier(optional): Provides direct storage details such as a table’s base path (like an S3 location) and the partition specification. This allows reading from a source even if it is not strictly registered in a catalog, as long as the format and location are knowntargetCatalogTableIdentifiers: A list of one or more targets that this source table should be written to.catalogId: The user defined unique identifier of the target catalog where the table will be created or updated. The targetCatalog's id passed here should be one of the targetCatalogs defined above.tableFormat: The target table format (e.g., DELTA, HUDI, ICEBERG), specifying how the data will be stored at the target.tableIdentifier: Specifies a target table by its 3 level hierarchical fully qualified name - catalogName, databaseName and tableName. If catalogName is not provided, the default catalog will be used.sourceCatalog:
catalogName: "source-1"
catalogType: "catalog-type-1"
catalogConversionSourceImpl: "org.apache.xtable.utilities.CustomCatalogConversionSourceImpl"
catalogProperties:
key01: "value01"
key02: "value02"
key03: "value03"
targetCatalogs:
- catalogName: "target-1"
catalogType: "catalog-type-2"
catalogProperties:
key11: "value11"
key12: "value22"
key13: "value33"
- catalogName: "target-2"
catalogSyncClientImpl: "org.apache.xtable.utilities.CustomCatalogSyncClientImpl"
catalogProperties:
key21: "value21"
key22: "value22"
key23: "value23"
datasets:
- sourceCatalogTableIdentifier:
tableIdentifier:
hierarchicalId: "source-database-1.source-1"
targetCatalogTableIdentifiers:
- catalogName: "target-1"
tableFormat: "DELTA"
tableIdentifier:
hierarchicalId: "target-database-1.target-tableName-1"
- catalogName: "target-1"
tableFormat: "ICEBERG"
tableIdentifier:
hierarchicalId: "target-database-2.target-tableName-2-iceberg"
- catalogName: "target-2"
tableFormat: "HUDI"
tableIdentifier:
hierarchicalId: "default-catalog-2.target-database-2.target-tableName-2-hudi"
- sourceCatalogTableIdentifier:
storageIdentifier:
tableBasePath: s3://tpc-ds-datasets/1GB/hudi/catalog_sales
tableName: catalog_sales
partitionSpec: cs_sold_date_sk:VALUE
tableFormat: "HUDI"
targetCatalogTableIdentifiers:
- catalogName: "target-2"
tableFormat: "ICEBERG"
tableIdentifier:
hierarchicalId: "target-database-2.target-tableName-2"
RunCatalogSync processSyncResult status has been refactored to tableFormatSyncStatus for clarity.RunSync without any issues.We plan to add the HMS and Glue implementations for CatalogSyncClient interface, conversion in both ways across all table formats will be tested.