Apache XTable™ (Incubating) is a cross-table converter for table formats that facilitates omni-directional interoperability across data processing systems and query engines. Currently, Apache XTable™ supports widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.
Apache XTable™ simplifies data lake operations by leveraging a common model for table representation. This allows users to write data in one format while still benefiting from integrations and features available in other formats. For instance, Apache XTable™ enables existing Hudi users to seamlessly work with Databricks's Photon Engine or query Iceberg tables with Snowflake. Creating transformations from one format to another is straightforward and only requires the implementation of a few interfaces, which we believe will facilitate the expansion of supported source and target formats in the future.
mvn clean package. Use mvn clean package -DskipTests to skip tests while building.mvn clean test or mvn test to run all unit tests. If you need to run only a specific test you can do this by something like mvn test -Dtest=TestDeltaSync -pl xtable-core.mvn clean verify or mvn verify to run integration tests.Note: When using Maven version 3.9 or above, Maven automatically caches the build. To ignore build caching, you can add the -Dmaven.build.cache.enabled=false parameter. For example, mvn clean package -DskipTests -Dmaven.build.cache.enabled=false
mvn spotless:check to find out code style violations and mvn spotless:apply to fix them. Code style check is tied to compile phase by default, so code style violations will lead to build failures.mvn install -DskipTestssourceFormat: HUDI targetFormats: - DELTA - ICEBERG datasets: - tableBasePath: s3://tpc-ds-datasets/1GB/hudi/call_center tableDataPath: s3://tpc-ds-datasets/1GB/hudi/call_center/data tableName: call_center namespace: my.db - tableBasePath: s3://tpc-ds-datasets/1GB/hudi/catalog_sales tableName: catalog_sales partitionSpec: cs_sold_date_sk:VALUE - tableBasePath: s3://hudi/multi-partition-dataset tableName: multi_partition_dataset partitionSpec: time_millis:DAY:yyyy-MM-dd,type:VALUE - tableBasePath: abfs://container@storage.dfs.core.windows.net/multi-partition-dataset tableName: multi_partition_dataset
sourceFormat  is the format of the source table that you want to converttargetFormats is a list of formats you want to create from your source tablestableBasePath is the basePath of the tabletableDataPath is an optional field specifying the path to the data files. If not specified, the tableBasePath will be used. For Iceberg source tables, you will need to specify the /data path.namespace is an optional field specifying the namespace of the table and will be used when syncing to a catalog.partitionSpec is a spec that allows us to infer partition values. This is only required for Hudi source tables. If the table is not partitioned, leave it blank. If it is partitioned, you can specify a spec with a comma separated list with format path:type:formatpath is a dot separated path to the partition fieldtype describes how the partition value was generated from the column valueVALUE: an identity transform of field value to partition valueYEAR: data is partitioned by a field representing a date and year granularity is usedMONTH: same as YEAR but with month granularityDAY: same as YEAR but with day granularityHOUR: same as YEAR but with hour granularityformat: if your partition type is YEAR, MONTH, DAY, or HOUR specify the format for the date string as it appears in your file paths# conversionSourceProviderClass: The class name of a table format's converter factory, where the converter is # used for reading from a table of this format. All user configurations, including hadoop config # and converter specific configuration, will be available to the factory for instantiation of the # converter. # conversionTargetProviderClass: The class name of a table format's converter factory, where the converter is # used for writing to a table of this format. # configuration: A map of configuration values specific to this converter. tableFormatConverters: HUDI: conversionSourceProviderClass: org.apache.xtable.hudi.HudiConversionSourceProvider DELTA: conversionTargetProviderClass: org.apache.xtable.delta.DeltaConversionTarget configuration: spark.master: local[2] spark.app.name: xtable
--icebergCatalogConfig option. The format of the catalog config file is:catalogImpl: io.my.CatalogImpl catalogName: name catalogOptions: # all other options are passed through in a map key1: value1 key2: value2
java -jar xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [--hadoopConfig hdfs-site.xml] [--convertersConfig converters.yaml] [--icebergCatalogConfig catalog.yaml] The bundled jar includes hadoop dependencies for AWS, Azure, and GCP. Sample hadoop configurations for configuring the converters can be found in the xtable-hadoop-defaults.xml file. The custom hadoop configurations can be passed in with the --hadoopConfig [custom-hadoop-config-file] option. The config in custom hadoop config file will override the default hadoop configurations. For an example of a custom hadoop config file, see hadoop.xml.docker build . -t xtabledocker run \ -v ./xtable/config.yml:/xtable/config.yml \ -v ./xtable/core-site.xml:/xtable/core-site.xml \ -v ./xtable/catalog.yml:/xtable/catalog.yml \ xtable \ --datasetConfig /xtable/config.yml --hadoopConfig /xtable/core-site.xml --icebergCatalogConfig xtable/catalog.yml
For setting up the repo on IntelliJ, open the project and change the Java version to Java 11 in File->ProjectStructure 
Found a bug or have a cool idea to contribute? Open a GitHub issue to get started. For more contribution guidelines and ways to stay involved, visit our community page.
Adding a new target format requires a developer implement ConversionTarget. Once you have implemented that interface, you can integrate it into the ConversionController. If you think others may find that target useful, please raise a Pull Request to add it to the project.