commit	108ec6473c6419a6b93ad0ffe2d39e46d9290f07	[log] [tgz]
author	Selim Soufargi <80632333+unical1988@users.noreply.github.com>	Sat May 17 21:00:13 2025 +0200
committer	GitHub <noreply@github.com>	Sat May 17 14:00:13 2025 -0500
tree	2a0972ef6834df787a392148eca8d348e8db54e5
parent	fc4d6e8960eb47b722ee2f319496e11fc7d33cfc [diff]

[553] Parquet Schema and Column Stats Converters (#669)

* smaller PR for parquet

* read parquet file for metadataExtractor: compiling, not testd

* cleanups for statsExtractor: compiling, not testd

* refactoring for statsExtractor: compiling, not testd

* added avro dependency

* added tests for SchemaExtractor: int and string primitiveTypes test passes

* fixed some minor bugs in SchemaExtractor

* close fileReader and handle exception

* adjusted fromInternalSchema()

* added a test and adjusted SchemaExtractor

* added a testing code

* bug fix for Schema extractor: groupType

* bug fix for Schema extractor

* bug fix for tests

* bug fix for SchemaExtractor and added tests for nested lists support

* bug fix for tests for nested lists support

* bug fix for complex test which now passes!

* added test for Map

* schemaExtractor refactored

* bug fixed isNullable() schema

* fromInternalSchema : list and map types

* decimal primitive test added

* float primitive + list and map tests for fromInternalSchema

* added tests for primitive type (date and timestamp)

* refactoring for partitionValues extractor

* git build error fixed

* cleanups for schemaExtractor + refactoring for schemaExtractorTests + added test code for statsExtractor

* added assertsEqual test for stats + removed partitionFields from the test,  TODO check if field is needed in ColumnStats

* bug fixed for stats tests: columnStats + tests data are read using FileReader

* bug fixed for stats tests, TODO equality test for two objects

* added compareFiles() in InternDataFile for the statsExtractor tests to pass: OK

* added custom comparison test for ColumnStat and InternDataFile, test passes, TODO: other stat types and other schema types testing

* added custom comparison test for ColumnStat (field) and exec spotless apply

* tempDir for parquet stats testing

* binaryStatistics test passes

* added int32 file schema test for statsExtractor

* cleanups + added fields comparison for InternalDataFile

* cleanups + added fixed_len_byte_array primitive type schema file test

* use of genericGetMax instead for stats extraction + cleanups

* boolean schema file test for statsExtractor added

* removed hard coded path in statsExtractor test

* cleanups + imports

* separate tests for int and binary for stats

* custom equals() not needed for InternalDataFile and ColumnStat

* removed parquet version from core sub-project pom

* statsExtractor tests as a suite, removed comments + run spotless apply

* removed uncessary classes

* removed uncessary classes: undo

* undo irrelevant changes

* fixed formatting issues with spotless:apply cmd

* cleanups for test class and fixes for failed build

* tmp file name fixed for failed build

* cleanups

* splotless apply run + assertion internalDataFile equality changed to display errors

* fixes for build, PhysicalPath and BinaryStats

* fixes for build, PhysicalPath and BinaryStats + synced fork

* fixes for build, PhysicalPath and BinaryStats + synced fork

* fixes for build and cleanups

* fixes for build and cleanups

* Parquet dep set as provided to use Spark's

* parquet dep version back to 1.15.1

* parquet-avro moved from core to project's pom

* parquet-avro moved after hadoop-common

* parquet dep scope removed

* run spotless:apply

---------

Co-authored-by: Selim Soufargi <ssoufargi.idealab.unical@gmail.com~>

14 files changed

tree: 2a0972ef6834df787a392148eca8d348e8db54e5

README.md

Apache XTable™ (Incubating)

Apache XTable™ (Incubating) is a cross-table converter for table formats that facilitates omni-directional interoperability across data processing systems and query engines. Currently, Apache XTable™ supports widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.

Apache XTable™ simplifies data lake operations by leveraging a common model for table representation. This allows users to write data in one format while still benefiting from integrations and features available in other formats. For instance, Apache XTable™ enables existing Hudi users to seamlessly work with Databricks's Photon Engine or query Iceberg tables with Snowflake. Creating transformations from one format to another is straightforward and only requires the implementation of a few interfaces, which we believe will facilitate the expansion of supported source and target formats in the future.

Building the project and running tests.

Use Java 11 for building the project. If you are using another Java version, you can use jenv to use multiple Java versions locally.
Build the project using mvn clean package. Use mvn clean package -DskipTests to skip tests while building.
Use mvn clean test or mvn test to run all unit tests. If you need to run only a specific test you can do this by something like mvn test -Dtest=TestDeltaSync -pl xtable-core.
Similarly, use mvn clean verify or mvn verify to run integration tests.

Note: When using Maven version 3.9 or above, Maven automatically caches the build. To ignore build caching, you can add the -Dmaven.build.cache.enabled=false parameter. For example, mvn clean package -DskipTests -Dmaven.build.cache.enabled=false

Style guide

We use Maven Spotless plugin and Google java format for code style.
Use mvn spotless:check to find out code style violations and mvn spotless:apply to fix them. Code style check is tied to compile phase by default, so code style violations will lead to build failures.

Running the bundled jar

Get a pre-built bundled jar or create the jar with mvn install -DskipTests
Create a yaml file that follows the format below:

sourceFormat: HUDI
targetFormats:
  - DELTA
  - ICEBERG
datasets:
  -
    tableBasePath: s3://tpc-ds-datasets/1GB/hudi/call_center
    tableDataPath: s3://tpc-ds-datasets/1GB/hudi/call_center/data
    tableName: call_center
    namespace: my.db
  -
    tableBasePath: s3://tpc-ds-datasets/1GB/hudi/catalog_sales
    tableName: catalog_sales
    partitionSpec: cs_sold_date_sk:VALUE
  -
    tableBasePath: s3://hudi/multi-partition-dataset
    tableName: multi_partition_dataset
    partitionSpec: time_millis:DAY:yyyy-MM-dd,type:VALUE
  -
    tableBasePath: abfs://container@storage.dfs.core.windows.net/multi-partition-dataset
    tableName: multi_partition_dataset

sourceFormat is the format of the source table that you want to convert
targetFormats is a list of formats you want to create from your source tables
tableBasePath is the basePath of the table
tableDataPath is an optional field specifying the path to the data files. If not specified, the tableBasePath will be used. For Iceberg source tables, you will need to specify the /data path.
namespace is an optional field specifying the namespace of the table and will be used when syncing to a catalog.
partitionSpec is a spec that allows us to infer partition values. This is only required for Hudi source tables. If the table is not partitioned, leave it blank. If it is partitioned, you can specify a spec with a comma separated list with format path:type:format
- path is a dot separated path to the partition field
- type describes how the partition value was generated from the column value
  - VALUE: an identity transform of field value to partition value
  - YEAR: data is partitioned by a field representing a date and year granularity is used
  - MONTH: same as YEAR but with month granularity
  - DAY: same as YEAR but with day granularity
  - HOUR: same as YEAR but with hour granularity
- format: if your partition type is YEAR, MONTH, DAY, or HOUR specify the format for the date string as it appears in your file paths

The default implementations of table format converters can be replaced with custom implementations by specifying a converter configs yaml file in the format below:

# conversionSourceProviderClass: The class name of a table format's converter factory, where the converter is
#     used for reading from a table of this format. All user configurations, including hadoop config
#     and converter specific configuration, will be available to the factory for instantiation of the
#     converter.
# conversionTargetProviderClass: The class name of a table format's converter factory, where the converter is
#     used for writing to a table of this format.
# configuration: A map of configuration values specific to this converter.
tableFormatConverters:
    HUDI:
      conversionSourceProviderClass: org.apache.xtable.hudi.HudiConversionSourceProvider
    DELTA:
      conversionTargetProviderClass: org.apache.xtable.delta.DeltaConversionTarget
      configuration:
        spark.master: local[2]
        spark.app.name: xtable

A catalog can be used when reading and updating Iceberg tables. The catalog can be specified in a yaml file and passed in with the --icebergCatalogConfig option. The format of the catalog config file is:

catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
  key1: value1
  key2: value2

Run with java -jar xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [--hadoopConfig hdfs-site.xml] [--convertersConfig converters.yaml] [--icebergCatalogConfig catalog.yaml] The bundled jar includes hadoop dependencies for AWS, Azure, and GCP. Sample hadoop configurations for configuring the converters can be found in the xtable-hadoop-defaults.xml file. The custom hadoop configurations can be passed in with the --hadoopConfig [custom-hadoop-config-file] option. The config in custom hadoop config file will override the default hadoop configurations. For an example of a custom hadoop config file, see hadoop.xml.

Running using docker

Build the docker image using docker build . -t xtable
Mount the config files on the container and run the container:

docker run \
  -v ./xtable/config.yml:/xtable/config.yml \
  -v ./xtable/core-site.xml:/xtable/core-site.xml \
  -v ./xtable/catalog.yml:/xtable/catalog.yml \
  xtable \
  --datasetConfig /xtable/config.yml --hadoopConfig /xtable/core-site.xml --icebergCatalogConfig xtable/catalog.yml

Contributing

Setup

For setting up the repo on IntelliJ, open the project and change the Java version to Java 11 in File->ProjectStructure

Found a bug or have a cool idea to contribute? Open a GitHub issue to get started. For more contribution guidelines and ways to stay involved, visit our community page.

Adding a new target format

Adding a new target format requires a developer implement ConversionTarget. Once you have implemented that interface, you can integrate it into the ConversionController. If you think others may find that target useful, please raise a Pull Request to add it to the project.