| commit | 05dd6f3bff8a7a55f91affb5e19bb8a19ed28e20 | [log] [tgz] |
|---|---|---|
| author | Vamshi Gudavarthi <107005799+vamshigv@users.noreply.github.com> | Fri Oct 27 12:37:27 2023 -0700 |
| committer | GitHub <noreply@github.com> | Fri Oct 27 12:37:27 2023 -0700 |
| tree | 21052e989348381b9d2895ca88fb3af0d491c856 | |
| parent | 80753e62db586517965cc1b578236c3779d0b71a [diff] |
Move to dataframe writes (#132)
OneTable is a vision towards standardizing data lake storage integration with data processing systems and query engines. By creating a common model for representing tables stored in a data lake, users can now write data in one format and still leverage integrations and other features that are present in other formats only. Currently this repo supports converting Hudi copy-on-write and read-optimized merge-on-read tables to Delta and Iceberg tables. This gives existing Hudi users the ability to work with Databricks's Photon Engine or query Iceberg tables with Snowflake. Creating transformations from one format to other formats just requires implementing few interfaces, so we hope to expand the number of supported transformations between source and target formats in the future.
mvn clean package. Use mvn clean package -DskipTests to skip tests while building.mvn clean test or mvn test to run all unit tests. If you need to run only a specific test you can do this by something like mvn test -Dtest=TestDeltaSync -pl core.mvn clean verify or mvn verify to run integration tests.mvn spotless:check to find out code style violations and mvn spotless:apply to fix them. Code style check is tied to compile phase by default, so code style violations will lead to build failures.mvn install -DskipTestssourceFormat: HUDI tableFormats: - DELTA - ICEBERG datasets: - tableBasePath: s3://tpc-ds-datasets/1GB/hudi/call_center tablename: call_center - tableBasePath: s3://tpc-ds-datasets/1GB/hudi/catalog_sales tablename: catalog_sales partitionSpec: cs_sold_date_sk:VALUE - tableBasePath: s3://hudi/multi-partition-dataset tablename: multi_partition_dataset partitionSpec: time_millis:DAY:yyyy-MM-dd,type:VALUE
tableFormats is a list of formats you want to create from your source Hudi tablestableBasePath is the basePath of the Hudi tablepartitionSpec is a spec that allows us to infer partition values. If the table is not partitioned, leave it blank. If it is partitioned, you can specify a spec with a comma separated list with format path:type:formatpath is a dot separated path to the partition fieldtype describes how the partition value was generated from the column valueVALUE: an identity transform of field value to partition valueYEAR: data is partitioned by a field representing a date and year granularity is usedMONTH: same as YEAR but with month granularityDAY: same as YEAR but with day granularityHOUR: same as YEAR but with hour granularityformat: if your partition type is YEAR, MONTH, DAY, or HOUR specify the format for the date string as it appears in your file paths# sourceClientProviderClass: The class name of a table format's client factory, where the client is # used for reading from a table of this format. All user configurations, including hadoop config # and client specific configuration, will be available to the factory for instantiation of the # client. # targetClientProviderClass: The class name of a table format's client factory, where the client is # used for writing to a table of this format. # configuration: A map of configuration values specific to this client. tableFormatsClients: HUDI: sourceClientProviderClass: io.onetable.hudi.HudiSourceClientProvider DELTA: targetClientProviderClass: io.onetable.delta.DeltaClient configuration: spark.master: local[2] spark.app.name: onetableclient
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [ --hadoopConfig hdfs-site.xml ] [--clientsConfig clients.yaml] The bundled jar includes hadoop dependencies for AWS and GCP. Authentication for AWS is done with com.amazonaws.auth.DefaultAWSCredentialsProviderChain. To override this setting, specify a different implementation with the --awsCredentialsProvider option.For setting up the repo on IntelliJ, open the project and change the java version to Java11 in File->ProjectStructure
You have found a bug, or have a cool idea you that want to contribute to the project ? Please file a GitHub issue here
Adding a new target format requires a developer implement TargetClient. Once you have implemented that interface, you can integrate it into the OneTableClient. If you think others may find that target useful, please raise a Pull Request to add it to the project.