import Tabs from ‘@theme/Tabs’; import TabItem from ‘@theme/TabItem’;
This document walks through the steps to register a OneTable synced table on Hive Metastore (HMS).
utilities-0.1.0-SNAPSHOT-bundled.jar by following the steps on the Installation pageCreate my_config.yaml in the cloned OneTable directory.
<Tabs groupId=“table-format” defaultValue=“hudi” values={[ { label: ‘targetFormat: HUDI’, value: ‘hudi’, }, { label: ‘targetFormat: DELTA’, value: ‘delta’, }, { label: ‘targetFormat: ICEBERG’, value: ‘iceberg’, }, ]}
sourceFormat: DELTA|ICEBERG # choose only one targetFormats: - HUDI datasets: - tableBasePath: file:///path/to/source/data tableName: table_name
sourceFormat: HUDI|ICEBERG # choose only one targetFormats: - DELTA datasets: - tableBasePath: file:///path/to/source/data tableName: table_name partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
sourceFormat: HUDI|DELTA # choose only one targetFormats: - ICEBERG datasets: - tableBasePath: file:///path/to/source/data tableName: table_name partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
:::note Note:
sourceFormat, tableBasePath and tableName fields.file:///path/to/source/data to appropriate source data path if you have your source table in S3/GCS/ADLS i.e.s3://path/to/source/datags://path/to/source/data orabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data> :::From your terminal under the cloned OneTable directory, run the sync process using the below command.
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
:::tip Note: At this point, if you check your bucket path, you will be able to see .hoodie or _delta_log or metadata directory with relevant metadata files that helps query engines to interpret the data as a Hudi/Delta/Iceberg table. :::
Now you need to register the OneTable synced target table in Hive Metastore.
<Tabs groupId=“table-format” defaultValue=“hudi” values={[ { label: ‘targetFormat: HUDI’, value: ‘hudi’, }, { label: ‘targetFormat: DELTA’, value: ‘delta’, }, { label: ‘targetFormat: ICEBERG’, value: ‘iceberg’, }, ]}
A Hudi table can directly be synced to the Hive Metastore using Hive Sync Tool and subsequently be queried by different query engines. For more information on the Hive Sync Tool, check Hudi Hive Metastore docs.
cd $HUDI_HOME/hudi-sync/hudi-hive-sync ./run_sync_tool.sh \ --jdbc-url <jdbc_url> \ --user <username> \ --pass <password> \ --partitioned-by <partition_field> \ --base-path <'/path/to/synced/hudi/table'> \ --database <database_name> \ --table <tableName>
:::note Note: Replace file:///path/to/source/data to appropriate source data path if you have your source table in S3/GCS/ADLS i.e.
s3://path/to/source/datags://path/to/source/data orabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data> :::Now you will be able to query the created table directly as a Hudi table from the same spark session or using query engines like Presto and/or Trino. Check out the guides for querying the OneTable synced tables on Presto or Trino query engines for more information.
SELECT * FROM <database_name>.<table_name>;
spark-sql --packages io.delta:delta-core_2.12:2.0.0 \ --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \ --conf "spark.sql.catalogImplementation=hive"
In the spark-sql shell, you need to create a schema and table like below.
CREATE SCHEMA delta_db; CREATE TABLE delta_db.<table_name> USING DELTA LOCATION '/path/to/synced/delta/table';
:::note Note: Replace file:///path/to/source/data to appropriate source data path if you have your source table in S3/GCS/ADLS i.e.
s3://path/to/source/datags://path/to/source/data orabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data> :::Now you will be able to query the created table directly as a Delta table from the same spark session or using query engines like Presto and/or Trino. Check out the guides for querying the OneTable synced tables on Presto or Trino query engines for more information.
SELECT * FROM delta_db.<table_name>;
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.1 \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog" \ --conf "spark.sql.catalog.spark_catalog.type=hive" \ --conf "spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.hive_prod.type=hive"
In the spark-sql shell, you need to create a schema and table like below.
CREATE SCHEMA iceberg_db; CALL hive_prod.system.register_table( table => 'hive_prod.iceberg_db.<table_name>', metadata_file => '/path/to/synced/iceberg/table/metadata/<VERSION>.metadata.json' );
:::note Note: Replace the dataset path while creating a dataframe to appropriate data path if you have your table in S3/GCS/ADLS i.e.
s3://path/to/source/datags://path/to/source/data orabfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data> :::Now you will be able to query the created table directly as an Iceberg table from the same spark session or using query engines like Presto and/or Trino. Check out the guides for querying the OneTable synced tables on Presto or Trino query engines for more information.
SELECT * FROM iceberg_db.<table_name>;
In this guide we saw how to,