website/docs/how-to.md - incubator-xtable - Git at Google

 ---
 sidebar_position: 1
 ---

 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';

 # Creating your first interoperable table

 :::danger Important
 Using Onetable to sync your source tables in different target format involves running sync on your
 current dataset using a bundled jar. You can create this bundled jar by following the instructions
 on the [Installation page](https://onetable.dev/docs/setup). Read through Onetable's
 [github page](https://github.com/onetable-io/onetable#building-the-project-and-running-tests) for more information.
 :::

 In this tutorial we will look at how to use Onetable to add interoperability between table formats.
 For example, you can expose a table ingested with Hudi as an Iceberg and/or Delta Lake table without
 copying or moving the underlying data files used for that table while maintaining a similar commit
 history to enable proper point in time queries.

 ## Pre-requisites
 1. A compute instance where you can run Apache Spark. This can be your local machine, docker,
    or a distributed service like Amazon EMR, Cloud Dataproc etc
 2. Clone the Onetable [repository](https://github.com/onetable-io/onetable) and create the
    `utilities-0.1.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](https://onetable.dev/docs/setup)
 3. Optional: Setup access to write to and/or read from distributed storage services like:
    * Amazon S3 by following the steps
    [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) to install AWSCLIv2
    and setup access credentials by following the steps
    [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html)
    * Google Cloud Storage by following the steps
    [here](https://cloud.google.com/iam/docs/keys-create-delete#creating)

 For the purpose of this tutorial, we will walk through the steps to using Onetable locally.

 ## Steps

 ### Initialize a pyspark shell
 :::tip Note:
 You can choose to follow this example with `spark-sql` or `spark-shell` as well.
 :::

 <Tabs
 groupId="table-format"
 defaultValue="hudi"
 values={[
 { label: 'Hudi', value: 'hudi', },
 { label: 'Delta', value: 'delta', },
 { label: 'Iceberg', value: 'iceberg', },
 ]}
 >
 <TabItem value="hudi">

 ```shell md title="shell"
 pyspark \
   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0 \
   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
   --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
   --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
 ```
 </TabItem>

 <TabItem value="delta">

 ```shell md title="shell"
 pyspark \
   --packages io.delta:delta-core_2.12:2.1.0 \
   --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
   --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 ```
 </TabItem>

 <TabItem value="iceberg">

 ```shell md title="shell"
 pyspark \
   --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.4.1 \
   --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
   --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog"
 ```
 </TabItem>
 </Tabs>

 :::tip Note:
 If you instead want to write your table to Amazon S3 or Google Cloud Storage,
 your spark session will need additional configurations
 * For Amazon S3, follow the configurations specified [here](https://hudi.apache.org/docs/s3_hoodie/)
 * For Google Cloud Storage, follow the configurations specified [here](https://hudi.apache.org/docs/gcs_hoodie)
 :::


 ### Create dataset
 Write a source table locally.

 <Tabs
 groupId="table-format"
 defaultValue="hudi"
 values={[
 { label: 'Hudi', value: 'hudi', },
 { label: 'Delta', value: 'delta', },
 { label: 'Iceberg', value: 'iceberg', },
 ]}
 >
 <TabItem value="hudi">

 ```python md title="python"
 from pyspark.sql.types import *

 # initialize the bucket
 table_name = "people"
 local_base_path = "/tmp/hudi-dataset"

 records = [
    (1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
    (2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
    (3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
    (4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
    (5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
    (6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
 ]

 schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("create_ts", StringType(), True)
 ])

 df = spark.createDataFrame(records, schema)

 hudi_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.partitionpath.field': 'city',
    'hoodie.datasource.write.hive_style_partitioning': 'true'
 }

 (
    df.write
    .format("hudi")
    .options(**hudi_options)
    .save(f"{local_base_path}/{table_name}")
 )
 ```
 </TabItem>

 <TabItem value="delta">

 ```python md title="python"
 from pyspark.sql.types import *

 # initialize the bucket
 table_name = "people"
 local_base_path = "/tmp/delta-dataset"

 records = [
    (1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
    (2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
    (3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
    (4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
    (5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
    (6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
 ]

 schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("create_ts", StringType(), True)
 ])

 df = spark.createDataFrame(records, schema)

 (
    df.write
    .format("delta")
    .partitionBy("city")
    .save(f"{local_base_path}/{table_name}")
 )
 ```
 </TabItem>

 <TabItem value="iceberg">

 ```python md title="python"
 from pyspark.sql.types import *

 # initialize the bucket
 table_name = "people"
 local_base_path = "/tmp/iceberg-dataset"

 records = [
    (1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
    (2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
    (3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
    (4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
    (5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
    (6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
 ]

 schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("create_ts", StringType(), True)
 ])

 df = spark.createDataFrame(records, schema)

 (
    df.write
    .format("iceberg")
    .partitionBy("city")
    .save(f"{local_base_path}/{table_name}")
 )
 ```
 </TabItem>
 </Tabs>


 ### Running sync

 Create `my_config.yaml` in the cloned onetable directory.

 <Tabs
 groupId="table-format"
 defaultValue="hudi"
 values={[
 { label: 'Hudi', value: 'hudi', },
 { label: 'Delta', value: 'delta', },
 { label: 'Iceberg', value: 'iceberg', },
 ]}
 >

 <TabItem value="hudi">

 ```yaml  md title="yaml"
 sourceFormat: HUDI
 targetFormats:
   - DELTA
   - ICEBERG
 datasets:
   -
     tableBasePath: file:///tmp/hudi-dataset/people
     tableName: people
     partitionSpec: city:VALUE
 ```
 </TabItem>

 <TabItem value="delta">

 ```yaml  md title="yaml"
 sourceFormat: DELTA
 targetFormats:
   - HUDI
   - ICEBERG
 datasets:
   -
     tableBasePath: file:///tmp/delta-dataset/people
     tableName: people
     partitionSpec: city:VALUE
 ```
 </TabItem>

 <TabItem value="iceberg">

 ```yaml  md title="yaml"
 sourceFormat: ICEBERG
 targetFormats:
   - HUDI
   - DELTA
 datasets:
   -
     tableBasePath: file:///tmp/iceberg-dataset/people
     tableName: people
     partitionSpec: city:VALUE
 ```
 </TabItem>
 </Tabs>

 **Optional:** If your source table exists in Amazon S3 or Google Cloud Storage,
 you should use a `yaml` file similar to below.

 <Tabs
 groupId="table-format"
 defaultValue="hudi"
 values={[
 { label: 'Hudi', value: 'hudi', },
 { label: 'Delta', value: 'delta', },
 { label: 'Iceberg', value: 'iceberg', },
 ]}
 >
 <TabItem value="hudi">

 ```yaml  md title="yaml"
 sourceFormat: HUDI
 targetFormats:
   - DELTA
   - ICEBERG
 datasets:
   -
     tableBasePath: s3://path/to/hudi-data  # replace this with gs://path/to/hudi_data if your data is in GCS.
     tableName: people
     partitionSpec: city:VALUE
 ```

 </TabItem>
 <TabItem value="delta">

 ```yaml  md title="yaml"
 sourceFormat: HUDI
 targetFormats:
   - DELTA
   - ICEBERG
 datasets:
   -
     tableBasePath: s3://path/to/delta-data  # replace this with gs://path/to/delta_data if your data is in GCS.
     tableName: people
     partitionSpec: city:VALUE
 ```

 </TabItem>
 <TabItem value="iceberg">

 ```yaml  md title="yaml"
 sourceFormat: HUDI
 targetFormats:
   - DELTA
   - ICEBERG
 datasets:
   -
     tableBasePath: s3://path/to/iceberg-data  # replace this with gs://path/to/icberg_data if your data is in GCS.
     tableName: people
     partitionSpec: city:VALUE
 ```

 </TabItem>
 </Tabs>

 :::tip Note:
 Authentication for AWS is done with `com.amazonaws.auth.DefaultAWSCredentialsProviderChain`.
 To override this setting, specify a different implementation with the `--awsCredentialsProvider` option.

 Authentication for GCP requires service account credentials to be exported. i.e.
 `export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_key.json`
 :::

 In your terminal under the cloned Onetable directory, run the below command.

 ```shell md title="shell"
 java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar -datasetConfig my_config.yaml
 ```

 **Optional:**
 At this point, if you check your local path, you will be able to see the necessary metadata files that contain the schema,
 commit history, partitions, and column stats that helps query engines to interpret the data in the target table format.

 ## Conclusion
 In this tutorial, we saw how to create a source table and use Onetable to create the metadata files
 that can be used to query the source table in different target table formats.

 ## Next steps
 Go through the [Catalog Integration guides](https://onetable.dev/docs/catalogs-index) to register the Onetable synced tables
 in different data catalogs.
	---
	sidebar_position: 1
	---

	import Tabs from '@theme/Tabs';
	import TabItem from '@theme/TabItem';

	# Creating your first interoperable table

	:::danger Important
	Using Onetable to sync your source tables in different target format involves running sync on your
	current dataset using a bundled jar. You can create this bundled jar by following the instructions
	on the [Installation page](https://onetable.dev/docs/setup). Read through Onetable's
	[github page](https://github.com/onetable-io/onetable#building-the-project-and-running-tests) for more information.
	:::

	In this tutorial we will look at how to use Onetable to add interoperability between table formats.
	For example, you can expose a table ingested with Hudi as an Iceberg and/or Delta Lake table without
	copying or moving the underlying data files used for that table while maintaining a similar commit
	history to enable proper point in time queries.

	## Pre-requisites
	1. A compute instance where you can run Apache Spark. This can be your local machine, docker,
	or a distributed service like Amazon EMR, Cloud Dataproc etc
	2. Clone the Onetable [repository](https://github.com/onetable-io/onetable) and create the
	`utilities-0.1.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](https://onetable.dev/docs/setup)
	3. Optional: Setup access to write to and/or read from distributed storage services like:
	* Amazon S3 by following the steps
	[here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) to install AWSCLIv2
	and setup access credentials by following the steps
	[here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html)
	* Google Cloud Storage by following the steps
	[here](https://cloud.google.com/iam/docs/keys-create-delete#creating)

	For the purpose of this tutorial, we will walk through the steps to using Onetable locally.

	## Steps

	### Initialize a pyspark shell
	:::tip Note:
	You can choose to follow this example with `spark-sql` or `spark-shell` as well.
	:::

	<Tabs
	groupId="table-format"
	defaultValue="hudi"
	values={[
	{ label: 'Hudi', value: 'hudi', },
	{ label: 'Delta', value: 'delta', },
	{ label: 'Iceberg', value: 'iceberg', },
	]}
	>
	<TabItem value="hudi">

	```shell md title="shell"
	pyspark \
	--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0 \
	--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
	--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
	--conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
	```
	</TabItem>

	<TabItem value="delta">

	```shell md title="shell"
	pyspark \
	--packages io.delta:delta-core_2.12:2.1.0 \
	--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
	--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
	```
	</TabItem>

	<TabItem value="iceberg">

	```shell md title="shell"
	pyspark \
	--packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.4.1 \
	--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
	--conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog"
	```
	</TabItem>
	</Tabs>

	:::tip Note:
	If you instead want to write your table to Amazon S3 or Google Cloud Storage,
	your spark session will need additional configurations
	* For Amazon S3, follow the configurations specified [here](https://hudi.apache.org/docs/s3_hoodie/)
	* For Google Cloud Storage, follow the configurations specified [here](https://hudi.apache.org/docs/gcs_hoodie)
	:::


	### Create dataset
	Write a source table locally.

	<Tabs
	groupId="table-format"
	defaultValue="hudi"
	values={[
	{ label: 'Hudi', value: 'hudi', },
	{ label: 'Delta', value: 'delta', },
	{ label: 'Iceberg', value: 'iceberg', },
	]}
	>
	<TabItem value="hudi">

	```python md title="python"
	from pyspark.sql.types import *

	# initialize the bucket
	table_name = "people"
	local_base_path = "/tmp/hudi-dataset"

	records = [
	(1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
	(2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
	(3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
	(4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
	(5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
	(6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
	]

	schema = StructType([
	StructField("id", IntegerType(), True),
	StructField("name", StringType(), True),
	StructField("age", IntegerType(), True),
	StructField("city", StringType(), True),
	StructField("create_ts", StringType(), True)
	])

	df = spark.createDataFrame(records, schema)

	hudi_options = {
	'hoodie.table.name': table_name,
	'hoodie.datasource.write.partitionpath.field': 'city',
	'hoodie.datasource.write.hive_style_partitioning': 'true'
	}

	(
	df.write
	.format("hudi")
	.options(**hudi_options)
	.save(f"{local_base_path}/{table_name}")
	)
	```
	</TabItem>

	<TabItem value="delta">

	```python md title="python"
	from pyspark.sql.types import *

	# initialize the bucket
	table_name = "people"
	local_base_path = "/tmp/delta-dataset"

	records = [
	(1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
	(2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
	(3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
	(4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
	(5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
	(6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
	]

	schema = StructType([
	StructField("id", IntegerType(), True),
	StructField("name", StringType(), True),
	StructField("age", IntegerType(), True),
	StructField("city", StringType(), True),
	StructField("create_ts", StringType(), True)
	])

	df = spark.createDataFrame(records, schema)

	(
	df.write
	.format("delta")
	.partitionBy("city")
	.save(f"{local_base_path}/{table_name}")
	)
	```
	</TabItem>

	<TabItem value="iceberg">

	```python md title="python"
	from pyspark.sql.types import *

	# initialize the bucket
	table_name = "people"
	local_base_path = "/tmp/iceberg-dataset"

	records = [
	(1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
	(2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
	(3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
	(4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
	(5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
	(6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
	]

	schema = StructType([
	StructField("id", IntegerType(), True),
	StructField("name", StringType(), True),
	StructField("age", IntegerType(), True),
	StructField("city", StringType(), True),
	StructField("create_ts", StringType(), True)
	])

	df = spark.createDataFrame(records, schema)

	(
	df.write
	.format("iceberg")
	.partitionBy("city")
	.save(f"{local_base_path}/{table_name}")
	)
	```
	</TabItem>
	</Tabs>


	### Running sync

	Create `my_config.yaml` in the cloned onetable directory.

	<Tabs
	groupId="table-format"
	defaultValue="hudi"
	values={[
	{ label: 'Hudi', value: 'hudi', },
	{ label: 'Delta', value: 'delta', },
	{ label: 'Iceberg', value: 'iceberg', },
	]}
	>

	<TabItem value="hudi">

	```yaml md title="yaml"
	sourceFormat: HUDI
	targetFormats:
	- DELTA
	- ICEBERG
	datasets:
	-
	tableBasePath: file:///tmp/hudi-dataset/people
	tableName: people
	partitionSpec: city:VALUE
	```
	</TabItem>

	<TabItem value="delta">

	```yaml md title="yaml"
	sourceFormat: DELTA
	targetFormats:
	- HUDI
	- ICEBERG
	datasets:
	-
	tableBasePath: file:///tmp/delta-dataset/people
	tableName: people
	partitionSpec: city:VALUE
	```
	</TabItem>

	<TabItem value="iceberg">

	```yaml md title="yaml"
	sourceFormat: ICEBERG
	targetFormats:
	- HUDI
	- DELTA
	datasets:
	-
	tableBasePath: file:///tmp/iceberg-dataset/people
	tableName: people
	partitionSpec: city:VALUE
	```
	</TabItem>
	</Tabs>

	Optional: If your source table exists in Amazon S3 or Google Cloud Storage,
	you should use a `yaml` file similar to below.

	<Tabs
	groupId="table-format"
	defaultValue="hudi"
	values={[
	{ label: 'Hudi', value: 'hudi', },
	{ label: 'Delta', value: 'delta', },
	{ label: 'Iceberg', value: 'iceberg', },
	]}
	>
	<TabItem value="hudi">

	```yaml md title="yaml"
	sourceFormat: HUDI
	targetFormats:
	- DELTA
	- ICEBERG
	datasets:
	-
	tableBasePath: s3://path/to/hudi-data # replace this with gs://path/to/hudi_data if your data is in GCS.
	tableName: people
	partitionSpec: city:VALUE
	```

	</TabItem>
	<TabItem value="delta">

	```yaml md title="yaml"
	sourceFormat: HUDI
	targetFormats:
	- DELTA
	- ICEBERG
	datasets:
	-
	tableBasePath: s3://path/to/delta-data # replace this with gs://path/to/delta_data if your data is in GCS.
	tableName: people
	partitionSpec: city:VALUE
	```

	</TabItem>
	<TabItem value="iceberg">

	```yaml md title="yaml"
	sourceFormat: HUDI
	targetFormats:
	- DELTA
	- ICEBERG
	datasets:
	-
	tableBasePath: s3://path/to/iceberg-data # replace this with gs://path/to/icberg_data if your data is in GCS.
	tableName: people
	partitionSpec: city:VALUE
	```

	</TabItem>
	</Tabs>

	:::tip Note:
	Authentication for AWS is done with `com.amazonaws.auth.DefaultAWSCredentialsProviderChain`.
	To override this setting, specify a different implementation with the `--awsCredentialsProvider` option.

	Authentication for GCP requires service account credentials to be exported. i.e.
	`export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_key.json`
	:::

	In your terminal under the cloned Onetable directory, run the below command.

	```shell md title="shell"
	java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar -datasetConfig my_config.yaml
	```

	Optional:
	At this point, if you check your local path, you will be able to see the necessary metadata files that contain the schema,
	commit history, partitions, and column stats that helps query engines to interpret the data in the target table format.

	## Conclusion
	In this tutorial, we saw how to create a source table and use Onetable to create the metadata files
	that can be used to query the source table in different target table formats.

	## Next steps
	Go through the [Catalog Integration guides](https://onetable.dev/docs/catalogs-index) to register the Onetable synced tables
	in different data catalogs.