blob: 9eef40f9e16595ee0ca5ef590f2379d8f8fd08a7 [file] [log] [blame] [view]
---
sidebar_position: 3
---
# Unity Catalog
This document walks through the steps to register a Onetable synced Delta table in Unity Catalog on Databricks.
## Pre-requisites
1. Source table(s) (Hudi/Iceberg) already written to external storage locations like S3/GCS.
If you don't have a source table written in S3/GCS,
you can follow the steps in [this](https://onetable.dev/docs/hms) tutorial to set it up.
2. Setup connection to external storage locations from Databricks.
* Follow the steps outlined [here](https://docs.databricks.com/en/storage/amazon-s3.html) for Amazon S3
* Follow the steps outlined [here](https://docs.databricks.com/en/storage/gcs.html) for Google Cloud Storage
3. Create a Unity Catalog metastore in Databricks as outlined [here](https://docs.gcp.databricks.com/data-governance/unity-catalog/create-metastore.html#create-a-unity-catalog-metastore).
4. Create an external location in Databricks as outlined [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-location.html).
5. Clone the Onetable [repository](https://github.com/onetable-io/onetable) and create the
`utilities-0.1.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](https://onetable.dev/docs/setup)
## Steps
### Running sync
Create `my_config.yaml` in the cloned Onetable directory.
```yaml md title="yaml"
sourceFormat: HUDI|ICEBERG # choose only one
targetFormats:
- DELTA
datasets:
-
tableBasePath: s3://path/to/source/data
tableName: table_name
partitionSpec: partitionpath:VALUE
```
:::tip Note:
Replace `s3://path/to/source/data` to `gs://path/to/source/data` if you have your source table in GCS.
And replace with appropriate values for `sourceFormat`, and `tableName` fields.
:::
From your terminal under the cloned Onetable directory, run the sync process using the below command.
```shell md title="shell"
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar -datasetConfig my_config.yaml
```
:::tip Note:
At this point, if you check your bucket path, you will be able to see `_delta_log` directory with
00000000000000000000.json which contains the logs that helps query engines to interpret the source table as a Delta table.
:::
### Register the target table in Unity Catalog
In your Databricks workspace, under SQL editor, run the following queries.
```sql md title="SQL"
CREATE CATALOG onetable;
CREATE SCHEMA onetable.synced_delta_schema;
CREATE TABLE onetable.synced_delta_schema.<table_name>
USING DELTA
LOCATION 's3://path/to/source/data';
```
:::tip Note:
Replace `s3://path/to/source/data` to `gs://path/to/source/data` if you have your source table in GCS.
:::
### Validating the results
You can now see the created delta table in **Unity Catalog** under **Catalog** as `<table_name>` under
`synced_delta_schema` and also query the table in the SQL editor:
```sql
SELECT * FROM onetable.synced_delta_schema.<table_name>;
```
## Conclusion
In this guide we saw how to,
1. sync a source table to create metadata for the desired target table formats using Onetable
2. catalog the data in Delta format in Unity Catalog on Databricks
3. query the Delta table using Databricks SQL editor