import Tabs from ‘@theme/Tabs’; import TabItem from ‘@theme/TabItem’;
This guide offers a short tutorial on how to query Apache Iceberg and Apache Hudi tables in Microsoft Fabric utilizing the translation capabilities of OneTable. This tutorial is intended solely for demonstration and to verify the compatibility of OneTable's output with Fabric. The tutorial leverages the currently[^1] available features in Fabric, like Shortcuts.
Microsoft Fabric is a unified data analytics platform for data analytics. It offers a wide range of capabilities including data engineering, data science, real-time analytics, and business intelligence. At its core is the data lake, OneLake, which provides tabular data stored in Delta Parquet format, allowing one copy of the data to be used in all Fabric analytical services like T-SQL, Spark, and Power BI. OneLake is designed to do away with the need for data duplication and eliminates the necessity for data migration or transfers. Whether it's a data engineer adding data to a table using Spark or a SQL developer analyzing data with T-SQL in a data warehouse, both can access and work on the same data copy stored in OneLake. Additionally, OneLake enables integration of existing storage accounts via its ‘Shortcut’ feature, which acts as a link to data in other file systems.
The objective of the following tutorial is to translate an Iceberg or Hudi table in ADLS storage account into Delta Lake format using OneTable. After translation, this table will be accessible for querying from various Fabric engines, including T-SQL, Spark, and Power BI.
This step creates a source table in Iceberg or Hudi format in the ADLS storage account. The primary actions to be taken to create a table people in Iceberg or Hudi format are documented in Creating your first interoperable table - Create Dataset tutorial section. However, instead of creating the people table locally, configure local_base_path to point to the ADLS storage account.
Assuming the container name is mycontainer and the storage account name is mystorageaccount, the local_base_path should be set to abfs://mycontainer@mystorageaccount.dfs.core.windows.net/.
An example spark configuration for authenticating to ADLS storage account is as follows:
spark.hadoop.fs.azure.account.auth.type=OAuth spark.hadoop.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider spark.hadoop.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/<tenant-id>/oauth2/token spark.hadoop.fs.azure.account.oauth2.client.id=<client-id> spark.hadoop.fs.azure.account.oauth2.client.secret=<client-secret>
This step translates the table people originally in Iceberg or Hudi format to Delta Lake format using OneTable. The primary actions for the translation are documented in Creating your first interoperable table - Running Sync tutorial section. However, since the table is in ADLS, you need to update datasets path and hadoop configurations.
For e.g. if the source table is in Iceberg format, the configuration file should look like:
sourceFormat: ICEBERG targetFormats: - DELTA datasets: - tableBasePath: abfs://mycontainer@mystorageaccount.dfs.core.windows.net/default/people tableDataPath: abfs://mycontainer@mystorageaccount.dfs.core.windows.net/default/people/data tableName: people # In the configuration above `default` refers to the default spark database.
An example hadoop configuration for authenticating to ADLS storage account is as follows:
<configuration> <property> <name>fs.azure.account.auth.type</name> <value>OAuth</value> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value> </property> <property> <name>fs.azure.account.oauth2.client.endpoint</name> <value>https://login.microsoftonline.com/[tenant-id]/oauth2/token</value> </property> <property> <name>fs.azure.account.oauth2.client.id</name> <value>[client-id]</value> </property> <property> <name>fs.azure.account.oauth2.client.secret</name> <value>[client-secret]</value> </property> </configuration>
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml --hadoopConfig hadoop.xml
Running the above command will translate the table people in Iceberg or Hudi format to Delta Lake format. To validate the translation, you can list the directories in the table‘s data path. For this tutorial, _delta_log directory should be present in the people table’s data path at default/people/data/_delta_log.
This step creates a shortcut in Fabric to the Delta Lake table created in the previous step. The shortcut is a link to the table's data path in ADLS storage account.
Tables and select New Shortcut.New Shortcut dialog, select Azure Delta Lake Storaget Gen2 as the External SourceNew Shortcut dialog, enter the Connection settings and authorize Fabric to access the storage account.New Shortcut dialog, enter the Shortcut settings and click Create.[^1]: Updated on 2023-11-16