| --- |
| title: "Getting Started" |
| weight: 200 |
| url: getting-started |
| aliases: |
| - "spark/getting-started" |
| menu: |
| main: |
| parent: Spark |
| weight: 0 |
| --- |
| <!-- |
| - Licensed to the Apache Software Foundation (ASF) under one or more |
| - contributor license agreements. See the NOTICE file distributed with |
| - this work for additional information regarding copyright ownership. |
| - The ASF licenses this file to You under the Apache License, Version 2.0 |
| - (the "License"); you may not use this file except in compliance with |
| - the License. You may obtain a copy of the License at |
| - |
| - http://www.apache.org/licenses/LICENSE-2.0 |
| - |
| - Unless required by applicable law or agreed to in writing, software |
| - distributed under the License is distributed on an "AS IS" BASIS, |
| - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| - See the License for the specific language governing permissions and |
| - limitations under the License. |
| --> |
| |
| # Getting Started |
| |
| The latest version of Iceberg is [{{% icebergVersion %}}](../../../releases). |
| |
| Spark is currently the most feature-rich compute engine for Iceberg operations. |
| We recommend you to get started with Spark to understand Iceberg concepts and features with examples. |
| You can also view documentations of using Iceberg with other compute engine under the **Engines** tab. |
| |
| ## Using Iceberg in Spark 3 |
| |
| To use Iceberg in a Spark shell, use the `--packages` option: |
| |
| ```sh |
| spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:{{% icebergVersion %}} |
| ``` |
| |
| {{< hint info >}} |
| If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.2_2.12` Jar](spark-runtime-jar) to Spark's `jars` folder. |
| {{< /hint >}} |
| |
| [spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.2_2.12-{{% icebergVersion %}}.jar |
| |
| ### Adding catalogs |
| |
| Iceberg comes with [catalogs](../spark-configuration#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. |
| |
| This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog: |
| |
| ```sh |
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:{{% icebergVersion %}}\ |
| --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ |
| --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ |
| --conf spark.sql.catalog.spark_catalog.type=hive \ |
| --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ |
| --conf spark.sql.catalog.local.type=hadoop \ |
| --conf spark.sql.catalog.local.warehouse=$PWD/warehouse |
| ``` |
| |
| ### Creating a table |
| |
| To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](../spark-ddl#create-table) command: |
| |
| ```sql |
| -- local is the path-based catalog defined above |
| CREATE TABLE local.db.table (id bigint, data string) USING iceberg |
| ``` |
| |
| Iceberg catalogs support the full range of SQL DDL commands, including: |
| |
| * [`CREATE TABLE ... PARTITIONED BY`](../spark-ddl#create-table) |
| * [`CREATE TABLE ... AS SELECT`](../spark-ddl#create-table--as-select) |
| * [`ALTER TABLE`](../spark-ddl#alter-table) |
| * [`DROP TABLE`](../spark-ddl#drop-table) |
| |
| ### Writing |
| |
| Once your table is created, insert data using [`INSERT INTO`](../spark-writes#insert-into): |
| |
| ```sql |
| INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c'); |
| INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1; |
| ``` |
| |
| Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](../spark-writes#merge-into) and [`DELETE FROM`](../spark-writes#delete-from): |
| |
| ```sql |
| MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id |
| WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count |
| WHEN NOT MATCHED THEN INSERT * |
| ``` |
| |
| Iceberg supports writing DataFrames using the new [v2 DataFrame write API](../spark-writes#writing-with-dataframes): |
| |
| ```scala |
| spark.table("source").select("id", "data") |
| .writeTo("local.db.table").append() |
| ``` |
| |
| The old `write` API is supported, but _not_ recommended. |
| |
| ### Reading |
| |
| To read with SQL, use the an Iceberg table name in a `SELECT` query: |
| |
| ```sql |
| SELECT count(1) as count, data |
| FROM local.db.table |
| GROUP BY data |
| ``` |
| |
| SQL is also the recommended way to [inspect tables](../spark-queries#inspecting-tables). To view all of the snapshots in a table, use the `snapshots` metadata table: |
| ```sql |
| SELECT * FROM local.db.table.snapshots |
| ``` |
| ``` |
| +-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ |
| | committed_at | snapshot_id | parent_id | operation | manifest_list | ... | |
| +-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ |
| | 2019-02-08 03:29:51.215 | 57897183625154 | null | append | s3://.../table/metadata/snap-57897183625154-1.avro | ... | |
| | | | | | | ... | |
| | | | | | | ... | |
| | ... | ... | ... | ... | ... | ... | |
| +-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ |
| ``` |
| |
| [DataFrame reads](../spark-queries#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: |
| |
| ```scala |
| val df = spark.table("local.db.table") |
| df.count() |
| ``` |
| |
| ### Next steps |
| |
| Next, you can learn more about Iceberg tables in Spark: |
| |
| * [DDL commands](../spark-ddl): `CREATE`, `ALTER`, and `DROP` |
| * [Querying data](../spark-queries): `SELECT` queries and metadata tables |
| * [Writing data](../spark-writes): `INSERT INTO` and `MERGE INTO` |
| * [Maintaining tables](../spark-procedures) with stored procedures |