blob: f29908d01bed0bf84238a17cbd72e9b4d5aaf959 [file] [log] [blame] [view]
---
title: "Getting Started"
weight: 200
url: getting-started
aliases:
- "spark/getting-started"
menu:
main:
parent: Spark
weight: 0
---
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-->
# Getting Started
The latest version of Iceberg is [{{% icebergVersion %}}](../../../releases).
Spark is currently the most feature-rich compute engine for Iceberg operations.
We recommend you to get started with Spark to understand Iceberg concepts and features with examples.
You can also view documentations of using Iceberg with other compute engine under the **Engines** tab.
## Using Iceberg in Spark 3
To use Iceberg in a Spark shell, use the `--packages` option:
```sh
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:{{% icebergVersion %}}
```
{{< hint info >}}
If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.2_2.12` Jar](spark-runtime-jar) to Spark's `jars` folder.
{{< /hint >}}
[spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.2_2.12-{{% icebergVersion %}}.jar
### Adding catalogs
Iceberg comes with [catalogs](../spark-configuration#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`.
This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog:
```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:{{% icebergVersion %}}\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
```
### Creating a table
To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](../spark-ddl#create-table) command:
```sql
-- local is the path-based catalog defined above
CREATE TABLE local.db.table (id bigint, data string) USING iceberg
```
Iceberg catalogs support the full range of SQL DDL commands, including:
* [`CREATE TABLE ... PARTITIONED BY`](../spark-ddl#create-table)
* [`CREATE TABLE ... AS SELECT`](../spark-ddl#create-table--as-select)
* [`ALTER TABLE`](../spark-ddl#alter-table)
* [`DROP TABLE`](../spark-ddl#drop-table)
### Writing
Once your table is created, insert data using [`INSERT INTO`](../spark-writes#insert-into):
```sql
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1;
```
Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](../spark-writes#merge-into) and [`DELETE FROM`](../spark-writes#delete-from):
```sql
MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
WHEN NOT MATCHED THEN INSERT *
```
Iceberg supports writing DataFrames using the new [v2 DataFrame write API](../spark-writes#writing-with-dataframes):
```scala
spark.table("source").select("id", "data")
.writeTo("local.db.table").append()
```
The old `write` API is supported, but _not_ recommended.
### Reading
To read with SQL, use the an Iceberg table name in a `SELECT` query:
```sql
SELECT count(1) as count, data
FROM local.db.table
GROUP BY data
```
SQL is also the recommended way to [inspect tables](../spark-queries#inspecting-tables). To view all of the snapshots in a table, use the `snapshots` metadata table:
```sql
SELECT * FROM local.db.table.snapshots
```
```
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+
| committed_at | snapshot_id | parent_id | operation | manifest_list | ... |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+
| 2019-02-08 03:29:51.215 | 57897183625154 | null | append | s3://.../table/metadata/snap-57897183625154-1.avro | ... |
| | | | | | ... |
| | | | | | ... |
| ... | ... | ... | ... | ... | ... |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+
```
[DataFrame reads](../spark-queries#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`:
```scala
val df = spark.table("local.db.table")
df.count()
```
### Next steps
Next, you can learn more about Iceberg tables in Spark:
* [DDL commands](../spark-ddl): `CREATE`, `ALTER`, and `DROP`
* [Querying data](../spark-queries): `SELECT` queries and metadata tables
* [Writing data](../spark-writes): `INSERT INTO` and `MERGE INTO`
* [Maintaining tables](../spark-procedures) with stored procedures