docs/content/java-api.md - iceberg-docs - Git at Google

 ---
 title: "Java API"
 url: api
 aliases:
     - "java/api"
 menu:
     main:
         parent: "API"
         identifier: java_api
         weight: 200
 ---
 <!--
  - Licensed to the Apache Software Foundation (ASF) under one or more
  - contributor license agreements.  See the NOTICE file distributed with
  - this work for additional information regarding copyright ownership.
  - The ASF licenses this file to You under the Apache License, Version 2.0
  - (the "License"); you may not use this file except in compliance with
  - the License.  You may obtain a copy of the License at
  -
  -   http://www.apache.org/licenses/LICENSE-2.0
  -
  - Unless required by applicable law or agreed to in writing, software
  - distributed under the License is distributed on an "AS IS" BASIS,
  - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  - See the License for the specific language governing permissions and
  - limitations under the License.
  -->

 # Iceberg Java API

 ## Tables

 The main purpose of the Iceberg API is to manage table metadata, like schema, partition spec, metadata, and data files that store table data.

 Table metadata and operations are accessed through the `Table` interface. This interface will return table information.

 ### Table metadata

 The [`Table` interface](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/Table.html) provides access to the table metadata:

 * `schema` returns the current table [schema](../schemas)
 * `spec` returns the current table partition spec
 * `properties` returns a map of key-value [properties](../configuration)
 * `currentSnapshot` returns the current table snapshot
 * `snapshots` returns all valid snapshots for the table
 * `snapshot(id)` returns a specific snapshot by ID
 * `location` returns the table's base location

 Tables also provide `refresh` to update the table to the latest version, and expose helpers:

 * `io` returns the `FileIO` used to read and write table files
 * `locationProvider` returns a `LocationProvider` used to create paths for data and metadata files


 ### Scanning

 #### File level

 Iceberg table scans start by creating a `TableScan` object with `newScan`.

 ```java
 TableScan scan = table.newScan();
 ```

 To configure a scan, call `filter` and `select` on the `TableScan` to get a new `TableScan` with those changes.

 ```java
 TableScan filteredScan = scan.filter(Expressions.equal("id", 5))
 ```

 Calls to configuration methods create a new `TableScan` so that each `TableScan` is immutable and won't change unexpectedly if shared across threads.

 When a scan is configured, `planFiles`, `planTasks`, and `schema` are used to return files, tasks, and the read projection.

 ```java
 TableScan scan = table.newScan()
     .filter(Expressions.equal("id", 5))
     .select("id", "data");

 Schema projection = scan.schema();
 Iterable<CombinedScanTask> tasks = scan.planTasks();
 ```

 Use `asOfTime` or `useSnapshot` to configure the table snapshot for time travel queries.

 #### Row level

 Iceberg table scans start by creating a `ScanBuilder` object with `IcebergGenerics.read`.

 ```java
 ScanBuilder scanBuilder = IcebergGenerics.read(table)
 ```

 To configure a scan, call `where` and `select` on the `ScanBuilder` to get a new `ScanBuilder` with those changes.

 ```java
 scanBuilder.where(Expressions.equal("id", 5))
 ```

 When a scan is configured, call method `build` to execute scan. `build` return `CloseableIterable<Record>`

 ```java
 CloseableIterable<Record> result = IcebergGenerics.read(table)
         .where(Expressions.lessThan("id", 5))
         .build();
 ```
 where `Record` is Iceberg record for iceberg-data module `org.apache.iceberg.data.Record`.

 ### Update operations

 `Table` also exposes operations that update the table. These operations use a builder pattern, [`PendingUpdate`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/PendingUpdate.html), that commits when `PendingUpdate#commit` is called.

 For example, updating the table schema is done by calling `updateSchema`, adding updates to the builder, and finally calling `commit` to commit the pending changes to the table:

 ```java
 table.updateSchema()
     .addColumn("count", Types.LongType.get())
     .commit();
 ```

 Available operations to update a table are:

 * `updateSchema` -- update the table schema
 * `updateProperties` -- update table properties
 * `updateLocation` -- update the table's base location
 * `newAppend` -- used to append data files
 * `newFastAppend` -- used to append data files, will not compact metadata
 * `newOverwrite` -- used to append data files and remove files that are overwritten
 * `newDelete` -- used to delete data files
 * `newRewrite` -- used to rewrite data files; will replace existing files with new versions
 * `newTransaction` -- create a new table-level transaction
 * `rewriteManifests` -- rewrite manifest data by clustering files, for faster scan planning
 * `rollback` -- rollback the table state to a specific snapshot

 ### Transactions

 Transactions are used to commit multiple table changes in a single atomic operation. A transaction is used to create individual operations using factory methods, like `newAppend`, just like working with a `Table`. Operations created by a transaction are committed as a group when `commitTransaction` is called.

 For example, deleting and appending a file in the same transaction:
 ```java
 Transaction t = table.newTransaction();

 // commit operations to the transaction
 t.newDelete().deleteFromRowFilter(filter).commit();
 t.newAppend().appendFile(data).commit();

 // commit all the changes to the table
 t.commitTransaction();
 ```

 ## Types

 Iceberg data types are located in the [`org.apache.iceberg.types` package](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/types/package-summary.html).

 ### Primitives

 Primitive type instances are available from static methods in each type class. Types without parameters use `get`, and types like `decimal` use factory methods:

 ```java
 Types.IntegerType.get()    // int
 Types.DoubleType.get()     // double
 Types.DecimalType.of(9, 2) // decimal(9, 2)
 ```

 ### Nested types

 Structs, maps, and lists are created using factory methods in type classes.

 Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](../evolution#correctness) and nullability.

 Struct fields are created using `NestedField.optional` or `NestedField.required`. Map value and list element nullability is set in the map and list factory methods.

 ```java
 // struct<1 id: int, 2 data: optional string>
 StructType struct = Struct.of(
     Types.NestedField.required(1, "id", Types.IntegerType.get()),
     Types.NestedField.optional(2, "data", Types.StringType.get())
   )
 ```
 ```java
 // map<1 key: int, 2 value: optional string>
 MapType map = MapType.ofOptional(
     1, 2,
     Types.IntegerType.get(),
     Types.StringType.get()
   )
 ```
 ```java
 // array<1 element: int>
 ListType list = ListType.ofRequired(1, IntegerType.get());
 ```


 ## Expressions

 Iceberg's expressions are used to configure table scans. To create expressions, use the factory methods in [`Expressions`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/expressions/Expressions.html).

 Supported predicate expressions are:

 * `isNull`
 * `notNull`
 * `equal`
 * `notEqual`
 * `lessThan`
 * `lessThanOrEqual`
 * `greaterThan`
 * `greaterThanOrEqual`
 * `in`
 * `notIn`
 * `startsWith`
 * `notStartsWith`

 Supported expression operations are:

 * `and`
 * `or`
 * `not`

 Constant expressions are:

 * `alwaysTrue`
 * `alwaysFalse`

 ### Expression binding

 When created, expressions are unbound. Before an expression is used, it will be bound to a data type to find the field ID the expression name represents, and to convert predicate literals.

 For example, before using the expression `lessThan("x", 10)`, Iceberg needs to determine which column `"x"` refers to and convert `10` to that column's data type.

 If the expression could be bound to the type `struct<1 x: long, 2 y: long>` or to `struct<11 x: int, 12 y: int>`.

 ### Expression example

 ```java
 table.newScan()
     .filter(Expressions.greaterThanOrEqual("x", 5))
     .filter(Expressions.lessThan("x", 10))
 ```


 ## Modules

 Iceberg table support is organized in library modules:

 * `iceberg-common` contains utility classes used in other modules
 * `iceberg-api` contains the public Iceberg API, including expressions, types, tables, and operations
 * `iceberg-arrow` is an implementation of the Iceberg type system for reading and writing data stored in Iceberg tables using Apache Arrow as the in-memory data format
 * `iceberg-aws` contains implementations of the Iceberg API to be used with tables stored on AWS S3 and/or for tables defined using the AWS Glue data catalog
 * `iceberg-core` contains implementations of the Iceberg API and support for Avro data files, **this is what processing engines should depend on**
 * `iceberg-parquet` is an optional module for working with tables backed by Parquet files
 * `iceberg-orc` is an optional module for working with tables backed by ORC files (*experimental*)
 * `iceberg-hive-metastore` is an implementation of Iceberg tables backed by the Hive metastore Thrift client

 This project Iceberg also has modules for adding Iceberg support to processing engines and associated tooling:

 * `iceberg-spark` is an implementation of Spark's Datasource V2 API for Iceberg with submodules for each spark versions (use runtime jars for a shaded version)
 * `iceberg-flink` is an implementation of Flink's Table and DataStream API for Iceberg (use iceberg-flink-runtime for a shaded version)
 * `iceberg-hive3` is an implementation of Hive 3 specific SerDe's for Timestamp, TimestampWithZone, and Date object inspectors (use iceberg-hive-runtime for a shaded version).
 * `iceberg-mr` is an implementation of MapReduce and Hive InputFormats and SerDes for Iceberg (use iceberg-hive-runtime for a shaded version for use with Hive)
 * `iceberg-nessie` is a module used to integrate Iceberg table metadata history and operations with [Project Nessie](https://projectnessie.org/)
 * `iceberg-data` is a client library used to read Iceberg tables from JVM applications
 * `iceberg-pig` is an implementation of Pig's LoadFunc API for Iceberg
 * `iceberg-runtime` generates a shaded runtime jar for Spark to integrate with iceberg tables
	---
	title: "Java API"
	url: api
	aliases:
	- "java/api"
	menu:
	main:
	parent: "API"
	identifier: java_api
	weight: 200
	---
	<!--
	- Licensed to the Apache Software Foundation (ASF) under one or more
	- contributor license agreements. See the NOTICE file distributed with
	- this work for additional information regarding copyright ownership.
	- The ASF licenses this file to You under the Apache License, Version 2.0
	- (the "License"); you may not use this file except in compliance with
	- the License. You may obtain a copy of the License at
	-
	- http://www.apache.org/licenses/LICENSE-2.0
	-
	- Unless required by applicable law or agreed to in writing, software
	- distributed under the License is distributed on an "AS IS" BASIS,
	- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	- See the License for the specific language governing permissions and
	- limitations under the License.
	-->

	# Iceberg Java API

	## Tables

	The main purpose of the Iceberg API is to manage table metadata, like schema, partition spec, metadata, and data files that store table data.

	Table metadata and operations are accessed through the `Table` interface. This interface will return table information.

	### Table metadata

	The [`Table` interface](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/Table.html) provides access to the table metadata:

	* `schema` returns the current table [schema](../schemas)
	* `spec` returns the current table partition spec
	* `properties` returns a map of key-value [properties](../configuration)
	* `currentSnapshot` returns the current table snapshot
	* `snapshots` returns all valid snapshots for the table
	* `snapshot(id)` returns a specific snapshot by ID
	* `location` returns the table's base location

	Tables also provide `refresh` to update the table to the latest version, and expose helpers:

	* `io` returns the `FileIO` used to read and write table files
	* `locationProvider` returns a `LocationProvider` used to create paths for data and metadata files


	### Scanning

	#### File level

	Iceberg table scans start by creating a `TableScan` object with `newScan`.

	```java
	TableScan scan = table.newScan();
	```

	To configure a scan, call `filter` and `select` on the `TableScan` to get a new `TableScan` with those changes.

	```java
	TableScan filteredScan = scan.filter(Expressions.equal("id", 5))
	```

	Calls to configuration methods create a new `TableScan` so that each `TableScan` is immutable and won't change unexpectedly if shared across threads.

	When a scan is configured, `planFiles`, `planTasks`, and `schema` are used to return files, tasks, and the read projection.

	```java
	TableScan scan = table.newScan()
	.filter(Expressions.equal("id", 5))
	.select("id", "data");

	Schema projection = scan.schema();
	Iterable<CombinedScanTask> tasks = scan.planTasks();
	```

	Use `asOfTime` or `useSnapshot` to configure the table snapshot for time travel queries.

	#### Row level

	Iceberg table scans start by creating a `ScanBuilder` object with `IcebergGenerics.read`.

	```java
	ScanBuilder scanBuilder = IcebergGenerics.read(table)
	```

	To configure a scan, call `where` and `select` on the `ScanBuilder` to get a new `ScanBuilder` with those changes.

	```java
	scanBuilder.where(Expressions.equal("id", 5))
	```

	When a scan is configured, call method `build` to execute scan. `build` return `CloseableIterable<Record>`

	```java
	CloseableIterable<Record> result = IcebergGenerics.read(table)
	.where(Expressions.lessThan("id", 5))
	.build();
	```
	where `Record` is Iceberg record for iceberg-data module `org.apache.iceberg.data.Record`.

	### Update operations

	`Table` also exposes operations that update the table. These operations use a builder pattern, [`PendingUpdate`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/PendingUpdate.html), that commits when `PendingUpdate#commit` is called.

	For example, updating the table schema is done by calling `updateSchema`, adding updates to the builder, and finally calling `commit` to commit the pending changes to the table:

	```java
	table.updateSchema()
	.addColumn("count", Types.LongType.get())
	.commit();
	```

	Available operations to update a table are:

	* `updateSchema` -- update the table schema
	* `updateProperties` -- update table properties
	* `updateLocation` -- update the table's base location
	* `newAppend` -- used to append data files
	* `newFastAppend` -- used to append data files, will not compact metadata
	* `newOverwrite` -- used to append data files and remove files that are overwritten
	* `newDelete` -- used to delete data files
	* `newRewrite` -- used to rewrite data files; will replace existing files with new versions
	* `newTransaction` -- create a new table-level transaction
	* `rewriteManifests` -- rewrite manifest data by clustering files, for faster scan planning
	* `rollback` -- rollback the table state to a specific snapshot

	### Transactions

	Transactions are used to commit multiple table changes in a single atomic operation. A transaction is used to create individual operations using factory methods, like `newAppend`, just like working with a `Table`. Operations created by a transaction are committed as a group when `commitTransaction` is called.

	For example, deleting and appending a file in the same transaction:
	```java
	Transaction t = table.newTransaction();

	// commit operations to the transaction
	t.newDelete().deleteFromRowFilter(filter).commit();
	t.newAppend().appendFile(data).commit();

	// commit all the changes to the table
	t.commitTransaction();
	```

	## Types

	Iceberg data types are located in the [`org.apache.iceberg.types` package](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/types/package-summary.html).

	### Primitives

	Primitive type instances are available from static methods in each type class. Types without parameters use `get`, and types like `decimal` use factory methods:

	```java
	Types.IntegerType.get() // int
	Types.DoubleType.get() // double
	Types.DecimalType.of(9, 2) // decimal(9, 2)
	```

	### Nested types

	Structs, maps, and lists are created using factory methods in type classes.

	Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](../evolution#correctness) and nullability.

	Struct fields are created using `NestedField.optional` or `NestedField.required`. Map value and list element nullability is set in the map and list factory methods.

	```java
	// struct<1 id: int, 2 data: optional string>
	StructType struct = Struct.of(
	Types.NestedField.required(1, "id", Types.IntegerType.get()),
	Types.NestedField.optional(2, "data", Types.StringType.get())
	)
	```
	```java
	// map<1 key: int, 2 value: optional string>
	MapType map = MapType.ofOptional(
	1, 2,
	Types.IntegerType.get(),
	Types.StringType.get()
	)
	```
	```java
	// array<1 element: int>
	ListType list = ListType.ofRequired(1, IntegerType.get());
	```


	## Expressions

	Iceberg's expressions are used to configure table scans. To create expressions, use the factory methods in [`Expressions`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/expressions/Expressions.html).

	Supported predicate expressions are:

	* `isNull`
	* `notNull`
	* `equal`
	* `notEqual`
	* `lessThan`
	* `lessThanOrEqual`
	* `greaterThan`
	* `greaterThanOrEqual`
	* `in`
	* `notIn`
	* `startsWith`
	* `notStartsWith`

	Supported expression operations are:

	* `and`
	* `or`
	* `not`

	Constant expressions are:

	* `alwaysTrue`
	* `alwaysFalse`

	### Expression binding

	When created, expressions are unbound. Before an expression is used, it will be bound to a data type to find the field ID the expression name represents, and to convert predicate literals.

	For example, before using the expression `lessThan("x", 10)`, Iceberg needs to determine which column `"x"` refers to and convert `10` to that column's data type.

	If the expression could be bound to the type `struct<1 x: long, 2 y: long>` or to `struct<11 x: int, 12 y: int>`.

	### Expression example

	```java
	table.newScan()
	.filter(Expressions.greaterThanOrEqual("x", 5))
	.filter(Expressions.lessThan("x", 10))
	```


	## Modules

	Iceberg table support is organized in library modules:

	* `iceberg-common` contains utility classes used in other modules
	* `iceberg-api` contains the public Iceberg API, including expressions, types, tables, and operations
	* `iceberg-arrow` is an implementation of the Iceberg type system for reading and writing data stored in Iceberg tables using Apache Arrow as the in-memory data format
	* `iceberg-aws` contains implementations of the Iceberg API to be used with tables stored on AWS S3 and/or for tables defined using the AWS Glue data catalog
	* `iceberg-core` contains implementations of the Iceberg API and support for Avro data files, this is what processing engines should depend on
	* `iceberg-parquet` is an optional module for working with tables backed by Parquet files
	* `iceberg-orc` is an optional module for working with tables backed by ORC files (experimental)
	* `iceberg-hive-metastore` is an implementation of Iceberg tables backed by the Hive metastore Thrift client

	This project Iceberg also has modules for adding Iceberg support to processing engines and associated tooling:

	* `iceberg-spark` is an implementation of Spark's Datasource V2 API for Iceberg with submodules for each spark versions (use runtime jars for a shaded version)
	* `iceberg-flink` is an implementation of Flink's Table and DataStream API for Iceberg (use iceberg-flink-runtime for a shaded version)
	* `iceberg-hive3` is an implementation of Hive 3 specific SerDe's for Timestamp, TimestampWithZone, and Date object inspectors (use iceberg-hive-runtime for a shaded version).
	* `iceberg-mr` is an implementation of MapReduce and Hive InputFormats and SerDes for Iceberg (use iceberg-hive-runtime for a shaded version for use with Hive)
	* `iceberg-nessie` is a module used to integrate Iceberg table metadata history and operations with [Project Nessie](https://projectnessie.org/)
	* `iceberg-data` is a client library used to read Iceberg tables from JVM applications
	* `iceberg-pig` is an implementation of Pig's LoadFunc API for Iceberg
	* `iceberg-runtime` generates a shaded runtime jar for Spark to integrate with iceberg tables