docs/en/connector-v2/source/Iceberg.md - seatunnel - Git at Google

 # Apache Iceberg

 > Apache Iceberg source connector

 ## Description

 Source connector for Apache Iceberg. It can support batch and stream mode.

 ## Key features

 - [x] [batch](../../concept/connector-v2-features.md)
 - [x] [stream](../../concept/connector-v2-features.md)
 - [x] [exactly-once](../../concept/connector-v2-features.md)
 - [x] [schema projection](../../concept/connector-v2-features.md)
 - [x] [parallelism](../../concept/connector-v2-features.md)
 - [ ] [support user-defined split](../../concept/connector-v2-features.md)

 - [x] data format
     - [x] parquet
     - [x] orc
     - [x] avro
 - [x] iceberg catalog
     - [x] hadoop(2.7.1 , 2.7.5 , 3.1.3)
     - [x] hive(2.3.9 , 3.1.2)

 ##  Options

 | name                     | type    | required | default value        |
 | ------------------------ | ------- | -------- | -------------------- |
 | catalog_name             | string  | yes      | -                    |
 | catalog_type             | string  | yes      | -                    |
 | uri                      | string  | no       | -                    |
 | warehouse                | string  | yes      | -                    |
 | namespace                | string  | yes      | -                    |
 | table                    | string  | yes      | -                    |
 | case_sensitive           | boolean | no       | false                |
 | start_snapshot_timestamp | long    | no       | -                    |
 | start_snapshot_id        | long    | no       | -                    |
 | end_snapshot_id          | long    | no       | -                    |
 | use_snapshot_id          | long    | no       | -                    |
 | use_snapshot_timestamp   | long    | no       | -                    |
 | stream_scan_strategy     | enum    | no       | FROM_LATEST_SNAPSHOT |
 | common-options           |         | no       | -                    |

 ### catalog_name [string]

 User-specified catalog name.

 ### catalog_type [string]

 The optional values are:
 - hive: The hive metastore catalog.
 - hadoop: The hadoop catalog.

 ### uri [string]

 The Hive metastore’s thrift URI.

 ### warehouse [string]

 The location to store metadata files and data files.

 ### namespace [string]

 The iceberg database name in the backend catalog.

 ### table [string]

 The iceberg table name in the backend catalog.

 ### case_sensitive [boolean]

 If data columns where selected via fields(Collection), controls whether the match to the schema will be done with case sensitivity.

 ### fields [array]

 Use projection to select data columns and columns order.

 ### start_snapshot_id [long]

 Instructs this scan to look for changes starting from a particular snapshot (exclusive).

 ### start_snapshot_timestamp [long]

 Instructs this scan to look for changes starting from  the most recent snapshot for the table as of the timestamp. timestamp – the timestamp in millis since the Unix epoch

 ### end_snapshot_id [long]

 Instructs this scan to look for changes up to a particular snapshot (inclusive).

 ### use_snapshot_id [long]

 Instructs this scan to look for use the given snapshot ID.

 ### use_snapshot_timestamp [long]

 Instructs this scan to look for use the most recent snapshot as of the given time in milliseconds. timestamp – the timestamp in millis since the Unix epoch

 ### stream_scan_strategy [enum]

 Starting strategy for stream mode execution, Default to use `FROM_LATEST_SNAPSHOT` if don’t specify any value.
 The optional values are:
 - TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode.
 - FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive.
 - FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive.
 - FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive.
 - FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive.

 ### common options

 Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.

 ## Example

 simple

 ```hocon
 source {
   Iceberg {
     catalog_name = "seatunnel"
     catalog_type = "hadoop"
     warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
     namespace = "your_iceberg_database"
     table = "your_iceberg_table"
   }
 }
 ```
 Or

 ```hocon
 source {
   Iceberg {
     catalog_name = "seatunnel"
     catalog_type = "hive"
     uri = "thrift://localhost:9083"
     warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
     namespace = "your_iceberg_database"
     table = "your_iceberg_table"
   }
 }
 ```

 schema projection

 ```hocon
 source {
   Iceberg {
     catalog_name = "seatunnel"
     catalog_type = "hadoop"
     warehouse = "hdfs://your_cluster/tmp/seatunnel/iceberg/"
     namespace = "your_iceberg_database"
     table = "your_iceberg_table"

     fields {
       f2 = "boolean"
       f1 = "bigint"
       f3 = "int"
       f4 = "bigint"
     }
   }
 }
 ```

 :::tip

 In order to be compatible with different versions of Hadoop and Hive, the scope of hive-exec and flink-shaded-hadoop-2 in the project pom file are provided, so if you use the Flink engine, first you may need to add the following Jar packages to <FLINK_HOME>/lib directory, if you are using the Spark engine and integrated with Hadoop, then you do not need to add the following Jar packages.

 :::

 ```
 flink-shaded-hadoop-x-xxx.jar
 hive-exec-xxx.jar
 libfb303-xxx.jar
 ```
 Some versions of the hive-exec package do not have libfb303-xxx.jar, so you also need to manually import the Jar package.

 ## Changelog

 ### 2.2.0-beta 2022-09-26

 - Add Iceberg Source Connector

 ### next version

 - [Feature] Support Hadoop3.x ([3046](https://github.com/apache/incubator-seatunnel/pull/3046))
	# Apache Iceberg

	> Apache Iceberg source connector

	## Description

	Source connector for Apache Iceberg. It can support batch and stream mode.

	## Key features

	- [x] [batch](../../concept/connector-v2-features.md)
	- [x] [stream](../../concept/connector-v2-features.md)
	- [x] [exactly-once](../../concept/connector-v2-features.md)
	- [x] [schema projection](../../concept/connector-v2-features.md)
	- [x] [parallelism](../../concept/connector-v2-features.md)
	- [ ] [support user-defined split](../../concept/connector-v2-features.md)

	- [x] data format
	- [x] parquet
	- [x] orc
	- [x] avro
	- [x] iceberg catalog
	- [x] hadoop(2.7.1 , 2.7.5 , 3.1.3)
	- [x] hive(2.3.9 , 3.1.2)

	## Options

	\| name \| type \| required \| default value \|
	\| ------------------------ \| ------- \| -------- \| -------------------- \|
	\| catalog_name \| string \| yes \| - \|
	\| catalog_type \| string \| yes \| - \|
	\| uri \| string \| no \| - \|
	\| warehouse \| string \| yes \| - \|
	\| namespace \| string \| yes \| - \|
	\| table \| string \| yes \| - \|
	\| case_sensitive \| boolean \| no \| false \|
	\| start_snapshot_timestamp \| long \| no \| - \|
	\| start_snapshot_id \| long \| no \| - \|
	\| end_snapshot_id \| long \| no \| - \|
	\| use_snapshot_id \| long \| no \| - \|
	\| use_snapshot_timestamp \| long \| no \| - \|
	\| stream_scan_strategy \| enum \| no \| FROM_LATEST_SNAPSHOT \|
	\| common-options \| \| no \| - \|

	### catalog_name [string]

	User-specified catalog name.

	### catalog_type [string]

	The optional values are:
	- hive: The hive metastore catalog.
	- hadoop: The hadoop catalog.

	### uri [string]

	The Hive metastore’s thrift URI.

	### warehouse [string]

	The location to store metadata files and data files.

	### namespace [string]

	The iceberg database name in the backend catalog.

	### table [string]

	The iceberg table name in the backend catalog.

	### case_sensitive [boolean]

	If data columns where selected via fields(Collection), controls whether the match to the schema will be done with case sensitivity.

	### fields [array]

	Use projection to select data columns and columns order.

	### start_snapshot_id [long]

	Instructs this scan to look for changes starting from a particular snapshot (exclusive).

	### start_snapshot_timestamp [long]

	Instructs this scan to look for changes starting from the most recent snapshot for the table as of the timestamp. timestamp – the timestamp in millis since the Unix epoch

	### end_snapshot_id [long]

	Instructs this scan to look for changes up to a particular snapshot (inclusive).

	### use_snapshot_id [long]

	Instructs this scan to look for use the given snapshot ID.

	### use_snapshot_timestamp [long]

	Instructs this scan to look for use the most recent snapshot as of the given time in milliseconds. timestamp – the timestamp in millis since the Unix epoch

	### stream_scan_strategy [enum]

	Starting strategy for stream mode execution, Default to use `FROM_LATEST_SNAPSHOT` if don’t specify any value.
	The optional values are:
	- TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode.
	- FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive.
	- FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive.
	- FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive.
	- FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive.

	### common options

	Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.

	## Example

	simple

	```hocon
	source {
	Iceberg {
	catalog_name = "seatunnel"
	catalog_type = "hadoop"
	warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
	namespace = "your_iceberg_database"
	table = "your_iceberg_table"
	}
	}
	```
	Or

	```hocon
	source {
	Iceberg {
	catalog_name = "seatunnel"
	catalog_type = "hive"
	uri = "thrift://localhost:9083"
	warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
	namespace = "your_iceberg_database"
	table = "your_iceberg_table"
	}
	}
	```

	schema projection

	```hocon
	source {
	Iceberg {
	catalog_name = "seatunnel"
	catalog_type = "hadoop"
	warehouse = "hdfs://your_cluster/tmp/seatunnel/iceberg/"
	namespace = "your_iceberg_database"
	table = "your_iceberg_table"

	fields {
	f2 = "boolean"
	f1 = "bigint"
	f3 = "int"
	f4 = "bigint"
	}
	}
	}
	```

	:::tip

	In order to be compatible with different versions of Hadoop and Hive, the scope of hive-exec and flink-shaded-hadoop-2 in the project pom file are provided, so if you use the Flink engine, first you may need to add the following Jar packages to <FLINK_HOME>/lib directory, if you are using the Spark engine and integrated with Hadoop, then you do not need to add the following Jar packages.

	:::

	```
	flink-shaded-hadoop-x-xxx.jar
	hive-exec-xxx.jar
	libfb303-xxx.jar
	```
	Some versions of the hive-exec package do not have libfb303-xxx.jar, so you also need to manually import the Jar package.

	## Changelog

	### 2.2.0-beta 2022-09-26

	- Add Iceberg Source Connector

	### next version

	- [Feature] Support Hadoop3.x ([3046](https://github.com/apache/incubator-seatunnel/pull/3046))