blob: a4400a530d382204dc14d8398afbd4aa9279690c [file] [log] [blame] [view]
# Apache Iceberg
> Apache Iceberg source connector
## Description
Source connector for Apache Iceberg. It can support batch and stream mode.
## Key features
- [x] [batch](../../concept/connector-v2-features.md)
- [x] [stream](../../concept/connector-v2-features.md)
- [x] [exactly-once](../../concept/connector-v2-features.md)
- [x] [schema projection](../../concept/connector-v2-features.md)
- [x] [parallelism](../../concept/connector-v2-features.md)
- [ ] [support user-defined split](../../concept/connector-v2-features.md)
- [x] data format
- [x] parquet
- [x] orc
- [x] avro
- [x] iceberg catalog
- [x] hadoop(2.7.1 , 2.7.5 , 3.1.3)
- [x] hive(2.3.9 , 3.1.2)
## Options
| name | type | required | default value |
| ------------------------ | ------- | -------- | -------------------- |
| catalog_name | string | yes | - |
| catalog_type | string | yes | - |
| uri | string | no | - |
| warehouse | string | yes | - |
| namespace | string | yes | - |
| table | string | yes | - |
| case_sensitive | boolean | no | false |
| start_snapshot_timestamp | long | no | - |
| start_snapshot_id | long | no | - |
| end_snapshot_id | long | no | - |
| use_snapshot_id | long | no | - |
| use_snapshot_timestamp | long | no | - |
| stream_scan_strategy | enum | no | FROM_LATEST_SNAPSHOT |
| common-options | | no | - |
### catalog_name [string]
User-specified catalog name.
### catalog_type [string]
The optional values are:
- hive: The hive metastore catalog.
- hadoop: The hadoop catalog.
### uri [string]
The Hive metastores thrift URI.
### warehouse [string]
The location to store metadata files and data files.
### namespace [string]
The iceberg database name in the backend catalog.
### table [string]
The iceberg table name in the backend catalog.
### case_sensitive [boolean]
If data columns where selected via fields(Collection), controls whether the match to the schema will be done with case sensitivity.
### fields [array]
Use projection to select data columns and columns order.
### start_snapshot_id [long]
Instructs this scan to look for changes starting from a particular snapshot (exclusive).
### start_snapshot_timestamp [long]
Instructs this scan to look for changes starting from the most recent snapshot for the table as of the timestamp. timestamp the timestamp in millis since the Unix epoch
### end_snapshot_id [long]
Instructs this scan to look for changes up to a particular snapshot (inclusive).
### use_snapshot_id [long]
Instructs this scan to look for use the given snapshot ID.
### use_snapshot_timestamp [long]
Instructs this scan to look for use the most recent snapshot as of the given time in milliseconds. timestamp the timestamp in millis since the Unix epoch
### stream_scan_strategy [enum]
Starting strategy for stream mode execution, Default to use `FROM_LATEST_SNAPSHOT` if dont specify any value.
The optional values are:
- TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode.
- FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive.
- FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive.
- FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive.
- FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive.
### common options
Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.
## Example
simple
```hocon
source {
Iceberg {
catalog_name = "seatunnel"
catalog_type = "hadoop"
warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
namespace = "your_iceberg_database"
table = "your_iceberg_table"
}
}
```
Or
```hocon
source {
Iceberg {
catalog_name = "seatunnel"
catalog_type = "hive"
uri = "thrift://localhost:9083"
warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
namespace = "your_iceberg_database"
table = "your_iceberg_table"
}
}
```
schema projection
```hocon
source {
Iceberg {
catalog_name = "seatunnel"
catalog_type = "hadoop"
warehouse = "hdfs://your_cluster/tmp/seatunnel/iceberg/"
namespace = "your_iceberg_database"
table = "your_iceberg_table"
fields {
f2 = "boolean"
f1 = "bigint"
f3 = "int"
f4 = "bigint"
}
}
}
```
:::tip
In order to be compatible with different versions of Hadoop and Hive, the scope of hive-exec and flink-shaded-hadoop-2 in the project pom file are provided, so if you use the Flink engine, first you may need to add the following Jar packages to <FLINK_HOME>/lib directory, if you are using the Spark engine and integrated with Hadoop, then you do not need to add the following Jar packages.
:::
```
flink-shaded-hadoop-x-xxx.jar
hive-exec-xxx.jar
libfb303-xxx.jar
```
Some versions of the hive-exec package do not have libfb303-xxx.jar, so you also need to manually import the Jar package.
## Changelog
### 2.2.0-beta 2022-09-26
- Add Iceberg Source Connector
### next version
- [Feature] Support Hadoop3.x ([3046](https://github.com/apache/incubator-seatunnel/pull/3046))