import ChangeLog from ‘../changelog/connector-paimon.md’;
Paimon source connector
Read data from Apache Paimon.
Seatunnel Version | Paimon Version |
---|---|
2.3.2 - 2.3.3 | 0.4-SNAPSHOT |
2.3.4 | 0.6-SNAPSHOT |
2.3.5 - 2.3.11 | 0.7.0-incubating |
2.3.12 - 2.3.13 | 1.1.1 |
0.7.0-incubating
to 1.1.1
Note: These steps help minimize risks and ensure a smooth transition to the stable version 1.1.1.
name | type | required | default value |
---|---|---|---|
warehouse | String | Yes | - |
catalog_type | String | No | filesystem |
catalog_uri | String | No | - |
database | String | Yes | - |
table | String | no | - |
table_list | array | no | - |
user | String | No | - |
password | String | No | - |
hdfs_site_path | String | No | - |
query | String | No | - |
paimon.hadoop.conf | Map | No | - |
paimon.hadoop.conf-path | String | No | - |
Paimon warehouse path
Catalog type of Paimon, support filesystem and hive
Catalog uri of Paimon, only needed when catalog_type is hive
The database you want to access
The table you want to access
The list of tables to be read, you can use this configuration instead of table
The file path of hdfs-site.xml
The filter condition of the table read. For example: select * from st_test where id > 100
. If not specified, all rows are read. Currently, where conditions only support <, <=, >, >=, =, !=, or, and,is null, is not null, between...and, in, not in, like, and others are not supported. The Having, Group By, Order By clauses are currently unsupported, because these clauses are not supported by Paimon. you can also project specific columns, for example: select id, name from st_test where id > 100. The limit will be supported in the future.
Note: When the field after the where condition is a string or boolean value, its value must be enclosed in single quotes, otherwise an error will be reported. For example: name='abc' or tag='true'
The field data types currently supported by where conditions are as follows:
Properties in hadoop conf
The specified loading path for the ‘core-site.xml’, ‘hdfs-site.xml’, ‘hive-site.xml’ files
The Paimon connector supports writing data to multiple file systems. Currently, the supported file systems are hdfs and s3. If you use the s3 filesystem. You can configure the fs.s3a.access-key
、fs.s3a.secret-key
、fs.s3a.endpoint
、fs.s3a.path.style.access
、fs.s3a.aws.credentials.provider
properties in the paimon.hadoop.conf
option. Besides, the warehouse should start with s3a://
.
source { Paimon { warehouse = "/tmp/paimon" database = "default" table = "st_test" } }
source { Paimon { warehouse = "/tmp/paimon" database = "default" table_list = [ { table = "table1" query = "select * from table1 where id > 100" }, { table = "table2" query = "select * from table2 where id > 100" } ] } }
source { Paimon { warehouse = "/tmp/paimon" database = "full_type" table = "st_test" query = "select c_boolean, c_tinyint from st_test where c_boolean= 'true' and c_tinyint > 116 and c_smallint = 15987 or c_decimal='2924137191386439303744.39292213'" } }
env { execution.parallelism = 1 job.mode = "BATCH" } source { Paimon { warehouse = "s3a://test/" database = "seatunnel_namespace11" table = "st_test" paimon.hadoop.conf = { fs.s3a.access-key=G52pnxg67819khOZ9ezX fs.s3a.secret-key=SHJuAQqHsLrgZWikvMa3lJf5T0NfM5LMFliJh9HF fs.s3a.endpoint="http://minio4:9000" fs.s3a.path.style.access=true fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider } } } sink { Console{} }
source { Paimon { catalog_name="seatunnel_test" warehouse="hdfs:///tmp/paimon" database="seatunnel_namespace1" table="st_test" query = "select * from st_test where pk_id is not null and pk_id < 3" paimon.hadoop.conf = { hadoop_user_name = "hdfs" fs.defaultFS = "hdfs://nameservice1" dfs.nameservices = "nameservice1" dfs.ha.namenodes.nameservice1 = "nn1,nn2" dfs.namenode.rpc-address.nameservice1.nn1 = "hadoop03:8020" dfs.namenode.rpc-address.nameservice1.nn2 = "hadoop04:8020" dfs.client.failover.proxy.provider.nameservice1 = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" dfs.client.use.datanode.hostname = "true" } } }
source { Paimon { catalog_name="seatunnel_test" catalog_type="hive" catalog_uri="thrift://hadoop04:9083" warehouse="hdfs:///tmp/seatunnel" database="seatunnel_test" table="st_test3" paimon.hadoop.conf = { fs.defaultFS = "hdfs://nameservice1" dfs.nameservices = "nameservice1" dfs.ha.namenodes.nameservice1 = "nn1,nn2" dfs.namenode.rpc-address.nameservice1.nn1 = "hadoop03:8020" dfs.namenode.rpc-address.nameservice1.nn2 = "hadoop04:8020" dfs.client.failover.proxy.provider.nameservice1 = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" dfs.client.use.datanode.hostname = "true" } } }
If you want to read the changelog of the Paimon table, first set the changelog-producer
for the Paimon source table and then use the SeaTunnel stream task to read it.
Currently, batch reads are always the latest snapshot read, so to read full changelog data, you need to use stream reads and start stream reads before writing data to the Paimon table, and to ensure order, the parallelism of the stream read task should be set to 1.
env { parallelism = 1 job.mode = "Streaming" } source { Paimon { warehouse = "/tmp/paimon" database = "full_type" table = "st_test" } } sink { Paimon { warehouse = "/tmp/paimon" database = "full_type" table = "st_test_sink" paimon.table.primary-keys = "c_tinyint" } }
source { Paimon { warehouse = "/tmp/paimon" database = "default" table = "st_test" user = "paimon" password = "******" } }