import ChangeLog from ‘../changelog/connector-hive.md’;
Hive source connector
Read data from Hive.
When using markdown format, SeaTunnel can parse markdown files stored in Hive tables and extract structured data with elements like headings, paragraphs, lists, code blocks, and tables. Each element is converted to a row with the following schema:
element_id
: Unique identifier for the elementelement_type
: Type of the element (Heading, Paragraph, ListItem, etc.)heading_level
: Level of heading (1-6, null for non-heading elements)text
: Text content of the elementpage_number
: Page number (default: 1)position_index
: Position index within the documentparent_id
: ID of the parent elementchild_ids
: Comma-separated list of child element IDsNote: Markdown format only supports reading, not writing.
:::tip
In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9 and 3.1.3 .
If you use SeaTunnel Engine, You need put seatunnel-hadoop3-3.1.4-uber.jar and hive-exec-3.1.3.jar and libfb303-0.9.3.jar in $SEATUNNEL_HOME/lib/ dir. :::
Read all the data in a split in a pollNext call. What splits are read will be saved in snapshot.
name | type | required | default value |
---|---|---|---|
table_name | string | yes | - |
metastore_uri | string | yes | - |
krb5_path | string | no | /etc/krb5.conf |
kerberos_principal | string | no | - |
kerberos_keytab_path | string | no | - |
hdfs_site_path | string | no | - |
hive_site_path | string | no | - |
hive.hadoop.conf | Map | no | - |
hive.hadoop.conf-path | string | no | - |
read_partitions | list | no | - |
read_columns | list | no | - |
compress_codec | string | no | none |
common-options | no | - |
Target Hive table name eg: db1.table1
Hive metastore uri
The path of hdfs-site.xml
, used to load ha configuration of namenodes
Properties in hadoop conf(‘core-site.xml’, ‘hdfs-site.xml’, ‘hive-site.xml’)
The specified loading path for the ‘core-site.xml’, ‘hdfs-site.xml’, ‘hive-site.xml’ files
The target partitions that user want to read from hive table, if user does not set this parameter, it will read all the data from hive table.
Tips: Every partition in partitions list should have the same directory depth. For example, a hive table has two partitions: par1 and par2, if user sets it like as the following: read_partitions = [par1=xxx, par1=yyy/par2=zzz], it is illegal
The path of krb5.conf
, used to authentication kerberos
The principal of kerberos authentication
The keytab file path of kerberos authentication
The read column list of the data source, user can use it to implement field projection.
The compress codec of files and the details that supported as the following shown:
lzo
none
lzo
none
lzo
none
Source plugin common parameters, please refer to Source Common Options for details
Hive { table_name = "default.seatunnel_orc" metastore_uri = "thrift://namenode001:9083" }
Note: Hive is a structured data source and should be use ‘table_list’, and ‘tables_configs’ will be removed in the future.
Hive { table_list = [ { table_name = "default.seatunnel_orc_1" metastore_uri = "thrift://namenode001:9083" }, { table_name = "default.seatunnel_orc_2" metastore_uri = "thrift://namenode001:9083" } ] }
Hive { tables_configs = [ { table_name = "default.seatunnel_orc_1" metastore_uri = "thrift://namenode001:9083" }, { table_name = "default.seatunnel_orc_2" metastore_uri = "thrift://namenode001:9083" } ] }
source { Hive { table_name = "default.test_hive_sink_on_hdfs_with_kerberos" metastore_uri = "thrift://metastore:9083" hive.hadoop.conf-path = "/tmp/hadoop" plugin_output = hive_source hive_site_path = "/tmp/hive-site.xml" kerberos_principal = "hive/metastore.seatunnel@EXAMPLE.COM" kerberos_keytab_path = "/tmp/hive.keytab" krb5_path = "/tmp/krb5.conf" } }
Description:
hive_site_path
: The path to the hive-site.xml
file.kerberos_principal
: The principal for Kerberos authentication.kerberos_keytab_path
: The keytab file path for Kerberos authentication.krb5_path
: The path to the krb5.conf
file used for Kerberos authentication.Run the case:
env { parallelism = 1 job.mode = "BATCH" } source { Hive { table_name = "default.test_hive_sink_on_hdfs_with_kerberos" metastore_uri = "thrift://metastore:9083" hive.hadoop.conf-path = "/tmp/hadoop" plugin_output = hive_source hive_site_path = "/tmp/hive-site.xml" kerberos_principal = "hive/metastore.seatunnel@EXAMPLE.COM" kerberos_keytab_path = "/tmp/hive.keytab" krb5_path = "/tmp/krb5.conf" } } sink { Assert { plugin_input = hive_source rules { row_rules = [ { rule_type = MAX_ROW rule_value = 3 } ], field_rules = [ { field_name = pk_id field_type = bigint field_value = [ { rule_type = NOT_NULL } ] }, { field_name = name field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = score field_type = int field_value = [ { rule_type = NOT_NULL } ] } ] } } }
Create the lib dir for hive of emr.
mkdir -p ${SEATUNNEL_HOME}/plugins/Hive/lib
Get the jars from maven center to the lib.
cd ${SEATUNNEL_HOME}/plugins/Hive/lib wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.6.5/hadoop-aws-2.6.5.jar wget https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.9/hive-exec-2.3.9.jar
Copy the jars from your environment on emr to the lib dir.
cp /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.60.0.jar ${SEATUNNEL_HOME}/plugins/Hive/lib cp /usr/share/aws/emr/hadoop-state-pusher/lib/hadoop-common-3.3.6-amzn-1.jar ${SEATUNNEL_HOME}/plugins/Hive/lib cp /usr/share/aws/emr/hadoop-state-pusher/lib/javax.inject-1.jar ${SEATUNNEL_HOME}/plugins/Hive/lib cp /usr/share/aws/emr/hadoop-state-pusher/lib/aopalliance-1.0.jar ${SEATUNNEL_HOME}/plugins/Hive/lib
Run the case.
env { parallelism = 1 job.mode = "BATCH" } source { Hive { table_name = "test_hive.test_hive_sink_on_s3" metastore_uri = "thrift://ip-192-168-0-202.cn-north-1.compute.internal:9083" hive.hadoop.conf-path = "/home/ec2-user/hadoop-conf" hive.hadoop.conf = { bucket="s3://ws-package" fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider" } read_columns = ["pk_id", "name", "score"] } } sink { Hive { table_name = "test_hive.test_hive_sink_on_s3_sink" metastore_uri = "thrift://ip-192-168-0-202.cn-north-1.compute.internal:9083" hive.hadoop.conf-path = "/home/ec2-user/hadoop-conf" hive.hadoop.conf = { bucket="s3://ws-package" fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider" } } }
Create the lib dir for hive of emr.
mkdir -p ${SEATUNNEL_HOME}/plugins/Hive/lib
Get the jars from maven center to the lib.
cd ${SEATUNNEL_HOME}/plugins/Hive/lib wget https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.9/hive-exec-2.3.9.jar
Copy the jars from your environment on emr to the lib dir and delete the conflicting jar.
cp -r /opt/apps/JINDOSDK/jindosdk-current/lib/jindo-*.jar ${SEATUNNEL_HOME}/plugins/Hive/lib rm -f ${SEATUNNEL_HOME}/lib/hadoop-aliyun-*.jar
Run the case.
env { parallelism = 1 job.mode = "BATCH" } source { Hive { table_name = "test_hive.test_hive_sink_on_oss" metastore_uri = "thrift://master-1-1.c-1009b01725b501f2.cn-wulanchabu.emr.aliyuncs.com:9083" hive.hadoop.conf-path = "/tmp/hadoop" hive.hadoop.conf = { bucket="oss://emr-osshdfs.cn-wulanchabu.oss-dls.aliyuncs.com" } } } sink { Hive { table_name = "test_hive.test_hive_sink_on_oss_sink" metastore_uri = "thrift://master-1-1.c-1009b01725b501f2.cn-wulanchabu.emr.aliyuncs.com:9083" hive.hadoop.conf-path = "/tmp/hadoop" hive.hadoop.conf = { bucket="oss://emr-osshdfs.cn-wulanchabu.oss-dls.aliyuncs.com" } } }