tree: 6ae537e81475766408a7c9628e6dc78b20badf0e [path history] [tgz]
  1. README.md
  2. pom.xml
  3. src/
external/storm-hive/README.md

Storm Hive Bolt & Trident State

Hive offers streaming API that allows data to be written continuously into Hive. The incoming data can be continuously committed in small batches of records into existing Hive partition or table. Once the data is committed its immediately visible to all hive queries. More info on Hive Streaming API https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

With the help of Hive Streaming API, HiveBolt and HiveState allows users to stream data from Storm into Hive directly. To use Hive streaming API users need to create a bucketed table with ORC format. Example below

create table test_table ( id INT, name STRING, phone STRING, street STRING) partitioned by (city STRING, state STRING) stored as orc tblproperties ("orc.compress"="NONE");

HiveBolt (org.apache.storm.hive.bolt.HiveBolt)

HiveBolt streams tuples directly into Hive. Tuples are written using Hive Transactions. Partitions to which HiveBolt will stream to can either created or pre-created or optionally HiveBolt can create them if they are missing. Fields from Tuples are mapped to table columns. User should make sure that Tuple field names are matched to the table column names.

DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
            .withColumnFields(new Fields(colNames));
HiveOptions hiveOptions = new HiveOptions(metaStoreURI,dbName,tblName,mapper);
HiveBolt hiveBolt = new HiveBolt(hiveOptions);

RecordHiveMapper

This class maps Tuple field names to Hive table column names. There are two implementaitons available

  • DelimitedRecordHiveMapper (org.apache.storm.hive.bolt.mapper.DelimitedRecordHiveMapper)
  • JsonRecordHiveMapper (org.apache.storm.hive.bolt.mapper.JsonRecordHiveMapper)
DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
         .withColumnFields(new Fields(colNames))
         .withPartitionFields(new Fields(partNames));
 or
DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
         .withColumnFields(new Fields(colNames))
         .withTimeAsPartitionField("YYYY/MM/DD");
ArgDescriptionType
withColumnFieldsfield names in a tuple to be mapped to table column namesFields (required)
withPartitionFieldsfield names in a tuple can be mapped to hive table partitionsFields
withTimeAsPartitionFieldusers can select system time as partition in hive tableString . Date format

HiveOptions (org.apache.storm.hive.common.HiveOptions)

HiveBolt takes in HiveOptions as a constructor arg.

HiveOptions hiveOptions = new HiveOptions(metaStoreURI,dbName,tblName,mapper)
                              .withTxnsPerBatch(10)
              				.withBatchSize(1000)
              	     		.withIdleTimeout(10)

HiveOptions params

ArgDescriptionType
metaStoreURIhive meta store URI (can be found in hive-site.xml)String (required)
dbNamedatabase nameString (required)
tblNametable nameString (required)
mapperMapper class to map Tuple field names to Table column namesDelimitedRecordHiveMapper or JsonRecordHiveMapper (required)
withTxnsPerBatchHive grants a batch of transactions instead of single transactions to streaming clients like HiveBolt.This setting configures the number of desired transactions per Transaction Batch. Data from all transactions in a single batch end up in a single file. Flume will write a maximum of batchSize events in each transaction in the batch. This setting in conjunction with batchSize provides control over the size of each file. Note that eventually Hive will transparently compact these files into larger files.Integer . default 100
withMaxOpenConnectionsAllow only this number of open connections. If this number is exceeded, the least recently used connection is closed.Integer . default 100
withBatchSizeMax number of events written to Hive in a single Hive transactionInteger. default 15000
withCallTimeout(In milliseconds) Timeout for Hive & HDFS I/O operations, such as openTxn, write, commit, abort.Integer. default 10000
withHeartBeatInterval(In seconds) Interval between consecutive heartbeats sent to Hive to keep unused transactions from expiring. Set this value to 0 to disable heartbeats.Integer. default 240
withAutoCreatePartitionsHiveBolt will automatically create the necessary Hive partitions to stream to.Boolean. default true
withKerberosPrinicipalKerberos user principal for accessing secure HiveString
withKerberosKeytabKerberos keytab for accessing secure HiveString
withTickTupleInterval(In seconds) If > 0 then the Hive Bolt will periodically flush transaction batches. Enabling this is recommended to avoid tuple timeouts while waiting for a batch to fill up.Integer. default 0

HiveState (org.apache.storm.hive.trident.HiveTrident)

Hive Trident state also follows similar pattern to HiveBolt it takes in HiveOptions as an arg.

   DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
            .withColumnFields(new Fields(colNames))
            .withTimeAsPartitionField("YYYY/MM/DD");
            
   HiveOptions hiveOptions = new HiveOptions(metaStoreURI,dbName,tblName,mapper)
                                .withTxnsPerBatch(10)
                				.withBatchSize(1000)
                	     		.withIdleTimeout(10)
                	     		
   StateFactory factory = new HiveStateFactory().withOptions(hiveOptions);
   TridentState state = stream.partitionPersist(factory, hiveFields, new HiveUpdater(), new Fields());

Committer Sponsors