streaming/src/java/org/apache/hive/streaming/package.html - hive - Git at Google

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
         "http://www.w3.org/TR/html4/loose.dtd">

 <html lang="en">

 <head>
 <meta name=Title content="HCatalog Streaming API">
 <meta name=Keywords content="HCatalog Streaming ACID">
 <meta http-equiv=Content-Type content="text/html; charset=macintosh">
 <title>HCatalog Streaming API</title>
 </head>

 <body>

 <h1>HCatalog Streaming API -- high level description</h1>

 <b>NOTE: The Streaming API feature is provided as a technology
 preview. The API may undergo incompatible changes in upcoming
 releases.</b>

 <p>
 Traditionally adding new data into hive requires gathering a large
 amount of data onto HDFS and then periodically adding a new
 partition. This is essentially a <i>batch insertion</i>. Insertion of
 new data into an existing partition or table is not done in a way that
 gives consistent results to readers. Hive Streaming API allows data to
 be pumped continuously into Hive. The incoming data can be
 continuously committed in small batches (of records) into a Hive
 partition. Once data is committed it becomes immediately visible to
 all Hive queries initiated subsequently.</p>

 <p>
 This API is intended for streaming clients such as NiFi, Flume and Storm,
 which continuously generate data. Streaming support is built on top of
 ACID based insert/update support in Hive.</p>

 <p>
 The classes and interfaces part of the Hive streaming API are broadly
 categorized into two. The first set provides support for connection
 and transaction management while the second set provides I/O
 support. Transactions are managed by the Hive MetaStore.  Writes are
 performed to HDFS via Hive wrapper APIs that bypass MetaStore. </p>

 <p>
 <b>Note on packaging</b>: The APIs are defined in the
 <b>org.apache.hive.streaming</b> and
 <b>org.apache.hive.streaming.transaction</b> Java packages and included as
 the hive-streaming jar.</p>

 <h2>STREAMING REQUIREMENTS</h2>

 <p>
 A few things are currently required to use streaming.
 </p>

 <p>
 <ol>
   <li> Currently, only ORC storage format is supported. So
     '<b>stored as orc</b>' must be specified during table creation.</li>
   <li> The hive table may be bucketed but must not be sorted. </li>
   <li> User of the client streaming process must have the necessary
     permissions to write to the table or partition and create partitions in
     the table.</li>
   <li> Currently, when issuing queries on streaming tables, query client must set
     <ol>
       <li><b>hive.input.format =
              org.apache.hadoop.hive.ql.io.HiveInputFormat</b></li>
     </ol></li>
   The above client settings are a temporary requirement and the intention is to
   drop the need for them in the near future.
   <li> Settings required in hive-site.xml for Metastore:
     <ol>
       <li><b>hive.txn.manager =
 	org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</b></li>
       <li><b>hive.support.concurrency = true </b> </li>
       <li><b>hive.compactor.initiator.on = true</b> </li>
       <li><b>hive.compactor.worker.threads > 0 </b> </li>
     </ol></li>
 </ol></p>

 <p>
 <b>Note:</b> Streaming to <b>unpartitioned</b> tables is also
 supported.</p>

 <h2>Transaction and Connection management</h2>

 <p>
 The class <a href="HiveEndPoint.html"><b>HiveEndPoint</b></a> is a Hive end
 point to connect to. An endpoint is either a Hive table or
 partition. An endpoint is cheap to create and does not internally hold
 on to any network connections.  Invoking the newConnection method on
 it creates a new connection to the Hive MetaStore for streaming
 purposes.  It returns a
 <a href="StreamingConnection.html"><b>StreamingConnection</b></a>
 object. Multiple connections can be established on the same
 endpoint. StreamingConnection can then be used to initiate new
 transactions for performing I/O. </p>

 <h3>Dynamic Partition Creation:</h3> It is very likely that a setup in
 which data is being streamed continuously (e.g. Flume), it is
 desirable to have new partitions created automatically (say on a
 hourly basis). In such cases requiring the Hive admin to pre-create
 the necessary partitions may not be reasonable.  Consequently the
 streaming API allows streaming clients to create partitions as
 needed. <b>HiveEndPoind.newConnection()</b> accepts a argument to
 indicate if the partition should be auto created. Partition creation
 being an atomic action, multiple clients can race to create the
 partition, but only one would succeed, so streaming clients need not
 synchronize when creating a partition. The user of the client process
 needs to be given write permissions on the Hive table in order to
 create partitions.

 <h3>Batching Transactions:</h3> Transactions are implemented slightly
 differently than traditional database systems. Multiple transactions
 are grouped into a <i>Transaction Batch</i> and each transaction has
 an id. Data from each transaction batch gets a single file on HDFS,
 which eventually gets compacted with other files into a larger file
 automatically for efficiency.

 <h3>Basic Steps:</h3> After connection is established, a streaming
 client first requests for a new batch of transactions. In response it
 receives a set of transaction ids that are part of the transaction
 batch. Subsequently the client proceeds to consume one transaction at
 a time by initiating new transactions. Client will write() one or more
 records per transactions and either commit or abort the current
 transaction before switching to the next one. Each
 <b>TransactionBatch.write()</b> invocation automatically associates
 the I/O attempt with the current transaction id. The user of the
 streaming client needs to have write permissions to the partition or
 table.</p>

 <p>
 <b>Concurrency Note:</b> I/O can be performed on multiple
 <b>TransactionBatch</b>s concurrently. However the transactions within a
 transaction batch much be consumed sequentially.</p>

 <h2>Writing Data</h2>

 <p>
 These classes and interfaces provide support for writing the data to
 Hive within a transaction.
 <a href="RecordWriter.html"><b>RecordWriter</b></a> is the interface
 implemented by all writers. A writer is responsible for taking a
 record in the form of a <b>byte[]</b> containing data in a known
 format (e.g. CSV) and writing it out in the format supported by Hive
 streaming. A <b>RecordWriter</b> may reorder or drop fields from the incoming
 record if necessary to map them to the corresponding columns in the
 Hive Table. A streaming client will instantiate an appropriate
 <b>RecordWriter</b> type and pass it to
 <b>StreamingConnection.fetchTransactionBatch()</b>.  The streaming client
 does not directly interact with the <b>RecordWriter</b> therafter, but
 relies on the <b>TransactionBatch</b> to do so.</p>

 <p>
 Currently, out of the box, the streaming API provides two
 implementations of the <b>RecordWriter</b> interface. One handles delimited
 input data (such as CSV, tab separated, etc.  and the other for JSON
 (strict syntax). Support for other input formats can be provided by
 additional implementations of the <b>RecordWriter</b> interface.
 <ul>
 <li> <a href="DelimitedInputWriter.html"><b>DelimitedInputWriter</b></a>
 - Delimited text input.</li>
 <li> <a href="StrictJsonWriter.html"><b>StrictJsonWriter</b></a>
 - JSON text input.</li>
   <li> <a href="StrictRegexWriter.html"><b>StrictRegexWriter</b></a>
     - text input with regex.</li>
 </ul></p>

 <h2>Performance, Concurrency, Etc.</h2>
 <p>
   Each StreamingConnection is writing data at the rate the underlying
   FileSystem can accept it.  If that is not sufficient, multiple StreamingConnection objects can
   be created concurrently.
 </p>
 <p>
   Each StreamingConnection can have at most 1 outstanding TransactionBatch and each TransactionBatch
   may have at most 2 threads operaing on it.
   See <a href="TransactionBatch.html"><b>TransactionBatch</b></a>
 </p>
 </body>

 </html>
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
	"http://www.w3.org/TR/html4/loose.dtd">

	<html lang="en">

	<head>
	<meta name=Title content="HCatalog Streaming API">
	<meta name=Keywords content="HCatalog Streaming ACID">
	<meta http-equiv=Content-Type content="text/html; charset=macintosh">
	<title>HCatalog Streaming API</title>
	</head>

	<body>

	<h1>HCatalog Streaming API -- high level description</h1>

	<b>NOTE: The Streaming API feature is provided as a technology
	preview. The API may undergo incompatible changes in upcoming
	releases.</b>

	<p>
	Traditionally adding new data into hive requires gathering a large
	amount of data onto HDFS and then periodically adding a new
	partition. This is essentially a <i>batch insertion</i>. Insertion of
	new data into an existing partition or table is not done in a way that
	gives consistent results to readers. Hive Streaming API allows data to
	be pumped continuously into Hive. The incoming data can be
	continuously committed in small batches (of records) into a Hive
	partition. Once data is committed it becomes immediately visible to
	all Hive queries initiated subsequently.</p>

	<p>
	This API is intended for streaming clients such as NiFi, Flume and Storm,
	which continuously generate data. Streaming support is built on top of
	ACID based insert/update support in Hive.</p>

	<p>
	The classes and interfaces part of the Hive streaming API are broadly
	categorized into two. The first set provides support for connection
	and transaction management while the second set provides I/O
	support. Transactions are managed by the Hive MetaStore. Writes are
	performed to HDFS via Hive wrapper APIs that bypass MetaStore. </p>

	<p>
	<b>Note on packaging</b>: The APIs are defined in the
	<b>org.apache.hive.streaming</b> and
	<b>org.apache.hive.streaming.transaction</b> Java packages and included as
	the hive-streaming jar.</p>

	<h2>STREAMING REQUIREMENTS</h2>

	<p>
	A few things are currently required to use streaming.
	</p>

	<p>
	<ol>
	<li> Currently, only ORC storage format is supported. So
	'<b>stored as orc</b>' must be specified during table creation.</li>
	<li> The hive table may be bucketed but must not be sorted. </li>
	<li> User of the client streaming process must have the necessary
	permissions to write to the table or partition and create partitions in
	the table.</li>
	<li> Currently, when issuing queries on streaming tables, query client must set
	<ol>
	<li><b>hive.input.format =
	org.apache.hadoop.hive.ql.io.HiveInputFormat</b></li>
	</ol></li>
	The above client settings are a temporary requirement and the intention is to
	drop the need for them in the near future.
	<li> Settings required in hive-site.xml for Metastore:
	<ol>
	<li><b>hive.txn.manager =
	org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</b></li>
	<li><b>hive.support.concurrency = true </b> </li>
	<li><b>hive.compactor.initiator.on = true</b> </li>
	<li><b>hive.compactor.worker.threads > 0 </b> </li>
	</ol></li>
	</ol></p>

	<p>
	<b>Note:</b> Streaming to <b>unpartitioned</b> tables is also
	supported.</p>

	<h2>Transaction and Connection management</h2>

	<p>
	The class <a href="HiveEndPoint.html"><b>HiveEndPoint</b></a> is a Hive end
	point to connect to. An endpoint is either a Hive table or
	partition. An endpoint is cheap to create and does not internally hold
	on to any network connections. Invoking the newConnection method on
	it creates a new connection to the Hive MetaStore for streaming
	purposes. It returns a
	<a href="StreamingConnection.html"><b>StreamingConnection</b></a>
	object. Multiple connections can be established on the same
	endpoint. StreamingConnection can then be used to initiate new
	transactions for performing I/O. </p>

	<h3>Dynamic Partition Creation:</h3> It is very likely that a setup in
	which data is being streamed continuously (e.g. Flume), it is
	desirable to have new partitions created automatically (say on a
	hourly basis). In such cases requiring the Hive admin to pre-create
	the necessary partitions may not be reasonable. Consequently the
	streaming API allows streaming clients to create partitions as
	needed. <b>HiveEndPoind.newConnection()</b> accepts a argument to
	indicate if the partition should be auto created. Partition creation
	being an atomic action, multiple clients can race to create the
	partition, but only one would succeed, so streaming clients need not
	synchronize when creating a partition. The user of the client process
	needs to be given write permissions on the Hive table in order to
	create partitions.

	<h3>Batching Transactions:</h3> Transactions are implemented slightly
	differently than traditional database systems. Multiple transactions
	are grouped into a <i>Transaction Batch</i> and each transaction has
	an id. Data from each transaction batch gets a single file on HDFS,
	which eventually gets compacted with other files into a larger file
	automatically for efficiency.

	<h3>Basic Steps:</h3> After connection is established, a streaming
	client first requests for a new batch of transactions. In response it
	receives a set of transaction ids that are part of the transaction
	batch. Subsequently the client proceeds to consume one transaction at
	a time by initiating new transactions. Client will write() one or more
	records per transactions and either commit or abort the current
	transaction before switching to the next one. Each
	<b>TransactionBatch.write()</b> invocation automatically associates
	the I/O attempt with the current transaction id. The user of the
	streaming client needs to have write permissions to the partition or
	table.</p>

	<p>
	<b>Concurrency Note:</b> I/O can be performed on multiple
	<b>TransactionBatch</b>s concurrently. However the transactions within a
	transaction batch much be consumed sequentially.</p>

	<h2>Writing Data</h2>

	<p>
	These classes and interfaces provide support for writing the data to
	Hive within a transaction.
	<a href="RecordWriter.html"><b>RecordWriter</b></a> is the interface
	implemented by all writers. A writer is responsible for taking a
	record in the form of a <b>byte[]</b> containing data in a known
	format (e.g. CSV) and writing it out in the format supported by Hive
	streaming. A <b>RecordWriter</b> may reorder or drop fields from the incoming
	record if necessary to map them to the corresponding columns in the
	Hive Table. A streaming client will instantiate an appropriate
	<b>RecordWriter</b> type and pass it to
	<b>StreamingConnection.fetchTransactionBatch()</b>. The streaming client
	does not directly interact with the <b>RecordWriter</b> therafter, but
	relies on the <b>TransactionBatch</b> to do so.</p>

	<p>
	Currently, out of the box, the streaming API provides two
	implementations of the <b>RecordWriter</b> interface. One handles delimited
	input data (such as CSV, tab separated, etc. and the other for JSON
	(strict syntax). Support for other input formats can be provided by
	additional implementations of the <b>RecordWriter</b> interface.
	<ul>
	<li> <a href="DelimitedInputWriter.html"><b>DelimitedInputWriter</b></a>
	- Delimited text input.</li>
	<li> <a href="StrictJsonWriter.html"><b>StrictJsonWriter</b></a>
	- JSON text input.</li>
	<li> <a href="StrictRegexWriter.html"><b>StrictRegexWriter</b></a>
	- text input with regex.</li>
	</ul></p>

	<h2>Performance, Concurrency, Etc.</h2>
	<p>
	Each StreamingConnection is writing data at the rate the underlying
	FileSystem can accept it. If that is not sufficient, multiple StreamingConnection objects can
	be created concurrently.
	</p>
	<p>
	Each StreamingConnection can have at most 1 outstanding TransactionBatch and each TransactionBatch
	may have at most 2 threads operaing on it.
	See <a href="TransactionBatch.html"><b>TransactionBatch</b></a>
	</p>
	</body>

	</html>