blob: c5b0d13c5ddf05f53637fe7c2e8f3f23ce3145ad [file] [log] [blame]
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/html">
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<head>
<meta charset="utf-8"/>
<title>PutBigQuery</title>
<link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css"/>
</head>
<body>
<h1>Streaming Versus Batching Data</h1>
<p>
PutBigQuery is record based and is relying on the gRPC based Write API using protocol buffers. The underlying stream supports both streaming and batching approaches.
</p>
<h3>Streaming</h3>
<p>
With streaming the appended data to the stream is instantly available in BigQuery for reading. It is configurable how many records (rows) should be appended at once.
Only one stream is established per flow file so at the conclusion of the FlowFile processing the used stream is closed and a new one is opened for the next FlowFile.
Supports exactly once delivery semantics via stream offsets.
</p>
<h3>Batching</h3>
<p>
Similarly to the streaming approach one stream is opened for each FlowFile and records are appended to the stream. However data is not available in BigQuery until it is
committed by the processor at the end of the FlowFile processing.
</p>
<h1>Improvement opportunities</h1>
<p>
<ul>
<li>The table has to exist on BigQuery side it is not created automatically</li>
<li>The Write API supports multiple streams for parallel execution and transactionality across streams. This is not utilized at the moment as this would be covered on NiFI framework level.</li>
</ul>
</p>
<p>
The official <a href="https://cloud.google.com/bigquery/docs/write-api">Google Write API documentation</a> provides additional details.
</p>
</body>
</html>