blob: 690864a5d2449143caa96c88daaa93f79d4338f7 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<head>
<meta charset="utf-8" />
<title>PutDynamoDBRecord</title>
<link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css" />
</head>
<body>
<h2>Description</h2>
<p>
<i>PutDynamoDBRecord</i> intends to provide the capability to insert multiple Items into a DynamoDB table from a record-oriented FlowFile.
Compared to the <i>PutDynamoDB</i>, this processor is capable to process data based other than JSON format too and prepared to add multiple fields for a given Item.
Also, <i>PutDynamoDBRecord</i> is designed to insert bigger batches of data into the database.
</p>
<h2>Data types</h2>
<p>
The list data types supported by DynamoDB does not fully overlaps with the capabilities of the Record data structure.
Some conversions and simplifications are necessary during inserting the data. These are:
</p>
<ul>
<li>Numeric values are stored using a floating-point data structure within Items. In some cases this representation might cause issues with the accuracy.</li>
<li>Char is not a supported type within DynamoDB, these fields are converted into String values.</li>
<li>Enum types are stored as String fields, using the name of the given enum.</li>
<li>DynamoDB stores time and date related information as Strings.</li>
<li>Internal record structures are converted into maps.</li>
<li>Choice is not a supported data type, regardless of the actual wrapped data type, values enveloped in Choice are handled as Strings.</li>
<li>Unknown data types are handled as stings.</li>
</ul>
<h2>Limitations</h2>
<p>
Working with DynamoDB when batch inserting comes with two inherit limitations. First, the number of inserted Items is limited to 25 in any case.
In order to overcome this, during one execution, depending on the number or records in the incoming FlowFile, <i>PutDynamoDBRecord</i> might attempt multiple
insert calls towards the database server. Using this approach, the flow does not have to work with this limitation in most cases.
</p>
<p>
Having multiple external actions comes with the risk of having an unforeseen result at one of the steps.
For example when the incoming FlowFile is consists of 70 records, it will be split into 3 chunks, with a single insert operation for every chunk.
The first two chunks contains 25 Items to insert per chunk, and the third contains the remaining 20. In some cases it might occur that the first two insert operation succeeds but the third one fails.
In these cases we consider the FlowFile "partially processed" and we will transfer it to the "failure" or "unprocessed" Relationship according to the nature of the issue.
In order to keep the information about the successfully processed chunks the processor assigns the <i>"dynamodb.chunks.processed"</i> attribute to the FlowFile, which has the number of successfully processed chunks as value.
</p>
<p>
The most common reason for this behaviour comes from the other limitation the inserts have with DynamoDB: the database has a build in supervision over the amount of inserted data.
When a client reaches the "throughput limit", the server refuses to process the insert request until a certain amount of time. More information on this might be find <a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html">here</a>.
From the perspective of the <i>PutDynamoDBRecord</i> we consider these cases as temporary issues and the FlowFile will be transferred to the "unprocessed" Relationship after which the processor will yield in order to avoid further throughput issues.
(Other kinds of failures will result transfer to the "failure" Relationship)
</p>
<h2>Retry</h2>
<p>
It is suggested to loop back the "unprocessed" Relationship to the <i>PutDynamoDBRecord</i> in some way. FlowFiles transferred to that relationship considered as healthy ones might be successfully processed in a later point.
It is possible that the FlowFile contains such a high number of records, what needs more than two attempts to fully insert.
The attribute "dynamodb.chunks.processed" is "rolled" through the attempts, which means, after each trigger it will contain the sum number of inserted chunks making it possible for the later attempts to continue from the right point without duplicated inserts.
</p>
<h2>Partition and sort keys</h2>
<p>
The processor supports multiple strategies for assigning partition key and sort key to the inserted Items. These are:
</p>
<h3>Partition Key Strategies</h3>
<h4>Partition By Field</h4>
<p>
The processors assigns one of the record fields as partition key. The name of the record field is specified by the "Partition Key Field" property and the value will be the value of the record field with the same name.
</p>
<h4>Partition By Attribute</h4>
<p>
The processor assigns the value of a FlowFile attribute as partition key. With this strategy all the Items within a FlowFile will share the same partition key value and it is suggested to use for tables also having a sort key in order to meet the primary key requirements of the DynamoDB.
The property "Partition Key Field" defines the name of the Item field and the property "Partition Key Attribute" will specify which attribute's value will be assigned to the partition key.
With this strategy the "Partition Key Field" must be different from the fields consisted by the incoming records.
</p>
<h4>Generated UUID</h4>
<p>
By using this strategy the processor will generate a UUID identifier for every single Item. This identifier will be used as value for the partition key.
The name of the field used as partition key is defined by the property "Partition Key Field".
With this strategy the "Partition Key Field" must be different from the fields consisted by the incoming records.
When using this strategy, the partition key in the DynamoDB table must have String data type.
</p>
<h3>Sort Key Strategies</h3>
<h4>None</h4>
<p>
No sort key will be assigned to the Item. In case of the table definition expects it, using this strategy will result unsuccessful inserts.
</p>
<h4>Sort By Field</h4>
<p>
The processors assigns one of the record fields as sort key. The name of the record field is specified by the "Sort Key Field" property and the value will be the value of the record field with the same name.
With this strategy the "Sort Key Field" must be different from the fields consisted by the incoming records.
</p>
<h4>Generate Sequence</h4>
<p>
The processor assigns a generated value to every Item based on the original record's position in the incoming FlowFile (regardless of the chunks).
The first Item will have the sort key 1, the second will have sort key 2 and so on. The generated keys are unique within a given FlowFile.
The name of the record field is specified by the "Sort Key Field" attribute.
With this strategy the "Sort Key Field" must be different from the fields consisted by the incoming records.
When using this strategy, the sort key in the DynamoDB table must have Number data type.
</p>
<h2>Examples</h2>
<h3>Using fields as partition and sort key</h3>
<h4>Setup</h4>
<ul>
<li>Partition Key Strategy: Partition By Field</li>
<li>Partition Key Field: class</li>
<li>Sort Key Strategy: Sort By Field</li>
<li>Sort Key Field: size</li>
</ul>
<p>
Note: both fields have to exist in the incoming records!
</p>
<h4>Result</h4>
<p>
Using this pair of strategies will result Items identical to the incoming record (not counting the representational changes from the conversion).
The field specified by the properties are added to the Items normally with the only difference of flagged as (primary) key items.
</p>
<h4>Input</h4>
<code>
[{"type": "A", "subtype": 4, "class" : "t", "size": 1}]
</code>
<h4>Output (stylized)</h4>
<ul>
<li>type: String field with value "A"</li>
<li>subtype: Number field with value 4</li>
<li>class: String field with value "t" and serving as partition key</li>
<li>size: Number field with value 1 and serving as sort key</li>
</ul>
<h3>Using FlowFile filename as partition key with generated sort key</h3>
<h4>Setup</h4>
<ul>
<li>Partition Key Strategy: Partition By Attribute</li>
<li>Partition Key Field: source</li>
<li>Partition Key Attribute: filename</li>
<li>Sort Key Strategy: Generate Sequence</li>
<li>Sort Key Field: sort</li>
</ul>
<h4>Result</h4>
<p>
The FlowFile's filename attribute will be used as partition key. In this case all the records within the same FlowFile will share the same partition key.
In order to avoid collusion, if FlowFiles contain multiple records, using sort key is suggested.
In this case a generated sequence is used which is guaranteed to be unique within a given FlowFile.
</p>
<h4>Input</h4>
<code>
[
{"type": "A", "subtype": 4, "class" : "t", "size": 1},
{"type": "B", "subtype": 5, "class" : "m", "size": 2}
]
</code>
<h4>Output (stylized)</h4>
<h5>First Item</h5>
<ul>
<li>source: String field with value "data46362.json" and serving as partition key</li>
<li>type: String field with value "A"</li>
<li>subtype: Number field with value 4</li>
<li>class: String field with value "t"</li>
<li>size: Number field with value 1</li>
<li>sort: Number field with value 1 and serving as sort key</li>
</ul>
<h5>Second Item</h5>
<ul>
<li>source: String field with value "data46362.json" and serving as partition key</li>
<li>type: String field with value "B"</li>
<li>subtype: Number field with value 5</li>
<li>class: String field with value "m"</li>
<li>size: Number field with value 2</li>
<li>sort: Number field with value 2 and serving as sort key</li>
</ul>
<h3>Using generated partition key</h3>
<h4>Setup</h4>
<ul>
<li>Partition Key Strategy: Generated UUID</li>
<li>Partition Key Field: identifier</li>
<li>Sort Key Strategy: None</li>
</ul>
<h4>Result</h4>
<p>
A generated UUID will be used as partition key. A different UUID will be generated for every Item.
</p>
<h4>Input</h4>
<code>
[{"type": "A", "subtype": 4, "class" : "t", "size": 1}]
</code>
<h4>Output (stylized)</h4>
<ul>
<li>identifier: String field with value "872ab776-ed73-4d37-a04a-807f0297e06e" and serving as partition key</li>
<li>type: String field with value "A"</li>
<li>subtype: Number field with value 4</li>
<li>class: String field with value "t"</li>
<li>size: Number field with value 1</li>
</ul>
</body>
</html>