| <!DOCTYPE html> |
| <html lang="en"> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| http://www.apache.org/licenses/LICENSE-2.0 |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <head> |
| <meta charset="utf-8" /> |
| <title>SampleRecord</title> |
| |
| <link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css" /> |
| </head> |
| <body> |
| <p>This processor takes in a record set and samples records from the set according to the specified sampling |
| strategy. The available sampling strategies are:</p> |
| <ul> |
| <li><b>Interval Sampling</b> |
| <p>Select every <i>N</i>th record based on the value of the Sampling Interval property. For example, if there are |
| 100 records in the set and the Sampling Interval is set to 4, there will be 25 records in the output, |
| namely every 4th record. This performs uniform sampling of the record set so is best suited for record |
| sets that are uniformly distributed. For example a record set representing user information that is |
| uniformly distributed will result in the output records also being uniformly distributed. The outgoing |
| record count is deterministic and is exactly the total number of records divided by the Sampling Interval value. |
| </p> |
| </li> |
| <li><b>Probabilistic Sampling</b> |
| <p>Select each record with probability <i>P</i>, an integer percentage specified by the Sampling Probability value. |
| For example, an incoming record set of 100 records with a Sampling Probability value of 20 should have |
| roughly 20 records in the output. Use this when you want to output record sets of roughly the same size |
| (but not exactly) and when you want each record to have the same "chance" to be selected for the output |
| set. As another example, if you send the same flow file into the processor twice, a sampling strategy of |
| Interval Sampling will always produce the same output, where Probabilistic Sampling may output different |
| records (and a different total number of records). |
| </p> |
| </li> |
| <li><b>Reservoir Sampling</b> |
| <p>Select <i>K</i> records from a record set having <i>N</i> total values, where <i>K</i> is the value of |
| the Reservoir Size property and each record has an equal probability of being selected (exactly K / N). |
| For example, an incoming record set of 100 records with a Reservoir Size value of 20 should have |
| exactly 20 records in the output, randomly chosen from the input record set. Use this when you want to |
| control the exact number of output records and have each input record have the same probability of being |
| selected. As another example, if you send the same flow file into the processor twice, a sampling strategy of |
| Interval Sampling will always produce the same output (same records and number of records), where |
| Probabilistic Sampling may output different records (and a different total number of records), and |
| Reservoir Sampling may output different records but the same total number of records. Note that the |
| reservoir is kept in-memory, so if the size of the reservoir is very large, it may cause memory issues. |
| </p> |
| </li> |
| </ul> |
| <p> |
| The "Random Seed" property applies to strategies/algorithms that use a pseudorandom random number generator, such as |
| Probabilistic Sampling and Reservoir Sampling. The property is optional but if set will guarantee the same records |
| in a flow file will be selected by the algorithm each time. This is useful for testing flows using non-deterministic |
| algorithms such as Probabilistic Sampling and Reservoir Sampling. |
| </p> |
| </body> |
| </html> |