docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.GenerateTableFetch/additionalDetails.html - nifi-site - Git at Google

 <!DOCTYPE html>
 <html lang="en">
     <!--
       Licensed to the Apache Software Foundation (ASF) under one or more
       contributor license agreements.  See the NOTICE file distributed with
       this work for additional information regarding copyright ownership.
       The ASF licenses this file to You under the Apache License, Version 2.0
       (the "License"); you may not use this file except in compliance with
       the License.  You may obtain a copy of the License at
           http://www.apache.org/licenses/LICENSE-2.0
       Unless required by applicable law or agreed to in writing, software
       distributed under the License is distributed on an "AS IS" BASIS,
       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
       See the License for the specific language governing permissions and
       limitations under the License.
     -->
     <head>
         <meta charset="utf-8" />
         <title>GenerateTableFetch</title>

         <link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css" />
     </head>

     <body>
     	<p>
     		GenerateTableFetch uses its properties and the specified database connection to generate FlowFiles
 			containing SQL statements that can be used to fetch "pages" (aka "partitions") of data from a table.
             GenerateTableFetch executes a query to the database to determine the current row count and maximum value,
             and if Maximum Value Columns are specified, will collect the count of rows whose values for the Maximum
             Value Columns are larger than those last observed by GenerateTableFetch. This allows for incremental
             fetching of "new" rows, rather than generating SQL to fetch the entire table each time. If no Maximum
             Value Columns are set, then the processor will generate SQL to fetch the entire table each time.
         </p>

     	<p>
     		In order to generate SQL that will fetch pages/partitions of data, by default GenerateTableFetch will
             generate SQL that orders the data based on the Maximum Value Columns (if present) and utilize the row
             numbers of the result set to determine each page. For example if the Maximum Value Column is an integer "id"
             and the partition size is 10, then the SQL for the first page might be "SELECT * FROM myTable LIMIT 10" and
             the second page might be "SELECT * FROM myTable OFFSET 10 LIMIT 10", and so on.
     	</p>

     	<p>
 			Ordering the data can be an expensive operation depending on the database, the number of rows, etc.
             Alternatively, it is possible to specify a column whose values will be used to determine the pages, using
             the Column for Value Partitioning property. If set, GenerateTableFetch will determine the minimum and
             maximum values for the column, and uses the minimum value as the initial offset. The SQL to fetch a page is
             then based on this initial offset and the total difference in values (i.e. maximum - minimum) divided by
             the page size. For example, if the column "id" is used for value partitioning, and the column contains
             values 100 to 200, then with a page size of 10 the SQL to fetch the first page might be "SELECT * FROM
             myTable WHERE id >= 100 AND id < 110" and the second page might be "SELECT * FROM myTable WHERE id >= 110
             AND id < 120", and so on.
         </p>

         <p>
             It is important that the Column for Value Partitioning be set to a column whose type can be coerced to a
             long integer (i.e. not date or timestamp), and that the column values are evenly distributed and not
             sparse, for best performance. As a counterexample to the above, consider a column "id" whose values are 100,
             2000, and 30000. If the Partition Size is 100, then the column values are relatively sparse, so the SQL
             for the "second page" (see above example) will return zero rows, and so will every page until the value in
             the query becomes "id >= 2000". Another counterexample is when the values are not uniformly distributed.
             Consider a column "id" with values 100, 200, 201, 202, ... 299. Then the SQL for the first page (see above
             example) will return one row with value id = 100, and the second page will return 100 rows with values 200
             ... 299. This can cause inconsistent processing times downstream, as the pages may contain a very different
             number of rows. For these reasons it is recommended to use a Column for Value Partitioning that is
             sufficiently dense (not sparse) and fairly evenly distributed.
     	</p>
 	</body>
 </html>
	<!DOCTYPE html>
	<html lang="en">
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at
	http://www.apache.org/licenses/LICENSE-2.0
	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<head>
	<meta charset="utf-8" />
	<title>GenerateTableFetch</title>

	<link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css" />
	</head>

	<body>
	<p>
	GenerateTableFetch uses its properties and the specified database connection to generate FlowFiles
	containing SQL statements that can be used to fetch "pages" (aka "partitions") of data from a table.
	GenerateTableFetch executes a query to the database to determine the current row count and maximum value,
	and if Maximum Value Columns are specified, will collect the count of rows whose values for the Maximum
	Value Columns are larger than those last observed by GenerateTableFetch. This allows for incremental
	fetching of "new" rows, rather than generating SQL to fetch the entire table each time. If no Maximum
	Value Columns are set, then the processor will generate SQL to fetch the entire table each time.
	</p>

	<p>
	In order to generate SQL that will fetch pages/partitions of data, by default GenerateTableFetch will
	generate SQL that orders the data based on the Maximum Value Columns (if present) and utilize the row
	numbers of the result set to determine each page. For example if the Maximum Value Column is an integer "id"
	and the partition size is 10, then the SQL for the first page might be "SELECT * FROM myTable LIMIT 10" and
	the second page might be "SELECT * FROM myTable OFFSET 10 LIMIT 10", and so on.
	</p>

	<p>
	Ordering the data can be an expensive operation depending on the database, the number of rows, etc.
	Alternatively, it is possible to specify a column whose values will be used to determine the pages, using
	the Column for Value Partitioning property. If set, GenerateTableFetch will determine the minimum and
	maximum values for the column, and uses the minimum value as the initial offset. The SQL to fetch a page is
	then based on this initial offset and the total difference in values (i.e. maximum - minimum) divided by
	the page size. For example, if the column "id" is used for value partitioning, and the column contains
	values 100 to 200, then with a page size of 10 the SQL to fetch the first page might be "SELECT * FROM
	myTable WHERE id >= 100 AND id < 110" and the second page might be "SELECT * FROM myTable WHERE id >= 110
	AND id < 120", and so on.
	</p>

	<p>
	It is important that the Column for Value Partitioning be set to a column whose type can be coerced to a
	long integer (i.e. not date or timestamp), and that the column values are evenly distributed and not
	sparse, for best performance. As a counterexample to the above, consider a column "id" whose values are 100,
	2000, and 30000. If the Partition Size is 100, then the column values are relatively sparse, so the SQL
	for the "second page" (see above example) will return zero rows, and so will every page until the value in
	the query becomes "id >= 2000". Another counterexample is when the values are not uniformly distributed.
	Consider a column "id" with values 100, 200, 201, 202, ... 299. Then the SQL for the first page (see above
	example) will return one row with value id = 100, and the second page will return 100 rows with values 200
	... 299. This can cause inconsistent processing times downstream, as the pages may contain a very different
	number of rows. For these reasons it is recommended to use a Column for Value Partitioning that is
	sufficiently dense (not sparse) and fairly evenly distributed.
	</p>
	</body>
	</html>