docs/topics/impala_exec_single_node_rows_threshold.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept rev="2.0.0" id="exec_single_node_rows_threshold">

   <title>EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (<keyword keyref="impala21"/> or higher only)</title>
   <titlealts audience="PDF"><navtitle>EXEC_SINGLE_NODE_ROWS_THRESHOLD</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Impala Query Options"/>
       <data name="Category" value="Scalability"/>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p rev="2.0.0">
       <indexterm audience="hidden">EXEC_SINGLE_NODE_ROWS_THRESHOLD query option</indexterm>
       This setting controls the cutoff point (in terms of number of rows scanned) below which Impala treats a query
       as a <q>small</q> query, turning off optimizations such as parallel execution and native code generation. The
       overhead for these optimizations is applicable for queries involving substantial amounts of data, but it
       makes sense to skip them for queries involving tiny amounts of data. Reducing the overhead for small queries
       allows Impala to complete them more quickly, keeping admission control slots, CPU, memory, and so on
       available for resource-intensive queries.
     </p>

     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

 <codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=<varname>number_of_rows</varname></codeblock>

     <p>
       <b>Type:</b> numeric
     </p>

     <p>
       <b>Default:</b> 100
     </p>

     <p>
       <b>Usage notes:</b> Typically, you increase the default value to make this optimization apply to more queries.
       If incorrect or corrupted table and column statistics cause Impala to apply this optimization
       incorrectly to queries that actually involve substantial work, you might see the queries being slower as a
       result of remote reads. In that case, recompute statistics with the <codeph>COMPUTE STATS</codeph>
       or <codeph>COMPUTE INCREMENTAL STATS</codeph> statement. If there is a problem collecting accurate
       statistics, you can turn this feature off by setting the value to -1.
     </p>

     <p conref="../shared/impala_common.xml#common/internals_blurb"/>

     <p>
       This setting applies to queries where the number of rows processed can be accurately
       determined, either through table and column statistics, or by the presence of a
       <codeph>LIMIT</codeph> clause. If Impala cannot accurately estimate the number of rows,
       then this setting does not apply.
     </p>

     <p rev="2.3.0">
       In <keyword keyref="impala23_full"/> and higher, where Impala supports the complex data types <codeph>STRUCT</codeph>,
       <codeph>ARRAY</codeph>, and <codeph>MAP</codeph>, if a query refers to any column of those types,
       the small-query optimization is turned off for that query regardless of the
       <codeph>EXEC_SINGLE_NODE_ROWS_THRESHOLD</codeph> setting.
     </p>

     <p>
       For a query that is determined to be <q>small</q>, all work is performed on the coordinator node. This might
       result in some I/O being performed by remote reads. The savings from not distributing the query work and not
       generating native code are expected to outweigh any overhead from the remote reads.
     </p>

     <p conref="../shared/impala_common.xml#common/added_in_210"/>

     <p conref="../shared/impala_common.xml#common/example_blurb"/>

     <p>
       A common use case is to query just a few rows from a table to inspect typical data values. In this example,
       Impala does not parallelize the query or perform native code generation because the result set is guaranteed
       to be smaller than the threshold value from this query option:
     </p>

 <codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=500;
 SELECT * FROM enormous_table LIMIT 300;
 </codeblock>

 <!-- Don't have any other places that tie into this particular optimization technique yet.
 Potentially: conceptual topics about code generation, distributed queries

 <p conref="../shared/impala_common.xml#common/related_info"/>
 <p>
 </p>
 -->

   </conbody>

 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept rev="2.0.0" id="exec_single_node_rows_threshold">

	<title>EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (<keyword keyref="impala21"/> or higher only)</title>
	<titlealts audience="PDF"><navtitle>EXEC_SINGLE_NODE_ROWS_THRESHOLD</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Impala Query Options"/>
	<data name="Category" value="Scalability"/>
	<data name="Category" value="Performance"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p rev="2.0.0">
	<indexterm audience="hidden">EXEC_SINGLE_NODE_ROWS_THRESHOLD query option</indexterm>
	This setting controls the cutoff point (in terms of number of rows scanned) below which Impala treats a query
	as a <q>small</q> query, turning off optimizations such as parallel execution and native code generation. The
	overhead for these optimizations is applicable for queries involving substantial amounts of data, but it
	makes sense to skip them for queries involving tiny amounts of data. Reducing the overhead for small queries
	allows Impala to complete them more quickly, keeping admission control slots, CPU, memory, and so on
	available for resource-intensive queries.
	</p>

	<p conref="../shared/impala_common.xml#common/syntax_blurb"/>

	<codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=<varname>number_of_rows</varname></codeblock>

	<p>
	<b>Type:</b> numeric
	</p>

	<p>
	<b>Default:</b> 100
	</p>

	<p>
	<b>Usage notes:</b> Typically, you increase the default value to make this optimization apply to more queries.
	If incorrect or corrupted table and column statistics cause Impala to apply this optimization
	incorrectly to queries that actually involve substantial work, you might see the queries being slower as a
	result of remote reads. In that case, recompute statistics with the <codeph>COMPUTE STATS</codeph>
	or <codeph>COMPUTE INCREMENTAL STATS</codeph> statement. If there is a problem collecting accurate
	statistics, you can turn this feature off by setting the value to -1.
	</p>

	<p conref="../shared/impala_common.xml#common/internals_blurb"/>

	<p>
	This setting applies to queries where the number of rows processed can be accurately
	determined, either through table and column statistics, or by the presence of a
	<codeph>LIMIT</codeph> clause. If Impala cannot accurately estimate the number of rows,
	then this setting does not apply.
	</p>

	<p rev="2.3.0">
	In <keyword keyref="impala23_full"/> and higher, where Impala supports the complex data types <codeph>STRUCT</codeph>,
	<codeph>ARRAY</codeph>, and <codeph>MAP</codeph>, if a query refers to any column of those types,
	the small-query optimization is turned off for that query regardless of the
	<codeph>EXEC_SINGLE_NODE_ROWS_THRESHOLD</codeph> setting.
	</p>

	<p>
	For a query that is determined to be <q>small</q>, all work is performed on the coordinator node. This might
	result in some I/O being performed by remote reads. The savings from not distributing the query work and not
	generating native code are expected to outweigh any overhead from the remote reads.
	</p>

	<p conref="../shared/impala_common.xml#common/added_in_210"/>

	<p conref="../shared/impala_common.xml#common/example_blurb"/>

	<p>
	A common use case is to query just a few rows from a table to inspect typical data values. In this example,
	Impala does not parallelize the query or perform native code generation because the result set is guaranteed
	to be smaller than the threshold value from this query option:
	</p>

	<codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=500;
	SELECT * FROM enormous_table LIMIT 300;
	</codeblock>

	<!-- Don't have any other places that tie into this particular optimization technique yet.
	Potentially: conceptual topics about code generation, distributed queries

	<p conref="../shared/impala_common.xml#common/related_info"/>
	<p>
	</p>
	-->

	</conbody>

	</concept>