docs/build/plain-html/topics/impala_max_row_size.html - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

 <meta name="copyright" content="(C) Copyright 2023" />
 <meta name="DC.rights.owner" content="(C) Copyright 2023" />
 <meta name="DC.Type" content="concept" />
 <meta name="DC.Title" content="MAX_ROW_SIZE Query Option" />
 <meta name="DC.Relation" scheme="URI" content="../topics/impala_set.html" />
 <meta name="prodname" content="Impala" />
 <meta name="prodname" content="Impala" />
 <meta name="version" content="Impala 3.4.x" />
 <meta name="version" content="Impala 3.4.x" />
 <meta name="DC.Format" content="XHTML" />
 <meta name="DC.Identifier" content="max_row_size" />
 <link rel="stylesheet" type="text/css" href="../commonltr.css" />
 <title>MAX_ROW_SIZE Query Option</title>
 </head>
 <body id="max_row_size">


   <h1 class="title topictitle1" id="ariaid-title1">MAX_ROW_SIZE Query Option</h1>


   <div class="body conbody">

     <p class="p">

       Ensures that Impala can process rows of at least the specified size. (Larger
       rows might be successfully processed, but that is not guaranteed.) Applies when
       constructing intermediate or final rows in the result set. This setting prevents
       out-of-control memory use when accessing columns containing huge strings.
     </p>


     <p class="p">
         <strong class="ph b">Type:</strong> integer
       </p>


     <p class="p">
         <strong class="ph b">Default:</strong>
       </p>

     <p class="p">
       <code class="ph codeph">524288</code> (512 KB)
     </p>


     <p class="p">
         <strong class="ph b">Units:</strong> A numeric argument represents a size in bytes; you can also use a suffix
         of <code class="ph codeph">m</code> or <code class="ph codeph">mb</code> for megabytes, or <code class="ph codeph">g</code> or
         <code class="ph codeph">gb</code> for gigabytes. If you specify a value with unrecognized formats,
         subsequent queries fail with an error.
       </p>


     <p class="p">
         <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.10.0</span>
       </p>


     <p class="p">
         <strong class="ph b">Usage notes:</strong>
       </p>

     <p class="p">
       If a query fails because it involves rows with long strings and/or
       many columns, causing the total row size to exceed <code class="ph codeph">MAX_ROW_SIZE</code>
       bytes, increase the <code class="ph codeph">MAX_ROW_SIZE</code> setting to accommodate
       the total bytes stored in the largest row. Examine the error messages for any
       failed queries to see the size of the row that caused the problem.
     </p>

     <p class="p">
       Impala attempts to handle rows that exceed the <code class="ph codeph">MAX_ROW_SIZE</code>
       value where practical, so in many cases, queries succeed despite having rows
       that are larger than this setting.
     </p>

     <p class="p">
       Specifying a value that is substantially higher than actually needed can cause
       Impala to reserve more memory than is necessary to execute the query.
     </p>

     <p class="p">
       In a Hadoop cluster with highly concurrent workloads and queries that process
       high volumes of data, traditional SQL tuning advice about minimizing wasted memory
       is worth remembering. For example, if a table has <code class="ph codeph">STRING</code> columns
       where a single value might be multiple megabytes, make sure that the
       <code class="ph codeph">SELECT</code> lists in queries only refer to columns that are actually
       needed in the result set, instead of using the <code class="ph codeph">SELECT *</code> shorthand.
     </p>


     <p class="p">
         <strong class="ph b">Examples:</strong>
       </p>


     <p class="p">
       The following examples show the kinds of situations where it is necessary to
       adjust the <code class="ph codeph">MAX_ROW_SIZE</code> setting. First, we create a table
       containing some very long values in <code class="ph codeph">STRING</code> columns:
     </p>


 <pre class="pre codeblock"><code>
 create table big_strings (s1 string, s2 string, s3 string) stored as parquet;

 -- Turn off compression to more easily reason about data volume by doing SHOW TABLE STATS.
 -- Does not actually affect query success or failure, because MAX_ROW_SIZE applies when
 -- column values are materialized in memory.
 set compression_codec=none;
 set;
 ...
   MAX_ROW_SIZE: [524288]
 ...

 -- A very small row.
 insert into big_strings values ('one', 'two', 'three');
 -- A row right around the default MAX_ROW_SIZE limit: a 500 KiB string and a 30 KiB string.
 insert into big_strings values (repeat('12345',100000), 'short', repeat('123',10000));
 -- A row that is too big if the query has to materialize both S1 and S3.
 insert into big_strings values (repeat('12345',100000), 'short', repeat('12345',100000));

 </code></pre>

     <p class="p">
       With the default <code class="ph codeph">MAX_ROW_SIZE</code> setting, different queries succeed
       or fail based on which column values have to be materialized during query processing:
     </p>


 <pre class="pre codeblock"><code>
 -- All the S1 values can be materialized within the 512 KB MAX_ROW_SIZE buffer.
 select count(distinct s1) from big_strings;
 +--------------------+
 | count(distinct s1) |
 +--------------------+
 | 2                  |
 +--------------------+

 -- A row where even the S1 value is too large to materialize within MAX_ROW_SIZE.
 insert into big_strings values (repeat('12345',1000000), 'short', repeat('12345',1000000));

 -- The 5 MiB string is too large to materialize. The message explains the size of the result
 -- set row the query is attempting to materialize.
 select count(distinct(s1)) from big_strings;
 WARNINGS: Row of size 4.77 MB could not be materialized in plan node with id 1.
   Increase the max_row_size query option (currently 512.00 KB) to process larger rows.

 -- If more columns are involved, the result set row being materialized is bigger.
 select count(distinct s1, s2, s3) from big_strings;
 WARNINGS: Row of size 9.54 MB could not be materialized in plan node with id 1.
   Increase the max_row_size query option (currently 512.00 KB) to process larger rows.

 -- Column S2, containing only short strings, can still be examined.
 select count(distinct(s2)) from big_strings;
 +----------------------+
 | count(distinct (s2)) |
 +----------------------+
 | 2                    |
 +----------------------+

 -- Queries that do not materialize the big column values are OK.
 select count(*) from big_strings;
 +----------+
 | count(*) |
 +----------+
 | 4        |
 +----------+

 </code></pre>

     <p class="p">
       The following examples show how adjusting <code class="ph codeph">MAX_ROW_SIZE</code> upward
       allows queries involving the long string columns to succeed:
     </p>


 <pre class="pre codeblock"><code>
 -- Boosting MAX_ROW_SIZE moderately allows all S1 values to be materialized.
 set max_row_size=7mb;

 select count(distinct s1) from big_strings;
 +--------------------+
 | count(distinct s1) |
 +--------------------+
 | 3                  |
 +--------------------+

 -- But the combination of S1 + S3 strings is still too large.
 select count(distinct s1, s2, s3) from big_strings;
 WARNINGS: Row of size 9.54 MB could not be materialized in plan node with id 1. Increase the max_row_size query option (currently 7.00 MB) to process larger rows.

 -- Boosting MAX_ROW_SIZE to larger than the largest row in the table allows
 -- all queries to complete successfully.
 set max_row_size=12mb;

 select count(distinct s1, s2, s3) from big_strings;
 +----------------------------+
 | count(distinct s1, s2, s3) |
 +----------------------------+
 | 4                          |
 +----------------------------+

 </code></pre>

     <p class="p">
       The following examples show how to reason about appropriate values for
       <code class="ph codeph">MAX_ROW_SIZE</code>, based on the characteristics of the
       columns containing the long values:
     </p>


 <pre class="pre codeblock"><code>
 -- With a large MAX_ROW_SIZE in place, we can examine the columns to
 -- understand the practical lower limit for MAX_ROW_SIZE based on the
 -- table structure and column values.
 select max(length(s1) + length(s2) + length(s3)) / 1e6 as megabytes from big_strings;
 +-----------+
 | megabytes |
 +-----------+
 | 10.000005 |
 +-----------+

 -- We can also examine the 'Max Size' for each column after computing stats.
 compute stats big_strings;
 show column stats big_strings;
 +--------+--------+------------------+--------+----------+-----------+
 | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size  |
 +--------+--------+------------------+--------+----------+-----------+
 | s1     | STRING | 2                | -1     | 5000000  | 2500002.5 |
 | s2     | STRING | 2                | -1     | 10       | 7.5       |
 | s3     | STRING | 2                | -1     | 5000000  | 2500005   |
 +--------+--------+------------------+--------+----------+-----------+

 </code></pre>

     <p class="p">
         <strong class="ph b">Related information:</strong>
       </p>

     <p class="p">
       <a class="xref" href="impala_buffer_pool_limit.html">BUFFER_POOL_LIMIT Query Option</a>,
       <a class="xref" href="impala_default_spillable_buffer_size.html">DEFAULT_SPILLABLE_BUFFER_SIZE Query Option</a>,
       <a class="xref" href="impala_min_spillable_buffer_size.html">MIN_SPILLABLE_BUFFER_SIZE Query Option</a>,
       <a class="xref" href="impala_scalability.html">Scalability Considerations for Impala</a>
     </p>


   </div>

 <div class="related-links">
 <div class="familylinks">
 <div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_set.html">SET Statement</a></div>
 </div>
 </div></body>
 </html>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE html
	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
	<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

	<meta name="copyright" content="(C) Copyright 2023" />
	<meta name="DC.rights.owner" content="(C) Copyright 2023" />
	<meta name="DC.Type" content="concept" />
	<meta name="DC.Title" content="MAX_ROW_SIZE Query Option" />
	<meta name="DC.Relation" scheme="URI" content="../topics/impala_set.html" />
	<meta name="prodname" content="Impala" />
	<meta name="prodname" content="Impala" />
	<meta name="version" content="Impala 3.4.x" />
	<meta name="version" content="Impala 3.4.x" />
	<meta name="DC.Format" content="XHTML" />
	<meta name="DC.Identifier" content="max_row_size" />
	<link rel="stylesheet" type="text/css" href="../commonltr.css" />
	<title>MAX_ROW_SIZE Query Option</title>
	</head>
	<body id="max_row_size">


	<h1 class="title topictitle1" id="ariaid-title1">MAX_ROW_SIZE Query Option</h1>




	<div class="body conbody">

	<p class="p">

	Ensures that Impala can process rows of at least the specified size. (Larger
	rows might be successfully processed, but that is not guaranteed.) Applies when
	constructing intermediate or final rows in the result set. This setting prevents
	out-of-control memory use when accessing columns containing huge strings.
	</p>




	<p class="p">
	<strong class="ph b">Type:</strong> integer
	</p>


	<p class="p">
	<strong class="ph b">Default:</strong>
	</p>

	<p class="p">
	<code class="ph codeph">524288</code> (512 KB)
	</p>


	<p class="p">
	<strong class="ph b">Units:</strong> A numeric argument represents a size in bytes; you can also use a suffix
	of <code class="ph codeph">m</code> or <code class="ph codeph">mb</code> for megabytes, or <code class="ph codeph">g</code> or
	<code class="ph codeph">gb</code> for gigabytes. If you specify a value with unrecognized formats,
	subsequent queries fail with an error.
	</p>


	<p class="p">
	<strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.10.0</span>
	</p>


	<p class="p">
	<strong class="ph b">Usage notes:</strong>
	</p>

	<p class="p">
	If a query fails because it involves rows with long strings and/or
	many columns, causing the total row size to exceed <code class="ph codeph">MAX_ROW_SIZE</code>
	bytes, increase the <code class="ph codeph">MAX_ROW_SIZE</code> setting to accommodate
	the total bytes stored in the largest row. Examine the error messages for any
	failed queries to see the size of the row that caused the problem.
	</p>

	<p class="p">
	Impala attempts to handle rows that exceed the <code class="ph codeph">MAX_ROW_SIZE</code>
	value where practical, so in many cases, queries succeed despite having rows
	that are larger than this setting.
	</p>

	<p class="p">
	Specifying a value that is substantially higher than actually needed can cause
	Impala to reserve more memory than is necessary to execute the query.
	</p>

	<p class="p">
	In a Hadoop cluster with highly concurrent workloads and queries that process
	high volumes of data, traditional SQL tuning advice about minimizing wasted memory
	is worth remembering. For example, if a table has <code class="ph codeph">STRING</code> columns
	where a single value might be multiple megabytes, make sure that the
	<code class="ph codeph">SELECT</code> lists in queries only refer to columns that are actually
	needed in the result set, instead of using the <code class="ph codeph">SELECT *</code> shorthand.
	</p>


	<p class="p">
	<strong class="ph b">Examples:</strong>
	</p>


	<p class="p">
	The following examples show the kinds of situations where it is necessary to
	adjust the <code class="ph codeph">MAX_ROW_SIZE</code> setting. First, we create a table
	containing some very long values in <code class="ph codeph">STRING</code> columns:
	</p>


	<pre class="pre codeblock"><code>
	create table big_strings (s1 string, s2 string, s3 string) stored as parquet;

	-- Turn off compression to more easily reason about data volume by doing SHOW TABLE STATS.
	-- Does not actually affect query success or failure, because MAX_ROW_SIZE applies when
	-- column values are materialized in memory.
	set compression_codec=none;
	set;
	...
	MAX_ROW_SIZE: [524288]
	...

	-- A very small row.
	insert into big_strings values ('one', 'two', 'three');
	-- A row right around the default MAX_ROW_SIZE limit: a 500 KiB string and a 30 KiB string.
	insert into big_strings values (repeat('12345',100000), 'short', repeat('123',10000));
	-- A row that is too big if the query has to materialize both S1 and S3.
	insert into big_strings values (repeat('12345',100000), 'short', repeat('12345',100000));

	</code></pre>

	<p class="p">
	With the default <code class="ph codeph">MAX_ROW_SIZE</code> setting, different queries succeed
	or fail based on which column values have to be materialized during query processing:
	</p>


	<pre class="pre codeblock"><code>
	-- All the S1 values can be materialized within the 512 KB MAX_ROW_SIZE buffer.
	select count(distinct s1) from big_strings;
	+--------------------+
	\| count(distinct s1) \|
	+--------------------+
	\| 2 \|
	+--------------------+

	-- A row where even the S1 value is too large to materialize within MAX_ROW_SIZE.
	insert into big_strings values (repeat('12345',1000000), 'short', repeat('12345',1000000));

	-- The 5 MiB string is too large to materialize. The message explains the size of the result
	-- set row the query is attempting to materialize.
	select count(distinct(s1)) from big_strings;
	WARNINGS: Row of size 4.77 MB could not be materialized in plan node with id 1.
	Increase the max_row_size query option (currently 512.00 KB) to process larger rows.

	-- If more columns are involved, the result set row being materialized is bigger.
	select count(distinct s1, s2, s3) from big_strings;
	WARNINGS: Row of size 9.54 MB could not be materialized in plan node with id 1.
	Increase the max_row_size query option (currently 512.00 KB) to process larger rows.

	-- Column S2, containing only short strings, can still be examined.
	select count(distinct(s2)) from big_strings;
	+----------------------+
	\| count(distinct (s2)) \|
	+----------------------+
	\| 2 \|
	+----------------------+

	-- Queries that do not materialize the big column values are OK.
	select count(*) from big_strings;
	+----------+
	\| count(*) \|
	+----------+
	\| 4 \|
	+----------+

	</code></pre>

	<p class="p">
	The following examples show how adjusting <code class="ph codeph">MAX_ROW_SIZE</code> upward
	allows queries involving the long string columns to succeed:
	</p>


	<pre class="pre codeblock"><code>
	-- Boosting MAX_ROW_SIZE moderately allows all S1 values to be materialized.
	set max_row_size=7mb;

	select count(distinct s1) from big_strings;
	+--------------------+
	\| count(distinct s1) \|
	+--------------------+
	\| 3 \|
	+--------------------+

	-- But the combination of S1 + S3 strings is still too large.
	select count(distinct s1, s2, s3) from big_strings;
	WARNINGS: Row of size 9.54 MB could not be materialized in plan node with id 1. Increase the max_row_size query option (currently 7.00 MB) to process larger rows.

	-- Boosting MAX_ROW_SIZE to larger than the largest row in the table allows
	-- all queries to complete successfully.
	set max_row_size=12mb;

	select count(distinct s1, s2, s3) from big_strings;
	+----------------------------+
	\| count(distinct s1, s2, s3) \|
	+----------------------------+
	\| 4 \|
	+----------------------------+

	</code></pre>

	<p class="p">
	The following examples show how to reason about appropriate values for
	<code class="ph codeph">MAX_ROW_SIZE</code>, based on the characteristics of the
	columns containing the long values:
	</p>


	<pre class="pre codeblock"><code>
	-- With a large MAX_ROW_SIZE in place, we can examine the columns to
	-- understand the practical lower limit for MAX_ROW_SIZE based on the
	-- table structure and column values.
	select max(length(s1) + length(s2) + length(s3)) / 1e6 as megabytes from big_strings;
	+-----------+
	\| megabytes \|
	+-----------+
	\| 10.000005 \|
	+-----------+

	-- We can also examine the 'Max Size' for each column after computing stats.
	compute stats big_strings;
	show column stats big_strings;
	+--------+--------+------------------+--------+----------+-----------+
	\| Column \| Type \| #Distinct Values \| #Nulls \| Max Size \| Avg Size \|
	+--------+--------+------------------+--------+----------+-----------+
	\| s1 \| STRING \| 2 \| -1 \| 5000000 \| 2500002.5 \|
	\| s2 \| STRING \| 2 \| -1 \| 10 \| 7.5 \|
	\| s3 \| STRING \| 2 \| -1 \| 5000000 \| 2500005 \|
	+--------+--------+------------------+--------+----------+-----------+

	</code></pre>

	<p class="p">
	<strong class="ph b">Related information:</strong>
	</p>

	<p class="p">
	<a class="xref" href="impala_buffer_pool_limit.html">BUFFER_POOL_LIMIT Query Option</a>,
	<a class="xref" href="impala_default_spillable_buffer_size.html">DEFAULT_SPILLABLE_BUFFER_SIZE Query Option</a>,
	<a class="xref" href="impala_min_spillable_buffer_size.html">MIN_SPILLABLE_BUFFER_SIZE Query Option</a>,
	<a class="xref" href="impala_scalability.html">Scalability Considerations for Impala</a>
	</p>


	</div>

	<div class="related-links">
	<div class="familylinks">
	<div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_set.html">SET Statement</a></div>
	</div>
	</div></body>
	</html>