docs/topics/impala_ndv.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept rev="1.2.1" id="ndv">

   <title>NDV Function</title>
   <titlealts audience="PDF"><navtitle>NDV</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Impala Functions"/>
       <data name="Category" value="Aggregate Functions"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       <indexterm audience="hidden">ndv() function</indexterm>
       An aggregate function that returns an approximate value similar to the result of <codeph>COUNT(DISTINCT
       <varname>col</varname>)</codeph>, the <q>number of distinct values</q>. It is much faster than the
       combination of <codeph>COUNT</codeph> and <codeph>DISTINCT</codeph>, and uses a constant amount of memory and
       thus is less memory-intensive for columns with high cardinality.
     </p>

     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

 <codeblock>NDV([DISTINCT | ALL] <varname>expression</varname>)</codeblock>

     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

     <p rev="1.2.2">
       This is the mechanism used internally by the <codeph>COMPUTE STATS</codeph> statement for computing the
       number of distinct values in a column.
     </p>

     <p>
       Because this number is an estimate, it might not reflect the precise number of different values in the
       column, especially if the cardinality is very low or very high. If the estimated number is higher than the
       number of rows in the table, Impala adjusts the value internally during query planning.
     </p>

     <p conref="../shared/impala_common.xml#common/former_odd_return_type_string"/>

 <!-- <p conref="../shared/impala_common.xml#common/restrictions_sliding_window"/> -->

     <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>

     <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>

     <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>

     <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

     <p conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/>

     <p conref="../shared/impala_common.xml#common/example_blurb"/>

     <p>
       The following example queries a billion-row table to illustrate the relative performance of
       <codeph>COUNT(DISTINCT)</codeph> and <codeph>NDV()</codeph>. It shows how <codeph>COUNT(DISTINCT)</codeph>
       gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient.
       The <codeph>NDV()</codeph> function gives an approximate result but is much faster.
     </p>

 <codeblock>select count(distinct col1) from sample_data;
 +---------------------+
 | count(distinct col1)|
 +---------------------+
 | 100000              |
 +---------------------+
 Fetched 1 row(s) in 20.13s

 select cast(ndv(col1) as bigint) as col1 from sample_data;
 +----------+
 | col1     |
 +----------+
 | 139017   |
 +----------+
 Fetched 1 row(s) in 8.91s
 </codeblock>

     <p>
       The following example shows how you can code multiple <codeph>NDV()</codeph> calls in a single query, to
       easily learn which columns have substantially more or fewer distinct values. This technique is faster than
       running a sequence of queries with <codeph>COUNT(DISTINCT)</codeph> calls.
     </p>

 <codeblock>select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2,
     cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4
   from sample_data;
 +----------+-----------+------------+-----------+
 | col1     | col2      | col3       | col4      |
 +----------+-----------+------------+-----------+
 | 139017   | 282       | 46         | 145636240 |
 +----------+-----------+------------+-----------+
 Fetched 1 row(s) in 34.97s

 select count(distinct col1) from sample_data;
 +---------------------+
 | count(distinct col1)|
 +---------------------+
 | 100000              |
 +---------------------+
 Fetched 1 row(s) in 20.13s

 select count(distinct col2) from sample_data;
 +----------------------+
 | count(distinct col2) |
 +----------------------+
 | 278                  |
 +----------------------+
 Fetched 1 row(s) in 20.09s

 select count(distinct col3) from sample_data;
 +-----------------------+
 | count(distinct col3)  |
 +-----------------------+
 | 46                    |
 +-----------------------+
 Fetched 1 row(s) in 19.12s

 select count(distinct col4) from sample_data;
 +----------------------+
 | count(distinct col4) |
 +----------------------+
 | 147135880            |
 +----------------------+
 Fetched 1 row(s) in 266.95s
 </codeblock>
   </conbody>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept rev="1.2.1" id="ndv">

	<title>NDV Function</title>
	<titlealts audience="PDF"><navtitle>NDV</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="SQL"/>
	<data name="Category" value="Impala Functions"/>
	<data name="Category" value="Aggregate Functions"/>
	<data name="Category" value="Querying"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="hidden">ndv() function</indexterm>
	An aggregate function that returns an approximate value similar to the result of <codeph>COUNT(DISTINCT
	<varname>col</varname>)</codeph>, the <q>number of distinct values</q>. It is much faster than the
	combination of <codeph>COUNT</codeph> and <codeph>DISTINCT</codeph>, and uses a constant amount of memory and
	thus is less memory-intensive for columns with high cardinality.
	</p>

	<p conref="../shared/impala_common.xml#common/syntax_blurb"/>

	<codeblock>NDV([DISTINCT \| ALL] <varname>expression</varname>)</codeblock>

	<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

	<p rev="1.2.2">
	This is the mechanism used internally by the <codeph>COMPUTE STATS</codeph> statement for computing the
	number of distinct values in a column.
	</p>

	<p>
	Because this number is an estimate, it might not reflect the precise number of different values in the
	column, especially if the cardinality is very low or very high. If the estimated number is higher than the
	number of rows in the table, Impala adjusts the value internally during query planning.
	</p>

	<p conref="../shared/impala_common.xml#common/former_odd_return_type_string"/>

	<!-- <p conref="../shared/impala_common.xml#common/restrictions_sliding_window"/> -->

	<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>

	<p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>

	<p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>

	<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

	<p conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/>

	<p conref="../shared/impala_common.xml#common/example_blurb"/>

	<p>
	The following example queries a billion-row table to illustrate the relative performance of
	<codeph>COUNT(DISTINCT)</codeph> and <codeph>NDV()</codeph>. It shows how <codeph>COUNT(DISTINCT)</codeph>
	gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient.
	The <codeph>NDV()</codeph> function gives an approximate result but is much faster.
	</p>

	<codeblock>select count(distinct col1) from sample_data;
	+---------------------+
	\| count(distinct col1)\|
	+---------------------+
	\| 100000 \|
	+---------------------+
	Fetched 1 row(s) in 20.13s

	select cast(ndv(col1) as bigint) as col1 from sample_data;
	+----------+
	\| col1 \|
	+----------+
	\| 139017 \|
	+----------+
	Fetched 1 row(s) in 8.91s
	</codeblock>

	<p>
	The following example shows how you can code multiple <codeph>NDV()</codeph> calls in a single query, to
	easily learn which columns have substantially more or fewer distinct values. This technique is faster than
	running a sequence of queries with <codeph>COUNT(DISTINCT)</codeph> calls.
	</p>

	<codeblock>select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2,
	cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4
	from sample_data;
	+----------+-----------+------------+-----------+
	\| col1 \| col2 \| col3 \| col4 \|
	+----------+-----------+------------+-----------+
	\| 139017 \| 282 \| 46 \| 145636240 \|
	+----------+-----------+------------+-----------+
	Fetched 1 row(s) in 34.97s

	select count(distinct col1) from sample_data;
	+---------------------+
	\| count(distinct col1)\|
	+---------------------+
	\| 100000 \|
	+---------------------+
	Fetched 1 row(s) in 20.13s

	select count(distinct col2) from sample_data;
	+----------------------+
	\| count(distinct col2) \|
	+----------------------+
	\| 278 \|
	+----------------------+
	Fetched 1 row(s) in 20.09s

	select count(distinct col3) from sample_data;
	+-----------------------+
	\| count(distinct col3) \|
	+-----------------------+
	\| 46 \|
	+-----------------------+
	Fetched 1 row(s) in 19.12s

	select count(distinct col4) from sample_data;
	+----------------------+
	\| count(distinct col4) \|
	+----------------------+
	\| 147135880 \|
	+----------------------+
	Fetched 1 row(s) in 266.95s
	</codeblock>
	</conbody>
	</concept>