site/docs/2.4.7/api/R/column_aggregate_functions.html - spark-website - Git at Google

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Aggregate functions for Column operations</title>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <link rel="stylesheet" type="text/css" href="R.css" />

 <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
 <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
 <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
 <script>hljs.initHighlightingOnLoad();</script>
 </head><body>

 <table width="100%" summary="page for column_aggregate_functions {SparkR}"><tr><td>column_aggregate_functions {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table>

 <h2>Aggregate functions for Column operations</h2>

 <h3>Description</h3>

 <p>Aggregate functions defined for <code>Column</code>.
 </p>


 <h3>Usage</h3>

 <pre>
 approxCountDistinct(x, ...)

 collect_list(x)

 collect_set(x)

 countDistinct(x, ...)

 grouping_bit(x)

 grouping_id(x, ...)

 kurtosis(x)

 n_distinct(x, ...)

 sd(x, na.rm = FALSE)

 skewness(x)

 stddev(x)

 stddev_pop(x)

 stddev_samp(x)

 sumDistinct(x)

 var(x, y = NULL, na.rm = FALSE, use)

 variance(x)

 var_pop(x)

 var_samp(x)

 ## S4 method for signature 'Column'
 approxCountDistinct(x, rsd = 0.05)

 ## S4 method for signature 'Column'
 kurtosis(x)

 ## S4 method for signature 'Column'
 max(x)

 ## S4 method for signature 'Column'
 mean(x)

 ## S4 method for signature 'Column'
 min(x)

 ## S4 method for signature 'Column'
 sd(x)

 ## S4 method for signature 'Column'
 skewness(x)

 ## S4 method for signature 'Column'
 stddev(x)

 ## S4 method for signature 'Column'
 stddev_pop(x)

 ## S4 method for signature 'Column'
 stddev_samp(x)

 ## S4 method for signature 'Column'
 sum(x)

 ## S4 method for signature 'Column'
 sumDistinct(x)

 ## S4 method for signature 'Column'
 var(x)

 ## S4 method for signature 'Column'
 variance(x)

 ## S4 method for signature 'Column'
 var_pop(x)

 ## S4 method for signature 'Column'
 var_samp(x)

 ## S4 method for signature 'Column'
 approxCountDistinct(x, rsd = 0.05)

 ## S4 method for signature 'Column'
 countDistinct(x, ...)

 ## S4 method for signature 'Column'
 n_distinct(x, ...)

 ## S4 method for signature 'Column'
 collect_list(x)

 ## S4 method for signature 'Column'
 collect_set(x)

 ## S4 method for signature 'Column'
 grouping_bit(x)

 ## S4 method for signature 'Column'
 grouping_id(x, ...)
 </pre>


 <h3>Arguments</h3>

 <table summary="R argblock">
 <tr valign="top"><td><code>x</code></td>
 <td>
 <p>Column to compute on.</p>
 </td></tr>
 <tr valign="top"><td><code>...</code></td>
 <td>
 <p>additional argument(s). For example, it could be used to pass additional Columns.</p>
 </td></tr>
 <tr valign="top"><td><code>y, na.rm, use</code></td>
 <td>
 <p>currently not used.</p>
 </td></tr>
 <tr valign="top"><td><code>rsd</code></td>
 <td>
 <p>maximum estimation error allowed (default = 0.05).</p>
 </td></tr>
 </table>


 <h3>Details</h3>

 <p><code>approxCountDistinct</code>: Returns the approximate number of distinct items in a group.
 </p>
 <p><code>kurtosis</code>: Returns the kurtosis of the values in a group.
 </p>
 <p><code>max</code>: Returns the maximum value of the expression in a group.
 </p>
 <p><code>mean</code>: Returns the average of the values in a group. Alias for <code>avg</code>.
 </p>
 <p><code>min</code>: Returns the minimum value of the expression in a group.
 </p>
 <p><code>sd</code>: Alias for <code>stddev_samp</code>.
 </p>
 <p><code>skewness</code>: Returns the skewness of the values in a group.
 </p>
 <p><code>stddev</code>: Alias for <code>std_dev</code>.
 </p>
 <p><code>stddev_pop</code>: Returns the population standard deviation of the expression in a group.
 </p>
 <p><code>stddev_samp</code>: Returns the unbiased sample standard deviation of the expression in a group.
 </p>
 <p><code>sum</code>: Returns the sum of all values in the expression.
 </p>
 <p><code>sumDistinct</code>: Returns the sum of distinct values in the expression.
 </p>
 <p><code>var</code>: Alias for <code>var_samp</code>.
 </p>
 <p><code>var_pop</code>: Returns the population variance of the values in a group.
 </p>
 <p><code>var_samp</code>: Returns the unbiased variance of the values in a group.
 </p>
 <p><code>countDistinct</code>: Returns the number of distinct items in a group.
 </p>
 <p><code>n_distinct</code>: Returns the number of distinct items in a group.
 </p>
 <p><code>collect_list</code>: Creates a list of objects with duplicates.
 Note: the function is non-deterministic because the order of collected results depends
 on order of rows which may be non-deterministic after a shuffle.
 </p>
 <p><code>collect_set</code>: Creates a list of objects with duplicate elements eliminated.
 Note: the function is non-deterministic because the order of collected results depends
 on order of rows which may be non-deterministic after a shuffle.
 </p>
 <p><code>grouping_bit</code>: Indicates whether a specified column in a GROUP BY list is aggregated or
 not, returns 1 for aggregated or 0 for not aggregated in the result set. Same as <code>GROUPING</code>
 in SQL and <code>grouping</code> function in Scala.
 </p>
 <p><code>grouping_id</code>: Returns the level of grouping.
 Equals to <code>
 grouping_bit(c1) * 2^(n - 1) + grouping_bit(c2) * 2^(n - 2)  + ... + grouping_bit(cn)
 </code>.
 </p>


 <h3>Note</h3>

 <p>approxCountDistinct(Column) since 1.4.0
 </p>
 <p>kurtosis since 1.6.0
 </p>
 <p>max since 1.5.0
 </p>
 <p>mean since 1.5.0
 </p>
 <p>min since 1.5.0
 </p>
 <p>sd since 1.6.0
 </p>
 <p>skewness since 1.6.0
 </p>
 <p>stddev since 1.6.0
 </p>
 <p>stddev_pop since 1.6.0
 </p>
 <p>stddev_samp since 1.6.0
 </p>
 <p>sum since 1.5.0
 </p>
 <p>sumDistinct since 1.4.0
 </p>
 <p>var since 1.6.0
 </p>
 <p>variance since 1.6.0
 </p>
 <p>var_pop since 1.5.0
 </p>
 <p>var_samp since 1.6.0
 </p>
 <p>approxCountDistinct(Column, numeric) since 1.4.0
 </p>
 <p>countDistinct since 1.4.0
 </p>
 <p>n_distinct since 1.4.0
 </p>
 <p>collect_list since 2.3.0
 </p>
 <p>collect_set since 2.3.0
 </p>
 <p>grouping_bit since 2.3.0
 </p>
 <p>grouping_id since 2.3.0
 </p>


 <h3>See Also</h3>

 <p>Other aggregate functions:
 <code><a href="avg.html">avg</a>()</code>,
 <code><a href="corr.html">corr</a>()</code>,
 <code><a href="count.html">count</a>()</code>,
 <code><a href="cov.html">cov</a>()</code>,
 <code><a href="first.html">first</a>()</code>,
 <code><a href="last.html">last</a>()</code>
 </p>


 <h3>Examples</h3>

 <pre><code class="r">## Not run:
 ##D # Dataframe used throughout this doc
 ##D df &lt;- createDataFrame(cbind(model = rownames(mtcars), mtcars))
 ## End(Not run)

 ## Not run:
 ##D head(select(df, approxCountDistinct(df$gear)))
 ##D head(select(df, approxCountDistinct(df$gear, 0.02)))
 ##D head(select(df, countDistinct(df$gear, df$cyl)))
 ##D head(select(df, n_distinct(df$gear)))
 ##D head(distinct(select(df, &quot;gear&quot;)))
 ## End(Not run)

 ## Not run:
 ##D head(select(df, mean(df$mpg), sd(df$mpg), skewness(df$mpg), kurtosis(df$mpg)))
 ## End(Not run)

 ## Not run:
 ##D head(select(df, avg(df$mpg), mean(df$mpg), sum(df$mpg), min(df$wt), max(df$qsec)))
 ##D
 ##D # metrics by num of cylinders
 ##D tmp &lt;- agg(groupBy(df, &quot;cyl&quot;), avg(df$mpg), avg(df$hp), avg(df$wt), avg(df$qsec))
 ##D head(orderBy(tmp, &quot;cyl&quot;))
 ##D
 ##D # car with the max mpg
 ##D mpg_max &lt;- as.numeric(collect(agg(df, max(df$mpg))))
 ##D head(where(df, df$mpg == mpg_max))
 ## End(Not run)

 ## Not run:
 ##D head(select(df, sd(df$mpg), stddev(df$mpg), stddev_pop(df$wt), stddev_samp(df$qsec)))
 ## End(Not run)

 ## Not run:
 ##D head(select(df, sumDistinct(df$gear)))
 ##D head(distinct(select(df, &quot;gear&quot;)))
 ## End(Not run)

 ## Not run:
 ##D head(agg(df, var(df$mpg), variance(df$mpg), var_pop(df$mpg), var_samp(df$mpg)))
 ## End(Not run)

 ## Not run:
 ##D df2 = df[df$mpg &gt; 20, ]
 ##D collect(select(df2, collect_list(df2$gear)))
 ##D collect(select(df2, collect_set(df2$gear)))
 ## End(Not run)

 ## Not run:
 ##D # With cube
 ##D agg(
 ##D   cube(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
 ##D   mean(df$mpg),
 ##D   grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
 ##D )
 ##D
 ##D # With rollup
 ##D agg(
 ##D   rollup(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
 ##D   mean(df$mpg),
 ##D   grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
 ##D )
 ## End(Not run)

 ## Not run:
 ##D # With cube
 ##D agg(
 ##D   cube(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
 ##D   mean(df$mpg),
 ##D   grouping_id(df$cyl, df$gear, df$am)
 ##D )
 ##D
 ##D # With rollup
 ##D agg(
 ##D   rollup(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
 ##D   mean(df$mpg),
 ##D   grouping_id(df$cyl, df$gear, df$am)
 ##D )
 ## End(Not run)
 </code></pre>


 <hr /><div style="text-align: center;">[Package <em>SparkR</em> version 2.4.7 <a href="00Index.html">Index</a>]</div>
 </body></html>
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Aggregate functions for Column operations</title>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<link rel="stylesheet" type="text/css" href="R.css" />

	<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
	<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
	<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
	<script>hljs.initHighlightingOnLoad();</script>
	</head><body>

	<table width="100%" summary="page for column_aggregate_functions {SparkR}"><tr><td>column_aggregate_functions {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table>

	<h2>Aggregate functions for Column operations</h2>

	<h3>Description</h3>

	<p>Aggregate functions defined for <code>Column</code>.
	</p>


	<h3>Usage</h3>

	<pre>
	approxCountDistinct(x, ...)

	collect_list(x)

	collect_set(x)

	countDistinct(x, ...)

	grouping_bit(x)

	grouping_id(x, ...)

	kurtosis(x)

	n_distinct(x, ...)

	sd(x, na.rm = FALSE)

	skewness(x)

	stddev(x)

	stddev_pop(x)

	stddev_samp(x)

	sumDistinct(x)

	var(x, y = NULL, na.rm = FALSE, use)

	variance(x)

	var_pop(x)

	var_samp(x)

	## S4 method for signature 'Column'
	approxCountDistinct(x, rsd = 0.05)

	## S4 method for signature 'Column'
	kurtosis(x)

	## S4 method for signature 'Column'
	max(x)

	## S4 method for signature 'Column'
	mean(x)

	## S4 method for signature 'Column'
	min(x)

	## S4 method for signature 'Column'
	sd(x)

	## S4 method for signature 'Column'
	skewness(x)

	## S4 method for signature 'Column'
	stddev(x)

	## S4 method for signature 'Column'
	stddev_pop(x)

	## S4 method for signature 'Column'
	stddev_samp(x)

	## S4 method for signature 'Column'
	sum(x)

	## S4 method for signature 'Column'
	sumDistinct(x)

	## S4 method for signature 'Column'
	var(x)

	## S4 method for signature 'Column'
	variance(x)

	## S4 method for signature 'Column'
	var_pop(x)

	## S4 method for signature 'Column'
	var_samp(x)

	## S4 method for signature 'Column'
	approxCountDistinct(x, rsd = 0.05)

	## S4 method for signature 'Column'
	countDistinct(x, ...)

	## S4 method for signature 'Column'
	n_distinct(x, ...)

	## S4 method for signature 'Column'
	collect_list(x)

	## S4 method for signature 'Column'
	collect_set(x)

	## S4 method for signature 'Column'
	grouping_bit(x)

	## S4 method for signature 'Column'
	grouping_id(x, ...)
	</pre>


	<h3>Arguments</h3>

	<table summary="R argblock">
	<tr valign="top"><td><code>x</code></td>
	<td>
	<p>Column to compute on.</p>
	</td></tr>
	<tr valign="top"><td><code>...</code></td>
	<td>
	<p>additional argument(s). For example, it could be used to pass additional Columns.</p>
	</td></tr>
	<tr valign="top"><td><code>y, na.rm, use</code></td>
	<td>
	<p>currently not used.</p>
	</td></tr>
	<tr valign="top"><td><code>rsd</code></td>
	<td>
	<p>maximum estimation error allowed (default = 0.05).</p>
	</td></tr>
	</table>


	<h3>Details</h3>

	<p><code>approxCountDistinct</code>: Returns the approximate number of distinct items in a group.
	</p>
	<p><code>kurtosis</code>: Returns the kurtosis of the values in a group.
	</p>
	<p><code>max</code>: Returns the maximum value of the expression in a group.
	</p>
	<p><code>mean</code>: Returns the average of the values in a group. Alias for <code>avg</code>.
	</p>
	<p><code>min</code>: Returns the minimum value of the expression in a group.
	</p>
	<p><code>sd</code>: Alias for <code>stddev_samp</code>.
	</p>
	<p><code>skewness</code>: Returns the skewness of the values in a group.
	</p>
	<p><code>stddev</code>: Alias for <code>std_dev</code>.
	</p>
	<p><code>stddev_pop</code>: Returns the population standard deviation of the expression in a group.
	</p>
	<p><code>stddev_samp</code>: Returns the unbiased sample standard deviation of the expression in a group.
	</p>
	<p><code>sum</code>: Returns the sum of all values in the expression.
	</p>
	<p><code>sumDistinct</code>: Returns the sum of distinct values in the expression.
	</p>
	<p><code>var</code>: Alias for <code>var_samp</code>.
	</p>
	<p><code>var_pop</code>: Returns the population variance of the values in a group.
	</p>
	<p><code>var_samp</code>: Returns the unbiased variance of the values in a group.
	</p>
	<p><code>countDistinct</code>: Returns the number of distinct items in a group.
	</p>
	<p><code>n_distinct</code>: Returns the number of distinct items in a group.
	</p>
	<p><code>collect_list</code>: Creates a list of objects with duplicates.
	Note: the function is non-deterministic because the order of collected results depends
	on order of rows which may be non-deterministic after a shuffle.
	</p>
	<p><code>collect_set</code>: Creates a list of objects with duplicate elements eliminated.
	Note: the function is non-deterministic because the order of collected results depends
	on order of rows which may be non-deterministic after a shuffle.
	</p>
	<p><code>grouping_bit</code>: Indicates whether a specified column in a GROUP BY list is aggregated or
	not, returns 1 for aggregated or 0 for not aggregated in the result set. Same as <code>GROUPING</code>
	in SQL and <code>grouping</code> function in Scala.
	</p>
	<p><code>grouping_id</code>: Returns the level of grouping.
	Equals to <code>
	grouping_bit(c1) * 2^(n - 1) + grouping_bit(c2) * 2^(n - 2) + ... + grouping_bit(cn)
	</code>.
	</p>


	<h3>Note</h3>

	<p>approxCountDistinct(Column) since 1.4.0
	</p>
	<p>kurtosis since 1.6.0
	</p>
	<p>max since 1.5.0
	</p>
	<p>mean since 1.5.0
	</p>
	<p>min since 1.5.0
	</p>
	<p>sd since 1.6.0
	</p>
	<p>skewness since 1.6.0
	</p>
	<p>stddev since 1.6.0
	</p>
	<p>stddev_pop since 1.6.0
	</p>
	<p>stddev_samp since 1.6.0
	</p>
	<p>sum since 1.5.0
	</p>
	<p>sumDistinct since 1.4.0
	</p>
	<p>var since 1.6.0
	</p>
	<p>variance since 1.6.0
	</p>
	<p>var_pop since 1.5.0
	</p>
	<p>var_samp since 1.6.0
	</p>
	<p>approxCountDistinct(Column, numeric) since 1.4.0
	</p>
	<p>countDistinct since 1.4.0
	</p>
	<p>n_distinct since 1.4.0
	</p>
	<p>collect_list since 2.3.0
	</p>
	<p>collect_set since 2.3.0
	</p>
	<p>grouping_bit since 2.3.0
	</p>
	<p>grouping_id since 2.3.0
	</p>


	<h3>See Also</h3>

	<p>Other aggregate functions:
	<code><a href="avg.html">avg</a>()</code>,
	<code><a href="corr.html">corr</a>()</code>,
	<code><a href="count.html">count</a>()</code>,
	<code><a href="cov.html">cov</a>()</code>,
	<code><a href="first.html">first</a>()</code>,
	<code><a href="last.html">last</a>()</code>
	</p>


	<h3>Examples</h3>

	<pre><code class="r">## Not run:
	##D # Dataframe used throughout this doc
	##D df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
	## End(Not run)

	## Not run:
	##D head(select(df, approxCountDistinct(df$gear)))
	##D head(select(df, approxCountDistinct(df$gear, 0.02)))
	##D head(select(df, countDistinct(df$gear, df$cyl)))
	##D head(select(df, n_distinct(df$gear)))
	##D head(distinct(select(df, "gear")))
	## End(Not run)

	## Not run:
	##D head(select(df, mean(df$mpg), sd(df$mpg), skewness(df$mpg), kurtosis(df$mpg)))
	## End(Not run)

	## Not run:
	##D head(select(df, avg(df$mpg), mean(df$mpg), sum(df$mpg), min(df$wt), max(df$qsec)))
	##D
	##D # metrics by num of cylinders
	##D tmp <- agg(groupBy(df, "cyl"), avg(df$mpg), avg(df$hp), avg(df$wt), avg(df$qsec))
	##D head(orderBy(tmp, "cyl"))
	##D
	##D # car with the max mpg
	##D mpg_max <- as.numeric(collect(agg(df, max(df$mpg))))
	##D head(where(df, df$mpg == mpg_max))
	## End(Not run)

	## Not run:
	##D head(select(df, sd(df$mpg), stddev(df$mpg), stddev_pop(df$wt), stddev_samp(df$qsec)))
	## End(Not run)

	## Not run:
	##D head(select(df, sumDistinct(df$gear)))
	##D head(distinct(select(df, "gear")))
	## End(Not run)

	## Not run:
	##D head(agg(df, var(df$mpg), variance(df$mpg), var_pop(df$mpg), var_samp(df$mpg)))
	## End(Not run)

	## Not run:
	##D df2 = df[df$mpg > 20, ]
	##D collect(select(df2, collect_list(df2$gear)))
	##D collect(select(df2, collect_set(df2$gear)))
	## End(Not run)

	## Not run:
	##D # With cube
	##D agg(
	##D cube(df, "cyl", "gear", "am"),
	##D mean(df$mpg),
	##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
	##D )
	##D
	##D # With rollup
	##D agg(
	##D rollup(df, "cyl", "gear", "am"),
	##D mean(df$mpg),
	##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
	##D )
	## End(Not run)

	## Not run:
	##D # With cube
	##D agg(
	##D cube(df, "cyl", "gear", "am"),
	##D mean(df$mpg),
	##D grouping_id(df$cyl, df$gear, df$am)
	##D )
	##D
	##D # With rollup
	##D agg(
	##D rollup(df, "cyl", "gear", "am"),
	##D mean(df$mpg),
	##D grouping_id(df$cyl, df$gear, df$am)
	##D )
	## End(Not run)
	</code></pre>


	<hr /><div style="text-align: center;">[Package <em>SparkR</em> version 2.4.7 <a href="00Index.html">Index</a>]</div>
	</body></html>