blob: f0d4f1f5ea7f1a48458305a5a5915a1326ea9c1d [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Aggregate functions for Column operations</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="R.css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head><body>
<table width="100%" summary="page for column_aggregate_functions {SparkR}"><tr><td>column_aggregate_functions {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table>
<h2>Aggregate functions for Column operations</h2>
<h3>Description</h3>
<p>Aggregate functions defined for <code>Column</code>.
</p>
<h3>Usage</h3>
<pre>
approxCountDistinct(x, ...)
collect_list(x)
collect_set(x)
countDistinct(x, ...)
grouping_bit(x)
grouping_id(x, ...)
kurtosis(x)
n_distinct(x, ...)
sd(x, na.rm = FALSE)
skewness(x)
stddev(x)
stddev_pop(x)
stddev_samp(x)
sumDistinct(x)
var(x, y = NULL, na.rm = FALSE, use)
variance(x)
var_pop(x)
var_samp(x)
## S4 method for signature 'Column'
approxCountDistinct(x, rsd = 0.05)
## S4 method for signature 'Column'
kurtosis(x)
## S4 method for signature 'Column'
max(x)
## S4 method for signature 'Column'
mean(x)
## S4 method for signature 'Column'
min(x)
## S4 method for signature 'Column'
sd(x)
## S4 method for signature 'Column'
skewness(x)
## S4 method for signature 'Column'
stddev(x)
## S4 method for signature 'Column'
stddev_pop(x)
## S4 method for signature 'Column'
stddev_samp(x)
## S4 method for signature 'Column'
sum(x)
## S4 method for signature 'Column'
sumDistinct(x)
## S4 method for signature 'Column'
var(x)
## S4 method for signature 'Column'
variance(x)
## S4 method for signature 'Column'
var_pop(x)
## S4 method for signature 'Column'
var_samp(x)
## S4 method for signature 'Column'
approxCountDistinct(x, rsd = 0.05)
## S4 method for signature 'Column'
countDistinct(x, ...)
## S4 method for signature 'Column'
n_distinct(x, ...)
## S4 method for signature 'Column'
collect_list(x)
## S4 method for signature 'Column'
collect_set(x)
## S4 method for signature 'Column'
grouping_bit(x)
## S4 method for signature 'Column'
grouping_id(x, ...)
</pre>
<h3>Arguments</h3>
<table summary="R argblock">
<tr valign="top"><td><code>x</code></td>
<td>
<p>Column to compute on.</p>
</td></tr>
<tr valign="top"><td><code>...</code></td>
<td>
<p>additional argument(s). For example, it could be used to pass additional Columns.</p>
</td></tr>
<tr valign="top"><td><code>y, na.rm, use</code></td>
<td>
<p>currently not used.</p>
</td></tr>
<tr valign="top"><td><code>rsd</code></td>
<td>
<p>maximum estimation error allowed (default = 0.05).</p>
</td></tr>
</table>
<h3>Details</h3>
<p><code>approxCountDistinct</code>: Returns the approximate number of distinct items in a group.
</p>
<p><code>kurtosis</code>: Returns the kurtosis of the values in a group.
</p>
<p><code>max</code>: Returns the maximum value of the expression in a group.
</p>
<p><code>mean</code>: Returns the average of the values in a group. Alias for <code>avg</code>.
</p>
<p><code>min</code>: Returns the minimum value of the expression in a group.
</p>
<p><code>sd</code>: Alias for <code>stddev_samp</code>.
</p>
<p><code>skewness</code>: Returns the skewness of the values in a group.
</p>
<p><code>stddev</code>: Alias for <code>std_dev</code>.
</p>
<p><code>stddev_pop</code>: Returns the population standard deviation of the expression in a group.
</p>
<p><code>stddev_samp</code>: Returns the unbiased sample standard deviation of the expression in a group.
</p>
<p><code>sum</code>: Returns the sum of all values in the expression.
</p>
<p><code>sumDistinct</code>: Returns the sum of distinct values in the expression.
</p>
<p><code>var</code>: Alias for <code>var_samp</code>.
</p>
<p><code>var_pop</code>: Returns the population variance of the values in a group.
</p>
<p><code>var_samp</code>: Returns the unbiased variance of the values in a group.
</p>
<p><code>countDistinct</code>: Returns the number of distinct items in a group.
</p>
<p><code>n_distinct</code>: Returns the number of distinct items in a group.
</p>
<p><code>collect_list</code>: Creates a list of objects with duplicates.
Note: the function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
</p>
<p><code>collect_set</code>: Creates a list of objects with duplicate elements eliminated.
Note: the function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
</p>
<p><code>grouping_bit</code>: Indicates whether a specified column in a GROUP BY list is aggregated or
not, returns 1 for aggregated or 0 for not aggregated in the result set. Same as <code>GROUPING</code>
in SQL and <code>grouping</code> function in Scala.
</p>
<p><code>grouping_id</code>: Returns the level of grouping.
Equals to <code>
grouping_bit(c1) * 2^(n - 1) + grouping_bit(c2) * 2^(n - 2) + ... + grouping_bit(cn)
</code>.
</p>
<h3>Note</h3>
<p>approxCountDistinct(Column) since 1.4.0
</p>
<p>kurtosis since 1.6.0
</p>
<p>max since 1.5.0
</p>
<p>mean since 1.5.0
</p>
<p>min since 1.5.0
</p>
<p>sd since 1.6.0
</p>
<p>skewness since 1.6.0
</p>
<p>stddev since 1.6.0
</p>
<p>stddev_pop since 1.6.0
</p>
<p>stddev_samp since 1.6.0
</p>
<p>sum since 1.5.0
</p>
<p>sumDistinct since 1.4.0
</p>
<p>var since 1.6.0
</p>
<p>variance since 1.6.0
</p>
<p>var_pop since 1.5.0
</p>
<p>var_samp since 1.6.0
</p>
<p>approxCountDistinct(Column, numeric) since 1.4.0
</p>
<p>countDistinct since 1.4.0
</p>
<p>n_distinct since 1.4.0
</p>
<p>collect_list since 2.3.0
</p>
<p>collect_set since 2.3.0
</p>
<p>grouping_bit since 2.3.0
</p>
<p>grouping_id since 2.3.0
</p>
<h3>See Also</h3>
<p>Other aggregate functions: <code><a href="avg.html">avg</a></code>,
<code><a href="corr.html">corr</a></code>, <code><a href="count.html">count</a></code>,
<code><a href="cov.html">cov</a></code>, <code><a href="first.html">first</a></code>,
<code><a href="last.html">last</a></code>
</p>
<h3>Examples</h3>
<pre><code class="r">## Not run:
##D # Dataframe used throughout this doc
##D df &lt;- createDataFrame(cbind(model = rownames(mtcars), mtcars))
## End(Not run)
## Not run:
##D head(select(df, approxCountDistinct(df$gear)))
##D head(select(df, approxCountDistinct(df$gear, 0.02)))
##D head(select(df, countDistinct(df$gear, df$cyl)))
##D head(select(df, n_distinct(df$gear)))
##D head(distinct(select(df, &quot;gear&quot;)))
## End(Not run)
## Not run:
##D head(select(df, mean(df$mpg), sd(df$mpg), skewness(df$mpg), kurtosis(df$mpg)))
## End(Not run)
## Not run:
##D head(select(df, avg(df$mpg), mean(df$mpg), sum(df$mpg), min(df$wt), max(df$qsec)))
##D
##D # metrics by num of cylinders
##D tmp &lt;- agg(groupBy(df, &quot;cyl&quot;), avg(df$mpg), avg(df$hp), avg(df$wt), avg(df$qsec))
##D head(orderBy(tmp, &quot;cyl&quot;))
##D
##D # car with the max mpg
##D mpg_max &lt;- as.numeric(collect(agg(df, max(df$mpg))))
##D head(where(df, df$mpg == mpg_max))
## End(Not run)
## Not run:
##D head(select(df, sd(df$mpg), stddev(df$mpg), stddev_pop(df$wt), stddev_samp(df$qsec)))
## End(Not run)
## Not run:
##D head(select(df, sumDistinct(df$gear)))
##D head(distinct(select(df, &quot;gear&quot;)))
## End(Not run)
## Not run:
##D head(agg(df, var(df$mpg), variance(df$mpg), var_pop(df$mpg), var_samp(df$mpg)))
## End(Not run)
## Not run:
##D df2 = df[df$mpg &gt; 20, ]
##D collect(select(df2, collect_list(df2$gear)))
##D collect(select(df2, collect_set(df2$gear)))
## End(Not run)
## Not run:
##D # With cube
##D agg(
##D cube(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
##D mean(df$mpg),
##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
##D )
##D
##D # With rollup
##D agg(
##D rollup(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
##D mean(df$mpg),
##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
##D )
## End(Not run)
## Not run:
##D # With cube
##D agg(
##D cube(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
##D mean(df$mpg),
##D grouping_id(df$cyl, df$gear, df$am)
##D )
##D
##D # With rollup
##D agg(
##D rollup(df, &quot;cyl&quot;, &quot;gear&quot;, &quot;am&quot;),
##D mean(df$mpg),
##D grouping_id(df$cyl, df$gear, df$am)
##D )
## End(Not run)
</code></pre>
<hr /><div style="text-align: center;">[Package <em>SparkR</em> version 2.4.0 <a href="00Index.html">Index</a>]</div>
</body></html>