| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Aggregate functions for Column operations</title> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> |
| <link rel="stylesheet" type="text/css" href="R.css" /> |
| |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css"> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script> |
| <script>hljs.initHighlightingOnLoad();</script> |
| </head><body> |
| |
| <table width="100%" summary="page for column_aggregate_functions {SparkR}"><tr><td>column_aggregate_functions {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table> |
| |
| <h2>Aggregate functions for Column operations</h2> |
| |
| <h3>Description</h3> |
| |
| <p>Aggregate functions defined for <code>Column</code>. |
| </p> |
| |
| |
| <h3>Usage</h3> |
| |
| <pre> |
| approxCountDistinct(x, ...) |
| |
| collect_list(x) |
| |
| collect_set(x) |
| |
| countDistinct(x, ...) |
| |
| grouping_bit(x) |
| |
| grouping_id(x, ...) |
| |
| kurtosis(x) |
| |
| n_distinct(x, ...) |
| |
| sd(x, na.rm = FALSE) |
| |
| skewness(x) |
| |
| stddev(x) |
| |
| stddev_pop(x) |
| |
| stddev_samp(x) |
| |
| sumDistinct(x) |
| |
| var(x, y = NULL, na.rm = FALSE, use) |
| |
| variance(x) |
| |
| var_pop(x) |
| |
| var_samp(x) |
| |
| ## S4 method for signature 'Column' |
| approxCountDistinct(x, rsd = 0.05) |
| |
| ## S4 method for signature 'Column' |
| kurtosis(x) |
| |
| ## S4 method for signature 'Column' |
| max(x) |
| |
| ## S4 method for signature 'Column' |
| mean(x) |
| |
| ## S4 method for signature 'Column' |
| min(x) |
| |
| ## S4 method for signature 'Column' |
| sd(x) |
| |
| ## S4 method for signature 'Column' |
| skewness(x) |
| |
| ## S4 method for signature 'Column' |
| stddev(x) |
| |
| ## S4 method for signature 'Column' |
| stddev_pop(x) |
| |
| ## S4 method for signature 'Column' |
| stddev_samp(x) |
| |
| ## S4 method for signature 'Column' |
| sum(x) |
| |
| ## S4 method for signature 'Column' |
| sumDistinct(x) |
| |
| ## S4 method for signature 'Column' |
| var(x) |
| |
| ## S4 method for signature 'Column' |
| variance(x) |
| |
| ## S4 method for signature 'Column' |
| var_pop(x) |
| |
| ## S4 method for signature 'Column' |
| var_samp(x) |
| |
| ## S4 method for signature 'Column' |
| approxCountDistinct(x, rsd = 0.05) |
| |
| ## S4 method for signature 'Column' |
| countDistinct(x, ...) |
| |
| ## S4 method for signature 'Column' |
| n_distinct(x, ...) |
| |
| ## S4 method for signature 'Column' |
| collect_list(x) |
| |
| ## S4 method for signature 'Column' |
| collect_set(x) |
| |
| ## S4 method for signature 'Column' |
| grouping_bit(x) |
| |
| ## S4 method for signature 'Column' |
| grouping_id(x, ...) |
| </pre> |
| |
| |
| <h3>Arguments</h3> |
| |
| <table summary="R argblock"> |
| <tr valign="top"><td><code>x</code></td> |
| <td> |
| <p>Column to compute on.</p> |
| </td></tr> |
| <tr valign="top"><td><code>...</code></td> |
| <td> |
| <p>additional argument(s). For example, it could be used to pass additional Columns.</p> |
| </td></tr> |
| <tr valign="top"><td><code>y, na.rm, use</code></td> |
| <td> |
| <p>currently not used.</p> |
| </td></tr> |
| <tr valign="top"><td><code>rsd</code></td> |
| <td> |
| <p>maximum estimation error allowed (default = 0.05).</p> |
| </td></tr> |
| </table> |
| |
| |
| <h3>Details</h3> |
| |
| <p><code>approxCountDistinct</code>: Returns the approximate number of distinct items in a group. |
| </p> |
| <p><code>kurtosis</code>: Returns the kurtosis of the values in a group. |
| </p> |
| <p><code>max</code>: Returns the maximum value of the expression in a group. |
| </p> |
| <p><code>mean</code>: Returns the average of the values in a group. Alias for <code>avg</code>. |
| </p> |
| <p><code>min</code>: Returns the minimum value of the expression in a group. |
| </p> |
| <p><code>sd</code>: Alias for <code>stddev_samp</code>. |
| </p> |
| <p><code>skewness</code>: Returns the skewness of the values in a group. |
| </p> |
| <p><code>stddev</code>: Alias for <code>std_dev</code>. |
| </p> |
| <p><code>stddev_pop</code>: Returns the population standard deviation of the expression in a group. |
| </p> |
| <p><code>stddev_samp</code>: Returns the unbiased sample standard deviation of the expression in a group. |
| </p> |
| <p><code>sum</code>: Returns the sum of all values in the expression. |
| </p> |
| <p><code>sumDistinct</code>: Returns the sum of distinct values in the expression. |
| </p> |
| <p><code>var</code>: Alias for <code>var_samp</code>. |
| </p> |
| <p><code>var_pop</code>: Returns the population variance of the values in a group. |
| </p> |
| <p><code>var_samp</code>: Returns the unbiased variance of the values in a group. |
| </p> |
| <p><code>countDistinct</code>: Returns the number of distinct items in a group. |
| </p> |
| <p><code>n_distinct</code>: Returns the number of distinct items in a group. |
| </p> |
| <p><code>collect_list</code>: Creates a list of objects with duplicates. |
| Note: the function is non-deterministic because the order of collected results depends |
| on order of rows which may be non-deterministic after a shuffle. |
| </p> |
| <p><code>collect_set</code>: Creates a list of objects with duplicate elements eliminated. |
| Note: the function is non-deterministic because the order of collected results depends |
| on order of rows which may be non-deterministic after a shuffle. |
| </p> |
| <p><code>grouping_bit</code>: Indicates whether a specified column in a GROUP BY list is aggregated or |
| not, returns 1 for aggregated or 0 for not aggregated in the result set. Same as <code>GROUPING</code> |
| in SQL and <code>grouping</code> function in Scala. |
| </p> |
| <p><code>grouping_id</code>: Returns the level of grouping. |
| Equals to <code> |
| grouping_bit(c1) * 2^(n - 1) + grouping_bit(c2) * 2^(n - 2) + ... + grouping_bit(cn) |
| </code>. |
| </p> |
| |
| |
| <h3>Note</h3> |
| |
| <p>approxCountDistinct(Column) since 1.4.0 |
| </p> |
| <p>kurtosis since 1.6.0 |
| </p> |
| <p>max since 1.5.0 |
| </p> |
| <p>mean since 1.5.0 |
| </p> |
| <p>min since 1.5.0 |
| </p> |
| <p>sd since 1.6.0 |
| </p> |
| <p>skewness since 1.6.0 |
| </p> |
| <p>stddev since 1.6.0 |
| </p> |
| <p>stddev_pop since 1.6.0 |
| </p> |
| <p>stddev_samp since 1.6.0 |
| </p> |
| <p>sum since 1.5.0 |
| </p> |
| <p>sumDistinct since 1.4.0 |
| </p> |
| <p>var since 1.6.0 |
| </p> |
| <p>variance since 1.6.0 |
| </p> |
| <p>var_pop since 1.5.0 |
| </p> |
| <p>var_samp since 1.6.0 |
| </p> |
| <p>approxCountDistinct(Column, numeric) since 1.4.0 |
| </p> |
| <p>countDistinct since 1.4.0 |
| </p> |
| <p>n_distinct since 1.4.0 |
| </p> |
| <p>collect_list since 2.3.0 |
| </p> |
| <p>collect_set since 2.3.0 |
| </p> |
| <p>grouping_bit since 2.3.0 |
| </p> |
| <p>grouping_id since 2.3.0 |
| </p> |
| |
| |
| <h3>See Also</h3> |
| |
| <p>Other aggregate functions: |
| <code><a href="avg.html">avg</a>()</code>, |
| <code><a href="corr.html">corr</a>()</code>, |
| <code><a href="count.html">count</a>()</code>, |
| <code><a href="cov.html">cov</a>()</code>, |
| <code><a href="first.html">first</a>()</code>, |
| <code><a href="last.html">last</a>()</code> |
| </p> |
| |
| |
| <h3>Examples</h3> |
| |
| <pre><code class="r">## Not run: |
| ##D # Dataframe used throughout this doc |
| ##D df <- createDataFrame(cbind(model = rownames(mtcars), mtcars)) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D head(select(df, approxCountDistinct(df$gear))) |
| ##D head(select(df, approxCountDistinct(df$gear, 0.02))) |
| ##D head(select(df, countDistinct(df$gear, df$cyl))) |
| ##D head(select(df, n_distinct(df$gear))) |
| ##D head(distinct(select(df, "gear"))) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D head(select(df, mean(df$mpg), sd(df$mpg), skewness(df$mpg), kurtosis(df$mpg))) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D head(select(df, avg(df$mpg), mean(df$mpg), sum(df$mpg), min(df$wt), max(df$qsec))) |
| ##D |
| ##D # metrics by num of cylinders |
| ##D tmp <- agg(groupBy(df, "cyl"), avg(df$mpg), avg(df$hp), avg(df$wt), avg(df$qsec)) |
| ##D head(orderBy(tmp, "cyl")) |
| ##D |
| ##D # car with the max mpg |
| ##D mpg_max <- as.numeric(collect(agg(df, max(df$mpg)))) |
| ##D head(where(df, df$mpg == mpg_max)) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D head(select(df, sd(df$mpg), stddev(df$mpg), stddev_pop(df$wt), stddev_samp(df$qsec))) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D head(select(df, sumDistinct(df$gear))) |
| ##D head(distinct(select(df, "gear"))) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D head(agg(df, var(df$mpg), variance(df$mpg), var_pop(df$mpg), var_samp(df$mpg))) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D df2 = df[df$mpg > 20, ] |
| ##D collect(select(df2, collect_list(df2$gear))) |
| ##D collect(select(df2, collect_set(df2$gear))) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D # With cube |
| ##D agg( |
| ##D cube(df, "cyl", "gear", "am"), |
| ##D mean(df$mpg), |
| ##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am) |
| ##D ) |
| ##D |
| ##D # With rollup |
| ##D agg( |
| ##D rollup(df, "cyl", "gear", "am"), |
| ##D mean(df$mpg), |
| ##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am) |
| ##D ) |
| ## End(Not run) |
| |
| ## Not run: |
| ##D # With cube |
| ##D agg( |
| ##D cube(df, "cyl", "gear", "am"), |
| ##D mean(df$mpg), |
| ##D grouping_id(df$cyl, df$gear, df$am) |
| ##D ) |
| ##D |
| ##D # With rollup |
| ##D agg( |
| ##D rollup(df, "cyl", "gear", "am"), |
| ##D mean(df$mpg), |
| ##D grouping_id(df$cyl, df$gear, df$am) |
| ##D ) |
| ## End(Not run) |
| </code></pre> |
| |
| |
| <hr /><div style="text-align: center;">[Package <em>SparkR</em> version 2.4.7 <a href="00Index.html">Index</a>]</div> |
| </body></html> |