blob: 0ae8cf44fc0d50b77f013fea52948f7dcb68b755 [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Window functions for Column operations</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="R.css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head><body>
<table width="100%" summary="page for column_window_functions {SparkR}"><tr><td>column_window_functions {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table>
<h2>Window functions for Column operations</h2>
<h3>Description</h3>
<p>Window functions defined for <code>Column</code>.
</p>
<h3>Usage</h3>
<pre>
cume_dist(x = "missing")
dense_rank(x = "missing")
lag(x, ...)
lead(x, offset, defaultValue = NULL)
ntile(x)
percent_rank(x = "missing")
rank(x, ...)
row_number(x = "missing")
## S4 method for signature 'missing'
cume_dist()
## S4 method for signature 'missing'
dense_rank()
## S4 method for signature 'characterOrColumn'
lag(x, offset = 1, defaultValue = NULL)
## S4 method for signature 'characterOrColumn,numeric'
lead(x, offset = 1,
defaultValue = NULL)
## S4 method for signature 'numeric'
ntile(x)
## S4 method for signature 'missing'
percent_rank()
## S4 method for signature 'missing'
rank()
## S4 method for signature 'ANY'
rank(x, ...)
## S4 method for signature 'missing'
row_number()
</pre>
<h3>Arguments</h3>
<table summary="R argblock">
<tr valign="top"><td><code>x</code></td>
<td>
<p>In <code>lag</code> and <code>lead</code>, it is the column as a character string or a Column
to compute on. In <code>ntile</code>, it is the number of ntile groups.</p>
</td></tr>
<tr valign="top"><td><code>...</code></td>
<td>
<p>additional argument(s).</p>
</td></tr>
<tr valign="top"><td><code>offset</code></td>
<td>
<p>In <code>lag</code>, the number of rows back from the current row from which to obtain
a value. In <code>lead</code>, the number of rows after the current row from which to
obtain a value. If not specified, the default is 1.</p>
</td></tr>
<tr valign="top"><td><code>defaultValue</code></td>
<td>
<p>(optional) default to use when the offset row does not exist.</p>
</td></tr>
</table>
<h3>Details</h3>
<p><code>cume_dist</code>: Returns the cumulative distribution of values within a window partition,
i.e. the fraction of rows that are below the current row:
(number of values before and including x) / (total number of rows in the partition).
This is equivalent to the <code>CUME_DIST</code> function in SQL.
The method should be used with no argument.
</p>
<p><code>dense_rank</code>: Returns the rank of rows within a window partition, without any gaps.
The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using dense_rank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third. Rank would give me sequential numbers, making
the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the <code>DENSE_RANK</code> function in SQL.
The method should be used with no argument.
</p>
<p><code>lag</code>: Returns the value that is <code>offset</code> rows before the current row, and
<code>defaultValue</code> if there is less than <code>offset</code> rows before the current row. For example,
an <code>offset</code> of one will return the previous row at any given point in the window partition.
This is equivalent to the <code>LAG</code> function in SQL.
</p>
<p><code>lead</code>: Returns the value that is <code>offset</code> rows after the current row, and
<code>defaultValue</code> if there is less than <code>offset</code> rows after the current row.
For example, an <code>offset</code> of one will return the next row at any given point
in the window partition.
This is equivalent to the <code>LEAD</code> function in SQL.
</p>
<p><code>ntile</code>: Returns the ntile group id (from 1 to n inclusive) in an ordered window
partition. For example, if n is 4, the first quarter of the rows will get value 1, the second
quarter will get 2, the third quarter will get 3, and the last quarter will get 4.
This is equivalent to the <code>NTILE</code> function in SQL.
</p>
<p><code>percent_rank</code>: Returns the relative rank (i.e. percentile) of rows within a window
partition.
This is computed by: (rank of row in its partition - 1) / (number of rows in the partition - 1).
This is equivalent to the <code>PERCENT_RANK</code> function in SQL.
The method should be used with no argument.
</p>
<p><code>rank</code>: Returns the rank of rows within a window partition.
The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using dense_rank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third. Rank would give me sequential numbers, making
the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the <code>RANK</code> function in SQL.
The method should be used with no argument.
</p>
<p><code>row_number</code>: Returns a sequential number starting at 1 within a window partition.
This is equivalent to the <code>ROW_NUMBER</code> function in SQL.
The method should be used with no argument.
</p>
<h3>Note</h3>
<p>cume_dist since 1.6.0
</p>
<p>dense_rank since 1.6.0
</p>
<p>lag since 1.6.0
</p>
<p>lead since 1.6.0
</p>
<p>ntile since 1.6.0
</p>
<p>percent_rank since 1.6.0
</p>
<p>rank since 1.6.0
</p>
<p>row_number since 1.6.0
</p>
<h3>Examples</h3>
<pre><code class="r">## Not run:
##D # Dataframe used throughout this doc
##D df &lt;- createDataFrame(cbind(model = rownames(mtcars), mtcars))
##D ws &lt;- orderBy(windowPartitionBy(&quot;am&quot;), &quot;hp&quot;)
##D tmp &lt;- mutate(df, dist = over(cume_dist(), ws), dense_rank = over(dense_rank(), ws),
##D lag = over(lag(df$mpg), ws), lead = over(lead(df$mpg, 1), ws),
##D percent_rank = over(percent_rank(), ws),
##D rank = over(rank(), ws), row_number = over(row_number(), ws))
##D # Get ntile group id (1-4) for hp
##D tmp &lt;- mutate(tmp, ntile = over(ntile(4), ws))
##D head(tmp)
## End(Not run)
</code></pre>
<hr /><div style="text-align: center;">[Package <em>SparkR</em> version 2.4.0 <a href="00Index.html">Index</a>]</div>
</body></html>