blob: 51f7d3815f1322c6571b9abf1deb690e5024ae0b [file] [log] [blame]
# Manipulating Data - Arrays
## Introduction
An Arrow Array is roughly equivalent to an R vector - it can be used to
represent a single column of data, with all values having the same data type.
A number of base R functions which have S3 generic methods have been implemented
to work on Arrow Arrays; for example `mean`, `min`, and `max`.
## Filter by values matching a predicate or mask
You want to search for values in an Array that match a predicate condition.
### Solution
```{r, array_predicate}
my_values <- Array$create(c(1:5, NA))
my_values[my_values > 3]
```
```{r, test_array_predicate, opts.label = "test"}
test_that("array_predicate works as expected", {
expect_equal(my_values[my_values > 3], Array$create(c(4L, 5L, NA)))
})
```
### Discussion
You can refer to items in an Array using the square brackets `[]` like you can
an R vector.
## Compute Mean/Min/Max, etc value of an Array
You want to calculate the mean, minimum, or maximum of values in an array.
### Solution
```{r, array_mean_na}
my_values <- Array$create(c(1:5, NA))
mean(my_values, na.rm = TRUE)
```
```{r, test_array_mean_na, opts.label = "test"}
test_that("array_mean_na works as expected", {
expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
})
```
### Discussion
Many base R generic functions such as `mean()`, `min()`, and `max()` have been
mapped to their Arrow equivalents, and so can be called on Arrow Array objects
in the same way. They will return Arrow objects themselves.
If you want to use an R function which does not have an Arrow mapping, you can
use `as.vector()` to convert Arrow objects to base R vectors.
```{r, array_fivenum}
arrow_array <- Array$create(1:100)
# get Tukey's five-number summary
fivenum(as.vector(arrow_array))
```
```{r, test_array_fivenum, opts.label = "test"}
test_that("array_fivenum works as expected", {
# generates both an error and a warning
expect_warning(
expect_error(fivenum(arrow_array))
)
expect_identical(
fivenum(as.vector(arrow_array)),
c(1, 25.5, 50.5, 75.5, 100)
)
})
```
You can tell if a function is a standard S3 generic function by looking
at the body of the function - S3 generic functions call `UseMethod()`
to determine the appropriate version of that function to use for the object.
```{r}
mean
```
You can also use `isS3stdGeneric()` to determine if a function is an S3 generic.
```{r}
isS3stdGeneric("mean")
```
If you find an S3 generic function which isn't implemented for Arrow objects
but you would like to be able to use, please
[open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
## Count occurrences of elements in an Array
You want to count repeated values in an Array.
### Solution
```{r, value_counts}
repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
value_counts(repeated_vals)
```
```{r, test_value_counts, opts.label = "test"}
test_that("value_counts works as expected", {
expect_equal(
as.vector(value_counts(repeated_vals)),
tibble(
values = as.numeric(names(table(as.vector(repeated_vals)))),
counts = as.vector(table(as.vector(repeated_vals)))
)
)
})
```
### Discussion
Some functions in the Arrow R package do not have base R equivalents. In other
cases, the base R equivalents are not generic functions so they cannot be called
directly on Arrow Array objects.
For example, the `value_counts()` function in the Arrow R package is loosely
equivalent to the base R function `table()`, which is not a generic function.
## Apply arithmetic functions to Arrays.
You want to use the various arithmetic operators on Array objects.
### Solution
```{r, add_array}
num_array <- Array$create(1:10)
num_array + 10
```
```{r, test_add_array, opts.label = "test"}
test_that("add_array works as expected", {
# need to specify expected array as 1:10 + 10 instead of 11:20 so is double not integer
expect_equal(num_array + 10, Array$create(1:10 + 10))
})
```
### Discussion
You will get the same result if you pass in the value you're adding as an Arrow object.
```{r, add_array_scalar}
num_array + Scalar$create(10)
```
```{r, test_add_array_scalar, opts.label = "test"}
test_that("add_array_scalar works as expected", {
# need to specify expected array as 1:10 + 10 instead of 11:20 so is double not integer
expect_equal(num_array + Scalar$create(10), Array$create(1:10 + 10))
})
```
## Call Arrow compute functions directly on Arrays
You want to call an Arrow compute function directly on an Array.
### Solution
```{r, call_function}
first_100_numbers <- Array$create(1:100)
# Calculate the variance of 1 to 100, setting the delta degrees of freedom to 0.
call_function("variance", first_100_numbers, options = list(ddof = 0))
```
```{r, test_call_function, opts.label = "test"}
test_that("call_function works as expected", {
expect_equal(
call_function("variance", first_100_numbers, options = list(ddof = 0)),
Scalar$create(833.25)
)
})
```
### Discussion
You can use `call_function()` to call Arrow compute functions directly on
Scalar, Array, and ChunkedArray objects. The returned object will be an Arrow object.
### See also
For a more in-depth discussion of Arrow compute functions, see the section on (using arrow functions in dplyr verbs in arrow)[#Using-arrow-functions-in-dplyr-verbs-in-arrow]