| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # Manipulating Data - Arrays |
| |
| ## Introduction |
| |
| An Arrow Array is roughly equivalent to an R vector - it can be used to |
| represent a single column of data, with all values having the same data type. |
| |
| A number of base R functions which have S3 generic methods have been implemented |
| to work on Arrow Arrays; for example `mean`, `min`, and `max`. |
| |
| ## Filter by values matching a predicate or mask |
| |
| You want to search for values in an Array that match a predicate condition. |
| |
| ### Solution |
| |
| ```{r, array_predicate} |
| my_values <- Array$create(c(1:5, NA)) |
| my_values[my_values > 3] |
| ``` |
| ```{r, test_array_predicate, opts.label = "test"} |
| test_that("array_predicate works as expected", { |
| expect_equal(my_values[my_values > 3], Array$create(c(4L, 5L, NA))) |
| }) |
| ``` |
| |
| ### Discussion |
| |
| You can refer to items in an Array using the square brackets `[]` like you can |
| an R vector. |
| |
| ## Compute Mean/Min/Max, etc value of an Array |
| |
| You want to calculate the mean, minimum, or maximum of values in an array. |
| |
| ### Solution |
| |
| ```{r, array_mean_na} |
| my_values <- Array$create(c(1:5, NA)) |
| mean(my_values, na.rm = TRUE) |
| ``` |
| ```{r, test_array_mean_na, opts.label = "test"} |
| test_that("array_mean_na works as expected", { |
| expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3)) |
| }) |
| ``` |
| |
| ### Discussion |
| |
| Many base R generic functions such as `mean()`, `min()`, and `max()` have been |
| mapped to their Arrow equivalents, and so can be called on Arrow Array objects |
| in the same way. They will return Arrow objects themselves. |
| |
| If you want to use an R function which does not have an Arrow mapping, you can |
| use `as.vector()` to convert Arrow objects to base R vectors. |
| |
| ```{r, array_fivenum} |
| arrow_array <- Array$create(1:100) |
| # get Tukey's five-number summary |
| fivenum(as.vector(arrow_array)) |
| ``` |
| ```{r, test_array_fivenum, opts.label = "test"} |
| |
| test_that("array_fivenum works as expected", { |
| |
| expect_identical( |
| fivenum(as.vector(arrow_array)), |
| c(1, 25.5, 50.5, 75.5, 100) |
| ) |
| }) |
| |
| ``` |
| |
| You can tell if a function is a standard S3 generic function by looking |
| at the body of the function - S3 generic functions call `UseMethod()` |
| to determine the appropriate version of that function to use for the object. |
| |
| ```{r} |
| mean |
| ``` |
| |
| You can also use `isS3stdGeneric()` to determine if a function is an S3 generic. |
| |
| ```{r} |
| isS3stdGeneric("mean") |
| ``` |
| |
| If you find an S3 generic function which isn't implemented for Arrow objects |
| but you would like to be able to use, please |
| [open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues). |
| |
| ## Count occurrences of elements in an Array |
| |
| You want to count repeated values in an Array. |
| |
| ### Solution |
| |
| ```{r, value_counts} |
| repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3)) |
| value_counts(repeated_vals) |
| ``` |
| |
| ```{r, test_value_counts, opts.label = "test"} |
| test_that("value_counts works as expected", { |
| expect_equal( |
| as.vector(value_counts(repeated_vals)), |
| tibble( |
| values = as.numeric(names(table(as.vector(repeated_vals)))), |
| counts = as.vector(table(as.vector(repeated_vals))) |
| ) |
| ) |
| }) |
| ``` |
| |
| ### Discussion |
| |
| Some functions in the Arrow R package do not have base R equivalents. In other |
| cases, the base R equivalents are not generic functions so they cannot be called |
| directly on Arrow Array objects. |
| |
| For example, the `value_counts()` function in the Arrow R package is loosely |
| equivalent to the base R function `table()`, which is not a generic function. |
| |
| ## Apply arithmetic functions to Arrays. |
| |
| You want to use the various arithmetic operators on Array objects. |
| |
| ### Solution |
| |
| ```{r, add_array} |
| num_array <- Array$create(1:10) |
| num_array + 10 |
| ``` |
| ```{r, test_add_array, opts.label = "test"} |
| test_that("add_array works as expected", { |
| # need to specify expected array as 1:10 + 10 instead of 11:20 so is double not integer |
| expect_equal(num_array + 10, Array$create(1:10 + 10)) |
| }) |
| ``` |
| |
| ### Discussion |
| |
| You will get the same result if you pass in the value you're adding as an Arrow object. |
| |
| ```{r, add_array_scalar} |
| num_array + Scalar$create(10) |
| ``` |
| ```{r, test_add_array_scalar, opts.label = "test"} |
| test_that("add_array_scalar works as expected", { |
| # need to specify expected array as 1:10 + 10 instead of 11:20 so is double not integer |
| expect_equal(num_array + Scalar$create(10), Array$create(1:10 + 10)) |
| }) |
| ``` |
| |
| ## Call Arrow compute functions directly on Arrays |
| |
| You want to call an Arrow compute function directly on an Array. |
| |
| ### Solution |
| |
| ```{r, call_function} |
| first_100_numbers <- Array$create(1:100) |
| |
| # Calculate the variance of 1 to 100, setting the delta degrees of freedom to 0. |
| call_function("variance", first_100_numbers, options = list(ddof = 0)) |
| |
| ``` |
| ```{r, test_call_function, opts.label = "test"} |
| test_that("call_function works as expected", { |
| expect_equal( |
| call_function("variance", first_100_numbers, options = list(ddof = 0)), |
| Scalar$create(833.25) |
| ) |
| }) |
| ``` |
| ### Discussion |
| |
| You can use `call_function()` to call Arrow compute functions directly on |
| Scalar, Array, and ChunkedArray objects. The returned object will be an Arrow object. |
| |
| ### See also |
| |
| For a more in-depth discussion of Arrow compute functions, see the section on [using arrow functions in dplyr verbs in arrow](#use-arrow-functions-in-dplyr-verbs-in-arrow) |