blob: 89225ab6ce2047ea0cade96e987516fd12f4702a [file] [log] [blame]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Manipulating Data - Tables
## Introduction
One of the aims of the Arrow project is to reduce duplication between different
data frame implementations. The underlying implementation of a data frame is a
conceptually different thing to the code- or the application programming interface (API)-that you write to work with it.
You may have seen this before in packages like dbplyr which allow you to use
the dplyr API to interact with SQL databases.
The Arrow R package has been written so that the underlying Arrow Table-like
objects can be manipulated using the dplyr API, which allows you to use dplyr verbs.
For example, here's a short pipeline of data manipulation which uses dplyr exclusively:
```{r, dplyr_raw}
library(dplyr)
starwars %>%
filter(species == "Human") %>%
mutate(height_ft = height/30.48) %>%
select(name, height_ft)
```
And the same results as using Arrow with dplyr syntax:
```{r, dplyr_arrow}
arrow_table(starwars) %>%
filter(species == "Human") %>%
mutate(height_ft = height/30.48) %>%
select(name, height_ft) %>%
collect()
```
```{r, test_dplyr_raw_and_arrow, opts.label = "test"}
test_that("dplyr_raw and dplyr_arrow chunk provide the same results", {
expect_equal(
starwars %>%
filter(species == "Human") %>%
mutate(height_ft = height/30.48) %>%
select(name, height_ft),
arrow_table(starwars) %>%
filter(species == "Human") %>%
mutate(height_ft = height/30.48) %>%
select(name, height_ft) %>%
collect()
)
})
```
You'll notice we've used `collect()` in the Arrow pipeline above. That's because
one of the ways in which Arrow is efficient is that it works out the instructions
for the calculations it needs to perform (_expressions_) and only runs them
using Arrow once you actually pull the data into your R session. This means
instead of doing lots of separate operations, it does them all at once in a
more optimised way. This is called _lazy evaluation_.
It also means that you are able to manipulate data that is larger than you can
fit into memory on the machine you're running your code on, if you only pull
data into R when you have selected the desired subset, or when using functions
which can operate on chunks of data.
You can also have data which is split across multiple files. For example, you
might have files which are stored in multiple Parquet or Feather files,
partitioned across different directories. You can open partitioned or multi-file datasets
using `open_dataset()` as discussed in a previous chapter, and then manipulate
this data using Arrow before even reading any of the data into R.
## Use dplyr verbs in Arrow
You want to use a dplyr verb in Arrow.
### Solution
```{r, dplyr_verb}
library(dplyr)
arrow_table(starwars) %>%
filter(species == "Human", homeworld == "Tatooine") %>%
collect()
```
```{r, test_dplyr_verb, opts.label = "test"}
test_that("dplyr_verb works as expected", {
out <- arrow_table(starwars) %>%
filter(species == "Human", homeworld == "Tatooine") %>%
collect()
expect_equal(nrow(out), 8)
expect_s3_class(out, "data.frame")
expect_identical(unique(out$species), "Human")
expect_identical(unique(out$homeworld), "Tatooine")
})
```
### Discussion
You can use most of the dplyr verbs directly from Arrow.
### See also
You can find examples of the various dplyr verbs in "Introduction to dplyr" -
run `vignette("dplyr", package = "dplyr")` or view on
the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html).
You can see more information about using `arrow_table()` to create Arrow Tables
and `collect()` to view them as R data frames in [Creating Arrow Objects](creating-arrow-objects.html#creating-arrow-objects).
## Use R functions in dplyr verbs in Arrow
You want to use an R function inside a dplyr verb in Arrow.
### Solution
```{r, dplyr_str_detect}
arrow_table(starwars) %>%
filter(str_detect(name, "Darth")) %>%
collect()
```
```{r, test_dplyr_str_detect, opts.label = "test"}
test_that("dplyr_str_detect", {
out <- arrow_table(starwars) %>%
filter(str_detect(name, "Darth")) %>%
collect()
expect_equal(nrow(out), 2)
expect_equal(sort(out$name), c("Darth Maul", "Darth Vader"))
})
```
### Discussion
The Arrow R package allows you to use dplyr verbs containing expressions which
include base R and many tidyverse functions, but call Arrow functions under the hood.
If you find any base R or tidyverse functions which you would like to see a
mapping of in Arrow, please
[open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
The following packages (amongst some from others) have had many function
bindings/mappings written in arrow:
* [lubridate](https://lubridate.tidyverse.org/)
* [stringr](https://stringr.tidyverse.org/)
* [dplyr](https://dplyr.tidyverse.org/)
If you try to call a function which does not have arrow mapping, the data will
be pulled back into R, and you will see a warning message.
```{r, dplyr_func_warning}
library(stringr)
arrow_table(starwars) %>%
mutate(name_split = str_split_fixed(name, " ", 2)) %>%
collect()
```
```{r, test_dplyr_func_warning, opts.label = "test"}
test_that("dplyr_func_warning", {
expect_warning(
arrow_table(starwars) %>%
mutate(name_split = str_split_fixed(name, " ", 2)) %>%
collect(),
'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling data into R',
fixed = TRUE
)
})
```
## Use Arrow functions in dplyr verbs in Arrow
You want to use a function which is implemented in Arrow's C++ library but either:
* it doesn't have a mapping to a base R or tidyverse equivalent, or
* it has a mapping but nevertheless you want to call the C++ function directly
### Solution
```{r, dplyr_arrow_func}
arrow_table(starwars) %>%
select(name) %>%
mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>%
collect()
```
```{r, test_dplyr_arrow_func, opts.label = "test"}
test_that("dplyr_arrow_func", {
out <- arrow_table(starwars) %>%
select(name) %>%
mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>%
collect()
expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE)
})
```
### Discussion
The vast majority of Arrow C++ compute functions have been mapped to their
base R or tidyverse equivalents, and we strongly recommend that you use
these mappings where possible, as the original functions are well documented
and the mapped versions have been tested to ensure the results returned are as
expected.
However, there may be circumstances in which you might want to use a compute
function from the Arrow C++ library which does not have a base R or tidyverse
equivalent.
You can find documentation of Arrow C++ compute functions in
[the C++ documention](https://arrow.apache.org/docs/cpp/compute.html#available-functions).
This documentation lists all available compute functions, any associated options classes
they need, and the valid data types that they can be used with.
You can list all available Arrow compute functions from R by calling
`list_compute_functions()`.
```{r, list_compute_funcs}
list_compute_functions()
```
```{r, test_list_compute_funcs, opts.label = "test"}
test_that("list_compute_funcs", {
expect_gt(length(list_compute_functions()), 0)
})
```
The majority of functions here have been mapped to their base R or tidyverse
equivalent and can be called within a dplyr query as usual. For functions which
don't have a base R or tidyverse equivalent, or you want to supply custom
options, you can call them by prefixing their name with "arrow_".
For example, base R's `is.na()` function is the equivalent of the Arrow C++
compute function `is_null()` with the option `nan_is_null` set to `TRUE`.
A mapping between these functions (with `nan_is_null` set to `TRUE`) has been
created in arrow.
```{r, dplyr_is_na}
demo_df <- data.frame(x = c(1, 2, 3, NA, NaN))
arrow_table(demo_df) %>%
mutate(y = is.na(x)) %>%
collect()
```
```{r, test_dplyr_is_na, opts.label = "test"}
test_that("dplyr_is_na", {
out <- arrow_table(demo_df) %>%
mutate(y = is.na(x)) %>%
collect()
expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, TRUE))
})
```
If you want to call Arrow's `is_null()` function but with `nan_is_null` set to
`FALSE` (so it returns `TRUE` when a value being examined is `NA` but `FALSE`
when the value being examined is `NaN`), you must call `is_null()` directly and
specify the option `nan_is_null = FALSE`.
```{r, dplyr_arrow_is_null}
arrow_table(demo_df) %>%
mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>%
collect()
```
```{r, test_dplyr_arrow_is_null, opts.label = "test"}
test_that("dplyr_arrow_is_null", {
out <- arrow_table(demo_df) %>%
mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>%
collect()
expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, FALSE))
})
```
#### Compute functions with options
Although not all Arrow C++ compute functions require options to be specified,
most do. For these functions to work in R, they must be linked up
with the appropriate libarrow options C++ class via the R
package's C++ code. At the time of writing, all compute functions available in
the development version of the Arrow R package had been associated with their options
classes. However, as the Arrow C++ library's functionality extends, compute
functions may be added which do not yet have an R binding. If you find a C++
compute function which you wish to use from the R package, please [open an issue
on the Github project](https://github.com/apache/arrow/issues).
## Compute Window Aggregates
You want to apply an aggregation (e.g. `mean()`) on a grouped table or within a rowwise operation like `filter()`:
### Solution
```{r, arrow_window_aggregate_joins}
arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
left_join(
arrow_table(starwars) %>%
group_by(hair_color) %>%
summarize(mean_height = mean(height, na.rm = TRUE))
) %>%
filter(height < mean_height) %>%
select(!mean_height) %>%
collect()
```
```{r, test_arrow_window_aggregate_joins, opts.label = "test"}
test_that("arrow_window_aggregate_joins", {
out <- arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
left_join(
arrow_table(starwars) %>%
group_by(hair_color) %>%
summarize(mean_height = mean(height, na.rm = TRUE))
) %>%
filter(height < mean_height) %>%
select(!mean_height) %>%
collect()
expected <- starwars %>%
select(name, height, hair_color) %>%
filter(!is.na(hair_color)) %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE))
expect_equal(nrow(out), nrow(expected))
expect_equal(sum(out$height), sum(expected$height))
})
```
Or using `to_duckdb()`
```{r, arrow_window_aggregate_duckdb}
arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
to_duckdb() %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE)) %>%
to_arrow() %>%
collect()
```
```{r, test_arrow_window_aggregate_duckdb, opts.label = "test"}
test_that("arrow_window_aggregate_duckdb", {
out <- arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
to_duckdb() %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE)) %>%
to_arrow() %>%
collect()
expected <- starwars %>%
select(name, height, hair_color) %>%
filter(!is.na(hair_color)) %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE))
expect_equal(nrow(out), nrow(expected))
expect_equal(sum(out$height), sum(expected$height))
})
```
### Discusson
Arrow does not support window functions, and pulls the data into R. For large tables, this sacrifices performance.
```{r, arrow_window_aggregate}
arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE))
```
You can perform these window aggregate operations on Arrow tables by:
- Computing the aggregation separately, and joining the result
- Passing the data to DuckDB, and use the DuckDB query engine to perform the operations
Arrow supports zero-copy integration with DuckDB, and DuckDB can query Arrow datasets directly and stream query results back to Arrow. This integreation uses zero-copy streaming of data between DuckDB and Arrow and vice versa so that you can compose a query using both together, all the while not paying any cost to (re)serialize the data when you pass it back and forth. This is especially useful in cases where something is supported in one of Arrow or DuckDB query engines but not the other. You can find more information about this integration on the [Arrow blog post](https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/).