r/content/tables.Rmd - arrow-cookbook - Git at Google

 # Manipulating Data - Tables

 ## Introduction

 One of the aims of the Arrow project is to reduce duplication between different
 data frame implementations.  The underlying implementation of a data frame is a
 conceptually different thing to the code that you run to work with it - the API.

 You may have seen this before in packages like `dbplyr` which allow you to use
 the dplyr API to interact with SQL databases.

 The `arrow` package has been written so that the underlying Arrow table-like
 objects can be manipulated via use of the dplyr API via the dplyr verbs.

 For example, here's a short pipeline of data manipulation which uses dplyr exclusively:

 ```{r, dplyr_raw}
 library(dplyr)
 starwars %>%
   filter(species == "Human") %>%
   mutate(height_ft = height/30.48) %>%
   select(name, height_ft)
 ```

 And the same results as using arrow with dplyr syntax:

 ```{r, dplyr_arrow}
 Table$create(starwars) %>%
   filter(species == "Human") %>%
   mutate(height_ft = height/30.48) %>%
   select(name, height_ft) %>%
   collect()
 ```

 ```{r, test_dplyr_raw_and_arrow, opts.label = "test"}

 test_that("dplyr_raw and dplyr_arrow chunk provide the same results", {

   expect_equal(
     starwars %>%
       filter(species == "Human") %>%
       mutate(height_ft = height/30.48) %>%
       select(name, height_ft),
     Table$create(starwars) %>%
       filter(species == "Human") %>%
       mutate(height_ft = height/30.48) %>%
       select(name, height_ft) %>%
       collect()
   )

 })

 ```


 You'll notice we've used `collect()` in the Arrow pipeline above.  That's because
 one of the ways in which `arrow` is efficient is that it works out the instructions
 for the calculations it needs to perform (_expressions_) and only runs them
 using arrow once you actually pull the data into your R session.  This means
 instead of doing lots of separate operations, it does them all at once in a
 more optimised way, _lazy evaluation_.

 It also means that you are able to manipulate data that is larger than you can
 fit into memory on the machine you're running your code on, if you only pull
 data into R when you have selected the desired subset, or when using functions
 which can operate on chunks of data.

 You can also have data which is split across multiple files.  For example, you
 might have files which are stored in multiple Parquet or Feather files,
 partitioned across different directories.  You can open multi-file datasets
 using `open_dataset()` as discussed in a previous chapter, and then manipulate
 this data using arrow before even reading any of it into R.

 ## Use dplyr verbs in arrow

 You want to use a dplyr verb in arrow.

 ### Solution

 ```{r, dplyr_verb}
 library(dplyr)
 Table$create(starwars) %>%
   filter(species == "Human", homeworld == "Tatooine") %>%
   collect()
 ```

 ```{r, test_dplyr_verb, opts.label = "test"}

 test_that("dplyr_verb works as expected", {
   out <- Table$create(starwars) %>%
     filter(species == "Human", homeworld == "Tatooine") %>%
     collect()

   expect_equal(nrow(out), 8)
   expect_s3_class(out, "data.frame")
   expect_identical(unique(out$species), "Human")
   expect_identical(unique(out$homeworld), "Tatooine")
 })

 ```

 ### Discussion

 You can use most of the dplyr verbs directly from arrow.

 ### See also

 You can find examples of the various dplyr verbs in "Introduction to dplyr" -
 run `vignette("dplyr", package = "dplyr")` or view on
 the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html).

 You can see more information about using `Table$create()` to create Arrow Tables
 and `collect()` to view them as R data frames in [Creating Arrow Objects](creating-arrow-objects.html#creating-arrow-objects).

 ## Use R functions in dplyr verbs in arrow

 You want to use an R function inside a dplyr verb in arrow.

 ### Solution

 ```{r, dplyr_str_detect}
 Table$create(starwars) %>%
   filter(str_detect(name, "Darth")) %>%
   collect()
 ```

 ```{r, test_dplyr_str_detect, opts.label = "test"}

 test_that("dplyr_str_detect", {
   out <- Table$create(starwars) %>%
     filter(str_detect(name, "Darth")) %>%
     collect()

   expect_equal(nrow(out), 2)
   expect_equal(sort(out$name), c("Darth Maul", "Darth Vader"))

 })

 ```

 ### Discussion

 The arrow package allows you to use dplyr verbs containing expressions which
 include base R and many tidyverse functions, but call Arrow functions under the hood.
 If you find any base R or tidyverse functions which you would like to see a
 mapping of in arrow, please
 [open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).

 The following packages (amongst some from others) have had many function
 bindings/mappings written in arrow:

 * [lubridate](https://lubridate.tidyverse.org/)
 * [stringr](https://stringr.tidyverse.org/)
 * [dplyr](https://dplyr.tidyverse.org/)

 If you try to call a function which does not have arrow mapping, the data will
 be pulled back into R, and you will see a warning message.

 ```{r, dplyr_func_warning}
 library(stringr)

 Table$create(starwars) %>%
   mutate(name_split = str_split_fixed(name, " ", 2)) %>%
   collect()
 ```

 ```{r, test_dplyr_func_warning, opts.label = "test"}

 test_that("dplyr_func_warning", {

   expect_warning(
      Table$create(starwars) %>%
       mutate(name_split = str_split_fixed(name, " ", 2)) %>%
       collect(),
     'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling data into R',
     fixed = TRUE
   )

 })
 ```

 It should be noted that to work with functions which do have mappings, you must
 refer to them by
 their function names.  If you use the `package::function()` syntax, this will
 result in a warning about the data being pulled back into R before the output is
 calculated.

 ```{r, dplyr_str_to_lower_r, warning=TRUE}
 Table$create(starwars) %>%
   select(name) %>%
   mutate(name_lower = stringr::str_to_lower(name)) %>%
   head() %>%
   collect()
 ```

 ```{r, test_dplyr_str_to_lower_r, opts.label = "test"}

 test_that("dplyr_str_to_lower_r", {

   expect_warning(
     Table$create(starwars) %>%
       select(name) %>%
       mutate(name_lower = stringr::str_to_lower(name)) %>%
       head() %>%
       collect(),
     "Expression stringr::str_to_lower(name) not supported in Arrow; pulling data into R",
     fixed = TRUE
   )

 })

 ```

 ## Use arrow functions in dplyr verbs in arrow

 You want to use a function which is implemented in Arrow's C++ library but either:

 * it doesn't have a mapping to a base R or tidyverse equivalent, or
 * it has a mapping but nevertheless you want to call the C++ function directly

 ### Solution

 ```{r, dplyr_arrow_func}
 Table$create(starwars) %>%
   select(name) %>%
   mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>%
   collect()
 ```
 ```{r, test_dplyr_arrow_func, opts.label = "test"}

 test_that("dplyr_arrow_func", {
   out <- Table$create(starwars) %>%
     select(name) %>%
     mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>%
     collect()

   expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE)

 })

 ```
 ### Discussion

 The vast majority of Arrow C++ compute functions have been mapped to their
 base R or tidyverse equivalents, and we strongly recommend that you use
 these mappings where possible, as the original functions are well documented
 and the mapped versions have been tested to ensure the results returned are as
 expected.

 However, there may be circumstances in which you might want to use a compute
 function from the Arrow C++ library which does not have a base R or tidyverse
 equivalent.

 You can find documentation of Arrow C++ compute functions in
 [the C++ documention](https://arrow.apache.org/docs/cpp/compute.html#available-functions).
 This documentation lists all available compute functions, any associated options classes
 they need, and the valid data types that they can be used with.

 You can list all available Arrow compute functions from R by calling
 `list_compute_functions()`.

 ```{r, list_compute_funcs}
 list_compute_functions()
 ```
 ```{r, test_list_compute_funcs, opts.label = "test"}
 test_that("list_compute_funcs", {
   expect_gt(length(list_compute_functions()), 0)
 })
 ```

 The majority of functions here have been mapped to their base R or tidyverse
 equivalent and can be called within a dplyr query as usual.  For functions which
 don't have a base R or tidyverse equivalent, or you want to supply custom
 options, you can call them by prefixing their name with "arrow_".

 For example, base R's `is.na()` function is the equivalent of the Arrow C++
 compute function `is_null()` with the option `nan_is_null` set to `TRUE`.
 A mapping between these functions (with `nan_is_null` set to `TRUE`) has been
 created in arrow.

 ```{r, dplyr_is_na}
 demo_df <- data.frame(x = c(1, 2, 3, NA, NaN))

 Table$create(demo_df) %>%
   mutate(y = is.na(x)) %>%
   collect()
 ```

 ```{r, test_dplyr_is_na, opts.label = "test"}
 test_that("dplyr_is_na", {
   out <- Table$create(demo_df) %>%
     mutate(y = is.na(x)) %>%
     collect()

   expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, TRUE))

 })
 ```

 If you want to call Arrow's `is_null()` function but with `nan_is_null` set to
 `FALSE` (so it returns `TRUE` when a value being examined is `NA` but `FALSE`
 when the value being examined is `NaN`), you must call `is_null()` directly and
 specify the option `nan_is_null = FALSE`.

 ```{r, dplyr_arrow_is_null}
 Table$create(demo_df) %>%
   mutate(y = arrow_is_null(x, options  = list(nan_is_null = FALSE))) %>%
   collect()
 ```

 ```{r, test_dplyr_arrow_is_null, opts.label = "test"}
 test_that("dplyr_arrow_is_null", {
   out <- Table$create(demo_df) %>%
     mutate(y = arrow_is_null(x, options  = list(nan_is_null = FALSE))) %>%
     collect()

   expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, FALSE))

 })
 ```

 #### Compute functions with options

 Although not all Arrow C++ compute functions require options to be specified,
 most do.  For these functions to work in R, they must be linked up
 with the appropriate libarrow options C++ class via the R
 package's C++ code.  At the time of writing, all compute functions available in
 the development version of the arrow R package had been associated with their options
 classes.  However, as the Arrow C++ library's functionality extends, compute
 functions may be added which do not yet have an R binding.  If you find a C++
 compute function which you wish to use from the R package, please [open an issue
 on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
	# Manipulating Data - Tables

	## Introduction

	One of the aims of the Arrow project is to reduce duplication between different
	data frame implementations. The underlying implementation of a data frame is a
	conceptually different thing to the code that you run to work with it - the API.

	You may have seen this before in packages like `dbplyr` which allow you to use
	the dplyr API to interact with SQL databases.

	The `arrow` package has been written so that the underlying Arrow table-like
	objects can be manipulated via use of the dplyr API via the dplyr verbs.

	For example, here's a short pipeline of data manipulation which uses dplyr exclusively:

	```{r, dplyr_raw}
	library(dplyr)
	starwars %>%
	filter(species == "Human") %>%
	mutate(height_ft = height/30.48) %>%
	select(name, height_ft)
	```

	And the same results as using arrow with dplyr syntax:

	```{r, dplyr_arrow}
	Table$create(starwars) %>%
	filter(species == "Human") %>%
	mutate(height_ft = height/30.48) %>%
	select(name, height_ft) %>%
	collect()
	```

	```{r, test_dplyr_raw_and_arrow, opts.label = "test"}

	test_that("dplyr_raw and dplyr_arrow chunk provide the same results", {

	expect_equal(
	starwars %>%
	filter(species == "Human") %>%
	mutate(height_ft = height/30.48) %>%
	select(name, height_ft),
	Table$create(starwars) %>%
	filter(species == "Human") %>%
	mutate(height_ft = height/30.48) %>%
	select(name, height_ft) %>%
	collect()
	)

	})

	```


	You'll notice we've used `collect()` in the Arrow pipeline above. That's because
	one of the ways in which `arrow` is efficient is that it works out the instructions
	for the calculations it needs to perform (_expressions_) and only runs them
	using arrow once you actually pull the data into your R session. This means
	instead of doing lots of separate operations, it does them all at once in a
	more optimised way, _lazy evaluation_.

	It also means that you are able to manipulate data that is larger than you can
	fit into memory on the machine you're running your code on, if you only pull
	data into R when you have selected the desired subset, or when using functions
	which can operate on chunks of data.

	You can also have data which is split across multiple files. For example, you
	might have files which are stored in multiple Parquet or Feather files,
	partitioned across different directories. You can open multi-file datasets
	using `open_dataset()` as discussed in a previous chapter, and then manipulate
	this data using arrow before even reading any of it into R.

	## Use dplyr verbs in arrow

	You want to use a dplyr verb in arrow.

	### Solution

	```{r, dplyr_verb}
	library(dplyr)
	Table$create(starwars) %>%
	filter(species == "Human", homeworld == "Tatooine") %>%
	collect()
	```

	```{r, test_dplyr_verb, opts.label = "test"}

	test_that("dplyr_verb works as expected", {
	out <- Table$create(starwars) %>%
	filter(species == "Human", homeworld == "Tatooine") %>%
	collect()

	expect_equal(nrow(out), 8)
	expect_s3_class(out, "data.frame")
	expect_identical(unique(out$species), "Human")
	expect_identical(unique(out$homeworld), "Tatooine")
	})

	```

	### Discussion

	You can use most of the dplyr verbs directly from arrow.

	### See also

	You can find examples of the various dplyr verbs in "Introduction to dplyr" -
	run `vignette("dplyr", package = "dplyr")` or view on
	the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html).

	You can see more information about using `Table$create()` to create Arrow Tables
	and `collect()` to view them as R data frames in [Creating Arrow Objects](creating-arrow-objects.html#creating-arrow-objects).

	## Use R functions in dplyr verbs in arrow

	You want to use an R function inside a dplyr verb in arrow.

	### Solution

	```{r, dplyr_str_detect}
	Table$create(starwars) %>%
	filter(str_detect(name, "Darth")) %>%
	collect()
	```

	```{r, test_dplyr_str_detect, opts.label = "test"}

	test_that("dplyr_str_detect", {
	out <- Table$create(starwars) %>%
	filter(str_detect(name, "Darth")) %>%
	collect()

	expect_equal(nrow(out), 2)
	expect_equal(sort(out$name), c("Darth Maul", "Darth Vader"))

	})

	```

	### Discussion

	The arrow package allows you to use dplyr verbs containing expressions which
	include base R and many tidyverse functions, but call Arrow functions under the hood.
	If you find any base R or tidyverse functions which you would like to see a
	mapping of in arrow, please
	[open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).

	The following packages (amongst some from others) have had many function
	bindings/mappings written in arrow:

	* [lubridate](https://lubridate.tidyverse.org/)
	* [stringr](https://stringr.tidyverse.org/)
	* [dplyr](https://dplyr.tidyverse.org/)

	If you try to call a function which does not have arrow mapping, the data will
	be pulled back into R, and you will see a warning message.

	```{r, dplyr_func_warning}
	library(stringr)

	Table$create(starwars) %>%
	mutate(name_split = str_split_fixed(name, " ", 2)) %>%
	collect()
	```

	```{r, test_dplyr_func_warning, opts.label = "test"}

	test_that("dplyr_func_warning", {

	expect_warning(
	Table$create(starwars) %>%
	mutate(name_split = str_split_fixed(name, " ", 2)) %>%
	collect(),
	'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling data into R',
	fixed = TRUE
	)

	})
	```

	It should be noted that to work with functions which do have mappings, you must
	refer to them by
	their function names. If you use the `package::function()` syntax, this will
	result in a warning about the data being pulled back into R before the output is
	calculated.

	```{r, dplyr_str_to_lower_r, warning=TRUE}
	Table$create(starwars) %>%
	select(name) %>%
	mutate(name_lower = stringr::str_to_lower(name)) %>%
	head() %>%
	collect()
	```

	```{r, test_dplyr_str_to_lower_r, opts.label = "test"}

	test_that("dplyr_str_to_lower_r", {

	expect_warning(
	Table$create(starwars) %>%
	select(name) %>%
	mutate(name_lower = stringr::str_to_lower(name)) %>%
	head() %>%
	collect(),
	"Expression stringr::str_to_lower(name) not supported in Arrow; pulling data into R",
	fixed = TRUE
	)

	})

	```

	## Use arrow functions in dplyr verbs in arrow

	You want to use a function which is implemented in Arrow's C++ library but either:

	* it doesn't have a mapping to a base R or tidyverse equivalent, or
	* it has a mapping but nevertheless you want to call the C++ function directly

	### Solution

	```{r, dplyr_arrow_func}
	Table$create(starwars) %>%
	select(name) %>%
	mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>%
	collect()
	```
	```{r, test_dplyr_arrow_func, opts.label = "test"}

	test_that("dplyr_arrow_func", {
	out <- Table$create(starwars) %>%
	select(name) %>%
	mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>%
	collect()

	expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE)

	})

	```
	### Discussion

	The vast majority of Arrow C++ compute functions have been mapped to their
	base R or tidyverse equivalents, and we strongly recommend that you use
	these mappings where possible, as the original functions are well documented
	and the mapped versions have been tested to ensure the results returned are as
	expected.

	However, there may be circumstances in which you might want to use a compute
	function from the Arrow C++ library which does not have a base R or tidyverse
	equivalent.

	You can find documentation of Arrow C++ compute functions in
	[the C++ documention](https://arrow.apache.org/docs/cpp/compute.html#available-functions).
	This documentation lists all available compute functions, any associated options classes
	they need, and the valid data types that they can be used with.

	You can list all available Arrow compute functions from R by calling
	`list_compute_functions()`.

	```{r, list_compute_funcs}
	list_compute_functions()
	```
	```{r, test_list_compute_funcs, opts.label = "test"}
	test_that("list_compute_funcs", {
	expect_gt(length(list_compute_functions()), 0)
	})
	```

	The majority of functions here have been mapped to their base R or tidyverse
	equivalent and can be called within a dplyr query as usual. For functions which
	don't have a base R or tidyverse equivalent, or you want to supply custom
	options, you can call them by prefixing their name with "arrow_".

	For example, base R's `is.na()` function is the equivalent of the Arrow C++
	compute function `is_null()` with the option `nan_is_null` set to `TRUE`.
	A mapping between these functions (with `nan_is_null` set to `TRUE`) has been
	created in arrow.

	```{r, dplyr_is_na}
	demo_df <- data.frame(x = c(1, 2, 3, NA, NaN))

	Table$create(demo_df) %>%
	mutate(y = is.na(x)) %>%
	collect()
	```

	```{r, test_dplyr_is_na, opts.label = "test"}
	test_that("dplyr_is_na", {
	out <- Table$create(demo_df) %>%
	mutate(y = is.na(x)) %>%
	collect()

	expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, TRUE))

	})
	```

	If you want to call Arrow's `is_null()` function but with `nan_is_null` set to
	`FALSE` (so it returns `TRUE` when a value being examined is `NA` but `FALSE`
	when the value being examined is `NaN`), you must call `is_null()` directly and
	specify the option `nan_is_null = FALSE`.

	```{r, dplyr_arrow_is_null}
	Table$create(demo_df) %>%
	mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>%
	collect()
	```

	```{r, test_dplyr_arrow_is_null, opts.label = "test"}
	test_that("dplyr_arrow_is_null", {
	out <- Table$create(demo_df) %>%
	mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>%
	collect()

	expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, FALSE))

	})
	```

	#### Compute functions with options

	Although not all Arrow C++ compute functions require options to be specified,
	most do. For these functions to work in R, they must be linked up
	with the appropriate libarrow options C++ class via the R
	package's C++ code. At the time of writing, all compute functions available in
	the development version of the arrow R package had been associated with their options
	classes. However, as the Arrow C++ library's functionality extends, compute
	functions may be added which do not yet have an R binding. If you find a C++
	compute function which you wish to use from the R package, please [open an issue
	on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).