r/vignettes/developers/writing_bindings.Rmd - arrow - Git at Google

 ---
 title: "Writing dplyr bindings"
 description: >
   Learn how to write bindings that allow arrow to mirror the behavior
   of native R functions within dplyr pipelines
 output: rmarkdown::html_vignette
 ---

 ```{r, include=FALSE}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```

 When writing bindings between C++ compute functions and R functions, the aim is
 to expose the C++ functionality via the same interface as existing R functions. The syntax and
 functionality should match that of the existing R functions
 (though there are some exceptions) so that users are able to use existing tidyverse
 or base R syntax, whilst taking advantage of the speed and functionality of the
 underlying arrow package.

 One of main ways in which users interact with arrow is via
 [dplyr](https://dplyr.tidyverse.org/) syntax called on Arrow objects.  For
 example, when a user calls `dplyr::mutate()` on an Arrow Tabular,
 Dataset, or arrow data query object, the Arrow implementation of `mutate()` is
 used and under the hood, translates the dplyr code into Arrow C++ code.

 When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use functions
 from other packages.  The example below uses `stringr::str_detect()`.

 ```{r}
 library(dplyr)
 library(stringr)
 starwars %>%
   filter(str_detect(name, "Darth"))
 ```
 This functionality has also been implemented in Arrow, e.g.:

 ```{r}
 library(arrow)
 arrow_table(starwars) %>%
   filter(str_detect(name, "Darth")) %>%
   collect()
 ```

 This is possible as a **binding** has been created between the call to the
 stringr function `str_detect()` and the Arrow C++ code, here as a direct mapping
 to `match_substring_regex`.  You can see this for yourself by inspecting the
 arrow data query object without retrieving the results via `collect()`.


 ```{r}
 arrow_table(starwars) %>%
   filter(str_detect(name, "Darth"))
 ```

 In the following sections, we'll walk through how to create a binding between an
 R function and an Arrow C++ function.

 # Walkthrough

 Imagine you are writing the bindings for the C++ function
 [`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
 and want to bind it to the (base) R function `startsWith()`.

 First, take a look at the docs for both of those functions.

 ## Examining the R function

 Here are the docs for R's `startsWith()` (also available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)

 ```{r, echo=FALSE, out.width="50%"}
 knitr::include_graphics("./startswithdocs.png")
 ```

 It takes 2 parameters; `x` - the input, and `prefix` - the characters to check
 if `x` starts with.

 ## Examining the C++ function

 Now, go to
 [the compute function documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
 and look for the Arrow C++ library's `starts_with()` function:

 ```{r, echo=FALSE, out.width="100%"}
 knitr::include_graphics("./starts_with_docs.png")
 ```

 The docs show that `starts_with()` is a unary function, which means that it takes a
 single data input. The data input must be a string-like class, and the returned
 value is boolean, both of which match up to R's `startsWith()`.

 There is an options class associated with `starts_with()` - called [`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE)
 - so let's take a look at that.

 ```{r, echo=FALSE, out.width="100%"}
 knitr::include_graphics("./matchsubstringoptions.png")
 ```

 Options classes allow the user to control the behaviour of the function.  In
 this case, there are two possible options which can be supplied - `pattern` and
 `ignore_case`, which are described in the docs shown above.

 ## Comparing the R and C++ functions

 What conclusions can be drawn from what you've seen so far?

 Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data
 types, return equivalent data types, and as there are no options implemented in
 R that Arrow doesn't have, this should be fairly simple to map without a great
 deal of extra work.

 As `starts_with()` has an options class associated with it, we'll need to make
 sure that it's linked up with this in the R code.

 In case you're wondering about the difference between arguments in R and options
 in Arrow, in R, arguments to functions can include the actual data to be
 analysed as well as options governing how the function works, whereas in the
 C++ compute functions, the arguments are the data to be analysed and the
 options are for specifying how exactly the function works.

 So let's get started.

 ## Step 1 - add unit tests

 We recommend a test-driven-development approach - write failing tests first,
 then check that they fail, and then write the code needed to make them pass.
 Thinking up-front about the behavior which needs testing can make it easier to
 reason about the code which needs writing later.

 Look up the R function that you want to bind the compute kernel to, and write a
 set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and
 perhaps even `compare_dplyr_error()` if necessary.  These functions compare the
 output of the original function with the dplyr bindings and make sure they match.
 We recommend looking at the [documentation next to the source code for these
 functions](https://github.com/apache/arrow/blob/main/r/tests/testthat/helper-expectation.R)
 to get a better understanding of how they work.

 You should make sure you're testing all parameters of the R function in your
 tests.

 Below is a possible example test for `startsWith()`.

 ```{r, eval = FALSE}
 test_that("startsWith behaves identically in dplyr and Arrow", {
   df <- tibble(x = c("Foo", "bar", "baz", "qux"))
   compare_dplyr_binding(
     .input %>%
         filter(startsWith(x, "b")) %>%
         collect(),
     df
   )

 })
 ```

 ## Step 2 - Hook up the compute function with options class if necessary

 If the C++ compute function can have options specified, make sure that the
 function is linked with its options class in `make_compute_options()` in the
 file `arrow/r/src/compute.cpp`.  You can find out if a compute function requires
 options by looking in the docs here: https://arrow.apache.org/docs/cpp/compute.html

 In the case of `starts_with()`, it looks something like this:

 ```cpp
   if (func_name == "starts_with") {
     using Options = arrow::compute::MatchSubstringOptions;
     bool ignore_case = false;
     if (!Rf_isNull(options["ignore_case"])) {
       ignore_case = cpp11::as_cpp<bool>(options["ignore_case"]);
     }
     return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["pattern"]),
                                      ignore_case);
   }
 ```

 You can usually copy and paste from a similar existing example.  In this case,
 as the option `ignore_case` doesn't map to any parameters of `startsWith()`, we
 give it a default value of `false` but if it's been set, use the set value
 instead.  As the `pattern` argument maps directly to `prefix` in `startsWith()`
 we can pass it straight through.

 ## Step 3 - Map the R function to the C++ kernel

 The next task is writing the code which binds the R function to the C++ kernel.

 ### Step 3a - See if direct mapping is appropriate
 Compare the C++ function and R function.  If they are simple functions with no
 options, it might be possible to directly map between the C++ and R in
 `unary_function_map`, in the case of compute functions that operate on single
 columns of data, or `binary_function_map` for those which operate on 2 columns
 of data.

 As `startsWith()` requires options, direct mapping is not appropriate.

 ### Step 3b - If direct mapping not possible, try a modified implementation
 If the function cannot be mapped directly, some extra work may be needed to
 ensure that calling the arrow version of the function results in the same result
 as calling the R version of the function.  In this case, the function will need
 adding to the `nse_funcs` function registry. Here is how this might look for
 `startsWith()`:

 ```{r, eval = FALSE}
 register_binding("base::startsWith", function(x, prefix) {
   Expression$create(
     "starts_with",
     x,
     options = list(pattern = prefix)
   )
 })
 ```

 In the source files, all the `register_binding()` calls are wrapped in functions
 that are called on package load. These are separated into files based on
 subject matter (e.g., `R/dplyr-funcs-math.R`, `R/dplyr-funcs-string.R`): find the
 closest analog to the function whose binding is being defined and define the
 new binding in a similar location. For example, the binding for `startsWith()`
 is registered in `dplyr-funcs-string.R` next to the binding for `endsWith()`.

 Note: we use the namespace-qualified name (i.e. `"base::startsWith"`) for a
 binding. This will register the same binding both as `startsWith()` and as
 `base::startsWith()`, which will allow us to use the `pkg::` prefix in a call.

 ```{r}
 arrow_table(starwars) %>%
   filter(stringr::str_detect(name, "Darth"))
 ```

 Hint:  you can use `call_function()` to call a compute function directly from R.
 This might be useful if you want to experiment with a compute function while
 you're writing bindings for it, e.g.

 ```{r}
 call_function(
   "starts_with",
   Array$create(c("Apache", "Arrow", "R", "package")),
   options = list(pattern = "A")
 )
 ```

 ## Step 4 - Run (and potentially add to) your tests.

 In the process of implementing the function, you will need at least one test
 to make sure that your binding works and that future changes to the Arrow R
 package don't break it! Bindings are tested in files that correspond to
 the file in which they were defined (e.g., `startsWith()` is tested in
 `tests/testthat/test-dplyr-funcs-string.R`) next to the tests for `endsWith()`.

 You may end up implementing more tests, for example if you discover unusual
 edge cases.  This is fine - add them  to the ones you wrote originally,
 and run them all.  If they pass, you're done and you can submit a PR.
 If you've modified the C++ code in the
 R package (for example, when hooking up a binding to its options class), you
 should make sure to run `arrow/r/lint.sh` to lint the code.
	---
	title: "Writing dplyr bindings"
	description: >
	Learn how to write bindings that allow arrow to mirror the behavior
	of native R functions within dplyr pipelines
	output: rmarkdown::html_vignette
	---

	```{r, include=FALSE}
	library(arrow, warn.conflicts = FALSE)
	library(dplyr, warn.conflicts = FALSE)
	```

	When writing bindings between C++ compute functions and R functions, the aim is
	to expose the C++ functionality via the same interface as existing R functions. The syntax and
	functionality should match that of the existing R functions
	(though there are some exceptions) so that users are able to use existing tidyverse
	or base R syntax, whilst taking advantage of the speed and functionality of the
	underlying arrow package.

	One of main ways in which users interact with arrow is via
	[dplyr](https://dplyr.tidyverse.org/) syntax called on Arrow objects. For
	example, when a user calls `dplyr::mutate()` on an Arrow Tabular,
	Dataset, or arrow data query object, the Arrow implementation of `mutate()` is
	used and under the hood, translates the dplyr code into Arrow C++ code.

	When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use functions
	from other packages. The example below uses `stringr::str_detect()`.

	```{r}
	library(dplyr)
	library(stringr)
	starwars %>%
	filter(str_detect(name, "Darth"))
	```
	This functionality has also been implemented in Arrow, e.g.:

	```{r}
	library(arrow)
	arrow_table(starwars) %>%
	filter(str_detect(name, "Darth")) %>%
	collect()
	```

	This is possible as a binding has been created between the call to the
	stringr function `str_detect()` and the Arrow C++ code, here as a direct mapping
	to `match_substring_regex`. You can see this for yourself by inspecting the
	arrow data query object without retrieving the results via `collect()`.


	```{r}
	arrow_table(starwars) %>%
	filter(str_detect(name, "Darth"))
	```

	In the following sections, we'll walk through how to create a binding between an
	R function and an Arrow C++ function.

	# Walkthrough

	Imagine you are writing the bindings for the C++ function
	[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
	and want to bind it to the (base) R function `startsWith()`.

	First, take a look at the docs for both of those functions.

	## Examining the R function

	Here are the docs for R's `startsWith()` (also available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)

	```{r, echo=FALSE, out.width="50%"}
	knitr::include_graphics("./startswithdocs.png")
	```

	It takes 2 parameters; `x` - the input, and `prefix` - the characters to check
	if `x` starts with.

	## Examining the C++ function

	Now, go to
	[the compute function documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
	and look for the Arrow C++ library's `starts_with()` function:

	```{r, echo=FALSE, out.width="100%"}
	knitr::include_graphics("./starts_with_docs.png")
	```

	The docs show that `starts_with()` is a unary function, which means that it takes a
	single data input. The data input must be a string-like class, and the returned
	value is boolean, both of which match up to R's `startsWith()`.

	There is an options class associated with `starts_with()` - called [`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE)
	- so let's take a look at that.

	```{r, echo=FALSE, out.width="100%"}
	knitr::include_graphics("./matchsubstringoptions.png")
	```

	Options classes allow the user to control the behaviour of the function. In
	this case, there are two possible options which can be supplied - `pattern` and
	`ignore_case`, which are described in the docs shown above.

	## Comparing the R and C++ functions

	What conclusions can be drawn from what you've seen so far?

	Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data
	types, return equivalent data types, and as there are no options implemented in
	R that Arrow doesn't have, this should be fairly simple to map without a great
	deal of extra work.

	As `starts_with()` has an options class associated with it, we'll need to make
	sure that it's linked up with this in the R code.

	In case you're wondering about the difference between arguments in R and options
	in Arrow, in R, arguments to functions can include the actual data to be
	analysed as well as options governing how the function works, whereas in the
	C++ compute functions, the arguments are the data to be analysed and the
	options are for specifying how exactly the function works.

	So let's get started.

	## Step 1 - add unit tests

	We recommend a test-driven-development approach - write failing tests first,
	then check that they fail, and then write the code needed to make them pass.
	Thinking up-front about the behavior which needs testing can make it easier to
	reason about the code which needs writing later.

	Look up the R function that you want to bind the compute kernel to, and write a
	set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and
	perhaps even `compare_dplyr_error()` if necessary. These functions compare the
	output of the original function with the dplyr bindings and make sure they match.
	We recommend looking at the [documentation next to the source code for these
	functions](https://github.com/apache/arrow/blob/main/r/tests/testthat/helper-expectation.R)
	to get a better understanding of how they work.

	You should make sure you're testing all parameters of the R function in your
	tests.

	Below is a possible example test for `startsWith()`.

	```{r, eval = FALSE}
	test_that("startsWith behaves identically in dplyr and Arrow", {
	df <- tibble(x = c("Foo", "bar", "baz", "qux"))
	compare_dplyr_binding(
	.input %>%
	filter(startsWith(x, "b")) %>%
	collect(),
	df
	)

	})
	```

	## Step 2 - Hook up the compute function with options class if necessary

	If the C++ compute function can have options specified, make sure that the
	function is linked with its options class in `make_compute_options()` in the
	file `arrow/r/src/compute.cpp`. You can find out if a compute function requires
	options by looking in the docs here: https://arrow.apache.org/docs/cpp/compute.html

	In the case of `starts_with()`, it looks something like this:

	```cpp
	if (func_name == "starts_with") {
	using Options = arrow::compute::MatchSubstringOptions;
	bool ignore_case = false;
	if (!Rf_isNull(options["ignore_case"])) {
	ignore_case = cpp11::as_cpp<bool>(options["ignore_case"]);
	}
	return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["pattern"]),
	ignore_case);
	}
	```

	You can usually copy and paste from a similar existing example. In this case,
	as the option `ignore_case` doesn't map to any parameters of `startsWith()`, we
	give it a default value of `false` but if it's been set, use the set value
	instead. As the `pattern` argument maps directly to `prefix` in `startsWith()`
	we can pass it straight through.

	## Step 3 - Map the R function to the C++ kernel

	The next task is writing the code which binds the R function to the C++ kernel.

	### Step 3a - See if direct mapping is appropriate
	Compare the C++ function and R function. If they are simple functions with no
	options, it might be possible to directly map between the C++ and R in
	`unary_function_map`, in the case of compute functions that operate on single
	columns of data, or `binary_function_map` for those which operate on 2 columns
	of data.

	As `startsWith()` requires options, direct mapping is not appropriate.

	### Step 3b - If direct mapping not possible, try a modified implementation
	If the function cannot be mapped directly, some extra work may be needed to
	ensure that calling the arrow version of the function results in the same result
	as calling the R version of the function. In this case, the function will need
	adding to the `nse_funcs` function registry. Here is how this might look for
	`startsWith()`:

	```{r, eval = FALSE}
	register_binding("base::startsWith", function(x, prefix) {
	Expression$create(
	"starts_with",
	x,
	options = list(pattern = prefix)
	)
	})
	```

	In the source files, all the `register_binding()` calls are wrapped in functions
	that are called on package load. These are separated into files based on
	subject matter (e.g., `R/dplyr-funcs-math.R`, `R/dplyr-funcs-string.R`): find the
	closest analog to the function whose binding is being defined and define the
	new binding in a similar location. For example, the binding for `startsWith()`
	is registered in `dplyr-funcs-string.R` next to the binding for `endsWith()`.

	Note: we use the namespace-qualified name (i.e. `"base::startsWith"`) for a
	binding. This will register the same binding both as `startsWith()` and as
	`base::startsWith()`, which will allow us to use the `pkg::` prefix in a call.

	```{r}
	arrow_table(starwars) %>%
	filter(stringr::str_detect(name, "Darth"))
	```

	Hint: you can use `call_function()` to call a compute function directly from R.
	This might be useful if you want to experiment with a compute function while
	you're writing bindings for it, e.g.

	```{r}
	call_function(
	"starts_with",
	Array$create(c("Apache", "Arrow", "R", "package")),
	options = list(pattern = "A")
	)
	```

	## Step 4 - Run (and potentially add to) your tests.

	In the process of implementing the function, you will need at least one test
	to make sure that your binding works and that future changes to the Arrow R
	package don't break it! Bindings are tested in files that correspond to
	the file in which they were defined (e.g., `startsWith()` is tested in
	`tests/testthat/test-dplyr-funcs-string.R`) next to the tests for `endsWith()`.

	You may end up implementing more tests, for example if you discover unusual
	edge cases. This is fine - add them to the ones you wrote originally,
	and run them all. If they pass, you're done and you can submit a PR.
	If you've modified the C++ code in the
	R package (for example, when hooking up a binding to its options class), you
	should make sure to run `arrow/r/lint.sh` to lint the code.