blob: c05ee7dc7c3234e349adf23e27986cf1a6c751a1 [file] [log] [blame]
---
title: "Apache Arrow in Python and R with reticulate"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The `arrow` package provides `reticulate` methods for passing data between
R and Python in the same process. This document provides a brief overview.
## Installing
To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
To install it in a virtualenv,
```r
library(reticulate)
virtualenv_create("arrow-env")
install_pyarrow("arrow-env")
```
If you want to install a development version of `pyarrow`,
add `nightly = TRUE`:
```r
install_pyarrow("arrow-env", nightly = TRUE)
```
`install_pyarrow()` also works with `conda` environments
(`conda_create()` instead of `virtualenv_create()`).
For more on installing and configuring Python,
see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html).
## Using
To start, load `arrow` and `reticulate`, and then import `pyarrow`.
```r
library(arrow)
library(reticulate)
use_virtualenv("arrow-env")
pa <- import("pyarrow")
```
The package includes support for sharing Arrow `Array` and `RecordBatch`
objects in-process between R and Python. For example, let's create an `Array`
in `pyarrow`.
```r
a <- pa$array(c(1, 2, 3))
a
## Array
## <double>
## [
## 1,
## 2,
## 3
## ]
```
`a` is now an `Array` object in our R session, even though we created it in Python.
We can apply R methods on it:
```r
a[a > 1]
## Array
## <double>
## [
## 2,
## 3
## ]
```
We can send data both ways. One reason we might want to use `pyarrow` in R is
to take advantage of functionality that is better supported in Python than in R.
For example, `pyarrow` has a `concat_arrays` function, but as of 0.17, this
function is not implemented in the `arrow` R package. We can use `reticulate`
to use it efficiently.
```r
b <- Array$create(c(5, 6, 7, 8, 9))
a_and_b <- pa$concat_arrays(list(a, b))
a_and_b
## Array
## <double>
## [
## 1,
## 2,
## 3,
## 5,
## 6,
## 7,
## 8,
## 9
## ]
```
Now we have a single `Array` in R.
"Send", however, isn't the correct word. Internally, we're passing pointers to
the data between the R and Python interpreters running together in the same
process, without copying anything. Nothing is being sent: we're sharing and
accessing the same internal Arrow memory buffers.
## Troubleshooting
If you get an error like
```
Error in py_get_attr_impl(x, name, silent) :
AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c'
```
it means that the version of `pyarrow` you're using is too old.
Support for passing data to and from R is included in versions 0.17 and greater.
Check your pyarrow version like this:
```r
pa$`__version__`
## [1] "0.16.0"
```
Note that your `pyarrow` and `arrow` versions don't need themselves to match:
they just need to be 0.17 or greater.