| --- |
| title: "Using the Arrow C++ Library in R" |
| description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package." |
| output: rmarkdown::html_vignette |
| vignette: > |
| %\VignetteIndexEntry{Using the Arrow C++ Library in R} |
| %\VignetteEngine{knitr::rmarkdown} |
| %\VignetteEncoding{UTF-8} |
| --- |
| |
| The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R. |
| |
| # Features |
| |
| ## Multi-file datasets |
| |
| The `arrow` package lets you work efficiently with large, multi-file datasets |
| using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview. |
| |
| ## Reading and writing files |
| |
| `arrow` provides some simple functions for using the Arrow C++ library to read and write files. |
| These functions are designed to drop into your normal R workflow |
| without requiring any knowledge of the Arrow C++ library |
| and use naming conventions and arguments that follow popular R packages, particularly `readr`. |
| The readers return `data.frame`s |
| (or if you use the `tibble` package, they will act like `tbl_df`s), |
| and the writers take `data.frame`s. |
| |
| Importantly, `arrow` provides basic read and write support for the [Apache |
| Parquet](https://parquet.apache.org/) columnar data file format. |
| |
| ```r |
| library(arrow) |
| df <- read_parquet("path/to/file.parquet") |
| ``` |
| |
| Just as you can read, you can write Parquet files: |
| |
| ```r |
| write_parquet(df, "path/to/different_file.parquet") |
| ``` |
| |
| The `arrow` package also includes a faster and more robust implementation of the |
| [Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and |
| `write_feather()`. This implementation depends |
| on the same underlying C++ library as the Python version does, |
| resulting in more reliable and consistent behavior across the two languages, as |
| well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/). |
| `arrow` also by default writes the Feather V2 format, |
| which supports a wider range of data types, as well as compression. |
| |
| For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively. |
| While `read_csv_arrow()` currently has fewer parsing options for dealing with |
| every CSV format variation in the wild, for the files it can read, it is |
| often significantly faster than other R CSV readers, such as |
| `base::read.csv`, `readr::read_csv`, and `data.table::fread`. |
| |
| ## Working with Arrow data in Python |
| |
| Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you |
| share data between R and Python (`pyarrow`) efficiently, enabling you to take |
| advantage of the vibrant ecosystem of Python packages that build on top of |
| Apache Arrow. See `vignette("python", package = "arrow")` for details. |
| |
| ## Access to Arrow messages, buffers, and streams |
| |
| The `arrow` package also provides many lower-level bindings to the C++ library, which enable you |
| to access and manipulate Arrow objects. You can use these to build connectors |
| to other applications and services that use Arrow. One example is Spark: the |
| [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to |
| move data to and from Spark, yielding [significant performance |
| gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). |
| |
| # Internals |
| |
| ## Mapping of R <--> Arrow types |
| |
| Arrow has a rich data type system that includes direct parallels with R's data types and much more. |
| |
| In the tables, entries with a `-` are not currently implemented. |
| |
| ### R to Arrow |
| |
| | R type | Arrow type | |
| |--------------------------|------------| |
| | logical | boolean | |
| | integer | int32 | |
| | double ("numeric") | float64 | |
| | character | utf8^1^ | |
| | factor | dictionary | |
| | raw | uint8 | |
| | Date | date32 | |
| | POSIXct | timestamp | |
| | POSIXlt | struct | |
| | data.frame | struct | |
| | list^2^ | list | |
| | bit64::integer64 | int64 | |
| | difftime | time32 | |
| | vctrs::vctrs_unspecified | null | |
| |
| ^1^: If the character vector exceeds 2GB of strings, it will be converted to a `large_utf8` Arrow type |
| |
| ^2^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type). |
| |
| ### Arrow to R |
| |
| | Arrow type | R type | |
| |-------------------|------------------------------| |
| | boolean | logical | |
| | int8 | integer | |
| | int16 | integer | |
| | int32 | integer | |
| | int64 | integer^3^ | |
| | uint8 | integer | |
| | uint16 | integer | |
| | uint32 | integer^3^ | |
| | uint64 | integer^3^ | |
| | float16 | - | |
| | float32 | double | |
| | float64 | double | |
| | utf8 | character | |
| | binary | arrow_binary ^5^ | |
| | fixed_size_binary | arrow_fixed_size_binary ^5^ | |
| | date32 | Date | |
| | date64 | POSIXct | |
| | time32 | hms::difftime | |
| | time64 | hms::difftime | |
| | timestamp | POSIXct | |
| | duration | - | |
| | decimal | double | |
| | dictionary | factor^4^ | |
| | list | arrow_list ^6^ | |
| | fixed_size_list | arrow_fixed_size_list ^6^ | |
| | struct | data.frame | |
| | null | vctrs::vctrs_unspecified | |
| | map | - | |
| | union | - | |
| | large_utf8 | character | |
| | large_binary | arrow_large_binary ^5^ | |
| | large_list | arrow_large_list ^6^ | |
| |
| ^3^: These integer types may contain values that exceed the range of R's `integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are converted to `double` ("numeric") and `int64` is converted to `bit64::integer64`. |
| |
| ^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are coerced to string when translated to R if they are not already strings. |
| |
| ^5^: `arrow*_binary` classes are implemented as lists of raw vectors. |
| |
| ^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` with a `ptype` attribute set to what an empty Array of the value type converts to. |
| |
| ### R object attributes |
| |
| Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like `x$metadata$new_key <- "new value"`. |
| |
| This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling `as.data.frame()` on a Table/RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow. |
| |
| Note that the `attributes()` stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys. |
| |
| ## Class structure and package conventions |
| |
| C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace. |
| |
| In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`. |
| |
| Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above. |
| |
| The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used. |