blob: 9ea977b7e559ca6340cc28a29799b81fc2ac85df [file] [log] [blame]
---
title: "Using the Arrow C++ Library in R"
description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using the Arrow C++ Library in R}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
# Features
## Multi-file datasets
The `arrow` package lets you work efficiently with large, multi-file datasets
using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
## Reading and writing files
`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
These functions are designed to drop into your normal R workflow
without requiring any knowledge of the Arrow C++ library
and use naming conventions and arguments that follow popular R packages, particularly `readr`.
The readers return `data.frame`s
(or if you use the `tibble` package, they will act like `tbl_df`s),
and the writers take `data.frame`s.
Importantly, `arrow` provides basic read and write support for the [Apache
Parquet](https://parquet.apache.org/) columnar data file format.
```r
library(arrow)
df <- read_parquet("path/to/file.parquet")
```
Just as you can read, you can write Parquet files:
```r
write_parquet(df, "path/to/different_file.parquet")
```
The `arrow` package also includes a faster and more robust implementation of the
[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
`write_feather()`. This implementation depends
on the same underlying C++ library as the Python version does,
resulting in more reliable and consistent behavior across the two languages, as
well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
`arrow` also by default writes the Feather V2 format,
which supports a wider range of data types, as well as compression.
For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
While `read_csv_arrow()` currently has fewer parsing options for dealing with
every CSV format variation in the wild, for the files it can read, it is
often significantly faster than other R CSV readers, such as
`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
## Working with Arrow data in Python
Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
share data between R and Python (`pyarrow`) efficiently, enabling you to take
advantage of the vibrant ecosystem of Python packages that build on top of
Apache Arrow. See `vignette("python", package = "arrow")` for details.
## Access to Arrow messages, buffers, and streams
The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
to access and manipulate Arrow objects. You can use these to build connectors
to other applications and services that use Arrow. One example is Spark: the
[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
move data to and from Spark, yielding [significant performance
gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
# Internals
## Mapping of R <--> Arrow types
Arrow has a rich data type system that includes direct parallels with R's data types and much more.
In the tables, entries with a `-` are not currently implemented.
### R to Arrow
| R type | Arrow type |
|--------------------------|------------|
| logical | boolean |
| integer | int32 |
| double ("numeric") | float64 |
| character | utf8^1^ |
| factor | dictionary |
| raw | uint8 |
| Date | date32 |
| POSIXct | timestamp |
| POSIXlt | struct |
| data.frame | struct |
| list^2^ | list |
| bit64::integer64 | int64 |
| difftime | time32 |
| vctrs::vctrs_unspecified | null |
^1^: If the character vector exceeds 2GB of strings, it will be converted to a `large_utf8` Arrow type
^2^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type).
### Arrow to R
| Arrow type | R type |
|-------------------|------------------------------|
| boolean | logical |
| int8 | integer |
| int16 | integer |
| int32 | integer |
| int64 | integer^3^ |
| uint8 | integer |
| uint16 | integer |
| uint32 | integer^3^ |
| uint64 | integer^3^ |
| float16 | - |
| float32 | double |
| float64 | double |
| utf8 | character |
| binary | arrow_binary ^5^ |
| fixed_size_binary | arrow_fixed_size_binary ^5^ |
| date32 | Date |
| date64 | POSIXct |
| time32 | hms::difftime |
| time64 | hms::difftime |
| timestamp | POSIXct |
| duration | - |
| decimal | double |
| dictionary | factor^4^ |
| list | arrow_list ^6^ |
| fixed_size_list | arrow_fixed_size_list ^6^ |
| struct | data.frame |
| null | vctrs::vctrs_unspecified |
| map | - |
| union | - |
| large_utf8 | character |
| large_binary | arrow_large_binary ^5^ |
| large_list | arrow_large_list ^6^ |
^3^: These integer types may contain values that exceed the range of R's `integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are converted to `double` ("numeric") and `int64` is converted to `bit64::integer64`.
^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are coerced to string when translated to R if they are not already strings.
^5^: `arrow*_binary` classes are implemented as lists of raw vectors.
^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` with a `ptype` attribute set to what an empty Array of the value type converts to.
### R object attributes
Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like `x$metadata$new_key <- "new value"`.
This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling `as.data.frame()` on a Table/RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.
Note that the `attributes()` stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys.
## Class structure and package conventions
C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace.
In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.
Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.
The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.