r/vignettes/arrow.Rmd - arrow - Git at Google

 ---
 title: "Using the Arrow C++ Library in R"
 description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{Using the Arrow C++ Library in R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---

 The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.

 # Features

 ## Multi-file datasets

 The `arrow` package lets you work efficiently with large, multi-file datasets
 using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.

 ## Reading and writing files

 `arrow` provides some simple functions for using the Arrow C++ library to read and write files.
 These functions are designed to drop into your normal R workflow
 without requiring any knowledge of the Arrow C++ library
 and use naming conventions and arguments that follow popular R packages, particularly `readr`.
 The readers return `data.frame`s
 (or if you use the `tibble` package, they will act like `tbl_df`s),
 and the writers take `data.frame`s.

 Importantly, `arrow` provides basic read and write support for the [Apache
 Parquet](https://parquet.apache.org/) columnar data file format.

 ```r
 library(arrow)
 df <- read_parquet("path/to/file.parquet")
 ```

 Just as you can read, you can write Parquet files:

 ```r
 write_parquet(df, "path/to/different_file.parquet")
 ```

 The `arrow` package also includes a faster and more robust implementation of the
 [Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
 `write_feather()`. This implementation depends
 on the same underlying C++ library as the Python version does,
 resulting in more reliable and consistent behavior across the two languages, as
 well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
 `arrow` also by default writes the Feather V2 format,
 which supports a wider range of data types, as well as compression.

 For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
 While `read_csv_arrow()` currently has fewer parsing options for dealing with
 every CSV format variation in the wild, for the files it can read, it is
 often significantly faster than other R CSV readers, such as
 `base::read.csv`, `readr::read_csv`, and `data.table::fread`.

 ## Working with Arrow data in Python

 Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
 share data between R and Python (`pyarrow`) efficiently, enabling you to take
 advantage of the vibrant ecosystem of Python packages that build on top of
 Apache Arrow. See `vignette("python", package = "arrow")` for details.

 ## Access to Arrow messages, buffers, and streams

 The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
 to access and manipulate Arrow objects. You can use these to build connectors
 to other applications and services that use Arrow. One example is Spark: the
 [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
 move data to and from Spark, yielding [significant performance
 gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).

 # Object hierarchy

 ## Metadata objects

 Arrow defines the following classes for representing metadata:

 | Class      | Description                                        | How to create an instance        |
 | ---------- | -------------------------------------------------- | -------------------------------- |
 | `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
 | `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
 | `Schema`   | list of `Field`s                                   | `schema(...)`                    |

 ## Data objects

 Arrow defines the following classes for representing zero-dimensional (scalar),
 one-dimensional (array/vector-like), and two-dimensional (tabular/data
 frame-like) data:

 | Dim | Class          | Description                               | How to create an instance                                                  |
 | --- | -------------- | ----------------------------------------- | -------------------------------------------------------------------------- |
 | 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                               |
 | 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                               |
 | 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`       |
 | 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                     |
 | 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*(file, as_data_frame = FALSE)`        |
 | 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)` |

 Each of these is defined as an `R6` class in the `arrow` R package and
 corresponds to a class of the same name in the Arrow C++ library. The `arrow`
 package provides a variety of `R6` and S3 methods for interacting with instances
 of these classes.

 For convenience, the `arrow` package also defines several synthetic classes that
 do not exist in the C++ library, including:

 * `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
 * `ArrowTabular`: inherited by `RecordBatch` and `Table`
 * `ArrowObject`: inherited by all Arrow objects

 # Internals

 ## Mapping of R <--> Arrow types

 Arrow has a rich data type system that includes direct parallels with R's data types and much more.

 In the tables, entries with a `-` are not currently implemented.

 ### R to Arrow

 | R type                   | Arrow type |
 |--------------------------|------------|
 | logical                  | boolean    |
 | integer                  | int32      |
 | double ("numeric")       | float64    |
 | character                | utf8^1^    |
 | factor                   | dictionary |
 | raw                      | uint8      |
 | Date                     | date32     |
 | POSIXct                  | timestamp  |
 | POSIXlt                  | struct     |
 | data.frame               | struct     |
 | list^2^                  | list       |
 | bit64::integer64         | int64      |
 | difftime                 | time32     |
 | vctrs::vctrs_unspecified | null       |

 ^1^: If the character vector exceeds 2GB of strings, it will be converted to a `large_utf8` Arrow type

 ^2^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type).

 ### Arrow to R

 | Arrow type        | R type                       |
 |-------------------|------------------------------|
 | boolean           | logical                      |
 | int8              | integer                      |
 | int16             | integer                      |
 | int32             | integer                      |
 | int64             | integer^3^                   |
 | uint8             | integer                      |
 | uint16            | integer                      |
 | uint32            | integer^3^                   |
 | uint64            | integer^3^                   |
 | float16           | -                            |
 | float32           | double                       |
 | float64           | double                       |
 | utf8              | character                    |
 | binary            | arrow_binary ^5^             |
 | fixed_size_binary | arrow_fixed_size_binary ^5^  |
 | date32            | Date                         |
 | date64            | POSIXct                      |
 | time32            | hms::difftime                |
 | time64            | hms::difftime                |
 | timestamp         | POSIXct                      |
 | duration          | -                            |
 | decimal           | double                       |
 | dictionary        | factor^4^                    |
 | list              | arrow_list ^6^               |
 | fixed_size_list   | arrow_fixed_size_list ^6^    |
 | struct            | data.frame                   |
 | null              | vctrs::vctrs_unspecified     |
 | map               | -                            |
 | union             | -                            |
 | large_utf8        | character                    |
 | large_binary      | arrow_large_binary ^5^       |
 | large_list        | arrow_large_list ^6^         |

 ^3^: These integer types may contain values that exceed the range of R's `integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are converted to `double` ("numeric") and `int64` is converted to `bit64::integer64`. This conversion can be disabled (so that `int64` always yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.

 ^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are coerced to string when translated to R if they are not already strings.

 ^5^: `arrow*_binary` classes are implemented as lists of raw vectors.

 ^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` with a `ptype` attribute set to what an empty Array of the value type converts to.

 ### R object attributes

 Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like `x$metadata$new_key <- "new value"`.

 This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling `as.data.frame()` on a Table/RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.

 Note that the `attributes()` stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys. For more details, see the documentation for `schema()`.

 ## Class structure and package conventions

 C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace.

 In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.

 Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.

 The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.
	---
	title: "Using the Arrow C++ Library in R"
	description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
	output: rmarkdown::html_vignette
	vignette: >
	%\VignetteIndexEntry{Using the Arrow C++ Library in R}
	%\VignetteEngine{knitr::rmarkdown}
	%\VignetteEncoding{UTF-8}
	---

	The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.

	# Features

	## Multi-file datasets

	The `arrow` package lets you work efficiently with large, multi-file datasets
	using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.

	## Reading and writing files

	`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
	These functions are designed to drop into your normal R workflow
	without requiring any knowledge of the Arrow C++ library
	and use naming conventions and arguments that follow popular R packages, particularly `readr`.
	The readers return `data.frame`s
	(or if you use the `tibble` package, they will act like `tbl_df`s),
	and the writers take `data.frame`s.

	Importantly, `arrow` provides basic read and write support for the [Apache
	Parquet](https://parquet.apache.org/) columnar data file format.

	```r
	library(arrow)
	df <- read_parquet("path/to/file.parquet")
	```

	Just as you can read, you can write Parquet files:

	```r
	write_parquet(df, "path/to/different_file.parquet")
	```

	The `arrow` package also includes a faster and more robust implementation of the
	[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
	`write_feather()`. This implementation depends
	on the same underlying C++ library as the Python version does,
	resulting in more reliable and consistent behavior across the two languages, as
	well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
	`arrow` also by default writes the Feather V2 format,
	which supports a wider range of data types, as well as compression.

	For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
	While `read_csv_arrow()` currently has fewer parsing options for dealing with
	every CSV format variation in the wild, for the files it can read, it is
	often significantly faster than other R CSV readers, such as
	`base::read.csv`, `readr::read_csv`, and `data.table::fread`.

	## Working with Arrow data in Python

	Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
	share data between R and Python (`pyarrow`) efficiently, enabling you to take
	advantage of the vibrant ecosystem of Python packages that build on top of
	Apache Arrow. See `vignette("python", package = "arrow")` for details.

	## Access to Arrow messages, buffers, and streams

	The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
	to access and manipulate Arrow objects. You can use these to build connectors
	to other applications and services that use Arrow. One example is Spark: the
	[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
	move data to and from Spark, yielding [significant performance
	gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).

	# Object hierarchy

	## Metadata objects

	Arrow defines the following classes for representing metadata:

	\| Class \| Description \| How to create an instance \|
	\| ---------- \| -------------------------------------------------- \| -------------------------------- \|
	\| `DataType` \| attribute controlling how values are represented \| functions in `help("data-type")` \|
	\| `Field` \| a character string name and a `DataType` \| `field(name, type)` \|
	\| `Schema` \| list of `Field`s \| `schema(...)` \|

	## Data objects

	Arrow defines the following classes for representing zero-dimensional (scalar),
	one-dimensional (array/vector-like), and two-dimensional (tabular/data
	frame-like) data:

	\| Dim \| Class \| Description \| How to create an instance \|
	\| --- \| -------------- \| ----------------------------------------- \| -------------------------------------------------------------------------- \|
	\| 0 \| `Scalar` \| single value and its `DataType` \| `Scalar$create(value, type)` \|
	\| 1 \| `Array` \| vector of values and its `DataType` \| `Array$create(vector, type)` \|
	\| 1 \| `ChunkedArray` \| vectors of values and their `DataType` \| `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)` \|
	\| 2 \| `RecordBatch` \| list of `Array`s with a `Schema` \| `RecordBatch$create(...)` or alias `record_batch(...)` \|
	\| 2 \| `Table` \| list of `ChunkedArray` with a `Schema` \| `Table$create(...)` or `arrow::read_*(file, as_data_frame = FALSE)` \|
	\| 2 \| `Dataset` \| list of `Table`s with the same `Schema` \| `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)` \|

	Each of these is defined as an `R6` class in the `arrow` R package and
	corresponds to a class of the same name in the Arrow C++ library. The `arrow`
	package provides a variety of `R6` and S3 methods for interacting with instances
	of these classes.

	For convenience, the `arrow` package also defines several synthetic classes that
	do not exist in the C++ library, including:

	* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
	* `ArrowTabular`: inherited by `RecordBatch` and `Table`
	* `ArrowObject`: inherited by all Arrow objects

	# Internals

	## Mapping of R <--> Arrow types

	Arrow has a rich data type system that includes direct parallels with R's data types and much more.

	In the tables, entries with a `-` are not currently implemented.

	### R to Arrow

	\| R type \| Arrow type \|
	\|--------------------------\|------------\|
	\| logical \| boolean \|
	\| integer \| int32 \|
	\| double ("numeric") \| float64 \|
	\| character \| utf8^1^ \|
	\| factor \| dictionary \|
	\| raw \| uint8 \|
	\| Date \| date32 \|
	\| POSIXct \| timestamp \|
	\| POSIXlt \| struct \|
	\| data.frame \| struct \|
	\| list^2^ \| list \|
	\| bit64::integer64 \| int64 \|
	\| difftime \| time32 \|
	\| vctrs::vctrs_unspecified \| null \|

	^1^: If the character vector exceeds 2GB of strings, it will be converted to a `large_utf8` Arrow type

	^2^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type).

	### Arrow to R

	\| Arrow type \| R type \|
	\|-------------------\|------------------------------\|
	\| boolean \| logical \|
	\| int8 \| integer \|
	\| int16 \| integer \|
	\| int32 \| integer \|
	\| int64 \| integer^3^ \|
	\| uint8 \| integer \|
	\| uint16 \| integer \|
	\| uint32 \| integer^3^ \|
	\| uint64 \| integer^3^ \|
	\| float16 \| - \|
	\| float32 \| double \|
	\| float64 \| double \|
	\| utf8 \| character \|
	\| binary \| arrow_binary ^5^ \|
	\| fixed_size_binary \| arrow_fixed_size_binary ^5^ \|
	\| date32 \| Date \|
	\| date64 \| POSIXct \|
	\| time32 \| hms::difftime \|
	\| time64 \| hms::difftime \|
	\| timestamp \| POSIXct \|
	\| duration \| - \|
	\| decimal \| double \|
	\| dictionary \| factor^4^ \|
	\| list \| arrow_list ^6^ \|
	\| fixed_size_list \| arrow_fixed_size_list ^6^ \|
	\| struct \| data.frame \|
	\| null \| vctrs::vctrs_unspecified \|
	\| map \| - \|
	\| union \| - \|
	\| large_utf8 \| character \|
	\| large_binary \| arrow_large_binary ^5^ \|
	\| large_list \| arrow_large_list ^6^ \|

	^3^: These integer types may contain values that exceed the range of R's `integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are converted to `double` ("numeric") and `int64` is converted to `bit64::integer64`. This conversion can be disabled (so that `int64` always yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.

	^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are coerced to string when translated to R if they are not already strings.

	^5^: `arrow*_binary` classes are implemented as lists of raw vectors.

	^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` with a `ptype` attribute set to what an empty Array of the value type converts to.

	### R object attributes

	Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like `x$metadata$new_key <- "new value"`.

	This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling `as.data.frame()` on a Table/RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.

	Note that the `attributes()` stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys. For more details, see the documentation for `schema()`.

	## Class structure and package conventions

	C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace.

	In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.

	Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.

	The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.