r/vignettes/arrow.Rmd - arrow - Git at Google

 ---
 title: "Using the Arrow C++ Library in R"
 description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{Using the Arrow C++ Library in R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---

 The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.

 # Features

 ## Multi-file datasets

 The `arrow` package lets you work efficiently with large, multi-file datasets
 using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.

 ## Reading and writing files

 `arrow` provides some simple functions for using the Arrow C++ library to read and write files.
 These functions are designed to drop into your normal R workflow
 without requiring any knowledge of the Arrow C++ library
 and use naming conventions and arguments that follow popular R packages, particularly `readr`.
 The readers return `data.frame`s
 (or if you use the `tibble` package, they will act like `tbl_df`s),
 and the writers take `data.frame`s.

 Importantly, `arrow` provides basic read and write support for the [Apache
 Parquet](https://parquet.apache.org/) columnar data file format.

 ```r
 library(arrow)
 df <- read_parquet("path/to/file.parquet")
 ```

 Just as you can read, you can write Parquet files:

 ```r
 write_parquet(df, "path/to/different_file.parquet")
 ```

 The `arrow` package also includes a faster and more robust implementation of the
 [Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
 `write_feather()`. This implementation depends
 on the same underlying C++ library as the Python version does,
 resulting in more reliable and consistent behavior across the two languages, as
 well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
 `arrow` also by default writes the Feather V2 format,
 which supports a wider range of data types, as well as compression.

 For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
 While `read_csv_arrow()` currently has fewer parsing options for dealing with
 every CSV format variation in the wild, for the files it can read, it is
 often significantly faster than other R CSV readers, such as
 `base::read.csv`, `readr::read_csv`, and `data.table::fread`.

 ## Working with Arrow data in Python

 Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
 share data between R and Python (`pyarrow`) efficiently, enabling you to take
 advantage of the vibrant ecosystem of Python packages that build on top of
 Apache Arrow. See `vignette("python", package = "arrow")` for details.

 ## Access to Arrow messages, buffers, and streams

 The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
 to access and manipulate Arrow objects. You can use these to build connectors
 to other applications and services that use Arrow. One example is Spark: the
 [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
 move data to and from Spark, yielding [significant performance
 gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).

 # Internals

 ## Mapping of R <--> Arrow types

 Arrow has a rich data type system that includes direct parallels with R's data types and much more.

 In the tables, entries with a `-` are not currently implemented.

 ### R to Arrow

 | R type                   | Arrow type |
 |--------------------------|------------|
 | logical                  | boolean    |
 | integer                  | int32      |
 | double ("numeric")       | float64    |
 | character                | utf8^1^    |
 | factor                   | dictionary |
 | raw                      | uint8      |
 | Date                     | date32     |
 | POSIXct                  | timestamp  |
 | POSIXlt                  | struct     |
 | data.frame               | struct     |
 | list^2^                  | list       |
 | bit64::integer64         | int64      |
 | difftime                 | time32     |
 | vctrs::vctrs_unspecified | null       |

 ^1^: If the character vector exceeds 2GB of strings, it will be converted to a `large_utf8` Arrow type

 ^2^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type).

 ### Arrow to R

 | Arrow type        | R type                       |
 |-------------------|------------------------------|
 | boolean           | logical                      |
 | int8              | integer                      |
 | int16             | integer                      |
 | int32             | integer                      |
 | int64             | integer^3^                   |
 | uint8             | integer                      |
 | uint16            | integer                      |
 | uint32            | integer^3^                   |
 | uint64            | integer^3^                   |
 | float16           | -                            |
 | float32           | double                       |
 | float64           | double                       |
 | utf8              | character                    |
 | binary            | arrow_binary ^5^             |
 | fixed_size_binary | arrow_fixed_size_binary ^5^  |
 | date32            | Date                         |
 | date64            | POSIXct                      |
 | time32            | hms::difftime                |
 | time64            | hms::difftime                |
 | timestamp         | POSIXct                      |
 | duration          | -                            |
 | decimal           | double                       |
 | dictionary        | factor^4^                    |
 | list              | arrow_list ^6^               |
 | fixed_size_list   | arrow_fixed_size_list ^6^    |
 | struct            | data.frame                   |
 | null              | vctrs::vctrs_unspecified     |
 | map               | -                            |
 | union             | -                            |
 | large_utf8        | character                    |
 | large_binary      | arrow_large_binary ^5^       |
 | large_list        | arrow_large_list ^6^         |

 ^3^: These integer types may contain values that exceed the range of R's `integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are converted to `double` ("numeric") and `int64` is converted to `bit64::integer64`.

 ^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are coerced to string when translated to R if they are not already strings.

 ^5^: `arrow*_binary` classes are implemented as lists of raw vectors.

 ^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` with a `ptype` attribute set to what an empty Array of the value type converts to.

 ### R object attributes

 Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like `x$metadata$new_key <- "new value"`.

 This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling `as.data.frame()` on a Table/RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.

 Note that the `attributes()` stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys.

 ## Class structure and package conventions

 C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace.

 In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.

 Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.

 The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.
	---
	title: "Using the Arrow C++ Library in R"
	description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
	output: rmarkdown::html_vignette
	vignette: >
	%\VignetteIndexEntry{Using the Arrow C++ Library in R}
	%\VignetteEngine{knitr::rmarkdown}
	%\VignetteEncoding{UTF-8}
	---

	The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.

	# Features

	## Multi-file datasets

	The `arrow` package lets you work efficiently with large, multi-file datasets
	using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.

	## Reading and writing files

	`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
	These functions are designed to drop into your normal R workflow
	without requiring any knowledge of the Arrow C++ library
	and use naming conventions and arguments that follow popular R packages, particularly `readr`.
	The readers return `data.frame`s
	(or if you use the `tibble` package, they will act like `tbl_df`s),
	and the writers take `data.frame`s.

	Importantly, `arrow` provides basic read and write support for the [Apache
	Parquet](https://parquet.apache.org/) columnar data file format.

	```r
	library(arrow)
	df <- read_parquet("path/to/file.parquet")
	```

	Just as you can read, you can write Parquet files:

	```r
	write_parquet(df, "path/to/different_file.parquet")
	```

	The `arrow` package also includes a faster and more robust implementation of the
	[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
	`write_feather()`. This implementation depends
	on the same underlying C++ library as the Python version does,
	resulting in more reliable and consistent behavior across the two languages, as
	well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
	`arrow` also by default writes the Feather V2 format,
	which supports a wider range of data types, as well as compression.

	For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
	While `read_csv_arrow()` currently has fewer parsing options for dealing with
	every CSV format variation in the wild, for the files it can read, it is
	often significantly faster than other R CSV readers, such as
	`base::read.csv`, `readr::read_csv`, and `data.table::fread`.

	## Working with Arrow data in Python

	Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
	share data between R and Python (`pyarrow`) efficiently, enabling you to take
	advantage of the vibrant ecosystem of Python packages that build on top of
	Apache Arrow. See `vignette("python", package = "arrow")` for details.

	## Access to Arrow messages, buffers, and streams

	The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
	to access and manipulate Arrow objects. You can use these to build connectors
	to other applications and services that use Arrow. One example is Spark: the
	[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
	move data to and from Spark, yielding [significant performance
	gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).

	# Internals

	## Mapping of R <--> Arrow types

	Arrow has a rich data type system that includes direct parallels with R's data types and much more.

	In the tables, entries with a `-` are not currently implemented.

	### R to Arrow

	\| R type \| Arrow type \|
	\|--------------------------\|------------\|
	\| logical \| boolean \|
	\| integer \| int32 \|
	\| double ("numeric") \| float64 \|
	\| character \| utf8^1^ \|
	\| factor \| dictionary \|
	\| raw \| uint8 \|
	\| Date \| date32 \|
	\| POSIXct \| timestamp \|
	\| POSIXlt \| struct \|
	\| data.frame \| struct \|
	\| list^2^ \| list \|
	\| bit64::integer64 \| int64 \|
	\| difftime \| time32 \|
	\| vctrs::vctrs_unspecified \| null \|

	^1^: If the character vector exceeds 2GB of strings, it will be converted to a `large_utf8` Arrow type

	^2^: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type).

	### Arrow to R

	\| Arrow type \| R type \|
	\|-------------------\|------------------------------\|
	\| boolean \| logical \|
	\| int8 \| integer \|
	\| int16 \| integer \|
	\| int32 \| integer \|
	\| int64 \| integer^3^ \|
	\| uint8 \| integer \|
	\| uint16 \| integer \|
	\| uint32 \| integer^3^ \|
	\| uint64 \| integer^3^ \|
	\| float16 \| - \|
	\| float32 \| double \|
	\| float64 \| double \|
	\| utf8 \| character \|
	\| binary \| arrow_binary ^5^ \|
	\| fixed_size_binary \| arrow_fixed_size_binary ^5^ \|
	\| date32 \| Date \|
	\| date64 \| POSIXct \|
	\| time32 \| hms::difftime \|
	\| time64 \| hms::difftime \|
	\| timestamp \| POSIXct \|
	\| duration \| - \|
	\| decimal \| double \|
	\| dictionary \| factor^4^ \|
	\| list \| arrow_list ^6^ \|
	\| fixed_size_list \| arrow_fixed_size_list ^6^ \|
	\| struct \| data.frame \|
	\| null \| vctrs::vctrs_unspecified \|
	\| map \| - \|
	\| union \| - \|
	\| large_utf8 \| character \|
	\| large_binary \| arrow_large_binary ^5^ \|
	\| large_list \| arrow_large_list ^6^ \|

	^3^: These integer types may contain values that exceed the range of R's `integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are converted to `double` ("numeric") and `int64` is converted to `bit64::integer64`.

	^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are coerced to string when translated to R if they are not already strings.

	^5^: `arrow*_binary` classes are implemented as lists of raw vectors.

	^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` with a `ptype` attribute set to what an empty Array of the value type converts to.

	### R object attributes

	Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like `x$metadata$new_key <- "new value"`.

	This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling `as.data.frame()` on a Table/RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.

	Note that the `attributes()` stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys.

	## Class structure and package conventions

	C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace.

	In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.

	Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.

	The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.