Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
arrow package exposes an interface to the Arrow C++ library, enabling access to many of its features in R. It provides low-level access to the Arrow C++ library API and higher-level access through a
dplyr backend and familiar R functions.
write_parquet()), an efficient and widely used columnar format
write_feather()), a format optimized for speed and interoperability
Install the latest release of
arrow from CRAN with
Conda users can install
arrow from conda-forge with
conda install -c conda-forge --strict-channel-priority r-arrow
Installing a released version of the
arrow package requires no additional system dependencies. For macOS and Windows, CRAN hosts binary packages that contain the Arrow C++ library. On Linux, source package installation will also build necessary C++ dependencies. For a faster, more complete installation, set the environment variable
vignette("install", package = "arrow") for details.
Development versions of the package (binary and source) are built nightly and hosted at https://arrow-r-nightly.s3.amazonaws.com. To install from there:
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
Conda users can install
arrow nightly builds with
conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
If you already have a version of
arrow installed, you can switch to the latest nightly development version with
arrow::install_arrow(nightly = TRUE)
These nightly package builds are not official Apache releases and are not recommended for production use. They may be useful for testing bug fixes and new features under active development.
Among the many applications of the
arrow package, two of the most accessible are:
The sections below describe these two uses and illustrate them with basic examples. The sections below mention two Arrow data structures:
Table: a tabular, column-oriented data structure capable of storing and processing large amounts of data more efficiently than R’s built-in
data.frameand with SQL-like column data types that afford better interoperability with databases and data warehouse systems
Dataset: a data structure functionally similar to
Tablebut with the capability to work on larger-than-memory data partitioned across multiple files
arrow package provides functions for reading single data files in several common formats. By default, calling any of these functions returns an R
data.frame. To return an Arrow
Table, set argument
as_data_frame = FALSE.
read_parquet(): read a file in Parquet format
read_feather(): read a file in Feather format (the Apache Arrow IPC format)
read_delim_arrow(): read a delimited text file (default delimiter is comma)
read_csv_arrow(): read a comma-separated values (CSV) file
read_tsv_arrow(): read a tab-separated values (TSV) file
read_json_arrow(): read a JSON data file
For writing data to single files, the
arrow package provides the functions
write_feather(). These can be used with R
data.frame and Arrow
For example, let’s write the Star Wars characters data that’s included in
dplyr to a Parquet file, then read it back in. Parquet is a popular choice for storing analytic data; it is optimized for reduced file sizes and fast read performance, especially for column-based access patterns. Parquet is widely supported by many tools and platforms.
First load the
library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE)
Then write the
starwars to a Parquet file at
file_path <- tempfile() write_parquet(starwars, file_path)
Then read the Parquet file into an R
sw <- read_parquet(file_path)
R object attributes are preserved when writing data to Parquet or Feather files and when reading those files back into R. This enables round-trip writing and reading of
sf::sf objects, R
data.frames with with
haven::labelled columns, and
data.frames with other custom attributes.
For reading and writing larger files or sets of multiple files,
Dataset objects and provides the functions
write_dataset(), which enable analysis and processing of bigger-than-memory data, including the ability to partition data into smaller chunks without loading the full data into memory. For examples of these functions, see
vignette("dataset", package = "arrow").
All these functions can read and write files in the local filesystem or in Amazon S3 (by passing S3 URIs beginning with
s3://). For more details, see
vignette("fs", package = "arrow")
arrow package provides a
dplyr backend enabling manipulation of Arrow tabular data with
dplyr verbs. To use it, first load both packages
dplyr. Then load data into an Arrow
Dataset object. For example, read the Parquet file written in the previous example into an Arrow
sw <- read_parquet(file_path, as_data_frame = FALSE)
Next, pipe on
result <- sw %>% filter(homeworld == "Tatooine") %>% rename(height_cm = height, mass_kg = mass) %>% mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>% arrange(desc(birth_year)) %>% select(name, height_in, mass_lbs)
arrow package uses lazy evaluation to delay computation until the result is required. This speeds up processing by enabling the Arrow C++ library to perform multiple computations in one operation.
result is an object with class
arrow_dplyr_query which represents all the computations to be performed:
result #> Table (query) #> name: string #> height_in: expr #> mass_lbs: expr #> #> * Filter: equal(homeworld, "Tatooine") #> * Sorted by birth_year [desc] #> See $.data for the source Arrow object
To perform these computations and materialize the result, call
compute() returns an Arrow
Table, suitable for passing to other
result %>% compute() #> Table #> 10 rows x 3 columns #> $name <string> #> $height_in <double> #> $mass_lbs <double>
collect() returns an R
data.frame, suitable for viewing or passing to other R functions for analysis or visualization:
result %>% collect() #> # A tibble: 10 x 3 #> name height_in mass_lbs #> <chr> <dbl> <dbl> #> 1 C-3PO 65.7 165. #> 2 Cliegg Lars 72.0 NA #> 3 Shmi Skywalker 64.2 NA #> 4 Owen Lars 70.1 265. #> 5 Beru Whitesun lars 65.0 165. #> 6 Darth Vader 79.5 300. #> 7 Anakin Skywalker 74.0 185. #> 8 Biggs Darklighter 72.0 185. #> 9 Luke Skywalker 67.7 170. #> 10 R5-D4 38.2 70.5
arrow package works with most single-table
dplyr verbs except those that compute aggregates, such as
dplyr verbs, Arrow offers support for many functions and operators, with common functions mapped to their base R and tidyverse equivalents. The changelog lists many of them. If there are additional functions you would like to see implemented, please file an issue as described in the Getting help section below.
dplyr queries on
Table objects, if the
arrow package detects an unimplemented function within a
dplyr verb, it automatically calls
collect() to return the data as an R
data.frame before processing that
dplyr verb. For queries on
Dataset objects (which can be larger than memory), it raises an error if the function is unimplemented; you need to explicitly tell it to
Other applications of
arrow are described in the following vignettes:
vignette("python", package = "arrow"): use
reticulateto pass data between R and Python
vignette("flight", package = "arrow"): connect to Arrow Flight RPC servers to send and receive data
vignette("arrow", package = "arrow"): access and manipulate Arrow objects through low-level bindings to the C++ library
If you encounter a bug, please file an issue with a minimal reproducible example on the Apache Jira issue tracker. Create an account or log in, then click Create to file an issue. Select the project Apache Arrow (ARROW), select the component R, and begin the issue summary with
[R] followed by a space. For more information, see the Report bugs and propose features section of the Contributing to Apache Arrow page in the Arrow developer documentation.
We welcome questions, discussion, and contributions from users of the
arrow package. For information about mailing lists and other venues for engaging with the Arrow developer and user communities, please see the Apache Arrow Community page.
All participation in the Apache Arrow project is governed by the Apache Software Foundation’s code of conduct.