r/vignettes/read_write.Rmd - arrow - Git at Google

 ---
 title: "Reading and writing data files"
 description: >
   Learn how to read and write CSV, Parquet, and Feather files with arrow
 output: rmarkdown::html_vignette
 ---

 The arrow package provides functions for reading single data files into memory, in
 several common formats. By default, calling any of these functions
 returns an R data frame. To return an Arrow Table, set argument
 `as_data_frame = FALSE`.

 - `read_parquet()`: read a file in Parquet format
 - `read_feather()`: read a file in the Apache Arrow IPC format (formerly called the Feather format)
 - `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
 - `read_csv_arrow()`: read a comma-separated values (CSV) file
 - `read_tsv_arrow()`: read a tab-separated values (TSV) file
 - `read_json_arrow()`: read a JSON data file

 For writing data to single files, the arrow package provides the
 following functions, which can be used with both R data frames and
 Arrow Tables:

 - `write_parquet()`: write a file in Parquet format
 - `write_feather()`: write a file in Arrow IPC format
 - `write_csv_arrow()`: write a file in CSV format

 All these functions can read and write files in the local filesystem or
 to cloud storage. For more on cloud storage support in arrow, see the [cloud storage article](./fs.html).

 The arrow package also supports reading larger-than-memory single data files, and reading and writing multi-file data sets.
 This enables analysis and processing of larger-than-memory data, and provides
 the ability to partition data into smaller chunks without loading the full
 data into memory. For more information on this topic, see the [dataset article](./dataset.html).

 ## Parquet format

 [Apache Parquet](https://parquet.apache.org/) is a popular
 choice for storing analytics data; it is a binary format that is
 optimized for reduced file sizes and fast read performance, especially
 for column-based access patterns. The simplest way to read and write
 Parquet data using arrow is with the `read_parquet()` and
 `write_parquet()` functions. To illustrate this, we'll write the
 `starwars` data included in dplyr to a Parquet file, then read it
 back in. First load the arrow and dplyr packages:

 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```

 Next we'll write the data frame to a Parquet file located at `file_path`:

 ```{r}
 file_path <- tempfile()
 write_parquet(starwars, file_path)
 ```

 The size of a Parquet file is typically much smaller than the corresponding CSV
 file would have been. This is in part due to the use of file compression: by default,
 Parquet files written with the arrow package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip
 are also supported. See `help("write_parquet", package = "arrow")` for more
 information.

 Having written the Parquet file, we now can read it with `read_parquet()`:

 ```{r}
 read_parquet(file_path)
 ```

 The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:

 ```{r}
 read_parquet(file_path, as_data_frame = FALSE)
 ```

 One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality:

 ```{r}
 read_parquet(file_path, col_select = c("name", "height", "mass"))
 ```

 Fine-grained control over the Parquet reader is possible with the `props` argument. See `help("ParquetArrowReaderProperties", package = "arrow")` for details.

 R object attributes are preserved when writing data to Parquet or
 Arrow/Feather files and when reading those files back into R. This enables
 round-trip writing and reading of `sf::sf` objects, R data frames with
 with `haven::labelled` columns, and data frame with other custom
 attributes. To learn more about how metadata are handled in arrow, the [metadata article](./metadata.html).

 ## Arrow/Feather format

 The Arrow file format was developed to provide binary columnar
 serialization for data frames, to make reading and writing data frames
 efficient, and to make sharing data across data analysis languages easy.
 This file format is sometimes referred to as Feather because it is an
 outgrowth of the original [Feather](https://github.com/wesm/feather) project
 that has now been moved into the Arrow project itself. You can find the
 detailed specification of version 2 of the Arrow format -- officially
 referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
 on the Arrow specification page.

 The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:

 ```{r}
 file_path <- tempfile()
 write_feather(starwars, file_path)
 ```

 The `read_feather()` function provides a familiar interface for reading feather files:

 ```{r}
 read_feather(file_path)
 ```

 Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:

 ```{r}
 read_feather(
   file = file_path,
   col_select = c("name", "height", "mass"),
   as_data_frame = FALSE
 )
 ```

 ## CSV format

 The read/write capabilities of the arrow package also include support for
 CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`,
 and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read
 data files, where the Arrow C++ options have been mapped to arguments in a
 way that mirrors the conventions used in `readr::read_delim()`, with a
 `col_select` argument inspired by `vroom::vroom()`.

 A simple example of writing and reading a CSV file with arrow is shown below:

 ```{r}
 file_path <- tempfile()
 write_csv_arrow(mtcars, file_path)
 read_csv_arrow(file_path, col_select = starts_with("d"))
 ```

 In addition to the options provided by the readr-style arguments (`delim`, `quote`, `escape_doubple`, `escape_backslash`, etc), you can use the `schema` argument to specify column types: see `schema()` help for details. There is also the option of using `parse_options`, `convert_options`, and `read_options` to exercise fine-grained control over the arrow csv reader: see `help("CsvReadOptions", package = "arrow")` for details.

 ## JSON format

 The arrow package supports reading (but not writing) of tabular data from line-delimited JSON, using the `read_json_arrow()` function. A minimal example is shown below:

 ```{r}
 file_path <- tempfile()
 writeLines('
     { "hello": 3.5, "world": false, "yo": "thing" }
     { "hello": 3.25, "world": null }
     { "hello": 0.0, "world": true, "yo": null }
   ', file_path, useBytes = TRUE)
 read_json_arrow(file_path)
 ```

 ## Further reading

 - To learn more about cloud storage, see the [cloud storage article](./fs.html).
 - To learn more about multi-file datasets, see the [datasets article](./dataset.html).
 - The Apache Arrow R cookbook has chapters on [reading and writing single files](https://arrow.apache.org/cookbook/r/reading-and-writing-data---single-files.html) into memory and working with [multi-file datasets](https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html) stored on-disk.
	---
	title: "Reading and writing data files"
	description: >
	Learn how to read and write CSV, Parquet, and Feather files with arrow
	output: rmarkdown::html_vignette
	---

	The arrow package provides functions for reading single data files into memory, in
	several common formats. By default, calling any of these functions
	returns an R data frame. To return an Arrow Table, set argument
	`as_data_frame = FALSE`.

	- `read_parquet()`: read a file in Parquet format
	- `read_feather()`: read a file in the Apache Arrow IPC format (formerly called the Feather format)
	- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
	- `read_csv_arrow()`: read a comma-separated values (CSV) file
	- `read_tsv_arrow()`: read a tab-separated values (TSV) file
	- `read_json_arrow()`: read a JSON data file

	For writing data to single files, the arrow package provides the
	following functions, which can be used with both R data frames and
	Arrow Tables:

	- `write_parquet()`: write a file in Parquet format
	- `write_feather()`: write a file in Arrow IPC format
	- `write_csv_arrow()`: write a file in CSV format

	All these functions can read and write files in the local filesystem or
	to cloud storage. For more on cloud storage support in arrow, see the [cloud storage article](./fs.html).

	The arrow package also supports reading larger-than-memory single data files, and reading and writing multi-file data sets.
	This enables analysis and processing of larger-than-memory data, and provides
	the ability to partition data into smaller chunks without loading the full
	data into memory. For more information on this topic, see the [dataset article](./dataset.html).

	## Parquet format

	[Apache Parquet](https://parquet.apache.org/) is a popular
	choice for storing analytics data; it is a binary format that is
	optimized for reduced file sizes and fast read performance, especially
	for column-based access patterns. The simplest way to read and write
	Parquet data using arrow is with the `read_parquet()` and
	`write_parquet()` functions. To illustrate this, we'll write the
	`starwars` data included in dplyr to a Parquet file, then read it
	back in. First load the arrow and dplyr packages:

	```{r}
	library(arrow, warn.conflicts = FALSE)
	library(dplyr, warn.conflicts = FALSE)
	```

	Next we'll write the data frame to a Parquet file located at `file_path`:

	```{r}
	file_path <- tempfile()
	write_parquet(starwars, file_path)
	```

	The size of a Parquet file is typically much smaller than the corresponding CSV
	file would have been. This is in part due to the use of file compression: by default,
	Parquet files written with the arrow package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip
	are also supported. See `help("write_parquet", package = "arrow")` for more
	information.

	Having written the Parquet file, we now can read it with `read_parquet()`:

	```{r}
	read_parquet(file_path)
	```

	The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:

	```{r}
	read_parquet(file_path, as_data_frame = FALSE)
	```

	One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality:

	```{r}
	read_parquet(file_path, col_select = c("name", "height", "mass"))
	```

	Fine-grained control over the Parquet reader is possible with the `props` argument. See `help("ParquetArrowReaderProperties", package = "arrow")` for details.

	R object attributes are preserved when writing data to Parquet or
	Arrow/Feather files and when reading those files back into R. This enables
	round-trip writing and reading of `sf::sf` objects, R data frames with
	with `haven::labelled` columns, and data frame with other custom
	attributes. To learn more about how metadata are handled in arrow, the [metadata article](./metadata.html).

	## Arrow/Feather format

	The Arrow file format was developed to provide binary columnar
	serialization for data frames, to make reading and writing data frames
	efficient, and to make sharing data across data analysis languages easy.
	This file format is sometimes referred to as Feather because it is an
	outgrowth of the original [Feather](https://github.com/wesm/feather) project
	that has now been moved into the Arrow project itself. You can find the
	detailed specification of version 2 of the Arrow format -- officially
	referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
	on the Arrow specification page.

	The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:

	```{r}
	file_path <- tempfile()
	write_feather(starwars, file_path)
	```

	The `read_feather()` function provides a familiar interface for reading feather files:

	```{r}
	read_feather(file_path)
	```

	Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:

	```{r}
	read_feather(
	file = file_path,
	col_select = c("name", "height", "mass"),
	as_data_frame = FALSE
	)
	```

	## CSV format

	The read/write capabilities of the arrow package also include support for
	CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`,
	and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read
	data files, where the Arrow C++ options have been mapped to arguments in a
	way that mirrors the conventions used in `readr::read_delim()`, with a
	`col_select` argument inspired by `vroom::vroom()`.

	A simple example of writing and reading a CSV file with arrow is shown below:

	```{r}
	file_path <- tempfile()
	write_csv_arrow(mtcars, file_path)
	read_csv_arrow(file_path, col_select = starts_with("d"))
	```

	In addition to the options provided by the readr-style arguments (`delim`, `quote`, `escape_doubple`, `escape_backslash`, etc), you can use the `schema` argument to specify column types: see `schema()` help for details. There is also the option of using `parse_options`, `convert_options`, and `read_options` to exercise fine-grained control over the arrow csv reader: see `help("CsvReadOptions", package = "arrow")` for details.

	## JSON format

	The arrow package supports reading (but not writing) of tabular data from line-delimited JSON, using the `read_json_arrow()` function. A minimal example is shown below:

	```{r}
	file_path <- tempfile()
	writeLines('
	{ "hello": 3.5, "world": false, "yo": "thing" }
	{ "hello": 3.25, "world": null }
	{ "hello": 0.0, "world": true, "yo": null }
	', file_path, useBytes = TRUE)
	read_json_arrow(file_path)
	```

	## Further reading

	- To learn more about cloud storage, see the [cloud storage article](./fs.html).
	- To learn more about multi-file datasets, see the [datasets article](./dataset.html).
	- The Apache Arrow R cookbook has chapters on [reading and writing single files](https://arrow.apache.org/cookbook/r/reading-and-writing-data---single-files.html) into memory and working with [multi-file datasets](https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html) stored on-disk.