docs/dev/r/articles/read_write.md - arrow-site - Git at Google

 <div id="main" class="col-md-9" role="main">

 # Reading and writing data files

 The arrow package provides functions for reading single data files into
 memory, in several common formats. By default, calling any of these
 functions returns an R data frame. To return an Arrow Table, set
 argument `as_data_frame = FALSE`.

 -   `read_parquet()`: read a file in Parquet format
 -   `read_feather()`: read a file in the Apache Arrow IPC format
     (formerly called the Feather format)
 -   `read_delim_arrow()`: read a delimited text file (default delimiter
     is comma)
 -   `read_csv_arrow()`: read a comma-separated values (CSV) file
 -   `read_tsv_arrow()`: read a tab-separated values (TSV) file
 -   `read_json_arrow()`: read a JSON data file

 For writing data to single files, the arrow package provides the
 following functions, which can be used with both R data frames and Arrow
 Tables:

 -   `write_parquet()`: write a file in Parquet format
 -   `write_feather()`: write a file in Arrow IPC format
 -   `write_csv_arrow()`: write a file in CSV format

 All these functions can read and write files in the local filesystem or
 to cloud storage. For more on cloud storage support in arrow, see the
 [cloud storage article](https://arrow.apache.org/docs/r/articles/fs.md).

 The arrow package also supports reading larger-than-memory single data
 files, and reading and writing multi-file data sets. This enables
 analysis and processing of larger-than-memory data, and provides the
 ability to partition data into smaller chunks without loading the full
 data into memory. For more information on this topic, see the [dataset
 article](https://arrow.apache.org/docs/r/articles/dataset.md).

 <div class="section level2">

 ## Parquet format

 [Apache Parquet](https://parquet.apache.org/) is a popular choice for
 storing analytics data; it is a binary format that is optimized for
 reduced file sizes and fast read performance, especially for
 column-based access patterns. The simplest way to read and write Parquet
 data using arrow is with the `read_parquet()` and `write_parquet()`
 functions. To illustrate this, we’ll write the `starwars` data included
 in dplyr to a Parquet file, then read it back in. First load the arrow
 and dplyr packages:

 <div id="cb1" class="sourceCode">

 ``` r
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```

 </div>

 Next we’ll write the data frame to a Parquet file located at
 `file_path`:

 <div id="cb2" class="sourceCode">

 ``` r
 file_path <- tempfile()
 write_parquet(starwars, file_path)
 ```

 </div>

 The size of a Parquet file is typically much smaller than the
 corresponding CSV file would have been. This is in part due to the use
 of file compression: by default, Parquet files written with the arrow
 package use [Snappy compression](https://google.github.io/snappy/) but
 other options such as gzip are also supported. See
 `help("write_parquet", package = "arrow")` for more information.

 Having written the Parquet file, we now can read it with
 `read_parquet()`:

 <div id="cb3" class="sourceCode">

 ``` r
 read_parquet(file_path)
 ```

 </div>

     ## # A tibble: 87 x 14
     ##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
     ##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
     ##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
     ##  2 C-3PO       167    75 NA         gold       yellow         112   none  mascu~
     ##  3 R2-D2        96    32 NA         white, bl~ red             33   none  mascu~
     ##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
     ##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
     ##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
     ##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
     ##  8 R5-D4        97    32 NA         white, red red             NA   none  mascu~
     ##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
     ## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
     ## # i 77 more rows
     ## # i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,
     ## #   vehicles <list<character>>, starships <list<character>>

 The default is to return a data frame or tibble. If we want an Arrow
 Table instead, we would set `as_data_frame = FALSE`:

 <div id="cb5" class="sourceCode">

 ``` r
 read_parquet(file_path, as_data_frame = FALSE)
 ```

 </div>

     ## Table
     ## 87 rows x 14 columns
     ## $name <string>
     ## $height <int32>
     ## $mass <double>
     ## $hair_color <string>
     ## $skin_color <string>
     ## $eye_color <string>
     ## $birth_year <double>
     ## $sex <string>
     ## $gender <string>
     ## $homeworld <string>
     ## $species <string>
     ## $films: list<element <string>>
     ## $vehicles: list<element <string>>
     ## $starships: list<element <string>>

 One useful feature of Parquet files is that they store data column-wise,
 and contain metadata that allow file readers to skip to the relevant
 sections of the file. That means it is possible to load only a subset of
 the columns without reading the complete file. The `col_select` argument
 to `read_parquet()` supports this functionality:

 <div id="cb7" class="sourceCode">

 ``` r
 read_parquet(file_path, col_select = c("name", "height", "mass"))
 ```

 </div>

     ## # A tibble: 87 x 3
     ##    name               height  mass
     ##    <chr>               <int> <dbl>
     ##  1 Luke Skywalker        172    77
     ##  2 C-3PO                 167    75
     ##  3 R2-D2                  96    32
     ##  4 Darth Vader           202   136
     ##  5 Leia Organa           150    49
     ##  6 Owen Lars             178   120
     ##  7 Beru Whitesun Lars    165    75
     ##  8 R5-D4                  97    32
     ##  9 Biggs Darklighter     183    84
     ## 10 Obi-Wan Kenobi        182    77
     ## # i 77 more rows

 Fine-grained control over the Parquet reader is possible with the
 `props` argument. See
 `help("ParquetArrowReaderProperties", package = "arrow")` for details.

 R object attributes are preserved when writing data to Parquet or
 Arrow/Feather files and when reading those files back into R. This
 enables round-trip writing and reading of `sf::sf` objects, R data
 frames with with `haven::labelled` columns, and data frame with other
 custom attributes. To learn more about how metadata are handled in
 arrow, the [metadata
 article](https://arrow.apache.org/docs/r/articles/metadata.md).

 </div>

 <div class="section level2">

 ## Arrow/Feather format

 The Arrow file format was developed to provide binary columnar
 serialization for data frames, to make reading and writing data frames
 efficient, and to make sharing data across data analysis languages easy.
 This file format is sometimes referred to as Feather because it is an
 outgrowth of the original [Feather](https://github.com/wesm/feather)
 project that has now been moved into the Arrow project itself. You can
 find the detailed specification of version 2 of the Arrow format –
 officially referred to as [the Arrow IPC file
 format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)
 – on the Arrow specification page.

 The `write_feather()` function writes version 2 Arrow/Feather files by
 default, and supports multiple kinds of file compression. Basic use is
 shown below:

 <div id="cb9" class="sourceCode">

 ``` r
 file_path <- tempfile()
 write_feather(starwars, file_path)
 ```

 </div>

 The `read_feather()` function provides a familiar interface for reading
 feather files:

 <div id="cb10" class="sourceCode">

 ``` r
 read_feather(file_path)
 ```

 </div>

     ## # A tibble: 87 x 14
     ##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
     ##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
     ##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
     ##  2 C-3PO       167    75 NA         gold       yellow         112   none  mascu~
     ##  3 R2-D2        96    32 NA         white, bl~ red             33   none  mascu~
     ##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
     ##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
     ##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
     ##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
     ##  8 R5-D4        97    32 NA         white, red red             NA   none  mascu~
     ##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
     ## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
     ## # i 77 more rows
     ## # i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,
     ## #   vehicles <list<character>>, starships <list<character>>

 Like the Parquet reader, this reader supports reading a only subset of
 columns, and can produce Arrow Table output:

 <div id="cb12" class="sourceCode">

 ``` r
 read_feather(
   file = file_path,
   col_select = c("name", "height", "mass"),
   as_data_frame = FALSE
 )
 ```

 </div>

     ## Table
     ## 87 rows x 3 columns
     ## $name <string>
     ## $height <int32>
     ## $mass <double>

 </div>

 <div class="section level2">

 ## CSV format

 The read/write capabilities of the arrow package also include support
 for CSV and other text-delimited files. The `read_csv_arrow()`,
 `read_tsv_arrow()`, and `read_delim_arrow()` functions all use the Arrow
 C++ CSV reader to read data files, where the Arrow C++ options have been
 mapped to arguments in a way that mirrors the conventions used in
 `readr::read_delim()`, with a `col_select` argument inspired by
 `vroom::vroom()`.

 A simple example of writing and reading a CSV file with arrow is shown
 below:

 <div id="cb14" class="sourceCode">

 ``` r
 file_path <- tempfile()
 write_csv_arrow(mtcars, file_path)
 read_csv_arrow(file_path, col_select = starts_with("d"))
 ```

 </div>

     ## # A tibble: 32 x 2
     ##     disp  drat
     ##    <dbl> <dbl>
     ##  1  160   3.9
     ##  2  160   3.9
     ##  3  108   3.85
     ##  4  258   3.08
     ##  5  360   3.15
     ##  6  225   2.76
     ##  7  360   3.21
     ##  8  147.  3.69
     ##  9  141.  3.92
     ## 10  168.  3.92
     ## # i 22 more rows

 In addition to the options provided by the readr-style arguments
 (`delim`, `quote`, `escape_double`, `escape_backslash`, etc), you can
 use the `schema` argument to specify column types: see `schema()` help
 for details. There is also the option of using `parse_options`,
 `convert_options`, and `read_options` to exercise fine-grained control
 over the arrow csv reader: see
 `help("CsvReadOptions", package = "arrow")` for details.

 </div>

 <div class="section level2">

 ## JSON format

 The arrow package supports reading (but not writing) of tabular data
 from line-delimited JSON, using the `read_json_arrow()` function. A
 minimal example is shown below:

 <div id="cb16" class="sourceCode">

 ``` r
 file_path <- tempfile()
 writeLines('
     { "hello": 3.5, "world": false, "yo": "thing" }
     { "hello": 3.25, "world": null }
     { "hello": 0.0, "world": true, "yo": null }
   ', file_path, useBytes = TRUE)
 read_json_arrow(file_path)
 ```

 </div>

     ## # A tibble: 3 x 3
     ##   hello world yo
     ##   <dbl> <lgl> <chr>
     ## 1  3.5  FALSE thing
     ## 2  3.25 NA    NA
     ## 3  0    TRUE  NA

 </div>

 <div class="section level2">

 ## Further reading

 -   To learn more about cloud storage, see the [cloud storage
     article](https://arrow.apache.org/docs/r/articles/fs.md).
 -   To learn more about multi-file datasets, see the [datasets
     article](https://arrow.apache.org/docs/r/articles/dataset.md).
 -   The Apache Arrow R cookbook has chapters on [reading and writing
     single
     files](https://arrow.apache.org/cookbook/r/reading-and-writing-data---single-files.html)
     into memory and working with [multi-file
     datasets](https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html)
     stored on-disk.

 </div>

 </div>
	<div id="main" class="col-md-9" role="main">

	# Reading and writing data files

	The arrow package provides functions for reading single data files into
	memory, in several common formats. By default, calling any of these
	functions returns an R data frame. To return an Arrow Table, set
	argument `as_data_frame = FALSE`.

	- `read_parquet()`: read a file in Parquet format
	- `read_feather()`: read a file in the Apache Arrow IPC format
	(formerly called the Feather format)
	- `read_delim_arrow()`: read a delimited text file (default delimiter
	is comma)
	- `read_csv_arrow()`: read a comma-separated values (CSV) file
	- `read_tsv_arrow()`: read a tab-separated values (TSV) file
	- `read_json_arrow()`: read a JSON data file

	For writing data to single files, the arrow package provides the
	following functions, which can be used with both R data frames and Arrow
	Tables:

	- `write_parquet()`: write a file in Parquet format
	- `write_feather()`: write a file in Arrow IPC format
	- `write_csv_arrow()`: write a file in CSV format

	All these functions can read and write files in the local filesystem or
	to cloud storage. For more on cloud storage support in arrow, see the
	[cloud storage article](https://arrow.apache.org/docs/r/articles/fs.md).

	The arrow package also supports reading larger-than-memory single data
	files, and reading and writing multi-file data sets. This enables
	analysis and processing of larger-than-memory data, and provides the
	ability to partition data into smaller chunks without loading the full
	data into memory. For more information on this topic, see the [dataset
	article](https://arrow.apache.org/docs/r/articles/dataset.md).

	<div class="section level2">

	## Parquet format

	[Apache Parquet](https://parquet.apache.org/) is a popular choice for
	storing analytics data; it is a binary format that is optimized for
	reduced file sizes and fast read performance, especially for
	column-based access patterns. The simplest way to read and write Parquet
	data using arrow is with the `read_parquet()` and `write_parquet()`
	functions. To illustrate this, we’ll write the `starwars` data included
	in dplyr to a Parquet file, then read it back in. First load the arrow
	and dplyr packages:

	<div id="cb1" class="sourceCode">

	``` r
	library(arrow, warn.conflicts = FALSE)
	library(dplyr, warn.conflicts = FALSE)
	```

	</div>

	Next we’ll write the data frame to a Parquet file located at
	`file_path`:

	<div id="cb2" class="sourceCode">

	``` r
	file_path <- tempfile()
	write_parquet(starwars, file_path)
	```

	</div>

	The size of a Parquet file is typically much smaller than the
	corresponding CSV file would have been. This is in part due to the use
	of file compression: by default, Parquet files written with the arrow
	package use [Snappy compression](https://google.github.io/snappy/) but
	other options such as gzip are also supported. See
	`help("write_parquet", package = "arrow")` for more information.

	Having written the Parquet file, we now can read it with
	`read_parquet()`:

	<div id="cb3" class="sourceCode">

	``` r
	read_parquet(file_path)
	```

	</div>

	## # A tibble: 87 x 14
	## name height mass hair_color skin_color eye_color birth_year sex gender
	## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
	## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
	## 2 C-3PO 167 75 NA gold yellow 112 none mascu~
	## 3 R2-D2 96 32 NA white, bl~ red 33 none mascu~
	## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
	## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
	## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
	## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
	## 8 R5-D4 97 32 NA white, red red NA none mascu~
	## 9 Biggs D~ 183 84 black light brown 24 male mascu~
	## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
	## # i 77 more rows
	## # i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,
	## # vehicles <list<character>>, starships <list<character>>

	The default is to return a data frame or tibble. If we want an Arrow
	Table instead, we would set `as_data_frame = FALSE`:

	<div id="cb5" class="sourceCode">

	``` r
	read_parquet(file_path, as_data_frame = FALSE)
	```

	</div>

	## Table
	## 87 rows x 14 columns
	## $name <string>
	## $height <int32>
	## $mass <double>
	## $hair_color <string>
	## $skin_color <string>
	## $eye_color <string>
	## $birth_year <double>
	## $sex <string>
	## $gender <string>
	## $homeworld <string>
	## $species <string>
	## $films: list<element <string>>
	## $vehicles: list<element <string>>
	## $starships: list<element <string>>

	One useful feature of Parquet files is that they store data column-wise,
	and contain metadata that allow file readers to skip to the relevant
	sections of the file. That means it is possible to load only a subset of
	the columns without reading the complete file. The `col_select` argument
	to `read_parquet()` supports this functionality:

	<div id="cb7" class="sourceCode">

	``` r
	read_parquet(file_path, col_select = c("name", "height", "mass"))
	```

	</div>

	## # A tibble: 87 x 3
	## name height mass
	## <chr> <int> <dbl>
	## 1 Luke Skywalker 172 77
	## 2 C-3PO 167 75
	## 3 R2-D2 96 32
	## 4 Darth Vader 202 136
	## 5 Leia Organa 150 49
	## 6 Owen Lars 178 120
	## 7 Beru Whitesun Lars 165 75
	## 8 R5-D4 97 32
	## 9 Biggs Darklighter 183 84
	## 10 Obi-Wan Kenobi 182 77
	## # i 77 more rows

	Fine-grained control over the Parquet reader is possible with the
	`props` argument. See
	`help("ParquetArrowReaderProperties", package = "arrow")` for details.

	R object attributes are preserved when writing data to Parquet or
	Arrow/Feather files and when reading those files back into R. This
	enables round-trip writing and reading of `sf::sf` objects, R data
	frames with with `haven::labelled` columns, and data frame with other
	custom attributes. To learn more about how metadata are handled in
	arrow, the [metadata
	article](https://arrow.apache.org/docs/r/articles/metadata.md).

	</div>

	<div class="section level2">

	## Arrow/Feather format

	The Arrow file format was developed to provide binary columnar
	serialization for data frames, to make reading and writing data frames
	efficient, and to make sharing data across data analysis languages easy.
	This file format is sometimes referred to as Feather because it is an
	outgrowth of the original [Feather](https://github.com/wesm/feather)
	project that has now been moved into the Arrow project itself. You can
	find the detailed specification of version 2 of the Arrow format –
	officially referred to as [the Arrow IPC file
	format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)
	– on the Arrow specification page.

	The `write_feather()` function writes version 2 Arrow/Feather files by
	default, and supports multiple kinds of file compression. Basic use is
	shown below:

	<div id="cb9" class="sourceCode">

	``` r
	file_path <- tempfile()
	write_feather(starwars, file_path)
	```

	</div>

	The `read_feather()` function provides a familiar interface for reading
	feather files:

	<div id="cb10" class="sourceCode">

	``` r
	read_feather(file_path)
	```

	</div>

	## # A tibble: 87 x 14
	## name height mass hair_color skin_color eye_color birth_year sex gender
	## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
	## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
	## 2 C-3PO 167 75 NA gold yellow 112 none mascu~
	## 3 R2-D2 96 32 NA white, bl~ red 33 none mascu~
	## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
	## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
	## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
	## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
	## 8 R5-D4 97 32 NA white, red red NA none mascu~
	## 9 Biggs D~ 183 84 black light brown 24 male mascu~
	## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
	## # i 77 more rows
	## # i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,
	## # vehicles <list<character>>, starships <list<character>>

	Like the Parquet reader, this reader supports reading a only subset of
	columns, and can produce Arrow Table output:

	<div id="cb12" class="sourceCode">

	``` r
	read_feather(
	file = file_path,
	col_select = c("name", "height", "mass"),
	as_data_frame = FALSE
	)
	```

	</div>

	## Table
	## 87 rows x 3 columns
	## $name <string>
	## $height <int32>
	## $mass <double>

	</div>

	<div class="section level2">

	## CSV format

	The read/write capabilities of the arrow package also include support
	for CSV and other text-delimited files. The `read_csv_arrow()`,
	`read_tsv_arrow()`, and `read_delim_arrow()` functions all use the Arrow
	C++ CSV reader to read data files, where the Arrow C++ options have been
	mapped to arguments in a way that mirrors the conventions used in
	`readr::read_delim()`, with a `col_select` argument inspired by
	`vroom::vroom()`.

	A simple example of writing and reading a CSV file with arrow is shown
	below:

	<div id="cb14" class="sourceCode">

	``` r
	file_path <- tempfile()
	write_csv_arrow(mtcars, file_path)
	read_csv_arrow(file_path, col_select = starts_with("d"))
	```

	</div>

	## # A tibble: 32 x 2
	## disp drat
	## <dbl> <dbl>
	## 1 160 3.9
	## 2 160 3.9
	## 3 108 3.85
	## 4 258 3.08
	## 5 360 3.15
	## 6 225 2.76
	## 7 360 3.21
	## 8 147. 3.69
	## 9 141. 3.92
	## 10 168. 3.92
	## # i 22 more rows

	In addition to the options provided by the readr-style arguments
	(`delim`, `quote`, `escape_double`, `escape_backslash`, etc), you can
	use the `schema` argument to specify column types: see `schema()` help
	for details. There is also the option of using `parse_options`,
	`convert_options`, and `read_options` to exercise fine-grained control
	over the arrow csv reader: see
	`help("CsvReadOptions", package = "arrow")` for details.

	</div>

	<div class="section level2">

	## JSON format

	The arrow package supports reading (but not writing) of tabular data
	from line-delimited JSON, using the `read_json_arrow()` function. A
	minimal example is shown below:

	<div id="cb16" class="sourceCode">

	``` r
	file_path <- tempfile()
	writeLines('
	{ "hello": 3.5, "world": false, "yo": "thing" }
	{ "hello": 3.25, "world": null }
	{ "hello": 0.0, "world": true, "yo": null }
	', file_path, useBytes = TRUE)
	read_json_arrow(file_path)
	```

	</div>

	## # A tibble: 3 x 3
	## hello world yo
	## <dbl> <lgl> <chr>
	## 1 3.5 FALSE thing
	## 2 3.25 NA NA
	## 3 0 TRUE NA

	</div>

	<div class="section level2">

	## Further reading

	- To learn more about cloud storage, see the [cloud storage
	article](https://arrow.apache.org/docs/r/articles/fs.md).
	- To learn more about multi-file datasets, see the [datasets
	article](https://arrow.apache.org/docs/r/articles/dataset.md).
	- The Apache Arrow R cookbook has chapters on [reading and writing
	single
	files](https://arrow.apache.org/cookbook/r/reading-and-writing-data---single-files.html)
	into memory and working with [multi-file
	datasets](https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html)
	stored on-disk.

	</div>

	</div>