| # Reading and Writing Data |
| |
| This chapter contains recipes related to reading and writing data using Apache Arrow. When reading data using Apache Arrow, there are 2 different ways you may choose to read in the data: |
| 1. a `tibble` |
| 2. an Arrow Table |
| |
| There are a number of circumstances in which you may want to read in the data as an Arrow Table: |
| * your dataset is large and if you load it into memory, it may lead to performance issues |
| * you want faster performance from your `dplyr` queries |
| * you want to be able to take advantage of Arrow's compute functions |
| |
| ## Converting from a tibble to an Arrow Table |
| |
| You can convert an existing `tibble` or `data.frame` into an Arrow Table. |
| |
| ```{r, table_create_from_tibble} |
| air_table <- Table$create(airquality) |
| air_table |
| ``` |
| ```{r, test_table_create_from_tibble, opts.label = "test"} |
| test_that("table_create_from_tibble chunk works as expected", { |
| expect_s3_class(air_table, "Table") |
| }) |
| ``` |
| |
| ## Converting data from an Arrow Table to a tibble |
| |
| You may want to convert an Arrow Table to a tibble to view the data or work with it in your usual analytics pipeline. You can use either `dplyr::collect()` or `as.data.frame()` to do this. |
| |
| ```{r, collect_table} |
| air_tibble <- dplyr::collect(air_table) |
| air_tibble |
| ``` |
| ```{r, test_collect_table, opts.label = "test"} |
| test_that("collect_table chunk works as expected", { |
| expect_identical(air_tibble, airquality) |
| }) |
| ``` |
| |
| ## Reading and Writing Parquet Files |
| |
| ### Writing a Parquet file |
| |
| You can write Parquet files to disk using `arrow::write_parquet()`. |
| ```{r, write_parquet} |
| # Create table |
| my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| # Write to Parquet |
| write_parquet(my_table, "my_table.parquet") |
| ``` |
| ```{r, test_write_parquet, opts.label = "test"} |
| test_that("write_parquet chunk works as expected", { |
| expect_true(file.exists("my_table.parquet")) |
| }) |
| ``` |
| |
| ### Reading a Parquet file |
| |
| Given a Parquet file, it can be read back in by using `arrow::read_parquet()`. |
| |
| ```{r, read_parquet} |
| parquet_tbl <- read_parquet("my_table.parquet") |
| parquet_tbl |
| ``` |
| ```{r, test_read_parquet, opts.label = "test"} |
| test_that("read_parquet works as expected", { |
| expect_identical(parquet_tbl, tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| }) |
| ``` |
| |
| As the argument `as_data_frame` was left set to its default value of `TRUE`, the file was read in as a `data.frame` object. |
| |
| ```{r, read_parquet_2} |
| class(parquet_tbl) |
| ``` |
| ```{r, test_read_parquet_2, opts.label = "test"} |
| test_that("read_parquet_2 works as expected", { |
| expect_s3_class(parquet_tbl, "data.frame") |
| }) |
| ``` |
| If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow Table. |
| |
| ```{r, read_parquet_table} |
| my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE) |
| my_table_arrow_table |
| ``` |
| |
| ```{r, read_parquet_table_class} |
| class(my_table_arrow_table) |
| ``` |
| ```{r, test_read_parquet_table_class, opts.label = "test"} |
| test_that("read_parquet_table_class works as expected", { |
| expect_s3_class(my_table_arrow_table, "Table") |
| }) |
| ``` |
| |
| ### How to read a Parquet file from S3 |
| |
| You can open a Parquet file saved on S3 by calling `read_parquet()` and passing the relevant URI as the `file` argument. |
| |
| ```{r, read_parquet_s3, eval = FALSE} |
| df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet") |
| ``` |
| For more in-depth instructions, including how to work with S3 buckets which require authentication, you can find a guide to reading and writing to/from S3 buckets here: https://arrow.apache.org/docs/r/articles/fs.html. |
| |
| ### How to filter columns while reading a Parquet file |
| |
| When reading in a Parquet file, you can specify which columns to read in via the `col_select` argument. |
| |
| ```{r, read_parquet_filter} |
| # Create table to read back in |
| dist_time <- Table$create(tibble::tibble(distance = c(12.2, 15.7, 14.2), time = c(43, 44, 40))) |
| # Write to Parquet |
| write_parquet(dist_time, "dist_time.parquet") |
| |
| # Read in only the "time" column |
| time_only <- read_parquet("dist_time.parquet", col_select = "time") |
| time_only |
| ``` |
| ```{r, test_read_parquet_filter, opts.label = "test"} |
| test_that("read_parquet_filter works as expected", { |
| expect_identical(time_only, tibble::tibble(time = c(43, 44, 40))) |
| }) |
| ``` |
| |
| ## Reading and Writing Feather files |
| |
| ### Write an IPC/Feather V2 file |
| |
| The Arrow IPC file format is identical to the Feather version 2 format. If you call `write_arrow()`, you will get a warning telling you to use `write_feather()` instead. |
| |
| ```{r, write_arrow} |
| # Create table |
| my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| write_arrow(my_table, "my_table.arrow") |
| ``` |
| ```{r, test_write_arrow, opts.label = "test"} |
| test_that("write_arrow chunk works as expected", { |
| expect_true(file.exists("my_table.arrow")) |
| expect_warning( |
| write_arrow(iris, "my_table.arrow"), |
| regexp = "Use 'write_ipc_stream' or 'write_feather' instead." |
| ) |
| }) |
| ``` |
| |
| Instead, you can use `write_feather()`. |
| |
| ```{r, write_feather} |
| my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| write_feather(my_table, "my_table.arrow") |
| ``` |
| ```{r, test_write_feather, opts.label = "test"} |
| test_that("write_feather chunk works as expected", { |
| expect_true(file.exists("my_table.arrow")) |
| }) |
| ``` |
| ### Write a Feather (version 1) file |
| |
| For legacy support, you can write data in the original Feather format by setting the `version` parameter to `1`. |
| |
| ```{r, write_feather1} |
| # Create table |
| my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| # Write to Feather format V1 |
| write_feather(mtcars, "my_table.feather", version = 1) |
| ``` |
| ```{r, test_write_feather1, opts.label = "test"} |
| test_that("write_feather1 chunk works as expected", { |
| expect_true(file.exists("my_table.feather")) |
| }) |
| ``` |
| |
| ### Read a Feather file |
| |
| You can read Feather files in via `read_feather()`. |
| |
| ```{r, read_feather} |
| my_feather_tbl <- read_feather("my_table.arrow") |
| ``` |
| ```{r, test_read_feather, opts.label = "test"} |
| test_that("read_feather chunk works as expected", { |
| expect_identical(dplyr::collect(my_feather_tbl), tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| }) |
| ``` |
| |
| ## Reading and Writing Streaming IPC Files |
| |
| You can write to the IPC stream format using `write_ipc_stream()`. |
| |
| ```{r, write_ipc_stream} |
| # Create table |
| my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| # Write to IPC stream format |
| write_ipc_stream(my_table, "my_table.arrows") |
| ``` |
| ```{r, test_write_ipc_stream, opts.label = "test"} |
| test_that("write_ipc_stream chunk works as expected", { |
| expect_true(file.exists("my_table.arrows")) |
| }) |
| ``` |
| You can read from IPC stream format using `read_ipc_stream()`. |
| |
| ```{r, read_ipc_stream} |
| my_ipc_stream <- arrow::read_ipc_stream("my_table.arrows") |
| ``` |
| ```{r, test_read_ipc_stream, opts.label = "test"} |
| test_that("read_ipc_stream chunk works as expected", { |
| expect_equal(my_ipc_stream, tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) |
| }) |
| ``` |
| |
| ## Reading and Writing CSV files |
| |
| You can use `write_csv_arrow()` to save an Arrow Table to disk as a CSV. |
| |
| ```{r, write_csv_arrow} |
| write_csv_arrow(cars, "cars.csv") |
| ``` |
| ```{r, test_write_csv_arrow, opts.label = "test"} |
| test_that("write_csv_arrow chunk works as expected", { |
| expect_true(file.exists("cars.csv")) |
| }) |
| ``` |
| |
| You can use `read_csv_arrow()` to read in a CSV file as an Arrow Table. |
| |
| ```{r, read_csv_arrow} |
| my_csv <- read_csv_arrow("cars.csv", as_data_frame = FALSE) |
| ``` |
| |
| ```{r, test_read_csv_arrow, opts.label = "test"} |
| test_that("read_csv_arrow chunk works as expected", { |
| expect_equivalent(dplyr::collect(my_csv), cars) |
| }) |
| ``` |
| |
| ## Reading and Writing Partitioned Data |
| |
| ### Writing Partitioned Data |
| |
| You can use `write_dataset()` to save data to disk in partitions based on columns in the data. |
| |
| ```{r, write_dataset} |
| write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", "Day")) |
| list.files("airquality_partitioned") |
| ``` |
| ```{r, test_write_dataset, opts.label = "test"} |
| test_that("write_dataset chunk works as expected", { |
| # Partition by month |
| expect_identical(list.files("airquality_partitioned"), c("Month=5", "Month=6", "Month=7", "Month=8", "Month=9")) |
| # We have enough files |
| expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 153) |
| }) |
| ``` |
| As you can see, this has created folders based on the first partition variable supplied, `Month`. |
| |
| If you take a look in one of these folders, you will see that the data is then partitioned by the second partition variable, `Day`. |
| |
| ```{r} |
| list.files("airquality_partitioned/Month=5") |
| ``` |
| |
| Each of these folders contains 1 or more Parquet files containing the relevant partition of the data. |
| |
| ```{r} |
| list.files("airquality_partitioned/Month=5/Day=10") |
| ``` |
| |
| ### Reading Partitioned Data |
| |
| You can use `open_dataset()` to read partitioned data. |
| |
| ```{r, open_dataset} |
| # Read data from directory |
| air_data <- open_dataset("airquality_partitioned") |
| |
| # View data |
| air_data |
| ``` |
| ```{r, test_open_dataset, opts.label = "test"} |
| test_that("open_dataset chunk works as expected", { |
| expect_equal(nrow(air_data), 153) |
| expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, Month, Day), ignore_attr = TRUE) |
| }) |
| ``` |
| |