[R] Initial datasets content (#159)

commit: 7df8c28e31067f8b10b54e018efd588b4cbfce8b [log] [tgz]
author: Nic Crane <thisisnic@gmail.com> Tue Nov 08 12:49:04 2022 +0000
committer: GitHub <noreply@github.com> Tue Nov 08 12:49:04 2022 +0000
tree: 4095c56518a25fccb6e214817bce5895aac0daaf
parent: 081a0d7dad6ae51d4d8f01462640a1c635ca8cef [diff]
diff --git a/r/content/_bookdown.yml b/r/content/_bookdown.yml
index a76108b..06a5f3e 100644
--- a/r/content/_bookdown.yml
+++ b/r/content/_bookdown.yml

@@ -25,6 +25,7 @@
 rmd_files: [
   "index.Rmd",
   "reading_and_writing_data.Rmd",
+  "datasets.Rmd",
   "creating_arrow_objects.Rmd",
   "specify_data_types_and_schemas.Rmd",
   "arrays.Rmd",

diff --git a/r/content/datasets.Rmd b/r/content/datasets.Rmd
new file mode 100644
index 0000000..c9baf02
--- /dev/null
+++ b/r/content/datasets.Rmd

@@ -0,0 +1,399 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Reading and Writing Data - Multiple Files
+
+## Introduction
+
+When reading files into R using Apache Arrow, you can read:
+
+* a single file into memory as a data frame or an Arrow Table
+* a single file that is too large to fit in memory as an Arrow Dataset
+* multiple and partitioned files as an Arrow Dataset
+
+This chapter contains recipes related to using Apache Arrow to read and 
+write files too large for memory and multiple or partitioned files as an 
+Arrow Dataset. There are a number of 
+circumstances in which you may want to read in the data as an Arrow Dataset:
+
+* your single data file is too large to load into memory
+* your data are partitioned among numerous files
+* you want faster performance from your `dplyr` queries
+* you want to be able to take advantage of Arrow's compute functions
+
+It is possible to read in partitioned data in Parquet, Feather (also known as Arrow IPC), and CSV or 
+other text-delimited formats.  If you are choosing a partitioned multiple file format, we 
+recommend Parquet or Feather (Arrow IPC ), both of which can have improved performance 
+when compared to CSVs due to their capabilities around metadata and compression.
+
+## Write data to disk - Parquet
+
+You want to write data to disk in a single Parquet file.
+
+### Solution
+
+```{r, write_dataset_basic}
+write_dataset(dataset = airquality, path = "airquality_data")
+```
+
+```{r, test_write_dataset_basic, opts.label = "test"}
+test_that("write_dataset_basic works as expected", {
+  expect_true(file.exists("airquality_data"))
+  expect_length(list.files("airquality_data"), 1)
+})
+```
+
+### Discussion
+
+The default format for `open_dataset()` and `write_dataset()` is Parquet. 
+
+## Write partitioned data - Parquet
+
+You want to save multiple Parquet data files to disk in partitions based on columns in the data.
+
+### Solution
+
+```{r, write_dataset}
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month"))
+```
+
+```{r, test_write_dataset, opts.label = "test"}
+test_that("write_dataset chunk works as expected", {
+  # Partition by month
+  expect_identical(list.files("airquality_partitioned"), c("Month=5", "Month=6", "Month=7", "Month=8", "Month=9"))
+  # We have enough files
+  expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 5)
+})
+```
+
+As you can see, this has created folders based on the supplied partition variable `Month`.
+
+```{r}
+list.files("airquality_partitioned")
+```
+
+### Discussion
+
+The data is written to separate folders based on the values in the `Month` 
+column.  The default behaviour is to use Hive-style (i.e. "col_name=value" folder names)
+partitions.
+
+```{r}
+# Take a look at the files in this directory
+list.files("airquality_partitioned", recursive = TRUE)
+```
+
+You can specify multiple partitioning variables to add extra levels of partitioning.
+
+```{r, write_dataset_partitioned_deeper}
+write_dataset(airquality, "airquality_partitioned_deeper", partitioning = c("Month", "Day"))
+list.files("airquality_partitioned_deeper")
+```
+
+```{r, test_write_dataset_partitioned_deeper, opts.label = "test"}
+test_that("write_dataset_partitioned_deeper works as expected", {
+  expect_true(file.exists("airquality_partitioned_deeper"))
+  expect_length(list.files("airquality_partitioned_deeper", recursive = TRUE), 153)
+})
+```
+
+If you take a look in one of these folders, you will see that the data is then partitioned by the second partition variable, `Day`.
+
+```{r}
+# Take a look at the files in this directory
+list.files("airquality_partitioned_deeper/Month=5", recursive = TRUE)
+```
+
+There are two different ways to specify variables to use for partitioning - 
+either via the `partitioning` variable as above, or by using `dplyr::group_by()` on your data - the group variables will form the partitions.
+
+```{r, write_dataset_partitioned_groupby}
+write_dataset(dataset = group_by(airquality, Month, Day),
+  path = "airquality_groupby")
+```
+
+```{r, test_write_dataset_partitioned_groupby, opts.label = "test"}
+test_that("write_dataset_partitioned_groupby works as expected", {
+  expect_true(file.exists("airquality_groupby"))
+  expect_length(list.files("airquality_groupby", recursive = TRUE), 153)
+})
+
+```
+
+```{r}
+# Take a look at the files in this directory
+list.files("airquality_groupby", recursive = TRUE)
+```
+
+Each of these folders contains 1 or more Parquet files containing the relevant partition of the data.
+
+```{r}
+list.files("airquality_groupby/Month=5/Day=10")
+```
+
+Note that when there was an `NA` value in the partition column, 
+these values are written to the `col_name=__HIVE_DEFAULT_PARTITION__`
+directory.
+
+
+## Read partitioned data
+
+You want to read partitioned data files as an Arrow Dataset.
+
+### Solution
+
+```{r, open_dataset}
+# Read data from directory
+air_data <- open_dataset("airquality_partitioned_deeper")
+
+# View data
+air_data
+```
+```{r, test_open_dataset, opts.label = "test"}
+test_that("open_dataset chunk works as expected", {
+  expect_equal(nrow(air_data), 153)
+  expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, Month, Day), ignore_attr = TRUE)
+})
+```
+
+### Discussion
+
+Partitioning allows you to split data across 
+multiple files and folders, avoiding problems associated with storing all your data 
+in a single file.  This can provide further advantages when using Arrow, as Arrow will only 
+read in the necessary partitioned files needed for any given analysis.
+
+## Write data to disk - Feather/Arrow IPC format
+
+You want to write data to disk in a single Feather/Arrow IPC file.
+
+### Solution
+
+```{r, write_dataset_feather}
+write_dataset(dataset = airquality,
+  path = "airquality_data_feather",
+  format = "feather")
+```
+```{r, test_write_dataset_feather, opts.label = "test"}
+test_that("write_dataset_feather works as expected", {
+  expect_true(file.exists("airquality_data_feather"))
+  expect_length(list.files("airquality_data_feather"), 1)
+})
+```
+
+## Read in Feather/Arrow IPC data as an Arrow Dataset
+
+You want to read in Feather/Arrow IPC data as an Arrow Dataset
+
+### Solution
+
+```{r, read_arrow_datset}
+# write Arrow file to use in this example
+write_dataset(dataset = airquality,
+  path = "airquality_data_arrow",
+  format = "arrow")
+
+# read into R
+open_dataset("airquality_data_arrow", format = "arrow")
+```
+
+```{r, test_read_arrow_datset, opts.label = "test"}
+test_that("read_arrow_dataset works as expected", {
+  dataset <- open_dataset("airquality_data_arrow", format = "arrow")
+  expect_s3_class(dataset, "FileSystemDataset")
+  expect_identical(dim(dataset), c(153L, 6L))
+})
+```
+
+## Write data to disk - CSV format
+
+You want to write data to disk in a single CSV file.
+
+### Solution
+
+```{r, write_dataset_csv}
+write_dataset(dataset = airquality,
+  path = "airquality_data_csv",
+  format = "csv")
+```
+
+```{r, test_write_dataset_csv, opts.label = "test"}
+test_that("write_dataset_csv works as expected", {
+  expect_true(file.exists("airquality_data_csv"))
+  expect_length(list.files("airquality_data_csv"), 1)
+})
+```
+
+
+## Read in CSV data as an Arrow Dataset
+
+You want to read in CSV data as an Arrow Dataset
+
+### Solution
+
+```{r, read_csv_datset}
+# write CSV file to use in this example
+write_dataset(dataset = airquality,
+  path = "airquality_data_csv",
+  format = "csv")
+
+# read into R
+open_dataset("airquality_data_csv", format = "csv")
+```
+
+```{r, test_read_csv_datset, opts.label = "test"}
+test_that("read_csv_dataset works as expected", {
+  dataset <- open_dataset("airquality_data_csv", format = "csv")
+  expect_s3_class(dataset, "FileSystemDataset")
+  expect_identical(dim(dataset), c(153L, 6L))
+})
+```
+
+## Read in a CSV dataset (no headers)
+
+You want to read in a dataset containing CSVs with no headers
+
+### Solution
+
+```{r, read_headerless_csv_datset}
+# write CSV file to use in this example
+dataset_1 <- airquality[1:40, c("Month", "Day", "Temp")]
+dataset_2 <- airquality[41:80, c("Month", "Day", "Temp")]
+
+dir.create("airquality")
+write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, col.names = FALSE)
+write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, col.names = FALSE)
+
+# read into R
+open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))
+```
+
+```{r, test_read_headerless_csv_datset, opts.label = "test"}
+test_that("read_headerless_csv_datset works as expected", {
+  data_in <- open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))
+  expect_s3_class(data_in, "FileSystemDataset")
+  expect_identical(dim(data_in), c(80L, 3L))
+  expect_named(data_in, c("Month", "Day", "Temp"))
+})
+```
+
+### Discussion
+
+If your dataset is made up of headerless CSV files, you must supply the names of
+each column.  You can do this in multiple ways - either via the `column_names` 
+parameter (as shown above) or via a schema:
+
+```{r, read_headerless_csv_datset_schema}
+open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
+```
+
+```{r, test_read_headerless_csv_datset_schema, opts.label = "test"}
+test_that("read_headerless_csv_datset_schema works as expected", {
+  data_in <- open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
+  expect_s3_class(data_in, "FileSystemDataset")
+  expect_identical(dim(data_in), c(80L, 3L))
+  expect_named(data_in, c("Month", "Day", "Temp"))
+  expect_equal(data_in$schema, schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
+})
+```
+
+One additional advantage of using a schema is that you also have control of the 
+data types of the columns. If you provide both column names and a schema, the values 
+in `column_names` must match the `schema` field names.
+
+
+## Write compressed partitioned data
+
+You want to save partitioned files, compressed with a specified compression algorithm.
+
+### Solution
+
+```{r, dataset_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, compression = "gzip")
+```
+
+```{r}
+# View files in the directory
+list.files(td, recursive = TRUE)
+```
+```{r, test_dataset_gzip, opts.label = "test"}
+test_that("dataset_gzip", {
+  expect_true(file.exists(file.path(td, "part-0.parquet")))
+})
+```
+
+### Discussion
+
+You can supply the `compression` argument to `write_dataset()` as long as 
+the compression algorithm is compatible with the chosen format. See `?write_dataset()` 
+for more information on supported compression algorithms and default settings.
+
+## Read compressed data
+
+You want to read in data which has been compressed.
+
+### Solution
+
+```{r, opendataset_compressed}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, compression = "gzip")
+
+# Read in data
+ds <- open_dataset(td) %>%
+  collect()
+
+ds
+```
+
+```{r, test_opendataset_compressed, opts.label = "test"}
+test_that("opendataset_compressed", {
+  expect_s3_class(ds, "data.frame")
+  expect_named(
+    ds,
+    c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
+  )
+})
+```
+
+### Discussion
+
+Note that Arrow automatically detects the compression and you do not have to 
+supply it in the call to `open_dataset()` or the `read_*()` functions.
+
+
+```{r cleanup_multifile, include = FALSE}
+#cleanup
+unlink("airquality", recursive = TRUE)
+unlink("airquality_data_csv", recursive = TRUE)
+unlink("airquality_data", recursive = TRUE)
+unlink("airquality_data_arrow", recursive = TRUE)
+unlink("airquality_data_feather", recursive = TRUE)
+unlink("airquality_partitioned", recursive = TRUE)
+unlink("airquality_groupby", recursive = TRUE)
+unlink("airquality_partitioned_deeper", recursive = TRUE)
+```
\ No newline at end of file

diff --git a/r/content/reading_and_writing_data.Rmd b/r/content/reading_and_writing_data.Rmd
index ef097b3..a089eb8 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd

@@ -17,22 +17,29 @@
   under the License.
 -->
 
-# Reading and Writing Data
+# Reading and Writing Data - Single Files
 
 ## Introduction
 
-This chapter contains recipes related to reading and writing data using Apache 
-Arrow.  When reading files into R using Apache Arrow, you can choose to read in 
-your file as either a data frame or as an Arrow Table object.
+When reading files into R using Apache Arrow, you can read:
 
+* a single file into memory as a data frame or an Arrow Table
+* a single file that is too large to fit in memory as an Arrow Dataset
+* multiple and partitioned files as an Arrow Dataset
 
-There are a number of circumstances in which you may want to read in the data as an Arrow Table:
+This chapter contains recipes related to using Apache Arrow to read and 
+write single file data into memory as an Arrow Table. There are a number of circumstances in
+which you may want to read in single file data as an Arrow Table:
 
-* your dataset is large and if you load it into memory, it may lead to performance issues
+* your data file is large and having performance issues
 * you want faster performance from your `dplyr` queries
 * you want to be able to take advantage of Arrow's compute functions
 
-## Convert from a data frame to an Arrow Table
+If a single data file is too large to load into memory, you can use the Arrow Dataset API. 
+Recipes for using `open_dataset()` and `write_dataset()` are in the Reading and Writing Data - Multiple Files
+chapter.
+
+## Convert data from a data frame to an Arrow Table
 
 You want to convert an existing `data.frame` or `tibble` object into an Arrow Table.
 
@@ -61,7 +68,7 @@
 ```
 ```{r, test_asdf_table, opts.label = "test"}
 test_that("asdf_table chunk works as expected", {
-  expect_identical(air_df, airquality) 
+  expect_identical(air_df, airquality)
 })
 ```
 
@@ -71,7 +78,7 @@
 
 ## Write a Parquet file
 
-You want to write Parquet files to disk.
+You want to write a single Parquet file to disk.
 
 ### Solution
 
@@ -89,7 +96,7 @@
  
 ## Read a Parquet file
 
-You want to read a Parquet file.
+You want to read a single Parquet file into memory.
 
 ### Solution
 
@@ -123,6 +130,7 @@
 my_table_arrow
 ```
 
+
 ```{r, read_parquet_table_class}
 class(my_table_arrow)
 ```
@@ -134,12 +142,12 @@
 
 ## Read a Parquet file from S3 
 
-You want to read a Parquet file from S3.
+You want to read a single Parquet file from S3 into memory.
 
 ### Solution
 
 ```{r, read_parquet_s3, eval = FALSE}
-df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet")
 ```
 
 ### See also
@@ -148,12 +156,12 @@
 
 ## Filter columns while reading a Parquet file 
 
-You want to specify which columns to include when reading in a Parquet file.
+You want to specify which columns to include when reading in a single Parquet file into memory.
 
 ### Solution
 
 ```{r, read_parquet_filter}
-# Create table to read back in 
+# Create table to read back in
 dist_time <- arrow_table(data.frame(distance = c(12.2, 15.7, 14.2), time = c(43, 44, 40)))
 # Write to Parquet
 write_parquet(dist_time, "dist_time.parquet")
@@ -168,9 +176,9 @@
 })
 ```
 
-## Write an IPC/Feather V2 file
+## Write a Feather V2/Arrow IPC file
 
-You want to read in a Feather file.
+You want to write a single Feather V2 file (also called Arrow IPC file).
 
 ### Solution
 
@@ -197,13 +205,11 @@
 test_that("write_feather1 chunk works as expected", {
   expect_true(file.exists("my_table.feather"))
 })
-
-unlink("my_table.feather")
 ```
 
-## Read a Feather file
+## Read a Feather/Arrow IPC file
 
-You want to read a Feather file.
+You want to read a single Feather V1 or V2 file into memory (also called Arrow IPC file).
 
 ### Solution
 
@@ -217,9 +223,9 @@
 unlink("my_table.arrow")
 ```
 
-## Write streaming IPC files
+## Write streaming Arrow IPC files
 
-You want to write to the IPC stream format.
+You want to write to the Arrow IPC stream format.
 
 ### Solution
 
@@ -240,9 +246,9 @@
 })
 ```
 
-## Read streaming IPC files
+## Read streaming Arrow IPC files
 
-You want to read from the IPC stream format.
+You want to read from the Arrow IPC stream format.
 
 ### Solution
 ```{r, read_ipc_stream}
@@ -258,9 +264,9 @@
 unlink("my_table.arrows")
 ```
 
-## Write CSV files  
+## Write a CSV file 
 
-You want to write Arrow data to a CSV file.
+You want to write Arrow data to a single CSV file.
 
 ### Solution
 
@@ -273,9 +279,9 @@
 })
 ```
 
-## Read CSV files
+## Read a CSV file
 
-You want to read a CSV file.
+You want to read a single CSV file into memory.
 
 ### Solution
 
@@ -290,14 +296,14 @@
 unlink("cars.csv")
 ```
 
-## Read JSON files 
+## Read a JSON file
 
-You want to read a JSON file.
+You want to read a JSON file into memory.
 
 ### Solution
 
 ```{r, read_json_arrow}
-# Create a file to read back in 
+# Create a file to read back in
 tf <- tempfile()
 writeLines('
     {"country": "United Kingdom", "code": "GB", "long": -3.44, "lat": 55.38}
@@ -323,76 +329,9 @@
 unlink(tf)
 ```
 
-## Write partitioned data
+## Write a compressed single data file
 
-You want to save data to disk in partitions based on columns in the data.
-
-### Solution
-
-```{r, write_dataset}
-write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", "Day"))
-list.files("airquality_partitioned")
-```
-```{r, test_write_dataset, opts.label = "test"}
-test_that("write_dataset chunk works as expected", {
-  # Partition by month
-  expect_identical(list.files("airquality_partitioned"), c("Month=5", "Month=6", "Month=7", "Month=8", "Month=9"))
-  # We have enough files
-  expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 153)
-})
-```
-As you can see, this has created folders based on the first partition variable supplied, `Month`.
-
-If you take a look in one of these folders, you will see that the data is then partitioned by the second partition variable, `Day`.
-
-```{r}
-list.files("airquality_partitioned/Month=5")
-```
-
-Each of these folders contains 1 or more Parquet files containing the relevant partition of the data.
-
-```{r}
-list.files("airquality_partitioned/Month=5/Day=10")
-```
-
-## Read partitioned data
-
-You want to read partitioned data.
-
-### Solution
-
-```{r, open_dataset}
-# Read data from directory
-air_data <- open_dataset("airquality_partitioned")
-
-# View data
-air_data
-```
-```{r, test_open_dataset, opts.label = "test"}
-test_that("open_dataset chunk works as expected", {
-  expect_equal(nrow(air_data), 153)
-  expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, Month, Day), ignore_attr = TRUE)
-})
-```
-
-```{r}
-unlink("airquality_partitioned", recursive = TRUE)
-```
-
-```{r, include = FALSE}
-# cleanup
-unlink("my_table.arrow")
-unlink("my_table.arrows")
-unlink("cars.csv")
-unlink("my_table.feather")
-unlink("my_table.parquet")
-unlink("dist_time.parquet")
-unlink("airquality_partitioned", recursive = TRUE)
-```
-
-## Write compressed data
-
-You want to save a file, compressed with a specified compression algorithm.
+You want to save a single file, compressed with a specified compression algorithm.
 
 ### Solution
 
@@ -407,35 +346,7 @@
 
 ```{r, test_parquet_gzip, opts.label = "test"}
 test_that("parquet_gzip", {
-  file.exists(file.path(td, "iris.parquet"))
-})
-```
-
-### Discussion
-
-Note that `write_parquet()` by default already uses compression.  See 
-`default_parquet_compression()` to see what the default configured on your 
-machine is.
-
-You can also supply the `compression` argument to `write_dataset()`, as long as 
-the compression algorithm is compatible with the chosen format.
-
-```{r, dataset_gzip}
-# Create a temporary directory
-td <- tempfile()
-dir.create(td)
-
-# Write dataset to file
-write_dataset(iris, path = td, compression = "gzip")
-```
-
-```{r}
-# View files in the directory
-list.files(td, recursive = TRUE)
-```
-```{r, test_dataset_gzip, opts.label = "test"}
-test_that("dataset_gzip", {
-  file.exists(file.path(td, "part-0.parquet"))
+  expect_true(file.exists(file.path(td, "iris.parquet")))
 })
 ```
 
@@ -446,11 +357,10 @@
 
 * `?write_parquet()`
 * `?write_feather()`
-* `?write_dataset()`
 
 ## Read compressed data
 
-You want to read in data which has been compressed.
+You want to read in a single data file which has been compressed.
 
 ### Solution
 
@@ -459,13 +369,11 @@
 td <- tempfile()
 dir.create(td)
 
-# Write dataset which is to be read back in
+# Write data which is to be read back in
 write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
 
 # Read in data
-ds <- read_parquet(file.path(td, "iris.parquet")) %>%
-  collect()
-
+ds <- read_parquet(file.path(td, "iris.parquet"))
 ds
 ```
 
@@ -482,7 +390,7 @@
 ### Discussion
 
 Note that Arrow automatically detects the compression and you do not have to 
-supply it in the call to `open_dataset()` or the `read_*()` functions.
+supply it in the call to the `read_*()` or the `open_dataset()` functions.
 
 Although the CSV format does not support compression itself, Arrow supports 
 reading in CSV data which has been compressed, if the file extension is `.gz`.
@@ -492,12 +400,11 @@
 td <- tempfile()
 dir.create(td)
 
-# Write dataset which is to be read back in
+# Write data which is to be read back in
 write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote = FALSE)
 
 # Read in data
-ds <- open_dataset(td, format = "csv") %>%
-  collect()
+ds <- read_csv_arrow(file.path(td, "iris.csv.gz"))
 ds
 ```
 
@@ -511,4 +418,12 @@
 })
 ```
 
-
+```{r cleanup_singlefiles, include = FALSE}
+# cleanup
+unlink("my_table.arrow")
+unlink("my_table.arrows")
+unlink("cars.csv")
+unlink("my_table.feather")
+unlink("my_table.parquet")
+unlink("dist_time.parquet")
+```
\ No newline at end of file

diff --git a/r/content/tables.Rmd b/r/content/tables.Rmd
index 127a5b1..75078c2 100644
--- a/r/content/tables.Rmd
+++ b/r/content/tables.Rmd

@@ -23,13 +23,13 @@
 
 One of the aims of the Arrow project is to reduce duplication between different 
 data frame implementations.  The underlying implementation of a data frame is a 
-conceptually different thing to the code that you run to work with it - the API.
+conceptually different thing to the code- or the application programming interface (API)-that you write to work with it.
 
-You may have seen this before in packages like `dbplyr` which allow you to use 
+You may have seen this before in packages like dbplyr which allow you to use 
 the dplyr API to interact with SQL databases.
 
-The `arrow` package has been written so that the underlying Arrow table-like 
-objects can be manipulated via use of the dplyr API via the dplyr verbs.
+The Arrow R package has been written so that the underlying Arrow Table-like 
+objects can be manipulated using the dplyr API, which allows you to use dplyr verbs.
 
 For example, here's a short pipeline of data manipulation which uses dplyr exclusively:
   
@@ -41,7 +41,7 @@
   select(name, height_ft)
 ```
 
-And the same results as using arrow with dplyr syntax:
+And the same results as using Arrow with dplyr syntax:
   
 ```{r, dplyr_arrow}
 arrow_table(starwars) %>%
@@ -73,11 +73,11 @@
 
 
 You'll notice we've used `collect()` in the Arrow pipeline above.  That's because 
-one of the ways in which `arrow` is efficient is that it works out the instructions
+one of the ways in which Arrow is efficient is that it works out the instructions
 for the calculations it needs to perform (_expressions_) and only runs them 
-using arrow once you actually pull the data into your R session.  This means 
+using Arrow once you actually pull the data into your R session.  This means 
 instead of doing lots of separate operations, it does them all at once in a 
-more optimised way, _lazy evaluation_.
+more optimised way. This is called _lazy evaluation_.
 
 It also means that you are able to manipulate data that is larger than you can 
 fit into memory on the machine you're running your code on, if you only pull 
@@ -86,13 +86,13 @@
 
 You can also have data which is split across multiple files.  For example, you
 might have files which are stored in multiple Parquet or Feather files, 
-partitioned across different directories.  You can open multi-file datasets 
+partitioned across different directories.  You can open partitioned or multi-file datasets 
 using `open_dataset()` as discussed in a previous chapter, and then manipulate 
-this data using arrow before even reading any of it into R.
+this data using Arrow before even reading any of the data into R.
 
-## Use dplyr verbs in arrow
+## Use dplyr verbs in Arrow
 
-You want to use a dplyr verb in arrow.
+You want to use a dplyr verb in Arrow.
 
 ### Solution
 
@@ -120,7 +120,7 @@
 
 ### Discussion
 
-You can use most of the dplyr verbs directly from arrow.  
+You can use most of the dplyr verbs directly from Arrow.  
 
 ### See also
 
@@ -131,9 +131,9 @@
 You can see more information about using `arrow_table()` to create Arrow Tables
 and `collect()` to view them as R data frames in [Creating Arrow Objects](creating-arrow-objects.html#creating-arrow-objects).
 
-## Use R functions in dplyr verbs in arrow
+## Use R functions in dplyr verbs in Arrow
 
-You want to use an R function inside a dplyr verb in arrow.
+You want to use an R function inside a dplyr verb in Arrow.
 
 ### Solution
 
@@ -159,10 +159,10 @@
 
 ### Discussion
 
-The arrow package allows you to use dplyr verbs containing expressions which 
+The Arrow R package allows you to use dplyr verbs containing expressions which 
 include base R and many tidyverse functions, but call Arrow functions under the hood.
 If you find any base R or tidyverse functions which you would like to see a 
-mapping of in arrow, please 
+mapping of in Arrow, please 
 [open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
 
 The following packages (amongst some from others) have had many function 
@@ -199,7 +199,7 @@
 ```
 
 
-## Use arrow functions in dplyr verbs in arrow
+## Use Arrow functions in dplyr verbs in Arrow
 
 You want to use a function which is implemented in Arrow's C++ library but either:
 
@@ -313,7 +313,7 @@
 most do.  For these functions to work in R, they must be linked up 
 with the appropriate libarrow options C++ class via the R 
 package's C++ code.  At the time of writing, all compute functions available in
-the development version of the arrow R package had been associated with their options
+the development version of the Arrow R package had been associated with their options
 classes.  However, as the Arrow C++ library's functionality extends, compute 
 functions may be added which do not yet have an R binding.  If you find a C++ 
 compute function which you wish to use from the R package, please [open an issue
commit	7df8c28e31067f8b10b54e018efd588b4cbfce8b	[log] [tgz]
author	Nic Crane <thisisnic@gmail.com>	Tue Nov 08 12:49:04 2022 +0000
committer	GitHub <noreply@github.com>	Tue Nov 08 12:49:04 2022 +0000
tree	4095c56518a25fccb6e214817bce5895aac0daaf
parent	081a0d7dad6ae51d4d8f01462640a1c635ca8cef [diff]