blob: 05483f141dda10b79ea3532b5a77e092caab9ee4 [file] [log] [blame]
[["index.html", "Apache Arrow R Cookbook 1 Preface 1.1 What is Arrow? 1.2 Alternative resources", " Apache Arrow R Cookbook 1 Preface This cookbook aims to provide a number of recipes showing how to perform common tasks using arrow. This version of the cookbook works with arrow >= 6.0.0, but in future we will maintain different versions for the last few major R package releases. 1.1 What is Arrow? Apache Arrow is a cross-language development platform for in-memory analytics. The arrow R package provides a low-level interface to much of the functionality available in the C++ implementation, as well as a higher-level interface to the compute functionality via an implementation of the dplyr API. 1.2 Alternative resources For a complete reference guide to the functions in arrow, as well as vignettes, see the pkgdown site. If you have any requests for new recipes, please open a ticket via the cookbook’s GitHub Issues page. If you have any Arrow feature requests to make or bugs to report, please open an issue on the project JIRA "],["reading-and-writing-data---single-files.html", "2 Reading and Writing Data - Single Files 2.1 Introduction 2.2 Convert data from a data frame to an Arrow Table 2.3 Convert data from an Arrow Table to a data frame 2.4 Write a Parquet file 2.5 Read a Parquet file 2.6 Read a Parquet file from S3 2.7 Filter columns while reading a Parquet file 2.8 Write a Feather V2/Arrow IPC file 2.9 Read a Feather/Arrow IPC file 2.10 Write streaming Arrow IPC files 2.11 Read streaming Arrow IPC files 2.12 Write a CSV file 2.13 Read a CSV file 2.14 Read a JSON file 2.15 Write a compressed single data file 2.16 Read compressed data", " 2 Reading and Writing Data - Single Files 2.1 Introduction When reading files into R using Apache Arrow, you can read: a single file into memory as a data frame or an Arrow Table a single file that is too large to fit in memory as an Arrow Dataset multiple and partitioned files as an Arrow Dataset This chapter contains recipes related to using Apache Arrow to read and write single file data into memory as an Arrow Table. There are a number of circumstances in which you may want to read in single file data as an Arrow Table: your data file is large and having performance issues you want faster performance from your dplyr queries you want to be able to take advantage of Arrow’s compute functions If a single data file is too large to load into memory, you can use the Arrow Dataset API. Recipes for using open_dataset() and write_dataset() are in the Reading and Writing Data - Multiple Files chapter. 2.2 Convert data from a data frame to an Arrow Table You want to convert an existing data.frame or tibble object into an Arrow Table. 2.2.1 Solution air_table <- arrow_table(airquality) air_table ## Table ## 153 rows x 6 columns ## $Ozone <int32> ## $Solar.R <int32> ## $Wind <double> ## $Temp <int32> ## $Month <int32> ## $Day <int32> ## ## See $metadata for additional Schema metadata 2.3 Convert data from an Arrow Table to a data frame You want to convert an Arrow Table to a data frame to view the data or work with it in your usual analytics pipeline. 2.3.1 Solution air_df <- as.data.frame(air_table) air_df ## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 ## 6 28 NA 14.9 66 5 6 ## 7 23 299 8.6 65 5 7 ## 8 19 99 13.8 59 5 8 ## 9 8 19 20.1 61 5 9 ## 10 NA 194 8.6 69 5 10 ## 11 7 NA 6.9 74 5 11 ## 12 16 256 9.7 69 5 12 ## 13 11 290 9.2 66 5 13 ## 14 14 274 10.9 68 5 14 ## 15 18 65 13.2 58 5 15 ## 16 14 334 11.5 64 5 16 ## 17 34 307 12.0 66 5 17 ## 18 6 78 18.4 57 5 18 ## 19 30 322 11.5 68 5 19 ## 20 11 44 9.7 62 5 20 ## 21 1 8 9.7 59 5 21 ## 22 11 320 16.6 73 5 22 ## 23 4 25 9.7 61 5 23 ## 24 32 92 12.0 61 5 24 ## 25 NA 66 16.6 57 5 25 ## 26 NA 266 14.9 58 5 26 ## 27 NA NA 8.0 57 5 27 ## 28 23 13 12.0 67 5 28 ## 29 45 252 14.9 81 5 29 ## 30 115 223 5.7 79 5 30 ## 31 37 279 7.4 76 5 31 ## 32 NA 286 8.6 78 6 1 ## 33 NA 287 9.7 74 6 2 ## 34 NA 242 16.1 67 6 3 ## 35 NA 186 9.2 84 6 4 ## 36 NA 220 8.6 85 6 5 ## 37 NA 264 14.3 79 6 6 ## 38 29 127 9.7 82 6 7 ## 39 NA 273 6.9 87 6 8 ## 40 71 291 13.8 90 6 9 ## 41 39 323 11.5 87 6 10 ## 42 NA 259 10.9 93 6 11 ## 43 NA 250 9.2 92 6 12 ## 44 23 148 8.0 82 6 13 ## 45 NA 332 13.8 80 6 14 ## 46 NA 322 11.5 79 6 15 ## 47 21 191 14.9 77 6 16 ## 48 37 284 20.7 72 6 17 ## 49 20 37 9.2 65 6 18 ## 50 12 120 11.5 73 6 19 ## 51 13 137 10.3 76 6 20 ## 52 NA 150 6.3 77 6 21 ## 53 NA 59 1.7 76 6 22 ## 54 NA 91 4.6 76 6 23 ## 55 NA 250 6.3 76 6 24 ## 56 NA 135 8.0 75 6 25 ## 57 NA 127 8.0 78 6 26 ## 58 NA 47 10.3 73 6 27 ## 59 NA 98 11.5 80 6 28 ## 60 NA 31 14.9 77 6 29 ## 61 NA 138 8.0 83 6 30 ## 62 135 269 4.1 84 7 1 ## 63 49 248 9.2 85 7 2 ## 64 32 236 9.2 81 7 3 ## 65 NA 101 10.9 84 7 4 ## 66 64 175 4.6 83 7 5 ## 67 40 314 10.9 83 7 6 ## 68 77 276 5.1 88 7 7 ## 69 97 267 6.3 92 7 8 ## 70 97 272 5.7 92 7 9 ## 71 85 175 7.4 89 7 10 ## 72 NA 139 8.6 82 7 11 ## 73 10 264 14.3 73 7 12 ## 74 27 175 14.9 81 7 13 ## 75 NA 291 14.9 91 7 14 ## 76 7 48 14.3 80 7 15 ## 77 48 260 6.9 81 7 16 ## 78 35 274 10.3 82 7 17 ## 79 61 285 6.3 84 7 18 ## 80 79 187 5.1 87 7 19 ## 81 63 220 11.5 85 7 20 ## 82 16 7 6.9 74 7 21 ## 83 NA 258 9.7 81 7 22 ## 84 NA 295 11.5 82 7 23 ## 85 80 294 8.6 86 7 24 ## 86 108 223 8.0 85 7 25 ## 87 20 81 8.6 82 7 26 ## 88 52 82 12.0 86 7 27 ## 89 82 213 7.4 88 7 28 ## 90 50 275 7.4 86 7 29 ## 91 64 253 7.4 83 7 30 ## 92 59 254 9.2 81 7 31 ## 93 39 83 6.9 81 8 1 ## 94 9 24 13.8 81 8 2 ## 95 16 77 7.4 82 8 3 ## 96 78 NA 6.9 86 8 4 ## 97 35 NA 7.4 85 8 5 ## 98 66 NA 4.6 87 8 6 ## 99 122 255 4.0 89 8 7 ## 100 89 229 10.3 90 8 8 ## 101 110 207 8.0 90 8 9 ## 102 NA 222 8.6 92 8 10 ## 103 NA 137 11.5 86 8 11 ## 104 44 192 11.5 86 8 12 ## 105 28 273 11.5 82 8 13 ## 106 65 157 9.7 80 8 14 ## 107 NA 64 11.5 79 8 15 ## 108 22 71 10.3 77 8 16 ## 109 59 51 6.3 79 8 17 ## 110 23 115 7.4 76 8 18 ## 111 31 244 10.9 78 8 19 ## 112 44 190 10.3 78 8 20 ## 113 21 259 15.5 77 8 21 ## 114 9 36 14.3 72 8 22 ## 115 NA 255 12.6 75 8 23 ## 116 45 212 9.7 79 8 24 ## 117 168 238 3.4 81 8 25 ## 118 73 215 8.0 86 8 26 ## 119 NA 153 5.7 88 8 27 ## 120 76 203 9.7 97 8 28 ## 121 118 225 2.3 94 8 29 ## 122 84 237 6.3 96 8 30 ## 123 85 188 6.3 94 8 31 ## 124 96 167 6.9 91 9 1 ## 125 78 197 5.1 92 9 2 ## 126 73 183 2.8 93 9 3 ## 127 91 189 4.6 93 9 4 ## 128 47 95 7.4 87 9 5 ## 129 32 92 15.5 84 9 6 ## 130 20 252 10.9 80 9 7 ## 131 23 220 10.3 78 9 8 ## 132 21 230 10.9 75 9 9 ## 133 24 259 9.7 73 9 10 ## 134 44 236 14.9 81 9 11 ## 135 21 259 15.5 76 9 12 ## 136 28 238 6.3 77 9 13 ## 137 9 24 10.9 71 9 14 ## 138 13 112 11.5 71 9 15 ## 139 46 237 6.9 78 9 16 ## 140 18 224 13.8 67 9 17 ## 141 13 27 10.3 76 9 18 ## 142 24 238 10.3 68 9 19 ## 143 16 201 8.0 82 9 20 ## 144 13 238 12.6 64 9 21 ## 145 23 14 9.2 71 9 22 ## 146 36 139 10.3 81 9 23 ## 147 7 49 10.3 69 9 24 ## 148 14 20 16.6 63 9 25 ## 149 30 193 6.9 70 9 26 ## 150 NA 145 13.2 77 9 27 ## 151 14 191 14.3 75 9 28 ## 152 18 131 8.0 76 9 29 ## 153 20 223 11.5 68 9 30 2.3.2 Discussion You can dplyr::collect() to return a tibble or as.data.frame() to return a data.frame. 2.4 Write a Parquet file You want to write a single Parquet file to disk. 2.4.1 Solution # Create table my_table <- arrow_table(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) # Write to Parquet write_parquet(my_table, "my_table.parquet") 2.5 Read a Parquet file You want to read a single Parquet file into memory. 2.5.1 Solution parquet_tbl <- read_parquet("my_table.parquet") parquet_tbl ## # A tibble: 3 × 2 ## group score ## <chr> <dbl> ## 1 A 99 ## 2 B 97 ## 3 C 99 As the argument as_data_frame was left set to its default value of TRUE, the file was read in as a tibble. class(parquet_tbl) ## [1] "tbl_df" "tbl" "data.frame" 2.5.2 Discussion If you set as_data_frame to FALSE, the file will be read in as an Arrow Table. my_table_arrow <- read_parquet("my_table.parquet", as_data_frame = FALSE) my_table_arrow ## Table ## 3 rows x 2 columns ## $group <string> ## $score <double> class(my_table_arrow) ## [1] "Table" "ArrowTabular" "ArrowObject" "R6" 2.6 Read a Parquet file from S3 You want to read a single Parquet file from S3 into memory. 2.6.1 Solution df <- read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet") 2.6.2 See also For more in-depth instructions, including how to work with S3 buckets which require authentication, you can find a guide to reading and writing to/from S3 buckets here: https://arrow.apache.org/docs/r/articles/fs.html. 2.7 Filter columns while reading a Parquet file You want to specify which columns to include when reading in a single Parquet file into memory. 2.7.1 Solution # Create table to read back in dist_time <- arrow_table(data.frame(distance = c(12.2, 15.7, 14.2), time = c(43, 44, 40))) # Write to Parquet write_parquet(dist_time, "dist_time.parquet") # Read in only the "time" column time_only <- read_parquet("dist_time.parquet", col_select = "time") time_only ## # A tibble: 3 × 1 ## time ## <dbl> ## 1 43 ## 2 44 ## 3 40 2.8 Write a Feather V2/Arrow IPC file You want to write a single Feather V2 file (also called Arrow IPC file). 2.8.1 Solution my_table <- arrow_table(data.frame(group = c("A", "B", "C"), score = c(99, 97, 99))) write_feather(my_table, "my_table.arrow") 2.8.2 Discussion For legacy support, you can write data in the original Feather format by setting the version parameter to 1. # Create table my_table <- arrow_table(data.frame(group = c("A", "B", "C"), score = c(99, 97, 99))) # Write to Feather format V1 write_feather(mtcars, "my_table.feather", version = 1) 2.9 Read a Feather/Arrow IPC file You want to read a single Feather V1 or V2 file into memory (also called Arrow IPC file). 2.9.1 Solution my_feather_tbl <- read_feather("my_table.arrow") 2.10 Write streaming Arrow IPC files You want to write to the Arrow IPC stream format. 2.10.1 Solution # Create table my_table <- arrow_table( data.frame( group = c("A", "B", "C"), score = c(99, 97, 99) ) ) # Write to IPC stream format write_ipc_stream(my_table, "my_table.arrows") 2.11 Read streaming Arrow IPC files You want to read from the Arrow IPC stream format. 2.11.1 Solution my_ipc_stream <- arrow::read_ipc_stream("my_table.arrows") 2.12 Write a CSV file You want to write Arrow data to a single CSV file. 2.12.1 Solution write_csv_arrow(cars, "cars.csv") 2.13 Read a CSV file You want to read a single CSV file into memory. 2.13.1 Solution my_csv <- read_csv_arrow("cars.csv", as_data_frame = FALSE) 2.14 Read a JSON file You want to read a JSON file into memory. 2.14.1 Solution # Create a file to read back in tf <- tempfile() writeLines(' {"country": "United Kingdom", "code": "GB", "long": -3.44, "lat": 55.38} {"country": "France", "code": "FR", "long": 2.21, "lat": 46.23} {"country": "Germany", "code": "DE", "long": 10.45, "lat": 51.17} ', tf, useBytes = TRUE) # Read in the data countries <- read_json_arrow(tf, col_select = c("country", "long", "lat")) countries ## # A tibble: 3 × 3 ## country long lat ## <chr> <dbl> <dbl> ## 1 United Kingdom -3.44 55.4 ## 2 France 2.21 46.2 ## 3 Germany 10.4 51.2 2.15 Write a compressed single data file You want to save a single file, compressed with a specified compression algorithm. 2.15.1 Solution # Create a temporary directory td <- tempfile() dir.create(td) # Write data compressed with the gzip algorithm instead of the default write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip") 2.15.2 See also Some formats write compressed data by default. For more information on the supported compression algorithms and default settings, see: ?write_parquet() ?write_feather() 2.16 Read compressed data You want to read in a single data file which has been compressed. 2.16.1 Solution # Create a temporary directory td <- tempfile() dir.create(td) # Write data which is to be read back in write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip") # Read in data ds <- read_parquet(file.path(td, "iris.parquet")) ds ## # A tibble: 150 × 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ℹ 140 more rows 2.16.2 Discussion Note that Arrow automatically detects the compression and you do not have to supply it in the call to the read_*() or the open_dataset() functions. Although the CSV format does not support compression itself, Arrow supports reading in CSV data which has been compressed, if the file extension is .gz. # Create a temporary directory td <- tempfile() dir.create(td) # Write data which is to be read back in write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote = FALSE) # Read in data ds <- read_csv_arrow(file.path(td, "iris.csv.gz")) ds ## # A tibble: 150 × 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ℹ 140 more rows "],["reading-and-writing-data---multiple-files.html", "3 Reading and Writing Data - Multiple Files 3.1 Introduction 3.2 Write data to disk - Parquet 3.3 Write partitioned data - Parquet 3.4 Read partitioned data 3.5 Write data to disk - Feather/Arrow IPC format 3.6 Read in Feather/Arrow IPC data as an Arrow Dataset 3.7 Write data to disk - CSV format 3.8 Read in CSV data as an Arrow Dataset 3.9 Read in a CSV dataset (no headers) 3.10 Write compressed partitioned data 3.11 Read compressed data", " 3 Reading and Writing Data - Multiple Files 3.1 Introduction When reading files into R using Apache Arrow, you can read: a single file into memory as a data frame or an Arrow Table a single file that is too large to fit in memory as an Arrow Dataset multiple and partitioned files as an Arrow Dataset This chapter contains recipes related to using Apache Arrow to read and write files too large for memory and multiple or partitioned files as an Arrow Dataset. There are a number of circumstances in which you may want to read in the data as an Arrow Dataset: your single data file is too large to load into memory your data are partitioned among numerous files you want faster performance from your dplyr queries you want to be able to take advantage of Arrow’s compute functions It is possible to read in partitioned data in Parquet, Feather (also known as Arrow IPC), and CSV or other text-delimited formats. If you are choosing a partitioned multiple file format, we recommend Parquet or Feather (Arrow IPC ), both of which can have improved performance when compared to CSVs due to their capabilities around metadata and compression. 3.2 Write data to disk - Parquet You want to write data to disk in a single Parquet file. 3.2.1 Solution write_dataset(dataset = airquality, path = "airquality_data") 3.2.2 Discussion The default format for open_dataset() and write_dataset() is Parquet. 3.3 Write partitioned data - Parquet You want to save multiple Parquet data files to disk in partitions based on columns in the data. 3.3.1 Solution write_dataset(airquality, "airquality_partitioned", partitioning = c("Month")) As you can see, this has created folders based on the supplied partition variable Month. list.files("airquality_partitioned") ## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9" 3.3.2 Discussion The data is written to separate folders based on the values in the Month column. The default behaviour is to use Hive-style (i.e. “col_name=value” folder names) partitions. # Take a look at the files in this directory list.files("airquality_partitioned", recursive = TRUE) ## [1] "Month=5/part-0.parquet" "Month=6/part-0.parquet" "Month=7/part-0.parquet" ## [4] "Month=8/part-0.parquet" "Month=9/part-0.parquet" You can specify multiple partitioning variables to add extra levels of partitioning. write_dataset(airquality, "airquality_partitioned_deeper", partitioning = c("Month", "Day")) list.files("airquality_partitioned_deeper") ## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9" If you take a look in one of these folders, you will see that the data is then partitioned by the second partition variable, Day. # Take a look at the files in this directory list.files("airquality_partitioned_deeper/Month=5", recursive = TRUE) ## [1] "Day=1/part-0.parquet" "Day=10/part-0.parquet" "Day=11/part-0.parquet" ## [4] "Day=12/part-0.parquet" "Day=13/part-0.parquet" "Day=14/part-0.parquet" ## [7] "Day=15/part-0.parquet" "Day=16/part-0.parquet" "Day=17/part-0.parquet" ## [10] "Day=18/part-0.parquet" "Day=19/part-0.parquet" "Day=2/part-0.parquet" ## [13] "Day=20/part-0.parquet" "Day=21/part-0.parquet" "Day=22/part-0.parquet" ## [16] "Day=23/part-0.parquet" "Day=24/part-0.parquet" "Day=25/part-0.parquet" ## [19] "Day=26/part-0.parquet" "Day=27/part-0.parquet" "Day=28/part-0.parquet" ## [22] "Day=29/part-0.parquet" "Day=3/part-0.parquet" "Day=30/part-0.parquet" ## [25] "Day=31/part-0.parquet" "Day=4/part-0.parquet" "Day=5/part-0.parquet" ## [28] "Day=6/part-0.parquet" "Day=7/part-0.parquet" "Day=8/part-0.parquet" ## [31] "Day=9/part-0.parquet" There are two different ways to specify variables to use for partitioning - either via the partitioning variable as above, or by using dplyr::group_by() on your data - the group variables will form the partitions. write_dataset(dataset = group_by(airquality, Month, Day), path = "airquality_groupby") # Take a look at the files in this directory list.files("airquality_groupby", recursive = TRUE) ## [1] "Month=5/Day=1/part-0.parquet" "Month=5/Day=10/part-0.parquet" ## [3] "Month=5/Day=11/part-0.parquet" "Month=5/Day=12/part-0.parquet" ## [5] "Month=5/Day=13/part-0.parquet" "Month=5/Day=14/part-0.parquet" ## [7] "Month=5/Day=15/part-0.parquet" "Month=5/Day=16/part-0.parquet" ## [9] "Month=5/Day=17/part-0.parquet" "Month=5/Day=18/part-0.parquet" ## [11] "Month=5/Day=19/part-0.parquet" "Month=5/Day=2/part-0.parquet" ## [13] "Month=5/Day=20/part-0.parquet" "Month=5/Day=21/part-0.parquet" ## [15] "Month=5/Day=22/part-0.parquet" "Month=5/Day=23/part-0.parquet" ## [17] "Month=5/Day=24/part-0.parquet" "Month=5/Day=25/part-0.parquet" ## [19] "Month=5/Day=26/part-0.parquet" "Month=5/Day=27/part-0.parquet" ## [21] "Month=5/Day=28/part-0.parquet" "Month=5/Day=29/part-0.parquet" ## [23] "Month=5/Day=3/part-0.parquet" "Month=5/Day=30/part-0.parquet" ## [25] "Month=5/Day=31/part-0.parquet" "Month=5/Day=4/part-0.parquet" ## [27] "Month=5/Day=5/part-0.parquet" "Month=5/Day=6/part-0.parquet" ## [29] "Month=5/Day=7/part-0.parquet" "Month=5/Day=8/part-0.parquet" ## [31] "Month=5/Day=9/part-0.parquet" "Month=6/Day=1/part-0.parquet" ## [33] "Month=6/Day=10/part-0.parquet" "Month=6/Day=11/part-0.parquet" ## [35] "Month=6/Day=12/part-0.parquet" "Month=6/Day=13/part-0.parquet" ## [37] "Month=6/Day=14/part-0.parquet" "Month=6/Day=15/part-0.parquet" ## [39] "Month=6/Day=16/part-0.parquet" "Month=6/Day=17/part-0.parquet" ## [41] "Month=6/Day=18/part-0.parquet" "Month=6/Day=19/part-0.parquet" ## [43] "Month=6/Day=2/part-0.parquet" "Month=6/Day=20/part-0.parquet" ## [45] "Month=6/Day=21/part-0.parquet" "Month=6/Day=22/part-0.parquet" ## [47] "Month=6/Day=23/part-0.parquet" "Month=6/Day=24/part-0.parquet" ## [49] "Month=6/Day=25/part-0.parquet" "Month=6/Day=26/part-0.parquet" ## [51] "Month=6/Day=27/part-0.parquet" "Month=6/Day=28/part-0.parquet" ## [53] "Month=6/Day=29/part-0.parquet" "Month=6/Day=3/part-0.parquet" ## [55] "Month=6/Day=30/part-0.parquet" "Month=6/Day=4/part-0.parquet" ## [57] "Month=6/Day=5/part-0.parquet" "Month=6/Day=6/part-0.parquet" ## [59] "Month=6/Day=7/part-0.parquet" "Month=6/Day=8/part-0.parquet" ## [61] "Month=6/Day=9/part-0.parquet" "Month=7/Day=1/part-0.parquet" ## [63] "Month=7/Day=10/part-0.parquet" "Month=7/Day=11/part-0.parquet" ## [65] "Month=7/Day=12/part-0.parquet" "Month=7/Day=13/part-0.parquet" ## [67] "Month=7/Day=14/part-0.parquet" "Month=7/Day=15/part-0.parquet" ## [69] "Month=7/Day=16/part-0.parquet" "Month=7/Day=17/part-0.parquet" ## [71] "Month=7/Day=18/part-0.parquet" "Month=7/Day=19/part-0.parquet" ## [73] "Month=7/Day=2/part-0.parquet" "Month=7/Day=20/part-0.parquet" ## [75] "Month=7/Day=21/part-0.parquet" "Month=7/Day=22/part-0.parquet" ## [77] "Month=7/Day=23/part-0.parquet" "Month=7/Day=24/part-0.parquet" ## [79] "Month=7/Day=25/part-0.parquet" "Month=7/Day=26/part-0.parquet" ## [81] "Month=7/Day=27/part-0.parquet" "Month=7/Day=28/part-0.parquet" ## [83] "Month=7/Day=29/part-0.parquet" "Month=7/Day=3/part-0.parquet" ## [85] "Month=7/Day=30/part-0.parquet" "Month=7/Day=31/part-0.parquet" ## [87] "Month=7/Day=4/part-0.parquet" "Month=7/Day=5/part-0.parquet" ## [89] "Month=7/Day=6/part-0.parquet" "Month=7/Day=7/part-0.parquet" ## [91] "Month=7/Day=8/part-0.parquet" "Month=7/Day=9/part-0.parquet" ## [93] "Month=8/Day=1/part-0.parquet" "Month=8/Day=10/part-0.parquet" ## [95] "Month=8/Day=11/part-0.parquet" "Month=8/Day=12/part-0.parquet" ## [97] "Month=8/Day=13/part-0.parquet" "Month=8/Day=14/part-0.parquet" ## [99] "Month=8/Day=15/part-0.parquet" "Month=8/Day=16/part-0.parquet" ## [101] "Month=8/Day=17/part-0.parquet" "Month=8/Day=18/part-0.parquet" ## [103] "Month=8/Day=19/part-0.parquet" "Month=8/Day=2/part-0.parquet" ## [105] "Month=8/Day=20/part-0.parquet" "Month=8/Day=21/part-0.parquet" ## [107] "Month=8/Day=22/part-0.parquet" "Month=8/Day=23/part-0.parquet" ## [109] "Month=8/Day=24/part-0.parquet" "Month=8/Day=25/part-0.parquet" ## [111] "Month=8/Day=26/part-0.parquet" "Month=8/Day=27/part-0.parquet" ## [113] "Month=8/Day=28/part-0.parquet" "Month=8/Day=29/part-0.parquet" ## [115] "Month=8/Day=3/part-0.parquet" "Month=8/Day=30/part-0.parquet" ## [117] "Month=8/Day=31/part-0.parquet" "Month=8/Day=4/part-0.parquet" ## [119] "Month=8/Day=5/part-0.parquet" "Month=8/Day=6/part-0.parquet" ## [121] "Month=8/Day=7/part-0.parquet" "Month=8/Day=8/part-0.parquet" ## [123] "Month=8/Day=9/part-0.parquet" "Month=9/Day=1/part-0.parquet" ## [125] "Month=9/Day=10/part-0.parquet" "Month=9/Day=11/part-0.parquet" ## [127] "Month=9/Day=12/part-0.parquet" "Month=9/Day=13/part-0.parquet" ## [129] "Month=9/Day=14/part-0.parquet" "Month=9/Day=15/part-0.parquet" ## [131] "Month=9/Day=16/part-0.parquet" "Month=9/Day=17/part-0.parquet" ## [133] "Month=9/Day=18/part-0.parquet" "Month=9/Day=19/part-0.parquet" ## [135] "Month=9/Day=2/part-0.parquet" "Month=9/Day=20/part-0.parquet" ## [137] "Month=9/Day=21/part-0.parquet" "Month=9/Day=22/part-0.parquet" ## [139] "Month=9/Day=23/part-0.parquet" "Month=9/Day=24/part-0.parquet" ## [141] "Month=9/Day=25/part-0.parquet" "Month=9/Day=26/part-0.parquet" ## [143] "Month=9/Day=27/part-0.parquet" "Month=9/Day=28/part-0.parquet" ## [145] "Month=9/Day=29/part-0.parquet" "Month=9/Day=3/part-0.parquet" ## [147] "Month=9/Day=30/part-0.parquet" "Month=9/Day=4/part-0.parquet" ## [149] "Month=9/Day=5/part-0.parquet" "Month=9/Day=6/part-0.parquet" ## [151] "Month=9/Day=7/part-0.parquet" "Month=9/Day=8/part-0.parquet" ## [153] "Month=9/Day=9/part-0.parquet" Each of these folders contains 1 or more Parquet files containing the relevant partition of the data. list.files("airquality_groupby/Month=5/Day=10") ## [1] "part-0.parquet" Note that when there was an NA value in the partition column, these values are written to the col_name=__HIVE_DEFAULT_PARTITION__ directory. 3.4 Read partitioned data You want to read partitioned data files as an Arrow Dataset. 3.4.1 Solution # Read data from directory air_data <- open_dataset("airquality_partitioned_deeper") # View data air_data ## FileSystemDataset with 153 Parquet files ## Ozone: int32 ## Solar.R: int32 ## Wind: double ## Temp: int32 ## Month: int32 ## Day: int32 ## ## See $metadata for additional Schema metadata 3.4.2 Discussion Partitioning allows you to split data across multiple files and folders, avoiding problems associated with storing all your data in a single file. This can provide further advantages when using Arrow, as Arrow will only read in the necessary partitioned files needed for any given analysis. 3.5 Write data to disk - Feather/Arrow IPC format You want to write data to disk in a single Feather/Arrow IPC file. 3.5.1 Solution write_dataset(dataset = airquality, path = "airquality_data_feather", format = "feather") 3.6 Read in Feather/Arrow IPC data as an Arrow Dataset You want to read in Feather/Arrow IPC data as an Arrow Dataset 3.6.1 Solution # write Arrow file to use in this example write_dataset(dataset = airquality, path = "airquality_data_arrow", format = "arrow") # read into R open_dataset("airquality_data_arrow", format = "arrow") ## FileSystemDataset with 1 Feather file ## Ozone: int32 ## Solar.R: int32 ## Wind: double ## Temp: int32 ## Month: int32 ## Day: int32 ## ## See $metadata for additional Schema metadata 3.7 Write data to disk - CSV format You want to write data to disk in a single CSV file. 3.7.1 Solution write_dataset(dataset = airquality, path = "airquality_data_csv", format = "csv") 3.8 Read in CSV data as an Arrow Dataset You want to read in CSV data as an Arrow Dataset 3.8.1 Solution # write CSV file to use in this example write_dataset(dataset = airquality, path = "airquality_data_csv", format = "csv") # read into R open_dataset("airquality_data_csv", format = "csv") ## FileSystemDataset with 1 csv file ## Ozone: int64 ## Solar.R: int64 ## Wind: double ## Temp: int64 ## Month: int64 ## Day: int64 3.9 Read in a CSV dataset (no headers) You want to read in a dataset containing CSVs with no headers 3.9.1 Solution # write CSV file to use in this example dataset_1 <- airquality[1:40, c("Month", "Day", "Temp")] dataset_2 <- airquality[41:80, c("Month", "Day", "Temp")] dir.create("airquality") write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, col.names = FALSE) write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, col.names = FALSE) # read into R open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp")) ## FileSystemDataset with 2 csv files ## Month: int64 ## Day: int64 ## Temp: int64 3.9.2 Discussion If your dataset is made up of headerless CSV files, you must supply the names of each column. You can do this in multiple ways - either via the column_names parameter (as shown above) or via a schema: open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), "Day" = int32(), "Temp" = int32())) ## FileSystemDataset with 2 csv files ## Month: int32 ## Day: int32 ## Temp: int32 One additional advantage of using a schema is that you also have control of the data types of the columns. If you provide both column names and a schema, the values in column_names must match the schema field names. 3.10 Write compressed partitioned data You want to save partitioned files, compressed with a specified compression algorithm. 3.10.1 Solution # Create a temporary directory td <- tempfile() dir.create(td) # Write dataset to file write_dataset(iris, path = td, compression = "gzip") # View files in the directory list.files(td, recursive = TRUE) ## [1] "part-0.parquet" 3.10.2 Discussion You can supply the compression argument to write_dataset() as long as the compression algorithm is compatible with the chosen format. See ?write_dataset() for more information on supported compression algorithms and default settings. 3.11 Read compressed data You want to read in data which has been compressed. 3.11.1 Solution # Create a temporary directory td <- tempfile() dir.create(td) # Write dataset to file write_dataset(iris, path = td, compression = "gzip") # Read in data ds <- open_dataset(td) %>% collect() ds ## # A tibble: 150 × 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ℹ 140 more rows 3.11.2 Discussion Note that Arrow automatically detects the compression and you do not have to supply it in the call to open_dataset() or the read_*() functions. "],["creating-arrow-objects.html", "4 Creating Arrow Objects 4.1 Create an Arrow Array from an R object 4.2 Create a Arrow Table from an R object 4.3 View the contents of an Arrow Table or RecordBatch 4.4 Manually create a RecordBatch from an R object.", " 4 Creating Arrow Objects 4.1 Create an Arrow Array from an R object You want to convert an existing vector in R to an Arrow Array object. 4.1.1 Solution # Create an example vector score = c(99, 97, 99) # Convert to Arrow Array score_array <- Array$create(score) # View Array score_array ## Array ## <double> ## [ ## 99, ## 97, ## 99 ## ] 4.2 Create a Arrow Table from an R object You want to convert an existing data frame in R to an Arrow Table object. 4.2.1 Solution # Create an example data frame my_tibble <- tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)) # Convert to Arrow Table my_table <- arrow_table(my_tibble) # View table my_table ## Table ## 3 rows x 2 columns ## $group <string> ## $score <double> 4.3 View the contents of an Arrow Table or RecordBatch You want to view the contents of an Arrow Table or RecordBatch. 4.3.1 Solution # View Table dplyr::collect(my_table) ## # A tibble: 3 × 2 ## group score ## <chr> <dbl> ## 1 A 99 ## 2 B 97 ## 3 C 99 4.4 Manually create a RecordBatch from an R object. You want to convert an existing data frame in R to an Arrow RecordBatch object. 4.4.1 Solution # Create an example data frame my_tibble <- tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)) # Convert to Arrow RecordBatch my_record_batch <- record_batch(my_tibble) # View RecordBatch my_record_batch ## RecordBatch ## 3 rows x 2 columns ## $group <string> ## $score <double> "],["defining-data-types.html", "5 Defining Data Types 5.1 Introduction 5.2 Update data type of an existing Arrow Array 5.3 Update data type of a field in an existing Arrow Table 5.4 Specify data types when creating an Arrow table from an R object 5.5 Specify data types when reading in files", " 5 Defining Data Types 5.1 Introduction As discussed in previous chapters, Arrow automatically infers the most appropriate data type when reading in data or converting R objects to Arrow objects. However, you might want to manually tell Arrow which data types to use, for example, to ensure interoperability with databases and data warehouse systems. This chapter includes recipes for: changing the data types of existing Arrow objects defining data types during the process of creating Arrow objects A table showing the default mappings between R and Arrow data types can be found in R data type to Arrow data type mappings. A table containing Arrow data types, and their R equivalents can be found in Arrow data type to R data type mapping. 5.2 Update data type of an existing Arrow Array You want to change the data type of an existing Arrow Array. 5.2.1 Solution # Create an Array to cast integer_arr <- Array$create(1:5) # Cast to an unsigned int8 type uint_arr <- integer_arr$cast(target_type = uint8()) uint_arr ## Array ## <uint8> ## [ ## 1, ## 2, ## 3, ## 4, ## 5 ## ] 5.2.2 Discussion There are some data types which are not compatible with each other. Errors will occur if you try to cast between incompatible data types. int_arr <- Array$create(1:5) int_arr$cast(target_type = binary()) ## Error: NotImplemented: Unsupported cast from int32 to binary using function cast_binary 5.3 Update data type of a field in an existing Arrow Table You want to change the type of one or more fields in an existing Arrow Table. 5.3.1 Solution # Set up a tibble to use in this example oscars <- tibble::tibble( actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), num_awards = c(4, 3, 3) ) # Convert tibble to an Arrow table oscars_arrow <- arrow_table(oscars) # The default mapping from numeric column "num_awards" is to a double oscars_arrow ## Table ## 3 rows x 2 columns ## $actor <string> ## $num_awards <double> # Set up schema with "num_awards" as integer oscars_schema <- schema(actor = string(), num_awards = int16()) # Cast to an int16 oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema) oscars_arrow_int ## Table ## 3 rows x 2 columns ## $actor <string> ## $num_awards <int16> 5.3.2 Discussion There are some Arrow data types which do not have any R equivalent. Attempting to cast to these data types or using a schema which contains them will result in an error. # Set up a tibble to use in this example oscars <- tibble::tibble( actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), num_awards = c(4, 3, 3) ) # Convert tibble to an Arrow table oscars_arrow <- arrow_table(oscars) # Set up schema with "num_awards" as float16 which doesn't have an R equivalent oscars_schema_invalid <- schema(actor = string(), num_awards = float16()) # The default mapping from numeric column "num_awards" is to a double oscars_arrow$cast(target_schema = oscars_schema_invalid) ## Error: NotImplemented: Unsupported cast from double to halffloat using function cast_half_float 5.4 Specify data types when creating an Arrow table from an R object You want to manually specify Arrow data types when converting an object from a data frame to an Arrow object. 5.4.1 Solution # Set up a tibble to use in this example oscars <- tibble::tibble( actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), num_awards = c(4, 3, 3) ) # Set up schema with "num_awards" as integer oscars_schema <- schema(actor = string(), num_awards = int16()) # create arrow Table containing data and schema oscars_data_arrow <- arrow_table(oscars, schema = oscars_schema) oscars_data_arrow ## Table ## 3 rows x 2 columns ## $actor <string> ## $num_awards <int16> 5.5 Specify data types when reading in files You want to manually specify Arrow data types when reading in files. 5.5.1 Solution # Set up a tibble to use in this example oscars <- tibble::tibble( actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), num_awards = c(4, 3, 3) ) # write dataset to disk write_dataset(oscars, path = "oscars_data") # Set up schema with "num_awards" as integer oscars_schema <- schema(actor = string(), num_awards = int16()) # read the dataset in, using the schema instead of inferring the type automatically oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema) oscars_dataset_arrow ## FileSystemDataset with 1 Parquet file ## actor: string ## num_awards: int16 "],["manipulating-data---arrays.html", "6 Manipulating Data - Arrays 6.1 Introduction 6.2 Filter by values matching a predicate or mask 6.3 Compute Mean/Min/Max, etc value of an Array 6.4 Count occurrences of elements in an Array 6.5 Apply arithmetic functions to Arrays. 6.6 Call Arrow compute functions directly on Arrays", " 6 Manipulating Data - Arrays 6.1 Introduction An Arrow Array is roughly equivalent to an R vector - it can be used to represent a single column of data, with all values having the same data type. A number of base R functions which have S3 generic methods have been implemented to work on Arrow Arrays; for example mean, min, and max. 6.2 Filter by values matching a predicate or mask You want to search for values in an Array that match a predicate condition. 6.2.1 Solution my_values <- Array$create(c(1:5, NA)) my_values[my_values > 3] ## Array ## <int32> ## [ ## 4, ## 5, ## null ## ] 6.2.2 Discussion You can refer to items in an Array using the square brackets [] like you can an R vector. 6.3 Compute Mean/Min/Max, etc value of an Array You want to calculate the mean, minimum, or maximum of values in an array. 6.3.1 Solution my_values <- Array$create(c(1:5, NA)) mean(my_values, na.rm = TRUE) ## Scalar ## 3 6.3.2 Discussion Many base R generic functions such as mean(), min(), and max() have been mapped to their Arrow equivalents, and so can be called on Arrow Array objects in the same way. They will return Arrow objects themselves. If you want to use an R function which does not have an Arrow mapping, you can use as.vector() to convert Arrow objects to base R vectors. arrow_array <- Array$create(1:100) # get Tukey's five-number summary fivenum(as.vector(arrow_array)) ## [1] 1.0 25.5 50.5 75.5 100.0 You can tell if a function is a standard S3 generic function by looking at the body of the function - S3 generic functions call UseMethod() to determine the appropriate version of that function to use for the object. mean ## function (x, ...) ## UseMethod("mean") ## <bytecode: 0x55d9fb32d190> ## <environment: namespace:base> You can also use isS3stdGeneric() to determine if a function is an S3 generic. isS3stdGeneric("mean") ## mean ## TRUE If you find an S3 generic function which isn’t implemented for Arrow objects but you would like to be able to use, please open an issue on the project JIRA. 6.4 Count occurrences of elements in an Array You want to count repeated values in an Array. 6.4.1 Solution repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3)) value_counts(repeated_vals) ## StructArray ## <struct<values: double, counts: int64>> ## -- is_valid: all not null ## -- child 0 type: double ## [ ## 1, ## 2, ## 3 ## ] ## -- child 1 type: int64 ## [ ## 2, ## 1, ## 5 ## ] 6.4.2 Discussion Some functions in the Arrow R package do not have base R equivalents. In other cases, the base R equivalents are not generic functions so they cannot be called directly on Arrow Array objects. For example, the value_counts() function in the Arrow R package is loosely equivalent to the base R function table(), which is not a generic function. 6.5 Apply arithmetic functions to Arrays. You want to use the various arithmetic operators on Array objects. 6.5.1 Solution num_array <- Array$create(1:10) num_array + 10 ## Array ## <double> ## [ ## 11, ## 12, ## 13, ## 14, ## 15, ## 16, ## 17, ## 18, ## 19, ## 20 ## ] 6.5.2 Discussion You will get the same result if you pass in the value you’re adding as an Arrow object. num_array + Scalar$create(10) ## Array ## <double> ## [ ## 11, ## 12, ## 13, ## 14, ## 15, ## 16, ## 17, ## 18, ## 19, ## 20 ## ] 6.6 Call Arrow compute functions directly on Arrays You want to call an Arrow compute function directly on an Array. 6.6.1 Solution first_100_numbers <- Array$create(1:100) # Calculate the variance of 1 to 100, setting the delta degrees of freedom to 0. call_function("variance", first_100_numbers, options = list(ddof = 0)) ## Scalar ## 833.25 6.6.2 Discussion You can use call_function() to call Arrow compute functions directly on Scalar, Array, and ChunkedArray objects. The returned object will be an Arrow object. 6.6.3 See also For a more in-depth discussion of Arrow compute functions, see the section on using arrow functions in dplyr verbs in arrow "],["manipulating-data---tables.html", "7 Manipulating Data - Tables 7.1 Introduction 7.2 Use dplyr verbs in Arrow 7.3 Use R functions in dplyr verbs in Arrow 7.4 Use Arrow functions in dplyr verbs in Arrow 7.5 Compute Window Aggregates", " 7 Manipulating Data - Tables 7.1 Introduction One of the aims of the Arrow project is to reduce duplication between different data frame implementations. The underlying implementation of a data frame is a conceptually different thing to the code- or the application programming interface (API)-that you write to work with it. You may have seen this before in packages like dbplyr which allow you to use the dplyr API to interact with SQL databases. The Arrow R package has been written so that the underlying Arrow Table-like objects can be manipulated using the dplyr API, which allows you to use dplyr verbs. For example, here’s a short pipeline of data manipulation which uses dplyr exclusively: library(dplyr) starwars %>% filter(species == "Human") %>% mutate(height_ft = height/30.48) %>% select(name, height_ft) ## # A tibble: 35 × 2 ## name height_ft ## <chr> <dbl> ## 1 Luke Skywalker 5.64 ## 2 Darth Vader 6.63 ## 3 Leia Organa 4.92 ## 4 Owen Lars 5.84 ## 5 Beru Whitesun Lars 5.41 ## 6 Biggs Darklighter 6.00 ## 7 Obi-Wan Kenobi 5.97 ## 8 Anakin Skywalker 6.17 ## 9 Wilhuff Tarkin 5.91 ## 10 Han Solo 5.91 ## # ℹ 25 more rows And the same results as using Arrow with dplyr syntax: arrow_table(starwars) %>% filter(species == "Human") %>% mutate(height_ft = height/30.48) %>% select(name, height_ft) %>% collect() ## # A tibble: 35 × 2 ## name height_ft ## <chr> <dbl> ## 1 Luke Skywalker 5.64 ## 2 Darth Vader 6.63 ## 3 Leia Organa 4.92 ## 4 Owen Lars 5.84 ## 5 Beru Whitesun Lars 5.41 ## 6 Biggs Darklighter 6.00 ## 7 Obi-Wan Kenobi 5.97 ## 8 Anakin Skywalker 6.17 ## 9 Wilhuff Tarkin 5.91 ## 10 Han Solo 5.91 ## # ℹ 25 more rows You’ll notice we’ve used collect() in the Arrow pipeline above. That’s because one of the ways in which Arrow is efficient is that it works out the instructions for the calculations it needs to perform (expressions) and only runs them using Arrow once you actually pull the data into your R session. This means instead of doing lots of separate operations, it does them all at once in a more optimised way. This is called lazy evaluation. It also means that you are able to manipulate data that is larger than you can fit into memory on the machine you’re running your code on, if you only pull data into R when you have selected the desired subset, or when using functions which can operate on chunks of data. You can also have data which is split across multiple files. For example, you might have files which are stored in multiple Parquet or Feather files, partitioned across different directories. You can open partitioned or multi-file datasets using open_dataset() as discussed in a previous chapter, and then manipulate this data using Arrow before even reading any of the data into R. 7.2 Use dplyr verbs in Arrow You want to use a dplyr verb in Arrow. 7.2.1 Solution library(dplyr) arrow_table(starwars) %>% filter(species == "Human", homeworld == "Tatooine") %>% collect() ## # A tibble: 8 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sky… 172 77 blond fair blue 19 male mascu… ## 2 Darth Va… 202 136 none white yellow 41.9 male mascu… ## 3 Owen Lars 178 120 brown, gr… light blue 52 male mascu… ## 4 Beru Whi… 165 75 brown light blue 47 fema… femin… ## 5 Biggs Da… 183 84 black light brown 24 male mascu… ## 6 Anakin S… 188 84 blond fair blue 41.9 male mascu… ## 7 Shmi Sky… 163 NA black fair brown 72 fema… femin… ## 8 Cliegg L… 183 NA brown fair blue 82 male mascu… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list<character>>, ## # vehicles <list<character>>, starships <list<character>> 7.2.2 Discussion You can use most of the dplyr verbs directly from Arrow. 7.2.3 See also You can find examples of the various dplyr verbs in “Introduction to dplyr” - run vignette(\"dplyr\", package = \"dplyr\") or view on the pkgdown site. You can see more information about using arrow_table() to create Arrow Tables and collect() to view them as R data frames in Creating Arrow Objects. 7.3 Use R functions in dplyr verbs in Arrow You want to use an R function inside a dplyr verb in Arrow. 7.3.1 Solution arrow_table(starwars) %>% filter(str_detect(name, "Darth")) %>% collect() ## # A tibble: 2 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Darth Va… 202 136 none white yellow 41.9 male mascu… ## 2 Darth Ma… 175 80 none red yellow 54 male mascu… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list<character>>, ## # vehicles <list<character>>, starships <list<character>> 7.3.2 Discussion The Arrow R package allows you to use dplyr verbs containing expressions which include base R and many tidyverse functions, but call Arrow functions under the hood. If you find any base R or tidyverse functions which you would like to see a mapping of in Arrow, please open an issue on the project JIRA. The following packages (amongst some from others) have had many function bindings/mappings written in arrow: lubridate stringr dplyr If you try to call a function which does not have arrow mapping, the data will be pulled back into R, and you will see a warning message. library(stringr) arrow_table(starwars) %>% mutate(name_split = str_split_fixed(name, " ", 2)) %>% collect() ## Warning: Expression str_split_fixed(name, " ", 2) not supported in Arrow; ## pulling data into R ## # A tibble: 87 × 15 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sk… 172 77 blond fair blue 19 male mascu… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… ## 4 Darth V… 202 136 none white yellow 41.9 male mascu… ## 5 Leia Or… 150 49 brown light brown 19 fema… femin… ## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… ## 7 Beru Wh… 165 75 brown light blue 47 fema… femin… ## 8 R5-D4 97 32 <NA> white, red red NA none mascu… ## 9 Biggs D… 183 84 black light brown 24 male mascu… ## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… ## # ℹ 77 more rows ## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list<character>>, ## # vehicles <list<character>>, starships <list<character>>, ## # name_split <chr[,2]> 7.4 Use Arrow functions in dplyr verbs in Arrow You want to use a function which is implemented in Arrow’s C++ library but either: it doesn’t have a mapping to a base R or tidyverse equivalent, or it has a mapping but nevertheless you want to call the C++ function directly 7.4.1 Solution arrow_table(starwars) %>% select(name) %>% mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>% collect() ## # A tibble: 87 × 2 ## name padded_name ## <chr> <chr> ## 1 Luke Skywalker Luke Skywalker ## 2 C-3PO *****C-3PO ## 3 R2-D2 *****R2-D2 ## 4 Darth Vader Darth Vader ## 5 Leia Organa Leia Organa ## 6 Owen Lars *Owen Lars ## 7 Beru Whitesun Lars Beru Whitesun Lars ## 8 R5-D4 *****R5-D4 ## 9 Biggs Darklighter Biggs Darklighter ## 10 Obi-Wan Kenobi Obi-Wan Kenobi ## # ℹ 77 more rows 7.4.2 Discussion The vast majority of Arrow C++ compute functions have been mapped to their base R or tidyverse equivalents, and we strongly recommend that you use these mappings where possible, as the original functions are well documented and the mapped versions have been tested to ensure the results returned are as expected. However, there may be circumstances in which you might want to use a compute function from the Arrow C++ library which does not have a base R or tidyverse equivalent. You can find documentation of Arrow C++ compute functions in the C++ documention. This documentation lists all available compute functions, any associated options classes they need, and the valid data types that they can be used with. You can list all available Arrow compute functions from R by calling list_compute_functions(). list_compute_functions() ## [1] "abs" "abs_checked" ## [3] "acos" "acos_checked" ## [5] "add" "add_checked" ## [7] "all" "and" ## [9] "and_kleene" "and_not" ## [11] "and_not_kleene" "any" ## [13] "approximate_median" "array_filter" ## [15] "array_sort_indices" "array_take" ## [17] "ascii_capitalize" "ascii_center" ## [19] "ascii_is_alnum" "ascii_is_alpha" ## [21] "ascii_is_decimal" "ascii_is_lower" ## [23] "ascii_is_printable" "ascii_is_space" ## [25] "ascii_is_title" "ascii_is_upper" ## [27] "ascii_lower" "ascii_lpad" ## [29] "ascii_ltrim" "ascii_ltrim_whitespace" ## [31] "ascii_reverse" "ascii_rpad" ## [33] "ascii_rtrim" "ascii_rtrim_whitespace" ## [35] "ascii_split_whitespace" "ascii_swapcase" ## [37] "ascii_title" "ascii_trim" ## [39] "ascii_trim_whitespace" "ascii_upper" ## [41] "asin" "asin_checked" ## [43] "assume_timezone" "atan" ## [45] "atan2" "binary_join" ## [47] "binary_join_element_wise" "binary_length" ## [49] "binary_repeat" "binary_replace_slice" ## [51] "binary_reverse" "binary_slice" ## [53] "bit_wise_and" "bit_wise_not" ## [55] "bit_wise_or" "bit_wise_xor" ## [57] "case_when" "cast" ## [59] "ceil" "ceil_temporal" ## [61] "choose" "coalesce" ## [63] "cos" "cos_checked" ## [65] "count" "count_all" ## [67] "count_distinct" "count_substring" ## [69] "count_substring_regex" "cumulative_max" ## [71] "cumulative_mean" "cumulative_min" ## [73] "cumulative_prod" "cumulative_prod_checked" ## [75] "cumulative_sum" "cumulative_sum_checked" ## [77] "day" "day_of_week" ## [79] "day_of_year" "day_time_interval_between" ## [81] "days_between" "dictionary_decode" ## [83] "dictionary_encode" "divide" ## [85] "divide_checked" "drop_null" ## [87] "ends_with" "equal" ## [89] "exp" "extract_regex" ## [91] "fill_null_backward" "fill_null_forward" ## [93] "filter" "find_substring" ## [95] "find_substring_regex" "first" ## [97] "first_last" "floor" ## [99] "floor_temporal" "greater" ## [101] "greater_equal" "hour" ## [103] "hours_between" "if_else" ## [105] "index" "index_in" ## [107] "index_in_meta_binary" "indices_nonzero" ## [109] "invert" "is_dst" ## [111] "is_finite" "is_in" ## [113] "is_in_meta_binary" "is_inf" ## [115] "is_leap_year" "is_nan" ## [117] "is_null" "is_valid" ## [119] "iso_calendar" "iso_week" ## [121] "iso_year" "last" ## [123] "less" "less_equal" ## [125] "list_element" "list_flatten" ## [127] "list_parent_indices" "list_slice" ## [129] "list_value_length" "ln" ## [131] "ln_checked" "local_timestamp" ## [133] "log10" "log10_checked" ## [135] "log1p" "log1p_checked" ## [137] "log2" "log2_checked" ## [139] "logb" "logb_checked" ## [141] "make_struct" "map_lookup" ## [143] "match_like" "match_substring" ## [145] "match_substring_regex" "max" ## [147] "max_element_wise" "mean" ## [149] "microsecond" "microseconds_between" ## [151] "millisecond" "milliseconds_between" ## [153] "min" "min_element_wise" ## [155] "min_max" "minute" ## [157] "minutes_between" "mode" ## [159] "month" "month_day_nano_interval_between" ## [161] "month_interval_between" "multiply" ## [163] "multiply_checked" "nanosecond" ## [165] "nanoseconds_between" "negate" ## [167] "negate_checked" "not_equal" ## [169] "or" "or_kleene" ## [171] "pairwise_diff" "pairwise_diff_checked" ## [173] "partition_nth_indices" "power" ## [175] "power_checked" "product" ## [177] "quantile" "quarter" ## [179] "quarters_between" "random" ## [181] "rank" "replace_substring" ## [183] "replace_substring_regex" "replace_with_mask" ## [185] "round" "round_binary" ## [187] "round_temporal" "round_to_multiple" ## [189] "run_end_decode" "run_end_encode" ## [191] "second" "seconds_between" ## [193] "select_k_unstable" "shift_left" ## [195] "shift_left_checked" "shift_right" ## [197] "shift_right_checked" "sign" ## [199] "sin" "sin_checked" ## [201] "sort_indices" "split_pattern" ## [203] "split_pattern_regex" "sqrt" ## [205] "sqrt_checked" "starts_with" ## [207] "stddev" "strftime" ## [209] "string_is_ascii" "strptime" ## [211] "struct_field" "subsecond" ## [213] "subtract" "subtract_checked" ## [215] "sum" "take" ## [217] "tan" "tan_checked" ## [219] "tdigest" "true_unless_null" ## [221] "trunc" "unique" ## [223] "us_week" "us_year" ## [225] "utf8_capitalize" "utf8_center" ## [227] "utf8_is_alnum" "utf8_is_alpha" ## [229] "utf8_is_decimal" "utf8_is_digit" ## [231] "utf8_is_lower" "utf8_is_numeric" ## [233] "utf8_is_printable" "utf8_is_space" ## [235] "utf8_is_title" "utf8_is_upper" ## [237] "utf8_length" "utf8_lower" ## [239] "utf8_lpad" "utf8_ltrim" ## [241] "utf8_ltrim_whitespace" "utf8_normalize" ## [243] "utf8_replace_slice" "utf8_reverse" ## [245] "utf8_rpad" "utf8_rtrim" ## [247] "utf8_rtrim_whitespace" "utf8_slice_codeunits" ## [249] "utf8_split_whitespace" "utf8_swapcase" ## [251] "utf8_title" "utf8_trim" ## [253] "utf8_trim_whitespace" "utf8_upper" ## [255] "value_counts" "variance" ## [257] "week" "weeks_between" ## [259] "xor" "year" ## [261] "year_month_day" "years_between" The majority of functions here have been mapped to their base R or tidyverse equivalent and can be called within a dplyr query as usual. For functions which don’t have a base R or tidyverse equivalent, or you want to supply custom options, you can call them by prefixing their name with “arrow_”. For example, base R’s is.na() function is the equivalent of the Arrow C++ compute function is_null() with the option nan_is_null set to TRUE. A mapping between these functions (with nan_is_null set to TRUE) has been created in arrow. demo_df <- data.frame(x = c(1, 2, 3, NA, NaN)) arrow_table(demo_df) %>% mutate(y = is.na(x)) %>% collect() ## # A tibble: 5 × 2 ## x y ## <dbl> <lgl> ## 1 1 FALSE ## 2 2 FALSE ## 3 3 FALSE ## 4 NA TRUE ## 5 NaN TRUE If you want to call Arrow’s is_null() function but with nan_is_null set to FALSE (so it returns TRUE when a value being examined is NA but FALSE when the value being examined is NaN), you must call is_null() directly and specify the option nan_is_null = FALSE. arrow_table(demo_df) %>% mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>% collect() ## # A tibble: 5 × 2 ## x y ## <dbl> <lgl> ## 1 1 FALSE ## 2 2 FALSE ## 3 3 FALSE ## 4 NA TRUE ## 5 NaN FALSE 7.4.2.1 Compute functions with options Although not all Arrow C++ compute functions require options to be specified, most do. For these functions to work in R, they must be linked up with the appropriate libarrow options C++ class via the R package’s C++ code. At the time of writing, all compute functions available in the development version of the Arrow R package had been associated with their options classes. However, as the Arrow C++ library’s functionality extends, compute functions may be added which do not yet have an R binding. If you find a C++ compute function which you wish to use from the R package, please open an issue on the Github project. 7.5 Compute Window Aggregates You want to apply an aggregation (e.g. mean()) on a grouped table or within a rowwise operation like filter(): 7.5.1 Solution arrow_table(starwars) %>% select(1:4) %>% filter(!is.na(hair_color)) %>% left_join( arrow_table(starwars) %>% group_by(hair_color) %>% summarize(mean_height = mean(height, na.rm = TRUE)) ) %>% filter(height < mean_height) %>% select(!mean_height) %>% collect() ## # A tibble: 28 × 4 ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Luke Skywalker 172 77 blond ## 2 Leia Organa 150 49 brown ## 3 Beru Whitesun Lars 165 75 brown ## 4 Wedge Antilles 170 77 brown ## 5 Yoda 66 17 white ## 6 Lobot 175 79 none ## 7 Ackbar 180 83 none ## 8 Wicket Systri Warrick 88 20 brown ## 9 Nien Nunb 160 68 none ## 10 Finis Valorum 170 NA blond ## # ℹ 18 more rows Or using to_duckdb() arrow_table(starwars) %>% select(1:4) %>% filter(!is.na(hair_color)) %>% to_duckdb() %>% group_by(hair_color) %>% filter(height < mean(height, na.rm = TRUE)) %>% to_arrow() %>% collect() ## # A tibble: 28 × 4 ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Yoda 66 17 white ## 2 Luke Skywalker 172 77 blond ## 3 Finis Valorum 170 NA blond ## 4 R4-P17 96 NA none ## 5 Lobot 175 79 none ## 6 Ackbar 180 83 none ## 7 Nien Nunb 160 68 none ## 8 Darth Maul 175 80 none ## 9 Bib Fortuna 180 NA none ## 10 Ayla Secura 178 55 none ## # ℹ 18 more rows 7.5.2 Discusson Arrow does not support window functions, and pulls the data into R. For large tables, this sacrifices performance. arrow_table(starwars) %>% select(1:4) %>% filter(!is.na(hair_color)) %>% group_by(hair_color) %>% filter(height < mean(height, na.rm = TRUE)) ## Warning: Expression height < mean(height, na.rm = TRUE) not supported in Arrow; ## pulling data into R ## # A tibble: 28 × 4 ## # Groups: hair_color [5] ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Luke Skywalker 172 77 blond ## 2 Leia Organa 150 49 brown ## 3 Beru Whitesun Lars 165 75 brown ## 4 Wedge Antilles 170 77 brown ## 5 Yoda 66 17 white ## 6 Lobot 175 79 none ## 7 Ackbar 180 83 none ## 8 Wicket Systri Warrick 88 20 brown ## 9 Nien Nunb 160 68 none ## 10 Finis Valorum 170 NA blond ## # ℹ 18 more rows You can perform these window aggregate operations on Arrow tables by: Computing the aggregation separately, and joining the result Passing the data to DuckDB, and use the DuckDB query engine to perform the operations Arrow supports zero-copy integration with DuckDB, and DuckDB can query Arrow datasets directly and stream query results back to Arrow. This integreation uses zero-copy streaming of data between DuckDB and Arrow and vice versa so that you can compose a query using both together, all the while not paying any cost to (re)serialize the data when you pass it back and forth. This is especially useful in cases where something is supported in one of Arrow or DuckDB query engines but not the other. You can find more information about this integration on the Arrow blog post. "],["using-pyarrow-from-r.html", "8 Using PyArrow from R 8.1 Introduction 8.2 Create an Arrow object using PyArrow in R 8.3 Call a PyArrow function from R", " 8 Using PyArrow from R 8.1 Introduction For more information on using setting up and installing PyArrow to use in R, see the “Apache Arrow in Python and R with reticulate” vignette. 8.2 Create an Arrow object using PyArrow in R You want to use PyArrow to create an Arrow object in an R session. 8.2.1 Solution library(reticulate) pa <- import("pyarrow") pyarrow_scalar <- pa$scalar(42) pyarrow_scalar ## <pyarrow.DoubleScalar: 42.0> 8.3 Call a PyArrow function from R You want to call a PyArrow function from your R session. 8.3.1 Solution table_1 <- arrow_table(mtcars[1:5,]) table_2 <- arrow_table(mtcars[11:15,]) pa$concat_tables(tables = list(table_1, table_2)) %>% collect() ## # A tibble: 10 × 11 ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 ## 6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4 ## 7 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 ## 8 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 ## 9 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 ## 10 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 "],["flight.html", "9 Flight 9.1 Introduction 9.2 Connect to a Flight server 9.3 Send data to a Flight server 9.4 Check what resources exist on a Flight server 9.5 Retrieve data from a Flight server", " 9 Flight 9.1 Introduction Flight is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. Flight allows for highly efficient data transfer as it: removes the need for serialization during data transfer allows for parallel data streaming is highly optimized to take advantage of Arrow’s columnar format. The arrow package provides methods for connecting to Flight RPC servers to send and receive data. It should be noted that the Flight implementation in the R package depends on PyArrow which is called via reticulate. This is quite different from the other capabilities in the R package, nearly all of which are all implemented directly. 9.2 Connect to a Flight server You want to connect to a Flight server running on a specified host and port. 9.2.1 Solution local_client <- flight_connect(host = "127.0.0.1", port = 8089) 9.2.2 See also For an example of how to set up a Flight server from R, see the Flight vignette. 9.3 Send data to a Flight server You want to send data that you have in memory to a Flight server 9.3.1 Solution # Connect to the Flight server local_client <- flight_connect(host = "127.0.0.1", port = 8089) # Send the data flight_put( local_client, data = airquality, path = "pollution_data" ) 9.4 Check what resources exist on a Flight server You want to see what paths are available on a Flight server. 9.4.1 Solution # Connect to the Flight server local_client <- flight_connect(host = "127.0.0.1", port = 8089) # Retrieve path listing list_flights(local_client) # [1] "pollution_data" 9.5 Retrieve data from a Flight server You want to retrieve data on a Flight server from a specified path. 9.5.1 Solution # Connect to the Flight server local_client <- flight_connect(host = "127.0.0.1", port = 8089) # Retrieve data flight_get( local_client, "pollution_data" ) # Table # 153 rows x 6 columns # $Ozone <int32> # $Solar.R <int32> # $Wind <double> # $Temp <int32> # $Month <int32> # $Day <int32> # # See $metadata for additional Schema metadata "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]