Arrow Flight is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. It allows for highly efficient data transfer by several means:
The arrow package provides methods for connecting to Flight servers to send and receive data.
At present the arrow package in R does not supply an independent implementation of Arrow Flight: it works by calling Flight methods supplied by PyArrow Python, and requires both the reticulate package and the Python PyArrow library to be installed. If you are using them for the first time you can install them like this:
install.packages("reticulate") arrow::install_pyarrow()
See the python integrations article for more details on setting up pyarrow.
The package includes methods for starting a Python-based Flight server, as well as methods for connecting to a Flight server running elsewhere. To illustrate both sides, in one R process we’ll start a demo server:
library(arrow) demo_server <- load_flight_server("demo_flight_server") server <- demo_server$DemoFlightServer(port = 8089) server$serve()
We’ll leave that one running.
In a different R process, let’s connect to it and put some data in it.
library(arrow) client <- flight_connect(port = 8089) flight_put(client, iris, path = "test_data/iris")
Now, in yet another R process, we can connect to the server and pull the data we put there:
library(arrow) library(dplyr) client <- flight_connect(port = 8089) client |> flight_get("test_data/iris") |> group_by(Species) |> summarize(max_petal = max(Petal.Length)) ## # A tibble: 3 x 2 ## Species max_petal ## <fct> <dbl> ## 1 setosa 1.9 ## 2 versicolor 5.1 ## 3 virginica 6.9
Because flight_get() returns an Arrow data structure, you can directly pipe its result into a dplyr workflow. See the article on data wrangling for more information on working with Arrow objects via a dplyr interface.