| --- |
| title: "Connecting to a Flight server" |
| description: > |
| Learn how to efficiently stream Apache Arrow data objects across a |
| network using Arrow Flight |
| output: rmarkdown::html_vignette |
| --- |
| |
| [Arrow Flight](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. It allows for highly efficient data transfer by several means: |
| |
| * Flight removes the need for deserialization during data transfer. |
| * Flight allows for parallel data streaming. |
| * Flight employs optimizations designed to take advantage of Arrow's columnar format. |
| |
| The arrow package provides methods for connecting to Flight servers to send and receive data. |
| |
| ## Prerequisites |
| |
| At present the arrow package in R does not supply an independent implementation of Arrow Flight: it works by calling [Flight methods supplied by PyArrow](https://arrow.apache.org/docs/python/api/flight.html) Python, and requires both the [reticulate](https://rstudio.github.io/reticulate/) package and the Python PyArrow library to be installed. If you are using them for the first time you can install them like this: |
| |
| ```r |
| install.packages("reticulate") |
| arrow::install_pyarrow() |
| ``` |
| |
| See the [python integrations article](./python.html) for more details on setting up pyarrow. |
| |
| ## Example |
| |
| The package includes methods for starting a Python-based Flight server, as well |
| as methods for connecting to a Flight server running elsewhere. To illustrate both sides, in one R process we'll start a demo server: |
| |
| ```r |
| library(arrow) |
| demo_server <- load_flight_server("demo_flight_server") |
| server <- demo_server$DemoFlightServer(port = 8089) |
| server$serve() |
| ``` |
| |
| We'll leave that one running. |
| |
| In a different R process, let's connect to it and put some data in it. |
| |
| ```r |
| library(arrow) |
| client <- flight_connect(port = 8089) |
| flight_put(client, iris, path = "test_data/iris") |
| ``` |
| |
| Now, in yet another R process, we can connect to the server and pull the data we |
| put there: |
| |
| ```r |
| library(arrow) |
| library(dplyr) |
| client <- flight_connect(port = 8089) |
| client %>% |
| flight_get("test_data/iris") %>% |
| group_by(Species) %>% |
| summarize(max_petal = max(Petal.Length)) |
| |
| ## # A tibble: 3 x 2 |
| ## Species max_petal |
| ## <fct> <dbl> |
| ## 1 setosa 1.9 |
| ## 2 versicolor 5.1 |
| ## 3 virginica 6.9 |
| ``` |
| |
| Because `flight_get()` returns an Arrow data structure, you can directly pipe |
| its result into a [dplyr](https://dplyr.tidyverse.org/) workflow. |
| See the article on [data wrangling](./data_wrangling.html) for more information on working with Arrow objects via a dplyr interface. |
| |
| ## Further reading |
| |
| - The specification of the [Flight remote procedure call protocol](https://arrow.apache.org/docs/format/Flight.html) is listed on the Arrow project homepage |
| - The Arrow C++ documentation contains a list of [best practices](https://arrow.apache.org/docs/cpp/flight.html#best-practices) for Arrow Flight. |
| - A detailed worked example of an Arrow Flight server in Python is provided in the [Apache Arrow Python Cookbook](https://arrow.apache.org/cookbook/py/flight.html). |
| |
| |
| |