The goal of nanoarrow is to provide minimal useful bindings to the Arrow C Data and Arrow C Stream interfaces using the nanoarrow C library.
You can install the released version of nanoarrow from CRAN with:
install.packages("nanoarrow")
You can install the development version of nanoarrow from GitHub with:
# install.packages("remotes") remotes::install_github("apache/arrow-nanoarrow/r")
If you can load the package, you’re good to go!
library(nanoarrow)
The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ArrowSchema
which represents a data type of an array, the ArrowArray
which represents the values of an array, and an ArrowArrayStream
, which represents zero or more ArrowArray
s with a common ArrowSchema
. All three can be wrapped by R objects using the nanoarrow R package.
Use infer_nanoarrow_schema()
to get the ArrowSchema object that corresponds to a given R vector type; use as_nanoarrow_schema()
to convert an object from some other data type representation (e.g., an arrow R package DataType
like arrow::int32()
); or use na_XXX()
functions to construct them.
infer_nanoarrow_schema(1:5) #> <nanoarrow_schema int32> #> $ format : chr "i" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 2 #> $ children : list() #> $ dictionary: NULL as_nanoarrow_schema(arrow::schema(col1 = arrow::float64())) #> <nanoarrow_schema struct> #> $ format : chr "+s" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 0 #> $ children :List of 1 #> ..$ col1:<nanoarrow_schema double> #> .. ..$ format : chr "g" #> .. ..$ name : chr "col1" #> .. ..$ metadata : list() #> .. ..$ flags : int 2 #> .. ..$ children : list() #> .. ..$ dictionary: NULL #> $ dictionary: NULL na_int64() #> <nanoarrow_schema int64> #> $ format : chr "l" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 2 #> $ children : list() #> $ dictionary: NULL
Use as_nanoarrow_array()
to convert an object to an ArrowArray object:
as_nanoarrow_array(1:5) #> <nanoarrow_array int32[5]> #> $ length : int 5 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 2 #> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5` #> $ dictionary: NULL #> $ children : list() as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2))) #> <nanoarrow_array struct[2]> #> $ length : int 2 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 1 #> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> $ children :List of 1 #> ..$ col1:<nanoarrow_array double[2]> #> .. ..$ length : int 2 #> .. ..$ null_count: int 0 #> .. ..$ offset : int 0 #> .. ..$ buffers :List of 2 #> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2` #> .. ..$ dictionary: NULL #> .. ..$ children : list() #> $ dictionary: NULL
You can use as.vector()
or as.data.frame()
to get the R representation of the object back:
array <- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2))) as.data.frame(array) #> col1 #> 1 1.1 #> 2 2.2
Even though at the C level the ArrowArray is distinct from the ArrowSchema, at the R level we attach a schema wherever possible. You can access the attached schema using infer_nanoarrow_schema()
:
infer_nanoarrow_schema(array) #> <nanoarrow_schema struct> #> $ format : chr "+s" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 0 #> $ children :List of 1 #> ..$ col1:<nanoarrow_schema double> #> .. ..$ format : chr "g" #> .. ..$ name : chr "col1" #> .. ..$ metadata : list() #> .. ..$ flags : int 2 #> .. ..$ children : list() #> .. ..$ dictionary: NULL #> $ dictionary: NULL
The easiest way to create an ArrowArrayStream is from a list of arrays or objects that can be converted to an array using as_nanoarrow_array()
:
stream <- basic_array_stream( list( data.frame(col1 = c(1.1, 2.2)), data.frame(col1 = c(3.3, 4.4)) ) )
You can pull batches from the stream using the $get_next()
method. The last batch will return NULL
.
stream$get_next() #> <nanoarrow_array struct[2]> #> $ length : int 2 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 1 #> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> $ children :List of 1 #> ..$ col1:<nanoarrow_array double[2]> #> .. ..$ length : int 2 #> .. ..$ null_count: int 0 #> .. ..$ offset : int 0 #> .. ..$ buffers :List of 2 #> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2` #> .. ..$ dictionary: NULL #> .. ..$ children : list() #> $ dictionary: NULL stream$get_next() #> <nanoarrow_array struct[2]> #> $ length : int 2 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 1 #> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> $ children :List of 1 #> ..$ col1:<nanoarrow_array double[2]> #> .. ..$ length : int 2 #> .. ..$ null_count: int 0 #> .. ..$ offset : int 0 #> .. ..$ buffers :List of 2 #> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `3.3 4.4` #> .. ..$ dictionary: NULL #> .. ..$ children : list() #> $ dictionary: NULL stream$get_next() #> NULL
You can pull all the batches into a data.frame()
by calling as.data.frame()
or as.vector()
:
stream <- basic_array_stream( list( data.frame(col1 = c(1.1, 2.2)), data.frame(col1 = c(3.3, 4.4)) ) ) as.data.frame(stream) #> col1 #> 1 1.1 #> 2 2.2 #> 3 3.3 #> 4 4.4
After consuming a stream, you should call the release method as soon as you can. This lets the implementation of the stream release any resources (like open files) it may be holding in a more predictable way than waiting for the garbage collector to clean up the object.
The nanoarrow package implements as_nanoarrow_schema()
, as_nanoarrow_array()
, and as_nanoarrow_array_stream()
for most arrow package types. Similarly, it implements arrow::as_arrow_array()
, arrow::as_record_batch()
, arrow::as_arrow_table()
, arrow::as_record_batch_reader()
, arrow::infer_type()
, arrow::as_data_type()
, and arrow::as_schema()
for nanoarrow objects such that you can pass equivalent nanoarrow objects into many arrow functions and vice versa.