| --- |
| output: github_document |
| --- |
| |
| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <!-- README.md is generated from README.Rmd. Please edit that file --> |
| |
| ```{r, include = FALSE} |
| knitr::opts_chunk$set( |
| collapse = TRUE, |
| comment = "#>", |
| fig.path = "man/figures/README-", |
| out.width = "100%" |
| ) |
| ``` |
| |
| # nanoarrow |
| |
| <!-- badges: start --> |
| <!-- badges: end --> |
| |
| The goal of nanoarrow is to provide minimal useful bindings to the [Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) interfaces using the [nanoarrow C library](https://arrow.apache.org/nanoarrow). |
| |
| ## Installation |
| |
| You can install the released version of nanoarrow from [CRAN](https://cran.r-project.org/) with: |
| |
| ```r |
| install.packages("nanoarrow") |
| ``` |
| |
| You can install the development version of nanoarrow from [GitHub](https://github.com/) with: |
| |
| ``` r |
| # install.packages("remotes") |
| remotes::install_github("apache/arrow-nanoarrow/r") |
| ``` |
| |
| If you can load the package, you're good to go! |
| |
| ```{r} |
| library(nanoarrow) |
| ``` |
| |
| ## Example |
| |
| The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. All three can be wrapped by R objects using the nanoarrow R package. |
| |
| ### Schemas |
| |
| Use `infer_nanoarrow_schema()` to get the ArrowSchema object that corresponds to a given R vector type; use `as_nanoarrow_schema()` to convert an object from some other data type representation (e.g., an arrow R package `DataType` like `arrow::int32()`); or use `na_XXX()` functions to construct them. |
| |
| ```{r} |
| infer_nanoarrow_schema(1:5) |
| as_nanoarrow_schema(arrow::schema(col1 = arrow::float64())) |
| na_int64() |
| ``` |
| |
| ### Arrays |
| |
| Use `as_nanoarrow_array()` to convert an object to an ArrowArray object: |
| |
| ```{r} |
| as_nanoarrow_array(1:5) |
| as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2))) |
| ``` |
| |
| You can use `as.vector()` or `as.data.frame()` to get the R representation of the object back: |
| |
| ```{r} |
| array <- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2))) |
| as.data.frame(array) |
| ``` |
| |
| Even though at the C level the ArrowArray is distinct from the ArrowSchema, at the R level we attach a schema wherever possible. You can access the attached schema using `infer_nanoarrow_schema()`: |
| |
| ```{r} |
| infer_nanoarrow_schema(array) |
| ``` |
| |
| ### Array Streams |
| |
| The easiest way to create an ArrowArrayStream is from a list of arrays or objects that can be converted to an array using `as_nanoarrow_array()`: |
| |
| ```{r} |
| stream <- basic_array_stream( |
| list( |
| data.frame(col1 = c(1.1, 2.2)), |
| data.frame(col1 = c(3.3, 4.4)) |
| ) |
| ) |
| ``` |
| |
| You can pull batches from the stream using the `$get_next()` method. The last batch will return `NULL`. |
| |
| ```{r} |
| stream$get_next() |
| stream$get_next() |
| stream$get_next() |
| ``` |
| |
| You can pull all the batches into a `data.frame()` by calling `as.data.frame()` or `as.vector()`: |
| |
| ```{r} |
| stream <- basic_array_stream( |
| list( |
| data.frame(col1 = c(1.1, 2.2)), |
| data.frame(col1 = c(3.3, 4.4)) |
| ) |
| ) |
| |
| as.data.frame(stream) |
| ``` |
| |
| After consuming a stream, you should call the release method as soon as you can. This lets the implementation of the stream release any resources (like open files) it may be holding in a more predictable way than waiting for the garbage collector to clean up the object. |
| |
| ## Integration with the arrow package |
| |
| The nanoarrow package implements `as_nanoarrow_schema()`, `as_nanoarrow_array()`, and `as_nanoarrow_array_stream()` for most arrow package types. Similarly, it implements `arrow::as_arrow_array()`, `arrow::as_record_batch()`, `arrow::as_arrow_table()`, `arrow::as_record_batch_reader()`, `arrow::infer_type()`, `arrow::as_data_type()`, and `arrow::as_schema()` for nanoarrow objects such that you can pass equivalent nanoarrow objects into many arrow functions and vice versa. |