blob: f09d99cc1beed04e656071269b78d3a553238b6f [file] [log] [blame]
[{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":null,"dir":"","previous_headings":"","what":"Packaging checklist for CRAN release","title":"Packaging checklist for CRAN release","text":"high-level overview release process see Apache Arrow Release Management Guide.","code":""},{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":"before-the-release-candidate-is-cut","dir":"","previous_headings":"","what":"Before the release candidate is cut","title":"Packaging checklist for CRAN release","text":"Create GitHub issue entitled [R] CRAN packaging checklist version X.X.X copy checklist issue. Review deprecated functions advance deprecation status, including removing preprocessor directives longer apply (search ARROW_VERSION_MAJOR r/src). Evaluate status failing nightly tests nightly packaging builds. checks replicate checks CRAN runs, need passing understand failures may (though won’t necessarily) result rejection CRAN. Check current CRAN check results Ensure contents README accurate date. Run urlchecker::url_check() R directory release candidate. commit. Ignore errors badges removed CRAN release branch. Polish NEWS update version numbers (done automatically later). can find commits , example, git log --oneline aa057d0..HEAD | grep \"\\[R\\]\" Run preliminary reverse dependency checks using archery docker run r-revdepcheck. major releases, prepare tweet thread highlighting new features. Wait release candidate cut:","code":""},{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":"after-release-candidate-has-been-cut","dir":"","previous_headings":"","what":"After release candidate has been cut","title":"Packaging checklist for CRAN release","text":"Create CRAN-release branch release candidate commit, name new branch e.g. maint-X.X.X-r push upstream","code":""},{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":"prepare-and-check-the-targz-that-will-be-released-to-cran","dir":"","previous_headings":"","what":"Prepare and check the .tar.gz that will be released to CRAN.","title":"Packaging checklist for CRAN release","text":"git fetch upstream && git checkout release-X.X.X-rcXX && git clean -f -d Run make build. copies Arrow C++ tools/cpp, prunes unnecessary components, runs R CMD build generate source tarball. install package, need ensure version Arrow C++ available configure script version vendored R package (e.g., may need unset ARROW_HOME). devtools::check_built(\"arrow_X.X.X.tar.gz\") locally Run reverse dependency checks using archery docker run r-revdepcheck.","code":""},{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":"release-vote","dir":"","previous_headings":"","what":"Release vote","title":"Packaging checklist for CRAN release","text":"Release vote passed!","code":""},{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":"generate-r-package-to-submit-to-cran","dir":"","previous_headings":"","what":"Generate R package to submit to CRAN","title":"Packaging checklist for CRAN release","text":"release candidate commit updated, rebase CRAN release branch commit. Pick commits made main since release commit needed fix CRAN-related submission issues identified steps. Remove badges README.md Run urlchecker::url_check() R directory Create PR entitled WIP: [R] Verify CRAN release-10.0.1-rc0. Add comment @github-actions crossbow submit --group r run R crossbow jobs CRAN-specific release branch. Run Rscript tools/update-checksums.R <libarrow version> download checksums pre-compiled binaries ASF artifactory tools directory. Regenerate arrow_X.X.X.tar.gz (.e., make build) Ensure linux binary packages available: - [ ] Ensure linux binaries available artifactory: https://apache.jfrog.io/ui/repos/tree/General/arrow/r","code":""},{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":"check-binary-arrow-c-distributions-specific-to-the-r-package","dir":"","previous_headings":"","what":"Check binary Arrow C++ distributions specific to the R package","title":"Packaging checklist for CRAN release","text":"Upload .tar.gz win-builder (r-devel ) confirm (Nic, automatically receive email results) check clean. step completed Jeroen put binaries MinGW repository, .e. , , . Upload .tar.gz MacBuilder confirm check clean Check install.packages(\"arrow_X.X.X.tar.gz\") Ubuntu ensure hosted binaries used devtools::check_built(\"arrow_X.X.X.tar.gz\") locally one time (luck)","code":""},{"path":"https://arrow.apache.org/docs/r/PACKAGING.html","id":"cran-submission","dir":"","previous_headings":"","what":"CRAN submission","title":"Packaging checklist for CRAN release","text":"Upload arrow_X.X.X.tar.gz CRAN submit page Confirm submission email Wait CRAN… - [ ] Accepted! - [ ] Tag tip CRAN-specific release branch r-universe-release - [ ] Add new line matrix backwards compatability job - [ ] (patch releases ) Update package version ci/scripts/PKGBUILD, dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb, r/DESCRIPTION, r/NEWS.md - [ ] (CRAN-releases) Rebuild news page pkgdown::build_news() submit PR asf-site branch docs site contents arrow/r/docs/news/index.html replacing current contents arrow-site/docs/r/news/index.html - [ ] (CRAN-releases) Bump version number r/pkgdown/assets/versions.json, update asf-site branch docs site . - [ ] Update packaging checklist template reflect new realities packaging process. - [ ] Wait CRAN-hosted binaries CRAN package page reflect new version - [ ] Tweet!","code":""},{"path":"https://arrow.apache.org/docs/r/STYLE.html","id":null,"dir":"","previous_headings":"","what":"Style","title":"Style","text":"style guide writing documentation arrow.","code":""},{"path":"https://arrow.apache.org/docs/r/STYLE.html","id":"coding-style","dir":"","previous_headings":"","what":"Coding style","title":"Style","text":"Please use tidyverse coding style.","code":""},{"path":"https://arrow.apache.org/docs/r/STYLE.html","id":"referring-to-external-packages","dir":"","previous_headings":"","what":"Referring to external packages","title":"Style","text":"referring external packages, include link package first mention, subsequently refer plain text, e.g. “arrow R package provides dplyr interface Arrow Datasets. article introduces Datasets shows use dplyr analyze .”","code":""},{"path":"https://arrow.apache.org/docs/r/STYLE.html","id":"data-frames","dir":"","previous_headings":"","what":"Data frames","title":"Style","text":"referring concept, use phrase “data frame”, whereas referring object class class important, write data.frame, e.g. “can call write_dataset() tabular data objects Arrow Tables RecordBatches, R data frames. working data frames might want use tibble instead data.frame take advantage default behaviour partitioning data based grouped variables.”","code":""},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"package-conventions","dir":"Articles","previous_headings":"","what":"Package conventions","title":"Get started with Arrow","text":"arrow R package builds top Arrow C++ library, C++ object oriented language. consequence, core logic Arrow C++ library encapsulated classes methods. arrow R package implemented R6 classes adopt “TitleCase” naming conventions. examples include: Two-dimensional, tabular data structures Table, RecordBatch, Dataset One-dimensional, vector-like data structures Array ChunkedArray Classes reading, writing, streaming data ParquetFileReader CsvTableReader low-level interface allows interact Arrow C++ library flexible way, many common situations may never need use , arrow also supplies high-level interface using functions follow “snake_case” naming convention. examples include: arrow_table() allows create Arrow tables without directly using Table object read_parquet() allows open Parquet files without directly using ParquetFileReader object examples used article rely high-level interface. developers interested learning package structure, see developer guide.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"tabular-data-in-arrow","dir":"Articles","previous_headings":"","what":"Tabular data in Arrow","title":"Get started with Arrow","text":"critical component Apache Arrow -memory columnar format, standardized, language-agnostic specification representing structured, table-like datasets -memory. arrow R package, Table class used store objects. Tables roughly analogous data frames similar behavior. arrow_table() function allows generate new Arrow Tables much way data.frame() used create new data frames: can use [ specify subsets Arrow Table way data frame: Along lines, $ operator can used extract named columns: Note output: individual columns Arrow Table represented Chunked Arrays, one-dimensional data structures Arrow roughly analogous vectors R. Tables primary way represent rectangular data -memory using Arrow, rectangular data structure used Arrow C++ library: also Datasets used data stored -disk rather -memory, Record Batches fundamental building blocks typically used data analysis. learn different data object classes arrow, see article data objects.","code":"library(arrow, warn.conflicts = FALSE) dat <- arrow_table(x = 1:3, y = c(\"a\", \"b\", \"c\")) dat ## Table ## 3 rows x 2 columns ## $x <int32> ## $y <string> dat[1:2, 1:2] ## Table ## 2 rows x 2 columns ## $x <int32> ## $y <string> dat$y ## ChunkedArray ## <string> ## [ ## [ ## \"a\", ## \"b\", ## \"c\" ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"converting-tables-to-data-frames","dir":"Articles","previous_headings":"","what":"Converting Tables to data frames","title":"Get started with Arrow","text":"Tables data structure used represent rectangular data within memory allocated Arrow C++ library, can coerced native R data frames (tibbles) using .data.frame() coercion takes place, columns original Arrow Table must converted native R data objects. dat Table, instance, dat$x stored Arrow data type int32 inherited C++, becomes R integer type .data.frame() called. possible exercise fine-grained control conversion process. learn different types converted, see data types article.","code":"as.data.frame(dat) ## x y ## 1 1 a ## 2 2 b ## 3 3 c"},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"reading-and-writing-data","dir":"Articles","previous_headings":"","what":"Reading and writing data","title":"Get started with Arrow","text":"One main ways use arrow read write data files several common formats. arrow package supplies extremely fast CSV reading writing capabilities, addition supports data formats like Parquet Arrow (also called Feather) widely supported packages. addition, arrow package supports multi-file data sets single rectangular data set stored across multiple files.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"individual-files","dir":"Articles","previous_headings":"Reading and writing data","what":"Individual files","title":"Get started with Arrow","text":"goal read single data file memory, several functions can use: read_parquet(): read file Parquet format read_feather(): read file Arrow/Feather format read_delim_arrow(): read delimited text file read_csv_arrow(): read comma-separated values (CSV) file read_tsv_arrow(): read tab-separated values (TSV) file read_json_arrow(): read JSON data file every case except JSON, corresponding write_*() function allows write data files appropriate format. default, read_*() functions return data frame tibble, can also use read data Arrow Table. , need set as_data_frame argument FALSE. example , take starwars data provided dplyr package write Parquet file using write_parquet() can use read_parquet() load data file. shown , default behavior return data frame (sw_frame) set as_data_frame = FALSE data read Arrow Table (sw_table): learn reading writing individual data files, see read/write article.","code":"library(dplyr, warn.conflicts = FALSE) file_path <- tempfile(fileext = \".parquet\") write_parquet(starwars, file_path) sw_frame <- read_parquet(file_path) sw_table <- read_parquet(file_path, as_data_frame = FALSE) sw_table ## Table ## 87 rows x 14 columns ## $name <string> ## $height <int32> ## $mass <double> ## $hair_color <string> ## $skin_color <string> ## $eye_color <string> ## $birth_year <double> ## $sex <string> ## $gender <string> ## $homeworld <string> ## $species <string> ## $films: list<element <string>> ## $vehicles: list<element <string>> ## $starships: list<element <string>>"},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"multi-file-data-sets","dir":"Articles","previous_headings":"Reading and writing data","what":"Multi-file data sets","title":"Get started with Arrow","text":"tabular data set becomes large, often good practice partition data meaningful subsets store one separate file. Among things, means one subset data relevant analysis, one (smaller) file needs read. arrow package provides Dataset interface, convenient way read, write, analyze single data file larger--memory multi-file data sets. illustrate concepts, ’ll create nonsense data set 100000 rows can split 10 subsets: might like partition data write 10 separate Parquet files, one corresponding value subset column. first specify path folder write data files: can use group_by() function dplyr specify data partitioned using subset column, pass grouped data write_dataset(): creates set 10 files, one subset. files named according “hive partitioning” format shown : Parquet files can opened individually using read_parquet() often convenient – especially large data sets – scan folder “connect” data set without loading memory. can using open_dataset(): dset object store data -memory, metadata. However, discussed next section, possible analyze data referred dset loaded. learn Arrow Datasets, see dataset article.","code":"set.seed(1234) nrows <- 100000 random_data <- data.frame( x = rnorm(nrows), y = rnorm(nrows), subset = sample(10, nrows, replace = TRUE) ) dataset_path <- file.path(tempdir(), \"random_data\") random_data %>% group_by(subset) %>% write_dataset(dataset_path) list.files(dataset_path, recursive = TRUE) ## [1] \"subset=1/part-0.parquet\" \"subset=10/part-0.parquet\" ## [3] \"subset=2/part-0.parquet\" \"subset=3/part-0.parquet\" ## [5] \"subset=4/part-0.parquet\" \"subset=5/part-0.parquet\" ## [7] \"subset=6/part-0.parquet\" \"subset=7/part-0.parquet\" ## [9] \"subset=8/part-0.parquet\" \"subset=9/part-0.parquet\" dset <- open_dataset(dataset_path) dset ## FileSystemDataset with 10 Parquet files ## 3 columns ## x: double ## y: double ## subset: int32"},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"analyzing-arrow-data-with-dplyr","dir":"Articles","previous_headings":"","what":"Analyzing Arrow data with dplyr","title":"Get started with Arrow","text":"Arrow Tables Datasets can analyzed using dplyr syntax. possible arrow R package supplies backend translates dplyr verbs commands understood Arrow C++ library, similarly translate R expressions appear within call dplyr verb. example, although dset Dataset data frame (store data values memory), can still pass dplyr pipeline like one shown : Notice call collect() end pipeline. actual computations performed collect() (related compute() function) called. “lazy evaluation” makes possible Arrow C++ compute engine optimize computations performed. learn analyzing Arrow data, see data wrangling article. list functions available dplyr queries page may also useful.","code":"dset %>% group_by(subset) %>% summarize(mean_x = mean(x), min_y = min(y)) %>% filter(mean_x > 0) %>% arrange(subset) %>% collect() ## # A tibble: 6 x 3 ## subset mean_x min_y ## <int> <dbl> <dbl> ## 1 2 0.00486 -4.00 ## 2 3 0.00440 -3.86 ## 3 4 0.0125 -3.65 ## 4 6 0.0234 -3.88 ## 5 7 0.00477 -4.65 ## 6 9 0.00557 -3.50"},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"connecting-to-cloud-storage","dir":"Articles","previous_headings":"","what":"Connecting to cloud storage","title":"Get started with Arrow","text":"Another use arrow R package read, write, analyze data sets stored remotely cloud services. package currently supports Amazon Simple Storage Service (S3) Google Cloud Storage (GCS). example illustrates can use s3_bucket() refer S3 bucket, use open_dataset() connect data set stored : learn support cloud services arrow, see cloud storage article.","code":"bucket <- s3_bucket(\"voltrondata-labs-datasets/nyc-taxi\") nyc_taxi <- open_dataset(bucket)"},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"efficient-data-interchange-between-r-and-python","dir":"Articles","previous_headings":"","what":"Efficient data interchange between R and Python","title":"Get started with Arrow","text":"reticulate package provides interface allows call Python code R. arrow package designed interoperable reticulate. Python environment pyarrow library installed (Python equivalent arrow package), can pass Arrow Table R Python using r_to_py() function reticulate shown : sw_table_python object now stored pyarrow Table: Python equivalent Table class. can see print object: important recognize transfer takes place, C++ pointer (.e., metadata referring underlying data object stored Arrow C++ library) copied. data values place within memory. consequence much faster pass Arrow Table R Python copy data frame R Pandas DataFrame Python. learn passing Arrow data R Python, see article python integrations.","code":"library(reticulate) sw_table_python <- r_to_py(sw_table) sw_table_python ## pyarrow.Table ## name: string ## height: int32 ## mass: double ## hair_color: string ## skin_color: string ## eye_color: string ## birth_year: double ## sex: string ## gender: string ## homeworld: string ## species: string ## films: list<element: string> ## child 0, element: string ## vehicles: list<element: string> ## child 0, element: string ## starships: list<element: string> ## child 0, element: string ## ---- ## name: [[\"Luke Skywalker\",\"C-3PO\",\"R2-D2\",\"Darth Vader\",\"Leia Organa\",...,\"Finn\",\"Rey\",\"Poe Dameron\",\"BB8\",\"Captain Phasma\"]] ## height: [[172,167,96,202,150,...,null,null,null,null,null]] ## mass: [[77,75,32,136,49,...,null,null,null,null,null]] ## hair_color: [[\"blond\",null,null,\"none\",\"brown\",...,\"black\",\"brown\",\"brown\",\"none\",\"none\"]] ## skin_color: [[\"fair\",\"gold\",\"white, blue\",\"white\",\"light\",...,\"dark\",\"light\",\"light\",\"none\",\"none\"]] ## eye_color: [[\"blue\",\"yellow\",\"red\",\"yellow\",\"brown\",...,\"dark\",\"hazel\",\"brown\",\"black\",\"unknown\"]] ## birth_year: [[19,112,33,41.9,19,...,null,null,null,null,null]] ## sex: [[\"male\",\"none\",\"none\",\"male\",\"female\",...,\"male\",\"female\",\"male\",\"none\",\"female\"]] ## gender: [[\"masculine\",\"masculine\",\"masculine\",\"masculine\",\"feminine\",...,\"masculine\",\"feminine\",\"masculine\",\"masculine\",\"feminine\"]] ## homeworld: [[\"Tatooine\",\"Tatooine\",\"Naboo\",\"Tatooine\",\"Alderaan\",...,null,null,null,null,null]] ## ..."},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"access-to-arrow-messages-buffers-and-streams","dir":"Articles","previous_headings":"","what":"Access to Arrow messages, buffers, and streams","title":"Get started with Arrow","text":"arrow package also provides many lower-level bindings C++ library, enable access manipulate Arrow objects. can use build connectors applications services use Arrow. One example Spark: sparklyr package support using Arrow move data Spark, yielding significant performance gains.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/arrow.html","id":"contributing-to-arrow","dir":"Articles","previous_headings":"","what":"Contributing to arrow","title":"Get started with Arrow","text":"Apache Arrow extensive project spanning multiple languages, arrow R package one part large project. number special considerations developers like contribute package. help make process easier, several articles arrow documentation discuss topics relevant arrow developers, unlikely needed users. overview development process list related articles developers, see developer guide.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"scalars","dir":"Articles","previous_headings":"","what":"Scalars","title":"Data objects","text":"Scalar object simply single value can type. might integer, string, timestamp, different DataType objects Arrow supports. users arrow R package unlikely create Scalars directly, need can calling Scalar$create() method:","code":"Scalar$create(\"hello\") ## Scalar ## hello"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"arrays","dir":"Articles","previous_headings":"","what":"Arrays","title":"Data objects","text":"Array objects ordered sets Scalar values. Scalars users need create Arrays directly, need arises Array$create() method allows create new Arrays: Array can subset using square brackets shown : Arrays immutable objects: Array created modified extended.","code":"integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L)) integer_array ## Array ## <int32> ## [ ## 1, ## null, ## 2, ## 4, ## 8 ## ] string_array <- Array$create(c(\"hello\", \"amazing\", \"and\", \"cruel\", \"world\")) string_array ## Array ## <string> ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ] string_array[4:5] ## Array ## <string> ## [ ## \"cruel\", ## \"world\" ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"chunked-arrays","dir":"Articles","previous_headings":"","what":"Chunked Arrays","title":"Data objects","text":"practice, users arrow R package likely use Chunked Arrays rather simple Arrays. hood, Chunked Array collection one Arrays can indexed single Array. reasons Arrow provides functionality described data object layout article present purposes sufficient notice Chunked Arrays behave like Arrays regular data analysis. illustrate, let’s use chunked_array() function: chunked_array() function just wrapper around functionality ChunkedArray$create() provides. Let’s print object: double bracketing output intended highlight fact Chunked Arrays wrappers around one Arrays. However, although comprised multiple distinct Arrays, Chunked Array can indexed laid end--end single “vector-like” object. illustrated : can use chunked_string_array illustrate : important thing note “chunking” semantically meaningful. implementation detail : users never treat chunk meaningful unit. Writing data disk, example, often results data organized different chunks. Similarly, two Chunked Arrays contain values assigned different chunks deemed equivalent. illustrate can create Chunked Array contains four four values chunked_string_array[4:7], organized one chunk rather split two: Testing equality using == produces element-wise comparison, result new Chunked Array four (boolean type) true values: short, intention users interact Chunked Arrays ordinary one-dimensional data structures without ever think much underlying chunking arrangement. Chunked Arrays mutable, specific sense: Arrays can added removed Chunked Array.","code":"chunked_string_array <- chunked_array( string_array, c(\"I\", \"love\", \"you\") ) chunked_string_array ## ChunkedArray ## <string> ## [ ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ], ## [ ## \"I\", ## \"love\", ## \"you\" ## ] ## ] chunked_string_array[4:7] ## ChunkedArray ## <string> ## [ ## [ ## \"cruel\", ## \"world\" ## ], ## [ ## \"I\", ## \"love\" ## ] ## ] cruel_world <- chunked_array(c(\"cruel\", \"world\", \"I\", \"love\")) cruel_world ## ChunkedArray ## <string> ## [ ## [ ## \"cruel\", ## \"world\", ## \"I\", ## \"love\" ## ] ## ] cruel_world == chunked_string_array[4:7] ## ChunkedArray ## <bool> ## [ ## [ ## true, ## true, ## true, ## true ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"record-batches","dir":"Articles","previous_headings":"","what":"Record Batches","title":"Data objects","text":"Record Batch tabular data structure comprised named Arrays, accompanying Schema specifies name data type associated Array. Record Batches fundamental unit data interchange Arrow, typically used data analysis. Tables Datasets usually convenient analytic contexts. Arrays can different types must length. Array referred one “fields” “columns” Record Batch. can create Record Batch using record_batch() function using RecordBatch$create() method. functions flexible can accept inputs several formats: can pass data frame, one named vectors, input stream, even raw vector containing appropriate binary data. example: Record Batch containing 5 rows 3 columns, conceptual structure shown : arrow package supplies $ method Record Batch objects, used extract single column name: can use double brackets [[ refer columns position. rb$ints array second column Record Batch can extract : also [ method allows extract subsets record batch way data frame. command rb[1:3, 1:2] extracts first three rows first two columns: Record Batches concatenated: comprised Arrays, Arrays immutable objects, new rows added Record Batch created.","code":"rb <- record_batch( strs = string_array, ints = integer_array, dbls = c(1.1, 3.2, 0.2, NA, 11) ) rb ## RecordBatch ## 5 rows x 3 columns ## $strs <string> ## $ints <int32> ## $dbls <double> rb$strs ## Array ## <string> ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ] rb[[2]] ## Array ## <int32> ## [ ## 1, ## null, ## 2, ## 4, ## 8 ## ] rb[1:3, 1:2] ## RecordBatch ## 3 rows x 2 columns ## $strs <string> ## $ints <int32>"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"tables","dir":"Articles","previous_headings":"","what":"Tables","title":"Data objects","text":"Table comprised named Chunked Arrays, way Record Batch comprised named Arrays. Like Record Batches, Tables include explicit Schema specifying name data type Chunked Array. can subset Tables $, [[, [ way can Record Batches. Unlike Record Batches, Tables can concatenated (comprised Chunked Arrays). Suppose second Record Batch arrives: possible create Record Batch appends data new_rb data rb, without creating entirely new objects memory. Tables, however, can: now two fragments data set represented Tables. difference Table Record Batch columns represented Chunked Arrays. Array original Record Batch one chunk corresponding Chunked Array Table: ’s underlying data – indeed immutable Array referenced – just enclosed new, flexible Chunked Array wrapper. However, wrapper allows us concatenate Tables: resulting object shown schematically : Notice Chunked Arrays within new Table retain chunking structure, none original Arrays moved:","code":"new_rb <- record_batch( strs = c(\"I\", \"love\", \"you\"), ints = c(5L, 0L, 0L), dbls = c(7.1, -0.1, 2) ) df <- arrow_table(rb) new_df <- arrow_table(new_rb) rb$strs ## Array ## <string> ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ] df$strs ## ChunkedArray ## <string> ## [ ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ] ## ] concat_tables(df, new_df) ## Table ## 8 rows x 3 columns ## $strs <string> ## $ints <int32> ## $dbls <double> df_both <- concat_tables(df, new_df) df_both$strs ## ChunkedArray ## <string> ## [ ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ], ## [ ## \"I\", ## \"love\", ## \"you\" ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"datasets","dir":"Articles","previous_headings":"","what":"Datasets","title":"Data objects","text":"Like Record Batch Table objects, Dataset used represent tabular data. abstract level, Dataset can viewed object comprised rows columns, just like Record Batches Tables, contains explicit Schema specifies name data type associated column. However, Tables Record Batches data explicitly represented -memory, Dataset . Instead, Dataset abstraction refers data stored -disk one files. Values stored data files loaded memory batched process. Loading takes place needed, query executed data. respect Arrow Datasets different kind object Arrow Tables, dplyr commands used analyze essentially identical. section ’ll talk Datasets structured. want learn practical details analyzing Datasets, see article analyzing multi-file datasets.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"the-on-disk-data-files","dir":"Articles","previous_headings":"Datasets","what":"The on-disk data files","title":"Data objects","text":"Reduced simplest form, -disk structure Dataset simply collection data files, storing one subset data. subsets sometimes referred “fragments”, partitioning process sometimes referred “sharding”. convention, files organized folder structure called Hive-style partition: see hive_partition() details. illustrate works, let’s write multi-file dataset disk manually, without using Arrow Dataset functionality work. ’ll start three small data frames, contains one subset data want store: intention data frames stored separate data file. can see, quite structured partitioning: data subset = \"\" belong one file, data subset = \"b\" belong another file, data subset = \"c\" belong third file. first step define create folder hold files: next step manually create Hive-style folder structure: Notice named folder “key=value” format exactly describes subset data written folder. naming structure essence Hive-style partitions. Now folders, ’ll use write_parquet() create single parquet file three subsets: wanted , subdivided dataset. folder contain multiple files (part-0.parquet, part-1.parquet, etc) wanted . Similarly, particular reason name files part-0.parquet way : fine call files subset-.parquet, subset-b.parquet, subset-c.parquet wished. written file formats wanted, don’t necessarily use Hive-style folders. can learn supported formats reading help documentation open_dataset(), learn exercise fine-grained control help(\"Dataset\", package = \"arrow\"). case, created -disk parquet Dataset using Hive-style partitioning. Dataset defined files: verify everything worked, let’s open data open_dataset() call glimpse() inspect contents: can see, ds Dataset object aggregates three separate data files. fact, particular case Dataset small values three files appear output glimpse(). noted everyday data analysis work, wouldn’t need write data files manually fashion. example entirely illustrative purposes. exact dataset created following command: fact, even ds happens refer data source larger memory, command still work Dataset functionality written ensure pipeline data loaded piecewise order avoid exhausting memory.","code":"df_a <- data.frame(id = 1:5, value = rnorm(5), subset = \"a\") df_b <- data.frame(id = 6:10, value = rnorm(5), subset = \"b\") df_c <- data.frame(id = 11:15, value = rnorm(5), subset = \"c\") ds_dir <- \"mini-dataset\" dir.create(ds_dir) ds_dir_a <- file.path(ds_dir, \"subset=a\") ds_dir_b <- file.path(ds_dir, \"subset=b\") ds_dir_c <- file.path(ds_dir, \"subset=c\") dir.create(ds_dir_a) dir.create(ds_dir_b) dir.create(ds_dir_c) write_parquet(df_a, file.path(ds_dir_a, \"part-0.parquet\")) write_parquet(df_b, file.path(ds_dir_b, \"part-0.parquet\")) write_parquet(df_c, file.path(ds_dir_c, \"part-0.parquet\")) list.files(ds_dir, recursive = TRUE) ## [1] \"subset=a/part-0.parquet\" \"subset=b/part-0.parquet\" ## [3] \"subset=c/part-0.parquet\" ds <- open_dataset(ds_dir) glimpse(ds) ## FileSystemDataset with 3 Parquet files ## 15 rows x 3 columns ## $ id <int32> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ## $ value <double> -1.400043517, 0.255317055, -2.437263611, -0.005571287, 0.62155~ ## $ subset <string> \"a\", \"a\", \"a\", \"a\", \"a\", \"b\", \"b\", \"b\", \"b\", \"b\", \"c\", \"c\", \"c~ ## Call `print()` for full schema details ds |> group_by(subset) |> write_dataset(\"mini-dataset\")"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"the-dataset-object","dir":"Articles","previous_headings":"Datasets","what":"The Dataset object","title":"Data objects","text":"previous section examined -disk structure Dataset. now turn -memory structure Dataset object (.e., ds previous example). Dataset object created, arrow searches dataset folder looking appropriate files, load contents files. Paths files stored active binding ds$files: thing happens open_dataset() called explicit Schema Dataset constructed stored ds$schema: default Schema inferred inspecting first file , though possible construct unified schema inspecting files. , set unify_schemas = TRUE calling open_dataset(). also possible use schema argument open_dataset() specify Schema explicitly (see schema() function details). act reading data performed Scanner object. analyzing Dataset using dplyr interface never need construct Scanner manually, explanatory purposes ’ll : Calling ToTable() method materialize Dataset (-disk) Table (-memory): scanning process multi-threaded default, necessary threading can disabled setting use_threads = FALSE calling Scanner$create().","code":"ds$files ## [1] \"/arrow/r/vignettes/mini-dataset/subset=a/part-0.parquet\" ## [2] \"/arrow/r/vignettes/mini-dataset/subset=b/part-0.parquet\" ## [3] \"/arrow/r/vignettes/mini-dataset/subset=c/part-0.parquet\" ds$schema ## Schema ## id: int32 ## value: double ## subset: string ## ## See $metadata for additional Schema metadata scan <- Scanner$create(dataset = ds) scan$ToTable() ## Table ## 15 rows x 3 columns ## $id <int32> ## $value <double> ## $subset <string> ## ## See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"querying-a-dataset","dir":"Articles","previous_headings":"Datasets","what":"Querying a Dataset","title":"Data objects","text":"query executed Dataset new scan initiated results pulled back R. example, consider following dplyr expression: can replicate using low-level Dataset interface creating new scan specifying filter projection arguments Scanner$create(). use arguments need know little Arrow Expressions, may find helpful read help documentation help(\"Expression\", package = \"arrow\"). scanner defined mimics dplyr pipeline shown , call .data.frame(scan$ToTable()) produce result dplyr version, though rows may appear order. get better sense happens query executes, ’ll call scan$ScanBatches(). Much like ToTable() method, ScanBatches() method executes query separately files, returns list Record Batches, one file. addition, ’ll convert Record Batches data frames individually: return dplyr query made earlier, use compute() return Table rather use collect() return data frame, can see evidence process work. Table object created concatenating three Record Batches produced query executes three data files, consequence Chunked Array defines column Table mirrors partitioning structure present data files:","code":"ds |> filter(value > 0) |> mutate(new_value = round(100 * value)) |> select(id, subset, new_value) |> collect() ## # A tibble: 6 x 3 ## id subset new_value ## <int> <chr> <dbl> ## 1 2 a 26 ## 2 5 a 62 ## 3 6 b 115 ## 4 12 c 63 ## 5 13 c 207 ## 6 15 c 51 scan <- Scanner$create( dataset = ds, filter = Expression$field_ref(\"value\") > 0, projection = list( id = Expression$field_ref(\"id\"), subset = Expression$field_ref(\"subset\"), new_value = Expression$create(\"round\", 100 * Expression$field_ref(\"value\")) ) ) lapply(scan$ScanBatches(), as.data.frame) ## [[1]] ## id subset new_value ## 1 2 a 26 ## 2 5 a 62 ## ## [[2]] ## id subset new_value ## 1 6 b 115 ## ## [[3]] ## id subset new_value ## 1 12 c 63 ## 2 13 c 207 ## 3 15 c 51 tbl <- ds |> filter(value > 0) |> mutate(new_value = round(100 * value)) |> select(id, subset, new_value) |> compute() tbl$subset ## ChunkedArray ## <string> ## [ ## [ ## \"a\", ## \"a\" ## ], ## [ ## \"b\" ## ], ## [ ## \"c\", ## \"c\", ## \"c\" ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"additional-notes","dir":"Articles","previous_headings":"Datasets","what":"Additional notes","title":"Data objects","text":"distinction ignored previous discussion FileSystemDataset InMemoryDataset objects. usual case, data comprise Dataset stored files -disk. , , primary advantage Datasets Tables. However, cases may useful make Dataset data already stored -memory. cases object created type InMemoryDataset. previous discussion assumes files stored Dataset Schema. usual case true, file conceptually subset single rectangular table. strictly required. information topics, see help(\"Dataset\", package = \"arrow\").","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_objects.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Data objects","text":"learn internal structure Arrays, see article data object layout. learn different data types used Arrow, see article data types. learn Arrow objects implemented, see Arrow specification page.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"motivating-example","dir":"Articles","previous_headings":"","what":"Motivating example","title":"Data types","text":"illustrate conversion needs take place, consider differences output obtain use dplyr::glimpse() inspect starwars data original format – data frame R – output obtain convert Arrow Table first calling arrow_table(): data represented essentially , descriptions data types columns changed. example: name labelled <chr> (character vector) data frame; labelled <string> (string type, also referred utf8 type) Arrow Table height labelled <int> (integer vector) data frame; labelled <int32> (32 bit signed integer) Arrow Table mass labelled <dbl> (numeric vector) data frame; labelled <double> (64 bit floating point number) Arrow Table differences purely cosmetic: integers R fact 32 bit signed integers, underlying data types Arrow R direct analogs one another. cases differences purely implementation: Arrow R different ways store vector strings, high level abstraction R character type Arrow string type can viewed direct analogs. cases, however, clear analogs: Arrow analog POSIXct (timestamp type) analog POSIXlt; conversely, R can represent 32 bit signed integers, equivalent 64 bit unsigned integer. arrow package converts R data Arrow data, first check see Schema provided – see schema() information – none available attempt guess appropriate type following default mappings. complete listing mappings provided end article, common cases depicted illustration : image, black boxes refer R data types light blue boxes refer Arrow data types. Directional arrows specify conversions (e.g., bidirectional arrow logical R type boolean Arrow type means logical R converts Arrow boolean vice versa). Solid lines indicate conversion rule always default; dashed lines mean sometimes applies (rules special cases described ).","code":"library(dplyr, warn.conflicts = FALSE) library(arrow, warn.conflicts = FALSE) glimpse(starwars) ## Rows: 87 ## Columns: 14 ## $ name <chr> \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia Or~ ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~ ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~ ## $ hair_color <chr> \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", N~ ## $ skin_color <chr> \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\", \"~ ## $ eye_color <chr> \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blue\",~ ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~ ## $ sex <chr> \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"female\",~ ## $ gender <chr> \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"femini~ ## $ homeworld <chr> \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\", \"T~ ## $ species <chr> \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"Huma~ ## $ films <list> <\"A New Hope\", \"The Empire Strikes Back\", \"Return of the J~ ## $ vehicles <list> <\"Snowspeeder\", \"Imperial Speeder Bike\">, <>, <>, <>, \"Imp~ ## $ starships <list> <\"X-wing\", \"Imperial shuttle\">, <>, <>, \"TIE Advanced x1\",~ glimpse(arrow_table(starwars)) ## Table ## 87 rows x 14 columns ## $ name <string> \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia~ ## $ height <int32> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180~ ## $ mass <double> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, ~ ## $ hair_color <string> \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\"~ ## $ skin_color <string> \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\"~ ## $ eye_color <string> \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blu~ ## $ birth_year <double> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.~ ## $ sex <string> \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"femal~ ## $ gender <string> \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"fem~ ## $ homeworld <string> \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\",~ ## $ species <string> \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"H~ ## $ films <list<...>> <\"A New Hope\", \"The Empire Strikes Back\", \"Return of the~ ## $ vehicles <list<...>> <\"Snowspeeder\", \"Imperial Speeder Bike\">, <>, <>, <>, \"I~ ## $ starships <list<...>> <\"X-wing\", \"Imperial shuttle\">, <>, <>, \"TIE Advanced x1~ ## Call `print()` for full schema details"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"logicalboolean-types","dir":"Articles","previous_headings":"","what":"Logical/boolean types","title":"Data types","text":"Arrow R use three-valued logic. R, logical values can TRUE FALSE, NA used represent missing data. Arrow, corresponding boolean type can take values true, false, null, shown : strictly necessary set type = boolean() example default behavior arrow translate R logical vectors Arrow booleans vice versa. However, sake clarity specify data types explicitly throughout article. likewise use chunked_array() create Arrow data R objects .vector() create R data Arrow objects, similar results obtained use methods.","code":"chunked_array(c(TRUE, FALSE, NA), type = boolean()) # default ## ChunkedArray ## <bool> ## [ ## [ ## true, ## false, ## null ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"integer-types","dir":"Articles","previous_headings":"","what":"Integer types","title":"Data types","text":"Base R natively supports one type integer, using 32 bits represent signed numbers -2147483648 2147483647, though R can also support 64 bit integers via bit64 package. Arrow inherits signed unsigned integer types C++ 8 bit, 16 bit, 32 bit, 64 bit versions: default, arrow translates R integers int32 type Arrow, can override explicitly specifying another integer type: value R fall within permissible range corresponding Arrow type, arrow throws error: translating Arrow R, integer types alway translate R integers unless one following exceptions applies: value Arrow uint32 uint64 falls outside range allowed R integers, result numeric vector R value Arrow int64 variable falls outside range allowed R integers, result bit64::integer64 vector R user sets options(arrow.int64_downcast = FALSE), Arrow int64 type always yields bit64::integer64 vector R regardless value","code":"chunked_array(c(10L, 3L, 200L), type = int32()) # default ## ChunkedArray ## <int32> ## [ ## [ ## 10, ## 3, ## 200 ## ] ## ] chunked_array(c(10L, 3L, 200L), type = int64()) ## ChunkedArray ## <int64> ## [ ## [ ## 10, ## 3, ## 200 ## ] ## ] chunked_array(c(10L, 3L, 200L), type = int8()) ## Error: Invalid: value outside of range"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"floating-point-numeric-types","dir":"Articles","previous_headings":"","what":"Floating point numeric types","title":"Data types","text":"R one double-precision (64 bit) numeric type, translates Arrow 64 bit floating point type default. Arrow supports single-precision (32 bit) double-precision (64 bit) floating point numbers, specified using float32() float64() data type functions. translated doubles R. Examples shown : Note Arrow specification also permits half-precision (16 bit) floating point numbers, yet implemented.","code":"chunked_array(c(0.1, 0.2, 0.3), type = float64()) # default ## ChunkedArray ## <double> ## [ ## [ ## 0.1, ## 0.2, ## 0.3 ## ] ## ] chunked_array(c(0.1, 0.2, 0.3), type = float32()) ## ChunkedArray ## <float> ## [ ## [ ## 0.1, ## 0.2, ## 0.3 ## ] ## ] arrow_double <- chunked_array(c(0.1, 0.2, 0.3), type = float64()) as.vector(arrow_double) ## [1] 0.1 0.2 0.3"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"fixed-point-decimal-types","dir":"Articles","previous_headings":"","what":"Fixed point decimal types","title":"Data types","text":"Arrow also contains decimal() data types, numeric values specified decimal format rather binary. Decimals Arrow come two varieties, 128 bit version 256 bit version, cases users able use general decimal() data type function rather specific decimal128() decimal256() functions. decimal types Arrow fixed-precision numbers (rather floating-point), means necessary explicitly specify precision scale arguments: precision specifies number significant digits store. scale specifies number digits stored decimal point. set scale = 2, exactly two digits stored decimal point. set scale = 0, values rounded nearest whole number. Negative scales also permitted (handy dealing extremely large numbers), scale = -2 stores value nearest 100. R way create decimal types natively, example little circuitous. First create floating point numbers Chunked Arrays, explicitly cast decimal types within Arrow. possible Chunked Array objects possess cast() method: Though natively used R, decimal types can useful situations especially important avoid problems arise floating point arithmetic.","code":"arrow_floating <- chunked_array(c(.01, .1, 1, 10, 100)) arrow_decimals <- arrow_floating$cast(decimal(precision = 5, scale = 2)) arrow_decimals ## ChunkedArray ## <decimal128(5, 2)> ## [ ## [ ## 0.01, ## 0.10, ## 1.00, ## 10.00, ## 100.00 ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"stringcharacter-types","dir":"Articles","previous_headings":"","what":"String/character types","title":"Data types","text":"R uses single character type represent strings whereas Arrow two types. Arrow C++ library types referred strings large_strings, avoid ambiguity arrow R package defined using utf8() large_utf8() data type functions. distinction two Arrow types unlikely important R users, though difference discussed article data object layout. default behavior translate R character vectors utf8/string type, translate Arrow types R character vectors:","code":"strings <- chunked_array(c(\"oh\", \"well\", \"whatever\")) strings ## ChunkedArray ## <string> ## [ ## [ ## \"oh\", ## \"well\", ## \"whatever\" ## ] ## ] as.vector(strings) ## [1] \"oh\" \"well\" \"whatever\""},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"factordictionary-types","dir":"Articles","previous_headings":"","what":"Factor/dictionary types","title":"Data types","text":"analog R factors Arrow dictionary type. Factors translate dictionaries vice versa. illustrate , let’s create small factor object R: translated Arrow, dictionary results: translated back R, recover original factor: Arrow dictionaries slightly flexible R factors: values dictionary necessarily strings, labels factor . consequence, non-string values Arrow dictionary coerced strings translated R.","code":"fct <- factor(c(\"cat\", \"dog\", \"pig\", \"dog\")) fct ## [1] cat dog pig dog ## Levels: cat dog pig dict <- chunked_array(fct, type = dictionary()) dict ## ChunkedArray ## <dictionary<values=string, indices=int32>> ## [ ## ## -- dictionary: ## [ ## \"cat\", ## \"dog\", ## \"pig\" ## ] ## -- indices: ## [ ## 0, ## 1, ## 2, ## 1 ## ] ## ] as.vector(dict) ## [1] cat dog pig dog ## Levels: cat dog pig"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"date-types","dir":"Articles","previous_headings":"","what":"Date types","title":"Data types","text":"R, dates typically represented using Date class. Internally Date object numeric type whose value counts number days since beginning Unix epoch (1 January 1970). Arrow supplies two data types can used represent dates: date32 type date64 type. date32 type similar Date class R: internally stores 32 bit integer counts number days since 1 January 1970. default arrow translate R Date objects Arrow date32 types: Arrow also supplies higher-precision date64 type, date represented 64 bit integer encodes number milliseconds since 1970-01-01 00:00 UTC: translation Arrow R differs. Internally date32 type similar R Date, objects translated R Dates: However, date64 types specified millisecond-level precision, translated R POSIXct times avoid possibility losing relevant information:","code":"nirvana_album_dates <- as.Date(c(\"1989-06-15\", \"1991-09-24\", \"1993-09-13\")) nirvana_album_dates ## [1] \"1989-06-15\" \"1991-09-24\" \"1993-09-13\" nirvana_32 <- chunked_array(nirvana_album_dates, type = date32()) # default nirvana_32 ## ChunkedArray ## <date32[day]> ## [ ## [ ## 1989-06-15, ## 1991-09-24, ## 1993-09-13 ## ] ## ] nirvana_64 <- chunked_array(nirvana_album_dates, type = date64()) nirvana_64 ## ChunkedArray ## <date64[ms]> ## [ ## [ ## 1989-06-15, ## 1991-09-24, ## 1993-09-13 ## ] ## ] class(as.vector(nirvana_32)) ## [1] \"Date\" class(as.vector(nirvana_64)) ## [1] \"POSIXct\" \"POSIXt\""},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"temporaltimestamp-types","dir":"Articles","previous_headings":"","what":"Temporal/timestamp types","title":"Data types","text":"R two classes used represent date time information, POSIXct POSIXlt. Arrow one: timestamp type. Arrow timestamps loosely analogous POSIXct class. Internally, POSIXct object represents date numeric variable stores number seconds since 1970-01-01 00:00 UTC. Internally, Arrow timestamp 64 bit integer counting number milliseconds since 1970-01-01 00:00 UTC. Arrow R support timezone information, display differently printed object. R, local time printed timezone name adjacent : translated Arrow, POSIXct object becomes Arrow timestamp object. printed, however, temporal instant always displayed UTC rather local time: timezone information lost, however, can easily see translating sydney_newyear_arrow object back R POSIXct object: POSIXlt objects behaviour different. Internally POSIXlt object list specifying “local time” terms variety human-relevant fields. analogous class Arrow, default behaviour translate Arrow list.","code":"sydney_newyear <- as.POSIXct(\"2000-01-01 00:01\", tz = \"Australia/Sydney\") sydney_newyear ## [1] \"2000-01-01 00:01:00 AEDT\" sydney_newyear_arrow <- chunked_array(sydney_newyear, type = timestamp()) sydney_newyear_arrow ## ChunkedArray ## <timestamp[s]> ## [ ## [ ## 1999-12-31 13:01:00 ## ] ## ] as.vector(sydney_newyear_arrow) ## [1] \"1999-12-31 13:01:00 UTC\""},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"time-of-day-types","dir":"Articles","previous_headings":"","what":"Time of day types","title":"Data types","text":"Base R class represent time day independent date (.e., possible specify “3pm” without referring specific day), can done help hms package. Internally, hms objects always stored number seconds since 00:00:00. Arrow two data types purposes. time32 types, data stored 32 bit integer interpreted either number seconds number milliseconds since 00:00:00. Note difference following: time64 object similar, stores time day using 64 bit integer can represent time higher precision. possible choose microseconds (unit = \"us\") nanoseconds (unit = \"ns\"), shown : versions time32 time64 objects Arrow translate hms times R.","code":"time_of_day <- hms::hms(56, 34, 12) chunked_array(time_of_day, type = time32(unit = \"s\")) ## ChunkedArray ## <time32[s]> ## [ ## [ ## 12:34:56 ## ] ## ] chunked_array(time_of_day, type = time32(unit = \"ms\")) ## ChunkedArray ## <time32[ms]> ## [ ## [ ## 12:34:56.000 ## ] ## ] chunked_array(time_of_day, type = time64(unit = \"us\")) ## ChunkedArray ## <time64[us]> ## [ ## [ ## 12:34:56.000000 ## ] ## ] chunked_array(time_of_day, type = time64(unit = \"ns\")) ## ChunkedArray ## <time64[ns]> ## [ ## [ ## 12:34:56.000000000 ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"duration-types","dir":"Articles","previous_headings":"","what":"Duration types","title":"Data types","text":"Lengths time represented difftime objects R. analogous data type Arrow duration type. duration type stored 64 bit integer, can represent number seconds (default, unit = \"s\"), milliseconds (unit = \"ms\"), microseconds (unit = \"us\"), nanoseconds (unit = \"ns\"). illustrate ’ll create difftime R corresponding 278 seconds: translation Arrow looks like : Regardless underlying unit, duration objects Arrow translate difftime objects R.","code":"len <- as.difftime(278, unit = \"secs\") len ## Time difference of 278 secs chunked_array(len, type = duration(unit = \"s\")) # default ## ChunkedArray ## <duration[s]> ## [ ## [ ## 278 ## ] ## ] chunked_array(len, type = duration(unit = \"ns\")) ## ChunkedArray ## <duration[ns]> ## [ ## [ ## 278000000000 ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"list-of-default-translations","dir":"Articles","previous_headings":"","what":"List of default translations","title":"Data types","text":"discussion covers common cases. two tables section provide complete list arrow translates R data types Arrow data types. table, entries - currently implemented.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"translations-from-r-to-arrow","dir":"Articles","previous_headings":"List of default translations","what":"Translations from R to Arrow","title":"Data types","text":"1: float64 double concept data type Arrow C++; however, float64() used arrow function double() already exists base R 2: character vector exceeds 2GB strings, converted large_utf8 Arrow type 3: lists elements type able translated Arrow list type (“list ” type).","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"translations-from-arrow-to-r","dir":"Articles","previous_headings":"List of default translations","what":"Translations from Arrow to R","title":"Data types","text":"1: integer types may contain values exceed range R’s integer type (32 bit signed integer). , uint32 uint64 converted double (“numeric”) int64 converted bit64::integer64. conversion can disabled (int64 always yields bit64::integer64 vector) setting options(arrow.int64_downcast = FALSE). 2: Arrow data types currently R equivalent raise error cast mapped via schema. 3: arrow*_binary classes implemented lists raw vectors. 4: Due limitation R factors, Arrow dictionary values coerced string translated R already strings. 5: arrow*_list classes implemented subclasses vctrs_list_of ptype attribute set empty Array value type converts .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_types.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Data types","text":"learn data types specified schema() metadata, see metadata article. additional details data types, see data types article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_wrangling.html","id":"one-table-dplyr-verbs","dir":"Articles","previous_headings":"","what":"One-table dplyr verbs","title":"Data analysis with dplyr syntax","text":"arrow package provides support dplyr one-table verbs, allowing users construct data analysis pipelines familiar way. example shows use filter(), rename(), mutate(), arrange() select(): important note arrow uses lazy evaluation delay computation result explicitly requested. speeds processing enabling Arrow C++ library perform multiple computations one operation. consequence design choice, yet performed computations sw data. result variable object class arrow_dplyr_query represents computations performed: perform computations materialize result, call compute() collect(). difference two determines kind object returned. Calling compute() returns Arrow Table, suitable passing arrow dplyr functions: contrast, collect() returns R data frame, suitable viewing passing R functions analysis visualization: arrow package broad support single-table dplyr verbs, including compute aggregates. example, supports group_by() summarize(), well commonly-used convenience functions count(): Note, however, window functions ntile() yet supported.","code":"result <- sw %>% filter(homeworld == \"Tatooine\") %>% rename(height_cm = height, mass_kg = mass) %>% mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>% arrange(desc(birth_year)) %>% select(name, height_in, mass_lbs) result ## Table (query) ## name: string ## height_in: double (divide(cast(height, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(2.54, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}))) ## mass_lbs: double (multiply_checked(mass, 2.2046)) ## ## * Filter: (homeworld == \"Tatooine\") ## * Sorted by birth_year [desc] ## See $.data for the source Arrow object compute(result) ## Table ## 10 rows x 3 columns ## $name <string> ## $height_in <double> ## $mass_lbs <double> collect(result) ## # A tibble: 10 x 3 ## name height_in mass_lbs ## <chr> <dbl> <dbl> ## 1 C-3PO 65.7 165. ## 2 Cliegg Lars 72.0 NA ## 3 Shmi Skywalker 64.2 NA ## 4 Owen Lars 70.1 265. ## 5 Beru Whitesun Lars 65.0 165. ## 6 Darth Vader 79.5 300. ## 7 Anakin Skywalker 74.0 185. ## 8 Biggs Darklighter 72.0 185. ## 9 Luke Skywalker 67.7 170. ## 10 R5-D4 38.2 70.5 sw %>% group_by(species) %>% summarize(mean_height = mean(height, na.rm = TRUE)) %>% collect() ## # A tibble: 38 x 2 ## species mean_height ## <chr> <dbl> ## 1 Human 178 ## 2 Droid 131. ## 3 Wookiee 231 ## 4 Rodian 173 ## 5 Hutt 175 ## 6 NA 175 ## 7 Yoda's species 66 ## 8 Trandoshan 190 ## 9 Mon Calamari 180 ## 10 Ewok 88 ## # i 28 more rows sw %>% count(gender) %>% collect() ## # A tibble: 3 x 2 ## gender n ## <chr> <int> ## 1 masculine 66 ## 2 feminine 17 ## 3 NA 4"},{"path":"https://arrow.apache.org/docs/r/articles/data_wrangling.html","id":"two-table-dplyr-verbs","dir":"Articles","previous_headings":"","what":"Two-table dplyr verbs","title":"Data analysis with dplyr syntax","text":"Equality joins (e.g. left_join(), inner_join()) supported joining multiple tables. illustrated :","code":"jedi <- data.frame( name = c(\"C-3PO\", \"Luke Skywalker\", \"Obi-Wan Kenobi\"), jedi = c(FALSE, TRUE, TRUE) ) sw %>% select(1:3) %>% right_join(jedi) %>% collect() ## # A tibble: 3 x 4 ## name height mass jedi ## <chr> <int> <dbl> <lgl> ## 1 Luke Skywalker 172 77 TRUE ## 2 C-3PO 167 75 FALSE ## 3 Obi-Wan Kenobi 182 77 TRUE"},{"path":"https://arrow.apache.org/docs/r/articles/data_wrangling.html","id":"expressions-within-dplyr-verbs","dir":"Articles","previous_headings":"","what":"Expressions within dplyr verbs","title":"Data analysis with dplyr syntax","text":"Inside dplyr verbs, Arrow offers support many functions operators, common functions mapped base R tidyverse equivalents: can find list supported functions within dplyr queries function documentation. additional functions like see implemented, please file issue described Getting help guidelines.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/data_wrangling.html","id":"registering-custom-bindings","dir":"Articles","previous_headings":"","what":"Registering custom bindings","title":"Data analysis with dplyr syntax","text":"arrow package makes possible users supply bindings custom functions situations using register_scalar_function(). operate correctly, --registered function must context first argument, required query engine. example, suppose wanted implement function converts string snake case (greatly simplified version janitor::make_clean_names()). function written follows: call within arrow/dplyr pipeline, needs registered: expression, name argument specifies name recognized context arrow/dplyr pipeline fun function . in_type out_type arguments used specify expected data type input output, auto_convert specifies whether arrow automatically convert R inputs Arrow equivalents. registered, following works: learn , see help(\"register_scalar_function\", package = \"arrow\").","code":"to_snake_name <- function(context, string) { replace <- c(`'` = \"\", `\"` = \"\", `-` = \"\", `\\\\.` = \"_\", ` ` = \"_\") string %>% stringr::str_replace_all(replace) %>% stringr::str_to_lower() %>% stringi::stri_trans_general(id = \"Latin-ASCII\") } register_scalar_function( name = \"to_snake_name\", fun = to_snake_name, in_type = utf8(), out_type = utf8(), auto_convert = TRUE ) sw %>% mutate(name, snake_name = to_snake_name(name), .keep = \"none\") %>% collect() ## # A tibble: 87 x 2 ## name snake_name ## <chr> <chr> ## 1 Luke Skywalker luke_skywalker ## 2 C-3PO c3po ## 3 R2-D2 r2d2 ## 4 Darth Vader darth_vader ## 5 Leia Organa leia_organa ## 6 Owen Lars owen_lars ## 7 Beru Whitesun Lars beru_whitesun_lars ## 8 R5-D4 r5d4 ## 9 Biggs Darklighter biggs_darklighter ## 10 Obi-Wan Kenobi obiwan_kenobi ## # i 77 more rows"},{"path":"https://arrow.apache.org/docs/r/articles/data_wrangling.html","id":"handling-unsupported-expressions","dir":"Articles","previous_headings":"","what":"Handling unsupported expressions","title":"Data analysis with dplyr syntax","text":"dplyr queries Table objects, held memory usually representable data frames, arrow package detects unimplemented function within dplyr verb, automatically calls collect() return data R data frame processing dplyr verb. example, neither lm() residuals() implemented, write code computes residuals linear regression model, automatic collection takes place: queries Dataset objects – can larger memory – arrow conservative always raises error detects unsupported expression. illustrate behavior, can write starwars data disk open Dataset. use pipeline Dataset, obtain error: Calling collect() middle pipeline fixes issue: operations, can use DuckDB. supports Arrow natively, can pass Dataset query object DuckDB without paying performance penalty using helper function to_duckdb() pass object back Arrow to_arrow():","code":"sw %>% filter(!is.na(height), !is.na(mass)) %>% transmute(name, height, mass, res = residuals(lm(mass ~ height))) ## Warning: Expression residuals(lm(mass ~ height)) not supported in Arrow; ## pulling data into R ## # A tibble: 59 x 4 ## name height mass res ## <chr> <int> <dbl> <dbl> ## 1 Luke Skywalker 172 77 -18.8 ## 2 C-3PO 167 75 -17.7 ## 3 R2-D2 96 32 -16.4 ## 4 Darth Vader 202 136 21.4 ## 5 Leia Organa 150 49 -33.1 ## 6 Owen Lars 178 120 20.4 ## 7 Beru Whitesun Lars 165 75 -16.5 ## 8 R5-D4 97 32 -17.0 ## 9 Biggs Darklighter 183 84 -18.7 ## 10 Obi-Wan Kenobi 182 77 -25.1 ## # i 49 more rows # write and open starwars dataset dataset_path <- tempfile() write_dataset(starwars, dataset_path) sw2 <- open_dataset(dataset_path) # dplyr pipeline with unsupported expressions sw2 %>% filter(!is.na(height), !is.na(mass)) %>% transmute(name, height, mass, res = residuals(lm(mass ~ height))) ## Error: Expression residuals(lm(mass ~ height)) not supported in Arrow ## Call collect() first to pull data into R. sw2 %>% filter(!is.na(height), !is.na(mass)) %>% collect() %>% transmute(name, height, mass, res = residuals(lm(mass ~ height))) ## # A tibble: 59 x 4 ## name height mass res ## <chr> <int> <dbl> <dbl> ## 1 Luke Skywalker 172 77 -18.8 ## 2 C-3PO 167 75 -17.7 ## 3 R2-D2 96 32 -16.4 ## 4 Darth Vader 202 136 21.4 ## 5 Leia Organa 150 49 -33.1 ## 6 Owen Lars 178 120 20.4 ## 7 Beru Whitesun Lars 165 75 -16.5 ## 8 R5-D4 97 32 -17.0 ## 9 Biggs Darklighter 183 84 -18.7 ## 10 Obi-Wan Kenobi 182 77 -25.1 ## # i 49 more rows sw %>% select(1:4) %>% filter(!is.na(hair_color)) %>% to_duckdb() %>% group_by(hair_color) %>% filter(height < mean(height, na.rm = TRUE)) %>% to_arrow() %>% # perform other arrow operations... collect() ## # A tibble: 28 x 4 ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Yoda 66 17 white ## 2 Watto 137 NA black ## 3 Shmi Skywalker 163 NA black ## 4 Eeth Koth 171 NA black ## 5 Luminara Unduli 170 56.2 black ## 6 Barriss Offee 166 50 black ## 7 R4-P17 96 NA none ## 8 Lobot 175 79 none ## 9 Ackbar 180 83 none ## 10 Nien Nunb 160 68 none ## # i 18 more rows"},{"path":"https://arrow.apache.org/docs/r/articles/data_wrangling.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Data analysis with dplyr syntax","text":"learn multi-file datasets, see dataset article. learn user-registered functions, see help(\"register_scalar_function\", package = \"arrow\"). learn writing dplyr bindings arrow developer, see article writing bindings.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"example-nyc-taxi-data","dir":"Articles","previous_headings":"","what":"Example: NYC taxi data","title":"Working with multi-file data sets","text":"primary motivation Arrow’s Datasets object allow users analyze extremely large datasets. example, consider New York City taxi trip record data widely used big data exercises competitions. demonstrate capabilities Apache Arrow host Parquet-formatted version data public Amazon S3 bucket: full form, version data set one large table 1.7 billion rows 24 columns, row corresponds single taxi ride sometime 2009 2022. data dictionary version NYC taxi data also available. multi-file data set comprised 158 distinct Parquet files, corresponding month data. single file typically around 400-500MB size, full data set 70GB size. small data set – slow download fit memory typical machine 🙂 – also host “tiny” version NYC taxi data formatted exactly way includes one every thousand entries original data set (.e., individual files <1MB size, “tiny” data set 70MB) Amazon S3 support enabled arrow (true users; see links end article need troubleshoot ), can connect copy “tiny taxi data” stored S3 command: Alternatively connect copy data Google Cloud Storage (GCS) using following command: want use full data set, replace nyc-taxi-tiny nyc-taxi code . Apart size – cost time, bandwidth usage, CPU cycles – difference two versions data: can test code using tiny taxi data check scales using full data set. make local copy data set stored bucket folder called \"nyc-taxi\", use copy_files() function: purposes article, assume NYC taxi dataset (either full data tiny version) downloaded locally exists \"nyc-taxi\" directory.","code":"bucket <- s3_bucket(\"voltrondata-labs-datasets/nyc-taxi-tiny\") bucket <- gs_bucket(\"voltrondata-labs-datasets/nyc-taxi-tiny\", anonymous = TRUE) copy_files(from = bucket, to = \"nyc-taxi\")"},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"opening-datasets","dir":"Articles","previous_headings":"","what":"Opening Datasets","title":"Working with multi-file data sets","text":"first step process create Dataset object points data directory: important note , data values loaded memory. Instead, Arrow scans data directory find relevant files, parses file paths looking “Hive-style partitioning” (see ), reads headers data files construct Schema contains metadata describing structure data. information Schemas see metadata article. Two questions naturally follow : kind files open_dataset() look , structure expect find file paths? Let’s start looking file types. default open_dataset() looks Parquet files can override using format argument. example data encoded CSV files set format = \"csv\" connect data. Arrow Dataset interface supports several file formats including: \"parquet\" (default) \"feather\" \"ipc\" (aliases \"arrow\"; Feather version 2 Arrow file format) \"csv\" (comma-delimited files) \"tsv\" (tab-delimited files) \"text\" (generic text-delimited files - use delimiter argument specify use) case text files, can pass following parsing options open_dataset() ensure files read correctly: delim quote escape_double escape_backslash skip_empty_rows alternative working text files use open_delim_dataset(), open_csv_dataset(), open_tsv_dataset(). functions wrappers around open_dataset() parameters mirror read_csv_arrow(), read_delim_arrow(), read_tsv_arrow() allow easy switching functions opening single files functions opening datasets. example: information arguments parsing delimited text files generally, see help documentation read_delim_arrow() open_delim_dataset(). Next, information open_dataset() expect find file paths? default, Dataset interface looks Hive-style partitioning structure folders named using “key=value” convention, data files folder contain subset data key relevant value. example, NYC taxi data file paths look like : , open_dataset() infers first listed Parquet file contains data January 2009. sense, hive-style partitioning self-describing: folder names state explicitly Dataset split across files. Sometimes directory partitioning isn’t self describing; , doesn’t contain field names. example, suppose NYC taxi data used file paths like : case, open_dataset() need hints use file paths. case, provide c(\"year\", \"month\") partitioning argument, saying first path segment gives value year, second segment month. Every row 2009/01/part-0.parquet value 2009 year 1 month, even though columns may present file. words, open data like : Either way, look Dataset, can see addition columns present every file, also columns year month. columns present files : inferred partitioning structure.","code":"ds <- open_dataset(\"nyc-taxi\") ds <- open_csv_dataset(\"nyc-taxi/csv/\") year=2009/month=1/part-0.parquet year=2009/month=2/part-0.parquet ... 2009/01/part-0.parquet 2009/02/part-0.parquet ... ds <- open_dataset(\"nyc-taxi\", partitioning = c(\"year\", \"month\")) ds ## ## FileSystemDataset with 158 Parquet files ## vendor_name: string ## pickup_datetime: timestamp[ms] ## dropoff_datetime: timestamp[ms] ## passenger_count: int64 ## trip_distance: double ## pickup_longitude: double ## pickup_latitude: double ## rate_code: string ## store_and_fwd: string ## dropoff_longitude: double ## dropoff_latitude: double ## payment_type: string ## fare_amount: double ## extra: double ## mta_tax: double ## tip_amount: double ## tolls_amount: double ## total_amount: double ## improvement_surcharge: double ## congestion_surcharge: double ## pickup_location_id: int64 ## dropoff_location_id: int64 ## year: int32 ## month: int32"},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"querying-datasets","dir":"Articles","previous_headings":"","what":"Querying Datasets","title":"Working with multi-file data sets","text":"Now Dataset object refers data, can construct dplyr-style queries. possible arrow supplies back end allows users manipulate tabular Arrow data using dplyr verbs. ’s example: suppose curious tipping behavior longest taxi rides. Let’s find median tip percentage rides fares greater $100 2015, broken number passengers: ’ve just selected subset Dataset contains around 2 billion rows, computed new column, aggregated . within seconds modern laptop. work? three reasons arrow can accomplish task quickly: First, arrow adopts lazy evaluation approach queries: dplyr verbs called Dataset, record actions evaluate actions data run collect(). can see taking code leaving final step: version code returns output instantly shows manipulations ’ve made, without loading data files. evaluation queries deferred, can build query selects small subset without generating intermediate data sets potentially large. Second, work pushed individual data files, depending file format, chunks data within files. result, can select subset data much larger data set collecting smaller slices file: don’t load whole data set memory slice . Third, partitioning, can ignore files entirely. example, filtering year == 2015, files corresponding years immediately excluded: don’t load order find rows match filter. Parquet files – contain row groups statistics data contained within groups – may entire chunks data can avoid scanning rows total_amount > 100. One final thing note querying Datasets. Suppose attempt call unsupported dplyr verbs unimplemented functions query Arrow Dataset. case, arrow package raises error. However, dplyr queries Arrow Table objects (already -memory), package automatically calls collect() processing dplyr verb. learn dplyr back end, see data wrangling article.","code":"system.time(ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = 100 * tip_amount / total_amount) %>% group_by(passenger_count) %>% summarise( median_tip_pct = median(tip_pct), n = n() ) %>% collect() %>% print()) ## ## # A tibble: 10 x 3 ## passenger_count median_tip_pct n ## <int> <dbl> <int> ## 1 1 16.6 143087 ## 2 2 16.2 34418 ## 3 5 16.7 5806 ## 4 4 11.4 4771 ## 5 6 16.7 3338 ## 6 3 14.6 8922 ## 7 0 10.1 380 ## 8 8 16.7 32 ## 9 9 16.7 42 ## 10 7 16.7 11 ## ## user system elapsed ## 4.436 1.012 1.402 ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = 100 * tip_amount / total_amount) %>% group_by(passenger_count) %>% summarise( median_tip_pct = median(tip_pct), n = n() ) ## ## FileSystemDataset (query) ## passenger_count: int64 ## median_tip_pct: double ## n: int32 ## ## See $.data for the source Arrow object"},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"batch-processing-experimental","dir":"Articles","previous_headings":"","what":"Batch processing (experimental)","title":"Working with multi-file data sets","text":"Sometimes want run R code entire Dataset, Dataset much larger memory. can use map_batches Dataset query process batch--batch. Note: map_batches experimental recommended production use. example, randomly sample Dataset, use map_batches sample percentage rows batch: function can also used aggregate summary statistics Dataset computing partial results batch aggregating partial results. Extending example , fit model sample data use map_batches compute MSE full Dataset.","code":"sampled_data <- ds %>% filter(year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% map_batches(~ as_record_batch(sample_frac(as.data.frame(.), 1e-4))) %>% mutate(tip_pct = tip_amount / total_amount) %>% collect() str(sampled_data) ## ## tibble [10,918 <U+00D7> 4] (S3: tbl_df/tbl/data.frame) ## $ tip_amount : num [1:10918] 3 0 4 1 1 6 0 1.35 0 5.9 ... ## $ total_amount : num [1:10918] 18.8 13.3 20.3 15.8 13.3 ... ## $ passenger_count: int [1:10918] 3 2 1 1 1 1 1 1 1 3 ... ## $ tip_pct : num [1:10918] 0.1596 0 0.197 0.0633 0.0752 ... model <- lm(tip_pct ~ total_amount + passenger_count, data = sampled_data) ds %>% filter(year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount) %>% map_batches(function(batch) { batch %>% as.data.frame() %>% mutate(pred_tip_pct = predict(model, newdata = .)) %>% filter(!is.nan(tip_pct)) %>% summarize(sse_partial = sum((pred_tip_pct - tip_pct)^2), n_partial = n()) %>% as_record_batch() }) %>% summarize(mse = sum(sse_partial) / sum(n_partial)) %>% pull(mse) ## ## [1] 0.1304284"},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"dataset-options","dir":"Articles","previous_headings":"","what":"Dataset options","title":"Working with multi-file data sets","text":"ways can control Dataset creation adapt special use cases.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"work-with-files-in-a-directory","dir":"Articles","previous_headings":"Dataset options","what":"Work with files in a directory","title":"Working with multi-file data sets","text":"working single file set files directory, can provide file path vector multiple file paths open_dataset(). useful , example, single CSV file big read memory. pass file path open_dataset(), use group_by() partition Dataset manageable chunks, use write_dataset() write chunk separate Parquet file—without needing read full CSV file R.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"explicitly-declare-column-names-and-data-types","dir":"Articles","previous_headings":"Dataset options","what":"Explicitly declare column names and data types","title":"Working with multi-file data sets","text":"can specify schema argument open_dataset() declare columns data types. useful data files different storage schema (example, column int32 one int8 another) want ensure resulting Dataset specific type. clear, ’s necessary specify schema, even example mixed integer types, Dataset constructor reconcile differences like . schema specification just lets declare want result .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"explicitly-declare-partition-format","dir":"Articles","previous_headings":"Dataset options","what":"Explicitly declare partition format","title":"Working with multi-file data sets","text":"Similarly, can provide Schema partitioning argument open_dataset() order declare types virtual columns define partitions. useful, NYC taxi data example, wanted keep month string instead integer.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"work-with-multiple-data-sources","dir":"Articles","previous_headings":"Dataset options","what":"Work with multiple data sources","title":"Working with multi-file data sets","text":"Another feature Datasets can composed multiple data sources. , may directory partitioned Parquet files one location, another directory, files haven’t partitioned. , point S3 bucket Parquet data directory CSVs local file system query together single Dataset. create multi-source Dataset, provide list Datasets open_dataset() instead file path, concatenate command like big_dataset <- c(ds1, ds2).","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"writing-datasets","dir":"Articles","previous_headings":"","what":"Writing Datasets","title":"Working with multi-file data sets","text":"can see, querying large Dataset can made quite fast storage efficient binary columnar format like Parquet Feather partitioning based columns commonly used filtering. However, data isn’t always stored way. Sometimes might start one giant CSV. first step analyzing data cleaning reshaping usable form. write_dataset() function allows take Dataset another tabular data object—Arrow Table RecordBatch, R data frame—write different file format, partitioned multiple files. Assume version NYC Taxi data CSV: can write new location translate files Feather format calling write_dataset() : Next, let’s imagine payment_type column something often filter , want partition data variable. ensure filter like payment_type == \"Cash\" touch subset files payment_type always \"Cash\". One natural way express columns want partition use group_by() method: write files directory tree looks like : Note directory names payment_type=Cash similar: Hive-style partitioning described . means call open_dataset() directory, don’t declare partitions can read file paths. (instead write bare values partition segments, .e. Cash rather payment_type=Cash, call write_dataset() hive_style = FALSE.) Perhaps, though, payment_type == \"Cash\" data ever care , just want drop rest smaller working set. , can filter() writing: thing can writing Datasets select subset columns reorder . Suppose never care vendor_id, string column, can take lot space read , let’s drop : Note can select subset columns, currently rename columns writing Dataset.","code":"ds <- open_dataset(\"nyc-taxi/csv/\", format = \"csv\") write_dataset(ds, \"nyc-taxi/feather\", format = \"feather\") ds %>% group_by(payment_type) %>% write_dataset(\"nyc-taxi/feather\", format = \"feather\") system(\"tree nyc-taxi/feather\") ## feather ## ├── payment_type=1 ## │ └── part-18.arrow ## ├── payment_type=2 ## │ └── part-19.arrow ## ... ## └── payment_type=UNK ## └── part-17.arrow ## ## 18 directories, 23 files ds %>% filter(payment_type == \"Cash\") %>% write_dataset(\"nyc-taxi/feather\", format = \"feather\") ds %>% group_by(payment_type) %>% select(-vendor_id) %>% write_dataset(\"nyc-taxi/feather\", format = \"feather\")"},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"partitioning-performance-considerations","dir":"Articles","previous_headings":"","what":"Partitioning performance considerations","title":"Working with multi-file data sets","text":"Partitioning Datasets two aspects affect performance: increases number files creates directory structure around files. benefits well costs. Depending configuration size Dataset, costs can outweigh benefits. partitions split Dataset multiple files, partitioned Datasets can read written parallelism. However, additional file adds little overhead processing filesystem interaction. also increases overall Dataset size since file shared metadata. example, parquet file contains schema group-level statistics. number partitions floor number files. partition Dataset date year data, least 365 files. partition another dimension 1,000 unique values, 365,000 files. fine partitioning often leads small files mostly consist metadata. Partitioned Datasets create nested folder structures, allow us prune files loaded scan. However, adds overhead discovering files Dataset, ’ll need recursively “list directory” find data files. fine partitions can cause problems : Partitioning dataset date years worth data require 365 list calls find files; adding another column cardinality 1,000 make 365,365 calls. optimal partitioning layout depend data, access patterns, systems reading data. systems, including Arrow, work across range file sizes partitioning layouts, extremes avoid. guidelines can help avoid known worst cases: Avoid files smaller 20MB larger 2GB. Avoid partitioning layouts 10,000 distinct partitions. file formats notion groups within file, Parquet, similar guidelines apply. Row groups can provide parallelism reading allow data skipping based statistics, small groups can cause metadata significant portion file size. Arrow’s file writer provides sensible defaults group sizing cases.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"transactions-acid-guarantees","dir":"Articles","previous_headings":"","what":"Transactions / ACID guarantees","title":"Working with multi-file data sets","text":"Dataset API offers transaction support ACID guarantees. affects reading writing. Concurrent reads fine. Concurrent writes writes concurring reads may unexpected behavior. Various approaches can used avoid operating files using unique basename template writer, temporary directory new files, separate storage file list instead relying directory discovery. Unexpectedly killing process write progress can leave system inconsistent state. Write calls generally return soon bytes written completely delivered OS page cache. Even though write operation completed possible part file lost sudden power loss immediately write call. file formats magic numbers written end. means partial file write can safely detected discarded. CSV file format concept partially written CSV file may detected valid.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/dataset.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Working with multi-file data sets","text":"learn cloud storage, see cloud storage article. learn dplyr arrow, see data wrangling article. learn reading writing data, see read/write article. specific recipes reading writing multi-file Datasets, see Arrow R cookbook chapter. manually enable cloud support Linux, see article installation Linux. learn schemas metadata, see metadata article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/data_object_layout.html","id":"validity-bitmap-buffer","dir":"Articles > Developers","previous_headings":"","what":"Validity bitmap buffer","title":"Internal structure of Arrow objects","text":"validity bitmap binary-valued, contains 1 whenever corresponding slot array contains valid, non-null value. abstract level can assume contains following five bits: However slight -simplification three reasons. First, memory allocated byte-size units three trailing bits end (assumed zero), giving us bitmap 10111000. Second, written left--right, written format typically presumed represent big endian format -significant bit written first (.e., lowest-valued memory address). Arrow adopts little-endian convention, naturally correspond toa right--left ordering written English. reflect write bits right--left order: 00011101. Finally, Arrow encourages naturally aligned data structures allocated memory addresses multiple data block sizes. Arrow uses 64 byte alignment, data structure must multiple 64 bytes size. design feature exists allow efficient use modern hardware, discussed Arrow specification. buffer looks like memory:","code":"10111"},{"path":"https://arrow.apache.org/docs/r/articles/developers/data_object_layout.html","id":"data-buffer","dir":"Articles > Developers","previous_headings":"","what":"Data buffer","title":"Internal structure of Arrow objects","text":"data buffer, like validity bitmap, padded length 64 bytes preserve natural alignment. ’s diagram showing physical layout: integer occupies 4 bytes, per requirements 32-bit signed integer. Notice bytes associated missing value left unspecified: space allocated value bytes filled.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/data_object_layout.html","id":"offset-buffer","dir":"Articles > Developers","previous_headings":"","what":"Offset buffer","title":"Internal structure of Arrow objects","text":"types Arrow array include third buffer known offset buffer. frequently encountered context string arrays, one: Using schematic notation , structure object. metadata shown , now three buffers: understand role offset buffer, helps note format data buffer string array: concatenates strings end end one contiguous section memory. string_array object, contents data buffer look like one long utf8-encoded string: individual strings can variable length, role offset buffer specify boundaries slots . second slot array string \"amazing\". positions data array indexed like can see string interest begins position 5 ends position 11. offset buffer consists integers store break point locations. string_array might look like : difference utf8() data type large_utf8() data type utf8() data type stores 32-bit integers whereas large_utf8() type stores 64-bit integers.","code":"string_array <- Array$create(c(\"hello\", \"amazing\", \"and\", \"cruel\", \"world\")) string_array ## Array ## <string> ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ] helloamazingandcruelworld 0 5 12 15 20 25"},{"path":"https://arrow.apache.org/docs/r/articles/developers/data_object_layout.html","id":"chunked-arrays","dir":"Articles > Developers","previous_headings":"","what":"Chunked arrays","title":"Internal structure of Arrow objects","text":"Arrays immutable objects: Array initialized values stores altered. ensures multiple entities can safely refer Array via pointers, run risk values change. Using immutable Arrays makes possible Arrow avoid unnecessary copies data objects. limitations immutable Arrays, notably new batches data arrive. array immutable, can’t add new information existing array. thing can don’t want disturb copy existing array create new array contains new data. preserves immutability arrays doesn’t lead unnecessary copying now new problem: data split across two arrays. array contains one “chunk” data. ideal abstraction layer allows us treat two Arrays though single “Array-like” object. problem chunked arrays solve. chunked array wrapper around list arrays, allows index contents “” single array. Physically, data still stored separate places – array one chunk, chunks don’t adjacent memory – chunked array provides us layer abstraction allows us pretend one thing. illustrate, let’s use chunked_array() function: chunked_array() function just wrapper around functionality ChunkedArray$create() provides. Let’s take look object: double bracketing output intended highlight “list-like” nature chunked arrays. three separate arrays, wrapped container object secretly list arrays, allows list behave just like regular one-dimensional data structure. Schematically looks like : figure illustrates, really three arrays , validity bitmap, offset buffer, data buffer.","code":"chunked_string_array <- chunked_array( c(\"hello\", \"amazing\", \"and\", \"cruel\", \"world\"), c(\"I\", \"love\", \"you\") ) chunked_string_array ## ChunkedArray ## <string> ## [ ## [ ## \"hello\", ## \"amazing\", ## \"and\", ## \"cruel\", ## \"world\" ## ], ## [ ## \"I\", ## \"love\", ## \"you\" ## ] ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/developers/data_object_layout.html","id":"record-batches","dir":"Articles > Developers","previous_headings":"","what":"Record batches","title":"Internal structure of Arrow objects","text":"record batch table-like data structure comprised sequence arrays. arrays can different types must length. array referred one “fields” “columns” record batch. field must (UTF8-encoded) name, names form part metadata record batch. stored memory, record batch include physical storage values stored field: instead contains pointers relevant array objects. , however, contain validity bitmap. record batch containing 5 rows 3 columns: abstract level rb object behaves like two dimensional structure rows columns, terms represented memory fundamentally list arrays shown :","code":"rb <- record_batch( strs = c(\"hello\", \"amazing\", \"and\", \"cruel\", \"world\"), ints = c(1L, NA, 2L, 4L, 8L), dbls = c(1.1, 3.2, 0.2, NA, 11) ) rb ## RecordBatch ## 5 rows x 3 columns ## $strs <string> ## $ints <int32> ## $dbls <double>"},{"path":"https://arrow.apache.org/docs/r/articles/developers/data_object_layout.html","id":"tables","dir":"Articles > Developers","previous_headings":"","what":"Tables","title":"Internal structure of Arrow objects","text":"deal situations rectangular data set can grow time (data added), need tabular data structure similar record batch one exception: instead storing column array, now want store chunked array. Table class arrow . illustrate, suppose second set data arrives record batch: underlying structure Table:","code":"new_rb <- record_batch( strs = c(\"I\", \"love\", \"you\"), ints = c(5L, 0L, 0L), dbls = c(7.1, -0.1, 2) ) df <- concat_tables(arrow_table(rb), arrow_table(new_rb)) df ## Table ## 8 rows x 3 columns ## $strs <string> ## $ints <int32> ## $dbls <double>"},{"path":"https://arrow.apache.org/docs/r/articles/developers/debugging.html","id":"debugging-r-code","dir":"Articles > Developers","previous_headings":"","what":"Debugging R code","title":"Debugging strategies","text":"general, found using interactive debugging (e.g. calls browser()), can inspect objects particular environment, efficient simpler techniques print() statements.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/debugging.html","id":"getting-more-descriptive-c-error-messages-after-a-segfault","dir":"Articles > Developers","previous_headings":"","what":"Getting more descriptive C++ error messages after a segfault","title":"Debugging strategies","text":"working RStudio IDE, R session aborted segfault. re-run code command-line R session, session isn’t automatically aborted possible copy error message accompanying segfault. example bug existed time writing. output provides R traceback; however, doesn’t provide information exact line C++ code segfault originated. , need run R C++ debugger attached.","code":"> S3FileSystem$create() *** caught segfault *** address 0x1a0, cause 'memory not mapped' Traceback: 1: (function (anonymous, access_key, secret_key, session_token, role_arn, session_name, external_id, load_frequency, region, endpoint_override, scheme, background_writes) { .Call(`_arrow_fs___S3FileSystem__create`, anonymous, access_key, secret_key, session_token, role_arn, session_name, external_id, load_frequency, region, endpoint_override, scheme, background_writes)})(access_key = \"\", secret_key = \"\", session_token = \"\", role_arn = \"\", session_name = \"\", external_id = \"\", load_frequency = 900L, region = \"\", endpoint_override = \"\", scheme = \"\", background_writes = TRUE, anonymous = FALSE) 2: exec(fs___S3FileSystem__create, !!!args) 3: S3FileSystem$create()"},{"path":"https://arrow.apache.org/docs/r/articles/developers/debugging.html","id":"running-r-code-with-the-c-debugger-attached","dir":"Articles > Developers","previous_headings":"Getting more descriptive C++ error messages after a segfault","what":"Running R code with the C++ debugger attached","title":"Debugging strategies","text":"Arrow C++ code core, debugging code can sometimes tricky errors originate C++ rather R layer. adding new code triggers C++ bug (find one existing code), can result segfault. working RStudio, session aborted, may able retrieve error messaging needed diagnose /report bug. One way around find code causes error, run R C++ debugger. using macOS installed R using Apple installer, able run R debugger attached; please see instructions details causes workarounds. Firstly, load R debugger. common debuggers gdb (typically found Linux, sometimes macOS, Windows via MinGW Cygwin) lldb (default macOS debugger). case ’s gdb, ’re using lldb debugger (example, ’re Mac), just swap command . Next, run R. now R session C++ debugger attached. look similar normal R session, extra output. Now, run code - either directly session sourcing file. code results segfault, extra output can use diagnose problem attach issue extra information. debugger output segfault shown previous example. can see exact line triggers segfault included output.","code":"R -d gdb run > S3FileSystem$create() Thread 1 \"R\" received signal SIGSEGV, Segmentation fault. 0x00007ffff0128369 in std::__atomic_base<long>::operator++ (this=0x178) at /usr/include/c++/9/bits/atomic_base.h:318 318 operator++() noexcept"},{"path":"https://arrow.apache.org/docs/r/articles/developers/debugging.html","id":"getting-debugger-output-if-your-session-hangs","dir":"Articles > Developers","previous_headings":"Getting more descriptive C++ error messages after a segfault > Running R code with the C++ debugger attached","what":"Getting debugger output if your session hangs","title":"Debugging strategies","text":"instructions can provide valuable additional context segfault occurs. However, occasionally circumstances bug cause session hang indefinitely without segfaulting. case, may diagnostically useful interrupt debugger generate backtraces running threads. , firstly, press Ctrl/Cmd C interrupt debugger, run: generate large amount output, information useful identifying cause issue.","code":"thread apply all bt"},{"path":"https://arrow.apache.org/docs/r/articles/developers/debugging.html","id":"further-reading","dir":"Articles > Developers","previous_headings":"","what":"Further reading","title":"Debugging strategies","text":"following resources provide detailed guides debugging R code: chapter debugging ‘Advanced R’ Hadley Wickham RStudio debugging documentation excellent -depth guide using C++ debugger R, see blog post David Vaughan. can find list equivalent gdb lldb commands LLDB website.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"how-do-i-run-a-docker-container","dir":"Articles > Developers","previous_headings":"","what":"How do I run a Docker container?","title":"Using docker containers","text":"number images created convenience Arrow devs can find DockerHub repo. code shows example command use run Docker container. run root directory checkout arrow repo. Components: docker run - command run container -- run interactive terminal can run commands containers -e ARROW_DEPENDENCY_SOURCE=AUTO - set environment variable ARROW_DEPENDENCY_SOURCE value AUTO -v $(pwd):/arrow - mount current directory /arrow container apache/arrow-dev - DockerHub repo get container r-rhub-ubuntu-release-latest - image tag run command, don’t copy particular image saved locally, first downloaded container spun . example , mounting directory Arrow repo stored local machine, meant code built tested container.","code":"docker run -it -e ARROW_DEPENDENCY_SOURCE=AUTO -v $(pwd):/arrow apache/arrow-dev:r-rhub-ubuntu-release-latest"},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"how-do-i-exit-this-image","dir":"Articles > Developers","previous_headings":"","what":"How do I exit this image?","title":"Using docker containers","text":"Linux, press Ctrl+D.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"how-do-i-show-all-images-saved","dir":"Articles > Developers","previous_headings":"","what":"How do I show all images saved?","title":"Using docker containers","text":"","code":"docker images"},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"how-do-i-show-all-running-containers","dir":"Articles > Developers","previous_headings":"","what":"How do I show all running containers?","title":"Using docker containers","text":"","code":"docker ps"},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"how-do-i-show-all-containers","dir":"Articles > Developers","previous_headings":"","what":"How do I show all containers?","title":"Using docker containers","text":"","code":"sudo docker ps -a"},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"running-existing-workflows-from-docker-compose-yml","dir":"Articles > Developers","previous_headings":"","what":"Running existing workflows from docker-compose.yml","title":"Using docker containers","text":"number workflows outlined file docker-compose.yml arrow repo root directory. example, can use workflow called r test building installing R package. advantageous can use existing utility scripts install onto container already R . workflows also parameterized, means can specify different options (just use defaults, can found .env)","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"example---the-manual-way","dir":"Articles > Developers","previous_headings":"Running existing workflows from docker-compose.yml","what":"Example - The manual way","title":"Using docker containers","text":"wanted run RHub’s latest ubuntu-release image, run:","code":"R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker-compose build r R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker-compose run r"},{"path":"https://arrow.apache.org/docs/r/articles/developers/docker.html","id":"example---using-archery","dir":"Articles > Developers","previous_headings":"Running existing workflows from docker-compose.yml","what":"Example - Using Archery","title":"Using docker containers","text":"Alternatively, may prefer use Archery tool run docker images. advantage making simpler build existing Arrow CI jobs hierarchical dependencies, example, build R package container already C++ code pre-built. tool CI uses - via tool called Crossbow. want run r workflow discussed , run:","code":"R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest archery docker run r"},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"installing-libarrow-during-r-package-installation","dir":"Articles > Developers","previous_headings":"","what":"Installing libarrow during R package installation","title":"Installation details","text":"number scripts triggered R CMD INSTALL . run Arrow users, just work without configuration pull complete pieces (e.g. official binaries host). One jobs scripts work libarrow installed, , install . overview scripts shown : configure configure.win - scripts triggered R CMD INSTALL . non-Windows Windows platforms, respectively. handle finding libarrow, setting build variables necessary, writing package Makevars file used compile C++ code R package. tools/nixlibs.R - script called configure Linux macOS (non-windows OS environment variable FORCE_BUNDLED_BUILD=true). windows script called configure.win environment variable ARROW_HOME set. looks existing libarrow installation, can’t find one downloads appropriate libarrow binary. non-windows binary found, script sets build process bundled builds (default linux) checks dependencies. inst/build_arrow_static.sh - called tools/nixlibs.R libarrow needs built. builds libarrow bundled, static build, mirrors steps described Arrow R developer guide build script also used generate prebuilt binaries. actions taken scripts resolve dependencies install correct components described .","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"windows","dir":"Articles > Developers","previous_headings":"Installing libarrow during R package installation > How the R package finds libarrow","what":"Windows","title":"Installation details","text":"diagram shows R package finds libarrow installation Windows.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"checking-for-existing-libarrow-installations","dir":"Articles > Developers","previous_headings":"Installing libarrow during R package installation > How the R package finds libarrow > Windows","what":"Checking for existing libarrow installations","title":"Installation details","text":"install arrow R package Windows, ARROW_HOME environment variable set, install script looks existing libarrow installation. find, checks whether R_WINLIB_LOCAL environment variable set point local installation.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"downloading-libarrow","dir":"Articles > Developers","previous_headings":"Installing libarrow during R package installation > How the R package finds libarrow > Windows","what":"Downloading libarrow","title":"Installation details","text":"existing libarrow installations can found, script proceeds try download required version libarrow, first nightly builds repository Rwinlib. script first tries find version libarrow matches components according semantic versioning, case failure becomes less specific (.e. binaries found version 0.14.1.1, try find one 0.14.1).","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"non-windows","dir":"Articles > Developers","previous_headings":"Installing libarrow during R package installation > How the R package finds libarrow","what":"Non-Windows","title":"Installation details","text":"Linux macOS, core logic : FORCE_BUNDLED_BUILD=true, skip step 3. Find libarrow system. present, make sure version compatible R package. suitable libarrow found, download (allowed) build source. Determine features libarrow flags requires, set src/Makevars use compiling bindings.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"finding-libarrow-on-the-system","dir":"Articles > Developers","previous_headings":"Installing libarrow during R package installation > How the R package finds libarrow > Non-Windows","what":"Finding libarrow on the system","title":"Installation details","text":"configure script look libarrow three places: path environment variable ARROW_HOME, set Whatever pkg-config finds, unless ARROW_USE_PKG_CONFIG=false Homebrew, done brew install apache-arrow libarrow build found, check version C++ library matches R package. versions match, like ’ve installed system package release version development version R package, libarrow used. C++ library R package development versions, see warning message advising trouble, ensure C++ library built commit R package, development version numbers change every commit.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"prebuilt-binaries","dir":"Articles > Developers","previous_headings":"Installing libarrow during R package installation > How the R package finds libarrow > Non-Windows","what":"Prebuilt binaries","title":"Installation details","text":"libarrow found system, R package installation script next attempt download prebuilt libarrow binaries match local operating system, required dependencies (e.g. openssl version) arrow R package version. used automatically many Linux distributions (x86_64 architecture ), according allowlist. distribution isn’t list, can opt-setting NOT_CRAN environment variable call install.packages(). found, downloaded bundled R package compiles.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"building-from-source","dir":"Articles > Developers","previous_headings":"Installing libarrow during R package installation > How the R package finds libarrow > Non-Windows","what":"Building from source","title":"Installation details","text":"suitable libarrow binary found, attempt build locally. First, also look see checkout apache/arrow git repository thus libarrow source files . Otherwise, builds source files included package. Depending system, building libarrow source may slow. libarrow built source, inst/build_arrow_static.sh executed.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"using-the-r-package-with-libarrow-installed-as-a-system-package","dir":"Articles > Developers","previous_headings":"","what":"Using the R package with libarrow installed as a system package","title":"Installation details","text":"authorized install system packages ’re installing CRAN release, may want use official Apache Arrow release packages corresponding R package version via software distribution tools apt yum (though drawbacks: see “Troubleshooting” section main installation docs). See Arrow project installation page find pre-compiled binary packages common Linux distributions, including Debian, Ubuntu, CentOS. developer contributing R package, system libarrow packages won’t useful versions match.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/install_details.html","id":"using-the-r-package-with-an-existing-libarrow-build","dir":"Articles > Developers","previous_headings":"","what":"Using the R package with an existing libarrow build","title":"Installation details","text":"setup much common arrow developers, may needing make changes R package libarrow source code. See developer setup docs information.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"option-1-using-nightly-libarrow-binaries","dir":"Articles > Developers","previous_headings":"","what":"Option 1: Using nightly libarrow binaries","title":"Configuring a developer environment","text":"Linux, macOS, Windows can use workflow might use another package contains compiled code (e.g., R CMD INSTALL . terminal, devtools::load_all() R prompt, Install & Restart RStudio). arrow/r/libarrow directory populated, configure script attempt download latest nightly libarrow binary, extract arrow/r/libarrow directory (macOS, Linux) arrow/r/windows directory (Windows), continue building R package usual. time, won’t need update version libarrow R package rarely changes updates C++ library; however, start get errors rebuilding R package, may remove libarrow directory (macOS, Linux) windows directory (Windows) “clean” rebuild. can terminal R CMD INSTALL . --preclean, RStudio using “Clean Install” option “Build” tab, using make clean using Makefile located root R package.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"option-2-use-a-local-arrow-c-development-build","dir":"Articles > Developers","previous_headings":"","what":"Option 2: Use a local Arrow C++ development build","title":"Configuring a developer environment","text":"need alter libarrow R package code, can’t get binary version latest libarrow elsewhere, ’ll need build source. section discusses set C++ libarrow build configured work R package. general resources, see Arrow C++ developer guide. five major steps process.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"step-1---install-dependencies","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build","what":"Step 1 - Install dependencies","title":"Configuring a developer environment","text":"building libarrow, default, system dependencies used suitable versions found. system dependencies present, libarrow build build process. dependencies need install outside build process cmake (configuring build) openssl building S3 support. faster build, may choose pre-install C++ library dependencies (lz4, zstd, etc.) system don’t need built source libarrow build.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"ubuntu","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build > Step 1 - Install dependencies","what":"Ubuntu","title":"Configuring a developer environment","text":"","code":"sudo apt install -y cmake libcurl4-openssl-dev libssl-dev"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"macos","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build > Step 1 - Install dependencies","what":"macOS","title":"Configuring a developer environment","text":"","code":"brew install cmake openssl"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"step-2---configure-the-libarrow-build","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build","what":"Step 2 - Configure the libarrow build","title":"Configuring a developer environment","text":"recommend configure libarrow built user-level directory rather system directory development work. development version using doesn’t overwrite released version libarrow may already installed, also able work one version libarrow (using different ARROW_HOME directories different versions). example , libarrow installed directory called dist parent directory arrow checkout. installation Arrow R package can point directory name, though recommend placing inside arrow git checkout directory unwanted changes stop working properly. Special instructions Linux: need set LD_LIBRARY_PATH lib directory set $ARROW_HOME, launching R using arrow. One way add profile (use ~/.bash_profile , might need put different file depending setup, e.g. use shell bash). macOS need macOS shared library paths hardcoded locations build time. Start navigating terminal arrow repository. need create directory C++ build put contents. recommend make build directory inside cpp directory Arrow git repository (git-ignored, won’t accidentally check ). Next, change directories inside cpp/build: ’ll first call cmake configure build make install. R package, ’ll need enable several features libarrow using -D flags:","code":"export ARROW_HOME=$(pwd)/dist mkdir $ARROW_HOME export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH echo \"export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH\" >> ~/.bash_profile pushd arrow mkdir -p cpp/build pushd cpp/build"},{"path":[]},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"linux-mac-os","dir":"Articles > Developers","previous_headings":"","what":"Configuring a developer environment","title":"Configuring a developer environment","text":"","code":"cmake \\ -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \\ -DCMAKE_INSTALL_LIBDIR=lib \\ -DARROW_COMPUTE=ON \\ -DARROW_CSV=ON \\ -DARROW_DATASET=ON \\ -DARROW_EXTRA_ERROR_CONTEXT=ON \\ -DARROW_FILESYSTEM=ON \\ -DARROW_INSTALL_NAME_RPATH=OFF \\ -DARROW_JEMALLOC=ON \\ -DARROW_JSON=ON \\ -DARROW_PARQUET=ON \\ -DARROW_WITH_SNAPPY=ON \\ -DARROW_WITH_ZLIB=ON \\ .."},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"section-1","dir":"Articles > Developers","previous_headings":"","what":"Configuring a developer environment","title":"Configuring a developer environment","text":".. refers C++ source directory: ’re cpp/build source cpp.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"enabling-more-arrow-features","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build > Step 2 - Configure the libarrow build","what":"Enabling more Arrow features","title":"Configuring a developer environment","text":"enable optional features including: S3 support, alternative memory allocator, additional compression libraries, add flags call cmake (trailing \\ makes easier paste bash shell new line): flags may useful: -DBoost_SOURCE=BUNDLED -DThrift_SOURCE=BUNDLED, example, dependency *_SOURCE, system version C++ dependency doesn’t work correctly Arrow. tells build compile version dependency source. -DCMAKE_BUILD_TYPE=debug -DCMAKE_BUILD_TYPE=relwithdebinfo can useful debugging. probably don’t want generally debug build much slower runtime default release build. -DARROW_BUILD_STATIC=-DARROW_BUILD_SHARED=want use static libraries instead dynamic libraries. static libraries isn’t risk R package linking wrong library, mean change C++ code recompile C++ libraries R package. Compilers typically link static libraries dynamic ones present, need set -DARROW_BUILD_SHARED=. switching compiling installing previously, may need remove .dll .files $ARROW_HOME/dist/bin. Note cmake particularly sensitive whitespacing, see errors, check don’t errant whitespace.","code":"-DARROW_GCS=ON \\ -DARROW_MIMALLOC=ON \\ -DARROW_S3=ON \\ -DARROW_WITH_BROTLI=ON \\ -DARROW_WITH_BZ2=ON \\ -DARROW_WITH_LZ4=ON \\ -DARROW_WITH_SNAPPY=ON \\ -DARROW_WITH_ZSTD=ON \\"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"step-3---building-libarrow","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build","what":"Step 3 - Building libarrow","title":"Configuring a developer environment","text":"can add -j# end command speed compilation running parallel (# number cores available).","code":"cmake --build . --target install -j8"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"step-4---build-the-arrow-r-package","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build","what":"Step 4 - Build the Arrow R package","title":"Configuring a developer environment","text":"’ve built libarrow, can install R package dependencies, along additional dev dependencies, git checkout: ---multiarch flag makes compile “main” architecture. compile architecture R path corresponds . compile one architecture switch another, make sure pass --preclean flag R package code recompiled new architecture. Otherwise, may see errors like LoadLibrary failure: %1 valid Win32 application.","code":"popd # To go back to the root directory of the project, from cpp/build pushd r R -e \"install.packages('remotes'); remotes::install_deps(dependencies = TRUE)\" R CMD INSTALL --no-multiarch ."},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"compilation-flags","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build > Step 4 - Build the Arrow R package","what":"Compilation flags","title":"Configuring a developer environment","text":"need set compilation flags building C++ extensions, can use ARROW_R_CXXFLAGS environment variable. example, using perf profile R extensions, may need set","code":"export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"recompiling-the-c-code","dir":"Articles > Developers","previous_headings":"Option 2: Use a local Arrow C++ development build > Step 4 - Build the Arrow R package","what":"Recompiling the C++ code","title":"Configuring a developer environment","text":"setup described , need rebuild Arrow library even C++ source R package iterate work R package. time need rebuilt changed C++ R package (even , R CMD INSTALL . need recompile files changed) libarrow C++ changed mismatch libarrow R package. find rebuilding either time install package run tests, something probably wrong set .","code":"cmake \\ -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \\ -DCMAKE_INSTALL_LIBDIR=lib \\ -DARROW_COMPUTE=ON \\ -DARROW_CSV=ON \\ -DARROW_DATASET=ON \\ -DARROW_EXTRA_ERROR_CONTEXT=ON \\ -DARROW_FILESYSTEM=ON \\ -DARROW_GCS=ON \\ -DARROW_INSTALL_NAME_RPATH=OFF \\ -DARROW_JEMALLOC=ON \\ -DARROW_JSON=ON \\ -DARROW_MIMALLOC=ON \\ -DARROW_PARQUET=ON \\ -DARROW_S3=ON \\ -DARROW_WITH_BROTLI=ON \\ -DARROW_WITH_BZ2=ON \\ -DARROW_WITH_LZ4=ON \\ -DARROW_WITH_SNAPPY=ON \\ -DARROW_WITH_ZLIB=ON \\ -DARROW_WITH_ZSTD=ON \\ .."},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"installing-a-version-of-the-r-package-with-a-specific-git-reference","dir":"Articles > Developers","previous_headings":"","what":"Installing a version of the R package with a specific git reference","title":"Configuring a developer environment","text":"need arrow installation specific repository git reference, platforms except Windows, can run: build = FALSE argument important installation can access C++ source cpp/ directory apache/arrow. installation methods, setting environment variables LIBARROW_MINIMAL=false ARROW_R_DEV=true provide full-featured version Arrow provide verbose output, respectively. example, install (fictional) branch bugfix apache/arrow run: Developers may wish use method installing specific commit separate another Arrow development environment system installation (e.g. use arrowbench install development versions libarrow isolated system install). already libarrow installed system-wide, may need set additional variables order isolate build system libraries: Setting environment variable FORCE_BUNDLED_BUILD true skip pkg-config search libarrow attempt build source repository+ref given. may also need set Makevars CPPFLAGS LDFLAGS \"\" order prevent installation process attempting link already installed system versions libarrow. One way temporarily wrapping remotes::install_github() call like :","code":"remotes::install_github(\"apache/arrow/r\", build = FALSE) Sys.setenv(LIBARROW_MINIMAL=\"false\") remotes::install_github(\"apache/arrow/r@bugfix\", build = FALSE) withr::with_makevars(list(CPPFLAGS = \"\", LDFLAGS = \"\"), remotes::install_github(...))"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"summary-of-environment-variables","dir":"Articles > Developers","previous_headings":"","what":"Summary of environment variables","title":"Configuring a developer environment","text":"See user-facing article installation large number environment variables determine build works features get built. ARROW_OFFLINE_BUILD: set true, build script download prebuilt C++ library binary , needed, cmake. turn features require download, unless ’re available ARROW_THIRDPARTY_DEPENDENCY_DIR tools/thirdparty_download/ subfolder. create_package_with_all_dependencies() creates subfolder.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"troubleshooting","dir":"Articles > Developers","previous_headings":"","what":"Troubleshooting","title":"Configuring a developer environment","text":"Note change libarrow, must reinstall run make clean git clean -fdx . remove cached object code r/src/ directory reinstalling R package. necessary make changes libarrow source; need manually purge object files editing R C++ code inside r/.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"arrow-library---r-package-mismatches","dir":"Articles > Developers","previous_headings":"Troubleshooting","what":"Arrow library - R package mismatches","title":"Configuring a developer environment","text":"libarrow R package diverged, see errors like: resolve , try rebuilding Arrow library.","code":"Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so Expected in: flat namespace in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so Error: loading failed Execution halted ERROR: loading failed"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"multiple-versions-of-libarrow","dir":"Articles > Developers","previous_headings":"Troubleshooting","what":"Multiple versions of libarrow","title":"Configuring a developer environment","text":"installing user-level directory, already previous installation libarrow system directory, get may get errors like following install R package: happens, need make sure don’t let R link system library building arrow. can number different ways: Setting MAKEFLAGS environment variable \"LDFLAGS=\" (see example) recommended way accomplish Using {withr}’s with_makevars(list(LDFLAGS = \"\"), ...) adding LDFLAGS= ~/.R/Makevars file (least recommended way, though common debugging approach suggested online)","code":"Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib Referenced from: /usr/local/lib/libparquet.400.dylib Reason: image not found MAKEFLAGS=\"LDFLAGS=\" R CMD INSTALL ."},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"rpath-issues","dir":"Articles > Developers","previous_headings":"Troubleshooting","what":"rpath issues","title":"Configuring a developer environment","text":"package fails install/load error like : ensure -DARROW_INSTALL_NAME_RPATH=passed (important macOS prevent problems link time -op platforms). Alternatively, try setting environment variable R_LD_LIBRARY_PATH wherever Arrow C++ put make install, e.g. export R_LD_LIBRARY_PATH=/usr/local/lib, retry installing R package. installing source, R C++ library versions match, installation may fail. ’ve previously installed libraries want upgrade R package, ’ll need update Arrow C++ library first. build/configuration challenges, see C++ developer guide.","code":"** testing if installed package can be loaded from temporary location Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib"},{"path":"https://arrow.apache.org/docs/r/articles/developers/setup.html","id":"other-installation-issues","dir":"Articles > Developers","previous_headings":"Troubleshooting","what":"Other installation issues","title":"Configuring a developer environment","text":"number scripts triggered arrow R package installed. package users interacting underlying code, just work without configuration pull complete pieces (e.g. official binaries host). However, knowing scripts can help package developers troubleshoot things go wrong things go wrong install. See article R package installation information.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"loading-arrow","dir":"Articles > Developers","previous_headings":"","what":"Loading arrow","title":"Developer workflows","text":"can load R package via devtools::load_all().","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"rebuilding-the-documentation","dir":"Articles > Developers","previous_headings":"","what":"Rebuilding the documentation","title":"Developer workflows","text":"R documentation uses @examplesIf tag introduced roxygen2 version 7.1.2. can use devtools::document() pkgdown::build_site() rebuild documentation preview results.","code":"remotes::install_github(\"r-lib/roxygen2\") # Update roxygen documentation devtools::document() # To preview the documentation website pkgdown::build_site(preview=TRUE)"},{"path":[]},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"r-code","dir":"Articles > Developers","previous_headings":"Styling and linting","what":"R code","title":"Developer workflows","text":"R code package follows tidyverse style. PR submission (pushes) CI run linting flag possible errors pull request annotations. run linter locally, install lintr package (note, currently use fork includes fixes yet accepted upstream, see lintr installed file ci/docker/linux-apt-lint.dockerfile current status) run can automatically change formatting code package using styler package. two ways : Use comment bot automatically command @github-actions autotune PR, commit back branch. Run styler locally either via Makefile commands: R: styler package fix many styling errors, thought lintr errors automatically fixable styler. list files intentionally style r/.styler_excludes.R.","code":"lintr::lint_package(\"arrow/r\") make style # (for only the files changed) make style-all # (for all files) # note the file that should not be styled styler::style_pkg(exclude_files = c(\"data-raw/codegen.R\"))"},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"c-code","dir":"Articles > Developers","previous_headings":"Styling and linting","what":"C++ code","title":"Developer workflows","text":"arrow package uses customized tools top cpp11 prepare C++ code src/. features enabled built conditionally build time. change C++ code R package, need set ARROW_R_DEV environment variable true (optionally, add ~/.Renviron file persist across sessions) data-raw/codegen.R file used code generation. Makefile commands also handles automatically. use Google C++ style C++ code. easiest way accomplish use editors/IDE formats code . Many popular editors/IDEs support running clang-format C++ files save . Installing/enabling appropriate plugin may save much frustration. Check style errors Fix style issues committing lint script requires Python 3 clang-format. command isn’t found, can explicitly provide path like: can see version clang-format required following command: Note lint script requires Python 3 Python dependencies (note `cmake_format pinned specific version): autopep8 flake8 cmake_format==0.5.2","code":"./lint.sh ./lint.sh --fix CLANG_FORMAT=/opt/llvm/bin/clang-format ./lint.sh (. ../.env && echo ${CLANG_TOOLS})"},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"running-tests","dir":"Articles > Developers","previous_headings":"","what":"Running tests","title":"Developer workflows","text":"Tests can run either using devtools::test() Makefile alternative. tests conditionally enabled based availability certain features package build (S3 support, compression libraries, etc.). Others generally skipped default can enabled environment variables settings: tests skipped Linux package builds without C++ libarrow. make build fail libarrow available (, test C++ build successful), set TEST_R_WITH_ARROW=true tests disabled unless ARROW_R_DEV=true Tests require allocating >2GB memory test Large types disabled unless ARROW_LARGE_MEMORY_TESTS=true Integration tests real S3 bucket disabled unless credentials set AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; available request S3 tests using MinIO locally enabled minio server process found running. ’re running MinIO custom settings, can set MINIO_ACCESS_KEY, MINIO_SECRET_KEY, MINIO_PORT override defaults.","code":"# Run the test suite, optionally filtering file names devtools::test(filter=\"^regexp$\") # or the Makefile alternative from the arrow/r directory in a shell: make test file=regexp"},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"running-checks","dir":"Articles > Developers","previous_headings":"","what":"Running checks","title":"Developer workflows","text":"can run package checks using devtools::check() check test coverage covr::package_coverage(). full package validation, can run following commands terminal.","code":"# All package checks devtools::check() # See test coverage statistics covr::report() covr::package_coverage() R CMD build . R CMD check arrow_*.tar.gz --as-cran"},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"running-extended-ci-checks","dir":"Articles > Developers","previous_headings":"","what":"Running extended CI checks","title":"Developer workflows","text":"pull request, actions can trigger commenting PR. extended CI checks run nightly can also requested -demand using internal tool called crossbow. important GitHub comment commands shown .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"run-all-extended-r-ci-tasks","dir":"Articles > Developers","previous_headings":"Running extended CI checks","what":"Run all extended R CI tasks","title":"Developer workflows","text":"runs R-related CI tasks.","code":"@github-actions crossbow submit -g r"},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"run-a-specific-task","dir":"Articles > Developers","previous_headings":"Running extended CI checks","what":"Run a specific task","title":"Developer workflows","text":"See r: group definition near beginning crossbow configuration list glob expression patterns match names items tasks: list .","code":"@github-actions crossbow submit {task-name}"},{"path":"https://arrow.apache.org/docs/r/articles/developers/workflow.html","id":"run-linting-and-documentation-building-tasks","dir":"Articles > Developers","previous_headings":"Running extended CI checks","what":"Run linting and documentation building tasks","title":"Developer workflows","text":"run fix lint C++ linting errors, run R documentation (among cleanup tasks), run styler changed R code, commit resulting updates branch.","code":"@github-actions autotune"},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"walkthrough","dir":"Articles > Developers","previous_headings":"","what":"Walkthrough","title":"Writing dplyr bindings","text":"Imagine writing bindings C++ function starts_with() want bind (base) R function startsWith(). First, take look docs functions.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"examining-the-r-function","dir":"Articles > Developers","previous_headings":"Walkthrough","what":"Examining the R function","title":"Writing dplyr bindings","text":"docs R’s startsWith() (also available https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html) takes 2 parameters; x - input, prefix - characters check x starts .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"examining-the-c-function","dir":"Articles > Developers","previous_headings":"Walkthrough","what":"Examining the C++ function","title":"Writing dplyr bindings","text":"Now, go compute function documentation look Arrow C++ library’s starts_with() function: docs show starts_with() unary function, means takes single data input. data input must string-like class, returned value boolean, match R’s startsWith(). options class associated starts_with() - called MatchSubstringOptions - let’s take look . Options classes allow user control behaviour function. case, two possible options can supplied - pattern ignore_case, described docs shown .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"comparing-the-r-and-c-functions","dir":"Articles > Developers","previous_headings":"Walkthrough","what":"Comparing the R and C++ functions","title":"Writing dplyr bindings","text":"conclusions can drawn ’ve seen far? Base R’s startsWith() Arrow’s starts_with() operate equivalent data types, return equivalent data types, options implemented R Arrow doesn’t , fairly simple map without great deal extra work. starts_with() options class associated , ’ll need make sure ’s linked R code. case ’re wondering difference arguments R options Arrow, R, arguments functions can include actual data analysed well options governing function works, whereas C++ compute functions, arguments data analysed options specifying exactly function works. let’s get started.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"step-1---add-unit-tests","dir":"Articles > Developers","previous_headings":"Walkthrough","what":"Step 1 - add unit tests","title":"Writing dplyr bindings","text":"recommend test-driven-development approach - write failing tests first, check fail, write code needed make pass. Thinking -front behavior needs testing can make easier reason code needs writing later. Look R function want bind compute kernel , write set unit tests use dplyr pipeline compare_dplyr_binding() (perhaps even compare_dplyr_error() necessary. functions compare output original function dplyr bindings make sure match. recommend looking documentation next source code functions get better understanding work. make sure ’re testing parameters R function tests. possible example test startsWith().","code":"test_that(\"startsWith behaves identically in dplyr and Arrow\", { df <- tibble(x = c(\"Foo\", \"bar\", \"baz\", \"qux\")) compare_dplyr_binding( .input %>% filter(startsWith(x, \"b\")) %>% collect(), df ) })"},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"step-2---hook-up-the-compute-function-with-options-class-if-necessary","dir":"Articles > Developers","previous_headings":"Walkthrough","what":"Step 2 - Hook up the compute function with options class if necessary","title":"Writing dplyr bindings","text":"C++ compute function can options specified, make sure function linked options class make_compute_options() file arrow/r/src/compute.cpp. can find compute function requires options looking docs : https://arrow.apache.org/docs/cpp/compute.html case starts_with(), looks something like : can usually copy paste similar existing example. case, option ignore_case doesn’t map parameters startsWith(), give default value false ’s set, use set value instead. pattern argument maps directly prefix startsWith() can pass straight .","code":"if (func_name == \"starts_with\") { using Options = arrow::compute::MatchSubstringOptions; bool ignore_case = false; if (!Rf_isNull(options[\"ignore_case\"])) { ignore_case = cpp11::as_cpp<bool>(options[\"ignore_case\"]); } return std::make_shared<Options>(cpp11::as_cpp<std::string>(options[\"pattern\"]), ignore_case); }"},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"step-3---map-the-r-function-to-the-c-kernel","dir":"Articles > Developers","previous_headings":"Walkthrough","what":"Step 3 - Map the R function to the C++ kernel","title":"Writing dplyr bindings","text":"next task writing code binds R function C++ kernel.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"step-3a---see-if-direct-mapping-is-appropriate","dir":"Articles > Developers","previous_headings":"Walkthrough > Step 3 - Map the R function to the C++ kernel","what":"Step 3a - See if direct mapping is appropriate","title":"Writing dplyr bindings","text":"Compare C++ function R function. simple functions options, might possible directly map C++ R unary_function_map, case compute functions operate single columns data, binary_function_map operate 2 columns data. startsWith() requires options, direct mapping appropriate.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"step-3b---if-direct-mapping-not-possible-try-a-modified-implementation","dir":"Articles > Developers","previous_headings":"Walkthrough > Step 3 - Map the R function to the C++ kernel","what":"Step 3b - If direct mapping not possible, try a modified implementation","title":"Writing dplyr bindings","text":"function mapped directly, extra work may needed ensure calling arrow version function results result calling R version function. case, function need adding nse_funcs function registry. might look startsWith(): source files, register_binding() calls wrapped functions called package load. separated files based subject matter (e.g., R/dplyr-funcs-math.R, R/dplyr-funcs-string.R): find closest analog function whose binding defined define new binding similar location. example, binding startsWith() registered dplyr-funcs-string.R next binding endsWith(). Note: use namespace-qualified name (.e. \"base::startsWith\") binding. register binding startsWith() base::startsWith(), allow us use pkg:: prefix call. Hint: can use call_function() call compute function directly R. might useful want experiment compute function ’re writing bindings , e.g.","code":"register_binding(\"base::startsWith\", function(x, prefix) { Expression$create( \"starts_with\", x, options = list(pattern = prefix) ) }) arrow_table(starwars) %>% filter(stringr::str_detect(name, \"Darth\")) ## Table (query) ## name: string ## height: int32 ## mass: double ## hair_color: string ## skin_color: string ## eye_color: string ## birth_year: double ## sex: string ## gender: string ## homeworld: string ## species: string ## films: list<item: string> ## vehicles: list<item: string> ## starships: list<item: string> ## ## * Filter: match_substring_regex(name, {pattern=\"Darth\", ignore_case=false}) ## See $.data for the source Arrow object call_function( \"starts_with\", Array$create(c(\"Apache\", \"Arrow\", \"R\", \"package\")), options = list(pattern = \"A\") ) ## Array ## <bool> ## [ ## true, ## true, ## false, ## false ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/developers/writing_bindings.html","id":"step-4---run-and-potentially-add-to-your-tests-","dir":"Articles > Developers","previous_headings":"Walkthrough","what":"Step 4 - Run (and potentially add to) your tests.","title":"Writing dplyr bindings","text":"process implementing function, need least one test make sure binding works future changes Arrow R package don’t break ! Bindings tested files correspond file defined (e.g., startsWith() tested tests/testthat/test-dplyr-funcs-string.R) next tests endsWith(). may end implementing tests, example discover unusual edge cases. fine - add ones wrote originally, run . pass, ’re done can submit PR. ’ve modified C++ code R package (example, hooking binding options class), make sure run arrow/r/lint.sh lint code.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developing.html","id":"package-structure-and-conventions","dir":"Articles","previous_headings":"","what":"Package structure and conventions","title":"Introduction for developers","text":"helps first outline structure package. C++ object-oriented language, core logic Arrow C++ library encapsulated classes methods. arrow R package, classes implemented R6 classes, exported namespace. order match C++ naming conventions, R6 classes named “TitleCase”, e.g. RecordBatch. makes easy look relevant C++ implementations code documentation. simplify things R, C++ library namespaces generally dropped flattened; , C++ library arrow::io::FileOutputStream, just FileOutputStream R package. One exception file readers, namespace necessary disambiguate. arrow::csv::TableReader becomes CsvTableReader, arrow::json::TableReader becomes JsonTableReader. classes meant instantiated directly; may base classes kinds helpers. able create, use $create() method instantiate object. example, rb <- RecordBatch$create(int = 1:10, dbl = .numeric(1:10)) create RecordBatch. Many factory methods R user might often encounter also “snake_case” alias, order familiar contemporary R users. record_batch(int = 1:10, dbl = .numeric(1:10)) RecordBatch$create() . typical user arrow R package may never deal directly R6 objects. provide R-friendly wrapper functions higher-level interface C++ library. R user can call read_parquet() without knowing caring ’re instantiating ParquetFileReader object calling $ReadFile() method . classes available advanced programmer wants fine-grained control C++ library used.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developing.html","id":"approach-to-implementing-functionality","dir":"Articles","previous_headings":"","what":"Approach to implementing functionality","title":"Introduction for developers","text":"general philosophy implementing functionality match existing R function signatures may familiar users, whilst exposing additional functionality available via Arrow. intention allow users able use existing code minimal changes, new code approaches learn. number ways : implementing function R equivalent, support arguments available R version much possible - use original parameter names translate arrow parameter name inside function arrow parameters exist R function, allow user pass options necessary add extra arguments function signature feature doesn’t exist R Arrow (e.g., passing schema reading CSV dataset)","code":""},{"path":"https://arrow.apache.org/docs/r/articles/developing.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further Reading","title":"Introduction for developers","text":"-depth guide contributing Arrow, including step--step examples R package architectural overview Setting development environment, building R package components Common Arrow developer workflow tasks Running R C++ debugger attached -depth guide package installation works Using Docker diagnose bug test feature specific OS Writing bindings R functions Arrow Acero functions","code":""},{"path":"https://arrow.apache.org/docs/r/articles/flight.html","id":"prerequisites","dir":"Articles","previous_headings":"","what":"Prerequisites","title":"Connecting to a Flight server","text":"present arrow package R supply independent implementation Arrow Flight: works calling Flight methods supplied PyArrow Python, requires reticulate package Python PyArrow library installed. using first time can install like : See python integrations article details setting pyarrow.","code":"install.packages(\"reticulate\") arrow::install_pyarrow()"},{"path":"https://arrow.apache.org/docs/r/articles/flight.html","id":"example","dir":"Articles","previous_headings":"","what":"Example","title":"Connecting to a Flight server","text":"package includes methods starting Python-based Flight server, well methods connecting Flight server running elsewhere. illustrate sides, one R process ’ll start demo server: ’ll leave one running. different R process, let’s connect put data . Now, yet another R process, can connect server pull data put : flight_get() returns Arrow data structure, can directly pipe result dplyr workflow. See article data wrangling information working Arrow objects via dplyr interface.","code":"library(arrow) demo_server <- load_flight_server(\"demo_flight_server\") server <- demo_server$DemoFlightServer(port = 8089) server$serve() library(arrow) client <- flight_connect(port = 8089) flight_put(client, iris, path = \"test_data/iris\") library(arrow) library(dplyr) client <- flight_connect(port = 8089) client %>% flight_get(\"test_data/iris\") %>% group_by(Species) %>% summarize(max_petal = max(Petal.Length)) ## # A tibble: 3 x 2 ## Species max_petal ## <fct> <dbl> ## 1 setosa 1.9 ## 2 versicolor 5.1 ## 3 virginica 6.9"},{"path":"https://arrow.apache.org/docs/r/articles/flight.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Connecting to a Flight server","text":"specification Flight remote procedure call protocol listed Arrow project homepage Arrow C++ documentation contains list best practices Arrow Flight. detailed worked example Arrow Flight server Python provided Apache Arrow Python Cookbook.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"s3-and-gcs-support-on-linux","dir":"Articles","previous_headings":"","what":"S3 and GCS support on Linux","title":"Using cloud storage (S3, GCS)","text":"start, make sure arrow install support S3 /GCS enabled. users true default, Windows macOS binary packages hosted CRAN include S3 GCS support. can check whether support enabled via helper functions: return TRUE relevant support enabled. cases may find system support enabled. common case occurs Linux installing arrow source. situation S3 GCS support always enabled default, additional system requirements involved. See installation article details resolve .","code":"arrow_with_s3() arrow_with_gcs()"},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"connecting-to-cloud-storage","dir":"Articles","previous_headings":"","what":"Connecting to cloud storage","title":"Using cloud storage (S3, GCS)","text":"One way working filesystems create ?FileSystem objects. ?S3FileSystem objects can created s3_bucket() function, automatically detects bucket’s AWS region. Similarly, ?GcsFileSystem objects can created gs_bucket() function. resulting FileSystem consider paths relative bucket’s path (example don’t need prefix bucket path listing directory). FileSystem object, can point specific files $path() method pass result file readers writers (read_parquet(), write_feather(), et al.). Often reason users work cloud storage real world analysis access large data sets. example discussed datasets article, new users may prefer work much smaller data set learning arrow cloud storage interface works. end, examples article rely multi-file Parquet dataset stores copy diamonds data made available ggplot2 package, documented help(\"diamonds\", package = \"ggplot2\"). cloud storage version data set consists 5 Parquet files totaling less 1MB size. diamonds data set hosted S3 GCS, bucket named voltrondata-labs-datasets. create S3FileSystem object refers bucket, use following command: GCS version data, command follows: Note anonymous = TRUE required GCS credentials configured. Within bucket folder called diamonds. can call bucket$ls(\"diamonds\") list files stored folder, bucket$ls(\"diamonds\", recursive = TRUE) recursively search subfolders. Note GCS, always set recursive = TRUE directories often don’t appear results. ’s get list files stored GCS bucket: 5 Parquet files , one corresponding “cut” categories diamonds data set. can specify path specific file calling bucket$path(): can use read_parquet() read path directly R: Note slower read file local.","code":"bucket <- s3_bucket(\"voltrondata-labs-datasets\") bucket <- gs_bucket(\"voltrondata-labs-datasets\", anonymous = TRUE) bucket$ls(\"diamonds\", recursive = TRUE) ## [1] \"diamonds/cut=Fair/part-0.parquet\" ## [2] \"diamonds/cut=Good/part-0.parquet\" ## [3] \"diamonds/cut=Ideal/part-0.parquet\" ## [4] \"diamonds/cut=Premium/part-0.parquet\" ## [5] \"diamonds/cut=Very Good/part-0.parquet\" parquet_good <- bucket$path(\"diamonds/cut=Good/part-0.parquet\") diamonds_good <- read_parquet(parquet_good) diamonds_good ## # A tibble: 4,906 × 9 ## carat color clarity depth table price x y z ## <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 E VS1 56.9 65 327 4.05 4.07 2.31 ## 2 0.31 J SI2 63.3 58 335 4.34 4.35 2.75 ## 3 0.3 J SI1 64 55 339 4.25 4.28 2.73 ## 4 0.3 J SI1 63.4 54 351 4.23 4.29 2.7 ## 5 0.3 J SI1 63.8 56 351 4.23 4.26 2.71 ## 6 0.3 I SI2 63.3 56 351 4.26 4.3 2.71 ## 7 0.23 F VS1 58.2 59 402 4.06 4.08 2.37 ## 8 0.23 E VS1 64.1 59 402 3.83 3.85 2.46 ## 9 0.31 H SI1 64 54 402 4.29 4.31 2.75 ## 10 0.26 D VS2 65.2 56 403 3.99 4.02 2.61 ## # … with 4,896 more rows ## # ℹ Use `print(n = ...)` to see more rows"},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"connecting-directly-with-a-uri","dir":"Articles","previous_headings":"","what":"Connecting directly with a URI","title":"Using cloud storage (S3, GCS)","text":"use cases, easiest natural way connect cloud storage arrow use FileSystem objects returned s3_bucket() gs_bucket(), especially multiple file operations required. However, cases may want download file directly specifying URI. permitted arrow, functions like read_parquet(), write_feather(), open_dataset() etc accept URIs cloud resources hosted S3 GCS. format S3 URI follows: GCS, URI format looks like : example, Parquet file storing “good cut” diamonds downloaded earlier article available S3 CGS. relevant URIs follows: Note “anonymous” required GCS public buckets. Regardless version use, can pass URI read_parquet() file stored locally: URIs accept additional options query parameters (part ?) passed configure underlying file system. separated &. example, equivalent : tell S3FileSystem object allow creation new buckets talk Google Storage instead S3. latter works GCS implements S3-compatible API – see File systems emulate S3 – want better support GCS refer GcsFileSystem using URI starts gs://. Also note parameters URI need percent encoded, :// written %3A%2F%2F. S3, following options can included URI query parameters region, scheme, endpoint_override, access_key, secret_key, allow_bucket_creation, allow_bucket_deletion. GCS, supported parameters scheme, endpoint_override, retry_limit_seconds. GCS, useful option retry_limit_seconds, sets number seconds request may spend retrying returning error. current default 15 minutes, many interactive contexts ’s nice set lower value:","code":"s3://[access_key:secret_key@]bucket/path[?region=] gs://[access_key:secret_key@]bucket/path gs://anonymous@bucket/path uri <- \"s3://voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet\" uri <- \"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet\" df <- read_parquet(uri) s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true bucket <- S3FileSystem$create( endpoint_override=\"https://storage.googleapis.com\", allow_bucket_creation=TRUE ) bucket$path(\"voltrondata-labs-datasets/\") gs://anonymous@voltrondata-labs-datasets/diamonds/?retry_limit_seconds=10"},{"path":[]},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"s3-authentication","dir":"Articles","previous_headings":"Authentication","what":"S3 Authentication","title":"Using cloud storage (S3, GCS)","text":"access private S3 buckets, need typically need two secret parameters: access_key, like user id, secret_key, like token password. options passing credentials: Include URI, like s3://access_key:secret_key@bucket-name/path//file. sure URL-encode secrets contain special characters like “/” (e.g., URLencode(\"123/456\", reserved = TRUE)). Pass access_key secret_key S3FileSystem$create() s3_bucket() Set environment variables named AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY, respectively. Define ~/.aws/credentials file, according AWS documentation. Use AccessRole temporary access passing role_arn identifier S3FileSystem$create() s3_bucket().","code":""},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"gcs-authentication","dir":"Articles","previous_headings":"Authentication","what":"GCS Authentication","title":"Using cloud storage (S3, GCS)","text":"simplest way authenticate GCS run gcloud command setup application default credentials: manually configure credentials, can pass either access_token expiration, using temporary tokens generated elsewhere, json_credentials, reference downloaded credentials file. haven’t configured credentials, access public buckets, must pass anonymous = TRUE anonymous user URI:","code":"gcloud auth application-default login bucket <- gs_bucket(\"voltrondata-labs-datasets\", anonymous = TRUE) fs <- GcsFileSystem$create(anonymous = TRUE) df <- read_parquet(\"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet\")"},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"using-a-proxy-server","dir":"Articles","previous_headings":"","what":"Using a proxy server","title":"Using cloud storage (S3, GCS)","text":"need use proxy server connect S3 bucket, can provide URI form http://user:password@host:port proxy_options. example, local proxy server running port 1316 can used like :","code":"bucket <- s3_bucket( bucket = \"voltrondata-labs-datasets\", proxy_options = \"http://localhost:1316\" )"},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"file-systems-that-emulate-s3","dir":"Articles","previous_headings":"","what":"File systems that emulate S3","title":"Using cloud storage (S3, GCS)","text":"S3FileSystem machinery enables work file system provides S3-compatible interface. example, MinIO object-storage server emulates S3 API. run minio server locally default settings, connect arrow using S3FileSystem like : , URI, (Note URL escaping : endpoint_override). Among applications, can useful testing code locally running remote S3 bucket.","code":"minio <- S3FileSystem$create( access_key = \"minioadmin\", secret_key = \"minioadmin\", scheme = \"http\", endpoint_override = \"localhost:9000\" ) s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"disabling-environment-variables","dir":"Articles","previous_headings":"","what":"Disabling environment variables","title":"Using cloud storage (S3, GCS)","text":"mentioned , possible make use environment variables configure access. However, wish pass connection details via URI alternative methods also existing AWS environment variables defined, may interfere session. example, may see error message like: can unset environment variables using Sys.unsetenv(), example: default, AWS SDK tries retrieve metadata user configuration, can cause conflicts passing connection details via URI (example accessing MINIO bucket). disable use AWS environment variables, can set environment variable AWS_EC2_METADATA_DISABLED TRUE.","code":"Error: IOError: When resolving region for bucket 'analysis': AWS Error [code 99]: curlCode: 6, Couldn't resolve host name Sys.unsetenv(\"AWS_DEFAULT_REGION\") Sys.unsetenv(\"AWS_S3_ENDPOINT\") Sys.setenv(AWS_EC2_METADATA_DISABLED = TRUE)"},{"path":"https://arrow.apache.org/docs/r/articles/fs.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Using cloud storage (S3, GCS)","text":"learn FileSystem classes, including S3FileSystem GcsFileSystem, see help(\"FileSystem\", package = \"arrow\"). see data analysis example relies data hosted cloud storage, see dataset article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"background","dir":"Articles","previous_headings":"","what":"Background","title":"Installing on Linux","text":"Apache Arrow project implemented multiple languages, R package depends Arrow C++ library (referred libarrow). means install arrow, need R C++ versions. install arrow CRAN machine running Windows macOS, call install.packages(\"arrow\"), precompiled binary containing R package libarrow downloaded. However, CRAN host R package binaries Linux, must choose one alternative approaches. article outlines recommend approaches installing arrow Linux, starting simplest least customizable complex flexibility customize installation. primary audience document arrow R package users Linux, Arrow developers. Additional resources developers listed end article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"system-dependencies","dir":"Articles","previous_headings":"","what":"System dependencies","title":"Installing on Linux","text":"arrow package designed work minimal system requirements, things note.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"compilers","dir":"Articles","previous_headings":"System dependencies","what":"Compilers","title":"Installing on Linux","text":"version 10.0.0, arrow requires C++17 compiler build. gcc, generally means version 7 newer. contemporary Linux distributions new enough compiler; however, CentOS 7 notable exception, ships gcc 4.8. CentOS 7, build arrow need install newer devtoolset, ’ll need update R’s Makevars define CXX17 variables. script installs devtoolset-8 configures R able use C++17: Note C++17 compiler required build time. don’t need enable devtoolset every time load package. ’s , install binary package RStudio Package Manager (see method 1a ), need set . Likewise, R CMD INSTALL --build arrow CentOS machine newer compilers, can take binary package produces install CentOS machine without compilers.","code":"#!/usr/bin/env bash yum install -y centos-release-scl yum install -y devtoolset-8 # Optional: also install cloud storage dependencies, as described below yum install -y libcurl-devel openssl-devel source /opt/rh/devtoolset-8/enable if [ ! `R CMD config CXX17` ]; then mkdir -p ~/.R echo \"CC = $(which gcc) -fPIC\" >> ~/.R/Makevars echo \"CXX17 = $(which g++) -fPIC\" >> ~/.R/Makevars echo \"CXX17STD = -std=c++17\" >> ~/.R/Makevars echo \"CXX17FLAGS = ${CXX11FLAGS}\" >> ~/.R/Makevars fi"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"libraries","dir":"Articles","previous_headings":"System dependencies","what":"Libraries","title":"Installing on Linux","text":"Optional support reading cloud storage–AWS S3 Google Cloud Storage (GCS)–requires additional system dependencies: CURL: install libcurl-devel (rpm) libcurl4-openssl-dev (deb) OpenSSL >= 1.0.2: install openssl-devel (rpm) libssl-dev (deb) prebuilt binaries come S3 GCS support enabled, need meet system requirements order use . ’re building everything source, install script check presence dependencies turn S3 GCS support build prerequisites met–installation succeed without S3 GCS functionality. afterwards install missing system requirements, ’ll need reinstall package order enable S3 GCS support.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"install-release-version-easy-way","dir":"Articles","previous_headings":"","what":"Install release version (easy way)","title":"Installing on Linux","text":"macOS Windows, run install.packages(\"arrow\") install arrow CRAN, get R binary package contains precompiled version libarrow. Installing binaries much easier installing source, CRAN host binaries Linux. means default behaviour run install.packages() Linux retrieve source version R package compile R package libarrow source. ’ll talk scenario next section (“less easy” way), first ’ll suggest two faster alternatives usually much easier.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"binary-r-package-with-libarrow-binary-via-rspmconda","dir":"Articles","previous_headings":"Install release version (easy way)","what":"Binary R package with libarrow binary via RSPM/conda","title":"Installing on Linux","text":"want quicker installation process, default fully-featured build, install arrow RStudio’s public package manager, hosts binaries Windows Linux. example, using Ubuntu 20.04 (Focal): Note User Agent header must specified example . Please check RStudio Package Manager: Admin Guide details. Linux distributions, get relevant URL, can visit RSPM site, click ‘binary’, select preferred distribution. Similarly, use conda manage R environment, can get latest official release R package including libarrow via:","code":"options( HTTPUserAgent = sprintf( \"R/%s R (%s)\", getRversion(), paste(getRversion(), R.version[\"platform\"], R.version[\"arch\"], R.version[\"os\"]) ) ) install.packages(\"arrow\", repos = \"https://packagemanager.rstudio.com/all/__linux__/focal/latest\") # Using the --strict-channel-priority flag on `conda install` causes very long # solve times, so we add it directly to the config conda config --set channel_priority strict conda install -c conda-forge r-arrow"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"r-source-package-with-libarrow-binary","dir":"Articles","previous_headings":"Install release version (easy way)","what":"R source package with libarrow binary","title":"Installing on Linux","text":"Another way achieving faster installation key features enabled use static libarrow binaries host. used automatically many Linux distributions (x86_64 architecture ), according allowlist. distribution isn’t list, can opt-setting NOT_CRAN environment variable call install.packages(): installs source version R package, installation process check compatible libarrow binaries host use available. binary available can’t found, option falls back onto method 2 (full source build), setting environment variable results fully-featured build default. libarrow binaries include support AWS S3 GCS, require libcurl openssl libraries installed separately, noted . don’t installed, libarrow binary won’t used, fall back full source build (S3 GCS support disabled). internet access computer doesn’t allow downloading libarrow binaries (e.g. access limited CRAN), can first identify right source version trying install offline computer: can obtain libarrow binaries (using computer internet access) transfer zip file target computer. Now just tell installer use pre-downloaded file:","code":"Sys.setenv(\"NOT_CRAN\" = \"true\") install.packages(\"arrow\") Sys.setenv(\"NOT_CRAN\" = \"true\", \"LIBARROW_BUILD\" = FALSE, \"ARROW_R_DEV\" = TRUE) install.packages(\"arrow\") # This will fail if no internet access, but will print the binaries URL # Watchout: release numbers of the pre-downloaded libarrow must match CRAN! Sys.setenv(\"ARROW_DOWNLOADED_BINARIES\" = \"/path/to/downloaded/libarrow.zip\") install.packages(\"arrow\")"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"install-release-version-less-easy","dir":"Articles","previous_headings":"","what":"Install release version (less easy)","title":"Installing on Linux","text":"“less easy” way install arrow install R package underlying Arrow C++ library (libarrow) source. method somewhat difficult compiling installing R packages C++ dependencies generally requires installing system packages, may privileges , /building C++ dependencies separately, introduces sorts additional ways things go wrong. Installing full source build arrow, compiling C++ R bindings, handle dependency management , much slower using binaries. However, using binaries isn’t option ,wish customize Linux installation, instructions section explain .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"basic-configuration","dir":"Articles","previous_headings":"Install release version (less easy)","what":"Basic configuration","title":"Installing on Linux","text":"wish install libarrow source instead looking pre-compiled binaries, can set LIBARROW_BINARY variable. default, set TRUE, libarrow built source environment variable set FALSE compatible binary OS can found. compiling libarrow source, power really fine-tune features install. can set environment variable LIBARROW_MINIMAL FALSE enable full-featured build including S3 support alternative memory allocators. default variable unset, builds many commonly used features Parquet support disables features costly build, like S3 GCS support. set TRUE, trimmed-version arrow installed optional features disabled. Note guide, seen us mention environment variable NOT_CRAN - convenience variable, set TRUE, automatically sets LIBARROW_MINIMAL FALSE LIBARROW_BINARY TRUE. Building libarrow source requires time resources installing binary. recommend set environment variable ARROW_R_DEV TRUE verbose output installation process anything goes wrong. set variables, call install.packages() install arrow using configuration. section discusses environment variables can set calling install.packages(\"arrow\") build source customise configuration.","code":"Sys.setenv(\"LIBARROW_BINARY\" = FALSE) Sys.setenv(\"LIBARROW_MINIMAL\" = FALSE) Sys.setenv(\"ARROW_R_DEV\" = TRUE) install.packages(\"arrow\")"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"handling-libarrow-dependencies","dir":"Articles","previous_headings":"Install release version (less easy) > Basic configuration","what":"Handling libarrow dependencies","title":"Installing on Linux","text":"build libarrow source, dependencies automatically downloaded. environment variable ARROW_DEPENDENCY_SOURCE controls whether libarrow installation also downloads installs dependencies (set BUNDLED), uses system-installed dependencies (set SYSTEM) checks system-installed dependencies first installs dependencies aren’t already present (set AUTO, default). dependencies vary platform; however, wish install prior libarrow installation, recommend take look docker file whichever CI builds (ones ending “cpp” building Arrow’s C++ libraries, aka libarrow) corresponds closely setup. contain --date information dependencies minimum versions. downloading dependencies build time option, building system disconnected behind firewall, options. See “Offline builds” .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"dependencies-for-s3-and-gcs-support","dir":"Articles","previous_headings":"Install release version (less easy) > Basic configuration","what":"Dependencies for S3 and GCS support","title":"Installing on Linux","text":"Support working data S3 GCS enabled default source build, additional system requirements described . enable , set environment variable LIBARROW_MINIMAL=false NOT_CRAN=true choose full-featured build, selectively set ARROW_S3=/ARROW_GCS=. either feature enabled, install script check presence required dependencies, prerequisites met, turn S3 GCS support–installation succeed without S3 GCS functionality. afterwards install missing system requirements, ’ll need reinstall package order enable S3 GCS support.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"advanced-configuration","dir":"Articles","previous_headings":"Install release version (less easy)","what":"Advanced configuration","title":"Installing on Linux","text":"section, describe fine-tune installation granular level.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"libarrow-configuration","dir":"Articles","previous_headings":"Install release version (less easy) > Advanced configuration","what":"libarrow configuration","title":"Installing on Linux","text":"features optional build Arrow source - can configure whether components built via use environment variables. names environment variables control features default values shown .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"r-package-configuration","dir":"Articles","previous_headings":"Install release version (less easy) > Advanced configuration","what":"R package configuration","title":"Installing on Linux","text":"number variables affect configure script bundled build script. boolean variables case-insensitive. See -depth explanations environment variables. LIBARROW_BINARY : default many distributions, explicitly set true, script determine whether prebuilt libarrow work system. can set false skip option altogether, can specify string “distro-version” corresponds binary available, override function may discover default. Possible values : “linux-openssl-1.0”, “linux-openssl-1.1”, “linux-openssl-3.0”. LIBARROW_BUILD : set false, build script attempt build C++ source. means get working arrow R package prebuilt binary found. Use want avoid compiling C++ library, may slow resource-intensive, ensure use prebuilt binary. LIBARROW_MINIMAL : set false, build script enable optional features, including S3 support additional alternative memory allocators. increase source build time results fully functional library. set true turns Parquet, Datasets, compression libraries, optional features. commonly used may helpful needing compile platform support features, e.g. Solaris. NOT_CRAN : variable set true, devtools package , build script set LIBARROW_BINARY=true LIBARROW_MINIMAL=false unless environment variables already set. provides complete fast installation experience users already NOT_CRAN=true part workflow, without requiring additional environment variables set. ARROW_R_DEV : set true, verbose messaging printed build script. arrow::install_arrow(verbose = TRUE) sets . variable also needed ’re modifying C++ code package: see developer guide article. ARROW_USE_PKG_CONFIG: set false, configure script won’t look Arrow libraries system instead look download/build . Use version mismatch installed system libraries version R package ’re installing. LIBARROW_DEBUG_DIR : C++ library building source fails (cmake), may messages telling check log file build directory. However, library built R package installation, location temp directory already deleted. capture logs, set variable absolute (relative) path log files copied . directory created exist. CMAKE : building C++ library source, can specify /path//cmake use different version whatever found $PATH.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"using-install_arrow","dir":"Articles","previous_headings":"","what":"Using install_arrow()","title":"Installing on Linux","text":"previous instructions useful fresh arrow installation, arrow provides function install_arrow(). three common use cases function: arrow installed want upgrade different version want try reinstall fix issues Linux C++ binaries want install development build Examples using install_arrow() shown : Although function part arrow package, also available standalone script, can access without first installing package: Notes: install_arrow() require environment variables set order satisfy C++ dependencies. unlike packages like tensorflow, blogdown, others require external dependencies, need run install_arrow() successful arrow installation.","code":"install_arrow() # latest release install_arrow(nightly = TRUE) # install development version install_arrow(verbose = TRUE) # verbose output to debug install errors source(\"https://raw.githubusercontent.com/apache/arrow/main/r/R/install-arrow.R\")"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"offline-installation","dir":"Articles","previous_headings":"","what":"Offline installation","title":"Installing on Linux","text":"install-arrow.R file mentioned previous section includes function called create_package_with_all_dependencies(). Normally, installing computer internet access, build process download third-party dependencies needed. function provides way download advance, can useful installing Arrow computer without internet access. process follows: Step 1. Using computer internet access, download dependencies: Install arrow package source script directly using following command: Use create_package_with_all_dependencies() function create installation bundle: Copy newly created my_arrow_pkg.tar.gz file computer without internet access Step 2. computer without internet access, install prepared package: Install arrow package copied file: installation build source, cmake must available Run arrow_info() check installed capabilities Notes: arrow can installed computer without internet access without using function, many useful features disabled, depend third-party components. precisely, arrow::arrow_info()$capabilities() FALSE every capability. using binary packages shouldn’t need function. can download appropriate binary package repository, transfer offline computer, install . ’re using RStudio Package Manager Linux (RSPM), want make source bundle function, make sure set first repository options(\"repos\") mirror contains source packages. , repository needs something RSPM binary mirror URLs.","code":"source(\"https://raw.githubusercontent.com/apache/arrow/main/r/R/install-arrow.R\") create_package_with_all_dependencies(\"my_arrow_pkg.tar.gz\") install.packages( \"my_arrow_pkg.tar.gz\", dependencies = c(\"Depends\", \"Imports\", \"LinkingTo\") )"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"offline-installation-alternative","dir":"Articles","previous_headings":"","what":"Offline installation (alternative)","title":"Installing on Linux","text":"second method offline installation little hands-. Follow steps wish try : Download dependency files (cpp/thirdparty/download_dependencies.sh may helpful) Copy directory dependencies offline computer Create environment variable ARROW_THIRDPARTY_DEPENDENCY_DIR offline computer, pointing copied directory. Install arrow package usual. offline installation using libarrow binaries, see Method 1b .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"troubleshooting","dir":"Articles","previous_headings":"","what":"Troubleshooting","title":"Installing on Linux","text":"intent install.packages(\"arrow\") just work handle C++ dependencies, depending system, may better results tune one several parameters. known complications ways address .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"package-failed-to-build-c-dependencies","dir":"Articles","previous_headings":"Troubleshooting","what":"Package failed to build C++ dependencies","title":"Installing on Linux","text":"see message like output package fails install, means installation failed retrieve build libarrow version compatible current version R package. Please check “Known installation issues” see apply, none apply, set environment variable ARROW_R_DEV=TRUE verbose output try installing . , please report issue include full installation output.","code":"------------------------- NOTE --------------------------- There was an issue preparing the Arrow C++ libraries. See https://arrow.apache.org/docs/r/articles/install.html ---------------------------------------------------------"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"using-system-libraries","dir":"Articles","previous_headings":"Troubleshooting","what":"Using system libraries","title":"Installing on Linux","text":"system library installed Arrow found doesn’t match R package version (example, libarrow 1.0.0 system installing R package 2.0.0), likely R bindings fail compile. Apache Arrow project active development, essential versions libarrow R package matches. install.packages(\"arrow\") download libarrow, install script ensures fetch libarrow version corresponds R package version. However, using version libarrow already system, version match isn’t guaranteed. fix version mismatch, can either update libarrow system packages match R package version, set environment variable ARROW_USE_PKG_CONFIG=FALSE tell configure script look system version libarrow. (latter default install_arrow().) System libarrow versions available corresponding CRAN releases nightly dev versions, depending R package version ’re installing, system libarrow version may option. Note also working R package installation based system (shared) libraries, update system libarrow installation, ’ll need reinstall R package match version. Similarly, ’re using libarrow system libraries, running update.packages() new release arrow package likely fail unless first update libarrow system packages.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"using-prebuilt-binaries","dir":"Articles","previous_headings":"Troubleshooting","what":"Using prebuilt binaries","title":"Installing on Linux","text":"R package finds downloads prebuilt binary libarrow, arrow package can’t loaded, perhaps “undefined symbols” errors, please report issue. likely compiler mismatch may resolvable setting environment variables instruct R compile packages match libarrow. workaround set environment variable LIBARROW_BINARY=FALSE retry installation: value instructs package build libarrow source instead downloading prebuilt binary. guarantee compiler settings match. prebuilt libarrow binary wasn’t found operating system think , please report issue share console output. may also set environment variable ARROW_R_DEV=TRUE additional debug messages.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"building-libarrow-from-source","dir":"Articles","previous_headings":"Troubleshooting","what":"Building libarrow from source","title":"Installing on Linux","text":"building libarrow source fails, check error message. (don’t see error message, ----- NOTE -----, set environment variable ARROW_R_DEV=TRUE increase verbosity retry installation.) install script work everywhere, libarrow fails compile, please report issue can improve script.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"known-installation-issues","dir":"Articles","previous_headings":"Troubleshooting","what":"Known installation issues","title":"Installing on Linux","text":"CentOS, building package requires modern devtoolset default system compilers. See “System dependencies” .","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"contributing","dir":"Articles","previous_headings":"","what":"Contributing","title":"Installing on Linux","text":"constantly working make installation process painless possible. find ways improve process, please report issue can document . Similarly, find Linux distribution version supported, welcome contribution Docker images (hosted Docker Hub) can use continuous integration hopefully improve coverage. contribute Docker image, minimal possible, containing R dependencies requires. reference, see images R-hub uses. can test arrow R package installation using docker-compose setup included apache/arrow git repository. example, installs arrow R package, including libarrow, rhub/ubuntu-release image.","code":"R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker-compose build r R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker-compose run r"},{"path":"https://arrow.apache.org/docs/r/articles/install.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Installing on Linux","text":"learn installing development versions, see article installing nightly builds. ’re contributing Arrow project, see Arrow R developers guide resources help set development environment. Arrow developers may also wish read detailed discussion code run installation process, described install details article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/install_nightly.html","id":"install-nightly-builds","dir":"Articles","previous_headings":"","what":"Install nightly builds","title":"Installing development versions","text":"Development versions package (binary source) built nightly hosted https://nightlies.apache.org/arrow/r/. nightly package builds official Apache releases recommended production use. may useful testing bug fixes new features active development. install arrow , use following command: Conda users can install arrow nightly builds : already version arrow installed, can switch latest nightly development version follows:","code":"install.packages(\"arrow\", repos = c(arrow = \"https://nightlies.apache.org/arrow/r\", getOption(\"repos\"))) conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow arrow::install_arrow(nightly = TRUE)"},{"path":"https://arrow.apache.org/docs/r/articles/install_nightly.html","id":"install-from-git-repository","dir":"Articles","previous_headings":"","what":"Install from git repository","title":"Installing development versions","text":"alternative way obtain development versions install R package git checkout. , type following terminal: don’t already libarrow system, installing R package source, also download build libarrow . See links build environment variables options configuring build source enabled features.","code":"git clone https://github.com/apache/arrow cd arrow/r R CMD INSTALL ."},{"path":"https://arrow.apache.org/docs/r/articles/install_nightly.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Installing development versions","text":"users looking information installing Linux, see Linux installation article. developers looking understand installation scripts, see installation details article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/metadata.html","id":"arrow-metadata-classes","dir":"Articles","previous_headings":"","what":"Arrow metadata classes","title":"Metadata","text":"arrow package defines following classes representing metadata: Schema list Field objects used describe structure tabular data object; Field specifies character string name DataType; DataType attribute controlling values represented Consider : schema automatically inferred also manually created: schema() function allows following shorthand define fields: Sometimes important specify schema manually, particularly want fine-grained control Arrow data types:","code":"df <- data.frame(x = 1:3, y = c(\"a\", \"b\", \"c\")) tb <- arrow_table(df) tb$schema ## Schema ## x: int32 ## y: string ## ## See $metadata for additional Schema metadata schema( field(name = \"x\", type = int32()), field(name = \"y\", type = utf8()) ) ## Schema ## x: int32 ## y: string schema(x = int32(), y = utf8()) ## Schema ## x: int32 ## y: string arrow_table(df, schema = schema(x = int64(), y = utf8())) ## Table ## 3 rows x 2 columns ## $x <int64> ## $y <string> ## ## See $metadata for additional Schema metadata arrow_table(df, schema = schema(x = float64(), y = utf8())) ## Table ## 3 rows x 2 columns ## $x <double> ## $y <string> ## ## See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/articles/metadata.html","id":"r-object-attributes","dir":"Articles","previous_headings":"","what":"R object attributes","title":"Metadata","text":"Arrow supports custom key-value metadata attached Schemas. convert data.frame Arrow Table RecordBatch, package stores attributes() attached columns data.frame Arrow object Schema. Attributes added objects fashion stored r key, shown : also possible assign additional string metadata key wish, using command like : Metadata attached Schema preserved writing Table Arrow/Feather Parquet formats. reading files R, calling .data.frame() Table RecordBatch, column attributes restored columns resulting data.frame. means custom data types, including haven::labelled, vctrs annotations, others, preserved round-trip Arrow. Note attributes stored $metadata$r understood R. write data.frame haven columns Feather file read Pandas, haven metadata won’t recognized . Similarly, Pandas writes custom metadata, R package consume. free, however, define custom metadata conventions application assign (string) values want metadata keys.","code":"# data frame with custom metadata df <- data.frame(x = 1:3, y = c(\"a\", \"b\", \"c\")) attr(df, \"df_meta\") <- \"custom data frame metadata\" attr(df$y, \"col_meta\") <- \"custom column metadata\" # when converted to a Table, the metadata is preserved tb <- arrow_table(df) tb$metadata ## $r ## $r$attributes ## $r$attributes$df_meta ## [1] \"custom data frame metadata\" ## ## ## $r$columns ## $r$columns$x ## NULL ## ## $r$columns$y ## $r$columns$y$attributes ## $r$columns$y$attributes$col_meta ## [1] \"custom column metadata\" ## ## ## $r$columns$y$columns ## NULL tb$metadata$new_key <- \"new value\""},{"path":"https://arrow.apache.org/docs/r/articles/metadata.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Metadata","text":"learn arrow metadata, see documentation schema(). learn data types, see data types article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/python.html","id":"motivation","dir":"Articles","previous_headings":"","what":"Motivation","title":"Integrating Arrow, Python, and R","text":"One reason might want use PyArrow R take advantage functionality better supported Python R current state development. example, one point time R arrow package didn’t support concat_arrays() PyArrow , good use case time. time current writing PyArrow comprehensive support Arrow Flight R package – see article Flight support arrow – another instance PyArrow benefit R users. second reason R users may want use PyArrow efficiently pass data objects R Python. large data sets, can quite costly – terms time CPU cycles – perform copy covert operations required translate native data structure R (e.g., data frame) analogous structure Python (e.g., Pandas DataFrame) vice versa. Arrow data objects Tables -memory format R Python, possible perform “zero-copy” data transfers, metadata needs passed languages. illustrated later, drastically improves performance.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/python.html","id":"installing-pyarrow","dir":"Articles","previous_headings":"","what":"Installing PyArrow","title":"Integrating Arrow, Python, and R","text":"use Arrow Python, pyarrow library needs installed. example, may wish create Python virtual environment containing pyarrow library. virtual environment specific Python installation created one project purpose. good practice use specific environments Python updating package doesn’t impact packages projects. can perform set within R. Let’s suppose want call virtual environment something like -pyarrow-env. setup code look like : want install development version pyarrow virtual environment, add nightly = TRUE install_pyarrow() command: Note don’t use virtual environments. prefer conda environments, can use setup code: learn installing configuring Python R, see reticulate documentation, discusses topic detail.","code":"virtualenv_create(\"my-pyarrow-env\") install_pyarrow(\"my-pyarrow-env\") install_pyarrow(\"my-pyarrow-env\", nightly = TRUE) conda_create(\"my-pyarrow-env\") install_pyarrow(\"my-pyarrow-env\")"},{"path":"https://arrow.apache.org/docs/r/articles/python.html","id":"importing-pyarrow","dir":"Articles","previous_headings":"","what":"Importing PyArrow","title":"Integrating Arrow, Python, and R","text":"Assuming arrow reticulate loaded R, first step make sure correct Python environment used. virtual environment, use command like : conda environment use following: done , next step import pyarrow Python session shown : Executing command R equivalent following import Python: may good idea check pyarrow version , shown : Support passing data R included pyarrow versions 0.17 greater.","code":"use_virtualenv(\"my-pyarrow-env\") use_condaenv(\"my-pyarrow-env\") pa <- import(\"pyarrow\") import pyarrow as pa pa$`__version__` ## [1] \"8.0.0\""},{"path":"https://arrow.apache.org/docs/r/articles/python.html","id":"using-pyarrow","dir":"Articles","previous_headings":"","what":"Using PyArrow","title":"Integrating Arrow, Python, and R","text":"can use reticulate function r_to_py() pass objects R Python, similarly can use py_to_r() pull objects Python session R. illustrate , let’s create two objects R: df_random R data frame containing 100 million rows random data, tb_random data stored Arrow Table: Transferring data R Python without Arrow time-consuming process underlying object copied converted Python data structure: contrast, sending Arrow Table across happens almost instantaneously: “Send”, however, isn’t really correct word. Internally, ’re passing pointers data R Python interpreters running together process, without copying anything. Nothing sent: ’re sharing accessing internal Arrow memory buffers. ’s possible send data direction also. example let’s create Array pyarrow. Notice now Array object R session – even though created Python – can apply R methods : Similarly, can combine object Arrow objects created R, can use PyArrow methods like pa$concat_arrays() : Now single Array R.","code":"set.seed(1234) nrows <- 10^8 df_random <- data.frame( x = rnorm(nrows), y = rnorm(nrows), subset = sample(10, nrows, replace = TRUE) ) tb_random <- arrow_table(df_random) system.time({ df_py <- r_to_py(df_random) }) ## user system elapsed ## 0.307 5.172 5.529 system.time({ tb_py <- r_to_py(tb_random) }) ## user system elapsed ## 0.004 0.000 0.003 a <- pa$array(c(1, 2, 3)) a ## Array ## <double> ## [ ## 1, ## 2, ## 3 ## ] a[a > 1] ## Array ## <double> ## [ ## 2, ## 3 ## ] b <- Array$create(c(5, 6, 7, 8, 9)) a_and_b <- pa$concat_arrays(list(a, b)) a_and_b ## Array ## <double> ## [ ## 1, ## 2, ## 3, ## 5, ## 6, ## 7, ## 8, ## 9 ## ]"},{"path":"https://arrow.apache.org/docs/r/articles/python.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Integrating Arrow, Python, and R","text":"learn installing configuring Python R, see reticulate documentation. learn PyArrow, see official PyArrow Documentation Apache Arrow Python Cookbook. R/Python integration Arrow also discussed PyArrow Integrations Documentation, blog post reticulate integration Arrow, blog post rpy2 integration Arrow. integration R Arrow PyArrow supported Arrow C data interface. learn Arrow data objects, see data objects article.","code":""},{"path":"https://arrow.apache.org/docs/r/articles/read_write.html","id":"parquet-format","dir":"Articles","previous_headings":"","what":"Parquet format","title":"Reading and writing data files","text":"Apache Parquet popular choice storing analytics data; binary format optimized reduced file sizes fast read performance, especially column-based access patterns. simplest way read write Parquet data using arrow read_parquet() write_parquet() functions. illustrate , ’ll write starwars data included dplyr Parquet file, read back . First load arrow dplyr packages: Next ’ll write data frame Parquet file located file_path: size Parquet file typically much smaller corresponding CSV file . part due use file compression: default, Parquet files written arrow package use Snappy compression options gzip also supported. See help(\"write_parquet\", package = \"arrow\") information. written Parquet file, now can read read_parquet(): default return data frame tibble. want Arrow Table instead, set as_data_frame = FALSE: One useful feature Parquet files store data column-wise, contain metadata allow file readers skip relevant sections file. means possible load subset columns without reading complete file. col_select argument read_parquet() supports functionality: Fine-grained control Parquet reader possible props argument. See help(\"ParquetArrowReaderProperties\", package = \"arrow\") details. R object attributes preserved writing data Parquet Arrow/Feather files reading files back R. enables round-trip writing reading sf::sf objects, R data frames haven::labelled columns, data frame custom attributes. learn metadata handled arrow, metadata article.","code":"library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) file_path <- tempfile() write_parquet(starwars, file_path) read_parquet(file_path) ## # A tibble: 87 x 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~ ## 2 C-3PO 167 75 NA gold yellow 112 none mascu~ ## 3 R2-D2 96 32 NA white, bl~ red 33 none mascu~ ## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~ ## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~ ## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~ ## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~ ## 8 R5-D4 97 32 NA white, red red NA none mascu~ ## 9 Biggs D~ 183 84 black light brown 24 male mascu~ ## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~ ## # i 77 more rows ## # i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>, ## # vehicles <list<character>>, starships <list<character>> read_parquet(file_path, as_data_frame = FALSE) ## Table ## 87 rows x 14 columns ## $name <string> ## $height <int32> ## $mass <double> ## $hair_color <string> ## $skin_color <string> ## $eye_color <string> ## $birth_year <double> ## $sex <string> ## $gender <string> ## $homeworld <string> ## $species <string> ## $films: list<element <string>> ## $vehicles: list<element <string>> ## $starships: list<element <string>> read_parquet(file_path, col_select = c(\"name\", \"height\", \"mass\")) ## # A tibble: 87 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun Lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # i 77 more rows"},{"path":"https://arrow.apache.org/docs/r/articles/read_write.html","id":"arrowfeather-format","dir":"Articles","previous_headings":"","what":"Arrow/Feather format","title":"Reading and writing data files","text":"Arrow file format developed provide binary columnar serialization data frames, make reading writing data frames efficient, make sharing data across data analysis languages easy. file format sometimes referred Feather outgrowth original Feather project now moved Arrow project . can find detailed specification version 2 Arrow format – officially referred Arrow IPC file format – Arrow specification page. write_feather() function writes version 2 Arrow/Feather files default, supports multiple kinds file compression. Basic use shown : read_feather() function provides familiar interface reading feather files: Like Parquet reader, reader supports reading subset columns, can produce Arrow Table output:","code":"file_path <- tempfile() write_feather(starwars, file_path) read_feather(file_path) ## # A tibble: 87 x 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~ ## 2 C-3PO 167 75 NA gold yellow 112 none mascu~ ## 3 R2-D2 96 32 NA white, bl~ red 33 none mascu~ ## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~ ## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~ ## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~ ## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~ ## 8 R5-D4 97 32 NA white, red red NA none mascu~ ## 9 Biggs D~ 183 84 black light brown 24 male mascu~ ## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~ ## # i 77 more rows ## # i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>, ## # vehicles <list<character>>, starships <list<character>> read_feather( file = file_path, col_select = c(\"name\", \"height\", \"mass\"), as_data_frame = FALSE ) ## Table ## 87 rows x 3 columns ## $name <string> ## $height <int32> ## $mass <double>"},{"path":"https://arrow.apache.org/docs/r/articles/read_write.html","id":"csv-format","dir":"Articles","previous_headings":"","what":"CSV format","title":"Reading and writing data files","text":"read/write capabilities arrow package also include support CSV text-delimited files. read_csv_arrow(), read_tsv_arrow(), read_delim_arrow() functions use Arrow C++ CSV reader read data files, Arrow C++ options mapped arguments way mirrors conventions used readr::read_delim(), col_select argument inspired vroom::vroom(). simple example writing reading CSV file arrow shown : addition options provided readr-style arguments (delim, quote, escape_double, escape_backslash, etc), can use schema argument specify column types: see schema() help details. also option using parse_options, convert_options, read_options exercise fine-grained control arrow csv reader: see help(\"CsvReadOptions\", package = \"arrow\") details.","code":"file_path <- tempfile() write_csv_arrow(mtcars, file_path) read_csv_arrow(file_path, col_select = starts_with(\"d\")) ## # A tibble: 32 x 2 ## disp drat ## <dbl> <dbl> ## 1 160 3.9 ## 2 160 3.9 ## 3 108 3.85 ## 4 258 3.08 ## 5 360 3.15 ## 6 225 2.76 ## 7 360 3.21 ## 8 147. 3.69 ## 9 141. 3.92 ## 10 168. 3.92 ## # i 22 more rows"},{"path":"https://arrow.apache.org/docs/r/articles/read_write.html","id":"json-format","dir":"Articles","previous_headings":"","what":"JSON format","title":"Reading and writing data files","text":"arrow package supports reading (writing) tabular data line-delimited JSON, using read_json_arrow() function. minimal example shown :","code":"file_path <- tempfile() writeLines(' { \"hello\": 3.5, \"world\": false, \"yo\": \"thing\" } { \"hello\": 3.25, \"world\": null } { \"hello\": 0.0, \"world\": true, \"yo\": null } ', file_path, useBytes = TRUE) read_json_arrow(file_path) ## # A tibble: 3 x 3 ## hello world yo ## <dbl> <lgl> <chr> ## 1 3.5 FALSE thing ## 2 3.25 NA NA ## 3 0 TRUE NA"},{"path":"https://arrow.apache.org/docs/r/articles/read_write.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"Reading and writing data files","text":"learn cloud storage, see cloud storage article. learn multi-file datasets, see datasets article. Apache Arrow R cookbook chapters reading writing single files memory working multi-file datasets stored -disk.","code":""},{"path":"https://arrow.apache.org/docs/r/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Neal Richardson. Author. Ian Cook. Author. Nic Crane. Author. Dewey Dunnington. Author. Romain François. Author. Jonathan Keane. Author, maintainer. Dragoș Moldovan-Grünfeld. Author. Jeroen Ooms. Author. Jacob Wujciak-Jens. Author. Javier Luraschi. Contributor. Karl Dunkle Werner. Contributor. Jeffrey Wong. Contributor. Apache Arrow. Author, copyright holder.","code":""},{"path":"https://arrow.apache.org/docs/r/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Richardson N, Cook , Crane N, Dunnington D, Fran<U+00E7>ois R, Keane J, Moldovan-Gr<U+00FC>nfeld D, Ooms J, Wujciak-Jens J, Apache Arrow (2024). arrow: Integration 'Apache' 'Arrow'. R package version 16.0.0.9000, https://arrow.apache.org/docs/r/, https://github.com/apache/arrow/.","code":"@Manual{, title = {arrow: Integration to 'Apache' 'Arrow'}, author = {Neal Richardson and Ian Cook and Nic Crane and Dewey Dunnington and Romain François and Jonathan Keane and Dragoș Moldovan-Grünfeld and Jeroen Ooms and Jacob Wujciak-Jens and {Apache Arrow}}, year = {2024}, note = {R package version 16.0.0.9000, https://arrow.apache.org/docs/r/}, url = {https://github.com/apache/arrow/}, }"},{"path":[]},{"path":"https://arrow.apache.org/docs/r/index.html","id":"overview","dir":"","previous_headings":"","what":"Overview","title":"Arrow R Package","text":"R arrow package provides access many features Apache Arrow C++ library R users. goal arrow provide Arrow C++ backend dplyr, access Arrow C++ library familiar base R tidyverse functions, R6 classes. learn Apache Arrow project, see parent documentation Arrow Project. Arrow project provides functionality wide range data analysis tasks store, process move data fast. See read/write article learn reading writing data files, data wrangling learn use dplyr syntax arrow objects, function documentation full list supported functions within dplyr queries.","code":""},{"path":"https://arrow.apache.org/docs/r/index.html","id":"installation","dir":"","previous_headings":"","what":"Installation","title":"Arrow R Package","text":"latest release arrow can installed CRAN. cases installing latest release work without requiring additional system dependencies, especially using Windows macOS. Alternatively, using conda can install arrow conda-forge: special cases note: macOS, R use Arrow match architecture machine using. ’re using ARM (aka M1, M2, etc.) processor use R compiled arm64. ’re using Intel based mac, use R compiled x86. Using R Arrow compiled Intel based macs ARM based mac result segfaults crashes. Linux installation process can sometimes involved CRAN host binaries Linux. information please see installation guide. compiling arrow source, please note version 10.0.0, arrow requires C++17 build. implications Windows CentOS 7. Windows users means need running R version 4.0 later. CentOS 7, means need install newer compiler default system compiler gcc. See installation details article guidance. Development versions arrow released nightly. information installl nighhtly builds please see installing nightly builds article.","code":"install.packages(\"arrow\") conda install -c conda-forge --strict-channel-priority r-arrow"},{"path":"https://arrow.apache.org/docs/r/index.html","id":"what-can-the-arrow-package-do","dir":"","previous_headings":"","what":"What can the arrow package do?","title":"Arrow R Package","text":"Arrow C++ library comprised different parts, serves specific purpose. arrow package provides binding C++ functionality wide range data analysis tasks. allows users read write data variety formats: Read write Parquet files, efficient widely used columnar format Read write Arrow (formerly known Feather) files, format optimized speed interoperability Read write CSV files excellent speed efficiency Read write multi-file larger--memory datasets Read JSON files provides access remote filesystems servers: Read write files Amazon S3 Google Cloud Storage buckets Connect Arrow Flight servers transport large datasets networks Additional features include: Manipulate analyze Arrow data dplyr verbs Zero-copy data sharing R Python Fine control column types work seamlessly databases data warehouses Toolkit building connectors applications services use Arrow","code":""},{"path":"https://arrow.apache.org/docs/r/index.html","id":"what-is-apache-arrow","dir":"","previous_headings":"","what":"What is Apache Arrow?","title":"Arrow R Package","text":"Apache Arrow cross-language development platform -memory larger--memory data. specifies standardized language-independent columnar memory format flat hierarchical data, organized efficient analytic operations modern hardware. also provides computational libraries zero-copy streaming, messaging, interprocess communication. package exposes interface Arrow C++ library, enabling access many features R. provides low-level access Arrow C++ library API higher-level access dplyr backend familiar R functions.","code":""},{"path":"https://arrow.apache.org/docs/r/index.html","id":"arrow-resources","dir":"","previous_headings":"","what":"Arrow resources","title":"Arrow R Package","text":"additional resources may find useful getting started arrow: official Arrow R package documentation Arrow R cheatsheet Apache Arrow R Cookbook R Data Science Chapter Arrow Awesome Arrow R","code":""},{"path":"https://arrow.apache.org/docs/r/index.html","id":"getting-help","dir":"","previous_headings":"","what":"Getting help","title":"Arrow R Package","text":"welcome questions, discussion, contributions users arrow package. information mailing lists venues engaging Arrow developer user communities, please see Apache Arrow Community page. encounter bug, please file issue minimal reproducible example GitHub issues. Log GitHub account, click New issue select type issue want create. Add meaningful title prefixed [R] followed space, issue summary select component R dropdown list. information, see Report bugs propose features section Contributing Apache Arrow page Arrow developer documentation.","code":""},{"path":"https://arrow.apache.org/docs/r/index.html","id":"code-of-conduct","dir":"","previous_headings":"","what":"Code of Conduct","title":"Arrow R Package","text":"Please note participation Apache Arrow project governed Apache Software Foundation’s code conduct.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ArrayData.html","id":null,"dir":"Reference","previous_headings":"","what":"ArrayData class — ArrayData","title":"ArrayData class — ArrayData","text":"ArrayData class allows get inspect data inside arrow::Array.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ArrayData.html","id":"usage","dir":"Reference","previous_headings":"","what":"Usage","title":"ArrayData class — ArrayData","text":"","code":"data <- Array$create(x)$data() data$type data$length data$null_count data$offset data$buffers"},{"path":"https://arrow.apache.org/docs/r/reference/ArrayData.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ArrayData class — ArrayData","text":"...","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Buffer-class.html","id":null,"dir":"Reference","previous_headings":"","what":"Buffer class — Buffer","title":"Buffer class — Buffer","text":"Buffer object containing pointer piece contiguous memory particular size.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Buffer-class.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Buffer class — Buffer","text":"buffer() lets create arrow::Buffer R object","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Buffer-class.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Buffer class — Buffer","text":"$is_mutable : buffer mutable? $ZeroPadding() : zero bytes padding, .e. bytes size capacity $size : size memory, bytes $capacity: possible capacity, bytes","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Buffer-class.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Buffer class — Buffer","text":"","code":"my_buffer <- buffer(c(1, 2, 3, 4)) my_buffer$is_mutable #> [1] TRUE my_buffer$ZeroPadding() my_buffer$size #> [1] 32 my_buffer$capacity #> [1] 32"},{"path":"https://arrow.apache.org/docs/r/reference/ChunkedArray-class.html","id":null,"dir":"Reference","previous_headings":"","what":"ChunkedArray class — ChunkedArray","title":"ChunkedArray class — ChunkedArray","text":"ChunkedArray data structure managing list primitive Arrow Arrays logically one large array. Chunked arrays may grouped together Table.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ChunkedArray-class.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"ChunkedArray class — ChunkedArray","text":"ChunkedArray$create() factory method instantiates object various Arrays R vectors. chunked_array() alias .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ChunkedArray-class.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ChunkedArray class — ChunkedArray","text":"$length(): Size number elements array contains $chunk(): Extract Array chunk integer position `$nbytes() : Total number bytes consumed elements array $as_vector(): convert R vector $Slice(offset, length = NULL): Construct zero-copy slice array indicated offset length. length NULL, slice goes end array. $Take(): return ChunkedArray values positions given integers . Arrow Array ChunkedArray, coerced R vector taking. $Filter(, keep_na = TRUE): return ChunkedArray values positions logical vector Arrow boolean-type (Chunked)Array TRUE. $SortIndices(descending = FALSE): return Array integer positions can used rearrange ChunkedArray ascending descending order $cast(target_type, safe = TRUE, options = cast_options(safe)): Alter data array change type. $null_count: number null entries array $chunks: return list Arrays $num_chunks: integer number chunks ChunkedArray $type: logical type data $View(type): Construct zero-copy view ChunkedArray given type. $Validate(): Perform validation checks determine obvious inconsistencies within array's internal data. can expensive check, potentially O(length)","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/ChunkedArray-class.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"ChunkedArray class — ChunkedArray","text":"","code":"# Pass items into chunked_array as separate objects to create chunks class_scores <- chunked_array(c(87, 88, 89), c(94, 93, 92), c(71, 72, 73)) class_scores$num_chunks #> [1] 3 # When taking a Slice from a chunked_array, chunks are preserved class_scores$Slice(2, length = 5) #> ChunkedArray #> <double> #> [ #> [ #> 89 #> ], #> [ #> 94, #> 93, #> 92 #> ], #> [ #> 71 #> ] #> ] # You can combine Take and SortIndices to return a ChunkedArray with 1 chunk # containing all values, ordered. class_scores$Take(class_scores$SortIndices(descending = TRUE)) #> ChunkedArray #> <double> #> [ #> [ #> 94, #> 93, #> 92, #> 89, #> 88, #> 87, #> 73, #> 72, #> 71 #> ] #> ] # If you pass a list into chunked_array, you get a list of length 1 list_scores <- chunked_array(list(c(9.9, 9.6, 9.5), c(8.2, 8.3, 8.4), c(10.0, 9.9, 9.8))) list_scores$num_chunks #> [1] 1 # When constructing a ChunkedArray, the first chunk is used to infer type. doubles <- chunked_array(c(1, 2, 3), c(5L, 6L, 7L)) doubles$type #> Float64 #> double # Concatenating chunked arrays returns a new chunked array containing all chunks a <- chunked_array(c(1, 2), 3) b <- chunked_array(c(4, 5), 6) c(a, b) #> ChunkedArray #> <double> #> [ #> [ #> 1, #> 2 #> ], #> [ #> 3 #> ], #> [ #> 4, #> 5 #> ], #> [ #> 6 #> ] #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/Codec.html","id":null,"dir":"Reference","previous_headings":"","what":"Compression Codec class — Codec","title":"Compression Codec class — Codec","text":"Codecs allow create compressed input output streams.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Codec.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Compression Codec class — Codec","text":"Codec$create() factory method takes following arguments: type: string name compression method. Possible values \"uncompressed\", \"snappy\", \"gzip\", \"brotli\", \"zstd\", \"lz4\", \"lzo\", \"bz2\". type may upper- lower-cased. methods may available; support depends build-time flags C++ library. See codec_is_available(). builds support least \"snappy\" \"gzip\". support \"uncompressed\". compression_level: compression level, default value (NA) uses default compression level selected compression type.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvFileFormat.html","id":null,"dir":"Reference","previous_headings":"","what":"CSV dataset file format — CsvFileFormat","title":"CSV dataset file format — CsvFileFormat","text":"CSVFileFormat FileFormat subclass holds information read parse files included CSV Dataset.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvFileFormat.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"CSV dataset file format — CsvFileFormat","text":"CsvFileFormat object","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvFileFormat.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"CSV dataset file format — CsvFileFormat","text":"CSVFileFormat$create() can take options form lists passed parse_options, read_options, convert_options parameters. Alternatively, readr-style options can passed individually. possible pass CSVReadOptions, CSVConvertOptions, CSVParseOptions objects, recommended options set objects validated compatibility.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/CsvFileFormat.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"CSV dataset file format — CsvFileFormat","text":"","code":"# Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) df <- data.frame(x = c(\"1\", \"2\", \"NULL\")) write.table(df, file.path(tf, \"file1.txt\"), sep = \",\", row.names = FALSE) # Create CsvFileFormat object with Arrow-style null_values option format <- CsvFileFormat$create(convert_options = list(null_values = c(\"\", \"NA\", \"NULL\"))) open_dataset(tf, format = format) #> FileSystemDataset with 1 csv file #> 1 columns #> x: int64 # Use readr-style options format <- CsvFileFormat$create(na = c(\"\", \"NA\", \"NULL\")) open_dataset(tf, format = format) #> FileSystemDataset with 1 csv file #> 1 columns #> x: int64"},{"path":"https://arrow.apache.org/docs/r/reference/CsvReadOptions.html","id":null,"dir":"Reference","previous_headings":"","what":"File reader options — CsvReadOptions","title":"File reader options — CsvReadOptions","text":"CsvReadOptions, CsvParseOptions, CsvConvertOptions, JsonReadOptions, JsonParseOptions, TimestampParser containers various file reading options. See usage read_csv_arrow() read_json_arrow(), respectively.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvReadOptions.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"File reader options — CsvReadOptions","text":"CsvReadOptions$create() JsonReadOptions$create() factory methods take following arguments: use_threads Whether use global CPU thread pool block_size Block size request IO layer; also determines size chunks use_threads TRUE. NB: FALSE, JSON input must end empty line. CsvReadOptions$create() accepts additional arguments: skip_rows Number lines skip reading data (default 0). column_names Character vector supply column names. length-0 (default), first non-skipped row parsed generate column names, unless autogenerate_column_names TRUE. autogenerate_column_names Logical: generate column names instead using first non-skipped row (default)? TRUE, column names \"f0\", \"f1\", ..., \"fN\". encoding file encoding. (default \"UTF-8\") skip_rows_after_names Number lines skip column names (default 0). number can larger number rows one block, empty rows counted. order application follows: skip_rows applied (non-zero); column names read (unless column_names set); skip_rows_after_names applied (non-zero). CsvParseOptions$create() takes following arguments: delimiter Field delimiting character (default \",\") quoting Logical: strings quoted? (default TRUE) quote_char Quoting character, quoting TRUE (default '\"') double_quote Logical: quotes inside values double-quoted? (default TRUE) escaping Logical: whether escaping used (default FALSE) escape_char Escaping character, escaping TRUE (default \"\\\\\") newlines_in_values Logical: values allowed contain CR (0x0d) LF (0x0a) characters? (default FALSE) ignore_empty_lines Logical: empty lines ignored (default) generate row missing values (FALSE)? JsonParseOptions$create() accepts newlines_in_values argument. CsvConvertOptions$create() takes following arguments: check_utf8 Logical: check UTF8 validity string columns? (default TRUE) null_values character vector recognized spellings null values. Analogous na.strings argument read.csv() na readr::read_csv(). strings_can_be_null Logical: can string / binary columns null values? Similar quoted_na argument readr::read_csv(). (default FALSE) true_values character vector recognized spellings TRUE values false_values character vector recognized spellings FALSE values col_types Schema NULL infer types auto_dict_encode Logical: Whether try automatically dictionary-encode string / binary data (think stringsAsFactors). Default FALSE. setting ignored non-inferred columns (col_types). auto_dict_max_cardinality auto_dict_encode, string/binary columns dictionary-encoded number unique values (default 50), switches regular encoding. include_columns non-empty, indicates names columns CSV file actually read converted (vector's order). include_missing_columns Logical: include_columns provided, columns named found data included column type null()? default (FALSE) means reader instead raise error. timestamp_parsers User-defined timestamp parsers. one parser specified, CSV conversion logic try parsing values starting beginning vector. Possible values () NULL, default, uses ISO-8601 parser; (b) character vector strptime parse strings; (c) list TimestampParser objects. decimal_point Character use decimal point floating point numbers. Default: \".\" TimestampParser$create() takes optional format string argument. See strptime() example syntax. default use ISO-8601 format parser. CsvWriteOptions$create() factory method takes following arguments: include_header Whether write initial header line column names batch_size Maximum number rows processed time. Default 1024. null_string string written null values. Must contain quotation marks. Default empty string (\"\"). eol end line character use ending rows. delimiter Field delimiter quoting_style Quoting style: \"Needed\" (enclose values quotes need , CSV rendering can contain quotes (e.g. strings binary values)), \"AllValid\" (Enclose valid values quotes), \"None\" (enclose values quotes).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvReadOptions.html","id":"active-bindings","dir":"Reference","previous_headings":"","what":"Active bindings","title":"File reader options — CsvReadOptions","text":"column_names: CsvReadOptions","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvTableReader.html","id":null,"dir":"Reference","previous_headings":"","what":"Arrow CSV and JSON table reader classes — CsvTableReader","title":"Arrow CSV and JSON table reader classes — CsvTableReader","text":"CsvTableReader JsonTableReader wrap Arrow C++ CSV JSON table readers. See usage read_csv_arrow() read_json_arrow(), respectively.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvTableReader.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Arrow CSV and JSON table reader classes — CsvTableReader","text":"CsvTableReader$create() JsonTableReader$create() factory methods take following arguments: file Arrow InputStream convert_options (CSV ), parse_options, read_options: see CsvReadOptions ... additional parameters.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/CsvTableReader.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Arrow CSV and JSON table reader classes — CsvTableReader","text":"$Read(): returns Arrow Table.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/DataType-class.html","id":null,"dir":"Reference","previous_headings":"","what":"DataType class — DataType","title":"DataType class — DataType","text":"DataType class","code":""},{"path":"https://arrow.apache.org/docs/r/reference/DataType-class.html","id":"r-methods","dir":"Reference","previous_headings":"","what":"R6 Methods","title":"DataType class — DataType","text":"$ToString(): String representation DataType $Equals(): DataType equal $fields(): children fields associated type $code(namespace): Produces R call data type. Use namespace=TRUE call arrow::. also active bindings: $id: integer Arrow type id. $name: string Arrow type name. $num_fields: number child fields.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/Dataset.html","id":null,"dir":"Reference","previous_headings":"","what":"Multi-file datasets — Dataset","title":"Multi-file datasets — Dataset","text":"Arrow Datasets allow query data split across multiple files. sharding data may indicate partitioning, can accelerate queries touch partitions (files). Dataset contains one Fragments, files, potentially differing type partitioning. Dataset$create(), see open_dataset(), alias . DatasetFactory used provide finer control creation Datasets.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Dataset.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Multi-file datasets — Dataset","text":"DatasetFactory used create Dataset, inspect Schema fragments contained , declare partitioning. FileSystemDatasetFactory subclass DatasetFactory discovering files local file system, currently supported file system. DatasetFactory$create() factory method, see dataset_factory(), alias . DatasetFactory : $Inspect(unify_schemas): unify_schemas TRUE, fragments scanned unified Schema created ; FALSE (default), first fragment inspected schema. Use fast path know trust fragments identical schema. $Finish(schema, unify_schemas): Returns Dataset. schema provided, used Dataset; omitted, Schema created inspecting fragments (files) dataset, following unify_schemas described . FileSystemDatasetFactory$create() lower-level factory method takes following arguments: filesystem: FileSystem selector: Either FileSelector NULL paths: Either character vector file paths NULL format: FileFormat partitioning: Either Partitioning, PartitioningFactory, NULL","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Dataset.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Multi-file datasets — Dataset","text":"Dataset following methods: $NewScan(): Returns ScannerBuilder building query $WithSchema(): Returns new Dataset specified schema. method currently supports adding, removing, reordering fields schema: alter cast field types. $schema: Active binding returns Schema Dataset; may also replace dataset's schema using ds$schema <- new_schema. FileSystemDataset following methods: $files: Active binding, returns files FileSystemDataset $format: Active binding, returns FileFormat FileSystemDataset UnionDataset following methods: $children: Active binding, returns child Datasets.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/DictionaryType.html","id":null,"dir":"Reference","previous_headings":"","what":"class DictionaryType — DictionaryType","title":"class DictionaryType — DictionaryType","text":"class DictionaryType","code":""},{"path":"https://arrow.apache.org/docs/r/reference/DictionaryType.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"class DictionaryType — DictionaryType","text":"TODO","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Expression.html","id":null,"dir":"Reference","previous_headings":"","what":"Arrow expressions — Expression","title":"Arrow expressions — Expression","text":"Expressions used define filter logic passing Dataset Scanner. Expression$scalar(x) constructs Expression always evaluates provided scalar (length-1) R value. Expression$field_ref(name) used construct Expression evaluates named column Dataset evaluated. Expression$create(function_name, ..., options) builds function-call Expression containing one Expressions. Anything ... already expression wrapped Expression$scalar(). Expression$op(FUN, ...) logical arithmetic operators. Scalar inputs ... attempted cast common type Expressions call types columns Dataset preserved unnecessarily upcast, may expensive.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ExtensionArray.html","id":null,"dir":"Reference","previous_headings":"","what":"ExtensionArray class — ExtensionArray","title":"ExtensionArray class — ExtensionArray","text":"ExtensionArray class","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ExtensionArray.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ExtensionArray class — ExtensionArray","text":"ExtensionArray class inherits Array, also provides access underlying storage extension. $storage(): Returns underlying Array used store values. ExtensionArray intended subclassed extension types.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ExtensionType.html","id":null,"dir":"Reference","previous_headings":"","what":"ExtensionType class — ExtensionType","title":"ExtensionType class — ExtensionType","text":"ExtensionType class","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ExtensionType.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ExtensionType class — ExtensionType","text":"ExtensionType class inherits DataType, also defines extra methods specific extension types: $storage_type(): Returns underlying DataType used store values. $storage_id(): Returns Type identifier corresponding $storage_type(). $extension_name(): Returns extension name. $extension_metadata(): Returns serialized version extension metadata raw() vector. $extension_metadata_utf8(): Returns serialized version extension metadata UTF-8 encoded string. $WrapArray(array): Wraps storage Array ExtensionArray extension type. addition, subclasses may override following methods customize behaviour extension classes. $deserialize_instance(): method called new ExtensionType initialized responsible parsing validating serialized extension_metadata (raw() vector) contents can inspected fields /methods R6 ExtensionType subclass. Implementations must also check storage_type make sure compatible extension type. $as_vector(extension_array): Convert Array ChunkedArray R vector. method called .vector() ExtensionArray objects, RecordBatch containing ExtensionArray converted data.frame(), ChunkedArray (e.g., column Table) converted R vector. default method returns converted storage array. $ToString() Return string representation printed console type Array type printed.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FeatherReader.html","id":null,"dir":"Reference","previous_headings":"","what":"FeatherReader class — FeatherReader","title":"FeatherReader class — FeatherReader","text":"class enables interact Feather files. Create one connect file InputStream, call Read() make arrow::Table. See usage read_feather().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FeatherReader.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"FeatherReader class — FeatherReader","text":"FeatherReader$create() factory method instantiates object takes following argument: file Arrow file connection object inheriting RandomAccessFile.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FeatherReader.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"FeatherReader class — FeatherReader","text":"$Read(columns): Returns Table selected columns, vector integer indices $column_names: Active binding, returns column names Feather file $schema: Active binding, returns schema Feather file $version: Active binding, returns 1 2, according Feather file version","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Field-class.html","id":null,"dir":"Reference","previous_headings":"","what":"Field class — Field","title":"Field class — Field","text":"field() lets create arrow::Field maps DataType column name. Fields contained Schemas.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Field-class.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Field class — Field","text":"f$ToString(): convert string f$Equals(): test equality. naturally called f == ","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Field.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a Field — field","title":"Create a Field — field","text":"Create Field","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Field.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a Field — field","text":"","code":"field(name, type, metadata, nullable = TRUE)"},{"path":"https://arrow.apache.org/docs/r/reference/Field.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a Field — field","text":"name field name type logical type, instance DataType metadata currently ignored nullable TRUE field nullable","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/Field.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create a Field — field","text":"","code":"field(\"x\", int32()) #> Field #> x: int32"},{"path":"https://arrow.apache.org/docs/r/reference/FileFormat.html","id":null,"dir":"Reference","previous_headings":"","what":"Dataset file formats — FileFormat","title":"Dataset file formats — FileFormat","text":"FileFormat holds information read parse files included Dataset. subclasses corresponding supported file formats (ParquetFileFormat IpcFileFormat).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileFormat.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Dataset file formats — FileFormat","text":"FileFormat$create() takes following arguments: format: string identifier file format. Currently supported values: \"parquet\" \"ipc\"/\"arrow\"/\"feather\", aliases ; Feather, note version 2 files supported \"csv\"/\"text\", aliases thing (comma default delimiter text files \"tsv\", equivalent passing format = \"text\", delimiter = \"\\t\" ...: Additional format-specific options format = \"parquet\": dict_columns: Names columns read dictionaries. Parquet options FragmentScanOptions. format = \"text\": see CsvParseOptions. Note can specify either Arrow C++ library naming (\"delimiter\", \"quoting\", etc.) readr-style naming used read_csv_arrow() (\"delim\", \"quote\", etc.). readr options currently supported; please file issue encounter one arrow support. Also, following options supported. CsvReadOptions: skip_rows column_names. Note Schema specified, column_names must match specified schema. autogenerate_column_names CsvFragmentScanOptions (values can overridden scan time): convert_options: CsvConvertOptions block_size returns appropriate subclass FileFormat (e.g. ParquetFileFormat)","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileFormat.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Dataset file formats — FileFormat","text":"","code":"## Semi-colon delimited files # Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) write.table(mtcars, file.path(tf, \"file1.txt\"), sep = \";\", row.names = FALSE) # Create FileFormat object format <- FileFormat$create(format = \"text\", delimiter = \";\") open_dataset(tf, format = format) #> FileSystemDataset with 1 csv file #> 11 columns #> mpg: double #> cyl: int64 #> disp: double #> hp: int64 #> drat: double #> wt: double #> qsec: double #> vs: int64 #> am: int64 #> gear: int64 #> carb: int64"},{"path":"https://arrow.apache.org/docs/r/reference/FileInfo.html","id":null,"dir":"Reference","previous_headings":"","what":"FileSystem entry info — FileInfo","title":"FileSystem entry info — FileInfo","text":"FileSystem entry info","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileInfo.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"FileSystem entry info — FileInfo","text":"base_name() : file base name (component last directory separator). extension() : file extension","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileInfo.html","id":"active-bindings","dir":"Reference","previous_headings":"","what":"Active bindings","title":"FileSystem entry info — FileInfo","text":"$type: file type $path: full file path filesystem $size: size bytes, available. regular files guaranteed size. $mtime: time last modification, available.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSelector.html","id":null,"dir":"Reference","previous_headings":"","what":"file selector — FileSelector","title":"file selector — FileSelector","text":"file selector","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSelector.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"file selector — FileSelector","text":"$create() factory method instantiates FileSelector given 3 fields described .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSelector.html","id":"fields","dir":"Reference","previous_headings":"","what":"Fields","title":"file selector — FileSelector","text":"base_dir: directory select files. path exists point directory, error. allow_not_found: behavior base_dir exist filesystem. FALSE, error returned. TRUE, empty selection returned recursive: Whether recurse subdirectories.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSystem.html","id":null,"dir":"Reference","previous_headings":"","what":"FileSystem classes — FileSystem","title":"FileSystem classes — FileSystem","text":"FileSystem abstract file system API, LocalFileSystem implementation accessing files local machine. SubTreeFileSystem implementation delegates another implementation prepending fixed base path","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSystem.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"FileSystem classes — FileSystem","text":"LocalFileSystem$create() returns object takes arguments. SubTreeFileSystem$create() takes following arguments: base_path, string path base_fs, FileSystem object S3FileSystem$create() optionally takes arguments: anonymous: logical, default FALSE. true, attempt look credentials using standard AWS configuration methods. access_key, secret_key: authentication credentials. one provided, must well. provided, override AWS configuration set environment level. session_token: optional string authentication along access_key secret_key role_arn: string AWS ARN AccessRole. provided instead access_key secret_key, temporary credentials fetched assuming role. session_name: optional string identifier assumed role session. external_id: optional unique string identifier might required assume role another account. load_frequency: integer, frequency (seconds) temporary credentials assumed role session refreshed. Default 900 (.e. 15 minutes) region: AWS region connect . omitted, AWS library provide sensible default based client configuration, falling back \"us-east-1\" alternatives found. endpoint_override: non-empty, override region connect string \"localhost:9000\". useful connecting file systems emulate S3. scheme: S3 connection transport (default \"https\") proxy_options: optional string, URI proxy use connecting S3 background_writes: logical, whether OutputStream writes issued background, without blocking (default TRUE) allow_bucket_creation: logical, TRUE, filesystem create buckets $CreateDir() called bucket level (default FALSE). allow_bucket_deletion: logical, TRUE, filesystem delete buckets $DeleteDir() called bucket level (default FALSE). request_timeout: Socket read time Windows macOS seconds. negative, AWS SDK default (typically 3 seconds). connect_timeout: Socket connection timeout seconds. negative, AWS SDK default used (typically 1 second). GcsFileSystem$create() optionally takes arguments: anonymous: logical, default FALSE. true, attempt look credentials using standard GCS configuration methods. access_token: optional string authentication. provided along expiration expiration: POSIXct. optional datetime representing point access_token expire. json_credentials: optional string authentication. Either string containing JSON credentials path location filesystem. path credentials given, file UTF-8 encoded. endpoint_override: non-empty, connect provided host name / port, \"localhost:9001\", instead default GCS ones. primarily useful testing purposes. scheme: connection transport (default \"https\") default_bucket_location: default location (\"region\") create new buckets . retry_limit_seconds: maximum amount time spend retrying filesystem encounters errors. Default 15 seconds. default_metadata: default metadata write new objects. project_id: project use creating buckets.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSystem.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"FileSystem classes — FileSystem","text":"path(x): Create SubTreeFileSystem current FileSystem rooted specified path x. cd(x): Create SubTreeFileSystem current FileSystem rooted specified path x. ls(path, ...): List files objects given path root FileSystem path provided. Additional arguments passed FileSelector$create, see FileSelector. $GetFileInfo(x): x may FileSelector character vector paths. Returns list FileInfo $CreateDir(path, recursive = TRUE): Create directory subdirectories. $DeleteDir(path): Delete directory contents, recursively. $DeleteDirContents(path): Delete directory's contents, recursively. Like $DeleteDir(), delete directory . Passing empty path (\"\") wipe entire filesystem tree. $DeleteFile(path) : Delete file. $DeleteFiles(paths) : Delete many files. default implementation issues individual delete operations sequence. $Move(src, dest): Move / rename file directory. destination exists: non-empty directory, error returned otherwise, type source, replaced otherwise, behavior unspecified (implementation-dependent). $CopyFile(src, dest): Copy file. destination exists directory, error returned. Otherwise, replaced. $OpenInputStream(path): Open input stream sequential reading. $OpenInputFile(path): Open input file random access reading. $OpenOutputStream(path): Open output stream sequential writing. $OpenAppendStream(path): Open output stream appending.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSystem.html","id":"active-bindings","dir":"Reference","previous_headings":"","what":"Active bindings","title":"FileSystem classes — FileSystem","text":"$type_name: string filesystem type name, \"local\", \"s3\", etc. $region: string AWS region, S3FileSystem SubTreeFileSystem containing S3FileSystem $base_fs: SubTreeFileSystem, FileSystem contains $base_path: SubTreeFileSystem, path $base_fs considered root SubTreeFileSystem. $options: GcsFileSystem, options used create GcsFileSystem instance list","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileSystem.html","id":"notes","dir":"Reference","previous_headings":"","what":"Notes","title":"FileSystem classes — FileSystem","text":"S3FileSystem, $CreateDir() top-level directory creates new bucket. S3FileSystem creates new buckets (assuming allow_bucket_creation TRUE), pass non-default settings. AWS S3, bucket objects publicly visible, bucket policies resource tags. control buckets created, use different API create . S3FileSystem, output produced fatal errors printing return values. troubleshooting, log level can set using environment variable ARROW_S3_LOG_LEVEL (e.g., Sys.setenv(\"ARROW_S3_LOG_LEVEL\"=\"DEBUG\")). log level must set prior running code interacts S3. Possible values include 'FATAL' (default), 'ERROR', 'WARN', 'INFO', 'DEBUG' (recommended), 'TRACE', ''.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FileWriteOptions.html","id":null,"dir":"Reference","previous_headings":"","what":"Format-specific write options — FileWriteOptions","title":"Format-specific write options — FileWriteOptions","text":"FileWriteOptions holds write options specific FileFormat.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FixedWidthType.html","id":null,"dir":"Reference","previous_headings":"","what":"FixedWidthType class — FixedWidthType","title":"FixedWidthType class — FixedWidthType","text":"FixedWidthType class","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FixedWidthType.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"FixedWidthType class — FixedWidthType","text":"TODO","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FragmentScanOptions.html","id":null,"dir":"Reference","previous_headings":"","what":"Format-specific scan options — FragmentScanOptions","title":"Format-specific scan options — FragmentScanOptions","text":"FragmentScanOptions holds options specific FileFormat scan operation.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/FragmentScanOptions.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Format-specific scan options — FragmentScanOptions","text":"FragmentScanOptions$create() takes following arguments: format: string identifier file format. Currently supported values: \"parquet\" \"csv\"/\"text\", aliases format. ...: Additional format-specific options format = \"parquet\": use_buffered_stream: Read files buffered input streams rather loading entire row groups . may enabled reduce memory overhead. Disabled default. buffer_size: Size buffered stream, enabled. Default 8KB. pre_buffer: Pre-buffer raw Parquet data. can improve performance high-latency filesystems. Disabled default. thrift_string_size_limit: Maximum string size allocated decoding thrift strings. May need increased order read files especially large headers. Default value 100000000. thrift_container_size_limit: Maximum size thrift containers. May need increased order read files especially large headers. Default value 1000000. format = \"text\": see CsvConvertOptions. Note options can specified Arrow C++ library naming. Also, \"block_size\" CsvReadOptions may given. returns appropriate subclass FragmentScanOptions (e.g. CsvFragmentScanOptions).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/InputStream.html","id":null,"dir":"Reference","previous_headings":"","what":"InputStream classes — InputStream","title":"InputStream classes — InputStream","text":"RandomAccessFile inherits InputStream base class : ReadableFile reading file; MemoryMappedFile memory mapping; BufferReader reading buffer. Use various table readers.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/InputStream.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"InputStream classes — InputStream","text":"$create() factory methods instantiate InputStream object take following arguments, depending subclass: path ReadableFile, character file name x BufferReader, Buffer object can made buffer via buffer(). instantiate MemoryMappedFile, call mmap_open().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/InputStream.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"InputStream classes — InputStream","text":"$GetSize(): $supports_zero_copy(): Logical $seek(position): go position stream $tell(): return position stream $close(): close stream $Read(nbytes): read data stream, either specified nbytes , nbytes provided $ReadAt(position, nbytes): similar $seek(position)$Read(nbytes) $Resize(size): MemoryMappedFile writeable","code":""},{"path":"https://arrow.apache.org/docs/r/reference/JsonFileFormat.html","id":null,"dir":"Reference","previous_headings":"","what":"JSON dataset file format — JsonFileFormat","title":"JSON dataset file format — JsonFileFormat","text":"JsonFileFormat FileFormat subclass holds information read parse files included JSON Dataset.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/JsonFileFormat.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"JSON dataset file format — JsonFileFormat","text":"JsonFileFormat object","code":""},{"path":"https://arrow.apache.org/docs/r/reference/JsonFileFormat.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"JSON dataset file format — JsonFileFormat","text":"JsonFileFormat$create() can take options form lists passed parse_options, read_options parameters. Available read_options parameters: use_threads: Whether use global CPU thread pool. Default TRUE. FALSE, JSON input must end empty line. block_size: Block size request IO layer; also determines size chunks use_threads TRUE. Available parse_options parameters: newlines_in_values:Logical: values allowed contain CR (0x0d \\r) LF (0x0a \\n) characters? (default FALSE)","code":""},{"path":[]},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/MemoryPool.html","id":null,"dir":"Reference","previous_headings":"","what":"MemoryPool class — MemoryPool","title":"MemoryPool class — MemoryPool","text":"MemoryPool class","code":""},{"path":"https://arrow.apache.org/docs/r/reference/MemoryPool.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"MemoryPool class — MemoryPool","text":"backend_name: one \"jemalloc\", \"mimalloc\", \"system\". Alternative memory allocators optionally enabled build time. Windows builds generally mimalloc, others jemalloc (used default) mimalloc. change memory allocators runtime, set environment variable ARROW_DEFAULT_MEMORY_POOL one strings prior loading arrow library. bytes_allocated max_memory","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Message.html","id":null,"dir":"Reference","previous_headings":"","what":"Message class — Message","title":"Message class — Message","text":"Message class","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Message.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Message class — Message","text":"TODO","code":""},{"path":"https://arrow.apache.org/docs/r/reference/MessageReader.html","id":null,"dir":"Reference","previous_headings":"","what":"MessageReader class — MessageReader","title":"MessageReader class — MessageReader","text":"MessageReader class","code":""},{"path":"https://arrow.apache.org/docs/r/reference/MessageReader.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"MessageReader class — MessageReader","text":"TODO","code":""},{"path":"https://arrow.apache.org/docs/r/reference/OutputStream.html","id":null,"dir":"Reference","previous_headings":"","what":"OutputStream classes — OutputStream","title":"OutputStream classes — OutputStream","text":"FileOutputStream writing file; BufferOutputStream writes buffer; can create one pass table writers, example.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/OutputStream.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"OutputStream classes — OutputStream","text":"$create() factory methods instantiate OutputStream object take following arguments, depending subclass: path FileOutputStream, character file name initial_capacity BufferOutputStream, size bytes buffer.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/OutputStream.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"OutputStream classes — OutputStream","text":"$tell(): return position stream $close(): close stream $write(x): send x stream $capacity(): BufferOutputStream $finish(): BufferOutputStream $GetExtentBytesWritten(): MockOutputStream, report many bytes sent.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetArrowReaderProperties.html","id":null,"dir":"Reference","previous_headings":"","what":"ParquetArrowReaderProperties class — ParquetArrowReaderProperties","title":"ParquetArrowReaderProperties class — ParquetArrowReaderProperties","text":"class holds settings control Parquet file read ParquetFileReader.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetArrowReaderProperties.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"ParquetArrowReaderProperties class — ParquetArrowReaderProperties","text":"ParquetArrowReaderProperties$create() factory method instantiates object takes following arguments: use_threads Logical: whether use multithreading (default TRUE)","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetArrowReaderProperties.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ParquetArrowReaderProperties class — ParquetArrowReaderProperties","text":"$read_dictionary(column_index) $set_read_dictionary(column_index, read_dict) $use_threads(use_threads)","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileReader.html","id":null,"dir":"Reference","previous_headings":"","what":"ParquetFileReader class — ParquetFileReader","title":"ParquetFileReader class — ParquetFileReader","text":"class enables interact Parquet files.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileReader.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"ParquetFileReader class — ParquetFileReader","text":"ParquetFileReader$create() factory method instantiates object takes following arguments: file character file name, raw vector, Arrow file connection object (e.g. RandomAccessFile). props Optional ParquetArrowReaderProperties mmap Logical: whether memory-map file (default TRUE) reader_props Optional ParquetReaderProperties ... Additional arguments, currently ignored","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileReader.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ParquetFileReader class — ParquetFileReader","text":"$ReadTable(column_indices): get arrow::Table file. optional column_indices= argument 0-based integer vector indicating columns retain. $ReadRowGroup(, column_indices): get arrow::Table reading ith row group (0-based). optional column_indices= argument 0-based integer vector indicating columns retain. $ReadRowGroups(row_groups, column_indices): get arrow::Table reading several row groups (0-based integers). optional column_indices= argument 0-based integer vector indicating columns retain. $GetSchema(): get arrow::Schema data file $ReadColumn(): read ith column (0-based) ChunkedArray.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileReader.html","id":"active-bindings","dir":"Reference","previous_headings":"","what":"Active bindings","title":"ParquetFileReader class — ParquetFileReader","text":"$num_rows: number rows. $num_columns: number columns. $num_row_groups: number row groups.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileReader.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"ParquetFileReader class — ParquetFileReader","text":"","code":"f <- system.file(\"v0.7.1.parquet\", package = \"arrow\") pq <- ParquetFileReader$create(f) pq$GetSchema() #> Schema #> carat: double #> cut: string #> color: string #> clarity: string #> depth: double #> table: double #> price: int64 #> x: double #> y: double #> z: double #> __index_level_0__: int64 #> #> See $metadata for additional Schema metadata if (codec_is_available(\"snappy\")) { # This file has compressed data columns tab <- pq$ReadTable() tab$schema } #> Schema #> carat: double #> cut: string #> color: string #> clarity: string #> depth: double #> table: double #> price: int64 #> x: double #> y: double #> z: double #> __index_level_0__: int64 #> #> See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html","id":null,"dir":"Reference","previous_headings":"","what":"ParquetFileWriter class — ParquetFileWriter","title":"ParquetFileWriter class — ParquetFileWriter","text":"class enables interact Parquet files.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"ParquetFileWriter class — ParquetFileWriter","text":"ParquetFileWriter$create() factory method instantiates object takes following arguments: schema Schema sink arrow::io::OutputStream properties instance ParquetWriterProperties arrow_properties instance ParquetArrowWriterProperties","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ParquetFileWriter class — ParquetFileWriter","text":"WriteTable Write Table sink Close Close writer. Note: close sink. arrow::io::OutputStream close() method.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html","id":null,"dir":"Reference","previous_headings":"","what":"ParquetReaderProperties class — ParquetReaderProperties","title":"ParquetReaderProperties class — ParquetReaderProperties","text":"class holds settings control Parquet file read ParquetFileReader.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"ParquetReaderProperties class — ParquetReaderProperties","text":"ParquetReaderProperties$create() factory method instantiates object takes arguments.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"ParquetReaderProperties class — ParquetReaderProperties","text":"$thrift_string_size_limit() $set_thrift_string_size_limit() $thrift_container_size_limit() $set_thrift_container_size_limit()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetWriterProperties.html","id":null,"dir":"Reference","previous_headings":"","what":"ParquetWriterProperties class — ParquetWriterProperties","title":"ParquetWriterProperties class — ParquetWriterProperties","text":"class holds settings control Parquet file read ParquetFileWriter.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetWriterProperties.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"ParquetWriterProperties class — ParquetWriterProperties","text":"parameters compression, compression_level, use_dictionary write_statistics` support various patterns: default NULL leaves parameter unspecified, C++ library uses appropriate default column (defaults listed ) single, unnamed, value (e.g. single string compression) applies columns unnamed vector, size number columns, specify value column, positional order named vector, specify value named columns, default value setting used supplied Unlike high-level write_parquet, ParquetWriterProperties arguments use C++ defaults. Currently means \"uncompressed\" rather \"snappy\" compression argument.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/ParquetWriterProperties.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"ParquetWriterProperties class — ParquetWriterProperties","text":"ParquetWriterProperties$create() factory method instantiates object takes following arguments: table: table write (required) version: Parquet version, \"1.0\" \"2.0\". Default \"1.0\" compression: Compression type, algorithm \"uncompressed\" compression_level: Compression level; meaning depends compression algorithm use_dictionary: Specify use dictionary encoding. Default TRUE write_statistics: Specify write statistics. Default TRUE data_page_size: Set target threshold approximate encoded size data pages within column chunk (bytes). Default 1 MiB.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/Partitioning.html","id":null,"dir":"Reference","previous_headings":"","what":"Define Partitioning for a Dataset — Partitioning","title":"Define Partitioning for a Dataset — Partitioning","text":"Pass Partitioning object FileSystemDatasetFactory's $create() method indicate file's paths interpreted define partitioning. DirectoryPartitioning describes interpret raw path segments, order. example, schema(year = int16(), month = int8()) define partitions file paths like \"2019/01/file.parquet\", \"2019/02/file.parquet\", etc. scheme NULL values skipped. previous example: writing dataset month NA (NULL), files placed \"2019/file.parquet\". reading, rows \"2019/file.parquet\" return NA month column. error raised outer directory NULL inner directory . HivePartitioning Hive-style partitioning, embeds field names values path segments, \"/year=2019/month=2/data.parquet\". fields named path segments, order matter. partitioning scheme allows NULL values. replaced configurable null_fallback defaults string \"__HIVE_DEFAULT_PARTITION__\" writing. reading, null_fallback string replaced NAs appropriate. PartitioningFactory subclasses instruct DatasetFactory detect partition features file paths.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Partitioning.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Define Partitioning for a Dataset — Partitioning","text":"DirectoryPartitioning$create() HivePartitioning$create() methods take Schema single input argument. helper function hive_partition(...) shorthand HivePartitioning$create(schema(...)). DirectoryPartitioningFactory$create(), can provide just names path segments (example, c(\"year\", \"month\")), DatasetFactory infer data types partition variables. HivePartitioningFactory$create() takes arguments: variable names types can inferred file paths. hive_partition() arguments returns HivePartitioningFactory.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatch-class.html","id":null,"dir":"Reference","previous_headings":"","what":"RecordBatch class — RecordBatch","title":"RecordBatch class — RecordBatch","text":"record batch collection equal-length arrays matching particular Schema. table-like data structure semantically sequence fields, contiguous Arrow Array.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatch-class.html","id":"s-methods-and-usage","dir":"Reference","previous_headings":"","what":"S3 Methods and Usage","title":"RecordBatch class — RecordBatch","text":"Record batches data-frame-like, many methods expect work data.frame implemented RecordBatch. includes [, [[, $, names, dim, nrow, ncol, head, tail. can also pull data Arrow record batch R .data.frame(). See examples. caveat $ method: RecordBatch R6 object, $ also used access object's methods (see ). Methods take precedence table's columns. , batch$Slice return \"Slice\" method function even column table called \"Slice\".","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatch-class.html","id":"r-methods","dir":"Reference","previous_headings":"","what":"R6 Methods","title":"RecordBatch class — RecordBatch","text":"addition R-friendly S3 methods, RecordBatch object following R6 methods map onto underlying C++ methods: $Equals(): Returns TRUE record batch equal $column(): Extract Array integer position batch $column_name(): Get column's name integer position $names(): Get column names (called names(batch)) $nbytes(): Total number bytes consumed elements record batch $RenameColumns(value): Set column names (called names(batch) <- value) $GetColumnByName(name): Extract Array string name $RemoveColumn(): Drops column batch integer position $SelectColumns(indices): Return new record batch selection columns, expressed 0-based integers. $Slice(offset, length = NULL): Create zero-copy view starting indicated integer offset going given length, end table NULL, default. $Take(): return RecordBatch rows positions given integers (R vector Array Array) . $Filter(, keep_na = TRUE): return RecordBatch rows positions logical vector (Arrow boolean Array) TRUE. $SortIndices(names, descending = FALSE): return Array integer row positions can used rearrange RecordBatch ascending descending order first named column, breaking ties named columns. descending can logical vector length one length names. $serialize(): Returns raw vector suitable interprocess communication $cast(target_schema, safe = TRUE, options = cast_options(safe)): Alter schema record batch. also active bindings $num_columns $num_rows $schema $metadata: Returns key-value metadata Schema named list. Modify replace assigning (batch$metadata <- new_metadata). list elements coerced string. See schema() information. $columns: Returns list Arrays","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchReader.html","id":null,"dir":"Reference","previous_headings":"","what":"RecordBatchReader classes — RecordBatchReader","title":"RecordBatchReader classes — RecordBatchReader","text":"Apache Arrow defines two formats serializing data interprocess communication (IPC): \"stream\" format \"file\" format, known Feather. RecordBatchStreamReader RecordBatchFileReader interfaces accessing record batches input sources formats, respectively. guidance use classes, see examples section.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchReader.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"RecordBatchReader classes — RecordBatchReader","text":"RecordBatchFileReader$create() RecordBatchStreamReader$create() factory methods instantiate object take single argument, named according class: file character file name, raw vector, Arrow file connection object (e.g. RandomAccessFile). stream raw vector, Buffer, InputStream.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchReader.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"RecordBatchReader classes — RecordBatchReader","text":"$read_next_batch(): Returns RecordBatch, iterating Reader. batches Reader, returns NULL. $schema: Returns Schema (active binding) $batches(): Returns list RecordBatches $read_table(): Collects reader's RecordBatches Table $get_batch(): RecordBatchFileReader, return particular batch integer index. $num_record_batches(): RecordBatchFileReader, see many batches file.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchReader.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"RecordBatchReader classes — RecordBatchReader","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) batch <- record_batch(chickwts) # This opens a connection to the file in Arrow file_obj <- FileOutputStream$create(tf) # Pass that to a RecordBatchWriter to write data conforming to a schema writer <- RecordBatchFileWriter$create(file_obj, batch$schema) writer$write(batch) # You may write additional batches to the stream, provided that they have # the same schema. # Call \"close\" on the writer to indicate end-of-file/stream writer$close() # Then, close the connection--closing the IPC message does not close the file file_obj$close() # Now, we have a file we can read from. Same pattern: open file connection, # then pass it to a RecordBatchReader read_file_obj <- ReadableFile$create(tf) reader <- RecordBatchFileReader$create(read_file_obj) # RecordBatchFileReader knows how many batches it has (StreamReader does not) reader$num_record_batches #> [1] 1 # We could consume the Reader by calling $read_next_batch() until all are, # consumed, or we can call $read_table() to pull them all into a Table tab <- reader$read_table() # Call as.data.frame to turn that Table into an R data.frame df <- as.data.frame(tab) # This should be the same data we sent all.equal(df, chickwts, check.attributes = FALSE) #> [1] TRUE # Unlike the Writers, we don't have to close RecordBatchReaders, # but we do still need to close the file connection read_file_obj$close()"},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchWriter.html","id":null,"dir":"Reference","previous_headings":"","what":"RecordBatchWriter classes — RecordBatchWriter","title":"RecordBatchWriter classes — RecordBatchWriter","text":"Apache Arrow defines two formats serializing data interprocess communication (IPC): \"stream\" format \"file\" format, known Feather. RecordBatchStreamWriter RecordBatchFileWriter interfaces writing record batches formats, respectively. guidance use classes, see examples section.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchWriter.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"RecordBatchWriter classes — RecordBatchWriter","text":"RecordBatchFileWriter$create() RecordBatchStreamWriter$create() factory methods instantiate object take following arguments: sink OutputStream schema Schema data written use_legacy_format logical: write data formatted Arrow libraries versions 0.14 lower can read . Default FALSE. can also enable setting environment variable ARROW_PRE_0_15_IPC_FORMAT=1. metadata_version: string like \"V5\" equivalent integer indicating Arrow IPC MetadataVersion. Default (NULL) use latest version, unless environment variable ARROW_PRE_1_0_METADATA_VERSION=1, case V4.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchWriter.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"RecordBatchWriter classes — RecordBatchWriter","text":"$write(x): Write RecordBatch, Table, data.frame, dispatching methods appropriately $write_batch(batch): Write RecordBatch stream $write_table(table): Write Table stream $close(): close stream. Note indicates end--file end--stream--close connection sink. needs closed separately.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/RecordBatchWriter.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"RecordBatchWriter classes — RecordBatchWriter","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) batch <- record_batch(chickwts) # This opens a connection to the file in Arrow file_obj <- FileOutputStream$create(tf) # Pass that to a RecordBatchWriter to write data conforming to a schema writer <- RecordBatchFileWriter$create(file_obj, batch$schema) writer$write(batch) # You may write additional batches to the stream, provided that they have # the same schema. # Call \"close\" on the writer to indicate end-of-file/stream writer$close() # Then, close the connection--closing the IPC message does not close the file file_obj$close() # Now, we have a file we can read from. Same pattern: open file connection, # then pass it to a RecordBatchReader read_file_obj <- ReadableFile$create(tf) reader <- RecordBatchFileReader$create(read_file_obj) # RecordBatchFileReader knows how many batches it has (StreamReader does not) reader$num_record_batches #> [1] 1 # We could consume the Reader by calling $read_next_batch() until all are, # consumed, or we can call $read_table() to pull them all into a Table tab <- reader$read_table() # Call as.data.frame to turn that Table into an R data.frame df <- as.data.frame(tab) # This should be the same data we sent all.equal(df, chickwts, check.attributes = FALSE) #> [1] TRUE # Unlike the Writers, we don't have to close RecordBatchReaders, # but we do still need to close the file connection read_file_obj$close()"},{"path":"https://arrow.apache.org/docs/r/reference/Scalar-class.html","id":null,"dir":"Reference","previous_headings":"","what":"Arrow scalars — Scalar","title":"Arrow scalars — Scalar","text":"Scalar holds single value Arrow type.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Scalar-class.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Arrow scalars — Scalar","text":"Scalar$create() factory method instantiates Scalar takes following arguments: x: R vector, list, data.frame type: optional data type x. omitted, type inferred data.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Scalar-class.html","id":"usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Arrow scalars — Scalar","text":"","code":"a <- Scalar$create(x) length(a) print(a) a == a"},{"path":"https://arrow.apache.org/docs/r/reference/Scalar-class.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Arrow scalars — Scalar","text":"$ToString(): convert string $as_vector(): convert R vector $as_array(): convert Arrow Array $Equals(): Scalar equal $ApproxEquals(): Scalar approximately equal $is_valid: Scalar valid $null_count: number invalid values - 1 0 $type: Scalar type $cast(target_type, safe = TRUE, options = cast_options(safe)): cast value different type","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Scalar-class.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Arrow scalars — Scalar","text":"","code":"Scalar$create(pi) #> Scalar #> 3.141592653589793 Scalar$create(404) #> Scalar #> 404 # If you pass a vector into Scalar$create, you get a list containing your items Scalar$create(c(1, 2, 3)) #> Scalar #> list<item: double>[1, 2, 3] # Comparisons my_scalar <- Scalar$create(99) my_scalar$ApproxEquals(Scalar$create(99.00001)) # FALSE #> [1] FALSE my_scalar$ApproxEquals(Scalar$create(99.000009)) # TRUE #> [1] TRUE my_scalar$Equals(Scalar$create(99.000009)) # FALSE #> [1] FALSE my_scalar$Equals(Scalar$create(99L)) # FALSE (types don't match) #> [1] FALSE my_scalar$ToString() #> [1] \"99\""},{"path":"https://arrow.apache.org/docs/r/reference/Scanner.html","id":null,"dir":"Reference","previous_headings":"","what":"Scan the contents of a dataset — Scanner","title":"Scan the contents of a dataset — Scanner","text":"Scanner iterates Dataset's fragments returns data according given row filtering column projection. ScannerBuilder can help create one.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Scanner.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Scan the contents of a dataset — Scanner","text":"Scanner$create() wraps ScannerBuilder interface make Scanner. takes following arguments: dataset: Dataset arrow_dplyr_query object, returned dplyr methods Dataset. projection: character vector column names select columns named list expressions filter: Expression filter scanned rows , TRUE (default) keep rows. use_threads: logical: scanning use multithreading? Default TRUE ...: Additional arguments, currently ignored","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Scanner.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Scan the contents of a dataset — Scanner","text":"ScannerBuilder following methods: $Project(cols): Indicate scan return columns given cols, character vector column names named list Expression. $Filter(expr): Filter rows Expression. $UseThreads(threads): logical: scan use multithreading? method's default input TRUE, must call method enable multithreading scanner default FALSE. $BatchSize(batch_size): integer: Maximum row count scanned record batches, default 32K. scanned record batches overflowing memory method can called reduce size. $schema: Active binding, returns Schema Dataset $Finish(): Returns Scanner Scanner currently single method, $ToTable(), evaluates query returns Arrow Table.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Scanner.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Scan the contents of a dataset — Scanner","text":"","code":"# Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) write_dataset(mtcars, tf, partitioning=\"cyl\") ds <- open_dataset(tf) scan_builder <- ds$NewScan() scan_builder$Filter(Expression$field_ref(\"hp\") > 100) #> ScannerBuilder scan_builder$Project(list(hp_times_ten = 10 * Expression$field_ref(\"hp\"))) #> ScannerBuilder # Once configured, call $Finish() scanner <- scan_builder$Finish() # Can get results as a table as.data.frame(scanner$ToTable()) #> hp_times_ten #> 1 1130 #> 2 1090 #> 3 1100 #> 4 1100 #> 5 1100 #> 6 1050 #> 7 1230 #> 8 1230 #> 9 1750 #> 10 1750 #> 11 2450 #> 12 1800 #> 13 1800 #> 14 1800 #> 15 2050 #> 16 2150 #> 17 2300 #> 18 1500 #> 19 1500 #> 20 2450 #> 21 1750 #> 22 2640 #> 23 3350 # Or as a RecordBatchReader scanner$ToRecordBatchReader() #> RecordBatchReader #> 1 columns #> hp_times_ten: double #> #> See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/reference/Schema-class.html","id":null,"dir":"Reference","previous_headings":"","what":"Schema class — Schema","title":"Schema class — Schema","text":"Schema Arrow object containing Fields, map names Arrow data types. Create Schema want convert R data.frame Arrow want rely default mapping R types Arrow types, want choose specific numeric precision, creating Dataset want ensure specific schema rather inferring various files. Many Arrow objects, including Table Dataset, $schema method (active binding) lets access schema.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Schema-class.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Schema class — Schema","text":"$ToString(): convert string $field(): returns field index (0-based) $GetFieldByName(x): returns field name x $WithMetadata(metadata): returns new Schema key-value metadata set. Note list elements metadata coerced character. $code(namespace): returns R code needed generate schema. Use namespace=TRUE call arrow::.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Schema-class.html","id":"active-bindings","dir":"Reference","previous_headings":"","what":"Active bindings","title":"Schema class — Schema","text":"$names: returns field names (called names(Schema)) $num_fields: returns number fields (called length(Schema)) $fields: returns list Fields Schema, suitable iterating $HasMetadata: logical: Schema extra metadata? $metadata: returns key-value metadata named list. Modify replace assigning (sch$metadata <- new_metadata). list elements coerced string.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Schema-class.html","id":"r-metadata","dir":"Reference","previous_headings":"","what":"R Metadata","title":"Schema class — Schema","text":"converting data.frame Arrow Table RecordBatch, attributes data.frame saved alongside tables object can reconstructed faithfully R (e.g. .data.frame()). metadata can top-level data.frame (e.g. attributes(df)) column (e.g. attributes(df$col_a)) list columns : element level (e.g. attributes(df[1, \"col_a\"])). example, allows storing haven columns table able faithfully re-create pulled back R. metadata separate schema (column names types) compatible Arrow clients. R metadata read R ignored clients (e.g. Pandas custom metadata). metadata stored $metadata$r. Since Schema metadata keys values must strings, metadata saved serializing R's attribute list structure string. serialized metadata exceeds 100Kb size, default compressed starting version 3.0.0. disable compression (e.g. tables compatible Arrow versions 3.0.0 include large amounts metadata), set option arrow.compress_metadata FALSE. Files compressed metadata readable older versions arrow, metadata dropped.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Table-class.html","id":null,"dir":"Reference","previous_headings":"","what":"Table class — Table","title":"Table class — Table","text":"Table sequence chunked arrays. similar interface record batches, can composed multiple record batches chunked arrays.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Table-class.html","id":"s-methods-and-usage","dir":"Reference","previous_headings":"","what":"S3 Methods and Usage","title":"Table class — Table","text":"Tables data-frame-like, many methods expect work data.frame implemented Table. includes [, [[, $, names, dim, nrow, ncol, head, tail. can also pull data Arrow table R .data.frame(). See examples. caveat $ method: Table R6 object, $ also used access object's methods (see ). Methods take precedence table's columns. , tab$Slice return \"Slice\" method function even column table called \"Slice\".","code":""},{"path":"https://arrow.apache.org/docs/r/reference/Table-class.html","id":"r-methods","dir":"Reference","previous_headings":"","what":"R6 Methods","title":"Table class — Table","text":"addition R-friendly S3 methods, Table object following R6 methods map onto underlying C++ methods: $column(): Extract ChunkedArray integer position table $ColumnNames(): Get column names (called names(tab)) $nbytes(): Total number bytes consumed elements table $RenameColumns(value): Set column names (called names(tab) <- value) $GetColumnByName(name): Extract ChunkedArray string name $field(): Extract Field table schema integer position $SelectColumns(indices): Return new Table specified columns, expressed 0-based integers. $Slice(offset, length = NULL): Create zero-copy view starting indicated integer offset going given length, end table NULL, default. $Take(): return Table rows positions given integers . Arrow Array ChunkedArray, coerced R vector taking. $Filter(, keep_na = TRUE): return Table rows positions logical vector Arrow boolean-type (Chunked)Array TRUE. $SortIndices(names, descending = FALSE): return Array integer row positions can used rearrange Table ascending descending order first named column, breaking ties named columns. descending can logical vector length one length names. $serialize(output_stream, ...): Write table given OutputStream $cast(target_schema, safe = TRUE, options = cast_options(safe)): Alter schema record batch. also active bindings: $num_columns $num_rows $schema $metadata: Returns key-value metadata Schema named list. Modify replace assigning (tab$metadata <- new_metadata). list elements coerced string. See schema() information. $columns: Returns list ChunkedArrays","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":null,"dir":"Reference","previous_headings":"","what":"Functions available in Arrow dplyr queries — acero","title":"Functions available in Arrow dplyr queries — acero","text":"arrow package contains methods 37 dplyr table functions, many \"verbs\" transformations one tables. package also mappings 212 R functions corresponding functions Arrow compute library. allow write code inside dplyr methods call R functions, including many packages like stringr lubridate, get translated Arrow run Arrow query engine (Acero). document lists mapped functions.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"dplyr-verbs","dir":"Reference","previous_headings":"","what":"dplyr verbs","title":"Functions available in Arrow dplyr queries — acero","text":"verb functions return arrow_dplyr_query object, similar spirit dbplyr::tbl_lazy. means verbs eagerly evaluate query data. run query, call either compute(), returns arrow Table, collect(), pulls resulting Table R tibble. anti_join(): copy argument ignored arrange() collapse() collect() compute() count() distinct(): .keep_all = TRUE supported explain() filter() full_join(): copy argument ignored glimpse() group_by() group_by_drop_default() group_vars() groups() inner_join(): copy argument ignored left_join(): copy argument ignored mutate() pull(): name argument supported; returns R vector default behavior deprecated return Arrow ChunkedArray future release. Provide as_vector = TRUE/FALSE control behavior, set options(arrow.pull_as_vector) globally. relocate() rename() rename_with() right_join(): copy argument ignored select() semi_join(): copy argument ignored show_query() slice_head(): slicing within groups supported; Arrow datasets row order, head non-deterministic; prop supported queries nrow() knowable without evaluating slice_max(): slicing within groups supported; with_ties = TRUE (dplyr default) supported; prop supported queries nrow() knowable without evaluating slice_min(): slicing within groups supported; with_ties = TRUE (dplyr default) supported; prop supported queries nrow() knowable without evaluating slice_sample(): slicing within groups supported; replace = TRUE weight_by argument supported; n supported queries nrow() knowable without evaluating slice_tail(): slicing within groups supported; Arrow datasets row order, tail non-deterministic; prop supported queries nrow() knowable without evaluating summarise(): window functions currently supported; arguments .drop = FALSE `.groups = \"rowwise\" supported tally() transmute() ungroup() union() union_all()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"function-mappings","dir":"Reference","previous_headings":"","what":"Function mappings","title":"Functions available in Arrow dplyr queries — acero","text":"list , differences behavior support Acero R function listed. notes follow function name, can assume function works Acero just R. Functions can called either pkg::fun() just fun(), .e. str_sub() stringr::str_sub() work. addition functions, can call Arrow's 262 compute functions directly. Arrow many functions map existing R function. cases R function mapping, can still call Arrow function directly want adaptations R mapping make Acero behave like R. functions listed C++ documentation, function registry R, named arrow_ prefix, arrow_ascii_is_decimal.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"arrow","dir":"Reference","previous_headings":"","what":"arrow","title":"Functions available in Arrow dplyr queries — acero","text":"add_filename() cast()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"base","dir":"Reference","previous_headings":"","what":"base","title":"Functions available in Arrow dplyr queries — acero","text":"! != %% %/% %% & * + - / < <= == > >= ISOdate() ISOdatetime() ^ abs() acos() () () .Date(): Multiple tryFormats supported Arrow. Consider using lubridate specialised parsing functions ymd(), ymd(), etc. .character() .difftime(): supports units = \"secs\" (default) .double() .integer() .logical() .numeric() asin() ceiling() cos() data.frame(): row.names check.rows arguments supported; stringsAsFactors must FALSE difftime(): supports units = \"secs\" (default); tz argument supported endsWith() exp() floor() format() grepl() gsub() ifelse() .character() .double() .factor() .finite() .infinite() .integer() .list() .logical() .na() .nan() .numeric() log() log10() log1p() log2() logb() max() mean() min() nchar(): allowNA = TRUE keepNA = TRUE supported paste(): collapse argument yet supported paste0(): collapse argument yet supported pmax() pmin() prod() round() sign() sin() sqrt() startsWith() strftime() strptime(): accepts unit argument present base function. Valid values \"s\", \"ms\" (default), \"us\", \"ns\". strrep() strsplit() sub() substr(): start stop must length 1 substring() sum() tan() tolower() toupper() trunc() |","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"bit-","dir":"Reference","previous_headings":"","what":"bit64","title":"Functions available in Arrow dplyr queries — acero","text":".integer64() .integer64()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"dplyr","dir":"Reference","previous_headings":"","what":"dplyr","title":"Functions available in Arrow dplyr queries — acero","text":"across() () case_when(): .ptype .size arguments supported coalesce() desc() if_all() if_any() if_else() n() n_distinct()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"lubridate","dir":"Reference","previous_headings":"","what":"lubridate","title":"Functions available in Arrow dplyr queries — acero","text":"() as_date() as_datetime() ceiling_date() date() date_decimal() day() ddays() decimal_date() dhours() dmicroseconds() dmilliseconds() dminutes() dmonths() dmy(): locale argument supported dmy_h(): locale argument supported dmy_hm(): locale argument supported dmy_hms(): locale argument supported dnanoseconds() dpicoseconds(): supported dseconds() dst() dweeks() dyears() dym(): locale argument supported epiweek() epiyear() fast_strptime(): non-default values lt cutoff_2000 supported floor_date() force_tz(): Timezone conversion non-UTC timezone supported; roll_dst values 'error' 'boundary' supported nonexistent times, roll_dst values 'error', 'pre', 'post' supported ambiguous times. format_ISO8601() hour() .Date() .POSIXct() .instant() .timepoint() isoweek() isoyear() leap_year() make_date() make_datetime(): supports UTC (default) timezone make_difftime(): supports units = \"secs\" (default); providing num ... supported mday() mdy(): locale argument supported mdy_h(): locale argument supported mdy_hm(): locale argument supported mdy_hms(): locale argument supported minute() month() (): locale argument supported myd(): locale argument supported parse_date_time(): quiet = FALSE supported Available formats H, , j, M, S, U, w, W, y, Y, R, T. Linux OS X additionally , , b, B, Om, p, r available. pm() qday() quarter() round_date() second() semester() tz() wday() week() with_tz() yday() ydm(): locale argument supported ydm_h(): locale argument supported ydm_hm(): locale argument supported ydm_hms(): locale argument supported year() ym(): locale argument supported ymd(): locale argument supported ymd_h(): locale argument supported ymd_hm(): locale argument supported ymd_hms(): locale argument supported yq(): locale argument supported","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"methods","dir":"Reference","previous_headings":"","what":"methods","title":"Functions available in Arrow dplyr queries — acero","text":"()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"rlang","dir":"Reference","previous_headings":"","what":"rlang","title":"Functions available in Arrow dplyr queries — acero","text":"is_character() is_double() is_integer() is_list() is_logical()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"stats","dir":"Reference","previous_headings":"","what":"stats","title":"Functions available in Arrow dplyr queries — acero","text":"median(): approximate median (t-digest) computed quantile(): probs must length 1; approximate quantile (t-digest) computed sd() var()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"stringi","dir":"Reference","previous_headings":"","what":"stringi","title":"Functions available in Arrow dplyr queries — acero","text":"stri_reverse()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"stringr","dir":"Reference","previous_headings":"","what":"stringr","title":"Functions available in Arrow dplyr queries — acero","text":"Pattern modifiers coll() boundary() supported functions. str_c(): collapse argument yet supported str_count(): pattern must length 1 character vector str_detect() str_dup() str_ends() str_length() str_like() str_pad() str_remove() str_remove_all() str_replace() str_replace_all() str_split(): Case-insensitive string splitting splitting 0 parts supported str_starts() str_sub(): start end must length 1 str_to_lower() str_to_title() str_to_upper() str_trim()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"tibble","dir":"Reference","previous_headings":"","what":"tibble","title":"Functions available in Arrow dplyr queries — acero","text":"tibble()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/acero.html","id":"tidyselect","dir":"Reference","previous_headings":"","what":"tidyselect","title":"Functions available in Arrow dplyr queries — acero","text":"all_of() contains() ends_with() everything() last_col() matches() num_range() one_of() starts_with()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/add_filename.html","id":null,"dir":"Reference","previous_headings":"","what":"Add the data filename as a column — add_filename","title":"Add the data filename as a column — add_filename","text":"function exists inside arrow dplyr queries, valid querying FileSystemDataset.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/add_filename.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Add the data filename as a column — add_filename","text":"","code":"add_filename()"},{"path":"https://arrow.apache.org/docs/r/reference/add_filename.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Add the data filename as a column — add_filename","text":"FieldRef Expression refers filename augmented column.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/add_filename.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Add the data filename as a column — add_filename","text":"use filenames generated function subsequent pipeline steps, must either call compute() collect() first. See Examples.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/add_filename.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Add the data filename as a column — add_filename","text":"","code":"if (FALSE) { open_dataset(\"nyc-taxi\") %>% mutate( file = add_filename() ) # To use a verb like mutate() with add_filename() we need to first call # compute() open_dataset(\"nyc-taxi\") %>% mutate(file = add_filename()) %>% compute() %>% mutate(filename_length = nchar(file)) }"},{"path":"https://arrow.apache.org/docs/r/reference/array-class.html","id":null,"dir":"Reference","previous_headings":"","what":"Array Classes — Array","title":"Array Classes — Array","text":"Array immutable data array logical type length. logical types contained base Array class; also subclasses DictionaryArray, ListArray, StructArray.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/array-class.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Array Classes — Array","text":"Array$create() factory method instantiates Array takes following arguments: x: R vector, list, data.frame type: optional data type x. omitted, type inferred data. Array$create() return appropriate subclass Array, DictionaryArray given R factor. compose DictionaryArray directly, call DictionaryArray$create(), takes two arguments: x: R vector Array integers dictionary indices dict: R vector Array dictionary values (like R factor levels limited strings )","code":""},{"path":"https://arrow.apache.org/docs/r/reference/array-class.html","id":"usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Array Classes — Array","text":"","code":"a <- Array$create(x) length(a) print(a) a == a"},{"path":"https://arrow.apache.org/docs/r/reference/array-class.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Array Classes — Array","text":"$IsNull(): Return true value index null. boundscheck $IsValid(): Return true value index valid. boundscheck $length(): Size number elements array contains $nbytes(): Total number bytes consumed elements array $offset: relative position another array's data, enable zero-copy slicing $null_count: number null entries array $type: logical type data $type_id(): type id $Equals() : array equal $ApproxEquals() : $Diff() : return string expressing difference two arrays $data(): return underlying ArrayData $as_vector(): convert R vector $ToString(): string representation array $Slice(offset, length = NULL): Construct zero-copy slice array indicated offset length. length NULL, slice goes end array. $Take(): return Array values positions given integers (R vector Array Array) . $Filter(, keep_na = TRUE): return Array values positions logical vector (Arrow boolean Array) TRUE. $SortIndices(descending = FALSE): return Array integer positions can used rearrange Array ascending descending order $RangeEquals(, start_idx, end_idx, other_start_idx) : $cast(target_type, safe = TRUE, options = cast_options(safe)): Alter data array change type. $View(type): Construct zero-copy view array given type. $Validate() : Perform validation checks determine obvious inconsistencies within array's internal data. can expensive check, potentially O(length)","code":""},{"path":"https://arrow.apache.org/docs/r/reference/array-class.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Array Classes — Array","text":"","code":"my_array <- Array$create(1:10) my_array$type #> Int32 #> int32 my_array$cast(int8()) #> Array #> <int8> #> [ #> 1, #> 2, #> 3, #> 4, #> 5, #> 6, #> 7, #> 8, #> 9, #> 10 #> ] # Check if value is null; zero-indexed na_array <- Array$create(c(1:5, NA)) na_array$IsNull(0) #> [1] FALSE na_array$IsNull(5) #> [1] TRUE na_array$IsValid(5) #> [1] FALSE na_array$null_count #> [1] 1 # zero-copy slicing; the offset of the new Array will be the same as the index passed to $Slice new_array <- na_array$Slice(5) new_array$offset #> [1] 5 # Compare 2 arrays na_array2 <- na_array na_array2 == na_array # element-wise comparison #> Array #> <bool> #> [ #> true, #> true, #> true, #> true, #> true, #> null #> ] na_array2$Equals(na_array) # overall comparison #> [1] TRUE"},{"path":"https://arrow.apache.org/docs/r/reference/arrow-package.html","id":null,"dir":"Reference","previous_headings":"","what":"arrow: Integration to 'Apache' 'Arrow' — arrow-package","title":"arrow: Integration to 'Apache' 'Arrow' — arrow-package","text":"'Apache' 'Arrow' https://arrow.apache.org/ cross-language development platform -memory data. specifies standardized language-independent columnar memory format flat hierarchical data, organized efficient analytic operations modern hardware. package provides interface 'Arrow C++' library.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/arrow-package.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"arrow: Integration to 'Apache' 'Arrow' — arrow-package","text":"Maintainer: Jonathan Keane jkeane@gmail.com Authors: Neal Richardson neal.p.richardson@gmail.com Ian Cook ianmcook@gmail.com Nic Crane thisisnic@gmail.com Dewey Dunnington dewey@fishandwhistle.net (ORCID) Romain François (ORCID) Dragoș Moldovan-Grünfeld dragos.mold@gmail.com Jeroen Ooms jeroen@berkeley.edu Jacob Wujciak-Jens jacob@wujciak.de Apache Arrow dev@arrow.apache.org [copyright holder] contributors: Javier Luraschi javier@rstudio.com [contributor] Karl Dunkle Werner karldw@users.noreply.github.com (ORCID) [contributor] Jeffrey Wong jeffreyw@netflix.com [contributor]","code":""},{"path":"https://arrow.apache.org/docs/r/reference/arrow_array.html","id":null,"dir":"Reference","previous_headings":"","what":"Create an Arrow Array — arrow_array","title":"Create an Arrow Array — arrow_array","text":"Create Arrow Array","code":""},{"path":"https://arrow.apache.org/docs/r/reference/arrow_array.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create an Arrow Array — arrow_array","text":"","code":"arrow_array(x, type = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/arrow_array.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create an Arrow Array — arrow_array","text":"x R object representable Arrow array, e.g. vector, list, data.frame. type optional data type x. omitted, type inferred data.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/arrow_array.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create an Arrow Array — arrow_array","text":"","code":"my_array <- arrow_array(1:10) # Compare 2 arrays na_array <- arrow_array(c(1:5, NA)) na_array2 <- na_array na_array2 == na_array # element-wise comparison #> Array #> <bool> #> [ #> true, #> true, #> true, #> true, #> true, #> null #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/arrow_info.html","id":null,"dir":"Reference","previous_headings":"","what":"Report information on the package's capabilities — arrow_info","title":"Report information on the package's capabilities — arrow_info","text":"function summarizes number build-time configurations run-time settings Arrow package. may useful diagnostics.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/arrow_info.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Report information on the package's capabilities — arrow_info","text":"","code":"arrow_info() arrow_available() arrow_with_acero() arrow_with_dataset() arrow_with_substrait() arrow_with_parquet() arrow_with_s3() arrow_with_gcs() arrow_with_json()"},{"path":"https://arrow.apache.org/docs/r/reference/arrow_info.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Report information on the package's capabilities — arrow_info","text":"arrow_info() returns list including version information, boolean \"capabilities\", statistics Arrow's memory allocator, also Arrow's run-time information. _available() functions return logical value whether C++ library built support .","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_array.html","id":null,"dir":"Reference","previous_headings":"","what":"Convert an object to an Arrow Array — as_arrow_array","title":"Convert an object to an Arrow Array — as_arrow_array","text":"as_arrow_array() function identical Array$create() except S3 generic, allows methods defined packages convert objects Array. Array$create() slightly faster tries convert C++ falling back as_arrow_array().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_array.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Convert an object to an Arrow Array — as_arrow_array","text":"","code":"as_arrow_array(x, ..., type = NULL) # S3 method for Array as_arrow_array(x, ..., type = NULL) # S3 method for Scalar as_arrow_array(x, ..., type = NULL) # S3 method for ChunkedArray as_arrow_array(x, ..., type = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_array.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Convert an object to an Arrow Array — as_arrow_array","text":"x object convert Arrow Array ... Passed S3 methods type type final Array. value NULL default type guessed infer_type().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_array.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Convert an object to an Arrow Array — as_arrow_array","text":"Array type type.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_array.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Convert an object to an Arrow Array — as_arrow_array","text":"","code":"as_arrow_array(1:5) #> Array #> <int32> #> [ #> 1, #> 2, #> 3, #> 4, #> 5 #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_table.html","id":null,"dir":"Reference","previous_headings":"","what":"Convert an object to an Arrow Table — as_arrow_table","title":"Convert an object to an Arrow Table — as_arrow_table","text":"Whereas arrow_table() constructs table one columns, as_arrow_table() converts single object Arrow Table.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_table.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Convert an object to an Arrow Table — as_arrow_table","text":"","code":"as_arrow_table(x, ..., schema = NULL) # S3 method for default as_arrow_table(x, ...) # S3 method for Table as_arrow_table(x, ..., schema = NULL) # S3 method for RecordBatch as_arrow_table(x, ..., schema = NULL) # S3 method for data.frame as_arrow_table(x, ..., schema = NULL) # S3 method for RecordBatchReader as_arrow_table(x, ...) # S3 method for Dataset as_arrow_table(x, ...) # S3 method for arrow_dplyr_query as_arrow_table(x, ...) # S3 method for Schema as_arrow_table(x, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_table.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Convert an object to an Arrow Table — as_arrow_table","text":"x object convert Arrow Table ... Passed S3 methods schema Schema, NULL (default) infer schema data .... providing Arrow IPC buffer, schema required.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_table.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Convert an object to an Arrow Table — as_arrow_table","text":"Table","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_arrow_table.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Convert an object to an Arrow Table — as_arrow_table","text":"","code":"# use as_arrow_table() for a single object as_arrow_table(data.frame(col1 = 1, col2 = \"two\")) #> Table #> 1 rows x 2 columns #> $col1 <double> #> $col2 <string> #> #> See $metadata for additional Schema metadata # use arrow_table() to create from columns arrow_table(col1 = 1, col2 = \"two\") #> Table #> 1 rows x 2 columns #> $col1 <double> #> $col2 <string>"},{"path":"https://arrow.apache.org/docs/r/reference/as_chunked_array.html","id":null,"dir":"Reference","previous_headings":"","what":"Convert an object to an Arrow ChunkedArray — as_chunked_array","title":"Convert an object to an Arrow ChunkedArray — as_chunked_array","text":"Whereas chunked_array() constructs ChunkedArray zero Arrays R vectors, as_chunked_array() converts single object ChunkedArray.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_chunked_array.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Convert an object to an Arrow ChunkedArray — as_chunked_array","text":"","code":"as_chunked_array(x, ..., type = NULL) # S3 method for ChunkedArray as_chunked_array(x, ..., type = NULL) # S3 method for Array as_chunked_array(x, ..., type = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/as_chunked_array.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Convert an object to an Arrow ChunkedArray — as_chunked_array","text":"x object convert Arrow Chunked Array ... Passed S3 methods type type final Array. value NULL default type guessed infer_type().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_chunked_array.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Convert an object to an Arrow ChunkedArray — as_chunked_array","text":"ChunkedArray.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_chunked_array.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Convert an object to an Arrow ChunkedArray — as_chunked_array","text":"","code":"as_chunked_array(1:5) #> ChunkedArray #> <int32> #> [ #> [ #> 1, #> 2, #> 3, #> 4, #> 5 #> ] #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/as_data_type.html","id":null,"dir":"Reference","previous_headings":"","what":"Convert an object to an Arrow DataType — as_data_type","title":"Convert an object to an Arrow DataType — as_data_type","text":"Convert object Arrow DataType","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_data_type.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Convert an object to an Arrow DataType — as_data_type","text":"","code":"as_data_type(x, ...) # S3 method for DataType as_data_type(x, ...) # S3 method for Field as_data_type(x, ...) # S3 method for Schema as_data_type(x, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/as_data_type.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Convert an object to an Arrow DataType — as_data_type","text":"x object convert Arrow DataType ... Passed S3 methods.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_data_type.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Convert an object to an Arrow DataType — as_data_type","text":"DataType object.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_data_type.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Convert an object to an Arrow DataType — as_data_type","text":"","code":"as_data_type(int32()) #> Int32 #> int32"},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch.html","id":null,"dir":"Reference","previous_headings":"","what":"Convert an object to an Arrow RecordBatch — as_record_batch","title":"Convert an object to an Arrow RecordBatch — as_record_batch","text":"Whereas record_batch() constructs RecordBatch one columns, as_record_batch() converts single object Arrow RecordBatch.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Convert an object to an Arrow RecordBatch — as_record_batch","text":"","code":"as_record_batch(x, ..., schema = NULL) # S3 method for RecordBatch as_record_batch(x, ..., schema = NULL) # S3 method for Table as_record_batch(x, ..., schema = NULL) # S3 method for arrow_dplyr_query as_record_batch(x, ...) # S3 method for data.frame as_record_batch(x, ..., schema = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Convert an object to an Arrow RecordBatch — as_record_batch","text":"x object convert Arrow RecordBatch ... Passed S3 methods schema Schema, NULL (default) infer schema data .... providing Arrow IPC buffer, schema required.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Convert an object to an Arrow RecordBatch — as_record_batch","text":"RecordBatch","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Convert an object to an Arrow RecordBatch — as_record_batch","text":"","code":"# use as_record_batch() for a single object as_record_batch(data.frame(col1 = 1, col2 = \"two\")) #> RecordBatch #> 1 rows x 2 columns #> $col1 <double> #> $col2 <string> #> #> See $metadata for additional Schema metadata # use record_batch() to create from columns record_batch(col1 = 1, col2 = \"two\") #> RecordBatch #> 1 rows x 2 columns #> $col1 <double> #> $col2 <string>"},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch_reader.html","id":null,"dir":"Reference","previous_headings":"","what":"Convert an object to an Arrow RecordBatchReader — as_record_batch_reader","title":"Convert an object to an Arrow RecordBatchReader — as_record_batch_reader","text":"Convert object Arrow RecordBatchReader","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch_reader.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Convert an object to an Arrow RecordBatchReader — as_record_batch_reader","text":"","code":"as_record_batch_reader(x, ...) # S3 method for RecordBatchReader as_record_batch_reader(x, ...) # S3 method for Table as_record_batch_reader(x, ...) # S3 method for RecordBatch as_record_batch_reader(x, ...) # S3 method for data.frame as_record_batch_reader(x, ...) # S3 method for Dataset as_record_batch_reader(x, ...) # S3 method for `function` as_record_batch_reader(x, ..., schema) # S3 method for arrow_dplyr_query as_record_batch_reader(x, ...) # S3 method for Scanner as_record_batch_reader(x, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch_reader.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Convert an object to an Arrow RecordBatchReader — as_record_batch_reader","text":"x object convert RecordBatchReader ... Passed S3 methods schema schema() must match schema returned call x x function.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch_reader.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Convert an object to an Arrow RecordBatchReader — as_record_batch_reader","text":"RecordBatchReader","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_record_batch_reader.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Convert an object to an Arrow RecordBatchReader — as_record_batch_reader","text":"","code":"reader <- as_record_batch_reader(data.frame(col1 = 1, col2 = \"two\")) reader$read_next_batch() #> RecordBatch #> 1 rows x 2 columns #> $col1 <double> #> $col2 <string> #> #> See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/reference/as_schema.html","id":null,"dir":"Reference","previous_headings":"","what":"Convert an object to an Arrow Schema — as_schema","title":"Convert an object to an Arrow Schema — as_schema","text":"Convert object Arrow Schema","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_schema.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Convert an object to an Arrow Schema — as_schema","text":"","code":"as_schema(x, ...) # S3 method for Schema as_schema(x, ...) # S3 method for StructType as_schema(x, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/as_schema.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Convert an object to an Arrow Schema — as_schema","text":"x object convert schema() ... Passed S3 methods.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_schema.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Convert an object to an Arrow Schema — as_schema","text":"Schema object.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/as_schema.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Convert an object to an Arrow Schema — as_schema","text":"","code":"as_schema(schema(col1 = int32())) #> Schema #> col1: int32"},{"path":"https://arrow.apache.org/docs/r/reference/buffer.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a Buffer — buffer","title":"Create a Buffer — buffer","text":"Create Buffer","code":""},{"path":"https://arrow.apache.org/docs/r/reference/buffer.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a Buffer — buffer","text":"","code":"buffer(x)"},{"path":"https://arrow.apache.org/docs/r/reference/buffer.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a Buffer — buffer","text":"x R object. raw, numeric integer vectors currently supported","code":""},{"path":"https://arrow.apache.org/docs/r/reference/buffer.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create a Buffer — buffer","text":"instance Buffer borrows memory x","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/call_function.html","id":null,"dir":"Reference","previous_headings":"","what":"Call an Arrow compute function — call_function","title":"Call an Arrow compute function — call_function","text":"function provides lower-level API calling Arrow functions string function name. use directly applications. Many Arrow compute functions mapped R methods, dplyr evaluation context, Arrow functions callable arrow_ prefix.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/call_function.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Call an Arrow compute function — call_function","text":"","code":"call_function( function_name, ..., args = list(...), options = empty_named_list() )"},{"path":"https://arrow.apache.org/docs/r/reference/call_function.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Call an Arrow compute function — call_function","text":"function_name string Arrow compute function name ... Function arguments, may include Array, ChunkedArray, Scalar, RecordBatch, Table. args list arguments alternative specifying ... options named list C++ function options.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/call_function.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Call an Arrow compute function — call_function","text":"Array, ChunkedArray, Scalar, RecordBatch, Table, whatever compute function results .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/call_function.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Call an Arrow compute function — call_function","text":"passing indices ..., args, options, express 0-based integers (consistent C++).","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/call_function.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Call an Arrow compute function — call_function","text":"","code":"a <- Array$create(c(1L, 2L, 3L, NA, 5L)) s <- Scalar$create(4L) call_function(\"coalesce\", a, s) #> Array #> <int32> #> [ #> 1, #> 2, #> 3, #> 4, #> 5 #> ] a <- Array$create(rnorm(10000)) call_function(\"quantile\", a, options = list(q = seq(0, 1, 0.25))) #> Array #> <double> #> [ #> -3.3041822296584606, #> -0.675501909840726, #> 0.0011218985985251336, #> 0.674597899120164, #> 3.5889486327287328 #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/cast.html","id":null,"dir":"Reference","previous_headings":"","what":"Change the type of an array or column — cast","title":"Change the type of an array or column — cast","text":"wrapper around $cast() method many Arrow objects . convenient call inside dplyr pipelines method.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/cast.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Change the type of an array or column — cast","text":"","code":"cast(x, to, safe = TRUE, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/cast.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Change the type of an array or column — cast","text":"x Array, Table, Expression, similar Arrow data object. DataType cast ; Table RecordBatch, Schema. safe logical: allow type conversion data lost (truncation, overflow, etc.). Default TRUE. ... specific CastOptions set","code":""},{"path":"https://arrow.apache.org/docs/r/reference/cast.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Change the type of an array or column — cast","text":"Expression","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/cast.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Change the type of an array or column — cast","text":"","code":"if (FALSE) { mtcars %>% arrow_table() %>% mutate(cyl = cast(cyl, string())) }"},{"path":"https://arrow.apache.org/docs/r/reference/cast_options.html","id":null,"dir":"Reference","previous_headings":"","what":"Cast options — cast_options","title":"Cast options — cast_options","text":"Cast options","code":""},{"path":"https://arrow.apache.org/docs/r/reference/cast_options.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Cast options — cast_options","text":"","code":"cast_options(safe = TRUE, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/cast_options.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Cast options — cast_options","text":"safe logical: enforce safe conversion? Default TRUE ... additional cast options, allow_int_overflow, allow_time_truncate, allow_float_truncate, set !safe default","code":""},{"path":"https://arrow.apache.org/docs/r/reference/cast_options.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Cast options — cast_options","text":"list","code":""},{"path":"https://arrow.apache.org/docs/r/reference/chunked_array.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a Chunked Array — chunked_array","title":"Create a Chunked Array — chunked_array","text":"Create Chunked Array","code":""},{"path":"https://arrow.apache.org/docs/r/reference/chunked_array.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a Chunked Array — chunked_array","text":"","code":"chunked_array(..., type = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/chunked_array.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a Chunked Array — chunked_array","text":"... R objects coerce ChunkedArray. must type. type optional data type. omitted, type inferred data.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/chunked_array.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create a Chunked Array — chunked_array","text":"","code":"# Pass items into chunked_array as separate objects to create chunks class_scores <- chunked_array(c(87, 88, 89), c(94, 93, 92), c(71, 72, 73)) # If you pass a list into chunked_array, you get a list of length 1 list_scores <- chunked_array(list(c(9.9, 9.6, 9.5), c(8.2, 8.3, 8.4), c(10.0, 9.9, 9.8))) # When constructing a ChunkedArray, the first chunk is used to infer type. infer_type(chunked_array(c(1, 2, 3), c(5L, 6L, 7L))) #> Float64 #> double # Concatenating chunked arrays returns a new chunked array containing all chunks a <- chunked_array(c(1, 2), 3) b <- chunked_array(c(4, 5), 6) c(a, b) #> ChunkedArray #> <double> #> [ #> [ #> 1, #> 2 #> ], #> [ #> 3 #> ], #> [ #> 4, #> 5 #> ], #> [ #> 6 #> ] #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/codec_is_available.html","id":null,"dir":"Reference","previous_headings":"","what":"Check whether a compression codec is available — codec_is_available","title":"Check whether a compression codec is available — codec_is_available","text":"Support compression libraries depends build-time settings Arrow C++ library. function lets know available use.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/codec_is_available.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Check whether a compression codec is available — codec_is_available","text":"","code":"codec_is_available(type)"},{"path":"https://arrow.apache.org/docs/r/reference/codec_is_available.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Check whether a compression codec is available — codec_is_available","text":"type string, one \"uncompressed\", \"snappy\", \"gzip\", \"brotli\", \"zstd\", \"lz4\", \"lzo\", \"bz2\", case-insensitive.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/codec_is_available.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Check whether a compression codec is available — codec_is_available","text":"Logical: type available?","code":""},{"path":"https://arrow.apache.org/docs/r/reference/codec_is_available.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Check whether a compression codec is available — codec_is_available","text":"","code":"codec_is_available(\"gzip\") #> [1] TRUE"},{"path":"https://arrow.apache.org/docs/r/reference/compression.html","id":null,"dir":"Reference","previous_headings":"","what":"Compressed stream classes — compression","title":"Compressed stream classes — compression","text":"CompressedInputStream CompressedOutputStream allow apply compression Codec input output stream.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/compression.html","id":"factory","dir":"Reference","previous_headings":"","what":"Factory","title":"Compressed stream classes — compression","text":"CompressedInputStream$create() CompressedOutputStream$create() factory methods instantiate object take following arguments: stream InputStream OutputStream, respectively codec Codec, either Codec instance string compression_level compression level codec argument given string","code":""},{"path":"https://arrow.apache.org/docs/r/reference/compression.html","id":"methods","dir":"Reference","previous_headings":"","what":"Methods","title":"Compressed stream classes — compression","text":"Methods inherited InputStream OutputStream, respectively","code":""},{"path":"https://arrow.apache.org/docs/r/reference/concat_arrays.html","id":null,"dir":"Reference","previous_headings":"","what":"Concatenate zero or more Arrays — concat_arrays","title":"Concatenate zero or more Arrays — concat_arrays","text":"Concatenates zero Array objects single array. operation make copy input; need behavior single Array need single object, use ChunkedArray.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/concat_arrays.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Concatenate zero or more Arrays — concat_arrays","text":"","code":"concat_arrays(..., type = NULL) # S3 method for Array c(...)"},{"path":"https://arrow.apache.org/docs/r/reference/concat_arrays.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Concatenate zero or more Arrays — concat_arrays","text":"... zero Array objects concatenate type optional type describing desired type final Array.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/concat_arrays.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Concatenate zero or more Arrays — concat_arrays","text":"single Array","code":""},{"path":"https://arrow.apache.org/docs/r/reference/concat_arrays.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Concatenate zero or more Arrays — concat_arrays","text":"","code":"concat_arrays(Array$create(1:3), Array$create(4:5)) #> Array #> <int32> #> [ #> 1, #> 2, #> 3, #> 4, #> 5 #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/concat_tables.html","id":null,"dir":"Reference","previous_headings":"","what":"Concatenate one or more Tables — concat_tables","title":"Concatenate one or more Tables — concat_tables","text":"Concatenate one Table objects single table. operation copy array data, instead creates new chunked arrays column point existing array data.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/concat_tables.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Concatenate one or more Tables — concat_tables","text":"","code":"concat_tables(..., unify_schemas = TRUE)"},{"path":"https://arrow.apache.org/docs/r/reference/concat_tables.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Concatenate one or more Tables — concat_tables","text":"... Table unify_schemas TRUE, schemas tables first unified fields name merged, table promoted unified schema concatenated. Otherwise, tables schema.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/concat_tables.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Concatenate one or more Tables — concat_tables","text":"","code":"tbl <- arrow_table(name = rownames(mtcars), mtcars) prius <- arrow_table(name = \"Prius\", mpg = 58, cyl = 4, disp = 1.8) combined <- concat_tables(tbl, prius) tail(combined)$to_data_frame() #> # A tibble: 6 x 12 #> name mpg cyl disp hp drat wt qsec vs am gear carb #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 #> 2 Ford Panter~ 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 #> 3 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 #> 4 Maserati Bo~ 15 8 301 335 3.54 3.57 14.6 0 1 5 8 #> 5 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 #> 6 Prius 58 4 1.8 NA NA NA NA NA NA NA NA"},{"path":"https://arrow.apache.org/docs/r/reference/contains_regex.html","id":null,"dir":"Reference","previous_headings":"","what":"Does this string contain regex metacharacters? — contains_regex","title":"Does this string contain regex metacharacters? — contains_regex","text":"string contain regex metacharacters?","code":""},{"path":"https://arrow.apache.org/docs/r/reference/contains_regex.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Does this string contain regex metacharacters? — contains_regex","text":"","code":"contains_regex(string)"},{"path":"https://arrow.apache.org/docs/r/reference/contains_regex.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Does this string contain regex metacharacters? — contains_regex","text":"string String tested","code":""},{"path":"https://arrow.apache.org/docs/r/reference/contains_regex.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Does this string contain regex metacharacters? — contains_regex","text":"Logical: string contain regex metacharacters?","code":""},{"path":"https://arrow.apache.org/docs/r/reference/copy_files.html","id":null,"dir":"Reference","previous_headings":"","what":"Copy files between FileSystems — copy_files","title":"Copy files between FileSystems — copy_files","text":"Copy files FileSystems","code":""},{"path":"https://arrow.apache.org/docs/r/reference/copy_files.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Copy files between FileSystems — copy_files","text":"","code":"copy_files(from, to, chunk_size = 1024L * 1024L)"},{"path":"https://arrow.apache.org/docs/r/reference/copy_files.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Copy files between FileSystems — copy_files","text":"string path local directory file, URI, SubTreeFileSystem. Files copied recursively path. string path local directory file, URI, SubTreeFileSystem. Directories created necessary chunk_size maximum size block read flushing destination file. larger chunk_size use memory copying may help accommodate high latency FileSystems.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/copy_files.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Copy files between FileSystems — copy_files","text":"Nothing: called side effects file system","code":""},{"path":"https://arrow.apache.org/docs/r/reference/copy_files.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Copy files between FileSystems — copy_files","text":"","code":"if (FALSE) { # Copy an S3 bucket's files to a local directory: copy_files(\"s3://your-bucket-name\", \"local-directory\") # Using a FileSystem object copy_files(s3_bucket(\"your-bucket-name\"), \"local-directory\") # Or go the other way, from local to S3 copy_files(\"local-directory\", s3_bucket(\"your-bucket-name\")) }"},{"path":"https://arrow.apache.org/docs/r/reference/cpu_count.html","id":null,"dir":"Reference","previous_headings":"","what":"Manage the global CPU thread pool in libarrow — cpu_count","title":"Manage the global CPU thread pool in libarrow — cpu_count","text":"Manage global CPU thread pool libarrow","code":""},{"path":"https://arrow.apache.org/docs/r/reference/cpu_count.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Manage the global CPU thread pool in libarrow — cpu_count","text":"","code":"cpu_count() set_cpu_count(num_threads)"},{"path":"https://arrow.apache.org/docs/r/reference/cpu_count.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Manage the global CPU thread pool in libarrow — cpu_count","text":"num_threads integer: New number threads thread pool","code":""},{"path":"https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","title":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","text":"Create source bundle includes thirdparty dependencies","code":""},{"path":"https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","text":"","code":"create_package_with_all_dependencies(dest_file = NULL, source_file = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","text":"dest_file File path new tar.gz package. Defaults arrow_V.V.V_with_deps.tar.gz current directory (V.V.V version) source_file File path input tar.gz package. Defaults downloading package CRAN (whatever set first getOption(\"repos\"))","code":""},{"path":"https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","text":"full path dest_file, invisibly function used setting offline build. possible download build time, use function. Instead, let cmake download required dependencies . downloaded dependencies used build ARROW_DEPENDENCY_SOURCE unset, BUNDLED, AUTO. https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds using binary packages need use function. download appropriate binary package repository, transfer offline computer, install . OS can create source bundle, installed Windows. (Instead, use standard Windows binary package.) Note using RStudio Package Manager Linux: still want make source bundle function, make sure set first repo options(\"repos\") mirror contains source packages (: something RSPM binary mirror URLs).","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html","id":"using-a-computer-with-internet-access-pre-download-the-dependencies-","dir":"Reference","previous_headings":"","what":"Using a computer with internet access, pre-download the dependencies:","title":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","text":"Install arrow package run source(\"https://raw.githubusercontent.com/apache/arrow/main/r/R/install-arrow.R\") Run create_package_with_all_dependencies(\"my_arrow_pkg.tar.gz\") Copy newly created my_arrow_pkg.tar.gz computer without internet access","code":""},{"path":"https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html","id":"on-the-computer-without-internet-access-install-the-prepared-package-","dir":"Reference","previous_headings":"","what":"On the computer without internet access, install the prepared package:","title":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","text":"Install arrow package copied file install.packages(\"my_arrow_pkg.tar.gz\", dependencies = c(\"Depends\", \"Imports\", \"LinkingTo\")) installation build source, cmake must available Run arrow_info() check installed capabilities","code":""},{"path":"https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create a source bundle that includes all thirdparty dependencies — create_package_with_all_dependencies","text":"","code":"if (FALSE) { new_pkg <- create_package_with_all_dependencies() # Note: this works when run in the same R session, but it's meant to be # copied to a different computer. install.packages(new_pkg, dependencies = c(\"Depends\", \"Imports\", \"LinkingTo\")) }"},{"path":"https://arrow.apache.org/docs/r/reference/csv_convert_options.html","id":null,"dir":"Reference","previous_headings":"","what":"CSV Convert Options — csv_convert_options","title":"CSV Convert Options — csv_convert_options","text":"CSV Convert Options","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_convert_options.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"CSV Convert Options — csv_convert_options","text":"","code":"csv_convert_options( check_utf8 = TRUE, null_values = c(\"\", \"NA\"), true_values = c(\"T\", \"true\", \"TRUE\"), false_values = c(\"F\", \"false\", \"FALSE\"), strings_can_be_null = FALSE, col_types = NULL, auto_dict_encode = FALSE, auto_dict_max_cardinality = 50L, include_columns = character(), include_missing_columns = FALSE, timestamp_parsers = NULL, decimal_point = \".\" )"},{"path":"https://arrow.apache.org/docs/r/reference/csv_convert_options.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"CSV Convert Options — csv_convert_options","text":"check_utf8 Logical: check UTF8 validity string columns? null_values Character vector recognized spellings null values. Analogous na.strings argument read.csv() na readr::read_csv(). true_values Character vector recognized spellings TRUE values false_values Character vector recognized spellings FALSE values strings_can_be_null Logical: can string / binary columns null values? Similar quoted_na argument readr::read_csv() col_types Schema NULL infer types auto_dict_encode Logical: Whether try automatically dictionary-encode string / binary data (think stringsAsFactors). setting ignored non-inferred columns (col_types). auto_dict_max_cardinality auto_dict_encode, string/binary columns dictionary-encoded number unique values (default 50), switches regular encoding. include_columns non-empty, indicates names columns CSV file actually read converted (vector's order). include_missing_columns Logical: include_columns provided, columns named found data included column type null()? default (FALSE) means reader instead raise error. timestamp_parsers User-defined timestamp parsers. one parser specified, CSV conversion logic try parsing values starting beginning vector. Possible values () NULL, default, uses ISO-8601 parser; (b) character vector strptime parse strings; (c) list TimestampParser objects. decimal_point Character use decimal point floating point numbers.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_convert_options.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"CSV Convert Options — csv_convert_options","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) writeLines(\"x\\n1\\nNULL\\n2\\nNA\", tf) read_csv_arrow(tf, convert_options = csv_convert_options(null_values = c(\"\", \"NA\", \"NULL\"))) #> # A tibble: 4 x 1 #> x #> <int> #> 1 1 #> 2 NA #> 3 2 #> 4 NA open_csv_dataset(tf, convert_options = csv_convert_options(null_values = c(\"\", \"NA\", \"NULL\"))) #> FileSystemDataset with 1 csv file #> 1 columns #> x: int64"},{"path":"https://arrow.apache.org/docs/r/reference/csv_parse_options.html","id":null,"dir":"Reference","previous_headings":"","what":"CSV Parsing Options — csv_parse_options","title":"CSV Parsing Options — csv_parse_options","text":"CSV Parsing Options","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_parse_options.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"CSV Parsing Options — csv_parse_options","text":"","code":"csv_parse_options( delimiter = \",\", quoting = TRUE, quote_char = \"\\\"\", double_quote = TRUE, escaping = FALSE, escape_char = \"\\\\\", newlines_in_values = FALSE, ignore_empty_lines = TRUE )"},{"path":"https://arrow.apache.org/docs/r/reference/csv_parse_options.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"CSV Parsing Options — csv_parse_options","text":"delimiter Field delimiting character quoting Logical: strings quoted? quote_char Quoting character, quoting TRUE double_quote Logical: quotes inside values double-quoted? escaping Logical: whether escaping used escape_char Escaping character, escaping TRUE newlines_in_values Logical: values allowed contain CR (0x0d) LF (0x0a) characters? ignore_empty_lines Logical: empty lines ignored (default) generate row missing values (FALSE)?","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_parse_options.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"CSV Parsing Options — csv_parse_options","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) writeLines(\"x\\n1\\n\\n2\", tf) read_csv_arrow(tf, parse_options = csv_parse_options(ignore_empty_lines = FALSE)) #> # A tibble: 3 x 1 #> x #> <int> #> 1 1 #> 2 NA #> 3 2 open_csv_dataset(tf, parse_options = csv_parse_options(ignore_empty_lines = FALSE)) #> FileSystemDataset with 1 csv file #> 1 columns #> x: int64"},{"path":"https://arrow.apache.org/docs/r/reference/csv_read_options.html","id":null,"dir":"Reference","previous_headings":"","what":"CSV Reading Options — csv_read_options","title":"CSV Reading Options — csv_read_options","text":"CSV Reading Options","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_read_options.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"CSV Reading Options — csv_read_options","text":"","code":"csv_read_options( use_threads = option_use_threads(), block_size = 1048576L, skip_rows = 0L, column_names = character(0), autogenerate_column_names = FALSE, encoding = \"UTF-8\", skip_rows_after_names = 0L )"},{"path":"https://arrow.apache.org/docs/r/reference/csv_read_options.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"CSV Reading Options — csv_read_options","text":"use_threads Whether use global CPU thread pool block_size Block size request IO layer; also determines size chunks use_threads TRUE. skip_rows Number lines skip reading data (default 0). column_names Character vector supply column names. length-0 (default), first non-skipped row parsed generate column names, unless autogenerate_column_names TRUE. autogenerate_column_names Logical: generate column names instead using first non-skipped row (default)? TRUE, column names \"f0\", \"f1\", ..., \"fN\". encoding file encoding. (default \"UTF-8\") skip_rows_after_names Number lines skip column names (default 0). number can larger number rows one block, empty rows counted. order application follows: - skip_rows applied (non-zero); - column names read (unless column_names set); - skip_rows_after_names applied (non-zero).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_read_options.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"CSV Reading Options — csv_read_options","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) writeLines(\"my file has a non-data header\\nx\\n1\\n2\", tf) read_csv_arrow(tf, read_options = csv_read_options(skip_rows = 1)) #> # A tibble: 2 x 1 #> x #> <int> #> 1 1 #> 2 2 open_csv_dataset(tf, read_options = csv_read_options(skip_rows = 1)) #> FileSystemDataset with 1 csv file #> 1 columns #> x: int64"},{"path":"https://arrow.apache.org/docs/r/reference/csv_write_options.html","id":null,"dir":"Reference","previous_headings":"","what":"CSV Writing Options — csv_write_options","title":"CSV Writing Options — csv_write_options","text":"CSV Writing Options","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_write_options.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"CSV Writing Options — csv_write_options","text":"","code":"csv_write_options( include_header = TRUE, batch_size = 1024L, null_string = \"\", delimiter = \",\", eol = \"\\n\", quoting_style = c(\"Needed\", \"AllValid\", \"None\") )"},{"path":"https://arrow.apache.org/docs/r/reference/csv_write_options.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"CSV Writing Options — csv_write_options","text":"include_header Whether write initial header line column names batch_size Maximum number rows processed time. null_string string written null values. Must contain quotation marks. delimiter Field delimiter eol end line character use ending rows quoting_style handle quotes. \"Needed\" (enclose values quotes need , CSV rendering can contain quotes (e.g. strings binary values)), \"AllValid\" (Enclose valid values quotes), \"None\" (enclose values quotes).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/csv_write_options.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"CSV Writing Options — csv_write_options","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) write_csv_arrow(airquality, tf, write_options = csv_write_options(null_string = \"-99\"))"},{"path":"https://arrow.apache.org/docs/r/reference/data-type.html","id":null,"dir":"Reference","previous_headings":"","what":"Create Arrow data types — data-type","title":"Create Arrow data types — data-type","text":"functions create type objects corresponding Arrow types. Use defining schema() inputs types, like struct. functions take arguments, .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/data-type.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create Arrow data types — data-type","text":"","code":"int8() int16() int32() int64() uint8() uint16() uint32() uint64() float16() halffloat() float32() float() float64() boolean() bool() utf8() large_utf8() binary() large_binary() fixed_size_binary(byte_width) string() date32() date64() time32(unit = c(\"ms\", \"s\")) time64(unit = c(\"ns\", \"us\")) duration(unit = c(\"s\", \"ms\", \"us\", \"ns\")) null() timestamp(unit = c(\"s\", \"ms\", \"us\", \"ns\"), timezone = \"\") decimal(precision, scale) decimal128(precision, scale) decimal256(precision, scale) struct(...) list_of(type) large_list_of(type) fixed_size_list_of(type, list_size) map_of(key_type, item_type, .keys_sorted = FALSE)"},{"path":"https://arrow.apache.org/docs/r/reference/data-type.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create Arrow data types — data-type","text":"byte_width byte width FixedSizeBinary type. unit time/timestamp types, time unit. time32() can take either \"s\" \"ms\", time64() can \"us\" \"ns\". timestamp() can take four values. timezone timestamp(), optional time zone string. precision decimal(), decimal128(), decimal256() number significant digits arrow decimal type can represent. maximum precision decimal128() 38 significant digits, decimal256() 76 digits. decimal() use choose type decimal return. scale decimal(), decimal128(), decimal256() number digits decimal point. can negative. ... struct(), named list types define struct columns type list_of(), data type make list--type list_size list size FixedSizeList type. key_type, item_type MapType, key item types. .keys_sorted Use TRUE assert keys MapType sorted.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/data-type.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create Arrow data types — data-type","text":"Arrow type object inheriting DataType.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/data-type.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Create Arrow data types — data-type","text":"functions aliases: utf8() string() float16() halffloat() float32() float() bool() boolean() called inside arrow function, schema() cast(), double() also supported way creating float64() date32() creates datetime type \"day\" unit, like R Date class. date64() \"ms\" unit. uint32 (32 bit unsigned integer), uint64 (64 bit unsigned integer), int64 (64-bit signed integer) types may contain values exceed range R's integer type (32-bit signed integer). arrow objects translated R objects, uint32 uint64 converted double (\"numeric\") int64 converted bit64::integer64. int64 types, conversion can disabled (int64 always yields bit64::integer64 object) setting options(arrow.int64_downcast = FALSE). decimal128() creates Decimal128Type. Arrow decimals fixed-point decimal numbers encoded scalar integer. precision number significant digits decimal type can represent; scale number digits decimal point. example, number 1234.567 precision 7 scale 3. Note scale can negative. example, decimal128(7, 3) can exactly represent numbers 1234.567 -1234.567 (encoded internally 128-bit integers 1234567 -1234567, respectively), neither 12345.67 123.4567. decimal128(5, -3) can exactly represent number 12345000 (encoded internally 128-bit integer 12345), neither 123450000 1234500. scale can thought argument controls rounding. negative, scale causes number expressed using scientific notation power 10. decimal256() creates Decimal256Type, allows higher maximum precision. use cases, maximum precision offered Decimal128Type sufficient, result compact efficient encoding. decimal() creates either Decimal128Type Decimal256Type depending value precision. precision greater 38 Decimal256Type returned, otherwise Decimal128Type. Use decimal128() decimal256() names informative decimal().","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/data-type.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create Arrow data types — data-type","text":"","code":"bool() #> Boolean #> bool struct(a = int32(), b = double()) #> StructType #> struct<a: int32, b: double> timestamp(\"ms\", timezone = \"CEST\") #> Timestamp #> timestamp[ms, tz=CEST] time64(\"ns\") #> Time64 #> time64[ns] # Use the cast method to change the type of data contained in Arrow objects. # Please check the documentation of each data object class for details. my_scalar <- Scalar$create(0L, type = int64()) # int64 my_scalar$cast(timestamp(\"ns\")) # timestamp[ns] #> Scalar #> 1970-01-01 00:00:00.000000000 my_array <- Array$create(0L, type = int64()) # int64 my_array$cast(timestamp(\"s\", timezone = \"UTC\")) # timestamp[s, tz=UTC] #> Array #> <timestamp[s, tz=UTC]> #> [ #> 1970-01-01 00:00:00Z #> ] my_chunked_array <- chunked_array(0L, 1L) # int32 my_chunked_array$cast(date32()) # date32[day] #> ChunkedArray #> <date32[day]> #> [ #> [ #> 1970-01-01 #> ], #> [ #> 1970-01-02 #> ] #> ] # You can also use `cast()` in an Arrow dplyr query. if (requireNamespace(\"dplyr\", quietly = TRUE)) { library(dplyr, warn.conflicts = FALSE) arrow_table(mtcars) %>% transmute( col1 = cast(cyl, string()), col2 = cast(cyl, int8()) ) %>% compute() } #> Table #> 32 rows x 2 columns #> $col1 <string> #> $col2 <int8> #> #> See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/reference/dataset_factory.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a DatasetFactory — dataset_factory","title":"Create a DatasetFactory — dataset_factory","text":"Dataset can constructed using one DatasetFactorys. function helps construct DatasetFactory can pass open_dataset().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/dataset_factory.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a DatasetFactory — dataset_factory","text":"","code":"dataset_factory( x, filesystem = NULL, format = c(\"parquet\", \"arrow\", \"ipc\", \"feather\", \"csv\", \"tsv\", \"text\", \"json\"), partitioning = NULL, hive_style = NA, factory_options = list(), ... )"},{"path":"https://arrow.apache.org/docs/r/reference/dataset_factory.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a DatasetFactory — dataset_factory","text":"x string path directory containing data files, vector one one string paths data files, list DatasetFactory objects whose datasets combined. argument specified used construct UnionDatasetFactory arguments ignored. filesystem FileSystem object; omitted, FileSystem detected x format FileFormat object, string identifier format files x. Currently supported values: \"parquet\" \"ipc\"/\"arrow\"/\"feather\", aliases ; Feather, note version 2 files supported \"csv\"/\"text\", aliases thing (comma default delimiter text files \"tsv\", equivalent passing format = \"text\", delimiter = \"\\t\" Default \"parquet\", unless delimiter also specified, case assumed \"text\". partitioning One Schema, case file paths relative sources parsed, path segments matched schema fields. example, schema(year = int16(), month = int8()) create partitions file paths like \"2019/01/file.parquet\", \"2019/02/file.parquet\", etc. character vector defines field names corresponding path segments (, providing names correspond Schema types autodetected) HivePartitioning HivePartitioningFactory, returned hive_partition() parses explicit autodetected fields Hive-style path segments NULL partitioning hive_style Logical: partitioning character vector Schema, interpreted specifying Hive-style partitioning? Default NA, means inspect file paths Hive-style partitioning behave accordingly. factory_options list optional FileSystemFactoryOptions: partition_base_dir: string path segment prefix ignore discovering partition information DirectoryPartitioning. meaningful (ignored warning) HivePartitioning, valid providing vector file paths. exclude_invalid_files: logical: files valid data files excluded? Default FALSE checking files front incurs /O thus slower, especially remote filesystems. false invalid files, error scan time. FileSystemFactoryOption valid providing directory path discover files providing vector file paths. selector_ignore_prefixes: character vector file prefixes ignore discovering files directory. invalid files can excluded common filename prefix way, can avoid /O cost exclude_invalid_files. valid providing vector file paths (providing file list, can filter invalid files ). ... Additional format-specific options, passed FileFormat$create(). CSV options, note can specify either Arrow C++ library naming (\"delimiter\", \"quoting\", etc.) readr-style naming used read_csv_arrow() (\"delim\", \"quote\", etc.). readr options currently supported; please file issue encounter one arrow support.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/dataset_factory.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create a DatasetFactory — dataset_factory","text":"DatasetFactory object. Pass open_dataset(), list potentially DatasetFactory objects, create Dataset.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/dataset_factory.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Create a DatasetFactory — dataset_factory","text":"single DatasetFactory (example, single directory containing Parquet files), can call open_dataset() directly. Use dataset_factory() want combine different directories, file systems, file formats.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/default_memory_pool.html","id":null,"dir":"Reference","previous_headings":"","what":"Arrow's default MemoryPool — default_memory_pool","title":"Arrow's default MemoryPool — default_memory_pool","text":"Arrow's default MemoryPool","code":""},{"path":"https://arrow.apache.org/docs/r/reference/default_memory_pool.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Arrow's default MemoryPool — default_memory_pool","text":"","code":"default_memory_pool()"},{"path":"https://arrow.apache.org/docs/r/reference/default_memory_pool.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Arrow's default MemoryPool — default_memory_pool","text":"default MemoryPool","code":""},{"path":"https://arrow.apache.org/docs/r/reference/dictionary.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a dictionary type — dictionary","title":"Create a dictionary type — dictionary","text":"Create dictionary type","code":""},{"path":"https://arrow.apache.org/docs/r/reference/dictionary.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a dictionary type — dictionary","text":"","code":"dictionary(index_type = int32(), value_type = utf8(), ordered = FALSE)"},{"path":"https://arrow.apache.org/docs/r/reference/dictionary.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a dictionary type — dictionary","text":"index_type DataType indices (default int32()) value_type DataType values (default utf8()) ordered ordered dictionary (default FALSE)?","code":""},{"path":"https://arrow.apache.org/docs/r/reference/dictionary.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create a dictionary type — dictionary","text":"DictionaryType","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/enums.html","id":null,"dir":"Reference","previous_headings":"","what":"Arrow enums — enums","title":"Arrow enums — enums","text":"Arrow enums","code":""},{"path":"https://arrow.apache.org/docs/r/reference/enums.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Arrow enums — enums","text":"","code":"TimeUnit DateUnit Type StatusCode FileMode MessageType CompressionType FileType ParquetVersionType MetadataVersion QuantileInterpolation NullEncodingBehavior NullHandlingBehavior RoundMode JoinType"},{"path":"https://arrow.apache.org/docs/r/reference/enums.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"Arrow enums — enums","text":"object class TimeUnit::type (inherits arrow-enum) length 4. object class DateUnit (inherits arrow-enum) length 2. object class Type::type (inherits arrow-enum) length 39. object class StatusCode (inherits arrow-enum) length 13. object class FileMode (inherits arrow-enum) length 3. object class MessageType (inherits arrow-enum) length 5. object class Compression::type (inherits arrow-enum) length 9. object class FileType (inherits arrow-enum) length 4. object class ParquetVersionType (inherits arrow-enum) length 4. object class MetadataVersion (inherits arrow-enum) length 5. object class QuantileInterpolation (inherits arrow-enum) length 5. object class NullEncodingBehavior (inherits arrow-enum) length 2. object class NullHandlingBehavior (inherits arrow-enum) length 3. object class RoundMode (inherits arrow-enum) length 10. object class JoinType (inherits arrow-enum) length 8.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_connect.html","id":null,"dir":"Reference","previous_headings":"","what":"Connect to a Flight server — flight_connect","title":"Connect to a Flight server — flight_connect","text":"Connect Flight server","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_connect.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Connect to a Flight server — flight_connect","text":"","code":"flight_connect(host = \"localhost\", port, scheme = \"grpc+tcp\")"},{"path":"https://arrow.apache.org/docs/r/reference/flight_connect.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Connect to a Flight server — flight_connect","text":"host string hostname connect port integer port connect scheme URL scheme, default \"grpc+tcp\"","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_connect.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Connect to a Flight server — flight_connect","text":"pyarrow.flight.FlightClient.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_disconnect.html","id":null,"dir":"Reference","previous_headings":"","what":"Explicitly close a Flight client — flight_disconnect","title":"Explicitly close a Flight client — flight_disconnect","text":"Explicitly close Flight client","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_disconnect.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Explicitly close a Flight client — flight_disconnect","text":"","code":"flight_disconnect(client)"},{"path":"https://arrow.apache.org/docs/r/reference/flight_disconnect.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Explicitly close a Flight client — flight_disconnect","text":"client client disconnect","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_get.html","id":null,"dir":"Reference","previous_headings":"","what":"Get data from a Flight server — flight_get","title":"Get data from a Flight server — flight_get","text":"Get data Flight server","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_get.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Get data from a Flight server — flight_get","text":"","code":"flight_get(client, path)"},{"path":"https://arrow.apache.org/docs/r/reference/flight_get.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Get data from a Flight server — flight_get","text":"client pyarrow.flight.FlightClient, returned flight_connect() path string identifier data stored","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_get.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Get data from a Flight server — flight_get","text":"Table","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_put.html","id":null,"dir":"Reference","previous_headings":"","what":"Send data to a Flight server — flight_put","title":"Send data to a Flight server — flight_put","text":"Send data Flight server","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_put.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Send data to a Flight server — flight_put","text":"","code":"flight_put(client, data, path, overwrite = TRUE, max_chunksize = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/flight_put.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Send data to a Flight server — flight_put","text":"client pyarrow.flight.FlightClient, returned flight_connect() data data.frame, RecordBatch, Table upload path string identifier store data overwrite logical: path exists client already, replace contents data? Default TRUE; FALSE path exists, function error. max_chunksize integer: Maximum number rows RecordBatch chunks data.frame sent. Individual chunks may smaller depending chunk layout individual columns.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/flight_put.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Send data to a Flight server — flight_put","text":"client, invisibly.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/format_schema.html","id":null,"dir":"Reference","previous_headings":"","what":"Get a string representing a Dataset or RecordBatchReader object's schema — format_schema","title":"Get a string representing a Dataset or RecordBatchReader object's schema — format_schema","text":"Get string representing Dataset RecordBatchReader object's schema","code":""},{"path":"https://arrow.apache.org/docs/r/reference/format_schema.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Get a string representing a Dataset or RecordBatchReader object's schema — format_schema","text":"","code":"format_schema(obj)"},{"path":"https://arrow.apache.org/docs/r/reference/format_schema.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Get a string representing a Dataset or RecordBatchReader object's schema — format_schema","text":"obj Dataset RecordBatchReader","code":""},{"path":"https://arrow.apache.org/docs/r/reference/format_schema.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Get a string representing a Dataset or RecordBatchReader object's schema — format_schema","text":"string containing formatted representation schema obj","code":""},{"path":"https://arrow.apache.org/docs/r/reference/get_stringr_pattern_options.html","id":null,"dir":"Reference","previous_headings":"","what":"Get stringr pattern options — get_stringr_pattern_options","title":"Get stringr pattern options — get_stringr_pattern_options","text":"function assigns definitions stringr pattern modifier functions (fixed(), regex(), etc.) inside , uses evaluate quoted expression pattern, returning list used control pattern matching behavior internal arrow functions.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/get_stringr_pattern_options.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Get stringr pattern options — get_stringr_pattern_options","text":"","code":"get_stringr_pattern_options(pattern)"},{"path":"https://arrow.apache.org/docs/r/reference/get_stringr_pattern_options.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Get stringr pattern options — get_stringr_pattern_options","text":"pattern Unevaluated expression containing call stringr pattern modifier function","code":""},{"path":"https://arrow.apache.org/docs/r/reference/get_stringr_pattern_options.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Get stringr pattern options — get_stringr_pattern_options","text":"List containing elements pattern, fixed, ignore_case","code":""},{"path":"https://arrow.apache.org/docs/r/reference/gs_bucket.html","id":null,"dir":"Reference","previous_headings":"","what":"Connect to a Google Cloud Storage (GCS) bucket — gs_bucket","title":"Connect to a Google Cloud Storage (GCS) bucket — gs_bucket","text":"gs_bucket() convenience function create GcsFileSystem object holds onto relative path","code":""},{"path":"https://arrow.apache.org/docs/r/reference/gs_bucket.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Connect to a Google Cloud Storage (GCS) bucket — gs_bucket","text":"","code":"gs_bucket(bucket, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/gs_bucket.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Connect to a Google Cloud Storage (GCS) bucket — gs_bucket","text":"bucket string GCS bucket name path ... Additional connection options, passed GcsFileSystem$create()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/gs_bucket.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Connect to a Google Cloud Storage (GCS) bucket — gs_bucket","text":"SubTreeFileSystem containing GcsFileSystem bucket's relative path. Note function's success guarantee authorized access bucket's contents.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/gs_bucket.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Connect to a Google Cloud Storage (GCS) bucket — gs_bucket","text":"","code":"if (FALSE) { bucket <- gs_bucket(\"voltrondata-labs-datasets\") }"},{"path":"https://arrow.apache.org/docs/r/reference/hive_partition.html","id":null,"dir":"Reference","previous_headings":"","what":"Construct Hive partitioning — hive_partition","title":"Construct Hive partitioning — hive_partition","text":"Hive partitioning embeds field names values path segments, \"/year=2019/month=2/data.parquet\".","code":""},{"path":"https://arrow.apache.org/docs/r/reference/hive_partition.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Construct Hive partitioning — hive_partition","text":"","code":"hive_partition(..., null_fallback = NULL, segment_encoding = \"uri\")"},{"path":"https://arrow.apache.org/docs/r/reference/hive_partition.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Construct Hive partitioning — hive_partition","text":"... named list data types, passed schema() null_fallback character used place missing values (NA NULL) partition columns. Default \"__HIVE_DEFAULT_PARTITION__\", Hive uses. segment_encoding Decode partition segments splitting paths. Default \"uri\" (URI-decode segments). May also \"none\" (leave -).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/hive_partition.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Construct Hive partitioning — hive_partition","text":"HivePartitioning, HivePartitioningFactory calling hive_partition() arguments.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/hive_partition.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Construct Hive partitioning — hive_partition","text":"fields named path segments, order fields passed hive_partition() matter.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/hive_partition.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Construct Hive partitioning — hive_partition","text":"","code":"hive_partition(year = int16(), month = int8()) #> HivePartitioning"},{"path":"https://arrow.apache.org/docs/r/reference/infer_schema.html","id":null,"dir":"Reference","previous_headings":"","what":"Extract a schema from an object — infer_schema","title":"Extract a schema from an object — infer_schema","text":"Extract schema object","code":""},{"path":"https://arrow.apache.org/docs/r/reference/infer_schema.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extract a schema from an object — infer_schema","text":"","code":"infer_schema(x)"},{"path":"https://arrow.apache.org/docs/r/reference/infer_schema.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extract a schema from an object — infer_schema","text":"x object schema, e.g. Dataset","code":""},{"path":"https://arrow.apache.org/docs/r/reference/infer_type.html","id":null,"dir":"Reference","previous_headings":"","what":"Infer the arrow Array type from an R object — infer_type","title":"Infer the arrow Array type from an R object — infer_type","text":"type() deprecated favor infer_type().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/infer_type.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Infer the arrow Array type from an R object — infer_type","text":"","code":"infer_type(x, ...) type(x)"},{"path":"https://arrow.apache.org/docs/r/reference/infer_type.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Infer the arrow Array type from an R object — infer_type","text":"x R object (usually vector) converted Array ChunkedArray. ... Passed S3 methods","code":""},{"path":"https://arrow.apache.org/docs/r/reference/infer_type.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Infer the arrow Array type from an R object — infer_type","text":"arrow data type","code":""},{"path":"https://arrow.apache.org/docs/r/reference/infer_type.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Infer the arrow Array type from an R object — infer_type","text":"","code":"infer_type(1:10) #> Int32 #> int32 infer_type(1L:10L) #> Int32 #> int32 infer_type(c(1, 1.5, 2)) #> Float64 #> double infer_type(c(\"A\", \"B\", \"C\")) #> Utf8 #> string infer_type(mtcars) #> StructType #> struct<mpg: double, cyl: double, disp: double, hp: double, drat: double, wt: double, qsec: double, vs: double, am: double, gear: double, carb: double> infer_type(Sys.Date()) #> Date32 #> date32[day] infer_type(as.POSIXlt(Sys.Date())) #> VctrsExtensionType #> POSIXlt of length 0 infer_type(vctrs::new_vctr(1:5, class = \"my_custom_vctr_class\")) #> VctrsExtensionType #> <my_custom_vctr_class[0]>"},{"path":"https://arrow.apache.org/docs/r/reference/install_arrow.html","id":null,"dir":"Reference","previous_headings":"","what":"Install or upgrade the Arrow library — install_arrow","title":"Install or upgrade the Arrow library — install_arrow","text":"Use function install latest release arrow, switch nightly development version, Linux try reinstalling necessary C++ dependencies.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/install_arrow.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Install or upgrade the Arrow library — install_arrow","text":"","code":"install_arrow( nightly = FALSE, binary = Sys.getenv(\"LIBARROW_BINARY\", TRUE), use_system = Sys.getenv(\"ARROW_USE_PKG_CONFIG\", FALSE), minimal = Sys.getenv(\"LIBARROW_MINIMAL\", FALSE), verbose = Sys.getenv(\"ARROW_R_DEV\", FALSE), repos = getOption(\"repos\"), ... )"},{"path":"https://arrow.apache.org/docs/r/reference/install_arrow.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Install or upgrade the Arrow library — install_arrow","text":"nightly logical: install development version package, install CRAN (default). binary Linux, value set environment variable LIBARROW_BINARY, governs C++ binaries used, . default value, TRUE, tells installation script detect Linux distribution version find appropriate C++ library. FALSE tell script retrieve binary instead build Arrow C++ source. valid values strings corresponding Linux distribution-version, override value detected. See install guide details. use_system logical: use pkg-config look Arrow system packages? Default FALSE. TRUE, source installation may faster, risk version mismatch. sets ARROW_USE_PKG_CONFIG environment variable. minimal logical: building source, build without optional dependencies (compression libraries, example)? Default FALSE. sets LIBARROW_MINIMAL environment variable. verbose logical: Print debugging output installing? Default FALSE. sets ARROW_R_DEV environment variable. repos character vector base URLs repositories install (passed install.packages()) ... Additional arguments passed install.packages()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/install_arrow.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Install or upgrade the Arrow library — install_arrow","text":"Note , unlike packages like tensorflow, blogdown, others require external dependencies, need run install_arrow() successful arrow installation.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/install_pyarrow.html","id":null,"dir":"Reference","previous_headings":"","what":"Install pyarrow for use with reticulate — install_pyarrow","title":"Install pyarrow for use with reticulate — install_pyarrow","text":"pyarrow Python package Apache Arrow. function helps installing use reticulate.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/install_pyarrow.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Install pyarrow for use with reticulate — install_pyarrow","text":"","code":"install_pyarrow(envname = NULL, nightly = FALSE, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/install_pyarrow.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Install pyarrow for use with reticulate — install_pyarrow","text":"envname name full path Python environment install . can virtualenv conda environment created reticulate. See reticulate::py_install(). nightly logical: install development version package? Default use official release version. ... additional arguments passed reticulate::py_install().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/io_thread_count.html","id":null,"dir":"Reference","previous_headings":"","what":"Manage the global I/O thread pool in libarrow — io_thread_count","title":"Manage the global I/O thread pool in libarrow — io_thread_count","text":"Manage global /O thread pool libarrow","code":""},{"path":"https://arrow.apache.org/docs/r/reference/io_thread_count.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Manage the global I/O thread pool in libarrow — io_thread_count","text":"","code":"io_thread_count() set_io_thread_count(num_threads)"},{"path":"https://arrow.apache.org/docs/r/reference/io_thread_count.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Manage the global I/O thread pool in libarrow — io_thread_count","text":"num_threads integer: New number threads thread pool. least two threads recommended support operations arrow package.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/list_compute_functions.html","id":null,"dir":"Reference","previous_headings":"","what":"List available Arrow C++ compute functions — list_compute_functions","title":"List available Arrow C++ compute functions — list_compute_functions","text":"function lists names available Arrow C++ library compute functions. can called passing call_function(), can called name arrow_ prefix inside dplyr verb.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/list_compute_functions.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"List available Arrow C++ compute functions — list_compute_functions","text":"","code":"list_compute_functions(pattern = NULL, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/list_compute_functions.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"List available Arrow C++ compute functions — list_compute_functions","text":"pattern Optional regular expression filter function list ... Additional parameters passed grep()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/list_compute_functions.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"List available Arrow C++ compute functions — list_compute_functions","text":"character vector available Arrow C++ function names","code":""},{"path":"https://arrow.apache.org/docs/r/reference/list_compute_functions.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"List available Arrow C++ compute functions — list_compute_functions","text":"resulting list describes capabilities arrow build. functions, string regular expression functions, require optional build-time C++ dependencies. arrow package compiled features enabled, functions appear list. functions take options need passed calling (list called options). options require custom handling C++; many functions already handling set . encounter one needs special handling options, please report issue. Note list enumerate R bindings functions. package includes Arrow methods many base R functions can called directly Arrow objects, well tidyverse-flavored versions available inside dplyr verbs.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/list_compute_functions.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"List available Arrow C++ compute functions — list_compute_functions","text":"","code":"available_funcs <- list_compute_functions() utf8_funcs <- list_compute_functions(pattern = \"^UTF8\", ignore.case = TRUE)"},{"path":"https://arrow.apache.org/docs/r/reference/list_flights.html","id":null,"dir":"Reference","previous_headings":"","what":"See available resources on a Flight server — list_flights","title":"See available resources on a Flight server — list_flights","text":"See available resources Flight server","code":""},{"path":"https://arrow.apache.org/docs/r/reference/list_flights.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"See available resources on a Flight server — list_flights","text":"","code":"list_flights(client) flight_path_exists(client, path)"},{"path":"https://arrow.apache.org/docs/r/reference/list_flights.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"See available resources on a Flight server — list_flights","text":"client pyarrow.flight.FlightClient, returned flight_connect() path string identifier data stored","code":""},{"path":"https://arrow.apache.org/docs/r/reference/list_flights.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"See available resources on a Flight server — list_flights","text":"list_flights() returns character vector paths. flight_path_exists() returns logical value, equivalent path %% list_flights()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/load_flight_server.html","id":null,"dir":"Reference","previous_headings":"","what":"Load a Python Flight server — load_flight_server","title":"Load a Python Flight server — load_flight_server","text":"Load Python Flight server","code":""},{"path":"https://arrow.apache.org/docs/r/reference/load_flight_server.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Load a Python Flight server — load_flight_server","text":"","code":"load_flight_server(name, path = system.file(package = \"arrow\"))"},{"path":"https://arrow.apache.org/docs/r/reference/load_flight_server.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Load a Python Flight server — load_flight_server","text":"name string Python module name path file system path Python module found. Default look inst/ directory included modules.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/load_flight_server.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Load a Python Flight server — load_flight_server","text":"","code":"if (FALSE) { load_flight_server(\"demo_flight_server\") }"},{"path":"https://arrow.apache.org/docs/r/reference/make_readable_file.html","id":null,"dir":"Reference","previous_headings":"","what":"Handle a range of possible input sources — make_readable_file","title":"Handle a range of possible input sources — make_readable_file","text":"Handle range possible input sources","code":""},{"path":"https://arrow.apache.org/docs/r/reference/make_readable_file.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Handle a range of possible input sources — make_readable_file","text":"","code":"make_readable_file(file, mmap = TRUE, random_access = TRUE)"},{"path":"https://arrow.apache.org/docs/r/reference/make_readable_file.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Handle a range of possible input sources — make_readable_file","text":"file character file name, raw vector, Arrow input stream mmap Logical: whether memory-map file (default TRUE) random_access Logical: whether result must RandomAccessFile","code":""},{"path":"https://arrow.apache.org/docs/r/reference/make_readable_file.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Handle a range of possible input sources — make_readable_file","text":"InputStream subclass one.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/map_batches.html","id":null,"dir":"Reference","previous_headings":"","what":"Apply a function to a stream of RecordBatches — map_batches","title":"Apply a function to a stream of RecordBatches — map_batches","text":"alternative calling collect() Dataset query, can use function access stream RecordBatches Dataset. lets complex operations R operate chunks data without hold entire Dataset memory . can include map_batches() dplyr pipeline additional dplyr methods stream data Arrow .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/map_batches.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Apply a function to a stream of RecordBatches — map_batches","text":"","code":"map_batches(X, FUN, ..., .schema = NULL, .lazy = TRUE, .data.frame = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/map_batches.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Apply a function to a stream of RecordBatches — map_batches","text":"X Dataset arrow_dplyr_query object, returned dplyr methods Dataset. FUN function purrr-style lambda expression apply batch. must return RecordBatch something coercible one via `as_record_batch()'. ... Additional arguments passed FUN .schema optional schema(). NULL, schema inferred first batch. .lazy Use TRUE evaluate FUN lazily batches read result; use FALSE evaluate FUN batches returning reader. .data.frame Deprecated argument, ignored","code":""},{"path":"https://arrow.apache.org/docs/r/reference/map_batches.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Apply a function to a stream of RecordBatches — map_batches","text":"arrow_dplyr_query.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/map_batches.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Apply a function to a stream of RecordBatches — map_batches","text":"experimental recommended production use. also single-threaded runs R C++, fast core Arrow methods.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/match_arrow.html","id":null,"dir":"Reference","previous_headings":"","what":"Value matching for Arrow objects — match_arrow","title":"Value matching for Arrow objects — match_arrow","text":"base::match() base::%% generics, just define Arrow methods . functions expose analogous functions Arrow C++ library.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/match_arrow.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Value matching for Arrow objects — match_arrow","text":"","code":"match_arrow(x, table, ...) is_in(x, table, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/match_arrow.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Value matching for Arrow objects — match_arrow","text":"x Scalar, Array ChunkedArray table Scalar, Array, ChunkedArray`, R vector lookup table. ... additional arguments, ignored","code":""},{"path":"https://arrow.apache.org/docs/r/reference/match_arrow.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Value matching for Arrow objects — match_arrow","text":"match_arrow() returns int32-type Arrow object length type x (0-based) indexes table. is_in() returns boolean-type Arrow object length type x values indicating per element x present table.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/match_arrow.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Value matching for Arrow objects — match_arrow","text":"","code":"# note that the returned value is 0-indexed cars_tbl <- arrow_table(name = rownames(mtcars), mtcars) match_arrow(Scalar$create(\"Mazda RX4 Wag\"), cars_tbl$name) #> Scalar #> 1 is_in(Array$create(\"Mazda RX4 Wag\"), cars_tbl$name) #> Array #> <bool> #> [ #> true #> ] # Although there are multiple matches, you are returned the index of the first # match, as with the base R equivalent match(4, mtcars$cyl) # 1-indexed #> [1] 3 match_arrow(Scalar$create(4), cars_tbl$cyl) # 0-indexed #> Scalar #> 2 # If `x` contains multiple values, you are returned the indices of the first # match for each value. match(c(4, 6, 8), mtcars$cyl) #> [1] 3 1 5 match_arrow(Array$create(c(4, 6, 8)), cars_tbl$cyl) #> Array #> <int32> #> [ #> 2, #> 0, #> 4 #> ] # Return type matches type of `x` is_in(c(4, 6, 8), mtcars$cyl) # returns vector #> Array #> <bool> #> [ #> true, #> true, #> true #> ] is_in(Scalar$create(4), mtcars$cyl) # returns Scalar #> Scalar #> true is_in(Array$create(c(4, 6, 8)), cars_tbl$cyl) # returns Array #> Array #> <bool> #> [ #> true, #> true, #> true #> ] is_in(ChunkedArray$create(c(4, 6), 8), cars_tbl$cyl) # returns ChunkedArray #> ChunkedArray #> <bool> #> [ #> [ #> true, #> true, #> true #> ] #> ]"},{"path":"https://arrow.apache.org/docs/r/reference/mmap_create.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a new read/write memory mapped file of a given size — mmap_create","title":"Create a new read/write memory mapped file of a given size — mmap_create","text":"Create new read/write memory mapped file given size","code":""},{"path":"https://arrow.apache.org/docs/r/reference/mmap_create.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a new read/write memory mapped file of a given size — mmap_create","text":"","code":"mmap_create(path, size)"},{"path":"https://arrow.apache.org/docs/r/reference/mmap_create.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a new read/write memory mapped file of a given size — mmap_create","text":"path file path size size bytes","code":""},{"path":"https://arrow.apache.org/docs/r/reference/mmap_create.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create a new read/write memory mapped file of a given size — mmap_create","text":"arrow::io::MemoryMappedFile","code":""},{"path":"https://arrow.apache.org/docs/r/reference/mmap_open.html","id":null,"dir":"Reference","previous_headings":"","what":"Open a memory mapped file — mmap_open","title":"Open a memory mapped file — mmap_open","text":"Open memory mapped file","code":""},{"path":"https://arrow.apache.org/docs/r/reference/mmap_open.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Open a memory mapped file — mmap_open","text":"","code":"mmap_open(path, mode = c(\"read\", \"write\", \"readwrite\"))"},{"path":"https://arrow.apache.org/docs/r/reference/mmap_open.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Open a memory mapped file — mmap_open","text":"path file path mode file mode (read/write/readwrite)","code":""},{"path":"https://arrow.apache.org/docs/r/reference/new_extension_type.html","id":null,"dir":"Reference","previous_headings":"","what":"Extension types — new_extension_type","title":"Extension types — new_extension_type","text":"Extension arrays wrappers around regular Arrow Array objects provide customized behaviour /storage. common use-case extension types define customized conversion Arrow Array R object default conversion slow loses metadata important interpretation values array. types, built-vctrs extension type probably sufficient.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/new_extension_type.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extension types — new_extension_type","text":"","code":"new_extension_type( storage_type, extension_name, extension_metadata = raw(), type_class = ExtensionType ) new_extension_array(storage_array, extension_type) register_extension_type(extension_type) reregister_extension_type(extension_type) unregister_extension_type(extension_name)"},{"path":"https://arrow.apache.org/docs/r/reference/new_extension_type.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extension types — new_extension_type","text":"storage_type data type underlying storage array. extension_name extension name. namespaced using \"dot\" syntax (.e., \"some_package.some_type\"). namespace \"arrow\" reserved extension types defined Apache Arrow libraries. extension_metadata raw() character() vector containing serialized version type. Character vectors must length 1 converted UTF-8 converting raw(). type_class R6::R6Class whose $new() class method used construct new instance type. storage_array Array object underlying storage. extension_type ExtensionType instance.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/new_extension_type.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extension types — new_extension_type","text":"new_extension_type() returns ExtensionType instance according type_class specified. new_extension_array() returns ExtensionArray whose $type corresponds extension_type. register_extension_type(), unregister_extension_type() reregister_extension_type() return NULL, invisibly.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/new_extension_type.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Extension types — new_extension_type","text":"functions create, register, unregister ExtensionType ExtensionArray objects. use extension type : Define R6::R6Class inherits ExtensionType reimplement one methods (e.g., deserialize_instance()). Make type constructor function (e.g., my_extension_type()) calls new_extension_type() create R6 instance can used data type elsewhere package. Make array constructor function (e.g., my_extension_array()) calls new_extension_array() create Array instance extension type. Register dummy instance extension type created using constructor function using register_extension_type(). defining extension type R package, probably want use reregister_extension_type() package's .onLoad() hook since package probably get reloaded R session development register_extension_type() error called twice extension_name. example extension type uses features, see vctrs_extension_type().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/new_extension_type.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Extension types — new_extension_type","text":"","code":"# Create the R6 type whose methods control how Array objects are # converted to R objects, how equality between types is computed, # and how types are printed. QuantizedType <- R6::R6Class( \"QuantizedType\", inherit = ExtensionType, public = list( # methods to access the custom metadata fields center = function() private$.center, scale = function() private$.scale, # called when an Array of this type is converted to an R vector as_vector = function(extension_array) { if (inherits(extension_array, \"ExtensionArray\")) { unquantized_arrow <- (extension_array$storage()$cast(float64()) / private$.scale) + private$.center as.vector(unquantized_arrow) } else { super$as_vector(extension_array) } }, # populate the custom metadata fields from the serialized metadata deserialize_instance = function() { vals <- as.numeric(strsplit(self$extension_metadata_utf8(), \";\")[[1]]) private$.center <- vals[1] private$.scale <- vals[2] } ), private = list( .center = NULL, .scale = NULL ) ) # Create a helper type constructor that calls new_extension_type() quantized <- function(center = 0, scale = 1, storage_type = int32()) { new_extension_type( storage_type = storage_type, extension_name = \"arrow.example.quantized\", extension_metadata = paste(center, scale, sep = \";\"), type_class = QuantizedType ) } # Create a helper array constructor that calls new_extension_array() quantized_array <- function(x, center = 0, scale = 1, storage_type = int32()) { type <- quantized(center, scale, storage_type) new_extension_array( Array$create((x - center) * scale, type = storage_type), type ) } # Register the extension type so that Arrow knows what to do when # it encounters this extension type reregister_extension_type(quantized()) # Create Array objects and use them! (vals <- runif(5, min = 19, max = 21)) #> [1] 20.42644 20.32484 20.21635 20.83541 19.04193 (array <- quantized_array( vals, center = 20, scale = 2^15 - 1, storage_type = int16() ) ) #> ExtensionArray #> <QuantizedType <20;32767>> #> [ #> 13973, #> 10643, #> 7089, #> 27373, #> -31393 #> ] array$type$center() #> [1] 20 array$type$scale() #> [1] 32767 as.vector(array) #> [1] 20.42644 20.32481 20.21635 20.83538 19.04193"},{"path":"https://arrow.apache.org/docs/r/reference/open_dataset.html","id":null,"dir":"Reference","previous_headings":"","what":"Open a multi-file dataset — open_dataset","title":"Open a multi-file dataset — open_dataset","text":"Arrow Datasets allow query data split across multiple files. sharding data may indicate partitioning, can accelerate queries touch partitions (files). Call open_dataset() point directory data files return Dataset, use dplyr methods query .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/open_dataset.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Open a multi-file dataset — open_dataset","text":"","code":"open_dataset( sources, schema = NULL, partitioning = hive_partition(), hive_style = NA, unify_schemas = NULL, format = c(\"parquet\", \"arrow\", \"ipc\", \"feather\", \"csv\", \"tsv\", \"text\", \"json\"), factory_options = list(), ... )"},{"path":"https://arrow.apache.org/docs/r/reference/open_dataset.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Open a multi-file dataset — open_dataset","text":"sources One : string path URI directory containing data files FileSystem references directory containing data files (returned s3_bucket()) string path URI single file character vector paths URIs individual data files list Dataset objects created function list DatasetFactory objects created dataset_factory(). sources vector file URIs, must use protocol point files located file system format. schema Schema Dataset. NULL (default), schema inferred data sources. partitioning sources directory path/URI, one : Schema, case file paths relative sources parsed, path segments matched schema fields. character vector defines field names corresponding path segments (, providing names correspond Schema types autodetected) Partitioning PartitioningFactory, returned hive_partition() NULL partitioning default autodetect Hive-style partitions unless hive_style = FALSE. See \"Partitioning\" section details. sources directory path/URI, partitioning ignored. hive_style Logical: partitioning interpreted Hive-style? Default NA, means inspect file paths Hive-style partitioning behave accordingly. unify_schemas logical: data fragments (files, Datasets) scanned order create unified schema ? FALSE, first fragment inspected schema. Use fast path know trust fragments identical schema. default FALSE creating dataset directory path/URI vector file paths/URIs (may many files scanning may slow) TRUE sources list Datasets (Datasets list Schemas already memory). format FileFormat object, string identifier format files x. argument ignored sources list Dataset objects. Currently supported values: \"parquet\" \"ipc\"/\"arrow\"/\"feather\", aliases ; Feather, note version 2 files supported \"csv\"/\"text\", aliases thing (comma default delimiter text files \"tsv\", equivalent passing format = \"text\", delimiter = \"\\t\" \"json\", JSON format datasets Note: newline-delimited JSON (aka ND-JSON) datasets currently supported Default \"parquet\", unless delimiter also specified, case assumed \"text\". factory_options list optional FileSystemFactoryOptions: partition_base_dir: string path segment prefix ignore discovering partition information DirectoryPartitioning. meaningful (ignored warning) HivePartitioning, valid providing vector file paths. exclude_invalid_files: logical: files valid data files excluded? Default FALSE checking files front incurs /O thus slower, especially remote filesystems. false invalid files, error scan time. FileSystemFactoryOption valid providing directory path discover files providing vector file paths. selector_ignore_prefixes: character vector file prefixes ignore discovering files directory. invalid files can excluded common filename prefix way, can avoid /O cost exclude_invalid_files. valid providing vector file paths (providing file list, can filter invalid files ). ... additional arguments passed dataset_factory() sources directory path/URI vector file paths/URIs, otherwise ignored. may include format indicate file format, format-specific options (see read_csv_arrow(), read_parquet() read_feather() specify ).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/open_dataset.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Open a multi-file dataset — open_dataset","text":"Dataset R6 object. Use dplyr methods query data, call $NewScan() construct query directly.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/open_dataset.html","id":"partitioning","dir":"Reference","previous_headings":"","what":"Partitioning","title":"Open a multi-file dataset — open_dataset","text":"Data often split multiple files nested subdirectories based value one columns data. may column commonly referenced queries, may time-based, examples. Data divided way \"partitioned,\" values partitioning columns encoded file path segments. path segments effectively virtual columns dataset, values known prior reading files , can greatly speed filtered queries skipping files entirely. Arrow supports reading partition information file paths two forms: \"Hive-style\", deriving Apache Hive project common database systems. Partitions encoded \"key=value\" path segments, \"year=2019/month=1/file.parquet\". may awkward file names, advantage self-describing. \"Directory\" partitioning, Hive without key names, like \"2019/01/file.parquet\". order use , need know least names give virtual columns come path segments. default behavior open_dataset() inspect file paths contained provided directory, look like Hive-style, parse Hive. dataset Hive-style partitioning file paths, need provide anything partitioning argument open_dataset() use . provide character vector partition column names, ignored match detected, match, get error. (want rename partition columns, using select() rename() opening dataset.). provide Schema names match detected, use types defined Schema. example file path , provide Schema specify \"month\" int8() instead int32() parsed default. file paths appear Hive-style, pass hive_style = FALSE, partitioning argument used create Directory partitioning. character vector names required create partitions; may instead provide Schema map names desired column types, described . neither provided, partitioning information taken file paths.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/open_dataset.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Open a multi-file dataset — open_dataset","text":"","code":"# Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) write_dataset(mtcars, tf, partitioning = \"cyl\") # You can specify a directory containing the files for your dataset and # open_dataset will scan all files in your directory. open_dataset(tf) #> FileSystemDataset with 3 Parquet files #> 11 columns #> mpg: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> cyl: int32 #> #> See $metadata for additional Schema metadata # You can also supply a vector of paths open_dataset(c(file.path(tf, \"cyl=4/part-0.parquet\"), file.path(tf, \"cyl=8/part-0.parquet\"))) #> FileSystemDataset with 2 Parquet files #> 10 columns #> mpg: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> #> See $metadata for additional Schema metadata ## You must specify the file format if using a format other than parquet. tf2 <- tempfile() dir.create(tf2) on.exit(unlink(tf2)) write_dataset(mtcars, tf2, format = \"ipc\") # This line will results in errors when you try to work with the data if (FALSE) { open_dataset(tf2) } # This line will work open_dataset(tf2, format = \"ipc\") #> FileSystemDataset with 1 Feather file #> 11 columns #> mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> #> See $metadata for additional Schema metadata ## You can specify file partitioning to include it as a field in your dataset # Create a temporary directory and write example dataset tf3 <- tempfile() dir.create(tf3) on.exit(unlink(tf3)) write_dataset(airquality, tf3, partitioning = c(\"Month\", \"Day\"), hive_style = FALSE) # View files - you can see the partitioning means that files have been written # to folders based on Month/Day values tf3_files <- list.files(tf3, recursive = TRUE) # With no partitioning specified, dataset contains all files but doesn't include # directory names as field names open_dataset(tf3) #> FileSystemDataset with 153 Parquet files #> 4 columns #> Ozone: int32 #> Solar.R: int32 #> Wind: double #> Temp: int32 #> #> See $metadata for additional Schema metadata # Now that partitioning has been specified, your dataset contains columns for Month and Day open_dataset(tf3, partitioning = c(\"Month\", \"Day\")) #> FileSystemDataset with 153 Parquet files #> 6 columns #> Ozone: int32 #> Solar.R: int32 #> Wind: double #> Temp: int32 #> Month: int32 #> Day: int32 #> #> See $metadata for additional Schema metadata # If you want to specify the data types for your fields, you can pass in a Schema open_dataset(tf3, partitioning = schema(Month = int8(), Day = int8())) #> FileSystemDataset with 153 Parquet files #> 6 columns #> Ozone: int32 #> Solar.R: int32 #> Wind: double #> Temp: int32 #> Month: int8 #> Day: int8 #> #> See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/reference/open_delim_dataset.html","id":null,"dir":"Reference","previous_headings":"","what":"Open a multi-file dataset of CSV or other delimiter-separated format — open_delim_dataset","title":"Open a multi-file dataset of CSV or other delimiter-separated format — open_delim_dataset","text":"wrapper around open_dataset explicitly includes parameters mirroring read_csv_arrow(), read_delim_arrow(), read_tsv_arrow() allow easy switching functions opening single files functions opening datasets.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/open_delim_dataset.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Open a multi-file dataset of CSV or other delimiter-separated format — open_delim_dataset","text":"","code":"open_delim_dataset( sources, schema = NULL, partitioning = hive_partition(), hive_style = NA, unify_schemas = NULL, factory_options = list(), delim = \",\", quote = \"\\\"\", escape_double = TRUE, escape_backslash = FALSE, col_names = TRUE, col_types = NULL, na = c(\"\", \"NA\"), skip_empty_rows = TRUE, skip = 0L, convert_options = NULL, read_options = NULL, timestamp_parsers = NULL, quoted_na = TRUE, parse_options = NULL ) open_csv_dataset( sources, schema = NULL, partitioning = hive_partition(), hive_style = NA, unify_schemas = NULL, factory_options = list(), quote = \"\\\"\", escape_double = TRUE, escape_backslash = FALSE, col_names = TRUE, col_types = NULL, na = c(\"\", \"NA\"), skip_empty_rows = TRUE, skip = 0L, convert_options = NULL, read_options = NULL, timestamp_parsers = NULL, quoted_na = TRUE, parse_options = NULL ) open_tsv_dataset( sources, schema = NULL, partitioning = hive_partition(), hive_style = NA, unify_schemas = NULL, factory_options = list(), quote = \"\\\"\", escape_double = TRUE, escape_backslash = FALSE, col_names = TRUE, col_types = NULL, na = c(\"\", \"NA\"), skip_empty_rows = TRUE, skip = 0L, convert_options = NULL, read_options = NULL, timestamp_parsers = NULL, quoted_na = TRUE, parse_options = NULL )"},{"path":"https://arrow.apache.org/docs/r/reference/open_delim_dataset.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Open a multi-file dataset of CSV or other delimiter-separated format — open_delim_dataset","text":"sources One : string path URI directory containing data files FileSystem references directory containing data files (returned s3_bucket()) string path URI single file character vector paths URIs individual data files list Dataset objects created function list DatasetFactory objects created dataset_factory(). sources vector file URIs, must use protocol point files located file system format. schema Schema Dataset. NULL (default), schema inferred data sources. partitioning sources directory path/URI, one : Schema, case file paths relative sources parsed, path segments matched schema fields. character vector defines field names corresponding path segments (, providing names correspond Schema types autodetected) Partitioning PartitioningFactory, returned hive_partition() NULL partitioning default autodetect Hive-style partitions unless hive_style = FALSE. See \"Partitioning\" section details. sources directory path/URI, partitioning ignored. hive_style Logical: partitioning interpreted Hive-style? Default NA, means inspect file paths Hive-style partitioning behave accordingly. unify_schemas logical: data fragments (files, Datasets) scanned order create unified schema ? FALSE, first fragment inspected schema. Use fast path know trust fragments identical schema. default FALSE creating dataset directory path/URI vector file paths/URIs (may many files scanning may slow) TRUE sources list Datasets (Datasets list Schemas already memory). factory_options list optional FileSystemFactoryOptions: partition_base_dir: string path segment prefix ignore discovering partition information DirectoryPartitioning. meaningful (ignored warning) HivePartitioning, valid providing vector file paths. exclude_invalid_files: logical: files valid data files excluded? Default FALSE checking files front incurs /O thus slower, especially remote filesystems. false invalid files, error scan time. FileSystemFactoryOption valid providing directory path discover files providing vector file paths. selector_ignore_prefixes: character vector file prefixes ignore discovering files directory. invalid files can excluded common filename prefix way, can avoid /O cost exclude_invalid_files. valid providing vector file paths (providing file list, can filter invalid files ). delim Single character used separate fields within record. quote Single character used quote strings. escape_double file escape quotes doubling ? .e. option TRUE, value \"\"\"\" represents single quote, \\\". escape_backslash file use backslashes escape special characters? general escape_double backslashes can used escape delimiter character, quote character, add special characters like \\\\n. col_names TRUE, first row input used column names included data frame. FALSE, column names generated Arrow, starting \"f0\", \"f1\", ..., \"fN\". Alternatively, can specify character vector column names. col_types compact string representation column types, Arrow Schema, NULL (default) infer types data. na character vector strings interpret missing values. skip_empty_rows blank rows ignored altogether? TRUE, blank rows represented . FALSE, filled missings. skip Number lines skip reading data. convert_options see CSV conversion options read_options see CSV reading options timestamp_parsers User-defined timestamp parsers. one parser specified, CSV conversion logic try parsing values starting beginning vector. Possible values : NULL: default, uses ISO-8601 parser character vector strptime parse strings list TimestampParser objects quoted_na missing values inside quotes treated missing values (default) strings. (Note different Arrow C++ default corresponding convert option, strings_can_be_null.) parse_options see CSV parsing options. given, overrides parsing options provided arguments (e.g. delim, quote, etc.).","code":""},{"path":"https://arrow.apache.org/docs/r/reference/open_delim_dataset.html","id":"options-currently-supported-by-read-delim-arrow-which-are-not-supported-here","dir":"Reference","previous_headings":"","what":"Options currently supported by read_delim_arrow() which are not supported here","title":"Open a multi-file dataset of CSV or other delimiter-separated format — open_delim_dataset","text":"file (instead, please specify files sources) col_select (instead, subset columns dataset creation) as_data_frame (instead, convert data frame dataset creation) parse_options","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/open_delim_dataset.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Open a multi-file dataset of CSV or other delimiter-separated format — open_delim_dataset","text":"","code":"# Set up directory for examples tf <- tempfile() dir.create(tf) df <- data.frame(x = c(\"1\", \"2\", \"NULL\")) file_path <- file.path(tf, \"file1.txt\") write.table(df, file_path, sep = \",\", row.names = FALSE) read_csv_arrow(file_path, na = c(\"\", \"NA\", \"NULL\"), col_names = \"y\", skip = 1) #> # A tibble: 3 x 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA open_csv_dataset(file_path, na = c(\"\", \"NA\", \"NULL\"), col_names = \"y\", skip = 1) #> FileSystemDataset with 1 csv file #> 1 columns #> y: int64 unlink(tf)"},{"path":"https://arrow.apache.org/docs/r/reference/read_delim_arrow.html","id":null,"dir":"Reference","previous_headings":"","what":"Read a CSV or other delimited file with Arrow — read_delim_arrow","title":"Read a CSV or other delimited file with Arrow — read_delim_arrow","text":"functions uses Arrow C++ CSV reader read tibble. Arrow C++ options mapped argument names follow readr::read_delim(), col_select inspired vroom::vroom().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_delim_arrow.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read a CSV or other delimited file with Arrow — read_delim_arrow","text":"","code":"read_delim_arrow( file, delim = \",\", quote = \"\\\"\", escape_double = TRUE, escape_backslash = FALSE, schema = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, na = c(\"\", \"NA\"), quoted_na = TRUE, skip_empty_rows = TRUE, skip = 0L, parse_options = NULL, convert_options = NULL, read_options = NULL, as_data_frame = TRUE, timestamp_parsers = NULL, decimal_point = \".\" ) read_csv_arrow( file, quote = \"\\\"\", escape_double = TRUE, escape_backslash = FALSE, schema = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, na = c(\"\", \"NA\"), quoted_na = TRUE, skip_empty_rows = TRUE, skip = 0L, parse_options = NULL, convert_options = NULL, read_options = NULL, as_data_frame = TRUE, timestamp_parsers = NULL ) read_csv2_arrow( file, quote = \"\\\"\", escape_double = TRUE, escape_backslash = FALSE, schema = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, na = c(\"\", \"NA\"), quoted_na = TRUE, skip_empty_rows = TRUE, skip = 0L, parse_options = NULL, convert_options = NULL, read_options = NULL, as_data_frame = TRUE, timestamp_parsers = NULL ) read_tsv_arrow( file, quote = \"\\\"\", escape_double = TRUE, escape_backslash = FALSE, schema = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, na = c(\"\", \"NA\"), quoted_na = TRUE, skip_empty_rows = TRUE, skip = 0L, parse_options = NULL, convert_options = NULL, read_options = NULL, as_data_frame = TRUE, timestamp_parsers = NULL )"},{"path":"https://arrow.apache.org/docs/r/reference/read_delim_arrow.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read a CSV or other delimited file with Arrow — read_delim_arrow","text":"file character file name URI, connection, literal data (either single string raw vector), Arrow input stream, FileSystem path (SubTreeFileSystem). file name, memory-mapped Arrow InputStream opened closed finished; compression detected file extension handled automatically. input stream provided, left open. recognised literal data, input must wrapped (). delim Single character used separate fields within record. quote Single character used quote strings. escape_double file escape quotes doubling ? .e. option TRUE, value \"\"\"\" represents single quote, \\\". escape_backslash file use backslashes escape special characters? general escape_double backslashes can used escape delimiter character, quote character, add special characters like \\\\n. schema Schema describes table. provided, used satisfy col_names col_types. col_names TRUE, first row input used column names included data frame. FALSE, column names generated Arrow, starting \"f0\", \"f1\", ..., \"fN\". Alternatively, can specify character vector column names. col_types compact string representation column types, Arrow Schema, NULL (default) infer types data. col_select character vector column names keep, \"select\" argument data.table::fread(), tidy selection specification columns, used dplyr::select(). na character vector strings interpret missing values. quoted_na missing values inside quotes treated missing values (default) strings. (Note different Arrow C++ default corresponding convert option, strings_can_be_null.) skip_empty_rows blank rows ignored altogether? TRUE, blank rows represented . FALSE, filled missings. skip Number lines skip reading data. parse_options see CSV parsing options. given, overrides parsing options provided arguments (e.g. delim, quote, etc.). convert_options see CSV conversion options read_options see CSV reading options as_data_frame function return tibble (default) Arrow Table? timestamp_parsers User-defined timestamp parsers. one parser specified, CSV conversion logic try parsing values starting beginning vector. Possible values : NULL: default, uses ISO-8601 parser character vector strptime parse strings list TimestampParser objects decimal_point Character use decimal point floating point numbers.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_delim_arrow.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read a CSV or other delimited file with Arrow — read_delim_arrow","text":"tibble, Table as_data_frame = FALSE.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_delim_arrow.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Read a CSV or other delimited file with Arrow — read_delim_arrow","text":"read_csv_arrow() read_tsv_arrow() wrappers around read_delim_arrow() specify delimiter. read_csv2_arrow() uses ; delimiter , decimal point. Note readr options currently implemented . Please file issue encounter one arrow support. need control Arrow-specific reader parameters equivalent readr::read_csv(), can either provide parse_options, convert_options, read_options arguments, can use CsvTableReader directly lower-level access.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_delim_arrow.html","id":"specifying-column-types-and-names","dir":"Reference","previous_headings":"","what":"Specifying column types and names","title":"Read a CSV or other delimited file with Arrow — read_delim_arrow","text":"default, CSV reader infer column names data types file, ways can specify directly. One way provide Arrow Schema schema argument, ordered map column name type. provided, satisfies col_names col_types arguments. good know information front. can also pass Schema col_types argument. , column names still inferred file unless also specify col_names. either case, column names Schema must match data's column names, whether explicitly provided inferred. said, Schema reference columns: omitted types inferred. Alternatively, can declare column types providing compact string representation readr uses col_types argument. means provide single string, one character per column, characters map Arrow types analogously readr type mapping: \"c\": utf8() \"\": int32() \"n\": float64() \"d\": float64() \"l\": bool() \"f\": dictionary() \"D\": date32() \"T\": timestamp(unit = \"ns\") \"t\": time32() (unit arg set default value \"ms\") \"_\": null() \"-\": null() \"?\": infer type data use compact string representation col_types, must also specify col_names. Regardless types specified, columns null() type dropped. Note specifying column names, whether schema col_names, CSV file header row otherwise used identify column names, need add skip = 1 skip row.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_delim_arrow.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Read a CSV or other delimited file with Arrow — read_delim_arrow","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) write.csv(mtcars, file = tf) df <- read_csv_arrow(tf) dim(df) #> [1] 32 12 # Can select columns df <- read_csv_arrow(tf, col_select = starts_with(\"d\")) # Specifying column types and names write.csv(data.frame(x = c(1, 3), y = c(2, 4)), file = tf, row.names = FALSE) read_csv_arrow(tf, schema = schema(x = int32(), y = utf8()), skip = 1) #> # A tibble: 2 x 2 #> x y #> <int> <chr> #> 1 1 2 #> 2 3 4 read_csv_arrow(tf, col_types = schema(y = utf8())) #> # A tibble: 2 x 2 #> x y #> <int> <chr> #> 1 1 2 #> 2 3 4 read_csv_arrow(tf, col_types = \"ic\", col_names = c(\"x\", \"y\"), skip = 1) #> # A tibble: 2 x 2 #> x y #> <int> <chr> #> 1 1 2 #> 2 3 4 # Note that if a timestamp column contains time zones, # the string \"T\" `col_types` specification won't work. # To parse timestamps with time zones, provide a [Schema] to `col_types` # and specify the time zone in the type object: tf <- tempfile() write.csv(data.frame(x = \"1970-01-01T12:00:00+12:00\"), file = tf, row.names = FALSE) read_csv_arrow( tf, col_types = schema(x = timestamp(unit = \"us\", timezone = \"UTC\")) ) #> # A tibble: 1 x 1 #> x #> <dttm> #> 1 1970-01-01 00:00:00 # Read directly from strings with `I()` read_csv_arrow(I(\"x,y\\n1,2\\n3,4\")) #> # A tibble: 2 x 2 #> x y #> <int> <int> #> 1 1 2 #> 2 3 4 read_delim_arrow(I(c(\"x y\", \"1 2\", \"3 4\")), delim = \" \") #> # A tibble: 2 x 2 #> x y #> <int> <int> #> 1 1 2 #> 2 3 4"},{"path":"https://arrow.apache.org/docs/r/reference/read_feather.html","id":null,"dir":"Reference","previous_headings":"","what":"Read a Feather file (an Arrow IPC file) — read_feather","title":"Read a Feather file (an Arrow IPC file) — read_feather","text":"Feather provides binary columnar serialization data frames. designed make reading writing data frames efficient, make sharing data across data analysis languages easy. read_feather() can read Feather Version 1 (V1), legacy version available starting 2016, Version 2 (V2), Apache Arrow IPC file format. read_ipc_file() alias read_feather().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_feather.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read a Feather file (an Arrow IPC file) — read_feather","text":"","code":"read_feather(file, col_select = NULL, as_data_frame = TRUE, mmap = TRUE) read_ipc_file(file, col_select = NULL, as_data_frame = TRUE, mmap = TRUE)"},{"path":"https://arrow.apache.org/docs/r/reference/read_feather.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read a Feather file (an Arrow IPC file) — read_feather","text":"file character file name URI, connection, raw vector, Arrow input stream, FileSystem path (SubTreeFileSystem). file name URI, Arrow InputStream opened closed finished. input stream provided, left open. col_select character vector column names keep, \"select\" argument data.table::fread(), tidy selection specification columns, used dplyr::select(). as_data_frame function return tibble (default) Arrow Table? mmap Logical: whether memory-map file (default TRUE)","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_feather.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read a Feather file (an Arrow IPC file) — read_feather","text":"tibble as_data_frame TRUE (default), Arrow Table otherwise","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/read_feather.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Read a Feather file (an Arrow IPC file) — read_feather","text":"","code":"# We recommend the \".arrow\" extension for Arrow IPC files (Feather V2). tf <- tempfile(fileext = \".arrow\") on.exit(unlink(tf)) write_feather(mtcars, tf) df <- read_feather(tf) dim(df) #> [1] 32 11 # Can select columns df <- read_feather(tf, col_select = starts_with(\"d\"))"},{"path":"https://arrow.apache.org/docs/r/reference/read_ipc_stream.html","id":null,"dir":"Reference","previous_headings":"","what":"Read Arrow IPC stream format — read_ipc_stream","title":"Read Arrow IPC stream format — read_ipc_stream","text":"Apache Arrow defines two formats serializing data interprocess communication (IPC): \"stream\" format \"file\" format, known Feather. read_ipc_stream() read_feather() read formats, respectively.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_ipc_stream.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read Arrow IPC stream format — read_ipc_stream","text":"","code":"read_ipc_stream(file, as_data_frame = TRUE, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/read_ipc_stream.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read Arrow IPC stream format — read_ipc_stream","text":"file character file name URI, connection, raw vector, Arrow input stream, FileSystem path (SubTreeFileSystem). file name URI, Arrow InputStream opened closed finished. input stream provided, left open. as_data_frame function return tibble (default) Arrow Table? ... extra parameters passed read_feather().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_ipc_stream.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read Arrow IPC stream format — read_ipc_stream","text":"tibble as_data_frame TRUE (default), Arrow Table otherwise","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/read_json_arrow.html","id":null,"dir":"Reference","previous_headings":"","what":"Read a JSON file — read_json_arrow","title":"Read a JSON file — read_json_arrow","text":"Wrapper around JsonTableReader read newline-delimited JSON (ndjson) file data frame Arrow Table.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_json_arrow.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read a JSON file — read_json_arrow","text":"","code":"read_json_arrow( file, col_select = NULL, as_data_frame = TRUE, schema = NULL, ... )"},{"path":"https://arrow.apache.org/docs/r/reference/read_json_arrow.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read a JSON file — read_json_arrow","text":"file character file name URI, connection, literal data (either single string raw vector), Arrow input stream, FileSystem path (SubTreeFileSystem). file name, memory-mapped Arrow InputStream opened closed finished; compression detected file extension handled automatically. input stream provided, left open. recognised literal data, input must wrapped (). col_select character vector column names keep, \"select\" argument data.table::fread(), tidy selection specification columns, used dplyr::select(). as_data_frame function return tibble (default) Arrow Table? schema Schema describes table. ... Additional options passed JsonTableReader$create()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_json_arrow.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read a JSON file — read_json_arrow","text":"tibble, Table as_data_frame = FALSE.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_json_arrow.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Read a JSON file — read_json_arrow","text":"passed path, detect handle compression file extension (e.g. .json.gz). schema provided, Arrow data types inferred data: JSON null values convert null() type, can fall back type. JSON booleans convert boolean(). JSON numbers convert int64(), falling back float64() non-integer encountered. JSON strings kind \"YYYY-MM-DD\" \"YYYY-MM-DD hh:mm:ss\" convert timestamp(unit = \"s\"), falling back utf8() conversion error occurs. JSON arrays convert list_of() type, inference proceeds recursively JSON arrays' values. Nested JSON objects convert struct() type, inference proceeds recursively JSON objects' values. as_data_frame = TRUE, Arrow types converted R types.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_json_arrow.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Read a JSON file — read_json_arrow","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) writeLines(' { \"hello\": 3.5, \"world\": false, \"yo\": \"thing\" } { \"hello\": 3.25, \"world\": null } { \"hello\": 0.0, \"world\": true, \"yo\": null } ', tf, useBytes = TRUE) read_json_arrow(tf) #> # A tibble: 3 x 3 #> hello world yo #> <dbl> <lgl> <chr> #> 1 3.5 FALSE thing #> 2 3.25 NA NA #> 3 0 TRUE NA # Read directly from strings with `I()` read_json_arrow(I(c('{\"x\": 1, \"y\": 2}', '{\"x\": 3, \"y\": 4}'))) #> # A tibble: 2 x 2 #> x y #> <int> <int> #> 1 1 2 #> 2 3 4"},{"path":"https://arrow.apache.org/docs/r/reference/read_message.html","id":null,"dir":"Reference","previous_headings":"","what":"Read a Message from a stream — read_message","title":"Read a Message from a stream — read_message","text":"Read Message stream","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_message.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read a Message from a stream — read_message","text":"","code":"read_message(stream)"},{"path":"https://arrow.apache.org/docs/r/reference/read_message.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read a Message from a stream — read_message","text":"stream InputStream","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_parquet.html","id":null,"dir":"Reference","previous_headings":"","what":"Read a Parquet file — read_parquet","title":"Read a Parquet file — read_parquet","text":"'Parquet' columnar storage file format. function enables read Parquet files R.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_parquet.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read a Parquet file — read_parquet","text":"","code":"read_parquet( file, col_select = NULL, as_data_frame = TRUE, props = ParquetArrowReaderProperties$create(), mmap = TRUE, ... )"},{"path":"https://arrow.apache.org/docs/r/reference/read_parquet.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read a Parquet file — read_parquet","text":"file character file name URI, connection, raw vector, Arrow input stream, FileSystem path (SubTreeFileSystem). file name URI, Arrow InputStream opened closed finished. input stream provided, left open. col_select character vector column names keep, \"select\" argument data.table::fread(), tidy selection specification columns, used dplyr::select(). as_data_frame function return tibble (default) Arrow Table? props ParquetArrowReaderProperties mmap Use TRUE use memory mapping possible ... Additional arguments passed ParquetFileReader$create()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_parquet.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read a Parquet file — read_parquet","text":"tibble as_data_frame TRUE (default), Arrow Table otherwise.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_parquet.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Read a Parquet file — read_parquet","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) write_parquet(mtcars, tf) df <- read_parquet(tf, col_select = starts_with(\"d\")) head(df) #> # A tibble: 6 x 2 #> disp drat #> <dbl> <dbl> #> 1 160 3.9 #> 2 160 3.9 #> 3 108 3.85 #> 4 258 3.08 #> 5 360 3.15 #> 6 225 2.76"},{"path":"https://arrow.apache.org/docs/r/reference/read_schema.html","id":null,"dir":"Reference","previous_headings":"","what":"Read a Schema from a stream — read_schema","title":"Read a Schema from a stream — read_schema","text":"Read Schema stream","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_schema.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read a Schema from a stream — read_schema","text":"","code":"read_schema(stream, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/read_schema.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read a Schema from a stream — read_schema","text":"stream Message, InputStream, Buffer ... currently ignored","code":""},{"path":"https://arrow.apache.org/docs/r/reference/read_schema.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read a Schema from a stream — read_schema","text":"Schema","code":""},{"path":"https://arrow.apache.org/docs/r/reference/record_batch.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a RecordBatch — record_batch","title":"Create a RecordBatch — record_batch","text":"Create RecordBatch","code":""},{"path":"https://arrow.apache.org/docs/r/reference/record_batch.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a RecordBatch — record_batch","text":"","code":"record_batch(..., schema = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/record_batch.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a RecordBatch — record_batch","text":"... data.frame named set Arrays vectors. given mixture data.frames vectors, inputs autospliced together (see examples). Alternatively, can provide single Arrow IPC InputStream, Message, Buffer, R raw object containing Buffer. schema Schema, NULL (default) infer schema data .... providing Arrow IPC buffer, schema required.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/record_batch.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create a RecordBatch — record_batch","text":"","code":"batch <- record_batch(name = rownames(mtcars), mtcars) dim(batch) #> [1] 32 12 dim(head(batch)) #> [1] 6 12 names(batch) #> [1] \"name\" \"mpg\" \"cyl\" \"disp\" \"hp\" \"drat\" \"wt\" \"qsec\" \"vs\" \"am\" #> [11] \"gear\" \"carb\" batch$mpg #> Array #> <double> #> [ #> 21, #> 21, #> 22.8, #> 21.4, #> 18.7, #> 18.1, #> 14.3, #> 24.4, #> 22.8, #> 19.2, #> ... #> 15.2, #> 13.3, #> 19.2, #> 27.3, #> 26, #> 30.4, #> 15.8, #> 19.7, #> 15, #> 21.4 #> ] batch[[\"cyl\"]] #> Array #> <double> #> [ #> 6, #> 6, #> 4, #> 6, #> 8, #> 6, #> 8, #> 4, #> 4, #> 6, #> ... #> 8, #> 8, #> 8, #> 4, #> 4, #> 4, #> 8, #> 6, #> 8, #> 4 #> ] as.data.frame(batch[4:8, c(\"gear\", \"hp\", \"wt\")]) #> gear hp wt #> 1 3 110 3.215 #> 2 3 175 3.440 #> 3 3 105 3.460 #> 4 3 245 3.570 #> 5 4 62 3.190"},{"path":"https://arrow.apache.org/docs/r/reference/recycle_scalars.html","id":null,"dir":"Reference","previous_headings":"","what":"Recycle scalar values in a list of arrays — recycle_scalars","title":"Recycle scalar values in a list of arrays — recycle_scalars","text":"Recycle scalar values list arrays","code":""},{"path":"https://arrow.apache.org/docs/r/reference/recycle_scalars.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Recycle scalar values in a list of arrays — recycle_scalars","text":"","code":"recycle_scalars(arrays)"},{"path":"https://arrow.apache.org/docs/r/reference/recycle_scalars.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Recycle scalar values in a list of arrays — recycle_scalars","text":"arrays List arrays","code":""},{"path":"https://arrow.apache.org/docs/r/reference/recycle_scalars.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Recycle scalar values in a list of arrays — recycle_scalars","text":"List arrays vector/Scalar/Array/ChunkedArray values length 1 recycled","code":""},{"path":"https://arrow.apache.org/docs/r/reference/reexports.html","id":null,"dir":"Reference","previous_headings":"","what":"Objects exported from other packages — reexports","title":"Objects exported from other packages — reexports","text":"objects imported packages. Follow links see documentation. bit64 print.integer64, str.integer64 tidyselect all_of, contains, ends_with, everything, last_col, matches, num_range, one_of, starts_with","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_binding.html","id":null,"dir":"Reference","previous_headings":"","what":"Register compute bindings — register_binding","title":"Register compute bindings — register_binding","text":"register_binding() register_binding_agg() functions used populate list functions operate (return) Expressions. basis .data mask inside dplyr methods.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_binding.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Register compute bindings — register_binding","text":"","code":"register_binding( fun_name, fun, registry = nse_funcs, update_cache = FALSE, notes = character(0) ) register_binding_agg( fun_name, agg_fun, registry = agg_funcs, notes = character(0) )"},{"path":"https://arrow.apache.org/docs/r/reference/register_binding.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Register compute bindings — register_binding","text":"fun_name string containing function name form \"function\" \"package::function\". package name currently used may used future allow types function calls. fun function NULL un-register previous function. function must accept Expression objects arguments return Expression objects instead regular R objects. registry environment functions assigned. update_cache Update .cache$functions time registration. default FALSE majority usage register bindings package load, create cache . reason .cache$functions needed addition nse_funcs non-aggregate functions revisited...currently used data mask mutate, filter, aggregate (summarise) data mask list. notes string docs: note limitations differences behavior Arrow version R function. agg_fun aggregate function NULL un-register previous aggregate function. function must accept Expression objects arguments return list() components: fun: string function name data: list 0 Expressions options: list function options, passed call_function","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_binding.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Register compute bindings — register_binding","text":"previously registered binding NULL previously registered function existed.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_binding.html","id":"writing-bindings","dir":"Reference","previous_headings":"","what":"Writing bindings","title":"Register compute bindings — register_binding","text":"Expression$create() wrap non-Expression inputs Scalar Expressions. want try coerce scalar inputs match type Expression(s) arguments, call cast_scalars_to_common_type(args) args. example, Expression$create(\"add\", args = list(int16_field, 1)) result float64 type output 1 double R. prevent casting data int16_field float preserve int16, Expression$create(\"add\", args = cast_scalars_to_common_type(list(int16_field, 1))) Inside function, can call binding call_binding().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_scalar_function.html","id":null,"dir":"Reference","previous_headings":"","what":"Register user-defined functions — register_scalar_function","title":"Register user-defined functions — register_scalar_function","text":"functions support calling R code query engine execution (.e., dplyr::mutate() dplyr::filter() Table Dataset). Use register_scalar_function() attach Arrow input output types R function make available use dplyr interface /call_function(). Scalar functions currently type user-defined function supported. Arrow, scalar functions must stateless return output shape (.e., number rows) input.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_scalar_function.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Register user-defined functions — register_scalar_function","text":"","code":"register_scalar_function(name, fun, in_type, out_type, auto_convert = FALSE)"},{"path":"https://arrow.apache.org/docs/r/reference/register_scalar_function.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Register user-defined functions — register_scalar_function","text":"name function name used dplyr bindings fun R function rlang-style lambda expression. function called first argument context list() elements batch_size (expected length output) output_type (required DataType output) may used ensure output correct type length. Subsequent arguments passed position specified in_types. auto_convert TRUE, subsequent arguments converted R vectors passed fun output automatically constructed expected output type via as_arrow_array(). in_type DataType input type schema() functions one argument. signature used determine function appropriate given set arguments. function appropriate one signature, pass list() . out_type DataType output type function accepting single argument (types), list() DataTypes. function must return DataType. auto_convert Use TRUE convert inputs passing fun construct Array correct type output. Use option write functions R objects opposed functions Arrow R6 objects.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_scalar_function.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Register user-defined functions — register_scalar_function","text":"NULL, invisibly","code":""},{"path":"https://arrow.apache.org/docs/r/reference/register_scalar_function.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Register user-defined functions — register_scalar_function","text":"","code":"if (FALSE) { # arrow_with_dataset() && identical(Sys.getenv(\"NOT_CRAN\"), \"true\") library(dplyr, warn.conflicts = FALSE) some_model <- lm(mpg ~ disp + cyl, data = mtcars) register_scalar_function( \"mtcars_predict_mpg\", function(context, disp, cyl) { predict(some_model, newdata = data.frame(disp, cyl)) }, in_type = schema(disp = float64(), cyl = float64()), out_type = float64(), auto_convert = TRUE ) as_arrow_table(mtcars) %>% transmute(mpg, mpg_predicted = mtcars_predict_mpg(disp, cyl)) %>% collect() %>% head() }"},{"path":"https://arrow.apache.org/docs/r/reference/repeat_value_as_array.html","id":null,"dir":"Reference","previous_headings":"","what":"Take an object of length 1 and repeat it. — repeat_value_as_array","title":"Take an object of length 1 and repeat it. — repeat_value_as_array","text":"Take object length 1 repeat .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/repeat_value_as_array.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Take an object of length 1 and repeat it. — repeat_value_as_array","text":"","code":"repeat_value_as_array(object, n)"},{"path":"https://arrow.apache.org/docs/r/reference/repeat_value_as_array.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Take an object of length 1 and repeat it. — repeat_value_as_array","text":"object Object length 1 repeated - vector, Scalar, Array, ChunkedArray n Number repetitions","code":""},{"path":"https://arrow.apache.org/docs/r/reference/repeat_value_as_array.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Take an object of length 1 and repeat it. — repeat_value_as_array","text":"Array length n","code":""},{"path":"https://arrow.apache.org/docs/r/reference/s3_bucket.html","id":null,"dir":"Reference","previous_headings":"","what":"Connect to an AWS S3 bucket — s3_bucket","title":"Connect to an AWS S3 bucket — s3_bucket","text":"s3_bucket() convenience function create S3FileSystem object automatically detects bucket's AWS region holding onto relative path.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/s3_bucket.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Connect to an AWS S3 bucket — s3_bucket","text":"","code":"s3_bucket(bucket, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/s3_bucket.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Connect to an AWS S3 bucket — s3_bucket","text":"bucket string S3 bucket name path ... Additional connection options, passed S3FileSystem$create()","code":""},{"path":"https://arrow.apache.org/docs/r/reference/s3_bucket.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Connect to an AWS S3 bucket — s3_bucket","text":"SubTreeFileSystem containing S3FileSystem bucket's relative path. Note function's success guarantee authorized access bucket's contents.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/s3_bucket.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Connect to an AWS S3 bucket — s3_bucket","text":"default, s3_bucket S3FileSystem functions produce output fatal errors printing return values. troubleshooting problems, may useful increase log level. See Notes section S3FileSystem information see Examples .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/s3_bucket.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Connect to an AWS S3 bucket — s3_bucket","text":"","code":"if (FALSE) { bucket <- s3_bucket(\"voltrondata-labs-datasets\") } if (FALSE) { # Turn on debug logging. The following line of code should be run in a fresh # R session prior to any calls to `s3_bucket()` (or other S3 functions) Sys.setenv(\"ARROW_S3_LOG_LEVEL\"=\"DEBUG\") bucket <- s3_bucket(\"voltrondata-labs-datasets\") }"},{"path":"https://arrow.apache.org/docs/r/reference/scalar.html","id":null,"dir":"Reference","previous_headings":"","what":"Create an Arrow Scalar — scalar","title":"Create an Arrow Scalar — scalar","text":"Create Arrow Scalar","code":""},{"path":"https://arrow.apache.org/docs/r/reference/scalar.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create an Arrow Scalar — scalar","text":"","code":"scalar(x, type = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/scalar.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create an Arrow Scalar — scalar","text":"x R vector, list, data.frame type optional data type x. omitted, type inferred data.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/scalar.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create an Arrow Scalar — scalar","text":"","code":"scalar(pi) #> Scalar #> 3.141592653589793 scalar(404) #> Scalar #> 404 # If you pass a vector into scalar(), you get a list containing your items scalar(c(1, 2, 3)) #> Scalar #> list<item: double>[1, 2, 3] scalar(9) == scalar(10) #> Scalar #> false"},{"path":"https://arrow.apache.org/docs/r/reference/schema.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a schema or extract one from an object. — schema","title":"Create a schema or extract one from an object. — schema","text":"Create schema extract one object.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/schema.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a schema or extract one from an object. — schema","text":"","code":"schema(...)"},{"path":"https://arrow.apache.org/docs/r/reference/schema.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a schema or extract one from an object. — schema","text":"... fields, field name/data type pairs (list ), object extract schema","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/schema.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create a schema or extract one from an object. — schema","text":"","code":"# Create schema using pairs of field names and data types schema(a = int32(), b = float64()) #> Schema #> a: int32 #> b: double # Create a schema using a list of pairs of field names and data types schema(list(a = int8(), b = string())) #> Schema #> a: int8 #> b: string # Create schema using fields schema( field(\"b\", double()), field(\"c\", bool(), nullable = FALSE), field(\"d\", string()) ) #> Schema #> b: double #> c: bool not null #> d: string # Extract schemas from objects df <- data.frame(col1 = 2:4, col2 = c(0.1, 0.3, 0.5)) tab1 <- arrow_table(df) schema(tab1) #> Schema #> col1: int32 #> col2: double #> #> See $metadata for additional Schema metadata tab2 <- arrow_table(df, schema = schema(col1 = int8(), col2 = float32())) schema(tab2) #> Schema #> col1: int8 #> col2: float #> #> See $metadata for additional Schema metadata"},{"path":"https://arrow.apache.org/docs/r/reference/show_exec_plan.html","id":null,"dir":"Reference","previous_headings":"","what":"Show the details of an Arrow Execution Plan — show_exec_plan","title":"Show the details of an Arrow Execution Plan — show_exec_plan","text":"function gives details logical query plan executed evaluating arrow_dplyr_query object. calls C++ ExecPlan object's print method. Functionally, similar dplyr::explain(). function used dplyr::explain() dplyr::show_query() methods.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/show_exec_plan.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Show the details of an Arrow Execution Plan — show_exec_plan","text":"","code":"show_exec_plan(x)"},{"path":"https://arrow.apache.org/docs/r/reference/show_exec_plan.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Show the details of an Arrow Execution Plan — show_exec_plan","text":"x arrow_dplyr_query print ExecPlan .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/show_exec_plan.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Show the details of an Arrow Execution Plan — show_exec_plan","text":"x, invisibly.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/show_exec_plan.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Show the details of an Arrow Execution Plan — show_exec_plan","text":"","code":"library(dplyr) mtcars %>% arrow_table() %>% filter(mpg > 20) %>% mutate(x = gear / carb) %>% show_exec_plan() #> ExecPlan with 4 nodes: #> 3:SinkNode{} #> 2:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, \"x\": divide(cast(gear, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(carb, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}))]} #> 1:FilterNode{filter=(mpg > 20)} #> 0:TableSourceNode{}"},{"path":"https://arrow.apache.org/docs/r/reference/table.html","id":null,"dir":"Reference","previous_headings":"","what":"Create an Arrow Table — arrow_table","title":"Create an Arrow Table — arrow_table","text":"Create Arrow Table","code":""},{"path":"https://arrow.apache.org/docs/r/reference/table.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create an Arrow Table — arrow_table","text":"","code":"arrow_table(..., schema = NULL)"},{"path":"https://arrow.apache.org/docs/r/reference/table.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create an Arrow Table — arrow_table","text":"... data.frame named set Arrays vectors. given mixture data.frames named vectors, inputs autospliced together (see examples). Alternatively, can provide single Arrow IPC InputStream, Message, Buffer, R raw object containing Buffer. schema Schema, NULL (default) infer schema data .... providing Arrow IPC buffer, schema required.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/table.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create an Arrow Table — arrow_table","text":"","code":"tbl <- arrow_table(name = rownames(mtcars), mtcars) dim(tbl) #> [1] 32 12 dim(head(tbl)) #> [1] 6 12 names(tbl) #> [1] \"name\" \"mpg\" \"cyl\" \"disp\" \"hp\" \"drat\" \"wt\" \"qsec\" \"vs\" \"am\" #> [11] \"gear\" \"carb\" tbl$mpg #> ChunkedArray #> <double> #> [ #> [ #> 21, #> 21, #> 22.8, #> 21.4, #> 18.7, #> 18.1, #> 14.3, #> 24.4, #> 22.8, #> 19.2, #> ... #> 15.2, #> 13.3, #> 19.2, #> 27.3, #> 26, #> 30.4, #> 15.8, #> 19.7, #> 15, #> 21.4 #> ] #> ] tbl[[\"cyl\"]] #> ChunkedArray #> <double> #> [ #> [ #> 6, #> 6, #> 4, #> 6, #> 8, #> 6, #> 8, #> 4, #> 4, #> 6, #> ... #> 8, #> 8, #> 8, #> 4, #> 4, #> 4, #> 8, #> 6, #> 8, #> 4 #> ] #> ] as.data.frame(tbl[4:8, c(\"gear\", \"hp\", \"wt\")]) #> gear hp wt #> 1 3 110 3.215 #> 2 3 175 3.440 #> 3 3 105 3.460 #> 4 3 245 3.570 #> 5 4 62 3.190"},{"path":"https://arrow.apache.org/docs/r/reference/to_arrow.html","id":null,"dir":"Reference","previous_headings":"","what":"Create an Arrow object from a DuckDB connection — to_arrow","title":"Create an Arrow object from a DuckDB connection — to_arrow","text":"can used pipelines pass data back forth Arrow DuckDB","code":""},{"path":"https://arrow.apache.org/docs/r/reference/to_arrow.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create an Arrow object from a DuckDB connection — to_arrow","text":"","code":"to_arrow(.data)"},{"path":"https://arrow.apache.org/docs/r/reference/to_arrow.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create an Arrow object from a DuckDB connection — to_arrow","text":".data object converted","code":""},{"path":"https://arrow.apache.org/docs/r/reference/to_arrow.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create an Arrow object from a DuckDB connection — to_arrow","text":"RecordBatchReader.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/to_arrow.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create an Arrow object from a DuckDB connection — to_arrow","text":"","code":"library(dplyr) ds <- InMemoryDataset$create(mtcars) ds %>% filter(mpg < 30) %>% to_duckdb() %>% group_by(cyl) %>% summarize(mean_mpg = mean(mpg, na.rm = TRUE)) %>% to_arrow() %>% collect() #> # A tibble: 3 x 2 #> cyl mean_mpg #> <dbl> <dbl> #> 1 4 23.7 #> 2 6 19.7 #> 3 8 15.1"},{"path":"https://arrow.apache.org/docs/r/reference/to_duckdb.html","id":null,"dir":"Reference","previous_headings":"","what":"Create a (virtual) DuckDB table from an Arrow object — to_duckdb","title":"Create a (virtual) DuckDB table from an Arrow object — to_duckdb","text":"necessary configuration create (virtual) table DuckDB backed Arrow object given. data copied modified collect() compute() called query run table.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/to_duckdb.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create a (virtual) DuckDB table from an Arrow object — to_duckdb","text":"","code":"to_duckdb( .data, con = arrow_duck_connection(), table_name = unique_arrow_tablename(), auto_disconnect = TRUE )"},{"path":"https://arrow.apache.org/docs/r/reference/to_duckdb.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create a (virtual) DuckDB table from an Arrow object — to_duckdb","text":".data Arrow object (e.g. Dataset, Table) use DuckDB table con DuckDB connection use (default create one store options(\"arrow_duck_con\")) table_name name use DuckDB object. default unique string \"arrow_\" followed numbers. auto_disconnect table automatically cleaned resulting object removed (garbage collected)? Default: TRUE","code":""},{"path":"https://arrow.apache.org/docs/r/reference/to_duckdb.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create a (virtual) DuckDB table from an Arrow object — to_duckdb","text":"tbl new table DuckDB","code":""},{"path":"https://arrow.apache.org/docs/r/reference/to_duckdb.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Create a (virtual) DuckDB table from an Arrow object — to_duckdb","text":"result dbplyr-compatible object can used d(b)plyr pipelines. auto_disconnect = TRUE, DuckDB table created configured unregistered tbl object garbage collected. helpful want extra table objects DuckDB finished using .","code":""},{"path":"https://arrow.apache.org/docs/r/reference/to_duckdb.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create a (virtual) DuckDB table from an Arrow object — to_duckdb","text":"","code":"library(dplyr) ds <- InMemoryDataset$create(mtcars) ds %>% filter(mpg < 30) %>% group_by(cyl) %>% to_duckdb() %>% slice_min(disp) #> # Source: SQL [5 x 11] #> # Database: DuckDB v0.10.1 [unknown@Linux 6.5.0-1018-azure:R 4.4.0/:memory:] #> # Groups: cyl #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 #> 2 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 #> 3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 #> 4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 #> 5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3"},{"path":"https://arrow.apache.org/docs/r/reference/unify_schemas.html","id":null,"dir":"Reference","previous_headings":"","what":"Combine and harmonize schemas — unify_schemas","title":"Combine and harmonize schemas — unify_schemas","text":"Combine harmonize schemas","code":""},{"path":"https://arrow.apache.org/docs/r/reference/unify_schemas.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Combine and harmonize schemas — unify_schemas","text":"","code":"unify_schemas(..., schemas = list(...))"},{"path":"https://arrow.apache.org/docs/r/reference/unify_schemas.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Combine and harmonize schemas — unify_schemas","text":"... Schemas unify schemas Alternatively, list schemas","code":""},{"path":"https://arrow.apache.org/docs/r/reference/unify_schemas.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Combine and harmonize schemas — unify_schemas","text":"Schema union fields contained inputs, NULL schemas NULL","code":""},{"path":"https://arrow.apache.org/docs/r/reference/unify_schemas.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Combine and harmonize schemas — unify_schemas","text":"","code":"a <- schema(b = double(), c = bool()) z <- schema(b = double(), k = utf8()) unify_schemas(a, z) #> Schema #> b: double #> c: bool #> k: string"},{"path":"https://arrow.apache.org/docs/r/reference/value_counts.html","id":null,"dir":"Reference","previous_headings":"","what":"table for Arrow objects — value_counts","title":"table for Arrow objects — value_counts","text":"function tabulates values array returns table counts.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/value_counts.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"table for Arrow objects — value_counts","text":"","code":"value_counts(x)"},{"path":"https://arrow.apache.org/docs/r/reference/value_counts.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"table for Arrow objects — value_counts","text":"x Array ChunkedArray","code":""},{"path":"https://arrow.apache.org/docs/r/reference/value_counts.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"table for Arrow objects — value_counts","text":"StructArray containing \"values\" (type x) \"counts\" Int64.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/value_counts.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"table for Arrow objects — value_counts","text":"","code":"cyl_vals <- Array$create(mtcars$cyl) counts <- value_counts(cyl_vals)"},{"path":"https://arrow.apache.org/docs/r/reference/vctrs_extension_array.html","id":null,"dir":"Reference","previous_headings":"","what":"Extension type for generic typed vectors — vctrs_extension_array","title":"Extension type for generic typed vectors — vctrs_extension_array","text":"common R vector types converted automatically suitable Arrow data type without need extension type. vector types whose conversion suitably handled default, can create vctrs_extension_array(), passes vctrs::vec_data() Array$create() calls vctrs::vec_restore() Array converted back R vector.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/vctrs_extension_array.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extension type for generic typed vectors — vctrs_extension_array","text":"","code":"vctrs_extension_array(x, ptype = vctrs::vec_ptype(x), storage_type = NULL) vctrs_extension_type(x, storage_type = infer_type(vctrs::vec_data(x)))"},{"path":"https://arrow.apache.org/docs/r/reference/vctrs_extension_array.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extension type for generic typed vectors — vctrs_extension_array","text":"x vctr (.e., vctrs::vec_is() returns TRUE). ptype vctrs::vec_ptype(), usually zero-length version object appropriate attributes set. value serialized using serialize(), refer R object saved/reloaded. storage_type data type underlying storage array.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/vctrs_extension_array.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extension type for generic typed vectors — vctrs_extension_array","text":"vctrs_extension_array() returns ExtensionArray instance vctrs_extension_type(). vctrs_extension_type() returns ExtensionType instance extension name \"arrow.r.vctrs\".","code":""},{"path":"https://arrow.apache.org/docs/r/reference/vctrs_extension_array.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Extension type for generic typed vectors — vctrs_extension_array","text":"","code":"(array <- vctrs_extension_array(as.POSIXlt(\"2022-01-02 03:45\", tz = \"UTC\"))) #> ExtensionArray #> <POSIXlt of length 0> #> -- is_valid: all not null #> -- child 0 type: double #> [ #> 0 #> ] #> -- child 1 type: int32 #> [ #> 45 #> ] #> -- child 2 type: int32 #> [ #> 3 #> ] #> -- child 3 type: int32 #> [ #> 2 #> ] #> -- child 4 type: int32 #> [ #> 0 #> ] #> -- child 5 type: int32 #> [ #> 122 #> ] #> -- child 6 type: int32 #> [ #> 0 #> ] #> -- child 7 type: int32 #> [ #> 1 #> ] #> -- child 8 type: int32 #> [ #> 0 #> ] #> -- child 9 type: string #> [ #> \"UTC\" #> ] #> -- child 10 type: int32 #> [ #> 0 #> ] array$type #> VctrsExtensionType #> POSIXlt of length 0 as.vector(array) #> [1] \"2022-01-02 03:45:00 UTC\" temp_feather <- tempfile() write_feather(arrow_table(col = array), temp_feather) read_feather(temp_feather) #> # A tibble: 1 x 1 #> col #> <dttm> #> 1 2022-01-02 03:45:00 unlink(temp_feather)"},{"path":"https://arrow.apache.org/docs/r/reference/write_csv_arrow.html","id":null,"dir":"Reference","previous_headings":"","what":"Write CSV file to disk — write_csv_arrow","title":"Write CSV file to disk — write_csv_arrow","text":"Write CSV file disk","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_csv_arrow.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Write CSV file to disk — write_csv_arrow","text":"","code":"write_csv_arrow( x, sink, file = NULL, include_header = TRUE, col_names = NULL, batch_size = 1024L, na = \"\", write_options = NULL, ... )"},{"path":"https://arrow.apache.org/docs/r/reference/write_csv_arrow.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Write CSV file to disk — write_csv_arrow","text":"x data.frame, RecordBatch, Table sink string file path, connection, URI, OutputStream, path file system (SubTreeFileSystem) file file name. Specify sink, . include_header Whether write initial header line column names col_names identical include_header. Specify include_headers, . batch_size Maximum number rows processed time. Default 1024. na value write NA values. Must contain quote marks. Default \"\". write_options see CSV write options ... additional parameters","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_csv_arrow.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Write CSV file to disk — write_csv_arrow","text":"input x, invisibly. Note sink OutputStream, stream left open.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_csv_arrow.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Write CSV file to disk — write_csv_arrow","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) write_csv_arrow(mtcars, tf)"},{"path":"https://arrow.apache.org/docs/r/reference/write_dataset.html","id":null,"dir":"Reference","previous_headings":"","what":"Write a dataset — write_dataset","title":"Write a dataset — write_dataset","text":"function allows write dataset. writing efficient binary storage formats, specifying relevant partitioning, can make much faster read query.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_dataset.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Write a dataset — write_dataset","text":"","code":"write_dataset( dataset, path, format = c(\"parquet\", \"feather\", \"arrow\", \"ipc\", \"csv\", \"tsv\", \"txt\", \"text\"), partitioning = dplyr::group_vars(dataset), basename_template = paste0(\"part-{i}.\", as.character(format)), hive_style = TRUE, existing_data_behavior = c(\"overwrite\", \"error\", \"delete_matching\"), max_partitions = 1024L, max_open_files = 900L, max_rows_per_file = 0L, min_rows_per_group = 0L, max_rows_per_group = bitwShiftL(1, 20), ... )"},{"path":"https://arrow.apache.org/docs/r/reference/write_dataset.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Write a dataset — write_dataset","text":"dataset Dataset, RecordBatch, Table, arrow_dplyr_query, data.frame. arrow_dplyr_query, query evaluated result written. means can select(), filter(), mutate(), etc. transform data written need . path string path, URI, SubTreeFileSystem referencing directory write (directory created exist) format string identifier file format. Default use \"parquet\" (see FileFormat) partitioning Partitioning character vector columns use partition keys (written path segments). Default use current group_by() columns. basename_template string template names files written. Must contain \"{}\", replaced autoincremented integer generate basenames datafiles. example, \"part-{}.arrow\" yield \"part-0.arrow\", .... specified, defaults \"part-{}.<default extension>\". hive_style logical: write partition segments Hive-style (key1=value1/key2=value2/file.ext) just bare values. Default TRUE. existing_data_behavior behavior use already data destination directory. Must one \"overwrite\", \"error\", \"delete_matching\". \"overwrite\" (default) new files created overwrite existing files \"error\" operation fail destination directory empty \"delete_matching\" writer delete existing partitions data going written partitions leave alone partitions data written . max_partitions maximum number partitions batch may written . Default 1024L. max_open_files maximum number files can left opened write operation. greater 0 limit maximum number files can left open. attempt made open many files least recently used file closed. setting set low may end fragmenting data many small files. default 900 also allows # files open scanner hitting default Linux limit 1024. max_rows_per_file maximum number rows per file. greater 0 limit many rows placed single file. Default 0L. min_rows_per_group write row groups disk number rows accumulated. Default 0L. max_rows_per_group maximum rows allowed single group number rows exceeded, split next set rows written next group. value must set greater min_rows_per_group. Default 1024 * 1024. ... additional format-specific arguments. available Parquet options, see write_parquet(). available Feather options : use_legacy_format logical: write data formatted Arrow libraries versions 0.14 lower can read . Default FALSE. can also enable setting environment variable ARROW_PRE_0_15_IPC_FORMAT=1. metadata_version: string like \"V5\" equivalent integer indicating Arrow IPC MetadataVersion. Default (NULL) use latest version, unless environment variable ARROW_PRE_1_0_METADATA_VERSION=1, case V4. codec: Codec used compress body buffers written files. Default (NULL) compress body buffers. null_fallback: character used place missing values (NA NULL) using Hive-style partitioning. See hive_partition().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_dataset.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Write a dataset — write_dataset","text":"input dataset, invisibly","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_dataset.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Write a dataset — write_dataset","text":"","code":"# You can write datasets partitioned by the values in a column (here: \"cyl\"). # This creates a structure of the form cyl=X/part-Z.parquet. one_level_tree <- tempfile() write_dataset(mtcars, one_level_tree, partitioning = \"cyl\") list.files(one_level_tree, recursive = TRUE) #> [1] \"cyl=4/part-0.parquet\" \"cyl=6/part-0.parquet\" \"cyl=8/part-0.parquet\" # You can also partition by the values in multiple columns # (here: \"cyl\" and \"gear\"). # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet. two_levels_tree <- tempfile() write_dataset(mtcars, two_levels_tree, partitioning = c(\"cyl\", \"gear\")) list.files(two_levels_tree, recursive = TRUE) #> [1] \"cyl=4/gear=3/part-0.parquet\" \"cyl=4/gear=4/part-0.parquet\" #> [3] \"cyl=4/gear=5/part-0.parquet\" \"cyl=6/gear=3/part-0.parquet\" #> [5] \"cyl=6/gear=4/part-0.parquet\" \"cyl=6/gear=5/part-0.parquet\" #> [7] \"cyl=8/gear=3/part-0.parquet\" \"cyl=8/gear=5/part-0.parquet\" # In the two previous examples we would have: # X = {4,6,8}, the number of cylinders. # Y = {3,4,5}, the number of forward gears. # Z = {0,1,2}, the number of saved parts, starting from 0. # You can obtain the same result as as the previous examples using arrow with # a dplyr pipeline. This will be the same as two_levels_tree above, but the # output directory will be different. library(dplyr) two_levels_tree_2 <- tempfile() mtcars %>% group_by(cyl, gear) %>% write_dataset(two_levels_tree_2) list.files(two_levels_tree_2, recursive = TRUE) #> [1] \"cyl=4/gear=3/part-0.parquet\" \"cyl=4/gear=4/part-0.parquet\" #> [3] \"cyl=4/gear=5/part-0.parquet\" \"cyl=6/gear=3/part-0.parquet\" #> [5] \"cyl=6/gear=4/part-0.parquet\" \"cyl=6/gear=5/part-0.parquet\" #> [7] \"cyl=8/gear=3/part-0.parquet\" \"cyl=8/gear=5/part-0.parquet\" # And you can also turn off the Hive-style directory naming where the column # name is included with the values by using `hive_style = FALSE`. # Write a structure X/Y/part-Z.parquet. two_levels_tree_no_hive <- tempfile() mtcars %>% group_by(cyl, gear) %>% write_dataset(two_levels_tree_no_hive, hive_style = FALSE) list.files(two_levels_tree_no_hive, recursive = TRUE) #> [1] \"4/3/part-0.parquet\" \"4/4/part-0.parquet\" \"4/5/part-0.parquet\" #> [4] \"6/3/part-0.parquet\" \"6/4/part-0.parquet\" \"6/5/part-0.parquet\" #> [7] \"8/3/part-0.parquet\" \"8/5/part-0.parquet\""},{"path":"https://arrow.apache.org/docs/r/reference/write_delim_dataset.html","id":null,"dir":"Reference","previous_headings":"","what":"Write a dataset into partitioned flat files. — write_delim_dataset","title":"Write a dataset into partitioned flat files. — write_delim_dataset","text":"write_*_dataset() family wrappers around write_dataset allow easy switching functions writing datasets.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_delim_dataset.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Write a dataset into partitioned flat files. — write_delim_dataset","text":"","code":"write_delim_dataset( dataset, path, partitioning = dplyr::group_vars(dataset), basename_template = \"part-{i}.txt\", hive_style = TRUE, existing_data_behavior = c(\"overwrite\", \"error\", \"delete_matching\"), max_partitions = 1024L, max_open_files = 900L, max_rows_per_file = 0L, min_rows_per_group = 0L, max_rows_per_group = bitwShiftL(1, 20), col_names = TRUE, batch_size = 1024L, delim = \",\", na = \"\", eol = \"\\n\", quote = c(\"needed\", \"all\", \"none\") ) write_csv_dataset( dataset, path, partitioning = dplyr::group_vars(dataset), basename_template = \"part-{i}.csv\", hive_style = TRUE, existing_data_behavior = c(\"overwrite\", \"error\", \"delete_matching\"), max_partitions = 1024L, max_open_files = 900L, max_rows_per_file = 0L, min_rows_per_group = 0L, max_rows_per_group = bitwShiftL(1, 20), col_names = TRUE, batch_size = 1024L, delim = \",\", na = \"\", eol = \"\\n\", quote = c(\"needed\", \"all\", \"none\") ) write_tsv_dataset( dataset, path, partitioning = dplyr::group_vars(dataset), basename_template = \"part-{i}.tsv\", hive_style = TRUE, existing_data_behavior = c(\"overwrite\", \"error\", \"delete_matching\"), max_partitions = 1024L, max_open_files = 900L, max_rows_per_file = 0L, min_rows_per_group = 0L, max_rows_per_group = bitwShiftL(1, 20), col_names = TRUE, batch_size = 1024L, na = \"\", eol = \"\\n\", quote = c(\"needed\", \"all\", \"none\") )"},{"path":"https://arrow.apache.org/docs/r/reference/write_delim_dataset.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Write a dataset into partitioned flat files. — write_delim_dataset","text":"dataset Dataset, RecordBatch, Table, arrow_dplyr_query, data.frame. arrow_dplyr_query, query evaluated result written. means can select(), filter(), mutate(), etc. transform data written need . path string path, URI, SubTreeFileSystem referencing directory write (directory created exist) partitioning Partitioning character vector columns use partition keys (written path segments). Default use current group_by() columns. basename_template string template names files written. Must contain \"{}\", replaced autoincremented integer generate basenames datafiles. example, \"part-{}.arrow\" yield \"part-0.arrow\", .... specified, defaults \"part-{}.<default extension>\". hive_style logical: write partition segments Hive-style (key1=value1/key2=value2/file.ext) just bare values. Default TRUE. existing_data_behavior behavior use already data destination directory. Must one \"overwrite\", \"error\", \"delete_matching\". \"overwrite\" (default) new files created overwrite existing files \"error\" operation fail destination directory empty \"delete_matching\" writer delete existing partitions data going written partitions leave alone partitions data written . max_partitions maximum number partitions batch may written . Default 1024L. max_open_files maximum number files can left opened write operation. greater 0 limit maximum number files can left open. attempt made open many files least recently used file closed. setting set low may end fragmenting data many small files. default 900 also allows # files open scanner hitting default Linux limit 1024. max_rows_per_file maximum number rows per file. greater 0 limit many rows placed single file. Default 0L. min_rows_per_group write row groups disk number rows accumulated. Default 0L. max_rows_per_group maximum rows allowed single group number rows exceeded, split next set rows written next group. value must set greater min_rows_per_group. Default 1024 * 1024. col_names Whether write initial header line column names. batch_size Maximum number rows processed time. Default 1024L. delim Delimiter used separate values. Defaults \",\" write_delim_dataset() write_csv_dataset(), \"\\t write_tsv_dataset(). changed write_tsv_dataset(). na character vector strings interpret missing values. Quotes allowed string. default empty string \"\". eol end line character use ending rows. default \"\\n\". quote handle fields contain characters need quoted. needed - Enclose strings binary values quotes need , CSV rendering can contain quotes (default) - Enclose valid values quotes. Nulls quoted. May cause readers interpret values strings schema inferred. none - enclose values quotes. Prevents values containing quotes (\"), cell delimiters (,) line endings (\\r, \\n), (following RFC4180). values contain characters, error caused attempting write.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_delim_dataset.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Write a dataset into partitioned flat files. — write_delim_dataset","text":"input dataset, invisibly.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/write_feather.html","id":null,"dir":"Reference","previous_headings":"","what":"Write a Feather file (an Arrow IPC file) — write_feather","title":"Write a Feather file (an Arrow IPC file) — write_feather","text":"Feather provides binary columnar serialization data frames. designed make reading writing data frames efficient, make sharing data across data analysis languages easy. write_feather() can write Feather Version 1 (V1), legacy version available starting 2016, Version 2 (V2), Apache Arrow IPC file format. default version V2. V1 files distinct Arrow IPC files lack many features, ability store Arrow data tyeps, compression support. write_ipc_file() can write V2 files.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_feather.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Write a Feather file (an Arrow IPC file) — write_feather","text":"","code":"write_feather( x, sink, version = 2, chunk_size = 65536L, compression = c(\"default\", \"lz4\", \"lz4_frame\", \"uncompressed\", \"zstd\"), compression_level = NULL ) write_ipc_file( x, sink, chunk_size = 65536L, compression = c(\"default\", \"lz4\", \"lz4_frame\", \"uncompressed\", \"zstd\"), compression_level = NULL )"},{"path":"https://arrow.apache.org/docs/r/reference/write_feather.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Write a Feather file (an Arrow IPC file) — write_feather","text":"x data.frame, RecordBatch, Table sink string file path, connection, URI, OutputStream, path file system (SubTreeFileSystem) version integer Feather file version, Version 1 Version 2. Version 2 default. chunk_size V2 files, number rows chunk data file. Use smaller chunk_size need faster random row access. Default 64K. option supported V1. compression Name compression codec use, . Default \"lz4\" LZ4 available build Arrow C++ library, otherwise \"uncompressed\". \"zstd\" available codec generally better compression ratios exchange slower read write performance. \"lz4\" shorthand \"lz4_frame\" codec. See codec_is_available() details. TRUE FALSE can also used place \"default\" \"uncompressed\". option supported V1. compression_level compression \"zstd\", may specify integer compression level. omitted, compression codec's default compression level used.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_feather.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Write a Feather file (an Arrow IPC file) — write_feather","text":"input x, invisibly. Note sink OutputStream, stream left open.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/write_feather.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Write a Feather file (an Arrow IPC file) — write_feather","text":"","code":"# We recommend the \".arrow\" extension for Arrow IPC files (Feather V2). tf1 <- tempfile(fileext = \".feather\") tf2 <- tempfile(fileext = \".arrow\") tf3 <- tempfile(fileext = \".arrow\") on.exit({ unlink(tf1) unlink(tf2) unlink(tf3) }) write_feather(mtcars, tf1, version = 1) write_feather(mtcars, tf2) write_ipc_file(mtcars, tf3)"},{"path":"https://arrow.apache.org/docs/r/reference/write_ipc_stream.html","id":null,"dir":"Reference","previous_headings":"","what":"Write Arrow IPC stream format — write_ipc_stream","title":"Write Arrow IPC stream format — write_ipc_stream","text":"Apache Arrow defines two formats serializing data interprocess communication (IPC): \"stream\" format \"file\" format, known Feather. write_ipc_stream() write_feather() write formats, respectively.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_ipc_stream.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Write Arrow IPC stream format — write_ipc_stream","text":"","code":"write_ipc_stream(x, sink, ...)"},{"path":"https://arrow.apache.org/docs/r/reference/write_ipc_stream.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Write Arrow IPC stream format — write_ipc_stream","text":"x data.frame, RecordBatch, Table sink string file path, connection, URI, OutputStream, path file system (SubTreeFileSystem) ... extra parameters passed write_feather().","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_ipc_stream.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Write Arrow IPC stream format — write_ipc_stream","text":"x, invisibly.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/write_ipc_stream.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Write Arrow IPC stream format — write_ipc_stream","text":"","code":"tf <- tempfile() on.exit(unlink(tf)) write_ipc_stream(mtcars, tf)"},{"path":"https://arrow.apache.org/docs/r/reference/write_parquet.html","id":null,"dir":"Reference","previous_headings":"","what":"Write Parquet file to disk — write_parquet","title":"Write Parquet file to disk — write_parquet","text":"Parquet columnar storage file format. function enables write Parquet files R.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_parquet.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Write Parquet file to disk — write_parquet","text":"","code":"write_parquet( x, sink, chunk_size = NULL, version = \"2.4\", compression = default_parquet_compression(), compression_level = NULL, use_dictionary = NULL, write_statistics = NULL, data_page_size = NULL, use_deprecated_int96_timestamps = FALSE, coerce_timestamps = NULL, allow_truncated_timestamps = FALSE )"},{"path":"https://arrow.apache.org/docs/r/reference/write_parquet.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Write Parquet file to disk — write_parquet","text":"x data.frame, RecordBatch, Table sink string file path, connection, URI, OutputStream, path file system (SubTreeFileSystem) chunk_size many rows data write disk . directly corresponds many rows row group parquet. NULL, best guess made optimal size (based number columns number rows), though data fewer 250 million cells (rows x cols), total number rows used. version parquet version: \"1.0\", \"2.0\" (deprecated), \"2.4\" (default), \"2.6\", \"latest\" (currently equivalent 2.6). Numeric values coerced character. compression compression algorithm. Default \"snappy\". See details. compression_level compression level. Meaning depends compression algorithm use_dictionary logical: use dictionary encoding? Default TRUE write_statistics logical: include statistics? Default TRUE data_page_size Set target threshold approximate encoded size data pages within column chunk (bytes). Default 1 MiB. use_deprecated_int96_timestamps logical: write timestamps INT96 Parquet format, deprecated? Default FALSE. coerce_timestamps Cast timestamps particular resolution. Can NULL, \"ms\" \"us\". Default NULL (casting) allow_truncated_timestamps logical: Allow loss data coercing timestamps particular resolution. E.g. microsecond nanosecond data lost coercing \"ms\", raise exception. Default FALSE.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_parquet.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Write Parquet file to disk — write_parquet","text":"input x invisibly.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_parquet.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Write Parquet file to disk — write_parquet","text":"Due features format, Parquet files appended . want use Parquet format also want ability extend dataset, can write additional Parquet files treat whole directory files Dataset can query. See dataset article examples . parameters compression, compression_level, use_dictionary write_statistics support various patterns: default NULL leaves parameter unspecified, C++ library uses appropriate default column (defaults listed ) single, unnamed, value (e.g. single string compression) applies columns unnamed vector, size number columns, specify value column, positional order named vector, specify value named columns, default value setting used supplied compression argument can following (case-insensitive): \"uncompressed\", \"snappy\", \"gzip\", \"brotli\", \"zstd\", \"lz4\", \"lzo\" \"bz2\". \"uncompressed\" guaranteed available, \"snappy\" \"gzip\" almost always included. See codec_is_available(). default \"snappy\" used available, otherwise \"uncompressed\". disable compression, set compression = \"uncompressed\". Note \"uncompressed\" columns may still dictionary encoding.","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/reference/write_parquet.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Write Parquet file to disk — write_parquet","text":"","code":"tf1 <- tempfile(fileext = \".parquet\") write_parquet(data.frame(x = 1:5), tf1) # using compression if (codec_is_available(\"gzip\")) { tf2 <- tempfile(fileext = \".gz.parquet\") write_parquet(data.frame(x = 1:5), tf2, compression = \"gzip\", compression_level = 5) }"},{"path":"https://arrow.apache.org/docs/r/reference/write_to_raw.html","id":null,"dir":"Reference","previous_headings":"","what":"Write Arrow data to a raw vector — write_to_raw","title":"Write Arrow data to a raw vector — write_to_raw","text":"write_ipc_stream() write_feather() write data sink return data (data.frame, RecordBatch, Table) given. function wraps can serialize data buffer access buffer raw vector R.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_to_raw.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Write Arrow data to a raw vector — write_to_raw","text":"","code":"write_to_raw(x, format = c(\"stream\", \"file\"))"},{"path":"https://arrow.apache.org/docs/r/reference/write_to_raw.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Write Arrow data to a raw vector — write_to_raw","text":"x data.frame, RecordBatch, Table format one c(\"stream\", \"file\"), indicating IPC format use","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_to_raw.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Write Arrow data to a raw vector — write_to_raw","text":"raw vector containing bytes IPC serialized data.","code":""},{"path":"https://arrow.apache.org/docs/r/reference/write_to_raw.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Write Arrow data to a raw vector — write_to_raw","text":"","code":"# The default format is \"stream\" mtcars_raw <- write_to_raw(mtcars)"},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-16009000","dir":"Changelog","previous_headings":"","what":"arrow 16.0.0.9000","title":"arrow 16.0.0.9000","text":"R functions users write use functions Arrow supports dataset queries now can used queries . Previously, functions used arithmetic operators worked. example, time_hours <- function(mins) mins / 60 worked, time_hours_rounded <- function(mins) round(mins / 60) ; now work. automatic translations rather true user-defined functions (UDFs); UDFs, see register_scalar_function(). (#41223) summarize() supports complex expressions, correctly handles cases column names reused expressions. na_matches argument dplyr::*_join() functions now supported. argument controls whether NA values considered equal joining. (#41358)","code":""},{"path":[]},{"path":[]},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-1501","dir":"Changelog","previous_headings":"","what":"arrow 15.0.1","title":"arrow 15.0.1","text":"CRAN release: 2024-03-12","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"new-features-15-0-0","dir":"Changelog","previous_headings":"","what":"New features","title":"arrow 15.0.0","text":"Bindings base::prod added can now use dplyr pipelines (.e., tbl |> summarize(prod(col))) without pull data R (@m-muecke, #38601). Calling dimnames colnames Dataset objects now returns useful result rather just NULL (#38377). code() method Schema objects now takes optional namespace argument , TRUE, prefixes names arrow:: makes output portable (@orgadish, #38144).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-15-0-0","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 15.0.0","text":"Don’t download cmake ARROW_OFFLINE_BUILD=true update SystemRequirements (#39602). Fallback source build gracefully binary download fails (#39587). error now thrown instead warning pulling data R sub, gsub, stringr::str_replace, stringr::str_replace_all passed length > 1 vector values pattern (@abfleishman, #39219). Missing documentation added ?open_dataset documenting use ND-JSON support added arrow 13.0.0 (@Divyansh200102, #38258). make debugging problems easier using arrow AWS S3 (e.g., s3_bucket, S3FileSystem), debug log level S3 can set AWS_S3_LOG_LEVEL environment variable. See ?S3FileSystem information. (#38267) Using arrow duckdb (.e., to_duckdb()) longer results warnings quitting R session. (#38495) large number minor spelling mistakes fixed (@jsoref, #38929, #38257) developer documentation updated match changes made recent releases (#38220)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-14021","dir":"Changelog","previous_headings":"","what":"arrow 14.0.2.1","title":"arrow 14.0.2.1","text":"CRAN release: 2024-02-23","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-14-0-2-1","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 14.0.2.1","text":"Check internet access building source fallback minimally scoped Arrow C++ build (#39699). Build source default macOS, use LIBARROW_BINARY=true old behavior (#39861). Support building older versions Arrow C++. currently opt-(ARROW_R_ALLOW_CPP_VERSION_MISMATCH=true) requires atleast Arrow C++ 13.0.0 (#39739). Make possible use Arrow C++ Rtools windows (future Rtools versions). (#39986).","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-14-0-2","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 14.0.2","text":"Fixed C++ compiler warnings caused implicit conversions (#39138, #39186). Fixed confusing dplyr warnings tests (#39076). Added missing “-framework Security” pkg-config flag prevent issues compiling strict linker settings (#38861).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-14002","dir":"Changelog","previous_headings":"","what":"arrow 14.0.0.2","title":"arrow 14.0.0.2","text":"CRAN release: 2023-12-02","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-14-0-0-2","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 14.0.0.2","text":"Fixed printf syntax align format checking (#38894) Removed bashism configure script (#38716). Fixed broken link README (#38657) Properly escape license header lintr config (#38639). Removed spurious warnings installation-script test suite (#38571). Polished installation-script refactor (#38534)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-14-0-0-2","dir":"Changelog","previous_headings":"","what":"Installation","title":"arrow 14.0.0.2","text":"pkg-config fails detect required libraries additional search without pkg-config run (#38970). Fetch latest nightly Arrow C++ binary installing development Version (#38236).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-14001","dir":"Changelog","previous_headings":"","what":"arrow 14.0.0.1","title":"arrow 14.0.0.1","text":"CRAN release: 2023-11-24","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-14-0-0-1","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 14.0.0.1","text":"Add debug output build failures (#38819) Increase timeout static library download (#38767) Fix bug rosetta detection causing installation failure (#38754)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-1400","dir":"Changelog","previous_headings":"","what":"arrow 14.0.0","title":"arrow 14.0.0","text":"CRAN release: 2023-11-16","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"new-features-14-0-0","dir":"Changelog","previous_headings":"","what":"New features","title":"arrow 14.0.0","text":"reading partitioned CSV datasets supplying schema open_dataset(), partition variables now included resulting dataset (#37658). New function write_csv_dataset() now wraps write_dataset() mirrors syntax write_csv_arrow() (@dgreiss, #36436). open_delim_dataset() now accepts quoted_na argument empty strings parsed NA values (#37828). schema() can now called data.frame objects retrieve inferred Arrow schema (#37843). CSVs comma character decimal mark can now read dataset reading functions new function read_csv2_arrow() (#38002).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-14-0-0","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 14.0.0","text":"Documentation CsvParseOptions object creation now contains information default values (@angela-li, #37909). Fixed code path may resulted R code called non-R thread failed allocation (#37565). Fixed bug large Parquet files read R connections (#37274). Bindings stringr helpers (e.g., fixed(), regex() etc.) now allow variables reliably used arguments (#36784). Thrift string container size limits can now configured via newly exposed ParquetReaderProperties, allowing users work Parquet files unusually large metadata (#36992). Error messages resulting use add_filename() improved (@amoeba, #37372).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-14-0-0","dir":"Changelog","previous_headings":"","what":"Installation","title":"arrow 14.0.0","text":"macOS builds now use installation pathway Linux (@assignUser, #37684). warning message now issued package load running emulation macOS (.e., use x86 installation R M1/aarch64; #37777). R scripts run configuration installation now run using correct R interpreter (@meztez, #37225). Failed libarrow builds now return detailed output (@amoeba, #37727). create_package_with_all_dependencies() now properly escapes paths Windows (#37226).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-13001","dir":"Changelog","previous_headings":"","what":"arrow 13.0.0.1","title":"arrow 13.0.0.1","text":"CRAN release: 2023-09-22 Remove reference legacy timezones prevent CRAN check failures (#37671)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-1300","dir":"Changelog","previous_headings":"","what":"arrow 13.0.0","title":"arrow 13.0.0","text":"CRAN release: 2023-08-30","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"breaking-changes-13-0-0","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"arrow 13.0.0","text":"Input objects inherit data.frame classes now class attribute dropped, resulting now always returning tibbles file reading functions arrow_table(), results consistency type returned objects. Calling .data.frame() Arrow Tabular objects now always returns data.frame object (#34775)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"new-features-13-0-0","dir":"Changelog","previous_headings":"","what":"New features","title":"arrow 13.0.0","text":"open_dataset() now works ND-JSON files (#35055) Calling schema() multiple Arrow objects now returns object’s schema (#35543) dplyr ./argument now supported arrow implementation dplyr verbs (@eitsupi, #35667) Binding dplyr::case_when() now accepts .default parameter match update dplyr 1.1.0 (#35502)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-13-0-0","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 13.0.0","text":"Convenience function arrow_array() can used create Arrow Arrays (#36381) Convenience function scalar() can used create Arrow Scalars (#36265) Prevent crashed passing data arrow duckdb always calling RecordBatchReader::ReadNext() DuckDB main R thread (#36307) Issue warning set_io_thread_count() num_threads < 2 (#36304) Ensure missing grouping variables added beginning variable list (#36305) CSV File reader options class objects can print selected values (#35955) Schema metadata can set named character vector (#35954) Ensure RStringViewer helper class Array references (#35812) strptime() arrow return timezone-aware timestamp %z part format string (#35671) Column ordering combining group_by() across() now matches dplyr (@eitsupi, #35473)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-13-0-0","dir":"Changelog","previous_headings":"","what":"Installation","title":"arrow 13.0.0","text":"Link correct version OpenSSL using autobrew (#36551) Require cmake 3.16 bundled build script (#36321)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"docs-13-0-0","dir":"Changelog","previous_headings":"","what":"Docs","title":"arrow 13.0.0","text":"Split R6 classes convenience functions improve readability (#36394) Enable pkgdown built-search (@eitsupi, #36374) Re-organise reference page pkgdown site improve readability (#36171)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-12011","dir":"Changelog","previous_headings":"","what":"arrow 12.0.1.1","title":"arrow 12.0.1.1","text":"CRAN release: 2023-07-18 Update package version reference text instead numeric due CRAN update requiring (#36353, #36364)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-1201","dir":"Changelog","previous_headings":"","what":"arrow 12.0.1","title":"arrow 12.0.1","text":"CRAN release: 2023-06-15 Update version date library vendored Arrow C++ library compatibility tzdb 0.4.0 (#35594, #35612). Update tests compatibility waldo 0.5.1 (#35131, #35308).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-1200","dir":"Changelog","previous_headings":"","what":"arrow 12.0.0","title":"arrow 12.0.0","text":"CRAN release: 2023-05-05","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"new-features-12-0-0","dir":"Changelog","previous_headings":"","what":"New features","title":"arrow 12.0.0","text":"read_parquet() read_feather() functions can now accept URL arguments (#33287, #34708). json_credentials argument GcsFileSystem$create() now accepts file path containing appropriate authentication token (@amoeba, #34421, #34524). $options member GcsFileSystem objects can now inspected (@amoeba, #34422, #34477). read_csv_arrow() read_json_arrow() functions now accept literal text input wrapped () improve compatability readr::read_csv() (@eitsupi, #18487, #33968). Nested fields can now accessed using $ [[ dplyr expressions (#18818, #19706).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-12-0-0","dir":"Changelog","previous_headings":"","what":"Installation","title":"arrow 12.0.0","text":"binaries now built Centos 7 (#32292, #34048).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-12-0-0","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 12.0.0","text":"Fix crash occurred process exit related finalizing S3 filesystem component (#15054, #33858). Implement Arrow C++ FetchNode OrderByNode improve performance simplify building query plans dplyr expressions (#34437, #34685). Fix bug different R metadata written depending subtle argument passing semantics arrow_table() (#35038, #35039). Improve error message attempting convert data.frame NULL column names Table (#15247, #34798). Vignettes updated reflect improvements open_csv_dataset() family functions (#33998, #34710). Fixed crash occurred arrow ALTREP vectors materialized converted back arrow Arrays (#34211, #34489). Improved conda install instructions (#32512, #34398). Improved documentation URL configurations (@eitsupi, #34276). Updated links JIRA issues migrated GitHub (@eitsupi, #33631, #34260). dplyr::n() function now mapped count_all kernel improve performance simplify R implementation (#33892, #33917). Improved experience using s3_bucket() filesystem helper endpoint_override fixed surprising behaviour occurred passing combinations arguments (@cboettig, #33904, #34009). raise error schema supplied col_names = TRUE open_csv_dataset() (#34217, #34092).","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-11003","dir":"Changelog","previous_headings":"","what":"arrow 11.0.0.3","title":"arrow 11.0.0.3","text":"CRAN release: 2023-03-08","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-11-0-0-3","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 11.0.0.3","text":"open_csv_dataset() allows schema specified. (#34217) ensure compatibility upcoming dplyr release, longer call dplyr:::check_names() (#34369)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-11002","dir":"Changelog","previous_headings":"","what":"arrow 11.0.0.2","title":"arrow 11.0.0.2","text":"CRAN release: 2023-02-12","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"breaking-changes-11-0-0-2","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"arrow 11.0.0.2","text":"map_batches() lazy default; now returns RecordBatchReader instead list RecordBatch objects unless lazy = FALSE. (#14521)","code":""},{"path":[]},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"docs-11-0-0-2","dir":"Changelog","previous_headings":"New features","what":"Docs","title":"arrow 11.0.0.2","text":"substantial reorganisation, rewrite addition , many vignettes README. (@djnavarro, #14514)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"readingwriting-data-11-0-0-2","dir":"Changelog","previous_headings":"New features","what":"Reading/writing data","title":"arrow 11.0.0.2","text":"New functions open_csv_dataset(), open_tsv_dataset(), open_delim_dataset() wrap open_dataset()- don’t provide new functionality, allow readr-style options supplied, making simpler switch individual file-reading dataset functionality. (#33614) User-defined null values can set writing CSVs datasets individual files. (@wjones127, #14679) new col_names parameter allows specification column names opening CSV dataset. (@wjones127, #14705) parse_options, read_options, convert_options parameters reading individual files (read_*_arrow() functions) datasets (open_dataset() new open_*_dataset() functions) can passed lists. (#15270) File paths containing accents can read read_csv_arrow(). (#14930)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"dplyr-compatibility-11-0-0-2","dir":"Changelog","previous_headings":"New features","what":"dplyr compatibility","title":"arrow 11.0.0.2","text":"New dplyr (1.1.0) function join_by() implemented dplyr joins Arrow objects (equality conditions ). (#33664) Output accurate multiple dplyr::group_by()/dplyr::summarise() calls used. (#14905) dplyr::summarize() works division divisor variable. (#14933) dplyr::right_join() correctly coalesces keys. (#15077) Multiple changes ensure compatibility dplyr 1.1.0. (@lionel-, #14948)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"function-bindings-11-0-0-2","dir":"Changelog","previous_headings":"New features","what":"Function bindings","title":"arrow 11.0.0.2","text":"lubridate::with_tz() lubridate::force_tz() (@eitsupi, #14093) stringr::str_remove() stringr::str_remove_all() (#14644)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-object-creation-11-0-0-2","dir":"Changelog","previous_headings":"New features","what":"Arrow object creation","title":"arrow 11.0.0.2","text":"Arrow Scalars can created POSIXlt objects. (#15277) Array$create() can create Decimal arrays. (#15211) StructArray$create() can used create StructArray objects. (#14922) Creating Array object bigger 2^31 correct length (#14929)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-11-0-0-2","dir":"Changelog","previous_headings":"New features","what":"Installation","title":"arrow 11.0.0.2","text":"Improved offline installation using pre-downloaded binaries. (@pgramme, #14086) package can automatically link system installations AWS SDK C++. (@kou, #14235)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"minor-improvements-and-fixes-11-0-0-2","dir":"Changelog","previous_headings":"","what":"Minor improvements and fixes","title":"arrow 11.0.0.2","text":"Calling lubridate::as_datetime() Arrow objects can handle time sub-seconds. (@eitsupi, #13890) head() can called as_record_batch_reader(). (#14518) .Date() can go timestamp[us] timestamp[s]. (#14935) curl timeout policy can configured S3. (#15166) rlang dependency must least version 1.0.0 check_dots_empty(). (@daattali, #14744)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-1001","dir":"Changelog","previous_headings":"","what":"arrow 10.0.1","title":"arrow 10.0.1","text":"CRAN release: 2022-12-06 Minor improvements fixes: Fixes failing test lubridate 1.9 release (#14615) Update ensure compatibility changes dev purrr (#14581) Fix correctly handle .data pronoun dplyr::group_by() (#14484)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-1000","dir":"Changelog","previous_headings":"","what":"arrow 10.0.0","title":"arrow 10.0.0","text":"CRAN release: 2022-10-26","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-dplyr-queries-10-0-0","dir":"Changelog","previous_headings":"","what":"Arrow dplyr queries","title":"arrow 10.0.0","text":"Several new functions can used queries: dplyr::across() can used apply computation across multiple columns, () selection helper supported across(); add_filename() can used get filename row came (available querying ?Dataset); Added five functions slice_* family: dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_head(), dplyr::slice_tail(), dplyr::slice_sample(). package now documentation lists dplyr methods R function mappings supported Arrow data, along notes differences functionality queries evaluated R versus Acero, Arrow query engine. See ?acero. new features bugfixes implemented joins: Extension arrays now supported joins, allowing, example, joining datasets contain geoarrow data. keep argument now supported, allowing separate columns left right hand side join keys join output. Full joins now coalesce join keys (keep = FALSE), avoiding issue join keys NA rows right hand side without matches left. changes improve consistency API: future release, calling dplyr::pull() return ?ChunkedArray instead R vector default. current default behavior deprecated. update new behavior now, specify pull(as_vector = FALSE) set options(arrow.pull_as_vector = FALSE) globally. Calling dplyr::compute() query grouped returns ?Table instead query object. Finally, long-running queries can now cancelled abort computation immediately.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrays-and-tables-10-0-0","dir":"Changelog","previous_headings":"","what":"Arrays and tables","title":"arrow 10.0.0","text":"as_arrow_array() can now take blob::blob ?vctrs::list_of, convert binary list arrays, respectively. Also fixed issue as_arrow_array() ignored type argument passed StructArray. unique() function works ?Table, ?RecordBatch, ?Dataset, ?RecordBatchReader.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"reading-and-writing-10-0-0","dir":"Changelog","previous_headings":"","what":"Reading and writing","title":"arrow 10.0.0","text":"write_feather() can take compression = FALSE choose writing uncompressed files. Also, breaking change IPC files write_dataset(): passing \"ipc\" \"feather\" format now write files .arrow extension instead .ipc .feather.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-10-0-0","dir":"Changelog","previous_headings":"","what":"Installation","title":"arrow 10.0.0","text":"version 10.0.0, arrow requires C++17 build. means : Windows, need R >= 4.0. Version 9.0.0 last version support R 3.6. CentOS 7, can build latest version arrow, first need install newer compiler default system compiler, gcc 4.8. See vignette(\"install\", package = \"arrow\") guidance. Note need newer compiler build arrow: installing binary package, RStudio Package Manager, loading package ’ve already installed works fine system defaults.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-900","dir":"Changelog","previous_headings":"","what":"arrow 9.0.0","title":"arrow 9.0.0","text":"CRAN release: 2022-08-10","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-dplyr-queries-9-0-0","dir":"Changelog","previous_headings":"","what":"Arrow dplyr queries","title":"arrow 9.0.0","text":"dplyr::union dplyr::union_all (#13090) dplyr::glimpse (#13563) show_exec_plan() can added end dplyr pipeline show underlying plan, similar dplyr::show_query(). dplyr::show_query() dplyr::explain() also work show output, may change future. (#13541) User-defined functions supported queries. Use register_scalar_function() create . (#13397) map_batches() returns RecordBatchReader requires function maps returns something coercible RecordBatch as_record_batch() S3 function. can also run streaming fashion passed .lazy = TRUE. (#13170, #13650) Functions can called package namespace prefixes (e.g. stringr::, lubridate::) within queries. example, stringr::str_length now dispatch kernel str_length. (#13160) orders year, month, day, hours, minutes, seconds components supported. orders argument Arrow binding works follows: orders transformed formats subsequently get applied turn. select_formats parameter inference takes place (like case lubridate::parse_date_time()). lubridate date datetime parsers lubridate::ymd(), lubridate::yq(), lubridate::ymd_hms() (#13118, #13163, #13627) lubridate::fast_strptime() (#13174) lubridate::floor_date(), lubridate::ceiling_date(), lubridate::round_date() (#12154) strptime() supports tz argument pass timezones. (#13190) lubridate::qday() (day quarter) exp() sqrt(). (#13517) Count distinct now gives correct result across multiple row groups. (#13583) Aggregations partition columns return correct results. (#13518)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"reading-and-writing-9-0-0","dir":"Changelog","previous_headings":"","what":"Reading and writing","title":"arrow 9.0.0","text":"New functions read_ipc_file() write_ipc_file() added. functions almost read_feather() write_feather(), differ target IPC files (Feather V2 files), Feather V1 files. read_arrow() write_arrow(), deprecated since 1.0.0 (July 2020), removed. Instead , use read_ipc_file() write_ipc_file() IPC files, , read_ipc_stream() write_ipc_stream() IPC streams. (#13550) write_parquet() now defaults writing Parquet format version 2.4 (1.0). Previously deprecated arguments properties arrow_properties removed; need deal lower-level properties objects directly, use ParquetFileWriter, write_parquet() wraps. (#13555) UnionDatasets can unify schemas multiple InMemoryDatasets varying schemas. (#13088) write_dataset() preserves schema metadata . 8.0.0, drop metadata, breaking packages sfarrow. (#13105) Reading writing functions (write_csv_arrow()) automatically (de-)compress data file path contains compression extension (e.g. \"data.csv.gz\"). works locally well remote filesystems like S3 GCS. (#13183) FileSystemFactoryOptions can provided open_dataset(), allowing pass options file prefixes ignore. (#13171) default, S3FileSystem create delete buckets. enable , pass configuration option allow_bucket_creation allow_bucket_deletion. (#13206) GcsFileSystem gs_bucket() allow connecting Google Cloud Storage. (#10999, #13601)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrays-and-tables-9-0-0","dir":"Changelog","previous_headings":"","what":"Arrays and tables","title":"arrow 9.0.0","text":"Table RecordBatch $num_rows() method returns double (previously integer), avoiding integer overflow larger tables. (#13482, #13514)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"packaging-9-0-0","dir":"Changelog","previous_headings":"","what":"Packaging","title":"arrow 9.0.0","text":"arrow.dev_repo nightly builds R package prebuilt libarrow binaries now https://nightlies.apache.org/arrow/r/. Brotli BZ2 shipped macOS binaries. BZ2 shipped Windows binaries. (#13484)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-800","dir":"Changelog","previous_headings":"","what":"arrow 8.0.0","title":"arrow 8.0.0","text":"CRAN release: 2022-05-09","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"enhancements-to-dplyr-and-datasets-8-0-0","dir":"Changelog","previous_headings":"","what":"Enhancements to dplyr and datasets","title":"arrow 8.0.0","text":"correctly supports skip argument skipping header rows CSV datasets. can take list datasets differing schemas attempt unify schemas produce UnionDataset. supported RecordBatchReader. allows, example, results DuckDB streamed back Arrow rather materialized continuing pipeline. longer need materialize entire result table writing dataset query contains aggregations joins. supports dplyr::rename_with(). dplyr::count() returns ungrouped dataframe. write_dataset() options controlling row group file sizes writing partitioned datasets, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group. write_csv_arrow() accepts Dataset Arrow dplyr query. Joining one datasets option(use_threads = FALSE) longer crashes R. option set default Windows. dplyr joins support suffix argument handle overlap column names. Filtering Parquet dataset .na() longer misses rows. map_batches() correctly accepts Dataset objects.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"enhancements-to-date-and-time-support-8-0-0","dir":"Changelog","previous_headings":"","what":"Enhancements to date and time support","title":"arrow 8.0.0","text":"read_csv_arrow()’s readr-style type T mapped timestamp(unit = \"ns\") instead timestamp(unit = \"s\"). lubridate::tz() (timezone), lubridate::semester(), lubridate::dst() (daylight savings time boolean), lubridate::date(), lubridate::epiyear() (year according epidemiological week calendar), lubridate::month() works integer inputs. lubridate::make_date() & lubridate::make_datetime() + base::ISOdatetime() & base::ISOdate() create date-times numeric representations. lubridate::decimal_date() lubridate::date_decimal() lubridate::make_difftime() (duration constructor) ?lubridate::duration helper functions, lubridate::dyears(), lubridate::dhours(), lubridate::dseconds(). lubridate::leap_year() lubridate::as_date() lubridate::as_datetime() base::difftime base::.difftime() base::.Date() convert date Arrow timestamp date arrays support base::format() strptime() returns NA instead erroring case format mismatch, just like base::strptime(). Timezone operations supported Windows tzdb package also installed.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"extensibility-8-0-0","dir":"Changelog","previous_headings":"","what":"Extensibility","title":"arrow 8.0.0","text":"Added S3 generic conversion functions as_arrow_array() as_arrow_table() main Arrow objects. includes, Arrow tables, record batches, arrays, chunked arrays, record batch readers, schemas, data types. allows packages define custom conversions types Arrow objects, including extension arrays. Custom extension types arrays can created registered, allowing packages define array types. Extension arrays wrap regular Arrow array types provide customized behavior /storage. See description example ?new_extension_type. Implemented generic extension type as_arrow_array() methods objects vctrs::vec_is() returns TRUE (.e., object can used column tibble::tibble()), provided underlying vctrs::vec_data() can converted Arrow Array.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"concatenation-support-8-0-0","dir":"Changelog","previous_headings":"","what":"Concatenation Support","title":"arrow 8.0.0","text":"Arrow arrays tables can easily concatenated: Arrays can concatenated concat_arrays() , zero-copy desired chunking acceptable, using ChunkedArray$create(). ChunkedArrays can concatenated c(). RecordBatches Tables support cbind(). Tables support rbind(). concat_tables() also provided concatenate tables unifying schemas.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-improvements-and-fixes-8-0-0","dir":"Changelog","previous_headings":"","what":"Other improvements and fixes","title":"arrow 8.0.0","text":"Dictionary arrays support using ALTREP converting R factors. Math group generics implemented ArrowDatum. means can use base functions like sqrt(), log(), exp() Arrow arrays scalars. read_* write_* functions support R Connection objects reading writing files. Parquet writer supports Duration type columns. dataset Parquet reader consumes less memory. median() quantile() warn approximate calculations regardless interactivity. Array$cast() can cast StructArrays another struct type field names structure (subset fields) different field types. Removed special handling Solaris. CSV writer much faster writing string columns. Fixed issue set_io_thread_count() set CPU count instead IO thread count. RandomAccessFile $ReadMetadata() method provides useful metadata provided filesystem. grepl binding returns FALSE NA inputs (previously returned NA), match behavior base::grepl(). create_package_with_all_dependencies() works Windows Mac OS, instead Linux.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-700","dir":"Changelog","previous_headings":"","what":"arrow 7.0.0","title":"arrow 7.0.0","text":"CRAN release: 2022-02-10","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"enhancements-to-dplyr-and-datasets-7-0-0","dir":"Changelog","previous_headings":"","what":"Enhancements to dplyr and datasets","title":"arrow 7.0.0","text":"Additional lubridate features: week(), .*() functions, label argument month() implemented. complex expressions inside summarize(), ifelse(n() > 1, mean(y), mean(z)), supported. adding columns dplyr pipeline, one can now use tibble data.frame create columns tibbles data.frames respectively (e.g. ... %>% mutate(df_col = tibble(, b)) %>% ...). Dictionary columns (R factor type) supported inside coalesce(). open_dataset() accepts partitioning argument reading Hive-style partitioned files, even though required. experimental map_batches() function custom operations dataset restored.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"csv-7-0-0","dir":"Changelog","previous_headings":"","what":"CSV","title":"arrow 7.0.0","text":"Delimited files (including CSVs) encodings UTF can now read (using encoding argument reading). open_dataset() correctly ignores byte-order marks (BOMs) CSVs, already true reading single files Reading dataset internally uses asynchronous scanner default, resolves potential deadlock reading large CSV datasets. head() longer hangs large CSV datasets. improved error message conflict header file schema/column names provided arguments. write_csv_arrow() now follows signature readr::write_csv().","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-improvements-and-fixes-7-0-0","dir":"Changelog","previous_headings":"","what":"Other improvements and fixes","title":"arrow 7.0.0","text":"Many vignettes reorganized, restructured expanded improve usefulness clarity. Code generate schemas (individual data type specifications) accessible $code() method schema type. allows easily get code needed create schema object already one. Arrow Duration type mapped R’s difftime class. decimal256() type supported. decimal() function revised call either decimal256() decimal128() based value precision argument. write_parquet() uses reasonable guess chunk_size instead always writing single chunk. improves speed reading writing large Parquet files. write_parquet() longer drops attributes grouped data.frames. Chunked arrays now supported using ALTREP. ALTREP vectors backed Arrow arrays longer unexpectedly mutated sorting negation. S3 file systems can created proxy_options. segfault creating S3 file systems fixed. Integer division Arrow closely matches R’s behavior.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-7-0-0","dir":"Changelog","previous_headings":"","what":"Installation","title":"arrow 7.0.0","text":"Source builds now default use pkg-config search system dependencies (libz) link present. new default make building Arrow source quicker systems dependencies installed already. retain previous behavior downloading building dependencies, set ARROW_DEPENDENCY_SOURCE=BUNDLED. Snappy lz4 compression libraries enabled default Linux builds. means default build Arrow, without setting environment variables, able read write snappy encoded Parquet files. Windows binary packages include brotli compression support. Building Arrow Windows can find locally built libarrow library. package compiles installs Raspberry Pi OS.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"under-the-hood-changes-7-0-0","dir":"Changelog","previous_headings":"","what":"Under-the-hood changes","title":"arrow 7.0.0","text":"pointers used pass data R Python made reliable. Backwards compatibility older versions pyarrow maintained. internal method registering new bindings use dplyr queries changed. See new vignette writing bindings information works. R 3.3 longer supported. glue, arrow depends transitively, dropped support .","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-601","dir":"Changelog","previous_headings":"","what":"arrow 6.0.1","title":"arrow 6.0.1","text":"CRAN release: 2021-11-20 Joins now support inclusion dictionary columns, multiple crashes fixed Grouped aggregation longer crashes working data filtered 0 rows Bindings added str_count() dplyr queries Work around critical bug AWS SDK C++ affect S3 multipart upload UBSAN warning round kernel resolved Fixes build failures Solaris old versions macOS","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-600","dir":"Changelog","previous_headings":"","what":"arrow 6.0.0","title":"arrow 6.0.0","text":"now two ways query Arrow data:","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"1-expanded-arrow-native-queries-aggregation-and-joins-6-0-0","dir":"Changelog","previous_headings":"","what":"1. Expanded Arrow-native queries: aggregation and joins","title":"arrow 6.0.0","text":"dplyr::summarize(), grouped ungrouped, now implemented Arrow Datasets, Tables, RecordBatches. data scanned chunks, can aggregate larger--memory datasets backed many files. Supported aggregation functions include n(), n_distinct(), min(), max(), sum(), mean(), var(), sd(), (), (). median() quantile() one probability also supported currently return approximate results using t-digest algorithm. Along summarize(), can also call count(), tally(), distinct(), effectively wrap summarize(). enhancement change behavior summarize() collect() cases: see “Breaking changes” details. addition summarize(), mutating filtering equality joins (inner_join(), left_join(), right_join(), full_join(), semi_join(), anti_join()) also supported natively Arrow. Grouped aggregation (especially) joins considered somewhat experimental release. expect work, may well optimized workloads. help us focus efforts improving next release, please let us know encounter unexpected behavior poor performance. New non-aggregating compute functions include string functions like str_to_title() strftime() well compute functions extracting date parts (e.g. year(), month()) dates. complete list additional compute functions; exhaustive list available compute functions see list_compute_functions(). ’ve also worked fill support data types, Decimal, functions added previous releases. type limitations mentioned previous release notes longer valid, find function implemented certain data type, please report issue.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"2-duckdb-integration-6-0-0","dir":"Changelog","previous_headings":"","what":"2. DuckDB integration","title":"arrow 6.0.0","text":"duckdb package installed, can hand Arrow Dataset query object DuckDB querying using to_duckdb() function. allows use duckdb’s dbplyr methods, well SQL interface, aggregate data. Filtering column projection done to_duckdb() evaluated Arrow, duckdb can push predicates Arrow well. handoff copy data, instead uses Arrow’s C-interface (just like passing arrow data R Python). means serialization data copying costs incurred. can also take duckdb tbl call to_arrow() stream data Arrow’s query engine. means single dplyr pipeline, start Arrow Dataset, evaluate steps DuckDB, evaluate rest Arrow.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"breaking-changes-6-0-0","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"arrow 6.0.0","text":"Row order data Dataset query longer deterministic. need stable sort order, explicitly arrange() query result. calls summarize(), can set options(arrow.summarise.sort = TRUE) match current dplyr behavior sorting grouping columns. dplyr::summarize() -memory Arrow Table RecordBatch longer eagerly evaluates. Call compute() collect() evaluate query. head() tail() also longer eagerly evaluate, -memory data Datasets. Also, row order longer deterministic, effectively give random slice data somewhere dataset unless arrange() specify sorting. Simple Feature (SF) columns longer save metadata converting Arrow tables (thus saving Parquet Feather). also includes dataframe column attributes element (words: row-level metadata). previous approach saving metadata (computationally) inefficient unreliable Arrow queries + datasets. impact saving SF columns. saving columns recommend either converting columns well-known binary representations (using sf::st_as_binary(col)) using sfarrow package handles intricacies conversion process. plans improve re-enable custom metadata like future can implement saving safe efficient way. need preserve pre-6.0.0 behavior saving metadata, can set options(arrow.preserve_row_level_metadata = TRUE). removing option coming release. strongly recommend avoiding using workaround possible since results supported future can lead surprising inaccurate results. run custom class besides sf columns impacted please report issue. Datasets officially longer supported 32-bit Windows R < 4.0 (Rtools 3.5). 32-bit Windows users upgrade newer version R order use datasets.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-on-linux-6-0-0","dir":"Changelog","previous_headings":"","what":"Installation on Linux","title":"arrow 6.0.0","text":"Package installation now fails Arrow C++ library compile. previous versions, C++ library failed compile, get successful R package installation wouldn’t much useful. can disable optional C++ components building source setting environment variable LIBARROW_MINIMAL=true. core Arrow/Feather components excludes Parquet, Datasets, compression libraries, optional features. Source packages now bundle Arrow C++ source code, downloaded order build package. source included, now possible build package offline/airgapped system. default, offline build minimal download third-party C++ dependencies required support features. allow fully featured offline build, included create_package_with_all_dependencies() function (also available GitHub without installing arrow package) download third-party C++ dependencies bundle inside R source package. Run function system connected network produce “fat” source package, copy .tar.gz package offline machine install. Special thanks @karldw huge amount work . Source builds can make use system dependencies (libz) setting ARROW_DEPENDENCY_SOURCE=AUTO. default release (BUNDLED, .e. download build dependencies) may become default future. JSON library components (read_json_arrow()) now optional still default; set ARROW_JSON=building disable .","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-enhancements-and-fixes-6-0-0","dir":"Changelog","previous_headings":"","what":"Other enhancements and fixes","title":"arrow 6.0.0","text":"Arrow data types use ALTREP converting R. speeds workflows significantly, others merely delays conversion Arrow R. ALTREP used default, disable , set options(arrow.use_altrep = FALSE) Field objects can now created non-nullable, schema() now optionally accepts list Fields Numeric division zero now matches R’s behavior longer raises error write_parquet() longer errors used grouped data.frame case_when() now errors cleanly expression supported Arrow open_dataset() now works CSVs without header rows Fixed minor issue short readr-style types T t reversed read_csv_arrow() Bindings log(..., base = b) b something 2, e, 10 number updates expansions vignettes Fix segfaults converting length-0 ChunkedArrays R vectors Table$create() now alias arrow_table()","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"internals-6-0-0","dir":"Changelog","previous_headings":"","what":"Internals","title":"arrow 6.0.0","text":"now use testthat 3rd edition default number large test reorganizations Style changes conform tidyverse style guide + using lintr","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-5002","dir":"Changelog","previous_headings":"","what":"arrow 5.0.0.2","title":"arrow 5.0.0.2","text":"CRAN release: 2021-09-05 patch version contains fixes sanitizer compiler warnings.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-500","dir":"Changelog","previous_headings":"","what":"arrow 5.0.0","title":"arrow 5.0.0","text":"CRAN release: 2021-07-29","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"more-dplyr-5-0-0","dir":"Changelog","previous_headings":"","what":"More dplyr","title":"arrow 5.0.0","text":"now 250 compute functions available use dplyr::filter(), mutate(), etc. Additions release include: String operations: strsplit() str_split(); strptime(); paste(), paste0(), str_c(); substr() str_sub(); str_like(); str_pad(); stri_reverse() Date/time operations: lubridate methods year(), month(), wday(), Math: logarithms (log() et al.); trigonometry (sin(), cos(), et al.); abs(); sign(); pmin() pmax(); ceiling(), floor(), trunc() Conditional functions, limitations input type release: ifelse() if_else() Decimal types; case_when() logical, numeric, temporal types ; coalesce() lists/structs. Note also release, factors/dictionaries converted strings functions. .* functions supported can used inside relocate() print method arrow_dplyr_query now includes expression resulting type columns derived mutate(). transmute() now errors passed arguments .keep, ., ., consistency behavior dplyr data.frames.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"csv-writing-5-0-0","dir":"Changelog","previous_headings":"","what":"CSV writing","title":"arrow 5.0.0","text":"write_csv_arrow() use Arrow write data.frame single CSV file write_dataset(format = \"csv\", ...) write Dataset CSVs, including partitioning","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"c-interface-5-0-0","dir":"Changelog","previous_headings":"","what":"C interface","title":"arrow 5.0.0","text":"Added bindings remainder C data interface: Type, Field, RecordBatchReader (experimental C stream interface). also reticulate::py_to_r() r_to_py() methods. Along addition Scanner$ToRecordBatchReader() method, can now build Dataset query R pass resulting stream batches another tool process. C interface methods exposed Arrow objects (e.g. Array$export_to_c(), RecordBatch$import_from_c()), similar pyarrow. facilitates use packages. See py_to_r() r_to_py() methods usage examples.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-enhancements-5-0-0","dir":"Changelog","previous_headings":"","what":"Other enhancements","title":"arrow 5.0.0","text":"Converting R data.frame Arrow Table uses multithreading across columns Arrow array types now use ALTREP converting R. disable , set options(arrow.use_altrep = FALSE) .na() now evaluates TRUE NaN values floating point number fields, consistency base R. .nan() now evaluates FALSE NA values floating point number fields FALSE values non-floating point fields, consistency base R. Additional methods Array, ChunkedArray, RecordBatch, Table: na.omit() friends, ()/() Scalar inputs RecordBatch$create() Table$create() recycled arrow_info() includes details C++ build, compiler version match_arrow() now converts x Array Scalar, Array ChunkedArray longer dispatches base::match(). Row-level metadata now restricted reading/writing single parquet feather files. Row-level metadata datasets ignored (warning) dataset contains row-level metadata. Writing dataset row-level metadata also ignored (warning). working robust implementation support row-level metadata (complex types) — stay tuned. working {sf} objects, {sfarrow} helpful serializing sf columns sharing geopandas.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-401","dir":"Changelog","previous_headings":"","what":"arrow 4.0.1","title":"arrow 4.0.1","text":"CRAN release: 2021-05-28 Resolved bugs new string compute kernels (#10320, #10287)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-4001","dir":"Changelog","previous_headings":"","what":"arrow 4.0.0.1","title":"arrow 4.0.0.1","text":"CRAN release: 2021-05-10 mimalloc memory allocator default memory allocator using static source build package Linux. better behavior valgrind jemalloc . full-featured build (installed LIBARROW_MINIMAL=false) includes jemalloc mimalloc, still jemalloc default, though configurable runtime ARROW_DEFAULT_MEMORY_POOL environment variable. Environment variables LIBARROW_MINIMAL, LIBARROW_DOWNLOAD, NOT_CRAN now case-insensitive Linux build script. build configuration issue macOS binary package resolved.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-400","dir":"Changelog","previous_headings":"","what":"arrow 4.0.0","title":"arrow 4.0.0","text":"CRAN release: 2021-04-27","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"dplyr-methods-4-0-0","dir":"Changelog","previous_headings":"","what":"dplyr methods","title":"arrow 4.0.0","text":"Many dplyr verbs supported Arrow objects: dplyr::mutate() now supported Arrow many applications. queries Table RecordBatch yet supported Arrow, implementation falls back pulling data -memory R data.frame first, previous release. queries Dataset (can larger memory), raises error function implemented. main mutate() features yet called Arrow objects (1) mutate() group_by() (typically used combination aggregation) (2) queries use dplyr::across(). dplyr::transmute() (calls mutate()) dplyr::group_by() now preserves .drop argument supports --fly definition columns dplyr::relocate() reorder columns dplyr::arrange() sort rows dplyr::compute() evaluate lazy expressions return Arrow Table. equivalent dplyr::collect(as_data_frame = FALSE), added 2.0.0. 100 functions can now called Arrow objects inside dplyr verb: String functions nchar(), tolower(), toupper(), along stringr spellings str_length(), str_to_lower(), str_to_upper(), supported Arrow dplyr calls. str_trim() also supported. Regular expression functions sub(), gsub(), grepl(), along str_replace(), str_replace_all(), str_detect(), supported. cast(x, type) dictionary_encode() allow changing type columns Arrow objects; .numeric(), .character(), etc. exposed similar type-altering conveniences dplyr::(); Arrow version also allows left right arguments columns data just scalars Additionally, Arrow C++ compute function can called inside dplyr verb. enables access Arrow functions don’t direct R mapping. See list_compute_functions() available functions, available dplyr prefixed arrow_. Arrow C++ compute functions now systematic type promotion called data different types (e.g. int32 float64). Previously, Scalars expressions always cast match type corresponding Array, new type promotion enables, among things, operations two columns (Arrays) dataset. side effect, comparisons worked prior versions longer supported: example, dplyr::filter(arrow_dataset, string_column == 3) error message type mismatch numeric 3 string type string_column.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"datasets-4-0-0","dir":"Changelog","previous_headings":"","what":"Datasets","title":"arrow 4.0.0","text":"open_dataset() now accepts vector file paths (even single file path). Among things, enables open single large file use write_dataset() partition without read whole file memory. Datasets can now detect read directory compressed CSVs write_dataset() now defaults format = \"parquet\" better validates format argument Invalid input schema open_dataset() now correctly handled Collecting 0 columns Dataset now longer returns columns Scanner$Scan() method removed; use Scanner$ScanBatches()","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-improvements-4-0-0","dir":"Changelog","previous_headings":"","what":"Other improvements","title":"arrow 4.0.0","text":"value_counts() tabulate values Array ChunkedArray, similar base::table(). StructArray objects gain data.frame-like methods, including names(), $, [[, dim(). RecordBatch columns can now added, replaced, removed assigning (<-) either $ [[ Similarly, Schema can now edited assigning new types. enables using CSV reader detect schema file, modify Schema object columns want read different type, use Schema read data. Better validation creating Table schema, columns different lengths, scalar value recycling Reading Parquet files Japanese multi-byte locales Windows longer hangs (workaround bug libstdc++; thanks @yutannihilation persistence discovering !) attempt read string data embedded nul (\\0) characters, error message now informs can set options(arrow.skip_nul = TRUE) strip . recommended set option default since code path significantly slower, string data contain nuls. read_json_arrow() now accepts schema: read_json_arrow(\"file.json\", schema = schema(col_a = float64(), col_b = string()))","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-and-configuration-4-0-0","dir":"Changelog","previous_headings":"","what":"Installation and configuration","title":"arrow 4.0.0","text":"R package can now support working Arrow C++ library additional features (dataset, parquet, string libraries) disabled, bundled build script enables setting environment variables disable . See vignette(\"install\", package = \"arrow\") details. allows faster, smaller package build cases useful, enables minimal, functioning R package build Solaris. macOS, now possible use bundled C++ build used default Linux, along customization parameters, setting environment variable FORCE_BUNDLED_BUILD=true. arrow now uses mimalloc memory allocator default macOS, available (CRAN binaries), instead jemalloc. configuration issues jemalloc macOS, benchmark analysis shows negative effects performance, especially memory-intensive workflows. jemalloc remains default Linux; mimalloc default Windows. Setting ARROW_DEFAULT_MEMORY_POOL environment variable switch memory allocators now works correctly Arrow C++ library statically linked (usually case installing CRAN). arrow_info() function now reports additional optional features, well detected SIMD level. key features compression libraries enabled build, arrow_info() refer installation vignette guidance install complete build, desired. attempt read file compressed codec Arrow build contain support , error message now tell reinstall Arrow feature enabled. new vignette developer environment setup vignette(\"developing\", package = \"arrow\"). building source, can use environment variable ARROW_HOME point specific directory Arrow libraries . similar passing INCLUDE_DIR LIB_DIR.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-300","dir":"Changelog","previous_headings":"","what":"arrow 3.0.0","title":"arrow 3.0.0","text":"CRAN release: 2021-01-27","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"python-and-flight-3-0-0","dir":"Changelog","previous_headings":"","what":"Python and Flight","title":"arrow 3.0.0","text":"Flight methods flight_get() flight_put() (renamed push_data() release) can handle Tables RecordBatches flight_put() gains overwrite argument optionally check existence resource name list_flights() flight_path_exists() enable see available resources Flight server Schema objects now r_to_py py_to_r methods Schema metadata correctly preserved converting Tables /Python","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"enhancements-3-0-0","dir":"Changelog","previous_headings":"","what":"Enhancements","title":"arrow 3.0.0","text":"Arithmetic operations (+, *, etc.) supported Arrays ChunkedArrays can used filter expressions Arrow dplyr pipelines Table columns can now added, replaced, removed assigning (<-) either $ [[ Column names Tables RecordBatches can renamed assigning names() Large string types can now written Parquet files rlang pronouns .data .env now fully supported Arrow dplyr pipelines. Option arrow.skip_nul (default FALSE, base::scan()) allows conversion Arrow string (utf8()) type data containing embedded nul \\0 characters R. set TRUE, nuls stripped warning emitted found. arrow_info() overview various run-time build-time Arrow configurations, useful debugging Set environment variable ARROW_DEFAULT_MEMORY_POOL loading Arrow package change memory allocators. Windows packages built mimalloc; others built jemalloc (used default) mimalloc. alternative memory allocators generally much faster system memory allocator, used default available, sometimes useful turn debugging purposes. disable , set ARROW_DEFAULT_MEMORY_POOL=system. List columns attributes element now also included metadata saved creating Arrow tables. allows sf tibbles faithfully preserved roundtripped (#8549). R metadata exceeds 100Kb now compressed written table; see schema() details.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"bug-fixes-3-0-0","dir":"Changelog","previous_headings":"","what":"Bug fixes","title":"arrow 3.0.0","text":"Fixed performance regression converting Arrow string types R present 2.0.0 release C++ functions now trigger garbage collection needed write_parquet() can now write RecordBatches Reading Table RecordBatchStreamReader containing 0 batches longer crashes readr’s problems attribute removed converting Arrow RecordBatch table prevent large amounts metadata accumulating inadvertently (#9092) Fixed reading compressed Feather files written Arrow 0.17 (#9128) SubTreeFileSystem gains useful print method longer errors printing","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"packaging-and-installation-3-0-0","dir":"Changelog","previous_headings":"","what":"Packaging and installation","title":"arrow 3.0.0","text":"Nightly development versions conda r-arrow package available conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow Linux installation now safely supports older cmake versions Compiler version checking enabling S3 support correctly identifies active compiler Updated guidance troubleshooting vignette(\"install\", package = \"arrow\"), especially known CentOS issues Operating system detection Linux uses distro package. OS isn’t correctly identified, please report issue .","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-200","dir":"Changelog","previous_headings":"","what":"arrow 2.0.0","title":"arrow 2.0.0","text":"CRAN release: 2020-10-20","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"datasets-2-0-0","dir":"Changelog","previous_headings":"","what":"Datasets","title":"arrow 2.0.0","text":"write_dataset() Feather Parquet files partitioning. See end vignette(\"dataset\", package = \"arrow\") discussion examples. Datasets now head(), tail(), take ([) methods. head() optimized others may performant. collect() gains as_data_frame argument, default TRUE FALSE allows evaluate accumulated select filter query keep result Arrow, R data.frame read_csv_arrow() supports specifying column types, Schema compact string representation types used readr package. also gained timestamp_parsers argument lets express set strptime parse strings tried convert columns designated Timestamp type.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"aws-s3-support-2-0-0","dir":"Changelog","previous_headings":"","what":"AWS S3 support","title":"arrow 2.0.0","text":"S3 support now enabled binary macOS Windows (Rtools40 , .e. R >= 4.0) packages. enable Linux, need additional system dependencies libcurl openssl, well sufficiently modern compiler. See vignette(\"install\", package = \"arrow\") details. File readers writers (read_parquet(), write_feather(), et al.), well open_dataset() write_dataset(), allow access resources S3 (file systems emulate S3) either providing s3:// URI providing FileSystem$path(). See vignette(\"fs\", package = \"arrow\") examples. copy_files() allows recursively copy directories files one file system another, S3 local machine.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"flight-rpc-2-0-0","dir":"Changelog","previous_headings":"","what":"Flight RPC","title":"arrow 2.0.0","text":"Flight general-purpose client-server framework high performance transport large datasets network interfaces. arrow R package now provides methods connecting Flight RPC servers send receive data. See vignette(\"flight\", package = \"arrow\") overview.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"computation-2-0-0","dir":"Changelog","previous_headings":"","what":"Computation","title":"arrow 2.0.0","text":"Comparison (==, >, etc.) boolean (&, |, !) operations, along .na, %% match (called match_arrow()), Arrow Arrays ChunkedArrays now implemented C++ library. Aggregation methods min(), max(), unique() implemented Arrays ChunkedArrays. dplyr filter expressions Arrow Tables RecordBatches now evaluated C++ library, rather pulling data R evaluating. yields significant performance improvements. dim() (nrow) dplyr queries Table/RecordBatch now supported","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"packaging-and-installation-2-0-0","dir":"Changelog","previous_headings":"","what":"Packaging and installation","title":"arrow 2.0.0","text":"arrow now depends cpp11, brings robust UTF-8 handling faster compilation Linux build script now succeeds older versions R macOS binary packages now ship zstandard compression enabled","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"bug-fixes-and-other-enhancements-2-0-0","dir":"Changelog","previous_headings":"","what":"Bug fixes and other enhancements","title":"arrow 2.0.0","text":"Automatic conversion Arrow Int64 type values fit R 32-bit integer now correctly inspects chunks ChunkedArray, conversion can disabled (Int64 always yields bit64::integer64 vector) setting options(arrow.int64_downcast = FALSE). addition data.frame column metadata preserved round trip, added 1.0.0, now attributes data.frame also preserved Arrow schema metadata. File writers now respect system umask setting ParquetFileReader additional methods accessing individual columns row groups file Various segfaults fixed: invalid input ParquetFileWriter; invalid ArrowObject pointer saved R object; converting deeply nested structs Arrow R properties arrow_properties arguments write_parquet() deprecated","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-101","dir":"Changelog","previous_headings":"","what":"arrow 1.0.1","title":"arrow 1.0.1","text":"CRAN release: 2020-08-28","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"bug-fixes-1-0-1","dir":"Changelog","previous_headings":"","what":"Bug fixes","title":"arrow 1.0.1","text":"Filtering Dataset multiple partition keys using %% expression now faithfully returns relevant rows Datasets can now path segments root directory start . _; files subdirectories starting prefixes still ignored open_dataset(\"~/path\") now correctly expands path version option write_parquet() now correctly implemented UBSAN failure parquet-cpp library fixed bundled Linux builds, logic finding cmake robust, can now specify /path//cmake setting CMAKE environment variable","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-100","dir":"Changelog","previous_headings":"","what":"arrow 1.0.0","title":"arrow 1.0.0","text":"CRAN release: 2020-07-25","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-format-conversion-1-0-0","dir":"Changelog","previous_headings":"","what":"Arrow format conversion","title":"arrow 1.0.0","text":"vignette(\"arrow\", package = \"arrow\") includes tables explain R types converted Arrow types vice versa. Support added converting /Arrow types: uint64, binary, fixed_size_binary, large_binary, large_utf8, large_list, list structs. character vectors exceed 2GB converted Arrow large_utf8 type POSIXlt objects can now converted Arrow (struct) R attributes() preserved Arrow metadata converting Arrow RecordBatch table restored converting Arrow. means custom subclasses, haven::labelled, preserved round trip Arrow. Schema metadata now exposed named list, can modified assignment like batch$metadata$new_key <- \"new value\" Arrow types int64, uint32, uint64 now converted R integer values fit bounds Arrow date32 now converted R Date double underlying storage. Even though data values integers, provides strict round-trip fidelity converting R factor, dictionary ChunkedArrays identical dictionaries properly unified 1.0 release, Arrow IPC metadata version increased V4 V5. default, RecordBatch{File,Stream}Writer write V5, can specify alternate metadata_version. convenience, know consumer ’re writing read V5, can set environment variable ARROW_PRE_1_0_METADATA_VERSION=1 write V4 without changing code.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"datasets-1-0-0","dir":"Changelog","previous_headings":"","what":"Datasets","title":"arrow 1.0.0","text":"CSV text-delimited datasets now supported custom C++ build, possible read datasets directly S3 passing URL like ds <- open_dataset(\"s3://...\"). Note currently requires special C++ library build additional dependencies–yet available CRAN releases nightly packages. reading individual CSV JSON files, compression automatically detected file extension","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-enhancements-1-0-0","dir":"Changelog","previous_headings":"","what":"Other enhancements","title":"arrow 1.0.0","text":"Initial support C++ aggregation methods: sum() mean() implemented Array ChunkedArray Tables RecordBatches additional data.frame-like methods, including dimnames() .list() Tables ChunkedArrays can now moved /Python via reticulate","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"bug-fixes-and-deprecations-1-0-0","dir":"Changelog","previous_headings":"","what":"Bug fixes and deprecations","title":"arrow 1.0.0","text":"Non-UTF-8 strings (common Windows) correctly coerced UTF-8 passing Arrow memory appropriately re-localized converting R coerce_timestamps option write_parquet() now correctly implemented. Creating Dictionary array respects type definition provided user read_arrow write_arrow now deprecated; use read/write_feather() read/write_ipc_stream() functions depending whether ’re working Arrow IPC file stream format, respectively. Previously deprecated FileStats, read_record_batch, read_table removed.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-and-packaging-1-0-0","dir":"Changelog","previous_headings":"","what":"Installation and packaging","title":"arrow 1.0.0","text":"improved performance memory allocation, macOS Linux binaries now jemalloc included, Windows packages use mimalloc Linux installation: tweaks OS detection binaries, updates known installation issues vignette bundled libarrow built CC CXX values R uses Failure build bundled libarrow yields clear message Various streamlining efforts reduce library size compile time","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-0171","dir":"Changelog","previous_headings":"","what":"arrow 0.17.1","title":"arrow 0.17.1","text":"CRAN release: 2020-05-19 Updates compatibility dplyr 1.0 reticulate::r_to_py() conversion now correctly works automatically, without call method Assorted bug fixes C++ library around Parquet reading","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-0170","dir":"Changelog","previous_headings":"","what":"arrow 0.17.0","title":"arrow 0.17.0","text":"CRAN release: 2020-04-21","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"feather-v2-0-17-0","dir":"Changelog","previous_headings":"","what":"Feather v2","title":"arrow 0.17.0","text":"release includes support version 2 Feather file format. Feather v2 features full support Arrow data types, fixes 2GB per-column limitation large amounts string data, allows files compressed using either lz4 zstd. write_feather() can write either version 2 version 1 Feather files, read_feather() automatically detects file version reading. Related change, several functions around reading writing data reworked. read_ipc_stream() write_ipc_stream() added facilitate writing data Arrow IPC stream format, slightly different IPC file format (Feather v2 IPC file format). Behavior standardized: read_<format>() return R data.frame (default) Table argument as_data_frame = FALSE; write_<format>() functions return data object, invisibly. facilitate workflows, special write_to_raw() function added wrap write_ipc_stream() return raw vector containing buffer written. achieve standardization, read_table(), read_record_batch(), read_arrow(), write_arrow() deprecated.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"python-interoperability-0-17-0","dir":"Changelog","previous_headings":"","what":"Python interoperability","title":"arrow 0.17.0","text":"0.17 Apache Arrow release includes C data interface allows exchanging Arrow data -process C level without copying without libraries build runtime dependency . enables us use reticulate share data R Python (pyarrow) efficiently. See vignette(\"python\", package = \"arrow\") details.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"datasets-0-17-0","dir":"Changelog","previous_headings":"","what":"Datasets","title":"arrow 0.17.0","text":"Dataset reading benefits many speedups fixes C++ library Datasets dim() method, sums rows across files (#6635, @boshek) Combine multiple datasets single queryable UnionDataset c() method Dataset filtering now treats NA FALSE, consistent dplyr::filter() Dataset filtering now correctly supported Arrow date/time/timestamp column types vignette(\"dataset\", package = \"arrow\") now correct, executable code","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"installation-0-17-0","dir":"Changelog","previous_headings":"","what":"Installation","title":"arrow 0.17.0","text":"Installation Linux now builds C++ library source default, compression libraries disabled. faster, richer build, set environment variable NOT_CRAN=true. See vignette(\"install\", package = \"arrow\") details options. Source installation faster reliable Linux distributions.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-bug-fixes-and-enhancements-0-17-0","dir":"Changelog","previous_headings":"","what":"Other bug fixes and enhancements","title":"arrow 0.17.0","text":"unify_schemas() create Schema containing union fields multiple schemas Timezones faithfully preserved roundtrip R Arrow read_feather() reader functions close file connections open Arrow R6 objects longer namespace collisions R.oo package also loaded FileStats renamed FileInfo, original spelling deprecated","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-01602","dir":"Changelog","previous_headings":"","what":"arrow 0.16.0.2","title":"arrow 0.16.0.2","text":"CRAN release: 2020-02-14 install_arrow() now installs latest release arrow, including Linux dependencies, either CRAN releases development builds (nightly = TRUE) Package installation Linux longer downloads C++ dependencies unless LIBARROW_DOWNLOAD NOT_CRAN environment variable set write_feather(), write_arrow() write_parquet() now return input, similar write_* functions readr package (#6387, @boshek) Can now infer type R list create ListArray list elements type (#6275, @michaelchirico)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-0160","dir":"Changelog","previous_headings":"","what":"arrow 0.16.0","title":"arrow 0.16.0","text":"CRAN release: 2020-02-09","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"multi-file-datasets-0-16-0","dir":"Changelog","previous_headings":"","what":"Multi-file datasets","title":"arrow 0.16.0","text":"release includes dplyr interface Arrow Datasets, let work efficiently large, multi-file datasets single entity. Explore directory data files open_dataset() use dplyr methods select(), filter(), etc. Work done possible Arrow memory. necessary, data pulled R computation. dplyr methods conditionally loaded dplyr available; hard dependency. See vignette(\"dataset\", package = \"arrow\") details.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"linux-installation-0-16-0","dir":"Changelog","previous_headings":"","what":"Linux installation","title":"arrow 0.16.0","text":"source package installation (CRAN) now handle C++ dependencies automatically. common Linux distributions versions, installation retrieve prebuilt static C++ library inclusion package; binary available, package executes bundled script build Arrow C++ library system dependencies beyond R requires. See vignette(\"install\", package = \"arrow\") details.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"data-exploration-0-16-0","dir":"Changelog","previous_headings":"","what":"Data exploration","title":"arrow 0.16.0","text":"Tables RecordBatches also dplyr methods. exploration without dplyr, [ methods Tables, RecordBatches, Arrays, ChunkedArrays now support natural row extraction operations. use C++ Filter, Slice, Take methods efficient access, depending type selection vector. experimental, lazily evaluated array_expression class also added, enabling among things ability filter Table function Arrays, arrow_table[arrow_table$var1 > 5, ] without pull everything R first.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"compression-0-16-0","dir":"Changelog","previous_headings":"","what":"Compression","title":"arrow 0.16.0","text":"write_parquet() now supports compression codec_is_available() returns TRUE FALSE whether Arrow C++ library built support given compression library (e.g. gzip, lz4, snappy) Windows builds now include support zstd lz4 compression (#5814, @gnguy)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-fixes-and-improvements-0-16-0","dir":"Changelog","previous_headings":"","what":"Other fixes and improvements","title":"arrow 0.16.0","text":"Arrow null type now supported Factor types now preserved round trip Parquet format (#6135, @yutannihilation) Reading Arrow dictionary type coerces dictionary values character (R factor levels required ) instead raising error Many improvements Parquet function documentation (@karldw, @khughitt)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-0151","dir":"Changelog","previous_headings":"","what":"arrow 0.15.1","title":"arrow 0.15.1","text":"CRAN release: 2019-11-04 patch release includes bugfixes C++ library around dictionary types Parquet reading.","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-0150","dir":"Changelog","previous_headings":"","what":"arrow 0.15.0","title":"arrow 0.15.0","text":"CRAN release: 2019-10-07","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"breaking-changes-0-15-0","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"arrow 0.15.0","text":"R6 classes wrap C++ classes now documented exported renamed R-friendly. Users high-level R interface package affected. want interact Arrow C++ API directly work objects methods. part change, many functions instantiated R6 objects removed favor Class$create() methods. Notably, arrow::array() arrow::table() removed favor Array$create() Table$create(), eliminating package startup message masking base functions. information, see new vignette(\"arrow\"). Due subtle change Arrow message format, data written 0.15 version libraries may readable older versions. need send data process uses older version Arrow (example, Apache Spark server hasn’t yet updated Arrow 0.15), can set environment variable ARROW_PRE_0_15_IPC_FORMAT=1. as_tibble argument read_*() functions renamed as_data_frame (#5399, @jameslamb) arrow::Column class removed, removed C++ library","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"new-features-0-15-0","dir":"Changelog","previous_headings":"","what":"New features","title":"arrow 0.15.0","text":"Table RecordBatch objects S3 methods enable work like data.frames. Extract columns, subset, . See ?Table ?RecordBatch examples. Initial implementation bindings C++ File System API. (#5223) Compressed streams now supported Windows (#5329), can also specify compression level (#5450)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"other-upgrades-0-15-0","dir":"Changelog","previous_headings":"","what":"Other upgrades","title":"arrow 0.15.0","text":"Parquet file reading much, much faster, thanks improvements Arrow C++ library. read_csv_arrow() supports parsing options, including col_names, na, quoted_na, skip read_parquet() read_feather() can ingest data raw vector (#5141) File readers now properly handle paths need expanding, ~/file.parquet (#5169) Improved support creating types schema: types’ printed names (e.g. “double”) guaranteed valid use instantiating schema (e.g. double()), time types can created human-friendly resolution strings (“ms”, “s”, etc.). (#5198, #5201)","code":""},{"path":"https://arrow.apache.org/docs/r/news/index.html","id":"arrow-0141","dir":"Changelog","previous_headings":"","what":"arrow 0.14.1","title":"arrow 0.14.1","text":"CRAN release: 2019-08-05 Initial CRAN release arrow package. Key features include: Read write support various file formats, including Parquet, Feather/Arrow, CSV, JSON. API bindings C++ library Arrow data types objects, well mapping Arrow types R data types. Tools helping C++ library configuration installation.","code":""}]