arrow 0.16.0

Multi-file datasets

This release includes a dplyr interface to Arrow Datasets, which let you work efficiently with large, multi-file datasets as a single entity. Explore a directory of data files with open_dataset() and then use dplyr methods to select(), filter(), etc. Work will be done where possible in Arrow memory. When necessary, data is pulled into R for further computation. dplyr methods are conditionally loaded if you have dplyr available; it is not a hard dependency.

See vignette("dataset", package = "arrow") for details.

Linux installation

A source package installation (as from CRAN) will now handle its C++ dependencies automatically. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library with no system dependencies beyond what R requires.

See vignette("install", package = "arrow") for details.

Data exploration

  • Tables and RecordBatches also have dplyr methods.
  • For exploration without dplyr, [ methods for Tables, RecordBatches, Arrays, and ChunkedArrays now support natural row extraction operations. These use the C++ Filter, Slice, and Take methods for efficient access, depending on the type of selection vector.
  • An experimental, lazily evaluated array_expression class has also been added, enabling among other things the ability to filter a Table with some function of Arrays, such as arrow_table[arrow_table$var1 > 5, ] without having to pull everything into R first.

Compression

  • write_parquet() now supports compression
  • codec_is_available() returns TRUE or FALSE whether the Arrow C++ library was built with support for a given compression library (e.g. gzip, lz4, snappy)
  • Windows builds now include support for zstd and lz4 compression (#5814, @gnguy)

Other fixes and improvements

  • Arrow null type is now supported
  • Factor types are now preserved in round trip through Parquet format (#6135, @yutannihilation)
  • Reading an Arrow dictionary type coerces dictionary values to character (as R factor levels are required to be) instead of raising an error
  • Many improvements to Parquet function documentation (@karldw, @khughitt)

arrow 0.15.1

  • This patch release includes bugfixes in the C++ library around dictionary types and Parquet reading.

arrow 0.15.0

Breaking changes

  • The R6 classes that wrap the C++ classes are now documented and exported and have been renamed to be more R-friendly. Users of the high-level R interface in this package are not affected. Those who want to interact with the Arrow C++ API more directly should work with these objects and methods. As part of this change, many functions that instantiated these R6 objects have been removed in favor of Class$create() methods. Notably, arrow::array() and arrow::table() have been removed in favor of Array$create() and Table$create(), eliminating the package startup message about masking base functions. For more information, see the new vignette("arrow").
  • Due to a subtle change in the Arrow message format, data written by the 0.15 version libraries may not be readable by older versions. If you need to send data to a process that uses an older version of Arrow (for example, an Apache Spark server that hasn't yet updated to Arrow 0.15), you can set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1.
  • The as_tibble argument in the read_*() functions has been renamed to as_data_frame (ARROW-6337, @jameslamb)
  • The arrow::Column class has been removed, as it was removed from the C++ library

New features

  • Table and RecordBatch objects have S3 methods that enable you to work with them more like data.frames. Extract columns, subset, and so on. See ?Table and ?RecordBatch for examples.
  • Initial implementation of bindings for the C++ File System API. (ARROW-6348)
  • Compressed streams are now supported on Windows (ARROW-6360), and you can also specify a compression level (ARROW-6533)

Other upgrades

  • Parquet file reading is much, much faster, thanks to improvements in the Arrow C++ library.
  • read_csv_arrow() supports more parsing options, including col_names, na, quoted_na, and skip
  • read_parquet() and read_feather() can ingest data from a raw vector (ARROW-6278)
  • File readers now properly handle paths that need expanding, such as ~/file.parquet (ARROW-6323)
  • Improved support for creating types in a schema: the types' printed names (e.g. “double”) are guaranteed to be valid to use in instantiating a schema (e.g. double()), and time types can be created with human-friendly resolution strings (“ms”, “s”, etc.). (ARROW-6338, ARROW-6364)

arrow 0.14.1

Initial CRAN release of the arrow package. Key features include:

  • Read and write support for various file formats, including Parquet, Feather/Arrow, CSV, and JSON.
  • API bindings to the C++ library for Arrow data types and objects, as well as mapping between Arrow types and R data types.
  • Tools for helping with C++ library configuration and installation.