blob: f106899e6daa4fb14e5a0c1696ac61c63f581a52 [file] [log] [blame]
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Using the Arrow C++ Library in R • Arrow R Package</title>
<!-- jquery --><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.min.js" integrity="sha256-CSXorXvZcTkaix6Yvo6HppcZGetbYMGWSFlBw8HfCJo=" crossorigin="anonymous"></script><!-- Bootstrap --><link href="https://cdnjs.cloudflare.com/ajax/libs/bootswatch/3.4.0/cosmo/bootstrap.min.css" rel="stylesheet" crossorigin="anonymous">
<script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.4.1/js/bootstrap.min.js" integrity="sha256-nuL8/2cJ5NDSSwnKD8VqreErSWHtnEP9E7AySL+1ev4=" crossorigin="anonymous"></script><!-- bootstrap-toc --><link rel="stylesheet" href="../bootstrap-toc.css">
<script src="../bootstrap-toc.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
<!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- pkgdown --><link href="../pkgdown.css" rel="stylesheet">
<script src="../pkgdown.js"></script><meta property="og:title" content="Using the Arrow C++ Library in R">
<meta property="og:description" content="This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package.">
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!-- Matomo -->
<script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
_paq.push(["setDoNotTrack", true]);
_paq.push(["disableCookies"]);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<!-- End Matomo Code -->
</head>
<body data-spy="scroll" data-target="#toc">
<div class="container template-article">
<header><div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<span class="navbar-brand">
<a class="navbar-link" href="../index.html">Arrow R Package</a>
<span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Released version">2.0.0</span>
</span>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="https://arrow.apache.org/">❯❯❯</a>
</li>
<li>
<a href="../articles/arrow.html">Get started</a>
</li>
<li>
<a href="../reference/index.html">Reference</a>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
Articles
<span class="caret"></span>
</a>
<ul class="dropdown-menu" role="menu">
<li>
<a href="../articles/dataset.html">Working with Arrow Datasets and dplyr</a>
</li>
<li>
<a href="../articles/flight.html">Connecting to Flight RPC Servers</a>
</li>
<li>
<a href="../articles/fs.html">Working with Cloud Storage (S3)</a>
</li>
<li>
<a href="../articles/install.html">Installing the Arrow Package on Linux</a>
</li>
<li>
<a href="../articles/python.html">Apache Arrow in Python and R with reticulate</a>
</li>
</ul>
</li>
<li>
<a href="../news/index.html">Changelog</a>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
Project docs
<span class="caret"></span>
</a>
<ul class="dropdown-menu" role="menu">
<li>
<a href="https://arrow.apache.org/docs/format/README.html">Specification</a>
</li>
<li>
<a href="https://arrow.apache.org/docs/c_glib">C GLib</a>
</li>
<li>
<a href="https://arrow.apache.org/docs/cpp">C++</a>
</li>
<li>
<a href="https://arrow.apache.org/docs/java">Java</a>
</li>
<li>
<a href="https://arrow.apache.org/docs/js">JavaScript</a>
</li>
<li>
<a href="https://arrow.apache.org/docs/python">Python</a>
</li>
<li>
<a href="../index.html">R</a>
</li>
</ul>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li>
<a href="https://github.com/apache/arrow/">
<span class="fab fa fab fa-github fa-lg"></span>
</a>
</li>
</ul>
</div>
<!--/.nav-collapse -->
</div>
<!--/.container -->
</div>
<!--/.navbar -->
</header><div class="row">
<div class="col-md-9 contents">
<div class="page-header toc-ignore">
<h1 data-toc-skip>Using the Arrow C++ Library in R</h1>
<small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/master/vignettes/arrow.Rmd"><code>vignettes/arrow.Rmd</code></a></small>
<div class="hidden name"><code>arrow.Rmd</code></div>
</div>
<p>The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The <code>arrow</code> R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.</p>
<div id="features" class="section level1">
<h1 class="hasAnchor">
<a href="#features" class="anchor"></a>Features</h1>
<div id="multi-file-datasets" class="section level2">
<h2 class="hasAnchor">
<a href="#multi-file-datasets" class="anchor"></a>Multi-file datasets</h2>
<p>The <code>arrow</code> package lets you work efficiently with large, multi-file datasets using <code>dplyr</code> methods. See <code><a href="../articles/dataset.html">vignette("dataset", package = "arrow")</a></code> for an overview.</p>
</div>
<div id="reading-and-writing-files" class="section level2">
<h2 class="hasAnchor">
<a href="#reading-and-writing-files" class="anchor"></a>Reading and writing files</h2>
<p><code>arrow</code> provides some simple functions for using the Arrow C++ library to read and write files. These functions are designed to drop into your normal R workflow without requiring any knowledge of the Arrow C++ library and use naming conventions and arguments that follow popular R packages, particularly <code>readr</code>. The readers return <code>data.frame</code>s (or if you use the <code>tibble</code> package, they will act like <code>tbl_df</code>s), and the writers take <code>data.frame</code>s.</p>
<p>Importantly, <code>arrow</code> provides basic read and write support for the <a href="https://parquet.apache.org/">Apache Parquet</a> columnar data file format.</p>
<div class="sourceCode"><pre class="downlit">
<span class="kw"><a href="https://rdrr.io/r/base/library.html">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/apache/arrow/">arrow</a></span><span class="op">)</span>
<span class="va">df</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="st">"path/to/file.parquet"</span><span class="op">)</span></pre></div>
<p>Just as you can read, you can write Parquet files:</p>
<div class="sourceCode"><pre class="downlit">
<span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">df</span>, <span class="st">"path/to/different_file.parquet"</span><span class="op">)</span></pre></div>
<p>The <code>arrow</code> package also includes a faster and more robust implementation of the <a href="https://github.com/wesm/feather">Feather</a> file format, providing <code><a href="../reference/read_feather.html">read_feather()</a></code> and <code><a href="../reference/write_feather.html">write_feather()</a></code>. This implementation depends on the same underlying C++ library as the Python version does, resulting in more reliable and consistent behavior across the two languages, as well as <a href="https://wesmckinney.com/blog/feather-arrow-future/">improved performance</a>. <code>arrow</code> also by default writes the Feather V2 format, which supports a wider range of data types, as well as compression.</p>
<p>For CSV and line-delimited JSON, there are <code><a href="../reference/read_delim_arrow.html">read_csv_arrow()</a></code> and <code><a href="../reference/read_json_arrow.html">read_json_arrow()</a></code>, respectively. While <code><a href="../reference/read_delim_arrow.html">read_csv_arrow()</a></code> currently has fewer parsing options for dealing with every CSV format variation in the wild, for the files it can read, it is often significantly faster than other R CSV readers, such as <code>base::read.csv</code>, <code>readr::read_csv</code>, and <code>data.table::fread</code>.</p>
</div>
<div id="working-with-arrow-data-in-python" class="section level2">
<h2 class="hasAnchor">
<a href="#working-with-arrow-data-in-python" class="anchor"></a>Working with Arrow data in Python</h2>
<p>Using <a href="https://rstudio.github.io/reticulate/"><code>reticulate</code></a>, <code>arrow</code> lets you share data between R and Python (<code>pyarrow</code>) efficiently, enabling you to take advantage of the vibrant ecosystem of Python packages that build on top of Apache Arrow. See <code><a href="../articles/python.html">vignette("python", package = "arrow")</a></code> for details.</p>
</div>
<div id="access-to-arrow-messages-buffers-and-streams" class="section level2">
<h2 class="hasAnchor">
<a href="#access-to-arrow-messages-buffers-and-streams" class="anchor"></a>Access to Arrow messages, buffers, and streams</h2>
<p>The <code>arrow</code> package also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the <a href="https://spark.rstudio.com/"><code>sparklyr</code></a> package has support for using Arrow to move data to and from Spark, yielding <a href="http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/">significant performance gains</a>.</p>
</div>
</div>
<div id="internals" class="section level1">
<h1 class="hasAnchor">
<a href="#internals" class="anchor"></a>Internals</h1>
<div id="mapping-of-r----arrow-types" class="section level2">
<h2 class="hasAnchor">
<a href="#mapping-of-r----arrow-types" class="anchor"></a>Mapping of R &lt;--&gt; Arrow types</h2>
<p>Arrow has a rich data type system that includes direct parallels with R's data types and much more.</p>
<p>In the tables, entries with a <code><a href="https://rdrr.io/r/base/Arithmetic.html">-</a></code> are not currently implemented.</p>
<div id="r-to-arrow" class="section level3">
<h3 class="hasAnchor">
<a href="#r-to-arrow" class="anchor"></a>R to Arrow</h3>
<table class="table">
<thead><tr class="header">
<th>R type</th>
<th>Arrow type</th>
</tr></thead>
<tbody>
<tr class="odd">
<td>logical</td>
<td>boolean</td>
</tr>
<tr class="even">
<td>integer</td>
<td>int32</td>
</tr>
<tr class="odd">
<td>double ("numeric")</td>
<td>float64</td>
</tr>
<tr class="even">
<td>character</td>
<td>utf8<sup>1</sup>
</td>
</tr>
<tr class="odd">
<td>factor</td>
<td>dictionary</td>
</tr>
<tr class="even">
<td>raw</td>
<td>uint8</td>
</tr>
<tr class="odd">
<td>Date</td>
<td>date32</td>
</tr>
<tr class="even">
<td>POSIXct</td>
<td>timestamp</td>
</tr>
<tr class="odd">
<td>POSIXlt</td>
<td>struct</td>
</tr>
<tr class="even">
<td>data.frame</td>
<td>struct</td>
</tr>
<tr class="odd">
<td>list<sup>2</sup>
</td>
<td>list</td>
</tr>
<tr class="even">
<td>bit64::integer64</td>
<td>int64</td>
</tr>
<tr class="odd">
<td>difftime</td>
<td>time32</td>
</tr>
<tr class="even">
<td>vctrs::vctrs_unspecified</td>
<td>null</td>
</tr>
</tbody>
</table>
<p><sup>1</sup>: If the character vector exceeds 2GB of strings, it will be converted to a <code>large_utf8</code> Arrow type</p>
<p><sup>2</sup>: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a "list of" some type).</p>
</div>
<div id="arrow-to-r" class="section level3">
<h3 class="hasAnchor">
<a href="#arrow-to-r" class="anchor"></a>Arrow to R</h3>
<table class="table">
<thead><tr class="header">
<th>Arrow type</th>
<th>R type</th>
</tr></thead>
<tbody>
<tr class="odd">
<td>boolean</td>
<td>logical</td>
</tr>
<tr class="even">
<td>int8</td>
<td>integer</td>
</tr>
<tr class="odd">
<td>int16</td>
<td>integer</td>
</tr>
<tr class="even">
<td>int32</td>
<td>integer</td>
</tr>
<tr class="odd">
<td>int64</td>
<td>integer<sup>3</sup>
</td>
</tr>
<tr class="even">
<td>uint8</td>
<td>integer</td>
</tr>
<tr class="odd">
<td>uint16</td>
<td>integer</td>
</tr>
<tr class="even">
<td>uint32</td>
<td>integer<sup>3</sup>
</td>
</tr>
<tr class="odd">
<td>uint64</td>
<td>integer<sup>3</sup>
</td>
</tr>
<tr class="even">
<td>float16</td>
<td>-</td>
</tr>
<tr class="odd">
<td>float32</td>
<td>double</td>
</tr>
<tr class="even">
<td>float64</td>
<td>double</td>
</tr>
<tr class="odd">
<td>utf8</td>
<td>character</td>
</tr>
<tr class="even">
<td>binary</td>
<td>arrow_binary <sup>5</sup>
</td>
</tr>
<tr class="odd">
<td>fixed_size_binary</td>
<td>arrow_fixed_size_binary <sup>5</sup>
</td>
</tr>
<tr class="even">
<td>date32</td>
<td>Date</td>
</tr>
<tr class="odd">
<td>date64</td>
<td>POSIXct</td>
</tr>
<tr class="even">
<td>time32</td>
<td>hms::difftime</td>
</tr>
<tr class="odd">
<td>time64</td>
<td>hms::difftime</td>
</tr>
<tr class="even">
<td>timestamp</td>
<td>POSIXct</td>
</tr>
<tr class="odd">
<td>duration</td>
<td>-</td>
</tr>
<tr class="even">
<td>decimal</td>
<td>double</td>
</tr>
<tr class="odd">
<td>dictionary</td>
<td>factor<sup>4</sup>
</td>
</tr>
<tr class="even">
<td>list</td>
<td>arrow_list <sup>6</sup>
</td>
</tr>
<tr class="odd">
<td>fixed_size_list</td>
<td>arrow_fixed_size_list <sup>6</sup>
</td>
</tr>
<tr class="even">
<td>struct</td>
<td>data.frame</td>
</tr>
<tr class="odd">
<td>null</td>
<td>vctrs::vctrs_unspecified</td>
</tr>
<tr class="even">
<td>map</td>
<td>-</td>
</tr>
<tr class="odd">
<td>union</td>
<td>-</td>
</tr>
<tr class="even">
<td>large_utf8</td>
<td>character</td>
</tr>
<tr class="odd">
<td>large_binary</td>
<td>arrow_large_binary <sup>5</sup>
</td>
</tr>
<tr class="even">
<td>large_list</td>
<td>arrow_large_list <sup>6</sup>
</td>
</tr>
</tbody>
</table>
<p><sup>3</sup>: These integer types may contain values that exceed the range of R's <code>integer</code> type (32-bit signed integer). When they do, <code>uint32</code> and <code>uint64</code> are converted to <code>double</code> ("numeric") and <code>int64</code> is converted to <code><a href="https://rdrr.io/pkg/bit64/man/bit64-package.html">bit64::integer64</a></code>.</p>
<p><sup>4</sup>: Due to the limitation of R <code>factor</code>s, Arrow <code>dictionary</code> values are coerced to string when translated to R if they are not already strings.</p>
<p><sup>5</sup>: <code>arrow*_binary</code> classes are implemented as lists of raw vectors.</p>
<p><sup>6</sup>: <code>arrow*_list</code> classes are implemented as subclasses of <code>vctrs_list_of</code> with a <code>ptype</code> attribute set to what an empty Array of the value type converts to.</p>
</div>
<div id="r-object-attributes" class="section level3">
<h3 class="hasAnchor">
<a href="#r-object-attributes" class="anchor"></a>R object attributes</h3>
<p>Arrow supports custom key-value metadata attached to Schemas. When we convert a <code>data.frame</code> to an Arrow Table or RecordBatch, the package stores any <code><a href="https://rdrr.io/r/base/attributes.html">attributes()</a></code> attached to the columns of the <code>data.frame</code> in the Arrow object's Schema. These attributes are stored under the "r" key; you can assign additional string metadata under any other key you wish, like <code>x$metadata$new_key &lt;- "new value"</code>.</p>
<p>This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling <code><a href="https://rdrr.io/r/base/as.data.frame.html">as.data.frame()</a></code> on a Table/RecordBatch, the column attributes are restored to the columns of the resulting <code>data.frame</code>. This means that custom data types, including <code>haven::labelled</code>, <code>vctrs</code> annotations, and others, are preserved when doing a round-trip through Arrow.</p>
<p>Note that the <code><a href="https://rdrr.io/r/base/attributes.html">attributes()</a></code> stored in <code>$metadata$r</code> are only understood by R. If you write a <code>data.frame</code> with <code>haven</code> columns to a Feather file and read that in Pandas, the <code>haven</code> metadata won't be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys.</p>
</div>
</div>
<div id="class-structure-and-package-conventions" class="section level2">
<h2 class="hasAnchor">
<a href="#class-structure-and-package-conventions" class="anchor"></a>Class structure and package conventions</h2>
<p>C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as <code>R6</code> reference classes, most of which are exported from the namespace.</p>
<p>In order to match the C++ naming conventions, the <code>R6</code> classes are in TitleCase, e.g. <code>RecordBatch</code>. This makes it easy to look up the relevant C++ implementations in the <a href="https://github.com/apache/arrow/tree/master/cpp">code</a> or <a href="https://arrow.apache.org/docs/cpp/">documentation</a>. To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has <code>arrow::io::FileOutputStream</code>, it is just <code>FileOutputStream</code> in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So <code>arrow::csv::TableReader</code> becomes <code>CsvTableReader</code>, and <code>arrow::json::TableReader</code> becomes <code>JsonTableReader</code>.</p>
<p>Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the <code>$create()</code> method to instantiate an object. For example, <code>rb &lt;- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))</code> will create a <code>RecordBatch</code>. Many of these factory methods that an R user might most often encounter also have a <code>snake_case</code> alias, in order to be more familiar for contemporary R users. So <code><a href="../reference/RecordBatch.html">record_batch(int = 1:10, dbl = as.numeric(1:10))</a></code> would do the same as <code>RecordBatch$create()</code> above.</p>
<p>The typical user of the <code>arrow</code> R package may never deal directly with the <code>R6</code> objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call <code><a href="../reference/read_parquet.html">read_parquet()</a></code> without knowing or caring that they're instantiating a <code>ParquetFileReader</code> object and calling the <code>$ReadFile()</code> method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.</p>
</div>
</div>
</div>
<div class="col-md-3 hidden-xs hidden-sm" id="pkgdown-sidebar">
<nav id="toc" data-toggle="toc"><h2 data-toc-skip>Contents</h2>
</nav>
</div>
</div>
<footer><div class="copyright">
<p>Developed by Romain François, Jeroen Ooms, Neal Richardson, Apache Arrow.</p>
</div>
<div class="pkgdown">
<p>Site built with <a href="https://pkgdown.r-lib.org/">pkgdown</a> 1.6.1.</p>
</div>
</footer>
</div>
<script type="text/javascript" src="/docs/_static/versionwarning.js"></script> </body>
</html>