docs/dev/r/articles/data_objects.html - arrow-site - Git at Google

 <!DOCTYPE html>
 <!-- Generated by pkgdown: do not edit by hand --><html lang="en">
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <meta charset="utf-8">
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
 <meta name="description" content="Learn about Scalar, Array, Table, and Dataset objects in arrow  (among others), how they relate to each other, as well as their  relationships to familiar R objects like data frames and vectors
 ">
 <title>Data objects • Arrow R Package</title>
 <!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../favicon-16x16.png">
 <link rel="icon" type="image/png" sizes="32x32" href="../favicon-32x32.png">
 <link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../apple-touch-icon.png">
 <link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../apple-touch-icon-120x120.png">
 <link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../apple-touch-icon-76x76.png">
 <link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../apple-touch-icon-60x60.png">
 <script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
 <link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet">
 <script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
 <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
 <!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.11/clipboard.min.js" integrity="sha512-7O5pXpc0oCRrxk8RUfDYFgn0nO1t+jLuIOQdOMRp4APB7uZ4vSjspzp5y6YDtDs4VzUSTbWzBFZ/LKJhnyFOKw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet">
 <meta property="og:title" content="Data objects">
 <meta property="og:description" content="Learn about Scalar, Array, Table, and Dataset objects in arrow  (among others), how they relate to each other, as well as their  relationships to familiar R objects like data frames and vectors
 ">
 <meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png">
 <meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text">
 <meta name="twitter:card" content="summary_large_image">
 <meta name="twitter:creator" content="@apachearrow">
 <meta name="twitter:site" content="@apachearrow">
 <!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
 <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
 <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
 <![endif]--><!-- Matomo --><script>
   var _paq = window._paq = window._paq || [];
   /* tracker methods like "setCustomDimension" should be called before "trackPageView" */
   /* We explicitly disable cookie tracking to avoid privacy issues */
   _paq.push(['disableCookies']);
   _paq.push(['trackPageView']);
   _paq.push(['enableLinkTracking']);
   (function() {
     var u="https://analytics.apache.org/";
     _paq.push(['setTrackerUrl', u+'matomo.php']);
     _paq.push(['setSiteId', '20']);
     var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
     g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
   })();
 </script><!-- End Matomo Code -->
 </head>
 <body>
     <a href="#main" class="visually-hidden-focusable">Skip to contents</a>


     <nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container">

     <a class="navbar-brand me-2" href="../index.html">Arrow R Package</a>

     <span class="version">
       <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">16.1.0.9000</small>
     </span>


     <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
       <span class="navbar-toggler-icon"></span>
     </button>

     <div id="navbar" class="collapse navbar-collapse ms-3">
       <ul class="navbar-nav me-auto">
 <li class="nav-item">
   <a class="nav-link" href="../articles/arrow.html">Get started</a>
 </li>
 <li class="nav-item">
   <a class="nav-link" href="../reference/index.html">Reference</a>
 </li>
 <li class="active nav-item dropdown">
   <a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a>
   <div class="dropdown-menu" aria-labelledby="dropdown-articles">
     <h6 class="dropdown-header" data-toc-skip>Using the package</h6>
     <a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a>
     <a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a>
     <a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a>
     <a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a>
     <a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a>
     <a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a>
     <div class="dropdown-divider"></div>
     <h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6>
     <a class="dropdown-item" href="../articles/data_objects.html">Data objects</a>
     <a class="dropdown-item" href="../articles/data_types.html">Data types</a>
     <a class="dropdown-item" href="../articles/metadata.html">Metadata</a>
     <div class="dropdown-divider"></div>
     <h6 class="dropdown-header" data-toc-skip>Installation</h6>
     <a class="dropdown-item" href="../articles/install.html">Installing on Linux</a>
     <a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a>
     <div class="dropdown-divider"></div>
     <a class="dropdown-item" href="../articles/index.html">More articles...</a>
   </div>
 </li>
 <li class="nav-item">
   <a class="nav-link" href="../news/index.html">Changelog</a>
 </li>
       </ul>
 <form class="form-inline my-2 my-lg-0" role="search">
         <input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="Search for" autocomplete="off">
 </form>

       <ul class="navbar-nav">
 <li class="nav-item">
   <a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="github">
     <span class="fab fa fab fa-github fa-lg"></span>

   </a>
 </li>
       </ul>
 </div>


   </div>
 </nav><div class="container template-article">


 <div class="row">
   <main id="main" class="col-md-9"><div class="page-header">
       <img src="" class="logo" alt=""><h1>Data objects</h1>


       <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/data_objects.Rmd" class="external-link"><code>vignettes/data_objects.Rmd</code></a></small>
       <div class="d-none name"><code>data_objects.Rmd</code></div>
     </div>


 <p>This article describes the various data object types supplied by
 arrow, and documents how these objects are structured.</p>
 <p>The arrow package supplies several object classes that are used to
 represent data. <code>RecordBatch</code>, <code>Table</code>, and
 <code>Dataset</code> objects are two-dimensional rectangular data
 structures used to store tabular data. For columnar, one-dimensional
 data, the <code>Array</code> and <code>ChunkedArray</code> classes are
 provided. Finally, <code>Scalar</code> objects represent individual
 values. The table below summarizes these objects and shows how you can
 create new instances using the <a href="https://r6.r-lib.org/" class="external-link"><code>R6</code></a> class object, as well
 as convenience functions that provide the same functionality in a more
 traditional R-like fashion:</p>
 <table class="table">
 <colgroup>
 <col width="2%">
 <col width="12%">
 <col width="42%">
 <col width="41%">
 </colgroup>
 <thead><tr class="header">
 <th>Dim</th>
 <th>Class</th>
 <th>How to create an instance</th>
 <th>Convenience function</th>
 </tr></thead>
 <tbody>
 <tr class="odd">
 <td>0</td>
 <td><code>Scalar</code></td>
 <td><code>Scalar$create(value, type)</code></td>
 <td></td>
 </tr>
 <tr class="even">
 <td>1</td>
 <td><code>Array</code></td>
 <td><code>Array$create(vector, type)</code></td>
 <td><code>as_arrow_array(x)</code></td>
 </tr>
 <tr class="odd">
 <td>1</td>
 <td><code>ChunkedArray</code></td>
 <td><code>ChunkedArray$create(..., type)</code></td>
 <td><code>chunked_array(..., type)</code></td>
 </tr>
 <tr class="even">
 <td>2</td>
 <td><code>RecordBatch</code></td>
 <td><code>RecordBatch$create(...)</code></td>
 <td><code>record_batch(...)</code></td>
 </tr>
 <tr class="odd">
 <td>2</td>
 <td><code>Table</code></td>
 <td><code>Table$create(...)</code></td>
 <td><code>arrow_table(...)</code></td>
 </tr>
 <tr class="even">
 <td>2</td>
 <td><code>Dataset</code></td>
 <td><code>Dataset$create(sources, schema)</code></td>
 <td><code>open_dataset(sources, schema)</code></td>
 </tr>
 </tbody>
 </table>
 <p>Later in the article we’ll look at each of these in more detail. For
 now we note that each of these object classes corresponds to a class of
 the same name in the underlying Arrow C++ library.</p>
 <p>In addition to these data objects, arrow defines the following
 classes for representing metadata:</p>
 <ul>
 <li>A <code>Schema</code> is a list of <code>Field</code> objects used
 to describe the structure of a tabular data object; where</li>
 <li>A <code>Field</code> specifies a character string name and a
 <code>DataType</code>; and</li>
 <li>A <code>DataType</code> is an attribute controlling how values are
 represented</li>
 </ul>
 <p>These metadata objects play an important role in making sure data are
 represented correctly, and all three of the tabular data object types
 (Record Batch, Table, and Dataset) include explicit Schema objects used
 to represent metadata. To learn more about these metadata classes, see
 the <a href="./metadata.html">metadata article</a>.</p>
 <div class="section level2">
 <h2 id="scalars">Scalars<a class="anchor" aria-label="anchor" href="#scalars"></a>
 </h2>
 <p>A Scalar object is simply a single value that can be of any type. It
 might be an integer, a string, a timestamp, or any of the different
 <code>DataType</code> objects that Arrow supports. Most users of the
 arrow R package are unlikely to create Scalars directly, but should
 there be a need you can do this by calling the
 <code>Scalar$create()</code> method:</p>
 <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">Scalar</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="st">"hello"</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## Scalar</span></span>
 <span><span class="co">## hello</span></span></code></pre>
 </div>
 <div class="section level2">
 <h2 id="arrays">Arrays<a class="anchor" aria-label="anchor" href="#arrays"></a>
 </h2>
 <p>Array objects are ordered sets of Scalar values. As with Scalars most
 users will not need to create Arrays directly, but if the need arises
 there is an <code>Array$create()</code> method that allows you to create
 new Arrays:</p>
 <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">integer_array</span> <span class="op">&lt;-</span> <span class="va">Array</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">1L</span>, <span class="cn">NA</span>, <span class="fl">2L</span>, <span class="fl">4L</span>, <span class="fl">8L</span><span class="op">)</span><span class="op">)</span></span>
 <span><span class="va">integer_array</span></span></code></pre></div>
 <pre><code><span><span class="co">## Array</span></span>
 <span><span class="co">## &lt;int32&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   1,</span></span>
 <span><span class="co">##   null,</span></span>
 <span><span class="co">##   2,</span></span>
 <span><span class="co">##   4,</span></span>
 <span><span class="co">##   8</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">string_array</span> <span class="op">&lt;-</span> <span class="va">Array</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"hello"</span>, <span class="st">"amazing"</span>, <span class="st">"and"</span>, <span class="st">"cruel"</span>, <span class="st">"world"</span><span class="op">)</span><span class="op">)</span></span>
 <span><span class="va">string_array</span></span></code></pre></div>
 <pre><code><span><span class="co">## Array</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   "hello",</span></span>
 <span><span class="co">##   "amazing",</span></span>
 <span><span class="co">##   "and",</span></span>
 <span><span class="co">##   "cruel",</span></span>
 <span><span class="co">##   "world"</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>An Array can be subset using square brackets as shown below:</p>
 <div class="sourceCode" id="cb7"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">string_array</span><span class="op">[</span><span class="fl">4</span><span class="op">:</span><span class="fl">5</span><span class="op">]</span></span></code></pre></div>
 <pre><code><span><span class="co">## Array</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   "cruel",</span></span>
 <span><span class="co">##   "world"</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>Arrays are immutable objects: once an Array has been created it
 cannot be modified or extended.</p>
 </div>
 <div class="section level2">
 <h2 id="chunked-arrays">Chunked Arrays<a class="anchor" aria-label="anchor" href="#chunked-arrays"></a>
 </h2>
 <p>In practice, most users of the arrow R package are likely to use
 Chunked Arrays rather than simple Arrays. Under the hood, a Chunked
 Array is a collection of one or more Arrays that can be indexed <em>as
 if</em> they were a single Array. The reasons that Arrow provides this
 functionality are described in the <a href="./developers/data_object_layout.html">data object layout
 article</a> but for the present purposes it is sufficient to notice that
 Chunked Arrays behave like Arrays in regular data analysis.</p>
 <p>To illustrate, let’s use the <code><a href="../reference/chunked_array.html">chunked_array()</a></code>
 function:</p>
 <div class="sourceCode" id="cb9"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">chunked_string_array</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/chunked_array.html">chunked_array</a></span><span class="op">(</span></span>
 <span>  <span class="va">string_array</span>,</span>
 <span>  <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"I"</span>, <span class="st">"love"</span>, <span class="st">"you"</span><span class="op">)</span></span>
 <span><span class="op">)</span></span></code></pre></div>
 <p>The <code><a href="../reference/chunked_array.html">chunked_array()</a></code> function is just a wrapper around
 the functionality that <code>ChunkedArray$create()</code> provides.
 Let’s print the object:</p>
 <div class="sourceCode" id="cb10"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">chunked_string_array</span></span></code></pre></div>
 <pre><code><span><span class="co">## ChunkedArray</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "hello",</span></span>
 <span><span class="co">##     "amazing",</span></span>
 <span><span class="co">##     "and",</span></span>
 <span><span class="co">##     "cruel",</span></span>
 <span><span class="co">##     "world"</span></span>
 <span><span class="co">##   ],</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "I",</span></span>
 <span><span class="co">##     "love",</span></span>
 <span><span class="co">##     "you"</span></span>
 <span><span class="co">##   ]</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>The double bracketing in this output is intended to highlight the
 fact that Chunked Arrays are wrappers around one or more Arrays.
 However, although comprised of multiple distinct Arrays, a Chunked Array
 can be indexed as if they were laid end-to-end in a single “vector-like”
 object. This is illustrated below:</p>
 <p><img src="array_indexing.png" width="100%"></p>
 <p>We can use <code>chunked_string_array</code> to illustrate this:</p>
 <div class="sourceCode" id="cb12"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">chunked_string_array</span><span class="op">[</span><span class="fl">4</span><span class="op">:</span><span class="fl">7</span><span class="op">]</span></span></code></pre></div>
 <pre><code><span><span class="co">## ChunkedArray</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "cruel",</span></span>
 <span><span class="co">##     "world"</span></span>
 <span><span class="co">##   ],</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "I",</span></span>
 <span><span class="co">##     "love"</span></span>
 <span><span class="co">##   ]</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>An important thing to note is that “chunking” is not semantically
 meaningful. It is an implementation detail only: users should never
 treat the chunk as a meaningful unit. Writing the data to disk, for
 example, often results in the data being organized into different
 chunks. Similarly, two Chunked Arrays that contain the same values
 assigned to different chunks are deemed equivalent. To illustrate this
 we can create a Chunked Array that contains the same four same four
 values as <code>chunked_string_array[4:7]</code>, but organized into one
 chunk rather than split into two:</p>
 <div class="sourceCode" id="cb14"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">cruel_world</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/chunked_array.html">chunked_array</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"cruel"</span>, <span class="st">"world"</span>, <span class="st">"I"</span>, <span class="st">"love"</span><span class="op">)</span><span class="op">)</span></span>
 <span><span class="va">cruel_world</span></span></code></pre></div>
 <pre><code><span><span class="co">## ChunkedArray</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "cruel",</span></span>
 <span><span class="co">##     "world",</span></span>
 <span><span class="co">##     "I",</span></span>
 <span><span class="co">##     "love"</span></span>
 <span><span class="co">##   ]</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>Testing for equality using <code>==</code> produces an element-wise
 comparison, and the result is a new Chunked Array of four (boolean type)
 <code>true</code> values:</p>
 <div class="sourceCode" id="cb16"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">cruel_world</span> <span class="op">==</span> <span class="va">chunked_string_array</span><span class="op">[</span><span class="fl">4</span><span class="op">:</span><span class="fl">7</span><span class="op">]</span></span></code></pre></div>
 <pre><code><span><span class="co">## ChunkedArray</span></span>
 <span><span class="co">## &lt;bool&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     true,</span></span>
 <span><span class="co">##     true,</span></span>
 <span><span class="co">##     true,</span></span>
 <span><span class="co">##     true</span></span>
 <span><span class="co">##   ]</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>In short, the intention is that users interact with Chunked Arrays as
 if they are ordinary one-dimensional data structures without ever having
 to think much about the underlying chunking arrangement.</p>
 <p>Chunked Arrays are mutable, in a specific sense: Arrays can be added
 and removed from a Chunked Array.</p>
 </div>
 <div class="section level2">
 <h2 id="record-batches">Record Batches<a class="anchor" aria-label="anchor" href="#record-batches"></a>
 </h2>
 <p>A Record Batch is tabular data structure comprised of named Arrays,
 and an accompanying Schema that specifies the name and data type
 associated with each Array. Record Batches are a fundamental unit for
 data interchange in Arrow, but are not typically used for data analysis.
 Tables and Datasets are usually more convenient in analytic
 contexts.</p>
 <p>These Arrays can be of different types but must all be the same
 length. Each Array is referred to as one of the “fields” or “columns” of
 the Record Batch. You can create a Record Batch using the
 <code><a href="../reference/record_batch.html">record_batch()</a></code> function or by using the
 <code>RecordBatch$create()</code> method. These functions are flexible
 and can accept inputs in several formats: you can pass a data frame, one
 or more named vectors, an input stream, or even a raw vector containing
 appropriate binary data. For example:</p>
 <div class="sourceCode" id="cb18"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">rb</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/record_batch.html">record_batch</a></span><span class="op">(</span></span>
 <span>  strs <span class="op">=</span> <span class="va">string_array</span>,</span>
 <span>  ints <span class="op">=</span> <span class="va">integer_array</span>,</span>
 <span>  dbls <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">1.1</span>, <span class="fl">3.2</span>, <span class="fl">0.2</span>, <span class="cn">NA</span>, <span class="fl">11</span><span class="op">)</span></span>
 <span><span class="op">)</span></span>
 <span><span class="va">rb</span></span></code></pre></div>
 <pre><code><span><span class="co">## RecordBatch</span></span>
 <span><span class="co">## 5 rows x 3 columns</span></span>
 <span><span class="co">## $strs &lt;string&gt;</span></span>
 <span><span class="co">## $ints &lt;int32&gt;</span></span>
 <span><span class="co">## $dbls &lt;double&gt;</span></span></code></pre>
 <p>This is a Record Batch containing 5 rows and 3 columns, and its
 conceptual structure is shown below:</p>
 <p><img src="record_batch.png" width="100%"></p>
 <p>The arrow package supplies a <code>$</code> method for Record Batch
 objects, used to extract a single column by name:</p>
 <div class="sourceCode" id="cb20"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">rb</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div>
 <pre><code><span><span class="co">## Array</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   "hello",</span></span>
 <span><span class="co">##   "amazing",</span></span>
 <span><span class="co">##   "and",</span></span>
 <span><span class="co">##   "cruel",</span></span>
 <span><span class="co">##   "world"</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>You can use double brackets <code>[[</code> to refer to columns by
 position. The <code>rb$ints</code> array is the second column in our
 Record Batch so we can extract it with this:</p>
 <div class="sourceCode" id="cb22"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">rb</span><span class="op">[[</span><span class="fl">2</span><span class="op">]</span><span class="op">]</span></span></code></pre></div>
 <pre><code><span><span class="co">## Array</span></span>
 <span><span class="co">## &lt;int32&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   1,</span></span>
 <span><span class="co">##   null,</span></span>
 <span><span class="co">##   2,</span></span>
 <span><span class="co">##   4,</span></span>
 <span><span class="co">##   8</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>There is also <code>[</code> method that allows you to extract
 subsets of a record batch in the same way you would for a data frame.
 The command <code>rb[1:3, 1:2]</code> extracts the first three rows and
 the first two columns:</p>
 <div class="sourceCode" id="cb24"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">rb</span><span class="op">[</span><span class="fl">1</span><span class="op">:</span><span class="fl">3</span>, <span class="fl">1</span><span class="op">:</span><span class="fl">2</span><span class="op">]</span></span></code></pre></div>
 <pre><code><span><span class="co">## RecordBatch</span></span>
 <span><span class="co">## 3 rows x 2 columns</span></span>
 <span><span class="co">## $strs &lt;string&gt;</span></span>
 <span><span class="co">## $ints &lt;int32&gt;</span></span></code></pre>
 <p>Record Batches cannot be concatenated: because they are comprised of
 Arrays, and Arrays are immutable objects, new rows cannot be added to
 Record Batch once created.</p>
 </div>
 <div class="section level2">
 <h2 id="tables">Tables<a class="anchor" aria-label="anchor" href="#tables"></a>
 </h2>
 <p>A Table is comprised of named Chunked Arrays, in the same way that a
 Record Batch is comprised of named Arrays. Like Record Batches, Tables
 include an explicit Schema specifying the name and data type for each
 Chunked Array.</p>
 <p>You can subset Tables with <code>$</code>, <code>[[</code>, and
 <code>[</code> the same way you can for Record Batches. Unlike Record
 Batches, Tables can be concatenated (because they are comprised of
 Chunked Arrays). Suppose a second Record Batch arrives:</p>
 <div class="sourceCode" id="cb26"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">new_rb</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/record_batch.html">record_batch</a></span><span class="op">(</span></span>
 <span>  strs <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"I"</span>, <span class="st">"love"</span>, <span class="st">"you"</span><span class="op">)</span>,</span>
 <span>  ints <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">5L</span>, <span class="fl">0L</span>, <span class="fl">0L</span><span class="op">)</span>,</span>
 <span>  dbls <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">7.1</span>, <span class="op">-</span><span class="fl">0.1</span>, <span class="fl">2</span><span class="op">)</span></span>
 <span><span class="op">)</span></span></code></pre></div>
 <p>It is not possible to create a Record Batch that appends the data
 from <code>new_rb</code> to the data in <code>rb</code>, not without
 creating entirely new objects in memory. With Tables, however, we
 can:</p>
 <div class="sourceCode" id="cb27"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">df</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">rb</span><span class="op">)</span></span>
 <span><span class="va">new_df</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">new_rb</span><span class="op">)</span></span></code></pre></div>
 <p>We now have the two fragments of the data set represented as Tables.
 The difference between the Table and the Record Batch is that the
 columns are all represented as Chunked Arrays. Each Array from the
 original Record Batch is one chunk in the corresponding Chunked Array in
 the Table:</p>
 <div class="sourceCode" id="cb28"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">rb</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div>
 <pre><code><span><span class="co">## Array</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   "hello",</span></span>
 <span><span class="co">##   "amazing",</span></span>
 <span><span class="co">##   "and",</span></span>
 <span><span class="co">##   "cruel",</span></span>
 <span><span class="co">##   "world"</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <div class="sourceCode" id="cb30"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">df</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div>
 <pre><code><span><span class="co">## ChunkedArray</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "hello",</span></span>
 <span><span class="co">##     "amazing",</span></span>
 <span><span class="co">##     "and",</span></span>
 <span><span class="co">##     "cruel",</span></span>
 <span><span class="co">##     "world"</span></span>
 <span><span class="co">##   ]</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 <p>It’s the same underlying data – and indeed the same immutable Array
 is referenced by both – just enclosed by a new, flexible Chunked Array
 wrapper. However, it is this wrapper that allows us to concatenate
 Tables:</p>
 <div class="sourceCode" id="cb32"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="../reference/concat_tables.html">concat_tables</a></span><span class="op">(</span><span class="va">df</span>, <span class="va">new_df</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## Table</span></span>
 <span><span class="co">## 8 rows x 3 columns</span></span>
 <span><span class="co">## $strs &lt;string&gt;</span></span>
 <span><span class="co">## $ints &lt;int32&gt;</span></span>
 <span><span class="co">## $dbls &lt;double&gt;</span></span></code></pre>
 <p>The resulting object is shown schematically below:</p>
 <p><img src="table.png" width="100%"></p>
 <p>Notice that the Chunked Arrays within the new Table retain this
 chunking structure, because none of the original Arrays have been
 moved:</p>
 <div class="sourceCode" id="cb34"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">df_both</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/concat_tables.html">concat_tables</a></span><span class="op">(</span><span class="va">df</span>, <span class="va">new_df</span><span class="op">)</span></span>
 <span><span class="va">df_both</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div>
 <pre><code><span><span class="co">## ChunkedArray</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "hello",</span></span>
 <span><span class="co">##     "amazing",</span></span>
 <span><span class="co">##     "and",</span></span>
 <span><span class="co">##     "cruel",</span></span>
 <span><span class="co">##     "world"</span></span>
 <span><span class="co">##   ],</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "I",</span></span>
 <span><span class="co">##     "love",</span></span>
 <span><span class="co">##     "you"</span></span>
 <span><span class="co">##   ]</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 </div>
 <div class="section level2">
 <h2 id="datasets">Datasets<a class="anchor" aria-label="anchor" href="#datasets"></a>
 </h2>
 <p>Like Record Batch and Table objects, a Dataset is used to represent
 tabular data. At an abstract level, a Dataset can be viewed as an object
 comprised of rows and columns, and just like Record Batches and Tables,
 it contains an explicit Schema that specifies the name and data type
 associated with each column.</p>
 <p>However, where Tables and Record Batches are data explicitly
 represented in-memory, a Dataset is not. Instead, a Dataset is an
 abstraction that refers to data stored on-disk in one or more files.
 Values stored in the data files are loaded into memory as a batched
 process. Loading takes place only as needed, and only when a query is
 executed against the data. In this respect Arrow Datasets are a very
 different kind of object to Arrow Tables, but the dplyr commands used to
 analyze them are essentially identical. In this section we’ll talk about
 how Datasets are structured. If you want to learn more about the
 practical details of analyzing Datasets, see the article on <a href="./dataset.html">analyzing multi-file datasets</a>.</p>
 <div class="section level3">
 <h3 id="the-on-disk-data-files">The on-disk data files<a class="anchor" aria-label="anchor" href="#the-on-disk-data-files"></a>
 </h3>
 <p>Reduced to its simplest form, the on-disk structure of a Dataset is
 simply a collection of data files, each storing one subset of the data.
 These subsets are sometimes referred to as “fragments”, and the
 partitioning process is sometimes referred to as “sharding”. By
 convention, these files are organized into a folder structure called a
 Hive-style partition: see <code><a href="../reference/hive_partition.html">hive_partition()</a></code> for details.</p>
 <p>To illustrate how this works, let’s write a multi-file dataset to
 disk manually, without using any of the Arrow Dataset functionality to
 do the work. We’ll start with three small data frames, each of which
 contains one subset of the data we want to store:</p>
 <div class="sourceCode" id="cb36"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">df_a</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame</a></span><span class="op">(</span>id <span class="op">=</span> <span class="fl">1</span><span class="op">:</span><span class="fl">5</span>, value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="fl">5</span><span class="op">)</span>, subset <span class="op">=</span> <span class="st">"a"</span><span class="op">)</span></span>
 <span><span class="va">df_b</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame</a></span><span class="op">(</span>id <span class="op">=</span> <span class="fl">6</span><span class="op">:</span><span class="fl">10</span>, value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="fl">5</span><span class="op">)</span>, subset <span class="op">=</span> <span class="st">"b"</span><span class="op">)</span></span>
 <span><span class="va">df_c</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame</a></span><span class="op">(</span>id <span class="op">=</span> <span class="fl">11</span><span class="op">:</span><span class="fl">15</span>, value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="fl">5</span><span class="op">)</span>, subset <span class="op">=</span> <span class="st">"c"</span><span class="op">)</span></span></code></pre></div>
 <p>Our intention is that each of the data frames should be stored in a
 separate data file. As you can see, this is a quite structured
 partitioning: all data where <code>subset = "a"</code> belong to one
 file, all data where <code>subset = "b"</code> belong to another file,
 and all data where <code>subset = "c"</code> belong to the third
 file.</p>
 <p>The first step is to define and create a folder that will hold all
 the files:</p>
 <div class="sourceCode" id="cb37"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds_dir</span> <span class="op">&lt;-</span> <span class="st">"mini-dataset"</span></span>
 <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir</span><span class="op">)</span></span></code></pre></div>
 <p>The next step is to manually create the Hive-style folder
 structure:</p>
 <div class="sourceCode" id="cb38"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds_dir_a</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir</span>, <span class="st">"subset=a"</span><span class="op">)</span></span>
 <span><span class="va">ds_dir_b</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir</span>, <span class="st">"subset=b"</span><span class="op">)</span></span>
 <span><span class="va">ds_dir_c</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir</span>, <span class="st">"subset=c"</span><span class="op">)</span></span>
 <span></span>
 <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir_a</span><span class="op">)</span></span>
 <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir_b</span><span class="op">)</span></span>
 <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir_c</span><span class="op">)</span></span></code></pre></div>
 <p>Notice that we have named each folder in a “key=value” format that
 exactly describes the subset of data that will be written into that
 folder. This naming structure is the essence of Hive-style
 partitions.</p>
 <p>Now that we have the folders, we’ll use <code><a href="../reference/write_parquet.html">write_parquet()</a></code>
 to create a single parquet file for each of the three subsets:</p>
 <div class="sourceCode" id="cb39"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">df_a</span>, <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir_a</span>, <span class="st">"part-0.parquet"</span><span class="op">)</span><span class="op">)</span></span>
 <span><span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">df_b</span>, <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir_b</span>, <span class="st">"part-0.parquet"</span><span class="op">)</span><span class="op">)</span></span>
 <span><span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">df_c</span>, <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir_c</span>, <span class="st">"part-0.parquet"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
 <p>If we had wanted to, we could have further subdivided the dataset. A
 folder could contain multiple files (<code>part-0.parquet</code>,
 <code>part-1.parquet</code>, etc) if we wanted it to. Similarly, there
 is no particular reason to name the files <code>part-0.parquet</code>
 this way at all: it would have been fine to call these files
 <code>subset-a.parquet</code>, <code>subset-b.parquet</code>, and
 <code>subset-c.parquet</code> if we had wished. We could have written
 other file formats if we wanted, and we don’t necessarily have to use
 Hive-style folders. You can learn more about the supported formats by
 reading the help documentation for <code><a href="../reference/open_dataset.html">open_dataset()</a></code>, and
 learn about how to exercise fine-grained control with
 <code><a href="../reference/Dataset.html">help("Dataset", package = "arrow")</a></code>.</p>
 <p>In any case, we have created an on-disk parquet Dataset using
 Hive-style partitioning. Our Dataset is defined by these files:</p>
 <div class="sourceCode" id="cb40"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/list.files.html" class="external-link">list.files</a></span><span class="op">(</span><span class="va">ds_dir</span>, recursive <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## [1] "subset=a/part-0.parquet" "subset=b/part-0.parquet"</span></span>
 <span><span class="co">## [3] "subset=c/part-0.parquet"</span></span></code></pre>
 <p>To verify that everything has worked, let’s open the data with
 <code><a href="../reference/open_dataset.html">open_dataset()</a></code> and call <code><a href="https://pillar.r-lib.org/reference/glimpse.html" class="external-link">glimpse()</a></code> to inspect
 its contents:</p>
 <div class="sourceCode" id="cb42"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="va">ds_dir</span><span class="op">)</span></span>
 <span><span class="fu"><a href="https://pillar.r-lib.org/reference/glimpse.html" class="external-link">glimpse</a></span><span class="op">(</span><span class="va">ds</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## FileSystemDataset with 3 Parquet files</span></span>
 <span><span class="co">## 15 rows x 3 columns</span></span>
 <span><span class="co">## $ id      <span style="color: #949494; font-style: italic;">&lt;int32&gt;</span> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15</span></span>
 <span><span class="co">## $ value  <span style="color: #949494; font-style: italic;">&lt;double&gt;</span> -1.400043517, 0.255317055, -2.437263611, -0.005571287, 0.62155~</span></span>
 <span><span class="co">## $ subset <span style="color: #949494; font-style: italic;">&lt;string&gt;</span> "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c~</span></span>
 <span><span class="co">## Call `print()` for full schema details</span></span></code></pre>
 <p>As you can see, the <code>ds</code> Dataset object aggregates the
 three separate data files. In fact, in this particular case the Dataset
 is so small that values from all three files appear in the output of
 <code><a href="https://pillar.r-lib.org/reference/glimpse.html" class="external-link">glimpse()</a></code>.</p>
 <p>It should be noted that in everyday data analysis work, you wouldn’t
 need to do write the data files manually in this fashion. The example
 above is entirely for illustrative purposes. The exact same dataset
 could be created with the following command:</p>
 <div class="sourceCode" id="cb44"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">subset</span><span class="op">)</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"mini-dataset"</span><span class="op">)</span></span></code></pre></div>
 <p>In fact, even if <code>ds</code> happens to refer to a data source
 that is larger than memory, this command should still work because the
 Dataset functionality is written to ensure that during a pipeline such
 as this the data is loaded piecewise in order to avoid exhausting
 memory.</p>
 </div>
 <div class="section level3">
 <h3 id="the-dataset-object">The Dataset object<a class="anchor" aria-label="anchor" href="#the-dataset-object"></a>
 </h3>
 <p>In the previous section we examined the on-disk structure of a
 Dataset. We now turn to the in-memory structure of the Dataset object
 itself (i.e., <code>ds</code> in the previous example). When the Dataset
 object is created, arrow searches the dataset folder looking for
 appropriate files, but does not load the contents of those files. Paths
 to these files are stored in an active binding
 <code>ds$files</code>:</p>
 <div class="sourceCode" id="cb45"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span><span class="op">$</span><span class="va">files</span></span></code></pre></div>
 <pre><code><span><span class="co">## [1] "/build/r/vignettes/mini-dataset/subset=a/part-0.parquet"</span></span>
 <span><span class="co">## [2] "/build/r/vignettes/mini-dataset/subset=b/part-0.parquet"</span></span>
 <span><span class="co">## [3] "/build/r/vignettes/mini-dataset/subset=c/part-0.parquet"</span></span></code></pre>
 <p>The other thing that happens when <code><a href="../reference/open_dataset.html">open_dataset()</a></code> is
 called is that an explicit Schema for the Dataset is constructed and
 stored as <code>ds$schema</code>:</p>
 <div class="sourceCode" id="cb47"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span><span class="op">$</span><span class="va">schema</span></span></code></pre></div>
 <pre><code><span><span class="co">## Schema</span></span>
 <span><span class="co">## id: int32</span></span>
 <span><span class="co">## value: double</span></span>
 <span><span class="co">## subset: string</span></span>
 <span><span class="co">## </span></span>
 <span><span class="co">## See $metadata for additional Schema metadata</span></span></code></pre>
 <p>By default this Schema is inferred by inspecting the first file only,
 though it is possible to construct a unified schema after inspecting all
 files. To do this, set <code>unify_schemas = TRUE</code> when calling
 <code><a href="../reference/open_dataset.html">open_dataset()</a></code>. It is also possible to use the
 <code>schema</code> argument to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to specify
 the Schema explicitly (see the <code><a href="../reference/schema.html">schema()</a></code> function for
 details).</p>
 <p>The act of reading the data is performed by a Scanner object. When
 analyzing a Dataset using the dplyr interface you never need to
 construct a Scanner manually, but for explanatory purposes we’ll do it
 here:</p>
 <div class="sourceCode" id="cb49"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">scan</span> <span class="op">&lt;-</span> <span class="va">Scanner</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span>dataset <span class="op">=</span> <span class="va">ds</span><span class="op">)</span></span></code></pre></div>
 <p>Calling the <code>ToTable()</code> method will materialize the
 Dataset (on-disk) as a Table (in-memory):</p>
 <div class="sourceCode" id="cb50"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">scan</span><span class="op">$</span><span class="fu">ToTable</span><span class="op">(</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## Table</span></span>
 <span><span class="co">## 15 rows x 3 columns</span></span>
 <span><span class="co">## $id &lt;int32&gt;</span></span>
 <span><span class="co">## $value &lt;double&gt;</span></span>
 <span><span class="co">## $subset &lt;string&gt;</span></span>
 <span><span class="co">## </span></span>
 <span><span class="co">## See $metadata for additional Schema metadata</span></span></code></pre>
 <p>This scanning process is multi-threaded by default, but if necessary
 threading can be disabled by setting <code>use_threads = FALSE</code>
 when calling <code>Scanner$create()</code>.</p>
 </div>
 <div class="section level3">
 <h3 id="querying-a-dataset">Querying a Dataset<a class="anchor" aria-label="anchor" href="#querying-a-dataset"></a>
 </h3>
 <p>When a query is executed against a Dataset a new scan is initiated
 and the results pulled back into R. As an example, consider the
 following dplyr expression:</p>
 <div class="sourceCode" id="cb52"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">value</span> <span class="op">&gt;</span> <span class="fl">0</span><span class="op">)</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>new_value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/Round.html" class="external-link">round</a></span><span class="op">(</span><span class="fl">100</span> <span class="op">*</span> <span class="va">value</span><span class="op">)</span><span class="op">)</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">id</span>, <span class="va">subset</span>, <span class="va">new_value</span><span class="op">)</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 6 x 3</span></span></span>
 <span><span class="co">##      id subset new_value</span></span>
 <span><span class="co">##   <span style="color: #949494; font-style: italic;">&lt;int&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>      <span style="color: #949494; font-style: italic;">&lt;dbl&gt;</span></span></span>
 <span><span class="co">## <span style="color: #BCBCBC;">1</span>     2 a             26</span></span>
 <span><span class="co">## <span style="color: #BCBCBC;">2</span>     5 a             62</span></span>
 <span><span class="co">## <span style="color: #BCBCBC;">3</span>     6 b            115</span></span>
 <span><span class="co">## <span style="color: #BCBCBC;">4</span>    12 c             63</span></span>
 <span><span class="co">## <span style="color: #BCBCBC;">5</span>    13 c            207</span></span>
 <span><span class="co">## <span style="color: #BCBCBC;">6</span>    15 c             51</span></span></code></pre>
 <p>We can replicate this using the low-level Dataset interface by
 creating a new scan by specifying the <code>filter</code> and
 <code>projection</code> arguments to <code>Scanner$create()</code>. To
 use these arguments you need to know a little about Arrow Expressions,
 for which you may find it helpful to read the help documentation in
 <code><a href="../reference/Expression.html">help("Expression", package = "arrow")</a></code>.</p>
 <p>The scanner defined below mimics the dplyr pipeline shown above,</p>
 <div class="sourceCode" id="cb54"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">scan</span> <span class="op">&lt;-</span> <span class="va">Scanner</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span>
 <span>  dataset <span class="op">=</span> <span class="va">ds</span>,</span>
 <span>  filter <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"value"</span><span class="op">)</span> <span class="op">&gt;</span> <span class="fl">0</span>,</span>
 <span>  projection <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span></span>
 <span>    id <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"id"</span><span class="op">)</span>,</span>
 <span>    subset <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"subset"</span><span class="op">)</span>,</span>
 <span>    new_value <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="st">"round"</span>, <span class="fl">100</span> <span class="op">*</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"value"</span><span class="op">)</span><span class="op">)</span></span>
 <span>  <span class="op">)</span></span>
 <span><span class="op">)</span></span></code></pre></div>
 <p>and if we were to call <code>as.data.frame(scan$ToTable())</code> it
 would produce the same result as the dplyr version, though the rows may
 not appear in the same order.</p>
 <p>To get a better sense of what happens when the query executes, what
 we’ll do here is call <code>scan$ScanBatches()</code>. Much like the
 <code>ToTable()</code> method, the <code>ScanBatches()</code> method
 executes the query separately against each of the files, but it returns
 a list of Record Batches, one for each file. In addition, we’ll convert
 these Record Batches to data frames individually:</p>
 <div class="sourceCode" id="cb55"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/lapply.html" class="external-link">lapply</a></span><span class="op">(</span><span class="va">scan</span><span class="op">$</span><span class="fu">ScanBatches</span><span class="op">(</span><span class="op">)</span>, <span class="va">as.data.frame</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## [[1]]</span></span>
 <span><span class="co">##   id subset new_value</span></span>
 <span><span class="co">## 1  2      a        26</span></span>
 <span><span class="co">## 2  5      a        62</span></span>
 <span><span class="co">## </span></span>
 <span><span class="co">## [[2]]</span></span>
 <span><span class="co">##   id subset new_value</span></span>
 <span><span class="co">## 1  6      b       115</span></span>
 <span><span class="co">## </span></span>
 <span><span class="co">## [[3]]</span></span>
 <span><span class="co">##   id subset new_value</span></span>
 <span><span class="co">## 1 12      c        63</span></span>
 <span><span class="co">## 2 13      c       207</span></span>
 <span><span class="co">## 3 15      c        51</span></span></code></pre>
 <p>If we return to the dplyr query we made earlier, and use
 <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">compute()</a></code> to return a Table rather use
 <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code> to return a data frame, we can see the evidence
 of this process at work. The Table object is created by concatenating
 the three Record Batches produced when the query executes against three
 data files, and as a consequence of this the Chunked Array that defines
 a column of the Table mirrors the partitioning structure present in the
 data files:</p>
 <div class="sourceCode" id="cb57"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">tbl</span> <span class="op">&lt;-</span> <span class="va">ds</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">value</span> <span class="op">&gt;</span> <span class="fl">0</span><span class="op">)</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>new_value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/Round.html" class="external-link">round</a></span><span class="op">(</span><span class="fl">100</span> <span class="op">*</span> <span class="va">value</span><span class="op">)</span><span class="op">)</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">id</span>, <span class="va">subset</span>, <span class="va">new_value</span><span class="op">)</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">compute</a></span><span class="op">(</span><span class="op">)</span></span>
 <span></span>
 <span><span class="va">tbl</span><span class="op">$</span><span class="va">subset</span></span></code></pre></div>
 <pre><code><span><span class="co">## ChunkedArray</span></span>
 <span><span class="co">## &lt;string&gt;</span></span>
 <span><span class="co">## [</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "a",</span></span>
 <span><span class="co">##     "a"</span></span>
 <span><span class="co">##   ],</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "b"</span></span>
 <span><span class="co">##   ],</span></span>
 <span><span class="co">##   [</span></span>
 <span><span class="co">##     "c",</span></span>
 <span><span class="co">##     "c",</span></span>
 <span><span class="co">##     "c"</span></span>
 <span><span class="co">##   ]</span></span>
 <span><span class="co">## ]</span></span></code></pre>
 </div>
 <div class="section level3">
 <h3 id="additional-notes">Additional notes<a class="anchor" aria-label="anchor" href="#additional-notes"></a>
 </h3>
 <ul>
 <li><p>A distinction ignored in the previous discussion is between
 <code>FileSystemDataset</code> and <code>InMemoryDataset</code> objects.
 In the usual case, the data that comprise a Dataset are stored in files
 on-disk. That is, after all, the primary advantage of Datasets over
 Tables. However, there are cases where it may be useful to make a
 Dataset from data that are already stored in-memory. In such cases the
 object created will have type <code>InMemoryDataset</code>.</p></li>
 <li><p>The previous discussion assumes that all files stored in the
 Dataset have the same Schema. In the usual case this will be true,
 because each file is conceptually a subset of a single rectangular
 table. But this is not strictly required.</p></li>
 </ul>
 <p>For more information about these topics, see
 <code><a href="../reference/Dataset.html">help("Dataset", package = "arrow")</a></code>.</p>
 </div>
 </div>
 <div class="section level2">
 <h2 id="further-reading">Further reading<a class="anchor" aria-label="anchor" href="#further-reading"></a>
 </h2>
 <ul>
 <li>To learn more about the internal structure of Arrays, see the
 article on <a href="./developers/data_object_layout.html">data object
 layout</a>.</li>
 <li>To learn more about the different data types used by Arrow, see the
 article on <a href="./data_types.html">data types</a>.</li>
 <li>To learn more about how Arrow objects are implemented, see the <a href="https://arrow.apache.org/docs/format/Columnar.html" class="external-link">Arrow
 specification</a> page.</li>
 </ul>
 </div>
   </main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2>
     </nav></aside>
 </div>


     <footer><div class="pkgdown-footer-left">
   <p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p>
 </div>

 <div class="pkgdown-footer-right">
   <p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.9.</p>
 </div>

     </footer>
 </div>


   </body>
 </html>