| <!DOCTYPE html> |
| <!-- Generated by pkgdown: do not edit by hand --><html lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <meta name="description" content="Learn about Scalar, Array, Table, and Dataset objects in arrow (among others), how they relate to each other, as well as their relationships to familiar R objects like data frames and vectors |
| "> |
| <title>Data objects • Arrow R Package</title> |
| <!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../favicon-16x16.png"> |
| <link rel="icon" type="image/png" sizes="32x32" href="../favicon-32x32.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../apple-touch-icon.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../apple-touch-icon-120x120.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../apple-touch-icon-76x76.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../apple-touch-icon-60x60.png"> |
| <script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet"> |
| <script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous"> |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous"> |
| <!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.11/clipboard.min.js" integrity="sha512-7O5pXpc0oCRrxk8RUfDYFgn0nO1t+jLuIOQdOMRp4APB7uZ4vSjspzp5y6YDtDs4VzUSTbWzBFZ/LKJhnyFOKw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet"> |
| <meta property="og:title" content="Data objects"> |
| <meta property="og:description" content="Learn about Scalar, Array, Table, and Dataset objects in arrow (among others), how they relate to each other, as well as their relationships to familiar R objects like data frames and vectors |
| "> |
| <meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"> |
| <meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"> |
| <meta name="twitter:card" content="summary_large_image"> |
| <meta name="twitter:creator" content="@apachearrow"> |
| <meta name="twitter:site" content="@apachearrow"> |
| <!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]> |
| <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script> |
| <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> |
| <![endif]--><!-- Matomo --><script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| /* We explicitly disable cookie tracking to avoid privacy issues */ |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '20']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script><!-- End Matomo Code --> |
| </head> |
| <body> |
| <a href="#main" class="visually-hidden-focusable">Skip to contents</a> |
| |
| |
| <nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container"> |
| |
| <a class="navbar-brand me-2" href="../index.html">Arrow R Package</a> |
| |
| <span class="version"> |
| <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">16.1.0.9000</small> |
| </span> |
| |
| |
| <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation"> |
| <span class="navbar-toggler-icon"></span> |
| </button> |
| |
| <div id="navbar" class="collapse navbar-collapse ms-3"> |
| <ul class="navbar-nav me-auto"> |
| <li class="nav-item"> |
| <a class="nav-link" href="../articles/arrow.html">Get started</a> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" href="../reference/index.html">Reference</a> |
| </li> |
| <li class="active nav-item dropdown"> |
| <a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a> |
| <div class="dropdown-menu" aria-labelledby="dropdown-articles"> |
| <h6 class="dropdown-header" data-toc-skip>Using the package</h6> |
| <a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a> |
| <a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a> |
| <a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a> |
| <a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a> |
| <a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a> |
| <a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a> |
| <div class="dropdown-divider"></div> |
| <h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6> |
| <a class="dropdown-item" href="../articles/data_objects.html">Data objects</a> |
| <a class="dropdown-item" href="../articles/data_types.html">Data types</a> |
| <a class="dropdown-item" href="../articles/metadata.html">Metadata</a> |
| <div class="dropdown-divider"></div> |
| <h6 class="dropdown-header" data-toc-skip>Installation</h6> |
| <a class="dropdown-item" href="../articles/install.html">Installing on Linux</a> |
| <a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a> |
| <div class="dropdown-divider"></div> |
| <a class="dropdown-item" href="../articles/index.html">More articles...</a> |
| </div> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" href="../news/index.html">Changelog</a> |
| </li> |
| </ul> |
| <form class="form-inline my-2 my-lg-0" role="search"> |
| <input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="Search for" autocomplete="off"> |
| </form> |
| |
| <ul class="navbar-nav"> |
| <li class="nav-item"> |
| <a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="github"> |
| <span class="fab fa fab fa-github fa-lg"></span> |
| |
| </a> |
| </li> |
| </ul> |
| </div> |
| |
| |
| </div> |
| </nav><div class="container template-article"> |
| |
| |
| |
| |
| <div class="row"> |
| <main id="main" class="col-md-9"><div class="page-header"> |
| <img src="" class="logo" alt=""><h1>Data objects</h1> |
| |
| |
| <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/data_objects.Rmd" class="external-link"><code>vignettes/data_objects.Rmd</code></a></small> |
| <div class="d-none name"><code>data_objects.Rmd</code></div> |
| </div> |
| |
| |
| |
| <p>This article describes the various data object types supplied by |
| arrow, and documents how these objects are structured.</p> |
| <p>The arrow package supplies several object classes that are used to |
| represent data. <code>RecordBatch</code>, <code>Table</code>, and |
| <code>Dataset</code> objects are two-dimensional rectangular data |
| structures used to store tabular data. For columnar, one-dimensional |
| data, the <code>Array</code> and <code>ChunkedArray</code> classes are |
| provided. Finally, <code>Scalar</code> objects represent individual |
| values. The table below summarizes these objects and shows how you can |
| create new instances using the <a href="https://r6.r-lib.org/" class="external-link"><code>R6</code></a> class object, as well |
| as convenience functions that provide the same functionality in a more |
| traditional R-like fashion:</p> |
| <table class="table"> |
| <colgroup> |
| <col width="2%"> |
| <col width="12%"> |
| <col width="42%"> |
| <col width="41%"> |
| </colgroup> |
| <thead><tr class="header"> |
| <th>Dim</th> |
| <th>Class</th> |
| <th>How to create an instance</th> |
| <th>Convenience function</th> |
| </tr></thead> |
| <tbody> |
| <tr class="odd"> |
| <td>0</td> |
| <td><code>Scalar</code></td> |
| <td><code>Scalar$create(value, type)</code></td> |
| <td></td> |
| </tr> |
| <tr class="even"> |
| <td>1</td> |
| <td><code>Array</code></td> |
| <td><code>Array$create(vector, type)</code></td> |
| <td><code>as_arrow_array(x)</code></td> |
| </tr> |
| <tr class="odd"> |
| <td>1</td> |
| <td><code>ChunkedArray</code></td> |
| <td><code>ChunkedArray$create(..., type)</code></td> |
| <td><code>chunked_array(..., type)</code></td> |
| </tr> |
| <tr class="even"> |
| <td>2</td> |
| <td><code>RecordBatch</code></td> |
| <td><code>RecordBatch$create(...)</code></td> |
| <td><code>record_batch(...)</code></td> |
| </tr> |
| <tr class="odd"> |
| <td>2</td> |
| <td><code>Table</code></td> |
| <td><code>Table$create(...)</code></td> |
| <td><code>arrow_table(...)</code></td> |
| </tr> |
| <tr class="even"> |
| <td>2</td> |
| <td><code>Dataset</code></td> |
| <td><code>Dataset$create(sources, schema)</code></td> |
| <td><code>open_dataset(sources, schema)</code></td> |
| </tr> |
| </tbody> |
| </table> |
| <p>Later in the article we’ll look at each of these in more detail. For |
| now we note that each of these object classes corresponds to a class of |
| the same name in the underlying Arrow C++ library.</p> |
| <p>In addition to these data objects, arrow defines the following |
| classes for representing metadata:</p> |
| <ul> |
| <li>A <code>Schema</code> is a list of <code>Field</code> objects used |
| to describe the structure of a tabular data object; where</li> |
| <li>A <code>Field</code> specifies a character string name and a |
| <code>DataType</code>; and</li> |
| <li>A <code>DataType</code> is an attribute controlling how values are |
| represented</li> |
| </ul> |
| <p>These metadata objects play an important role in making sure data are |
| represented correctly, and all three of the tabular data object types |
| (Record Batch, Table, and Dataset) include explicit Schema objects used |
| to represent metadata. To learn more about these metadata classes, see |
| the <a href="./metadata.html">metadata article</a>.</p> |
| <div class="section level2"> |
| <h2 id="scalars">Scalars<a class="anchor" aria-label="anchor" href="#scalars"></a> |
| </h2> |
| <p>A Scalar object is simply a single value that can be of any type. It |
| might be an integer, a string, a timestamp, or any of the different |
| <code>DataType</code> objects that Arrow supports. Most users of the |
| arrow R package are unlikely to create Scalars directly, but should |
| there be a need you can do this by calling the |
| <code>Scalar$create()</code> method:</p> |
| <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">Scalar</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="st">"hello"</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Scalar</span></span> |
| <span><span class="co">## hello</span></span></code></pre> |
| </div> |
| <div class="section level2"> |
| <h2 id="arrays">Arrays<a class="anchor" aria-label="anchor" href="#arrays"></a> |
| </h2> |
| <p>Array objects are ordered sets of Scalar values. As with Scalars most |
| users will not need to create Arrays directly, but if the need arises |
| there is an <code>Array$create()</code> method that allows you to create |
| new Arrays:</p> |
| <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">integer_array</span> <span class="op"><-</span> <span class="va">Array</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">1L</span>, <span class="cn">NA</span>, <span class="fl">2L</span>, <span class="fl">4L</span>, <span class="fl">8L</span><span class="op">)</span><span class="op">)</span></span> |
| <span><span class="va">integer_array</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Array</span></span> |
| <span><span class="co">## <int32></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## 1,</span></span> |
| <span><span class="co">## null,</span></span> |
| <span><span class="co">## 2,</span></span> |
| <span><span class="co">## 4,</span></span> |
| <span><span class="co">## 8</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">string_array</span> <span class="op"><-</span> <span class="va">Array</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"hello"</span>, <span class="st">"amazing"</span>, <span class="st">"and"</span>, <span class="st">"cruel"</span>, <span class="st">"world"</span><span class="op">)</span><span class="op">)</span></span> |
| <span><span class="va">string_array</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Array</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "hello",</span></span> |
| <span><span class="co">## "amazing",</span></span> |
| <span><span class="co">## "and",</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>An Array can be subset using square brackets as shown below:</p> |
| <div class="sourceCode" id="cb7"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">string_array</span><span class="op">[</span><span class="fl">4</span><span class="op">:</span><span class="fl">5</span><span class="op">]</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Array</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>Arrays are immutable objects: once an Array has been created it |
| cannot be modified or extended.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="chunked-arrays">Chunked Arrays<a class="anchor" aria-label="anchor" href="#chunked-arrays"></a> |
| </h2> |
| <p>In practice, most users of the arrow R package are likely to use |
| Chunked Arrays rather than simple Arrays. Under the hood, a Chunked |
| Array is a collection of one or more Arrays that can be indexed <em>as |
| if</em> they were a single Array. The reasons that Arrow provides this |
| functionality are described in the <a href="./developers/data_object_layout.html">data object layout |
| article</a> but for the present purposes it is sufficient to notice that |
| Chunked Arrays behave like Arrays in regular data analysis.</p> |
| <p>To illustrate, let’s use the <code><a href="../reference/chunked_array.html">chunked_array()</a></code> |
| function:</p> |
| <div class="sourceCode" id="cb9"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">chunked_string_array</span> <span class="op"><-</span> <span class="fu"><a href="../reference/chunked_array.html">chunked_array</a></span><span class="op">(</span></span> |
| <span> <span class="va">string_array</span>,</span> |
| <span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"I"</span>, <span class="st">"love"</span>, <span class="st">"you"</span><span class="op">)</span></span> |
| <span><span class="op">)</span></span></code></pre></div> |
| <p>The <code><a href="../reference/chunked_array.html">chunked_array()</a></code> function is just a wrapper around |
| the functionality that <code>ChunkedArray$create()</code> provides. |
| Let’s print the object:</p> |
| <div class="sourceCode" id="cb10"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">chunked_string_array</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "hello",</span></span> |
| <span><span class="co">## "amazing",</span></span> |
| <span><span class="co">## "and",</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ],</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "I",</span></span> |
| <span><span class="co">## "love",</span></span> |
| <span><span class="co">## "you"</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>The double bracketing in this output is intended to highlight the |
| fact that Chunked Arrays are wrappers around one or more Arrays. |
| However, although comprised of multiple distinct Arrays, a Chunked Array |
| can be indexed as if they were laid end-to-end in a single “vector-like” |
| object. This is illustrated below:</p> |
| <p><img src="array_indexing.png" width="100%"></p> |
| <p>We can use <code>chunked_string_array</code> to illustrate this:</p> |
| <div class="sourceCode" id="cb12"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">chunked_string_array</span><span class="op">[</span><span class="fl">4</span><span class="op">:</span><span class="fl">7</span><span class="op">]</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ],</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "I",</span></span> |
| <span><span class="co">## "love"</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>An important thing to note is that “chunking” is not semantically |
| meaningful. It is an implementation detail only: users should never |
| treat the chunk as a meaningful unit. Writing the data to disk, for |
| example, often results in the data being organized into different |
| chunks. Similarly, two Chunked Arrays that contain the same values |
| assigned to different chunks are deemed equivalent. To illustrate this |
| we can create a Chunked Array that contains the same four same four |
| values as <code>chunked_string_array[4:7]</code>, but organized into one |
| chunk rather than split into two:</p> |
| <div class="sourceCode" id="cb14"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">cruel_world</span> <span class="op"><-</span> <span class="fu"><a href="../reference/chunked_array.html">chunked_array</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"cruel"</span>, <span class="st">"world"</span>, <span class="st">"I"</span>, <span class="st">"love"</span><span class="op">)</span><span class="op">)</span></span> |
| <span><span class="va">cruel_world</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world",</span></span> |
| <span><span class="co">## "I",</span></span> |
| <span><span class="co">## "love"</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>Testing for equality using <code>==</code> produces an element-wise |
| comparison, and the result is a new Chunked Array of four (boolean type) |
| <code>true</code> values:</p> |
| <div class="sourceCode" id="cb16"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">cruel_world</span> <span class="op">==</span> <span class="va">chunked_string_array</span><span class="op">[</span><span class="fl">4</span><span class="op">:</span><span class="fl">7</span><span class="op">]</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <bool></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## true,</span></span> |
| <span><span class="co">## true,</span></span> |
| <span><span class="co">## true,</span></span> |
| <span><span class="co">## true</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>In short, the intention is that users interact with Chunked Arrays as |
| if they are ordinary one-dimensional data structures without ever having |
| to think much about the underlying chunking arrangement.</p> |
| <p>Chunked Arrays are mutable, in a specific sense: Arrays can be added |
| and removed from a Chunked Array.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="record-batches">Record Batches<a class="anchor" aria-label="anchor" href="#record-batches"></a> |
| </h2> |
| <p>A Record Batch is tabular data structure comprised of named Arrays, |
| and an accompanying Schema that specifies the name and data type |
| associated with each Array. Record Batches are a fundamental unit for |
| data interchange in Arrow, but are not typically used for data analysis. |
| Tables and Datasets are usually more convenient in analytic |
| contexts.</p> |
| <p>These Arrays can be of different types but must all be the same |
| length. Each Array is referred to as one of the “fields” or “columns” of |
| the Record Batch. You can create a Record Batch using the |
| <code><a href="../reference/record_batch.html">record_batch()</a></code> function or by using the |
| <code>RecordBatch$create()</code> method. These functions are flexible |
| and can accept inputs in several formats: you can pass a data frame, one |
| or more named vectors, an input stream, or even a raw vector containing |
| appropriate binary data. For example:</p> |
| <div class="sourceCode" id="cb18"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">rb</span> <span class="op"><-</span> <span class="fu"><a href="../reference/record_batch.html">record_batch</a></span><span class="op">(</span></span> |
| <span> strs <span class="op">=</span> <span class="va">string_array</span>,</span> |
| <span> ints <span class="op">=</span> <span class="va">integer_array</span>,</span> |
| <span> dbls <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">1.1</span>, <span class="fl">3.2</span>, <span class="fl">0.2</span>, <span class="cn">NA</span>, <span class="fl">11</span><span class="op">)</span></span> |
| <span><span class="op">)</span></span> |
| <span><span class="va">rb</span></span></code></pre></div> |
| <pre><code><span><span class="co">## RecordBatch</span></span> |
| <span><span class="co">## 5 rows x 3 columns</span></span> |
| <span><span class="co">## $strs <string></span></span> |
| <span><span class="co">## $ints <int32></span></span> |
| <span><span class="co">## $dbls <double></span></span></code></pre> |
| <p>This is a Record Batch containing 5 rows and 3 columns, and its |
| conceptual structure is shown below:</p> |
| <p><img src="record_batch.png" width="100%"></p> |
| <p>The arrow package supplies a <code>$</code> method for Record Batch |
| objects, used to extract a single column by name:</p> |
| <div class="sourceCode" id="cb20"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">rb</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Array</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "hello",</span></span> |
| <span><span class="co">## "amazing",</span></span> |
| <span><span class="co">## "and",</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>You can use double brackets <code>[[</code> to refer to columns by |
| position. The <code>rb$ints</code> array is the second column in our |
| Record Batch so we can extract it with this:</p> |
| <div class="sourceCode" id="cb22"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">rb</span><span class="op">[[</span><span class="fl">2</span><span class="op">]</span><span class="op">]</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Array</span></span> |
| <span><span class="co">## <int32></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## 1,</span></span> |
| <span><span class="co">## null,</span></span> |
| <span><span class="co">## 2,</span></span> |
| <span><span class="co">## 4,</span></span> |
| <span><span class="co">## 8</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>There is also <code>[</code> method that allows you to extract |
| subsets of a record batch in the same way you would for a data frame. |
| The command <code>rb[1:3, 1:2]</code> extracts the first three rows and |
| the first two columns:</p> |
| <div class="sourceCode" id="cb24"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">rb</span><span class="op">[</span><span class="fl">1</span><span class="op">:</span><span class="fl">3</span>, <span class="fl">1</span><span class="op">:</span><span class="fl">2</span><span class="op">]</span></span></code></pre></div> |
| <pre><code><span><span class="co">## RecordBatch</span></span> |
| <span><span class="co">## 3 rows x 2 columns</span></span> |
| <span><span class="co">## $strs <string></span></span> |
| <span><span class="co">## $ints <int32></span></span></code></pre> |
| <p>Record Batches cannot be concatenated: because they are comprised of |
| Arrays, and Arrays are immutable objects, new rows cannot be added to |
| Record Batch once created.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="tables">Tables<a class="anchor" aria-label="anchor" href="#tables"></a> |
| </h2> |
| <p>A Table is comprised of named Chunked Arrays, in the same way that a |
| Record Batch is comprised of named Arrays. Like Record Batches, Tables |
| include an explicit Schema specifying the name and data type for each |
| Chunked Array.</p> |
| <p>You can subset Tables with <code>$</code>, <code>[[</code>, and |
| <code>[</code> the same way you can for Record Batches. Unlike Record |
| Batches, Tables can be concatenated (because they are comprised of |
| Chunked Arrays). Suppose a second Record Batch arrives:</p> |
| <div class="sourceCode" id="cb26"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">new_rb</span> <span class="op"><-</span> <span class="fu"><a href="../reference/record_batch.html">record_batch</a></span><span class="op">(</span></span> |
| <span> strs <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"I"</span>, <span class="st">"love"</span>, <span class="st">"you"</span><span class="op">)</span>,</span> |
| <span> ints <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">5L</span>, <span class="fl">0L</span>, <span class="fl">0L</span><span class="op">)</span>,</span> |
| <span> dbls <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">7.1</span>, <span class="op">-</span><span class="fl">0.1</span>, <span class="fl">2</span><span class="op">)</span></span> |
| <span><span class="op">)</span></span></code></pre></div> |
| <p>It is not possible to create a Record Batch that appends the data |
| from <code>new_rb</code> to the data in <code>rb</code>, not without |
| creating entirely new objects in memory. With Tables, however, we |
| can:</p> |
| <div class="sourceCode" id="cb27"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">df</span> <span class="op"><-</span> <span class="fu"><a href="../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">rb</span><span class="op">)</span></span> |
| <span><span class="va">new_df</span> <span class="op"><-</span> <span class="fu"><a href="../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">new_rb</span><span class="op">)</span></span></code></pre></div> |
| <p>We now have the two fragments of the data set represented as Tables. |
| The difference between the Table and the Record Batch is that the |
| columns are all represented as Chunked Arrays. Each Array from the |
| original Record Batch is one chunk in the corresponding Chunked Array in |
| the Table:</p> |
| <div class="sourceCode" id="cb28"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">rb</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Array</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "hello",</span></span> |
| <span><span class="co">## "amazing",</span></span> |
| <span><span class="co">## "and",</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <div class="sourceCode" id="cb30"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">df</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "hello",</span></span> |
| <span><span class="co">## "amazing",</span></span> |
| <span><span class="co">## "and",</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>It’s the same underlying data – and indeed the same immutable Array |
| is referenced by both – just enclosed by a new, flexible Chunked Array |
| wrapper. However, it is this wrapper that allows us to concatenate |
| Tables:</p> |
| <div class="sourceCode" id="cb32"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../reference/concat_tables.html">concat_tables</a></span><span class="op">(</span><span class="va">df</span>, <span class="va">new_df</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Table</span></span> |
| <span><span class="co">## 8 rows x 3 columns</span></span> |
| <span><span class="co">## $strs <string></span></span> |
| <span><span class="co">## $ints <int32></span></span> |
| <span><span class="co">## $dbls <double></span></span></code></pre> |
| <p>The resulting object is shown schematically below:</p> |
| <p><img src="table.png" width="100%"></p> |
| <p>Notice that the Chunked Arrays within the new Table retain this |
| chunking structure, because none of the original Arrays have been |
| moved:</p> |
| <div class="sourceCode" id="cb34"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">df_both</span> <span class="op"><-</span> <span class="fu"><a href="../reference/concat_tables.html">concat_tables</a></span><span class="op">(</span><span class="va">df</span>, <span class="va">new_df</span><span class="op">)</span></span> |
| <span><span class="va">df_both</span><span class="op">$</span><span class="va">strs</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "hello",</span></span> |
| <span><span class="co">## "amazing",</span></span> |
| <span><span class="co">## "and",</span></span> |
| <span><span class="co">## "cruel",</span></span> |
| <span><span class="co">## "world"</span></span> |
| <span><span class="co">## ],</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "I",</span></span> |
| <span><span class="co">## "love",</span></span> |
| <span><span class="co">## "you"</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| </div> |
| <div class="section level2"> |
| <h2 id="datasets">Datasets<a class="anchor" aria-label="anchor" href="#datasets"></a> |
| </h2> |
| <p>Like Record Batch and Table objects, a Dataset is used to represent |
| tabular data. At an abstract level, a Dataset can be viewed as an object |
| comprised of rows and columns, and just like Record Batches and Tables, |
| it contains an explicit Schema that specifies the name and data type |
| associated with each column.</p> |
| <p>However, where Tables and Record Batches are data explicitly |
| represented in-memory, a Dataset is not. Instead, a Dataset is an |
| abstraction that refers to data stored on-disk in one or more files. |
| Values stored in the data files are loaded into memory as a batched |
| process. Loading takes place only as needed, and only when a query is |
| executed against the data. In this respect Arrow Datasets are a very |
| different kind of object to Arrow Tables, but the dplyr commands used to |
| analyze them are essentially identical. In this section we’ll talk about |
| how Datasets are structured. If you want to learn more about the |
| practical details of analyzing Datasets, see the article on <a href="./dataset.html">analyzing multi-file datasets</a>.</p> |
| <div class="section level3"> |
| <h3 id="the-on-disk-data-files">The on-disk data files<a class="anchor" aria-label="anchor" href="#the-on-disk-data-files"></a> |
| </h3> |
| <p>Reduced to its simplest form, the on-disk structure of a Dataset is |
| simply a collection of data files, each storing one subset of the data. |
| These subsets are sometimes referred to as “fragments”, and the |
| partitioning process is sometimes referred to as “sharding”. By |
| convention, these files are organized into a folder structure called a |
| Hive-style partition: see <code><a href="../reference/hive_partition.html">hive_partition()</a></code> for details.</p> |
| <p>To illustrate how this works, let’s write a multi-file dataset to |
| disk manually, without using any of the Arrow Dataset functionality to |
| do the work. We’ll start with three small data frames, each of which |
| contains one subset of the data we want to store:</p> |
| <div class="sourceCode" id="cb36"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">df_a</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame</a></span><span class="op">(</span>id <span class="op">=</span> <span class="fl">1</span><span class="op">:</span><span class="fl">5</span>, value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="fl">5</span><span class="op">)</span>, subset <span class="op">=</span> <span class="st">"a"</span><span class="op">)</span></span> |
| <span><span class="va">df_b</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame</a></span><span class="op">(</span>id <span class="op">=</span> <span class="fl">6</span><span class="op">:</span><span class="fl">10</span>, value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="fl">5</span><span class="op">)</span>, subset <span class="op">=</span> <span class="st">"b"</span><span class="op">)</span></span> |
| <span><span class="va">df_c</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame</a></span><span class="op">(</span>id <span class="op">=</span> <span class="fl">11</span><span class="op">:</span><span class="fl">15</span>, value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="fl">5</span><span class="op">)</span>, subset <span class="op">=</span> <span class="st">"c"</span><span class="op">)</span></span></code></pre></div> |
| <p>Our intention is that each of the data frames should be stored in a |
| separate data file. As you can see, this is a quite structured |
| partitioning: all data where <code>subset = "a"</code> belong to one |
| file, all data where <code>subset = "b"</code> belong to another file, |
| and all data where <code>subset = "c"</code> belong to the third |
| file.</p> |
| <p>The first step is to define and create a folder that will hold all |
| the files:</p> |
| <div class="sourceCode" id="cb37"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds_dir</span> <span class="op"><-</span> <span class="st">"mini-dataset"</span></span> |
| <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir</span><span class="op">)</span></span></code></pre></div> |
| <p>The next step is to manually create the Hive-style folder |
| structure:</p> |
| <div class="sourceCode" id="cb38"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds_dir_a</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir</span>, <span class="st">"subset=a"</span><span class="op">)</span></span> |
| <span><span class="va">ds_dir_b</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir</span>, <span class="st">"subset=b"</span><span class="op">)</span></span> |
| <span><span class="va">ds_dir_c</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir</span>, <span class="st">"subset=c"</span><span class="op">)</span></span> |
| <span></span> |
| <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir_a</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir_b</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="https://rdrr.io/r/base/files2.html" class="external-link">dir.create</a></span><span class="op">(</span><span class="va">ds_dir_c</span><span class="op">)</span></span></code></pre></div> |
| <p>Notice that we have named each folder in a “key=value” format that |
| exactly describes the subset of data that will be written into that |
| folder. This naming structure is the essence of Hive-style |
| partitions.</p> |
| <p>Now that we have the folders, we’ll use <code><a href="../reference/write_parquet.html">write_parquet()</a></code> |
| to create a single parquet file for each of the three subsets:</p> |
| <div class="sourceCode" id="cb39"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">df_a</span>, <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir_a</span>, <span class="st">"part-0.parquet"</span><span class="op">)</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">df_b</span>, <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir_b</span>, <span class="st">"part-0.parquet"</span><span class="op">)</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">df_c</span>, <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="va">ds_dir_c</span>, <span class="st">"part-0.parquet"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div> |
| <p>If we had wanted to, we could have further subdivided the dataset. A |
| folder could contain multiple files (<code>part-0.parquet</code>, |
| <code>part-1.parquet</code>, etc) if we wanted it to. Similarly, there |
| is no particular reason to name the files <code>part-0.parquet</code> |
| this way at all: it would have been fine to call these files |
| <code>subset-a.parquet</code>, <code>subset-b.parquet</code>, and |
| <code>subset-c.parquet</code> if we had wished. We could have written |
| other file formats if we wanted, and we don’t necessarily have to use |
| Hive-style folders. You can learn more about the supported formats by |
| reading the help documentation for <code><a href="../reference/open_dataset.html">open_dataset()</a></code>, and |
| learn about how to exercise fine-grained control with |
| <code><a href="../reference/Dataset.html">help("Dataset", package = "arrow")</a></code>.</p> |
| <p>In any case, we have created an on-disk parquet Dataset using |
| Hive-style partitioning. Our Dataset is defined by these files:</p> |
| <div class="sourceCode" id="cb40"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/list.files.html" class="external-link">list.files</a></span><span class="op">(</span><span class="va">ds_dir</span>, recursive <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## [1] "subset=a/part-0.parquet" "subset=b/part-0.parquet"</span></span> |
| <span><span class="co">## [3] "subset=c/part-0.parquet"</span></span></code></pre> |
| <p>To verify that everything has worked, let’s open the data with |
| <code><a href="../reference/open_dataset.html">open_dataset()</a></code> and call <code><a href="https://pillar.r-lib.org/reference/glimpse.html" class="external-link">glimpse()</a></code> to inspect |
| its contents:</p> |
| <div class="sourceCode" id="cb42"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="va">ds_dir</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="https://pillar.r-lib.org/reference/glimpse.html" class="external-link">glimpse</a></span><span class="op">(</span><span class="va">ds</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## FileSystemDataset with 3 Parquet files</span></span> |
| <span><span class="co">## 15 rows x 3 columns</span></span> |
| <span><span class="co">## $ id <span style="color: #949494; font-style: italic;"><int32></span> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15</span></span> |
| <span><span class="co">## $ value <span style="color: #949494; font-style: italic;"><double></span> -1.400043517, 0.255317055, -2.437263611, -0.005571287, 0.62155~</span></span> |
| <span><span class="co">## $ subset <span style="color: #949494; font-style: italic;"><string></span> "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c~</span></span> |
| <span><span class="co">## Call `print()` for full schema details</span></span></code></pre> |
| <p>As you can see, the <code>ds</code> Dataset object aggregates the |
| three separate data files. In fact, in this particular case the Dataset |
| is so small that values from all three files appear in the output of |
| <code><a href="https://pillar.r-lib.org/reference/glimpse.html" class="external-link">glimpse()</a></code>.</p> |
| <p>It should be noted that in everyday data analysis work, you wouldn’t |
| need to do write the data files manually in this fashion. The example |
| above is entirely for illustrative purposes. The exact same dataset |
| could be created with the following command:</p> |
| <div class="sourceCode" id="cb44"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">subset</span><span class="op">)</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"mini-dataset"</span><span class="op">)</span></span></code></pre></div> |
| <p>In fact, even if <code>ds</code> happens to refer to a data source |
| that is larger than memory, this command should still work because the |
| Dataset functionality is written to ensure that during a pipeline such |
| as this the data is loaded piecewise in order to avoid exhausting |
| memory.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="the-dataset-object">The Dataset object<a class="anchor" aria-label="anchor" href="#the-dataset-object"></a> |
| </h3> |
| <p>In the previous section we examined the on-disk structure of a |
| Dataset. We now turn to the in-memory structure of the Dataset object |
| itself (i.e., <code>ds</code> in the previous example). When the Dataset |
| object is created, arrow searches the dataset folder looking for |
| appropriate files, but does not load the contents of those files. Paths |
| to these files are stored in an active binding |
| <code>ds$files</code>:</p> |
| <div class="sourceCode" id="cb45"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span><span class="op">$</span><span class="va">files</span></span></code></pre></div> |
| <pre><code><span><span class="co">## [1] "/build/r/vignettes/mini-dataset/subset=a/part-0.parquet"</span></span> |
| <span><span class="co">## [2] "/build/r/vignettes/mini-dataset/subset=b/part-0.parquet"</span></span> |
| <span><span class="co">## [3] "/build/r/vignettes/mini-dataset/subset=c/part-0.parquet"</span></span></code></pre> |
| <p>The other thing that happens when <code><a href="../reference/open_dataset.html">open_dataset()</a></code> is |
| called is that an explicit Schema for the Dataset is constructed and |
| stored as <code>ds$schema</code>:</p> |
| <div class="sourceCode" id="cb47"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span><span class="op">$</span><span class="va">schema</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Schema</span></span> |
| <span><span class="co">## id: int32</span></span> |
| <span><span class="co">## value: double</span></span> |
| <span><span class="co">## subset: string</span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## See $metadata for additional Schema metadata</span></span></code></pre> |
| <p>By default this Schema is inferred by inspecting the first file only, |
| though it is possible to construct a unified schema after inspecting all |
| files. To do this, set <code>unify_schemas = TRUE</code> when calling |
| <code><a href="../reference/open_dataset.html">open_dataset()</a></code>. It is also possible to use the |
| <code>schema</code> argument to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to specify |
| the Schema explicitly (see the <code><a href="../reference/schema.html">schema()</a></code> function for |
| details).</p> |
| <p>The act of reading the data is performed by a Scanner object. When |
| analyzing a Dataset using the dplyr interface you never need to |
| construct a Scanner manually, but for explanatory purposes we’ll do it |
| here:</p> |
| <div class="sourceCode" id="cb49"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">scan</span> <span class="op"><-</span> <span class="va">Scanner</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span>dataset <span class="op">=</span> <span class="va">ds</span><span class="op">)</span></span></code></pre></div> |
| <p>Calling the <code>ToTable()</code> method will materialize the |
| Dataset (on-disk) as a Table (in-memory):</p> |
| <div class="sourceCode" id="cb50"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">scan</span><span class="op">$</span><span class="fu">ToTable</span><span class="op">(</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Table</span></span> |
| <span><span class="co">## 15 rows x 3 columns</span></span> |
| <span><span class="co">## $id <int32></span></span> |
| <span><span class="co">## $value <double></span></span> |
| <span><span class="co">## $subset <string></span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## See $metadata for additional Schema metadata</span></span></code></pre> |
| <p>This scanning process is multi-threaded by default, but if necessary |
| threading can be disabled by setting <code>use_threads = FALSE</code> |
| when calling <code>Scanner$create()</code>.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="querying-a-dataset">Querying a Dataset<a class="anchor" aria-label="anchor" href="#querying-a-dataset"></a> |
| </h3> |
| <p>When a query is executed against a Dataset a new scan is initiated |
| and the results pulled back into R. As an example, consider the |
| following dplyr expression:</p> |
| <div class="sourceCode" id="cb52"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">value</span> <span class="op">></span> <span class="fl">0</span><span class="op">)</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>new_value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/Round.html" class="external-link">round</a></span><span class="op">(</span><span class="fl">100</span> <span class="op">*</span> <span class="va">value</span><span class="op">)</span><span class="op">)</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">id</span>, <span class="va">subset</span>, <span class="va">new_value</span><span class="op">)</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 6 x 3</span></span></span> |
| <span><span class="co">## id subset new_value</span></span> |
| <span><span class="co">## <span style="color: #949494; font-style: italic;"><int></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><dbl></span></span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">1</span> 2 a 26</span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">2</span> 5 a 62</span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">3</span> 6 b 115</span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">4</span> 12 c 63</span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">5</span> 13 c 207</span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">6</span> 15 c 51</span></span></code></pre> |
| <p>We can replicate this using the low-level Dataset interface by |
| creating a new scan by specifying the <code>filter</code> and |
| <code>projection</code> arguments to <code>Scanner$create()</code>. To |
| use these arguments you need to know a little about Arrow Expressions, |
| for which you may find it helpful to read the help documentation in |
| <code><a href="../reference/Expression.html">help("Expression", package = "arrow")</a></code>.</p> |
| <p>The scanner defined below mimics the dplyr pipeline shown above,</p> |
| <div class="sourceCode" id="cb54"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">scan</span> <span class="op"><-</span> <span class="va">Scanner</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span> |
| <span> dataset <span class="op">=</span> <span class="va">ds</span>,</span> |
| <span> filter <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"value"</span><span class="op">)</span> <span class="op">></span> <span class="fl">0</span>,</span> |
| <span> projection <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span></span> |
| <span> id <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"id"</span><span class="op">)</span>,</span> |
| <span> subset <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"subset"</span><span class="op">)</span>,</span> |
| <span> new_value <span class="op">=</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="st">"round"</span>, <span class="fl">100</span> <span class="op">*</span> <span class="va">Expression</span><span class="op">$</span><span class="fu">field_ref</span><span class="op">(</span><span class="st">"value"</span><span class="op">)</span><span class="op">)</span></span> |
| <span> <span class="op">)</span></span> |
| <span><span class="op">)</span></span></code></pre></div> |
| <p>and if we were to call <code>as.data.frame(scan$ToTable())</code> it |
| would produce the same result as the dplyr version, though the rows may |
| not appear in the same order.</p> |
| <p>To get a better sense of what happens when the query executes, what |
| we’ll do here is call <code>scan$ScanBatches()</code>. Much like the |
| <code>ToTable()</code> method, the <code>ScanBatches()</code> method |
| executes the query separately against each of the files, but it returns |
| a list of Record Batches, one for each file. In addition, we’ll convert |
| these Record Batches to data frames individually:</p> |
| <div class="sourceCode" id="cb55"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/lapply.html" class="external-link">lapply</a></span><span class="op">(</span><span class="va">scan</span><span class="op">$</span><span class="fu">ScanBatches</span><span class="op">(</span><span class="op">)</span>, <span class="va">as.data.frame</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## [[1]]</span></span> |
| <span><span class="co">## id subset new_value</span></span> |
| <span><span class="co">## 1 2 a 26</span></span> |
| <span><span class="co">## 2 5 a 62</span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## [[2]]</span></span> |
| <span><span class="co">## id subset new_value</span></span> |
| <span><span class="co">## 1 6 b 115</span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## [[3]]</span></span> |
| <span><span class="co">## id subset new_value</span></span> |
| <span><span class="co">## 1 12 c 63</span></span> |
| <span><span class="co">## 2 13 c 207</span></span> |
| <span><span class="co">## 3 15 c 51</span></span></code></pre> |
| <p>If we return to the dplyr query we made earlier, and use |
| <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">compute()</a></code> to return a Table rather use |
| <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code> to return a data frame, we can see the evidence |
| of this process at work. The Table object is created by concatenating |
| the three Record Batches produced when the query executes against three |
| data files, and as a consequence of this the Chunked Array that defines |
| a column of the Table mirrors the partitioning structure present in the |
| data files:</p> |
| <div class="sourceCode" id="cb57"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">tbl</span> <span class="op"><-</span> <span class="va">ds</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">value</span> <span class="op">></span> <span class="fl">0</span><span class="op">)</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>new_value <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/Round.html" class="external-link">round</a></span><span class="op">(</span><span class="fl">100</span> <span class="op">*</span> <span class="va">value</span><span class="op">)</span><span class="op">)</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">id</span>, <span class="va">subset</span>, <span class="va">new_value</span><span class="op">)</span> <span class="op">|></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">compute</a></span><span class="op">(</span><span class="op">)</span></span> |
| <span></span> |
| <span><span class="va">tbl</span><span class="op">$</span><span class="va">subset</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "a",</span></span> |
| <span><span class="co">## "a"</span></span> |
| <span><span class="co">## ],</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "b"</span></span> |
| <span><span class="co">## ],</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "c",</span></span> |
| <span><span class="co">## "c",</span></span> |
| <span><span class="co">## "c"</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| </div> |
| <div class="section level3"> |
| <h3 id="additional-notes">Additional notes<a class="anchor" aria-label="anchor" href="#additional-notes"></a> |
| </h3> |
| <ul> |
| <li><p>A distinction ignored in the previous discussion is between |
| <code>FileSystemDataset</code> and <code>InMemoryDataset</code> objects. |
| In the usual case, the data that comprise a Dataset are stored in files |
| on-disk. That is, after all, the primary advantage of Datasets over |
| Tables. However, there are cases where it may be useful to make a |
| Dataset from data that are already stored in-memory. In such cases the |
| object created will have type <code>InMemoryDataset</code>.</p></li> |
| <li><p>The previous discussion assumes that all files stored in the |
| Dataset have the same Schema. In the usual case this will be true, |
| because each file is conceptually a subset of a single rectangular |
| table. But this is not strictly required.</p></li> |
| </ul> |
| <p>For more information about these topics, see |
| <code><a href="../reference/Dataset.html">help("Dataset", package = "arrow")</a></code>.</p> |
| </div> |
| </div> |
| <div class="section level2"> |
| <h2 id="further-reading">Further reading<a class="anchor" aria-label="anchor" href="#further-reading"></a> |
| </h2> |
| <ul> |
| <li>To learn more about the internal structure of Arrays, see the |
| article on <a href="./developers/data_object_layout.html">data object |
| layout</a>.</li> |
| <li>To learn more about the different data types used by Arrow, see the |
| article on <a href="./data_types.html">data types</a>.</li> |
| <li>To learn more about how Arrow objects are implemented, see the <a href="https://arrow.apache.org/docs/format/Columnar.html" class="external-link">Arrow |
| specification</a> page.</li> |
| </ul> |
| </div> |
| </main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2> |
| </nav></aside> |
| </div> |
| |
| |
| |
| <footer><div class="pkgdown-footer-left"> |
| <p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p> |
| </div> |
| |
| <div class="pkgdown-footer-right"> |
| <p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.9.</p> |
| </div> |
| |
| </footer> |
| </div> |
| |
| |
| |
| |
| |
| </body> |
| </html> |