docs/dev/r/articles/dataset.html - arrow-site - Git at Google

 <!DOCTYPE html>
 <!-- Generated by pkgdown: do not edit by hand --><html lang="en-US">
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <meta charset="utf-8">
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
 <title>Working with multi-file data sets • Arrow R Package</title>
 <!-- favicons --><link rel="icon" type="image/png" sizes="96x96" href="../favicon-96x96.png">
 <link rel="icon" type="”image/svg+xml”" href="../favicon.svg">
 <link rel="apple-touch-icon" sizes="180x180" href="../apple-touch-icon.png">
 <link rel="icon" sizes="any" href="../favicon.ico">
 <link rel="manifest" href="../site.webmanifest">
 <script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
 <link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet">
 <script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><link href="../deps/font-awesome-6.5.2/css/all.min.css" rel="stylesheet">
 <link href="../deps/font-awesome-6.5.2/css/v4-shims.min.css" rel="stylesheet">
 <script src="../deps/headroom-0.11.0/headroom.min.js"></script><script src="../deps/headroom-0.11.0/jQuery.headroom.min.js"></script><script src="../deps/bootstrap-toc-1.0.1/bootstrap-toc.min.js"></script><script src="../deps/clipboard.js-2.0.11/clipboard.min.js"></script><script src="../deps/search-1.0.0/autocomplete.jquery.min.js"></script><script src="../deps/search-1.0.0/fuse.min.js"></script><script src="../deps/search-1.0.0/mark.min.js"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet">
 <meta property="og:title" content="Working with multi-file data sets">
 <meta name="description" content="Learn how to use Datasets to read, write, and analyze  multi-file larger-than-memory data
 ">
 <meta property="og:description" content="Learn how to use Datasets to read, write, and analyze  multi-file larger-than-memory data
 ">
 <meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png">
 <meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text">
 <!-- Matomo --><script>
   var _paq = window._paq = window._paq || [];
   /* tracker methods like "setCustomDimension" should be called before "trackPageView" */
   /* We explicitly disable cookie tracking to avoid privacy issues */
   _paq.push(['disableCookies']);
   _paq.push(['trackPageView']);
   _paq.push(['enableLinkTracking']);
   (function() {
     var u="https://analytics.apache.org/";
     _paq.push(['setTrackerUrl', u+'matomo.php']);
     _paq.push(['setSiteId', '20']);
     var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
     g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
   })();
 </script><!-- End Matomo Code --><!-- Kapa AI --><script async src="https://widget.kapa.ai/kapa-widget.bundle.js" data-website-id="9db461d5-ac77-4b3f-a5c5-75efa78339d2" data-project-name="Apache Arrow" data-project-color="#000000" data-project-logo="https://arrow.apache.org/img/arrow-logo_chevrons_white-txt_black-bg.png" data-modal-disclaimer="This is a custom LLM with access to all of [Arrow documentation](https://arrow.apache.org/docs/).  If you want an R-specific answer, please mention this in your question." data-consent-required="true" data-user-analytics-cookie-enabled="false" data-consent-screen-disclaimer="By clicking &quot;I agree, let's chat&quot;, you consent to the use of the AI assistant in accordance with kapa.ai's [Privacy Policy](https://www.kapa.ai/content/privacy-policy). This service uses reCAPTCHA, which requires your consent to Google's [Privacy Policy](https://policies.google.com/privacy) and [Terms of Service](https://policies.google.com/terms). By proceeding, you explicitly agree to both kapa.ai's and Google's privacy policies."></script><!-- End Kapa AI -->
 </head>
 <body>
     <a href="#main" class="visually-hidden-focusable">Skip to contents</a>


     <nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container">

     <a class="navbar-brand me-2" href="../index.html">Arrow R Package</a>

     <span class="version">
       <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">21.0.0.9000</small>
     </span>


     <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
       <span class="navbar-toggler-icon"></span>
     </button>

     <div id="navbar" class="collapse navbar-collapse ms-3">
       <ul class="navbar-nav me-auto">
 <li class="nav-item"><a class="nav-link" href="../articles/arrow.html">Get started</a></li>
 <li class="nav-item"><a class="nav-link" href="../reference/index.html">Reference</a></li>
 <li class="active nav-item dropdown">
   <button class="nav-link dropdown-toggle" type="button" id="dropdown-articles" data-bs-toggle="dropdown" aria-expanded="false" aria-haspopup="true">Articles</button>
   <ul class="dropdown-menu" aria-labelledby="dropdown-articles">
 <li><hr class="dropdown-divider"></li>
     <li><h6 class="dropdown-header" data-toc-skip>Using the package</h6></li>
     <li><a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a></li>
     <li><a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a></li>
     <li><a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a></li>
     <li><a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a></li>
     <li><a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a></li>
     <li><a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a></li>
     <li><hr class="dropdown-divider"></li>
     <li><h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6></li>
     <li><a class="dropdown-item" href="../articles/data_objects.html">Data objects</a></li>
     <li><a class="dropdown-item" href="../articles/data_types.html">Data types</a></li>
     <li><a class="dropdown-item" href="../articles/metadata.html">Metadata</a></li>
     <li><hr class="dropdown-divider"></li>
     <li><h6 class="dropdown-header" data-toc-skip>Installation</h6></li>
     <li><a class="dropdown-item" href="../articles/install.html">Installing on Linux</a></li>
     <li><a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a></li>
     <li><hr class="dropdown-divider"></li>
     <li><a class="dropdown-item" href="../articles/index.html">More articles...</a></li>
   </ul>
 </li>
 <li class="nav-item"><a class="nav-link" href="../news/index.html">Changelog</a></li>
       </ul>
 <form class="form-inline my-2 my-lg-0" role="search">
         <input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="" autocomplete="off">
 </form>

       <ul class="navbar-nav">
 <li class="nav-item"><a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="GitHub"><span class="fa fab fa-github fa-lg"></span></a></li>
       </ul>
 </div>


   </div>
 </nav><div class="container template-article">


 <div class="row">
   <main id="main" class="col-md-9"><div class="page-header">

       <h1>Working with multi-file data sets</h1>


       <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/dataset.Rmd" class="external-link"><code>vignettes/dataset.Rmd</code></a></small>
       <div class="d-none name"><code>dataset.Rmd</code></div>
     </div>


 <p>Apache Arrow lets you work efficiently with single and multi-file
 data sets even when that data set is too large to be loaded into memory.
 With the help of Arrow Dataset objects you can analyze this kind of data
 using familiar <a href="https://dplyr.tidyverse.org/" class="external-link">dplyr</a> syntax.
 This article introduces Datasets and shows you how to analyze them with
 dplyr and arrow: we’ll start by ensuring both packages are loaded</p>
 <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/apache/arrow/" class="external-link">arrow</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span>
 <span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org" class="external-link">dplyr</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code></pre></div>
 <div class="section level2">
 <h2 id="example-nyc-taxi-data">Example: NYC taxi data<a class="anchor" aria-label="anchor" href="#example-nyc-taxi-data"></a>
 </h2>
 <p>The primary motivation for Arrow’s Datasets object is to allow users
 to analyze extremely large datasets. As an example, consider the <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page" class="external-link">New
 York City taxi trip record data</a> that is widely used in big data
 exercises and competitions. To demonstrate the capabilities of Apache
 Arrow we host a Parquet-formatted version this data in a public Amazon
 S3 bucket: in its full form, our version of the data set is one very
 large table with about 1.7 billion rows and 24 columns, where each row
 corresponds to a single taxi ride sometime between 2009 and 2022. A <a href="https://arrow-user2022.netlify.app/packages-and-data.html#data" class="external-link">data
 dictionary</a> for this version of the NYC taxi data is also
 available.</p>
 <p>This multi-file data set is comprised of 158 distinct Parquet files,
 each corresponding to a month of data. A single file is typically around
 400-500MB in size, and the full data set is about 70GB in size. It is
 not a small data set – it is slow to download and does not fit in memory
 on a typical machine 🙂 – so we also host a “tiny” version of the NYC
 taxi data that is formatted in exactly the same way but includes only
 one out of every thousand entries in the original data set (i.e.,
 individual files are &lt;1MB in size, and the “tiny” data set is only
 70MB)</p>
 <p>If you have Amazon S3 support enabled in arrow (true for most users;
 see links at the end of this article if you need to troubleshoot this),
 you can connect to a copy of the “tiny taxi data” stored on S3 with this
 command:</p>
 <div class="sourceCode" id="cb2"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/s3_bucket.html">s3_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets/nyc-taxi-tiny"</span><span class="op">)</span></span></code></pre></div>
 <p>Alternatively you could connect to a copy of the data on Google Cloud
 Storage (GCS) using the following command:</p>
 <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/gs_bucket.html">gs_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets/nyc-taxi-tiny"</span>, anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div>
 <p>If you want to use the full data set, replace
 <code>nyc-taxi-tiny</code> with <code>nyc-taxi</code> in the code above.
 Apart from size – and with it the cost in time, bandwidth usage, and CPU
 cycles – there is no difference in the two versions of the data: you can
 test your code using the tiny taxi data and then check how it scales
 using the full data set.</p>
 <p>To make a local copy of the data set stored in the
 <code>bucket</code> to a folder called <code>"nyc-taxi"</code>, use the
 <code><a href="../reference/copy_files.html">copy_files()</a></code> function:</p>
 <div class="sourceCode" id="cb4"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="../reference/copy_files.html">copy_files</a></span><span class="op">(</span>from <span class="op">=</span> <span class="va">bucket</span>, to <span class="op">=</span> <span class="st">"nyc-taxi"</span><span class="op">)</span></span></code></pre></div>
 <p>For the purposes of this article, we assume that the NYC taxi dataset
 (either the full data or the tiny version) has been downloaded locally
 and exists in an <code>"nyc-taxi"</code> directory.</p>
 </div>
 <div class="section level2">
 <h2 id="opening-datasets">Opening Datasets<a class="anchor" aria-label="anchor" href="#opening-datasets"></a>
 </h2>
 <p>The first step in the process is to create a Dataset object that
 points at the data directory:</p>
 <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span><span class="op">)</span></span></code></pre></div>
 <p>It is important to note that when we do this, the data values are not
 loaded into memory. Instead, Arrow scans the data directory to find
 relevant files, parses the file paths looking for a “Hive-style
 partitioning” (see below), and reads headers of the data files to
 construct a Schema that contains metadata describing the structure of
 the data. For more information about Schemas see the <a href="./metadata.html">metadata article</a>.</p>
 <p>Two questions naturally follow from this: what kind of files does
 <code><a href="../reference/open_dataset.html">open_dataset()</a></code> look for, and what structure does it expect
 to find in the file paths? Let’s start by looking at the file types.</p>
 <p>By default <code><a href="../reference/open_dataset.html">open_dataset()</a></code> looks for Parquet files but
 you can override this using the <code>format</code> argument. For
 example if the data were encoded as CSV files we could set
 <code>format = "csv"</code> to connect to the data. The Arrow Dataset
 interface supports several file formats including:</p>
 <ul>
 <li>
 <code>"parquet"</code> (the default)</li>
 <li>
 <code>"feather"</code> or <code>"ipc"</code> (aliases for
 <code>"arrow"</code>; as Feather version 2 is the Arrow file
 format)</li>
 <li>
 <code>"csv"</code> (comma-delimited files) and <code>"tsv"</code>
 (tab-delimited files)</li>
 <li>
 <code>"text"</code> (generic text-delimited files - use the
 <code>delimiter</code> argument to specify which to use)</li>
 </ul>
 <p>In the case of text files, you can pass the following parsing options
 to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to ensure that files are read
 correctly:</p>
 <ul>
 <li><code>delim</code></li>
 <li><code>quote</code></li>
 <li><code>escape_double</code></li>
 <li><code>escape_backslash</code></li>
 <li><code>skip_empty_rows</code></li>
 </ul>
 <p>An alternative when working with text files is to use
 <code><a href="../reference/open_delim_dataset.html">open_delim_dataset()</a></code>, <code><a href="../reference/open_delim_dataset.html">open_csv_dataset()</a></code>, or
 <code><a href="../reference/open_delim_dataset.html">open_tsv_dataset()</a></code>. These functions are wrappers around
 <code><a href="../reference/open_dataset.html">open_dataset()</a></code> but with parameters that mirror
 <code><a href="../reference/read_delim_arrow.html">read_csv_arrow()</a></code>, <code><a href="../reference/read_delim_arrow.html">read_delim_arrow()</a></code>, and
 <code><a href="../reference/read_delim_arrow.html">read_tsv_arrow()</a></code> to allow for easy switching between
 functions for opening single files and functions for opening
 datasets.</p>
 <p>For example:</p>
 <div class="sourceCode" id="cb6"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/open_delim_dataset.html">open_csv_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/csv/"</span><span class="op">)</span></span></code></pre></div>
 <p>For more information on these arguments and on parsing delimited text
 files generally, see the help documentation for
 <code><a href="../reference/read_delim_arrow.html">read_delim_arrow()</a></code> and
 <code><a href="../reference/open_delim_dataset.html">open_delim_dataset()</a></code>.</p>
 <p>Next, what information does <code><a href="../reference/open_dataset.html">open_dataset()</a></code> expect to
 find in the file paths? By default, the Dataset interface looks for <a href="https://hive.apache.org/" class="external-link">Hive</a>-style partitioning structure in
 which folders are named using a “key=value” convention, and data files
 in a folder contain the subset of the data for which the key has the
 relevant value. For example, in the NYC taxi data file paths look like
 this:</p>
 <pre><code>year=2009/month=1/part-0.parquet
 year=2009/month=2/part-0.parquet
 ...</code></pre>
 <p>From this, <code><a href="../reference/open_dataset.html">open_dataset()</a></code> infers that the first listed
 Parquet file contains the data for January 2009. In that sense, a
 hive-style partitioning is self-describing: the folder names state
 explicitly how the Dataset has been split across files.</p>
 <p>Sometimes the directory partitioning isn’t self describing; that is,
 it doesn’t contain field names. For example, suppose the NYC taxi data
 used file paths like these:</p>
 <pre><code>2009/01/part-0.parquet
 2009/02/part-0.parquet
 ...</code></pre>
 <p>In that case, <code><a href="../reference/open_dataset.html">open_dataset()</a></code> would need some hints as to
 how to use the file paths. In this case, you could provide
 <code>c("year", "month")</code> to the <code>partitioning</code>
 argument, saying that the first path segment gives the value for
 <code>year</code>, and the second segment is <code>month</code>. Every
 row in <code>2009/01/part-0.parquet</code> has a value of 2009 for
 <code>year</code> and 1 for <code>month</code>, even though those
 columns may not be present in the file. In other words, we would open
 the data like this:</p>
 <div class="sourceCode" id="cb9"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span>, partitioning <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"year"</span>, <span class="st">"month"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
 <p>Either way, when you look at the Dataset, you can see that in
 addition to the columns present in every file, there are also columns
 <code>year</code> and <code>month</code>. These columns are not present
 in the files themselves: they are inferred from the partitioning
 structure.</p>
 <div class="sourceCode" id="cb10"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span></span></code></pre></div>
 <pre><code><span><span class="co">## </span></span>
 <span><span class="co">## FileSystemDataset with 158 Parquet files</span></span>
 <span><span class="co">## vendor_name: string</span></span>
 <span><span class="co">## pickup_datetime: timestamp[ms]</span></span>
 <span><span class="co">## dropoff_datetime: timestamp[ms]</span></span>
 <span><span class="co">## passenger_count: int64</span></span>
 <span><span class="co">## trip_distance: double</span></span>
 <span><span class="co">## pickup_longitude: double</span></span>
 <span><span class="co">## pickup_latitude: double</span></span>
 <span><span class="co">## rate_code: string</span></span>
 <span><span class="co">## store_and_fwd: string</span></span>
 <span><span class="co">## dropoff_longitude: double</span></span>
 <span><span class="co">## dropoff_latitude: double</span></span>
 <span><span class="co">## payment_type: string</span></span>
 <span><span class="co">## fare_amount: double</span></span>
 <span><span class="co">## extra: double</span></span>
 <span><span class="co">## mta_tax: double</span></span>
 <span><span class="co">## tip_amount: double</span></span>
 <span><span class="co">## tolls_amount: double</span></span>
 <span><span class="co">## total_amount: double</span></span>
 <span><span class="co">## improvement_surcharge: double</span></span>
 <span><span class="co">## congestion_surcharge: double</span></span>
 <span><span class="co">## pickup_location_id: int64</span></span>
 <span><span class="co">## dropoff_location_id: int64</span></span>
 <span><span class="co">## year: int32</span></span>
 <span><span class="co">## month: int32</span></span></code></pre>
 </div>
 <div class="section level2">
 <h2 id="querying-datasets">Querying Datasets<a class="anchor" aria-label="anchor" href="#querying-datasets"></a>
 </h2>
 <p>Now that we have a Dataset object that refers to our data, we can
 construct dplyr-style queries. This is possible because arrow supplies a
 back end that allows users to manipulate tabular Arrow data using dplyr
 verbs. Here’s an example: suppose you are curious about tipping behavior
 in the longest taxi rides. Let’s find the median tip percentage for
 rides with fares greater than $100 in 2015, broken down by the number of
 passengers:</p>
 <div class="sourceCode" id="cb12"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/system.time.html" class="external-link">system.time</a></span><span class="op">(</span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">total_amount</span> <span class="op">&gt;</span> <span class="fl">100</span>, <span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="fl">100</span> <span class="op">*</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarise</a></span><span class="op">(</span></span>
 <span>    median_tip_pct <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/median.html" class="external-link">median</a></span><span class="op">(</span><span class="va">tip_pct</span><span class="op">)</span>,</span>
 <span>    n <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/context.html" class="external-link">n</a></span><span class="op">(</span><span class="op">)</span></span>
 <span>  <span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://rdrr.io/r/base/print.html" class="external-link">print</a></span><span class="op">(</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## </span></span>
 <span><span class="co">## # A tibble: 10 x 3</span></span>
 <span><span class="co">##    passenger_count median_tip_pct      n</span></span>
 <span><span class="co">##              &lt;int&gt;          &lt;dbl&gt;  &lt;int&gt;</span></span>
 <span><span class="co">##  1               1           16.6 143087</span></span>
 <span><span class="co">##  2               2           16.2  34418</span></span>
 <span><span class="co">##  3               5           16.7   5806</span></span>
 <span><span class="co">##  4               4           11.4   4771</span></span>
 <span><span class="co">##  5               6           16.7   3338</span></span>
 <span><span class="co">##  6               3           14.6   8922</span></span>
 <span><span class="co">##  7               0           10.1    380</span></span>
 <span><span class="co">##  8               8           16.7     32</span></span>
 <span><span class="co">##  9               9           16.7     42</span></span>
 <span><span class="co">## 10               7           16.7     11</span></span>
 <span><span class="co">## </span></span>
 <span><span class="co">##    user  system elapsed</span></span>
 <span><span class="co">##   4.436   1.012   1.402</span></span></code></pre>
 <p>You’ve just selected a subset from a Dataset that contains around 2
 billion rows, computed a new column, and aggregated it. All within a few
 seconds on a modern laptop. How does this work?</p>
 <p>There are three reasons arrow can accomplish this task so
 quickly:</p>
 <p>First, arrow adopts a lazy evaluation approach to queries: when dplyr
 verbs are called on the Dataset, they record their actions but do not
 evaluate those actions on the data until you run <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code>.
 We can see this by taking the same code as before and leaving off the
 final step:</p>
 <div class="sourceCode" id="cb14"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">total_amount</span> <span class="op">&gt;</span> <span class="fl">100</span>, <span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="fl">100</span> <span class="op">*</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarise</a></span><span class="op">(</span></span>
 <span>    median_tip_pct <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/median.html" class="external-link">median</a></span><span class="op">(</span><span class="va">tip_pct</span><span class="op">)</span>,</span>
 <span>    n <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/context.html" class="external-link">n</a></span><span class="op">(</span><span class="op">)</span></span>
 <span>  <span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## </span></span>
 <span><span class="co">## FileSystemDataset (query)</span></span>
 <span><span class="co">## passenger_count: int64</span></span>
 <span><span class="co">## median_tip_pct: double</span></span>
 <span><span class="co">## n: int32</span></span>
 <span><span class="co">## </span></span>
 <span><span class="co">## See $.data for the source Arrow object</span></span></code></pre>
 <p>This version of the code returns an output instantly and shows the
 manipulations you’ve made, without loading data from the files. Because
 the evaluation of these queries is deferred, you can build up a query
 that selects down to a small subset without generating intermediate data
 sets that could potentially be large.</p>
 <p>Second, all work is pushed down to the individual data files, and
 depending on the file format, chunks of data within files. As a result,
 you can select a subset of data from a much larger data set by
 collecting the smaller slices from each file: you don’t have to load the
 whole data set in memory to slice from it.</p>
 <p>Third, because of partitioning, you can ignore some files entirely.
 In this example, by filtering <code>year == 2015</code>, all files
 corresponding to other years are immediately excluded: you don’t have to
 load them in order to find that no rows match the filter. For Parquet
 files – which contain row groups with statistics on the data contained
 within groups – there may be entire chunks of data you can avoid
 scanning because they have no rows where
 <code>total_amount &gt; 100</code>.</p>
 <p>One final thing to note about querying Datasets. Suppose you attempt
 to call unsupported dplyr verbs or unimplemented functions in your query
 on an Arrow Dataset. In that case, the arrow package raises an error.
 However, for dplyr queries on Arrow Table objects (which are already
 in-memory), the package automatically calls <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code>
 before processing that dplyr verb. To learn more about the dplyr back
 end, see the <a href="./data_wrangling.html">data wrangling
 article</a>.</p>
 </div>
 <div class="section level2">
 <h2 id="batch-processing-experimental">Batch processing (experimental)<a class="anchor" aria-label="anchor" href="#batch-processing-experimental"></a>
 </h2>
 <p>Sometimes you want to run R code on the entire Dataset, but that
 Dataset is much larger than memory. You can use <code>map_batches</code>
 on a Dataset query to process it batch-by-batch.</p>
 <p><strong>Note</strong>: <code>map_batches</code> is experimental and
 not recommended for production use.</p>
 <p>As an example, to randomly sample a Dataset, use
 <code>map_batches</code> to sample a percentage of rows from each
 batch:</p>
 <div class="sourceCode" id="cb16"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">sampled_data</span> <span class="op">&lt;-</span> <span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="../reference/map_batches.html">map_batches</a></span><span class="op">(</span><span class="op">~</span> <span class="fu"><a href="../reference/as_record_batch.html">as_record_batch</a></span><span class="op">(</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/sample_n.html" class="external-link">sample_frac</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/as.data.frame.html" class="external-link">as.data.frame</a></span><span class="op">(</span><span class="va">.</span><span class="op">)</span>, <span class="fl">1e-4</span><span class="op">)</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span></span>
 <span></span>
 <span><span class="fu"><a href="https://rdrr.io/r/utils/str.html" class="external-link">str</a></span><span class="op">(</span><span class="va">sampled_data</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## </span></span>
 <span><span class="co">## tibble [10,918 &lt;U+00D7&gt; 4] (S3: tbl_df/tbl/data.frame)</span></span>
 <span><span class="co">##  $ tip_amount     : num [1:10918] 3 0 4 1 1 6 0 1.35 0 5.9 ...</span></span>
 <span><span class="co">##  $ total_amount   : num [1:10918] 18.8 13.3 20.3 15.8 13.3 ...</span></span>
 <span><span class="co">##  $ passenger_count: int [1:10918] 3 2 1 1 1 1 1 1 1 3 ...</span></span>
 <span><span class="co">##  $ tip_pct        : num [1:10918] 0.1596 0 0.197 0.0633 0.0752 ...</span></span></code></pre>
 <p>This function can also be used to aggregate summary statistics over a
 Dataset by computing partial results for each batch and then aggregating
 those partial results. Extending the example above, you could fit a
 model to the sample data and then use <code>map_batches</code> to
 compute the MSE on the full Dataset.</p>
 <div class="sourceCode" id="cb18"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">model</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/lm.html" class="external-link">lm</a></span><span class="op">(</span><span class="va">tip_pct</span> <span class="op">~</span> <span class="va">total_amount</span> <span class="op">+</span> <span class="va">passenger_count</span>, data <span class="op">=</span> <span class="va">sampled_data</span><span class="op">)</span></span>
 <span></span>
 <span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="../reference/map_batches.html">map_batches</a></span><span class="op">(</span><span class="kw">function</span><span class="op">(</span><span class="va">batch</span><span class="op">)</span> <span class="op">{</span></span>
 <span>    <span class="va">batch</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>      <span class="fu"><a href="https://rdrr.io/r/base/as.data.frame.html" class="external-link">as.data.frame</a></span><span class="op">(</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>      <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>pred_tip_pct <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/predict.html" class="external-link">predict</a></span><span class="op">(</span><span class="va">model</span>, newdata <span class="op">=</span> <span class="va">.</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>      <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="op">!</span><span class="fu"><a href="https://rdrr.io/r/base/is.finite.html" class="external-link">is.nan</a></span><span class="op">(</span><span class="va">tip_pct</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>      <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarize</a></span><span class="op">(</span>sse_partial <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/sum.html" class="external-link">sum</a></span><span class="op">(</span><span class="op">(</span><span class="va">pred_tip_pct</span> <span class="op">-</span> <span class="va">tip_pct</span><span class="op">)</span><span class="op">^</span><span class="fl">2</span><span class="op">)</span>, n_partial <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/context.html" class="external-link">n</a></span><span class="op">(</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>      <span class="fu"><a href="../reference/as_record_batch.html">as_record_batch</a></span><span class="op">(</span><span class="op">)</span></span>
 <span>  <span class="op">}</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarize</a></span><span class="op">(</span>mse <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/sum.html" class="external-link">sum</a></span><span class="op">(</span><span class="va">sse_partial</span><span class="op">)</span> <span class="op">/</span> <span class="fu"><a href="https://rdrr.io/r/base/sum.html" class="external-link">sum</a></span><span class="op">(</span><span class="va">n_partial</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/pull.html" class="external-link">pull</a></span><span class="op">(</span><span class="va">mse</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## </span></span>
 <span><span class="co">## [1] 0.1304284</span></span></code></pre>
 </div>
 <div class="section level2">
 <h2 id="dataset-options">Dataset options<a class="anchor" aria-label="anchor" href="#dataset-options"></a>
 </h2>
 <p>There are a few ways you can control the Dataset creation to adapt to
 special use cases.</p>
 <div class="section level3">
 <h3 id="work-with-files-in-a-directory">Work with files in a directory<a class="anchor" aria-label="anchor" href="#work-with-files-in-a-directory"></a>
 </h3>
 <p>If you are working with a single file or a set of files that are not
 all in the same directory, you can provide a file path or a vector of
 multiple file paths to <code><a href="../reference/open_dataset.html">open_dataset()</a></code>. This is useful if,
 for example, you have a single CSV file that is too big to read into
 memory. You could pass the file path to <code><a href="../reference/open_dataset.html">open_dataset()</a></code>, use
 <code><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by()</a></code> to partition the Dataset into manageable chunks,
 then use <code><a href="../reference/write_dataset.html">write_dataset()</a></code> to write each chunk to a separate
 Parquet file—all without needing to read the full CSV file into R.</p>
 </div>
 <div class="section level3">
 <h3 id="explicitly-declare-column-names-and-data-types">Explicitly declare column names and data types<a class="anchor" aria-label="anchor" href="#explicitly-declare-column-names-and-data-types"></a>
 </h3>
 <p>You can specify the <code>schema</code> argument to
 <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to declare the columns and their data types.
 This is useful if you have data files that have different storage schema
 (for example, a column could be <code>int32</code> in one and
 <code>int8</code> in another) and you want to ensure that the resulting
 Dataset has a specific type.</p>
 <p>To be clear, it’s not necessary to specify a schema, even in this
 example of mixed integer types, because the Dataset constructor will
 reconcile differences like these. The schema specification just lets you
 declare what you want the result to be.</p>
 </div>
 <div class="section level3">
 <h3 id="explicitly-declare-partition-format">Explicitly declare partition format<a class="anchor" aria-label="anchor" href="#explicitly-declare-partition-format"></a>
 </h3>
 <p>Similarly, you can provide a Schema in the <code>partitioning</code>
 argument of <code><a href="../reference/open_dataset.html">open_dataset()</a></code> in order to declare the types of
 the virtual columns that define the partitions. This would be useful, in
 the NYC taxi data example, if you wanted to keep <code>month</code> as a
 string instead of an integer.</p>
 </div>
 <div class="section level3">
 <h3 id="work-with-multiple-data-sources">Work with multiple data sources<a class="anchor" aria-label="anchor" href="#work-with-multiple-data-sources"></a>
 </h3>
 <p>Another feature of Datasets is that they can be composed of multiple
 data sources. That is, you may have a directory of partitioned Parquet
 files in one location, and in another directory, files that haven’t been
 partitioned. Or, you could point to an S3 bucket of Parquet data and a
 directory of CSVs on the local file system and query them together as a
 single Dataset. To create a multi-source Dataset, provide a list of
 Datasets to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> instead of a file path, or
 concatenate them with a command like
 <code>big_dataset &lt;- c(ds1, ds2)</code>.</p>
 </div>
 </div>
 <div class="section level2">
 <h2 id="writing-datasets">Writing Datasets<a class="anchor" aria-label="anchor" href="#writing-datasets"></a>
 </h2>
 <p>As you can see, querying a large Dataset can be made quite fast by
 storage in an efficient binary columnar format like Parquet or Feather
 and partitioning based on columns commonly used for filtering. However,
 data isn’t always stored that way. Sometimes you might start with one
 giant CSV. The first step in analyzing data is cleaning is up and
 reshaping it into a more usable form.</p>
 <p>The <code><a href="../reference/write_dataset.html">write_dataset()</a></code> function allows you to take a
 Dataset or another tabular data object—an Arrow Table or RecordBatch, or
 an R data frame—and write it to a different file format, partitioned
 into multiple files.</p>
 <p>Assume that you have a version of the NYC Taxi data as CSV:</p>
 <div class="sourceCode" id="cb20"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/csv/"</span>, format <span class="op">=</span> <span class="st">"csv"</span><span class="op">)</span></span></code></pre></div>
 <p>You can write it to a new location and translate the files to the
 Feather format by calling <code><a href="../reference/write_dataset.html">write_dataset()</a></code> on it:</p>
 <div class="sourceCode" id="cb21"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="va">ds</span>, <span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div>
 <p>Next, let’s imagine that the <code>payment_type</code> column is
 something you often filter on, so you want to partition the data by that
 variable. By doing so you ensure that a filter like
 <code>payment_type == "Cash"</code> will touch only a subset of files
 where <code>payment_type</code> is always <code>"Cash"</code>.</p>
 <p>One natural way to express the columns you want to partition on is to
 use the <code><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by()</a></code> method:</p>
 <div class="sourceCode" id="cb22"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">payment_type</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div>
 <p>This will write files to a directory tree that looks like this:</p>
 <div class="sourceCode" id="cb23"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/system.html" class="external-link">system</a></span><span class="op">(</span><span class="st">"tree nyc-taxi/feather"</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## feather</span></span>
 <span><span class="co">## ├── payment_type=1</span></span>
 <span><span class="co">## │   └── part-18.arrow</span></span>
 <span><span class="co">## ├── payment_type=2</span></span>
 <span><span class="co">## │   └── part-19.arrow</span></span>
 <span><span class="co">## ...</span></span>
 <span><span class="co">## └── payment_type=UNK</span></span>
 <span><span class="co">##     └── part-17.arrow</span></span>
 <span><span class="co">##</span></span>
 <span><span class="co">## 18 directories, 23 files</span></span></code></pre>
 <p>Note that the directory names are <code>payment_type=Cash</code> and
 similar: this is the Hive-style partitioning described above. This means
 that when you call <code><a href="../reference/open_dataset.html">open_dataset()</a></code> on this directory, you
 don’t have to declare what the partitions are because they can be read
 from the file paths. (To instead write bare values for partition
 segments, i.e. <code>Cash</code> rather than
 <code>payment_type=Cash</code>, call <code><a href="../reference/write_dataset.html">write_dataset()</a></code> with
 <code>hive_style = FALSE</code>.)</p>
 <p>Perhaps, though, <code>payment_type == "Cash"</code> is the only data
 you ever care about, and you just want to drop the rest and have a
 smaller working set. For this, you can <code><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter()</a></code> them out
 when writing:</p>
 <div class="sourceCode" id="cb25"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">payment_type</span> <span class="op">==</span> <span class="st">"Cash"</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div>
 <p>The other thing you can do when writing Datasets is select a subset
 of columns or reorder them. Suppose you never care about
 <code>vendor_id</code>, and being a string column, it can take up a lot
 of space when you read it in, so let’s drop it:</p>
 <div class="sourceCode" id="cb26"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">payment_type</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="op">-</span><span class="va">vendor_id</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
 <span>  <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div>
 <p>Note that while you can select a subset of columns, you cannot
 currently rename columns when writing a Dataset.</p>
 </div>
 <div class="section level2">
 <h2 id="partitioning-performance-considerations">Partitioning performance considerations<a class="anchor" aria-label="anchor" href="#partitioning-performance-considerations"></a>
 </h2>
 <p>Partitioning Datasets has two aspects that affect performance: it
 increases the number of files and it creates a directory structure
 around the files. Both of these have benefits as well as costs.
 Depending on the configuration and the size of your Dataset, the costs
 can outweigh the benefits.</p>
 <p>Because partitions split up the Dataset into multiple files,
 partitioned Datasets can be read and written with parallelism. However,
 each additional file adds a little overhead in processing for filesystem
 interaction. It also increases the overall Dataset size since each file
 has some shared metadata. For example, each parquet file contains the
 schema and group-level statistics. The number of partitions is a floor
 for the number of files. If you partition a Dataset by date with a year
 of data, you will have at least 365 files. If you further partition by
 another dimension with 1,000 unique values, you will have up to 365,000
 files. This fine of partitioning often leads to small files that mostly
 consist of metadata.</p>
 <p>Partitioned Datasets create nested folder structures, and those allow
 us to prune which files are loaded in a scan. However, this adds
 overhead to discovering files in the Dataset, as we’ll need to
 recursively “list directory” to find the data files. Too fine partitions
 can cause problems here: Partitioning a dataset by date for a years
 worth of data will require 365 list calls to find all the files; adding
 another column with cardinality 1,000 will make that 365,365 calls.</p>
 <p>The most optimal partitioning layout will depend on your data, access
 patterns, and which systems will be reading the data. Most systems,
 including Arrow, should work across a range of file sizes and
 partitioning layouts, but there are extremes you should avoid. These
 guidelines can help avoid some known worst cases:</p>
 <ul>
 <li>Avoid files smaller than 20MB and larger than 2GB.</li>
 <li>Avoid partitioning layouts with more than 10,000 distinct
 partitions.</li>
 </ul>
 <p>For file formats that have a notion of groups within a file, such as
 Parquet, similar guidelines apply. Row groups can provide parallelism
 when reading and allow data skipping based on statistics, but very small
 groups can cause metadata to be a significant portion of file size.
 Arrow’s file writer provides sensible defaults for group sizing in most
 cases.</p>
 </div>
 <div class="section level2">
 <h2 id="transactions-acid-guarantees">Transactions / ACID guarantees<a class="anchor" aria-label="anchor" href="#transactions-acid-guarantees"></a>
 </h2>
 <p>The Dataset API offers no transaction support or any ACID guarantees.
 This affects both reading and writing. Concurrent reads are fine.
 Concurrent writes or writes concurring with reads may have unexpected
 behavior. Various approaches can be used to avoid operating on the same
 files such as using a unique basename template for each writer, a
 temporary directory for new files, or separate storage of the file list
 instead of relying on directory discovery.</p>
 <p>Unexpectedly killing the process while a write is in progress can
 leave the system in an inconsistent state. Write calls generally return
 as soon as the bytes to be written have been completely delivered to the
 OS page cache. Even though a write operation has been completed it is
 possible for part of the file to be lost if there is a sudden power loss
 immediately after the write call.</p>
 <p>Most file formats have magic numbers which are written at the end.
 This means a partial file write can safely be detected and discarded.
 The CSV file format does not have any such concept and a partially
 written CSV file may be detected as valid.</p>
 </div>
 <div class="section level2">
 <h2 id="further-reading">Further reading<a class="anchor" aria-label="anchor" href="#further-reading"></a>
 </h2>
 <ul>
 <li>To learn about cloud storage, see the <a href="./fs.html">cloud
 storage article</a>.</li>
 <li>To learn about dplyr with arrow, see the <a href="./data_wrangling.html">data wrangling article</a>.</li>
 <li>To learn about reading and writing data, see the <a href="./read_write.html">read/write article</a>.</li>
 <li>For specific recipes on reading and writing multi-file Datasets, see
 this <a href="https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html" class="external-link">Arrow
 R cookbook chapter</a>.</li>
 <li>To manually enable cloud support on Linux, see the article on <a href="./install.html">installation on Linux</a>.</li>
 <li>To learn about schemas and metadata, see the <a href="./metadata.html">metadata article</a>.</li>
 </ul>
 </div>
   </main><aside class="col-md-3"><nav id="toc" aria-label="Table of contents"><h2>On this page</h2>
     </nav></aside>
 </div>


     <footer><div class="pkgdown-footer-left">
   <p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p>
 </div>

 <div class="pkgdown-footer-right">
   <p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.1.3.</p>
 </div>

     </footer>
 </div>


   </body>
 </html>