| <!DOCTYPE html> |
| <!-- Generated by pkgdown: do not edit by hand --><html lang="en-US"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <title>Working with multi-file data sets • Arrow R Package</title> |
| <!-- favicons --><link rel="icon" type="image/png" sizes="96x96" href="../favicon-96x96.png"> |
| <link rel="icon" type="”image/svg+xml”" href="../favicon.svg"> |
| <link rel="apple-touch-icon" sizes="180x180" href="../apple-touch-icon.png"> |
| <link rel="icon" sizes="any" href="../favicon.ico"> |
| <link rel="manifest" href="../site.webmanifest"> |
| <script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet"> |
| <script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><link href="../deps/font-awesome-6.5.2/css/all.min.css" rel="stylesheet"> |
| <link href="../deps/font-awesome-6.5.2/css/v4-shims.min.css" rel="stylesheet"> |
| <script src="../deps/headroom-0.11.0/headroom.min.js"></script><script src="../deps/headroom-0.11.0/jQuery.headroom.min.js"></script><script src="../deps/bootstrap-toc-1.0.1/bootstrap-toc.min.js"></script><script src="../deps/clipboard.js-2.0.11/clipboard.min.js"></script><script src="../deps/search-1.0.0/autocomplete.jquery.min.js"></script><script src="../deps/search-1.0.0/fuse.min.js"></script><script src="../deps/search-1.0.0/mark.min.js"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet"> |
| <meta property="og:title" content="Working with multi-file data sets"> |
| <meta name="description" content="Learn how to use Datasets to read, write, and analyze multi-file larger-than-memory data |
| "> |
| <meta property="og:description" content="Learn how to use Datasets to read, write, and analyze multi-file larger-than-memory data |
| "> |
| <meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"> |
| <meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"> |
| <!-- Matomo --><script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| /* We explicitly disable cookie tracking to avoid privacy issues */ |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '20']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script><!-- End Matomo Code --><!-- Kapa AI --><script async src="https://widget.kapa.ai/kapa-widget.bundle.js" data-website-id="9db461d5-ac77-4b3f-a5c5-75efa78339d2" data-project-name="Apache Arrow" data-project-color="#000000" data-project-logo="https://arrow.apache.org/img/arrow-logo_chevrons_white-txt_black-bg.png" data-modal-disclaimer="This is a custom LLM with access to all of [Arrow documentation](https://arrow.apache.org/docs/). If you want an R-specific answer, please mention this in your question." data-consent-required="true" data-user-analytics-cookie-enabled="false" data-consent-screen-disclaimer="By clicking "I agree, let's chat", you consent to the use of the AI assistant in accordance with kapa.ai's [Privacy Policy](https://www.kapa.ai/content/privacy-policy). This service uses reCAPTCHA, which requires your consent to Google's [Privacy Policy](https://policies.google.com/privacy) and [Terms of Service](https://policies.google.com/terms). By proceeding, you explicitly agree to both kapa.ai's and Google's privacy policies."></script><!-- End Kapa AI --> |
| </head> |
| <body> |
| <a href="#main" class="visually-hidden-focusable">Skip to contents</a> |
| |
| |
| <nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container"> |
| |
| <a class="navbar-brand me-2" href="../index.html">Arrow R Package</a> |
| |
| <span class="version"> |
| <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">21.0.0.9000</small> |
| </span> |
| |
| |
| <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation"> |
| <span class="navbar-toggler-icon"></span> |
| </button> |
| |
| <div id="navbar" class="collapse navbar-collapse ms-3"> |
| <ul class="navbar-nav me-auto"> |
| <li class="nav-item"><a class="nav-link" href="../articles/arrow.html">Get started</a></li> |
| <li class="nav-item"><a class="nav-link" href="../reference/index.html">Reference</a></li> |
| <li class="active nav-item dropdown"> |
| <button class="nav-link dropdown-toggle" type="button" id="dropdown-articles" data-bs-toggle="dropdown" aria-expanded="false" aria-haspopup="true">Articles</button> |
| <ul class="dropdown-menu" aria-labelledby="dropdown-articles"> |
| <li><hr class="dropdown-divider"></li> |
| <li><h6 class="dropdown-header" data-toc-skip>Using the package</h6></li> |
| <li><a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a></li> |
| <li><a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a></li> |
| <li><a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a></li> |
| <li><a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a></li> |
| <li><a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a></li> |
| <li><a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a></li> |
| <li><hr class="dropdown-divider"></li> |
| <li><h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6></li> |
| <li><a class="dropdown-item" href="../articles/data_objects.html">Data objects</a></li> |
| <li><a class="dropdown-item" href="../articles/data_types.html">Data types</a></li> |
| <li><a class="dropdown-item" href="../articles/metadata.html">Metadata</a></li> |
| <li><hr class="dropdown-divider"></li> |
| <li><h6 class="dropdown-header" data-toc-skip>Installation</h6></li> |
| <li><a class="dropdown-item" href="../articles/install.html">Installing on Linux</a></li> |
| <li><a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a></li> |
| <li><hr class="dropdown-divider"></li> |
| <li><a class="dropdown-item" href="../articles/index.html">More articles...</a></li> |
| </ul> |
| </li> |
| <li class="nav-item"><a class="nav-link" href="../news/index.html">Changelog</a></li> |
| </ul> |
| <form class="form-inline my-2 my-lg-0" role="search"> |
| <input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="" autocomplete="off"> |
| </form> |
| |
| <ul class="navbar-nav"> |
| <li class="nav-item"><a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="GitHub"><span class="fa fab fa-github fa-lg"></span></a></li> |
| </ul> |
| </div> |
| |
| |
| </div> |
| </nav><div class="container template-article"> |
| |
| |
| |
| |
| <div class="row"> |
| <main id="main" class="col-md-9"><div class="page-header"> |
| |
| <h1>Working with multi-file data sets</h1> |
| |
| |
| <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/dataset.Rmd" class="external-link"><code>vignettes/dataset.Rmd</code></a></small> |
| <div class="d-none name"><code>dataset.Rmd</code></div> |
| </div> |
| |
| |
| |
| <p>Apache Arrow lets you work efficiently with single and multi-file |
| data sets even when that data set is too large to be loaded into memory. |
| With the help of Arrow Dataset objects you can analyze this kind of data |
| using familiar <a href="https://dplyr.tidyverse.org/" class="external-link">dplyr</a> syntax. |
| This article introduces Datasets and shows you how to analyze them with |
| dplyr and arrow: we’ll start by ensuring both packages are loaded</p> |
| <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/apache/arrow/" class="external-link">arrow</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span> |
| <span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org" class="external-link">dplyr</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span></code></pre></div> |
| <div class="section level2"> |
| <h2 id="example-nyc-taxi-data">Example: NYC taxi data<a class="anchor" aria-label="anchor" href="#example-nyc-taxi-data"></a> |
| </h2> |
| <p>The primary motivation for Arrow’s Datasets object is to allow users |
| to analyze extremely large datasets. As an example, consider the <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page" class="external-link">New |
| York City taxi trip record data</a> that is widely used in big data |
| exercises and competitions. To demonstrate the capabilities of Apache |
| Arrow we host a Parquet-formatted version this data in a public Amazon |
| S3 bucket: in its full form, our version of the data set is one very |
| large table with about 1.7 billion rows and 24 columns, where each row |
| corresponds to a single taxi ride sometime between 2009 and 2022. A <a href="https://arrow-user2022.netlify.app/packages-and-data.html#data" class="external-link">data |
| dictionary</a> for this version of the NYC taxi data is also |
| available.</p> |
| <p>This multi-file data set is comprised of 158 distinct Parquet files, |
| each corresponding to a month of data. A single file is typically around |
| 400-500MB in size, and the full data set is about 70GB in size. It is |
| not a small data set – it is slow to download and does not fit in memory |
| on a typical machine 🙂 – so we also host a “tiny” version of the NYC |
| taxi data that is formatted in exactly the same way but includes only |
| one out of every thousand entries in the original data set (i.e., |
| individual files are <1MB in size, and the “tiny” data set is only |
| 70MB)</p> |
| <p>If you have Amazon S3 support enabled in arrow (true for most users; |
| see links at the end of this article if you need to troubleshoot this), |
| you can connect to a copy of the “tiny taxi data” stored on S3 with this |
| command:</p> |
| <div class="sourceCode" id="cb2"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="fu"><a href="../reference/s3_bucket.html">s3_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets/nyc-taxi-tiny"</span><span class="op">)</span></span></code></pre></div> |
| <p>Alternatively you could connect to a copy of the data on Google Cloud |
| Storage (GCS) using the following command:</p> |
| <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="fu"><a href="../reference/gs_bucket.html">gs_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets/nyc-taxi-tiny"</span>, anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div> |
| <p>If you want to use the full data set, replace |
| <code>nyc-taxi-tiny</code> with <code>nyc-taxi</code> in the code above. |
| Apart from size – and with it the cost in time, bandwidth usage, and CPU |
| cycles – there is no difference in the two versions of the data: you can |
| test your code using the tiny taxi data and then check how it scales |
| using the full data set.</p> |
| <p>To make a local copy of the data set stored in the |
| <code>bucket</code> to a folder called <code>"nyc-taxi"</code>, use the |
| <code><a href="../reference/copy_files.html">copy_files()</a></code> function:</p> |
| <div class="sourceCode" id="cb4"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../reference/copy_files.html">copy_files</a></span><span class="op">(</span>from <span class="op">=</span> <span class="va">bucket</span>, to <span class="op">=</span> <span class="st">"nyc-taxi"</span><span class="op">)</span></span></code></pre></div> |
| <p>For the purposes of this article, we assume that the NYC taxi dataset |
| (either the full data or the tiny version) has been downloaded locally |
| and exists in an <code>"nyc-taxi"</code> directory.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="opening-datasets">Opening Datasets<a class="anchor" aria-label="anchor" href="#opening-datasets"></a> |
| </h2> |
| <p>The first step in the process is to create a Dataset object that |
| points at the data directory:</p> |
| <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span><span class="op">)</span></span></code></pre></div> |
| <p>It is important to note that when we do this, the data values are not |
| loaded into memory. Instead, Arrow scans the data directory to find |
| relevant files, parses the file paths looking for a “Hive-style |
| partitioning” (see below), and reads headers of the data files to |
| construct a Schema that contains metadata describing the structure of |
| the data. For more information about Schemas see the <a href="./metadata.html">metadata article</a>.</p> |
| <p>Two questions naturally follow from this: what kind of files does |
| <code><a href="../reference/open_dataset.html">open_dataset()</a></code> look for, and what structure does it expect |
| to find in the file paths? Let’s start by looking at the file types.</p> |
| <p>By default <code><a href="../reference/open_dataset.html">open_dataset()</a></code> looks for Parquet files but |
| you can override this using the <code>format</code> argument. For |
| example if the data were encoded as CSV files we could set |
| <code>format = "csv"</code> to connect to the data. The Arrow Dataset |
| interface supports several file formats including:</p> |
| <ul> |
| <li> |
| <code>"parquet"</code> (the default)</li> |
| <li> |
| <code>"feather"</code> or <code>"ipc"</code> (aliases for |
| <code>"arrow"</code>; as Feather version 2 is the Arrow file |
| format)</li> |
| <li> |
| <code>"csv"</code> (comma-delimited files) and <code>"tsv"</code> |
| (tab-delimited files)</li> |
| <li> |
| <code>"text"</code> (generic text-delimited files - use the |
| <code>delimiter</code> argument to specify which to use)</li> |
| </ul> |
| <p>In the case of text files, you can pass the following parsing options |
| to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to ensure that files are read |
| correctly:</p> |
| <ul> |
| <li><code>delim</code></li> |
| <li><code>quote</code></li> |
| <li><code>escape_double</code></li> |
| <li><code>escape_backslash</code></li> |
| <li><code>skip_empty_rows</code></li> |
| </ul> |
| <p>An alternative when working with text files is to use |
| <code><a href="../reference/open_delim_dataset.html">open_delim_dataset()</a></code>, <code><a href="../reference/open_delim_dataset.html">open_csv_dataset()</a></code>, or |
| <code><a href="../reference/open_delim_dataset.html">open_tsv_dataset()</a></code>. These functions are wrappers around |
| <code><a href="../reference/open_dataset.html">open_dataset()</a></code> but with parameters that mirror |
| <code><a href="../reference/read_delim_arrow.html">read_csv_arrow()</a></code>, <code><a href="../reference/read_delim_arrow.html">read_delim_arrow()</a></code>, and |
| <code><a href="../reference/read_delim_arrow.html">read_tsv_arrow()</a></code> to allow for easy switching between |
| functions for opening single files and functions for opening |
| datasets.</p> |
| <p>For example:</p> |
| <div class="sourceCode" id="cb6"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_delim_dataset.html">open_csv_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/csv/"</span><span class="op">)</span></span></code></pre></div> |
| <p>For more information on these arguments and on parsing delimited text |
| files generally, see the help documentation for |
| <code><a href="../reference/read_delim_arrow.html">read_delim_arrow()</a></code> and |
| <code><a href="../reference/open_delim_dataset.html">open_delim_dataset()</a></code>.</p> |
| <p>Next, what information does <code><a href="../reference/open_dataset.html">open_dataset()</a></code> expect to |
| find in the file paths? By default, the Dataset interface looks for <a href="https://hive.apache.org/" class="external-link">Hive</a>-style partitioning structure in |
| which folders are named using a “key=value” convention, and data files |
| in a folder contain the subset of the data for which the key has the |
| relevant value. For example, in the NYC taxi data file paths look like |
| this:</p> |
| <pre><code>year=2009/month=1/part-0.parquet |
| year=2009/month=2/part-0.parquet |
| ...</code></pre> |
| <p>From this, <code><a href="../reference/open_dataset.html">open_dataset()</a></code> infers that the first listed |
| Parquet file contains the data for January 2009. In that sense, a |
| hive-style partitioning is self-describing: the folder names state |
| explicitly how the Dataset has been split across files.</p> |
| <p>Sometimes the directory partitioning isn’t self describing; that is, |
| it doesn’t contain field names. For example, suppose the NYC taxi data |
| used file paths like these:</p> |
| <pre><code>2009/01/part-0.parquet |
| 2009/02/part-0.parquet |
| ...</code></pre> |
| <p>In that case, <code><a href="../reference/open_dataset.html">open_dataset()</a></code> would need some hints as to |
| how to use the file paths. In this case, you could provide |
| <code>c("year", "month")</code> to the <code>partitioning</code> |
| argument, saying that the first path segment gives the value for |
| <code>year</code>, and the second segment is <code>month</code>. Every |
| row in <code>2009/01/part-0.parquet</code> has a value of 2009 for |
| <code>year</code> and 1 for <code>month</code>, even though those |
| columns may not be present in the file. In other words, we would open |
| the data like this:</p> |
| <div class="sourceCode" id="cb9"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span>, partitioning <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"year"</span>, <span class="st">"month"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div> |
| <p>Either way, when you look at the Dataset, you can see that in |
| addition to the columns present in every file, there are also columns |
| <code>year</code> and <code>month</code>. These columns are not present |
| in the files themselves: they are inferred from the partitioning |
| structure.</p> |
| <div class="sourceCode" id="cb10"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span></span></code></pre></div> |
| <pre><code><span><span class="co">## </span></span> |
| <span><span class="co">## FileSystemDataset with 158 Parquet files</span></span> |
| <span><span class="co">## vendor_name: string</span></span> |
| <span><span class="co">## pickup_datetime: timestamp[ms]</span></span> |
| <span><span class="co">## dropoff_datetime: timestamp[ms]</span></span> |
| <span><span class="co">## passenger_count: int64</span></span> |
| <span><span class="co">## trip_distance: double</span></span> |
| <span><span class="co">## pickup_longitude: double</span></span> |
| <span><span class="co">## pickup_latitude: double</span></span> |
| <span><span class="co">## rate_code: string</span></span> |
| <span><span class="co">## store_and_fwd: string</span></span> |
| <span><span class="co">## dropoff_longitude: double</span></span> |
| <span><span class="co">## dropoff_latitude: double</span></span> |
| <span><span class="co">## payment_type: string</span></span> |
| <span><span class="co">## fare_amount: double</span></span> |
| <span><span class="co">## extra: double</span></span> |
| <span><span class="co">## mta_tax: double</span></span> |
| <span><span class="co">## tip_amount: double</span></span> |
| <span><span class="co">## tolls_amount: double</span></span> |
| <span><span class="co">## total_amount: double</span></span> |
| <span><span class="co">## improvement_surcharge: double</span></span> |
| <span><span class="co">## congestion_surcharge: double</span></span> |
| <span><span class="co">## pickup_location_id: int64</span></span> |
| <span><span class="co">## dropoff_location_id: int64</span></span> |
| <span><span class="co">## year: int32</span></span> |
| <span><span class="co">## month: int32</span></span></code></pre> |
| </div> |
| <div class="section level2"> |
| <h2 id="querying-datasets">Querying Datasets<a class="anchor" aria-label="anchor" href="#querying-datasets"></a> |
| </h2> |
| <p>Now that we have a Dataset object that refers to our data, we can |
| construct dplyr-style queries. This is possible because arrow supplies a |
| back end that allows users to manipulate tabular Arrow data using dplyr |
| verbs. Here’s an example: suppose you are curious about tipping behavior |
| in the longest taxi rides. Let’s find the median tip percentage for |
| rides with fares greater than $100 in 2015, broken down by the number of |
| passengers:</p> |
| <div class="sourceCode" id="cb12"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/system.time.html" class="external-link">system.time</a></span><span class="op">(</span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">total_amount</span> <span class="op">></span> <span class="fl">100</span>, <span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="fl">100</span> <span class="op">*</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarise</a></span><span class="op">(</span></span> |
| <span> median_tip_pct <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/median.html" class="external-link">median</a></span><span class="op">(</span><span class="va">tip_pct</span><span class="op">)</span>,</span> |
| <span> n <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/context.html" class="external-link">n</a></span><span class="op">(</span><span class="op">)</span></span> |
| <span> <span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://rdrr.io/r/base/print.html" class="external-link">print</a></span><span class="op">(</span><span class="op">)</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## </span></span> |
| <span><span class="co">## # A tibble: 10 x 3</span></span> |
| <span><span class="co">## passenger_count median_tip_pct n</span></span> |
| <span><span class="co">## <int> <dbl> <int></span></span> |
| <span><span class="co">## 1 1 16.6 143087</span></span> |
| <span><span class="co">## 2 2 16.2 34418</span></span> |
| <span><span class="co">## 3 5 16.7 5806</span></span> |
| <span><span class="co">## 4 4 11.4 4771</span></span> |
| <span><span class="co">## 5 6 16.7 3338</span></span> |
| <span><span class="co">## 6 3 14.6 8922</span></span> |
| <span><span class="co">## 7 0 10.1 380</span></span> |
| <span><span class="co">## 8 8 16.7 32</span></span> |
| <span><span class="co">## 9 9 16.7 42</span></span> |
| <span><span class="co">## 10 7 16.7 11</span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## user system elapsed</span></span> |
| <span><span class="co">## 4.436 1.012 1.402</span></span></code></pre> |
| <p>You’ve just selected a subset from a Dataset that contains around 2 |
| billion rows, computed a new column, and aggregated it. All within a few |
| seconds on a modern laptop. How does this work?</p> |
| <p>There are three reasons arrow can accomplish this task so |
| quickly:</p> |
| <p>First, arrow adopts a lazy evaluation approach to queries: when dplyr |
| verbs are called on the Dataset, they record their actions but do not |
| evaluate those actions on the data until you run <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code>. |
| We can see this by taking the same code as before and leaving off the |
| final step:</p> |
| <div class="sourceCode" id="cb14"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">total_amount</span> <span class="op">></span> <span class="fl">100</span>, <span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="fl">100</span> <span class="op">*</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarise</a></span><span class="op">(</span></span> |
| <span> median_tip_pct <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/median.html" class="external-link">median</a></span><span class="op">(</span><span class="va">tip_pct</span><span class="op">)</span>,</span> |
| <span> n <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/context.html" class="external-link">n</a></span><span class="op">(</span><span class="op">)</span></span> |
| <span> <span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## </span></span> |
| <span><span class="co">## FileSystemDataset (query)</span></span> |
| <span><span class="co">## passenger_count: int64</span></span> |
| <span><span class="co">## median_tip_pct: double</span></span> |
| <span><span class="co">## n: int32</span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## See $.data for the source Arrow object</span></span></code></pre> |
| <p>This version of the code returns an output instantly and shows the |
| manipulations you’ve made, without loading data from the files. Because |
| the evaluation of these queries is deferred, you can build up a query |
| that selects down to a small subset without generating intermediate data |
| sets that could potentially be large.</p> |
| <p>Second, all work is pushed down to the individual data files, and |
| depending on the file format, chunks of data within files. As a result, |
| you can select a subset of data from a much larger data set by |
| collecting the smaller slices from each file: you don’t have to load the |
| whole data set in memory to slice from it.</p> |
| <p>Third, because of partitioning, you can ignore some files entirely. |
| In this example, by filtering <code>year == 2015</code>, all files |
| corresponding to other years are immediately excluded: you don’t have to |
| load them in order to find that no rows match the filter. For Parquet |
| files – which contain row groups with statistics on the data contained |
| within groups – there may be entire chunks of data you can avoid |
| scanning because they have no rows where |
| <code>total_amount > 100</code>.</p> |
| <p>One final thing to note about querying Datasets. Suppose you attempt |
| to call unsupported dplyr verbs or unimplemented functions in your query |
| on an Arrow Dataset. In that case, the arrow package raises an error. |
| However, for dplyr queries on Arrow Table objects (which are already |
| in-memory), the package automatically calls <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code> |
| before processing that dplyr verb. To learn more about the dplyr back |
| end, see the <a href="./data_wrangling.html">data wrangling |
| article</a>.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="batch-processing-experimental">Batch processing (experimental)<a class="anchor" aria-label="anchor" href="#batch-processing-experimental"></a> |
| </h2> |
| <p>Sometimes you want to run R code on the entire Dataset, but that |
| Dataset is much larger than memory. You can use <code>map_batches</code> |
| on a Dataset query to process it batch-by-batch.</p> |
| <p><strong>Note</strong>: <code>map_batches</code> is experimental and |
| not recommended for production use.</p> |
| <p>As an example, to randomly sample a Dataset, use |
| <code>map_batches</code> to sample a percentage of rows from each |
| batch:</p> |
| <div class="sourceCode" id="cb16"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">sampled_data</span> <span class="op"><-</span> <span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="../reference/map_batches.html">map_batches</a></span><span class="op">(</span><span class="op">~</span> <span class="fu"><a href="../reference/as_record_batch.html">as_record_batch</a></span><span class="op">(</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/sample_n.html" class="external-link">sample_frac</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/as.data.frame.html" class="external-link">as.data.frame</a></span><span class="op">(</span><span class="va">.</span><span class="op">)</span>, <span class="fl">1e-4</span><span class="op">)</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span></span> |
| <span></span> |
| <span><span class="fu"><a href="https://rdrr.io/r/utils/str.html" class="external-link">str</a></span><span class="op">(</span><span class="va">sampled_data</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## </span></span> |
| <span><span class="co">## tibble [10,918 <U+00D7> 4] (S3: tbl_df/tbl/data.frame)</span></span> |
| <span><span class="co">## $ tip_amount : num [1:10918] 3 0 4 1 1 6 0 1.35 0 5.9 ...</span></span> |
| <span><span class="co">## $ total_amount : num [1:10918] 18.8 13.3 20.3 15.8 13.3 ...</span></span> |
| <span><span class="co">## $ passenger_count: int [1:10918] 3 2 1 1 1 1 1 1 1 3 ...</span></span> |
| <span><span class="co">## $ tip_pct : num [1:10918] 0.1596 0 0.197 0.0633 0.0752 ...</span></span></code></pre> |
| <p>This function can also be used to aggregate summary statistics over a |
| Dataset by computing partial results for each batch and then aggregating |
| those partial results. Extending the example above, you could fit a |
| model to the sample data and then use <code>map_batches</code> to |
| compute the MSE on the full Dataset.</p> |
| <div class="sourceCode" id="cb18"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">model</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/stats/lm.html" class="external-link">lm</a></span><span class="op">(</span><span class="va">tip_pct</span> <span class="op">~</span> <span class="va">total_amount</span> <span class="op">+</span> <span class="va">passenger_count</span>, data <span class="op">=</span> <span class="va">sampled_data</span><span class="op">)</span></span> |
| <span></span> |
| <span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="../reference/map_batches.html">map_batches</a></span><span class="op">(</span><span class="kw">function</span><span class="op">(</span><span class="va">batch</span><span class="op">)</span> <span class="op">{</span></span> |
| <span> <span class="va">batch</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://rdrr.io/r/base/as.data.frame.html" class="external-link">as.data.frame</a></span><span class="op">(</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>pred_tip_pct <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/predict.html" class="external-link">predict</a></span><span class="op">(</span><span class="va">model</span>, newdata <span class="op">=</span> <span class="va">.</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="op">!</span><span class="fu"><a href="https://rdrr.io/r/base/is.finite.html" class="external-link">is.nan</a></span><span class="op">(</span><span class="va">tip_pct</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarize</a></span><span class="op">(</span>sse_partial <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/sum.html" class="external-link">sum</a></span><span class="op">(</span><span class="op">(</span><span class="va">pred_tip_pct</span> <span class="op">-</span> <span class="va">tip_pct</span><span class="op">)</span><span class="op">^</span><span class="fl">2</span><span class="op">)</span>, n_partial <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/context.html" class="external-link">n</a></span><span class="op">(</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="../reference/as_record_batch.html">as_record_batch</a></span><span class="op">(</span><span class="op">)</span></span> |
| <span> <span class="op">}</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarize</a></span><span class="op">(</span>mse <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/sum.html" class="external-link">sum</a></span><span class="op">(</span><span class="va">sse_partial</span><span class="op">)</span> <span class="op">/</span> <span class="fu"><a href="https://rdrr.io/r/base/sum.html" class="external-link">sum</a></span><span class="op">(</span><span class="va">n_partial</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/pull.html" class="external-link">pull</a></span><span class="op">(</span><span class="va">mse</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## </span></span> |
| <span><span class="co">## [1] 0.1304284</span></span></code></pre> |
| </div> |
| <div class="section level2"> |
| <h2 id="dataset-options">Dataset options<a class="anchor" aria-label="anchor" href="#dataset-options"></a> |
| </h2> |
| <p>There are a few ways you can control the Dataset creation to adapt to |
| special use cases.</p> |
| <div class="section level3"> |
| <h3 id="work-with-files-in-a-directory">Work with files in a directory<a class="anchor" aria-label="anchor" href="#work-with-files-in-a-directory"></a> |
| </h3> |
| <p>If you are working with a single file or a set of files that are not |
| all in the same directory, you can provide a file path or a vector of |
| multiple file paths to <code><a href="../reference/open_dataset.html">open_dataset()</a></code>. This is useful if, |
| for example, you have a single CSV file that is too big to read into |
| memory. You could pass the file path to <code><a href="../reference/open_dataset.html">open_dataset()</a></code>, use |
| <code><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by()</a></code> to partition the Dataset into manageable chunks, |
| then use <code><a href="../reference/write_dataset.html">write_dataset()</a></code> to write each chunk to a separate |
| Parquet file—all without needing to read the full CSV file into R.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="explicitly-declare-column-names-and-data-types">Explicitly declare column names and data types<a class="anchor" aria-label="anchor" href="#explicitly-declare-column-names-and-data-types"></a> |
| </h3> |
| <p>You can specify the <code>schema</code> argument to |
| <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to declare the columns and their data types. |
| This is useful if you have data files that have different storage schema |
| (for example, a column could be <code>int32</code> in one and |
| <code>int8</code> in another) and you want to ensure that the resulting |
| Dataset has a specific type.</p> |
| <p>To be clear, it’s not necessary to specify a schema, even in this |
| example of mixed integer types, because the Dataset constructor will |
| reconcile differences like these. The schema specification just lets you |
| declare what you want the result to be.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="explicitly-declare-partition-format">Explicitly declare partition format<a class="anchor" aria-label="anchor" href="#explicitly-declare-partition-format"></a> |
| </h3> |
| <p>Similarly, you can provide a Schema in the <code>partitioning</code> |
| argument of <code><a href="../reference/open_dataset.html">open_dataset()</a></code> in order to declare the types of |
| the virtual columns that define the partitions. This would be useful, in |
| the NYC taxi data example, if you wanted to keep <code>month</code> as a |
| string instead of an integer.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="work-with-multiple-data-sources">Work with multiple data sources<a class="anchor" aria-label="anchor" href="#work-with-multiple-data-sources"></a> |
| </h3> |
| <p>Another feature of Datasets is that they can be composed of multiple |
| data sources. That is, you may have a directory of partitioned Parquet |
| files in one location, and in another directory, files that haven’t been |
| partitioned. Or, you could point to an S3 bucket of Parquet data and a |
| directory of CSVs on the local file system and query them together as a |
| single Dataset. To create a multi-source Dataset, provide a list of |
| Datasets to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> instead of a file path, or |
| concatenate them with a command like |
| <code>big_dataset <- c(ds1, ds2)</code>.</p> |
| </div> |
| </div> |
| <div class="section level2"> |
| <h2 id="writing-datasets">Writing Datasets<a class="anchor" aria-label="anchor" href="#writing-datasets"></a> |
| </h2> |
| <p>As you can see, querying a large Dataset can be made quite fast by |
| storage in an efficient binary columnar format like Parquet or Feather |
| and partitioning based on columns commonly used for filtering. However, |
| data isn’t always stored that way. Sometimes you might start with one |
| giant CSV. The first step in analyzing data is cleaning is up and |
| reshaping it into a more usable form.</p> |
| <p>The <code><a href="../reference/write_dataset.html">write_dataset()</a></code> function allows you to take a |
| Dataset or another tabular data object—an Arrow Table or RecordBatch, or |
| an R data frame—and write it to a different file format, partitioned |
| into multiple files.</p> |
| <p>Assume that you have a version of the NYC Taxi data as CSV:</p> |
| <div class="sourceCode" id="cb20"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/csv/"</span>, format <span class="op">=</span> <span class="st">"csv"</span><span class="op">)</span></span></code></pre></div> |
| <p>You can write it to a new location and translate the files to the |
| Feather format by calling <code><a href="../reference/write_dataset.html">write_dataset()</a></code> on it:</p> |
| <div class="sourceCode" id="cb21"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="va">ds</span>, <span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div> |
| <p>Next, let’s imagine that the <code>payment_type</code> column is |
| something you often filter on, so you want to partition the data by that |
| variable. By doing so you ensure that a filter like |
| <code>payment_type == "Cash"</code> will touch only a subset of files |
| where <code>payment_type</code> is always <code>"Cash"</code>.</p> |
| <p>One natural way to express the columns you want to partition on is to |
| use the <code><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by()</a></code> method:</p> |
| <div class="sourceCode" id="cb22"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">payment_type</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div> |
| <p>This will write files to a directory tree that looks like this:</p> |
| <div class="sourceCode" id="cb23"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/system.html" class="external-link">system</a></span><span class="op">(</span><span class="st">"tree nyc-taxi/feather"</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## feather</span></span> |
| <span><span class="co">## ├── payment_type=1</span></span> |
| <span><span class="co">## │ └── part-18.arrow</span></span> |
| <span><span class="co">## ├── payment_type=2</span></span> |
| <span><span class="co">## │ └── part-19.arrow</span></span> |
| <span><span class="co">## ...</span></span> |
| <span><span class="co">## └── payment_type=UNK</span></span> |
| <span><span class="co">## └── part-17.arrow</span></span> |
| <span><span class="co">##</span></span> |
| <span><span class="co">## 18 directories, 23 files</span></span></code></pre> |
| <p>Note that the directory names are <code>payment_type=Cash</code> and |
| similar: this is the Hive-style partitioning described above. This means |
| that when you call <code><a href="../reference/open_dataset.html">open_dataset()</a></code> on this directory, you |
| don’t have to declare what the partitions are because they can be read |
| from the file paths. (To instead write bare values for partition |
| segments, i.e. <code>Cash</code> rather than |
| <code>payment_type=Cash</code>, call <code><a href="../reference/write_dataset.html">write_dataset()</a></code> with |
| <code>hive_style = FALSE</code>.)</p> |
| <p>Perhaps, though, <code>payment_type == "Cash"</code> is the only data |
| you ever care about, and you just want to drop the rest and have a |
| smaller working set. For this, you can <code><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter()</a></code> them out |
| when writing:</p> |
| <div class="sourceCode" id="cb25"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">payment_type</span> <span class="op">==</span> <span class="st">"Cash"</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div> |
| <p>The other thing you can do when writing Datasets is select a subset |
| of columns or reorder them. Suppose you never care about |
| <code>vendor_id</code>, and being a string column, it can take up a lot |
| of space when you read it in, so let’s drop it:</p> |
| <div class="sourceCode" id="cb26"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">ds</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">payment_type</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="op">-</span><span class="va">vendor_id</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></span></code></pre></div> |
| <p>Note that while you can select a subset of columns, you cannot |
| currently rename columns when writing a Dataset.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="partitioning-performance-considerations">Partitioning performance considerations<a class="anchor" aria-label="anchor" href="#partitioning-performance-considerations"></a> |
| </h2> |
| <p>Partitioning Datasets has two aspects that affect performance: it |
| increases the number of files and it creates a directory structure |
| around the files. Both of these have benefits as well as costs. |
| Depending on the configuration and the size of your Dataset, the costs |
| can outweigh the benefits.</p> |
| <p>Because partitions split up the Dataset into multiple files, |
| partitioned Datasets can be read and written with parallelism. However, |
| each additional file adds a little overhead in processing for filesystem |
| interaction. It also increases the overall Dataset size since each file |
| has some shared metadata. For example, each parquet file contains the |
| schema and group-level statistics. The number of partitions is a floor |
| for the number of files. If you partition a Dataset by date with a year |
| of data, you will have at least 365 files. If you further partition by |
| another dimension with 1,000 unique values, you will have up to 365,000 |
| files. This fine of partitioning often leads to small files that mostly |
| consist of metadata.</p> |
| <p>Partitioned Datasets create nested folder structures, and those allow |
| us to prune which files are loaded in a scan. However, this adds |
| overhead to discovering files in the Dataset, as we’ll need to |
| recursively “list directory” to find the data files. Too fine partitions |
| can cause problems here: Partitioning a dataset by date for a years |
| worth of data will require 365 list calls to find all the files; adding |
| another column with cardinality 1,000 will make that 365,365 calls.</p> |
| <p>The most optimal partitioning layout will depend on your data, access |
| patterns, and which systems will be reading the data. Most systems, |
| including Arrow, should work across a range of file sizes and |
| partitioning layouts, but there are extremes you should avoid. These |
| guidelines can help avoid some known worst cases:</p> |
| <ul> |
| <li>Avoid files smaller than 20MB and larger than 2GB.</li> |
| <li>Avoid partitioning layouts with more than 10,000 distinct |
| partitions.</li> |
| </ul> |
| <p>For file formats that have a notion of groups within a file, such as |
| Parquet, similar guidelines apply. Row groups can provide parallelism |
| when reading and allow data skipping based on statistics, but very small |
| groups can cause metadata to be a significant portion of file size. |
| Arrow’s file writer provides sensible defaults for group sizing in most |
| cases.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="transactions-acid-guarantees">Transactions / ACID guarantees<a class="anchor" aria-label="anchor" href="#transactions-acid-guarantees"></a> |
| </h2> |
| <p>The Dataset API offers no transaction support or any ACID guarantees. |
| This affects both reading and writing. Concurrent reads are fine. |
| Concurrent writes or writes concurring with reads may have unexpected |
| behavior. Various approaches can be used to avoid operating on the same |
| files such as using a unique basename template for each writer, a |
| temporary directory for new files, or separate storage of the file list |
| instead of relying on directory discovery.</p> |
| <p>Unexpectedly killing the process while a write is in progress can |
| leave the system in an inconsistent state. Write calls generally return |
| as soon as the bytes to be written have been completely delivered to the |
| OS page cache. Even though a write operation has been completed it is |
| possible for part of the file to be lost if there is a sudden power loss |
| immediately after the write call.</p> |
| <p>Most file formats have magic numbers which are written at the end. |
| This means a partial file write can safely be detected and discarded. |
| The CSV file format does not have any such concept and a partially |
| written CSV file may be detected as valid.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="further-reading">Further reading<a class="anchor" aria-label="anchor" href="#further-reading"></a> |
| </h2> |
| <ul> |
| <li>To learn about cloud storage, see the <a href="./fs.html">cloud |
| storage article</a>.</li> |
| <li>To learn about dplyr with arrow, see the <a href="./data_wrangling.html">data wrangling article</a>.</li> |
| <li>To learn about reading and writing data, see the <a href="./read_write.html">read/write article</a>.</li> |
| <li>For specific recipes on reading and writing multi-file Datasets, see |
| this <a href="https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html" class="external-link">Arrow |
| R cookbook chapter</a>.</li> |
| <li>To manually enable cloud support on Linux, see the article on <a href="./install.html">installation on Linux</a>.</li> |
| <li>To learn about schemas and metadata, see the <a href="./metadata.html">metadata article</a>.</li> |
| </ul> |
| </div> |
| </main><aside class="col-md-3"><nav id="toc" aria-label="Table of contents"><h2>On this page</h2> |
| </nav></aside> |
| </div> |
| |
| |
| |
| <footer><div class="pkgdown-footer-left"> |
| <p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p> |
| </div> |
| |
| <div class="pkgdown-footer-right"> |
| <p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.1.3.</p> |
| </div> |
| |
| </footer> |
| </div> |
| |
| |
| |
| |
| |
| </body> |
| </html> |