| <!DOCTYPE html> |
| <!-- Generated by pkgdown: do not edit by hand --><html lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| <title>Working with Arrow Datasets and dplyr • Arrow R Package</title> |
| <!-- jquery --><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.min.js" integrity="sha256-CSXorXvZcTkaix6Yvo6HppcZGetbYMGWSFlBw8HfCJo=" crossorigin="anonymous"></script><!-- Bootstrap --><link href="https://cdnjs.cloudflare.com/ajax/libs/bootswatch/3.4.0/cosmo/bootstrap.min.css" rel="stylesheet" crossorigin="anonymous"> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.4.1/js/bootstrap.min.js" integrity="sha256-nuL8/2cJ5NDSSwnKD8VqreErSWHtnEP9E7AySL+1ev4=" crossorigin="anonymous"></script><!-- bootstrap-toc --><link rel="stylesheet" href="../bootstrap-toc.css"> |
| <script src="../bootstrap-toc.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous"> |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous"> |
| <!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- pkgdown --><link href="../pkgdown.css" rel="stylesheet"> |
| <script src="../pkgdown.js"></script><script src="../extra.js"></script><meta property="og:title" content="Working with Arrow Datasets and dplyr"> |
| <meta property="og:description" content="arrow"> |
| <!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]> |
| <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script> |
| <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> |
| <![endif]-->
|
|
|
| <!-- Matomo -->
|
| <script>
|
| var _paq = window._paq = window._paq || [];
|
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */
|
| _paq.push(["setDoNotTrack", true]);
|
| _paq.push(["disableCookies"]);
|
| _paq.push(['trackPageView']);
|
| _paq.push(['enableLinkTracking']);
|
| (function() {
|
| var u="https://analytics.apache.org/";
|
| _paq.push(['setTrackerUrl', u+'matomo.php']);
|
| _paq.push(['setSiteId', '20']);
|
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
|
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
|
| })();
|
| </script>
|
| <!-- End Matomo Code -->
|
| |
| </head> |
| <body data-spy="scroll" data-target="#toc"> |
| <div class="container template-article"> |
| <header><div class="navbar navbar-default navbar-fixed-top" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| <span class="navbar-brand"> |
| <a class="navbar-link" href="../index.html">Arrow R Package</a> |
| <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Released version">5.0.0</span> |
| </span> |
| </div> |
| |
| <div id="navbar" class="navbar-collapse collapse"> |
| <ul class="nav navbar-nav"> |
| <li> |
| <a href="https://arrow.apache.org/">❯❯❯</a> |
| </li> |
| <li> |
| <a href="../articles/arrow.html">Get started</a> |
| </li> |
| <li> |
| <a href="../reference/index.html">Reference</a> |
| </li> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> |
| Articles |
| |
| <span class="caret"></span> |
| </a> |
| <ul class="dropdown-menu" role="menu"> |
| <li> |
| <a href="../articles/install.html">Installing the Arrow Package on Linux</a> |
| </li> |
| <li> |
| <a href="../articles/dataset.html">Working with Arrow Datasets and dplyr</a> |
| </li> |
| <li> |
| <a href="../articles/fs.html">Working with Cloud Storage (S3)</a> |
| </li> |
| <li> |
| <a href="../articles/python.html">Apache Arrow in Python and R with reticulate</a> |
| </li> |
| <li> |
| <a href="../articles/flight.html">Connecting to Flight RPC Servers</a> |
| </li> |
| <li> |
| <a href="../articles/developing.html">Arrow R Developer Guide</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="../news/index.html">Changelog</a> |
| </li> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> |
| Project docs |
| |
| <span class="caret"></span> |
| </a> |
| <ul class="dropdown-menu" role="menu"> |
| <li> |
| <a href="https://arrow.apache.org/docs/format/README.html">Specification</a> |
| </li> |
| <li> |
| <a href="https://arrow.apache.org/docs/c_glib">C GLib</a> |
| </li> |
| <li> |
| <a href="https://arrow.apache.org/docs/cpp">C++</a> |
| </li> |
| <li> |
| <a href="https://arrow.apache.org/docs/java">Java</a> |
| </li> |
| <li> |
| <a href="https://arrow.apache.org/docs/js">JavaScript</a> |
| </li> |
| <li> |
| <a href="https://arrow.apache.org/docs/python">Python</a> |
| </li> |
| <li> |
| <a href="../index.html">R</a> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| <ul class="nav navbar-nav navbar-right"></ul> |
| </div> |
| <!--/.nav-collapse --> |
| </div> |
| <!--/.container --> |
| </div> |
| <!--/.navbar --> |
| |
| |
| |
| </header><div class="row"> |
| <div class="col-md-9 contents"> |
| <div class="page-header toc-ignore"> |
| <h1 data-toc-skip>Working with Arrow Datasets and dplyr</h1> |
| |
| |
| <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/master/r/vignettes/dataset.Rmd"><code>vignettes/dataset.Rmd</code></a></small> |
| <div class="hidden name"><code>dataset.Rmd</code></div> |
| |
| </div> |
| |
| |
| |
| <p>Apache Arrow lets you work efficiently with large, multi-file datasets. The <code>arrow</code> R package provides a <code>dplyr</code> interface to Arrow Datasets, as well as other tools for interactive exploration of Arrow data.</p> |
| <p>This vignette introduces Datasets and shows how to use <code>dplyr</code> to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap.</p> |
| <div id="example-nyc-taxi-data" class="section level2"> |
| <h2 class="hasAnchor"> |
| <a href="#example-nyc-taxi-data" class="anchor"></a>Example: NYC taxi data</h2> |
| <p>The <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page">New York City taxi trip record data</a> is widely used in big data exercises and competitions. For demonstration purposes, we have hosted a Parquet-formatted version of about 10 years of the trip data in a public Amazon S3 bucket.</p> |
| <p>The total file size is around 37 gigabytes, even in the efficient Parquet file format. That’s bigger than memory on most people’s computers, so we can’t just read it all in and stack it into a single data frame.</p> |
| <p>In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See <code><a href="../articles/install.html">vignette("install", package = "arrow")</a></code> for details. To see if your <code>arrow</code> installation has S3 support, run</p> |
| <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="fu">arrow</span><span class="fu">::</span><span class="fu"><a href="../reference/arrow_available.html">arrow_with_s3</a></span><span class="op">(</span><span class="op">)</span></code></pre></div> |
| <pre><code>## [1] TRUE</code></pre> |
| <p>Even with S3 support enabled network, speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a “nyc-taxi” directory.</p> |
| <p>If your <code>arrow</code> build has S3 support, you can sync the data locally with:</p> |
| <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="fu">arrow</span><span class="fu">::</span><span class="fu"><a href="../reference/copy_files.html">copy_files</a></span><span class="op">(</span><span class="st">"s3://ursa-labs-taxi-data"</span>, <span class="st">"nyc-taxi"</span><span class="op">)</span></code></pre></div> |
| <p>If your <code>arrow</code> build doesn’t have S3 support, you can download the files with some additional code:</p> |
| <div class="sourceCode" id="cb4"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">bucket</span> <span class="op"><-</span> <span class="st">"https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"</span> |
| <span class="kw">for</span> <span class="op">(</span><span class="va">year</span> <span class="kw">in</span> <span class="fl">2009</span><span class="op">:</span><span class="fl">2019</span><span class="op">)</span> <span class="op">{</span> |
| <span class="kw">if</span> <span class="op">(</span><span class="va">year</span> <span class="op">==</span> <span class="fl">2019</span><span class="op">)</span> <span class="op">{</span> |
| <span class="co"># We only have through June 2019 there</span> |
| <span class="va">months</span> <span class="op"><-</span> <span class="fl">1</span><span class="op">:</span><span class="fl">6</span> |
| <span class="op">}</span> <span class="kw">else</span> <span class="op">{</span> |
| <span class="va">months</span> <span class="op"><-</span> <span class="fl">1</span><span class="op">:</span><span class="fl">12</span> |
| <span class="op">}</span> |
| <span class="kw">for</span> <span class="op">(</span><span class="va">month</span> <span class="kw">in</span> <span class="fu"><a href="https://rdrr.io/r/base/sprintf.html">sprintf</a></span><span class="op">(</span><span class="st">"%02d"</span>, <span class="va">months</span><span class="op">)</span><span class="op">)</span> <span class="op">{</span> |
| <span class="fu"><a href="https://rdrr.io/r/base/files2.html">dir.create</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/file.path.html">file.path</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span>, <span class="va">year</span>, <span class="va">month</span><span class="op">)</span>, recursive <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span> |
| <span class="kw"><a href="https://rdrr.io/r/base/try.html">try</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/utils/download.file.html">download.file</a></span><span class="op">(</span> |
| <span class="fu"><a href="https://rdrr.io/r/base/paste.html">paste</a></span><span class="op">(</span><span class="va">bucket</span>, <span class="va">year</span>, <span class="va">month</span>, <span class="st">"data.parquet"</span>, sep <span class="op">=</span> <span class="st">"/"</span><span class="op">)</span>, |
| <span class="fu"><a href="https://rdrr.io/r/base/file.path.html">file.path</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span>, <span class="va">year</span>, <span class="va">month</span>, <span class="st">"data.parquet"</span><span class="op">)</span>, |
| mode <span class="op">=</span> <span class="st">"wb"</span> |
| <span class="op">)</span>, silent <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span> |
| <span class="op">}</span> |
| <span class="op">}</span></code></pre></div> |
| <p>Note that these download steps in the vignette are not executed: if you want to run with live data, you’ll have to do it yourself separately. Given the size, if you’re running this locally and don’t have a fast connection, feel free to grab only a year or two of data.</p> |
| <p>If you don’t have the taxi data downloaded, the vignette will still run and will yield previously cached output for reference. To be explicit about which version is running, let’s check whether we’re running with live data:</p> |
| <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="fu"><a href="https://rdrr.io/r/base/files2.html">dir.exists</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span><span class="op">)</span></code></pre></div> |
| <pre><code>## [1] FALSE</code></pre> |
| </div> |
| <div id="getting-started" class="section level2"> |
| <h2 class="hasAnchor"> |
| <a href="#getting-started" class="anchor"></a>Getting started</h2> |
| <p>Because <code>dplyr</code> is not necessary for many Arrow workflows, it is an optional (<code>Suggests</code>) dependency. So, to work with Datasets, we need to load both <code>arrow</code> and <code>dplyr</code>.</p> |
| <div class="sourceCode" id="cb7"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="kw"><a href="https://rdrr.io/r/base/library.html">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/apache/arrow/">arrow</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span> |
| <span class="kw"><a href="https://rdrr.io/r/base/library.html">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org">dplyr</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></code></pre></div> |
| <p>The first step is to create our Dataset object, pointing at the directory of data.</p> |
| <div class="sourceCode" id="cb8"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">ds</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi"</span>, partitioning <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="st">"year"</span>, <span class="st">"month"</span><span class="op">)</span><span class="op">)</span></code></pre></div> |
| <p>The default file format for <code><a href="../reference/open_dataset.html">open_dataset()</a></code> is Parquet; if we had a directory of Arrow format files, we could include <code>format = "arrow"</code> in the call. Other supported formats include: <code>"feather"</code> (an alias for <code>"arrow"</code>, as Feather v2 is the Arrow file format), <code>"csv"</code>, <code>"tsv"</code> (for tab-delimited), and <code>"text"</code> for generic text-delimited files. For text files, you can pass any parsing options (<code>delim</code>, <code>quote</code>, etc.) to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> that you would otherwise pass to <code><a href="../reference/read_delim_arrow.html">read_csv_arrow()</a></code>.</p> |
| <p>The <code>partitioning</code> argument lets us specify how the file paths provide information about how the dataset is chunked into different files. Our files in this example have file paths like</p> |
| <pre><code>2009/01/data.parquet |
| 2009/02/data.parquet |
| ...</code></pre> |
| <p>By providing a character vector to <code>partitioning</code>, we’re saying that the first path segment gives the value for <code>year</code> and the second segment is <code>month</code>. Every row in <code>2009/01/data.parquet</code> has a value of 2009 for <code>year</code> and 1 for <code>month</code>, even though those columns may not actually be present in the file.</p> |
| <p>Indeed, when we look at the dataset, we see that in addition to the columns present in every file, there are also columns <code>year</code> and <code>month</code>.</p> |
| <div class="sourceCode" id="cb10"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">ds</span></code></pre></div> |
| <pre><code>## |
| ## FileSystemDataset with 125 Parquet files |
| ## vendor_id: string |
| ## pickup_at: timestamp[us] |
| ## dropoff_at: timestamp[us] |
| ## passenger_count: int8 |
| ## trip_distance: float |
| ## pickup_longitude: float |
| ## pickup_latitude: float |
| ## rate_code_id: null |
| ## store_and_fwd_flag: string |
| ## dropoff_longitude: float |
| ## dropoff_latitude: float |
| ## payment_type: string |
| ## fare_amount: float |
| ## extra: float |
| ## mta_tax: float |
| ## tip_amount: float |
| ## tolls_amount: float |
| ## total_amount: float |
| ## year: int32 |
| ## month: int32 |
| ## |
| ## See $metadata for additional Schema metadata</code></pre> |
| <p>The other form of partitioning currently supported is <a href="https://hive.apache.org/">Hive</a>-style, in which the partition variable names are included in the path segments. If we had saved our files in paths like</p> |
| <pre><code>year=2009/month=01/data.parquet |
| year=2009/month=02/data.parquet |
| ...</code></pre> |
| <p>we would not have had to provide the names in <code>partitioning</code>: we could have just called <code>ds <- open_dataset("nyc-taxi")</code> and the partitions would have been detected automatically.</p> |
| </div> |
| <div id="querying-the-dataset" class="section level2"> |
| <h2 class="hasAnchor"> |
| <a href="#querying-the-dataset" class="anchor"></a>Querying the dataset</h2> |
| <p>Up to this point, we haven’t loaded any data: we have walked directories to find files, we’ve parsed file paths to identify partitions, and we’ve read the headers of the Parquet files to inspect their schemas so that we can make sure they all line up.</p> |
| <p>In the current release, <code>arrow</code> supports the dplyr verbs <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">transmute()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. Aggregation is not yet supported, so before you call <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> or other verbs with aggregate functions, use <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> to pull the selected subset of the data into an in-memory R data frame.</p> |
| <p>If you attempt to call unsupported <code>dplyr</code> verbs or unimplemented functions in your query on an Arrow Dataset, the <code>arrow</code> package raises an error. However, for <code>dplyr</code> queries on <code>Table</code> objects (which are typically smaller in size) the package automatically calls <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> before processing that <code>dplyr</code> verb.</p> |
| <p>Here’s an example. Suppose I was curious about tipping behavior among the longest taxi rides. Let’s find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers:</p> |
| <div class="sourceCode" id="cb13"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="fu"><a href="https://rdrr.io/r/base/system.time.html">system.time</a></span><span class="op">(</span><span class="va">ds</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html">filter</a></span><span class="op">(</span><span class="va">total_amount</span> <span class="op">></span> <span class="fl">100</span>, <span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="fl">100</span> <span class="op">*</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by</a></span><span class="op">(</span><span class="va">passenger_count</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html">collect</a></span><span class="op">(</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise</a></span><span class="op">(</span> |
| median_tip_pct <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/median.html">median</a></span><span class="op">(</span><span class="va">tip_pct</span><span class="op">)</span>, |
| n <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/context.html">n</a></span><span class="op">(</span><span class="op">)</span> |
| <span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://rdrr.io/r/base/print.html">print</a></span><span class="op">(</span><span class="op">)</span><span class="op">)</span></code></pre></div> |
| <pre><code>## |
| ## # A tibble: 10 x 3 |
| ## passenger_count median_tip_pct n |
| ## <int> <dbl> <int> |
| ## 1 0 9.84 380 |
| ## 2 1 16.7 143087 |
| ## 3 2 16.6 34418 |
| ## 4 3 14.4 8922 |
| ## 5 4 11.4 4771 |
| ## 6 5 16.7 5806 |
| ## 7 6 16.7 3338 |
| ## 8 7 16.7 11 |
| ## 9 8 16.7 32 |
| ## 10 9 16.7 42 |
| ## |
| ## user system elapsed |
| ## 4.436 1.012 1.402</code></pre> |
| <p>We just selected a subset out of a dataset with around 2 billion rows, computed a new column, and aggregated on it in under 2 seconds on my laptop. How does this work?</p> |
| <p>First, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>/<code><a href="https://dplyr.tidyverse.org/reference/mutate.html">transmute()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>/<code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>/<code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> record their actions but don’t evaluate on the data until you run <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>.</p> |
| <div class="sourceCode" id="cb15"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">ds</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html">filter</a></span><span class="op">(</span><span class="va">total_amount</span> <span class="op">></span> <span class="fl">100</span>, <span class="va">year</span> <span class="op">==</span> <span class="fl">2015</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html">select</a></span><span class="op">(</span><span class="va">tip_amount</span>, <span class="va">total_amount</span>, <span class="va">passenger_count</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate</a></span><span class="op">(</span>tip_pct <span class="op">=</span> <span class="fl">100</span> <span class="op">*</span> <span class="va">tip_amount</span> <span class="op">/</span> <span class="va">total_amount</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by</a></span><span class="op">(</span><span class="va">passenger_count</span><span class="op">)</span></code></pre></div> |
| <pre><code>## |
| ## FileSystemDataset (query) |
| ## tip_amount: float |
| ## total_amount: float |
| ## passenger_count: int8 |
| ## tip_pct: expr |
| ## |
| ## * Filter: ((total_amount > 100) and (year == 2015)) |
| ## * Grouped by passenger_count |
| ## See $.data for the source Arrow object</code></pre> |
| <p>This returns instantly and shows the manipulations you’ve made, without loading data from the files. Because the evaluation of these queries is deferred, you can build up a query that selects down to a small subset without generating intermediate datasets that would potentially be large.</p> |
| <p>Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, we can select a subset of data from a much larger dataset by collecting the smaller slices from each file–we don’t have to load the whole dataset in memory in order to slice from it.</p> |
| <p>Third, because of partitioning, we can ignore some files entirely. In this example, by filtering <code>year == 2015</code>, all files corresponding to other years are immediately excluded: we don’t have to load them in order to find that no rows match the filter. Relatedly, since Parquet files contain row groups with statistics on the data within, there may be entire chunks of data we can avoid scanning because they have no rows where <code>total_amount > 100</code>.</p> |
| </div> |
| <div id="more-dataset-options" class="section level2"> |
| <h2 class="hasAnchor"> |
| <a href="#more-dataset-options" class="anchor"></a>More dataset options</h2> |
| <p>There are a few ways you can control the Dataset creation to adapt to special use cases. For one, if you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths to <code><a href="../reference/open_dataset.html">open_dataset()</a></code>. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to <code><a href="../reference/open_dataset.html">open_dataset()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to partition the Dataset into manageable chunks, then use <code><a href="../reference/write_dataset.html">write_dataset()</a></code> to write each chunk to a separate Parquet file—all without needing to read the full CSV file into R.</p> |
| <p>You can specify a <code>schema</code> argument to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be <code>int32</code> in one and <code>int8</code> in another) and you want to ensure that the resulting Dataset has a specific type. To be clear, it’s not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. The schema specification just lets you declare what you want the result to be.</p> |
| <p>Similarly, you can provide a Schema in the <code>partitioning</code> argument of <code><a href="../reference/open_dataset.html">open_dataset()</a></code> in order to declare the types of the virtual columns that define the partitions. This would be useful, in our taxi dataset example, if you wanted to keep <code>month</code> as a string instead of an integer for some reason.</p> |
| <p>Another feature of Datasets is that they can be composed of multiple data sources. That is, you may have a directory of partitioned Parquet files in one location, and in another directory, files that haven’t been partitioned. Or, you could point to an S3 bucket of Parquet data and a directory of CSVs on the local file system and query them together as a single dataset. To create a multi-source dataset, provide a list of datasets to <code><a href="../reference/open_dataset.html">open_dataset()</a></code> instead of a file path, or simply concatenate them like <code>big_dataset <- c(ds1, ds2)</code>.</p> |
| </div> |
| <div id="writing-datasets" class="section level2"> |
| <h2 class="hasAnchor"> |
| <a href="#writing-datasets" class="anchor"></a>Writing datasets</h2> |
| <p>As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on columns commonly used for filtering. However, we don’t always get our data delivered to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data is cleaning is up and reshaping it into a more usable form.</p> |
| <p>The <code><a href="../reference/write_dataset.html">write_dataset()</a></code> function allows you to take a Dataset or other tabular data object—an Arrow <code>Table</code> or <code>RecordBatch</code>, or an R <code>data.frame</code>—and write it to a different file format, partitioned into multiple files.</p> |
| <p>Assume we have a version of the NYC Taxi data as CSV:</p> |
| <div class="sourceCode" id="cb17"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">ds</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/csv/"</span>, format <span class="op">=</span> <span class="st">"csv"</span><span class="op">)</span></code></pre></div> |
| <p>We can write it to a new location and translate the files to the Feather format by calling <code><a href="../reference/write_dataset.html">write_dataset()</a></code> on it:</p> |
| <div class="sourceCode" id="cb18"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="va">ds</span>, <span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></code></pre></div> |
| <p>Next, let’s imagine that the <code>payment_type</code> column is something we often filter on, so we want to partition the data by that variable. By doing so we ensure that a filter like <code>payment_type == "Cash"</code> will touch only a subset of files where <code>payment_type</code> is always <code>"Cash"</code>.</p> |
| <p>One natural way to express the columns you want to partition on is to use the <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> method:</p> |
| <div class="sourceCode" id="cb19"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">ds</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by</a></span><span class="op">(</span><span class="va">payment_type</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></code></pre></div> |
| <p>This will write files to a directory tree that looks like this:</p> |
| <div class="sourceCode" id="cb20"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="fu"><a href="https://rdrr.io/r/base/system.html">system</a></span><span class="op">(</span><span class="st">"tree nyc-taxi/feather"</span><span class="op">)</span></code></pre></div> |
| <pre><code>## feather |
| ## ├── payment_type=1 |
| ## │ └── part-18.feather |
| ## ├── payment_type=2 |
| ## │ └── part-19.feather |
| ## ... |
| ## └── payment_type=UNK |
| ## └── part-17.feather |
| ## |
| ## 18 directories, 23 files</code></pre> |
| <p>Note that the directory names are <code>payment_type=Cash</code> and similar: this is the Hive-style partitioning described above. This means that when we call <code><a href="../reference/open_dataset.html">open_dataset()</a></code> on this directory, we don’t have to declare what the partitions are because they can be read from the file paths. (To instead write bare values for partition segments, i.e. <code>Cash</code> rather than <code>payment_type=Cash</code>, call <code><a href="../reference/write_dataset.html">write_dataset()</a></code> with <code>hive_style = FALSE</code>.)</p> |
| <p>Perhaps, though, <code>payment_type == "Cash"</code> is the only data we ever care about, and we just want to drop the rest and have a smaller working set. For this, we can <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> them out when writing:</p> |
| <div class="sourceCode" id="cb22"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">ds</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html">filter</a></span><span class="op">(</span><span class="va">payment_type</span> <span class="op">==</span> <span class="st">"Cash"</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></code></pre></div> |
| <p>The other thing we can do when writing datasets is select a subset of and/or reorder columns. Suppose we never care about <code>vendor_id</code>, and being a string column, it can take up a lot of space when we read it in, so let’s drop it:</p> |
| <div class="sourceCode" id="cb23"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span class="va">ds</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by</a></span><span class="op">(</span><span class="va">payment_type</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html">select</a></span><span class="op">(</span><span class="op">-</span><span class="va">vendor_id</span><span class="op">)</span> <span class="op">%>%</span> |
| <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="st">"nyc-taxi/feather"</span>, format <span class="op">=</span> <span class="st">"feather"</span><span class="op">)</span></code></pre></div> |
| <p>Note that while you can select a subset of columns, you cannot currently rename columns when writing a dataset.</p> |
| </div> |
| </div> |
| |
| <div class="col-md-3 hidden-xs hidden-sm" id="pkgdown-sidebar"> |
| |
| <nav id="toc" data-toggle="toc"><h2 data-toc-skip>Contents</h2> |
| </nav> |
| </div> |
| |
| </div> |
| |
| |
| |
| <footer><div class="copyright"> |
| <p>Developed by Neal Richardson, Ian Cook, Nic Crane, Jonathan Keane, Romain François, Jeroen Ooms, Apache Arrow.</p> |
| </div> |
| |
| <div class="pkgdown"> |
| <p>Site built with <a href="https://pkgdown.r-lib.org/">pkgdown</a> 1.6.1.</p> |
| </div> |
| |
| </footer> |
| </div> |
| |
| |
| |
| |
| <script type="text/javascript" src="/docs/_static/versionwarning.js"></script> </body> |
| </html> |