| <!DOCTYPE html> |
| <!-- Generated by pkgdown: do not edit by hand --><html lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <meta name="description" content="An overview of the Apache Arrow project and the arrow R package |
| "> |
| <title>Get started with Arrow • Arrow R Package</title> |
| <!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../favicon-16x16.png"> |
| <link rel="icon" type="image/png" sizes="32x32" href="../favicon-32x32.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../apple-touch-icon.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../apple-touch-icon-120x120.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../apple-touch-icon-76x76.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../apple-touch-icon-60x60.png"> |
| <script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <link href="../deps/bootstrap-5.2.2/bootstrap.min.css" rel="stylesheet"> |
| <script src="../deps/bootstrap-5.2.2/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous"> |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous"> |
| <!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../pkgdown.js"></script><meta property="og:title" content="Get started with Arrow"> |
| <meta property="og:description" content="An overview of the Apache Arrow project and the arrow R package |
| "> |
| <meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"> |
| <meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"> |
| <meta name="twitter:card" content="summary_large_image"> |
| <meta name="twitter:creator" content="@apachearrow"> |
| <meta name="twitter:site" content="@apachearrow"> |
| <!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]> |
| <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script> |
| <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> |
| <![endif]--><!-- Matomo --><script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| /* We explicitly disable cookie tracking to avoid privacy issues */ |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '20']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script><!-- End Matomo Code --> |
| </head> |
| <body> |
| <a href="#main" class="visually-hidden-focusable">Skip to contents</a> |
| |
| |
| <nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container"> |
| |
| <a class="navbar-brand me-2" href="../index.html">Arrow R Package</a> |
| |
| <span class="version"> |
| <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">13.0.0</small> |
| </span> |
| |
| |
| <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation"> |
| <span class="navbar-toggler-icon"></span> |
| </button> |
| |
| <div id="navbar" class="collapse navbar-collapse ms-3"> |
| <ul class="navbar-nav me-auto"> |
| <li class="active nav-item"> |
| <a class="nav-link" href="../articles/arrow.html">Get started</a> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" href="../reference/index.html">Reference</a> |
| </li> |
| <li class="nav-item dropdown"> |
| <a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a> |
| <div class="dropdown-menu" aria-labelledby="dropdown-articles"> |
| <h6 class="dropdown-header" data-toc-skip>Using the package</h6> |
| <a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a> |
| <a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a> |
| <a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a> |
| <a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a> |
| <a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a> |
| <a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a> |
| <div class="dropdown-divider"></div> |
| <h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6> |
| <a class="dropdown-item" href="../articles/data_objects.html">Data objects</a> |
| <a class="dropdown-item" href="../articles/data_types.html">Data types</a> |
| <a class="dropdown-item" href="../articles/metadata.html">Metadata</a> |
| <div class="dropdown-divider"></div> |
| <h6 class="dropdown-header" data-toc-skip>Installation</h6> |
| <a class="dropdown-item" href="../articles/install.html">Installing on Linux</a> |
| <a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a> |
| <div class="dropdown-divider"></div> |
| <a class="dropdown-item" href="../articles/index.html">More articles...</a> |
| </div> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" href="../news/index.html">Changelog</a> |
| </li> |
| </ul> |
| <form class="form-inline my-2 my-lg-0" role="search"> |
| <input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="Search for" autocomplete="off"> |
| </form> |
| |
| <ul class="navbar-nav"> |
| <li class="nav-item"> |
| <a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="github"> |
| <span class="fab fa fab fa-github fa-lg"></span> |
| |
| </a> |
| </li> |
| </ul> |
| </div> |
| |
| |
| </div> |
| </nav><div class="container template-article"> |
| |
| <div class="row"> |
| <main id="main" class="col-md-9"><div class="page-header"> |
| <img src="" class="logo" alt=""><h1>Get started with Arrow</h1> |
| |
| |
| <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/arrow.Rmd" class="external-link"><code>vignettes/arrow.Rmd</code></a></small> |
| <div class="d-none name"><code>arrow.Rmd</code></div> |
| </div> |
| |
| |
| |
| <p>Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.</p> |
| <p>The arrow package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the <a href="https://arrow.apache.org/docs/cpp" class="external-link">Arrow C++ library</a>, and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.</p> |
| <div class="section level2"> |
| <h2 id="package-conventions">Package conventions<a class="anchor" aria-label="anchor" href="#package-conventions"></a> |
| </h2> |
| <p>The arrow R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the arrow R package these are implemented as <a href="https://r6.r-lib.org" class="external-link"><code>R6</code></a> classes that all adopt “TitleCase” naming conventions. Some examples of these include:</p> |
| <ul> |
| <li>Two-dimensional, tabular data structures such as <code>Table</code>, <code>RecordBatch</code>, and <code>Dataset</code> |
| </li> |
| <li>One-dimensional, vector-like data structures such as <code>Array</code> and <code>ChunkedArray</code> |
| </li> |
| <li>Classes for reading, writing, and streaming data such as <code>ParquetFileReader</code> and <code>CsvTableReader</code> |
| </li> |
| </ul> |
| <p>This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because arrow also supplies a high-level interface using functions that follow a “snake_case” naming convention. Some examples of this include:</p> |
| <ul> |
| <li> |
| <code><a href="../reference/table.html">arrow_table()</a></code> allows you to create Arrow tables without directly using the <code>Table</code> object</li> |
| <li> |
| <code><a href="../reference/read_parquet.html">read_parquet()</a></code> allows you to open Parquet files without directly using the <code>ParquetFileReader</code> object</li> |
| </ul> |
| <p>All the examples used in this article rely on this high-level interface.</p> |
| <p>For developers interested in learning more about the package structure, see the <a href="./developing.html">developer guide</a>.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="tabular-data-in-arrow">Tabular data in Arrow<a class="anchor" aria-label="anchor" href="#tabular-data-in-arrow"></a> |
| </h2> |
| <p>A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the arrow R package, the <code>Table</code> class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The <code><a href="../reference/table.html">arrow_table()</a></code> function allows you to generate new Arrow Tables in much the same way that <code><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame()</a></code> is used to create new data frames:</p> |
| <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/apache/arrow/" class="external-link">arrow</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span> |
| <span></span> |
| <span><span class="va">dat</span> <span class="op"><-</span> <span class="fu"><a href="../reference/table.html">arrow_table</a></span><span class="op">(</span>x <span class="op">=</span> <span class="fl">1</span><span class="op">:</span><span class="fl">3</span>, y <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"a"</span>, <span class="st">"b"</span>, <span class="st">"c"</span><span class="op">)</span><span class="op">)</span></span> |
| <span><span class="va">dat</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Table</span></span> |
| <span><span class="co">## 3 rows x 2 columns</span></span> |
| <span><span class="co">## $x <int32></span></span> |
| <span><span class="co">## $y <string></span></span></code></pre> |
| <p>You can use <code>[</code> to specify subsets of Arrow Table in the same way you would for a data frame:</p> |
| <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">dat</span><span class="op">[</span><span class="fl">1</span><span class="op">:</span><span class="fl">2</span>, <span class="fl">1</span><span class="op">:</span><span class="fl">2</span><span class="op">]</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Table</span></span> |
| <span><span class="co">## 2 rows x 2 columns</span></span> |
| <span><span class="co">## $x <int32></span></span> |
| <span><span class="co">## $y <string></span></span></code></pre> |
| <p>Along the same lines, the <code>$</code> operator can be used to extract named columns:</p> |
| <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">dat</span><span class="op">$</span><span class="va">y</span></span></code></pre></div> |
| <pre><code><span><span class="co">## ChunkedArray</span></span> |
| <span><span class="co">## <string></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## "a",</span></span> |
| <span><span class="co">## "b",</span></span> |
| <span><span class="co">## "c"</span></span> |
| <span><span class="co">## ]</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| <p>Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R.</p> |
| <p>Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis.</p> |
| <p>To learn more about the different data object classes in arrow, see the article on <a href="./data_objects.html">data objects</a>.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="converting-tables-to-data-frames">Converting Tables to data frames<a class="anchor" aria-label="anchor" href="#converting-tables-to-data-frames"></a> |
| </h2> |
| <p>Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using <code><a href="https://rdrr.io/r/base/as.data.frame.html" class="external-link">as.data.frame()</a></code></p> |
| <div class="sourceCode" id="cb7"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/as.data.frame.html" class="external-link">as.data.frame</a></span><span class="op">(</span><span class="va">dat</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## x y</span></span> |
| <span><span class="co">## 1 1 a</span></span> |
| <span><span class="co">## 2 2 b</span></span> |
| <span><span class="co">## 3 3 c</span></span></code></pre> |
| <p>When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the <code>dat</code> Table, for instance, <code>dat$x</code> is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when <code><a href="https://rdrr.io/r/base/as.data.frame.html" class="external-link">as.data.frame()</a></code> is called.</p> |
| <p>It is possible to exercise fine grained control over this conversion process. To learn more about the different types and how they are converted, see the <a href="./data_types.html">data types</a> article.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="reading-and-writing-data">Reading and writing data<a class="anchor" aria-label="anchor" href="#reading-and-writing-data"></a> |
| </h2> |
| <p>One of the main ways to use arrow is to read and write data files in several common formats. The arrow package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Arrow (also called Feather) that are not widely supported in other packages. In addition, the arrow package supports multi-file data sets in which a single rectangular data set is stored across multiple files.</p> |
| <div class="section level3"> |
| <h3 id="individual-files">Individual files<a class="anchor" aria-label="anchor" href="#individual-files"></a> |
| </h3> |
| <p>When the goal is to read a single data file into memory, there are several functions you can use:</p> |
| <ul> |
| <li> |
| <code><a href="../reference/read_parquet.html">read_parquet()</a></code>: read a file in Parquet format</li> |
| <li> |
| <code><a href="../reference/read_feather.html">read_feather()</a></code>: read a file in Arrow/Feather format</li> |
| <li> |
| <code><a href="../reference/read_delim_arrow.html">read_delim_arrow()</a></code>: read a delimited text file</li> |
| <li> |
| <code><a href="../reference/read_delim_arrow.html">read_csv_arrow()</a></code>: read a comma-separated values (CSV) file</li> |
| <li> |
| <code><a href="../reference/read_delim_arrow.html">read_tsv_arrow()</a></code>: read a tab-separated values (TSV) file</li> |
| <li> |
| <code><a href="../reference/read_json_arrow.html">read_json_arrow()</a></code>: read a JSON data file</li> |
| </ul> |
| <p>In every case except JSON, there is a corresponding <code>write_*()</code> function that allows you to write data files in the appropriate format.</p> |
| <p>By default, the <code>read_*()</code> functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the <code>as_data_frame</code> argument to <code>FALSE</code>.</p> |
| <p>In the example below, we take the <code>starwars</code> data provided by the dplyr package and write it to a Parquet file using <code><a href="../reference/write_parquet.html">write_parquet()</a></code></p> |
| <div class="sourceCode" id="cb9"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org" class="external-link">dplyr</a></span>, warn.conflicts <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span> |
| <span></span> |
| <span><span class="va">file_path</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/tempfile.html" class="external-link">tempfile</a></span><span class="op">(</span>fileext <span class="op">=</span> <span class="st">".parquet"</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="../reference/write_parquet.html">write_parquet</a></span><span class="op">(</span><span class="va">starwars</span>, <span class="va">file_path</span><span class="op">)</span></span></code></pre></div> |
| <p>We can then use <code><a href="../reference/read_parquet.html">read_parquet()</a></code> to load the data from this file. As shown below, the default behavior is to return a data frame (<code>sw_frame</code>) but when we set <code>as_data_frame = FALSE</code> the data are read as an Arrow Table (<code>sw_table</code>):</p> |
| <div class="sourceCode" id="cb10"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">sw_frame</span> <span class="op"><-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="va">file_path</span><span class="op">)</span></span> |
| <span><span class="va">sw_table</span> <span class="op"><-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="va">file_path</span>, as_data_frame <span class="op">=</span> <span class="cn">FALSE</span><span class="op">)</span></span> |
| <span><span class="va">sw_table</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Table</span></span> |
| <span><span class="co">## 87 rows x 14 columns</span></span> |
| <span><span class="co">## $name <string></span></span> |
| <span><span class="co">## $height <int32></span></span> |
| <span><span class="co">## $mass <double></span></span> |
| <span><span class="co">## $hair_color <string></span></span> |
| <span><span class="co">## $skin_color <string></span></span> |
| <span><span class="co">## $eye_color <string></span></span> |
| <span><span class="co">## $birth_year <double></span></span> |
| <span><span class="co">## $sex <string></span></span> |
| <span><span class="co">## $gender <string></span></span> |
| <span><span class="co">## $homeworld <string></span></span> |
| <span><span class="co">## $species <string></span></span> |
| <span><span class="co">## $films: list<element <string>></span></span> |
| <span><span class="co">## $vehicles: list<element <string>></span></span> |
| <span><span class="co">## $starships: list<element <string>></span></span></code></pre> |
| <p>To learn more about reading and writing individual data files, see the <a href="./read_write.html">read/write article</a>.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="multi-file-data-sets">Multi-file data sets<a class="anchor" aria-label="anchor" href="#multi-file-data-sets"></a> |
| </h3> |
| <p>When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The arrow package provides the Dataset interface, a convenient way to read, write, and analyze a single data file that is larger-than-memory and multi-file data sets.</p> |
| <p>To illustrate the concepts, we’ll create a nonsense data set with 100000 rows that can be split into 10 subsets:</p> |
| <div class="sourceCode" id="cb12"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Random.html" class="external-link">set.seed</a></span><span class="op">(</span><span class="fl">1234</span><span class="op">)</span></span> |
| <span><span class="va">nrows</span> <span class="op"><-</span> <span class="fl">100000</span></span> |
| <span><span class="va">random_data</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/data.frame.html" class="external-link">data.frame</a></span><span class="op">(</span></span> |
| <span> x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="va">nrows</span><span class="op">)</span>,</span> |
| <span> y <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html" class="external-link">rnorm</a></span><span class="op">(</span><span class="va">nrows</span><span class="op">)</span>,</span> |
| <span> subset <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/sample.html" class="external-link">sample</a></span><span class="op">(</span><span class="fl">10</span>, <span class="va">nrows</span>, replace <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span> |
| <span><span class="op">)</span></span></code></pre></div> |
| <p>What we might like to do is partition this data and then write it to 10 separate Parquet files, one corresponding to each value of the <code>subset</code> column. To do this we first specify the path to a folder into which we will write the data files:</p> |
| <div class="sourceCode" id="cb13"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">dataset_path</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/file.path.html" class="external-link">file.path</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/tempfile.html" class="external-link">tempdir</a></span><span class="op">(</span><span class="op">)</span>, <span class="st">"random_data"</span><span class="op">)</span></span></code></pre></div> |
| <p>We can then use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by()</a></code> function from dplyr to specify that the data will be partitioned using the <code>subset</code> column, and then pass the grouped data to <code><a href="../reference/write_dataset.html">write_dataset()</a></code>:</p> |
| <div class="sourceCode" id="cb14"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">random_data</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">subset</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="../reference/write_dataset.html">write_dataset</a></span><span class="op">(</span><span class="va">dataset_path</span><span class="op">)</span></span></code></pre></div> |
| <p>This creates a set of 10 files, one for each subset. These files are named according to the “hive partitioning” format as shown below:</p> |
| <div class="sourceCode" id="cb15"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/list.files.html" class="external-link">list.files</a></span><span class="op">(</span><span class="va">dataset_path</span>, recursive <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## [1] "subset=1/part-0.parquet" "subset=10/part-0.parquet"</span></span> |
| <span><span class="co">## [3] "subset=2/part-0.parquet" "subset=3/part-0.parquet" </span></span> |
| <span><span class="co">## [5] "subset=4/part-0.parquet" "subset=5/part-0.parquet" </span></span> |
| <span><span class="co">## [7] "subset=6/part-0.parquet" "subset=7/part-0.parquet" </span></span> |
| <span><span class="co">## [9] "subset=8/part-0.parquet" "subset=9/part-0.parquet"</span></span></code></pre> |
| <p>Each of these Parquet files can be opened individually using <code><a href="../reference/read_parquet.html">read_parquet()</a></code> but is often more convenient – especially for very large data sets – to scan the folder and “connect” to the data set without loading it into memory. We can do this using <code><a href="../reference/open_dataset.html">open_dataset()</a></code>:</p> |
| <div class="sourceCode" id="cb17"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">dset</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="va">dataset_path</span><span class="op">)</span></span> |
| <span><span class="va">dset</span></span></code></pre></div> |
| <pre><code><span><span class="co">## FileSystemDataset with 10 Parquet files</span></span> |
| <span><span class="co">## x: double</span></span> |
| <span><span class="co">## y: double</span></span> |
| <span><span class="co">## subset: int32</span></span></code></pre> |
| <p>This <code>dset</code> object does not store the data in-memory, only some metadata. However, as discussed in the next section, it is possible to analyze the data referred to be <code>dset</code> as if it had been loaded.</p> |
| <p>To learn more about Arrow Datasets, see the <a href="./dataset.html">dataset article</a>.</p> |
| </div> |
| </div> |
| <div class="section level2"> |
| <h2 id="analyzing-arrow-data-with-dplyr">Analyzing Arrow data with dplyr<a class="anchor" aria-label="anchor" href="#analyzing-arrow-data-with-dplyr"></a> |
| </h2> |
| <p>Arrow Tables and Datasets can be analyzed using dplyr syntax. This is possible because the arrow R package supplies a backend that translates dplyr verbs into commands that are understood by the Arrow C++ library, and will similarly translate R expressions that appear within a call to a dplyr verb. For example, although the <code>dset</code> Dataset is not a data frame (and does not store the data values in memory), you can still pass it to a dplyr pipeline like the one shown below:</p> |
| <div class="sourceCode" id="cb19"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">dset</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">subset</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarize</a></span><span class="op">(</span>mean_x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/mean.html" class="external-link">mean</a></span><span class="op">(</span><span class="va">x</span><span class="op">)</span>, min_y <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/Extremes.html" class="external-link">min</a></span><span class="op">(</span><span class="va">y</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">mean_x</span> <span class="op">></span> <span class="fl">0</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/arrange.html" class="external-link">arrange</a></span><span class="op">(</span><span class="va">subset</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 6 x 3</span></span></span> |
| <span><span class="co">## subset mean_x min_y</span></span> |
| <span><span class="co">## <span style="color: #949494; font-style: italic;"><int></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span></span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">1</span> 2 0.004<span style="text-decoration: underline;">86</span> -<span style="color: #BB0000;">4.00</span></span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">2</span> 3 0.004<span style="text-decoration: underline;">40</span> -<span style="color: #BB0000;">3.86</span></span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">3</span> 4 0.012<span style="text-decoration: underline;">5</span> -<span style="color: #BB0000;">3.65</span></span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">4</span> 6 0.023<span style="text-decoration: underline;">4</span> -<span style="color: #BB0000;">3.88</span></span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">5</span> 7 0.004<span style="text-decoration: underline;">77</span> -<span style="color: #BB0000;">4.65</span></span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">6</span> 9 0.005<span style="text-decoration: underline;">57</span> -<span style="color: #BB0000;">3.50</span></span></span></code></pre> |
| <p>Notice that we call <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code> at the end of the pipeline. No actual computations are performed until <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code> (or the related <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">compute()</a></code> function) is called. This “lazy evaluation” makes it possible for the Arrow C++ compute engine to optimize how the computations are performed.</p> |
| <p>To learn more about analyzing Arrow data, see the <a href="./data_wrangling.html">data wrangling article</a>. The <a href="https://arrow.apache.org/docs/r/reference/acero.html">list of functions available in dplyr queries</a> page may also be useful.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="connecting-to-cloud-storage">Connecting to cloud storage<a class="anchor" aria-label="anchor" href="#connecting-to-cloud-storage"></a> |
| </h2> |
| <p>Another use for the arrow R package is to read, write, and analyze data sets stored remotely on cloud services. The package currently supports both Amazon Simple Storage Service (S3) and Google Cloud Storage (GCS). The example below illustrates how you can use <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code> to refer to a an S3 bucket, and use <code><a href="../reference/open_dataset.html">open_dataset()</a></code> to connect to the data set stored there:</p> |
| <div class="sourceCode" id="cb21"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="fu"><a href="../reference/s3_bucket.html">s3_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets/nyc-taxi"</span><span class="op">)</span></span> |
| <span><span class="va">nyc_taxi</span> <span class="op"><-</span> <span class="fu"><a href="../reference/open_dataset.html">open_dataset</a></span><span class="op">(</span><span class="va">bucket</span><span class="op">)</span></span></code></pre></div> |
| <p>To learn more about the support for cloud services in arrow, see the <a href="./fs.html">cloud storage</a> article.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="efficient-data-interchange-between-r-and-python">Efficient data interchange between R and Python<a class="anchor" aria-label="anchor" href="#efficient-data-interchange-between-r-and-python"></a> |
| </h2> |
| <p>The <a href="https://rstudio.github.io/reticulate/" class="external-link">reticulate</a> package provides an interface that allows you to call Python code from R. The arrow package is designed to be interoperable with reticulate. If the Python environment has the pyarrow library installed (the Python equivalent to the arrow package), you can pass an Arrow Table from R to Python using the <code><a href="https://rstudio.github.io/reticulate/reference/r-py-conversion.html" class="external-link">r_to_py()</a></code> function in reticulate as shown below:</p> |
| <div class="sourceCode" id="cb22"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://rstudio.github.io/reticulate/" class="external-link">reticulate</a></span><span class="op">)</span></span> |
| <span></span> |
| <span><span class="va">sw_table_python</span> <span class="op"><-</span> <span class="fu"><a href="https://rstudio.github.io/reticulate/reference/r-py-conversion.html" class="external-link">r_to_py</a></span><span class="op">(</span><span class="va">sw_table</span><span class="op">)</span></span></code></pre></div> |
| <p>The <code>sw_table_python</code> object is now stored as a pyarrow Table: the Python equivalent of the Table class. You can see this when you print the object:</p> |
| <div class="sourceCode" id="cb23"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">sw_table_python</span></span></code></pre></div> |
| <pre><code><span><span class="co">## pyarrow.Table</span></span> |
| <span><span class="co">## name: string</span></span> |
| <span><span class="co">## height: int32</span></span> |
| <span><span class="co">## mass: double</span></span> |
| <span><span class="co">## hair_color: string</span></span> |
| <span><span class="co">## skin_color: string</span></span> |
| <span><span class="co">## eye_color: string</span></span> |
| <span><span class="co">## birth_year: double</span></span> |
| <span><span class="co">## sex: string</span></span> |
| <span><span class="co">## gender: string</span></span> |
| <span><span class="co">## homeworld: string</span></span> |
| <span><span class="co">## species: string</span></span> |
| <span><span class="co">## films: list<element: string></span></span> |
| <span><span class="co">## child 0, element: string</span></span> |
| <span><span class="co">## vehicles: list<element: string></span></span> |
| <span><span class="co">## child 0, element: string</span></span> |
| <span><span class="co">## starships: list<element: string></span></span> |
| <span><span class="co">## child 0, element: string</span></span> |
| <span><span class="co">## ----</span></span> |
| <span><span class="co">## name: [["Luke Skywalker","C-3PO","R2-D2","Darth Vader","Leia Organa",...,"Rey","Poe Dameron","BB8","Captain Phasma","Padm<U+00E9> Amidala"]]</span></span> |
| <span><span class="co">## height: [[172,167,96,202,150,...,null,null,null,null,165]]</span></span> |
| <span><span class="co">## mass: [[77,75,32,136,49,...,null,null,null,null,45]]</span></span> |
| <span><span class="co">## hair_color: [["blond",null,null,"none","brown",...,"brown","brown","none","unknown","brown"]]</span></span> |
| <span><span class="co">## skin_color: [["fair","gold","white, blue","white","light",...,"light","light","none","unknown","light"]]</span></span> |
| <span><span class="co">## eye_color: [["blue","yellow","red","yellow","brown",...,"hazel","brown","black","unknown","brown"]]</span></span> |
| <span><span class="co">## birth_year: [[19,112,33,41.9,19,...,null,null,null,null,46]]</span></span> |
| <span><span class="co">## sex: [["male","none","none","male","female",...,"female","male","none",null,"female"]]</span></span> |
| <span><span class="co">## gender: [["masculine","masculine","masculine","masculine","feminine",...,"feminine","masculine","masculine",null,"feminine"]]</span></span> |
| <span><span class="co">## homeworld: [["Tatooine","Tatooine","Naboo","Tatooine","Alderaan",...,null,null,null,null,"Naboo"]]</span></span> |
| <span><span class="co">## ...</span></span></code></pre> |
| <p>It is important to recognize that when this transfer takes place, only the C++ pointer (i.e., metadata referring to the underlying data object stored by the Arrow C++ library) is copied. The data values themselves in the same place within memory. The consequence of this is that it is much faster to pass an Arrow Table from R to Python than to copy a data frame in R to a Pandas DataFrame in Python.</p> |
| <p>To learn more about passing Arrow data between R and Python, see the article on <a href="./python.html">python integrations</a>.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="access-to-arrow-messages-buffers-and-streams">Access to Arrow messages, buffers, and streams<a class="anchor" aria-label="anchor" href="#access-to-arrow-messages-buffers-and-streams"></a> |
| </h2> |
| <p>The arrow package also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the <a href="https://spark.rstudio.com/" class="external-link"><code>sparklyr</code></a> package has support for using Arrow to move data to and from Spark, yielding <a href="https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/" class="external-link">significant performance gains</a>.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="contributing-to-arrow">Contributing to arrow<a class="anchor" aria-label="anchor" href="#contributing-to-arrow"></a> |
| </h2> |
| <p>Apache Arrow is an extensive project spanning multiple languages, and the arrow R package is only one part of this large project. Because of this there are a number of special considerations for developers who would like to contribute to the package. To help make this process easier, there are several articles in the arrow documentation that discuss topics that are relevant to arrow developers, but are very unlikely to be needed by users.</p> |
| <p>For an overview of the development process and a list of related articles for developers, see the <a href="./developing.html">developer guide</a>.</p> |
| </div> |
| </main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2> |
| </nav></aside> |
| </div> |
| |
| |
| |
| <footer><div class="pkgdown-footer-left"> |
| <p></p> |
| <p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p> |
| </div> |
| |
| <div class="pkgdown-footer-right"> |
| <p></p> |
| <p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.7.</p> |
| </div> |
| |
| </footer> |
| </div> |
| |
| |
| |
| |
| |
| </body> |
| </html> |