| <!DOCTYPE html> |
| <!-- Generated by pkgdown: do not edit by hand --><html lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <meta name="description" content="Learn how to write bindings that allow arrow to mirror the behavior of native R functions within dplyr pipelines |
| "> |
| <title>Writing dplyr bindings • Arrow R Package</title> |
| <!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../../favicon-16x16.png"> |
| <link rel="icon" type="image/png" sizes="32x32" href="../../favicon-32x32.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../../apple-touch-icon.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../../apple-touch-icon-120x120.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../../apple-touch-icon-76x76.png"> |
| <link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../../apple-touch-icon-60x60.png"> |
| <script src="../../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <link href="../../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet"> |
| <script src="../../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous"> |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous"> |
| <!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.11/clipboard.min.js" integrity="sha512-7O5pXpc0oCRrxk8RUfDYFgn0nO1t+jLuIOQdOMRp4APB7uZ4vSjspzp5y6YDtDs4VzUSTbWzBFZ/LKJhnyFOKw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../../pkgdown.js"></script><link href="../../extra.css" rel="stylesheet"> |
| <meta property="og:title" content="Writing dplyr bindings"> |
| <meta property="og:description" content="Learn how to write bindings that allow arrow to mirror the behavior of native R functions within dplyr pipelines |
| "> |
| <meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"> |
| <meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"> |
| <meta name="twitter:card" content="summary_large_image"> |
| <meta name="twitter:creator" content="@apachearrow"> |
| <meta name="twitter:site" content="@apachearrow"> |
| <!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]> |
| <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script> |
| <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> |
| <![endif]--><!-- Matomo --><script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| /* We explicitly disable cookie tracking to avoid privacy issues */ |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '20']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script><!-- End Matomo Code --> |
| </head> |
| <body> |
| <a href="#main" class="visually-hidden-focusable">Skip to contents</a> |
| |
| |
| <nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container"> |
| |
| <a class="navbar-brand me-2" href="../../index.html">Arrow R Package</a> |
| |
| <span class="version"> |
| <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">16.0.0.9000</small> |
| </span> |
| |
| |
| <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation"> |
| <span class="navbar-toggler-icon"></span> |
| </button> |
| |
| <div id="navbar" class="collapse navbar-collapse ms-3"> |
| <ul class="navbar-nav me-auto"> |
| <li class="nav-item"> |
| <a class="nav-link" href="../../articles/arrow.html">Get started</a> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" href="../../reference/index.html">Reference</a> |
| </li> |
| <li class="active nav-item dropdown"> |
| <a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a> |
| <div class="dropdown-menu" aria-labelledby="dropdown-articles"> |
| <h6 class="dropdown-header" data-toc-skip>Using the package</h6> |
| <a class="dropdown-item" href="../../articles/read_write.html">Reading and writing data files</a> |
| <a class="dropdown-item" href="../../articles/data_wrangling.html">Data analysis with dplyr syntax</a> |
| <a class="dropdown-item" href="../../articles/dataset.html">Working with multi-file data sets</a> |
| <a class="dropdown-item" href="../../articles/python.html">Integrating Arrow, Python, and R</a> |
| <a class="dropdown-item" href="../../articles/fs.html">Using cloud storage (S3, GCS)</a> |
| <a class="dropdown-item" href="../../articles/flight.html">Connecting to a Flight server</a> |
| <div class="dropdown-divider"></div> |
| <h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6> |
| <a class="dropdown-item" href="../../articles/data_objects.html">Data objects</a> |
| <a class="dropdown-item" href="../../articles/data_types.html">Data types</a> |
| <a class="dropdown-item" href="../../articles/metadata.html">Metadata</a> |
| <div class="dropdown-divider"></div> |
| <h6 class="dropdown-header" data-toc-skip>Installation</h6> |
| <a class="dropdown-item" href="../../articles/install.html">Installing on Linux</a> |
| <a class="dropdown-item" href="../../articles/install_nightly.html">Installing development versions</a> |
| <div class="dropdown-divider"></div> |
| <a class="dropdown-item" href="../../articles/index.html">More articles...</a> |
| </div> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" href="../../news/index.html">Changelog</a> |
| </li> |
| </ul> |
| <form class="form-inline my-2 my-lg-0" role="search"> |
| <input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../../search.json" id="search-input" placeholder="Search for" autocomplete="off"> |
| </form> |
| |
| <ul class="navbar-nav"> |
| <li class="nav-item"> |
| <a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="github"> |
| <span class="fab fa fab fa-github fa-lg"></span> |
| |
| </a> |
| </li> |
| </ul> |
| </div> |
| |
| |
| </div> |
| </nav><div class="container template-article"> |
| |
| |
| |
| <script src="writing_bindings_files/accessible-code-block-0.0.1/empty-anchor.js"></script><div class="row"> |
| <main id="main" class="col-md-9"><div class="page-header"> |
| <img src="" class="logo" alt=""><h1>Writing dplyr bindings</h1> |
| |
| |
| <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/developers/writing_bindings.Rmd" class="external-link"><code>vignettes/developers/writing_bindings.Rmd</code></a></small> |
| <div class="d-none name"><code>writing_bindings.Rmd</code></div> |
| </div> |
| |
| |
| |
| <p>When writing bindings between C++ compute functions and R functions, the aim is to expose the C++ functionality via the same interface as existing R functions. The syntax and functionality should match that of the existing R functions (though there are some exceptions) so that users are able to use existing tidyverse or base R syntax, whilst taking advantage of the speed and functionality of the underlying arrow package.</p> |
| <p>One of main ways in which users interact with arrow is via <a href="https://dplyr.tidyverse.org/" class="external-link">dplyr</a> syntax called on Arrow objects. For example, when a user calls <code><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">dplyr::mutate()</a></code> on an Arrow Tabular, Dataset, or arrow data query object, the Arrow implementation of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate()</a></code> is used and under the hood, translates the dplyr code into Arrow C++ code.</p> |
| <p>When using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">dplyr::mutate()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">dplyr::filter()</a></code>, you may want to use functions from other packages. The example below uses <code><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">stringr::str_detect()</a></code>.</p> |
| <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org" class="external-link">dplyr</a></span><span class="op">)</span></span> |
| <span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://stringr.tidyverse.org" class="external-link">stringr</a></span><span class="op">)</span></span> |
| <span><span class="va">starwars</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 2 x 14</span></span></span> |
| <span><span class="co">## name height mass hair_color skin_color eye_color birth_year sex gender</span></span> |
| <span><span class="co">## <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><int></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><chr></span> </span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">1</span> Darth Va~ 202 136 none white yellow 41.9 male mascu~</span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">2</span> Darth Ma~ 175 80 none red yellow 54 male mascu~</span></span> |
| <span><span class="co">## <span style="color: #949494;"># i 5 more variables: homeworld <chr>, species <chr>, films <list>,</span></span></span> |
| <span><span class="co">## <span style="color: #949494;"># vehicles <list>, starships <list></span></span></span></code></pre> |
| <p>This functionality has also been implemented in Arrow, e.g.:</p> |
| <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/apache/arrow/" class="external-link">arrow</a></span><span class="op">)</span></span> |
| <span><span class="fu"><a href="../../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">starwars</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 2 x 14</span></span></span> |
| <span><span class="co">## name height mass hair_color skin_color eye_color birth_year sex gender</span></span> |
| <span><span class="co">## <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><int></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><chr></span> </span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">1</span> Darth Va~ 202 136 none white yellow 41.9 male mascu~</span></span> |
| <span><span class="co">## <span style="color: #BCBCBC;">2</span> Darth Ma~ 175 80 none red yellow 54 male mascu~</span></span> |
| <span><span class="co">## <span style="color: #949494;"># i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,</span></span></span> |
| <span><span class="co">## <span style="color: #949494;"># vehicles <list<character>>, starships <list<character>></span></span></span></code></pre> |
| <p>This is possible as a <strong>binding</strong> has been created between the call to the stringr function <code><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect()</a></code> and the Arrow C++ code, here as a direct mapping to <code>match_substring_regex</code>. You can see this for yourself by inspecting the arrow data query object without retrieving the results via <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code>.</p> |
| <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">starwars</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Table (query)</span></span> |
| <span><span class="co">## name: string</span></span> |
| <span><span class="co">## height: int32</span></span> |
| <span><span class="co">## mass: double</span></span> |
| <span><span class="co">## hair_color: string</span></span> |
| <span><span class="co">## skin_color: string</span></span> |
| <span><span class="co">## eye_color: string</span></span> |
| <span><span class="co">## birth_year: double</span></span> |
| <span><span class="co">## sex: string</span></span> |
| <span><span class="co">## gender: string</span></span> |
| <span><span class="co">## homeworld: string</span></span> |
| <span><span class="co">## species: string</span></span> |
| <span><span class="co">## films: list<item: string></span></span> |
| <span><span class="co">## vehicles: list<item: string></span></span> |
| <span><span class="co">## starships: list<item: string></span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## * Filter: match_substring_regex(name, {pattern="Darth", ignore_case=false})</span></span> |
| <span><span class="co">## See $.data for the source Arrow object</span></span></code></pre> |
| <p>In the following sections, we’ll walk through how to create a binding between an R function and an Arrow C++ function.</p> |
| <div class="section level2"> |
| <h2 id="walkthrough">Walkthrough<a class="anchor" aria-label="anchor" href="#walkthrough"></a> |
| </h2> |
| <p>Imagine you are writing the bindings for the C++ function <a href="https://arrow.apache.org/docs/cpp/compute.html#containment-tests" class="external-link"><code>starts_with()</code></a> and want to bind it to the (base) R function <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>.</p> |
| <p>First, take a look at the docs for both of those functions.</p> |
| <div class="section level3"> |
| <h3 id="examining-the-r-function">Examining the R function<a class="anchor" aria-label="anchor" href="#examining-the-r-function"></a> |
| </h3> |
| <p>Here are the docs for R’s <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> (also available at <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html" class="external-link uri">https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html</a>)</p> |
| <p><img src="startswithdocs.png" width="50%"></p> |
| <p>It takes 2 parameters; <code>x</code> - the input, and <code>prefix</code> - the characters to check if <code>x</code> starts with.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="examining-the-c-function">Examining the C++ function<a class="anchor" aria-label="anchor" href="#examining-the-c-function"></a> |
| </h3> |
| <p>Now, go to <a href="https://arrow.apache.org/docs/cpp/compute.html#containment-tests" class="external-link">the compute function documentation</a> and look for the Arrow C++ library’s <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> function:</p> |
| <p><img src="starts_with_docs.png" width="100%"></p> |
| <p>The docs show that <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> is a unary function, which means that it takes a single data input. The data input must be a string-like class, and the returned value is boolean, both of which match up to R’s <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>.</p> |
| <p>There is an options class associated with <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> - called <a href="https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE" class="external-link"><code>MatchSubstringOptions</code></a> - so let’s take a look at that.</p> |
| <p><img src="matchsubstringoptions.png" width="100%"></p> |
| <p>Options classes allow the user to control the behaviour of the function. In this case, there are two possible options which can be supplied - <code>pattern</code> and <code>ignore_case</code>, which are described in the docs shown above.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="comparing-the-r-and-c-functions">Comparing the R and C++ functions<a class="anchor" aria-label="anchor" href="#comparing-the-r-and-c-functions"></a> |
| </h3> |
| <p>What conclusions can be drawn from what you’ve seen so far?</p> |
| <p>Base R’s <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> and Arrow’s <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> operate on equivalent data types, return equivalent data types, and as there are no options implemented in R that Arrow doesn’t have, this should be fairly simple to map without a great deal of extra work.</p> |
| <p>As <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> has an options class associated with it, we’ll need to make sure that it’s linked up with this in the R code.</p> |
| <p>In case you’re wondering about the difference between arguments in R and options in Arrow, in R, arguments to functions can include the actual data to be analysed as well as options governing how the function works, whereas in the C++ compute functions, the arguments are the data to be analysed and the options are for specifying how exactly the function works.</p> |
| <p>So let’s get started.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="step-1---add-unit-tests">Step 1 - add unit tests<a class="anchor" aria-label="anchor" href="#step-1---add-unit-tests"></a> |
| </h3> |
| <p>We recommend a test-driven-development approach - write failing tests first, then check that they fail, and then write the code needed to make them pass. Thinking up-front about the behavior which needs testing can make it easier to reason about the code which needs writing later.</p> |
| <p>Look up the R function that you want to bind the compute kernel to, and write a set of unit tests that use a dplyr pipeline and <code>compare_dplyr_binding()</code> (and perhaps even <code>compare_dplyr_error()</code> if necessary. These functions compare the output of the original function with the dplyr bindings and make sure they match.<br> |
| We recommend looking at the <a href="https://github.com/apache/arrow/blob/main/r/tests/testthat/helper-expectation.R" class="external-link">documentation next to the source code for these functions</a> to get a better understanding of how they work.</p> |
| <p>You should make sure you’re testing all parameters of the R function in your tests.</p> |
| <p>Below is a possible example test for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>.</p> |
| <div class="sourceCode" id="cb7"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu">test_that</span><span class="op">(</span><span class="st">"startsWith behaves identically in dplyr and Arrow"</span>, <span class="op">{</span></span> |
| <span> <span class="va">df</span> <span class="op"><-</span> <span class="fu"><a href="https://tibble.tidyverse.org/reference/tibble.html" class="external-link">tibble</a></span><span class="op">(</span>x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"Foo"</span>, <span class="st">"bar"</span>, <span class="st">"baz"</span>, <span class="st">"qux"</span><span class="op">)</span><span class="op">)</span></span> |
| <span> <span class="fu">compare_dplyr_binding</span><span class="op">(</span></span> |
| <span> <span class="va">.input</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith</a></span><span class="op">(</span><span class="va">x</span>, <span class="st">"b"</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span>,</span> |
| <span> <span class="va">df</span></span> |
| <span> <span class="op">)</span></span> |
| <span><span class="op">}</span><span class="op">)</span></span></code></pre></div> |
| </div> |
| <div class="section level3"> |
| <h3 id="step-2---hook-up-the-compute-function-with-options-class-if-necessary">Step 2 - Hook up the compute function with options class if necessary<a class="anchor" aria-label="anchor" href="#step-2---hook-up-the-compute-function-with-options-class-if-necessary"></a> |
| </h3> |
| <p>If the C++ compute function can have options specified, make sure that the function is linked with its options class in <code>make_compute_options()</code> in the file <code>arrow/r/src/compute.cpp</code>. You can find out if a compute function requires options by looking in the docs here: <a href="https://arrow.apache.org/docs/cpp/compute.html" class="external-link uri">https://arrow.apache.org/docs/cpp/compute.html</a></p> |
| <p>In the case of <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code>, it looks something like this:</p> |
| <div class="sourceCode" id="cb8"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true"></a> <span class="cf">if</span> (func_name == <span class="st">"starts_with"</span>) {</span> |
| <span id="cb8-2"><a href="#cb8-2" aria-hidden="true"></a> <span class="kw">using</span> Options = arrow::compute::MatchSubstringOptions;</span> |
| <span id="cb8-3"><a href="#cb8-3" aria-hidden="true"></a> <span class="dt">bool</span> ignore_case = <span class="kw">false</span>;</span> |
| <span id="cb8-4"><a href="#cb8-4" aria-hidden="true"></a> <span class="cf">if</span> (!Rf_isNull(options[<span class="st">"ignore_case"</span>])) {</span> |
| <span id="cb8-5"><a href="#cb8-5" aria-hidden="true"></a> ignore_case = cpp11::as_cpp<<span class="dt">bool</span>>(options[<span class="st">"ignore_case"</span>]);</span> |
| <span id="cb8-6"><a href="#cb8-6" aria-hidden="true"></a> }</span> |
| <span id="cb8-7"><a href="#cb8-7" aria-hidden="true"></a> <span class="cf">return</span> <span class="bu">std::</span>make_shared<Options>(cpp11::as_cpp<<span class="bu">std::</span>string>(options[<span class="st">"pattern"</span>]),</span> |
| <span id="cb8-8"><a href="#cb8-8" aria-hidden="true"></a> ignore_case);</span> |
| <span id="cb8-9"><a href="#cb8-9" aria-hidden="true"></a> }</span></code></pre></div> |
| <p>You can usually copy and paste from a similar existing example. In this case, as the option <code>ignore_case</code> doesn’t map to any parameters of <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>, we give it a default value of <code>false</code> but if it’s been set, use the set value instead. As the <code>pattern</code> argument maps directly to <code>prefix</code> in <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> we can pass it straight through.</p> |
| </div> |
| <div class="section level3"> |
| <h3 id="step-3---map-the-r-function-to-the-c-kernel">Step 3 - Map the R function to the C++ kernel<a class="anchor" aria-label="anchor" href="#step-3---map-the-r-function-to-the-c-kernel"></a> |
| </h3> |
| <p>The next task is writing the code which binds the R function to the C++ kernel.</p> |
| <div class="section level4"> |
| <h4 id="step-3a---see-if-direct-mapping-is-appropriate">Step 3a - See if direct mapping is appropriate<a class="anchor" aria-label="anchor" href="#step-3a---see-if-direct-mapping-is-appropriate"></a> |
| </h4> |
| <p>Compare the C++ function and R function. If they are simple functions with no options, it might be possible to directly map between the C++ and R in <code>unary_function_map</code>, in the case of compute functions that operate on single columns of data, or <code>binary_function_map</code> for those which operate on 2 columns of data.</p> |
| <p>As <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> requires options, direct mapping is not appropriate.</p> |
| </div> |
| <div class="section level4"> |
| <h4 id="step-3b---if-direct-mapping-not-possible-try-a-modified-implementation">Step 3b - If direct mapping not possible, try a modified implementation<a class="anchor" aria-label="anchor" href="#step-3b---if-direct-mapping-not-possible-try-a-modified-implementation"></a> |
| </h4> |
| <p>If the function cannot be mapped directly, some extra work may be needed to ensure that calling the arrow version of the function results in the same result as calling the R version of the function. In this case, the function will need adding to the <code>.cache$functions</code> function registry. Here is how this might look for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>:</p> |
| <div class="sourceCode" id="cb9"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../../reference/register_binding.html">register_binding</a></span><span class="op">(</span><span class="st">"base::startsWith"</span>, <span class="kw">function</span><span class="op">(</span><span class="va">x</span>, <span class="va">prefix</span><span class="op">)</span> <span class="op">{</span></span> |
| <span> <span class="va">Expression</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span> |
| <span> <span class="st">"starts_with"</span>,</span> |
| <span> <span class="va">x</span>,</span> |
| <span> options <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span>pattern <span class="op">=</span> <span class="va">prefix</span><span class="op">)</span></span> |
| <span> <span class="op">)</span></span> |
| <span><span class="op">}</span><span class="op">)</span></span></code></pre></div> |
| <p>In the source files, all the <code><a href="../../reference/register_binding.html">register_binding()</a></code> calls are wrapped in functions that are called on package load. These are separated into files based on subject matter (e.g., <code>R/dplyr-funcs-math.R</code>, <code>R/dplyr-funcs-string.R</code>): find the closest analog to the function whose binding is being defined and define the new binding in a similar location. For example, the binding for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> is registered in <code>dplyr-funcs-string.R</code> next to the binding for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">endsWith()</a></code>.</p> |
| <p>Note: we use the namespace-qualified name (i.e. <code>"base::startsWith"</code>) for a binding. This will register the same binding both as <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> and as <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">base::startsWith()</a></code>, which will allow us to use the <code>pkg::</code> prefix in a call.</p> |
| <div class="sourceCode" id="cb10"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">starwars</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span> |
| <span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu">stringr</span><span class="fu">::</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Table (query)</span></span> |
| <span><span class="co">## name: string</span></span> |
| <span><span class="co">## height: int32</span></span> |
| <span><span class="co">## mass: double</span></span> |
| <span><span class="co">## hair_color: string</span></span> |
| <span><span class="co">## skin_color: string</span></span> |
| <span><span class="co">## eye_color: string</span></span> |
| <span><span class="co">## birth_year: double</span></span> |
| <span><span class="co">## sex: string</span></span> |
| <span><span class="co">## gender: string</span></span> |
| <span><span class="co">## homeworld: string</span></span> |
| <span><span class="co">## species: string</span></span> |
| <span><span class="co">## films: list<item: string></span></span> |
| <span><span class="co">## vehicles: list<item: string></span></span> |
| <span><span class="co">## starships: list<item: string></span></span> |
| <span><span class="co">## </span></span> |
| <span><span class="co">## * Filter: match_substring_regex(name, {pattern="Darth", ignore_case=false})</span></span> |
| <span><span class="co">## See $.data for the source Arrow object</span></span></code></pre> |
| <p>Hint: you can use <code><a href="../../reference/call_function.html">call_function()</a></code> to call a compute function directly from R. This might be useful if you want to experiment with a compute function while you’re writing bindings for it, e.g.</p> |
| <div class="sourceCode" id="cb12"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../../reference/call_function.html">call_function</a></span><span class="op">(</span></span> |
| <span> <span class="st">"starts_with"</span>,</span> |
| <span> <span class="va">Array</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"Apache"</span>, <span class="st">"Arrow"</span>, <span class="st">"R"</span>, <span class="st">"package"</span><span class="op">)</span><span class="op">)</span>,</span> |
| <span> options <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span>pattern <span class="op">=</span> <span class="st">"A"</span><span class="op">)</span></span> |
| <span><span class="op">)</span></span></code></pre></div> |
| <pre><code><span><span class="co">## Array</span></span> |
| <span><span class="co">## <bool></span></span> |
| <span><span class="co">## [</span></span> |
| <span><span class="co">## true,</span></span> |
| <span><span class="co">## true,</span></span> |
| <span><span class="co">## false,</span></span> |
| <span><span class="co">## false</span></span> |
| <span><span class="co">## ]</span></span></code></pre> |
| </div> |
| </div> |
| <div class="section level3"> |
| <h3 id="step-4---run-and-potentially-add-to-your-tests-">Step 4 - Run (and potentially add to) your tests.<a class="anchor" aria-label="anchor" href="#step-4---run-and-potentially-add-to-your-tests-"></a> |
| </h3> |
| <p>In the process of implementing the function, you will need at least one test to make sure that your binding works and that future changes to the Arrow R package don’t break it! Bindings are tested in files that correspond to the file in which they were defined (e.g., <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> is tested in <code>tests/testthat/test-dplyr-funcs-string.R</code>) next to the tests for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">endsWith()</a></code>.</p> |
| <p>You may end up implementing more tests, for example if you discover unusual edge cases. This is fine - add them to the ones you wrote originally, and run them all. If they pass, you’re done and you can submit a PR. If you’ve modified the C++ code in the R package (for example, when hooking up a binding to its options class), you should make sure to run <code>arrow/r/lint.sh</code> to lint the code.</p> |
| </div> |
| </div> |
| </main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2> |
| </nav></aside> |
| </div> |
| |
| |
| |
| <footer><div class="pkgdown-footer-left"> |
| <p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p> |
| </div> |
| |
| <div class="pkgdown-footer-right"> |
| <p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.9.</p> |
| </div> |
| |
| </footer> |
| </div> |
| |
| |
| |
| |
| |
| </body> |
| </html> |