blob: f9ef6e1a14f2c7e2156de0656a99bb2a1274c08d [file] [log] [blame]
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="description" content="Learn how to write bindings that allow arrow to mirror the behavior of native R functions within dplyr pipelines
">
<title>Writing dplyr bindings • Arrow R Package</title>
<!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../../favicon-16x16.png">
<link rel="icon" type="image/png" sizes="32x32" href="../../favicon-32x32.png">
<link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../../apple-touch-icon.png">
<link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../../apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../../apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../../apple-touch-icon-60x60.png">
<script src="../../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link href="../../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet">
<script src="../../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
<!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.11/clipboard.min.js" integrity="sha512-7O5pXpc0oCRrxk8RUfDYFgn0nO1t+jLuIOQdOMRp4APB7uZ4vSjspzp5y6YDtDs4VzUSTbWzBFZ/LKJhnyFOKw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../../pkgdown.js"></script><link href="../../extra.css" rel="stylesheet">
<meta property="og:title" content="Writing dplyr bindings">
<meta property="og:description" content="Learn how to write bindings that allow arrow to mirror the behavior of native R functions within dplyr pipelines
">
<meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png">
<meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:creator" content="@apachearrow">
<meta name="twitter:site" content="@apachearrow">
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]--><!-- Matomo --><script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script><!-- End Matomo Code -->
</head>
<body>
<a href="#main" class="visually-hidden-focusable">Skip to contents</a>
<nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container">
<a class="navbar-brand me-2" href="../../index.html">Arrow R Package</a>
<span class="version">
<small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">16.0.0.9000</small>
</span>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar" class="collapse navbar-collapse ms-3">
<ul class="navbar-nav me-auto">
<li class="nav-item">
<a class="nav-link" href="../../articles/arrow.html">Get started</a>
</li>
<li class="nav-item">
<a class="nav-link" href="../../reference/index.html">Reference</a>
</li>
<li class="active nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a>
<div class="dropdown-menu" aria-labelledby="dropdown-articles">
<h6 class="dropdown-header" data-toc-skip>Using the package</h6>
<a class="dropdown-item" href="../../articles/read_write.html">Reading and writing data files</a>
<a class="dropdown-item" href="../../articles/data_wrangling.html">Data analysis with dplyr syntax</a>
<a class="dropdown-item" href="../../articles/dataset.html">Working with multi-file data sets</a>
<a class="dropdown-item" href="../../articles/python.html">Integrating Arrow, Python, and R</a>
<a class="dropdown-item" href="../../articles/fs.html">Using cloud storage (S3, GCS)</a>
<a class="dropdown-item" href="../../articles/flight.html">Connecting to a Flight server</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6>
<a class="dropdown-item" href="../../articles/data_objects.html">Data objects</a>
<a class="dropdown-item" href="../../articles/data_types.html">Data types</a>
<a class="dropdown-item" href="../../articles/metadata.html">Metadata</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Installation</h6>
<a class="dropdown-item" href="../../articles/install.html">Installing on Linux</a>
<a class="dropdown-item" href="../../articles/install_nightly.html">Installing development versions</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="../../articles/index.html">More articles...</a>
</div>
</li>
<li class="nav-item">
<a class="nav-link" href="../../news/index.html">Changelog</a>
</li>
</ul>
<form class="form-inline my-2 my-lg-0" role="search">
<input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../../search.json" id="search-input" placeholder="Search for" autocomplete="off">
</form>
<ul class="navbar-nav">
<li class="nav-item">
<a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="github">
<span class="fab fa fab fa-github fa-lg"></span>
</a>
</li>
</ul>
</div>
</div>
</nav><div class="container template-article">
<script src="writing_bindings_files/accessible-code-block-0.0.1/empty-anchor.js"></script><div class="row">
<main id="main" class="col-md-9"><div class="page-header">
<img src="" class="logo" alt=""><h1>Writing dplyr bindings</h1>
<small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/developers/writing_bindings.Rmd" class="external-link"><code>vignettes/developers/writing_bindings.Rmd</code></a></small>
<div class="d-none name"><code>writing_bindings.Rmd</code></div>
</div>
<p>When writing bindings between C++ compute functions and R functions, the aim is to expose the C++ functionality via the same interface as existing R functions. The syntax and functionality should match that of the existing R functions (though there are some exceptions) so that users are able to use existing tidyverse or base R syntax, whilst taking advantage of the speed and functionality of the underlying arrow package.</p>
<p>One of main ways in which users interact with arrow is via <a href="https://dplyr.tidyverse.org/" class="external-link">dplyr</a> syntax called on Arrow objects. For example, when a user calls <code><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">dplyr::mutate()</a></code> on an Arrow Tabular, Dataset, or arrow data query object, the Arrow implementation of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate()</a></code> is used and under the hood, translates the dplyr code into Arrow C++ code.</p>
<p>When using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">dplyr::mutate()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">dplyr::filter()</a></code>, you may want to use functions from other packages. The example below uses <code><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">stringr::str_detect()</a></code>.</p>
<div class="sourceCode" id="cb1"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org" class="external-link">dplyr</a></span><span class="op">)</span></span>
<span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://stringr.tidyverse.org" class="external-link">stringr</a></span><span class="op">)</span></span>
<span><span class="va">starwars</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 2 x 14</span></span></span>
<span><span class="co">## name height mass hair_color skin_color eye_color birth_year sex gender</span></span>
<span><span class="co">## <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;int&gt;</span> <span style="color: #949494; font-style: italic;">&lt;dbl&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;dbl&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> </span></span>
<span><span class="co">## <span style="color: #BCBCBC;">1</span> Darth Va~ 202 136 none white yellow 41.9 male mascu~</span></span>
<span><span class="co">## <span style="color: #BCBCBC;">2</span> Darth Ma~ 175 80 none red yellow 54 male mascu~</span></span>
<span><span class="co">## <span style="color: #949494;"># i 5 more variables: homeworld &lt;chr&gt;, species &lt;chr&gt;, films &lt;list&gt;,</span></span></span>
<span><span class="co">## <span style="color: #949494;"># vehicles &lt;list&gt;, starships &lt;list&gt;</span></span></span></code></pre>
<p>This functionality has also been implemented in Arrow, e.g.:</p>
<div class="sourceCode" id="cb3"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/apache/arrow/" class="external-link">arrow</a></span><span class="op">)</span></span>
<span><span class="fu"><a href="../../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">starwars</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 2 x 14</span></span></span>
<span><span class="co">## name height mass hair_color skin_color eye_color birth_year sex gender</span></span>
<span><span class="co">## <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;int&gt;</span> <span style="color: #949494; font-style: italic;">&lt;dbl&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;dbl&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> </span></span>
<span><span class="co">## <span style="color: #BCBCBC;">1</span> Darth Va~ 202 136 none white yellow 41.9 male mascu~</span></span>
<span><span class="co">## <span style="color: #BCBCBC;">2</span> Darth Ma~ 175 80 none red yellow 54 male mascu~</span></span>
<span><span class="co">## <span style="color: #949494;"># i 5 more variables: homeworld &lt;chr&gt;, species &lt;chr&gt;, films &lt;list&lt;character&gt;&gt;,</span></span></span>
<span><span class="co">## <span style="color: #949494;"># vehicles &lt;list&lt;character&gt;&gt;, starships &lt;list&lt;character&gt;&gt;</span></span></span></code></pre>
<p>This is possible as a <strong>binding</strong> has been created between the call to the stringr function <code><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect()</a></code> and the Arrow C++ code, here as a direct mapping to <code>match_substring_regex</code>. You can see this for yourself by inspecting the arrow data query object without retrieving the results via <code><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect()</a></code>.</p>
<div class="sourceCode" id="cb5"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="../../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">starwars</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## Table (query)</span></span>
<span><span class="co">## name: string</span></span>
<span><span class="co">## height: int32</span></span>
<span><span class="co">## mass: double</span></span>
<span><span class="co">## hair_color: string</span></span>
<span><span class="co">## skin_color: string</span></span>
<span><span class="co">## eye_color: string</span></span>
<span><span class="co">## birth_year: double</span></span>
<span><span class="co">## sex: string</span></span>
<span><span class="co">## gender: string</span></span>
<span><span class="co">## homeworld: string</span></span>
<span><span class="co">## species: string</span></span>
<span><span class="co">## films: list&lt;item: string&gt;</span></span>
<span><span class="co">## vehicles: list&lt;item: string&gt;</span></span>
<span><span class="co">## starships: list&lt;item: string&gt;</span></span>
<span><span class="co">## </span></span>
<span><span class="co">## * Filter: match_substring_regex(name, {pattern="Darth", ignore_case=false})</span></span>
<span><span class="co">## See $.data for the source Arrow object</span></span></code></pre>
<p>In the following sections, we’ll walk through how to create a binding between an R function and an Arrow C++ function.</p>
<div class="section level2">
<h2 id="walkthrough">Walkthrough<a class="anchor" aria-label="anchor" href="#walkthrough"></a>
</h2>
<p>Imagine you are writing the bindings for the C++ function <a href="https://arrow.apache.org/docs/cpp/compute.html#containment-tests" class="external-link"><code>starts_with()</code></a> and want to bind it to the (base) R function <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>.</p>
<p>First, take a look at the docs for both of those functions.</p>
<div class="section level3">
<h3 id="examining-the-r-function">Examining the R function<a class="anchor" aria-label="anchor" href="#examining-the-r-function"></a>
</h3>
<p>Here are the docs for R’s <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> (also available at <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html" class="external-link uri">https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html</a>)</p>
<p><img src="startswithdocs.png" width="50%"></p>
<p>It takes 2 parameters; <code>x</code> - the input, and <code>prefix</code> - the characters to check if <code>x</code> starts with.</p>
</div>
<div class="section level3">
<h3 id="examining-the-c-function">Examining the C++ function<a class="anchor" aria-label="anchor" href="#examining-the-c-function"></a>
</h3>
<p>Now, go to <a href="https://arrow.apache.org/docs/cpp/compute.html#containment-tests" class="external-link">the compute function documentation</a> and look for the Arrow C++ library’s <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> function:</p>
<p><img src="starts_with_docs.png" width="100%"></p>
<p>The docs show that <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> is a unary function, which means that it takes a single data input. The data input must be a string-like class, and the returned value is boolean, both of which match up to R’s <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>.</p>
<p>There is an options class associated with <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> - called <a href="https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE" class="external-link"><code>MatchSubstringOptions</code></a> - so let’s take a look at that.</p>
<p><img src="matchsubstringoptions.png" width="100%"></p>
<p>Options classes allow the user to control the behaviour of the function. In this case, there are two possible options which can be supplied - <code>pattern</code> and <code>ignore_case</code>, which are described in the docs shown above.</p>
</div>
<div class="section level3">
<h3 id="comparing-the-r-and-c-functions">Comparing the R and C++ functions<a class="anchor" aria-label="anchor" href="#comparing-the-r-and-c-functions"></a>
</h3>
<p>What conclusions can be drawn from what you’ve seen so far?</p>
<p>Base R’s <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> and Arrow’s <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> operate on equivalent data types, return equivalent data types, and as there are no options implemented in R that Arrow doesn’t have, this should be fairly simple to map without a great deal of extra work.</p>
<p>As <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code> has an options class associated with it, we’ll need to make sure that it’s linked up with this in the R code.</p>
<p>In case you’re wondering about the difference between arguments in R and options in Arrow, in R, arguments to functions can include the actual data to be analysed as well as options governing how the function works, whereas in the C++ compute functions, the arguments are the data to be analysed and the options are for specifying how exactly the function works.</p>
<p>So let’s get started.</p>
</div>
<div class="section level3">
<h3 id="step-1---add-unit-tests">Step 1 - add unit tests<a class="anchor" aria-label="anchor" href="#step-1---add-unit-tests"></a>
</h3>
<p>We recommend a test-driven-development approach - write failing tests first, then check that they fail, and then write the code needed to make them pass. Thinking up-front about the behavior which needs testing can make it easier to reason about the code which needs writing later.</p>
<p>Look up the R function that you want to bind the compute kernel to, and write a set of unit tests that use a dplyr pipeline and <code>compare_dplyr_binding()</code> (and perhaps even <code>compare_dplyr_error()</code> if necessary. These functions compare the output of the original function with the dplyr bindings and make sure they match.<br>
We recommend looking at the <a href="https://github.com/apache/arrow/blob/main/r/tests/testthat/helper-expectation.R" class="external-link">documentation next to the source code for these functions</a> to get a better understanding of how they work.</p>
<p>You should make sure you’re testing all parameters of the R function in your tests.</p>
<p>Below is a possible example test for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>.</p>
<div class="sourceCode" id="cb7"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu">test_that</span><span class="op">(</span><span class="st">"startsWith behaves identically in dplyr and Arrow"</span>, <span class="op">{</span></span>
<span> <span class="va">df</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://tibble.tidyverse.org/reference/tibble.html" class="external-link">tibble</a></span><span class="op">(</span>x <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"Foo"</span>, <span class="st">"bar"</span>, <span class="st">"baz"</span>, <span class="st">"qux"</span><span class="op">)</span><span class="op">)</span></span>
<span> <span class="fu">compare_dplyr_binding</span><span class="op">(</span></span>
<span> <span class="va">.input</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith</a></span><span class="op">(</span><span class="va">x</span>, <span class="st">"b"</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/compute.html" class="external-link">collect</a></span><span class="op">(</span><span class="op">)</span>,</span>
<span> <span class="va">df</span></span>
<span> <span class="op">)</span></span>
<span><span class="op">}</span><span class="op">)</span></span></code></pre></div>
</div>
<div class="section level3">
<h3 id="step-2---hook-up-the-compute-function-with-options-class-if-necessary">Step 2 - Hook up the compute function with options class if necessary<a class="anchor" aria-label="anchor" href="#step-2---hook-up-the-compute-function-with-options-class-if-necessary"></a>
</h3>
<p>If the C++ compute function can have options specified, make sure that the function is linked with its options class in <code>make_compute_options()</code> in the file <code>arrow/r/src/compute.cpp</code>. You can find out if a compute function requires options by looking in the docs here: <a href="https://arrow.apache.org/docs/cpp/compute.html" class="external-link uri">https://arrow.apache.org/docs/cpp/compute.html</a></p>
<p>In the case of <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with()</a></code>, it looks something like this:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true"></a> <span class="cf">if</span> (func_name == <span class="st">"starts_with"</span>) {</span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true"></a> <span class="kw">using</span> Options = arrow::compute::MatchSubstringOptions;</span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true"></a> <span class="dt">bool</span> ignore_case = <span class="kw">false</span>;</span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true"></a> <span class="cf">if</span> (!Rf_isNull(options[<span class="st">"ignore_case"</span>])) {</span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true"></a> ignore_case = cpp11::as_cpp&lt;<span class="dt">bool</span>&gt;(options[<span class="st">"ignore_case"</span>]);</span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true"></a> }</span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true"></a> <span class="cf">return</span> <span class="bu">std::</span>make_shared&lt;Options&gt;(cpp11::as_cpp&lt;<span class="bu">std::</span>string&gt;(options[<span class="st">"pattern"</span>]),</span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true"></a> ignore_case);</span>
<span id="cb8-9"><a href="#cb8-9" aria-hidden="true"></a> }</span></code></pre></div>
<p>You can usually copy and paste from a similar existing example. In this case, as the option <code>ignore_case</code> doesn’t map to any parameters of <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>, we give it a default value of <code>false</code> but if it’s been set, use the set value instead. As the <code>pattern</code> argument maps directly to <code>prefix</code> in <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> we can pass it straight through.</p>
</div>
<div class="section level3">
<h3 id="step-3---map-the-r-function-to-the-c-kernel">Step 3 - Map the R function to the C++ kernel<a class="anchor" aria-label="anchor" href="#step-3---map-the-r-function-to-the-c-kernel"></a>
</h3>
<p>The next task is writing the code which binds the R function to the C++ kernel.</p>
<div class="section level4">
<h4 id="step-3a---see-if-direct-mapping-is-appropriate">Step 3a - See if direct mapping is appropriate<a class="anchor" aria-label="anchor" href="#step-3a---see-if-direct-mapping-is-appropriate"></a>
</h4>
<p>Compare the C++ function and R function. If they are simple functions with no options, it might be possible to directly map between the C++ and R in <code>unary_function_map</code>, in the case of compute functions that operate on single columns of data, or <code>binary_function_map</code> for those which operate on 2 columns of data.</p>
<p>As <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> requires options, direct mapping is not appropriate.</p>
</div>
<div class="section level4">
<h4 id="step-3b---if-direct-mapping-not-possible-try-a-modified-implementation">Step 3b - If direct mapping not possible, try a modified implementation<a class="anchor" aria-label="anchor" href="#step-3b---if-direct-mapping-not-possible-try-a-modified-implementation"></a>
</h4>
<p>If the function cannot be mapped directly, some extra work may be needed to ensure that calling the arrow version of the function results in the same result as calling the R version of the function. In this case, the function will need adding to the <code>.cache$functions</code> function registry. Here is how this might look for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code>:</p>
<div class="sourceCode" id="cb9"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="../../reference/register_binding.html">register_binding</a></span><span class="op">(</span><span class="st">"base::startsWith"</span>, <span class="kw">function</span><span class="op">(</span><span class="va">x</span>, <span class="va">prefix</span><span class="op">)</span> <span class="op">{</span></span>
<span> <span class="va">Expression</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span>
<span> <span class="st">"starts_with"</span>,</span>
<span> <span class="va">x</span>,</span>
<span> options <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span>pattern <span class="op">=</span> <span class="va">prefix</span><span class="op">)</span></span>
<span> <span class="op">)</span></span>
<span><span class="op">}</span><span class="op">)</span></span></code></pre></div>
<p>In the source files, all the <code><a href="../../reference/register_binding.html">register_binding()</a></code> calls are wrapped in functions that are called on package load. These are separated into files based on subject matter (e.g., <code>R/dplyr-funcs-math.R</code>, <code>R/dplyr-funcs-string.R</code>): find the closest analog to the function whose binding is being defined and define the new binding in a similar location. For example, the binding for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> is registered in <code>dplyr-funcs-string.R</code> next to the binding for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">endsWith()</a></code>.</p>
<p>Note: we use the namespace-qualified name (i.e. <code>"base::startsWith"</code>) for a binding. This will register the same binding both as <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> and as <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">base::startsWith()</a></code>, which will allow us to use the <code>pkg::</code> prefix in a call.</p>
<div class="sourceCode" id="cb10"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="../../reference/table.html">arrow_table</a></span><span class="op">(</span><span class="va">starwars</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%&gt;%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="fu">stringr</span><span class="fu">::</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_detect.html" class="external-link">str_detect</a></span><span class="op">(</span><span class="va">name</span>, <span class="st">"Darth"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## Table (query)</span></span>
<span><span class="co">## name: string</span></span>
<span><span class="co">## height: int32</span></span>
<span><span class="co">## mass: double</span></span>
<span><span class="co">## hair_color: string</span></span>
<span><span class="co">## skin_color: string</span></span>
<span><span class="co">## eye_color: string</span></span>
<span><span class="co">## birth_year: double</span></span>
<span><span class="co">## sex: string</span></span>
<span><span class="co">## gender: string</span></span>
<span><span class="co">## homeworld: string</span></span>
<span><span class="co">## species: string</span></span>
<span><span class="co">## films: list&lt;item: string&gt;</span></span>
<span><span class="co">## vehicles: list&lt;item: string&gt;</span></span>
<span><span class="co">## starships: list&lt;item: string&gt;</span></span>
<span><span class="co">## </span></span>
<span><span class="co">## * Filter: match_substring_regex(name, {pattern="Darth", ignore_case=false})</span></span>
<span><span class="co">## See $.data for the source Arrow object</span></span></code></pre>
<p>Hint: you can use <code><a href="../../reference/call_function.html">call_function()</a></code> to call a compute function directly from R. This might be useful if you want to experiment with a compute function while you’re writing bindings for it, e.g.</p>
<div class="sourceCode" id="cb12"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="../../reference/call_function.html">call_function</a></span><span class="op">(</span></span>
<span> <span class="st">"starts_with"</span>,</span>
<span> <span class="va">Array</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"Apache"</span>, <span class="st">"Arrow"</span>, <span class="st">"R"</span>, <span class="st">"package"</span><span class="op">)</span><span class="op">)</span>,</span>
<span> options <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span>pattern <span class="op">=</span> <span class="st">"A"</span><span class="op">)</span></span>
<span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## Array</span></span>
<span><span class="co">## &lt;bool&gt;</span></span>
<span><span class="co">## [</span></span>
<span><span class="co">## true,</span></span>
<span><span class="co">## true,</span></span>
<span><span class="co">## false,</span></span>
<span><span class="co">## false</span></span>
<span><span class="co">## ]</span></span></code></pre>
</div>
</div>
<div class="section level3">
<h3 id="step-4---run-and-potentially-add-to-your-tests-">Step 4 - Run (and potentially add to) your tests.<a class="anchor" aria-label="anchor" href="#step-4---run-and-potentially-add-to-your-tests-"></a>
</h3>
<p>In the process of implementing the function, you will need at least one test to make sure that your binding works and that future changes to the Arrow R package don’t break it! Bindings are tested in files that correspond to the file in which they were defined (e.g., <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">startsWith()</a></code> is tested in <code>tests/testthat/test-dplyr-funcs-string.R</code>) next to the tests for <code><a href="https://rdrr.io/r/base/startsWith.html" class="external-link">endsWith()</a></code>.</p>
<p>You may end up implementing more tests, for example if you discover unusual edge cases. This is fine - add them to the ones you wrote originally, and run them all. If they pass, you’re done and you can submit a PR. If you’ve modified the C++ code in the R package (for example, when hooking up a binding to its options class), you should make sure to run <code>arrow/r/lint.sh</code> to lint the code.</p>
</div>
</div>
</main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2>
</nav></aside>
</div>
<footer><div class="pkgdown-footer-left">
<p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p>
</div>
<div class="pkgdown-footer-right">
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.9.</p>
</div>
</footer>
</div>
</body>
</html>