blob: a29399dcba000d7f33710496c1ed15710d36f822 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en" data-content_root="./">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Data Manipulation &#8212; Apache Arrow Python Cookbook documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=d1102ebc" />
<link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=49eeb2a1" />
<script src="_static/documentation_options.js?v=5929fcd5"></script>
<script src="_static/doctools.js?v=888ff710"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<link rel="icon" href="_static/favicon.ico"/>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Arrow Flight" href="flight.html" />
<link rel="prev" title="Working with Schema" href="schema.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
<!-- Matomo -->
<script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<!-- End Matomo Code -->
</head><body>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<section id="data-manipulation">
<h1><a class="toc-backref" href="#id1" role="doc-backlink">Data Manipulation</a><a class="headerlink" href="#data-manipulation" title="Link to this heading"></a></h1>
<p>Recipes related to filtering or transforming data in
arrays and tables.</p>
<nav class="contents" id="contents">
<p class="topic-title">Contents</p>
<ul class="simple">
<li><p><a class="reference internal" href="#data-manipulation" id="id1">Data Manipulation</a></p>
<ul>
<li><p><a class="reference internal" href="#computing-mean-min-max-values-of-an-array" id="id2">Computing Mean/Min/Max values of an array</a></p></li>
<li><p><a class="reference internal" href="#counting-occurrences-of-elements" id="id3">Counting Occurrences of Elements</a></p></li>
<li><p><a class="reference internal" href="#applying-arithmetic-functions-to-arrays" id="id4">Applying arithmetic functions to arrays.</a></p></li>
<li><p><a class="reference internal" href="#appending-tables-to-an-existing-table" id="id5">Appending tables to an existing table</a></p></li>
<li><p><a class="reference internal" href="#adding-a-column-to-an-existing-table" id="id6">Adding a column to an existing Table</a></p></li>
<li><p><a class="reference internal" href="#replacing-a-column-in-an-existing-table" id="id7">Replacing a column in an existing Table</a></p></li>
<li><p><a class="reference internal" href="#group-a-table" id="id8">Group a Table</a></p></li>
<li><p><a class="reference internal" href="#sort-a-table" id="id9">Sort a Table</a></p></li>
<li><p><a class="reference internal" href="#searching-for-values-matching-a-predicate-in-arrays" id="id10">Searching for values matching a predicate in Arrays</a></p></li>
<li><p><a class="reference internal" href="#filtering-arrays-using-a-mask" id="id11">Filtering Arrays using a mask</a></p></li>
</ul>
</li>
</ul>
</nav>
<p>See <a class="reference external" href="https://arrow.apache.org/docs/python/compute.html#compute" title="(in Apache Arrow v15.0.1)"><span>Compute Functions</span></a> for a complete list of all available compute functions</p>
<section id="computing-mean-min-max-values-of-an-array">
<h2><a class="toc-backref" href="#id2" role="doc-backlink">Computing Mean/Min/Max values of an array</a><a class="headerlink" href="#computing-mean-min-max-values-of-an-array" title="Link to this heading"></a></h2>
<p>Arrow provides compute functions that can be applied to arrays.
Those compute functions are exposed through the <code class="xref py py-mod docutils literal notranslate"><span class="pre">pyarrow.compute</span></code>
module.</p>
<p>Given an array with 100 numbers, from 0 to 99</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99
</pre></div>
</div>
<p>We can compute the <code class="docutils literal notranslate"><span class="pre">mean</span></code> using the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.compute.mean.html#pyarrow.compute.mean" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.compute.mean()</span></code></a>
function</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.compute</span> <span class="k">as</span> <span class="nn">pc</span>
<span class="n">mean</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">mean</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>49.5
</pre></div>
</div>
<p>And the <code class="docutils literal notranslate"><span class="pre">min</span></code> and <code class="docutils literal notranslate"><span class="pre">max</span></code> using the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.compute.min_max.html#pyarrow.compute.min_max" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.compute.min_max()</span></code></a>
function</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.compute</span> <span class="k">as</span> <span class="nn">pc</span>
<span class="n">min_max</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">min_max</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">min_max</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[(&#39;min&#39;, 0), (&#39;max&#39;, 99)]
</pre></div>
</div>
</section>
<section id="counting-occurrences-of-elements">
<h2><a class="toc-backref" href="#id3" role="doc-backlink">Counting Occurrences of Elements</a><a class="headerlink" href="#counting-occurrences-of-elements" title="Link to this heading"></a></h2>
<p>Arrow provides compute functions that can be applied to arrays,
those compute functions are exposed through the <code class="xref py py-mod docutils literal notranslate"><span class="pre">pyarrow.compute</span></code>
module.</p>
<p>Given an array with all numbers from 0 to 9 repeated 10 times</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;LEN: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">nums_arr</span><span class="p">)</span><span class="si">}</span><span class="s2">, MIN/MAX: </span><span class="si">{</span><span class="n">nums_arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">nums_arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>LEN: 100, MIN/MAX: 0 .. 9
</pre></div>
</div>
<p>We can count occurrences of all entries in the array using the
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html#pyarrow.compute.value_counts" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.compute.value_counts()</span></code></a> function</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.compute</span> <span class="k">as</span> <span class="nn">pc</span>
<span class="n">counts</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">nums_arr</span><span class="p">)</span>
<span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">counts</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">pair</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[(&#39;values&#39;, 0), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 1), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 2), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 3), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 4), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 5), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 6), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 7), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 8), (&#39;counts&#39;, 10)]
[(&#39;values&#39;, 9), (&#39;counts&#39;, 10)]
</pre></div>
</div>
</section>
<section id="applying-arithmetic-functions-to-arrays">
<h2><a class="toc-backref" href="#id4" role="doc-backlink">Applying arithmetic functions to arrays.</a><a class="headerlink" href="#applying-arithmetic-functions-to-arrays" title="Link to this heading"></a></h2>
<p>The compute functions in <code class="xref py py-mod docutils literal notranslate"><span class="pre">pyarrow.compute</span></code> also include
common transformations such as arithmetic functions.</p>
<p>Given an array with 100 numbers, from 0 to 99</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99
</pre></div>
</div>
<p>We can multiply all values by 2 using the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.compute.multiply.html#pyarrow.compute.multiply" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.compute.multiply()</span></code></a>
function</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.compute</span> <span class="k">as</span> <span class="nn">pc</span>
<span class="n">doubles</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">doubles</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">doubles</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 198
</pre></div>
</div>
</section>
<section id="appending-tables-to-an-existing-table">
<h2><a class="toc-backref" href="#id5" role="doc-backlink">Appending tables to an existing table</a><a class="headerlink" href="#appending-tables-to-an-existing-table" title="Link to this heading"></a></h2>
<p>If you have data split across two different tables, it is possible
to concatenate their rows into a single table.</p>
<p>If we have the list of Oscar nominations divided between two different tables:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">oscar_nominations_1</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="p">[</span><span class="s2">&quot;Meryl Streep&quot;</span><span class="p">,</span> <span class="s2">&quot;Katharine Hepburn&quot;</span><span class="p">],</span>
<span class="p">[</span><span class="mi">21</span><span class="p">,</span> <span class="mi">12</span><span class="p">]</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;actor&quot;</span><span class="p">,</span> <span class="s2">&quot;nominations&quot;</span><span class="p">])</span>
<span class="n">oscar_nominations_2</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="p">[</span><span class="s2">&quot;Jack Nicholson&quot;</span><span class="p">,</span> <span class="s2">&quot;Bette Davis&quot;</span><span class="p">],</span>
<span class="p">[</span><span class="mi">12</span><span class="p">,</span> <span class="mi">10</span><span class="p">]</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;actor&quot;</span><span class="p">,</span> <span class="s2">&quot;nominations&quot;</span><span class="p">])</span>
</pre></div>
</div>
<p>We can combine them into a single table using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html#pyarrow.concat_tables" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.concat_tables()</span></code></a>:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">oscar_nominations</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">concat_tables</span><span class="p">([</span><span class="n">oscar_nominations_1</span><span class="p">,</span>
<span class="n">oscar_nominations_2</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">oscar_nominations</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
actor: string
nominations: int64
----
actor: [[&quot;Meryl Streep&quot;,&quot;Katharine Hepburn&quot;],[&quot;Jack Nicholson&quot;,&quot;Bette Davis&quot;]]
nominations: [[21,12],[12,10]]
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>By default, appending two tables is a zero-copy operation that doesn’t need to
copy or rewrite data. As tables are made of <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html#pyarrow.ChunkedArray" title="(in Apache Arrow v15.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.ChunkedArray</span></code></a>,
the result will be a table with multiple chunks, each pointing to the original
data that has been appended. Under some conditions, Arrow might have to
cast data from one type to another (if <cite>promote=True</cite>). In such cases the data
will need to be copied and an extra cost will occur.</p>
</div>
</section>
<section id="adding-a-column-to-an-existing-table">
<h2><a class="toc-backref" href="#id6" role="doc-backlink">Adding a column to an existing Table</a><a class="headerlink" href="#adding-a-column-to-an-existing-table" title="Link to this heading"></a></h2>
<p>If you have a table it is possible to extend its columns using
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.append_column" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.Table.append_column()</span></code></a></p>
<p>Suppose we have a table with oscar nominations for each actress</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">oscar_nominations</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="p">[</span><span class="s2">&quot;Meryl Streep&quot;</span><span class="p">,</span> <span class="s2">&quot;Katharine Hepburn&quot;</span><span class="p">],</span>
<span class="p">[</span><span class="mi">21</span><span class="p">,</span> <span class="mi">12</span><span class="p">]</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;actor&quot;</span><span class="p">,</span> <span class="s2">&quot;nominations&quot;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">oscar_nominations</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
actor: string
nominations: int64
----
actor: [[&quot;Meryl Streep&quot;,&quot;Katharine Hepburn&quot;]]
nominations: [[21,12]]
</pre></div>
</div>
<p>it’s possible to append an additional column to track the years the
nomination was won using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.append_column" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.Table.append_column()</span></code></a></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">oscar_nominations</span> <span class="o">=</span> <span class="n">oscar_nominations</span><span class="o">.</span><span class="n">append_column</span><span class="p">(</span>
<span class="s2">&quot;wonyears&quot;</span><span class="p">,</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span>
<span class="p">[</span><span class="mi">1980</span><span class="p">,</span> <span class="mi">1983</span><span class="p">,</span> <span class="mi">2012</span><span class="p">],</span>
<span class="p">[</span><span class="mi">1934</span><span class="p">,</span> <span class="mi">1968</span><span class="p">,</span> <span class="mi">1969</span><span class="p">,</span> <span class="mi">1982</span><span class="p">]</span>
<span class="p">])</span>
<span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">oscar_nominations</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
actor: string
nominations: int64
wonyears: list&lt;item: int64&gt;
child 0, item: int64
----
actor: [[&quot;Meryl Streep&quot;,&quot;Katharine Hepburn&quot;]]
nominations: [[21,12]]
wonyears: [[[1980,1983,2012],[1934,1968,1969,1982]]]
</pre></div>
</div>
</section>
<section id="replacing-a-column-in-an-existing-table">
<h2><a class="toc-backref" href="#id7" role="doc-backlink">Replacing a column in an existing Table</a><a class="headerlink" href="#replacing-a-column-in-an-existing-table" title="Link to this heading"></a></h2>
<p>If you have a table it is possible to replace an existing column using
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.set_column" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.Table.set_column()</span></code></a></p>
<p>Suppose we have a table with information about items sold at a supermarket
on a particular day.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">sales_data</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="p">[</span><span class="s2">&quot;Potato&quot;</span><span class="p">,</span> <span class="s2">&quot;Bean&quot;</span><span class="p">,</span> <span class="s2">&quot;Cucumber&quot;</span><span class="p">,</span> <span class="s2">&quot;Eggs&quot;</span><span class="p">],</span>
<span class="p">[</span><span class="mi">21</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">30</span><span class="p">]</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;item&quot;</span><span class="p">,</span> <span class="s2">&quot;amount&quot;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">sales_data</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
item: string
amount: int64
----
item: [[&quot;Potato&quot;,&quot;Bean&quot;,&quot;Cucumber&quot;,&quot;Eggs&quot;]]
amount: [[21,12,10,30]]
</pre></div>
</div>
<p>it’s possible to replace the existing column <cite>amount</cite>
in index <cite>1</cite> to update the sales
using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.set_column" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.Table.set_column()</span></code></a></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">new_sales_data</span> <span class="o">=</span> <span class="n">sales_data</span><span class="o">.</span><span class="n">set_column</span><span class="p">(</span>
<span class="mi">1</span><span class="p">,</span>
<span class="s2">&quot;new_amount&quot;</span><span class="p">,</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">30</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">40</span><span class="p">])</span>
<span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">new_sales_data</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
item: string
new_amount: int64
----
item: [[&quot;Potato&quot;,&quot;Bean&quot;,&quot;Cucumber&quot;,&quot;Eggs&quot;]]
new_amount: [[30,20,15,40]]
</pre></div>
</div>
</section>
<section id="group-a-table">
<h2><a class="toc-backref" href="#id8" role="doc-backlink">Group a Table</a><a class="headerlink" href="#group-a-table" title="Link to this heading"></a></h2>
<p>If you have a table which needs to be grouped by a particular key,
you can use <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.group_by" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.Table.group_by()</span></code></a> followed by an aggregation
operation <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.TableGroupBy.html#pyarrow.TableGroupBy.aggregate" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.TableGroupBy.aggregate()</span></code></a>. Learn more about
groupby operations <a class="reference external" href="https://arrow.apache.org/docs/python/compute.html#grouped-aggregations">here</a>.</p>
<p>For example, let’s say we have some data with a particular set of keys
and values associated with that key. And we want to group the data by
those keys and apply an aggregate function like sum to evaluate
how many items are for each unique key.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;c&quot;</span><span class="p">,</span> <span class="s2">&quot;d&quot;</span><span class="p">,</span> <span class="s2">&quot;e&quot;</span><span class="p">,</span> <span class="s2">&quot;c&quot;</span><span class="p">]),</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">11</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">10</span><span class="p">]),</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;keys&quot;</span><span class="p">,</span> <span class="s2">&quot;values&quot;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
keys: string
values: int64
----
keys: [[&quot;a&quot;,&quot;a&quot;,&quot;b&quot;,&quot;b&quot;,&quot;c&quot;,&quot;d&quot;,&quot;e&quot;,&quot;c&quot;]]
values: [[11,20,3,4,5,1,4,10]]
</pre></div>
</div>
<p>Now we let’s apply a groupby operation. The table will be grouped
by the field <code class="docutils literal notranslate"><span class="pre">key</span></code> and an aggregation operation, <code class="docutils literal notranslate"><span class="pre">sum</span></code> is applied
on the column <code class="docutils literal notranslate"><span class="pre">values</span></code>. Note that, an aggregation operation pairs with a column name.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">aggregated_table</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">group_by</span><span class="p">(</span><span class="s2">&quot;keys&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">aggregate</span><span class="p">([(</span><span class="s2">&quot;values&quot;</span><span class="p">,</span> <span class="s2">&quot;sum&quot;</span><span class="p">)])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">aggregated_table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
keys: string
values_sum: int64
----
keys: [[&quot;a&quot;,&quot;b&quot;,&quot;c&quot;,&quot;d&quot;,&quot;e&quot;]]
values_sum: [[31,7,15,1,4]]
</pre></div>
</div>
<p>If you observe carefully, the new table returns the aggregated column
as <code class="docutils literal notranslate"><span class="pre">values_sum</span></code> which is formed by the column name and aggregation operation name.</p>
<p>Aggregation operations can be applied with options. Let’s take a case where
we have null values included in our dataset, but we want to take the
count of the unique groups excluding the null values.</p>
<p>A sample dataset can be formed as follows.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;c&quot;</span><span class="p">,</span> <span class="s2">&quot;d&quot;</span><span class="p">,</span> <span class="s2">&quot;d&quot;</span><span class="p">,</span> <span class="s2">&quot;e&quot;</span><span class="p">,</span> <span class="s2">&quot;c&quot;</span><span class="p">]),</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="kc">None</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="kc">None</span><span class="p">]),</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;keys&quot;</span><span class="p">,</span> <span class="s2">&quot;values&quot;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
keys: string
values: int64
----
keys: [[&quot;a&quot;,&quot;a&quot;,&quot;b&quot;,&quot;b&quot;,&quot;b&quot;,&quot;c&quot;,&quot;d&quot;,&quot;d&quot;,&quot;e&quot;,&quot;c&quot;]]
values: [[null,20,3,4,5,6,10,1,4,null]]
</pre></div>
</div>
<p>Let’s apply an aggregation operation <code class="docutils literal notranslate"><span class="pre">count</span></code> with the option to exclude
null values.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.compute</span> <span class="k">as</span> <span class="nn">pc</span>
<span class="n">grouped_table</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">group_by</span><span class="p">(</span><span class="s2">&quot;keys&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span>
<span class="p">[(</span><span class="s2">&quot;values&quot;</span><span class="p">,</span>
<span class="s2">&quot;count&quot;</span><span class="p">,</span>
<span class="n">pc</span><span class="o">.</span><span class="n">CountOptions</span><span class="p">(</span><span class="n">mode</span><span class="o">=</span><span class="s2">&quot;only_valid&quot;</span><span class="p">))]</span>
<span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">grouped_table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
keys: string
values_count: int64
----
keys: [[&quot;a&quot;,&quot;b&quot;,&quot;c&quot;,&quot;d&quot;,&quot;e&quot;]]
values_count: [[1,3,1,2,1]]
</pre></div>
</div>
</section>
<section id="sort-a-table">
<h2><a class="toc-backref" href="#id9" role="doc-backlink">Sort a Table</a><a class="headerlink" href="#sort-a-table" title="Link to this heading"></a></h2>
<p>Let’s discusse how to sort a table. We can sort a table,
based on values of a given column. Data can be either sorted <code class="docutils literal notranslate"><span class="pre">ascending</span></code>
or <code class="docutils literal notranslate"><span class="pre">descending</span></code>.</p>
<p>Prepare data;</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;c&quot;</span><span class="p">,</span> <span class="s2">&quot;d&quot;</span><span class="p">,</span> <span class="s2">&quot;d&quot;</span><span class="p">,</span> <span class="s2">&quot;e&quot;</span><span class="p">,</span> <span class="s2">&quot;c&quot;</span><span class="p">]),</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">15</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">123</span><span class="p">]),</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;keys&quot;</span><span class="p">,</span> <span class="s2">&quot;values&quot;</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
keys: string
values: int64
----
keys: [[&quot;a&quot;,&quot;a&quot;,&quot;b&quot;,&quot;b&quot;,&quot;b&quot;,&quot;c&quot;,&quot;d&quot;,&quot;d&quot;,&quot;e&quot;,&quot;c&quot;]]
values: [[15,20,3,4,5,6,10,1,14,123]]
</pre></div>
</div>
<p>Then applying sort with <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.sort_by" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.Table.sort_by()</span></code></a>;</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">sorted_table</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">sort_by</span><span class="p">([(</span><span class="s2">&quot;values&quot;</span><span class="p">,</span> <span class="s2">&quot;ascending&quot;</span><span class="p">)])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">sorted_table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
keys: string
values: int64
----
keys: [[&quot;d&quot;,&quot;b&quot;,&quot;b&quot;,&quot;b&quot;,&quot;c&quot;,&quot;d&quot;,&quot;e&quot;,&quot;a&quot;,&quot;a&quot;,&quot;c&quot;]]
values: [[1,3,4,5,6,10,14,15,20,123]]
</pre></div>
</div>
</section>
<section id="searching-for-values-matching-a-predicate-in-arrays">
<h2><a class="toc-backref" href="#id10" role="doc-backlink">Searching for values matching a predicate in Arrays</a><a class="headerlink" href="#searching-for-values-matching-a-predicate-in-arrays" title="Link to this heading"></a></h2>
<p>If you have to look for values matching a predicate in Arrow arrays
the <code class="xref py py-mod docutils literal notranslate"><span class="pre">pyarrow.compute</span></code> module provides several methods that
can be used to find the values you are looking for.</p>
<p>For example, given an array with numbers from 0 to 9, if we
want to look only for those greater than 5 we could use the
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.compute.greater.html#pyarrow.compute.greater" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.compute.greater()</span></code></a> method and get back the elements
that fit our predicate</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.compute</span> <span class="k">as</span> <span class="nn">pc</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
<span class="n">gtfive</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">greater</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">gtfive</span><span class="o">.</span><span class="n">to_string</span><span class="p">())</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[
false,
false,
false,
false,
false,
false,
true,
true,
true,
true
]
</pre></div>
</div>
<p>Furthermore we can filter the array to get only the entries
that match our predicate with <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html#pyarrow.compute.filter" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.compute.filter()</span></code></a></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">filtered_array</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="n">gtfive</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">filtered_array</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[
6,
7,
8,
9
]
</pre></div>
</div>
</section>
<section id="filtering-arrays-using-a-mask">
<h2><a class="toc-backref" href="#id11" role="doc-backlink">Filtering Arrays using a mask</a><a class="headerlink" href="#filtering-arrays-using-a-mask" title="Link to this heading"></a></h2>
<p>In many cases, when you are searching for something in an array
you will end up with a mask that tells you the positions at which
your search matched the values.</p>
<p>For example in an array of four items, we might have a mask that
matches the first and the last items only:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">array</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="kc">True</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">True</span><span class="p">])</span>
</pre></div>
</div>
<p>We can then filter the array according to the mask using
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.filter" title="(in Apache Arrow v15.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.Array.filter()</span></code></a> to get back a new array with
only the values matching the mask:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">filtered_array</span> <span class="o">=</span> <span class="n">array</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">mask</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">filtered_array</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[
1,
4
]
</pre></div>
</div>
<p>Most search functions in <code class="xref py py-mod docutils literal notranslate"><span class="pre">pyarrow.compute</span></code> will produce
a mask as the output, so you can use them to filter your arrays
for the values that have been found by the function.</p>
<p>For example we might filter our arrays for the values equal to <code class="docutils literal notranslate"><span class="pre">2</span></code>
using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.compute.equal.html#pyarrow.compute.equal" title="(in Apache Arrow v15.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.compute.equal()</span></code></a>:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.compute</span> <span class="k">as</span> <span class="nn">pc</span>
<span class="n">filtered_array</span> <span class="o">=</span> <span class="n">array</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pc</span><span class="o">.</span><span class="n">equal</span><span class="p">(</span><span class="n">array</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">filtered_array</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[
2
]
</pre></div>
</div>
</section>
</section>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<p class="logo">
<a href="index.html">
<img class="logo" src="_static/arrow-logo_vertical_black-txt_transparent-bg.svg" alt="Logo" />
</a>
</p>
<p>
<iframe src="https://ghbtns.com/github-btn.html?user=apache&repo=arrow-cookbook&type=none&count=true&size=large&v=2"
allowtransparency="true" frameborder="0" scrolling="0" width="200px" height="35px"></iframe>
</p>
<h3>Navigation</h3>
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="io.html">Reading and Writing Data</a></li>
<li class="toctree-l1"><a class="reference internal" href="create.html">Creating Arrow Objects</a></li>
<li class="toctree-l1"><a class="reference internal" href="schema.html">Working with Schema</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Data Manipulation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#computing-mean-min-max-values-of-an-array">Computing Mean/Min/Max values of an array</a></li>
<li class="toctree-l2"><a class="reference internal" href="#counting-occurrences-of-elements">Counting Occurrences of Elements</a></li>
<li class="toctree-l2"><a class="reference internal" href="#applying-arithmetic-functions-to-arrays">Applying arithmetic functions to arrays.</a></li>
<li class="toctree-l2"><a class="reference internal" href="#appending-tables-to-an-existing-table">Appending tables to an existing table</a></li>
<li class="toctree-l2"><a class="reference internal" href="#adding-a-column-to-an-existing-table">Adding a column to an existing Table</a></li>
<li class="toctree-l2"><a class="reference internal" href="#replacing-a-column-in-an-existing-table">Replacing a column in an existing Table</a></li>
<li class="toctree-l2"><a class="reference internal" href="#group-a-table">Group a Table</a></li>
<li class="toctree-l2"><a class="reference internal" href="#sort-a-table">Sort a Table</a></li>
<li class="toctree-l2"><a class="reference internal" href="#searching-for-values-matching-a-predicate-in-arrays">Searching for values matching a predicate in Arrays</a></li>
<li class="toctree-l2"><a class="reference internal" href="#filtering-arrays-using-a-mask">Filtering Arrays using a mask</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="flight.html">Arrow Flight</a></li>
</ul>
<hr />
<ul>
<li class="toctree-l1"><a href="https://arrow.apache.org/docs/python/index.html">User Guide</a></li>
<li class="toctree-l1"><a href="https://arrow.apache.org/docs/python/api.html">API Reference</a></li>
</ul>
<div class="relations">
<h3>Related Topics</h3>
<ul>
<li><a href="index.html">Documentation overview</a><ul>
<li>Previous: <a href="schema.html" title="previous chapter">Working with Schema</a></li>
<li>Next: <a href="flight.html" title="next chapter">Arrow Flight</a></li>
</ul></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
<input type="submit" value="Go" />
</form>
</div>
</div>
<script>document.getElementById('searchbox').style.display = "block"</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="footer">
&#169;2022, Apache Software Foundation.
|
Powered by <a href="https://www.sphinx-doc.org/">Sphinx 7.2.6</a>
&amp; <a href="https://alabaster.readthedocs.io">Alabaster 0.7.16</a>
|
<a href="_sources/data.rst.txt"
rel="nofollow">Page source</a>
</div>
</body>
</html>