blob: 42b13bd6fac825007ef5d6b6c7b1d05451450ccf [file] [log] [blame]
<!DOCTYPE html>
<html lang="en" data-content_root="./">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Reading and Writing Data &#8212; Apache Arrow Python Cookbook documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=4f649999" />
<link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=39aeeac0" />
<script src="_static/documentation_options.js?v=5929fcd5"></script>
<script src="_static/doctools.js?v=888ff710"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<link rel="icon" href="_static/favicon.ico"/>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Creating Arrow Objects" href="create.html" />
<link rel="prev" title="Apache Arrow Python Cookbook" href="index.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" />
<!-- Matomo -->
<script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<!-- End Matomo Code -->
</head><body>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<section id="reading-and-writing-data">
<h1><a class="toc-backref" href="#id1" role="doc-backlink">Reading and Writing Data</a><a class="headerlink" href="#reading-and-writing-data" title="Link to this heading"></a></h1>
<p>Recipes related to reading and writing data from disk using
Apache Arrow.</p>
<nav class="contents" id="contents">
<p class="topic-title">Contents</p>
<ul class="simple">
<li><p><a class="reference internal" href="#reading-and-writing-data" id="id1">Reading and Writing Data</a></p>
<ul>
<li><p><a class="reference internal" href="#write-a-parquet-file" id="id2">Write a Parquet file</a></p></li>
<li><p><a class="reference internal" href="#reading-a-parquet-file" id="id3">Reading a Parquet file</a></p></li>
<li><p><a class="reference internal" href="#reading-a-subset-of-parquet-data" id="id4">Reading a subset of Parquet data</a></p></li>
<li><p><a class="reference internal" href="#saving-arrow-arrays-to-disk" id="id5">Saving Arrow Arrays to disk</a></p></li>
<li><p><a class="reference internal" href="#memory-mapping-arrow-arrays-from-disk" id="id6">Memory Mapping Arrow Arrays from disk</a></p></li>
<li><p><a class="reference internal" href="#writing-csv-files" id="id7">Writing CSV files</a></p></li>
<li><p><a class="reference internal" href="#writing-csv-files-incrementally" id="id8">Writing CSV files incrementally</a></p></li>
<li><p><a class="reference internal" href="#reading-csv-files" id="id9">Reading CSV files</a></p></li>
<li><p><a class="reference internal" href="#writing-partitioned-datasets" id="id10">Writing Partitioned Datasets</a></p></li>
<li><p><a class="reference internal" href="#reading-partitioned-data" id="id11">Reading Partitioned data</a></p></li>
<li><p><a class="reference internal" href="#reading-partitioned-data-from-s3" id="id12">Reading Partitioned Data from S3</a></p></li>
<li><p><a class="reference internal" href="#write-a-feather-file" id="id13">Write a Feather file</a></p></li>
<li><p><a class="reference internal" href="#reading-a-feather-file" id="id14">Reading a Feather file</a></p></li>
<li><p><a class="reference internal" href="#reading-line-delimited-json" id="id15">Reading Line Delimited JSON</a></p></li>
<li><p><a class="reference internal" href="#writing-compressed-data" id="id16">Writing Compressed Data</a></p></li>
<li><p><a class="reference internal" href="#reading-compressed-data" id="id17">Reading Compressed Data</a></p></li>
</ul>
</li>
</ul>
</nav>
<section id="write-a-parquet-file">
<h2><a class="toc-backref" href="#id2" role="doc-backlink">Write a Parquet file</a><a class="headerlink" href="#write-a-parquet-file" title="Link to this heading"></a></h2>
<p>Given an array with 100 numbers, from 0 to 99</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99
</pre></div>
</div>
<p>To write it to a Parquet file,
as Parquet is a format that contains multiple named columns,
we must create a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> out of it,
so that we get a table of a single column which can then be
written to a Parquet file.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;col1&quot;</span><span class="p">])</span>
</pre></div>
</div>
<p>Once we have a table, it can be written to a Parquet File
using the functions provided by the <code class="docutils literal notranslate"><span class="pre">pyarrow.parquet</span></code> module</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="nn">pq</span>
<span class="n">pq</span><span class="o">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">&quot;example.parquet&quot;</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="reading-a-parquet-file">
<h2><a class="toc-backref" href="#id3" role="doc-backlink">Reading a Parquet file</a><a class="headerlink" href="#reading-a-parquet-file" title="Link to this heading"></a></h2>
<p>Given a Parquet file, it can be read back to a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a>
by using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code></a> function</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="nn">pq</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pq</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&quot;example.parquet&quot;</span><span class="p">)</span>
</pre></div>
</div>
<p>The resulting table will contain the same columns that existed in
the parquet file as <code class="xref py py-class docutils literal notranslate"><span class="pre">ChunkedArray</span></code></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
col1: int64
----
col1: [[0,1,2,3,4,...,95,96,97,98,99]]
</pre></div>
</div>
</section>
<section id="reading-a-subset-of-parquet-data">
<h2><a class="toc-backref" href="#id4" role="doc-backlink">Reading a subset of Parquet data</a><a class="headerlink" href="#reading-a-subset-of-parquet-data" title="Link to this heading"></a></h2>
<p>When reading a Parquet file with <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code></a>
it is possible to restrict which Columns and Rows will be read
into memory by using the <code class="docutils literal notranslate"><span class="pre">filters</span></code> and <code class="docutils literal notranslate"><span class="pre">columns</span></code> arguments</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="nn">pq</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pq</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&quot;example.parquet&quot;</span><span class="p">,</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;col1&quot;</span><span class="p">],</span>
<span class="n">filters</span><span class="o">=</span><span class="p">[</span>
<span class="p">(</span><span class="s2">&quot;col1&quot;</span><span class="p">,</span> <span class="s2">&quot;&gt;&quot;</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="p">(</span><span class="s2">&quot;col1&quot;</span><span class="p">,</span> <span class="s2">&quot;&lt;&quot;</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span>
<span class="p">])</span>
</pre></div>
</div>
<p>The resulting table will contain only the projected columns
and filtered rows. Refer to <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code></a>
documentation for details about the syntax for filters.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
col1: int64
----
col1: [[6,7,8,9]]
</pre></div>
</div>
</section>
<section id="saving-arrow-arrays-to-disk">
<h2><a class="toc-backref" href="#id5" role="doc-backlink">Saving Arrow Arrays to disk</a><a class="headerlink" href="#saving-arrow-arrays-to-disk" title="Link to this heading"></a></h2>
<p>Apart from using arrow to read and save common file formats like Parquet,
it is possible to dump data in the raw arrow format which allows
direct memory mapping of data from disk. This format is called
the Arrow IPC format.</p>
<p>Given an array with 100 numbers, from 0 to 99</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99
</pre></div>
</div>
<p>We can save the array by making a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.RecordBatch</span></code></a> out
of it and writing the record batch to disk.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">schema</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">schema</span><span class="p">([</span>
<span class="n">pa</span><span class="o">.</span><span class="n">field</span><span class="p">(</span><span class="s1">&#39;nums&#39;</span><span class="p">,</span> <span class="n">arr</span><span class="o">.</span><span class="n">type</span><span class="p">)</span>
<span class="p">])</span>
<span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">OSFile</span><span class="p">(</span><span class="s1">&#39;arraydata.arrow&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">sink</span><span class="p">:</span>
<span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">ipc</span><span class="o">.</span><span class="n">new_file</span><span class="p">(</span><span class="n">sink</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span> <span class="k">as</span> <span class="n">writer</span><span class="p">:</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">record_batch</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span>
<span class="n">writer</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>
</pre></div>
</div>
<p>If we were to save multiple arrays into the same file,
we would just have to adapt the <code class="docutils literal notranslate"><span class="pre">schema</span></code> accordingly and add
them all to the <code class="docutils literal notranslate"><span class="pre">record_batch</span></code> call.</p>
</section>
<section id="memory-mapping-arrow-arrays-from-disk">
<h2><a class="toc-backref" href="#id6" role="doc-backlink">Memory Mapping Arrow Arrays from disk</a><a class="headerlink" href="#memory-mapping-arrow-arrays-from-disk" title="Link to this heading"></a></h2>
<p>Arrow arrays that have been written to disk in the Arrow IPC
format can be memory mapped back directly from the disk.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">memory_map</span><span class="p">(</span><span class="s1">&#39;arraydata.arrow&#39;</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">source</span><span class="p">:</span>
<span class="n">loaded_arrays</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">ipc</span><span class="o">.</span><span class="n">open_file</span><span class="p">(</span><span class="n">source</span><span class="p">)</span><span class="o">.</span><span class="n">read_all</span><span class="p">()</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">arr</span> <span class="o">=</span> <span class="n">loaded_arrays</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99
</pre></div>
</div>
</section>
<section id="writing-csv-files">
<h2><a class="toc-backref" href="#id7" role="doc-backlink">Writing CSV files</a><a class="headerlink" href="#writing-csv-files" title="Link to this heading"></a></h2>
<p>It is possible to write an Arrow <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> to
a CSV file using the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html#pyarrow.csv.write_csv" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.csv.write_csv()</span></code></a> function</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;col1&quot;</span><span class="p">])</span>
<span class="kn">import</span> <span class="nn">pyarrow.csv</span>
<span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">write_csv</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">&quot;table.csv&quot;</span><span class="p">,</span>
<span class="n">write_options</span><span class="o">=</span><span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">WriteOptions</span><span class="p">(</span><span class="n">include_header</span><span class="o">=</span><span class="kc">True</span><span class="p">))</span>
</pre></div>
</div>
</section>
<section id="writing-csv-files-incrementally">
<h2><a class="toc-backref" href="#id8" role="doc-backlink">Writing CSV files incrementally</a><a class="headerlink" href="#writing-csv-files-incrementally" title="Link to this heading"></a></h2>
<p>If you need to write data to a CSV file incrementally
as you generate or retrieve the data and you don’t want to keep
in memory the whole table to write it at once, it’s possible to use
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVWriter.html#pyarrow.csv.CSVWriter" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.csv.CSVWriter</span></code></a> to write data incrementally</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">schema</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">schema</span><span class="p">([(</span><span class="s2">&quot;col1&quot;</span><span class="p">,</span> <span class="n">pa</span><span class="o">.</span><span class="n">int32</span><span class="p">())])</span>
<span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">CSVWriter</span><span class="p">(</span><span class="s2">&quot;table.csv&quot;</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span> <span class="k">as</span> <span class="n">writer</span><span class="p">:</span>
<span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">datachunk</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">chunk</span><span class="o">*</span><span class="mi">10</span><span class="p">,</span> <span class="p">(</span><span class="n">chunk</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="mi">10</span><span class="p">)</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">datachunk</span><span class="p">)],</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span>
<span class="n">writer</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<p>It’s equally possible to write <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.RecordBatch</span></code></a>
by passing them as you would for tables.</p>
</section>
<section id="reading-csv-files">
<h2><a class="toc-backref" href="#id9" role="doc-backlink">Reading CSV files</a><a class="headerlink" href="#reading-csv-files" title="Link to this heading"></a></h2>
<p>Arrow can read <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> entities from CSV using an
optimized codepath that can leverage multiple threads.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.csv</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">&quot;table.csv&quot;</span><span class="p">)</span>
</pre></div>
</div>
<p>Arrow will do its best to infer data types. Further options can be
provided to <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html#pyarrow.csv.read_csv" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.csv.read_csv()</span></code></a> to drive
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.csv.ConvertOptions</span></code></a>.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
col1: int64
----
col1: [[0,1,2,3,4,...,95,96,97,98,99]]
</pre></div>
</div>
</section>
<section id="writing-partitioned-datasets">
<h2><a class="toc-backref" href="#id10" role="doc-backlink">Writing Partitioned Datasets</a><a class="headerlink" href="#writing-partitioned-datasets" title="Link to this heading"></a></h2>
<p>When your dataset is big it usually makes sense to split it into
multiple separate files. You can do this manually or use
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.write_dataset()</span></code></a> to let Arrow do the effort
of splitting the data in chunks for you.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">partitioning</span></code> argument allows to tell <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.write_dataset()</span></code></a>
for which columns the data should be split.</p>
<p>For example given 100 birthdays, within 2000 and 2009</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy.random</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">({</span><span class="s2">&quot;day&quot;</span><span class="p">:</span> <span class="n">numpy</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">31</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
<span class="s2">&quot;month&quot;</span><span class="p">:</span> <span class="n">numpy</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
<span class="s2">&quot;year&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">2000</span> <span class="o">+</span> <span class="n">x</span> <span class="o">//</span> <span class="mi">10</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)]})</span>
</pre></div>
</div>
<p>Then we could partition the data by the year column so that it
gets saved in 10 different files:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.dataset</span> <span class="k">as</span> <span class="nn">ds</span>
<span class="n">ds</span><span class="o">.</span><span class="n">write_dataset</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s2">&quot;./partitioned&quot;</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s2">&quot;parquet&quot;</span><span class="p">,</span>
<span class="n">partitioning</span><span class="o">=</span><span class="n">ds</span><span class="o">.</span><span class="n">partitioning</span><span class="p">(</span><span class="n">pa</span><span class="o">.</span><span class="n">schema</span><span class="p">([(</span><span class="s2">&quot;year&quot;</span><span class="p">,</span> <span class="n">pa</span><span class="o">.</span><span class="n">int16</span><span class="p">())])))</span>
</pre></div>
</div>
<p>Arrow will partition datasets in subdirectories by default, which will
result in 10 different directories named with the value of the partitioning
column each with a file containing the subset of the data for that partition:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyarrow</span> <span class="kn">import</span> <span class="n">fs</span>
<span class="n">localfs</span> <span class="o">=</span> <span class="n">fs</span><span class="o">.</span><span class="n">LocalFileSystem</span><span class="p">()</span>
<span class="n">partitioned_dir_content</span> <span class="o">=</span> <span class="n">localfs</span><span class="o">.</span><span class="n">get_file_info</span><span class="p">(</span><span class="n">fs</span><span class="o">.</span><span class="n">FileSelector</span><span class="p">(</span><span class="s2">&quot;./partitioned&quot;</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="kc">True</span><span class="p">))</span>
<span class="n">files</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">((</span><span class="n">f</span><span class="o">.</span><span class="n">path</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">partitioned_dir_content</span> <span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">type</span> <span class="o">==</span> <span class="n">fs</span><span class="o">.</span><span class="n">FileType</span><span class="o">.</span><span class="n">File</span><span class="p">))</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>./partitioned/2000/part-0.parquet
./partitioned/2001/part-0.parquet
./partitioned/2002/part-0.parquet
./partitioned/2003/part-0.parquet
./partitioned/2004/part-0.parquet
./partitioned/2005/part-0.parquet
./partitioned/2006/part-0.parquet
./partitioned/2007/part-0.parquet
./partitioned/2008/part-0.parquet
./partitioned/2009/part-0.parquet
</pre></div>
</div>
</section>
<section id="reading-partitioned-data">
<h2><a class="toc-backref" href="#id11" role="doc-backlink">Reading Partitioned data</a><a class="headerlink" href="#reading-partitioned-data" title="Link to this heading"></a></h2>
<p>In some cases, your dataset might be composed by multiple separate
files each containing a piece of the data.</p>
<p>In this case the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.dataset()</span></code></a> function provides
an interface to discover and read all those files as a single big dataset.</p>
<p>For example if we have a structure like:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>examples/
├── dataset1.parquet
├── dataset2.parquet
└── dataset3.parquet
</pre></div>
</div>
<p>Then, pointing the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.dataset()</span></code></a> function to the <code class="docutils literal notranslate"><span class="pre">examples</span></code> directory
will discover those parquet files and will expose them all as a single
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset</span></code></a>:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.dataset</span> <span class="k">as</span> <span class="nn">ds</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">dataset</span><span class="p">(</span><span class="s2">&quot;./examples&quot;</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s2">&quot;parquet&quot;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">files</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[&#39;./examples/dataset1.parquet&#39;, &#39;./examples/dataset2.parquet&#39;, &#39;./examples/dataset3.parquet&#39;]
</pre></div>
</div>
<p>The whole dataset can be viewed as a single big table using
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_table()</span></code></a>. While each parquet file
contains only 10 rows, converting the dataset to a table will
expose them as a single Table.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">to_table</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
col1: int64
----
col1: [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13,14,15,16,17,18,19],[20,21,22,23,24,25,26,27,28,29]]
</pre></div>
</div>
<p>Notice that converting to a table will force all data to be loaded
in memory. For big datasets is usually not what you want.</p>
<p>For this reason, it might be better to rely on the
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_batches()</span></code></a> method, which will
iteratively load the dataset one chunk of data at the time returning a
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.RecordBatch</span></code></a> for each one of them.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">record_batch</span> <span class="ow">in</span> <span class="n">dataset</span><span class="o">.</span><span class="n">to_batches</span><span class="p">():</span>
<span class="n">col1</span> <span class="o">=</span> <span class="n">record_batch</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s2">&quot;col1&quot;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">col1</span><span class="o">.</span><span class="n">_name</span><span class="si">}</span><span class="s2"> = </span><span class="si">{</span><span class="n">col1</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">col1</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>col1 = 0 .. 9
col1 = 10 .. 19
col1 = 20 .. 29
</pre></div>
</div>
</section>
<section id="reading-partitioned-data-from-s3">
<h2><a class="toc-backref" href="#id12" role="doc-backlink">Reading Partitioned Data from S3</a><a class="headerlink" href="#reading-partitioned-data-from-s3" title="Link to this heading"></a></h2>
<p>The <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset</span></code></a> is also able to abstract
partitioned data coming from remote sources like S3 or HDFS.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyarrow</span> <span class="kn">import</span> <span class="n">fs</span>
<span class="c1"># List content of s3://ursa-labs-taxi-data/2011</span>
<span class="n">s3</span> <span class="o">=</span> <span class="n">fs</span><span class="o">.</span><span class="n">SubTreeFileSystem</span><span class="p">(</span>
<span class="s2">&quot;ursa-labs-taxi-data&quot;</span><span class="p">,</span>
<span class="n">fs</span><span class="o">.</span><span class="n">S3FileSystem</span><span class="p">(</span><span class="n">region</span><span class="o">=</span><span class="s2">&quot;us-east-2&quot;</span><span class="p">,</span> <span class="n">anonymous</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">for</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">s3</span><span class="o">.</span><span class="n">get_file_info</span><span class="p">(</span><span class="n">fs</span><span class="o">.</span><span class="n">FileSelector</span><span class="p">(</span><span class="s2">&quot;2011&quot;</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="kc">True</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">entry</span><span class="o">.</span><span class="n">type</span> <span class="o">==</span> <span class="n">fs</span><span class="o">.</span><span class="n">FileType</span><span class="o">.</span><span class="n">File</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">entry</span><span class="o">.</span><span class="n">path</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>2011/01/data.parquet
2011/02/data.parquet
2011/03/data.parquet
2011/04/data.parquet
2011/05/data.parquet
2011/06/data.parquet
2011/07/data.parquet
2011/08/data.parquet
2011/09/data.parquet
2011/10/data.parquet
2011/11/data.parquet
2011/12/data.parquet
</pre></div>
</div>
<p>The data in the bucket can be loaded as a single big dataset partitioned
by <code class="docutils literal notranslate"><span class="pre">month</span></code> using</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">dataset</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">dataset</span><span class="p">(</span><span class="s2">&quot;s3://ursa-labs-taxi-data/2011&quot;</span><span class="p">,</span>
<span class="n">partitioning</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;month&quot;</span><span class="p">])</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">dataset</span><span class="o">.</span><span class="n">files</span><span class="p">[:</span><span class="mi">10</span><span class="p">]:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">&quot;...&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>ursa-labs-taxi-data/2011/01/data.parquet
ursa-labs-taxi-data/2011/02/data.parquet
ursa-labs-taxi-data/2011/03/data.parquet
ursa-labs-taxi-data/2011/04/data.parquet
ursa-labs-taxi-data/2011/05/data.parquet
ursa-labs-taxi-data/2011/06/data.parquet
ursa-labs-taxi-data/2011/07/data.parquet
ursa-labs-taxi-data/2011/08/data.parquet
ursa-labs-taxi-data/2011/09/data.parquet
ursa-labs-taxi-data/2011/10/data.parquet
...
</pre></div>
</div>
<p>The dataset can then be used with <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_table()</span></code></a>
or <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_batches()</span></code></a> like you would for a local one.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>It is possible to load partitioned data also in the ipc arrow
format or in feather format.</p>
</div>
<div class="admonition warning">
<p class="admonition-title">Warning</p>
<p>If the above code throws an error most likely the reason is your
AWS credentials are not set. Follow these instructions to get
<code class="docutils literal notranslate"><span class="pre">AWS</span> <span class="pre">Access</span> <span class="pre">Key</span> <span class="pre">Id</span></code> and <code class="docutils literal notranslate"><span class="pre">AWS</span> <span class="pre">Secret</span> <span class="pre">Access</span> <span class="pre">Key</span></code>:
<a class="reference external" href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html">AWS Credentials</a>.</p>
<p>The credentials are normally stored in <code class="docutils literal notranslate"><span class="pre">~/.aws/credentials</span></code> (on Mac or Linux)
or in <code class="docutils literal notranslate"><span class="pre">C:\Users\&lt;USERNAME&gt;\.aws\credentials</span></code> (on Windows) file.
You will need to either create or update this file in the appropriate location.</p>
<p>The contents of the file should look like this:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>default<span class="o">]</span>
<span class="nv">aws_access_key_id</span><span class="o">=</span>&lt;YOUR_AWS_ACCESS_KEY_ID&gt;
<span class="nv">aws_secret_access_key</span><span class="o">=</span>&lt;YOUR_AWS_SECRET_ACCESS_KEY&gt;
</pre></div>
</div>
</div>
</section>
<section id="write-a-feather-file">
<h2><a class="toc-backref" href="#id13" role="doc-backlink">Write a Feather file</a><a class="headerlink" href="#write-a-feather-file" title="Link to this heading"></a></h2>
<p>Given an array with 100 numbers, from 0 to 99</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99
</pre></div>
</div>
<p>To write it to a Feather file, as Feather stores multiple columns,
we must create a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> out of it,
so that we get a table of a single column which can then be
written to a Feather file.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;col1&quot;</span><span class="p">])</span>
</pre></div>
</div>
<p>Once we have a table, it can be written to a Feather File
using the functions provided by the <code class="docutils literal notranslate"><span class="pre">pyarrow.feather</span></code> module</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.feather</span> <span class="k">as</span> <span class="nn">ft</span>
<span class="n">ft</span><span class="o">.</span><span class="n">write_feather</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s1">&#39;example.feather&#39;</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="reading-a-feather-file">
<h2><a class="toc-backref" href="#id14" role="doc-backlink">Reading a Feather file</a><a class="headerlink" href="#reading-a-feather-file" title="Link to this heading"></a></h2>
<p>Given a Feather file, it can be read back to a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a>
by using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_table.html#pyarrow.feather.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.feather.read_table()</span></code></a> function</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.feather</span> <span class="k">as</span> <span class="nn">ft</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">ft</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&quot;example.feather&quot;</span><span class="p">)</span>
</pre></div>
</div>
<p>The resulting table will contain the same columns that existed in
the parquet file as <code class="xref py py-class docutils literal notranslate"><span class="pre">ChunkedArray</span></code></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
col1: int64
----
col1: [[0,1,2,3,4,...,95,96,97,98,99]]
</pre></div>
</div>
</section>
<section id="reading-line-delimited-json">
<h2><a class="toc-backref" href="#id15" role="doc-backlink">Reading Line Delimited JSON</a><a class="headerlink" href="#reading-line-delimited-json" title="Link to this heading"></a></h2>
<p>Arrow has builtin support for line-delimited JSON.
Each line represents a row of data as a JSON object.</p>
<p>Given some data in a file where each line is a JSON object
containing a row of data:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">delete</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s2">&quot;w+&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">&#39;{&quot;a&quot;: 1, &quot;b&quot;: 2.0, &quot;c&quot;: 1}</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">&#39;{&quot;a&quot;: 3, &quot;b&quot;: 3.0, &quot;c&quot;: 2}</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">&#39;{&quot;a&quot;: 5, &quot;b&quot;: 4.0, &quot;c&quot;: 3}</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">&#39;{&quot;a&quot;: 7, &quot;b&quot;: 5.0, &quot;c&quot;: 4}</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>The content of the file can be read back to a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> using
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html#pyarrow.json.read_json" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.json.read_json()</span></code></a>:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.json</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">json</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">to_pydict</span><span class="p">())</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>{&#39;a&#39;: [1, 3, 5, 7], &#39;b&#39;: [2.0, 3.0, 4.0, 5.0], &#39;c&#39;: [1, 2, 3, 4]}
</pre></div>
</div>
</section>
<section id="writing-compressed-data">
<h2><a class="toc-backref" href="#id16" role="doc-backlink">Writing Compressed Data</a><a class="headerlink" href="#writing-compressed-data" title="Link to this heading"></a></h2>
<p>Arrow provides support for writing files in compressed formats,
both for formats that provide compression natively like Parquet or Feather,
and for formats that don’t support compression out of the box like CSV.</p>
<p>Given a table:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span>
<span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">])</span>
<span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;numbers&quot;</span><span class="p">])</span>
</pre></div>
</div>
<p>Writing compressed Parquet or Feather data is driven by the
<code class="docutils literal notranslate"><span class="pre">compression</span></code> argument to the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.feather.write_feather()</span></code></a> and
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.write_table()</span></code></a> functions:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pa</span><span class="o">.</span><span class="n">feather</span><span class="o">.</span><span class="n">write_feather</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">&quot;compressed.feather&quot;</span><span class="p">,</span>
<span class="n">compression</span><span class="o">=</span><span class="s2">&quot;lz4&quot;</span><span class="p">)</span>
<span class="n">pa</span><span class="o">.</span><span class="n">parquet</span><span class="o">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">&quot;compressed.parquet&quot;</span><span class="p">,</span>
<span class="n">compression</span><span class="o">=</span><span class="s2">&quot;lz4&quot;</span><span class="p">)</span>
</pre></div>
</div>
<p>You can refer to each of those functions’ documentation for a complete
list of supported compression formats.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Arrow actually uses compression by default when writing
Parquet or Feather files. Feather is compressed using <code class="docutils literal notranslate"><span class="pre">lz4</span></code>
by default and Parquet uses <code class="docutils literal notranslate"><span class="pre">snappy</span></code> by default.</p>
</div>
<p>For formats that don’t support compression natively, like CSV,
it’s possible to save compressed data using
<a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.CompressedOutputStream.html#pyarrow.CompressedOutputStream" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.CompressedOutputStream</span></code></a>:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">CompressedOutputStream</span><span class="p">(</span><span class="s2">&quot;compressed.csv.gz&quot;</span><span class="p">,</span> <span class="s2">&quot;gzip&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
<span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">write_csv</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">out</span><span class="p">)</span>
</pre></div>
</div>
<p>This requires decompressing the file when reading it back,
which can be done using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.CompressedInputStream.html#pyarrow.CompressedInputStream" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.CompressedInputStream</span></code></a>
as explained in the next recipe.</p>
</section>
<section id="reading-compressed-data">
<h2><a class="toc-backref" href="#id17" role="doc-backlink">Reading Compressed Data</a><a class="headerlink" href="#reading-compressed-data" title="Link to this heading"></a></h2>
<p>Arrow provides support for reading compressed files,
both for formats that provide it natively like Parquet or Feather,
and for files in formats that don’t support compression natively,
like CSV, but have been compressed by an application.</p>
<p>Reading compressed formats that have native support for compression
doesn’t require any special handling. We can for example read back
the Parquet and Feather files we wrote in the previous recipe
by simply invoking <code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.feather.read_table()</span></code> and
<code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code>:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table_feather</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">feather</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&quot;compressed.feather&quot;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table_feather</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
numbers: int64
----
numbers: [[1,2,3,4,5]]
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table_parquet</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">parquet</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&quot;compressed.parquet&quot;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table_parquet</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
numbers: int64
----
numbers: [[1,2,3,4,5]]
</pre></div>
</div>
<p>Reading data from formats that don’t have native support for
compression instead involves decompressing them before decoding them.
This can be done using the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.CompressedInputStream.html#pyarrow.CompressedInputStream" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.CompressedInputStream</span></code></a> class
which wraps files with a decompress operation before the result is
provided to the actual read function.</p>
<p>For example to read a compressed CSV file:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">CompressedInputStream</span><span class="p">(</span><span class="n">pa</span><span class="o">.</span><span class="n">OSFile</span><span class="p">(</span><span class="s2">&quot;compressed.csv.gz&quot;</span><span class="p">),</span> <span class="s2">&quot;gzip&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="nb">input</span><span class="p">:</span>
<span class="n">table_csv</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">input</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table_csv</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
numbers: int64
----
numbers: [[1,2,3,4,5]]
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>In the case of CSV, arrow is actually smart enough to try detecting
compressed files using the file extension. So if your file is named
<code class="docutils literal notranslate"><span class="pre">*.gz</span></code> or <code class="docutils literal notranslate"><span class="pre">*.bz2</span></code> the <code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.csv.read_csv()</span></code> function will
try to decompress it accordingly</p>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table_csv2</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">&quot;compressed.csv.gz&quot;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">table_csv2</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table
numbers: int64
----
numbers: [[1,2,3,4,5]]
</pre></div>
</div>
</section>
</section>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<p class="logo">
<a href="index.html">
<img class="logo" src="_static/arrow-logo_vertical_black-txt_transparent-bg.svg" alt="Logo"/>
</a>
</p>
<p>
<iframe src="https://ghbtns.com/github-btn.html?user=apache&repo=arrow-cookbook&type=none&count=true&size=large&v=2"
allowtransparency="true" frameborder="0" scrolling="0" width="200px" height="35px"></iframe>
</p>
<h3>Navigation</h3>
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">Reading and Writing Data</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#write-a-parquet-file">Write a Parquet file</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-a-parquet-file">Reading a Parquet file</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-a-subset-of-parquet-data">Reading a subset of Parquet data</a></li>
<li class="toctree-l2"><a class="reference internal" href="#saving-arrow-arrays-to-disk">Saving Arrow Arrays to disk</a></li>
<li class="toctree-l2"><a class="reference internal" href="#memory-mapping-arrow-arrays-from-disk">Memory Mapping Arrow Arrays from disk</a></li>
<li class="toctree-l2"><a class="reference internal" href="#writing-csv-files">Writing CSV files</a></li>
<li class="toctree-l2"><a class="reference internal" href="#writing-csv-files-incrementally">Writing CSV files incrementally</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-csv-files">Reading CSV files</a></li>
<li class="toctree-l2"><a class="reference internal" href="#writing-partitioned-datasets">Writing Partitioned Datasets</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-partitioned-data">Reading Partitioned data</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-partitioned-data-from-s3">Reading Partitioned Data from S3</a></li>
<li class="toctree-l2"><a class="reference internal" href="#write-a-feather-file">Write a Feather file</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-a-feather-file">Reading a Feather file</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-line-delimited-json">Reading Line Delimited JSON</a></li>
<li class="toctree-l2"><a class="reference internal" href="#writing-compressed-data">Writing Compressed Data</a></li>
<li class="toctree-l2"><a class="reference internal" href="#reading-compressed-data">Reading Compressed Data</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="create.html">Creating Arrow Objects</a></li>
<li class="toctree-l1"><a class="reference internal" href="schema.html">Working with Schema</a></li>
<li class="toctree-l1"><a class="reference internal" href="data.html">Data Manipulation</a></li>
<li class="toctree-l1"><a class="reference internal" href="flight.html">Arrow Flight</a></li>
</ul>
<hr />
<ul>
<li class="toctree-l1"><a href="https://arrow.apache.org/docs/python/index.html">User Guide</a></li>
<li class="toctree-l1"><a href="https://arrow.apache.org/docs/python/api.html">API Reference</a></li>
</ul>
<div class="relations">
<h3>Related Topics</h3>
<ul>
<li><a href="index.html">Documentation overview</a><ul>
<li>Previous: <a href="index.html" title="previous chapter">Apache Arrow Python Cookbook</a></li>
<li>Next: <a href="create.html" title="next chapter">Creating Arrow Objects</a></li>
</ul></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
<input type="submit" value="Go" />
</form>
</div>
</div>
<script>document.getElementById('searchbox').style.display = "block"</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="footer">
&copy;2022, Apache Software Foundation.
|
Powered by <a href="http://sphinx-doc.org/">Sphinx 7.2.6</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.13</a>
|
<a href="_sources/io.rst.txt"
rel="nofollow">Page source</a>
</div>
</body>
</html>