| <!DOCTYPE html> |
| |
| <html lang="en" data-content_root="./"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" /> |
| |
| <title>Reading and Writing Data — Apache Arrow Python Cookbook documentation</title> |
| <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=4f649999" /> |
| <link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=39aeeac0" /> |
| <script src="_static/documentation_options.js?v=5929fcd5"></script> |
| <script src="_static/doctools.js?v=888ff710"></script> |
| <script src="_static/sphinx_highlight.js?v=dc90522c"></script> |
| <link rel="icon" href="_static/favicon.ico"/> |
| <link rel="index" title="Index" href="genindex.html" /> |
| <link rel="search" title="Search" href="search.html" /> |
| <link rel="next" title="Creating Arrow Objects" href="create.html" /> |
| <link rel="prev" title="Apache Arrow Python Cookbook" href="index.html" /> |
| |
| |
| <link rel="stylesheet" href="_static/custom.css" type="text/css" /> |
| |
| |
| <meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" /> |
| <!-- Matomo --> |
| <script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| /* We explicitly disable cookie tracking to avoid privacy issues */ |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '20']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script> |
| <!-- End Matomo Code --> |
| |
| </head><body> |
| |
| |
| <div class="document"> |
| <div class="documentwrapper"> |
| <div class="bodywrapper"> |
| |
| |
| <div class="body" role="main"> |
| |
| <section id="reading-and-writing-data"> |
| <h1><a class="toc-backref" href="#id1" role="doc-backlink">Reading and Writing Data</a><a class="headerlink" href="#reading-and-writing-data" title="Link to this heading">¶</a></h1> |
| <p>Recipes related to reading and writing data from disk using |
| Apache Arrow.</p> |
| <nav class="contents" id="contents"> |
| <p class="topic-title">Contents</p> |
| <ul class="simple"> |
| <li><p><a class="reference internal" href="#reading-and-writing-data" id="id1">Reading and Writing Data</a></p> |
| <ul> |
| <li><p><a class="reference internal" href="#write-a-parquet-file" id="id2">Write a Parquet file</a></p></li> |
| <li><p><a class="reference internal" href="#reading-a-parquet-file" id="id3">Reading a Parquet file</a></p></li> |
| <li><p><a class="reference internal" href="#reading-a-subset-of-parquet-data" id="id4">Reading a subset of Parquet data</a></p></li> |
| <li><p><a class="reference internal" href="#saving-arrow-arrays-to-disk" id="id5">Saving Arrow Arrays to disk</a></p></li> |
| <li><p><a class="reference internal" href="#memory-mapping-arrow-arrays-from-disk" id="id6">Memory Mapping Arrow Arrays from disk</a></p></li> |
| <li><p><a class="reference internal" href="#writing-csv-files" id="id7">Writing CSV files</a></p></li> |
| <li><p><a class="reference internal" href="#writing-csv-files-incrementally" id="id8">Writing CSV files incrementally</a></p></li> |
| <li><p><a class="reference internal" href="#reading-csv-files" id="id9">Reading CSV files</a></p></li> |
| <li><p><a class="reference internal" href="#writing-partitioned-datasets" id="id10">Writing Partitioned Datasets</a></p></li> |
| <li><p><a class="reference internal" href="#reading-partitioned-data" id="id11">Reading Partitioned data</a></p></li> |
| <li><p><a class="reference internal" href="#reading-partitioned-data-from-s3" id="id12">Reading Partitioned Data from S3</a></p></li> |
| <li><p><a class="reference internal" href="#write-a-feather-file" id="id13">Write a Feather file</a></p></li> |
| <li><p><a class="reference internal" href="#reading-a-feather-file" id="id14">Reading a Feather file</a></p></li> |
| <li><p><a class="reference internal" href="#reading-line-delimited-json" id="id15">Reading Line Delimited JSON</a></p></li> |
| <li><p><a class="reference internal" href="#writing-compressed-data" id="id16">Writing Compressed Data</a></p></li> |
| <li><p><a class="reference internal" href="#reading-compressed-data" id="id17">Reading Compressed Data</a></p></li> |
| </ul> |
| </li> |
| </ul> |
| </nav> |
| <section id="write-a-parquet-file"> |
| <h2><a class="toc-backref" href="#id2" role="doc-backlink">Write a Parquet file</a><a class="headerlink" href="#write-a-parquet-file" title="Link to this heading">¶</a></h2> |
| <p>Given an array with 100 numbers, from 0 to 99</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> |
| <span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span> |
| |
| <span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span> |
| |
| <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99 |
| </pre></div> |
| </div> |
| <p>To write it to a Parquet file, |
| as Parquet is a format that contains multiple named columns, |
| we must create a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> out of it, |
| so that we get a table of a single column which can then be |
| written to a Parquet file.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">"col1"</span><span class="p">])</span> |
| </pre></div> |
| </div> |
| <p>Once we have a table, it can be written to a Parquet File |
| using the functions provided by the <code class="docutils literal notranslate"><span class="pre">pyarrow.parquet</span></code> module</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="nn">pq</span> |
| |
| <span class="n">pq</span><span class="o">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">"example.parquet"</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="reading-a-parquet-file"> |
| <h2><a class="toc-backref" href="#id3" role="doc-backlink">Reading a Parquet file</a><a class="headerlink" href="#reading-a-parquet-file" title="Link to this heading">¶</a></h2> |
| <p>Given a Parquet file, it can be read back to a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> |
| by using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code></a> function</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="nn">pq</span> |
| |
| <span class="n">table</span> <span class="o">=</span> <span class="n">pq</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">"example.parquet"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>The resulting table will contain the same columns that existed in |
| the parquet file as <code class="xref py py-class docutils literal notranslate"><span class="pre">ChunkedArray</span></code></p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| col1: int64 |
| ---- |
| col1: [[0,1,2,3,4,...,95,96,97,98,99]] |
| </pre></div> |
| </div> |
| </section> |
| <section id="reading-a-subset-of-parquet-data"> |
| <h2><a class="toc-backref" href="#id4" role="doc-backlink">Reading a subset of Parquet data</a><a class="headerlink" href="#reading-a-subset-of-parquet-data" title="Link to this heading">¶</a></h2> |
| <p>When reading a Parquet file with <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code></a> |
| it is possible to restrict which Columns and Rows will be read |
| into memory by using the <code class="docutils literal notranslate"><span class="pre">filters</span></code> and <code class="docutils literal notranslate"><span class="pre">columns</span></code> arguments</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="nn">pq</span> |
| |
| <span class="n">table</span> <span class="o">=</span> <span class="n">pq</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">"example.parquet"</span><span class="p">,</span> |
| <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">"col1"</span><span class="p">],</span> |
| <span class="n">filters</span><span class="o">=</span><span class="p">[</span> |
| <span class="p">(</span><span class="s2">"col1"</span><span class="p">,</span> <span class="s2">">"</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> |
| <span class="p">(</span><span class="s2">"col1"</span><span class="p">,</span> <span class="s2">"<"</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> |
| <span class="p">])</span> |
| </pre></div> |
| </div> |
| <p>The resulting table will contain only the projected columns |
| and filtered rows. Refer to <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code></a> |
| documentation for details about the syntax for filters.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| col1: int64 |
| ---- |
| col1: [[6,7,8,9]] |
| </pre></div> |
| </div> |
| </section> |
| <section id="saving-arrow-arrays-to-disk"> |
| <h2><a class="toc-backref" href="#id5" role="doc-backlink">Saving Arrow Arrays to disk</a><a class="headerlink" href="#saving-arrow-arrays-to-disk" title="Link to this heading">¶</a></h2> |
| <p>Apart from using arrow to read and save common file formats like Parquet, |
| it is possible to dump data in the raw arrow format which allows |
| direct memory mapping of data from disk. This format is called |
| the Arrow IPC format.</p> |
| <p>Given an array with 100 numbers, from 0 to 99</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> |
| <span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span> |
| |
| <span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span> |
| |
| <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99 |
| </pre></div> |
| </div> |
| <p>We can save the array by making a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.RecordBatch</span></code></a> out |
| of it and writing the record batch to disk.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">schema</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">schema</span><span class="p">([</span> |
| <span class="n">pa</span><span class="o">.</span><span class="n">field</span><span class="p">(</span><span class="s1">'nums'</span><span class="p">,</span> <span class="n">arr</span><span class="o">.</span><span class="n">type</span><span class="p">)</span> |
| <span class="p">])</span> |
| |
| <span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">OSFile</span><span class="p">(</span><span class="s1">'arraydata.arrow'</span><span class="p">,</span> <span class="s1">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">sink</span><span class="p">:</span> |
| <span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">ipc</span><span class="o">.</span><span class="n">new_file</span><span class="p">(</span><span class="n">sink</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span> <span class="k">as</span> <span class="n">writer</span><span class="p">:</span> |
| <span class="n">batch</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">record_batch</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span> |
| <span class="n">writer</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>If we were to save multiple arrays into the same file, |
| we would just have to adapt the <code class="docutils literal notranslate"><span class="pre">schema</span></code> accordingly and add |
| them all to the <code class="docutils literal notranslate"><span class="pre">record_batch</span></code> call.</p> |
| </section> |
| <section id="memory-mapping-arrow-arrays-from-disk"> |
| <h2><a class="toc-backref" href="#id6" role="doc-backlink">Memory Mapping Arrow Arrays from disk</a><a class="headerlink" href="#memory-mapping-arrow-arrays-from-disk" title="Link to this heading">¶</a></h2> |
| <p>Arrow arrays that have been written to disk in the Arrow IPC |
| format can be memory mapped back directly from the disk.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">memory_map</span><span class="p">(</span><span class="s1">'arraydata.arrow'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">source</span><span class="p">:</span> |
| <span class="n">loaded_arrays</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">ipc</span><span class="o">.</span><span class="n">open_file</span><span class="p">(</span><span class="n">source</span><span class="p">)</span><span class="o">.</span><span class="n">read_all</span><span class="p">()</span> |
| </pre></div> |
| </div> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">arr</span> <span class="o">=</span> <span class="n">loaded_arrays</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> |
| <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99 |
| </pre></div> |
| </div> |
| </section> |
| <section id="writing-csv-files"> |
| <h2><a class="toc-backref" href="#id7" role="doc-backlink">Writing CSV files</a><a class="headerlink" href="#writing-csv-files" title="Link to this heading">¶</a></h2> |
| <p>It is possible to write an Arrow <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> to |
| a CSV file using the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html#pyarrow.csv.write_csv" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.csv.write_csv()</span></code></a> function</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span> |
| <span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">"col1"</span><span class="p">])</span> |
| |
| <span class="kn">import</span> <span class="nn">pyarrow.csv</span> |
| <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">write_csv</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">"table.csv"</span><span class="p">,</span> |
| <span class="n">write_options</span><span class="o">=</span><span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">WriteOptions</span><span class="p">(</span><span class="n">include_header</span><span class="o">=</span><span class="kc">True</span><span class="p">))</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="writing-csv-files-incrementally"> |
| <h2><a class="toc-backref" href="#id8" role="doc-backlink">Writing CSV files incrementally</a><a class="headerlink" href="#writing-csv-files-incrementally" title="Link to this heading">¶</a></h2> |
| <p>If you need to write data to a CSV file incrementally |
| as you generate or retrieve the data and you don’t want to keep |
| in memory the whole table to write it at once, it’s possible to use |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVWriter.html#pyarrow.csv.CSVWriter" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.csv.CSVWriter</span></code></a> to write data incrementally</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">schema</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">schema</span><span class="p">([(</span><span class="s2">"col1"</span><span class="p">,</span> <span class="n">pa</span><span class="o">.</span><span class="n">int32</span><span class="p">())])</span> |
| <span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">CSVWriter</span><span class="p">(</span><span class="s2">"table.csv"</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span> <span class="k">as</span> <span class="n">writer</span><span class="p">:</span> |
| <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span> |
| <span class="n">datachunk</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">chunk</span><span class="o">*</span><span class="mi">10</span><span class="p">,</span> <span class="p">(</span><span class="n">chunk</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="mi">10</span><span class="p">)</span> |
| <span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">datachunk</span><span class="p">)],</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span> |
| <span class="n">writer</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>It’s equally possible to write <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.RecordBatch</span></code></a> |
| by passing them as you would for tables.</p> |
| </section> |
| <section id="reading-csv-files"> |
| <h2><a class="toc-backref" href="#id9" role="doc-backlink">Reading CSV files</a><a class="headerlink" href="#reading-csv-files" title="Link to this heading">¶</a></h2> |
| <p>Arrow can read <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> entities from CSV using an |
| optimized codepath that can leverage multiple threads.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.csv</span> |
| |
| <span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"table.csv"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>Arrow will do its best to infer data types. Further options can be |
| provided to <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html#pyarrow.csv.read_csv" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.csv.read_csv()</span></code></a> to drive |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.csv.ConvertOptions</span></code></a>.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| col1: int64 |
| ---- |
| col1: [[0,1,2,3,4,...,95,96,97,98,99]] |
| </pre></div> |
| </div> |
| </section> |
| <section id="writing-partitioned-datasets"> |
| <h2><a class="toc-backref" href="#id10" role="doc-backlink">Writing Partitioned Datasets</a><a class="headerlink" href="#writing-partitioned-datasets" title="Link to this heading">¶</a></h2> |
| <p>When your dataset is big it usually makes sense to split it into |
| multiple separate files. You can do this manually or use |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.write_dataset()</span></code></a> to let Arrow do the effort |
| of splitting the data in chunks for you.</p> |
| <p>The <code class="docutils literal notranslate"><span class="pre">partitioning</span></code> argument allows to tell <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.write_dataset()</span></code></a> |
| for which columns the data should be split.</p> |
| <p>For example given 100 birthdays, within 2000 and 2009</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy.random</span> |
| <span class="n">data</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">({</span><span class="s2">"day"</span><span class="p">:</span> <span class="n">numpy</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">31</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span> |
| <span class="s2">"month"</span><span class="p">:</span> <span class="n">numpy</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span> |
| <span class="s2">"year"</span><span class="p">:</span> <span class="p">[</span><span class="mi">2000</span> <span class="o">+</span> <span class="n">x</span> <span class="o">//</span> <span class="mi">10</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)]})</span> |
| </pre></div> |
| </div> |
| <p>Then we could partition the data by the year column so that it |
| gets saved in 10 different files:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span> |
| <span class="kn">import</span> <span class="nn">pyarrow.dataset</span> <span class="k">as</span> <span class="nn">ds</span> |
| |
| <span class="n">ds</span><span class="o">.</span><span class="n">write_dataset</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s2">"./partitioned"</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s2">"parquet"</span><span class="p">,</span> |
| <span class="n">partitioning</span><span class="o">=</span><span class="n">ds</span><span class="o">.</span><span class="n">partitioning</span><span class="p">(</span><span class="n">pa</span><span class="o">.</span><span class="n">schema</span><span class="p">([(</span><span class="s2">"year"</span><span class="p">,</span> <span class="n">pa</span><span class="o">.</span><span class="n">int16</span><span class="p">())])))</span> |
| </pre></div> |
| </div> |
| <p>Arrow will partition datasets in subdirectories by default, which will |
| result in 10 different directories named with the value of the partitioning |
| column each with a file containing the subset of the data for that partition:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyarrow</span> <span class="kn">import</span> <span class="n">fs</span> |
| |
| <span class="n">localfs</span> <span class="o">=</span> <span class="n">fs</span><span class="o">.</span><span class="n">LocalFileSystem</span><span class="p">()</span> |
| <span class="n">partitioned_dir_content</span> <span class="o">=</span> <span class="n">localfs</span><span class="o">.</span><span class="n">get_file_info</span><span class="p">(</span><span class="n">fs</span><span class="o">.</span><span class="n">FileSelector</span><span class="p">(</span><span class="s2">"./partitioned"</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="kc">True</span><span class="p">))</span> |
| <span class="n">files</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">((</span><span class="n">f</span><span class="o">.</span><span class="n">path</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">partitioned_dir_content</span> <span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">type</span> <span class="o">==</span> <span class="n">fs</span><span class="o">.</span><span class="n">FileType</span><span class="o">.</span><span class="n">File</span><span class="p">))</span> |
| |
| <span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">file</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>./partitioned/2000/part-0.parquet |
| ./partitioned/2001/part-0.parquet |
| ./partitioned/2002/part-0.parquet |
| ./partitioned/2003/part-0.parquet |
| ./partitioned/2004/part-0.parquet |
| ./partitioned/2005/part-0.parquet |
| ./partitioned/2006/part-0.parquet |
| ./partitioned/2007/part-0.parquet |
| ./partitioned/2008/part-0.parquet |
| ./partitioned/2009/part-0.parquet |
| </pre></div> |
| </div> |
| </section> |
| <section id="reading-partitioned-data"> |
| <h2><a class="toc-backref" href="#id11" role="doc-backlink">Reading Partitioned data</a><a class="headerlink" href="#reading-partitioned-data" title="Link to this heading">¶</a></h2> |
| <p>In some cases, your dataset might be composed by multiple separate |
| files each containing a piece of the data.</p> |
| <p>In this case the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.dataset()</span></code></a> function provides |
| an interface to discover and read all those files as a single big dataset.</p> |
| <p>For example if we have a structure like:</p> |
| <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>examples/ |
| ├── dataset1.parquet |
| ├── dataset2.parquet |
| └── dataset3.parquet |
| </pre></div> |
| </div> |
| <p>Then, pointing the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.dataset.dataset()</span></code></a> function to the <code class="docutils literal notranslate"><span class="pre">examples</span></code> directory |
| will discover those parquet files and will expose them all as a single |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset</span></code></a>:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.dataset</span> <span class="k">as</span> <span class="nn">ds</span> |
| |
| <span class="n">dataset</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">dataset</span><span class="p">(</span><span class="s2">"./examples"</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s2">"parquet"</span><span class="p">)</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">files</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>['./examples/dataset1.parquet', './examples/dataset2.parquet', './examples/dataset3.parquet'] |
| </pre></div> |
| </div> |
| <p>The whole dataset can be viewed as a single big table using |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_table()</span></code></a>. While each parquet file |
| contains only 10 rows, converting the dataset to a table will |
| expose them as a single Table.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">to_table</span><span class="p">()</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| col1: int64 |
| ---- |
| col1: [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13,14,15,16,17,18,19],[20,21,22,23,24,25,26,27,28,29]] |
| </pre></div> |
| </div> |
| <p>Notice that converting to a table will force all data to be loaded |
| in memory. For big datasets is usually not what you want.</p> |
| <p>For this reason, it might be better to rely on the |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_batches()</span></code></a> method, which will |
| iteratively load the dataset one chunk of data at the time returning a |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.RecordBatch</span></code></a> for each one of them.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">record_batch</span> <span class="ow">in</span> <span class="n">dataset</span><span class="o">.</span><span class="n">to_batches</span><span class="p">():</span> |
| <span class="n">col1</span> <span class="o">=</span> <span class="n">record_batch</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s2">"col1"</span><span class="p">)</span> |
| <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">col1</span><span class="o">.</span><span class="n">_name</span><span class="si">}</span><span class="s2"> = </span><span class="si">{</span><span class="n">col1</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">col1</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>col1 = 0 .. 9 |
| col1 = 10 .. 19 |
| col1 = 20 .. 29 |
| </pre></div> |
| </div> |
| </section> |
| <section id="reading-partitioned-data-from-s3"> |
| <h2><a class="toc-backref" href="#id12" role="doc-backlink">Reading Partitioned Data from S3</a><a class="headerlink" href="#reading-partitioned-data-from-s3" title="Link to this heading">¶</a></h2> |
| <p>The <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset</span></code></a> is also able to abstract |
| partitioned data coming from remote sources like S3 or HDFS.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyarrow</span> <span class="kn">import</span> <span class="n">fs</span> |
| |
| <span class="c1"># List content of s3://ursa-labs-taxi-data/2011</span> |
| <span class="n">s3</span> <span class="o">=</span> <span class="n">fs</span><span class="o">.</span><span class="n">SubTreeFileSystem</span><span class="p">(</span> |
| <span class="s2">"ursa-labs-taxi-data"</span><span class="p">,</span> |
| <span class="n">fs</span><span class="o">.</span><span class="n">S3FileSystem</span><span class="p">(</span><span class="n">region</span><span class="o">=</span><span class="s2">"us-east-2"</span><span class="p">,</span> <span class="n">anonymous</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> |
| <span class="p">)</span> |
| <span class="k">for</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">s3</span><span class="o">.</span><span class="n">get_file_info</span><span class="p">(</span><span class="n">fs</span><span class="o">.</span><span class="n">FileSelector</span><span class="p">(</span><span class="s2">"2011"</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="kc">True</span><span class="p">)):</span> |
| <span class="k">if</span> <span class="n">entry</span><span class="o">.</span><span class="n">type</span> <span class="o">==</span> <span class="n">fs</span><span class="o">.</span><span class="n">FileType</span><span class="o">.</span><span class="n">File</span><span class="p">:</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">entry</span><span class="o">.</span><span class="n">path</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>2011/01/data.parquet |
| 2011/02/data.parquet |
| 2011/03/data.parquet |
| 2011/04/data.parquet |
| 2011/05/data.parquet |
| 2011/06/data.parquet |
| 2011/07/data.parquet |
| 2011/08/data.parquet |
| 2011/09/data.parquet |
| 2011/10/data.parquet |
| 2011/11/data.parquet |
| 2011/12/data.parquet |
| </pre></div> |
| </div> |
| <p>The data in the bucket can be loaded as a single big dataset partitioned |
| by <code class="docutils literal notranslate"><span class="pre">month</span></code> using</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">dataset</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">dataset</span><span class="p">(</span><span class="s2">"s3://ursa-labs-taxi-data/2011"</span><span class="p">,</span> |
| <span class="n">partitioning</span><span class="o">=</span><span class="p">[</span><span class="s2">"month"</span><span class="p">])</span> |
| <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">dataset</span><span class="o">.</span><span class="n">files</span><span class="p">[:</span><span class="mi">10</span><span class="p">]:</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> |
| <span class="nb">print</span><span class="p">(</span><span class="s2">"..."</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>ursa-labs-taxi-data/2011/01/data.parquet |
| ursa-labs-taxi-data/2011/02/data.parquet |
| ursa-labs-taxi-data/2011/03/data.parquet |
| ursa-labs-taxi-data/2011/04/data.parquet |
| ursa-labs-taxi-data/2011/05/data.parquet |
| ursa-labs-taxi-data/2011/06/data.parquet |
| ursa-labs-taxi-data/2011/07/data.parquet |
| ursa-labs-taxi-data/2011/08/data.parquet |
| ursa-labs-taxi-data/2011/09/data.parquet |
| ursa-labs-taxi-data/2011/10/data.parquet |
| ... |
| </pre></div> |
| </div> |
| <p>The dataset can then be used with <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_table()</span></code></a> |
| or <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches" title="(in Apache Arrow v14.0.1)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.dataset.Dataset.to_batches()</span></code></a> like you would for a local one.</p> |
| <div class="admonition note"> |
| <p class="admonition-title">Note</p> |
| <p>It is possible to load partitioned data also in the ipc arrow |
| format or in feather format.</p> |
| </div> |
| <div class="admonition warning"> |
| <p class="admonition-title">Warning</p> |
| <p>If the above code throws an error most likely the reason is your |
| AWS credentials are not set. Follow these instructions to get |
| <code class="docutils literal notranslate"><span class="pre">AWS</span> <span class="pre">Access</span> <span class="pre">Key</span> <span class="pre">Id</span></code> and <code class="docutils literal notranslate"><span class="pre">AWS</span> <span class="pre">Secret</span> <span class="pre">Access</span> <span class="pre">Key</span></code>: |
| <a class="reference external" href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html">AWS Credentials</a>.</p> |
| <p>The credentials are normally stored in <code class="docutils literal notranslate"><span class="pre">~/.aws/credentials</span></code> (on Mac or Linux) |
| or in <code class="docutils literal notranslate"><span class="pre">C:\Users\<USERNAME>\.aws\credentials</span></code> (on Windows) file. |
| You will need to either create or update this file in the appropriate location.</p> |
| <p>The contents of the file should look like this:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>default<span class="o">]</span> |
| <span class="nv">aws_access_key_id</span><span class="o">=</span><YOUR_AWS_ACCESS_KEY_ID> |
| <span class="nv">aws_secret_access_key</span><span class="o">=</span><YOUR_AWS_SECRET_ACCESS_KEY> |
| </pre></div> |
| </div> |
| </div> |
| </section> |
| <section id="write-a-feather-file"> |
| <h2><a class="toc-backref" href="#id13" role="doc-backlink">Write a Feather file</a><a class="headerlink" href="#write-a-feather-file" title="Link to this heading">¶</a></h2> |
| <p>Given an array with 100 numbers, from 0 to 99</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> |
| <span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span> |
| |
| <span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span> |
| |
| <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> .. </span><span class="si">{</span><span class="n">arr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>0 .. 99 |
| </pre></div> |
| </div> |
| <p>To write it to a Feather file, as Feather stores multiple columns, |
| we must create a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> out of it, |
| so that we get a table of a single column which can then be |
| written to a Feather file.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">([</span><span class="n">arr</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">"col1"</span><span class="p">])</span> |
| </pre></div> |
| </div> |
| <p>Once we have a table, it can be written to a Feather File |
| using the functions provided by the <code class="docutils literal notranslate"><span class="pre">pyarrow.feather</span></code> module</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.feather</span> <span class="k">as</span> <span class="nn">ft</span> |
| |
| <span class="n">ft</span><span class="o">.</span><span class="n">write_feather</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s1">'example.feather'</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="reading-a-feather-file"> |
| <h2><a class="toc-backref" href="#id14" role="doc-backlink">Reading a Feather file</a><a class="headerlink" href="#reading-a-feather-file" title="Link to this heading">¶</a></h2> |
| <p>Given a Feather file, it can be read back to a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> |
| by using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_table.html#pyarrow.feather.read_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.feather.read_table()</span></code></a> function</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow.feather</span> <span class="k">as</span> <span class="nn">ft</span> |
| |
| <span class="n">table</span> <span class="o">=</span> <span class="n">ft</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">"example.feather"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>The resulting table will contain the same columns that existed in |
| the parquet file as <code class="xref py py-class docutils literal notranslate"><span class="pre">ChunkedArray</span></code></p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| col1: int64 |
| ---- |
| col1: [[0,1,2,3,4,...,95,96,97,98,99]] |
| </pre></div> |
| </div> |
| </section> |
| <section id="reading-line-delimited-json"> |
| <h2><a class="toc-backref" href="#id15" role="doc-backlink">Reading Line Delimited JSON</a><a class="headerlink" href="#reading-line-delimited-json" title="Link to this heading">¶</a></h2> |
| <p>Arrow has builtin support for line-delimited JSON. |
| Each line represents a row of data as a JSON object.</p> |
| <p>Given some data in a file where each line is a JSON object |
| containing a row of data:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">tempfile</span> |
| |
| <span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">delete</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s2">"w+"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> |
| <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'{"a": 1, "b": 2.0, "c": 1}</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span> |
| <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'{"a": 3, "b": 3.0, "c": 2}</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span> |
| <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'{"a": 5, "b": 4.0, "c": 3}</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span> |
| <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'{"a": 7, "b": 5.0, "c": 4}</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>The content of the file can be read back to a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.Table</span></code></a> using |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html#pyarrow.json.read_json" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.json.read_json()</span></code></a>:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span> |
| <span class="kn">import</span> <span class="nn">pyarrow.json</span> |
| |
| <span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">json</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">to_pydict</span><span class="p">())</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>{'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]} |
| </pre></div> |
| </div> |
| </section> |
| <section id="writing-compressed-data"> |
| <h2><a class="toc-backref" href="#id16" role="doc-backlink">Writing Compressed Data</a><a class="headerlink" href="#writing-compressed-data" title="Link to this heading">¶</a></h2> |
| <p>Arrow provides support for writing files in compressed formats, |
| both for formats that provide compression natively like Parquet or Feather, |
| and for formats that don’t support compression out of the box like CSV.</p> |
| <p>Given a table:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">([</span> |
| <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">])</span> |
| <span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">"numbers"</span><span class="p">])</span> |
| </pre></div> |
| </div> |
| <p>Writing compressed Parquet or Feather data is driven by the |
| <code class="docutils literal notranslate"><span class="pre">compression</span></code> argument to the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.feather.write_feather()</span></code></a> and |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table" title="(in Apache Arrow v14.0.1)"><code class="xref py py-func docutils literal notranslate"><span class="pre">pyarrow.parquet.write_table()</span></code></a> functions:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pa</span><span class="o">.</span><span class="n">feather</span><span class="o">.</span><span class="n">write_feather</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">"compressed.feather"</span><span class="p">,</span> |
| <span class="n">compression</span><span class="o">=</span><span class="s2">"lz4"</span><span class="p">)</span> |
| <span class="n">pa</span><span class="o">.</span><span class="n">parquet</span><span class="o">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="s2">"compressed.parquet"</span><span class="p">,</span> |
| <span class="n">compression</span><span class="o">=</span><span class="s2">"lz4"</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>You can refer to each of those functions’ documentation for a complete |
| list of supported compression formats.</p> |
| <div class="admonition note"> |
| <p class="admonition-title">Note</p> |
| <p>Arrow actually uses compression by default when writing |
| Parquet or Feather files. Feather is compressed using <code class="docutils literal notranslate"><span class="pre">lz4</span></code> |
| by default and Parquet uses <code class="docutils literal notranslate"><span class="pre">snappy</span></code> by default.</p> |
| </div> |
| <p>For formats that don’t support compression natively, like CSV, |
| it’s possible to save compressed data using |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.CompressedOutputStream.html#pyarrow.CompressedOutputStream" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.CompressedOutputStream</span></code></a>:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">CompressedOutputStream</span><span class="p">(</span><span class="s2">"compressed.csv.gz"</span><span class="p">,</span> <span class="s2">"gzip"</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span> |
| <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">write_csv</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">out</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>This requires decompressing the file when reading it back, |
| which can be done using <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.CompressedInputStream.html#pyarrow.CompressedInputStream" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.CompressedInputStream</span></code></a> |
| as explained in the next recipe.</p> |
| </section> |
| <section id="reading-compressed-data"> |
| <h2><a class="toc-backref" href="#id17" role="doc-backlink">Reading Compressed Data</a><a class="headerlink" href="#reading-compressed-data" title="Link to this heading">¶</a></h2> |
| <p>Arrow provides support for reading compressed files, |
| both for formats that provide it natively like Parquet or Feather, |
| and for files in formats that don’t support compression natively, |
| like CSV, but have been compressed by an application.</p> |
| <p>Reading compressed formats that have native support for compression |
| doesn’t require any special handling. We can for example read back |
| the Parquet and Feather files we wrote in the previous recipe |
| by simply invoking <code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.feather.read_table()</span></code> and |
| <code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.parquet.read_table()</span></code>:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table_feather</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">feather</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">"compressed.feather"</span><span class="p">)</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">table_feather</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| numbers: int64 |
| ---- |
| numbers: [[1,2,3,4,5]] |
| </pre></div> |
| </div> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table_parquet</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">parquet</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">"compressed.parquet"</span><span class="p">)</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">table_parquet</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| numbers: int64 |
| ---- |
| numbers: [[1,2,3,4,5]] |
| </pre></div> |
| </div> |
| <p>Reading data from formats that don’t have native support for |
| compression instead involves decompressing them before decoding them. |
| This can be done using the <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.CompressedInputStream.html#pyarrow.CompressedInputStream" title="(in Apache Arrow v14.0.1)"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyarrow.CompressedInputStream</span></code></a> class |
| which wraps files with a decompress operation before the result is |
| provided to the actual read function.</p> |
| <p>For example to read a compressed CSV file:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">with</span> <span class="n">pa</span><span class="o">.</span><span class="n">CompressedInputStream</span><span class="p">(</span><span class="n">pa</span><span class="o">.</span><span class="n">OSFile</span><span class="p">(</span><span class="s2">"compressed.csv.gz"</span><span class="p">),</span> <span class="s2">"gzip"</span><span class="p">)</span> <span class="k">as</span> <span class="nb">input</span><span class="p">:</span> |
| <span class="n">table_csv</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">input</span><span class="p">)</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">table_csv</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| numbers: int64 |
| ---- |
| numbers: [[1,2,3,4,5]] |
| </pre></div> |
| </div> |
| <div class="admonition note"> |
| <p class="admonition-title">Note</p> |
| <p>In the case of CSV, arrow is actually smart enough to try detecting |
| compressed files using the file extension. So if your file is named |
| <code class="docutils literal notranslate"><span class="pre">*.gz</span></code> or <code class="docutils literal notranslate"><span class="pre">*.bz2</span></code> the <code class="xref py py-meth docutils literal notranslate"><span class="pre">pyarrow.csv.read_csv()</span></code> function will |
| try to decompress it accordingly</p> |
| </div> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">table_csv2</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">csv</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"compressed.csv.gz"</span><span class="p">)</span> |
| <span class="nb">print</span><span class="p">(</span><span class="n">table_csv2</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>pyarrow.Table |
| numbers: int64 |
| ---- |
| numbers: [[1,2,3,4,5]] |
| </pre></div> |
| </div> |
| </section> |
| </section> |
| |
| |
| </div> |
| |
| </div> |
| </div> |
| <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> |
| <div class="sphinxsidebarwrapper"> |
| <p class="logo"> |
| <a href="index.html"> |
| <img class="logo" src="_static/arrow-logo_vertical_black-txt_transparent-bg.svg" alt="Logo"/> |
| |
| </a> |
| </p> |
| |
| |
| |
| |
| |
| |
| <p> |
| <iframe src="https://ghbtns.com/github-btn.html?user=apache&repo=arrow-cookbook&type=none&count=true&size=large&v=2" |
| allowtransparency="true" frameborder="0" scrolling="0" width="200px" height="35px"></iframe> |
| </p> |
| |
| |
| |
| |
| |
| <h3>Navigation</h3> |
| <p class="caption" role="heading"><span class="caption-text">Contents:</span></p> |
| <ul class="current"> |
| <li class="toctree-l1 current"><a class="current reference internal" href="#">Reading and Writing Data</a><ul> |
| <li class="toctree-l2"><a class="reference internal" href="#write-a-parquet-file">Write a Parquet file</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-a-parquet-file">Reading a Parquet file</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-a-subset-of-parquet-data">Reading a subset of Parquet data</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#saving-arrow-arrays-to-disk">Saving Arrow Arrays to disk</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#memory-mapping-arrow-arrays-from-disk">Memory Mapping Arrow Arrays from disk</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#writing-csv-files">Writing CSV files</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#writing-csv-files-incrementally">Writing CSV files incrementally</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-csv-files">Reading CSV files</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#writing-partitioned-datasets">Writing Partitioned Datasets</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-partitioned-data">Reading Partitioned data</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-partitioned-data-from-s3">Reading Partitioned Data from S3</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#write-a-feather-file">Write a Feather file</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-a-feather-file">Reading a Feather file</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-line-delimited-json">Reading Line Delimited JSON</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#writing-compressed-data">Writing Compressed Data</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#reading-compressed-data">Reading Compressed Data</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l1"><a class="reference internal" href="create.html">Creating Arrow Objects</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="schema.html">Working with Schema</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="data.html">Data Manipulation</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="flight.html">Arrow Flight</a></li> |
| </ul> |
| |
| |
| <hr /> |
| <ul> |
| |
| <li class="toctree-l1"><a href="https://arrow.apache.org/docs/python/index.html">User Guide</a></li> |
| |
| <li class="toctree-l1"><a href="https://arrow.apache.org/docs/python/api.html">API Reference</a></li> |
| |
| </ul> |
| <div class="relations"> |
| <h3>Related Topics</h3> |
| <ul> |
| <li><a href="index.html">Documentation overview</a><ul> |
| <li>Previous: <a href="index.html" title="previous chapter">Apache Arrow Python Cookbook</a></li> |
| <li>Next: <a href="create.html" title="next chapter">Creating Arrow Objects</a></li> |
| </ul></li> |
| </ul> |
| </div> |
| <div id="searchbox" style="display: none" role="search"> |
| <h3 id="searchlabel">Quick search</h3> |
| <div class="searchformwrapper"> |
| <form class="search" action="search.html" method="get"> |
| <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/> |
| <input type="submit" value="Go" /> |
| </form> |
| </div> |
| </div> |
| <script>document.getElementById('searchbox').style.display = "block"</script> |
| |
| |
| |
| |
| |
| |
| |
| |
| </div> |
| </div> |
| <div class="clearer"></div> |
| </div> |
| <div class="footer"> |
| ©2022, Apache Software Foundation. |
| |
| | |
| Powered by <a href="http://sphinx-doc.org/">Sphinx 7.2.6</a> |
| & <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.13</a> |
| |
| | |
| <a href="_sources/io.rst.txt" |
| rel="nofollow">Page source</a> |
| </div> |
| |
| |
| |
| |
| </body> |
| </html> |