| <!DOCTYPE html> |
| |
| <html lang="en" data-content_root="./"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" /> |
| |
| <title>Substrait — Apache Arrow Java Cookbook documentation</title> |
| <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=d1102ebc" /> |
| <link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=49eeb2a1" /> |
| <script src="_static/documentation_options.js?v=5929fcd5"></script> |
| <script src="_static/doctools.js?v=888ff710"></script> |
| <script src="_static/sphinx_highlight.js?v=dc90522c"></script> |
| <link rel="icon" href="_static/favicon.ico"/> |
| <link rel="index" title="Index" href="genindex.html" /> |
| <link rel="search" title="Search" href="search.html" /> |
| <link rel="next" title="Data manipulation" href="data.html" /> |
| <link rel="prev" title="Dataset" href="dataset.html" /> |
| |
| |
| <link rel="stylesheet" href="_static/custom.css" type="text/css" /> |
| |
| |
| |
| |
| <!-- Matomo --> |
| <script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| /* We explicitly disable cookie tracking to avoid privacy issues */ |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '20']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script> |
| <!-- End Matomo Code --> |
| |
| </head><body> |
| |
| |
| <div class="document"> |
| <div class="documentwrapper"> |
| <div class="bodywrapper"> |
| |
| |
| <div class="body" role="main"> |
| |
| <section id="substrait"> |
| <span id="arrow-substrait"></span><h1><a class="toc-backref" href="#id2" role="doc-backlink">Substrait</a><a class="headerlink" href="#substrait" title="Link to this heading">¶</a></h1> |
| <p>Arrow can use <a class="reference external" href="https://substrait.io/">Substrait</a> to integrate with other languages.</p> |
| <nav class="contents" id="contents"> |
| <p class="topic-title">Contents</p> |
| <ul class="simple"> |
| <li><p><a class="reference internal" href="#substrait" id="id2">Substrait</a></p> |
| <ul> |
| <li><p><a class="reference internal" href="#querying-datasets" id="id3">Querying Datasets</a></p></li> |
| <li><p><a class="reference internal" href="#filtering-and-projecting-datasets" id="id4">Filtering and Projecting Datasets</a></p> |
| <ul> |
| <li><p><a class="reference internal" href="#filtering-a-dataset" id="id5">Filtering a Dataset</a></p></li> |
| <li><p><a class="reference internal" href="#projecting-a-dataset" id="id6">Projecting a Dataset</a></p></li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| </nav> |
| <p>The Substrait support in Arrow combines <a class="reference internal" href="dataset.html"><span class="doc">Dataset</span></a> and |
| <a class="reference external" href="https://github.com/substrait-io/substrait-java">substrait-java</a> to query datasets using <a class="reference external" href="https://arrow.apache.org/docs/cpp/streaming_execution.html">Acero</a> as a backend.</p> |
| <p>Acero currently supports:</p> |
| <ul class="simple"> |
| <li><p>Reading Arrow, CSV, ORC, and Parquet files</p></li> |
| <li><p>Filters</p></li> |
| <li><p>Projections</p></li> |
| <li><p>Joins</p></li> |
| <li><p>Aggregates</p></li> |
| </ul> |
| <section id="querying-datasets"> |
| <h2><a class="toc-backref" href="#id3" role="doc-backlink">Querying Datasets</a><a class="headerlink" href="#querying-datasets" title="Link to this heading">¶</a></h2> |
| <p>Here is an example of a Java program that queries a Parquet file:</p> |
| <div class="highlight-java notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.isthmus.SqlToSubstrait</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.proto.Plan</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileFormat</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileSystemDatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.jni.NativeMemoryPool</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.ScanOptions</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.Scanner</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.Dataset</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.DatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.substrait.AceroSubstraitConsumer</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.BufferAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.RootAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.vector.ipc.ArrowReader</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.calcite.sql.parser.SqlParseException</span><span class="p">;</span> |
| |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.nio.ByteBuffer</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.util.HashMap</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.util.Map</span><span class="p">;</span> |
| |
| <span class="n">Plan</span><span class="w"> </span><span class="nf">queryTableNation</span><span class="p">()</span><span class="w"> </span><span class="kd">throws</span><span class="w"> </span><span class="n">SqlParseException</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">sql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"SELECT * FROM NATION WHERE N_NATIONKEY = 17"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">nation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, N_NAME CHAR(25), "</span><span class="w"> </span><span class="o">+</span> |
| <span class="w"> </span><span class="s">"N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">SqlToSubstrait</span><span class="w"> </span><span class="n">sqlToSubstrait</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">SqlToSubstrait</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">Plan</span><span class="w"> </span><span class="n">plan</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sqlToSubstrait</span><span class="p">.</span><span class="na">execute</span><span class="p">(</span><span class="n">sql</span><span class="p">,</span><span class="w"> </span><span class="n">Collections</span><span class="p">.</span><span class="na">singletonList</span><span class="p">(</span><span class="n">nation</span><span class="p">));</span> |
| <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">plan</span><span class="p">;</span> |
| <span class="p">}</span> |
| |
| <span class="kt">void</span><span class="w"> </span><span class="nf">queryDatasetThruSubstraitPlanDefinition</span><span class="p">()</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">uri</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"file:"</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">getProperty</span><span class="p">(</span><span class="s">"user.dir"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"/thirdpartydeps/tpch/nation.parquet"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">ScanOptions</span><span class="w"> </span><span class="n">options</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ScanOptions</span><span class="p">(</span><span class="cm">/*batchSize*/</span><span class="w"> </span><span class="mi">32768</span><span class="p">);</span> |
| <span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">(</span> |
| <span class="w"> </span><span class="n">BufferAllocator</span><span class="w"> </span><span class="n">allocator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RootAllocator</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">DatasetFactory</span><span class="w"> </span><span class="n">datasetFactory</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">FileSystemDatasetFactory</span><span class="p">(</span><span class="n">allocator</span><span class="p">,</span><span class="w"> </span><span class="n">NativeMemoryPool</span><span class="p">.</span><span class="na">getDefault</span><span class="p">(),</span> |
| <span class="w"> </span><span class="n">FileFormat</span><span class="p">.</span><span class="na">PARQUET</span><span class="p">,</span><span class="w"> </span><span class="n">uri</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">Dataset</span><span class="w"> </span><span class="n">dataset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datasetFactory</span><span class="p">.</span><span class="na">finish</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">Scanner</span><span class="w"> </span><span class="n">scanner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dataset</span><span class="p">.</span><span class="na">newScan</span><span class="p">(</span><span class="n">options</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">ArrowReader</span><span class="w"> </span><span class="n">reader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scanner</span><span class="p">.</span><span class="na">scanBatches</span><span class="p">()</span> |
| <span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">Map</span><span class="o"><</span><span class="n">String</span><span class="p">,</span><span class="w"> </span><span class="n">ArrowReader</span><span class="o">></span><span class="w"> </span><span class="n">mapTableToArrowReader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">HashMap</span><span class="o"><></span><span class="p">();</span> |
| <span class="w"> </span><span class="n">mapTableToArrowReader</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="s">"NATION"</span><span class="p">,</span><span class="w"> </span><span class="n">reader</span><span class="p">);</span> |
| <span class="w"> </span><span class="c1">// get binary plan</span> |
| <span class="w"> </span><span class="n">Plan</span><span class="w"> </span><span class="n">plan</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">queryTableNation</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ByteBuffer</span><span class="w"> </span><span class="n">substraitPlan</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ByteBuffer</span><span class="p">.</span><span class="na">allocateDirect</span><span class="p">(</span><span class="n">plan</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">().</span><span class="na">length</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">substraitPlan</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="n">plan</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">());</span> |
| <span class="w"> </span><span class="c1">// run query</span> |
| <span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">(</span><span class="n">ArrowReader</span><span class="w"> </span><span class="n">arrowReader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">AceroSubstraitConsumer</span><span class="p">(</span><span class="n">allocator</span><span class="p">).</span><span class="na">runQuery</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">substraitPlan</span><span class="p">,</span> |
| <span class="w"> </span><span class="n">mapTableToArrowReader</span> |
| <span class="w"> </span><span class="p">))</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">arrowReader</span><span class="p">.</span><span class="na">loadNextBatch</span><span class="p">())</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">out</span><span class="p">.</span><span class="na">print</span><span class="p">(</span><span class="n">arrowReader</span><span class="p">.</span><span class="na">getVectorSchemaRoot</span><span class="p">().</span><span class="na">contentToTSVString</span><span class="p">());</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">catch</span><span class="w"> </span><span class="p">(</span><span class="n">Exception</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="na">printStackTrace</span><span class="p">();</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="p">}</span> |
| |
| <span class="n">queryDatasetThruSubstraitPlanDefinition</span><span class="p">();</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>N_NATIONKEY N_NAME N_REGIONKEY N_COMMENT |
| 17 PERU 1 platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun |
| </pre></div> |
| </div> |
| <p>It is also possible to query multiple datasets and join them based on some criteria. |
| For example, we can join the nation and customer tables from the TPC-H benchmark:</p> |
| <div class="highlight-java notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.isthmus.SqlToSubstrait</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.proto.Plan</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileFormat</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileSystemDatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.jni.NativeMemoryPool</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.ScanOptions</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.Scanner</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.Dataset</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.DatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.substrait.AceroSubstraitConsumer</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.BufferAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.RootAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.vector.ipc.ArrowReader</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.calcite.sql.parser.SqlParseException</span><span class="p">;</span> |
| |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.nio.ByteBuffer</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.util.Arrays</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.util.HashMap</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.util.Map</span><span class="p">;</span> |
| |
| <span class="n">Plan</span><span class="w"> </span><span class="nf">queryTableNationJoinCustomer</span><span class="p">()</span><span class="w"> </span><span class="kd">throws</span><span class="w"> </span><span class="n">SqlParseException</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">sql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"SELECT n.n_name, COUNT(*) AS NUMBER_CUSTOMER FROM NATION n JOIN CUSTOMER c "</span><span class="w"> </span><span class="o">+</span> |
| <span class="w"> </span><span class="s">"ON n.n_nationkey = c.c_nationkey WHERE n.n_nationkey = 17 "</span><span class="w"> </span><span class="o">+</span> |
| <span class="w"> </span><span class="s">"GROUP BY n.n_name"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">nation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, "</span><span class="w"> </span><span class="o">+</span> |
| <span class="w"> </span><span class="s">"N_NAME CHAR(25), N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">customer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"CREATE TABLE CUSTOMER (C_CUSTKEY BIGINT NOT NULL, "</span><span class="w"> </span><span class="o">+</span> |
| <span class="w"> </span><span class="s">"C_NAME VARCHAR(25), C_ADDRESS VARCHAR(40), C_NATIONKEY BIGINT NOT NULL, "</span><span class="w"> </span><span class="o">+</span> |
| <span class="w"> </span><span class="s">"C_PHONE CHAR(15), C_ACCTBAL DECIMAL, C_MKTSEGMENT CHAR(10), "</span><span class="w"> </span><span class="o">+</span> |
| <span class="w"> </span><span class="s">"C_COMMENT VARCHAR(117) )"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">SqlToSubstrait</span><span class="w"> </span><span class="n">sqlToSubstrait</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">SqlToSubstrait</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">Plan</span><span class="w"> </span><span class="n">plan</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sqlToSubstrait</span><span class="p">.</span><span class="na">execute</span><span class="p">(</span><span class="n">sql</span><span class="p">,</span> |
| <span class="w"> </span><span class="n">Arrays</span><span class="p">.</span><span class="na">asList</span><span class="p">(</span><span class="n">nation</span><span class="p">,</span><span class="w"> </span><span class="n">customer</span><span class="p">));</span> |
| <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">plan</span><span class="p">;</span> |
| <span class="p">}</span> |
| |
| <span class="kt">void</span><span class="w"> </span><span class="nf">queryTwoDatasetsThruSubstraitPlanDefinition</span><span class="p">()</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">uriNation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"file:"</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">getProperty</span><span class="p">(</span><span class="s">"user.dir"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"/thirdpartydeps/tpch/nation.parquet"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">uriCustomer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"file:"</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">getProperty</span><span class="p">(</span><span class="s">"user.dir"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"/thirdpartydeps/tpch/customer.parquet"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">ScanOptions</span><span class="w"> </span><span class="n">options</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ScanOptions</span><span class="p">(</span><span class="cm">/*batchSize*/</span><span class="w"> </span><span class="mi">32768</span><span class="p">);</span> |
| <span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">(</span> |
| <span class="w"> </span><span class="n">BufferAllocator</span><span class="w"> </span><span class="n">allocator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RootAllocator</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">DatasetFactory</span><span class="w"> </span><span class="n">datasetFactory</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">FileSystemDatasetFactory</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">allocator</span><span class="p">,</span><span class="w"> </span><span class="n">NativeMemoryPool</span><span class="p">.</span><span class="na">getDefault</span><span class="p">(),</span> |
| <span class="w"> </span><span class="n">FileFormat</span><span class="p">.</span><span class="na">PARQUET</span><span class="p">,</span><span class="w"> </span><span class="n">uriNation</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">Dataset</span><span class="w"> </span><span class="n">dataset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datasetFactory</span><span class="p">.</span><span class="na">finish</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">Scanner</span><span class="w"> </span><span class="n">scanner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dataset</span><span class="p">.</span><span class="na">newScan</span><span class="p">(</span><span class="n">options</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">ArrowReader</span><span class="w"> </span><span class="n">readerNation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scanner</span><span class="p">.</span><span class="na">scanBatches</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">DatasetFactory</span><span class="w"> </span><span class="n">datasetFactoryCustomer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">FileSystemDatasetFactory</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">allocator</span><span class="p">,</span><span class="w"> </span><span class="n">NativeMemoryPool</span><span class="p">.</span><span class="na">getDefault</span><span class="p">(),</span> |
| <span class="w"> </span><span class="n">FileFormat</span><span class="p">.</span><span class="na">PARQUET</span><span class="p">,</span><span class="w"> </span><span class="n">uriCustomer</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">Dataset</span><span class="w"> </span><span class="n">datasetCustomer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datasetFactoryCustomer</span><span class="p">.</span><span class="na">finish</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">Scanner</span><span class="w"> </span><span class="n">scannerCustomer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datasetCustomer</span><span class="p">.</span><span class="na">newScan</span><span class="p">(</span><span class="n">options</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">ArrowReader</span><span class="w"> </span><span class="n">readerCustomer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scannerCustomer</span><span class="p">.</span><span class="na">scanBatches</span><span class="p">()</span> |
| <span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="c1">// map table to reader</span> |
| <span class="w"> </span><span class="n">Map</span><span class="o"><</span><span class="n">String</span><span class="p">,</span><span class="w"> </span><span class="n">ArrowReader</span><span class="o">></span><span class="w"> </span><span class="n">mapTableToArrowReader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">HashMap</span><span class="o"><></span><span class="p">();</span> |
| <span class="w"> </span><span class="n">mapTableToArrowReader</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="s">"NATION"</span><span class="p">,</span><span class="w"> </span><span class="n">readerNation</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">mapTableToArrowReader</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="s">"CUSTOMER"</span><span class="p">,</span><span class="w"> </span><span class="n">readerCustomer</span><span class="p">);</span> |
| <span class="w"> </span><span class="c1">// get binary plan</span> |
| <span class="w"> </span><span class="n">Plan</span><span class="w"> </span><span class="n">plan</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">queryTableNationJoinCustomer</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ByteBuffer</span><span class="w"> </span><span class="n">substraitPlan</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ByteBuffer</span><span class="p">.</span><span class="na">allocateDirect</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">plan</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">().</span><span class="na">length</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">substraitPlan</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="n">plan</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">());</span> |
| <span class="w"> </span><span class="c1">// run query</span> |
| <span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">(</span><span class="n">ArrowReader</span><span class="w"> </span><span class="n">arrowReader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">AceroSubstraitConsumer</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">allocator</span><span class="p">).</span><span class="na">runQuery</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">substraitPlan</span><span class="p">,</span> |
| <span class="w"> </span><span class="n">mapTableToArrowReader</span> |
| <span class="w"> </span><span class="p">))</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">arrowReader</span><span class="p">.</span><span class="na">loadNextBatch</span><span class="p">())</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">out</span><span class="p">.</span><span class="na">print</span><span class="p">(</span><span class="n">arrowReader</span><span class="p">.</span><span class="na">getVectorSchemaRoot</span><span class="p">().</span><span class="na">contentToTSVString</span><span class="p">());</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">catch</span><span class="w"> </span><span class="p">(</span><span class="n">Exception</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="na">printStackTrace</span><span class="p">();</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="p">}</span> |
| |
| <span class="n">queryTwoDatasetsThruSubstraitPlanDefinition</span><span class="p">();</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>N_NAME NUMBER_CUSTOMER |
| PERU 573 |
| </pre></div> |
| </div> |
| </section> |
| <section id="filtering-and-projecting-datasets"> |
| <h2><a class="toc-backref" href="#id4" role="doc-backlink">Filtering and Projecting Datasets</a><a class="headerlink" href="#filtering-and-projecting-datasets" title="Link to this heading">¶</a></h2> |
| <p>Arrow Dataset supports filters and projections with Substrait’s |
| <a class="reference external" href="https://github.com/substrait-io/substrait/blob/main/site/docs/expressions/extended_expression.md">Extended Expression</a>. The substrait-java library is required to construct |
| these expressions.</p> |
| <section id="filtering-a-dataset"> |
| <h3><a class="toc-backref" href="#id5" role="doc-backlink">Filtering a Dataset</a><a class="headerlink" href="#filtering-a-dataset" title="Link to this heading">¶</a></h3> |
| <p>Here is an example of a Java program that filters a Parquet file:</p> |
| <ul class="simple"> |
| <li><p>Loads a Parquet file containing the “nation” table from the TPC-H benchmark.</p></li> |
| <li><dl class="simple"> |
| <dt>Applies a filter:</dt><dd><ul> |
| <li><p><code class="docutils literal notranslate"><span class="pre">N_NATIONKEY</span> <span class="pre">></span> <span class="pre">10,</span> <span class="pre">AND</span></code></p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">N_NATIONKEY</span> <span class="pre"><</span> <span class="pre">15`</span></code></p></li> |
| </ul> |
| </dd> |
| </dl> |
| </li> |
| </ul> |
| <div class="highlight-java notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.isthmus.SqlExpressionToSubstrait</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.proto.ExtendedExpression</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.nio.ByteBuffer</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.util.Optional</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileFormat</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileSystemDatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.jni.NativeMemoryPool</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.ScanOptions</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.Scanner</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.Dataset</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.DatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.BufferAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.RootAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.vector.ipc.ArrowReader</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.calcite.sql.parser.SqlParseException</span><span class="p">;</span> |
| |
| <span class="n">ByteBuffer</span><span class="w"> </span><span class="nf">getFilterExpression</span><span class="p">()</span><span class="w"> </span><span class="kd">throws</span><span class="w"> </span><span class="n">SqlParseException</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">sqlExpression</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"N_NATIONKEY > 10 AND N_NATIONKEY < 15"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">nation</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="s">"CREATE TABLE NATION (N_NATIONKEY INT NOT NULL, N_NAME CHAR(25), "</span> |
| <span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"N_REGIONKEY INT NOT NULL, N_COMMENT VARCHAR)"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">SqlExpressionToSubstrait</span><span class="w"> </span><span class="n">expressionToSubstrait</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">SqlExpressionToSubstrait</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ExtendedExpression</span><span class="w"> </span><span class="n">expression</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="n">expressionToSubstrait</span><span class="p">.</span><span class="na">convert</span><span class="p">(</span><span class="n">sqlExpression</span><span class="p">,</span><span class="w"> </span><span class="n">Collections</span><span class="p">.</span><span class="na">singletonList</span><span class="p">(</span><span class="n">nation</span><span class="p">));</span> |
| <span class="w"> </span><span class="kt">byte</span><span class="o">[]</span><span class="w"> </span><span class="n">expressionToByte</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">expression</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ByteBuffer</span><span class="w"> </span><span class="n">byteBuffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ByteBuffer</span><span class="p">.</span><span class="na">allocateDirect</span><span class="p">(</span><span class="n">expressionToByte</span><span class="p">.</span><span class="na">length</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">byteBuffer</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="n">expressionToByte</span><span class="p">);</span> |
| <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">byteBuffer</span><span class="p">;</span> |
| <span class="p">}</span> |
| |
| <span class="kt">void</span><span class="w"> </span><span class="nf">filterDataset</span><span class="p">()</span><span class="w"> </span><span class="kd">throws</span><span class="w"> </span><span class="n">SqlParseException</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">uri</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"file:"</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">getProperty</span><span class="p">(</span><span class="s">"user.dir"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"/thirdpartydeps/tpch/nation.parquet"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">ScanOptions</span><span class="w"> </span><span class="n">options</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ScanOptions</span><span class="p">.</span><span class="na">Builder</span><span class="p">(</span><span class="cm">/*batchSize*/</span><span class="w"> </span><span class="mi">32768</span><span class="p">)</span> |
| <span class="w"> </span><span class="p">.</span><span class="na">columns</span><span class="p">(</span><span class="n">Optional</span><span class="p">.</span><span class="na">empty</span><span class="p">())</span> |
| <span class="w"> </span><span class="p">.</span><span class="na">substraitFilter</span><span class="p">(</span><span class="n">getFilterExpression</span><span class="p">())</span> |
| <span class="w"> </span><span class="p">.</span><span class="na">build</span><span class="p">();</span> |
| <span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">(</span><span class="n">BufferAllocator</span><span class="w"> </span><span class="n">allocator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RootAllocator</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">DatasetFactory</span><span class="w"> </span><span class="n">datasetFactory</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">FileSystemDatasetFactory</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">allocator</span><span class="p">,</span><span class="w"> </span><span class="n">NativeMemoryPool</span><span class="p">.</span><span class="na">getDefault</span><span class="p">(),</span><span class="w"> </span><span class="n">FileFormat</span><span class="p">.</span><span class="na">PARQUET</span><span class="p">,</span><span class="w"> </span><span class="n">uri</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">Dataset</span><span class="w"> </span><span class="n">dataset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datasetFactory</span><span class="p">.</span><span class="na">finish</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">Scanner</span><span class="w"> </span><span class="n">scanner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dataset</span><span class="p">.</span><span class="na">newScan</span><span class="p">(</span><span class="n">options</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">ArrowReader</span><span class="w"> </span><span class="n">reader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scanner</span><span class="p">.</span><span class="na">scanBatches</span><span class="p">())</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">reader</span><span class="p">.</span><span class="na">loadNextBatch</span><span class="p">())</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">out</span><span class="p">.</span><span class="na">print</span><span class="p">(</span><span class="n">reader</span><span class="p">.</span><span class="na">getVectorSchemaRoot</span><span class="p">().</span><span class="na">contentToTSVString</span><span class="p">());</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">catch</span><span class="w"> </span><span class="p">(</span><span class="n">Exception</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="k">throw</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RuntimeException</span><span class="p">(</span><span class="n">e</span><span class="p">);</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="p">}</span> |
| |
| <span class="n">filterDataset</span><span class="p">();</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>n_nationkey n_name n_regionkey n_comment |
| 11 IRAQ 4 nic deposits boost atop the quickly final requests? quickly regula |
| 12 JAPAN 2 ously. final, express gifts cajole a |
| 13 JORDAN 4 ic deposits are blithely about the carefully regular pa |
| 14 KENYA 0 pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t |
| </pre></div> |
| </div> |
| </section> |
| <section id="projecting-a-dataset"> |
| <h3><a class="toc-backref" href="#id6" role="doc-backlink">Projecting a Dataset</a><a class="headerlink" href="#projecting-a-dataset" title="Link to this heading">¶</a></h3> |
| <p>The following Java program projects new columns after applying a filter to |
| a Parquet file:</p> |
| <ul class="simple"> |
| <li><p>Loads a Parquet file containing the “nation” table from the TPC-H benchmark.</p></li> |
| <li><p>Applies a filter:</p></li> |
| </ul> |
| <blockquote> |
| <div><ul class="simple"> |
| <li><p><code class="docutils literal notranslate"><span class="pre">N_NATIONKEY</span> <span class="pre">></span> <span class="pre">10,</span> <span class="pre">AND</span></code></p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">N_NATIONKEY</span> <span class="pre"><</span> <span class="pre">15</span></code></p></li> |
| </ul> |
| </div></blockquote> |
| <ul class="simple"> |
| <li><p>Projects three new columns:</p></li> |
| </ul> |
| <blockquote> |
| <div><ul class="simple"> |
| <li><p><code class="docutils literal notranslate"><span class="pre">N_NAME</span></code></p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">N_NATIONKEY</span> <span class="pre">></span> <span class="pre">12</span></code></p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">N_NATIONKEY</span> <span class="pre">+</span> <span class="pre">31</span></code></p></li> |
| </ul> |
| </div></blockquote> |
| <div class="highlight-java notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.isthmus.SqlExpressionToSubstrait</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">io.substrait.proto.ExtendedExpression</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.nio.ByteBuffer</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">java.util.Optional</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileFormat</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.file.FileSystemDatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.jni.NativeMemoryPool</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.ScanOptions</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.scanner.Scanner</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.Dataset</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.dataset.source.DatasetFactory</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.BufferAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.memory.RootAllocator</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.arrow.vector.ipc.ArrowReader</span><span class="p">;</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.calcite.sql.parser.SqlParseException</span><span class="p">;</span> |
| |
| <span class="n">ByteBuffer</span><span class="w"> </span><span class="nf">getProjectExpression</span><span class="p">()</span><span class="w"> </span><span class="kd">throws</span><span class="w"> </span><span class="n">SqlParseException</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="o">[]</span><span class="w"> </span><span class="n">sqlExpression</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">String</span><span class="o">[]</span><span class="p">{</span><span class="s">"N_NAME"</span><span class="p">,</span><span class="w"> </span><span class="s">"N_NATIONKEY > 12"</span><span class="p">,</span><span class="w"> </span><span class="s">"N_NATIONKEY + 31"</span><span class="p">};</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">nation</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="s">"CREATE TABLE NATION (N_NATIONKEY INT NOT NULL, N_NAME CHAR(25), "</span> |
| <span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"N_REGIONKEY INT NOT NULL, N_COMMENT VARCHAR)"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">SqlExpressionToSubstrait</span><span class="w"> </span><span class="n">expressionToSubstrait</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">SqlExpressionToSubstrait</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ExtendedExpression</span><span class="w"> </span><span class="n">expression</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="n">expressionToSubstrait</span><span class="p">.</span><span class="na">convert</span><span class="p">(</span><span class="n">sqlExpression</span><span class="p">,</span><span class="w"> </span><span class="n">Collections</span><span class="p">.</span><span class="na">singletonList</span><span class="p">(</span><span class="n">nation</span><span class="p">));</span> |
| <span class="w"> </span><span class="kt">byte</span><span class="o">[]</span><span class="w"> </span><span class="n">expressionToByte</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">expression</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ByteBuffer</span><span class="w"> </span><span class="n">byteBuffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ByteBuffer</span><span class="p">.</span><span class="na">allocateDirect</span><span class="p">(</span><span class="n">expressionToByte</span><span class="p">.</span><span class="na">length</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">byteBuffer</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="n">expressionToByte</span><span class="p">);</span> |
| <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">byteBuffer</span><span class="p">;</span> |
| <span class="p">}</span> |
| |
| <span class="n">ByteBuffer</span><span class="w"> </span><span class="nf">getFilterExpression</span><span class="p">()</span><span class="w"> </span><span class="kd">throws</span><span class="w"> </span><span class="n">SqlParseException</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">sqlExpression</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"N_NATIONKEY > 10 AND N_NATIONKEY < 15"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">nation</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="s">"CREATE TABLE NATION (N_NATIONKEY INT NOT NULL, N_NAME CHAR(25), "</span> |
| <span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"N_REGIONKEY INT NOT NULL, N_COMMENT VARCHAR)"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">SqlExpressionToSubstrait</span><span class="w"> </span><span class="n">expressionToSubstrait</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">SqlExpressionToSubstrait</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ExtendedExpression</span><span class="w"> </span><span class="n">expression</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="n">expressionToSubstrait</span><span class="p">.</span><span class="na">convert</span><span class="p">(</span><span class="n">sqlExpression</span><span class="p">,</span><span class="w"> </span><span class="n">Collections</span><span class="p">.</span><span class="na">singletonList</span><span class="p">(</span><span class="n">nation</span><span class="p">));</span> |
| <span class="w"> </span><span class="kt">byte</span><span class="o">[]</span><span class="w"> </span><span class="n">expressionToByte</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">expression</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">ByteBuffer</span><span class="w"> </span><span class="n">byteBuffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ByteBuffer</span><span class="p">.</span><span class="na">allocateDirect</span><span class="p">(</span><span class="n">expressionToByte</span><span class="p">.</span><span class="na">length</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">byteBuffer</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="n">expressionToByte</span><span class="p">);</span> |
| <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">byteBuffer</span><span class="p">;</span> |
| <span class="p">}</span> |
| |
| <span class="kt">void</span><span class="w"> </span><span class="nf">filterAndProjectDataset</span><span class="p">()</span><span class="w"> </span><span class="kd">throws</span><span class="w"> </span><span class="n">SqlParseException</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">uri</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"file:"</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">getProperty</span><span class="p">(</span><span class="s">"user.dir"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"/thirdpartydeps/tpch/nation.parquet"</span><span class="p">;</span> |
| <span class="w"> </span><span class="n">ScanOptions</span><span class="w"> </span><span class="n">options</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ScanOptions</span><span class="p">.</span><span class="na">Builder</span><span class="p">(</span><span class="cm">/*batchSize*/</span><span class="w"> </span><span class="mi">32768</span><span class="p">)</span> |
| <span class="w"> </span><span class="p">.</span><span class="na">columns</span><span class="p">(</span><span class="n">Optional</span><span class="p">.</span><span class="na">empty</span><span class="p">())</span> |
| <span class="w"> </span><span class="p">.</span><span class="na">substraitFilter</span><span class="p">(</span><span class="n">getFilterExpression</span><span class="p">())</span> |
| <span class="w"> </span><span class="p">.</span><span class="na">substraitProjection</span><span class="p">(</span><span class="n">getProjectExpression</span><span class="p">())</span> |
| <span class="w"> </span><span class="p">.</span><span class="na">build</span><span class="p">();</span> |
| <span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">(</span><span class="n">BufferAllocator</span><span class="w"> </span><span class="n">allocator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RootAllocator</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">DatasetFactory</span><span class="w"> </span><span class="n">datasetFactory</span><span class="w"> </span><span class="o">=</span> |
| <span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">FileSystemDatasetFactory</span><span class="p">(</span> |
| <span class="w"> </span><span class="n">allocator</span><span class="p">,</span><span class="w"> </span><span class="n">NativeMemoryPool</span><span class="p">.</span><span class="na">getDefault</span><span class="p">(),</span><span class="w"> </span><span class="n">FileFormat</span><span class="p">.</span><span class="na">PARQUET</span><span class="p">,</span><span class="w"> </span><span class="n">uri</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">Dataset</span><span class="w"> </span><span class="n">dataset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datasetFactory</span><span class="p">.</span><span class="na">finish</span><span class="p">();</span> |
| <span class="w"> </span><span class="n">Scanner</span><span class="w"> </span><span class="n">scanner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dataset</span><span class="p">.</span><span class="na">newScan</span><span class="p">(</span><span class="n">options</span><span class="p">);</span> |
| <span class="w"> </span><span class="n">ArrowReader</span><span class="w"> </span><span class="n">reader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scanner</span><span class="p">.</span><span class="na">scanBatches</span><span class="p">())</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">reader</span><span class="p">.</span><span class="na">loadNextBatch</span><span class="p">())</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="n">System</span><span class="p">.</span><span class="na">out</span><span class="p">.</span><span class="na">print</span><span class="p">(</span><span class="n">reader</span><span class="p">.</span><span class="na">getVectorSchemaRoot</span><span class="p">().</span><span class="na">contentToTSVString</span><span class="p">());</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">catch</span><span class="w"> </span><span class="p">(</span><span class="n">Exception</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w"> </span><span class="p">{</span> |
| <span class="w"> </span><span class="k">throw</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RuntimeException</span><span class="p">(</span><span class="n">e</span><span class="p">);</span> |
| <span class="w"> </span><span class="p">}</span> |
| <span class="p">}</span> |
| |
| <span class="n">filterAndProjectDataset</span><span class="p">();</span> |
| </pre></div> |
| </div> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>column-1 column-2 column-3 |
| IRAQ false 42 |
| JAPAN false 43 |
| JORDAN true 44 |
| KENYA true 45 |
| </pre></div> |
| </div> |
| </section> |
| </section> |
| </section> |
| |
| |
| </div> |
| |
| </div> |
| </div> |
| <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> |
| <div class="sphinxsidebarwrapper"> |
| <p class="logo"> |
| <a href="index.html"> |
| <img class="logo" src="_static/arrow-logo_vertical_black-txt_transparent-bg.svg" alt="Logo" /> |
| |
| </a> |
| </p> |
| |
| |
| |
| |
| |
| |
| <p> |
| <iframe src="https://ghbtns.com/github-btn.html?user=apache&repo=arrow-cookbook&type=none&count=true&size=large&v=2" |
| allowtransparency="true" frameborder="0" scrolling="0" width="200px" height="35px"></iframe> |
| </p> |
| |
| |
| |
| |
| |
| <h3>Navigation</h3> |
| <p class="caption" role="heading"><span class="caption-text">Contents:</span></p> |
| <ul class="current"> |
| <li class="toctree-l1"><a class="reference internal" href="create.html">Creating Arrow Objects</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="schema.html">Working with Schema</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="io.html">Reading and writing data</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="flight.html">Arrow Flight</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="dataset.html">Dataset</a></li> |
| <li class="toctree-l1 current"><a class="current reference internal" href="#">Substrait</a><ul> |
| <li class="toctree-l2"><a class="reference internal" href="#querying-datasets">Querying Datasets</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="#filtering-and-projecting-datasets">Filtering and Projecting Datasets</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l1"><a class="reference internal" href="data.html">Data manipulation</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="avro.html">Avro</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="jdbc.html">Arrow JDBC Adapter</a></li> |
| </ul> |
| |
| |
| <hr /> |
| <ul> |
| |
| <li class="toctree-l1"><a href="https://arrow.apache.org/docs/java/index.html">User Guide</a></li> |
| |
| <li class="toctree-l1"><a href="https://arrow.apache.org/docs/java/reference/index.html">API Reference</a></li> |
| |
| </ul> |
| <div class="relations"> |
| <h3>Related Topics</h3> |
| <ul> |
| <li><a href="index.html">Documentation overview</a><ul> |
| <li>Previous: <a href="dataset.html" title="previous chapter">Dataset</a></li> |
| <li>Next: <a href="data.html" title="next chapter">Data manipulation</a></li> |
| </ul></li> |
| </ul> |
| </div> |
| <div id="searchbox" style="display: none" role="search"> |
| <h3 id="searchlabel">Quick search</h3> |
| <div class="searchformwrapper"> |
| <form class="search" action="search.html" method="get"> |
| <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/> |
| <input type="submit" value="Go" /> |
| </form> |
| </div> |
| </div> |
| <script>document.getElementById('searchbox').style.display = "block"</script> |
| |
| |
| |
| |
| |
| |
| |
| |
| </div> |
| </div> |
| <div class="clearer"></div> |
| </div> |
| <div class="footer"> |
| ©2022, Apache Software Foundation. |
| |
| | |
| Powered by <a href="https://www.sphinx-doc.org/">Sphinx 7.2.6</a> |
| & <a href="https://alabaster.readthedocs.io">Alabaster 0.7.16</a> |
| |
| | |
| <a href="_sources/substrait.rst.txt" |
| rel="nofollow">Page source</a> |
| </div> |
| |
| |
| |
| |
| </body> |
| </html> |