| <!DOCTYPE html> |
| |
| <html lang="en" data-content_root="../"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" /> |
| |
| <title>Data Sources — Apache Arrow DataFusion documentation</title> |
| |
| <link href="../_static/styles/theme.css?digest=1999514e3f237ded88cf" rel="stylesheet"> |
| <link href="../_static/styles/pydata-sphinx-theme.css?digest=1999514e3f237ded88cf" rel="stylesheet"> |
| |
| |
| <link rel="stylesheet" |
| href="../_static/vendor/fontawesome/5.13.0/css/all.min.css"> |
| <link rel="preload" as="font" type="font/woff2" crossorigin |
| href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2"> |
| <link rel="preload" as="font" type="font/woff2" crossorigin |
| href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2"> |
| |
| |
| |
| |
| |
| <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=8f2a1f02" /> |
| <link rel="stylesheet" type="text/css" href="../_static/styles/pydata-sphinx-theme.css?v=1140d252" /> |
| <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" /> |
| <link rel="stylesheet" type="text/css" href="../_static/theme_overrides.css?v=dca7052a" /> |
| |
| <link rel="preload" as="script" href="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf"> |
| |
| <script src="../_static/documentation_options.js?v=8a448e45"></script> |
| <script src="../_static/doctools.js?v=9bcbadda"></script> |
| <script src="../_static/sphinx_highlight.js?v=dc90522c"></script> |
| <link rel="index" title="Index" href="../genindex.html" /> |
| <link rel="search" title="Search" href="../search.html" /> |
| <link rel="next" title="DataFrames" href="dataframe/index.html" /> |
| <link rel="prev" title="Concepts" href="basics.html" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <meta name="docsearch:language" content="en"> |
| |
| |
| <!-- Google Analytics --> |
| |
| </head> |
| <body data-spy="scroll" data-target="#bd-toc-nav" data-offset="80"> |
| |
| <div class="container-fluid" id="banner"></div> |
| |
| |
| |
| |
| <div class="container-xl"> |
| <div class="row"> |
| |
| |
| <!-- Only show if we have sidebars configured, else just a small margin --> |
| <div class="col-12 col-md-3 bd-sidebar"> |
| <div class="sidebar-start-items"> |
| <a class="navbar-brand" href="../index.html"> |
| <img src="../_static/images/2x_bgwhite_original.png" class="logo" alt="logo"> |
| </a> |
| |
| <form class="bd-search d-flex align-items-center" action="../search.html" method="get"> |
| <i class="icon fas fa-search"></i> |
| <input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" > |
| </form> |
| |
| <nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation"> |
| <div class="bd-toc-item active"> |
| |
| <p aria-level="2" class="caption" role="heading"> |
| <span class="caption-text"> |
| LINKS |
| </span> |
| </p> |
| <ul class="nav bd-sidenav"> |
| <li class="toctree-l1"> |
| <a class="reference external" href="https://github.com/apache/datafusion-python"> |
| Github and Issue Tracker |
| </a> |
| </li> |
| <li class="toctree-l1"> |
| <a class="reference external" href="https://docs.rs/datafusion/latest/datafusion/"> |
| Rust's API Docs |
| </a> |
| </li> |
| <li class="toctree-l1"> |
| <a class="reference external" href="https://github.com/apache/datafusion/blob/main/CODE_OF_CONDUCT.md"> |
| Code of conduct |
| </a> |
| </li> |
| <li class="toctree-l1"> |
| <a class="reference external" href="https://github.com/apache/datafusion-python/tree/main/examples"> |
| Examples |
| </a> |
| </li> |
| </ul> |
| <p aria-level="2" class="caption" role="heading"> |
| <span class="caption-text"> |
| USER GUIDE |
| </span> |
| </p> |
| <ul class="current nav bd-sidenav"> |
| <li class="toctree-l1"> |
| <a class="reference internal" href="introduction.html"> |
| Introduction |
| </a> |
| </li> |
| <li class="toctree-l1"> |
| <a class="reference internal" href="basics.html"> |
| Concepts |
| </a> |
| </li> |
| <li class="toctree-l1 current active"> |
| <a class="current reference internal" href="#"> |
| Data Sources |
| </a> |
| </li> |
| <li class="toctree-l1 has-children"> |
| <a class="reference internal" href="dataframe/index.html"> |
| DataFrames |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/> |
| <label for="toctree-checkbox-1"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="dataframe/rendering.html"> |
| HTML Rendering in Jupyter |
| </a> |
| </li> |
| </ul> |
| </li> |
| <li class="toctree-l1 has-children"> |
| <a class="reference internal" href="common-operations/index.html"> |
| Common Operations |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-2" name="toctree-checkbox-2" type="checkbox"/> |
| <label for="toctree-checkbox-2"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/views.html"> |
| Registering Views |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/basic-info.html"> |
| Basic Operations |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/select-and-filter.html"> |
| Column Selections |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/expressions.html"> |
| Expressions |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/joins.html"> |
| Joins |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/functions.html"> |
| Functions |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/aggregations.html"> |
| Aggregation |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/windows.html"> |
| Window Functions |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="common-operations/udf-and-udfa.html"> |
| User-Defined Functions |
| </a> |
| </li> |
| </ul> |
| </li> |
| <li class="toctree-l1 has-children"> |
| <a class="reference internal" href="io/index.html"> |
| IO |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-3" name="toctree-checkbox-3" type="checkbox"/> |
| <label for="toctree-checkbox-3"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="io/arrow.html"> |
| Arrow |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="io/avro.html"> |
| Avro |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="io/csv.html"> |
| CSV |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="io/json.html"> |
| JSON |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="io/parquet.html"> |
| Parquet |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="io/table_provider.html"> |
| Custom Table Provider |
| </a> |
| </li> |
| </ul> |
| </li> |
| <li class="toctree-l1"> |
| <a class="reference internal" href="configuration.html"> |
| Configuration |
| </a> |
| </li> |
| <li class="toctree-l1"> |
| <a class="reference internal" href="sql.html"> |
| SQL |
| </a> |
| </li> |
| </ul> |
| <p aria-level="2" class="caption" role="heading"> |
| <span class="caption-text"> |
| CONTRIBUTOR GUIDE |
| </span> |
| </p> |
| <ul class="nav bd-sidenav"> |
| <li class="toctree-l1"> |
| <a class="reference internal" href="../contributor-guide/introduction.html"> |
| Introduction |
| </a> |
| </li> |
| <li class="toctree-l1"> |
| <a class="reference internal" href="../contributor-guide/ffi.html"> |
| Python Extensions |
| </a> |
| </li> |
| </ul> |
| <p aria-level="2" class="caption" role="heading"> |
| <span class="caption-text"> |
| API |
| </span> |
| </p> |
| <ul class="nav bd-sidenav"> |
| <li class="toctree-l1 has-children"> |
| <a class="reference internal" href="../autoapi/index.html"> |
| API Reference |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-4" name="toctree-checkbox-4" type="checkbox"/> |
| <label for="toctree-checkbox-4"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l2 has-children"> |
| <a class="reference internal" href="../autoapi/datafusion/index.html"> |
| datafusion |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-5" name="toctree-checkbox-5" type="checkbox"/> |
| <label for="toctree-checkbox-5"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/catalog/index.html"> |
| datafusion.catalog |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/context/index.html"> |
| datafusion.context |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/dataframe/index.html"> |
| datafusion.dataframe |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/dataframe_formatter/index.html"> |
| datafusion.dataframe_formatter |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/expr/index.html"> |
| datafusion.expr |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/functions/index.html"> |
| datafusion.functions |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/html_formatter/index.html"> |
| datafusion.html_formatter |
| </a> |
| </li> |
| <li class="toctree-l3 has-children"> |
| <a class="reference internal" href="../autoapi/datafusion/input/index.html"> |
| datafusion.input |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-6" name="toctree-checkbox-6" type="checkbox"/> |
| <label for="toctree-checkbox-6"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l4"> |
| <a class="reference internal" href="../autoapi/datafusion/input/base/index.html"> |
| datafusion.input.base |
| </a> |
| </li> |
| <li class="toctree-l4"> |
| <a class="reference internal" href="../autoapi/datafusion/input/location/index.html"> |
| datafusion.input.location |
| </a> |
| </li> |
| </ul> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/io/index.html"> |
| datafusion.io |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/object_store/index.html"> |
| datafusion.object_store |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/plan/index.html"> |
| datafusion.plan |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/record_batch/index.html"> |
| datafusion.record_batch |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/substrait/index.html"> |
| datafusion.substrait |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/unparser/index.html"> |
| datafusion.unparser |
| </a> |
| </li> |
| <li class="toctree-l3"> |
| <a class="reference internal" href="../autoapi/datafusion/user_defined/index.html"> |
| datafusion.user_defined |
| </a> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| |
| </div> |
| </nav> |
| </div> |
| <div class="sidebar-end-items"> |
| </div> |
| </div> |
| |
| |
| |
| |
| <div class="d-none d-xl-block col-xl-2 bd-toc"> |
| |
| |
| <div class="toc-item"> |
| |
| <div class="tocsection onthispage pt-5 pb-3"> |
| <i class="fas fa-list"></i> On this page |
| </div> |
| |
| <nav id="bd-toc-nav"> |
| <ul class="visible nav section-nav flex-column"> |
| <li class="toc-h1 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#"> |
| Data Sources |
| </a> |
| <ul class="visible nav section-nav flex-column"> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#local-file"> |
| Local file |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#create-in-memory"> |
| Create in-memory |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#object-store"> |
| Object Store |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#other-dataframe-libraries"> |
| Other DataFrame Libraries |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#delta-lake"> |
| Delta Lake |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#apache-iceberg"> |
| Apache Iceberg |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#custom-table-provider"> |
| Custom Table Provider |
| </a> |
| </li> |
| </ul> |
| </li> |
| <li class="toc-h1 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#catalog"> |
| Catalog |
| </a> |
| <ul class="visible nav section-nav flex-column"> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#user-defined-catalog-and-schema"> |
| User Defined Catalog and Schema |
| </a> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| </nav> |
| </div> |
| |
| <div class="toc-item"> |
| |
| </div> |
| |
| |
| </div> |
| |
| |
| |
| |
| |
| |
| <main class="col-12 col-md-9 col-xl-7 py-md-5 pl-md-5 pr-md-4 bd-content" role="main"> |
| |
| <div> |
| |
| <section id="data-sources"> |
| <span id="user-guide-data-sources"></span><h1>Data Sources<a class="headerlink" href="#data-sources" title="Link to this heading">¶</a></h1> |
| <p>DataFusion provides a wide variety of ways to get data into a DataFrame to perform operations.</p> |
| <section id="local-file"> |
| <h2>Local file<a class="headerlink" href="#local-file" title="Link to this heading">¶</a></h2> |
| <p>DataFusion has the ability to read from a variety of popular file formats, such as <a class="reference internal" href="io/parquet.html#io-parquet"><span class="std std-ref">Parquet</span></a>, |
| <a class="reference internal" href="io/csv.html#io-csv"><span class="std std-ref">CSV</span></a>, <a class="reference internal" href="io/json.html#io-json"><span class="std std-ref">JSON</span></a>, and <a class="reference internal" href="io/avro.html#io-avro"><span class="std std-ref">AVRO</span></a>.</p> |
| <div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="kn">from</span><span class="w"> </span><span class="nn">datafusion</span><span class="w"> </span><span class="kn">import</span> <span class="n">SessionContext</span> |
| |
| <span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">ctx</span> <span class="o">=</span> <span class="n">SessionContext</span><span class="p">()</span> |
| |
| <span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="n">df</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"pokemon.csv"</span><span class="p">)</span> |
| |
| <span class="n">In</span> <span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| <span class="n">DataFrame</span><span class="p">()</span> |
| <span class="o">+----+---------------------------+--------+--------+-------+----+--------+---------+---------+---------+-------+------------+-----------+</span> |
| <span class="o">|</span> <span class="c1"># | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary |</span> |
| <span class="o">+----+---------------------------+--------+--------+-------+----+--------+---------+---------+---------+-------+------------+-----------+</span> |
| <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">Bulbasaur</span> <span class="o">|</span> <span class="n">Grass</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">318</span> <span class="o">|</span> <span class="mi">45</span> <span class="o">|</span> <span class="mi">49</span> <span class="o">|</span> <span class="mi">49</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">45</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="n">Ivysaur</span> <span class="o">|</span> <span class="n">Grass</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">405</span> <span class="o">|</span> <span class="mi">60</span> <span class="o">|</span> <span class="mi">62</span> <span class="o">|</span> <span class="mi">63</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">60</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">3</span> <span class="o">|</span> <span class="n">Venusaur</span> <span class="o">|</span> <span class="n">Grass</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">525</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">82</span> <span class="o">|</span> <span class="mi">83</span> <span class="o">|</span> <span class="mi">100</span> <span class="o">|</span> <span class="mi">100</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">3</span> <span class="o">|</span> <span class="n">VenusaurMega</span> <span class="n">Venusaur</span> <span class="o">|</span> <span class="n">Grass</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">625</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">100</span> <span class="o">|</span> <span class="mi">123</span> <span class="o">|</span> <span class="mi">122</span> <span class="o">|</span> <span class="mi">120</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">4</span> <span class="o">|</span> <span class="n">Charmander</span> <span class="o">|</span> <span class="n">Fire</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">309</span> <span class="o">|</span> <span class="mi">39</span> <span class="o">|</span> <span class="mi">52</span> <span class="o">|</span> <span class="mi">43</span> <span class="o">|</span> <span class="mi">60</span> <span class="o">|</span> <span class="mi">50</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">5</span> <span class="o">|</span> <span class="n">Charmeleon</span> <span class="o">|</span> <span class="n">Fire</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">405</span> <span class="o">|</span> <span class="mi">58</span> <span class="o">|</span> <span class="mi">64</span> <span class="o">|</span> <span class="mi">58</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">6</span> <span class="o">|</span> <span class="n">Charizard</span> <span class="o">|</span> <span class="n">Fire</span> <span class="o">|</span> <span class="n">Flying</span> <span class="o">|</span> <span class="mi">534</span> <span class="o">|</span> <span class="mi">78</span> <span class="o">|</span> <span class="mi">84</span> <span class="o">|</span> <span class="mi">78</span> <span class="o">|</span> <span class="mi">109</span> <span class="o">|</span> <span class="mi">85</span> <span class="o">|</span> <span class="mi">100</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">6</span> <span class="o">|</span> <span class="n">CharizardMega</span> <span class="n">Charizard</span> <span class="n">X</span> <span class="o">|</span> <span class="n">Fire</span> <span class="o">|</span> <span class="n">Dragon</span> <span class="o">|</span> <span class="mi">634</span> <span class="o">|</span> <span class="mi">78</span> <span class="o">|</span> <span class="mi">130</span> <span class="o">|</span> <span class="mi">111</span> <span class="o">|</span> <span class="mi">130</span> <span class="o">|</span> <span class="mi">85</span> <span class="o">|</span> <span class="mi">100</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">6</span> <span class="o">|</span> <span class="n">CharizardMega</span> <span class="n">Charizard</span> <span class="n">Y</span> <span class="o">|</span> <span class="n">Fire</span> <span class="o">|</span> <span class="n">Flying</span> <span class="o">|</span> <span class="mi">634</span> <span class="o">|</span> <span class="mi">78</span> <span class="o">|</span> <span class="mi">104</span> <span class="o">|</span> <span class="mi">78</span> <span class="o">|</span> <span class="mi">159</span> <span class="o">|</span> <span class="mi">115</span> <span class="o">|</span> <span class="mi">100</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">7</span> <span class="o">|</span> <span class="n">Squirtle</span> <span class="o">|</span> <span class="n">Water</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">314</span> <span class="o">|</span> <span class="mi">44</span> <span class="o">|</span> <span class="mi">48</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">50</span> <span class="o">|</span> <span class="mi">64</span> <span class="o">|</span> <span class="mi">43</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">8</span> <span class="o">|</span> <span class="n">Wartortle</span> <span class="o">|</span> <span class="n">Water</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">405</span> <span class="o">|</span> <span class="mi">59</span> <span class="o">|</span> <span class="mi">63</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">58</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">9</span> <span class="o">|</span> <span class="n">Blastoise</span> <span class="o">|</span> <span class="n">Water</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">530</span> <span class="o">|</span> <span class="mi">79</span> <span class="o">|</span> <span class="mi">83</span> <span class="o">|</span> <span class="mi">100</span> <span class="o">|</span> <span class="mi">85</span> <span class="o">|</span> <span class="mi">105</span> <span class="o">|</span> <span class="mi">78</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">9</span> <span class="o">|</span> <span class="n">BlastoiseMega</span> <span class="n">Blastoise</span> <span class="o">|</span> <span class="n">Water</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">630</span> <span class="o">|</span> <span class="mi">79</span> <span class="o">|</span> <span class="mi">103</span> <span class="o">|</span> <span class="mi">120</span> <span class="o">|</span> <span class="mi">135</span> <span class="o">|</span> <span class="mi">115</span> <span class="o">|</span> <span class="mi">78</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">10</span> <span class="o">|</span> <span class="n">Caterpie</span> <span class="o">|</span> <span class="n">Bug</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">195</span> <span class="o">|</span> <span class="mi">45</span> <span class="o">|</span> <span class="mi">30</span> <span class="o">|</span> <span class="mi">35</span> <span class="o">|</span> <span class="mi">20</span> <span class="o">|</span> <span class="mi">20</span> <span class="o">|</span> <span class="mi">45</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">11</span> <span class="o">|</span> <span class="n">Metapod</span> <span class="o">|</span> <span class="n">Bug</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">205</span> <span class="o">|</span> <span class="mi">50</span> <span class="o">|</span> <span class="mi">20</span> <span class="o">|</span> <span class="mi">55</span> <span class="o">|</span> <span class="mi">25</span> <span class="o">|</span> <span class="mi">25</span> <span class="o">|</span> <span class="mi">30</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">12</span> <span class="o">|</span> <span class="n">Butterfree</span> <span class="o">|</span> <span class="n">Bug</span> <span class="o">|</span> <span class="n">Flying</span> <span class="o">|</span> <span class="mi">395</span> <span class="o">|</span> <span class="mi">60</span> <span class="o">|</span> <span class="mi">45</span> <span class="o">|</span> <span class="mi">50</span> <span class="o">|</span> <span class="mi">90</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">70</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">13</span> <span class="o">|</span> <span class="n">Weedle</span> <span class="o">|</span> <span class="n">Bug</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">195</span> <span class="o">|</span> <span class="mi">40</span> <span class="o">|</span> <span class="mi">35</span> <span class="o">|</span> <span class="mi">30</span> <span class="o">|</span> <span class="mi">20</span> <span class="o">|</span> <span class="mi">20</span> <span class="o">|</span> <span class="mi">50</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">14</span> <span class="o">|</span> <span class="n">Kakuna</span> <span class="o">|</span> <span class="n">Bug</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">205</span> <span class="o">|</span> <span class="mi">45</span> <span class="o">|</span> <span class="mi">25</span> <span class="o">|</span> <span class="mi">50</span> <span class="o">|</span> <span class="mi">25</span> <span class="o">|</span> <span class="mi">25</span> <span class="o">|</span> <span class="mi">35</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">15</span> <span class="o">|</span> <span class="n">Beedrill</span> <span class="o">|</span> <span class="n">Bug</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">395</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">90</span> <span class="o">|</span> <span class="mi">40</span> <span class="o">|</span> <span class="mi">45</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">75</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">15</span> <span class="o">|</span> <span class="n">BeedrillMega</span> <span class="n">Beedrill</span> <span class="o">|</span> <span class="n">Bug</span> <span class="o">|</span> <span class="n">Poison</span> <span class="o">|</span> <span class="mi">495</span> <span class="o">|</span> <span class="mi">65</span> <span class="o">|</span> <span class="mi">150</span> <span class="o">|</span> <span class="mi">40</span> <span class="o">|</span> <span class="mi">15</span> <span class="o">|</span> <span class="mi">80</span> <span class="o">|</span> <span class="mi">145</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="n">false</span> <span class="o">|</span> |
| <span class="o">+----+---------------------------+--------+--------+-------+----+--------+---------+---------+---------+-------+------------+-----------+</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="create-in-memory"> |
| <h2>Create in-memory<a class="headerlink" href="#create-in-memory" title="Link to this heading">¶</a></h2> |
| <p>Sometimes it can be convenient to create a small DataFrame from a Python list or dictionary object. |
| To do this in DataFusion, you can use one of the three functions |
| <a class="reference internal" href="../autoapi/datafusion/context/index.html#datafusion.context.SessionContext.from_pydict" title="datafusion.context.SessionContext.from_pydict"><code class="xref py py-func docutils literal notranslate"><span class="pre">from_pydict()</span></code></a>, |
| <a class="reference internal" href="../autoapi/datafusion/context/index.html#datafusion.context.SessionContext.from_pylist" title="datafusion.context.SessionContext.from_pylist"><code class="xref py py-func docutils literal notranslate"><span class="pre">from_pylist()</span></code></a>, or |
| <a class="reference internal" href="../autoapi/datafusion/context/index.html#datafusion.context.SessionContext.create_dataframe" title="datafusion.context.SessionContext.create_dataframe"><code class="xref py py-func docutils literal notranslate"><span class="pre">create_dataframe()</span></code></a>.</p> |
| <p>As their names suggest, <code class="docutils literal notranslate"><span class="pre">from_pydict</span></code> and <code class="docutils literal notranslate"><span class="pre">from_pylist</span></code> will create DataFrames from Python |
| dictionary and list objects, respectively. <code class="docutils literal notranslate"><span class="pre">create_dataframe</span></code> assumes you will pass in a list |
| of list of <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html">PyArrow Record Batches</a>.</p> |
| <p>The following three examples all will create identical DataFrames:</p> |
| <div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">5</span><span class="p">]:</span> <span class="kn">import</span><span class="w"> </span><span class="nn">pyarrow</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">pa</span> |
| |
| <span class="n">In</span> <span class="p">[</span><span class="mi">6</span><span class="p">]:</span> <span class="n">ctx</span><span class="o">.</span><span class="n">from_pylist</span><span class="p">([</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"a"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">"b"</span><span class="p">:</span> <span class="mf">10.0</span><span class="p">,</span> <span class="s2">"c"</span><span class="p">:</span> <span class="s2">"alpha"</span> <span class="p">},</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"a"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="s2">"b"</span><span class="p">:</span> <span class="mf">20.0</span><span class="p">,</span> <span class="s2">"c"</span><span class="p">:</span> <span class="s2">"beta"</span> <span class="p">},</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"a"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span> <span class="s2">"b"</span><span class="p">:</span> <span class="mf">30.0</span><span class="p">,</span> <span class="s2">"c"</span><span class="p">:</span> <span class="s2">"gamma"</span> <span class="p">},</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| <span class="o">...</span><span class="p">:</span> |
| <span class="n">DataFrame</span><span class="p">()</span> |
| <span class="o">+---+------+-------+</span> |
| <span class="o">|</span> <span class="n">a</span> <span class="o">|</span> <span class="n">b</span> <span class="o">|</span> <span class="n">c</span> <span class="o">|</span> |
| <span class="o">+---+------+-------+</span> |
| <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mf">10.0</span> <span class="o">|</span> <span class="n">alpha</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="mf">20.0</span> <span class="o">|</span> <span class="n">beta</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">3</span> <span class="o">|</span> <span class="mf">30.0</span> <span class="o">|</span> <span class="n">gamma</span> <span class="o">|</span> |
| <span class="o">+---+------+-------+</span> |
| |
| <span class="n">In</span> <span class="p">[</span><span class="mi">7</span><span class="p">]:</span> <span class="n">ctx</span><span class="o">.</span><span class="n">from_pydict</span><span class="p">({</span> |
| <span class="o">...</span><span class="p">:</span> <span class="s2">"a"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> |
| <span class="o">...</span><span class="p">:</span> <span class="s2">"b"</span><span class="p">:</span> <span class="p">[</span><span class="mf">10.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">],</span> |
| <span class="o">...</span><span class="p">:</span> <span class="s2">"c"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"alpha"</span><span class="p">,</span> <span class="s2">"beta"</span><span class="p">,</span> <span class="s2">"gamma"</span><span class="p">],</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">})</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| <span class="o">...</span><span class="p">:</span> |
| <span class="n">DataFrame</span><span class="p">()</span> |
| <span class="o">+---+------+-------+</span> |
| <span class="o">|</span> <span class="n">a</span> <span class="o">|</span> <span class="n">b</span> <span class="o">|</span> <span class="n">c</span> <span class="o">|</span> |
| <span class="o">+---+------+-------+</span> |
| <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mf">10.0</span> <span class="o">|</span> <span class="n">alpha</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="mf">20.0</span> <span class="o">|</span> <span class="n">beta</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">3</span> <span class="o">|</span> <span class="mf">30.0</span> <span class="o">|</span> <span class="n">gamma</span> <span class="o">|</span> |
| <span class="o">+---+------+-------+</span> |
| |
| <span class="n">In</span> <span class="p">[</span><span class="mi">8</span><span class="p">]:</span> <span class="n">batch</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">RecordBatch</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">(</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">[</span> |
| <span class="o">...</span><span class="p">:</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]),</span> |
| <span class="o">...</span><span class="p">:</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">10.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">]),</span> |
| <span class="o">...</span><span class="p">:</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">"alpha"</span><span class="p">,</span> <span class="s2">"beta"</span><span class="p">,</span> <span class="s2">"gamma"</span><span class="p">]),</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">],</span> |
| <span class="o">...</span><span class="p">:</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">"a"</span><span class="p">,</span> <span class="s2">"b"</span><span class="p">,</span> <span class="s2">"c"</span><span class="p">],</span> |
| <span class="o">...</span><span class="p">:</span> <span class="p">)</span> |
| <span class="o">...</span><span class="p">:</span> |
| |
| <span class="n">In</span> <span class="p">[</span><span class="mi">9</span><span class="p">]:</span> <span class="n">ctx</span><span class="o">.</span><span class="n">create_dataframe</span><span class="p">([[</span><span class="n">batch</span><span class="p">]])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| <span class="n">DataFrame</span><span class="p">()</span> |
| <span class="o">+---+------+-------+</span> |
| <span class="o">|</span> <span class="n">a</span> <span class="o">|</span> <span class="n">b</span> <span class="o">|</span> <span class="n">c</span> <span class="o">|</span> |
| <span class="o">+---+------+-------+</span> |
| <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mf">10.0</span> <span class="o">|</span> <span class="n">alpha</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="mf">20.0</span> <span class="o">|</span> <span class="n">beta</span> <span class="o">|</span> |
| <span class="o">|</span> <span class="mi">3</span> <span class="o">|</span> <span class="mf">30.0</span> <span class="o">|</span> <span class="n">gamma</span> <span class="o">|</span> |
| <span class="o">+---+------+-------+</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="object-store"> |
| <h2>Object Store<a class="headerlink" href="#object-store" title="Link to this heading">¶</a></h2> |
| <p>DataFusion has support for multiple storage options in addition to local files. |
| The example below requires an appropriate S3 account with access credentials.</p> |
| <p>Supported Object Stores are</p> |
| <ul class="simple"> |
| <li><p><a class="reference internal" href="../autoapi/datafusion/object_store/index.html#datafusion.object_store.AmazonS3" title="datafusion.object_store.AmazonS3"><code class="xref py py-class docutils literal notranslate"><span class="pre">AmazonS3</span></code></a></p></li> |
| <li><p><a class="reference internal" href="../autoapi/datafusion/object_store/index.html#datafusion.object_store.GoogleCloud" title="datafusion.object_store.GoogleCloud"><code class="xref py py-class docutils literal notranslate"><span class="pre">GoogleCloud</span></code></a></p></li> |
| <li><p><a class="reference internal" href="../autoapi/datafusion/object_store/index.html#datafusion.object_store.Http" title="datafusion.object_store.Http"><code class="xref py py-class docutils literal notranslate"><span class="pre">Http</span></code></a></p></li> |
| <li><p><a class="reference internal" href="../autoapi/datafusion/object_store/index.html#datafusion.object_store.LocalFileSystem" title="datafusion.object_store.LocalFileSystem"><code class="xref py py-class docutils literal notranslate"><span class="pre">LocalFileSystem</span></code></a></p></li> |
| <li><p><a class="reference internal" href="../autoapi/datafusion/object_store/index.html#datafusion.object_store.MicrosoftAzure" title="datafusion.object_store.MicrosoftAzure"><code class="xref py py-class docutils literal notranslate"><span class="pre">MicrosoftAzure</span></code></a></p></li> |
| </ul> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">datafusion.object_store</span><span class="w"> </span><span class="kn">import</span> <span class="n">AmazonS3</span> |
| |
| <span class="n">region</span> <span class="o">=</span> <span class="s2">"us-east-1"</span> |
| <span class="n">bucket_name</span> <span class="o">=</span> <span class="s2">"yellow-trips"</span> |
| |
| <span class="n">s3</span> <span class="o">=</span> <span class="n">AmazonS3</span><span class="p">(</span> |
| <span class="n">bucket_name</span><span class="o">=</span><span class="n">bucket_name</span><span class="p">,</span> |
| <span class="n">region</span><span class="o">=</span><span class="n">region</span><span class="p">,</span> |
| <span class="n">access_key_id</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">"AWS_ACCESS_KEY_ID"</span><span class="p">),</span> |
| <span class="n">secret_access_key</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">"AWS_SECRET_ACCESS_KEY"</span><span class="p">),</span> |
| <span class="p">)</span> |
| |
| <span class="n">path</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"s3://</span><span class="si">{</span><span class="n">bucket_name</span><span class="si">}</span><span class="s2">/"</span> |
| <span class="n">ctx</span><span class="o">.</span><span class="n">register_object_store</span><span class="p">(</span><span class="s2">"s3://"</span><span class="p">,</span> <span class="n">s3</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span> |
| |
| <span class="n">ctx</span><span class="o">.</span><span class="n">register_parquet</span><span class="p">(</span><span class="s2">"trips"</span><span class="p">,</span> <span class="n">path</span><span class="p">)</span> |
| |
| <span class="n">ctx</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s2">"trips"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="other-dataframe-libraries"> |
| <h2>Other DataFrame Libraries<a class="headerlink" href="#other-dataframe-libraries" title="Link to this heading">¶</a></h2> |
| <p>DataFusion can import DataFrames directly from other libraries, such as |
| <a class="reference external" href="https://pola.rs/">Polars</a> and <a class="reference external" href="https://pandas.pydata.org/">Pandas</a>. |
| Since DataFusion version 42.0.0, any DataFrame library that supports the Arrow FFI PyCapsule |
| interface can be imported to DataFusion using the |
| <a class="reference internal" href="../autoapi/datafusion/context/index.html#datafusion.context.SessionContext.from_arrow" title="datafusion.context.SessionContext.from_arrow"><code class="xref py py-func docutils literal notranslate"><span class="pre">from_arrow()</span></code></a> function. Older versions of Polars may |
| not support the arrow interface. In those cases, you can still import via the |
| <a class="reference internal" href="../autoapi/datafusion/context/index.html#datafusion.context.SessionContext.from_polars" title="datafusion.context.SessionContext.from_polars"><code class="xref py py-func docutils literal notranslate"><span class="pre">from_polars()</span></code></a> function.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">pandas</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">pd</span> |
| |
| <span class="n">data</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">"a"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="s2">"b"</span><span class="p">:</span> <span class="p">[</span><span class="mf">10.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">],</span> <span class="s2">"c"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"alpha"</span><span class="p">,</span> <span class="s2">"beta"</span><span class="p">,</span> <span class="s2">"gamma"</span><span class="p">]</span> <span class="p">}</span> |
| <span class="n">pandas_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> |
| |
| <span class="n">datafusion_df</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">.</span><span class="n">from_arrow</span><span class="p">(</span><span class="n">pandas_df</span><span class="p">)</span> |
| <span class="n">datafusion_df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| </pre></div> |
| </div> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">polars</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">pl</span> |
| <span class="n">polars_df</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> |
| |
| <span class="n">datafusion_df</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">.</span><span class="n">from_arrow</span><span class="p">(</span><span class="n">polars_df</span><span class="p">)</span> |
| <span class="n">datafusion_df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="delta-lake"> |
| <h2>Delta Lake<a class="headerlink" href="#delta-lake" title="Link to this heading">¶</a></h2> |
| <p>DataFusion 43.0.0 and later support the ability to register table providers from sources such |
| as Delta Lake. This will require a recent version of |
| <a class="reference external" href="https://delta-io.github.io/delta-rs/">deltalake</a> to provide the required interfaces.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">deltalake</span><span class="w"> </span><span class="kn">import</span> <span class="n">DeltaTable</span> |
| |
| <span class="n">delta_table</span> <span class="o">=</span> <span class="n">DeltaTable</span><span class="p">(</span><span class="s2">"path_to_table"</span><span class="p">)</span> |
| <span class="n">ctx</span><span class="o">.</span><span class="n">register_table</span><span class="p">(</span><span class="s2">"my_delta_table"</span><span class="p">,</span> <span class="n">delta_table</span><span class="p">)</span> |
| <span class="n">df</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s2">"my_delta_table"</span><span class="p">)</span> |
| <span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| </pre></div> |
| </div> |
| <p>On older versions of <code class="docutils literal notranslate"><span class="pre">deltalake</span></code> (prior to 0.22) you can use the |
| <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html">Arrow DataSet</a> |
| interface to import to DataFusion, but this does not support features such as filter push down |
| which can lead to a significant performance difference.</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">deltalake</span><span class="w"> </span><span class="kn">import</span> <span class="n">DeltaTable</span> |
| |
| <span class="n">delta_table</span> <span class="o">=</span> <span class="n">DeltaTable</span><span class="p">(</span><span class="s2">"path_to_table"</span><span class="p">)</span> |
| <span class="n">ctx</span><span class="o">.</span><span class="n">register_dataset</span><span class="p">(</span><span class="s2">"my_delta_table"</span><span class="p">,</span> <span class="n">delta_table</span><span class="o">.</span><span class="n">to_pyarrow_dataset</span><span class="p">())</span> |
| <span class="n">df</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s2">"my_delta_table"</span><span class="p">)</span> |
| <span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| </pre></div> |
| </div> |
| </section> |
| <section id="apache-iceberg"> |
| <h2>Apache Iceberg<a class="headerlink" href="#apache-iceberg" title="Link to this heading">¶</a></h2> |
| <p>DataFusion 45.0.0 and later support the ability to register Apache Iceberg tables as table providers through the Custom Table Provider interface.</p> |
| <p>This requires either the <a class="reference external" href="https://pypi.org/project/pyiceberg/">pyiceberg</a> library (>=0.10.0) or the <a class="reference external" href="https://pypi.org/project/pyiceberg-core/">pyiceberg-core</a> library (>=0.5.0).</p> |
| <ul class="simple"> |
| <li><p>The <code class="docutils literal notranslate"><span class="pre">pyiceberg-core</span></code> library exposes Iceberg Rust’s implementation of the Custom Table Provider interface as python bindings.</p></li> |
| <li><p>The <code class="docutils literal notranslate"><span class="pre">pyiceberg</span></code> library utilizes the <code class="docutils literal notranslate"><span class="pre">pyiceberg-core</span></code> python bindings under the hood and provides a native way for Python users to interact with the DataFusion.</p></li> |
| </ul> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">datafusion</span><span class="w"> </span><span class="kn">import</span> <span class="n">SessionContext</span> |
| <span class="kn">from</span><span class="w"> </span><span class="nn">pyiceberg.catalog</span><span class="w"> </span><span class="kn">import</span> <span class="n">load_catalog</span> |
| <span class="kn">import</span><span class="w"> </span><span class="nn">pyarrow</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">pa</span> |
| |
| <span class="c1"># Load catalog and create/load a table</span> |
| <span class="n">catalog</span> <span class="o">=</span> <span class="n">load_catalog</span><span class="p">(</span><span class="s2">"catalog"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s2">"in-memory"</span><span class="p">)</span> |
| <span class="n">catalog</span><span class="o">.</span><span class="n">create_namespace_if_not_exists</span><span class="p">(</span><span class="s2">"default"</span><span class="p">)</span> |
| |
| <span class="c1"># Create some sample data</span> |
| <span class="n">data</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">table</span><span class="p">({</span><span class="s2">"x"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="s2">"y"</span><span class="p">:</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]})</span> |
| <span class="n">iceberg_table</span> <span class="o">=</span> <span class="n">catalog</span><span class="o">.</span><span class="n">create_table</span><span class="p">(</span><span class="s2">"default.test"</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">data</span><span class="o">.</span><span class="n">schema</span><span class="p">)</span> |
| <span class="n">iceberg_table</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> |
| |
| <span class="c1"># Register the table with DataFusion</span> |
| <span class="n">ctx</span> <span class="o">=</span> <span class="n">SessionContext</span><span class="p">()</span> |
| <span class="n">ctx</span><span class="o">.</span><span class="n">register_table_provider</span><span class="p">(</span><span class="s2">"test"</span><span class="p">,</span> <span class="n">iceberg_table</span><span class="p">)</span> |
| |
| <span class="c1"># Query the table using DataFusion</span> |
| <span class="n">ctx</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s2">"test"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> |
| </pre></div> |
| </div> |
| <p>Note that the Datafusion integration rely on features from the <a class="reference external" href="https://github.com/apache/iceberg-rust/">Iceberg Rust</a> implementation instead of the <a class="reference external" href="https://github.com/apache/iceberg-python/">PyIceberg</a> implementation. |
| Features that are available in PyIceberg but not yet in Iceberg Rust will not be available when using DataFusion.</p> |
| </section> |
| <section id="custom-table-provider"> |
| <h2>Custom Table Provider<a class="headerlink" href="#custom-table-provider" title="Link to this heading">¶</a></h2> |
| <p>You can implement a custom Data Provider in Rust and expose it to DataFusion through the |
| the interface as describe in the <a class="reference internal" href="io/table_provider.html#io-custom-table-provider"><span class="std std-ref">Custom Table Provider</span></a> |
| section. This is an advanced topic, but a |
| <a class="reference external" href="https://github.com/apache/datafusion-python/tree/main/examples/datafusion-ffi-example">user example</a> |
| is provided in the DataFusion repository.</p> |
| </section> |
| </section> |
| <section id="catalog"> |
| <h1>Catalog<a class="headerlink" href="#catalog" title="Link to this heading">¶</a></h1> |
| <p>A common technique for organizing tables is using a three level hierarchical approach. DataFusion |
| supports this form of organizing using the <a class="reference internal" href="../autoapi/datafusion/catalog/index.html#datafusion.catalog.Catalog" title="datafusion.catalog.Catalog"><code class="xref py py-class docutils literal notranslate"><span class="pre">Catalog</span></code></a>, |
| <a class="reference internal" href="../autoapi/datafusion/catalog/index.html#datafusion.catalog.Schema" title="datafusion.catalog.Schema"><code class="xref py py-class docutils literal notranslate"><span class="pre">Schema</span></code></a>, and <a class="reference internal" href="../autoapi/datafusion/catalog/index.html#datafusion.catalog.Table" title="datafusion.catalog.Table"><code class="xref py py-class docutils literal notranslate"><span class="pre">Table</span></code></a>. By default, |
| a <a class="reference internal" href="../autoapi/datafusion/context/index.html#datafusion.context.SessionContext" title="datafusion.context.SessionContext"><code class="xref py py-class docutils literal notranslate"><span class="pre">SessionContext</span></code></a> comes with a single Catalog and a single Schema |
| with the names <code class="docutils literal notranslate"><span class="pre">datafusion</span></code> and <code class="docutils literal notranslate"><span class="pre">default</span></code>, respectively.</p> |
| <p>The default implementation uses an in-memory approach to the catalog and schema. We have support |
| for adding additional in-memory catalogs and schemas. This can be done like in the following |
| example:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">datafusion.catalog</span><span class="w"> </span><span class="kn">import</span> <span class="n">Catalog</span><span class="p">,</span> <span class="n">Schema</span> |
| |
| <span class="n">my_catalog</span> <span class="o">=</span> <span class="n">Catalog</span><span class="o">.</span><span class="n">memory_catalog</span><span class="p">()</span> |
| <span class="n">my_schema</span> <span class="o">=</span> <span class="n">Schema</span><span class="o">.</span><span class="n">memory_schema</span><span class="p">()</span> |
| |
| <span class="n">my_catalog</span><span class="o">.</span><span class="n">register_schema</span><span class="p">(</span><span class="s2">"my_schema_name"</span><span class="p">,</span> <span class="n">my_schema</span><span class="p">)</span> |
| |
| <span class="n">ctx</span><span class="o">.</span><span class="n">register_catalog</span><span class="p">(</span><span class="s2">"my_catalog_name"</span><span class="p">,</span> <span class="n">my_catalog</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>You could then register tables in <code class="docutils literal notranslate"><span class="pre">my_schema</span></code> and access them either through the DataFrame |
| API or via sql commands such as <code class="docutils literal notranslate"><span class="pre">"SELECT</span> <span class="pre">*</span> <span class="pre">from</span> <span class="pre">my_catalog_name.my_schema_name.my_table"</span></code>.</p> |
| <section id="user-defined-catalog-and-schema"> |
| <h2>User Defined Catalog and Schema<a class="headerlink" href="#user-defined-catalog-and-schema" title="Link to this heading">¶</a></h2> |
| <p>If the in-memory catalogs are insufficient for your uses, there are two approaches you can take |
| to implementing a custom catalog and/or schema. In the below discussion, we describe how to |
| implement these for a Catalog, but the approach to implementing for a Schema is nearly |
| identical.</p> |
| <p>DataFusion supports Catalogs written in either Rust or Python. If you write a Catalog in Rust, |
| you will need to export it as a Python library via PyO3. There is a complete example of a |
| catalog implemented this way in the |
| <a class="reference external" href="https://github.com/apache/datafusion-python/tree/main/examples/">examples folder</a> |
| of our repository. Writing catalog providers in Rust provides typically can lead to significant |
| performance improvements over the Python based approach.</p> |
| <p>To implement a Catalog in Python, you will need to inherit from the abstract base class |
| <a class="reference internal" href="../autoapi/datafusion/catalog/index.html#datafusion.catalog.CatalogProvider" title="datafusion.catalog.CatalogProvider"><code class="xref py py-class docutils literal notranslate"><span class="pre">CatalogProvider</span></code></a>. There are examples in the |
| <a class="reference external" href="https://github.com/apache/datafusion-python/tree/main/python/tests">unit tests</a> of |
| implementing a basic Catalog in Python where we simply keep a dictionary of the |
| registered Schemas.</p> |
| <p>One important note for developers is that when we have a Catalog defined in Python, we have |
| two different ways of accessing this Catalog. First, we register the catalog with a Rust |
| wrapper. This allows for any rust based code to call the Python functions as necessary. |
| Second, if the user access the Catalog via the Python API, we identify this and return back |
| the original Python object that implements the Catalog. This is an important distinction |
| for developers because we do <em>not</em> return a Python wrapper around the Rust wrapper of the |
| original Python object.</p> |
| </section> |
| </section> |
| |
| |
| </div> |
| |
| |
| <!-- Previous / next buttons --> |
| <div class='prev-next-area'> |
| <a class='left-prev' id="prev-link" href="basics.html" title="previous page"> |
| <i class="fas fa-angle-left"></i> |
| <div class="prev-next-info"> |
| <p class="prev-next-subtitle">previous</p> |
| <p class="prev-next-title">Concepts</p> |
| </div> |
| </a> |
| <a class='right-next' id="next-link" href="dataframe/index.html" title="next page"> |
| <div class="prev-next-info"> |
| <p class="prev-next-subtitle">next</p> |
| <p class="prev-next-title">DataFrames</p> |
| </div> |
| <i class="fas fa-angle-right"></i> |
| </a> |
| </div> |
| |
| </main> |
| |
| |
| </div> |
| </div> |
| |
| <script src="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf"></script> |
| |
| <!-- Based on pydata_sphinx_theme/footer.html --> |
| <footer class="footer mt-5 mt-md-0"> |
| <div class="container"> |
| |
| <div class="footer-item"> |
| <p class="copyright"> |
| © Copyright 2019-2024, Apache Software Foundation.<br> |
| </p> |
| </div> |
| |
| <div class="footer-item"> |
| <p class="sphinx-version"> |
| Created using <a href="http://sphinx-doc.org/">Sphinx</a> 8.1.3.<br> |
| </p> |
| </div> |
| |
| <div class="footer-item"> |
| <p>Apache Arrow DataFusion, Arrow DataFusion, Apache, the Apache feather logo, and the Apache Arrow DataFusion project logo</p> |
| <p>are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> |
| </div> |
| </div> |
| </footer> |
| |
| |
| </body> |
| </html> |