| |
| |
| |
| |
| |
| |
| <main class="col-12 col-md-9 col-xl-7 py-md-5 pl-md-5 pr-md-4 bd-content" role="main"> |
| |
| <div> |
| |
| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
Introduction
Introduction
| <p>DataFusion is a very fast, extensible query engine for building |
| high-quality data-centric systems in <a class="reference external" href="http://rustlang.org">Rust</a>, |
| using the <a class="reference external" href="https://arrow.apache.org">Apache Arrow</a> in-memory format. |
| DataFusion originated as part of the <a class="reference external" href="https://arrow.apache.org/">Apache Arrow</a> |
| project.</p> |
| <p>DataFusion offers SQL and Dataframe APIs, excellent <a class="reference external" href="https://benchmark.clickhouse.com/">performance</a>, built-in support for CSV, Parquet, JSON, and Avro, <a class="reference external" href="https://github.com/apache/datafusion-python">python bindings</a>, extensive customization, a great community, and more.</p> |
| <section id="project-goals"> |
| <h2>Project Goals<a class="headerlink" href="#project-goals" title="Link to this heading">¶</a></h2> |
| <p>DataFusion aims to be the query engine of choice for new, fast |
| data centric systems such as databases, dataframe libraries, machine |
| learning and streaming applications by leveraging the unique features |
| of <a class="reference external" href="https://www.rust-lang.org/">Rust</a> and <a class="reference external" href="https://arrow.apache.org/">Apache |
| Arrow</a>.</p> |
| </section> |
| <section id="features"> |
| <h2>Features<a class="headerlink" href="#features" title="Link to this heading">¶</a></h2> |
| <ul class="simple"> |
| <li><p>Feature-rich <a class="reference external" href="https://datafusion.apache.org/user-guide/sql/index.html">SQL support</a> and <a class="reference external" href="https://datafusion.apache.org/user-guide/dataframe.html">DataFrame API</a></p></li> |
| <li><p>Blazingly fast, vectorized, multi-threaded, streaming execution engine.</p></li> |
| <li><p>Native support for Parquet, CSV, JSON, and Avro file formats. Support |
| for custom file formats and non file datasources via the <code class="docutils literal notranslate"><span class="pre">TableProvider</span></code> trait.</p></li> |
| <li><p>Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, |
| other query languages, custom plan and execution nodes, optimizer passes, and more.</p></li> |
| <li><p>Streaming, asynchronous IO directly from popular object stores, including AWS S3, |
| Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the |
| <code class="docutils literal notranslate"><span class="pre">ObjectStore</span></code> trait).</p></li> |
| <li><p><a class="reference external" href="https://docs.rs/datafusion/latest">Excellent Documentation</a> and a |
| <a class="reference external" href="https://datafusion.apache.org/contributor-guide/communication.html">welcoming community</a>.</p></li> |
| <li><p>A state of the art query optimizer with expression coercion and |
| simplification, projection and filter pushdown, sort and distribution |
| aware optimizations, automatic join reordering, and more.</p></li> |
| <li><p>Permissive Apache 2.0 License, predictable and well understood |
| <a class="reference external" href="https://www.apache.org/">Apache Software Foundation</a> governance.</p></li> |
| <li><p>Implementation in <a class="reference external" href="https://www.rust-lang.org/">Rust</a>, a modern |
| system language with development productivity similar to Java or |
| Golang, the performance of C++, and <a class="reference external" href="https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted">loved by programmers |
| everywhere</a>.</p></li> |
| <li><p>Support for <a class="reference external" href="https://substrait.io/">Substrait</a> query plans, to |
| easily pass plans across language and system boundaries.</p></li> |
| </ul> |
| </section> |
| <section id="use-cases"> |
| <h2>Use Cases<a class="headerlink" href="#use-cases" title="Link to this heading">¶</a></h2> |
| <p>DataFusion can be used without modification as an embedded SQL |
| engine or can be customized and used as a foundation for |
| building new systems.</p> |
| <p>While most current usecases are “analytic” or (throughput) some |
| components of DataFusion such as the plan representations, are |
| suitable for “streaming” and “transaction” style systems (low |
| latency).</p> |
| <p>Here are some example systems built using DataFusion:</p> |
| <ul class="simple"> |
| <li><p>Specialized Analytical Database systems such as <a class="reference external" href="https://github.com/apache/incubator-horaedb">HoraeDB</a> and more general Apache Spark like system such a <a class="reference external" href="https://github.com/apache/datafusion-ballista">Ballista</a>.</p></li> |
| <li><p>New query language engines such as <a class="reference external" href="https://github.com/prql/prql-query">prql-query</a> and accelerators such as <a class="reference external" href="https://vegafusion.io/" title="if you know of another project, please submit a PR to add a link!">VegaFusion</a></p></li> |
| <li><p>Research platform for new Database Systems, such as <a class="reference external" href="https://github.com/flock-lab/flock">Flock</a></p></li> |
| <li><p>SQL support to another library, such as <a class="reference external" href="https://github.com/dask-contrib/dask-sql">dask sql</a></p></li> |
| <li><p>Streaming data platforms such as <a class="reference external" href="https://synnada.ai/">Synnada</a></p></li> |
| <li><p>Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as <a class="reference external" href="https://github.com/timvw/qv">qv</a></p></li> |
| <li><p>Native Spark runtime replacement such as <a class="reference external" href="https://github.com/blaze-init/blaze">Blaze</a></p></li> |
| </ul> |
| <p>By using DataFusion, projects are freed to focus on their specific |
| features, and avoid reimplementing general (but still necessary) |
| features such as an expression representation, standard optimizations, |
| parellelized streaming execution plans, file format support, etc.</p> |
| </section> |
| <section id="known-users"> |
| <h2>Known Users<a class="headerlink" href="#known-users" title="Link to this heading">¶</a></h2> |
| <p>Here are some active projects using DataFusion:</p> |
| <!-- "Active" means github repositories that had at least one commit in the last 6 months --> |
| <ul class="simple"> |
| <li><p><a class="reference external" href="https://github.com/ArroyoSystems/arroyo">Arroyo</a> Distributed stream processing engine in Rust</p></li> |
| <li><p><a class="reference external" href="https://github.com/apache/datafusion-ballista">Ballista</a> Distributed SQL Query Engine</p></li> |
| <li><p><a class="reference external" href="https://github.com/kwai/blaze">Blaze</a> The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing</p></li> |
| <li><p><a class="reference external" href="https://github.com/cnosdb/cnosdb">CnosDB</a> Open Source Distributed Time Series Database</p></li> |
| <li><p><a class="reference external" href="https://github.com/apache/datafusion-comet">Comet</a> Apache Spark native query execution plugin</p></li> |
| <li><p><a class="reference external" href="https://github.com/cube-js/cube.js/tree/master/rust">Cube Store</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/dask-contrib/dask-sql">Dask SQL</a> Distributed SQL query engine in Python</p></li> |
| <li><p><a class="reference external" href="https://github.com/delta-io/delta-rs">delta-rs</a> Native Rust implementation of Delta Lake</p></li> |
| <li><p><a class="reference external" href="https://github.com/wheretrue/exon">Exon</a> Analysis toolkit for life-science applications</p></li> |
| <li><p><a class="reference external" href="https://funnel.io/">Funnel</a> Data Platform powering Marketing Intelligence applications.</p></li> |
| <li><p><a class="reference external" href="https://github.com/GlareDB/glaredb">GlareDB</a> Fast SQL database for querying and analyzing distributed data.</p></li> |
| <li><p><a class="reference external" href="https://github.com/GreptimeTeam/greptimedb">GreptimeDB</a> Open Source & Cloud Native Distributed Time Series Database</p></li> |
| <li><p><a class="reference external" href="https://github.com/apache/incubator-horaedb">HoraeDB</a> Distributed Time-Series Database</p></li> |
| <li><p><a class="reference external" href="https://github.com/influxdata/influxdb">InfluxDB</a> Time Series Database</p></li> |
| <li><p><a class="reference external" href="https://github.com/kamu-data/kamu-cli/">Kamu</a> Planet-scale streaming data pipeline</p></li> |
| <li><p><a class="reference external" href="https://github.com/lakesoul-io/LakeSoul">LakeSoul</a> Open source LakeHouse framework with native IO in Rust.</p></li> |
| <li><p><a class="reference external" href="https://github.com/lancedb/lance">Lance</a> Modern columnar data format for ML</p></li> |
| <li><p><a class="reference external" href="https://github.com/openobserve/openobserve">OpenObserve</a> Distributed cloud native observability platform</p></li> |
| <li><p><a class="reference external" href="https://github.com/paradedb/paradedb">ParadeDB</a> PostgreSQL for Search & Analytics</p></li> |
| <li><p><a class="reference external" href="https://github.com/parseablehq/parseable">Parseable</a> Log storage and observability platform</p></li> |
| <li><p><a class="reference external" href="https://github.com/timvw/qv">qv</a> Quickly view your data</p></li> |
| <li><p><a class="reference external" href="https://github.com/restatedev">Restate</a> Easily build resilient applications using distributed durable async/await</p></li> |
| <li><p><a class="reference external" href="https://github.com/roapi/roapi">ROAPI</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/lakehq/sail">Sail</a> Unifying stream, batch, and AI workloads with Apache Spark compatibility</p></li> |
| <li><p><a class="reference external" href="https://github.com/splitgraph/seafowl">Seafowl</a> CDN-friendly analytical database</p></li> |
| <li><p><a class="reference external" href="https://github.com/spiceai/spiceai">Spice.ai</a> Unified SQL query interface & materialization engine</p></li> |
| <li><p><a class="reference external" href="https://synnada.ai/">Synnada</a> Streaming-first framework for data products</p></li> |
| <li><p><a class="reference external" href="https://vegafusion.io/">VegaFusion</a> Server-side acceleration for the <a class="reference external" href="https://vega.github.io/">Vega</a> visualization grammar</p></li> |
| <li><p><a class="reference external" href="https://telemetry.sh/">Telemetry</a> Structured logging made easy</p></li> |
| </ul> |
| <p>Here are some less active projects that used DataFusion:</p> |
| <ul class="simple"> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/bdt">bdt</a> Boring Data Tool</p></li> |
| <li><p><a class="reference external" href="https://github.com/cloudfuse-io/buzz-rust">Cloudfuse Buzz</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/datafusion-tui">datafusion-tui</a> Text UI for DataFusion</p></li> |
| <li><p><a class="reference external" href="https://github.com/flock-lab/flock">Flock</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/tensorbase/tensorbase">Tensorbase</a></p></li> |
| </ul> |
| </section> |
| <section id="integrations-and-extensions"> |
| <h2>Integrations and Extensions<a class="headerlink" href="#integrations-and-extensions" title="Link to this heading">¶</a></h2> |
| <p>There are a number of community projects that extend DataFusion or |
| provide integrations with other systems, some of which are described below:</p> |
| <section id="language-bindings"> |
| <h3>Language Bindings<a class="headerlink" href="#language-bindings" title="Link to this heading">¶</a></h3> |
| <ul class="simple"> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/datafusion-c">datafusion-c</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/apache/datafusion-python">datafusion-python</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/datafusion-ruby">datafusion-ruby</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/datafusion-java">datafusion-java</a></p></li> |
| </ul> |
| </section> |
| <section id="integrations"> |
| <h3>Integrations<a class="headerlink" href="#integrations" title="Link to this heading">¶</a></h3> |
| <ul class="simple"> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/datafusion-bigtable">datafusion-bigtable</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/datafusion-catalogprovider-glue">datafusion-catalogprovider-glue</a></p></li> |
| <li><p><a class="reference external" href="https://github.com/datafusion-contrib/datafusion-federation">datafusion-federation</a></p></li> |
| </ul> |
| </section> |
| </section> |
| <section id="why-datafusion"> |
| <h2>Why DataFusion?<a class="headerlink" href="#why-datafusion" title="Link to this heading">¶</a></h2> |
| <ul class="simple"> |
| <li><p><em>High Performance</em>: Leveraging Rust and Arrow’s memory model, DataFusion is very fast.</p></li> |
| <li><p><em>Easy to Connect</em>: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem</p></li> |
| <li><p><em>Easy to Embed</em>: Allowing extension at almost any point in its design, and published regularly as a crate on <a class="reference external" href="http://crates.io">crates.io</a>, DataFusion can be integrated and tailored for your specific usecase.</p></li> |
| <li><p><em>High Quality</em>: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.</p></li> |
| </ul> |
| </section> |
| </section> |
