blob: d803b11333f0e97a483336b4607a6f55364d84e6 [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Frequently Asked Questions
## What is the relationship between Apache Arrow, DataFusion, and Ballista?
Apache Arrow is a library which provides a standardized memory representation for columnar data. It also provides
"kernels" for performing common operations on this data.
DataFusion is a library for executing queries in-process using the Apache Arrow memory
model and computational kernels. It is designed to run within a single process, using threads
for parallel query execution.
[Ballista](https://github.com/apache/datafusion-ballista) is a distributed compute platform built on DataFusion.
# How does DataFusion Compare with `XYZ`?
When compared to similar systems, DataFusion typically is:
1. Targeted at developers, rather than end users / data scientists.
2. Designed to be embedded, rather than a complete file based SQL system.
3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
4. Implemented in `Rust`, rather than `C/C++`
Here is a comparison with similar projects that may help understand
when DataFusion might be suitable or unsuitable for your needs:
- [DuckDB](https://www.duckdb.org) is an open source, in process analytic database.
Like DataFusion, it supports very fast execution, both from its custom file format
and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
is primarily used directly by users as a serverless database and query system rather
than as a library for building such database systems.
- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
libraries at the time of writing. Like DataFusion, it is also
written in Rust and uses the Apache Arrow memory model, but unlike
DataFusion it is not designed with as many extension points.
- [Facebook Velox](https://github.com/facebookincubator/velox)
is an execution engine. Like DataFusion, Velox aims to
provide a reusable foundation for building database-like systems. Unlike DataFusion,
it is written in C/C++ and does not include a SQL frontend or planning / optimization
framework.
- [Databend](https://github.com/datafuselabs/databend) is a complete
database system. Like DataFusion it is also written in Rust and
utilizes the Apache Arrow memory model, but unlike DataFusion it
targets end-users rather than developers of other database systems.