docs/source/user-guide/introduction.md - datafusion - Git at Google

 <!---
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 # Features, and Usecases

 DataFusion is a very fast, extensible query engine for building
 high-quality data-centric systems in [Rust](http://rustlang.org),
 using the [Apache Arrow](https://arrow.apache.org) in-memory format.

 DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.

 ## Features

 - Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
 - Blazingly fast, vectorized, multi-threaded, streaming execution engine.
 - Native support for Parquet, CSV, JSON, and Avro file formats. Support
   for custom file formats and non file datasources via the `TableProvider` trait.
 - Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
   other query languages, custom plan and execution nodes, optimizer passes, and more.
 - Streaming, asynchronous IO directly from popular object stores, including AWS S3,
   Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
   `ObjectStore` trait.
 - [Excellent Documentation](https://docs.rs/datafusion/latest) and a
   [welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
 - A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
   automatic join reordering, expression coercion, and more.
 - Permissive Apache 2.0 License, Apache Software Foundation governance
 - Written in [Rust](https://www.rust-lang.org/), a modern system language with development
   productivity similar to Java or Golang, the performance of C++, and
   [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
 - Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
   with other projects, and to pass plans across language boundaries.

 ## Use Cases

 DataFusion can be used without modification as an embedded SQL
 engine or can be customized and used as a foundation for
 building new systems. Here are some examples of systems built using DataFusion:

 - Specialized Analytical Database systems such as [CeresDB] and more general Apache Spark like system such a [Ballista].
 - New query language engines such as [prql-query] and accelerators such as [VegaFusion]
 - Research platform for new Database Systems, such as [Flock]
 - SQL support to another library, such as [dask sql]
 - Streaming data platforms such as [Synnada]
 - Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
 - A faster Spark runtime replacement [Blaze]

 By using DataFusion, the projects are freed to focus on their specific
 features, and avoid reimplementing general (but still necessary)
 features such as an expression representation, standard optimizations,
 execution plans, file format support, etc.

 ## Why DataFusion?

 - _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
 - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
 - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
 - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
	<!---
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	# Features, and Usecases

	DataFusion is a very fast, extensible query engine for building
	high-quality data-centric systems in [Rust](http://rustlang.org),
	using the [Apache Arrow](https://arrow.apache.org) in-memory format.

	DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.

	## Features

	- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
	- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
	- Native support for Parquet, CSV, JSON, and Avro file formats. Support
	for custom file formats and non file datasources via the `TableProvider` trait.
	- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
	other query languages, custom plan and execution nodes, optimizer passes, and more.
	- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
	Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
	`ObjectStore` trait.
	- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
	[welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
	- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
	automatic join reordering, expression coercion, and more.
	- Permissive Apache 2.0 License, Apache Software Foundation governance
	- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
	productivity similar to Java or Golang, the performance of C++, and
	[loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
	- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
	with other projects, and to pass plans across language boundaries.

	## Use Cases

	DataFusion can be used without modification as an embedded SQL
	engine or can be customized and used as a foundation for
	building new systems. Here are some examples of systems built using DataFusion:

	- Specialized Analytical Database systems such as [CeresDB] and more general Apache Spark like system such a [Ballista].
	- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
	- Research platform for new Database Systems, such as [Flock]
	- SQL support to another library, such as [dask sql]
	- Streaming data platforms such as [Synnada]
	- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
	- A faster Spark runtime replacement [Blaze]

	By using DataFusion, the projects are freed to focus on their specific
	features, and avoid reimplementing general (but still necessary)
	features such as an expression representation, standard optimizations,
	execution plans, file format support, etc.

	## Why DataFusion?

	- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
	- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
	- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
	- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.