DataFusion's code structure and organization is described in the crates.io documentation, to keep it as close to the source as possible. You can find the most up to date version in the source code.
DataFusion is a fast moving project, which results in frequent internal changes. This benefits DataFusion by allowing it to evolve and respond quickly to requests, but also means that maintaining a fork with major modifications sometimes requires non trivial work.
The public API (what is accessible if you use the DataFusion releases from crates.io) is typically much more stable (though it does change from release to release as well).
Thus, rather than forks, we recommend using one of the many extension APIs (such as TableProvider, OptimizerRule, or ExecutionPlan) to customize DataFusion. If you can not do what you want with the existing APIs, we would welcome you working with us to add new APIs to enable your use case, as described in the next section.
Please see the Extensions section to find out more about existing DataFusion extensions and how to contribute your extension to the community.
DataFusion aims to be a general-purpose query engine, and thus the core crates contain features that are useful for a wide range of use cases. Use case specific functionality (such as very specific time series or stream processing features) are typically implemented using the extension APIs.
If have a use case that is not covered by the existing APIs, we would love to work with you to design a new general purpose API. There are often others who are interested in similar extensions and the act of defining the API often improves the code overall for everyone.
Extension APIs that provide “safe” default behaviors are more likely to be suitable for inclusion in DataFusion, while APIs that require major changes to built-in operators are less likely. For example, it might make less sense to add an API to support a stream processing feature if that would result in slower performance for built-in operators. It may still make sense to add extension APIs for such features, but leave implementation of such operators in downstream projects.
The process to create a new extension API is typically:
@ mentions) for feedback (you can find such people by looking at the most recently changed PRs and issues)datafusion-examples or refactoring existing code) to show how it would workSome benefits of using an example based approach are
An example of this process was creating a SQL Extension Planning API.