blob: d6049a8875bfe1775b3a9b1ca3df4be6079c7012 [file] [view]
<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
-->
Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. If you believe there should be a change or exception to these rules please bring it up for discussion on the developer mailing list (dev@parquet.apache.org).
### Key branches
- `master` has the latest stable changes
### Pull requests
- Submit pull requests against the `master` branch
- Try not to pollute your pull request with unintended changes--keep them simple and small
### License
By contributing your code, you agree to license your contribution under the terms of the APLv2:
https://github.com/apache/parquet-format/blob/master/LICENSE
### Additions/Changes to the Format
Note: This section applies to actual functional changes to the specification.
Fixing typos, grammar, and clarifying concepts that would not change the
semantics of the specification can be done as long as a committer feels comfortable
to merge them. When in doubt starting a discussion on the dev mailing list is
encouraged.
The general steps for adding features to the format are as follows:
1. Design/scoping: The goal of this phase is to identify design goals of a
feature and provide some demonstration that the feature meets those goals.
This phase starts with a discussion of changes on the developer mailing list
(dev@parquet.apache.org). Depending on the scope and goals of the feature the
it can be useful to provide additional artifacts as part of a discussion. The
artifacts can include a design docuemnt, a draft pull request to make the
discussion concrete and/or an prototype implementation to demostrate the
viability of implementation. This step is complete when there is lazy
consensus. Part of the consensus is whether it is sufficient to provide two
working implementations as outlined in step 2, or if demonstration of the
feature with a downstream query engine is necessary to justify the feature
(e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset
library, the Apache DataFusion query engine, or any other open source
engine).
2. Completeness: The goal of this phase is to ensure the feature is viable,
there is no ambiguity in its specification by demonstrating compatibility
between implementations. Once a change has lazy consensus, two
implementations of the feature demonstrating interopability must also be
provided. One implementation MUST be
[`parquet-java`](http://github.com/apache/parquet-java). It is preferred
that the second implementation be
[`parquet-cpp`](https://github.com/apache/arrow) or
[`parquet-rs`](https://github.com/apache/arrow-rs), however at the discretion
of the PMC any open source Parquet implementation may be acceptable.
Implementations whose contributors actively participate in the community
(e.g. keep their feature matrix up-to-date on the Parquet website) are more
likely to be considered. If discussed as a requirement in step 1 above,
demonstration of integration with a query engine is also required for this
step. The implementations must be made available publicly, and they should be
fit for inclusion (for example, they were submitted as a pull request against
the target repository and committers gave positive reviews). Reports on the
benefits from closed source implementations are welcome and can help lend
weight to features desirability but are not sufficient for acceptance of a
new feature.
Unless otherwise discussed, it is expected the implementations will be developed
from their respective main branch (i.e. backporting is not required), to
demonstrate that the feature is mergeable to its implementation.
3. Ratification: After the first two steps are complete a formal vote is held on
dev@parquet.apache.org to officially ratify the feature. After the vote
passes the format change is merged into the `parquet-format` repository and
it is expected the changes from step 2 will also be merged soon after
(implementations should not be merged until the addition has been merged to
`parquet-format`).
#### General guidelines/preferences on additions.
1. To the greatest extent possible changes should have an option for forward
compatibility (old readers can still read files). The [compatibility and
feature enablement](#compatibility-and-feature-enablement) section below
provides more details on expectations for changes that break compatibility.
2. New encodings should be fully specified in this repository and not
rely on an external dependencies for implementation (i.e. `parquet-format` is
the source of truth for the encoding). If it does require an
external dependency, then the external dependency must have its
own specification separate from implementation.
3. New compression mechanisms should have a pure Java implementation that can be
used as a dependency in `parquet-java`, exceptions may be
discussed on the mailing list to see if a non-native Java
implementation is acceptable.
### Releases
The Parquet PMC aims to do releases of the format package only as needed when
new features are introduced. If multiple new features are being proposed
simultaneously some features might be consolidated into the same release.
Guidance is provided below on when implementations should enable features added
to the specification. Due to confusion in the past over Parquet versioning it
is not expected that there will be a 3.x release of the specification in the
foreseeable future.
### Compatibility and Feature Enablement
For the purposes of this discussion we classify features into the following buckets:
1. Backward compatible. A file written under an older version of the format
should be readable under a newer version of the format.
2. Forward compatible. A file written under a newer version of the format with
the feature enabled can be read under an older version of the format, but
some metadata might be missing or performance might be suboptimal. Simply
phrased, forward compatible means all data can be read back in an older
version of the format. New logical types are considered forward
compatible despite the loss of semantic meaning.
3. Forward incompatible. A file written under a newer version of the format with
the feature enabled cannot be read under an older version of the format (e.g.
adding and using a new compression algorithm). It is expected any feature in
this category will provide a signal to older readers, so they can
unambiguously determine that they cannot properly read the file (e.g. via
adding a new value to an existing enum).
New features are intended to be widely beneficial to users of Parquet, and
therefore it is hoped third-party implementations will adopt them quickly after
they are introduced. It is assumed that writing new parts of the format, and
especially forward incompatible features, will be configured with a feature flag
defaulted to "off", and at some future point the feature is turned on by default
(reading of the new feature will typically be enabled without configuration or
defaulted to on). Some amount of lead time is desirable to ensure a critical
mass of Parquet implementations support a feature to avoid compatibility issues
across the ecosystem. Therefore, the Parquet PMC gives the following
recommendations for managing features:
1. Backward compatibility is the concern of implementations but given the
ubiquity of Parquet and the length of time it has been used, libraries should
support reading older versions of the format to the greatest extent possible.
2. Forward compatible features/changes may be enabled and used by default in
implementations once the parquet-format containing those changes has been
formally released. For features that may pose a significant performance
regression to older format readers, libaries should consider delaying default
enablement until 1 year after the release of the parquet-java implementation
that contains the feature implementation.
3. Forward incompatible features/changes should not be turned on by default
until 2 years after the parquet-java implementation containing the feature is
released. It is recommended that changing the default value for a forward
incompatible feature flag should be clearly advertised to consumers (e.g. via
a major version release if using Semantic Versioning, or highlighed in
release notes).
For forward compatible changes which have a high chance of performance
regression for older readers and forward incompatible changes, implementations
should clearly document the compatibility issues. Additionally, while it is up
to maintainers of individual open-source implementations to make the best decision to serve
their ecosystem, they are encouraged to start enabling features by default along
the same timelines as `parquet-java`. Parquet-java will wait to enable features
by default until the most conservative timelines outlined above have been
exceeded. This timeline is an attempt to balance ensuring
new features make their way into the ecosystem and avoiding
breaking compatiblity for readers that are slower to adopt new standards. We
encourage earlier adoption of new features when an organization using Parquet
can guarantee that all readers of the parquet files they produce can read a new
feature.
After turning a feature on by default implementations
are encouraged to keep a configuration to turn off the feature.
A recommendation for full deprecation will be made in a future
iteration of this document.
For features released prior to October 2024, target dates for each of these
categories will be updated as part of the `parquet-java 2.0` release process
based on a collected feature compatibility matrix.
For each release of `parquet-java` or `parquet-format` that influences this
guidance it is expected exact dates will be added to parquet-format to provide
clarity to implementors (e.g. When `parquet-java` 2.X.X is released, any new
format features it uses will be updated with concrete dates). As part of
`parquet-format` releases the compatibility matrix will be updated to contain
the release date in the format. Implementations are also encouraged to provide
implementation date/release version information when updating the feature
matrix.
End users of software are generally encouraged to consult the feature matrix
and vendor documentation before enabling features that are not yet widely
adopted.