| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| ======================= |
| Reviewing contributions |
| ======================= |
| |
| Principles |
| ========== |
| |
| Arrow is a foundational project that will need to evolve over many years |
| or even decades, while serving potentially millions of users. We believe |
| that being meticulous when reviewing brings greater rewards to the project |
| than being lenient and aiming for quick merges. |
| |
| Code reviews like this lead to better quality code, more people who are |
| engaged with and understand the code being changed, and a generally |
| healthier project with more room to grow and accommodate emerging needs. |
| |
| |
| Guidelines |
| ========== |
| |
| Meta |
| ---- |
| |
| **Use your own judgement**. These guidelines are not hard rules. |
| Committers are expected to have sufficient expertise on their work |
| areas to be able to adjust their approach based on any concerns they have. |
| |
| These guidelines are not listed in a particular order and are not intended |
| to be used as a checklist. |
| |
| Finally, these guidelines are not exhaustive. |
| |
| Scope and completeness |
| ---------------------- |
| |
| * Our general policy is to not introduce regressions or merge PRs that require |
| follow-ons to function correctly (though exceptions to this can be made). |
| Making necessary changes after a merge is more costly both for the |
| contributor and the reviewer, but also for other developers who may be |
| confused if they hit problems introduced by a merged PR. |
| |
| * What changes are in-scope for a PR and what changes might/could/should be |
| pushed out of scope and have a follow-up issue created should be determined |
| in collaboration between the authors and the reviewers. |
| |
| * When a large piece of functionality is being contributed and it seems |
| desirable to integrate it piecewise, favour functional cohesion when |
| deciding how to divide changes (for example, if a filesystem implementation |
| is being contributed, a first PR may contribute directory metadata |
| operations, a second PR file reading facilities and a third PR file writing |
| facilities). |
| |
| Public API design |
| ----------------- |
| |
| * Public APIs should nudge users towards the most desirable constructs. |
| In other words, if there is a "best" way to do something, it should |
| ideally also be the most easily discoverable and the most concise to type. |
| For example, safe APIs should be featured more prominently than |
| unsafe APIs that may crash or silently produce erroneous results on |
| invalid input. |
| |
| * Public APIs should ideally tend to produce readable code. One example |
| is when multiple options are expected to be added over time: it is better |
| to try to organize options logically rather than juxtapose them all in |
| a function's signature (see for example the CSV reading APIs in |
| :ref:`C++ <cpp-api-csv>` and :ref:`Python <py-api-csv>`). |
| |
| * Naming is important. Try to ask yourself if code calling the new API |
| would be understandable without having to read the API docs. |
| Vague naming should be avoided; inaccurate naming is even worse as it |
| can mislead the reader and lead to buggy user code. |
| |
| * Be mindful of terminology. Every project has (explicitly or tacitly) set |
| conventions about how to name important concepts; steering away from those |
| conventions increases the cognitive workload both for contributors and |
| users of the project. Conversely, reusing a well-known term for a different |
| purpose than usual can also increase the cognitive workload and make |
| developers' lives more difficult. |
| |
| * If you are unsure whether an API is the right one for the task at hand, |
| it is advisable to mark it experimental, such that users know that it |
| may be changed over time, while contributors are less wary of bringing |
| code-breaking improvements. However, experimental APIs should not be |
| used as an excuse for eschewing basic API design principles. |
| |
| Robustness |
| ---------- |
| |
| * Arrow is a set of open source libraries that will be used in a very wide |
| array of contexts (including fiddling with deliberately artificial data |
| at a Jupyter interpreter prompt). If you are writing a public API, make |
| sure that it won't crash or produce undefined behaviour on unusual (but |
| valid) inputs. |
| |
| * When a non-trivial algorithm is implemented, defensive coding can |
| be useful to catch potential problems (such as debug-only assertions, if |
| the language allows them). |
| |
| * APIs ingesting potentially untrusted data, such as on-disk file formats, |
| should try to avoid crashing or produce silent bugs when invalid or |
| corrupt data is fed to them. This can require a lot of care that is |
| out of the scope of regular code reviews (such as setting up |
| :ref:`fuzz testing <cpp-fuzzing>`), but basic checks can still be |
| suggested at the code review stage. |
| |
| * When calling foreign APIs, especially system functions or APIs dealing with |
| input / output, do check for errors and propagate them (if the language |
| does not propagate errors automatically, such as C++). |
| |
| Performance |
| ----------- |
| |
| * Think about performance, but do not obsess over it. Algorithmic complexity |
| is important if input size may be "large" (the meaning of large depends |
| on the context: use your own expertise to decide!). Micro-optimizations |
| improving performance by 20% or more on performance-sensitive functionality |
| may be useful as well; lesser micro-optimizations may not be worth the |
| time spent on them, especially if they lead to more complicated code. |
| |
| * If performance is important, measure it. Do not satisfy yourself with |
| guesses and intuitions (which may be founded on incorrect assumptions |
| about the compiler or the hardware). |
| |
| .. seealso:: :ref:`Benchmarking Arrow <benchmarks>` |
| |
| * Try to avoid trying to trick the compiler/interpreter/runtime by writing |
| the code in a certain way, unless it's really important. These tricks |
| are generally brittle and dependent on platform details that may become |
| obsolete, and they can make code harder to maintain (a common question |
| that can block contributors is "what should I do about this weird hack |
| that my changes would like to remove"?). |
| |
| * Avoiding rough edges or degenerate behaviour (such as memory blowups when |
| a size estimate is inaccurately large) may be more important than trying to |
| improve the common case by a small amount. |
| |
| Documentation |
| ------------- |
| |
| These guidelines should ideally apply to both prose documentation and |
| in-code docstrings. |
| |
| * Look for ambiguous / poorly informative wording. For example, *"it is an |
| error if ..."* is less informative than either *"An error is raised if ... "* |
| or *"Behaviour is undefined if ..."* (the first phrasing doesn't tell the |
| reader what actually *happens* on such an error). |
| |
| * When reviewing documentation changes (or prose snippets, in general), |
| be mindful about spelling, grammar, expression, and concision. Clear |
| communication is essential for effective collaboration with people |
| from a wide range of backgrounds, and contributes to better documentation. |
| |
| * Some contributors do not have English as a native language (and perhaps |
| neither do you). It is advised to help them and/or ask for external help |
| if needed. |
| |
| * Cross-linking increases the global value of documentation. Sphinx especially |
| has great cross-linking capabilities (including topic references, glossary |
| terms, API references), so be sure to make use of them! |
| |
| Testing |
| ------- |
| |
| * When adding an API, all nominal cases should have test cases. Does a function |
| allow null values? Then null values should be tested (alongside non-null |
| values, of course). Does a function allow different input types? etc. |
| |
| * If some aspect of a functionality is delicate (either by definition or |
| as an implementation detail), it should be tested. |
| |
| * Corner cases should be exercised, especially in low-level implementation |
| languages such as C++. Examples: empty arrays, zero-chunk arrays, arrays |
| with only nulls, etc. |
| |
| * Stress tests can be useful, for example to uncover synchronizations bugs |
| if non-trivial parallelization is being added, or to validate a computational |
| argument against a slow and straightforward reference implementation. |
| |
| * A mitigating concern, however, is the overall cost of running the test |
| suite. Continuous Integration (CI) runtimes can be painfully long and |
| we should be wary of increasing them too much. Sometimes it is |
| worthwhile to fine-tune testing parameters to balance the usefulness |
| of tests against the cost of running them (especially where stress tests |
| are involved, since they tend to imply execution over large datasets). |
| |
| |
| Social aspects |
| ============== |
| |
| * Reviewing is a communication between the contributor and the reviewer. |
| Avoid letting questions or comments remain unanswered for too long |
| ("too long" is of course very subjective, but two weeks can be a reasonable |
| heuristic). If you cannot allocate time soon, do say it explicitly. |
| If you don't have the answer to a question, do say it explicitly. |
| Saying *"I don't have time immediately but I will come back later, |
| feel free to ping if I seem to have forgotten"* or *"Sorry, I am out of |
| my depth here"* is always better than saying nothing and leaving the |
| other person wondering. |
| |
| * If you know someone who has the competence to help on a blocking issue |
| and past experience suggests they may be willing to do so, feel free to |
| add them to the discussion (for example by gently pinging their GitHub |
| handle). |
| |
| * If the contributor has stopped giving feedback or updating their PR, |
| perhaps they're not interested anymore, but perhaps also they're stuck |
| on some issue and feel unable to push their contribution any further. |
| Don't hesitate to ask (*"I see this PR hasn't seen any updates recently, |
| are you stuck on something? Do you need any help?"*). |
| |
| * If the contribution is genuinely desirable and the contributor is not making |
| any progress, it is also possible to take it up. Out of politeness, |
| it is however better to ask the contributor first. |
| |
| * Some contributors are looking for a quick fix to a specific problem and |
| don't want to spend too much time on it. Others on the contrary are eager |
| to learn and improve their contribution to make it conform to the |
| project's standards. The latter kind of contributors are especially |
| valuable as they may become long-term contributors or even committers |
| to the project. |
| |
| * Some contributors may respond *"I will fix it later, can we merge anyway?"* |
| when a problem is pointed out to them. Unfortunately, whether the fix will |
| really be contributed soon in a later PR is difficult to predict or enforce. |
| If the contributor has previously demonstrated that they are reliable, |
| it may be acceptable to do as suggested. Otherwise, it is better to |
| decline the suggestion. |
| |
| * If a PR is generally ready for merge apart from trivial or uncontroversial |
| concerns, the reviewer may decide to push changes themselves to the |
| PR instead of asking the contributor to make the changes. |
| |
| * Ideally, contributing code should be a rewarding process. Of course, |
| it will not always be, but we should strive to reduce contributor |
| frustration while keeping the above issues in mind. |
| |
| * Like any communication, code reviews are governed by the Apache |
| `Code of Conduct <https://www.apache.org/foundation/policies/conduct.html>`_. |
| This applies to both reviewers and contributors. |
| |
| |
| Labelling |
| ========= |
| |
| While reviewing PRs, we should try to identify whether the corresponding issue |
| needs to be marked with one or both of the following issue labels: |
| |
| * **Critical Fix**: The change fixes either: (a) a security vulnerability; |
| (b) a bug that causes incorrect or invalid data to be produced; |
| or (c) a bug that causes a crash (while the API contract is upheld). |
| This is intended to mark fixes to issues that may affect users without their |
| knowledge. For this reason, fixing bugs that cause errors don't count, since |
| those bugs are usually obvious. Bugs that cause crashes are considered critical |
| because they are a possible vector of Denial-of-Service attacks. |
| * **Breaking Change**: The change breaks backwards compatibility in a public API. |
| For changes in C++, this does not include changes that simply break ABI |
| compatibility, except for the few places where we do guarantee ABI |
| compatibility (such as C Data Interface). Experimental APIs are *not* |
| exempt from this; they are just more likely to be associated with this tag. |
| |
| Breaking changes and critical fixes are separate: breaking changes alter the |
| API contract, while critical fixes make the implementation align with the |
| existing API contract. For example, fixing a bug that caused a Parquet reader |
| to skip rows containing the number 42 is a critical fix but not a breaking change, |
| since the row skipping wasn't behavior a reasonable user would rely on. |
| |
| These labels are used in the release to highlight changes that users ought to be |
| aware of when they consider upgrading library versions. Breaking changes help |
| identify reasons when users may wish to wait to upgrade until they have time to |
| adapt their code, while critical fixes highlight the risk in *not* upgrading. |
| |
| In addition, we use the following labels to indicate priority: |
| |
| * **Priority: Blocker**: Indicates the changes **must** be merged before the next |
| release can happen. This includes fixes to test or packaging failures that |
| would prevent the release from succeeding final packaging or verification. |
| * **Priority: Critical**: Indicates issues that are high priority. This is a |
| superset of issues marked "Critical Fix", as it also contains certain fixes |
| to issues causing errors and crashes. |
| |
| Collaborators |
| ============= |
| |
| The collaborator role allows users to be given triage role to be able to help triaging |
| issues. The role allows users to label and assign issues. |
| |
| A user can ask to become a collaborator or can be proposed by another community member |
| when they have been collaborating in the project for a period of time. Collaborations |
| can be but are not limited to: creating PRs, answering questions, creating |
| issues, reviewing PRs, etcetera. |
| |
| In order to propose someone as a collaborator you must create a PR adding the user to |
| the collaborators list on `.asf.yaml <https://github.com/apache/arrow/blob/main/.asf.yaml>`_. |
| Committers can review the past collaborations for the user and approve. |
| |
| Collaborators that have been inactive for a period of time can be removed from the |
| list of collaborators. |