Contribution Guidelines and Standards for SystemDS

Thank you for contributing to SystemDS. :smile:

The following are mostly guidelines, not rules. Use your best judgement, and feel free to propose changes to this doc.

Before contributing a pull request for review, let's make sure the changes are consistent with the guidelines and coding style.

General Guidelines and Philosophy for contribution

  • Inclusion of unit tests when contributing new features, will help
    1. prove that the code works correctly, and
    2. guard against future breaking changes.
  • Formatting changes can be handled in a separate PR. Example bf4ba16b
  • New features (e.g., a new cutting edge machine learning algorithm) typically will live in scripts/staging or its equivalent folder for specific feature to get some airtime and sufficient testing before a decision is made regarding whether they are to migrated to the top-level.
  • When a new contribution is made to SystemDS, the maintenance burden is (by default) transferred to the SystemDS team. The benefit of the contribution is to be compared against the cost of maintaining the feature.

Code Style

We suggest applying a code formatter to the written code. Generally, this is done automatically.

We have provided at profile for java located in Codestyle File ./dev/CodeStyle.eclipse.xml. This can be loaded in most editors e.g.:

Commit Style

SystemDS project has linear history throughout. Rebase cleanly and verify that the commit-SHA's of the Apache Repo are not altered. A general guideline is never to change a commit inside the systemds official repo, which is the standard practice. If you have accidentally committed a functionality or tag, let others know. And create a new commit to revert the functionality of the previous commit but do not force push.

Tags

The tags can be used in combination to one another. These are the only tags available.

  • [MINOR]: Small changesets with additional functionality

    Examples:

    This commit makes small software updates with refactoring

    030fdab3 - [MINOR] Added package to R dependency; updated Docker test image

    This commit cleans up the redundant code for simplification

    f4fa5650 - [MINOR][SYSTEMDS-43] Cleanup scale builtin function (readability)

  • [SYSTEMDS-#]: A changeset which will a specific purpose such as a bug, Improvement, New feature, etc. The tag value is found from SystemDS jira issue tracker.

  • [DOC] also [DOCS]: Changes to the documentation

  • [HOTFIX]: Introduces changes into the already released versions.

    Example:

    This commit fixes the corrupted language path

    87bc3584 - [HOTFIX] Fix validation of scalar-scalar binary min/max operations

Protip: Addressing multiple jira issues in a single commit, [SYSTEMDS-123,SYSTEMDS-124] or [SYSTEMDS-123][SYSTEMDS-124]

Commit description

A commit or PR description is a public record of what change is being made and why.

Structure of the description

First Line
  1. A summary of what the changeset.
  2. A complete sentence, crafted as though it was an order.
    • an imperative sentence
    • Writing the rest of the description as an imperative is optional.
  3. Follow by an empty line.
Body

It consists of the following.

  1. A brief description of the problem solved.
  2. Why this is the best approach?.
  3. Shortcomings to the approach, if any (important!).

Additional info

  1. background information
    • bug/issue/jira numbers
    • benchmark/test results
    • links to design documents
    • code attributions
  2. Include enough context for
    • reviewers
    • future readers to understand the Changes.
  3. Add PR number, like Closes #1000.

The following is a commit description with all the points mentioned.

1abe9cb

Commit message:

[SYSTEMDS-418] Performance improvements lineage reuse probing/spilling

This patch makes some minor performance improvements to the lineage
reuse probing and cache put operations. Specifically, we now avoid
unnecessary lineage hashing and comparisons by using lists instead of
hash maps, move the time computations into the reuse path (to not affect
the code path without lineage reuse), avoid unnecessary branching, and
materialize the score of cache entries to avoid repeated computation
for the log N comparisons per add/remove/constaints operation.
For 100K iterations and ~40 ops per iteration, lineage tracing w/ reuse
improved from 41.9s to 38.8s (pure lineage tracing: 27.9s).

Good commit description

The following are some of the types of code changes with examples.

Functionality change

1101533

Commit message:

[SYSTEMDS-2603] New hybrid approach for lineage deduplication

This patch makes a major refactoring of the lineage deduplication
framework. This changes the design of tracing all the
distinct paths in a loop-body before the first iteration, to trace
during execution. The number of distinct paths grows exponentially
with the number of control flow statements. Tracing all the paths
in advance can be a huge waste and overhead.
We now trace an iteration during execution. We count the number of
distinct paths before the iterations start, and we stop tracing
once all the paths are traced. Tracing during execution fits
very well with our multi-level reuse infrastructure.
Refer to JIRA for detailed discussions.
Refactoring

Refactoring is a series of behaviour preserving changes with restructuring to existing code body. Refactoring does not alter the external behaviour or the output of a function, but the internal changes to the function to keep the code more organized, readable or to accomodate a more complex functionality.

e581b5a

Commit message:

[SYSTEMDS-2575] Fix eval function calls (incorrect pinning of inputs)

This patch fixes an issue of indirect eval function calls where wrong
input variable names led to missing pinning of inputs and thus too eager
cleanup of these variables (which causes crashes if the inputs are used
in other operations of the eval call).
The fix is simple. We avoid such inconsistent construction and
invocation of fcall instructions by using a narrower interface and
constructing the materialized names internally in the fcall.
Small Changeset still needs some context.

7af2ae0

Commit message:

[MINOR] Update docker images organization

Changes the docker images to use the docker organization systemds
add install dependency for R dbScan
Change the tests to use the new organizations docker images

Closes #1008

Protip: to reference other commits use first 7 letters of the commit SHA-1. eg. 1b81d8c for referencing 1b81d8cb19d8da6d865b7fca5a095dd5fec8d209

License

Including a license at the top of new files helps validate the consistency of license.

Examples:


Thanks again for taking your time to help improve SystemDS! :+1: