blob: a638409452c40e9a11411408fe5eccc45f9cf5eb [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
= Contributing to Apache Kudu
:author: Kudu Team
:imagesdir: ./images
:icons: font
:toclevels: 3
:doctype: book
:backend: html5
== Contributing Patches Using Gerrit
The Kudu team uses Gerrit for code review, rather than Github pull requests. Typically,
you pull from Github but push to Gerrit, and Gerrit is used to review code and merge
it into Github.
See the link:[Gerrit Tutorial]
for an overview of using Gerrit for code review.
=== Initial Setup for Gerrit
. Sign in to link:[Gerrit] using your Github username.
. Go to link:[Settings]. Update your name
and email address on the *Contact Information* page, and upload a SSH public
key under *SSH Public Keys* if you would like to use SSH to connect to Gerrit.
Generate an HTTP password under *HTTP Password* if you would like to use HTTP
or HTTPS to connect to Gerrit. (Most Kudu developers use the SSH option.)
NOTE: If you do not update your name, it will appear as "Anonymous Coward" in
Gerrit reviews.
. If you have not done so, clone the main Kudu repository. By default, the main remote
is called `origin`. When you fetch or pull, you will do so from `origin`.
git clone
. Change to the new `kudu` directory.
. Add a `gerrit` remote.
If using SSH to connect to Gerrit, use the following command to add the Gerrit
remote (substitute <username> with your Github username):
git remote add gerrit ssh://<username>
If using HTTP or HTTPS to connect to Gerrit, use the following command to add
the Gerrit remote (http:// also works):
git remote add gerrit
If you are using Gerrit's HTTP or HTTPS endpoints and prefer not to type a
username and password each time you submit a patch, you can put your login and
password into a `.netrc` file located at `$HOME/.netrc` and Git will use it.
The password is stored as plaintext and the file format is as follows:
machine <hostname>
login <username>
password <password>
. Run the following command to install the Gerrit `commit-msg` hook:
cd kudu
gitdir=$(git rev-parse --git-dir)
curl -LSsf -o ${gitdir}/hooks/commit-msg
chmod +x ${gitdir}/hooks/commit-msg
. Be sure you have set the Kudu repository to use `pull --rebase` by default. You
can use the following two commands, assuming you have only ever checked out `master`
so far:
git config branch.autosetuprebase always
git config branch.master.rebase true
If for some reason you had already checked out branches other than `master`, substitute
`master` for the other branch names in the second command above.
=== Submitting Patches
To submit a patch, first commit your change (using a descriptive multi-line
commit message if possible), then push the request to the `gerrit` remote. For instance, to push a change
to the `master` branch:
git push gerrit HEAD:refs/for/master --no-thin
or to push a change to the `gh-pages` branch (to update the website):
git push gerrit HEAD:refs/for/gh-pages --no-thin
TIP: While preparing a patch for review, it's a good idea to follow
link:[generic git commit guidelines and good practices].
NOTE: The `--no-thin` argument is a workaround to prevent an error in Gerrit. See
TIP: Consider creating Git aliases for the above commands. Gerrit also includes
a command-line tool called
which you may find helpful.
TIP: You can add reviewers automatically for a patch by adding their GitHub
username or associated email address to the remote branch name following with
the "r" flag:
git push gerrit HEAD:refs/for/master%r=githubuser,
TIP: To find possible reviewer candidates for your commit, use git blame or git
log to find out who are involved with the area you're touching. It's also a
good idea to add as reviewer whoever is involved with the JIRA you're working
Gerrit will add a change ID to your commit message and will create a Gerrit review,
whose URL will be emitted as part of the push reply. If desired, you can send a message
to the `kudu-dev` mailing list, explaining your patch and requesting review.
After getting feedback, you can update or amend your commit, (for instance, using
a command like `git commit --amend`) while leaving the Change
ID intact. Push your change to Gerrit again, and this will create a new patch set
in Gerrit and notify all reviewers about the change.
When your code has been reviewed and is ready to be merged into the Kudu code base,
a Kudu committer will merge it using Gerrit. You can discard your local branch.
=== Abandoning a Review
If your patch is not accepted or you decide to pull it from consideration, you can
use the Gerrit UI to *Abandon* the patch. It will still show in Gerrit's history,
but will not be listed as a pending review.
=== Reviewing Patches In Gerrit
You can view a unified or side-by-side diff of changes in Gerrit using the web UI.
To leave a comment, click the relevant line number or highlight the relevant part
of the line, and type 'c' to bring up a comment box. To submit your comments and/or
your review status, go up to the top level of the review and click *Reply*. You can
add additional top-level comments here, and submit them.
To check out code from a Gerrit review, click *Download* and paste the relevant Git
commands into your Git client. You can then update the commit and push to Gerrit to
submit a patch to the review, even if you were not the original reviewer.
Gerrit allows you to vote on a review. A vote of `+2` from at least one committer
(besides the submitter) is required before the patch can be merged.
== Code Style
=== {cpp} Code Style
Get familiar with these guidelines so that your contributions can be reviewed and
integrated quickly and easily.
In general, Kudu follows the
link:[Google {cpp} Style Guide].
A `clang-format` file is provided in `src/kudu/.clang-format` which allows
automatic formatting of source code whitespacing, indentation, etc. Not all
existing code conforms to this automatic formatting, so prefer using
`clang-format-diff` to format only the lines changed by your patch. For example,
after making a commit, run the following from the root of your checked out
git show -U0 | build-support/ -i -p1
git commit -a --amend
=== Exceptions from Google Style Guide
Kudu's code base makes the following notable exceptions from the Google Style Guide
referenced above:
==== Notes on {cpp} 11
Kudu uses {cpp} 11. Check out this handy guide to {cpp} 11 move semantics and rvalue
We aim to follow most of the same guidelines, such as, where possible, migrating
away from `foo.Pass()` in favor of `std::move(foo)`.
==== Limitations on `boost` Use
`boost` classes from header-only libraries can be used in cases where a suitable
replacement does not exist in the Kudu code base. However:
* Do not introduce dependencies on `boost` classes where equivalent functionality
exists in the standard {cpp} library or in `src/kudu/gutil/`. For example, prefer
`strings::Split()` from `gutil` rather than `boost::split`.
* Prefer using functionality from `boost` rather than re-implementing the same
functionality, _unless_ using the `boost` functionality requires excessive use of
{cpp} features which are disallowed by our style guidelines. For example,
`boost::spirit` is heavily based on template metaprogramming and should not be used.
* Do not use `boost` in any public headers for the Kudu {cpp} client, because
`boost` commonly breaks backward compatibility, and passing data between two
`boost` versions (one by the user, one by Kudu) causes serious issues.
When in doubt about introducing a new dependency on any `boost` functionality,
it is best to email `` to start a discussion.
==== Line length
The Kudu team allows line lengths of 100 characters per line, rather than Google's standard of 80. Try to
keep under 80 where possible, but you can spill over to 100 or so if necessary.
==== Pointers
.Smart Pointers and Singly-Owned Pointers
Generally, most objects should have clear "single-owner" semantics.
Most of the time, singly-owned objects can be wrapped in a `unique_ptr<>`
which ensures deletion on scope exit and prevents accidental copying.
If an object is singly owned, but referenced from multiple places, such as when
the pointed-to object is known to be valid at least as long as the pointer itself,
associate a comment with the constructor which takes and stores the raw pointer,
as in the following example.
// 'blah' must remain valid for the lifetime of this class
MyClass(const Blah* blah) :
blah_(blah) {
WARNING: Using `std::auto_ptr` is strictly disallowed because of its difficult and
bug-prone semantics. Besides, `std::auto_ptr` is declared deprecated
since {cpp}11.
.Smart Pointers for Multiply-Owned Pointers:
Although single ownership is ideal, sometimes it is not possible, particularly
when multiple threads are in play and the lifetimes of the pointers are not
clearly defined. In these cases, you can use either `std::shared_ptr` or
Kudu's own `scoped_refptr` from _gutil/ref_counted.hpp_. Each of these mechanisms
relies on reference counting to automatically delete the referent once no more
pointers remain. The key difference between these two types of pointers is that
`scoped_refptr` requires that the object extend a `RefCounted` base class, and
stores its reference count inside the object storage itself, while `shared_ptr`
maintains a separate reference count on the heap.
The pros and cons are:
* icon:plus-circle[role="green",alt="pro"] can be used with any type of object, without the
object deriving from a special base class
* icon:plus-circle[role="green",alt="pro"] part of the standard library and familiar to most
{cpp} developers
* icon:plus-circle[role="green",alt="pro"] supports the `weak_ptr` use cases:
** a temporary ownership when an object needs to be accessed only if it exists
** break circular references of `shared_ptr`, if any exists due to aggregation
* icon:plus-circle[role="green",alt="pro"] you can convert from the
`shared_ptr` into the `weak_ptr` and back
* icon:plus-circle[role="green",alt="pro"] if creating an instance with
`std::make_shared<>()` only one allocation is made (since {cpp}11;
a non-binding requirement in the Standard, though)
* icon:minus-circle[role="red",alt="con"] if creating a new object with
`shared_ptr<T> p(new T)` requires two allocations (one to create the ref count,
and one to create the object)
* icon:minus-circle[role="red",alt="con"] the ref count may not be near the object on the heap,
so extra cache misses may be incurred on access
* icon:minus-circle[role="red",alt="con"] the `shared_ptr` instance itself requires 16 bytes
(pointer to the ref count and pointer to the object)
* icon:plus-circle[pro, role="green"] only requires a single allocation, and ref count
is on the same cache line as the object
* icon:plus-circle[pro, role="green"] the pointer only requires 8 bytes (since
the ref count is within the object)
* icon:plus-circle[pro, role="green"] you can manually increase or decrease
reference counts when more control is required
* icon:plus-circle[pro, role="green"] you can convert from a raw pointer back
to a `scoped_refptr` safely without worrying about double freeing
* icon:plus-circle[pro, role="green"] since we control the implementation, we
can implement features, such as debug builds that capture the stack trace of every
referent to help debug leaks.
* icon:minus-circle[con, role="red"] the referred-to object must inherit
from `RefCounted`
* icon:minus-circle[con, role="red"] does not support the `weak_ptr` use cases
Since `scoped_refptr` is generally faster and smaller, try to use it
rather than `shared_ptr` in new code. Existing code uses `shared_ptr`
in many places. When interfacing with that code, you can continue to use `shared_ptr`.
==== Function Binding and Callbacks
All code should use {cpp}11 lambdas to capture and manage functors. Functions that
take a lambda as an argument should use `std::function` as the argument's
type. Do not use `boost::bind` or `std::bind` to create functors. Lambdas offer
the compiler greater opportunity to inline, and `std::bind` in particular is
link:[error-prone] and has a proclivity towards heap
allocation for storing bound parameters.
Until Kudu is upgraded to {cpp}14, lambda support will be
link:[somewhat incomplete]. For example, it
is not possible in {cpp}11 to capture an argument by move. Nor is it possible
to define new variables in the context of a lambda capture. Workarounds for
these deficiencies exist, and they must be used in the interim.
==== GFlags
Kudu uses gflags for both command-line and file-based configuration. Use these guidelines
to add a new gflag. All new gflags must conform to these
guidelines. Existing non-conformant ones will be made conformant in time.
The gflag's name conveys a lot of information, so choose a good name. The name
will propagate into other systems, such as the
link:configuration_reference.html[Configuration Reference].
* The different parts of a multi-word name should be separated by underscores.
For example, `fs_data_dirs`.
* The name should be prefixed with the context that it affects. For example,
`webserver_num_worker_threads` and `cfile_default_block_size`. Context can be
difficult to define, so bear in mind that this prefix will be
used to group similar gflags together. If the gflag affects the entire
process, it should not be prefixed.
* If the gflag is for a quantity, the name should be suffixed with the units.
For example, `tablet_copy_idle_timeout_ms`.
* Where possible, use short names. This will save time for those entering
command line options by hand.
* The name is part of Kudu's compatibility contract, and should not change
without very good reason.
.Default value
Choosing a default value is generally simple, but like the name, it propagates
into other systems.
* The default value is part of Kudu's compatibility contract, and should not
change without very good reason.
The gflag's description should supplement the name and provide additional
context and information. Like the name, the description propagates into other
* The description may include multiple sentences. Each should begin with a
capital letter, end with a period, and begin one space after the previous.
* The description should NOT include the gflag's type or default value; they are
provided out-of-band.
* The description should be in the third person. Do not use words like `you`.
* A gflag description can be changed freely; it is not expected to remain the
same across Kudu releases.
Kudu's gflag tagging mechanism adds machine-readable context to each gflag, for
use in consuming systems such as documentation or management tools. See the large block
comment in _flag_tags.h_ for guidelines.
* Avoid creating multiple gflags for the same logical parameter. For
example, many Kudu binaries need to configure a WAL directory. Rather than
creating `foo_wal_dir` and `bar_wal_dir` gflags, better to have a single
`kudu_wal_dir` gflag for use universally.
=== Java Code Style
==== Preconditions vs assert in the Kudu Java client
Use `assert` for verification of the static (i.e. non-runtime) internal
invariants. Internal means the pre- and post-conditions which are
completely under control of the code of a class or a function itself and cannot
be influenced by input parameters and other runtime/dynamic conditions.
Use `Preconditions` for verification of the input parameters and the other
conditions which are outside of the control of the local code, or conditions
which are dependent on the state of other objects/components in runtime.
Object pop() {
// Use Preconditions here because the external user of the class should not
// call pop() on an empty stack, but the stack itself is internally consistent
Preconditions.checkState(curSize > 0, "queue must not be empty");
Object toReturn = data[--curSize];
// Use an assert here because if we ended up with a negative size counter,
// that's an indication of a broken implementation of the stack; i.e. it's
// an invariant, not a state check.
assert curSize >= 0;
return toReturn;
However, keep in mind that `assert` checks are enabled only when the JVM is
run with `-ea` option. So, if some dynamic condition is crucial for the
overall consistency (e.g. a data loss can occur if some dynamic condition is not
satisfied and the code continues its execution), consider throwing an
if (!isCriticalConditionSatisfied) {
throw new AssertionError("cannot continue: data loss is possible otherwise");
===== Checking code style with Gradle checkStyle
Before posting a Java patch to Gerrit for review, make sure to check Java code
style with Gradle `checkstyle` plugin. See
link:[Gradle Checkstyle Plugin documentation]
for more information.
./gradlew checkstyle
===== References
* link:[Programming With Assertions]
* link:[Guava Preconditions Explained]
=== `CMake` Style Guide
`CMake` allows commands in lower, upper, or mixed case. To keep
the CMake files consistent, please use the following guidelines:
* *built-in commands* in lowercase
* *built-in arguments* in uppercase
message(STATUS "message goes here")
* *custom commands or macros* in uppercase
== Third-party dependencies
Like many complex applications, Kudu depends on a number of third-party
dependencies. Some (such as OpenSSL) are expected to be found on the build
system itself. However, the vast majority are "vendored" in the `thirdparty/`
tree. These dependencies are all versioned and pinned. They are also
source-based; the dependencies are built before the rest of Kudu is built using
the `` script.
Third-party dependencies and their versions are defined in ``. The source
code for each dependency is located in a tarball, typically named
`<dependency>-<version>.tar.gz`. The tarballs are stored in an Amazon S3 bucket
operated by Cloudera. The bucket is cached in Amazon CloudFront to maximize
download performance and reliability.
If as part of your contribution you need to add a new third-party dependency,
here's what you need to do:
. Begin by preparing a source tarball for the new dependency. Ideally it should
be a vanilla tarball obtained directly from an upstream project, but sometimes
either its name or the contents need to be massaged to meet Kudu's expectations.
. Add the new dependency to the third-party build. You'll need to modify
``, ``, ``, and
. On your local machine, extract the source tarball into `thirdparty/src`.
. Test the dependency's build by running ` <dependency>`.
This should build and install the dependency into `thirdparty/installed`, making
it available for the Kudu build.
. Test the Kudu build using the new dependency. You will need to pass
`NO_REBUILD_THIRDPARTY=1` in the environment to prevent the Kudu build from
rebuilding the `thirdparty/` tree (whereupon it'll fail to download the new
. When everything checks out, contact a Kudu committer who is also a Cloudera
employee and ask them to upload your source tarball to S3.
. After the tarball has been uploaded, test the entire third-party build
end-to-end by running ``.
. Publish your patch to gerrit. With the tarball uploaded, the precommit builds
should download and build the new dependency successfully.
== Testing
All new code should have tests.::
Add new tests either in existing files, or create new test files as necessary.
All bug fixes should have tests.::
It's OK to fix a bug without adding a
new test if it's triggered by an existing test case. For example, if a
race shows up when running a multi-threaded system test after 20
minutes or so, it's worth trying to make a more targeted test case to
trigger the bug. But if that's hard to do, the existing system test
should be enough.
Tests should run quickly (< 1s).::
If you want to write a time-intensive
test, make the runtime dependent on `KuduTest#AllowSlowTests`, which is
enabled via the `KUDU_ALLOW_SLOW_TESTS` environment variable and is
used by Jenkins test execution.
Tests which run a number of iterations of some task should use a `gflags` command-line argument for the number of iterations.::
This is handy for writing quick stress tests or performance tests.
Commits which may affect performance should include before/after `perf-stat(1)` output.::
This will show performance improvement or non-regression.
Performance-sensitive code should include some test case which can be used as a
targeted benchmark.
== Documentation
See the
link:[Documentation Style Guide]
for guidelines about contributing to the official Kudu documentation.
== Blog posts
=== Writing a post on the Kudu blog
If you are using or integrating with Kudu, consider doing a write-up about your
use case and your integration with Kudu and submitting it to be posted as an
article on the Kudu blog. People in the community love to read about how Kudu
is being used around the world.
Consider checking with the project developers on the Kudu Slack instance or on[] if you have any questions about
the content or the topic of a potential Kudu blog post.
=== Submitting a blog post in Google Doc format
If you don't have the time to learn Markdown or to submit a Gerrit change
request, but you would still like to submit a post for the Kudu blog, feel free
to write your post in Google Docs format and share the draft with us publicly
on[] -- we'll be happy to review
it and post it to the blog for you once it's ready to go.
If you would like to submit the post directly to Gerrit for review in Markdown
format (the developers will appreciate it if you do), please read below.
=== How to format a Kudu blog post
Blog posts live in the `gh-pages` branch under the `_posts` directory in
Markdown format. They're automatically rendered by Jekyll so for those familiar
with Markdown or Jekyll, submitting a blog post should be fairly
Each post is a separate file named in the following format:
The `YYYY-MM-DD` part is the date which will be included in the link as
`/YYYY-MM-DD`, then `title-of-the-post` is used verbatim. The words should be
separated by dashes and should contain only lowercase letters of the English
alphabet and numbers. Finally, the `.md` extension will be replaced with
The header contains the layout information (which is always "post"), the
title and the author's name.
layout: post
title: Example Post
author: John Doe
The actual text of the blog post goes below this header, beginning with the
"lead" which is a short excerpt that shows up in the index. This is separated
by the `<!--more-\->` string from the rest of the post.
=== How to check the rendering of a blog post
Once you've finished the post, there is a command you can run to make sure it
looks good called `site_tool` in the root of the `gh-pages` branch that can
start up Jekyll and serve the rendered site locally. To run this, you need Ruby
and Python to be installed on your machine, and you can start it with the below
$ ./site_tool jekyll serve
When starting, it will print the URL where you can reach the site, but it should
be http://localhost:4000, or to reach the blog directly,
You should be able to see the title and lead of your post along with your name
at the top of this page, and after clicking on the title or the "Read full
post...", the whole post.
=== How to submit a blog post
To submit the post, you'll need to commit your change and push it to
<<_contributing_patches_using_gerrit,Gerrit>> for review. If the post is deemed
useful for the community and all comments are addressed, a committer can merge
and publish your post.
If you have a GitHub account, you can fork Kudu from and push the change to your fork too. GitHub will
automatically render it on https://<yourname> and you can link it
directly on Gerrit.
This way the reviewers can see that the post renders well without having to
download it, which can speed up the review process.