blob: 9585b83892ea091cfb33d22f7499d945c4ce8b41 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<document xmlns="http://maven.apache.org/XDOC/2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd">
<properties>
<title>Welcome To Apache Incubator Giraph</title>
</properties>
<body>
<section name="Welcome To Apache Incubator Giraph">
<p>Web and online social graphs have been rapidly growing in size and scale during the past decade. In 2008, Google estimated that the number of web pages reached over a trillion. Online social networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter, have hundreds of millions of users and are expected to grow much more in the future. Processing these graphs plays a big role in relevant and personalized information for users, such as results from a search engine or news in an online social networking site.</p>
<p>Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. We have implemented a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon's EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.</p>
<p>Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.</p>
</section>
<section name="Incubator disclaimer">
<p>Apache Giraph is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the name of Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.</p>
</section>
<section name="Presentations">
We're working hard to build a community of users and developers around Giraph. As part of that outreach, we're giving presentations and talks to help bring people up to speed.
<ul>
<li>Avery Ching introduced Giraph at Hadoop Summit 2011. <a href="http://www.youtube.com/watch?v=l4nQjAG6fac">Watch the video</a>.</li>
<li>An updated slidedeck of the Hadoop Summit talk was presented at HortonWorks. <a href="http://www.slideshare.net/averyching/20111014hortonworks">Read the slides</a>.</li>
</ul>
</section>
<section name="Supported versions of Apache Hadoop">
<p> Hadoop versions for use with Giraph:
<ul>
<li>Secure Hadoop versions: Apache Hadoop 0.20.203, 0.20.204, other secure versions may work as well</li>
<li>Unsecure Hadoop versions: Apache Hadoop 0.20.1, 0.20.2, 0.20.3. While we provide support for unsecure Hadoop with the maven profile 'hadoop_non_secure', we have been primarily focusing on secure Hadoop releases at this time.</li>
<li>Other distributions that included Apache Hadoop reported to work include: Cloudera CDH3u0, CDH3u1</li>
</ul>
</p>
</section>
<section name="Getting involved">
<p>Giraph is a new project and we're looking to quickly build a community of users and contributors. All types of help is appreciated: contributing patches, writing documentation, posing and answering questions on the mailing list, even <a href="https://issues.apache.org/jira/browse/GIRAPH-4">graphic design</a>. Here's how to get involved with Giraph (or any Apache project):</p>
<ul>
<li>Subscribe to the <a href="mail-lists.html">mailing lists</a>, particularly the user and dev list, and follow their activity for a while to get a feel for the state of the project and what the community is working on.</li>
<li>Browse through <a href="issue-tracking.html">Giraph's JIRA</a>, our issue tracking system, to find issues you may be interested in working on. To help new contributors pitch in quickly, we maintain a <a href="http://bit.ly/newbie_apache_giraph_issues">set of JIRAs</a> that focus on getting new contributors started with the mechanics of generating a patch &#151; downloading the source, changing a couple lines, creating a patch, verifying its correctness, uploading it to JIRA and working with the community &#151; rather that deep technical issues within Giraph itself. These are good issues with which to join the community. See <a href="#Generatingpatches">below</a> for detailed instructions on creating patches.</li>
<li>Try out the examples and play with Giraph on your cluster. Be sure to ask questions on the mailing list or open new JIRAs if you run into issues with your particular configuration.</li>
</ul>
</section>
<section name="Building and testing">
<p>You will need the following:</p>
<ul>
<li>Java 1.6</li>
<li>Maven 3 or higher. Giraph uses the <a href="http://sonatype.github.com/munge-maven-plugin/">munge plugin</a>, which requires Maven 3, to support multiple versions of Hadoop. Also, the web site plugin requires Maven 3.</li>
</ul>
<p>Use the maven commands with secure Hadoop to:
<ul>
<li>compile (i.e <tt>mvn compile</tt>)</li>
<li>package (i.e. <tt>mvn package</tt>)</li>
<li>test (i.e. <tt>mvn test</tt>) For testing, one can submit the test to a running Hadoop instance (i.e. <tt>mvn test -Dprop.mapred.job.tracker=localhost:50300</tt>)</li>
</ul>
For the non-secure versions of Hadoop, run the maven commands with the
additional argument <tt>-Dhadoop=non_secure</tt> to enable the maven profile
<tt>hadoop_non_secure</tt>. An example compilation command is
<tt>mvn -Dhadoop=non_secure compile</tt>.</p>
</section>
<section name="Notes">
Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of counters one can use, which is set to 120 by default. This limit restricts the number of iterations/supersteps possible in Giraph. This limit can be increased by setting a parameter <tt>mapreduce.job.counters.limit</tt> in job tracker's config file mapred-site.xml.
</section>
<section name="Generating patches" id="Generatingpatches">
<p>Follow these steps to generate a patch that can be attached to a JIRA issue for review.</p>
<ul>
<li>Check out the Giraph source, either from the <a href="source-repository.html">subversion repository</a> or from a <a href="http://git.apache.org/">git mirror</a>. Note that the git mirrors may lag slightly behind the subversion repos.</li>
<li>Make the changes necessary for your particular issue. Try to avoid unnecessary changes, such as extra whitespace or formatting changes. Include a unit test, or be ready to justify in the JIRA why one isn't necessary.</li>
<li>Verify the new and existing tests continue to pass via <tt>mvn test</tt>. Verify the change works as expected on a real cluster, if possible. If one's not available for testing, mention it on the JIRA so another contributor can verify.</li>
<li>Verify that RAT is ok with the changes that you've made via <tt>mvn rat:check</tt>. Also check that the patch follows Giraph's style guidelines (found in the source root in <tt>CODE_CONVENTIONS</tt>).</li>
<li>Generate a patch either by <tt>svn diff > GIRAPH-{ISSUE-NUMBER}.patch</tt> or <tt>git diff --no-prefix trunk > GIRAPH-{ISSUE_NUMBER}.patch</tt> (the <tt>--no-prefix</tt> option is necessary to make the patch compatible with Apache's subversion repository). For subsequent patches, if necessary, number each version to make it easier for reviewers to track their progress.</li>
<li>Attach the patch to the JIRA issue (click <em>More Actions</em> and then <em>Attach File</em> from the top menu) using the comment to briefly explain what changes it contains and what testing was done. Mark the JIRA as <em>Patch Available</em> to let reviewers know it's ripe for evaluation.</li>
<li>Optionally, you can open <a href="https://reviews.apache.org/">reviewboard</a> request for the patch, although not all reviewers use this tool.</li>
</ul>
<p>A committer should review the patch shortly and either provide feedback for a new version, or commit it to the Giraph source.</p>
</section>
</body>
</document>