| Giraph : Large-scale graph processing on Hadoop |
| |
| Web and online social graphs have been rapidly growing in size and |
| scale during the past decade. In 2008, Google estimated that the |
| number of web pages reached over a trillion. Online social networking |
| and email sites, including Yahoo!, Google, Microsoft, Facebook, |
| LinkedIn, and Twitter, have hundreds of millions of users and are |
| expected to grow much more in the future. Processing these graphs |
| plays a big role in relevant and personalized information for users, |
| such as results from a search engine or news in an online social |
| networking site. |
| |
| Graph processing platforms to run large-scale algorithms (such as page |
| rank, shared connections, personalization-based popularity, etc.) have |
| become quite popular. Some recent examples include Pregel and HaLoop. |
| For general-purpose big data computation, the map-reduce computing |
| model has been well adopted and the most deployed map-reduce |
| infrastructure is Apache Hadoop. We have implemented a |
| graph-processing framework that is launched as a typical Hadoop job to |
| leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph |
| builds upon the graph-oriented nature of Pregel but additionally adds |
| fault-tolerance to the coordinator process with the use of ZooKeeper |
| as its centralized coordination service. |
| |
| Giraph follows the bulk-synchronous parallel model relative to graphs |
| where vertices can send messages to other vertices during a given |
| superstep. Checkpoints are initiated by the Giraph infrastructure at |
| user-defined intervals and are used for automatic application restarts |
| when any worker in the application fails. Any worker in the |
| application can act as the application coordinator and one will |
| automatically take over if the current application coordinator fails. |
| |
| ------------------------------- |
| |
| Hadoop versions for use with Giraph: |
| |
| Secure Hadoop versions: |
| |
| - Apache Hadoop 1 (latest version: 1.2.1) |
| |
| This is the default version used by Giraph: if you do not specify a |
| profile with the -P flag, maven will use this version. You may also |
| explicitly specify it with "mvn -Phadoop_1 <goals>". |
| |
| - Apache Hadoop 2 (latest version: 2.5.1) |
| |
| This is the latest version of Hadoop 2 (supporting YARN in addition |
| to MapReduce) Giraph could use. You may tell maven to use this version |
| with "mvn -Phadoop_2 <goals>". |
| |
| - Apache Hadoop 0.23.1 |
| |
| You may tell maven to use this version with "mvn -Phadoop_0.23 <goals>". |
| |
| - Apache Hadoop 0.20.203.0 |
| |
| You may tell maven to use this version with "mvn -Phadoop_0.20.203 <goals>". |
| |
| - Apache Hadoop Yarn with 2.2.0 |
| |
| You may tell maven to use this version with "mvn -Phadoop_yarn -Dhadoop.version=2.2.0 <goals>". |
| |
| - Apache Hadoop 3.0.0-SNAPSHOT |
| |
| You may tell maven to use this version with "mvn -Phadoop_trunk <goals>". |
| |
| Unsecure Hadoop versions: |
| |
| - Apache Hadoop 0.20.1, 0.20.2, 0.20.3 |
| |
| You may tell maven to use 0.20.2 with "mvn -Phadoop_non_secure <goals>". |
| |
| - Facebook Hadoop releases: https://github.com/facebook/hadoop-20, Master branch |
| |
| You may tell maven to use this version with "mvn -Phadoop_facebook <goals>" |
| |
| -- Other versions reported working include: |
| --- Cloudera CDH3u0, CDH3u1 |
| |
| While we provide support for unsecure and Facebook versions of Hadoop |
| with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook', |
| respectively, we have been primarily focusing on secure Hadoop releases |
| at this time. |
| |
| ------------------------------- |
| |
| Building and testing: |
| |
| You will need the following: |
| - Java 1.6 |
| - Maven 3 or higher. Giraph uses the munge plugin |
| (http://sonatype.github.com/munge-maven-plugin/), |
| which requires Maven 3, to support multiple versions of Hadoop. Also, the |
| web site plugin requires Maven 3. |
| |
| Use the maven commands with secure Hadoop to: |
| - compile (i.e mvn compile) |
| - package (i.e. mvn package) |
| - test (i.e. mvn test) |
| |
| For the non-secure versions of Hadoop, run the maven commands with the |
| additional argument '-Phadoop_non_secure'. |
| Example compilation commands is 'mvn -Phadoop_non_secure compile'. |
| |
| For the Facebook Hadoop release, run the maven commands with the |
| additional arguments '-Phadoop_facebook'. |
| Example compilation commands is 'mvn -Phadoop_facebook compile'. |
| |
| ------------------------------- |
| |
| Developing: |
| |
| Giraph is a multi-module maven project. The top level generates a POM that |
| carries information common to all the modules. Each module creates a jar with |
| the code contained in it. |
| |
| The giraph/ module contains the main giraph code. If you just want to work on |
| the main code only you can do all your work inside this subdirectory. |
| Specifically you would do something like: |
| |
| giraph-root/giraph/ $ mvn verify # build from current state |
| giraph-root/giraph/ $ mvn clean # wipe out build files |
| giraph-root/giraph/ $ mvn clean verify # build from fresh state |
| giraph-root/giraph/ $ mvn install # install jar to local repository |
| |
| The giraph-formats/ module contains hooks to read/write from various |
| formats (e.g. Accumulo, HBase, Hive). It depends on the giraph module. This |
| means if you make local changes to the giraph codebase you will first need to |
| install the giraph/ jar locally so that giraph-formats/ will pick it up. |
| In other words something like this: |
| |
| giraph-root/giraph/ $ mvn install |
| giraph-root/giraph-formats $ mvn verify |
| |
| To build everything at once you can issue the maven commands at the top level. |
| Note that we use the "install" target so that if you have any local changes to |
| giraph/ which formats needs it will get picked up because it will install |
| locally first. |
| |
| giraph-root/ $ mvn clean install |
| |
| ------------------------------- |
| |
| Scripting: |
| |
| Giraph has support for writing user logic in languages other than Java. A Giraph |
| job involves at the very least a Computation and Input/Output Formats. There are |
| other optional pieces as well like Aggregators and Combiners. |
| |
| As of this writing we support writing the Computation logic in Jython. The |
| Computation class is at the core of the algorithm so it was a natural starting |
| point. Eventually it is our goal to allow users to write any / all components of |
| their algorithms in any language they desire. |
| |
| To use Jython with our job launcher, GiraphRunner, pass the path to the script |
| as the Computation class argument. Additionally, you should set the -jythonClass |
| option to let Giraph know the name of your Jython Computation class. Lastly, you |
| will need to set -typesHolder to a class that extends Giraph's TypesHolder so |
| that Giraph can infer the types you use. Look at page-rank.py as an example. |
| |
| ------------------------------- |
| |
| How to run the unittests on a local pseudo-distributed Hadoop instance: |
| |
| As mentioned earlier, Giraph supports several versions of Hadoop. In |
| this section, we describe how to run the Giraph unittests against a single |
| node instance of Apache Hadoop 0.20.203. |
| |
| Download Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz) |
| from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/ |
| and unpack it into a local directory |
| |
| Follow the guide at |
| http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed |
| to setup a pseudo-distributed single node Hadoop cluster. |
| |
| Giraph’s code assumes that you can run at least 4 mappers at once, |
| unfortunately the default configuration allows only 2. Therefore you need |
| to update conf/mapred-site.xml: |
| |
| <property> |
| <name>mapred.tasktracker.map.tasks.maximum</name> |
| <value>4</value> |
| </property> |
| |
| <property> |
| <name>mapred.map.tasks</name> |
| <value>4</value> |
| </property> |
| |
| After preparing the local filesystem with: |
| |
| rm -rf /tmp/hadoop-<username> |
| /path/to/hadoop/bin/hadoop namenode -format |
| |
| you can start the local hadoop instance: |
| |
| /path/to/hadoop/bin/start-all.sh |
| |
| and finally run Giraph’s unittests: |
| |
| mvn clean test -Dprop.mapred.job.tracker=localhost:9001 |
| |
| Now you can open a browser, point it to http://localhost:50030 and watch the |
| Giraph jobs from the unittests running on your local Hadoop instance! |
| |
| |
| Notes: |
| Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of |
| counters one can use, which is set to 120 by default. This limit restricts the |
| number of iterations/supersteps possible in Giraph. This limit can be increased |
| by setting a parameter "mapreduce.job.counters.limit" in job tracker's config |
| file mapred-site.xml. |
| |