README - giraph - Git at Google

 Giraph : Large-scale graph processing on Hadoop

 Web and online social graphs have been rapidly growing in size and
 scale during the past decade.  In 2008, Google estimated that the
 number of web pages reached over a trillion.  Online social networking
 and email sites, including Yahoo!, Google, Microsoft, Facebook,
 LinkedIn, and Twitter, have hundreds of millions of users and are
 expected to grow much more in the future.  Processing these graphs
 plays a big role in relevant and personalized information for users,
 such as results from a search engine or news in an online social
 networking site.

 Graph processing platforms to run large-scale algorithms (such as page
 rank, shared connections, personalization-based popularity, etc.) have
 become quite popular.  Some recent examples include Pregel and HaLoop.
 For general-purpose big data computation, the map-reduce computing
 model has been well adopted and the most deployed map-reduce
 infrastructure is Apache Hadoop.  We have implemented a
 graph-processing framework that is launched as a typical Hadoop job to
 leverage existing Hadoop infrastructure, such as Amazon’s EC2.  Giraph
 builds upon the graph-oriented nature of Pregel but additionally adds
 fault-tolerance to the coordinator process with the use of ZooKeeper
 as its centralized coordination service.

 Giraph follows the bulk-synchronous parallel model relative to graphs
 where vertices can send messages to other vertices during a given
 superstep.  Checkpoints are initiated by the Giraph infrastructure at
 user-defined intervals and are used for automatic application restarts
 when any worker in the application fails.  Any worker in the
 application can act as the application coordinator and one will
 automatically take over if the current application coordinator fails.

 -------------------------------

 Hadoop versions for use with Giraph:

 Secure Hadoop versions:

 - Apache Hadoop 1 (latest version: 1.2.1)

   This is the default version used by Giraph: if you do not specify a
   profile with the -P flag, maven will use this version. You may also
   explicitly specify it with "mvn -Phadoop_1 <goals>".

 - Apache Hadoop 2 (latest version: 2.5.1)

   This is the latest version of Hadoop 2 (supporting YARN in addition
   to MapReduce) Giraph could use. You may tell maven to use this version
   with "mvn -Phadoop_2 <goals>".

 - Apache Hadoop 0.23.1

   You may tell maven to use this version with "mvn -Phadoop_0.23 <goals>".

 - Apache Hadoop 0.20.203.0

   You may tell maven to use this version with "mvn -Phadoop_0.20.203 <goals>".

 - Apache Hadoop Yarn with 2.2.0

   You may tell maven to use this version with "mvn -Phadoop_yarn -Dhadoop.version=2.2.0 <goals>".

 - Apache Hadoop 3.0.0-SNAPSHOT

   You may tell maven to use this version with "mvn -Phadoop_trunk <goals>".

 Unsecure Hadoop versions:

 - Apache Hadoop 0.20.1, 0.20.2, 0.20.3

   You may tell maven to use 0.20.2 with "mvn -Phadoop_non_secure <goals>".

 - Facebook Hadoop releases: https://github.com/facebook/hadoop-20, Master branch

   You may tell maven to use this version with "mvn -Phadoop_facebook <goals>"

 -- Other versions reported working include:
 ---  Cloudera CDH3u0, CDH3u1

 While we provide support for unsecure and Facebook versions of Hadoop
 with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook',
 respectively, we have been primarily focusing on secure Hadoop releases
 at this time.

 -------------------------------

 Building and testing:

 You will need the following:
 - Java 1.6
 - Maven 3 or higher. Giraph uses the munge plugin
   (http://sonatype.github.com/munge-maven-plugin/),
   which requires Maven 3, to support multiple versions of Hadoop. Also, the
   web site plugin requires Maven 3.

 Use the maven commands with secure Hadoop to:
 - compile (i.e mvn compile)
 - package (i.e. mvn package)
 - test (i.e. mvn test)

 For the non-secure versions of Hadoop, run the maven commands with the
 additional argument '-Phadoop_non_secure'.
 Example compilation commands is 'mvn -Phadoop_non_secure compile'.

 For the Facebook Hadoop release, run the maven commands with the
 additional arguments '-Phadoop_facebook'.
 Example compilation commands is 'mvn -Phadoop_facebook compile'.

 -------------------------------

 Developing:

 Giraph is a multi-module maven project. The top level generates a POM that
 carries information common to all the modules. Each module creates a jar with
 the code contained in it.

 The giraph/ module contains the main giraph code. If you just want to work on
 the main code only you can do all your work inside this subdirectory.
 Specifically you would do something like:

   giraph-root/giraph/ $ mvn verify            # build from current state
   giraph-root/giraph/ $ mvn clean             # wipe out build files
   giraph-root/giraph/ $ mvn clean verify      # build from fresh state
   giraph-root/giraph/ $ mvn install           # install jar to local repository

 The giraph-formats/ module contains hooks to read/write from various
 formats (e.g. Accumulo, HBase, Hive). It depends on the giraph module. This
 means if you make local changes to the giraph codebase you will first need to
 install the giraph/ jar locally so that giraph-formats/ will pick it up.
 In other words something like this:

   giraph-root/giraph/ $ mvn install
   giraph-root/giraph-formats $ mvn verify

 To build everything at once you can issue the maven commands at the top level.
 Note that we use the "install" target so that if you have any local changes to
 giraph/ which formats needs it will get picked up because it will install
 locally first.

   giraph-root/ $ mvn clean install

 -------------------------------

 Scripting:

 Giraph has support for writing user logic in languages other than Java. A Giraph
 job involves at the very least a Computation and Input/Output Formats. There are
 other optional pieces as well like Aggregators and Combiners.

 As of this writing we support writing the Computation logic in Jython. The
 Computation class is at the core of the algorithm so it was a natural starting
 point. Eventually it is our goal to allow users to write any / all components of
 their algorithms in any language they desire.

 To use Jython with our job launcher, GiraphRunner, pass the path to the script
 as the Computation class argument. Additionally, you should set the -jythonClass
 option to let Giraph know the name of your Jython Computation class. Lastly, you
 will need to set -typesHolder to a class that extends Giraph's TypesHolder so
 that Giraph can infer the types you use. Look at page-rank.py as an example.

 -------------------------------

 How to run the unittests on a local pseudo-distributed Hadoop instance:

 As mentioned earlier, Giraph supports several versions of Hadoop.  In
 this section, we describe how to run the Giraph unittests against a single
 node instance of Apache Hadoop 0.20.203.

 Download Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz)
 from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/
 and unpack it into a local directory

 Follow the guide at
 http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed
 to setup a pseudo-distributed single node Hadoop cluster.

 Giraph’s code assumes that you can run at least 4 mappers at once,
 unfortunately the default configuration allows only 2. Therefore you need
 to update conf/mapred-site.xml:

 <property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>4</value>
 </property>

 <property>
   <name>mapred.map.tasks</name>
   <value>4</value>
 </property>

 After preparing the local filesystem with:

 rm -rf /tmp/hadoop-<username>
 /path/to/hadoop/bin/hadoop namenode -format

 you can start the local hadoop instance:

 /path/to/hadoop/bin/start-all.sh

 and finally run Giraph’s unittests:

 mvn clean test -Dprop.mapred.job.tracker=localhost:9001

 Now you can open a browser, point it to http://localhost:50030 and watch the
 Giraph jobs from the unittests running on your local Hadoop instance!


 Notes:
 Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of
 counters one can use, which is set to 120 by default. This limit restricts the
 number of iterations/supersteps possible in Giraph. This limit can be increased
 by setting a parameter "mapreduce.job.counters.limit" in job tracker's config
 file mapred-site.xml.
	Giraph : Large-scale graph processing on Hadoop

	Web and online social graphs have been rapidly growing in size and
	scale during the past decade. In 2008, Google estimated that the
	number of web pages reached over a trillion. Online social networking
	and email sites, including Yahoo!, Google, Microsoft, Facebook,
	LinkedIn, and Twitter, have hundreds of millions of users and are
	expected to grow much more in the future. Processing these graphs
	plays a big role in relevant and personalized information for users,
	such as results from a search engine or news in an online social
	networking site.

	Graph processing platforms to run large-scale algorithms (such as page
	rank, shared connections, personalization-based popularity, etc.) have
	become quite popular. Some recent examples include Pregel and HaLoop.
	For general-purpose big data computation, the map-reduce computing
	model has been well adopted and the most deployed map-reduce
	infrastructure is Apache Hadoop. We have implemented a
	graph-processing framework that is launched as a typical Hadoop job to
	leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph
	builds upon the graph-oriented nature of Pregel but additionally adds
	fault-tolerance to the coordinator process with the use of ZooKeeper
	as its centralized coordination service.

	Giraph follows the bulk-synchronous parallel model relative to graphs
	where vertices can send messages to other vertices during a given
	superstep. Checkpoints are initiated by the Giraph infrastructure at
	user-defined intervals and are used for automatic application restarts
	when any worker in the application fails. Any worker in the
	application can act as the application coordinator and one will
	automatically take over if the current application coordinator fails.

	-------------------------------

	Hadoop versions for use with Giraph:

	Secure Hadoop versions:

	- Apache Hadoop 1 (latest version: 1.2.1)

	This is the default version used by Giraph: if you do not specify a
	profile with the -P flag, maven will use this version. You may also
	explicitly specify it with "mvn -Phadoop_1 <goals>".

	- Apache Hadoop 2 (latest version: 2.5.1)

	This is the latest version of Hadoop 2 (supporting YARN in addition
	to MapReduce) Giraph could use. You may tell maven to use this version
	with "mvn -Phadoop_2 <goals>".

	- Apache Hadoop 0.23.1

	You may tell maven to use this version with "mvn -Phadoop_0.23 <goals>".

	- Apache Hadoop 0.20.203.0

	You may tell maven to use this version with "mvn -Phadoop_0.20.203 <goals>".

	- Apache Hadoop Yarn with 2.2.0

	You may tell maven to use this version with "mvn -Phadoop_yarn -Dhadoop.version=2.2.0 <goals>".

	- Apache Hadoop 3.0.0-SNAPSHOT

	You may tell maven to use this version with "mvn -Phadoop_trunk <goals>".

	Unsecure Hadoop versions:

	- Apache Hadoop 0.20.1, 0.20.2, 0.20.3

	You may tell maven to use 0.20.2 with "mvn -Phadoop_non_secure <goals>".

	- Facebook Hadoop releases: https://github.com/facebook/hadoop-20, Master branch

	You may tell maven to use this version with "mvn -Phadoop_facebook <goals>"

	-- Other versions reported working include:
	--- Cloudera CDH3u0, CDH3u1

	While we provide support for unsecure and Facebook versions of Hadoop
	with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook',
	respectively, we have been primarily focusing on secure Hadoop releases
	at this time.

	-------------------------------

	Building and testing:

	You will need the following:
	- Java 1.6
	- Maven 3 or higher. Giraph uses the munge plugin
	(http://sonatype.github.com/munge-maven-plugin/),
	which requires Maven 3, to support multiple versions of Hadoop. Also, the
	web site plugin requires Maven 3.

	Use the maven commands with secure Hadoop to:
	- compile (i.e mvn compile)
	- package (i.e. mvn package)
	- test (i.e. mvn test)

	For the non-secure versions of Hadoop, run the maven commands with the
	additional argument '-Phadoop_non_secure'.
	Example compilation commands is 'mvn -Phadoop_non_secure compile'.

	For the Facebook Hadoop release, run the maven commands with the
	additional arguments '-Phadoop_facebook'.
	Example compilation commands is 'mvn -Phadoop_facebook compile'.

	-------------------------------

	Developing:

	Giraph is a multi-module maven project. The top level generates a POM that
	carries information common to all the modules. Each module creates a jar with
	the code contained in it.

	The giraph/ module contains the main giraph code. If you just want to work on
	the main code only you can do all your work inside this subdirectory.
	Specifically you would do something like:

	giraph-root/giraph/ $ mvn verify # build from current state
	giraph-root/giraph/ $ mvn clean # wipe out build files
	giraph-root/giraph/ $ mvn clean verify # build from fresh state
	giraph-root/giraph/ $ mvn install # install jar to local repository

	The giraph-formats/ module contains hooks to read/write from various
	formats (e.g. Accumulo, HBase, Hive). It depends on the giraph module. This
	means if you make local changes to the giraph codebase you will first need to
	install the giraph/ jar locally so that giraph-formats/ will pick it up.
	In other words something like this:

	giraph-root/giraph/ $ mvn install
	giraph-root/giraph-formats $ mvn verify

	To build everything at once you can issue the maven commands at the top level.
	Note that we use the "install" target so that if you have any local changes to
	giraph/ which formats needs it will get picked up because it will install
	locally first.

	giraph-root/ $ mvn clean install

	-------------------------------

	Scripting:

	Giraph has support for writing user logic in languages other than Java. A Giraph
	job involves at the very least a Computation and Input/Output Formats. There are
	other optional pieces as well like Aggregators and Combiners.

	As of this writing we support writing the Computation logic in Jython. The
	Computation class is at the core of the algorithm so it was a natural starting
	point. Eventually it is our goal to allow users to write any / all components of
	their algorithms in any language they desire.

	To use Jython with our job launcher, GiraphRunner, pass the path to the script
	as the Computation class argument. Additionally, you should set the -jythonClass
	option to let Giraph know the name of your Jython Computation class. Lastly, you
	will need to set -typesHolder to a class that extends Giraph's TypesHolder so
	that Giraph can infer the types you use. Look at page-rank.py as an example.

	-------------------------------

	How to run the unittests on a local pseudo-distributed Hadoop instance:

	As mentioned earlier, Giraph supports several versions of Hadoop. In
	this section, we describe how to run the Giraph unittests against a single
	node instance of Apache Hadoop 0.20.203.

	Download Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz)
	from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/
	and unpack it into a local directory

	Follow the guide at
	http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed
	to setup a pseudo-distributed single node Hadoop cluster.

	Giraph’s code assumes that you can run at least 4 mappers at once,
	unfortunately the default configuration allows only 2. Therefore you need
	to update conf/mapred-site.xml:

	<property>
	<name>mapred.tasktracker.map.tasks.maximum</name>
	<value>4</value>
	</property>

	<property>
	<name>mapred.map.tasks</name>
	<value>4</value>
	</property>

	After preparing the local filesystem with:

	rm -rf /tmp/hadoop-<username>
	/path/to/hadoop/bin/hadoop namenode -format

	you can start the local hadoop instance:

	/path/to/hadoop/bin/start-all.sh

	and finally run Giraph’s unittests:

	mvn clean test -Dprop.mapred.job.tracker=localhost:9001

	Now you can open a browser, point it to http://localhost:50030 and watch the
	Giraph jobs from the unittests running on your local Hadoop instance!


	Notes:
	Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of
	counters one can use, which is set to 120 by default. This limit restricts the
	number of iterations/supersteps possible in Giraph. This limit can be increased
	by setting a parameter "mapreduce.job.counters.limit" in job tracker's config
	file mapred-site.xml.