README - giraph - Git at Google

 Giraph : Large-scale graph processing on Hadoop

 Web and online social graphs have been rapidly growing in size and
 scale during the past decade.  In 2008, Google estimated that the
 number of web pages reached over a trillion.  Online social networking
 and email sites, including Yahoo!, Google, Microsoft, Facebook,
 LinkedIn, and Twitter, have hundreds of millions of users and are
 expected to grow much more in the future.  Processing these graphs
 plays a big role in relevant and personalized information for users,
 such as results from a search engine or news in an online social
 networking site.

 Graph processing platforms to run large-scale algorithms (such as page
 rank, shared connections, personalization-based popularity, etc.) have
 become quite popular.  Some recent examples include Pregel and HaLoop.
 For general-purpose big data computation, the map-reduce computing
 model has been well adopted and the most deployed map-reduce
 infrastructure is Apache Hadoop.  We have implemented a
 graph-processing framework that is launched as a typical Hadoop job to
 leverage existing Hadoop infrastructure, such as Amazon’s EC2.  Giraph
 builds upon the graph-oriented nature of Pregel but additionally adds
 fault-tolerance to the coordinator process with the use of ZooKeeper
 as its centralized coordination service.

 Giraph follows the bulk-synchronous parallel model relative to graphs
 where vertices can send messages to other vertices during a given
 superstep.  Checkpoints are initiated by the Giraph infrastructure at
 user-defined intervals and are used for automatic application restarts
 when any worker in the application fails.  Any worker in the
 application can act as the application coordinator and one will
 automatically take over if the current application coordinator fails.

 -------------------------------

 Hadoop versions for use with Giraph:

 Secure Hadoop versions:
 - Apache Hadoop 0.20.203, 0.20.204, other secure versions may work as well
 -- Other versions reported working include:
 ---  Cloudera CDH3u0, CDH3u1

 Unsecure Hadoop versions:
 - Apache Hadoop 0.20.1, 0.20.2, 0.20.3

 Facebook Hadoop release (https://github.com/facebook/hadoop-20-warehouse):
 - GitHub master

 While we provide support for the unsecure and Facebook versions of Hadoop
 with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook',
 respectively, we have been primarily focusing on secure Hadoop releases
 at this time.

 -------------------------------

 Building and testing:

 You will need the following:
 - Java 1.6
 - Maven 3 or higher. Giraph uses the munge plugin
   (http://sonatype.github.com/munge-maven-plugin/),
   which requires Maven 3, to support multiple versions of Hadoop. Also, the
   web site plugin requires Maven 3.

 Use the maven commands with secure Hadoop to:
 - compile (i.e mvn compile)
 - package (i.e. mvn package)
 - test (i.e. mvn test)

 For the non-secure versions of Hadoop, run the maven commands with the
 additional argument '-Dhadoop=non_secure' to enable the maven profile
 'hadoop_non_secure'.  An example compilation command is
 'mvn -Dhadoop=non_secure compile'.

 For the Facebook Hadoop release, run the maven commands with the
 additional arguments '-Dhadoop=facebook' to enable the maven profile
 'hadoop_facebook' as well as a location for the hadoop core jar file.  An
 example compilation command is 'mvn -Dhadoop=facebook
 -Dhadoop.jar.path=/tmp/hadoop-0.20.1-core.jar compile'.


 How to run the unittests on a local pseudo-distributed Hadoop instance:

 As mentioned earlier, Giraph supports several versions of Hadoop.  In
 this section, we describe how to run the Giraph unittests against a single
 node instance of Apache Hadoop 0.20.203.

 Download Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz)
 from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/
 and unpack it into a local directory

 Follow the guide at
 http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed
 to setup a pseudo-distributed single node Hadoop cluster.

 Giraph’s code assumes that you can run at least 4 mappers at once,
 unfortunately the default configuration allows only 2. Therefore you need
 to update conf/mapred-site.xml:

 <property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>4</value>
 </property>

 <property>
   <name>mapred.map.tasks</name>
   <value>4</value>
 </property>

 After preparing the local filesystem with:

 rm -rf /tmp/hadoop-<username>
 /path/to/hadoop/bin/hadoop namenode -format

 you can start the local hadoop instance:

 /path/to/hadoop/bin/start-all.sh

 and finally run Giraph’s unittests:

 mvn clean test -Dprop.mapred.job.tracker=localhost:9001

 Now you can open a browser, point it to http://localhost:50030 and watch the
 Giraph jobs from the unittests running on your local Hadoop instance!


 Notes:
 Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of
 counters one can use, which is set to 120 by default. This limit restricts the
 number of iterations/supersteps possible in Giraph. This limit can be increased
 by setting a parameter "mapreduce.job.counters.limit" in job tracker's config
 file mapred-site.xml.
	Giraph : Large-scale graph processing on Hadoop

	Web and online social graphs have been rapidly growing in size and
	scale during the past decade. In 2008, Google estimated that the
	number of web pages reached over a trillion. Online social networking
	and email sites, including Yahoo!, Google, Microsoft, Facebook,
	LinkedIn, and Twitter, have hundreds of millions of users and are
	expected to grow much more in the future. Processing these graphs
	plays a big role in relevant and personalized information for users,
	such as results from a search engine or news in an online social
	networking site.

	Graph processing platforms to run large-scale algorithms (such as page
	rank, shared connections, personalization-based popularity, etc.) have
	become quite popular. Some recent examples include Pregel and HaLoop.
	For general-purpose big data computation, the map-reduce computing
	model has been well adopted and the most deployed map-reduce
	infrastructure is Apache Hadoop. We have implemented a
	graph-processing framework that is launched as a typical Hadoop job to
	leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph
	builds upon the graph-oriented nature of Pregel but additionally adds
	fault-tolerance to the coordinator process with the use of ZooKeeper
	as its centralized coordination service.

	Giraph follows the bulk-synchronous parallel model relative to graphs
	where vertices can send messages to other vertices during a given
	superstep. Checkpoints are initiated by the Giraph infrastructure at
	user-defined intervals and are used for automatic application restarts
	when any worker in the application fails. Any worker in the
	application can act as the application coordinator and one will
	automatically take over if the current application coordinator fails.

	-------------------------------

	Hadoop versions for use with Giraph:

	Secure Hadoop versions:
	- Apache Hadoop 0.20.203, 0.20.204, other secure versions may work as well
	-- Other versions reported working include:
	--- Cloudera CDH3u0, CDH3u1

	Unsecure Hadoop versions:
	- Apache Hadoop 0.20.1, 0.20.2, 0.20.3

	Facebook Hadoop release (https://github.com/facebook/hadoop-20-warehouse):
	- GitHub master

	While we provide support for the unsecure and Facebook versions of Hadoop
	with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook',
	respectively, we have been primarily focusing on secure Hadoop releases
	at this time.

	-------------------------------

	Building and testing:

	You will need the following:
	- Java 1.6
	- Maven 3 or higher. Giraph uses the munge plugin
	(http://sonatype.github.com/munge-maven-plugin/),
	which requires Maven 3, to support multiple versions of Hadoop. Also, the
	web site plugin requires Maven 3.

	Use the maven commands with secure Hadoop to:
	- compile (i.e mvn compile)
	- package (i.e. mvn package)
	- test (i.e. mvn test)

	For the non-secure versions of Hadoop, run the maven commands with the
	additional argument '-Dhadoop=non_secure' to enable the maven profile
	'hadoop_non_secure'. An example compilation command is
	'mvn -Dhadoop=non_secure compile'.

	For the Facebook Hadoop release, run the maven commands with the
	additional arguments '-Dhadoop=facebook' to enable the maven profile
	'hadoop_facebook' as well as a location for the hadoop core jar file. An
	example compilation command is 'mvn -Dhadoop=facebook
	-Dhadoop.jar.path=/tmp/hadoop-0.20.1-core.jar compile'.


	How to run the unittests on a local pseudo-distributed Hadoop instance:

	As mentioned earlier, Giraph supports several versions of Hadoop. In
	this section, we describe how to run the Giraph unittests against a single
	node instance of Apache Hadoop 0.20.203.

	Download Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz)
	from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/
	and unpack it into a local directory

	Follow the guide at
	http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed
	to setup a pseudo-distributed single node Hadoop cluster.

	Giraph’s code assumes that you can run at least 4 mappers at once,
	unfortunately the default configuration allows only 2. Therefore you need
	to update conf/mapred-site.xml:

	<property>
	<name>mapred.tasktracker.map.tasks.maximum</name>
	<value>4</value>
	</property>

	<property>
	<name>mapred.map.tasks</name>
	<value>4</value>
	</property>

	After preparing the local filesystem with:

	rm -rf /tmp/hadoop-<username>
	/path/to/hadoop/bin/hadoop namenode -format

	you can start the local hadoop instance:

	/path/to/hadoop/bin/start-all.sh

	and finally run Giraph’s unittests:

	mvn clean test -Dprop.mapred.job.tracker=localhost:9001

	Now you can open a browser, point it to http://localhost:50030 and watch the
	Giraph jobs from the unittests running on your local Hadoop instance!


	Notes:
	Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of
	counters one can use, which is set to 120 by default. This limit restricts the
	number of iterations/supersteps possible in Giraph. This limit can be increased
	by setting a parameter "mapreduce.job.counters.limit" in job tracker's config
	file mapred-site.xml.