blob: ea8f1c8a7e5e24342178fd3eb8caf17388498695 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ See the License for the specific language governing permissions and
~ limitations under the License.
<document xmlns=""
<title>Giraph Input/Output with Gora</title>
<section name="Overview">
The <a class="externalLink" href="">Apache
Gora</a> project is an open source framework which provides an in-memory
data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and
analyzing the data with extensive Apache Hadoop MapReduce support.
<br />
The integration of these two awesome Apache projects has as main motivation
the possibility of turning Gora-supported-NoSQL data stores into
Giraph-processable graphs, and to provide Giraph the ability to store its
results into different data stores, letting users focus on the processing itself.
<br />
The way Gora works is by defining the data model how our data is going to be
stored using a JSON-like schema inspired in
<a class="externalLink" href="">Apache Avro</a> and
doing the physical mapping to the data store using an XML file.
The former one will help us generate data beans which will be read or written
into different data stores, and the latter one, helps us defining which data
bean should go where.
In this way, Giraph will be able to read/write data using three files:
<li>The generated data beans representing our data model.</li>
<li>The XML mapping file representing our physical mapping.</li>
<li>A file called <code></code> containing
configurations related to which data store Gora will use.</li>
The image below shows how this integration works in a plain simple image:
<img src="images/Gora-Giraph.svg" alt="Giraph Gora integration"/>
<section name="Generating DataBeans">
So the first thing we have to is to define our data model using a JSON-like schema. Here it is
a schema resembling graphs stored inside Apache HBase through Gora. The following shows a schema
for a vertex:
<div class="source"><pre class="prettyprint">
{"type": "record",
"name": "Vertex",
"namespace": "org.apache.giraph.gora.generated",
"fields" : [
{"name": "vertexId", "type": "long"},
{"name": "value", "type": "float"},
{"name": "edges",
"type": {
"type":"array", "items": {
"name": "Edge",
"type": "record",
"namespace": "org.apache.giraph.gora.generated",
"fields": [
{"name": "vertexId", "type": "long"},
{"name": "edgeValue", "type": "float"}
And this other schema shows what a schema for an edge should look like.
<div class="source"><pre class="prettyprint">
"type": "record",
"name": "GEdge",
"namespace": "org.apache.giraph.gora.generated",
"fields" : [
{"name": "edgeId", "type": "string"},
{"name": "edgeWeight", "type": "float"},
{"name": "vertexInId", "type": "string"},
{"name": "vertexOutId", "type": "string"},
{"name": "label", "type": "string"}
Now we are ready to generate our data beans. To do this, we need to use gora-core.jar which
comes with Giraph. The gora-compiler works using three parameters:
<div class="source"><pre class="prettyprint">
&lt;schema file&gt; - REQUIRED -individual avsc file to be compiled or a directory path containing avsc files
&lt;output dir&gt; - REQUIRED -output directory for generated Java files
&lt;-license id&gt; - the preferred license header to add to the
So by executing the gora compiler through this command, the generated data beans
will be created in the path set.
<div class="source"><pre class="prettyprint">
java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class edge.avsc gora-app/src/main/java/
<br />
This will result into a java class which will look something similar to this:
<div class="source"><pre class="prettyprint">
* Class for defining a Giraph-Vertex.
public class GVertex extends PersistentBase {
* Schema used for the class.
public static final Schema OBJ_SCHEMA = Schema.parse(
"{\"type\":\"record\",\"name\":\"Vertex\"," +
"\"namespace\":\"org.apache.giraph.gora.generated\"," +
"\"fields\":[{\"name\":\"vertexId\",\"type\":\"string\"}," +
"{\"name\":\"value\",\"type\":\"float\"},{\"name\":\"edges\"," +
* Vertex Id
private Utf8 vertexId;
* Gets vertexId
* @return Utf8 vertexId
public Utf8 getVertexId() {
return (Utf8) get(0);
* Sets vertexId
* @param value vertexId
public void setVertexId(Utf8 value) {
put(0, value);
. . .
Once this logical data modeling is done, the physical mapping between this generated
classes and the actual data repositories have to be made. Gora does this by using a
xml "mapping file".
<br />
The file below represents a <code>gora-hbase-mapping.xml</code> i.e. the necessary
information to map our data model into HBase tables. Within the tags <code>table</code>
the necessary column families will be defined. Moreover, within the tags
<code>class</code>, the actual generated java bean will be mapped into the column
families. Inside this, each field should be mapped into their respective column
family, and the HBase qualifier to be used for storing this field.
<br />
This mapping file can contain as many mappings as generated data beans our application
uses i.e. we can redefine more <code>table</code> tags with their own <code>class</code>
and <code>fields</code>.
<div class="source"><pre class="prettyprint">
&lt;table name="graphGiraph"&gt;
&lt;family name="vertices"/&gt;
&lt;class name="" keyClass="java.lang.String" table="graphGiraph"&gt;
&lt;field name="vertexId" family="vertices" qualifier="vertexId"/&gt;
&lt;field name="value" family="vertices" qualifier="value"/&gt;
&lt;field name="edges" family="vertices" qualifier="edges"/&gt;
A more complex file can be found inside <code>giraph-gora/conf</code> folder.
<section name="Preparation">
Once the data beans have been generated, the <code></code> file
has be created. This file specifies which data store is going to be used with
Gora, but also contains extra information about such data store. An example of
such file can be found inside <code>giraph-gora/conf</code> folder. Following
our example, if it has been decided to use Apache HBase so <code></code>
should contain such configuration, as shown below:<br />
Then to be able to use the Gora API the user needs to prepare the Gora environment.
This is not more than having set up one of the data stores Gora support, having
the data beans generated and the <code></code> file set up. A more
detail yet simple tutorial can be found
<a href="">here</a>.
<br />
The data definition files should be available in the classpath when the
Giraph job is run. But also all configuration files needed for each specific data
store should also be made available across the cluster. For example, if we were
to use HBase along Giraph and Gora, then the hbase-site.xml file should be passed
along as well. There are several ways to make these files available, and one common
way to do this is with the <code>-file</code> option. This option would look like
something similar to this: <br />
<div class="source"><pre class="prettyprint">
-files ../conf/,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml
</pre></div><br />
Gora also needs to be told which serialization types it will use. This serialization
types could be made across the cluster, but if that is not desired, then they can be
passed using the <code>-D</code> option of Hadoop. This option would look like
something similar to this:<br />
<div class="source"><pre class="prettyprint">,
</pre></div><br />
<section name="Configuration Options">
Now that the data beans have been generated, and Gora environment ready,
the configuration options for this API have to be known in order to be specified
by the user. These configurations are as follow: <br />
<table border='0'>
<td>Gora DataStore class to access to data from - required.</td>
<td>Gora Key class to query the datastore - required.</td>
<td>Gora Persistent class to read objects from Gora - required.</td>
<td>Gora start key to query the datastore.</td>
<td>Gora end key to query the datastore.</td>
<td> Keys factory to convert strings into desired keys - required. </td>
<td>Gora DataStore class to write data to - required.</td>
<td>Gora Key class to write to datastore - required.</td>
<td>Gora Persistent class to write to Gora - required.
<section name="Input/Output Example">
To make use of the Giraph input API available for Gora, it is required to extend the
classes <code>GoraVertexInputFormat</code> or <code>GoraEdgeInputFormat</code>.
In the first class, the only method that has to be implemented is
<code>transformVertex</code> to transform a <code>Gora Object</code> into a
Giraph's <code>Vertex</code> object. Likewise, for the second class the methods
that have to be implemented are <code>transformEdge</code>, to convert a
<code>Gora Edge Object</code> into a the Giraph's<code>Edge</code> object, and
<code>getCurrentSourceId</code>. There are two Examples of such implementations
which are <code>GoraGVertexVertexInputFormat</code> and
<code>GoraGEdgeEdgeInputFormat</code>. One other class that has to be implemented
here is the <code>KeyFactory</code> because this class is used to transform the keys
passed as strings throught the options into actual Gora key Objects used to query
the data store. The default one assumes your key type is a <code>String</code>.<br />
On the other hand, to make use of the Giraph output API available for Gora,
it is required to extend the classes <code>GoraVertexOutputFormat</code> or
In the first class, the only method that has to be implemented is
<code>getGoraVertex</code> to transform a Giraph's Vertex object into a
Gora object, and <code>getGoraKey</code> to determine the key which will represent
such vertex. Likewise, for the Edge output class the methods
that have to be implemented are <code>getGoraEdge</code>, to convert a Giraph's
Edge object into a Gora Edge object, and <code>getGoraKey</code> to determine the
key which will represent such edge. There are two Examples of such implementations
which are <code>GoraGVertexVertexOutputFormat</code> and
<br />
An example command showing how to put together all these classes and configurations
is shown below. This command is to compute the shortest path algorithm onto the
graph database shown previously is provided below.
<br />
export GIRAPH_CORE_JAR=$GIRAPH_CORE_TARGET_DIR/giraph-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar<br />
export GIRAPH_EXAMPLES_JAR=$GIRAPH_EXAMPLES_TARGET_DIR/giraph-examples-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar<br />
export GIRAPH_GORA_JAR=$GIRAPH_GORA_TARGET_DIR/giraph-gora-$GIRAPH_VERSION-SNAPSHOT-jar-with-dependencies.jar<br />
export GORA_HBASE_JAR=$GORA_HBASE_TARGET_DIR/gora-cassandra-$GORA_VERSION.jar<br />
export HBASE_JAR=$GORA_DIR/gora-hbase/lib/hbase-0.90.4.jar
</code><br />
<div class="source"><pre class="prettyprint">
hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner
-files ../conf/,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml,
-w 1
</pre></div><br />