docs/src/reference/implementations-hadoop-end.asciidoc - tinkerpop - Git at Google

 ////
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ////
 === Input/Output Formats

 image:adjacency-list.png[width=300,float=right] Hadoop-Gremlin provides various I/O formats -- i.e. Hadoop
 `InputFormat` and `OutputFormat`. All of the formats make use of an link:http://en.wikipedia.org/wiki/Adjacency_list[adjacency list]
 representation of the graph where each "row" represents a single vertex, its properties, and its incoming and
 outgoing edges.

 {empty} +

 [[gryo-io-format]]
 ==== Gryo I/O Format

 * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat`
 * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat`

 <<gryo,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo]
 to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time
 savings over text-based representations.

 NOTE: The `GryoInputFormat` is splittable.

 [[graphson-io-format]]
 ==== GraphSON I/O Format

 * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat`
 * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat`

 <<graphson,GraphSON>> is a JSON based graph format. GraphSON is a space-expensive graph format in that
 it is a text-based markup language. However, it is convenient for many developers to work with as its structure is
 simple (easy to create and parse).

 The data below represents an adjacency list representation of the classic TinkerGraph toy graph in GraphSON format.

 [source,json]
 ----
 {"id":1,"label":"person","outE":{"created":[{"id":9,"inV":3,"properties":{"weight":0.4}}],"knows":[{"id":7,"inV":2,"properties":{"weight":0.5}},{"id":8,"inV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":0,"value":"marko"}],"age":[{"id":1,"value":29}]}}
 {"id":2,"label":"person","inE":{"knows":[{"id":7,"outV":1,"properties":{"weight":0.5}}]},"properties":{"name":[{"id":2,"value":"vadas"}],"age":[{"id":3,"value":27}]}}
 {"id":3,"label":"software","inE":{"created":[{"id":9,"outV":1,"properties":{"weight":0.4}},{"id":11,"outV":4,"properties":{"weight":0.4}},{"id":12,"outV":6,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":4,"value":"lop"}],"lang":[{"id":5,"value":"java"}]}}
 {"id":4,"label":"person","inE":{"knows":[{"id":8,"outV":1,"properties":{"weight":1.0}}]},"outE":{"created":[{"id":10,"inV":5,"properties":{"weight":1.0}},{"id":11,"inV":3,"properties":{"weight":0.4}}]},"properties":{"name":[{"id":6,"value":"josh"}],"age":[{"id":7,"value":32}]}}
 {"id":5,"label":"software","inE":{"created":[{"id":10,"outV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":8,"value":"ripple"}],"lang":[{"id":9,"value":"java"}]}}
 {"id":6,"label":"person","outE":{"created":[{"id":12,"inV":3,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":10,"value":"peter"}],"age":[{"id":11,"value":35}]}}
 ----

 [[script-io-format]]
 ==== Script I/O Format

 * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat`
 * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptOutputFormat`

 `ScriptInputFormat` and `ScriptOutputFormat` take an arbitrary script and use that script to either read or write
 `Vertex` objects, respectively. This can be considered the most general `InputFormat`/`OutputFormat` possible in that
 Hadoop-Gremlin uses the user provided script for all reading/writing.

 ===== ScriptInputFormat

 The data below represents an adjacency list representation of the classic TinkerGraph toy graph. First line reads,
 "vertex `1`, labeled `person` having 2 property values (`marko` and `29`) has 3 outgoing edges; the first edge is
 labeled `knows`, connects the current vertex `1` with vertex `2` and has a property value `0.4`, and so on."

 [source]
 1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4
 2:person:vadas:27
 3:project:lop:java
 4:person:josh:32 created:3:0.4,created:5:1.0
 5:project:ripple:java
 6:person:peter:35 created:3:0.2

 There is no corresponding `InputFormat` that can parse this particular file (or some adjacency list variant of it).
 As such, `ScriptInputFormat` can be used. With `ScriptInputFormat` a script is stored in HDFS and leveraged by each
 mapper in the Hadoop job. The script must have the following method defined:

 [source,groovy]
 def parse(String line) { ... }

 In order to create vertices and edges, the `parse()` method gets access to a global variable named `graph`, which holds
 the local `StarGraph` for the current line/vertex.

 An appropriate `parse()` for the above adjacency list file is:

 [source,groovy]
 def parse(line) {
     def parts = line.split(/ /)
     def (id, label, name, x) = parts[0].split(/:/).toList()
     def v1 = graph.addVertex(T.id, id, T.label, label)
     if (name != null) v1.property('name', name) // first value is always the name
     if (x != null) {
         // second value depends on the vertex label; it's either
         // the age of a person or the language of a project
         if (label.equals('project')) v1.property('lang', x)
         else v1.property('age', Integer.valueOf(x))
     }
     if (parts.length == 2) {
         parts[1].split(/,/).grep { !it.isEmpty() }.each {
             def (eLabel, refId, weight) = it.split(/:/).toList()
             def v2 = graph.addVertex(T.id, refId)
             v1.addOutEdge(eLabel, v2, 'weight', Double.valueOf(weight))
         }
     }
     return v1
 }

 The resultant `Vertex` denotes whether the line parsed yielded a valid Vertex. As such, if the line is not valid
 (e.g. a comment line, a skip line, etc.), then simply return `null`.

 ===== ScriptOutputFormat Support

 The principle above can also be used to convert a vertex to an arbitrary `String` representation that is ultimately
 streamed back to a file in HDFS. This is the role of `ScriptOutputFormat`. `ScriptOutputFormat` requires that the
 provided script maintains a method with the following signature:

 [source,groovy]
 def stringify(Vertex vertex) { ... }

 An appropriate `stringify()` to produce output in the same format that was shown in the `ScriptInputFormat` sample is:

 [source,groovy]
 def stringify(vertex) {
     def v = vertex.values('name', 'age', 'lang').inject(vertex.id(), vertex.label()).join(':')
     def outE = vertex.outE().map {
         def e = it.get()
         e.values('weight').inject(e.label(), e.inV().next().id()).join(':')
     }.join(',')
     return [v, outE].join('\t')
 }


 === Storage Systems

 Hadoop-Gremlin provides two implementations of the `Storage` API:

 * `FileSystemStorage`: Access HDFS and local file system data.
 * `SparkContextStorage`: Access Spark persisted RDD data.

 [[interacting-with-hdfs]]
 ==== Interacting with HDFS

 The distributed file system of Hadoop is called link:http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system[HDFS].
 The results of any OLAP operation are stored in HDFS accessible via `hdfs`. For local file system access, there is `fs`.

 [gremlin-groovy]
 ----
 graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
 graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
 hdfs.ls()
 hdfs.ls('output')
 hdfs.head('output', GryoInputFormat)
 hdfs.head('output', 'clusterCount', SequenceFileInputFormat)
 hdfs.rm('output')
 hdfs.ls()
 ----

 [[interacting-with-spark]]
 ==== Interacting with Spark

 If a Spark context is persisted, then Spark RDDs will remain the Spark cache and accessible over subsequent jobs.
 RDDs are retrieved and saved to the `SparkContext` via `PersistedInputRDD` and `PersistedOutputRDD` respectively.
 Persisted RDDs can be accessed using `spark`.

 [gremlin-groovy]
 ----
 Spark.create('local[4]')
 graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
 graph.configuration().setProperty('gremlin.hadoop.graphWriter', PersistedOutputRDD.class.getCanonicalName())
 graph.configuration().setProperty('gremlin.spark.persistContext',true)
 graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
 spark.ls()
 spark.ls('output')
 spark.head('output', PersistedInputRDD)
 spark.head('output', 'clusterCount', PersistedInputRDD)
 spark.rm('output')
 spark.ls()
 ----
	////
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////
	=== Input/Output Formats

	image:adjacency-list.png[width=300,float=right] Hadoop-Gremlin provides various I/O formats -- i.e. Hadoop
	`InputFormat` and `OutputFormat`. All of the formats make use of an link:http://en.wikipedia.org/wiki/Adjacency_list[adjacency list]
	representation of the graph where each "row" represents a single vertex, its properties, and its incoming and
	outgoing edges.

	{empty} +

	[[gryo-io-format]]
	==== Gryo I/O Format

	* InputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat`
	* OutputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat`

	<<gryo,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo]
	to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time
	savings over text-based representations.

	NOTE: The `GryoInputFormat` is splittable.

	[[graphson-io-format]]
	==== GraphSON I/O Format

	* InputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat`
	* OutputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat`

	<<graphson,GraphSON>> is a JSON based graph format. GraphSON is a space-expensive graph format in that
	it is a text-based markup language. However, it is convenient for many developers to work with as its structure is
	simple (easy to create and parse).

	The data below represents an adjacency list representation of the classic TinkerGraph toy graph in GraphSON format.

	[source,json]
	----
	{"id":1,"label":"person","outE":{"created":[{"id":9,"inV":3,"properties":{"weight":0.4}}],"knows":[{"id":7,"inV":2,"properties":{"weight":0.5}},{"id":8,"inV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":0,"value":"marko"}],"age":[{"id":1,"value":29}]}}
	{"id":2,"label":"person","inE":{"knows":[{"id":7,"outV":1,"properties":{"weight":0.5}}]},"properties":{"name":[{"id":2,"value":"vadas"}],"age":[{"id":3,"value":27}]}}
	{"id":3,"label":"software","inE":{"created":[{"id":9,"outV":1,"properties":{"weight":0.4}},{"id":11,"outV":4,"properties":{"weight":0.4}},{"id":12,"outV":6,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":4,"value":"lop"}],"lang":[{"id":5,"value":"java"}]}}
	{"id":4,"label":"person","inE":{"knows":[{"id":8,"outV":1,"properties":{"weight":1.0}}]},"outE":{"created":[{"id":10,"inV":5,"properties":{"weight":1.0}},{"id":11,"inV":3,"properties":{"weight":0.4}}]},"properties":{"name":[{"id":6,"value":"josh"}],"age":[{"id":7,"value":32}]}}
	{"id":5,"label":"software","inE":{"created":[{"id":10,"outV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":8,"value":"ripple"}],"lang":[{"id":9,"value":"java"}]}}
	{"id":6,"label":"person","outE":{"created":[{"id":12,"inV":3,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":10,"value":"peter"}],"age":[{"id":11,"value":35}]}}
	----

	[[script-io-format]]
	==== Script I/O Format

	* InputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat`
	* OutputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptOutputFormat`

	`ScriptInputFormat` and `ScriptOutputFormat` take an arbitrary script and use that script to either read or write
	`Vertex` objects, respectively. This can be considered the most general `InputFormat`/`OutputFormat` possible in that
	Hadoop-Gremlin uses the user provided script for all reading/writing.

	===== ScriptInputFormat

	The data below represents an adjacency list representation of the classic TinkerGraph toy graph. First line reads,
	"vertex `1`, labeled `person` having 2 property values (`marko` and `29`) has 3 outgoing edges; the first edge is
	labeled `knows`, connects the current vertex `1` with vertex `2` and has a property value `0.4`, and so on."

	[source]
	1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4
	2:person:vadas:27
	3:project:lop:java
	4:person:josh:32 created:3:0.4,created:5:1.0
	5:project:ripple:java
	6:person:peter:35 created:3:0.2

	There is no corresponding `InputFormat` that can parse this particular file (or some adjacency list variant of it).
	As such, `ScriptInputFormat` can be used. With `ScriptInputFormat` a script is stored in HDFS and leveraged by each
	mapper in the Hadoop job. The script must have the following method defined:

	[source,groovy]
	def parse(String line) { ... }

	In order to create vertices and edges, the `parse()` method gets access to a global variable named `graph`, which holds
	the local `StarGraph` for the current line/vertex.

	An appropriate `parse()` for the above adjacency list file is:

	[source,groovy]
	def parse(line) {
	def parts = line.split(/ /)
	def (id, label, name, x) = parts[0].split(/:/).toList()
	def v1 = graph.addVertex(T.id, id, T.label, label)
	if (name != null) v1.property('name', name) // first value is always the name
	if (x != null) {
	// second value depends on the vertex label; it's either
	// the age of a person or the language of a project
	if (label.equals('project')) v1.property('lang', x)
	else v1.property('age', Integer.valueOf(x))
	}
	if (parts.length == 2) {
	parts[1].split(/,/).grep { !it.isEmpty() }.each {
	def (eLabel, refId, weight) = it.split(/:/).toList()
	def v2 = graph.addVertex(T.id, refId)
	v1.addOutEdge(eLabel, v2, 'weight', Double.valueOf(weight))
	}
	}
	return v1
	}

	The resultant `Vertex` denotes whether the line parsed yielded a valid Vertex. As such, if the line is not valid
	(e.g. a comment line, a skip line, etc.), then simply return `null`.

	===== ScriptOutputFormat Support

	The principle above can also be used to convert a vertex to an arbitrary `String` representation that is ultimately
	streamed back to a file in HDFS. This is the role of `ScriptOutputFormat`. `ScriptOutputFormat` requires that the
	provided script maintains a method with the following signature:

	[source,groovy]
	def stringify(Vertex vertex) { ... }

	An appropriate `stringify()` to produce output in the same format that was shown in the `ScriptInputFormat` sample is:

	[source,groovy]
	def stringify(vertex) {
	def v = vertex.values('name', 'age', 'lang').inject(vertex.id(), vertex.label()).join(':')
	def outE = vertex.outE().map {
	def e = it.get()
	e.values('weight').inject(e.label(), e.inV().next().id()).join(':')
	}.join(',')
	return [v, outE].join('\t')
	}



	=== Storage Systems

	Hadoop-Gremlin provides two implementations of the `Storage` API:

	* `FileSystemStorage`: Access HDFS and local file system data.
	* `SparkContextStorage`: Access Spark persisted RDD data.

	[[interacting-with-hdfs]]
	==== Interacting with HDFS

	The distributed file system of Hadoop is called link:http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system[HDFS].
	The results of any OLAP operation are stored in HDFS accessible via `hdfs`. For local file system access, there is `fs`.

	[gremlin-groovy]
	----
	graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
	graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
	hdfs.ls()
	hdfs.ls('output')
	hdfs.head('output', GryoInputFormat)
	hdfs.head('output', 'clusterCount', SequenceFileInputFormat)
	hdfs.rm('output')
	hdfs.ls()
	----

	[[interacting-with-spark]]
	==== Interacting with Spark

	If a Spark context is persisted, then Spark RDDs will remain the Spark cache and accessible over subsequent jobs.
	RDDs are retrieved and saved to the `SparkContext` via `PersistedInputRDD` and `PersistedOutputRDD` respectively.
	Persisted RDDs can be accessed using `spark`.

	[gremlin-groovy]
	----
	Spark.create('local[4]')
	graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
	graph.configuration().setProperty('gremlin.hadoop.graphWriter', PersistedOutputRDD.class.getCanonicalName())
	graph.configuration().setProperty('gremlin.spark.persistContext',true)
	graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
	spark.ls()
	spark.ls('output')
	spark.head('output', PersistedInputRDD)
	spark.head('output', 'clusterCount', PersistedInputRDD)
	spark.rm('output')
	spark.ls()
	----