docs/src/reference/implementations-hadoop-end.asciidoc - tinkerpop - Git at Google

 ////
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ////
 === Input/Output Formats

 image:adjacency-list.png[width=300,float=right] Hadoop-Gremlin provides various I/O formats -- i.e. Hadoop
 `InputFormat` and `OutputFormat`. All of the formats make use of an link:http://en.wikipedia.org/wiki/Adjacency_list[adjacency list]
 representation of the graph where each "row" represents a single vertex, its properties, and its incoming and
 outgoing edges.

 {empty} +

 [[gryo-io-format]]
 ==== Gryo I/O Format

 * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat`
 * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat`

 <<gryo-reader-writer,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo]
 to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time
 savings over text-based representations.

 NOTE: The `GryoInputFormat` is splittable.

 [[graphson-io-format]]
 ==== GraphSON I/O Format

 * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat`
 * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat`

 <<graphson-reader-writer,GraphSON>> is a JSON based graph format. GraphSON is a space-expensive graph format in that
 it is a text-based markup language. However, it is convenient for many developers to work with as its structure is
 simple (easy to create and parse).

 The data below represents an adjacency list representation of the classic TinkerGraph toy graph in GraphSON format.

 [source,json]
 ----
 {"id":1,"label":"person","outE":{"created":[{"id":9,"inV":3,"properties":{"weight":0.4}}],"knows":[{"id":7,"inV":2,"properties":{"weight":0.5}},{"id":8,"inV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":0,"value":"marko"}],"age":[{"id":1,"value":29}]}}
 {"id":2,"label":"person","inE":{"knows":[{"id":7,"outV":1,"properties":{"weight":0.5}}]},"properties":{"name":[{"id":2,"value":"vadas"}],"age":[{"id":3,"value":27}]}}
 {"id":3,"label":"software","inE":{"created":[{"id":9,"outV":1,"properties":{"weight":0.4}},{"id":11,"outV":4,"properties":{"weight":0.4}},{"id":12,"outV":6,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":4,"value":"lop"}],"lang":[{"id":5,"value":"java"}]}}
 {"id":4,"label":"person","inE":{"knows":[{"id":8,"outV":1,"properties":{"weight":1.0}}]},"outE":{"created":[{"id":10,"inV":5,"properties":{"weight":1.0}},{"id":11,"inV":3,"properties":{"weight":0.4}}]},"properties":{"name":[{"id":6,"value":"josh"}],"age":[{"id":7,"value":32}]}}
 {"id":5,"label":"software","inE":{"created":[{"id":10,"outV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":8,"value":"ripple"}],"lang":[{"id":9,"value":"java"}]}}
 {"id":6,"label":"person","outE":{"created":[{"id":12,"inV":3,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":10,"value":"peter"}],"age":[{"id":11,"value":35}]}}
 ----

 [[script-io-format]]
 ==== Script I/O Format

 * **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat`
 * **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptOutputFormat`

 `ScriptInputFormat` and `ScriptOutputFormat` take an arbitrary script and use that script to either read or write
 `Vertex` objects, respectively. This can be considered the most general `InputFormat`/`OutputFormat` possible in that
 Hadoop-Gremlin uses the user provided script for all reading/writing.

 ===== ScriptInputFormat

 The data below represents an adjacency list representation of the classic TinkerGraph toy graph. First line reads,
 "vertex `1`, labeled `person` having 2 property values (`marko` and `29`) has 3 outgoing edges; the first edge is
 labeled `knows`, connects the current vertex `1` with vertex `2` and has a property value `0.4`, and so on."

 [source]
 1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4
 2:person:vadas:27
 3:project:lop:java
 4:person:josh:32 created:3:0.4,created:5:1.0
 5:project:ripple:java
 6:person:peter:35 created:3:0.2

 There is no corresponding `InputFormat` that can parse this particular file (or some adjacency list variant of it).
 As such, `ScriptInputFormat` can be used. With `ScriptInputFormat` a script is stored in HDFS and leveraged by each
 mapper in the Hadoop job. The script must have the following method defined:

 [source,groovy]
 def parse(String line) { ... }

 In order to create vertices and edges, the `parse()` method gets access to a global variable named `graph`, which holds
 the local `StarGraph` for the current line/vertex.

 An appropriate `parse()` for the above adjacency list file is:

 [source,groovy]
 def parse(line) {
     def parts = line.split(/ /)
     def (id, label, name, x) = parts[0].split(/:/).toList()
     def v1 = graph.addVertex(T.id, id, T.label, label)
     if (name != null) v1.property('name', name) // first value is always the name
     if (x != null) {
         // second value depends on the vertex label; it's either
         // the age of a person or the language of a project
         if (label.equals('project')) v1.property('lang', x)
         else v1.property('age', Integer.valueOf(x))
     }
     if (parts.length == 2) {
         parts[1].split(/,/).grep { !it.isEmpty() }.each {
             def (eLabel, refId, weight) = it.split(/:/).toList()
             def v2 = graph.addVertex(T.id, refId)
             v1.addOutEdge(eLabel, v2, 'weight', Double.valueOf(weight))
         }
     }
     return v1
 }

 The resultant `Vertex` denotes whether the line parsed yielded a valid Vertex. As such, if the line is not valid
 (e.g. a comment line, a skip line, etc.), then simply return `null`.

 ===== ScriptOutputFormat Support

 The principle above can also be used to convert a vertex to an arbitrary `String` representation that is ultimately
 streamed back to a file in HDFS. This is the role of `ScriptOutputFormat`. `ScriptOutputFormat` requires that the
 provided script maintains a method with the following signature:

 [source,groovy]
 def stringify(Vertex vertex) { ... }

 An appropriate `stringify()` to produce output in the same format that was shown in the `ScriptInputFormat` sample is:

 [source,groovy]
 def stringify(vertex) {
     def v = vertex.values('name', 'age', 'lang').inject(vertex.id(), vertex.label()).join(':')
     def outE = vertex.outE().map {
         def e = it.get()
         e.values('weight').inject(e.label(), e.inV().next().id()).join(':')
     }.join(',')
     return [v, outE].join('\t')
 }


 === Storage Systems

 Hadoop-Gremlin provides two implementations of the `Storage` API:

 * `FileSystemStorage`: Access HDFS and local file system data.
 * `SparkContextStorage`: Access Spark persisted RDD data.

 [[interacting-with-hdfs]]
 ==== Interacting with HDFS

 The distributed file system of Hadoop is called link:http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system[HDFS].
 The results of any OLAP operation are stored in HDFS accessible via `hdfs`. For local file system access, there is `fs`.

 [gremlin-groovy]
 ----
 graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
 graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
 hdfs.ls()
 hdfs.ls('output')
 hdfs.head('output', GryoInputFormat)
 hdfs.head('output', 'clusterCount', SequenceFileInputFormat)
 hdfs.rm('output')
 hdfs.ls()
 ----

 [[interacting-with-spark]]
 ==== Interacting with Spark

 If a Spark context is persisted, then Spark RDDs will remain the Spark cache and accessible over subsequent jobs.
 RDDs are retrieved and saved to the `SparkContext` via `PersistedInputRDD` and `PersistedOutputRDD` respectivly.
 Persisted RDDs can be accessed using `spark`.

 [gremlin-groovy]
 ----
 Spark.create('local[4]')
 graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
 graph.configuration().setProperty('gremlin.hadoop.graphWriter', PersistedOutputRDD.class.getCanonicalName())
 graph.configuration().setProperty('gremlin.spark.persistContext',true)
 graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
 spark.ls()
 spark.ls('output')
 spark.head('output', PersistedInputRDD)
 spark.head('output', 'clusterCount', PersistedInputRDD)
 spark.rm('output')
 spark.ls()
 ----

 === A Command Line Example

 image::pagerank-logo.png[width=300]

 The classic link:http://en.wikipedia.org/wiki/PageRank[PageRank] centrality algorithm can be executed over the
 TinkerPop graph from the command line using `GiraphGraphComputer`.

 WARNING: Be sure that the `HADOOP_GREMLIN_LIBS` references the location `lib` directory of the respective
 `GraphComputer` engine being used or else the requisite dependencies will not be uploaded to the Hadoop cluster.

 [source,text]
 ----
 $ hdfs dfs -copyFromLocal data/tinkerpop-modern.json tinkerpop-modern.json
 $ hdfs dfs -ls
 Found 2 items
 -rw-r--r--   1 marko supergroup       2356 2014-07-28 13:00 /user/marko/tinkerpop-modern.json
 $ hadoop jar target/giraph-gremlin-x.y.z-job.jar org.apache.tinkerpop.gremlin.giraph.process.computer.GiraphGraphComputer ../hadoop-gremlin/conf/hadoop-graphson.properties
 15/09/11 08:02:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 15/09/11 08:02:11 INFO computer.GiraphGraphComputer: HadoopGremlin(Giraph): PageRankVertexProgram[alpha=0.85,iterations=30]
 15/09/11 08:02:12 INFO mapreduce.JobSubmitter: number of splits:3
 15/09/11 08:02:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441915907347_0028
 15/09/11 08:02:12 INFO impl.YarnClientImpl: Submitted application application_1441915907347_0028
 15/09/11 08:02:12 INFO job.GiraphJob: Tracking URL: http://markos-macbook:8088/proxy/application_1441915907347_0028/
 15/09/11 08:02:12 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 3 mappers
 15/09/11 08:03:54 INFO mapreduce.Job: Running job: job_1441915907347_0028
 15/09/11 08:03:55 INFO mapreduce.Job: Job job_1441915907347_0028 running in uber mode : false
 15/09/11 08:03:55 INFO mapreduce.Job:  map 33% reduce 0%
 15/09/11 08:03:57 INFO mapreduce.Job:  map 67% reduce 0%
 15/09/11 08:04:01 INFO mapreduce.Job:  map 100% reduce 0%
 15/09/11 08:06:17 INFO mapreduce.Job: Job job_1441915907347_0028 completed successfully
 15/09/11 08:06:17 INFO mapreduce.Job: Counters: 80
     File System Counters
         FILE: Number of bytes read=0
         FILE: Number of bytes written=483918
         FILE: Number of read operations=0
         FILE: Number of large read operations=0
         FILE: Number of write operations=0
         HDFS: Number of bytes read=1465
         HDFS: Number of bytes written=1760
         HDFS: Number of read operations=39
         HDFS: Number of large read operations=0
         HDFS: Number of write operations=20
     Job Counters
         Launched map tasks=3
         Other local map tasks=3
         Total time spent by all maps in occupied slots (ms)=458105
         Total time spent by all reduces in occupied slots (ms)=0
         Total time spent by all map tasks (ms)=458105
         Total vcore-seconds taken by all map tasks=458105
         Total megabyte-seconds taken by all map tasks=469099520
     Map-Reduce Framework
         Map input records=3
         Map output records=0
         Input split bytes=132
         Spilled Records=0
         Failed Shuffles=0
         Merged Map outputs=0
         GC time elapsed (ms)=1594
         CPU time spent (ms)=0
         Physical memory (bytes) snapshot=0
         Virtual memory (bytes) snapshot=0
         Total committed heap usage (bytes)=527958016
     Giraph Stats
         Aggregate edges=0
         Aggregate finished vertices=0
         Aggregate sent message message bytes=13535
         Aggregate sent messages=186
         Aggregate vertices=6
         Current master task partition=0
         Current workers=2
         Last checkpointed superstep=0
         Sent message bytes=438
         Sent messages=6
         Superstep=31
     Giraph Timers
         Initialize (ms)=2996
         Input superstep (ms)=5209
         Setup (ms)=59
         Shutdown (ms)=9324
         Superstep 0 GiraphComputation (ms)=3861
         Superstep 1 GiraphComputation (ms)=4027
         Superstep 10 GiraphComputation (ms)=4000
         Superstep 11 GiraphComputation (ms)=4004
         Superstep 12 GiraphComputation (ms)=3999
         Superstep 13 GiraphComputation (ms)=4000
         Superstep 14 GiraphComputation (ms)=4005
         Superstep 15 GiraphComputation (ms)=4003
         Superstep 16 GiraphComputation (ms)=4001
         Superstep 17 GiraphComputation (ms)=4007
         Superstep 18 GiraphComputation (ms)=3998
         Superstep 19 GiraphComputation (ms)=4006
         Superstep 2 GiraphComputation (ms)=4007
         Superstep 20 GiraphComputation (ms)=3996
         Superstep 21 GiraphComputation (ms)=4006
         Superstep 22 GiraphComputation (ms)=4002
         Superstep 23 GiraphComputation (ms)=3998
         Superstep 24 GiraphComputation (ms)=4003
         Superstep 25 GiraphComputation (ms)=4001
         Superstep 26 GiraphComputation (ms)=4003
         Superstep 27 GiraphComputation (ms)=4005
         Superstep 28 GiraphComputation (ms)=4002
         Superstep 29 GiraphComputation (ms)=4001
         Superstep 3 GiraphComputation (ms)=3988
         Superstep 30 GiraphComputation (ms)=4248
         Superstep 4 GiraphComputation (ms)=4010
         Superstep 5 GiraphComputation (ms)=3998
         Superstep 6 GiraphComputation (ms)=3996
         Superstep 7 GiraphComputation (ms)=4005
         Superstep 8 GiraphComputation (ms)=4009
         Superstep 9 GiraphComputation (ms)=3994
         Total (ms)=138788
     File Input Format Counters
         Bytes Read=0
     File Output Format Counters
         Bytes Written=0
 $ hdfs dfs -cat output/~g/*
 {"id":1,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.15000000000000002}],"name":[{"id":0,"value":"marko"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":3.0}],"age":[{"id":1,"value":29}]}}
 {"id":5,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.23181250000000003}],"name":[{"id":8,"value":"ripple"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"lang":[{"id":9,"value":"java"}]}}
 {"id":3,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.4018125}],"name":[{"id":4,"value":"lop"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":0.0}],"lang":[{"id":5,"value":"java"}]}}
 {"id":4,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],"name":[{"id":6,"value":"josh"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],"age":[{"id":7,"value":32}]}}
 {"id":2,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.19250000000000003}],"name":[{"id":2,"value":"vadas"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"age":[{"id":3,"value":27}]}}
 {"id":6,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.15000000000000002}],"name":[{"id":10,"value":"peter"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":1.0}],"age":[{"id":11,"value":35}]}}
 ----

 Vertex 4 ("josh") is isolated below:

 [source,js]
 ----
 {
   "id":4,
   "label":"person",
   "properties": {
     "gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],
     "name":[{"id":6,"value":"josh"}],
     "gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],
     "age":[{"id":7,"value":32}]}
   }
 }
 ----
	////
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////
	=== Input/Output Formats

	image:adjacency-list.png[width=300,float=right] Hadoop-Gremlin provides various I/O formats -- i.e. Hadoop
	`InputFormat` and `OutputFormat`. All of the formats make use of an link:http://en.wikipedia.org/wiki/Adjacency_list[adjacency list]
	representation of the graph where each "row" represents a single vertex, its properties, and its incoming and
	outgoing edges.

	{empty} +

	[[gryo-io-format]]
	==== Gryo I/O Format

	* InputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat`
	* OutputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat`

	<<gryo-reader-writer,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo]
	to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time
	savings over text-based representations.

	NOTE: The `GryoInputFormat` is splittable.

	[[graphson-io-format]]
	==== GraphSON I/O Format

	* InputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat`
	* OutputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat`

	<<graphson-reader-writer,GraphSON>> is a JSON based graph format. GraphSON is a space-expensive graph format in that
	it is a text-based markup language. However, it is convenient for many developers to work with as its structure is
	simple (easy to create and parse).

	The data below represents an adjacency list representation of the classic TinkerGraph toy graph in GraphSON format.

	[source,json]
	----
	{"id":1,"label":"person","outE":{"created":[{"id":9,"inV":3,"properties":{"weight":0.4}}],"knows":[{"id":7,"inV":2,"properties":{"weight":0.5}},{"id":8,"inV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":0,"value":"marko"}],"age":[{"id":1,"value":29}]}}
	{"id":2,"label":"person","inE":{"knows":[{"id":7,"outV":1,"properties":{"weight":0.5}}]},"properties":{"name":[{"id":2,"value":"vadas"}],"age":[{"id":3,"value":27}]}}
	{"id":3,"label":"software","inE":{"created":[{"id":9,"outV":1,"properties":{"weight":0.4}},{"id":11,"outV":4,"properties":{"weight":0.4}},{"id":12,"outV":6,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":4,"value":"lop"}],"lang":[{"id":5,"value":"java"}]}}
	{"id":4,"label":"person","inE":{"knows":[{"id":8,"outV":1,"properties":{"weight":1.0}}]},"outE":{"created":[{"id":10,"inV":5,"properties":{"weight":1.0}},{"id":11,"inV":3,"properties":{"weight":0.4}}]},"properties":{"name":[{"id":6,"value":"josh"}],"age":[{"id":7,"value":32}]}}
	{"id":5,"label":"software","inE":{"created":[{"id":10,"outV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":8,"value":"ripple"}],"lang":[{"id":9,"value":"java"}]}}
	{"id":6,"label":"person","outE":{"created":[{"id":12,"inV":3,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":10,"value":"peter"}],"age":[{"id":11,"value":35}]}}
	----

	[[script-io-format]]
	==== Script I/O Format

	* InputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat`
	* OutputFormat: `org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptOutputFormat`

	`ScriptInputFormat` and `ScriptOutputFormat` take an arbitrary script and use that script to either read or write
	`Vertex` objects, respectively. This can be considered the most general `InputFormat`/`OutputFormat` possible in that
	Hadoop-Gremlin uses the user provided script for all reading/writing.

	===== ScriptInputFormat

	The data below represents an adjacency list representation of the classic TinkerGraph toy graph. First line reads,
	"vertex `1`, labeled `person` having 2 property values (`marko` and `29`) has 3 outgoing edges; the first edge is
	labeled `knows`, connects the current vertex `1` with vertex `2` and has a property value `0.4`, and so on."

	[source]
	1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4
	2:person:vadas:27
	3:project:lop:java
	4:person:josh:32 created:3:0.4,created:5:1.0
	5:project:ripple:java
	6:person:peter:35 created:3:0.2

	There is no corresponding `InputFormat` that can parse this particular file (or some adjacency list variant of it).
	As such, `ScriptInputFormat` can be used. With `ScriptInputFormat` a script is stored in HDFS and leveraged by each
	mapper in the Hadoop job. The script must have the following method defined:

	[source,groovy]
	def parse(String line) { ... }

	In order to create vertices and edges, the `parse()` method gets access to a global variable named `graph`, which holds
	the local `StarGraph` for the current line/vertex.

	An appropriate `parse()` for the above adjacency list file is:

	[source,groovy]
	def parse(line) {
	def parts = line.split(/ /)
	def (id, label, name, x) = parts[0].split(/:/).toList()
	def v1 = graph.addVertex(T.id, id, T.label, label)
	if (name != null) v1.property('name', name) // first value is always the name
	if (x != null) {
	// second value depends on the vertex label; it's either
	// the age of a person or the language of a project
	if (label.equals('project')) v1.property('lang', x)
	else v1.property('age', Integer.valueOf(x))
	}
	if (parts.length == 2) {
	parts[1].split(/,/).grep { !it.isEmpty() }.each {
	def (eLabel, refId, weight) = it.split(/:/).toList()
	def v2 = graph.addVertex(T.id, refId)
	v1.addOutEdge(eLabel, v2, 'weight', Double.valueOf(weight))
	}
	}
	return v1
	}

	The resultant `Vertex` denotes whether the line parsed yielded a valid Vertex. As such, if the line is not valid
	(e.g. a comment line, a skip line, etc.), then simply return `null`.

	===== ScriptOutputFormat Support

	The principle above can also be used to convert a vertex to an arbitrary `String` representation that is ultimately
	streamed back to a file in HDFS. This is the role of `ScriptOutputFormat`. `ScriptOutputFormat` requires that the
	provided script maintains a method with the following signature:

	[source,groovy]
	def stringify(Vertex vertex) { ... }

	An appropriate `stringify()` to produce output in the same format that was shown in the `ScriptInputFormat` sample is:

	[source,groovy]
	def stringify(vertex) {
	def v = vertex.values('name', 'age', 'lang').inject(vertex.id(), vertex.label()).join(':')
	def outE = vertex.outE().map {
	def e = it.get()
	e.values('weight').inject(e.label(), e.inV().next().id()).join(':')
	}.join(',')
	return [v, outE].join('\t')
	}



	=== Storage Systems

	Hadoop-Gremlin provides two implementations of the `Storage` API:

	* `FileSystemStorage`: Access HDFS and local file system data.
	* `SparkContextStorage`: Access Spark persisted RDD data.

	[[interacting-with-hdfs]]
	==== Interacting with HDFS

	The distributed file system of Hadoop is called link:http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system[HDFS].
	The results of any OLAP operation are stored in HDFS accessible via `hdfs`. For local file system access, there is `fs`.

	[gremlin-groovy]
	----
	graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
	graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
	hdfs.ls()
	hdfs.ls('output')
	hdfs.head('output', GryoInputFormat)
	hdfs.head('output', 'clusterCount', SequenceFileInputFormat)
	hdfs.rm('output')
	hdfs.ls()
	----

	[[interacting-with-spark]]
	==== Interacting with Spark

	If a Spark context is persisted, then Spark RDDs will remain the Spark cache and accessible over subsequent jobs.
	RDDs are retrieved and saved to the `SparkContext` via `PersistedInputRDD` and `PersistedOutputRDD` respectivly.
	Persisted RDDs can be accessed using `spark`.

	[gremlin-groovy]
	----
	Spark.create('local[4]')
	graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
	graph.configuration().setProperty('gremlin.hadoop.graphWriter', PersistedOutputRDD.class.getCanonicalName())
	graph.configuration().setProperty('gremlin.spark.persistContext',true)
	graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
	spark.ls()
	spark.ls('output')
	spark.head('output', PersistedInputRDD)
	spark.head('output', 'clusterCount', PersistedInputRDD)
	spark.rm('output')
	spark.ls()
	----

	=== A Command Line Example

	image::pagerank-logo.png[width=300]

	The classic link:http://en.wikipedia.org/wiki/PageRank[PageRank] centrality algorithm can be executed over the
	TinkerPop graph from the command line using `GiraphGraphComputer`.

	WARNING: Be sure that the `HADOOP_GREMLIN_LIBS` references the location `lib` directory of the respective
	`GraphComputer` engine being used or else the requisite dependencies will not be uploaded to the Hadoop cluster.

	[source,text]
	----
	$ hdfs dfs -copyFromLocal data/tinkerpop-modern.json tinkerpop-modern.json
	$ hdfs dfs -ls
	Found 2 items
	-rw-r--r-- 1 marko supergroup 2356 2014-07-28 13:00 /user/marko/tinkerpop-modern.json
	$ hadoop jar target/giraph-gremlin-x.y.z-job.jar org.apache.tinkerpop.gremlin.giraph.process.computer.GiraphGraphComputer ../hadoop-gremlin/conf/hadoop-graphson.properties
	15/09/11 08:02:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	15/09/11 08:02:11 INFO computer.GiraphGraphComputer: HadoopGremlin(Giraph): PageRankVertexProgram[alpha=0.85,iterations=30]
	15/09/11 08:02:12 INFO mapreduce.JobSubmitter: number of splits:3
	15/09/11 08:02:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441915907347_0028
	15/09/11 08:02:12 INFO impl.YarnClientImpl: Submitted application application_1441915907347_0028
	15/09/11 08:02:12 INFO job.GiraphJob: Tracking URL: http://markos-macbook:8088/proxy/application_1441915907347_0028/
	15/09/11 08:02:12 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 3 mappers
	15/09/11 08:03:54 INFO mapreduce.Job: Running job: job_1441915907347_0028
	15/09/11 08:03:55 INFO mapreduce.Job: Job job_1441915907347_0028 running in uber mode : false
	15/09/11 08:03:55 INFO mapreduce.Job: map 33% reduce 0%
	15/09/11 08:03:57 INFO mapreduce.Job: map 67% reduce 0%
	15/09/11 08:04:01 INFO mapreduce.Job: map 100% reduce 0%
	15/09/11 08:06:17 INFO mapreduce.Job: Job job_1441915907347_0028 completed successfully
	15/09/11 08:06:17 INFO mapreduce.Job: Counters: 80
	File System Counters
	FILE: Number of bytes read=0
	FILE: Number of bytes written=483918
	FILE: Number of read operations=0
	FILE: Number of large read operations=0
	FILE: Number of write operations=0
	HDFS: Number of bytes read=1465
	HDFS: Number of bytes written=1760
	HDFS: Number of read operations=39
	HDFS: Number of large read operations=0
	HDFS: Number of write operations=20
	Job Counters
	Launched map tasks=3
	Other local map tasks=3
	Total time spent by all maps in occupied slots (ms)=458105
	Total time spent by all reduces in occupied slots (ms)=0
	Total time spent by all map tasks (ms)=458105
	Total vcore-seconds taken by all map tasks=458105
	Total megabyte-seconds taken by all map tasks=469099520
	Map-Reduce Framework
	Map input records=3
	Map output records=0
	Input split bytes=132
	Spilled Records=0
	Failed Shuffles=0
	Merged Map outputs=0
	GC time elapsed (ms)=1594
	CPU time spent (ms)=0
	Physical memory (bytes) snapshot=0
	Virtual memory (bytes) snapshot=0
	Total committed heap usage (bytes)=527958016
	Giraph Stats
	Aggregate edges=0
	Aggregate finished vertices=0
	Aggregate sent message message bytes=13535
	Aggregate sent messages=186
	Aggregate vertices=6
	Current master task partition=0
	Current workers=2
	Last checkpointed superstep=0
	Sent message bytes=438
	Sent messages=6
	Superstep=31
	Giraph Timers
	Initialize (ms)=2996
	Input superstep (ms)=5209
	Setup (ms)=59
	Shutdown (ms)=9324
	Superstep 0 GiraphComputation (ms)=3861
	Superstep 1 GiraphComputation (ms)=4027
	Superstep 10 GiraphComputation (ms)=4000
	Superstep 11 GiraphComputation (ms)=4004
	Superstep 12 GiraphComputation (ms)=3999
	Superstep 13 GiraphComputation (ms)=4000
	Superstep 14 GiraphComputation (ms)=4005
	Superstep 15 GiraphComputation (ms)=4003
	Superstep 16 GiraphComputation (ms)=4001
	Superstep 17 GiraphComputation (ms)=4007
	Superstep 18 GiraphComputation (ms)=3998
	Superstep 19 GiraphComputation (ms)=4006
	Superstep 2 GiraphComputation (ms)=4007
	Superstep 20 GiraphComputation (ms)=3996
	Superstep 21 GiraphComputation (ms)=4006
	Superstep 22 GiraphComputation (ms)=4002
	Superstep 23 GiraphComputation (ms)=3998
	Superstep 24 GiraphComputation (ms)=4003
	Superstep 25 GiraphComputation (ms)=4001
	Superstep 26 GiraphComputation (ms)=4003
	Superstep 27 GiraphComputation (ms)=4005
	Superstep 28 GiraphComputation (ms)=4002
	Superstep 29 GiraphComputation (ms)=4001
	Superstep 3 GiraphComputation (ms)=3988
	Superstep 30 GiraphComputation (ms)=4248
	Superstep 4 GiraphComputation (ms)=4010
	Superstep 5 GiraphComputation (ms)=3998
	Superstep 6 GiraphComputation (ms)=3996
	Superstep 7 GiraphComputation (ms)=4005
	Superstep 8 GiraphComputation (ms)=4009
	Superstep 9 GiraphComputation (ms)=3994
	Total (ms)=138788
	File Input Format Counters
	Bytes Read=0
	File Output Format Counters
	Bytes Written=0
	$ hdfs dfs -cat output/~g/*
	{"id":1,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.15000000000000002}],"name":[{"id":0,"value":"marko"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":3.0}],"age":[{"id":1,"value":29}]}}
	{"id":5,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.23181250000000003}],"name":[{"id":8,"value":"ripple"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"lang":[{"id":9,"value":"java"}]}}
	{"id":3,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.4018125}],"name":[{"id":4,"value":"lop"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":0.0}],"lang":[{"id":5,"value":"java"}]}}
	{"id":4,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],"name":[{"id":6,"value":"josh"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],"age":[{"id":7,"value":32}]}}
	{"id":2,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.19250000000000003}],"name":[{"id":2,"value":"vadas"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"age":[{"id":3,"value":27}]}}
	{"id":6,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.15000000000000002}],"name":[{"id":10,"value":"peter"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":1.0}],"age":[{"id":11,"value":35}]}}
	----

	Vertex 4 ("josh") is isolated below:

	[source,js]
	----
	{
	"id":4,
	"label":"person",
	"properties": {
	"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],
	"name":[{"id":6,"value":"josh"}],
	"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],
	"age":[{"id":7,"value":32}]}
	}
	}
	----