////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
[[neo4j-gremlin]]
== Neo4j-Gremlin

[source,xml]
----
<dependency>
   <groupId>org.apache.tinkerpop</groupId>
   <artifactId>neo4j-gremlin</artifactId>
   <version>x.y.z</version>
</dependency>
<!-- neo4j-tinkerpop-api-impl is NOT Apache 2 licensed - more information below -->
<dependency>
  <groupId>org.neo4j</groupId>
  <artifactId>neo4j-tinkerpop-api-impl</artifactId>
  <version>0.7-3.2.3</version>
</dependency>
----

link:http://neo4j.com[Neo4j, Inc.] are the developers of the OLTP-based link:http://neo4j.com[Neo4j graph database].

WARNING: Unless under a commercial agreement with Neo4j, Inc., Neo4j is licensed
link:http://en.wikipedia.org/wiki/Affero_General_Public_License[AGPL]. The `neo4j-gremlin` module is licensed Apache2
because it only references the Apache2-licensed Neo4j API (not its implementation). Note that neither the
<<gremlin-console,Gremlin Console>> nor <<gremlin-server,Gremlin Server>> distribute with the Neo4j implementation
binaries. To access the binaries, use the `:install` command to download binaries from
link:http://search.maven.org/[Maven Central Repository].

TIP: For configuring Grape, the dependency resolver of Groovy, please refer to the <<gremlin-applications,Gremlin Applications>> section.

[source,groovy]
----
gremlin> :install org.apache.tinkerpop neo4j-gremlin x.y.z
==>Loaded: [org.apache.tinkerpop, neo4j-gremlin, x.y.z] - restart the console to use [tinkerpop.neo4j]
gremlin> :q
...
gremlin> :plugin use tinkerpop.neo4j
==>tinkerpop.neo4j activated
gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
----

TIP: To host Neo4j in <<gremlin-server,Gremlin Server>>, the dependencies must first be "installed" or otherwise
copied to the Gremlin Server path. The automated method for doing this would be to execute
`bin/gremlin-server.sh install org.apache.tinkerpop neo4j-gremlin x.y.z`. Once installed, the Gremlin Server
configuration file must be edited to include the `Neo4jGremlinPlugin` as shown in `conf/gremlin-server.neo4j`.

=== Indices

Neo4j 2.x indices leverage vertex labels to partition the index space. TinkerPop does not provide method interfaces
for defining schemas/indices for the underlying graph system. Thus, in order to create indices, it is important to
call the Neo4j API directly.

NOTE: `Neo4jGraphStep` will attempt to discern which indices to use when executing a traversal of the form `g.V().has()`.

The Gremlin-Console session below demonstrates Neo4j indices. For more information, please refer to the Neo4j documentation:

* Manipulating indices with link:http://neo4j.com/docs/developer-manual/current/#query-schema-index[Cypher].
* Manipulating indices with the Neo4j link:http://neo4j.com/docs/stable/tutorials-java-embedded-new-index.html[Java API].

[gremlin-groovy]
----
graph = Neo4jGraph.open('/tmp/neo4j')
g = traversal().withEmbedded(graph)
graph.cypher("CREATE INDEX ON :person(name)")
graph.tx().commit()  <1>
g.addV('person').property('name','marko')
g.addV('dog').property('name','puppy')
g.V().hasLabel('person').has('name','marko').values('name')
graph.close()
----

<1> Schema mutations must happen in a different transaction than graph mutations

Below demonstrates the runtime benefits of indices and demonstrates how if there is no defined index (only vertex
labels), a linear scan of the vertex-label partition is still faster than a linear scan of all vertices.

[gremlin-groovy]
----
graph = Neo4jGraph.open('/tmp/neo4j')
g = traversal().withEmbedded(graph)
g.io('data/grateful-dead.xml').read().iterate()
g.tx().commit()
clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()}  <1>
graph.cypher("CREATE INDEX ON :artist(name)") <2>
g.tx().commit()
Thread.sleep(5000) <3>
clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()} <4>
clock(1000) {g.V().has('name','Garcia').iterate()} <5>
graph.cypher("DROP INDEX ON :artist(name)") <6>
g.tx().commit()
graph.close()
----

<1> Find all artists whose name is Garcia which does a linear scan of the artist vertex-label partition.
<2> Create an index for all artist vertices on their name property.
<3> Neo4j indices are eventually consistent so this stalls to give the index time to populate itself.
<4> Find all artists whose name is Garcia which uses the pre-defined schema index.
<5> Find all vertices whose name is Garcia which requires a linear scan of all the data in the graph.
<6> Drop the created index.

=== Cypher

image::gremlin-loves-cypher.png[width=400]

NeoTechnology are the creators of the graph pattern-match query language link:https://neo4j.com/developer/cypher-query-language/[Cypher].
It is possible to leverage Cypher from within Gremlin by using the `Neo4jGraph.cypher()` graph traversal method.

[gremlin-groovy]
----
graph = Neo4jGraph.open('/tmp/neo4j')
g = traversal().withEmbedded(graph)
g.io('data/tinkerpop-modern.kryo').read().iterate()
graph.cypher('MATCH (a {name:"marko"}) RETURN a')
graph.cypher('MATCH (a {name:"marko"}) RETURN a').select('a').out('knows').values('name')
graph.close()
----

Thus, like <<match-step,`match()`>>-step in Gremlin, it is possible to do a declarative pattern match and then move
back into imperative Gremlin.

TIP: For those developers using <<gremlin-server,Gremlin Server>> against Neo4j, it is possible to do Cypher queries
by simply placing the Cypher string in `graph.cypher(...)` before submission to the server.

=== Multi-Label

TinkerPop requires every `Element` to have a single, immutable string label (i.e. a `Vertex`, `Edge`, and
`VertexProperty`). In Neo4j, a `Node` (vertex) can have an
link:http://neo4j.com/docs/developer-manual/current/#graphdb-neo4j-labels[arbitrary number of labels] while a `Relationship`
(edge) can have one and only one. Furthermore, in Neo4j, `Node` labels are mutable while `Relationship` labels are
not. In order to handle this mismatch, three `Neo4jVertex` specific methods exist in Neo4j-Gremlin.

[source,java]
public Set<String> labels() // get all the labels of the vertex
public void addLabel(String label) // add a label to the vertex
public void removeLabel(String label) // remove a label from the vertex

An example use case is presented below.

[gremlin-groovy]
----
graph = Neo4jGraph.open('/tmp/neo4j')
g = traversal().withEmbedded(graph)
vertex = (Neo4jVertex) g.addV('human::animal').next() <1>
vertex.label() <2>
vertex.labels() <3>
vertex.addLabel('organism') <4>
vertex.label()
vertex.removeLabel('human') <5>
vertex.labels()
vertex.addLabel('organism') <6>
vertex.labels()
vertex.removeLabel('human') <7>
vertex.label()
g.V().has(label,'organism') <8>
g.V().has(label,of('organism')) <9>
g.V().has(label,of('organism')).has(label,of('animal'))
g.V().has(label,of('organism').and(of('animal')))
graph.close()
----

<1> Typecasting to a `Neo4jVertex` is only required in Java.
<2> The standard `Vertex.label()` method returns all the labels in alphabetical order concatenated using `::`.
<3> `Neo4jVertex.labels()` method returns the individual labels as a set.
<4> `Neo4jVertex.addLabel()` method adds a single label.
<5> `Neo4jVertex.removeLabel()` method removes a single label.
<6> Labels are unique and thus duplicate labels don't exist.
<7> If a label that does not exist is removed, nothing happens.
<8> `P.eq()` does a full string match and should only be used if multi-labels are not leveraged.
<9> `LabelP.of()` is specific to `Neo4jGraph` and used for multi-label matching.

IMPORTANT: `LabelP.of()` is only required if multi-labels are leveraged. `LabelP.of()` is used when
filtering/looking-up vertices by their label(s) as the standard `P.eq()` does a direct match on the `::`-representation
of `vertex.label()`

=== Configuration

The previous examples showed how to create a `Neo4jGraph` with the default configuration, but Neo4j has many other
options to initialize it that are native to Neo4j. In order to expose those, `Neo4jGraph` has an `open(Configuration)`
method which takes a standard Apache Configuration object. The same can be said of the standard method for creating
`Graph` instances with `GraphFactory`. Each configuration key that Neo4j has must simply be prefixed with
`gremlin.neo4j.conf.` and the suffix configuration key will be passed through to Neo4j.

NOTE: Gremlin Server uses `GraphFactory` to instantiate the `Graph` instances it manages, so the example below is also
relevant for that purpose as well.

For example, a standard configuration file called `neo4j.properties` that sets the Neo4j
`dbms.index_sampling.background_enabled` setting might look like:

[source,properties]
----
gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph
gremlin.neo4j.directory=/tmp/neo4j
gremlin.neo4j.conf.dbms.index_sampling.background_enabled=true
----

which can then be used as follows:

[source,text]
----
gremlin> graph = GraphFactory.open('neo4j.properties')
==>neo4jgraph[community single [/tmp/neo4j]]
gremlin> g = traversal().withEmbedded(graph)
==>graphtraversalsource[neo4jgraph[community single [/tmp/neo4j]], standard]
----

Having this ability to set standard Neo4j configurations makes it possible to better control the initialization of
Neo4j itself and provides the ability to enable certain features that would not otherwise be accessible.

=== Bolt Configuration

While `Neo4jGraph` enables Gremlin based queries, users may find it helpful to also be able to connect to that graph
with native Neo4j drivers and other tools from that space. It is possible to enable the
link:https://boltprotocol.org/[Bolt Protocol] as a way to do this:

[source,properties]
----
gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph
gremlin.neo4j.directory=/tmp/neo4j
gremlin.neo4j.conf.dbms.connector.0.type=BOLT
gremlin.neo4j.conf.dbms.connector.0.enabled=true
gremlin.neo4j.conf.dbms.connector.0.address=localhost:7687
----

This configuration is especially relevant to Gremlin Server where one might want to connect to the same graph instance
with both Gremlin and Cypher.

[source,text]
----
gremlin> :install org.neo4j.driver neo4j-java-driver 1.7.2
==>Loaded: [org.neo4j.driver, neo4j-java-driver, 1.7.2]
... // restart Gremlin Console
gremlin> import org.neo4j.driver.v1.*
==>org.apache.tinkerpop.gremlin.structure.*, org.apache.tinkerpop.gremlin.structure.util.*, ... org.neo4j.driver.v1.*
gremlin> driver = GraphDatabase.driver( "bolt://localhost:7687", AuthTokens.basic("neo4j", "neo4j"))
Oct 28, 2019 3:28:20 PM org.neo4j.driver.internal.logging.JULogger info
INFO: Direct driver instance 1385140107 created for server address localhost:7687
==>org.neo4j.driver.internal.InternalDriver@528f8f8b
gremlin> session = driver.session()
==>org.neo4j.driver.internal.NetworkSession@f3fcd59
gremlin> session.run( "CREATE (a:person {name: {name}, age: {age}})",
......1>                 Values.parameters("name", "stephen", "age", 29))
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182] - type ':remote console' to return to local mode
gremlin> g.V().elementMap()
==>{id=0, label=person, name=stephen, age=29}
----

=== High Availability Configuration

image:neo4j-ha.png[width=400,float=right] TinkerPop supports running Neo4j with its fault tolerant master-slave
replication configuration, referred to as its
link:http://neo4j.com/docs/operations-manual/current/#_neo4j_cluster_install[High Availability (HA) cluster]. From the
TinkerPop perspective, configuring for HA is not that different than configuring for embedded mode as shown above. The
main difference is the usage of HA configuration options that enable the cluster. Once connected to a cluster, usage
from the TinkerPop perspective is largely the same.

In configuring for HA the most important thing to realize is that all Neo4j HA settings are simply passed through the
TinkerPop configuration settings given to the `GraphFactory.open()` or `Neo4j.open()` methods. For example, to
provide the all-important `ha.server_id` configuration option through TinkerPop, simply prefix that key with the
TinkerPop Neo4j key of `gremlin.neo4j.conf`.

The following properties demonstrates one of the three configuration files required to setup a simple three node HA
cluster on the same machine instance:

[source,properties]
----
gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph
gremlin.neo4j.directory=/tmp/neo4j.server1
gremlin.neo4j.conf.ha.server_id=1
gremlin.neo4j.conf.ha.initial_hosts=localhost:5001\,localhost:5002\,localhost:5003
gremlin.neo4j.conf.ha.host.coordination=localhost:5001
gremlin.neo4j.conf.ha.host.data=localhost:6001
----

Assuming the intent is to configure this cluster completely within TinkerPop (perhaps within three separate Gremlin
Server instances), the other two configuration files will be quite similar. The second will be:

[source,properties]
----
gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph
gremlin.neo4j.directory=/tmp/neo4j.server2
gremlin.neo4j.conf.ha.server_id=2
gremlin.neo4j.conf.ha.initial_hosts=localhost:5001\,localhost:5002\,localhost:5003
gremlin.neo4j.conf.ha.host.coordination=localhost:5002
gremlin.neo4j.conf.ha.host.data=localhost:6002
----

and the third will be:

[source,properties]
----
gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph
gremlin.neo4j.directory=/tmp/neo4j.server3
gremlin.neo4j.conf.ha.server_id=3
gremlin.neo4j.conf.ha.initial_hosts=localhost:5001\,localhost:5002\,localhost:5003
gremlin.neo4j.conf.ha.host.coordination=localhost:5003
gremlin.neo4j.conf.ha.host.data=localhost:6003
----

IMPORTANT: The backslashes in the values provided to `gremlin.neo4j.conf.ha.initial_hosts` prevent that configuration
setting as being interpreted as a `List`.

Create three separate Gremlin Server configuration files and point each at one of these Neo4j files. Since these Gremlin
Server instances will be running on the same machine, ensure that each Gremlin Server instance has a unique `port`
setting in that Gremlin Server configuration file. Start each Gremlin Server instance to bring the HA cluster online.

NOTE: `Neo4jGraph` instances will block until all nodes join the cluster.

Neither Gremlin Server nor Neo4j will share transactions across the cluster. Be sure to either use Gremlin Server
managed transactions or, if using a session without that option, ensure that all requests are being routed to the
same server.

This example discussed use of Gremlin Server to demonstrate the HA configuration, but it is also easy to setup with
three Gremlin Console instances. Simply start three Gremlin Console instances and use `GraphFactory` to read those
configuration files to form the cluster. Furthermore, keep in mind that it is possible to have a Gremlin Console join
a cluster handled by two Gremlin Servers or Neo4j Enterprise. The only limits as to how the configuration can be
utilized are prescribed by Neo4j itself. Please refer to their
link:http://neo4j.com/docs/operations-manual/current/#ha-setup-tutorial[documentation] for more information on how
this feature works.
