blob: dd93b9846aa808cf5bf0f96c2c5a42e65f75e59b [file] [log] [blame]
= Using Graph Databases with Groovy
Paul King
:revdate: 2024-09-02T22:18:00+00:00
:updated: 2024-12-11T10:19:00+00:00
:keywords: tugraph, apache tinkerpop, gremlin, neo4j, apache age, graph databases, apache hugegraph, arcadedb, orientdb, groovy
:description: This post illustrates using graph databases with Groovy.
In this blog post, we look at using property graph databases with Groovy.
We'll look at:
* Some advantages of property graph database technologies
* Some features of Groovy which make using such databases a little nicer
* Code examples for a common case study across 7 interesting graph databases
== Case Study
The Olympics is over for another 4 years. For sports fans, there were many exciting moments.
Let's look at just one event where the Olympic record was broken several times over the
last three years. We'll look at the women's 100m backstroke and model the results using
graph databases.
Why the women's 100m backstroke? Well, that was a particularly exciting event
in terms of broken records. In Heat 4 of the Tokyo 2021 Olympics, Kylie Masse broke the record previously
held by Emily Seebohm from the London 2012 Olympics. A few minutes later in Heat 5, Regan Smith
broke the record again. Then in another few minutes in Heat 6, Kaylee McKeown broke the record again.
On the following day in Semifinal 1, Regan took back the record. Then, on the following
day in the final, Kaylee reclaimed the record. At the Paris 2024 Olympics,
Kaylee bettered her own record in the final. Then a few days later,
Regan lead off the 4 x 100m medley relay and broke the backstroke record swimming the first leg.
That makes 7 times the record was broken across the last 2 games!
image:img/BackstrokeRecord.png[Result of Semifinal1,70%]
We'll have vertices in our graph database corresponding to the swimmers and the swims.
We'll use the labels `Swimmer` and `Swim` for these vertices. We'll have relationships
such as `swam` and `supersedes` between vertices.
We'll explore modelling and querying the event
information using several graph database technologies.
The examples in this post can be found on
https://github.com/paulk-asert/groovy-graphdb/[GitHub].
== Why graph databases?
RDBMS systems are many times more popular than graph databases, but there are a
range of scenarios where graph databases are often used.
Which scenarios? Usually, it boils down to relationships.
If there are important relationships between data in your system,
graph databases might make sense.
Typical usage scenarios include fraud detection, knowledge graphs, recommendations engines,
social networks, and supply chain management.
This blog post doesn't aim to convert everyone to use graph databases all the time,
but we'll show you some examples of when it might make sense and let you make up your own mind.
Graph databases certainly represent a very useful tool to have in your toolbox should the need arise.
Graph databases are known for more succinct queries
and vastly more efficient queries in some scenarios.
As a first example, do you prefer this cypher query (it's from the TuGraph code we'll see later
but other technologies are similar):
[source,sql]
----
MATCH (sr:Swimmer)-[:swam]->(sm:Swim {at: 'Paris 2024'})
RETURN DISTINCT sr.country AS country
----
Or the equivalent SQL query assuming we were storing
the information in relational tables:
[source,sql]
----
SELECT DISTINCT country FROM Swimmer
LEFT JOIN Swimmer_Swim
ON Swimmer.swimmerId = Swimmer_Swim.fkSwimmer
LEFT JOIN Swim
ON Swim.swimId = Swimmer_Swim.fkSwim
WHERE Swim.at = 'Paris 2024'
----
This SQL query is typical of what is required when we have a many-to-many relationship
between our entities, in this case _swimmers_ and _swims_. Many-to-many is required to
correctly model relay swims like the last record swim (though for brevity, we haven't
included the other relay swimmers in our dataset). The multiple joins in that query
can also be notoriously slow for large datasets.
We'll see other examples later too, one being a query involving traversal of relationships.
Here is the cypher (again from TuGraph):
[source,sql]
----
MATCH (s1:Swim)-[:supersedes*1..10]->(s2:Swim {at: 'London 2012'})
RETURN s1.at as at, s1.event as event
----
And the equivalent SQL:
[source,sql]
----
WITH RECURSIVE traversed(swimId) AS (
SELECT fkNew FROM Supersedes
WHERE fkOld IN (
SELECT swimId FROM Swim
WHERE event = 'Heat 4' AND at = 'London 2012'
)
UNION ALL
SELECT Supersedes.fkNew as swimId
FROM traversed as t
JOIN Supersedes
ON t.swimId = Supersedes.fkOld
WHERE t.swimId = swimId
)
SELECT at, event FROM Swim
WHERE swimId IN (SELECT * FROM traversed)
----
Here we have a `Supersedes` table and a recursive SQL function, `traversed`.
The details aren't important, but it shows the kind of complexity typically
required for the kind of relationship traversal we are looking at.
There are certainly far more complex SQL examples for different kinds of
traversals like shortest path.
This example used TuGraph's Cypher variant as the Query language. Not all the
databases we'll look at support Cypher, but they all have some kind of query
language or API that makes such queries shorter.
Several of the other databases do support a variant of https://www.iso.org/standard/76120.html[Cypher].
Others support different SQL-like query languages.
We'll also see several JMV-based databases which support TinkerPop/Gremlin.
It's a Groovy-based technology and will be our first technology to explore.
Recently, ISO published an international standard, https://www.iso.org/standard/76120.html[GQL],
for property graph databases. We expect to see databases supporting that standard
in the not too distant future.
Now, it's time to explore the case study using our different database technologies.
We tried to pick technologies that seem reasonably well maintained, had reasonable
JVM support, and had any features that seemed worth showing off. Several we
selected because they have TinkerPop/Gremlin support.
== Apache TinkerPop
Our first technology to examine is https://tinkerpop.apache.org/[Apache TinkerPopâ„¢].
image:https://tinkerpop.apache.org/img/tinkerpop-splash.png[tinkerpop logo,70%]
TinkerPop is an open source computing framework for graph databases. It provides
a common abstraction layer, and a graph query language, called Gremlin.
This allows you to work with numerous graph database implementations in a consistent way.
TinkerPop also provides its own graph engine implementation, called TinkerGraph,
which is what we'll use initially. TinkerPop/Gremlin will be a technology we revisit
for other databases later.
We'll look at the swims for the medalists and record breakers at the Tokyo 2021 and Paris 2024 Olympics
in the women's 100m backstroke. For reference purposes, we'll also include the previous swim that
set an olympic record.
We'll start by creating a new in-memory graph database and
create a helper object for traversing the graph:
[source,groovy]
----
var graph = TinkerGraph.open()
var g = traversal().withEmbedded(graph)
----
Next, let's create the information relevant for the previous Olympic record which was set
at the London 2012 Olympics. Emily Seebohm set that record in Heat 4:
[source,groovy]
----
var es = g.addV('Swimmer').property(name: 'Emily Seebohm', country: '🇦🇺').next()
swim1 = g.addV('Swim').property(at: 'London 2012', event: 'Heat 4', time: 58.23, result: 'First').next()
es.addEdge('swam', swim1)
----
We can print out some information from our newly created nodes (vertices)
by querying the properties of two nodes respectively:
[source,groovy]
----
var (name, country) = ['name', 'country'].collect { es.value(it) }
var (at, event, time) = ['at', 'event', 'time'].collect { swim1.value(it) }
println "$name from $country swam a time of $time in $event at the $at Olympics"
----
Which has this output:
----
Emily Seebohm from 🇦🇺 swam a time of 58.23 in Heat 4 at the London 2012 Olympics
----
So far, we've just been using the Java API from TinkerPop.
It also provides some additional syntactic sugar for Groovy.
We can enable the syntactic sugar with:
[source,groovy]
----
SugarLoader.load()
----
Which then lets us write (instead of the three earlier lines) the slightly shorter:
[source,groovy]
----
println "$es.name from $es.country swam a time of $swim1.time in $swim1.event at the $swim1.at Olympics"
----
This uses Groovy's normal property access syntax and has the same output when executed.
Let's create some helper methods to simplify creation of the remaining information.
[source,groovy]
----
def insertSwimmer(TraversalSource g, name, country) {
g.addV('Swimmer').property(name: name, country: country).next()
}
def insertSwim(TraversalSource g, at, event, time, result, swimmer) {
var swim = g.addV('Swim').property(at: at, event: event, time: time, result: result).next()
swimmer.addEdge('swam', swim)
swim
}
----
Now we can create the remaining swim information:
[source,groovy]
----
var km = insertSwimmer(g, 'Kylie Masse', '🇨🇦')
var swim2 = insertSwim(g, 'Tokyo 2021', 'Heat 4', 58.17, 'First', km)
swim2.addEdge('supersedes', swim1)
var swim3 = insertSwim(g, 'Tokyo 2021', 'Final', 57.72, '🥈', km)
var rs = insertSwimmer(g, 'Regan Smith', '🇺🇸')
var swim4 = insertSwim(g, 'Tokyo 2021', 'Heat 5', 57.96, 'First', rs)
swim4.addEdge('supersedes', swim2)
var swim5 = insertSwim(g, 'Tokyo 2021', 'Semifinal 1', 57.86, '', rs)
var swim6 = insertSwim(g, 'Tokyo 2021', 'Final', 58.05, '🥉', rs)
var swim7 = insertSwim(g, 'Paris 2024', 'Final', 57.66, '🥈', rs)
var swim8 = insertSwim(g, 'Paris 2024', 'Relay leg1', 57.28, 'First', rs)
var kmk = insertSwimmer(g, 'Kaylee McKeown', '🇦🇺')
var swim9 = insertSwim(g, 'Tokyo 2021', 'Heat 6', 57.88, 'First', kmk)
swim9.addEdge('supersedes', swim4)
swim5.addEdge('supersedes', swim9)
var swim10 = insertSwim(g, 'Tokyo 2021', 'Final', 57.47, '🥇', kmk)
swim10.addEdge('supersedes', swim5)
var swim11 = insertSwim(g, 'Paris 2024', 'Final', 57.33, '🥇', kmk)
swim11.addEdge('supersedes', swim10)
swim8.addEdge('supersedes', swim11)
var kb = insertSwimmer(g, 'Katharine Berkoff', '🇺🇸')
var swim12 = insertSwim(g, 'Paris 2024', 'Final', 57.98, '🥉', kb)
----
Note that we just entered the swims where medals were won or
where olympic records were broken. We could easily have added
more swimmers, other strokes and distances, relay events,
and even other sports if we wanted to.
Let's have a look at what our graph now looks like:
image:https://raw.githubusercontent.com/paulk-asert/groovy-graphdb/main/docs/images/BackstrokeRecords.png[network of swim and swimmer vertices and relationship edges]
We now might want to query the graph in numerous ways.
For instance, what countries had success at the Paris 2024 olympics,
where success is defined, for the purposes of this query, as
winning a medal or breaking a record. Of course, just having
a swimmer make the olympic team is a great success - but let's
keep our example simple for now.
[source,groovy]
----
var successInParis = g.V().out('swam').has('at', 'Paris 2024').in()
.values('country').toSet()
assert successInParis == ['🇺🇸', '🇦🇺'] as Set
----
By way of explanation, we find all nodes with an outgoing `swam` edge
pointing to a swim that was at the Paris 2024 olympics, i.e.
all the swimmers from Paris 2024. We then find the set of countries
represented. We are using sets here to remove duplicates, and also
we aren't imposing an ordering on the returned results so we compare
sets on both sides.
Similarly, we can find the olympic records set during heat swims:
[source,groovy]
----
var recordSetInHeat = g.V().has('Swim','event', startingWith('Heat')).values('at').toSet()
assert recordSetInHeat == ['London 2012', 'Tokyo 2021'] as Set
----
Or, we can find the times of the records set during finals:
[source,groovy]
----
var recordTimesInFinals = g.V().has('event', 'Final').as('ev').out('supersedes')
.select('ev').values('time').toSet()
assert recordTimesInFinals == [57.47, 57.33] as Set
----
Making use of the Groovy syntactic sugar gives simpler versions:
[source,groovy]
----
var successInParis = g.V.out('swam').has('at', 'Paris 2024').in.country.toSet
assert successInParis == ['🇺🇸', '🇦🇺'] as Set
var recordSetInHeat = g.V.has('Swim','event', startingWith('Heat')).at.toSet
assert recordSetInHeat == ['London 2012', 'Tokyo 2021'] as Set
var recordTimesInFinals = g.V.has('event', 'Final').as('ev').out('supersedes').select('ev').time.toSet
assert recordTimesInFinals == [57.47, 57.33] as Set
----
Groovy happens to be very good at allowing you to add syntactic sugar
for your own programs or existing classes. TinkerPop's special Groovy support
is just one example of this. Your vendor could certainly supply such a feature
for your favorite graph database (why not ask them?) but we'll look shortly at
how you could write such syntactic sugar yourself when we explore Neo4j.
Our examples so far are all interesting,
but graph databases really excel when performing queries
involving multiple edge traversals. Let's look
at all the olympic records set in 2021 and 2024,
i.e. all records set after London 2012 (`swim1` from earlier):
[source,groovy]
----
println "Olympic records after ${g.V(swim1).values('at', 'event').toList().join(' ')}: "
println g.V(swim1).repeat(in('supersedes')).as('sw').emit()
.values('at').concat(' ')
.concat(select('sw').values('event')).toList().join('\n')
----
Or after using the Groovy syntactic sugar, the query becomes:
[source,groovy]
----
println g.V(swim1).repeat(in('supersedes')).as('sw').emit
.at.concat(' ').concat(select('sw').event).toList.join('\n')
----
Both have this output:
----
Olympic records after London 2012 Heat 4:
Tokyo 2021 Heat 4
Tokyo 2021 Heat 5
Tokyo 2021 Heat 6
Tokyo 2021 Semifinal 1
Tokyo 2021 Final
Paris 2024 Final
Paris 2024 Relay leg1
----
NOTE: While not important for our examples, TinkerPop has a `GraphMLWriter` class which can write out our
graph in _GraphML_, which is how the earlier image of Graphs and Nodes was initially generated.
== Neo4j
Our next technology to examine is
https://neo4j.com/product/neo4j-graph-database/[neo4j]. Neo4j is a graph
database storing nodes and edges. Nodes and edges may have a label and properties (or attributes).
image:https://dist.neo4j.com/wp-content/uploads/20230926084108/Logo_FullColor_RGB_TransBG.svg[neo4j logo,50%]
Neo4j models edge relationships using enums. Let's create an enum for our example:
[source,groovy]
----
enum SwimmingRelationships implements RelationshipType {
swam, supersedes, runnerup
}
----
We'll use Neo4j in embedded mode and perform all of our operations
as part of a transaction:
[source,groovy]
----
// ... set up managementService ...
var graphDb = managementService.database(DEFAULT_DATABASE_NAME)
try (Transaction tx = graphDb.beginTx()) {
// ... other Neo4j code below here ...
}
----
Let's create our nodes and edges using Neo4j. First the existing Olympic record:
[source,groovy]
----
es = tx.createNode(label('Swimmer'))
es.setProperty('name', 'Emily Seebohm')
es.setProperty('country', '🇦🇺')
swim1 = tx.createNode(label('Swim'))
swim1.setProperty('event', 'Heat 4')
swim1.setProperty('at', 'London 2012')
swim1.setProperty('result', 'First')
swim1.setProperty('time', 58.23d)
es.createRelationshipTo(swim1, swam)
var name = es.getProperty('name')
var country = es.getProperty('country')
var at = swim1.getProperty('at')
var event = swim1.getProperty('event')
var time = swim1.getProperty('time')
println "$name from $country swam a time of $time in $event at the $at Olympics"
----
While there is nothing wrong with this code, Groovy has many features for making code more succinct.
Let's use some dynamic metaprogramming to achieve just that.
[source,groovy]
----
Node.metaClass {
propertyMissing { String name, val -> delegate.setProperty(name, val) }
propertyMissing { String name -> delegate.getProperty(name) }
methodMissing { String name, args ->
delegate.createRelationshipTo(args[0], SwimmingRelationships."$name")
}
}
----
What does this do? The propertyMissing lines catch attempts to use Groovy's
normal property access and funnels then through appropriate `getProperty` and `setProperty` methods.
The methodMissing line means any attempted method calls that we don't recognize
are intended to be relationship creation, so we funnel them through the appropriate
`createRelationshipTo` method call.
Now we can use normal Groovy property access for setting the node properties.
It looks much cleaner.
We define an edge relationship simply by calling a method having the relationship name.
[source,groovy]
----
km = tx.createNode(label('Swimmer'))
km.name = 'Kylie Masse'
km.country = '🇨🇦'
----
The code is already a little cleaner, but we can tweak the metaprogramming a little
more to get rid of the noise associated with the `label` method:
[source,groovy]
----
Transaction.metaClass {
createNode { String labelName -> delegate.createNode(label(labelName)) }
}
----
This adds an overload for `createNode` that takes a `String`, and
node creation is improved again, as we can see here:
[source,groovy]
----
swim2 = tx.createNode('Swim')
swim2.time = 58.17d
swim2.result = 'First'
swim2.event = 'Heat 4'
swim2.at = 'Tokyo 2021'
km.swam(swim2)
swim2.supersedes(swim1)
swim3 = tx.createNode('Swim')
swim3.time = 57.72d
swim3.result = '🥈'
swim3.event = 'Final'
swim3.at = 'Tokyo 2021'
km.swam(swim3)
----
The code for relationships is certainly a lot cleaner too,
and it was quite a minimal amount of work to define the necessary metaprogramming.
With a little bit more work, we could use static metaprogramming techniques.
This would give us better IDE completion.
We'll have more to say about improved type checking at the end of this post.
For now though, let's continue with defining the rest of our graph.
We can redefine our `insertSwimmer` and `insertSwim` methods using Neo4j implementation
calls, and then our earlier code could be used to create our graph. Now let's
investigate what the queries look like. We'll start with querying via
the API. and later look at using Cypher.
First, the successful countries in Paris 2024:
[source,groovy]
----
var swimmers = [es, km, rs, kmk, kb]
var successInParis = swimmers.findAll { swimmer ->
swimmer.getRelationships(swam).any { run ->
run.getOtherNode(swimmer).at == 'Paris 2024'
}
}
assert successInParis*.country.unique() == ['🇺🇸', '🇦🇺']
----
Then, at which olympics were records broken in heats:
[source,groovy]
----
var swims = [swim1, swim2, swim3, swim4, swim5, swim6, swim7, swim8, swim9, swim10, swim11, swim12]
var recordSetInHeat = swims.findAll { swim ->
swim.event.startsWith('Heat')
}*.at
assert recordSetInHeat.unique() == ['London 2012', 'Tokyo 2021']
----
Now, what were the times for records broken in finals:
[source,groovy]
----
var recordTimesInFinals = swims.findAll { swim ->
swim.event == 'Final' && swim.hasRelationship(supersedes)
}*.time
assert recordTimesInFinals == [57.47d, 57.33d]
----
To see traversal in action, Neo4j has a special API for doing such queries:
[source,groovy]
----
var info = { s -> "$s.at $s.event" }
println "Olympic records following ${info(swim1)}:"
for (Path p in tx.traversalDescription()
.breadthFirst()
.relationships(supersedes)
.evaluator(Evaluators.fromDepth(1))
.uniqueness(Uniqueness.NONE)
.traverse(swim1)) {
println p.endNode().with(info)
}
----
Earlier versions of Neo4j also supported Gremlin, so we could have written our queries in
the same was as we did for TinkerPop. That technology is deprecated in recent Neo4j versions, and instead
they now offer a Cypher query language. We can use that language for all of our previous queries
as shown here:
[source,groovy]
----
assert tx.execute('''
MATCH (s:Swim WHERE s.event STARTS WITH 'Heat')
WITH s.at as at
WITH DISTINCT at
RETURN at
''')*.at == ['London 2012', 'Tokyo 2021']
assert tx.execute('''
MATCH (s1:Swim {event: 'Final'})-[:supersedes]->(s2:Swim)
RETURN s1.time AS time
''')*.time == [57.47d, 57.33d]
tx.execute('''
MATCH (s1:Swim)-[:supersedes]->{1,}(s2:Swim { at: $at })
RETURN s1
''', [at: swim1.at])*.s1.each { s ->
println "$s.at $s.event"
}
----
.An aside on graph design
****
This blog post is definitely, not meant to be an advanced course on graph database
design, but it is worth noting a few points.
Deciding which information should be stored as node properties and which as relationships
still requires developer judgement. For example, we could have added a Boolean `olympicRecord`
property to our `Swim` nodes. Certain queries might now become simpler, or at least more familiar
to traditional RDBMS SQL developers, but other queries might become much harder to write
and potentially much less efficient.
This is the kind of thing which needs to be thought through and sometimes experimented with.
Suppose, in the case where a record is broken, we wanted to see which other swimmers
(in our case medallists in the final) also broke the previous record.
We could write a query to find this as follows:
[source,groovy]
----
assert tx.execute('''
MATCH (sr1:Swimmer)-[:swam]->(sm1:Swim {event: 'Final'}), (sm2:Swim {event: 'Final'})-[:supersedes]->(sm3:Swim)
WHERE sm1.at = sm2.at AND sm1 <> sm2 AND sm1.time < sm3.time
RETURN sr1.name as name
''')*.name == ['Kylie Masse']
----
It's not too bad, but if we had a much larger graph of data, it could be quite slow.
We could instead opt to use an additional relationship, called `runnerup` in our graph.
[source,groovy]
----
swim6.runnerup(swim3)
swim3.runnerup(swim10)
swim12.runnerup(swim7)
swim7.runnerup(swim11)
----
The visualization is something like this:
image:img/BackstrokeRecordsRunnerup.png[Additional runnerup relationship,60%]
It essentially makes it easier to find the other medalists if we know any one of them.
The resulting query becomes this:
[source,groovy]
----
assert tx.execute('''
MATCH (sr1:Swimmer)-[:swam]->(sm1:Swim {event: 'Final'})-[:runnerup]->{1,2}(sm2:Swim {event: 'Final'})-[:supersedes]->(sm3:Swim)
WHERE sm1.time < sm3.time
RETURN sr1.name as name
''')*.name == ['Kylie Masse']
----
The _MATCH_ clause is similar in complexity, the _WHERE_ clause is much simpler.
The query is probably faster too, but it is a tradeoff that should be weighed up.
****
== Apache AGE
The next technology we'll look at is the https://age.apache.org/[Apache AGEâ„¢] graph database.
Apache AGE leverages https://www.postgresql.org[PostgreSQL] for storage.
image:https://age.apache.org/age-manual/master/_static/logo.png[Apache AGE logo, 40%]
image:https://age.apache.org/img/logo-large-postgresql.jpg[PostgreSQL logo]
We installed Apache AGE via a Docker Image as outlined in the Apache AGE
https://age.apache.org/age-manual/master/intro/setup.html#installing-via-docker-image[manual].
Since Apache AGE offers a SQL-inspired graph database experience, we use Groovy's
SQL facilities to interact with the database:
[source,groovy]
----
Sql.withInstance(DB_URL, USER, PASS, 'org.postgresql.jdbc.PgConnection') { sql ->
// enable Apache AGE extension, then use Sql connection ...
}
----
For creating our nodes and subsequent querying, we use SQL statements
with embedded _cypher_ clauses. Here is the statement for creating
out nodes and edges:
[source,groovy]
----
sql.execute'''
SELECT * FROM cypher('swimming_graph', $$ CREATE
(es:Swimmer {name: 'Emily Seebohm', country: '🇦🇺'}),
(swim1:Swim {event: 'Heat 4', result: 'First', time: 58.23, at: 'London 2012'}),
(es)-[:swam]->(swim1),
(km:Swimmer {name: 'Kylie Masse', country: '🇨🇦'}),
(swim2:Swim {event: 'Heat 4', result: 'First', time: 58.17, at: 'Tokyo 2021'}),
(km)-[:swam]->(swim2),
(swim2)-[:supersedes]->(swim1),
(swim3:Swim {event: 'Final', result: '🥈', time: 57.72, at: 'Tokyo 2021'}),
(km)-[:swam]->(swim3),
(rs:Swimmer {name: 'Regan Smith', country: '🇺🇸'}),
(swim4:Swim {event: 'Heat 5', result: 'First', time: 57.96, at: 'Tokyo 2021'}),
(rs)-[:swam]->(swim4),
(swim4)-[:supersedes]->(swim2),
(swim5:Swim {event: 'Semifinal 1', result: 'First', time: 57.86, at: 'Tokyo 2021'}),
(rs)-[:swam]->(swim5),
(swim6:Swim {event: 'Final', result: '🥉', time: 58.05, at: 'Tokyo 2021'}),
(rs)-[:swam]->(swim6),
(swim7:Swim {event: 'Final', result: '🥈', time: 57.66, at: 'Paris 2024'}),
(rs)-[:swam]->(swim7),
(swim8:Swim {event: 'Relay leg1', result: 'First', time: 57.28, at: 'Paris 2024'}),
(rs)-[:swam]->(swim8),
(kmk:Swimmer {name: 'Kaylee McKeown', country: '🇦🇺'}),
(swim9:Swim {event: 'Heat 6', result: 'First', time: 57.88, at: 'Tokyo 2021'}),
(kmk)-[:swam]->(swim9),
(swim9)-[:supersedes]->(swim4),
(swim5)-[:supersedes]->(swim9),
(swim10:Swim {event: 'Final', result: '🥇', time: 57.47, at: 'Tokyo 2021'}),
(kmk)-[:swam]->(swim10),
(swim10)-[:supersedes]->(swim5),
(swim11:Swim {event: 'Final', result: '🥇', time: 57.33, at: 'Paris 2024'}),
(kmk)-[:swam]->(swim11),
(swim11)-[:supersedes]->(swim10),
(swim8)-[:supersedes]->(swim11),
(kb:Swimmer {name: 'Katharine Berkoff', country: '🇺🇸'}),
(swim12:Swim {event: 'Final', result: '🥉', time: 57.98, at: 'Paris 2024'}),
(kb)-[:swam]->(swim12)
$$) AS (a agtype)
'''
----
To find which olympics where records were set in heats, we
can use the following _cypher_ query:
[source,groovy]
----
assert sql.rows('''
SELECT * from cypher('swimming_graph', $$
MATCH (s:Swim)
WHERE left(s.event, 4) = 'Heat'
RETURN s
$$) AS (a agtype)
''').a*.map*.get('properties')*.at.toUnique() == ['London 2012', 'Tokyo 2021']
----
The results come back in a special JSON-like data type called `agtype`.
From that, we can query the properties and return the `at` property.
We select the unique ones to remove duplicates.
Similarly, we can find the times of olympic records set in finals
as follows:
[source,groovy]
----
assert sql.rows('''
SELECT * from cypher('swimming_graph', $$
MATCH (s1:Swim {event: 'Final'})-[:supersedes]->(s2:Swim)
RETURN s1
$$) AS (a agtype)
''').a*.map*.get('properties')*.time == [57.47, 57.33]
----
To print all the olympic records set across Tokyo 2021 and Paris 2024,
we can use `eachRow` and the following query:
[source,groovy]
----
sql.eachRow('''
SELECT * from cypher('swimming_graph', $$
MATCH (s1:Swim)-[:supersedes]->(swim1)
RETURN s1
$$) AS (a agtype)
''') {
println it.a*.map*.get('properties')[0].with{ "$it.at $it.event" }
}
----
The output looks like this:
----
Tokyo 2021 Heat 4
Tokyo 2021 Heat 5
Tokyo 2021 Heat 6
Tokyo 2021 Final
Tokyo 2021 Semifinal 1
Paris 2024 Final
Paris 2024 Relay leg1
----
The Apache AGE project also maintains a viewer tool offering a web-based
user interface for visualization of graph data stored in our database.
Instructions for installation are available on the
https://github.com/apache/age-viewer[GitHub site].
The tool allows visualization of the results from any query.
For our database, a query returning all nodes and edges creates
a visualization like below (we chose to manually re-arrange the nodes):
image:img/age-viewer.png[]
== OrientDB
image:https://www.orientdb.com/images/orientdb_logo_mid.png[orientdb logo,50%]
The next graph database we'll look at is https://orientdb.org/[OrientDB].
We used the open source Community edition. We used it in embedded mode but there are
https://orientdb.org/docs/3.0.x/gettingstarted/Tutorial-Installation.html[instructions]
for running a docker image as well.
The main claim to fame for OrientDB (and the closely related ArcadeDB we'll cover next)
is that they are multi-model databases, supporting graphs and documents
in the one database.
Creating our database and setting up our vertex and edge classes (think mini-schema)
is done as follows:
[source,groovy]
----
try (var db = context.open("swimming", "admin", "adminpwd")) {
db.createVertexClass('Swimmer')
db.createVertexClass('Swim')
db.createEdgeClass('swam')
db.createEdgeClass('supersedes')
// other code here
}
----
See the https://github.com/paulk-asert/groovy-graphdb/tree/main/orientdb[GitHub repo] for further details.
With initialization out fo the way, we can start defining our nodes and edges:
[source,groovy]
----
var es = db.newVertex('Swimmer')
es.setProperty('name', 'Emily Seebohm')
es.setProperty('country', '🇦🇺')
var swim1 = db.newVertex('Swim')
swim1.setProperty('at', 'London 2012')
swim1.setProperty('result', 'First')
swim1.setProperty('event', 'Heat 4')
swim1.setProperty('time', 58.23)
es.addEdge(swim1, 'swam')
----
We can print out the details as before:
[source,groovy]
----
var (name, country) = ['name', 'country'].collect { es.getProperty(it) }
var (at, event, time) = ['at', 'event', 'time'].collect { swim1.getProperty(it) }
println "$name from $country swam a time of $time in $event at the $at Olympics"
----
At this point, we could apply some Groovy metaprogramming to make the code more succinct,
but we'll just flesh out our `insertSwimmer` and `insertSwim` helper methods like before.
We can use these to enter the remaining swim information.
Queries are performed using the Multi-Model API using SQL-like queries.
Our three queries we've seen earlier look like this:
[source,groovy]
----
var results = db.query("SELECT expand(out('supersedes').in('supersedes')) FROM Swim WHERE event = 'Final'")
assert results*.getProperty('time').toSet() == [57.47, 57.33] as Set
results = db.query("SELECT expand(out('supersedes')) FROM Swim WHERE event.left(4) = 'Heat'")
assert results*.getProperty('at').toSet() == ['Tokyo 2021', 'London 2012'] as Set
results = db.query("SELECT country FROM ( SELECT expand(in('swam')) FROM Swim WHERE at = 'Paris 2024' )")
assert results*.getProperty('country').toSet() == ['🇺🇸', '🇦🇺'] as Set
----
Traversal looks like this:
[source,groovy]
----
results = db.query("TRAVERSE in('supersedes') FROM :swim", swim1)
results.each {
if (it.toElement() != swim1) {
println "${it.getProperty('at')} ${it.getProperty('event')}"
}
}
----
OrientDB also supports Gremlin and a studio Web-UI.
Both of these features are very similar to the ArcadeDB counterparts.
We'll examine them next when we look at ArcadeDB.
== ArcadeDB
Now, we'll examine https://arcadedb.com/#getting-started[ArcadeDB].
image:https://arcadedb.com/assets/images/arcadedb-logo-mini.png[arcadedb logo]
ArcadeDB is a rewrite/partial fork of OrientDB and carries over its Multi-Model nature.
We used it in embedded mode but there are
https://arcadedb.com/#getting-started[instructions] for running a docker image if you prefer.
Not surprisingly, some usage of ArcadeDB is very similar to OrientDB. Initialization
changes slightly:
[source,groovy]
----
var factory = new DatabaseFactory("swimming")
try (var db = factory.create()) {
db.transaction { ->
db.schema.with {
createVertexType('Swimmer')
createVertexType('Swim')
createEdgeType('swam')
createEdgeType('supersedes')
}
// ... other code goes here ...
}
}
----
Defining the existing record information is done as follows:
[source,groovy]
----
var es = db.newVertex('Swimmer')
es.set(name: 'Emily Seebohm', country: '🇦🇺').save()
var swim1 = db.newVertex('Swim')
swim1.set(at: 'London 2012', result: 'First', event: 'Heat 4', time: 58.23).save()
swim1.newEdge('swam', es, false).save()
----
Accessing the information can be done like this:
[source,groovy]
----
var (name, country) = ['name', 'country'].collect { es.get(it) }
var (at, event, time) = ['at', 'event', 'time'].collect { swim1.get(it) }
println "$name from $country swam a time of $time in $event at the $at Olympics"
----
ArcadeDB supports multiple query languages. The SQL-like language mirrors the OrientDB offering.
Here are our three now familiar queries:
[source,groovy]
----
var results = db.query('SQL', '''
SELECT expand(outV()) FROM (SELECT expand(outE('supersedes')) FROM Swim WHERE event = 'Final')
''')
assert results*.toMap().time.toSet() == [57.47, 57.33] as Set
results = db.query('SQL', "SELECT expand(outV()) FROM (SELECT expand(outE('supersedes')) FROM Swim WHERE event.left(4) = 'Heat')")
assert results*.toMap().at.toSet() == ['Tokyo 2021', 'London 2012'] as Set
results = db.query('SQL', "SELECT country FROM ( SELECT expand(out('swam')) FROM Swim WHERE at = 'Paris 2024' )")
assert results*.toMap().country.toSet() == ['🇺🇸', '🇦🇺'] as Set
----
Here is our traversal example:
[source,groovy]
----
results = db.query('SQL', "TRAVERSE out('supersedes') FROM :swim", swim1)
results.each {
if (it.toElement() != swim1) {
var props = it.toMap()
println "$props.at $props.event"
}
}
----
ArcadeDB also supports Cypher queries (like Neo4j). The times for records in finals query
using the Cypher dialect looks like this:
[source,groovy]
----
results = db.query('cypher', '''
MATCH (s1:Swim {event: 'Final'})-[:supersedes]->(s2:Swim)
RETURN s1.time AS time
''')
assert results*.toMap().time.toSet() == [57.47, 57.33] as Set
----
ArcadeDB also supports Gremlin queries. The times for records in finals query
using the Gremlin dialect looks like this:
[source,groovy]
----
results = db.query('gremlin', '''
g.V().has('event', 'Final').as('ev').out('supersedes').select('ev').values('time')
''')
assert results*.toMap().result.toSet() == [57.47, 57.33] as Set
----
Rather than just passing a Gremlin query as a String, we can get full access to the TinkerPop environment
as this example show:
[source,groovy]
----
try (final ArcadeGraph graph = ArcadeGraph.open("swimming")) {
var recordTimesInFinals = graph.traversal().V().has('event', 'Final').as('ev').out('supersedes')
.select('ev').values('time').toSet()
assert recordTimesInFinals == [57.47, 57.33] as Set
}
----
ArcadeDB also supports a Studio Web-UI. Here is an example of using Studio
with a query that looks at all nodes and edges associated with the Tokyo 2021 olympics:
image:img/ArcadeStudio.png[ArcadeStudio]
== TuGraph
Next, we'll look at
https://tugraph.tech/[TuGraph].
image:https://mdn.alipayobjects.com/huamei_qcdryc/afts/img/A*AbamQ5lxv0IAAAAAAAAAAAAADgOBAQ/original[tugraph logo,width=40%]
We used the Community Edition using a docker image as outlined in the
https://tugraph-db.readthedocs.io/en/latest/5.installation%26running/3.docker-deployment.html[documentation] and
https://blog.csdn.net/qq_35721299/article/details/128076604[here].
TuGraph's claim to fame is high performance. Certainly, that isn't really
needed for this example, but let's have a play anyway.
There are a few ways to talk to TuGraph. We'll use the recommended Neo4j
https://tugraph-db.readthedocs.io/en/latest/7.client-tools/5.bolt-client.html[Bolt client]
which uses the Bolt protocol to talk to the TuGraph server.
We'll create a session using that client plus a helper `run` method to invoke our queries.
[source,groovy]
----
var authToken = AuthTokens.basic("admin", "73@TuGraph")
var driver = GraphDatabase.driver("bolt://localhost:7687", authToken)
var session = driver.session(SessionConfig.forDatabase("default"))
var run = { String s -> session.run(s) }
----
Next, we set up our database including providing a schema for our nodes, edges and properties.
One point of difference with earlier examples is that TuGraph needs a primary key for each vertex.
Hence, we added the `id` for our `Swim` vertex.
[source,groovy]
----
'''
CALL db.dropDB()
CALL db.createVertexLabel('Swimmer', 'name', 'name', 'STRING', false, 'country', 'STRING', false)
CALL db.createVertexLabel('Swim', 'id', 'id', 'INT32', false, 'event', 'STRING', false, 'result', 'STRING', false, 'at', 'STRING', false, 'time', 'FLOAT', false)
CALL db.createEdgeLabel('swam','[["Swimmer","Swim"]]')
CALL db.createEdgeLabel('supersedes','[["Swim","Swim"]]')
'''.trim().readLines().each{ run(it) }
----
With these defined, we can create our swim information:
[source,groovy]
----
run '''create
(es:Swimmer {name: 'Emily Seebohm', country: '🇦🇺'}),
(swim1:Swim {event: 'Heat 4', result: 'First', time: 58.23, at: 'London 2012', id:1}),
(es)-[:swam]->(swim1),
(km:Swimmer {name: 'Kylie Masse', country: '🇨🇦'}),
(swim2:Swim {event: 'Heat 4', result: 'First', time: 58.17, at: 'Tokyo 2021', id:2}),
(km)-[:swam]->(swim2),
(swim3:Swim {event: 'Final', result: '🥈', time: 57.72, at: 'Tokyo 2021', id:3}),
(km)-[:swam]->(swim3),
(swim2)-[:supersedes]->(swim1),
(rs:Swimmer {name: 'Regan Smith', country: '🇺🇸'}),
(swim4:Swim {event: 'Heat 5', result: 'First', time: 57.96, at: 'Tokyo 2021', id:4}),
(rs)-[:swam]->(swim4),
(swim5:Swim {event: 'Semifinal 1', result: 'First', time: 57.86, at: 'Tokyo 2021', id:5}),
(rs)-[:swam]->(swim5),
(swim6:Swim {event: 'Final', result: '🥉', time: 58.05, at: 'Tokyo 2021', id:6}),
(rs)-[:swam]->(swim6),
(swim7:Swim {event: 'Final', result: '🥈', time: 57.66, at: 'Paris 2024', id:7}),
(rs)-[:swam]->(swim7),
(swim8:Swim {event: 'Relay leg1', result: 'First', time: 57.28, at: 'Paris 2024', id:8}),
(rs)-[:swam]->(swim8),
(swim4)-[:supersedes]->(swim2),
(kmk:Swimmer {name: 'Kaylee McKeown', country: '🇦🇺'}),
(swim9:Swim {event: 'Heat 6', result: 'First', time: 57.88, at: 'Tokyo 2021', id:9}),
(kmk)-[:swam]->(swim9),
(swim9)-[:supersedes]->(swim4),
(swim5)-[:supersedes]->(swim9),
(swim10:Swim {event: 'Final', result: '🥇', time: 57.47, at: 'Tokyo 2021', id:10}),
(kmk)-[:swam]->(swim10),
(swim10)-[:supersedes]->(swim5),
(swim11:Swim {event: 'Final', result: '🥇', time: 57.33, at: 'Paris 2024', id:11}),
(kmk)-[:swam]->(swim11),
(swim11)-[:supersedes]->(swim10),
(swim8)-[:supersedes]->(swim11),
(kb:Swimmer {name: 'Katharine Berkoff', country: '🇺🇸'}),
(swim12:Swim {event: 'Final', result: '🥉', time: 57.98, at: 'Paris 2024', id:12}),
(kb)-[:swam]->(swim12)
'''
----
TuGraph uses Cypher style queries. Here are our three standard queries:
[source,groovy]
----
assert run('''
MATCH (sr:Swimmer)-[:swam]->(sm:Swim {at: 'Paris 2024'})
RETURN DISTINCT sr.country AS country
''')*.get('country')*.asString().toSet() == ['🇺🇸', '🇦🇺'] as Set
assert run('''
MATCH (s:Swim)
WHERE s.event STARTS WITH 'Heat'
RETURN DISTINCT s.at AS at
''')*.get('at')*.asString().toSet() == ["London 2012", "Tokyo 2021"] as Set
assert run('''
MATCH (s1:Swim {event: 'Final'})-[:supersedes]->(s2:Swim)
RETURN s1.time as time
''')*.get('time')*.asDouble().toSet() == [57.47d, 57.33d] as Set
----
Here is our traversal query:
[source,groovy]
----
run('''
MATCH (s1:Swim)-[:supersedes*1..10]->(s2:Swim {at: 'London 2012'})
RETURN s1.at as at, s1.event as event
''')*.asMap().each{ println "$it.at $it.event" }
----
== Apache HugeGraph
Our final technology is Apache
https://hugegraph.apache.org/[HugeGraph].
It is a project undergoing incubation at the ASF.
image:https://www.apache.org/logos/res/hugegraph/hugegraph.png[hugegraph logo,50%]
HugeGraph's claim to fame is the ability to support very large graph databases.
Again, not really needed for this example, but it should be fun to play with.
We used a docker image as described in the
https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev[documentation].
Setup involved creating a client for talking to the server (running on the docker image):
[source,groovy]
----
var client = HugeClient.builder("http://localhost:8080", "hugegraph").build()
----
Next, we defined the schema for our graph database:
[source,groovy]
----
var schema = client.schema()
schema.propertyKey("num").asInt().ifNotExist().create()
schema.propertyKey("name").asText().ifNotExist().create()
schema.propertyKey("country").asText().ifNotExist().create()
schema.propertyKey("at").asText().ifNotExist().create()
schema.propertyKey("event").asText().ifNotExist().create()
schema.propertyKey("result").asText().ifNotExist().create()
schema.propertyKey("time").asDouble().ifNotExist().create()
schema.vertexLabel('Swimmer')
.properties('name', 'country')
.primaryKeys('name')
.ifNotExist()
.create()
schema.vertexLabel('Swim')
.properties('num', 'at', 'event', 'result', 'time')
.primaryKeys('num')
.ifNotExist()
.create()
schema.edgeLabel("swam")
.sourceLabel("Swimmer")
.targetLabel("Swim")
.ifNotExist()
.create()
schema.edgeLabel("supersedes")
.sourceLabel("Swim")
.targetLabel("Swim")
.ifNotExist()
.create()
schema.indexLabel("SwimByEvent")
.onV("Swim")
.by("event")
.secondary()
.ifNotExist()
.create()
schema.indexLabel("SwimByAt")
.onV("Swim")
.by("at")
.secondary()
.ifNotExist()
.create()
----
While, technically, HugeGraph supports composite keys,
it seemed to work better when the `Swim` vertex had a single primary key.
We used the `num` field just giving a number to each swim.
We use the graph API used for creating nodes and edges:
[source,groovy]
----
var g = client.graph()
var es = g.addVertex(T.LABEL, 'Swimmer', 'name', 'Emily Seebohm', 'country', '🇦🇺')
var swim1 = g.addVertex(T.LABEL, 'Swim', 'at', 'London 2012', 'event', 'Heat 4', 'time', 58.23, 'result', 'First', 'num', NUM++)
es.addEdge('swam', swim1)
----
Here is how to print out some node information:
[source,groovy]
----
var (name, country) = ['name', 'country'].collect { es.property(it) }
var (at, event, time) = ['at', 'event', 'time'].collect { swim1.property(it) }
println "$name from $country swam a time of $time in $event at the $at Olympics"
----
We now create the other swimmer and swim nodes and edges.
Gremlin queries are invoked through a gremlin helper object.
Our three standard queries look like this:
[source,groovy]
----
var gremlin = client.gremlin()
var successInParis = gremlin.gremlin('''
g.V().out('swam').has('Swim', 'at', 'Paris 2024').in().values('country').dedup().order()
''').execute()
assert successInParis.data() == ['🇦🇺', '🇺🇸']
var recordSetInHeat = gremlin.gremlin('''
g.V().hasLabel('Swim')
.filter { it.get().property('event').value().startsWith('Heat') }
.values('at').dedup().order()
''').execute()
assert recordSetInHeat.data() == ['London 2012', 'Tokyo 2021']
var recordTimesInFinals = gremlin.gremlin('''
g.V().has('Swim', 'event', 'Final').as('ev').out('supersedes').select('ev').values('time').order()
''').execute()
assert recordTimesInFinals.data() == [57.33, 57.47]
----
Here is our traversal example:
[source,groovy]
----
println "Olympic records after ${swim1.properties().subMap(['at', 'event']).values().join(' ')}: "
gremlin.gremlin('''
g.V().has('at', 'London 2012').repeat(__.in('supersedes')).emit().values('at', 'event')
''').execute().data().collate(2).each { a, e ->
println "$a $e"
}
----
== Static typing
Another interesting topic is improving type checking for graph database code.
Groovy supports very dynamic styles of code through to "stronger-than-Java" type checking.
Some graph database technologies offer only a schema-free experience
to allow your data models to _"adapt and change easily with your business"_.
Others allow a schema to be defined with varying degrees of information.
Groovy's dynamic capabilities make it particularly suited for writing code
that will work easily even if you change your data model on the fly.
However, if you prefer to add further type checking into your code, Groovy has
options for that too.
Let's recap on what schema-like capabilities our examples made use of:
* Apache TinkerPop: used dynamic vertex labels and edges
* Neo4j: used dynamic vertex labels but required edges to be defined by an enum
* Apache AGE: although not shown in this post, defined vertex labels, edges were dynamic
* OrientDB: defined vertex and edge classes
* ArcadeDB: defined vertex and edge types
* TuGraph: defined vertex and edge labels, vertex labels had typed properties, edge labels typed with from/to vertex labels
* Apache HugeGraph: defined vertex and edge labels, vertex labels had typed properties, edge labels typed with from/to vertex labels
The good news about where we chose very dynamic options, we could easily add new
vertices and edges, e.g.:
[source,groovy]
----
var mb = g.addV('Coach').property(name: 'Michael Bohl').next()
mb.coaches(kmk)
----
For the examples which used schema-like capabilities, we'd need to declare the additional
vertex type `Coach` and edge `coaches` before we could define the new node and edge.
Let's explore just a few options where Groovy capabilities could make it easier to deal
with typing.
We previously used `insertSwimmer` and `insertSwim` helper methods. We could supply types
for those parameters even where our underlying database technology wasn't using them.
That would at least capture typing errors when inserting information into our graph.
We could use a richly-typed domain using Groovy classes or records. We could generate
the necessary method calls to create the schema/labels and then populate the database.
Alternatively, we can leave the code in its dynamic form and make use of Groovy's
extensible type checking system. We could write an extension which
fails compilation if any invalid edge or vertex definitions were detected.
For our `coaches` example above, the previous line would pass compilation
but if had incorrect vertices for that edge relationship, compilation would fail,
e.g. for the statement `swim1.coaches(mb)`, we'd get the following error:
----
[Static type checking] - Invalid edge - expected: <Coach>.coaches(<Swimmer>)
but found: <Swim>.coaches(<Coach>)
@ line 20, column 5.
swim1.coaches(mb)
^
1 error
----
We won't show the code for this, it's in the GitHub repo. It is hard-coded to
know about the `coaches` relationship. Ideally, we'd combine extensible type checking
with the previously mentioned richly-typed model, and we could populate both the
information that our type checker needs and any label/schema information our
graph database would need.
Anyway, these a just a few options Groovy gives you. Why not have fun trying out some
ideas yourself!
.Update history
****
*02/Sep/2024*: Initial version. +
*18/Sep/2024*: Updated for: latest Groovy 5 version, TuGraph 4.5.0 with thanks to Florian (GitHub: fanzhidongyzby) and Richard Bian (x: @RichSFO), TinkerPop tweaks with thanks to Stephen Mallette (ASF: spmallette). +
*11/Dec/2024*: Updated for: latest Groovy 5 version, TuGraph 4.5.1, HugeGraph 1.5.0, ArcadeDB 24.11.1, Gremlin 3.7.3, Neo4J 5.26.0, OrientDB 3.2.36. +
****