blob: c46eb3534960fe2f9282944934be128808a55ae4 [file] [log] [blame]
= Graph Traversal
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
Graph traversal with streaming expressions uses the `nodes` function to perform a breadth-first graph traversal.
The `nodes` function can be combined with the `scoreNodes` function to provide recommendations. `nodes` can also be combined with the wider streaming expression library to perform complex operations on gathered node sets.
`nodes` traversals are distributed within a SolrCloud collection and can span collections.
`nodes` is designed for use cases that involve zooming into a neighborhood in the graph and performing precise traversals to gather node sets and aggregations. In these types of use cases `nodes` will often provide sub-second performance. Some sample use cases are provided later in the document.
[IMPORTANT]
====
This document assumes a basic understanding of graph terminology and streaming expressions. You can begin exploring graph traversal concepts with this https://en.wikipedia.org/wiki/Graph_traversal[Wikipedia article]. More details about streaming expressions are available in this Guide, in the section <<streaming-expressions.adoc#,Streaming Expressions>>.
====
== Basic Syntax
We'll start with the most basic syntax and slowly build up more complexity. The most basic syntax for `nodes` is:
[source,plain]
----
nodes(emails,
walk="johndoe@apache.org->from",
gather="to")
----
Let's break down this simple expression.
The first parameter, `emails`, is the collection being traversed. The second parameter, `walk`, maps a hard-coded node ID ("\johndoe@apache.org") to a field in the index (`from`). This will return all the *edges* in the index that have `johndoe@apache.org` in the `from` field.
The `gather` parameter tells the function to gather the values in the `to `field. The values that are gathered are the node IDs emitted by the function.
In the example above the nodes emitted will be all of the people that "johndoe@apache.org" has emailed.
The walk parameter also accepts a list of root node IDs:
[source,plain]
----
nodes(emails,
walk="johndoe@apache.org, janesmith@apache.org->from",
gather="to")
----
The `nodes` function above finds all the edges with "johndoe@apache.org" or "janesmith@apache.org" in the `from` field and gathers the `to` field.
Like all <<streaming-expressions.adoc#,Streaming Expressions>>, you can execute a `nodes` expression by sending it to the `/stream` handler. For example:
[source,bash]
----
curl --data-urlencode 'expr=nodes(emails,
walk="johndoe@apache.org, janesmith@apache.org->from",
gather="to")' http://localhost:8983/solr/emails/stream
----
The output of this expression would look like this:
[source,json]
----
{
"result-set": {
"docs": [
{
"node": "slist@campbell.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "catherine.pernot@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "airam.arteaga@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 44
}
]
}
}
----
All of the tuples returned have the `node` field. The `node` field contains the node IDs gathered by the function. The `collection`, `field`, and `level` of the traversal are also included in the output.
Notice that the level is "1" for each tuple in the example. The root nodes are level 0 (in the example above, the root nodes are "johndoe@apache.org, janesmith@apache.org") By default the `nodes` function emits only the _*leaf nodes*_ of the traversal, which is the outer-most node set. To emit the root nodes you can specify the `scatter` parameter:
[source,plain]
----
nodes(emails,
walk="johndoe@apache.org->from",
gather="to",
scatter="branches, leaves")
----
The `scatter` parameter controls whether to emit the _branches_ with the _leaves_. The root nodes are considered "branches" because they are not the outer-most level of the traversal.
When scattering both branches and leaves the output would like this:
[source,json]
----
{
"result-set": {
"docs": [
{
"node": "johndoe@apache.org",
"collection": "emails",
"field": "node",
"level": 0
},
{
"node": "slist@campbell.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "catherine.pernot@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "airam.arteaga@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 44
}
]
}
}
----
Now the level 0 root node is included in the output.
== Aggregations
`nodes` also supports aggregations. For example:
[source,plain]
----
nodes(emails,
walk="johndoe@apache.org, janesmith@apache.org->from",
gather="to",
count(*))
----
The expression above finds the edges with "\johndoe@apache.org" or "\janesmith@apache.org" in the `from` field and gathers the values from the `to` field. It also aggregates the count for each node ID gathered.
A gathered node could have a count of 2 if both "\johndoe@apache.org" and "\janesmith@apache.org" have emailed the same person. Node sets contain a unique set of nodes, so the same person won't appear twice in the node set, but the count will reflect that it appeared twice during the traversal.
Edges are uniqued as part of the traversal so the count will *not* reflect the number of times "\johndoe@apache.org" emailed the same person. For example, personA might have emailed personB 100 times. These edges would get uniqued and only be counted once. But if person personC also emailed personB this would increment the count for personB.
The aggregation functions supported are `count(*)`, `sum(field)`, `min(field)`, `max(field)`, and `avg(field)`. The fields being aggregated should be present in the edges collected during the traversal. Later examples (below) will show aggregations can be a powerful tool for providing recommendations and limiting the scope of traversals.
== Nesting nodes Functions
The `nodes` function can be nested to traverse deeper into the graph. For example:
[source,plain]
----
nodes(emails,
nodes(emails,
walk="johndoe@apache.org->from",
gather="to"),
walk="node->from",
gather="to")
----
In the example above the outer `nodes` function operates on the node set collected from the inner `nodes` function.
Notice that the inner `nodes` function behaves exactly as the examples already discussed. But the `walk` parameter of the outer `nodes` function behaves differently.
In the outer `nodes` function the `walk` parameter works with tuples coming from an internal streaming expression. In this scenario the `walk` parameter maps the `node` field to the `from` field. Remember that the node IDs collected from the inner `nodes` expression are placed in the `node` field.
Put more simply, the inner expression gathers all the people that "\johndoe@apache.org" has emailed. We can call this group the "friends of \johndoe@apache.org". The outer expression gathers all the people that the "friends of \johndoe@apache.org" have emailed. This is a basic friends-of-friends traversal.
This construct of nesting `nodes` functions is the basic technique for doing a controlled traversal through the graph.
== Cycle Detection
The `nodes` function performs cycle detection across the entire traversal. This ensures that nodes that have already been visited are not traversed again. Cycle detection is important for both limiting the size of traversals and gathering accurate aggregations. Without cycle detection the size of the traversal could grow exponentially with each hop in the traversal. With cycle detection only new nodes encountered are traversed.
Cycle detection *does not* cross collection boundaries. This is because internally the collection name is part of the node ID. For example the node ID "\johndoe@apache.org", is really `emails/johndoe@apache.org`. When traversing to another collection "\johndoe@apache.org" will be traversed.
== Filtering the Traversal
Each level in the traversal can be filtered with a filter query. For example:
[source,plain]
----
nodes(emails,
walk="johndoe@apache.org->from",
fq="body:(solr rocks)",
gather="to")
----
In the example above only emails that match the filter query will be included in the traversal. Any Solr query can be included here. So you can do fun things like <<spatial-search.adoc#,geospatial queries>>, apply any of the available <<query-syntax-and-parsing.adoc#,query parsers>>, or even write custom query parsers to limit the traversal.
== Root Streams
Any streaming expression can be used to provide the root nodes for a traversal. For example:
[source,plain]
----
nodes(emails,
search(emails, q="body:(solr rocks)", fl="to", sort="score desc", rows="20")
walk="to->from",
gather="to")
----
The example above provides the root nodes through a search expression. You can also provide arbitrarily complex, nested streaming expressions with joins, etc., to specify the root nodes.
Notice that the `walk` parameter maps a field from the tuples generated by the inner stream. In this case it maps the `to` field from the inner stream to the `from` field.
== Skipping High Frequency Nodes
It's often desirable to skip traversing high frequency nodes in the graph. This is similar in nature to a search term stop list. The best way to describe this is through an example use case.
Let's say that you want to recommend content for a user based on a collaborative filter. Below is one approach for a simple collaborative filter:
. Find all content userA has read.
. Find users whose reading list is closest to userA. These are users with similar tastes as userA.
. Recommend content based on what the users in step 2 have read, that userA has not yet read.
Look closely at step 2. In large graphs, step 2 can lead to a very large traversal. This is because userA may have viewed content that has been viewed by millions of other people. We may want to skip these high frequency nodes for two reasons:
. A large traversal that visit millions of unique nodes is slow and takes a lot of memory because cycle detection is tracked in memory.
. High frequency nodes are also not useful in determining users with similar tastes. The content that fewer people have viewed provides a more precise recommendation.
The `nodes` function has the `maxDocFreq` parameter to allow for filtering out high frequency nodes. The sample code below shows steps 1 and 2 of the recommendation:
[source,plain]
----
nodes(logs,
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:view", qt="/export"),
walk="articleID->articleID",
gather="userID",
fq="action:view",
maxDocFreq="10000",
count(*)))
----
In the example above, the inner search expression searches the `logs` collection and returning all the articles viewed by "user1". The outer `nodes` expression takes all the articles emitted from the inner search expression and finds all the records in the logs collection for those articles. It then gathers and aggregates the users that have read the articles. The `maxDocFreq` parameter limits the articles returned to those that appear in no more then 10,000 log records (per shard). This guards against returning articles that have been viewed by millions of users.
== Tracking the Traversal
By default the `nodes` function only tracks enough information to do cycle detection. This provides enough information to output the nodes and aggregations in the graph.
For some use cases, such as graph visualization, we also need to output the edges. Setting `trackTraversal="true"` tells `nodes` to track the connections between nodes, so the edges can be constructed. When `trackTraversal` is enabled a new `ancestors` property will appear with each node. The `ancestors` property contains a list of node IDs that pointed to the node.
Below is a sample `nodes` expression with `trackTraversal` set to true:
[source,plain]
----
nodes(emails,
nodes(emails,
walk="johndoe@apache.org->from",
gather="to",
trackTraversal="true"),
walk="node->from",
trackTraversal="true",
gather="to")
----
== Cross-Collection Traversals
Nested `nodes` functions can operate on different SolrCloud collections. This allow traversals to "walk" from one collection to another to gather nodes. Cycle detection does not cross collection boundaries, so nodes collected in one collection will be traversed in a different collection. This was done deliberately to support cross-collection traversals. Note that the output from a cross-collection traversal will likely contain duplicate nodes with different collection attributes.
Below is a sample `nodes` expression that traverses from the "emails" collection to the "logs" collection:
[source,plain]
----
nodes(logs,
nodes(emails,
search(emails, q="body:(solr rocks)", fl="from", sort="score desc", rows="20")
walk="from->from",
gather="to",
scatter="leaves, branches"),
walk="node->user",
fq="action:edit",
gather="contentID")
----
The example above finds all people who sent emails with a body that contains "solr rocks". It then finds all the people these people have emailed. Then it traverses to the logs collection and gathers all the content IDs that these people have edited.
== Combining nodes With Other Streaming Expressions
The `nodes` function can act as both a stream source and a stream decorator. The connection with the wider stream expression library provides tremendous power and flexibility when performing graph traversals. Here is an example of using the streaming expression library to intersect two friend networks:
[source,plain]
----
intersect(on="node",
sort(by="node asc",
nodes(emails,
nodes(emails,
walk="johndoe@apache.org->from",
gather="to"),
walk="node->from",
gather="to",
scatter="branches,leaves")),
sort(by="node asc",
nodes(emails,
nodes(emails,
walk="janedoe@apache.org->from",
gather="to"),
walk="node->from",
gather="to",
scatter="branches,leaves")))
----
The example above gathers two separate friend networks, one rooted with "\johndoe@apache.org" and another rooted with "\janedoe@apache.org". The friend networks are then sorted by the `node` field, and intersected. The resulting node set will be the intersection of the two friend networks.
== Sample Use Cases for Graph Traversal
=== Calculate Market Basket Co-occurrence
It is often useful to know which products are most frequently purchased with a particular product. This example uses a simple market basket table (indexed in Solr) to store past shopping baskets. The schema for the table is very simple with each row containing a `basketID` and a `productID`. This can be seen as a graph with each row in the table representing an edge. And it can be traversed very quickly to calculate basket co-occurrence, even when the graph contains billions of edges.
Here is the sample syntax:
[source,plain]
----
top(n="5",
sort="count(*) desc",
nodes(baskets,
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
walk="basketID->basketID",
fq="-productID:ABC",
gather="productID",
count(*)))
----
Let's break down exactly what this traversal is doing.
. The first expression evaluated is the inner `random` expression, which returns 500 random basketIDs, from the `baskets` collection, that have the `productID` "ABC". The `random` expression is very useful for recommendations because it limits the traversal to a fixed set of baskets, and because it adds the element of surprise into the recommendation. Using the `random` function you can provide fast sample sets from very large graphs.
. The outer `nodes` expression finds all the records in the `baskets` collection for the basketIDs generated in step 1. It also filters out `productID` "ABC" so it doesn't show up in the results. It then gathers and counts the productID's across these baskets.
. The outer `top` expression ranks the productIDs emitted in step 2 by the count and selects the top 5.
In a nutshell this expression finds the products that most frequently co-occur with product "ABC" in past shopping baskets.
=== Using the scoreNodes Function to Make a Recommendation
This use case builds on the market basket example <<Calculate Market Basket Co-occurrence,above>> that calculates which products co-occur most frequently with productID:ABC. The ranked co-occurrence counts provide candidates for a recommendation. The `scoreNodes` function can be used to score the candidates to find the best recommendation.
Before diving into the syntax of the `scoreNodes` function it's useful to understand why the raw co-occurrence counts may not produce the best recommendation. The reason is that raw co-occurrence counts favor items that occur frequently across all baskets. A better recommendation would find the product that has the most significant relationship with productID ABC. The `scoreNodes` function uses a term frequency-inverse document frequency (TF-IDF) algorithm to find the most significant relationship.
==== How scoreNodes Works
The `scoreNodes` function assigns a score to each node emitted by the nodes expression. By default the `scoreNodes` function uses the `count(*)` aggregation, which is the co-occurrence count, as the TF value. The IDF value for each node is fetched from the collection where the node was gathered. Each node is then scored using the TF*IDF formula, which provides a boost to nodes with a lower frequency across all market baskets.
Combining the co-occurrence count with the IDF provides a score that shows how important the relationship is between productID ABC and the recommendation candidates.
The `scoreNodes` function adds the score to each node in the `nodeScore` field.
==== Example scoreNodes Syntax
[source,plain]
----
top(n="1",
sort="nodeScore desc",
scoreNodes(top(n="50",
sort="count(*) desc",
nodes(baskets,
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
walk="basketID->basketID",
fq="-productID:ABC",
gather="productID",
count(*)))))
----
This example builds on the earlier example "Calculate market basket co-occurrence".
. Notice that the inner-most `top` function is taking the top 50 products that co-occur most frequently with productID ABC. This provides 50 candidate recommendations.
. The `scoreNodes` function then assigns a score to the candidates based on the TF*IDF of each node.
. The outer `top` expression selects the highest scoring node. This is the recommendation.
=== Recommend Content Based on Collaborative Filter
In this example we'll recommend content for a user based on a collaborative filter. This recommendation is made using log records that contain the `userID` and `articleID` and the action performed. In this scenario each log record can be viewed as an edge in a graph. The userID and articleID are the nodes and the action is an edge property used to filter the traversal.
Here is the sample syntax:
[source,plain]
----
top(n="5",
sort="count(*) desc",
nodes(logs,
top(n="30",
sort="count(*) desc",
nodes(logs,
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:read", qt="/export"),
walk="articleID->articleID",
gather="userID",
fq="action:read",
maxDocFreq="10000",
count(*))),
walk="node->userID",
gather="articleID",
fq="action:read",
count(*)))
----
Let's break down the expression above step-by-step.
. The first expression evaluated is the inner `search` expression. This expression searches the `logs` collection for all records matching "user1". This is the user we are making the recommendation for.
+
There is a filter applied to pull back only records where the "action:read". It returns the `articleID` for each record found. In other words, this expression returns all the articles "user1" has read.
. The inner `nodes` expression operates over the articleIDs returned from step 1. It takes each `articleID` found and searches them against the `articleID` field.
+
Note that it skips high frequency nodes using the `maxDocFreq` parameter to filter out articles that appear over 10,000 times in the logs. It gathers userIDs and aggregates the counts for each user. This step finds the users that have read the same articles that "user1" has read and counts how many of the same articles they have read.
. The inner `top` expression ranks the users emitted from step 2. It will emit the top 30 users who have the most overlap with user1's reading list.
. The outer `nodes` expression gathers the reading list for the users emitted from step 3. It counts the articleIDs that are gathered.
+
Any article selected in step 1 (user1 reading list), will not appear in this step due to cycle detection. So this step returns the articles read by the users with the most similar readings habits to "user1" that "user1" has not read yet. It also counts the number of times each article has been read across this user group.
. The outer `top` expression takes the top articles emitted from step 4. This is the recommendation.
=== Protein Pathway Traversal
In recent years, scientists have become increasingly able to rationally design drugs that target the mutated proteins, called oncogenes, responsible for some cancers. Proteins typically act through long chains of chemical interactions between multiple proteins, called pathways, and, while the oncogene in the pathway may not have a corresponding drug, another protein in the pathway may. Graph traversal on a protein collection that records protein interactions and drugs may yield possible candidates. (Thanks to Lewis Geer of the NCBI, for providing this example).
The example below illustrates a protein pathway traversal:
[source,plain]
----
nodes(proteins,
nodes(proteins,
walk="NRAS->name",
gather="interacts"),
walk="node->name",
gather="drug")
----
Let's break down exactly what this traversal is doing.
. The inner `nodes` expression traverses in the `proteins` collection. It finds all the edges in the graph where the name of the protein is "NRAS". Then it gathers the proteins in the `interacts` field. This gathers all the proteins that "NRAS" interactions with.
. The outer `nodes` expression also works with the `proteins` collection. It gathers all the drugs that correspond to proteins emitted from step 1.
. Using this stepwise approach you can gather the drugs along the pathway of interactions any number of steps away from the root protein.
== Exporting GraphML to Support Graph Visualization
In the examples above, the `nodes` expression was sent to Solr's `/stream` handler like any other streaming expression. This approach outputs the nodes in the same JSON tuple format as other streaming expressions so that it can be treated like any other streaming expression. You can use the `/stream` handler when you need to operate directly on the tuples, such as in the recommendation use cases above.
There are other graph traversal use cases that involve graph visualization. Solr supports these use cases with the introduction of the `/graph` request handler, which takes a `nodes` expression and outputs the results in GraphML.
http://graphml.graphdrawing.org/[GraphML] is an XML format supported by graph visualization tools such as https://gephi.org/[Gephi], which is a sophisticated open source tool for statistically analyzing and visualizing graphs. Using a `nodes` expression, parts of a larger graph can be exported in GraphML and then imported into tools like Gephi.
There are a few things to keep mind when exporting a graph in GraphML:
. The `/graph` handler can export both the nodes and edges in the graph. By default, it only exports the nodes. To export the edges you must set `trackTraversal="true"` in the `nodes` expression.
. The `/graph` handler currently accepts an arbitrarily complex streaming expression which includes a `nodes` expression. If the streaming expression doesn't include a `nodes` expression, the `/graph` handler will not properly output GraphML.
. The `/graph` handler currently accepts a single arbitrarily complex, nested `nodes` expression per request. This means you cannot send in a streaming expression that joins or intersects the node sets from multiple `nodes` expressions. The `/graph` handler does support any level of nesting within a single `nodes` expression. The `/stream` handler does support joining and intersecting node sets, but the `/graph` handler currently does not.
=== Sample GraphML Request
[source,bash]
----
curl --data-urlencode 'expr=nodes(enron_emails,
nodes(enron_emails,
walk="kayne.coulter@enron.com->from",
trackTraversal="true",
gather="to"),
walk="node->from",
scatter="leaves,branches",
trackTraversal="true",
gather="to")' http://localhost:8983/solr/enron_emails/graph
----
=== Sample GraphML Output
[source,xml]
----
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph id="G" edgedefault="directed">
<node id="kayne.coulter@enron.com">
<data key="field">node</data>
<data key="level">0</data>
<data key="count(*)">0.0</data>
</node>
<node id="don.baughman@enron.com">
<data key="field">to</data>
<data key="level">1</data>
<data key="count(*)">1.0</data>
</node>
<edge id="1" source="kayne.coulter@enron.com" target="don.baughman@enron.com"/>
<node id="john.kinser@enron.com">
<data key="field">to</data>
<data key="level">1</data>
<data key="count(*)">1.0</data>
</node>
<edge id="2" source="kayne.coulter@enron.com" target="john.kinser@enron.com"/>
<node id="jay.wills@enron.com">
<data key="field">to</data>
<data key="level">1</data>
<data key="count(*)">1.0</data>
</node>
<edge id="3" source="kayne.coulter@enron.com" target="jay.wills@enron.com"/>
</graph></graphml>
----