Graph traversal with streaming expressions uses the gatherNodes
function to perform a breadth-first graph traversal.
The gatherNodes
function can be combined with the scoreNodes
function to provide recommendations. gatherNodes
can also be combined with the wider streaming expression library to perform complex operations on gathered node sets.
gatherNodes
traversals are distributed within a SolrCloud collection and can span collections.
gatherNodes
is designed for use cases that involve zooming into a neighborhood in the graph and performing precise traversals to gather node sets and aggregations. In these types of use cases gatherNodes
will often provide sub-second performance. Some sample use cases are provided later in the document.
This document assumes a basic understanding of graph terminology and streaming expressions. You can begin exploring graph traversal concepts with this Wikipedia article. More details about streaming expressions are available in this Guide, in the section Streaming Expressions. |
Basic Syntax
We’ll start with the most basic syntax and slowly build up more complexity. The most basic syntax for gatherNodes
is:
gatherNodes(emails,
walk="johndoe@apache.org->from",
gather="to")
Let’s break down this simple expression.
The first parameter, emails
, is the collection being traversed. The second parameter, walk
, maps a hard-coded node ID ("johndoe@apache.org") to a field in the index (from
). This will return all the edges in the index that have johndoe@apache.org
in the from
field.
The gather
parameter tells the function to gather the values in the `to `field. The values that are gathered are the node IDs emitted by the function.
In the example above the nodes emitted will be all of the people that "johndoe@apache.org" has emailed.
The walk parameter also accepts a list of root node IDs:
gatherNodes(emails,
walk="johndoe@apache.org, janesmith@apache.org->from",
gather="to")
The gatherNodes
function above finds all the edges with "johndoe@apache.org" or "janesmith@apache.org" in the from
field and gathers the to
field.
Like all Streaming Expressions, you can execute a gatherNodes
expression by sending it to the /stream
handler. For example:
curl --data-urlencode 'expr=gatherNodes(emails,
walk="johndoe@apache.org, janesmith@apache.org->from",
gather="to")' http://localhost:8983/solr/emails/stream
The output of this expression would look like this:
{
"result-set": {
"docs": [
{
"node": "slist@campbell.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "catherine.pernot@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "airam.arteaga@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 44
}
]
}
}
All of the tuples returned have the node
field. The node
field contains the node IDs gathered by the function. The collection
, field
, and level
of the traversal are also included in the output.
Notice that the level is "1" for each tuple in the example. The root nodes are level 0 (in the example above, the root nodes are "johndoe@apache.org, janesmith@apache.org") By default the gatherNodes
function emits only the *leaf nodes* of the traversal, which is the outer-most node set. To emit the root nodes you can specify the scatter
parameter:
gatherNodes(emails,
walk="johndoe@apache.org->from",
gather="to",
scatter="branches, leaves")
The scatter
parameter controls whether to emit the branches with the leaves. The root nodes are considered "branches" because they are not the outer-most level of the traversal.
When scattering both branches and leaves the output would like this:
{
"result-set": {
"docs": [
{
"node": "johndoe@apache.org",
"collection": "emails",
"field": "node",
"level": 0
},
{
"node": "slist@campbell.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "catherine.pernot@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"node": "airam.arteaga@enron.com",
"collection": "emails",
"field": "to",
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 44
}
]
}
}
Now the level 0 root node is included in the output.
Aggregations
gatherNodes
also supports aggregations. For example:
gatherNodes(emails,
walk="johndoe@apache.org, janesmith@apache.org->from",
gather="to",
count(*))
The expression above finds the edges with "johndoe@apache.org" or "janesmith@apache.org" in the from
field and gathers the values from the to
field. It also aggregates the count for each node ID gathered.
A gathered node could have a count of 2 if both "johndoe@apache.org" and "janesmith@apache.org" have emailed the same person. Node sets contain a unique set of nodes, so the same person won’t appear twice in the node set, but the count will reflect that it appeared twice during the traversal.
Edges are uniqued as part of the traversal so the count will not reflect the number of times "johndoe@apache.org" emailed the same person. For example, personA might have emailed personB 100 times. These edges would get uniqued and only be counted once. But if person personC also emailed personB this would increment the count for personB.
The aggregation functions supported are count(*)
, sum(field)
, min(field)
, max(field)
, and avg(field)
. The fields being aggregated should be present in the edges collected during the traversal. Later examples (below) will show aggregations can be a powerful tool for providing recommendations and limiting the scope of traversals.
Nesting gatherNodes functions
The gatherNodes
function can be nested to traverse deeper into the graph. For example:
gatherNodes(emails,
gatherNodes(emails,
walk="johndoe@apache.org->from",
gather="to"),
walk="node->from",
gather="to")
In the example above the outer gatherNodes
function operates on the node set collected from the inner gatherNodes
function.
Notice that the inner gatherNodes
function behaves exactly as the examples already discussed. But the walk
parameter of the outer gatherNodes
function behaves differently.
In the outer gatherNodes
function the walk
parameter works with tuples coming from an internal streaming expression. In this scenario the walk
parameter maps the node
field to the from
field. Remember that the node IDs collected from the inner gatherNodes
expression are placed in the node
field.
Put more simply, the inner expression gathers all the people that "johndoe@apache.org" has emailed. We can call this group the "friends of johndoe@apache.org". The outer expression gathers all the people that the "friends of johndoe@apache.org" have emailed. This is a basic friends-of-friends traversal.
This construct of nesting gatherNodes
functions is the basic technique for doing a controlled traversal through the graph.
Cycle Detection
The gatherNodes
function performs cycle detection across the entire traversal. This ensures that nodes that have already been visited are not traversed again. Cycle detection is important for both limiting the size of traversals and gathering accurate aggregations. Without cycle detection the size of the traversal could grow exponentially with each hop in the traversal. With cycle detection only new nodes encountered are traversed.
Cycle detection does not cross collection boundaries. This is because internally the collection name is part of the node ID. For example the node ID "johndoe@apache.org", is really emails/johndoe@apache.org
. When traversing to another collection "johndoe@apache.org" will be traversed.
Filtering the Traversal
Each level in the traversal can be filtered with a filter query. For example:
gatherNodes(emails,
walk="johndoe@apache.org->from",
fq="body:(solr rocks)",
gather="to")
In the example above only emails that match the filter query will be included in the traversal. Any Solr query can be included here. So you can do fun things like geospatial queries, apply any of the available query parsers, or even write custom query parsers to limit the traversal.
Root Streams
Any streaming expression can be used to provide the root nodes for a traversal. For example:
gatherNodes(emails,
search(emails, q="body:(solr rocks)", fl="to", sort="score desc", rows="20")
walk="to->from",
gather="to")
The example above provides the root nodes through a search expression. You can also provide arbitrarily complex, nested streaming expressions with joins, etc., to specify the root nodes.
Notice that the walk
parameter maps a field from the tuples generated by the inner stream. In this case it maps the to
field from the inner stream to the from
field.
Skipping High Frequency Nodes
It’s often desirable to skip traversing high frequency nodes in the graph. This is similar in nature to a search term stop list. The best way to describe this is through an example use case.
Let’s say that you want to recommend content for a user based on a collaborative filter. Below is one approach for a simple collaborative filter:
-
Find all content userA has read.
-
Find users whose reading list is closest to userA. These are users with similar tastes as userA.
-
Recommend content based on what the users in step 2 have read, that userA has not yet read.
Look closely at step 2. In large graphs, step 2 can lead to a very large traversal. This is because userA may have viewed content that has been viewed by millions of other people. We may want to skip these high frequency nodes for two reasons:
-
A large traversal that visit millions of unique nodes is slow and takes a lot of memory because cycle detection is tracked in memory.
-
High frequency nodes are also not useful in determining users with similar tastes. The content that fewer people have viewed provides a more precise recommendation.
The gatherNodes
function has the maxDocFreq
param to allow for filtering out high frequency nodes. The sample code below shows steps 1 and 2 of the recommendation:
gatherNodes(logs,
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:view", qt="/export"),
walk="articleID->articleID",
gather="userID",
fq="action:view",
maxDocFreq="10000",
count(*)))
In the example above, the inner search expression searches the logs
collection and returning all the articles viewed by "user1". The outer gatherNodes
expression takes all the articles emitted from the inner search expression and finds all the records in the logs collection for those articles. It then gathers and aggregates the users that have read the articles. The maxDocFreq
parameter limits the articles returned to those that appear in no more then 10,000 log records (per shard). This guards against returning articles that have been viewed by millions of users.
Tracking the Traversal
By default the gatherNodes
function only tracks enough information to do cycle detection. This provides enough information to output the nodes and aggregations in the graph.
For some use cases, such as graph visualization, we also need to output the edges. Setting trackTraversal="true"
tells gatherNodes
to track the connections between nodes, so the edges can be constructed. When trackTraversal
is enabled a new ancestors
property will appear with each node. The ancestors
property contains a list of node IDs that pointed to the node.
Below is a sample gatherNodes
expression with trackTraversal
set to true:
gatherNodes(emails,
gatherNodes(emails,
walk="johndoe@apache.org->from",
gather="to",
trackTraversal="true"),
walk="node->from",
trackTraversal="true",
gather="to")
Cross-Collection Traversals
Nested gatherNodes
functions can operate on different SolrCloud collections. This allow traversals to "walk" from one collection to another to gather nodes. Cycle detection does not cross collection boundaries, so nodes collected in one collection will be traversed in a different collection. This was done deliberately to support cross-collection traversals. Note that the output from a cross-collection traversal will likely contain duplicate nodes with different collection attributes.
Below is a sample gatherNodes
expression that traverses from the "emails" collection to the "logs" collection:
gatherNodes(logs,
gatherNodes(emails,
search(emails, q="body:(solr rocks)", fl="from", sort="score desc", rows="20")
walk="from->from",
gather="to",
scatter="leaves, branches"),
walk="node->user",
fq="action:edit",
gather="contentID")
The example above finds all people who sent emails with a body that contains "solr rocks". It then finds all the people these people have emailed. Then it traverses to the logs collection and gathers all the content IDs that these people have edited.
Combining gatherNodes With Other Streaming Expressions
The gatherNodes
function can act as both a stream source and a stream decorator. The connection with the wider stream expression library provides tremendous power and flexibility when performing graph traversals. Here is an example of using the streaming expression library to intersect two friend networks:
intersect(on="node",
sort(by="node asc",
gatherNodes(emails,
gatherNodes(emails,
walk="johndoe@apache.org->from",
gather="to"),
walk="node->from",
gather="to",
scatter="branches,leaves")),
sort(by="node asc",
gatherNodes(emails,
gatherNodes(emails,
walk="janedoe@apache.org->from",
gather="to"),
walk="node->from",
gather="to",
scatter="branches,leaves")))
The example above gathers two separate friend networks, one rooted with "johndoe@apache.org" and another rooted with "janedoe@apache.org". The friend networks are then sorted by the node
field, and intersected. The resulting node set will be the intersection of the two friend networks.
Sample Use Cases
Calculate Market Basket Co-occurrence
It is often useful to know which products are most frequently purchased with a particular product. This example uses a simple market basket table (indexed in Solr) to store past shopping baskets. The schema for the table is very simple with each row containing a basketID
and a productID
. This can be seen as a graph with each row in the table representing an edge. And it can be traversed very quickly to calculate basket co-occurrence, even when the graph contains billions of edges.
Here is the sample syntax:
top(n="5",
sort="count(*) desc",
gatherNodes(baskets,
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
walk="basketID->basketID",
fq="-productID:ABC",
gather="productID",
count(*)))
Let’s break down exactly what this traversal is doing.
-
The first expression evaluated is the inner
random
expression, which returns 500 random basketIDs, from thebaskets
collection, that have theproductID
"ABC". Therandom
expression is very useful for recommendations because it limits the traversal to a fixed set of baskets, and because it adds the element of surprise into the recommendation. Using therandom
function you can provide fast sample sets from very large graphs. -
The outer
gatherNodes
expression finds all the records in thebaskets
collection for the basketIDs generated in step 1. It also filters outproductID
"ABC" so it doesn’t show up in the results. It then gathers and counts the productID’s across these baskets. -
The outer
top
expression ranks the productIDs emitted in step 2 by the count and selects the top 5.
In a nutshell this expression finds the products that most frequently co-occur with product "ABC" in past shopping baskets.
Using the scoreNodes Function to Make a Recommendation
This use case builds on the market basket example above that calculates which products co-occur most frequently with productID:ABC. The ranked co-occurrence counts provide candidates for a recommendation. The scoreNodes
function can be used to score the candidates to find the best recommendation.
Before diving into the syntax of the scoreNodes
function it’s useful to understand why the raw co-occurrence counts may not produce the best recommendation. The reason is that raw co-occurrence counts favor items that occur frequently across all baskets. A better recommendation would find the product that has the most significant relationship with productID ABC. The scoreNodes
function uses a term frequency-inverse document frequency (TF-IDF) algorithm to find the most significant relationship.
How It Works
The scoreNodes
function assigns a score to each node emitted by the gatherNodes expression. By default the scoreNodes
function uses the count(*)
aggregation, which is the co-occurrence count, as the TF value. The IDF value for each node is fetched from the collection where the node was gathered. Each node is then scored using the TF*IDF formula, which provides a boost to nodes with a lower frequency across all market baskets.
Combining the co-occurrence count with the IDF provides a score that shows how important the relationship is between productID ABC and the recommendation candidates.
The scoreNodes
function adds the score to each node in the nodeScore
field.
Example Syntax
top(n="1",
sort="nodeScore desc",
scoreNodes(top(n="50",
sort="count(*) desc",
gatherNodes(baskets,
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
walk="basketID->basketID",
fq="-productID:ABC",
gather="productID",
count(*)))))
This example builds on the earlier example "Calculate market basket co-occurrence".
-
Notice that the inner-most
top
function is taking the top 50 products that co-occur most frequently with productID ABC. This provides 50 candidate recommendations. -
The
scoreNodes
function then assigns a score to the candidates based on the TF*IDF of each node. -
The outer
top
expression selects the highest scoring node. This is the recommendation.
Recommend Content Based on Collaborative Filter
In this example we’ll recommend content for a user based on a collaborative filter. This recommendation is made using log records that contain the userID
and articleID
and the action performed. In this scenario each log record can be viewed as an edge in a graph. The userID and articleID are the nodes and the action is an edge property used to filter the traversal.
Here is the sample syntax:
top(n="5",
sort="count(*) desc",
gatherNodes(logs,
top(n="30",
sort="count(*) desc",
gatherNodes(logs,
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:read", qt="/export"),
walk="articleID->articleID",
gather="userID",
fq="action:read",
maxDocFreq="10000",
count(*))),
walk="node->userID",
gather="articleID",
fq="action:read",
count(*)))
Let’s break down the expression above step-by-step.
-
The first expression evaluated is the inner
search
expression. This expression searches thelogs
collection for all records matching "user1". This is the user we are making the recommendation for.There is a filter applied to pull back only records where the "action:read". It returns the
articleID
for each record found. In other words, this expression returns all the articles "user1" has read. -
The inner
gatherNodes
expression operates over the articleIDs returned from step 1. It takes eacharticleID
found and searches them against thearticleID
field.Note that it skips high frequency nodes using the
maxDocFreq
param to filter out articles that appear over 10,000 times in the logs. It gathers userIDs and aggregates the counts for each user. This step finds the users that have read the same articles that "user1" has read and counts how many of the same articles they have read. -
The inner
top
expression ranks the users emitted from step 2. It will emit the top 30 users who have the most overlap with user1’s reading list. -
The outer
gatherNodes
expression gathers the reading list for the users emitted from step 3. It counts the articleIDs that are gathered.Any article selected in step 1 (user1 reading list), will not appear in this step due to cycle detection. So this step returns the articles read by the users with the most similar readings habits to "user1" that "user1" has not read yet. It also counts the number of times each article has been read across this user group.
-
The outer
top
expression takes the top articles emitted from step 4. This is the recommendation.
Protein Pathway Traversal
In recent years, scientists have become increasingly able to rationally design drugs that target the mutated proteins, called oncogenes, responsible for some cancers. Proteins typically act through long chains of chemical interactions between multiple proteins, called pathways, and, while the oncogene in the pathway may not have a corresponding drug, another protein in the pathway may. Graph traversal on a protein collection that records protein interactions and drugs may yield possible candidates. (Thanks to Lewis Geer of the NCBI, for providing this example).
The example below illustrates a protein pathway traversal:
gatherNodes(proteins,
gatherNodes(proteins,
walk="NRAS->name",
gather="interacts"),
walk="node->name",
gather="drug")
Let’s break down exactly what this traversal is doing.
-
The inner
gatherNodes
expression traverses in theproteins
collection. It finds all the edges in the graph where the name of the protein is "NRAS". Then it gathers the proteins in theinteracts
field. This gathers all the proteins that "NRAS" interactions with. -
The outer
gatherNodes
expression also works with theproteins
collection. It gathers all the drugs that correspond to proteins emitted from step 1. -
Using this stepwise approach you can gather the drugs along the pathway of interactions any number of steps away from the root protein.
Exporting GraphML to Support Graph Visualization
In the examples above, the gatherNodes
expression was sent to Solr’s /stream
handler like any other streaming expression. This approach outputs the nodes in the same JSON tuple format as other streaming expressions so that it can be treated like any other streaming expression. You can use the /stream
handler when you need to operate directly on the tuples, such as in the recommendation use cases above.
There are other graph traversal use cases that involve graph visualization. Solr supports these use cases with the introduction of the /graph
request handler, which takes a gatherNodes
expression and outputs the results in GraphML.
GraphML is an XML format supported by graph visualization tools such as Gephi, which is a sophisticated open source tool for statistically analyzing and visualizing graphs. Using a gatherNodes
expression, parts of a larger graph can be exported in GraphML and then imported into tools like Gephi.
There are a few things to keep mind when exporting a graph in GraphML:
-
The
/graph
handler can export both the nodes and edges in the graph. By default, it only exports the nodes. To export the edges you must settrackTraversal="true"
in thegatherNodes
expression. -
The
/graph
handler currently accepts an arbitrarily complex streaming expression which includes agatherNodes
expression. If the streaming expression doesn’t include agatherNodes
expression, the/graph
handler will not properly output GraphML. -
The
/graph
handler currently accepts a single arbitrarily complex, nestedgatherNodes
expression per request. This means you cannot send in a streaming expression that joins or intersects the node sets from multiplegatherNodes
expressions. The/graph
handler does support any level of nesting within a singlegatherNodes
expression. The/stream
handler does support joining and intersecting node sets, but the/graph
handler currently does not.
Sample Request
curl --data-urlencode 'expr=gatherNodes(enron_emails,
gatherNodes(enron_emails,
walk="kayne.coulter@enron.com->from",
trackTraversal="true",
gather="to"),
walk="node->from",
scatter="leaves,branches",
trackTraversal="true",
gather="to")' http://localhost:8983/solr/enron_emails/graph
Sample GraphML Output
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph id="G" edgedefault="directed">
<node id="kayne.coulter@enron.com">
<data key="field">node</data>
<data key="level">0</data>
<data key="count(*)">0.0</data>
</node>
<node id="don.baughman@enron.com">
<data key="field">to</data>
<data key="level">1</data>
<data key="count(*)">1.0</data>
</node>
<edge id="1" source="kayne.coulter@enron.com" target="don.baughman@enron.com"/>
<node id="john.kinser@enron.com">
<data key="field">to</data>
<data key="level">1</data>
<data key="count(*)">1.0</data>
</node>
<edge id="2" source="kayne.coulter@enron.com" target="john.kinser@enron.com"/>
<node id="jay.wills@enron.com">
<data key="field">to</data>
<data key="level">1</data>
<data key="count(*)">1.0</data>
</node>
<edge id="3" source="kayne.coulter@enron.com" target="jay.wills@enron.com"/>
</graph></graphml>