Join Query Parser
The Join query parser allows users to run queries that normalize relationships between documents.
Solr runs a subquery of the user’s choosing (the v
param), identifies all the values that matching documents have in a field of interest (the from
param), and then returns documents where those values are contained in a second field of interest (the to
param).
In practice, these semantics are much like "inner queries" in a SQL engine. As an example, consider the Solr query below:
/solr/techproducts/select?q={!join from=manu_id_s to=id}title:ipod
This query, which returns a document for each manufacturer that makes a product with "ipod" in the title, is semantically identical to the SQL query below:
SELECT *
FROM techproducts
WHERE id IN (
SELECT manu_id_s
FROM techproducts
WHERE title='ipod'
)
The join operation is done on a term basis, so the from
and to
fields must use compatible field types.
For example: joining between a StrField
and a IntPointField
will not work.
Likewise joining between a StrField
and a TextField
that uses LowerCaseFilterFactory
will only work for values that are already lower cased in the string field.
Parameters
This query parser takes the following parameters:
from
-
Required
Default: none
The name of a field which contains values to look for in the
to
field. Can be single or multi-valued, but must have a field type compatible with the field represented in theto
field. to
-
Required
Default: none
The name of a field whose value(s) will be checked against those found in the
from
field. Can be single or multi-valued, but must have a field type compatible with thefrom
field. fromIndex
-
Optional
Default: see description
The name of the index to run the "from" query (
v
parameter) on and where "from" values are gathered. Must be located on the same node as the core processing the request. If this parameter is not defined, it defaults to the value of the processing core. See Joining Across Single Shard Collections or Cross Collection Join below for more information. score
-
Optional
Default: see description
Instructs Solr to return information about the "from" query scores. The value of this parameter controls what type of aggregation information is returned. Options include
avg
(average),max
(maximum),min
(minimum),total
(total), ornone
.If
method
is not specified butscore
is, then thedvWithScore
method is used. Ifmethod
is specified and is notdvWithScore
, then thescore
value is ignored. See themethod
parameter documentation below for more details. method
-
Optional
Default: see description
Determines which of several query implementations should be used by Solr. Options are restricted to:
index
,dvWithScore
, andtopLevelDV
.If unspecified the default value is
index
, unless thescore
parameter is present which overrides it todvWithScore
. Each implementation has its own performance characteristics, and users are encouraged to experiment to determine which implementation is most performant for their use-case. Details and performance heuristics are given below.index
-
The default
method
unless thescore
parameter is specified. It uses the terms index structures to process the request. Performance scales with the cardinality and number of postings (term occurrences) in the "from" field. Consider this method when the "from" field has low cardinality, when the "to" side returns a large number of documents, or when sporadic post-commit slowdowns cannot be tolerated (this is a disadvantage of other methods thatindex
avoids). dvWithScore
-
Returns an optional "score" statistic alongside result documents. It uses docValues structures if available, but falls back to the field cache when necessary. The first access to the field cache slows down the initial requests following a commit and takes up additional space on the JVM heap, so docValues are recommended in most situations. Performance scales linearly with the number of values matched in the "from" field. This method must be used if score information is required, and should also be considered when the "from" query matches few documents, regardless of the number of "to" side documents returned.
dvWithScore and single value numericsThe
dvWithScore
method doesn’t support single value numeric fields. Users migrating from versions prior to 7.0 are encouraged to change field types to string and rebuild indexes during migration. topLevelDV
-
Can only be used when
to
andfrom
fields have docValues data, and does not currently support numeric fields. It uses top-level docValues data structures to find results. These data structures outperform other methods as the number of values matched in thefrom
field grows high. But they are also expensive to build and need to be lazily populated after each commit, causing a sometimes-noticeable slowdown on the first query to use them after each commit. If you commit frequently and your use-case can tolerate a static warming query, consider adding one tosolrconfig.xml
so that this work is done as a part of the commit itself and not attached directly to user requests. Consider this method when the "from" query matches a large number of documents and the "to" result set is small to moderate in size, but only if sporadic post-commit slowness is tolerable.
Joining Across Single Shard Collections
You can also specify a fromIndex
parameter to join with a field from another core or a single shard collection.
If running in SolrCloud mode, then the collection specified in the fromIndex
parameter must have a single shard and a replica on all Solr nodes where the collection you’re joining to has a replica.
Let’s consider an example where you want to use a Solr join query to filter movies by directors that have won an Oscar. Specifically, imagine we have two collections with the following fields:
movies: id, title, director_id, …
movie_directors: id, name, has_oscar, …
To filter movies by directors that have won an Oscar using a Solr join on the movie_directors collection, you can send the following filter query to the movies collection:
fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true
Notice that the query criteria of the filter (has_oscar:true
) is based on a field in the collection specified using fromIndex
.
Keep in mind that you cannot return fields from the fromIndex
collection using join queries, you can only use the fields for filtering results in the "to" collection (movies).
Next, let’s understand how these collections need to be deployed in your cluster. Imagine the movies collection is deployed to a four node SolrCloud cluster and has two shards with a replication factor of two. Specifically, the movies collection has replicas on the following four nodes:
node 1: movies_shard1_replica1
node 2: movies_shard1_replica2
node 3: movies_shard2_replica1
node 4: movies_shard2_replica2
To use the movie_directors collection in Solr join queries with the movies collection, it needs to have a replica on each of the four nodes. In other words, movie_directors must have one shard and replication factor of four:
node 1: movie_directors_shard1_replica1
node 2: movie_directors_shard1_replica2
node 3: movie_directors_shard1_replica3
node 4: movie_directors_shard1_replica4
At query time, the JoinQParser
will access the local replica of the movie_directors collection to perform the join.
If a local replica is not available or active, then the query will fail.
At this point, it should be clear that since you’re limited to a single shard and the data must be replicated across all nodes where it is needed, this approach works better with smaller data sets where there is a one-to-many relationship between the from collection and the to collection.
Moreover, if you add a replica to the to collection, then you also need to add a replica for the from collection.
For more information, Erick Erickson has written a blog post about join performance titled Solr and Joins.
Cross Collection Join
The Cross Collection Join Filter is a method for the join parser that will execute a query against a remote Solr collection to get back a set of join keys that will be used to as a filter query against the local Solr collection.
The cross collection join query will create an CrossCollectionQuery
object.
The CrossCollectionQuery
will first query a remote Solr collection and get back a streaming expression result of the join keys.
As the join keys are streamed to the node, a bitset of the matching documents in the local index is built up.
This avoids keeping the full set of join keys in memory at any given time.
This bitset is then inserted into the filter cache upon successful execution as with the normal behavior of the Solr filter cache.
If the local index is sharded according to the join key field, the cross collection join can leverage a secondary query parser called the Hash Range Query Parser.
The hash range query parser is responsible for returning only the documents that hash to a given range of values.
This allows the CrossCollectionQuery
to query the remote Solr collection and return only the join keys that would match a specific shard in the local Solr collection.
This has the benefit of making sure that network traffic doesn’t increase as the number of shards increases and allows for much greater scalability.
The cross collection join query works with both string and point types of fields. The fields that are being used for the join key must be single-valued and have docValues enabled.
It’s advised to shard the local collection by the join key as this allows for the optimization mentioned above to be utilized.
Cross collection join queries should not generally be used as part of the q
parameter.
It is designed to be used as a filter query (fq
parameter) to ensure proper caching.
The remote Solr collection that is being queried should have a single-valued field for the join key with docValues enabled.
The remote Solr collection does not have any specific sharding requirements.
Join Query Parser Definition in solrconfig.xml
The cross collection join has some configuration options that can be specified in solrconfig.xml
.
routerField
-
Optional
Default: none
If the documents are routed to shards using the CompositeID router by the join field, then that field name should be specified in the configuration here. This will allow the parser to optimize the resulting HashRange query.
allowSolrUrls
-
Optional
Default: none
If specified, this array of strings specifies allow-listed Solr URLs that can be passed to the
solrUrl
query parameter. Without this configuration thesolrUrl
parameter cannot be used. This restriction is necessary to prevent an attacker from using Solr to explore the network.
<queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin">
<str name="routerField">product_id_s</str>
<arr name="allowSolrUrls">
<str>http://othersolr.example.com:8983/solr</str>
</arr>
</queryParser>
Cross Collection Join Query Parameters
fromIndex
-
Required
Default: none
The name of the external Solr collection to be queried to retrieve the set of join key values.
zkHost
-
Optional
Default: none
The connection string to be used to connect to ZooKeeper.
zkHost
andsolrUrl
are both optional parameters, and at most one of them should be specified. If neitherzkHost
norsolrUrl
are specified, the local ZooKeeper cluster will be used. solrUrl
-
Optional
Default: none
The URL of the external Solr node to be queried. Must be a character-for-character exact match of a allow-listed URL that is listed in the
allowSolrUrls
parameter insolrconfig.xml
. If the URL does not match, this parameter will be effectively disabled. from
-
Required
Default: none
The join key field name in the external collection.
to
-
Optional
Default: none
The join key field name in the local collection.
v
-
Optional
Default: none
The query substituted in as a local param. This is the query string that will match documents in the remote collection.
routed
-
Optional
Default:
false
If
true
, the cross collection join query will use each shard’s hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by theto
field. If this parameter is not specified, the cross collection join query will try to determine the correct value automatically. ttl
-
Optional
Default:
3600
secondsThe length of time that a cross collection join query in the cache will be considered valid, in seconds. The cross collection join query will not be aware of changes to the remote collection, so if the remote collection is updated, cached cross collection queries may give inaccurate results. After the
ttl
period has expired, the cross collection join query will re-execute the join against the remote collection. - Other Parameters
-
Any normal Solr query parameter can also be specified/passed through as a local param.