JSON Facet API

JSON Faceting & Analytics

JSON Faceting exposes similar functionality to Solr’s traditional faceting but with a stronger emphasis on usability. It has several benefits over traditional faceting:

  • easier programmatic construction of complex or nested facets

  • the nesting and structure offered by JSON makes facets easier to read and understand than the flat namespace of the traditional faceting API.

  • first class support for metrics and analytics

  • more standardized response format makes responses easier for clients to parse and use

Faceted search is about aggregating data and calculating metrics about that data.

There are two main types of facets:

  • Facets that partition or categorize data (the domain) into multiple buckets

  • Facets that calculate data for a given bucket (normally a metric, statistic or analytic function)

Bucketing Facet Example

Here’s an example of a bucketing facet, that partitions documents into bucket based on the cat field (short for category), and returns the top 3 buckets:

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "categories" : {
      "type": "terms",
      "field": "cat",
      "limit": 3
    }
  }
}'

SolrJ

final TermsFacetMap categoryFacet = new TermsFacetMap("cat").setLimit(3);
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("categories", categoryFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

The response below shows us that 32 documents match the default root domain. Twelve documents have cat:electronics, 4 documents have cat:currency, etc.

[...]
  "facets":{
    "count":32,
    "categories":{
      "buckets":[{
          "val":"electronics",
          "count":12},
        {
          "val":"currency",
          "count":4},
        {
          "val":"memory",
          "count":3},
      ]
    }
  }

Stat Facet Example

Stat (also called aggregation or analytic) facets are useful for displaying information derived from query results, in addition to those results themselves. For example, stat facets can be used to provide context to users on an e-commerce site looking for memory. The example below computes the average price (and other statistics) and would allow a user to gauge whether the memory stick in their cart is a good price.

curl

curl http://localhost:8983/solr/techproducts/query -d '
q=memory&
fq=inStock:true&
json.facet={
  "avg_price" : "avg(price)",
  "num_suppliers" : "unique(manu_exact)",
  "median_weight" : "percentile(weight,50)"
}'

SolrJ

final JsonQueryRequest request =
    new JsonQueryRequest()
        .setQuery("memory")
        .withFilter("inStock:true")
        .withStatFacet("avg_price", "avg(price)")
        .withStatFacet("min_manufacturedate_dt", "min(manufacturedate_dt)")
        .withStatFacet("num_suppliers", "unique(manu_exact)")
        .withStatFacet("median_weight", "percentile(weight,50)");
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

The response to the facet request above will start with documents matching the root domain (docs containing "memory" with inStock:true) followed by the requested statistics in a facets block:

 "facets" : {
    "count" : 4,
    "avg_price" : 109.9950008392334,
    "num_suppliers" : 3,
    "median_weight" : 352.0
  }

Types of Facets

There are 4 different types of bucketing facets, which behave in two different ways:

  • "terms" and "range" facets produce multiple buckets and assign each document in the domain into one (or more) of these buckets

  • "query" and "heatmap" facets always produce a single bucket which all documents in the domain belong to

Each of these facet-types are covered in detail below.

Terms Facet

A terms facet buckets the domain based on the unique values in a field.

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    categories:{
      "type": "terms",
      "field" : "cat",
      "limit" : 5
    }
  }
}'

SolrJ

final TermsFacetMap categoryFacet = new TermsFacetMap("cat").setLimit(5);
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("categories", categoryFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);
Parameter Description

field

The field name to facet over.

offset

Used for paging, this skips the first N buckets. Defaults to 0.

limit

Limits the number of buckets returned. Defaults to 10.

sort

Specifies how to sort the buckets produced.

count specifies document count, index sorts by the index (natural) order of the bucket value. One can also sort by any facet function / statistic that occurs in the bucket. The default is count desc. This parameter may also be specified in JSON like sort:{count:desc}. The sort order may either be “asc” or “desc”

overrequest

Number of buckets beyond the limit to internally request from shards during a distributed search.

Larger values can increase the accuracy of the final "Top Terms" returned when the individual shards have very diff top terms.

The default of -1 causes a heuristic to be applied based on the other options specified.

refine

If true, turns on distributed facet refining. This uses a second phase to retrieve any buckets needed for the final result from shards that did not include those buckets in their initial internal results, so that every shard contributes to every returned bucket in this facet and any sub-facets. This makes counts & stats for returned buckets exact.

overrefine

Number of buckets beyond the limit to consider internally during a distributed search when determining which buckets to refine.

Larger values can increase the accuracy of the final "Top Terms" returned when the individual shards have very diff top terms, and the current sort option can result in refinement pushing terms lower down the sorted list (ex: sort:"count asc")

The default of -1 causes a heuristic to be applied based on other options specified.

mincount

Only return buckets with a count of at least this number. Defaults to 1.

missing

A boolean that specifies if a special “missing” bucket should be returned that is defined by documents without a value in the field. Defaults to false.

numBuckets

A boolean. If true, adds “numBuckets” to the response, an integer representing the number of buckets for the facet (as opposed to the number of buckets returned). Defaults to false.

allBuckets

A boolean. If true, adds an “allBuckets” bucket to the response, representing the union of all of the buckets. For multi-valued fields, this is different from a bucket for all of the documents in the domain since a single document can belong to multiple buckets. Defaults to false.

prefix

Only produce buckets for terms starting with the specified prefix.

facet

Aggregations, metrics or nested facets that will be calculated for every returned bucket

method

This parameter indicates the facet algorithm to use:

  • dv DocValues, collect into ordinal array

  • uif UnInvertedField, collect into ordinal array

  • dvhash DocValues, collect into hash - improves efficiency over high cardinality fields

  • enum TermsEnum then intersect DocSet (stream-able)

  • stream Presently equivalent to enum. Used for indexed, non-point fields with sort index asc and allBuckets, numBuckets, and missing disabled.

  • smart Pick the best method for the field type (this is the default)

prelim_sort

An optional parameter for specifying an approximation of the final sort to use during initial collection of top buckets when the sort parameter is very costly.

Query Facet

The query facet produces a single bucket of documents that match the domain as well as the specified query.

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "high_popularity": {
      "type": "query",
      "q": "popularity:[8 TO 10]"
    }
  }
}'

SolrJ

QueryFacetMap queryFacet = new QueryFacetMap("popularity:[8 TO 10]");
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("high_popularity", queryFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

Users may also specify sub-facets (either "bucketing" facets or metrics):

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "high_popularity": {
      "type": "query",
      "q": "popularity:[8 TO 10]",
      "facet" : {
        "average_price" : "avg(price)"
      }
    }
  }
}'

SolrJ

QueryFacetMap queryFacet =
    new QueryFacetMap("popularity:[8 TO 10]").withStatSubFacet("average_price", "avg(price)");
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("high_popularity", queryFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

Example response:

"high_popularity" : {
  "count" : 36,
  "average_price" : 36.75
}

Range Facet

The range facet produces multiple buckets over a date or numeric field.

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "prices": {
      "type": "range",
      "field": "price",
      "start": 0,
      "end": 100,
      "gap": 20
    }
  }
}'

SolrJ

RangeFacetMap rangeFacet = new RangeFacetMap("price", 0.0, 100.0, 20.0);
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("prices", rangeFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

The output from the range facet above would look a bit like:

"prices":{
  "buckets":[
    {
      "val":0.0,  // the bucket value represents the start of each range.  This bucket covers 0-20
      "count":5},
    {
      "val":20.0,
      "count":0},
    {
      "val":40.0,
      "count":0},
    {
      "val":60.0,
      "count":1},
    {
      "val":80.0,
      "count":1}
  ]
}

Range Facet Parameters

Range facet parameter names and semantics largely mirror facet.range query-parameter style faceting. For example "start" here corresponds to "facet.range.start" in a facet.range command.

Parameter Description

field

The numeric field or date field to produce range buckets from.

start

Lower bound of the ranges.

end

Upper bound of the ranges.

gap

Size of each range bucket produced.

hardend

A boolean, which if true means that the last bucket will end at “end” even if it is less than “gap” wide. If false, the last bucket will be “gap” wide, which may extend past “end”.

other

This parameter indicates that in addition to the counts for each range constraint between start and end, counts should also be computed for…

  • "before" all records with field values lower than lower bound of the first range

  • "after" all records with field values greater than the upper bound of the last range

  • "between" all records with field values between the start and end bounds of all ranges

  • "none" compute none of this information

  • "all" shortcut for before, between, and after

include

By default, the ranges used to compute range faceting between start and end are inclusive of their lower bounds and exclusive of the upper bounds. The “before” range is exclusive and the “after” range is inclusive. This default, equivalent to "lower" below, will not result in double counting at the boundaries. The include parameter may be any combination of the following options:

  • "lower" all gap based ranges include their lower bound

  • "upper" all gap based ranges include their upper bound

  • "edge" the first and last gap ranges include their edge bounds (i.e., lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified

  • "outer" the “before” and “after” ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries.

  • "all" shorthand for lower, upper, edge, outer

facet

Aggregations, metrics, or nested facets that will be calculated for every returned bucket

ranges

List of arbitrary range when specified calculates facet on given ranges rather than start, gap and end. With start, end and gap the width of the range or bucket is always fixed. If range faceting needs to computed on varying range width then, ranges should be specified.

  • Specifying start, end or gap along with ranges is disallowed and request would fail.

  • When ranges are specified in the range facet, hardend, include and other parameters are ignored.

Arbitrary Range

An arbitrary range consists of from and to values over which range bucket is computed. This range can be specified in two syntax.

Parameter Description

from

The lower bound of the range. When not specified defaults to *.

to

The upper bound of the range. When not specified defaults to *.

inclusive_from

A boolean, which if true means that include the lower bound from. This defaults to true.

inclusive_to

A boolean, which if true means that include the upper bound to. This default to false.

range

The range is specified as string. This is semantically similar to facet.interval

  • When range is specified then, all the above parameters from, to and etc in the range are ignored

  • range always start with ( or [ and ends with ) or ]

    • ( - exclude lower bound

    • [ - include lower bound

    • ) - exclude upper bound

    • ] - include upper bound

For example, For range (5,10] 5 is excluded and 10 is included

other with ranges

other parameter is ignored when ranges is specified but there are ways to achieve same behavior with ranges.

  • before - This is equivalent to [*,some_val) or just specifying to value

  • after - This is equivalent to (som_val, *] or just specifying from value

  • between - This is equivalent to specifying start, end as from and to respectively

include with ranges

include parameter is ignored when ranges is specified but there are ways to achieve same behavior with ranges. lower, upper, outer, edge all can be achieved using combination of inclusive_to and inclusive_from.

Range facet with ranges

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "prices": {
      "type": "range",
      "field": "price",
      "ranges": [
        {
          "from": 0,
          "to": 20,
          "inclusive_from": true,
          "inclusive_to": false
        },
        {
          "range": "[40,100)"
        }
      ]
    }
  }
}'

The output from the range facet above would look a bit like:

{
  "prices": {
    "buckets": [
      {
        "val": "[0,20)",
        "count": 5
      },
      {
        "val": "[40,100)",
        "count": 2
      }
    ]
  }
}
When range is specified, its value in the request is used as key in the response. In the other case, key is generated using from, to, inclusive_to and inclusive_from. Currently, custom key is not supported.

Heatmap Facet

The heatmap facet generates a 2D grid of facet counts for documents having spatial data in each grid cell.

This feature is primarily documented in the spatial section of the reference guide. The key parameters are type to specify heatmap and field to indicate a spatial RPT field. The rest of the parameter names use the same names and semantics mirroring facet.heatmap query-parameter style faceting, albeit without the "facet.heatmap." prefix. For example geom here corresponds to facet.heatmap.geom in a facet.heatmap command.

Unlike other facets that partition the domain into buckets, heatmap facets do not currently support Nested Facets.

curl

curl http://localhost:8983/solr/spatialdata/query -d '
{
  "query": "*:*",
  "facet": {
    "locations": {
      "type": "heatmap",
      "field": "location_srpt",
      "geom": "[\"50 20\" TO \"180 90\"]",
      "gridLevel": 4
    }
  }
}'

SolrJ

final JsonQueryRequest request =
    new JsonQueryRequest()
        .setQuery("*:*")
        .setLimit(0)
        .withFacet(
            "locations",
            new HeatmapFacetMap("location_srpt")
                .setHeatmapFormat(HeatmapFacetMap.HeatmapFormat.INTS2D)
                .setRegionQuery("[\"50 20\" TO \"180 90\"]")
                .setGridLevel(4));

And the facet response will look like:

{
  "facets": {
    "locations":{
      "gridLevel":1,
      "columns":6,
      "rows":4,
      "minX":-180.0,
      "maxX":90.0,
      "minY":-90.0,
      "maxY":90.0,
      "counts_ints2D":[[68,1270,459,5359,39456,1713],[123,10472,13620,7777,18376,6239],[88,6,3898,989,1314,255],[0,0,30,1,0,1]]
    }
  }
}

Stat Facet Functions

Unlike all the facets discussed so far, Aggregation functions (also called facet functions, analytic functions, or metrics) do not partition data into buckets. Instead, they calculate something over all the documents in the domain.

Aggregation Example Description

sum

sum(sales)

summation of numeric values

avg

avg(popularity)

average of numeric values

min

min(salary)

minimum value

max

max(mul(price,popularity))

maximum value

missing

missing(author)

number of documents which do not have value for given field or function

countvals

countvals(author)

number of values for a given field or function

unique

unique(author)

number of unique values of the given field. Beyond 100 values it yields not exact estimate

uniqueBlock

uniqueBlock(_root_) or uniqueBlock($fldref) where fldref=_root_

same as above with smaller footprint strictly for counting the number of Block Join blocks. The given field must be unique across blocks, and only singlevalued string fields are supported, docValues are recommended.

uniqueBlock({!v=type:parent}) or uniqueBlock({!v=$qryref}) where qryref=type:parent

same as above, but using bitset of the given query to aggregate hits.

hll

hll(author)

distributed cardinality estimate via hyper-log-log algorithm

percentile

percentile(salary,50,75,99,99.9)

Percentile estimates via t-digest algorithm. When sorting by this metric, the first percentile listed is used as the sort value.

sumsq

sumsq(rent)

sum of squares of field or function

variance

variance(rent)

variance of numeric field or function

stddev

stddev(rent)

standard deviation of field or function

relatedness

relatedness('popularity:[100 TO *]','inStock:true')

A function for computing a relatedness score of the documents in the domain to a Foreground set, relative to a Background set (both defined as queries). This is primarily for use when building Semantic Knowledge Graphs.

Numeric aggregation functions such as avg can be on any numeric field, or on a nested function of multiple numeric fields such as avg(div(popularity,price)).

The most common way of requesting an aggregation function is as a simple String containing the expression you wish to compute:

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "filter": [
    "price:[1.0 TO *]",
    "popularity:[0 TO 10]"
  ],
  "facet": {
    "avg_value": "avg(div(popularity,price))"
  }
}'

SolrJ

final JsonQueryRequest request =
    new JsonQueryRequest()
        .setQuery("*:*")
        .withFilter("price:[1.0 TO *]")
        .withFilter("popularity:[0 TO 10]")
        .withStatFacet("min_manu_id_s", "min(manu_id_s)")
        .withStatFacet("avg_value", "avg(div(popularity,price))");
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

An expanded form allows for Local Params to be specified. These may be used explicitly by some specialized aggregations such as relatedness(), but can also be used as parameter references to make aggregation expressions more readable, without needing to use (global) request parameters:

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "filter": [
    "price:[1.0 TO *]",
    "popularity:[0 TO 10]"
  ],
  "facet": {
    "avg_value" : {
      "type": "func",
      "func": "avg(div($numer,$denom))",
      "numer": "mul(popularity,3.0)",
      "denom": "price"
    }
  }
}'

SolrJ

final Map<String, Object> expandedStatFacet = new HashMap<>();
expandedStatFacet.put("type", "func");
expandedStatFacet.put("func", "avg(div($numer,$denom))");
expandedStatFacet.put("numer", "mul(popularity,3.0)");
expandedStatFacet.put("denom", "price");
final JsonQueryRequest request =
    new JsonQueryRequest()
        .setQuery("*:*")
        .withFilter("price:[1.0 TO *]")
        .withFilter("popularity:[0 TO 10]")
        .withFacet("avg_value", expandedStatFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

Nested Facets

Nested facets, or sub-facets, allow one to nest facet commands under any facet command that partitions the domain into buckets (i.e., terms, range, query). These sub-facets are then evaluated against the domains defined by the set of all documents in each bucket of their parent.

The syntax is identical to top-level facets - just add a facet command to the facet command block of the parent facet. Technically, every facet command is actually a sub-facet since we start off with a single facet bucket with a domain defined by the main query and filters.

Nested Facet Example

Let’s start off with a simple non-nested terms facet on the category field cat:

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "categories": {
      "type": "terms",
      "field": "cat",
      "limit": 3
    }
  }
}'

SolrJ

final TermsFacetMap categoryFacet = new TermsFacetMap("cat").setLimit(3);
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("categories", categoryFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

The response for the facet above will show the top category and the number of documents that falls into each category bucket. Nested facets can be used to gather additional information about each bucket of documents. For example, using the nested facet below, we can find the top categories as well as who the leading manufacturer is in each category:

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "categories": {
      "type": "terms",
      "field": "cat",
      "limit": 3,
      "facet": {
        "top_manufacturer": {
          "type": "terms",
          "field": "manu_id_s",
          "limit": 1
        }
      }
    }
  }
}'

SolrJ

final TermsFacetMap topCategoriesFacet = new TermsFacetMap("cat").setLimit(3);
final TermsFacetMap topManufacturerFacet = new TermsFacetMap("manu_id_s").setLimit(1);
topCategoriesFacet.withSubFacet("top_manufacturers", topManufacturerFacet);
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("categories", topCategoriesFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

And the response will look something like:

"facets":{
    "count":32,
    "categories":{
      "buckets":[{
          "val":"electronics",
          "count":12,
          "top_manufacturer":{
            "buckets":[{
                "val":"corsair",
                "count":3}]}},
        {
          "val":"currency",
          "count":4,
          "top_manufacturer":{
            "buckets":[{
                "val":"boa",
                "count":1}]}}]}}

Sorting Facets By Nested Functions

The default sort for a field or terms facet is by bucket count descending. We can optionally sort ascending or descending by any facet function that appears in each bucket.

curl

curl http://localhost:8983/solr/techproducts/query -d '
{
  "query": "*:*",
  "facet": {
    "categories":{
      "type" : "terms",     // terms facet creates a bucket for each indexed term in the field
      "field" : "cat",
      "limit": 3,
      "sort" : "avg_price desc",
      "facet" : {
        "avg_price" : "avg(price)",
      }
    }
  }
}'

SolrJ

final TermsFacetMap topCategoriesFacet =
    new TermsFacetMap("cat")
        .setLimit(3)
        .withStatSubFacet("avg_price", "avg(price)")
        .setSort("avg_price desc");
final JsonQueryRequest request =
    new JsonQueryRequest().setQuery("*:*").withFacet("categories", topCategoriesFacet);
QueryResponse queryResponse = request.process(solrClient, COLLECTION_NAME);

In some situations the desired sort may be an aggregation function that is very costly to compute for every bucket. A prelim_sort option can be used to specify an approximation of the sort, for initially ranking the buckets to determine the top candidates (based on the limit and overrequest). Only after the top candidate buckets have been refined, will the actual sort be used.

{
  categories:{
    type : terms,
    field : cat,
    refine: true,
    limit: 10,
    overrequest: 100,
    prelim_sort: "sales_rank desc",
    sort : "prod_quality desc",
    facet : {
      prod_quality : "avg(div(prod(rating,sales_rank),prod(num_returns,price)))"
      sales_rank : "sum(sales_rank)"
    }
  }
}

Changing the Domain

As discussed above, facets compute buckets or statistics based on their "domain" of documents.

  • By default, top-level facets use the set of all documents matching the main query as their domain.

  • Nested "sub-facets" are computed for every bucket of their parent facet, using a domain containing all documents in that bucket.

In addition to this default behavior, domains can be also be widened, narrowed, or changed entirely. The JSON Faceting API supports modifying domains through its domain property. This is discussed in more detail in JSON Faceting Domain Changes.

Special Stat Facet Functions

Most stat facet functions (avg, sumsq, etc.) allow users to perform math computations on groups of documents. A few functions are more involved though, and deserve an explanation of their own. These are described in more detail in the sections below.

uniqueBlock() and Block Join Counts

When a collection contains nested documents, the blockChildren and blockParent domain changes can be useful when searching for parent documents and you want to compute stats against all of the affected children documents (or vice versa). But if you only need to know the count of all the blocks that exist in the current domain, a more efficient option is the uniqueBlock() aggregate function.

Suppose we have products with multiple SKUs, and we want to count products for each color.

{
  "id": "1", "type": "product", "name": "Solr T-Shirt",
  "_childDocuments_": [
    { "id": "11", "type": "SKU", "color": "Red",  "size": "L" },
    { "id": "12", "type": "SKU", "color": "Blue", "size": "L" },
    { "id": "13", "type": "SKU", "color": "Red",  "size": "M" }
  ]
},
{
  "id": "2", "type": "product", "name": "Solr T-Shirt",
  "_childDocuments_": [
    { "id": "21", "type": "SKU", "color": "Blue", "size": "S" }
  ]
}

When searching against a set of SKU documents, we can ask for a facet on color, with a nested statistic counting all the "blocks" — aka: products:

color: {
  type: terms,
  field: color,
  limit: -1,
  facet: {
    productsCount: "uniqueBlock(_root_)"
      // or "uniqueBlock({!v=type:product})"
  }
}

and get:

color:{
   buckets:[
      { val:Blue, count:2, productsCount:2 },
      { val:Red, count:2, productsCount:1 }
   ]
}

Please notice that _root_ is an internal field added by Lucene to each child document to reference on parent one. Aggregation uniqueBlock(_root_) is functionally equivalent to unique(_root_), but is optimized for nested documents block structure. It’s recommended to define limit: -1 for uniqueBlock calculation, like in above example, since default value of limit parameter is 10, while uniqueBlock is supposed to be much faster with -1.

relatedness() and Semantic Knowledge Graphs

The relatedness(…​) stat function allows for sets of documents to be scored relative to Foreground and Background sets of documents, for the purposes of finding ad-hoc relationships that make up a "Semantic Knowledge Graph":

At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes.

— Grainger et al.
The Semantic Knowledge Graph

The relatedness(…​) function is used to "score" these relationships, relative to "Foreground" and "Background" sets of documents, specified in the function params as queries.

Unlike most aggregation functions, the relatedness(…​) function is aware of whether and how it’s used in Nested Facets. It evaluates the query defining the current bucket independently of its parent/ancestor buckets, and intersects those documents with a "Foreground Set" defined by the foreground query combined with the ancestor buckets. The result is then compared to a similar intersection done against the "Background Set" (defined exclusively by background query) to see if there is a positive, or negative, correlation between the current bucket and the Foreground Set, relative to the Background Set.

The semantics of relatedness(…​) in an allBuckets context is currently undefined. Accordingly, although the relatedness(…​) stat may be specified for a facet request that also specifies allBuckets:true, the allBuckets bucket itself will not include a relatedness calculation.
While it’s very common to define the Background Set as *:*, or some other super-set of the Foreground Query, it is not strictly required. The relatedness(…​) function can be used to compare the statistical relatedness of sets of documents to orthogonal foreground/background queries.

relatedness() Options

When using the extended type:func syntax for specifying a relatedness() aggregation, an optional min_popularity (float) option can be used to specify a lower bound on the foreground_popularity and background_popularity values, that must be met in order for the relatedness score to be valid — If this min_popularity is not met, then the relatedness score will be -Infinity.

The default implementation for calculating relatedness() domain correlation depends on the type of facet being calculated. Generic domain correlation is calculated per-term, by selectively retrieving a DocSet for each bucket-associated query (consulting the filterCache) and calculating DocSet intersections with "foreground" and "background" sets. For term facets (especially over high-cardinality fields) this approach can lead to filterCache thrashing; accordingly, relatedness() over term facets defaults where possible to an approach that collects facet counts directly over all multiple domains in a single sweep (never touching the filterCache). It is possible to explicitly control this "single sweep" collection by setting the extended type:func syntax sweep_collection option to true (the default) or false (to disable sweep collection).

Disabling sweep collection for relatedness() stats over low-cardinality fields may yield a performance benefit, provided the filterCache is sufficiently large to accommodate an entry for each value in the associated field without inducing thrashing for anticipated use patterns. A reasonable heuristic is that fields of cardinality less than 1,000 may benefit from disabling sweep. This heuristic is not used to determine default behavior, particularly because non-sweep collection can so easily induce filterCache thrashing, with system-wide detrimental effects.
{ "type": "func",
  "func": "relatedness($fore,$back)",
  "min_popularity": 0.001,
}

This can be particularly useful when using a descending sorting on relatedness() with foreground and background queries that are disjoint, to ensure the "top buckets" are all relevant to both sets.

When sorting on relatedness(…​) requests can be processed much more quickly by adding a prelim_sort: "count desc" option. Increasing the overrequest can help improve the accuracy of the top buckets.

Semantic Knowledge Graph Example

Sample Documents
curl -sS -X POST 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '[
{"id":"01",age:15,"state":"AZ","hobbies":["soccer","painting","cycling"]},
{"id":"02",age:22,"state":"AZ","hobbies":["swimming","darts","cycling"]},
{"id":"03",age:27,"state":"AZ","hobbies":["swimming","frisbee","painting"]},
{"id":"04",age:33,"state":"AZ","hobbies":["darts"]},
{"id":"05",age:42,"state":"AZ","hobbies":["swimming","golf","painting"]},
{"id":"06",age:54,"state":"AZ","hobbies":["swimming","golf"]},
{"id":"07",age:67,"state":"AZ","hobbies":["golf","painting"]},
{"id":"08",age:71,"state":"AZ","hobbies":["painting"]},
{"id":"09",age:14,"state":"CO","hobbies":["soccer","frisbee","skiing","swimming","skating"]},
{"id":"10",age:23,"state":"CO","hobbies":["skiing","darts","cycling","swimming"]},
{"id":"11",age:26,"state":"CO","hobbies":["skiing","golf"]},
{"id":"12",age:35,"state":"CO","hobbies":["golf","frisbee","painting","skiing"]},
{"id":"13",age:47,"state":"CO","hobbies":["skiing","darts","painting","skating"]},
{"id":"14",age:51,"state":"CO","hobbies":["skiing","golf"]},
{"id":"15",age:64,"state":"CO","hobbies":["skating","cycling"]},
{"id":"16",age:73,"state":"CO","hobbies":["painting"]},
]'
Example Query
curl -sS -X POST http://localhost:8983/solr/gettingstarted/query -d 'rows=0&q=*:*
&back=*:*                                  (1)
&fore=age:[35 TO *]                        (2)
&json.facet={
  hobby : {
    type : terms,
    field : hobbies,
    limit : 5,
    sort : { r1: desc },                   (3)
    facet : {
      r1 : "relatedness($fore,$back)",     (4)
      location : {
        type : terms,
        field : state,
        limit : 2,
        sort : { r2: desc },               (3)
        facet : {
          r2 : "relatedness($fore,$back)"  (4)
        }
      }
    }
  }
}'
1 Use the entire collection as our "Background Set"
2 Use a query for "age >= 35" to define our (initial) "Foreground Set"
3 For both the top level hobbies facet & the sub-facet on state we will be sorting on the relatedness(…​) values
4 In both calls to the relatedness(…​) function, we use parameter variables to refer to the previously defined fore and back queries.
The Facet Response
"facets":{
  "count":16,
  "hobby":{
    "buckets":[{
        "val":"golf",
        "count":6,                                (1)
        "r1":{
          "relatedness":0.01225,
          "foreground_popularity":0.3125,         (2)
          "background_popularity":0.375},         (3)
        "location":{
          "buckets":[{
              "val":"az",
              "count":3,
              "r2":{
                "relatedness":0.00496,            (4)
                "foreground_popularity":0.1875,   (6)
                "background_popularity":0.5}},    (7)
            {
              "val":"co",
              "count":3,
              "r2":{
                "relatedness":-0.00496,           (5)
                "foreground_popularity":0.125,
                "background_popularity":0.5}}]}},
      {
        "val":"painting",
        "count":8,                                (1)
        "r1":{
          "relatedness":0.01097,
          "foreground_popularity":0.375,
          "background_popularity":0.5},
        "location":{
          "buckets":[{
            ...
1 Even though hobbies:golf has a lower total facet count then hobbies:painting, it has a higher relatedness score, indicating that relative to the Background Set (the entire collection) Golf has a stronger correlation to our Foreground Set (people age 35+) then Painting.
2 The number of documents matching age:[35 TO *] and hobbies:golf is 31.25% of the total number of documents in the Background Set
3 37.5% of the documents in the Background Set match hobbies:golf
4 The state of Arizona (AZ) has a positive relatedness correlation with the nested Foreground Set (people ages 35+ who play Golf) compared to the Background Set — i.e., "People in Arizona are statistically more likely to be '35+ year old Golfers' than the country as a whole."
5 The state of Colorado (CO) has a negative correlation with the nested Foreground Set — i.e., "People in Colorado are statistically less likely to be '35+ year old Golfers' then the country as a whole."
6 The number documents matching age:[35 TO *] and hobbies:golf and state:AZ is 18.75% of the total number of documents in the Background Set
7 50% of the documents in the Background Set match state:AZ