JSON Facet API

The new Facet & Analytics Module exposed via the JSON Facet API is a rewrite of Solr’s previous faceting capabilities, with the following goals:

First class native JSON API to control faceting and analytics
- The structured nature of nested sub-facets are more naturally expressed in JSON rather than the flat namespace provided by normal query parameters.
First class integrated analytics support
Nest any facet type under any other facet type (such as range facet, field facet, query facet)
Ability to sort facet buckets by any calculated metric
Easier programmatic construction of complex nested facet commands
Support a more canonical response format that is easier for clients to parse
Support a cleaner way to implement distributed faceting
Support better integration with other search features
Full integration with the JSON Request API

Faceted Search

Faceted search is about aggregating data and calculating metrics about that data.

There are two main types of facets:

Facets that partition or categorize data (the domain) into multiple buckets
Facets that calculate data for a given bucket (normally a metric, statistic or analytic function)

Metrics Example

By default, the domain for facets starts with all documents that match the base query and any filters. Here’s an example that requests various metrics about the root domain:

curl http://localhost:8983/solr/techproducts/query -d '
q=memory&
fq=inStock:true&
json.facet={
  "avg_price" : "avg(price)",
  "num_suppliers" : "unique(manu_exact)",
  "median_weight" : "percentile(weight,50)"
}'

The response to the facet request above will start with documents matching the root domain (docs containing "memory" with inStock:true) then calculate and return the requested metrics:

 "facets" : {
    "count" : 4,
    "avg_price" : 109.9950008392334,
    "num_suppliers" : 3,
    "median_weight" : 352.0
  }

Here’s an example of a bucketing facet, that partitions documents into bucket based on the cat field (short for category), and returns the top 3 buckets:

curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&
json.facet={
  categories : {
    type : terms,
    field : cat,    // bucket documents based on the "cat" field
    limit : 3       // retrieve the top 3 buckets ranked by the number of docs in each bucket
  }
}'

The response below shows us that 32 documents match the default root domain. and 12 documents have cat:electronics, 4 documents have cat:currency, etc.

[...]
  "facets":{
    "count":32,
    "categories":{
      "buckets":[{
          "val":"electronics",
          "count":12},
        {
          "val":"currency",
          "count":4},
        {
          "val":"memory",
          "count":3},
      ]
    }
  }

In this guide, we will often just present the facet command block:

{
  x : "average(mul(price,popularity))"
}

To execute a facet command block such as this, you’ll need to use the json.facet parameter, and provide at least a base query such as q=*:*

curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&json.facet=
{
  x : "avg(mul(price,popularity))"
}
'

Another option is to use the JSON Request API to provide the entire request in JSON:

curl http://localhost:8983/solr/techproducts/query -d '
{
  query : "*:*",                        // this is the base query
  filter : [ "inStock:true" ],          // a list of filters
  facet : {
    x : "avg(mul(price,popularity))"    // and our funky metric of average of price * popularity
 }
}
'

JSON Extensions

The Noggit JSON parser that is used by Solr accepts a number of JSON extensions such as

bare words can be left unquoted
single line comments using either // or #
Multi-line comments using C style /* comments in here */
Single quoted strings
Allow backslash escaping of any character
Allow trailing commas and extra commas. Example: [9,4,3,]
Handle nbsp (non-break space, \u00a0) as whitespace.

The terms facet (or field facet) buckets the domain based on the unique terms / values of a field.

curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&
json.facet={
  categories:{
    terms: {
      field : cat,    // bucket documents based on the "cat" field
      limit : 5       // retrieve the top 5 buckets ranked by the number of docs in each bucket
    }
  }
}'

Parameter Description

Parameter	Description
field	The field name to facet over.
offset	Used for paging, this skips the first N buckets. Defaults to 0.
limit	Limits the number of buckets returned. Defaults to 10.
refine	If true, turns on distributed facet refining. This uses a second phase to retrieve selected stats from shards so that every shard contributes to every returned bucket in this facet and any sub-facets. This makes stats for returned buckets exact.
overrequest	Number of buckets beyond the limit to request internally during distributed search. -1 means default.
mincount	Only return buckets with a count of at least this number. Defaults to 1.
sort	Specifies how to sort the buckets produced. “count” specifies document count, “index” sorts by the index (natural) order of the bucket value. One can also sort by any facet function / statistic that occurs in the bucket. The default is “count desc”. This parameter may also be specified in JSON like `sort:{count:desc}`. The sort order may either be “asc” or “desc”
missing	A boolean that specifies if a special “missing” bucket should be returned that is defined by documents without a value in the field. Defaults to false.
numBuckets	A boolean. If true, adds “numBuckets” to the response, an integer representing the number of buckets for the facet (as opposed to the number of buckets returned). Defaults to false.
allBuckets	A boolean. If true, adds an “allBuckets” bucket to the response, representing the union of all of the buckets. For multi-valued fields, this is different than a bucket for all of the documents in the domain since a single document can belong to multiple buckets. Defaults to false.
prefix	Only produce buckets for terms starting with the specified prefix.
facet	Aggregations, metrics or nested facets that will be calculated for every returned bucket
method	This parameter indicates the facet algorithm to use: "dv" DocValues, collect into ordinal array "uif" UnInvertedField, collect into ordinal array "dvhash" DocValues, collect into hash - improves efficiency over high cardinality fields "enum" TermsEnum then intersect DocSet (stream-able) "stream" Presently equivalent to "enum" "smart" Pick the best method for the field type (this is the default)

field

The field name to facet over.

offset

Used for paging, this skips the first N buckets. Defaults to 0.

limit

Limits the number of buckets returned. Defaults to 10.

refine

If true, turns on distributed facet refining. This uses a second phase to retrieve selected stats from shards so that every shard contributes to every returned bucket in this facet and any sub-facets. This makes stats for returned buckets exact.

overrequest

Number of buckets beyond the limit to request internally during distributed search. -1 means default.

mincount

Only return buckets with a count of at least this number. Defaults to 1.

sort

Specifies how to sort the buckets produced. “count” specifies document count, “index” sorts by the index (natural) order of the bucket value. One can also sort by any facet function / statistic that occurs in the bucket. The default is “count desc”. This parameter may also be specified in JSON like sort:{count:desc}. The sort order may either be “asc” or “desc”

missing

A boolean that specifies if a special “missing” bucket should be returned that is defined by documents without a value in the field. Defaults to false.

numBuckets

A boolean. If true, adds “numBuckets” to the response, an integer representing the number of buckets for the facet (as opposed to the number of buckets returned). Defaults to false.

allBuckets

A boolean. If true, adds an “allBuckets” bucket to the response, representing the union of all of the buckets. For multi-valued fields, this is different than a bucket for all of the documents in the domain since a single document can belong to multiple buckets. Defaults to false.

prefix

Only produce buckets for terms starting with the specified prefix.

facet

Aggregations, metrics or nested facets that will be calculated for every returned bucket

method

This parameter indicates the facet algorithm to use:

"dv" DocValues, collect into ordinal array
"uif" UnInvertedField, collect into ordinal array
"dvhash" DocValues, collect into hash - improves efficiency over high cardinality fields
"enum" TermsEnum then intersect DocSet (stream-able)
"stream" Presently equivalent to "enum"
"smart" Pick the best method for the field type (this is the default)

The query facet produces a single bucket of documents that match the domain as well as the specified query.

An example of the simplest form of the query facet is "query":"query string".

{
  high_popularity : { query : "popularity:[8 TO 10]" }
}

An expanded form allows for more parameters and a facet command block to specify sub-facets (either nested facets or metrics):

{
  high_popularity : {
    type: query,
    q : "popularity:[8 TO 10]",
    facet : { average_price : "avg(price)" }
  }
}

Example response:

"high_popularity" : {
  "count" : 36,
  "average_price" : 36.75
}

The range facet produces multiple buckets over a date field or numeric field.

Example:

{
  prices : {
    type: range,
    field : price,
    start : 0,
    end : 100,
    gap : 20
  }
}

"prices":{
  "buckets":[
    {
      "val":0.0,  // the bucket value represents the start of each range.  This bucket covers 0-20
      "count":5},
    {
      "val":20.0,
      "count":3},
    {
      "val":40.0,
      "count":2},
    {
      "val":60.0,
      "count":1},
    {
      "val":80.0,
      "count":1}
  ]
}

To ease migration, the range facet parameter names and semantics largely mirror facet.range query-parameter style faceting. For example "start" here corresponds to "facet.range.start" in a facet.range command.

Parameter Description

Parameter	Description
field	The numeric field or date field to produce range buckets from.
start	Lower bound of the ranges.
end	Upper bound of the ranges.
gap	Size of each range bucket produced.
hardend	A boolean, which if true means that the last bucket will end at “end” even if it is less than “gap” wide. If false, the last bucket will be “gap” wide, which may extend past “end”.
other	This parameter indicates that in addition to the counts for each range constraint between `start` and `end`, counts should also be computed for… "before" all records with field values lower then lower bound of the first range "after" all records with field values greater then the upper bound of the last range "between" all records with field values between the start and end bounds of all ranges "none" compute none of this information "all" shortcut for before, between, and after
include	By default, the ranges used to compute range faceting between `start` and `end` are inclusive of their lower bounds and exclusive of the upper bounds. The “before” range is exclusive and the “after” range is inclusive. This default, equivalent to "lower" below, will not result in double counting at the boundaries. The `include` parameter may be any combination of the following options: "lower" all gap based ranges include their lower bound "upper" all gap based ranges include their upper bound "edge" the first and last gap ranges include their edge bounds (i.e., lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified "outer" the “before” and “after” ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries. "all" shorthand for lower, upper, edge, outer
facet	Aggregations, metrics, or nested facets that will be calculated for every returned bucket

field

The numeric field or date field to produce range buckets from.

start

Lower bound of the ranges.

end

Upper bound of the ranges.

gap

Size of each range bucket produced.

hardend

A boolean, which if true means that the last bucket will end at “end” even if it is less than “gap” wide. If false, the last bucket will be “gap” wide, which may extend past “end”.

other

This parameter indicates that in addition to the counts for each range constraint between start and end, counts should also be computed for…

"before" all records with field values lower then lower bound of the first range
"after" all records with field values greater then the upper bound of the last range
"between" all records with field values between the start and end bounds of all ranges
"none" compute none of this information
"all" shortcut for before, between, and after

include

By default, the ranges used to compute range faceting between start and end are inclusive of their lower bounds and exclusive of the upper bounds. The “before” range is exclusive and the “after” range is inclusive. This default, equivalent to "lower" below, will not result in double counting at the boundaries. The include parameter may be any combination of the following options:

"lower" all gap based ranges include their lower bound
"upper" all gap based ranges include their upper bound
"edge" the first and last gap ranges include their edge bounds (i.e., lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified
"outer" the “before” and “after” ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries.
"all" shorthand for lower, upper, edge, outer

facet

Aggregations, metrics, or nested facets that will be calculated for every returned bucket

One can filter the domain before faceting via the filter keyword in the domain block of the facet.

Example:

{
  categories : {
     type : terms,
     field : cat,
     domain : { filter : "popularity:[5 TO 10]" }
   }
}

The value of filter can be a single query to treat as a filter, or a list of filter queries. Each one can be:

a string containing a query in Solr query syntax
a reference to a request parameter containing Solr query syntax, of the form: {param : <request_param_name>}

Filter Exclusions

One can exclude top-level filters and query before faceting via the excludeTags keywords in the domain block of the facet.

Example:

&q={!tag=top}"running shorts"
&fq={!tag=COLOR}color:Blue
&json={
   filter:"{!tag=BRAND}brand:Bosco"
   facet:{
      sizes:{type:terms, field:size},
      colors:{type:terms, field:color, domain:{excludeTags:COLOR} },
      brands:{type:terms, field:brand, domain:{excludeTags:"BRAND,top"} }
   }
}

The value of excludeTags can be a single string tag, array of string tags or comma-separated tags in the single string. See also the section on multi-select faceting.

Aggregation Functions

Aggregation functions, also called facet functions, analytic functions, or metrics, calculate something interesting over a domain (each facet bucket).

Aggregation Example Description

Aggregation	Example	Description
sum	`sum(sales)`	summation of numeric values
avg	`avg(popularity)`	average of numeric values
min	`min(salary)`	minimum value
max	`max(mul(price,popularity))`	maximum value
unique	`unique(author)`	number of unique values of the given field. Beyond 100 values it yields not exact estimate
uniqueBlock	`uniqueBlock(_root_)`	same as above with smaller footprint strictly requires block index. The given field is expected to be unique across blocks, now only singlevalued string fields are supported, docValues are recommended.
hll	`hll(author)`	distributed cardinality estimate via hyper-log-log algorithm
percentile	`percentile(salary,50,75,99,99.9)`	Percentile estimates via t-digest algorithm. When sorting by this metric, the first percentile listed is used as the sort value.
sumsq	`sumsq(rent)`	sum of squares of field or function
variance	`variance(rent)`	variance of numeric field or function
stddev	`stddev(rent)`	standard deviation of field or function
relatedness	`relatedness('popularity:[100 TO \*]','inStock:true')`	A function for computing a relatedness score of the documents in the domain to a Foreground set, relative to a Background set (both defined as queries). This is primarily for use when building Semantic Knowledge Graphs.

sum

sum(sales)

summation of numeric values

avg

avg(popularity)

average of numeric values

min

min(salary)

minimum value

max

max(mul(price,popularity))

maximum value

unique

unique(author)

number of unique values of the given field. Beyond 100 values it yields not exact estimate

uniqueBlock

uniqueBlock(_root_)

same as above with smaller footprint strictly requires block index. The given field is expected to be unique across blocks, now only singlevalued string fields are supported, docValues are recommended.

hll

hll(author)

distributed cardinality estimate via hyper-log-log algorithm

percentile

percentile(salary,50,75,99,99.9)

Percentile estimates via t-digest algorithm. When sorting by this metric, the first percentile listed is used as the sort value.

sumsq

sumsq(rent)

sum of squares of field or function

variance

variance(rent)

variance of numeric field or function

stddev

stddev(rent)

standard deviation of field or function

relatedness

relatedness('popularity:[100 TO \*]','inStock:true')

A function for computing a relatedness score of the documents in the domain to a Foreground set, relative to a Background set (both defined as queries). This is primarily for use when building Semantic Knowledge Graphs.

Numeric aggregation functions such as avg can be on any numeric field, or on another function of multiple numeric fields such as avg(mul(price,popularity)).

The default sort for a field or terms facet is by bucket count descending. We can optionally sort ascending or descending by any facet function that appears in each bucket.

{
  categories:{
    type : terms      // terms facet creates a bucket for each indexed term in the field
    field : cat,
    sort : "x desc",  // can also use sort:{x:desc}
    facet : {
      x : "avg(price)",     // x = average price for each facet bucket
      y : "max(popularity)" // y = max popularity value in each facet bucket
    }
  }
}

Nested facets, or sub-facets, allow one to nest bucketing facet commands like terms, range, or query facets under other facet commands. The syntax is identical to top-level facets - just add the facet command to the facet command block of the parent facet. Technically, every facet command is actually a sub-facet since we start off with a single facet bucket with a domain defined by the main query and filters.

Let’s start off with a simple non-nested terms facet on the genre field:

 top_genres:{
    type: terms
    field: genre,
    limit: 5
  }

Now if we wanted to add a nested facet to find the top 2 authors for each genre bucket:

  top_genres:{
    type: terms,
    field: genre,
    limit: 5,
    facet:{
      top_authors:{
        type: terms, // nested terms facet on author will be calculated for each parent bucket (genre)
        field: author,
        limit: 2
      }
    }
  }

And the response will look something like:

  "facets":{
    "top_genres":{
      "buckets":[
        {
          "val":"Fantasy",
          "count":5432,
          "top_authors":{  // these are the top authors in the "Fantasy" genre
            "buckets":[{
                "val":"Mercedes Lackey",
                "count":121},
              {
                "val":"Piers Anthony",
                "count":98}
            ]
          }
        },
        {
          "val":"Mystery",
          "count":4322,
          "top_authors":{  // these are the top authors in the "Mystery" genre
            "buckets":[{
                "val":"James Patterson",
                "count":146},
              {
                "val":"Patricia Cornwell",
                "count":132}
            ]
          }
        }

By default "top authors" is defined by simple document count descending, but we could use our aggregation functions to sort by more interesting metrics.

Block Join Facets facets allow bucketing child documents as attributes of their parents.

Suppose we have products with multiple SKUs, and we want to count products for each color.

{
    "id": "1", "type": "product", "name": "Solr T-Shirt",
    "_childDocuments_": [
      { "id": "11", "type": "SKU", "color": "Red",  "size": "L" },
      { "id": "12", "type": "SKU", "color": "Blue", "size": "L" },
      { "id": "13", "type": "SKU", "color": "Red",  "size": "M" },
      { "id": "14", "type": "SKU", "color": "Blue", "size": "S" }
    ]
  }

For SKU domain we can request

  color: {
    type: terms,
    field: color,
    limit: -1,
    facet: {
      productsCount: "uniqueBlock(_root_)"
    }
  }

and get


  [...]
  color:{
     buckets:[
        {
          val:Red, count:2, productsCount:1
        },
        {
          val:Blue, count:2, productsCount:1
        }
     ]
  }

Please notice that _root_ is an internal field added by Lucene to each child document to reference on parent one. Aggregation uniqueBlock(_root_) is functionally equivalent to unique(_root_), but is optimized for nested documents block structure. It’s recommended to define limit: -1 for uniqueBlock calculation, like in above example, since default value of limit parameter is 10, while uniqueBlock is supposed to be much faster with -1.

Semantic Knowledge Graphs

The relatedness(…) aggregation function allows for sets of documents to be scored relative to Foreground and Background sets of documents, for the purposes of finding ad-hoc relationships that make up a "Semantic Knowledge Graph":

At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes.

— Grainger et al.
The Semantic Knowledge Graph

The relatedness(…) function is used to "score" these relationships, relative to "Foreground" and "Background" sets of documents, specified in the function params as queries.

Unlike most aggregation functions, the relatedness(…) function is aware of whether and how it’s used in Nested Facets. It evaluates the query defining the current bucket independently from it’s parent/ancestor buckets, and intersects those documents with a "Foreground Set" defined by the foreground query combined with the ancestor buckets. The result is then compared to a similar intersection done against the "Background Set" (defined exclusively by background query) to see if there is a positive, or negative, correlation between the current bucket and the Foreground Set, relative to the Background Set.

Semantic Knowledge Graph Example

Sample Documents

curl -sS -X POST 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '[
{"id":"01",age:15,"state":"AZ","hobbies":["soccer","painting","cycling"]},
{"id":"02",age:22,"state":"AZ","hobbies":["swimming","darts","cycling"]},
{"id":"03",age:27,"state":"AZ","hobbies":["swimming","frisbee","painting"]},
{"id":"04",age:33,"state":"AZ","hobbies":["darts"]},
{"id":"05",age:42,"state":"AZ","hobbies":["swimming","golf","painting"]},
{"id":"06",age:54,"state":"AZ","hobbies":["swimming","golf"]},
{"id":"07",age:67,"state":"AZ","hobbies":["golf","painting"]},
{"id":"08",age:71,"state":"AZ","hobbies":["painting"]},
{"id":"09",age:14,"state":"CO","hobbies":["soccer","frisbee","skiing","swimming","skating"]},
{"id":"10",age:23,"state":"CO","hobbies":["skiing","darts","cycling","swimming"]},
{"id":"11",age:26,"state":"CO","hobbies":["skiing","golf"]},
{"id":"12",age:35,"state":"CO","hobbies":["golf","frisbee","painting","skiing"]},
{"id":"13",age:47,"state":"CO","hobbies":["skiing","darts","painting","skating"]},
{"id":"14",age:51,"state":"CO","hobbies":["skiing","golf"]},
{"id":"15",age:64,"state":"CO","hobbies":["skating","cycling"]},
{"id":"16",age:73,"state":"CO","hobbies":["painting"]},
]'

Example Query

curl -sS -X POST http://localhost:8983/solr/gettingstarted/query -d 'rows=0&q=*:*
&back=*:*                                  (1)
&fore=age:[35 TO *]                        (2)
&json.facet={
  hobby : {
    type : terms,
    field : hobbies,
    limit : 5,
    sort : { r1: desc },                   (3)
    facet : {
      r1 : "relatedness($fore,$back)",     (4)
      location : {
        type : terms,
        field : state,
        limit : 2,
        sort : { r2: desc },               (3)
        facet : {
          r2 : "relatedness($fore,$back)"  (4)
        }
      }
    }
  }
}'

1	Use the entire collection as our "Background Set"
2	Use a query for "age >= 35" to define our (initial) "Foreground Set"
3	For both the top level `hobbies` facet & the sub-facet on `state` we will be sorting on the `relatedness(…)` values
4	In both calls to the `relatedness(…)` function, we use Parameter Variables to refer to the previously defined `fore` and `back` queries.

The Facet Response

"facets":{
  "count":16,
  "hobby":{
    "buckets":[{
        "val":"golf",
        "count":6,                                (1)
        "r1":{
          "relatedness":0.01225,
          "foreground_popularity":0.3125,         (2)
          "background_popularity":0.375},         (3)
        "location":{
          "buckets":[{
              "val":"az",
              "count":3,
              "r2":{
                "relatedness":0.00496,            (4)
                "foreground_popularity":0.1875,   (6)
                "background_popularity":0.5}},    (7)
            {
              "val":"co",
              "count":3,
              "r2":{
                "relatedness":-0.00496,           (5)
                "foreground_popularity":0.125,
                "background_popularity":0.5}}]}},
      {
        "val":"painting",
        "count":8,                                (1)
        "r1":{
          "relatedness":0.01097,
          "foreground_popularity":0.375,
          "background_popularity":0.5},
        "location":{
          "buckets":[{
            ...

1	Even though `hobbies:golf` has a lower total facet `count` then `hobbies:painting`, it has a higher `relatedness` score, indicating that relative to the Background Set (the entire collection) Golf has a stronger correlation to our Foreground Set (people age 35+) then Painting.
2	The number of documents matching `age:[35 TO ]` and* `hobbies:golf` is 31.25% of the total number of documents in the Background Set
3	37.5% of the documents in the Background Set match `hobbies:golf`
4	The state of Arizona (AZ) has a positive relatedness correlation with the nested Foreground Set (people ages 35+ who play Golf) compared to the Background Set — i.e., "People in Arizona are statistically more likely to be '35+ year old Golfers' then the country as a whole."
5	The state of Colorado (CO) has a negative correlation with the nested Foreground Set — i.e., "People in Colorado are statistically less likely to be '35+ year old Golfers' then the country as a whole."
6	The number documents matching `age:[35 TO ]` and* `hobbies:golf` and `state:AZ` is 18.75% of the total number of documents in the Background Set
7	50% of the documents in the Background Set match `state:AZ`

While it’s very common to define the Background Set as *:*, or some other super-set of the Foreground Query, it is not strictly required. The relatedness(…) function can be used to compare the statistical relatedness of sets of documents to orthogonal foreground/background queries.

References

This documentation was originally adapted largely from the following blog pages:

http://yonik.com/json-facet-api/

http://yonik.com/solr-facet-functions/

http://yonik.com/solr-subfacets/

http://yonik.com/percentiles-for-solr-faceting/

JSON Query DSL Faceting

Facet & Analytics Module

Faceted Search

Metrics Example

Bucketing Facet Example

Making a Facet Request

JSON Extensions

Terms Facet

Query Facet

Range Facet

Range Facet Parameters

Filtering Facets

Filter Exclusions

Aggregation Functions

Sorting Facets

Nested Facets

Nested Facet Example

Block Join Facets

Block Join Facet Example

Semantic Knowledge Graphs

Semantic Knowledge Graph Example

References