Analytics Component
The Analytics Component allows users to calculate complex statistical aggregations over result sets.
The component enables interacting with data in a variety of ways, both through a diverse set of analytics functions as well as powerful faceting functionality. The standard facets are supported within the analytics component with additions that leverage its analytical capabilities.
Analytics Configuration
The Analytics component is in a contrib module, therefore it will need to be enabled in the solrconfig.xml
for each collection where you would like to use it.
Since the Analytics framework is a search component, it must be declared as such and added to the search handler.
For distributed analytics requests over cloud collections, the component uses the AnalyticsHandler
strictly for inter-shard communication.
The Analytics Handler should not be used by users to submit analytics requests.
To configure Solr to use the Analytics Component, the first step is to add a <lib/>
directive so Solr loads the Analytic Component classes (for more about the <lib/>
directive, see Lib Directives in SolrConfig). In the section of solrconfig.xml
where the default <lib/>
directives are, add a line:
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-analytics-\d.*\.jar" />
Next you need to enable the request handler and search component. Add the following lines to solrconfig.xml
, near the defintions for other request handlers:
For these changes to take effect, restart Solr or reload the core or collection.
Request Syntax
An Analytics request is passed to Solr with the parameter analytics
in a request sent to the
Search Handler.
Since the analytics request is sent inside of a search handler request, it will compute results based on the result set determined by the search handler.
For example, this curl command encodes and POSTs a simple analytics request to the the search handler:
curl --data-urlencode 'analytics={
"expressions" : {
"revenue" : "sum(mult(price,quantity))"
}
}'
http://localhost:8983/solr/sales/select?q=*:*&wt=json&rows=0
There are 3 main parts of any analytics request:
- Expressions
- A list of calculations to perform over the entire result set. Expressions aggregate the search results into a single value to return. This list is entirely independent of the expressions defined in each of the groupings. Find out more about them in the section Expressions.
- Functions
- One or more Variable Functions to be used throughout the rest of the request. These are essentially lambda functions and can be combined in a number of ways.
These functions for the expressions defined in
expressions
as well asgroupings
. - Groupings
- The list of Groupings to calculate in addition to the expressions. Groupings hold a set of facets and a list of expressions to compute over those facets. The expressions defined in a grouping are only calculated over the facets defined in that grouping.
Optional Parameters
Either the expressions or the groupings parameter must be present in the request, or else there will be no analytics to compute.
The functions parameter is always optional.
|
Expressions
Expressions are the way to request pieces of information from the analytics component. These are the statistical expressions that you want computed and returned in your response.
Constructing an Expression
Expression Components
An expression is built using fields, constants, mapping functions and reduction functions. The ways that these can be defined are described below.
- Sources
- Constants: The values defined in the expression. The supported constant types are described in the Analytics Expression Source Reference.
- Fields: Solr fields that are read from the index. The supported fields are listed in the Analytics Expression Source Reference.
- Mapping Functions
- Mapping functions map values for each Solr Document or Reduction.
The provided mapping functions are detailed in the Analytics Mapping Function Reference.
- Unreduced Mapping: Mapping a Field with another Field or Constant returns a value for every Solr Document. Unreduced mapping functions can take fields, constants as well as other unreduced mapping functions as input.
- Reduced Mapping: Mapping a Reduction Function with another Reduction Function or Constant returns a single value.
- Reduction Functions
- Functions that reduce the values of sources and/or unreduced mapping functions for every Solr Document to a single value. The provided reduction functions are detailed in the Analytics Reduction Function Reference.
Component Ordering
The expression components must be used in the following order to create valid expressions.
- Reduced Mapping Function
- Constants
- Reduction Function
- Sources
- Unreduced Mapping Function
- Sources
- Unreduced Mapping Function
- Reduced Mapping Function
- Reduction Function
This ordering is based on the following rules:
- No reduction function can be an argument of another reduction function. Since all reduction is done together in one step, one reduction function cannot rely on the result of another.
- No fields can be left unreduced, since the analytics component cannot return a list of values for an expression (one for every document). Every expression must be reduced to a single value.
- Mapping functions are not necessary when creating functions, however as many nested mappings as needed can be used.
- Nested mapping functions must be the same type, so either both must be unreduced or both must be reduced. A reduced mapping function cannot take an unreduced mapping function as a parameter and vice versa.
Example Construction
With the above definitions and ordering, an example expression can be broken up into its components:
div(sum(a,fill_missing(b,0)),add(10.5,count(mult(a,c)))))
As a whole, this is a reduced mapping function. The div
function is a reduced mapping function since it is a provided mapping function and has reduced arguments.
If we break down the expression further:
sum(a,fill_missing(b,0))
: Reduction Function
sum
is a provided reduction function.a
: Fieldfill_missing(b,0)
: Unreduced Mapping Function
fill_missing
is an unreduced mapping function since it is a provided mapping function and has a field argument.b
: Field0
: Constant
add(10.5,count(mult(a,c)))
: Reduced Mapping Function
add
is a reduced mapping function since it is a provided mapping function and has a reduction function argument.10.5
: Constantcount(mult(a,c))
: Reduction Function
count
is a provided reduction functionmult(a,c)
: Unreduced Mapping Function
mult
is an unreduced mapping function since it is a provided mapping function and has two field arguments.a
: Fieldc
: Field
Expression Cardinality (Multi-Valued and Single-Valued)
The root of all multi-valued expressions are multi-valued fields. Single-valued expressions can be started with constants or single-valued fields. All single-valued expressions can be treated as multi-valued expressions that contain one value.
Single-valued expressions and multi-valued expressions can be used together in many mapping functions, as well as multi-valued expressions being used alone, and many single-valued expressions being used together. For example:
add(<single-valued double>, <single-valued double>, …)
- Returns a single-valued double expression where the value of the values of each expression are added.
add(<single-valued double>, <multi-valued double>)
- Returns a multi-valued double expression where each value of the second expression is added to the single value of the first expression.
add(<multi-valued double>, <single-valued double>)
- Acts the same as the above function.
add(<multi-valued double>)
- Returns a single-valued double expression which is the sum of the multiple values of the parameter expression.
Types and Implicit Casting
The new analytics component currently supports the types listed in the below table. These types have one-way implicit casting enabled for the following relationships:
Type | Implicitly Casts To |
---|---|
Boolean | String |
Date | Long, String |
Integer | Long, Float, Double, String |
Long | Double, String |
Float | Double, String |
Double | String |
String | none |
An implicit cast means that if a function requires a certain type of value as a parameter, arguments will be automatically converted to that type if it is possible.
For example, concat()
only accepts string parameters and since all types can be implicitly cast to strings, any type is accepted as an argument.
This also goes for dynamically typed functions. fill_missing()
requires two arguments of the same type. However, two types that implicitly cast to the same type can also be used.
For example, fill_missing(<long>,<float>)
will be cast to fill_missing(<double>,<double>)
since long cannot be cast to float and float cannot be cast to long implicitly.
There is an ordering to implicit casts, where the more specialized type is ordered ahead of the more general type. Therefore even though both long and float can be implicitly cast to double and string, they will be cast to double. This is because double is a more specialized type than string, which every type can be cast to.
The ordering is the same as their order in the above table.
Cardinality can also be implicitly cast. Single-valued expressions can always be implicitly cast to multi-valued expressions, since all single-valued expressions are multi-valued expressions with one value.
Implicit casting will only occur when an expression will not "compile" without it.
If an expression follows all typing rules initially, no implicit casting will occur.
Certain functions such as string()
, date()
, round()
, floor()
, and ceil()
act as explicit casts, declaring the type that is desired.
However round()
, floor()
and cell()
can return either int or long, depending on the argument type.
Variable Functions
Variable functions are a way to shorten your expressions and make writing analytics queries easier. They are essentially lambda functions defined in a request.
In the above request, instead of writing mult(price,quantity)
twice, a function sale()
was defined to abstract this idea. Then that function was used in the multiple expressions.
Suppose that we want to look at the sales of specific categories:
{
"functions" : {
"clothing_sale()" : "filter(mult(price,quantity),equal(category,'Clothing'))",
"kitchen_sale()" : "filter(mult(price,quantity),equal(category,\"Kitchen\"))"
},
"expressions" : {
"max_clothing_sale" : "max(clothing_sale())"
, "med_clothing_sale" : "median(clothing_sale())"
, "max_kitchen_sale" : "max(kitchen_sale())"
, "med_kitchen_sale" : "median(kitchen_sale())"
}
}
Arguments
Instead of making a function for each category, it would be much easier to use category
as an input to the sale()
function.
An example of this functionality is shown below:
Variable Functions can take any number of arguments and use them in the function expression as if they were a field or constant.
Variable Length Arguments
There are analytics functions that take a variable amount of parameters. Therefore there are use cases where variable functions would need to take a variable amount of parameters.
For example, maybe there are multiple, yet undetermined, number of components to the price of a product.
Functions can take a variable length of parameters if the last parameter is followed by ..
In the above example a variable length argument is used to encapsulate all of the costs to use for a product.
There is no definite number of arguments requested for the variable length parameter, therefore the clothing expressions can use 3 and the kitchen expressions can use 2.
When the sale()
function is called, costs
is expanded to the arguments given.
Therefore in the above request, inside of the sale
function:
add(costs)
is expanded to both of the following:
add(material, tariff, tax)
add(material, construction)
For-Each Functions
Advanced Functionality
The following function details are for advanced requests. |
Although the above functionality allows for an undefined number of arguments to be passed to a function, it does not allow for interacting with those arguments.
Many times we might want to wrap each argument in additional functions.
For example maybe we want to be able to look at multiple categories at the same time.
So we want to see if category EQUALS x OR category EQUALS y
and so on.
In order to do this we need to use for-each lambda functions, which transform each value of the variable length parameter.
The for-each is started with the :
character after the variable length parameter.
In this example, cats:
is the syntax that starts a for-each lambda function over every parameter cats
, and the _
character is used to refer to the value of cats
in each iteration in the for-each.
When sale("Clothing", "Kitchen")
is called, the lambda function equal(category,_)
is applied to both Clothing and Kitchen inside of the or()
function.
Using all of these rules, the expression:
`sale("Clothing","Kitchen")`
is expanded to:
`filter(mult(price,quantity),or(equal(category,"Kitchen"),equal(category,"Clothing")))`
by the expression parser.
Groupings And Facets
Facets, much like in other parts of Solr, allow analytics results to be broken up and grouped by attributes of the data that the expressions are being calculated over.
The currently available facets for use in the analytics component are Value Facets, Pivot Facets, Range Facets and Query Facets. Each facet is required to have a unique name within the grouping it is defined in, and no facet can be defined outside of a grouping.
Groupings allow users to calculate the same grouping of expressions over a set of facets.
Groupings must have both expressions
and facets
given.
Facet Sorting
Some Analytics facets allow for complex sorting of their results. The two current sortable facets are Analytic Value Facets and Analytic Pivot Facets.
Parameters
criteria
The list of criteria to sort the facet by.
It takes the following parameters:
type
- The type of sort. There are two possible values:
expression
: Sort by the value of an expression defined in the same grouping.facetvalue
: Sort by the string-representation of the facet value.
Direction
- (Optional) The direction to sort.
ascending
(Default)descending
expression
- When
type = expression
, the name of an expression defined in the same grouping.
limit
- Limit the number of returned facet values to the top N. (Optional)
offset
- When a limit is set, skip the top N facet values. (Optional)
Value Facets
Value Facets are used to group documents by the value of a mapping expression applied to each document. Mapping expressions are expressions that do not include a reduction function.
For more information, refer to the Expressions section.
mult(quantity, sum(price, tax))
: breakup documents by the revenue generatedfillmissing(state, "N/A")
: breakup documents by state, where N/A is used when the document doesn’t contain a state
Value Facets can be sorted.
Parameters
expression
- The expression to choose a facet bucket for each document.
sort
- A sort for the results of the pivot.
Optional Parameters
The sort parameter is optional.
|
Field Facets
This is a replacement for Field Facets in the original Analytics Component.
Field Facet functionality is maintained in Value Facets by using the name of a field as the expression.
|
Analytic Pivot Facets
Pivot Facets are used to group documents by the value of multiple mapping expressions applied to each document.
Pivot Facets work much like layers of Analytic Value Facets. A list of pivots is required, and the order of the list directly impacts the results returned. The first pivot given will be treated like a normal value facet. The second pivot given will be treated like one value facet for each value of the first pivot. Each of these second-level value facets will be limited to the documents in their first-level facet bucket. This continues for however many pivots are provided.
Sorting is enabled on a per-pivot basis. This means that if your top pivot has a sort with limit:1
, then only that first value of the facet will be drilled down into. Sorting in each pivot is independent of the other pivots.
Parameters
pivots
- The list of pivots to calculate a drill-down facet for. The list is ordered by top-most to bottom-most level.
name
- The name of the pivot.
expression
- The expression to choose a facet bucket for each document.
sort
- A sort for the results of the pivot.
Optional Parameters
The sort parameter within the pivot object is optional, and can be given in any, none or all of the provided pivots.
|
Analytics Range Facets
Range Facets are used to group documents by the value of a field into a given set of ranges. The inputs for analytics range facets are identical to those used for Solr range facets. Refer to the Range Facet documentation for additional questions regarding use.
Parameters
field
- Field to be faceted over
start
- The bottom end of the range
end
- The top end of the range
gap
- A list of range gaps to generate facet buckets. If the buckets do not add up to fit the
start
toend
range, then the lastgap
value will repeated as many times as needed to fill any unused range. hardend
- Whether to cutoff the last facet bucket range at the
end
value if it spills over. Defaults tofalse
. include
- The boundaries to include in the facet buckets. Defaults to
lower
.lower
- All gap-based ranges include their lower bound.upper
- All gap-based ranges include their upper bound.edge
- The first and last gap ranges include their edge bounds (lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified.outer
- Thebefore
andafter
ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries.all
- Includes all options:lower
,upper
,edge
, andouter
others
- Additional ranges to include in the facet. Defaults to
none
.before
- All records with field values lower then lower bound of the first range.after
- All records with field values greater then the upper bound of the last range.between
- All records with field values between the lower bound of the first range and the upper bound of the last range.none
- Include facet buckets for none of the above.all
- Include facet buckets forbefore
,after
andbetween
.
Optional Parameters
The hardend , include and others parameters are all optional.
|
Query Facets
Query Facets are used to group documents by given set of queries.
Parameters
queries
- The list of queries to facet by.