Statistical Programming | Apache Solr Reference Guide 7.3

The Streaming Expression language includes a powerful statistical programing syntax with many of the features of a functional programming language.

The syntax includes variables, data structures and a growing set of mathematical functions.

Using the statistical programing syntax, Solr’s powerful data retrieval capabilities can be combined with in-depth statistical analysis.

The data retrieval methods include:

SQL
time series aggregation
random sampling
faceted aggregation
K-Nearest Neighbor (KNN) searches
topic message queues
MapReduce (parallel relational algebra)
JDBC calls to outside databases
Graph Expressions

Once the data is retrieved, the statistical programming syntax can be used to create arrays from the data so it can be manipulated, transformed and analyzed.

The statistical function library includes functions that perform:

Correlation
Cross-correlation
Covariance
Moving averages
Percentiles
Simple regression and prediction
Analysis of covariance (ANOVA)
Histograms
Convolution
Euclidean distance
Descriptive statistics
Rank transformation
Normalization transformation
Sequences
Array manipulation functions (creation, copying, length, scaling, reverse, etc.)

The statistical function library is backed by Apache Commons Math library. A full discussion of many of the math functions available to streaming expressions is available in the section Stream Evaluator Reference.

This document provides an overview of the how to apply the variables, data structures and mathematical functions.

Like all streaming expressions, the statistical functions are run by Solr’s /stream handler. For an overview of this handler, see the section Streaming Expressions.

Math Functions

Streaming expressions contain a suite of mathematical functions which can be called on their own or as part of a larger expression.

Solr’s /stream handler evaluates the mathematical expression and returns a result.

For example, if you send the following expression to the /stream handler:

add(1, 1)

You get the following response:

{
  "result-set": {
    "docs": [
      {
        "return-value": 2
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 2
      }
    ]
  }
}

You can nest math functions within each other. For example:

pow(10, add(1,1))

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "return-value": 100
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

You can also perform math on a stream of Tuples. For example:

select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3"),
       price_f,
       mult(price_f, 10) as newPrice)

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "price_f": 0.99999994,
        "newPrice": 9.9999994
      },
      {
        "price_f": 0.99999994,
        "newPrice": 9.9999994
      },
      {
        "price_f": 0.9999992,
        "newPrice": 9.999992
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 3
      }
    ]
  }
}

Data Structures

Several types of data can be manipulated with the statistical programming syntax. The following sections explore arrays, tuples, and lists.

Arrays

The first data structure we’ll explore is the array.

We can create an array with the array function:

For example:

array(1, 2, 3)

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "return-value": [
          1,
          2,
          3
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

We can nest arrays within a matrix function to return matrix:

matrix(array(1, 2, 3),
       array(4, 5, 6))

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "return-value": [
          [
            1,
            2,
            3
          ],
          [
            4,
            5,
            6
          ]
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

We can manipulate arrays with functions. For example, we can reverse an array with the rev function:

rev(array(1, 2, 3))

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "return-value": [
          3,
          2,
          1
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Arrays can also be built and returned by functions. For example, the sequence function:

sequence(5,0,1)

This returns an array of size 5 starting from 0 with a stride of 1.

{
  "result-set": {
    "docs": [
      {
        "return-value": [
          0,
          1,
          2,
          3,
          4
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 4
      }
    ]
  }
}

We can perform math on an array. For example, we can scale an array with the scale function:

scale(10, sequence(5,0,1))

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "return-value": [
          0,
          10,
          20,
          30,
          40
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

We can perform statistical analysis on arrays For example, we can correlate two sequences with the corr function:

corr(sequence(5,1,1), sequence(5,10,10))

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "return-value": 1
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 1
      }
    ]
  }
}

Tuples

The tuple is the next data structure we’ll explore.

The tuple function returns a map of name/value pairs. A tuple is a very flexible data structure that can hold values that are strings, numerics, arrays and lists of tuples.

A tuple can be used to return a complex result from a statistical expression.

Here is an example:

tuple(title="hello world",
      array1=array(1,2,3,4),
      array2=array(4,5,6,7))

Returns the following response:

{
  "result-set": {
    "docs": [
      {
        "title": "hello world",
        "array1": [
          1,
          2,
          3,
          4
        ],
        "array2": [
          4,
          5,
          6,
          7
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Lists

Next we have the list data structure.

The list function is a data structure that wraps streaming expressions and emits all the tuples from the wrapped expressions as a single concatenated stream.

Below is an example of a list of tuples:

list(tuple(id=1, data=array(1, 2, 3)),
     tuple(id=2, data=array(10, 12, 14)))

Returns the following response:


{
  "result-set": {
    "docs": [
      {
        "id": "1",
        "data": [
          1,
          2,
          3
        ]
      },
      {
        "id": "2",
        "data": [
          10,
          12,
          14
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Setting Variables with let

The let function sets variables and returns the last variable. The output of any statistical function can be set to a variable.

Below is a simple example setting three variables a, b and correlation.

let(a=array(1,2,3),
    b=array(10, 20, 30),
    correlation=corr(a, b))

Here is the output:

{
  "result-set": {
    "docs": [
      {
        "correlation": 1
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

All variables can be output by setting the echo variable to true.

let(echo=true,
    a=array(1,2,3),
    b=array(10, 20, 30),
    correlation=corr(a, b))

Here is the output:

{
  "result-set": {
    "docs": [
      {
        "a": [
          1,
          2,
          3
        ],
        "b": [
          10,
          20,
          30
        ],
        "correlation": 1
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Streaming expressions can also be used inside of a let expression in the following ways:

A variable can be set to the output of any streaming expression.
A streaming expression can be executed after all variables have been set. The variables can then be referenced by the streaming expression that is executed. The let expression will stream the tuples that are emitted by the final streaming expression.

Here is a very simple example:

let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
    b=random(collection2, q="*:*", rows="3", fl="price_f"),
    tuple(sample1=a, sample2=b))

The let expression above is setting variables a and b to random samples taken from collection2.

The let function then executes the tuple streaming expression which references the two variables.

Here is the output:

{
  "result-set": {
    "docs": [
      {
        "sample1": [
          {
            "price_f": 0.39729273
          },
          {
            "price_f": 0.063344836
          },
          {
            "price_f": 0.42020327
          }
        ],
        "sample2": [
          {
            "price_f": 0.659244
          },
          {
            "price_f": 0.58797807
          },
          {
            "price_f": 0.57520163
          }
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 20
      }
    ]
  }
}

Creating Arrays with col Function

The col function is used to move a column of numbers from a list of tuples into an array.

This is an important function because streaming expressions such as sql, random and timeseries return tuples, but the statistical functions operate on arrays.

Below is an example of the col function:

let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
    b=random(collection2, q="*:*", rows="3", fl="price_f"),
    c=col(a, price_f),
    d=col(b, price_f),
    tuple(sample1=c, sample2=d))

The example above is using the col function to create arrays from the tuples stored in variables a and b.

Variable c contains an array of values from the price_f field, taken from the tuples stored in variable a.

Variable d contains an array of values from the price_f field, taken from the tuples stored in variable b.

Also notice inn that the response tuple executed by let is pointing to the arrays in variables c and d.

The response shows the arrays:


{
  "result-set": {
    "docs": [
      {
        "sample1": [
          0.06490427,
          0.6751543,
          0.07063508
        ],
        "sample2": [
          0.8884564,
          0.8878821,
          0.3504665
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 17
      }
    ]
  }
}

Statistical Programming Example

We’ve covered how the data structures, variables and a few statistical functions work. Let’s dive into an example that puts these tools to use.

Use Case

We have an existing hotel in cityA that is very profitable. We are contemplating opening up a new hotel in a different city. We’re considering 4 different cities: cityB, cityC, cityD, cityE. We’d like to open a hotel in a city that has similar room rates to cityA.

How do we determine which of the 4 cities we’re considering has room rates which are most similar to cityA?

The Data

We have a data set of un-aggregated hotel bookings. Each booking record has a rate and city.

Can We Simply Aggregate?

One approach would be to aggregate the data from each city and compare the mean room rates. This approach will give us some useful information, but the mean is a summary statistic which loses a significant amount of information about the data. For example, we don’t have an understanding of how the distribution of room rates is impacting the mean.

The median room rate provides another interesting data point but it’s still not the entire picture. It’s sill just one point of reference.

Is there a way that we can compare the markets without losing valuable information in the data?

K-Nearest Neighbor

The use case we’re reasoning about can often be approached using a K-Nearest Neighbor (knn) algorithm.

With knn we use a distance measure to compare vectors of data to find the k nearest neighbors to a specific vector.

Euclidean Distance

The streaming expression statistical function library has a function called distance. The distance function computes the Euclidean distance between two vectors. This looks promising for comparing vectors of room rates.

Vectors

But how to create the vectors from a our data set? Remember we have un-aggregated room rates from each of the cities. How can we vectorize the data so it can be compared using the distance function.

We have a streaming expression that can retrieve a random sample from each of the cities. The name of this expression is random. So we could take a random sample of 1000 room rates from each of the five cities.

But random vectors of room rates are not comparable because the distance algorithm compares values at each index in the vector. How can make these vectors comparable?

We can make them comparable by sorting them. Then as the distance algorithm moves along the vectors it will be comparing room rates from lowest to highest in both cities.

The Code

let(cityA=sort(random(bookings, q="city:cityA", rows="1000", fl="rate_d"), by="rate_d asc"),
    cityB=sort(random(bookings, q="city:cityB", rows="1000", fl="rate_d"), by="rate_d asc"),
    cityC=sort(random(bookings, q="city:cityC", rows="1000", fl="rate_d"), by="rate_d asc"),
    cityD=sort(random(bookings, q="city:cityD", rows="1000", fl="rate_d"), by="rate_d asc"),
    cityE=sort(random(bookings, q="city:cityE", rows="1000", fl="rate_d"), by="rate_d asc"),
    ratesA=col(cityA, rate_d),
    ratesB=col(cityB, rate_d),
    ratesC=col(cityC, rate_d),
    ratesD=col(cityD, rate_d),
    ratesE=col(cityE, rate_d),
    top(n=1,
        sort="distance asc",
        list(tuple(city=B, distance=distance(ratesA, ratesB)),
             tuple(city=C, distance=distance(ratesA, ratesC)),
             tuple(city=D, distance=distance(ratesA, ratesD)),
             tuple(city=E, distance=distance(ratesA, ratesE)))))

The Code Explained

The let expression sets variables first.

The first 5 variables (cityA, cityB, cityC, cityD, cityE), contain the random samples from the bookings collection. The random function is pulling 1000 random samples from each city and including the rate_d field in the tuples that are returned.

The random function is wrapped by a sort function which is sorting the tuples in ascending order based on the rate_d field.

The next five variables (ratesA, ratesB, ratesC, ratesD, ratesE) contain the arrays of room rates for each city. The col function is used to move the rate_d field from the random sample tuples into an array for each city.

Now we have five sorted vectors of room rates that we can compare with our distance function.

After the variables are set the let expression runs the top expression.

The top expression is wrapping a list of tuples. Inside each tuple the distance function is used to compare the rateA vector with one of the other cities. The output of the distance function is stored in the distance field in the tuple.

The list function emits each tuple and the top function returns only the tuple with the lowest distance.

Stream Evaluator Reference Graph Traversal