Statistics | Apache Solr Reference Guide 7.5

This section of the user guide covers the core statistical functions available in math expressions.

Descriptive Statistics

The describe function can be used to return descriptive statistics about a numeric array. The describe function returns a single tuple with name/value pairs containing descriptive statistics.

Below is a simple example that selects a random sample of documents, vectorizes the price_f field in the result set and uses the describe function to return descriptive statistics about the vector:

let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
    b=col(a, price_f),
    c=describe(b))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "c": {
          "sumsq": 4999.041975263254,
          "max": 0.99995726,
          "var": 0.08344429493940454,
          "geometricMean": 0.36696588922559575,
          "sum": 7497.460565552007,
          "kurtosis": -1.2000739963006035,
          "N": 15000,
          "min": 0.00012338161,
          "mean": 0.49983070437013266,
          "popVar": 0.08343873198640858,
          "skewness": -0.001735537500095477,
          "stdev": 0.28886726179926403
        }
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 305
      }
    ]
  }
}

Histograms and Frequency Tables

Histograms and frequency tables are are tools for understanding the distribution of a random variable.

The hist function creates a histogram designed for usage with continuous data. The freqTable function creates a frequency table for use with discrete data.

histograms

Below is an example that selects a random sample, creates a vector from the result set and uses the hist function to return a histogram with 5 bins. The hist function returns a list of tuples with summary statistics for each bin.

let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
    b=col(a, price_f),
    c=hist(b, 5))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "c": [
          {
            "prob": 0.2057939717603699,
            "min": 0.000010371208,
            "max": 0.19996578,
            "mean": 0.10010319358402578,
            "var": 0.003366805016271609,
            "cumProb": 0.10293732468049072,
            "sum": 309.0185585938884,
            "stdev": 0.058024176136086666,
            "N": 3087
          },
          {
            "prob": 0.19381868629885585,
            "min": 0.20007741,
            "max": 0.3999073,
            "mean": 0.2993590803885827,
            "var": 0.003401644034068929,
            "cumProb": 0.3025295802728267,
            "sum": 870.5362057700005,
            "stdev": 0.0583236147205309,
            "N": 2908
          },
          {
            "prob": 0.20565789836690007,
            "min": 0.39995712,
            "max": 0.5999038,
            "mean": 0.4993620963792545,
            "var": 0.0033158364923609046,
            "cumProb": 0.5023006239697967,
            "sum": 1540.5320673300018,
            "stdev": 0.05758330046429177,
            "N": 3085
          },
          {
            "prob": 0.19437108496008693,
            "min": 0.6000449,
            "max": 0.79973197,
            "mean": 0.7001752711861512,
            "var": 0.0033895105082360185,
            "cumProb": 0.7026537198687285,
            "sum": 2042.4112660500066,
            "stdev": 0.058219502816805456,
            "N": 2917
          },
          {
            "prob": 0.20019582213899467,
            "min": 0.7999126,
            "max": 0.99987316,
            "mean": 0.8985428275824184,
            "var": 0.003312360017780078,
            "cumProb": 0.899450457219298,
            "sum": 2698.3241112299997,
            "stdev": 0.05755310606544253,
            "N": 3003
          }
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 322
      }
    ]
  }
}

The col function can be used to vectorize a column of data from the list of tuples returned by the hist function.

In the example below, the N field, which is the number of observations in the each bin, is returned as a vector.

let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
     b=col(a, price_f),
     c=hist(b, 11),
     d=col(c, N))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "d": [
          1387,
          1396,
          1391,
          1357,
          1384,
          1360,
          1367,
          1375,
          1307,
          1310,
          1366
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 307
      }
    ]
  }
}

Frequency Tables

The freqTable function returns a frequency distribution for a discrete data set. The freqTable function doesn’t create bins like the histogram. Instead it counts the occurrence of each discrete data value and returns a list of tuples with the frequency statistics for each value. Fields from a frequency table can be vectorized using using the col function in the same manner as a histogram.

Below is a simple example of a frequency table built from a random sample of a discrete variable.

let(a=random(collection1, q="*:*", rows="15000", fl="day_i"),
     b=col(a, day_i),
     c=freqTable(b))

When this expression is sent to the /stream handler it responds with:

  "result-set": {
    "docs": [
      {
        "c": [
          {
            "pct": 0.0318,
            "count": 477,
            "cumFreq": 477,
            "cumPct": 0.0318,
            "value": 0
          },
          {
            "pct": 0.033133333333333334,
            "count": 497,
            "cumFreq": 974,
            "cumPct": 0.06493333333333333,
            "value": 1
          },
          {
            "pct": 0.03426666666666667,
            "count": 514,
            "cumFreq": 1488,
            "cumPct": 0.0992,
            "value": 2
          },
          {
            "pct": 0.0346,
            "count": 519,
            "cumFreq": 2007,
            "cumPct": 0.1338,
            "value": 3
          },
          {
            "pct": 0.03133333333333333,
            "count": 470,
            "cumFreq": 2477,
            "cumPct": 0.16513333333333333,
            "value": 4
          },
          {
            "pct": 0.03333333333333333,
            "count": 500,
            "cumFreq": 2977,
            "cumPct": 0.19846666666666668,
            "value": 5
          }
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 281
      }
    ]
  }
}

Percentiles

The percentile function returns the estimated value for a specific percentile in a sample set. The example below returns the estimation for the 95th percentile of the price_f field.

let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
     b=col(a, price_f),
     c=percentile(b, 95))

When this expression is sent to the /stream handler it responds with:

 {
   "result-set": {
     "docs": [
       {
         "c": 312.94
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 286
       }
     ]
   }
 }

Covariance and Correlation

Covariance and Correlation measure how random variables move together.

Covariance and Covariance Matrices

The cov function calculates the covariance of two sample sets of data.

In the example below covariance is calculated for two numeric arrays.

The example below uses arrays created by the array function. Its important to note that vectorized data from Solr Cloud collections can be used with any function that operates on arrays.

let(a=array(1, 2, 3, 4, 5),
    b=array(100, 200, 300, 400, 500),
    c=cov(a, b))

When this expression is sent to the /stream handler it responds with:

 {
   "result-set": {
     "docs": [
       {
         "c": 0.9484775349999998
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 286
       }
     ]
   }
 }

If a matrix is passed to the cov function it will automatically compute a covariance matrix for the columns of the matrix.

Notice in the example three numeric arrays are added as rows in a matrix. The matrix is then transposed to turn the rows into columns, and the covariance matrix is computed for the columns of the matrix.

let(a=array(1, 2, 3, 4, 5),
     b=array(100, 200, 300, 400, 500),
     c=array(30, 40, 80, 90, 110),
     d=transpose(matrix(a, b, c)),
     e=cov(d))

When this expression is sent to the /stream handler it responds with:

 {
   "result-set": {
     "docs": [
       {
         "e": [
           [
             2.5,
             250,
             52.5
           ],
           [
             250,
             25000,
             5250
           ],
           [
             52.5,
             5250,
             1150
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 2
       }
     ]
   }
 }

Correlation and Correlation Matrices

Correlation is measure of covariance that has been scaled between -1 and 1.

Three correlation types are supported:

pearsons (default)
kendalls
spearmans

The type of correlation is specified by adding the type named parameter in the function call. The example below demonstrates the use of the type named parameter.

let(a=array(1, 2, 3, 4, 5),
    b=array(100, 200, 300, 400, 5000),
    c=corr(a, b, type=spearmans))

When this expression is sent to the /stream handler it responds with:

 {
   "result-set": {
     "docs": [
       {
         "c": 0.7432941462471664
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }

Like the cov function, the corr function automatically builds a correlation matrix if a matrix is passed as a parameter. The correlation matrix is built by correlating the columns of the matrix passed in.

Statistical Inference Tests

Statistical inference tests test a hypothesis on random samples and return p-values which can be used to infer the reliability of the test for the entire population.

The following statistical inference tests are available:

anova: One-Way-Anova tests if there is a statistically significant difference in the means of two or more random samples.
ttest: The T-test tests if there is a statistically significant difference in the means of two random samples.
pairedTtest: The paired t-test tests if there is a statistically significant difference in the means of two random samples with paired data.
gTestDataSet: The G-test tests if two samples of binned discrete data were drawn from the same population.
chiSquareDataset: The Chi-Squared test tests if two samples of binned discrete data were drawn from the same population.
mannWhitney: The Mann-Whitney test is a non-parametric test that tests if two samples of continuous were pulled from the same population. The Mann-Whitney test is often used instead of the T-test when the underlying assumptions of the T-test are not met.
ks: The Kolmogorov-Smirnov test tests if two samples of continuous data were drawn from the same distribution.

Below is a simple example of a T-test performed on two random samples. The returned p-value of .93 means we can accept the null hypothesis that the two samples do not have statistically significantly differences in the means.

let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
    b=random(collection1, q="*:*", rows="1500", fl="price_f"),
    c=col(a, price_f),
    d=col(b, price_f),
    e=ttest(c, d))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "e": {
          "p-value": 0.9350135639249795,
          "t-statistic": 0.081545541074817
        }
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 48
      }
    ]
  }
}

Transformations

In statistical analysis its often useful to transform data sets before performing statistical calculations. The statistical function library includes the following commonly used transformations:

rank: Returns a numeric array with the rank-transformed value of each element of the original array.
log: Returns a numeric array with the natural log of each element of the original array.
sqrt: Returns a numeric array with the square root of each element of the original array.
cbrt: Returns a numeric array with the cube root of each element of the original array.

Below is an example of a ttest performed on log transformed data sets:

let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
    b=random(collection1, q="*:*", rows="1500", fl="price_f"),
    c=log(col(a, price_f)),
    d=log(col(b, price_f)),
    e=ttest(c, d))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "e": {
          "p-value": 0.9655110070265056,
          "t-statistic": -0.04324265449471238
        }
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 58
      }
    ]
  }
}

Z-scores

The zscores function converts a numeric array to an array of z-scores. The z-score is the number of standard deviations a number is from the mean.

The example below computes the z-scores for the values in an array.

let(a=array(1,2,3),
    b=zscores(a))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "b": [
          -1,
          0,
          1
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 27
      }
    ]
  }
}

Text Analysis and Term Vectors Probability Distributions