The math expressions library supports simple and multivariate linear regression.
Simple Linear Regression
The regress
function is used to build a linear regression model
between two random variables. Sample observations are provided with two
numeric arrays. The first numeric array is the independent variable and
the second array is the dependent variable.
In the example below the random
function selects 5000 random samples each containing
the fields filesize_d
and response_d
. The two fields are vectorized
and stored in variables b
and c
. Then the regress
function performs a regression
analysis on the two numeric arrays.
The regress
function returns a single tuple with the results of the regression
analysis.
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
b=col(a, filesize_d),
c=col(a, response_d),
d=regress(b, c))
Note that in this regression analysis the value of RSquared
is .75
. This means that changes in
filesize_d
explain 75% of the variability of the response_d
variable:
{
"result-set": {
"docs": [
{
"d": {
"significance": 0,
"totalSumSquares": 10564812.895147054,
"R": 0.8674822407146515,
"RSquared": 0.7525254379553127,
"meanSquareError": 523.1137343558588,
"intercept": -49.528134913099095,
"slopeConfidenceInterval": 0.0003171801710329995,
"regressionSumSquares": 7950290.450836472,
"slope": 0.019945557923159506,
"interceptStdErr": 6.489732340389941,
"N": 5000
}
},
{
"EOF": true,
"RESPONSE_TIME": 98
}
]
}
}
Prediction
The predict
function uses the regression model to make predictions.
Using the example above the regression model can be used to predict the value
of response_d
given a value for filesize_d
.
In the example below the predict
function uses the regression analysis to predict
the value of response_d
for the filesize_d
value of 40000
.
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
b=col(a, filesize_d),
c=col(a, response_d),
d=regress(b, c),
e=predict(d, 40000))
When this expression is sent to the /stream
handler it responds with:
{
"result-set": {
"docs": [
{
"e": 748.079241022975
},
{
"EOF": true,
"RESPONSE_TIME": 95
}
]
}
}
The predict
function can also make predictions for an array of values. In this
case it returns an array of predictions.
In the example below the predict
function uses the regression analysis to
predict values for each of the 5000 samples of filesize_d
used to generate the model.
In this case 5000 predictions are returned.
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
b=col(a, filesize_d),
c=col(a, response_d),
d=regress(b, c),
e=predict(d, b))
When this expression is sent to the /stream
handler it responds with:
{
"result-set": {
"docs": [
{
"e": [
742.2525322514165,
709.6972488729955,
687.8382568904871,
820.2511324266264,
720.4006432289061,
761.1578181053039,
759.1304101159126,
699.5597256337142,
742.4738911248204,
769.0342605881644,
746.6740473150268,
...
]
},
{
"EOF": true,
"RESPONSE_TIME": 113
}
]
}
}
Residuals
The difference between the observed value and the predicted value is known as the residual. There isn’t a specific function to calculate the residuals but vector math can used to perform the calculation.
In the example below the predictions are stored in variable e
. The ebeSubtract
function is then used to subtract the predictions
from the actual response_d
values stored in variable c
. Variable f
contains
the array of residuals.
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
b=col(a, filesize_d),
c=col(a, response_d),
d=regress(b, c),
e=predict(d, b),
f=ebeSubtract(c, e))
When this expression is sent to the /stream
handler it responds with:
{
"result-set": {
"docs": [
{
"e": [
31.30678554491226,
-30.292830927953446,
-30.49508862647258,
-30.499884780783532,
-9.696458959319784,
-30.521563961535094,
-30.28380938033081,
-9.890289849359306,
30.819723560583157,
-30.213178859683012,
-30.609943619066826,
10.527700442607625,
10.68046928406568,
...
]
},
{
"EOF": true,
"RESPONSE_TIME": 113
}
]
}
}
Multivariate Linear Regression
The olsRegress
function performs a multivariate linear regression analysis. Multivariate linear
regression models the linear relationship between two or more independent variables and a dependent variable.
The example below extends the simple linear regression example by introducing a new independent variable
called service_d
. The service_d
variable is the service level of the request and it can range from 1 to 4
in the data-set. The higher the service level, the higher the bandwidth available for the request.
Notice that the two independent variables filesize_d
and service_d
are vectorized and stored
in the variables b
and c
. The variables b
and c
are then added as rows to a matrix
. The matrix is
then transposed so that each row in the matrix represents one observation with filesize_d
and service_d
.
The olsRegress
function then performs the multivariate regression analysis using the observation matrix as the
independent variables and the response_d
values, stored in variable d
, as the dependent variable.
let(a=random(collection2, q="*:*", rows="30000", fl="filesize_d, service_d, response_d"),
b=col(a, filesize_d),
c=col(a, service_d),
d=col(a, response_d),
e=transpose(matrix(b, c)),
f=olsRegress(e, d))
Notice in the response that the RSquared of the regression analysis is 1. This means that linear relationship between
filesize_d
and service_d
describe 100% of the variability of the response_d
variable:
{
"result-set": {
"docs": [
{
"f": {
"regressionParametersStandardErrors": [
2.0660690430026933e-13,
5.1212982077663434e-18,
9.10920932555875e-15
],
"RSquared": 1,
"regressionParameters": [
6.553210695971329e-12,
0.019999999999999858,
-20.49999999999968
],
"regressandVariance": 2124.130825172683,
"regressionParametersVariance": [
[
0.013660174897582315,
-3.361258014840509e-7,
-0.00006893737578369605
],
[
-3.361258014840509e-7,
8.393183709503206e-12,
6.430253229589981e-11
],
[
-0.00006893737578369605,
6.430253229589981e-11,
0.000026553878455570856
]
],
"adjustedRSquared": 1,
"residualSumSquares": 9.373703759269822e-20
}
},
{
"EOF": true,
"RESPONSE_TIME": 690
}
]
}
}
Prediction
The predict
function can also be used to make predictions for multivariate linear regression.
Below is an example of a single prediction using the multivariate linear regression model and a single observation.
The observation is an array that matches the structure of the observation matrix used to build the model. In this case
the first value represents a filesize_d
of 40000
and the second value represents a service_d
of 4
.
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"),
b=col(a, filesize_d),
c=col(a, service_d),
d=col(a, response_d),
e=transpose(matrix(b, c)),
f=olsRegress(e, d),
g=predict(f, array(40000, 4)))
When this expression is sent to the /stream
handler it responds with:
{
"result-set": {
"docs": [
{
"g": 718.0000000000005
},
{
"EOF": true,
"RESPONSE_TIME": 117
}
]
}
}
The predict
function can also make predictions for more than one multivariate observation. In this scenario
an observation matrix used.
In the example below the observation matrix used to build the multivariate regression model
is passed to the predict
function and it returns an array of predictions.
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"),
b=col(a, filesize_d),
c=col(a, service_d),
d=col(a, response_d),
e=transpose(matrix(b, c)),
f=olsRegress(e, d),
g=predict(f, e))
When this expression is sent to the /stream
handler it responds with:
{
"result-set": {
"docs": [
{
"e": [
685.498283591961,
801.2175699959365,
776.7638245911025,
610.3559852681935,
751.0925865965207,
787.2914663381897,
744.3632053810668,
688.3729301599697,
765.367783417171,
724.9309687628346,
834.4350712384264,
...
]
},
{
"EOF": true,
"RESPONSE_TIME": 113
}
]
}
}
Residuals
Once the predictions are generated the residuals can be calculated using the same approach used with simple linear regression.
Below is an example of the residuals calculation following a multivariate linear regression. In the example
the predictions stored variable g
are subtracted from observed values stored in variable d
.
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"),
b=col(a, filesize_d),
c=col(a, service_d),
d=col(a, response_d),
e=transpose(matrix(b, c)),
f=olsRegress(e, d),
g=predict(f, e),
h=ebeSubtract(d, g))
When this expression is sent to the /stream
handler it responds with:
{
"result-set": {
"docs": [
{
"e": [
1.1368683772161603e-13,
1.1368683772161603e-13,
0,
1.1368683772161603e-13,
0,
1.1368683772161603e-13,
0,
2.2737367544323206e-13,
1.1368683772161603e-13,
2.2737367544323206e-13,
1.1368683772161603e-13,
...
]
},
{
"EOF": true,
"RESPONSE_TIME": 113
}
]
}
}