Interpolation, Derivatives and Integrals | Apache Solr Reference Guide 7.5

Interpolation, derivatives and integrals are three interrelated topics which are part of the field of mathematics called numerical analysis. This section explores the math expressions available for numerical anlysis.

Interpolation

Interpolation is used to construct new data points between a set of known control of points. The ability to predict new data points allows for sampling along the curve defined by the control points.

The interpolation functions described below all return an interpolation model that can be passed to other functions which make use of the sampling capability.

If returned directly the interpolation model returns an array containing predictions for each of the control points. This is useful in the case of loess interpolation which first smooths the control points and then interpolates the smoothed points. All other interpolation functions simply return the original control points because interpolation predicts a curve that passes through the original control points.

There are different algorithms for interpolation that will result in different predictions along the curve. The math expressions library currently supports the following interpolation functions:

lerp: Linear interpolation predicts points that pass through each control point and form straight lines between control points.
spline: Spline interpolation predicts points that pass through each control point and form a smooth curve between control points.
akima: Akima spline interpolation is similar to spline interpolation but is stable to outliers.
loess: Loess interpolation first performs a non-linear local regression to smooth the original control points. Then a spline is used to interpolate the smoothed control points.

Upsampling

Interpolation can be used to increase the sampling rate along a curve. One example of this would be to take a time series with samples every minute and create a data set with samples every second. In order to do this the data points between the minutes must be created.

The predict function can be used to predict values anywhere within the bounds of the interpolation range. The example below shows a very simple example of upsampling.

let(x=array(0, 2,  4,  6,  8,   10, 12,  14, 16, 18, 20),  (1)
    y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5),  (2)
    l=lerp(x, y),  (3)
    u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),  (4)
    p=predict(l, u))  (5)

1	In the example linear interpolation is performed on the arrays in variables `x` and `y`. The `x` variable, which is the x-axis, is a sequence from 0 to 20 with a stride of 2.
2	The `y` variable defines the curve along the x-axis.
3	The `lerp` function performs the interpolation and returns the interpolation model.
4	The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis. The `predict` function then uses the interpolation function in variable `l` to predict values for every point in the array assigned to variable `u`.
5	The variable `p` is the array of predictions, which is the upsampled set of `y` values.

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "g": [
          5,
          7.5,
          10,
          35,
          60,
          125,
          190,
          145,
          100,
          115,
          130,
          115,
          100,
          60,
          20,
          25,
          30,
          20,
          10,
          7.5,
          5
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Smoothing Interpolation

The loess function is a smoothing interpolator which means it doesn’t derive a function that passes through the original control points. Instead the loess function returns a function that smooths the original control points.

A technique known as local regression is used to compute the smoothed curve. The size of the neighborhood of the local regression can be adjusted to control how close the new curve conforms to the original control points.

The loess function is passed x- and y-axes and fits a smooth curve to the data. If only a single array is provided it is treated as the y-axis and a sequence is generated for the x-axis.

The example below uses the loess function to fit a curve to a set of y values in an array. The bandwidth parameter defines the percent of data to use for the local regression. The lower the percent the smaller the neighborhood used for the local regression and the closer the curve will be to the original data.

let(echo="residuals, sumSqError",
    y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
    curve=loess(y, bandwidth=.3),
    residuals=ebeSubtract(y, curve),
    sumSqError=sumSq(residuals))

In the example the fitted curve is subtracted from the original curve using the ebeSubtract function. The output shows the error between the fitted curve and the original curve, known as the residuals. The output also includes the sum-of-squares of the residuals which provides a measure of how large the error is:

{
  "result-set": {
    "docs": [
      {
        "residuals": [
          0,
          0,
          0,
          -0.040524802275866634,
          -0.10531988096456502,
          0.5906115002526198,
          0.004215074334896762,
          0.4201374330912433,
          0.09618315578013803,
          0.012107948556718817,
          -0.9892939034492398,
          0.012014364143757561,
          0.1093830927709325,
          0.523166271893805,
          0.09658362075164639,
          -0.011433819306139625,
          0.9899403519886416,
          -0.011707983372932773,
          -0.004223284004140737,
          -0.00021462867928434548,
          0.0018723112875456138
        ],
        "sumSqError": 2.8016013870800616
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

In the next example the curve is fit using a bandwidth of .25:

let(echo="residuals, sumSqError",
    y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0),
    curve=loess(y, .25),
    residuals=ebeSubtract(y, curve),
    sumSqError=sumSq(residuals))

Notice that the curve is a closer fit, shown by the smaller residuals and lower value for the sum-of-squares of the residuals:

{
  "result-set": {
    "docs": [
      {
        "residuals": [
          0,
          0,
          0,
          0,
          -0.19117650587715396,
          0.442863451538809,
          -0.18553845993358564,
          0.29990769020356645,
          0,
          0.23761890236245709,
          -0.7344358765888117,
          0.2376189023624491,
          0,
          0.30373119215254984,
          -3.552713678800501e-15,
          -0.23761890236245264,
          0.7344358765888046,
          -0.2376189023625095,
          0,
          2.842170943040401e-14,
          -2.4868995751603507e-14
        ],
        "sumSqError": 1.7539413576337557
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Derivatives

The derivative of a function measures the rate of change of the y value in respects to the rate of change of the x value.

The derivative function can compute the derivative of any interpolation function. It can also compute the derivative of a derivative.

The example below computes the derivative for a loess interpolation function.

let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
    y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
    curve=loess(x, y, bandwidth=.3),
    derivative=derivative(curve))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "derivative": [
          1.0022002675659012,
          0.9955994648681976,
          1.0154018729613081,
          1.0643674501141696,
          1.0430879694757085,
          0.9698717643975381,
          0.7488201070357539,
          0.44627000894357516,
          0.19019561285422165,
          0.01703599324311178,
          -0.001908408138535126,
          -0.009121607450087499,
          -0.2576361507216319,
          -0.49378951291352746,
          -0.7288073815664,
          -0.9871806872210384,
          -1.0025400632604322,
          -1.001836567536853,
          -1.0076227586138085,
          -1.0021524620888589,
          -1.0020541789058157
        ]
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Integrals

An integral is a measure of the volume underneath a curve. The integrate function computes an integral for a specific range of an interpolated curve.

In the example below the integrate function computes an integral for the entire range of the curve, 0 through 20.

let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
    y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
    curve=loess(x, y, bandwidth=.3),
    integral=integrate(curve,  0, 20))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "integral": 90.17446104846645
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

In the next example an integral is computed for the range of 0 through 10.

let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
    y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0),
    curve=loess(x, y, bandwidth=.3),
    integral=integrate(curve,  0, 10))

When this expression is sent to the /stream handler it responds with:

{
  "result-set": {
    "docs": [
      {
        "integral": 45.300912584519914
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Bicubic Spline

The bicubicSpline function can be used to interpolate and predict values anywhere within a grid of data.

A simple example will make this more clear:

let(years=array(1998, 2000, 2002, 2004, 2006),
    floors=array(1, 5, 9, 13, 17, 19),
    prices = matrix(array(300000, 320000, 330000, 350000, 360000, 370000),
                    array(320000, 330000, 340000, 350000, 365000, 380000),
                    array(400000, 410000, 415000, 425000, 430000, 440000),
                    array(410000, 420000, 425000, 435000, 445000, 450000),
                    array(420000, 430000, 435000, 445000, 450000, 470000)),
    bspline=bicubicSpline(years, floors, prices),
    prediction=predict(bspline, 2003, 8))

In this example a bicubic spline is used to interpolate a matrix of real estate data. Each row of the matrix represent specific years. Each column of the matrix represents floors of the building. The grid of numbers is the average selling price of an apartment for each year and floor. For example in 2002 the average selling price for the 9th floor was 415000 (row 3, column 3).

The bicubicSpline function is then used to interpolate the grid, and the predict function is used to predict a value for year 2003, floor 8. Notice that the matrix does not include a data point for year 2003, floor 8. The bicupicSpline function creates that data point based on the surrounding data in the matrix:

{
  "result-set": {
    "docs": [
      {
        "prediction": 418279.5009328358
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 0
      }
    ]
  }
}

Linear Regression Curve Fitting