| = Linear Regression |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| The math expressions library supports simple and multivariate linear regression. |
| |
| == Simple Linear Regression |
| |
| The `regress` function is used to build a linear regression model |
| between two random variables. Sample observations are provided with two |
| numeric arrays. The first numeric array is the independent variable and |
| the second array is the dependent variable. |
| |
| In the example below the `random` function selects 5000 random samples each containing |
| the fields `filesize_d` and `response_d`. The two fields are vectorized |
| and stored in variables *`b`* and *`c`*. Then the `regress` function performs a regression |
| analysis on the two numeric arrays. |
| |
| The `regress` function returns a single tuple with the results of the regression |
| analysis. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, response_d), |
| d=regress(b, c)) |
| ---- |
| |
| Note that in this regression analysis the value of `RSquared` is `.75`. This means that changes in |
| `filesize_d` explain 75% of the variability of the `response_d` variable: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "d": { |
| "significance": 0, |
| "totalSumSquares": 10564812.895147054, |
| "R": 0.8674822407146515, |
| "RSquared": 0.7525254379553127, |
| "meanSquareError": 523.1137343558588, |
| "intercept": -49.528134913099095, |
| "slopeConfidenceInterval": 0.0003171801710329995, |
| "regressionSumSquares": 7950290.450836472, |
| "slope": 0.019945557923159506, |
| "interceptStdErr": 6.489732340389941, |
| "N": 5000 |
| } |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 98 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Prediction |
| |
| The `predict` function uses the regression model to make predictions. |
| Using the example above the regression model can be used to predict the value |
| of `response_d` given a value for `filesize_d`. |
| |
| In the example below the `predict` function uses the regression analysis to predict |
| the value of `response_d` for the `filesize_d` value of `40000`. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, response_d), |
| d=regress(b, c), |
| e=predict(d, 40000)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": 748.079241022975 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 95 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| The `predict` function can also make predictions for an array of values. In this |
| case it returns an array of predictions. |
| |
| In the example below the `predict` function uses the regression analysis to |
| predict values for each of the 5000 samples of `filesize_d` used to generate the model. |
| In this case 5000 predictions are returned. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, response_d), |
| d=regress(b, c), |
| e=predict(d, b)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| 742.2525322514165, |
| 709.6972488729955, |
| 687.8382568904871, |
| 820.2511324266264, |
| 720.4006432289061, |
| 761.1578181053039, |
| 759.1304101159126, |
| 699.5597256337142, |
| 742.4738911248204, |
| 769.0342605881644, |
| 746.6740473150268, |
| ... |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 113 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Residuals |
| |
| The difference between the observed value and the predicted value is known as the |
| residual. There isn't a specific function to calculate the residuals but vector |
| math can used to perform the calculation. |
| |
| In the example below the predictions are stored in variable *`e`*. The `ebeSubtract` |
| function is then used to subtract the predictions |
| from the actual `response_d` values stored in variable *`c`*. Variable *`f`* contains |
| the array of residuals. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, response_d), |
| d=regress(b, c), |
| e=predict(d, b), |
| f=ebeSubtract(c, e)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| 31.30678554491226, |
| -30.292830927953446, |
| -30.49508862647258, |
| -30.499884780783532, |
| -9.696458959319784, |
| -30.521563961535094, |
| -30.28380938033081, |
| -9.890289849359306, |
| 30.819723560583157, |
| -30.213178859683012, |
| -30.609943619066826, |
| 10.527700442607625, |
| 10.68046928406568, |
| ... |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 113 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Multivariate Linear Regression |
| |
| The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear |
| regression models the linear relationship between two or more independent variables and a dependent variable. |
| |
| The example below extends the simple linear regression example by introducing a new independent variable |
| called `service_d`. The `service_d` variable is the service level of the request and it can range from 1 to 4 |
| in the data-set. The higher the service level, the higher the bandwidth available for the request. |
| |
| Notice that the two independent variables `filesize_d` and `service_d` are vectorized and stored |
| in the variables *`b`* and *`c`*. The variables *`b`* and *`c`* are then added as rows to a `matrix`. The matrix is |
| then transposed so that each row in the matrix represents one observation with `filesize_d` and `service_d`. |
| The `olsRegress` function then performs the multivariate regression analysis using the observation matrix as the |
| independent variables and the `response_d` values, stored in variable *`d`*, as the dependent variable. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="30000", fl="filesize_d, service_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, service_d), |
| d=col(a, response_d), |
| e=transpose(matrix(b, c)), |
| f=olsRegress(e, d)) |
| ---- |
| |
| Notice in the response that the RSquared of the regression analysis is 1. This means that linear relationship between |
| `filesize_d` and `service_d` describe 100% of the variability of the `response_d` variable: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "f": { |
| "regressionParametersStandardErrors": [ |
| 2.0660690430026933e-13, |
| 5.1212982077663434e-18, |
| 9.10920932555875e-15 |
| ], |
| "RSquared": 1, |
| "regressionParameters": [ |
| 6.553210695971329e-12, |
| 0.019999999999999858, |
| -20.49999999999968 |
| ], |
| "regressandVariance": 2124.130825172683, |
| "regressionParametersVariance": [ |
| [ |
| 0.013660174897582315, |
| -3.361258014840509e-7, |
| -0.00006893737578369605 |
| ], |
| [ |
| -3.361258014840509e-7, |
| 8.393183709503206e-12, |
| 6.430253229589981e-11 |
| ], |
| [ |
| -0.00006893737578369605, |
| 6.430253229589981e-11, |
| 0.000026553878455570856 |
| ] |
| ], |
| "adjustedRSquared": 1, |
| "residualSumSquares": 9.373703759269822e-20 |
| } |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 690 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Prediction |
| |
| The `predict` function can also be used to make predictions for multivariate linear regression. |
| |
| Below is an example of a single prediction using the multivariate linear regression model and a single observation. |
| The observation is an array that matches the structure of the observation matrix used to build the model. In this case |
| the first value represents a `filesize_d` of `40000` and the second value represents a `service_d` of `4`. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, service_d), |
| d=col(a, response_d), |
| e=transpose(matrix(b, c)), |
| f=olsRegress(e, d), |
| g=predict(f, array(40000, 4))) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "g": 718.0000000000005 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 117 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| The `predict` function can also make predictions for more than one multivariate observation. In this scenario |
| an observation matrix used. |
| |
| In the example below the observation matrix used to build the multivariate regression model |
| is passed to the `predict` function and it returns an array of predictions. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, service_d), |
| d=col(a, response_d), |
| e=transpose(matrix(b, c)), |
| f=olsRegress(e, d), |
| g=predict(f, e)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| 685.498283591961, |
| 801.2175699959365, |
| 776.7638245911025, |
| 610.3559852681935, |
| 751.0925865965207, |
| 787.2914663381897, |
| 744.3632053810668, |
| 688.3729301599697, |
| 765.367783417171, |
| 724.9309687628346, |
| 834.4350712384264, |
| ... |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 113 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Residuals |
| |
| Once the predictions are generated the residuals can be calculated using the same approach used with |
| simple linear regression. |
| |
| Below is an example of the residuals calculation following a multivariate linear regression. In the example |
| the predictions stored variable *`g`* are subtracted from observed values stored in variable *`d`*. |
| |
| [source,text] |
| ---- |
| let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"), |
| b=col(a, filesize_d), |
| c=col(a, service_d), |
| d=col(a, response_d), |
| e=transpose(matrix(b, c)), |
| f=olsRegress(e, d), |
| g=predict(f, e), |
| h=ebeSubtract(d, g)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| 1.1368683772161603e-13, |
| 1.1368683772161603e-13, |
| 0, |
| 1.1368683772161603e-13, |
| 0, |
| 1.1368683772161603e-13, |
| 0, |
| 2.2737367544323206e-13, |
| 1.1368683772161603e-13, |
| 2.2737367544323206e-13, |
| 1.1368683772161603e-13, |
| ... |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 113 |
| } |
| ] |
| } |
| } |
| ---- |