| = Machine Learning |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| |
| This section of the math expressions user guide covers machine learning |
| functions. |
| |
| == Feature Scaling |
| |
| Before performing machine learning operations its often necessary to |
| scale the feature vectors so they can be compared at the same scale. |
| |
| All the scaling function operate on vectors and matrices. |
| When operating on a matrix the rows of the matrix are scaled. |
| |
| === Min/Max Scaling |
| |
| The `minMaxScale` function scales a vector or matrix between a minimum and maximum value. |
| By default it will scale between 0 and 1 if min/max values are not provided. |
| |
| Below is a simple example of min/max scaling between 0 and 1. |
| Notice that once brought into the same scale the vectors are the same. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(200, 300, 400, 500), |
| c=matrix(a, b), |
| d=minMaxScale(c)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "d": [ |
| [ |
| 0, |
| 0.3333333333333333, |
| 0.6666666666666666, |
| 1 |
| ], |
| [ |
| 0, |
| 0.3333333333333333, |
| 0.6666666666666666, |
| 1 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 0 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Standardization |
| |
| The `standardize` function scales a vector so that it has a mean of 0 and a standard deviation of 1. |
| Standardization can be used with machine learning algorithms, such as |
| https://en.wikipedia.org/wiki/Support_vector_machine[Support Vector Machine (SVM)], that perform better |
| when the data has a normal distribution. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(200, 300, 400, 500), |
| c=matrix(a, b), |
| d=standardize(c)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "d": [ |
| [ |
| -1.161895003862225, |
| -0.3872983346207417, |
| 0.3872983346207417, |
| 1.161895003862225 |
| ], |
| [ |
| -1.1618950038622249, |
| -0.38729833462074165, |
| 0.38729833462074165, |
| 1.1618950038622249 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 17 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Unit Vectors |
| |
| The `unitize` function scales vectors to a magnitude of 1. A vector with a |
| magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals |
| with vector direction rather than magnitude. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(200, 300, 400, 500), |
| c=matrix(a, b), |
| d=unitize(c)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "d": [ |
| [ |
| 0.2721655269759087, |
| 0.40824829046386296, |
| 0.5443310539518174, |
| 0.6804138174397716 |
| ], |
| [ |
| 0.2721655269759087, |
| 0.4082482904638631, |
| 0.5443310539518174, |
| 0.6804138174397717 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 6 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Distance and Distance Measures |
| |
| The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix. |
| |
| There are five distance measure functions that return a function that performs the actual distance calculation: |
| |
| * `euclidean` (default) |
| * `manhattan` |
| * `canberra` |
| * `earthMovers` |
| * `haversineMeters` (Geospatial distance measure) |
| |
| The distance measure functions can be used with all machine learning functions |
| that support distance measures. |
| |
| Below is an example for computing Euclidean distance for two numeric arrays: |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(21, 29, 41, 49), |
| c=distance(a, b)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "c": 2 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 0 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| Below the distance is calculated using *Manahattan* distance. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(21, 29, 41, 49), |
| c=distance(a, b, manhattan())) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "c": 4 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 1 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| |
| Below is an example for computing a distance matrix for columns |
| of a matrix: |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40), |
| b=array(21, 29, 41), |
| c=array(31, 40, 50), |
| d=matrix(a, b, c), |
| c=distance(d)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| [ |
| 0, |
| 15.652475842498529, |
| 34.07345007480164 |
| ], |
| [ |
| 15.652475842498529, |
| 0, |
| 18.547236990991408 |
| ], |
| [ |
| 34.07345007480164, |
| 18.547236990991408, |
| 0 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 24 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == K-Means Clustering |
| |
| The `kmeans` functions performs k-means clustering of the rows of a matrix. |
| Once the clustering has been completed there are a number of useful functions available |
| for examining the clusters and centroids. |
| |
| The examples below cluster _term vectors_. |
| The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> offers |
| a full explanation of these features. |
| |
| === Centroid Features |
| |
| In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set |
| and then the top features are extracted from the cluster centroids. |
| |
| [source,text] |
| ---- |
| let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), <1> |
| id, |
| analyze(body, body_bigram) as terms), |
| b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),<2> |
| c=kmeans(b, 5), <3> |
| d=getCentroids(c), <4> |
| e=topFeatures(d, 5)) <5> |
| ---- |
| |
| Let's look at what data is assigned to each variable: |
| |
| <1> *`a`*: The `random` function returns a sample of 500 documents from the "enron" |
| collection that match the query "body:oil". The `select` function selects the `id` and |
| and annotates each tuple with the analyzed bigram terms from the `body` field. |
| <2> *`b`*: The `termVectors` function creates a TF-IDF term vector matrix from the |
| tuples stored in variable *`a`*. Each row in the matrix represents a document. The columns of the matrix |
| are the bigram terms that were attached to each tuple. |
| <3> *`c`*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the Euclidean distance measure. |
| <4> *`d`*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid |
| from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix. |
| <5> *`e`*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix. |
| This returns the top 5 bigram terms for each centroid. |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| [ |
| "enron enronxgate", |
| "north american", |
| "energy services", |
| "conference call", |
| "power generation" |
| ], |
| [ |
| "financial times", |
| "chief financial", |
| "financial officer", |
| "exchange commission", |
| "houston chronicle" |
| ], |
| [ |
| "southern california", |
| "california edison", |
| "public utilities", |
| "utilities commission", |
| "rate increases" |
| ], |
| [ |
| "rolling blackouts", |
| "public utilities", |
| "electricity prices", |
| "federal energy", |
| "price controls" |
| ], |
| [ |
| "california edison", |
| "regulatory commission", |
| "southern california", |
| "federal energy", |
| "power generators" |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 982 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Cluster Features |
| |
| The example below examines the top features of a specific cluster. This example uses the same techniques |
| as the centroids example but the top features are extracted from a cluster rather than the centroids. |
| |
| [source,text] |
| ---- |
| let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"), |
| id, |
| analyze(body, body_bigram) as terms), |
| b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), |
| c=kmeans(b, 25), |
| d=getCluster(c, 0), <1> |
| e=topFeatures(d, 4)) <2> |
| ---- |
| |
| <1> The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors |
| that have been clustered together based on their features. |
| <2> The `topFeatures` function is used to extract the top 4 features from each term vector |
| in the cluster. |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| [ |
| "electricity board", |
| "maharashtra state", |
| "power purchase", |
| "state electricity", |
| "reserved enron" |
| ], |
| [ |
| "electricity board", |
| "maharashtra state", |
| "state electricity", |
| "purchase agreement", |
| "independent power" |
| ], |
| [ |
| "maharashtra state", |
| "reserved enron", |
| "federal government", |
| "state government", |
| "dabhol project" |
| ], |
| [ |
| "purchase agreement", |
| "power purchase", |
| "electricity board", |
| "maharashtra state", |
| "state government" |
| ], |
| [ |
| "investment grade", |
| "portland general", |
| "general electric", |
| "holding company", |
| "transmission lines" |
| ], |
| [ |
| "state government", |
| "state electricity", |
| "purchase agreement", |
| "electricity board", |
| "maharashtra state" |
| ], |
| [ |
| "electricity board", |
| "state electricity", |
| "energy management", |
| "maharashtra state", |
| "energy markets" |
| ], |
| [ |
| "electricity board", |
| "maharashtra state", |
| "state electricity", |
| "state government", |
| "second quarter" |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 978 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Multi K-Means Clustering |
| |
| K-means clustering will produce different results depending on |
| the initial placement of the centroids. K-means is fast enough |
| that multiple trials can be performed and the best outcome selected. |
| |
| The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the |
| best result based on which trial produces the lowest intra-cluster variance. |
| |
| The example below is identical to centroids example except that it uses `multiKmeans` with 100 trials, |
| rather than a single trial of the `kmeans` function. |
| |
| [source,text] |
| ---- |
| let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"), |
| id, |
| analyze(body, body_bigram) as terms), |
| b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), |
| c=multiKmeans(b, 5, 100), |
| d=getCentroids(c), |
| e=topFeatures(d, 5)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "e": [ |
| [ |
| "enron enronxgate", |
| "energy trading", |
| "energy markets", |
| "energy services", |
| "unleaded gasoline" |
| ], |
| [ |
| "maharashtra state", |
| "electricity board", |
| "state electricity", |
| "energy trading", |
| "chief financial" |
| ], |
| [ |
| "price controls", |
| "electricity prices", |
| "francisco chronicle", |
| "wholesale electricity", |
| "power generators" |
| ], |
| [ |
| "southern california", |
| "california edison", |
| "public utilities", |
| "francisco chronicle", |
| "utilities commission" |
| ], |
| [ |
| "california edison", |
| "power purchases", |
| "system operator", |
| "term contracts", |
| "independent system" |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 1182 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Fuzzy K-Means Clustering |
| |
| The `fuzzyKmeans` function is a soft clustering algorithm which |
| allows vectors to be assigned to more then one cluster. The `fuzziness` parameter |
| is a value between 1 and 2 that determines how fuzzy to make the cluster assignment. |
| |
| After the clustering has been performed the `getMembershipMatrix` function can be called |
| on the clustering result to return a matrix describing which clusters each vector belongs to. |
| There is a row in the matrix for each vector that was clustered. There is a column in the matrix |
| for each cluster. The values in the columns are the probability that the vector belonged to the specific |
| cluster. |
| |
| A simple example will make this more clear. In the example below 300 documents are analyzed and |
| then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the |
| term vectors into 12 clusters with a fuzziness factor of 1.25. |
| |
| [source,text] |
| ---- |
| let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"), |
| id, |
| analyze(body, body_bigram) as terms), |
| b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), |
| c=fuzzyKmeans(b, 12, fuzziness=1.25), |
| d=getMembershipMatrix(c), <1> |
| e=rowAt(d, 0), <2> |
| f=precision(e, 5)) <3> |
| ---- |
| |
| <1> The `getMembershipMatrix` function is used to return the membership matrix; |
| <2> and the first row of membership matrix is retrieved with the `rowAt` function. |
| <3> The `precision` function is then applied to the first row |
| of the matrix to make it easier to read. |
| |
| This expression returns a single vector representing the cluster membership probabilities for the first |
| term vector. Notice that the term vector has the highest association with the 12^th^ cluster, |
| but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "f": [ |
| 0, |
| 0, |
| 0.178, |
| 0, |
| 0.17707, |
| 0.17775, |
| 0.16214, |
| 0, |
| 0, |
| 0, |
| 0, |
| 0.30504 |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 2157 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == K-Nearest Neighbor (KNN) |
| |
| The `knn` function searches the rows of a matrix for the |
| k-nearest neighbors of a search vector. The `knn` function |
| returns a matrix of the k-nearest neighbors. |
| |
| The `knn` function supports changing of the distance measure by providing one of these |
| distance measure functions as the fourth parameter: |
| |
| * `euclidean` (Default) |
| * `manhattan` |
| * `canberra` |
| * `earthMovers` |
| |
| The example below builds on the clustering examples to demonstrate the `knn` function. |
| |
| [source,text] |
| ---- |
| let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"), |
| id, |
| analyze(body, body_bigram) as terms), |
| b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), |
| c=multiKmeans(b, 5, 100), |
| d=getCentroids(c), <1> |
| e=rowAt(d, 0), <2> |
| g=knn(b, e, 3), <3> |
| h=topFeatures(g, 4)) <4> |
| ---- |
| |
| <1> In the example, the centroids matrix is set to variable *`d`*. |
| <2> The first centroid vector is selected from the matrix with the `rowAt` function. |
| <3> Then the `knn` function is used to find the 3 nearest neighbors |
| to the centroid vector in the term vector matrix (variable *`b`*). |
| <4> The `topFeatures` function is used to request the top 4 featurs of the term vectors in the knn matrix. |
| |
| The `knn` function returns a matrix with the 3 nearest neighbors based on the |
| default distance measure which is euclidean. Finally, the top 4 features |
| of the term vectors in the nearest neighbor matrix are returned: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "h": [ |
| [ |
| "california power", |
| "electricity supply", |
| "concerned about", |
| "companies like" |
| ], |
| [ |
| "maharashtra state", |
| "california power", |
| "electricity board", |
| "alternative energy" |
| ], |
| [ |
| "electricity board", |
| "maharashtra state", |
| "state electricity", |
| "houston chronicle" |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 1243 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == K-Nearest Neighbor Regression |
| |
| K-nearest neighbor regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning |
| technique which means it does not fit a model to the training set in advance. Instead the |
| entire training set of observations and outcomes are held in memory and predictions are made |
| by averaging the outcomes of the k-nearest neighbors. |
| |
| The `knnRegress` function prepares the training set for use with the `predict` function. |
| |
| Below is an example of the `knnRegress` function. In this example 10,000 random samples |
| are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of |
| `filesize_d` and `service_d` will be used to predict the value of `response_d`. |
| |
| [source,text] |
| ---- |
| let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), |
| filesizes=col(samples, filesize_d), |
| serviceLevels=col(samples, service_d), |
| outcomes=col(samples, response_d), |
| observations=transpose(matrix(filesizes, serviceLevels)), |
| lazyModel=knnRegress(observations, outcomes , 5)) |
| ---- |
| |
| This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "lazyModel": { |
| "features": 2, |
| "robust": false, |
| "distance": "EuclideanDistance", |
| "observations": 10000, |
| "scale": false, |
| "k": 5 |
| } |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 170 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Prediction and Residuals |
| |
| The output of `knnRegress` can be used with the `predict` function like other regression models. |
| |
| In the example below the `predict` function is used to predict results for the original training |
| data. The sumSq of the residuals is then calculated. |
| |
| [source,text] |
| ---- |
| let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), |
| filesizes=col(samples, filesize_d), |
| serviceLevels=col(samples, service_d), |
| outcomes=col(samples, response_d), |
| observations=transpose(matrix(filesizes, serviceLevels)), |
| lazyModel=knnRegress(observations, outcomes , 5), |
| predictions=predict(lazyModel, observations), |
| residuals=ebeSubtract(outcomes, predictions), |
| sumSqErr=sumSq(residuals)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "sumSqErr": 1920290.1204126712 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 3796 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Setting Feature Scaling |
| |
| If the features in the observation matrix are not in the same scale then the larger features |
| will carry more weight in the distance calculation then the smaller features. This can greatly |
| impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which |
| can be set to `true` to automatically scale the features in the same range. |
| |
| The example below shows `knnRegress` with feature scaling turned on. |
| |
| Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower. |
| This shows how much more accurate the predictions are when feature scaling is turned on in |
| this particular example. This is because the `filesize_d` feature is significantly larger then |
| the `service_d` feature. |
| |
| [source,text] |
| ---- |
| let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), |
| filesizes=col(samples, filesize_d), |
| serviceLevels=col(samples, service_d), |
| outcomes=col(samples, response_d), |
| observations=transpose(matrix(filesizes, serviceLevels)), |
| lazyModel=knnRegress(observations, outcomes , 5, scale=true), |
| predictions=predict(lazyModel, observations), |
| residuals=ebeSubtract(outcomes, predictions), |
| sumSqErr=sumSq(residuals)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "sumSqErr": 4076.794951120683 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 3790 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| |
| === Setting Robust Regression |
| |
| The default prediction approach is to take the mean of the outcomes of the k-nearest |
| neighbors. If the outcomes contain outliers the mean value can be skewed. Setting |
| the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors. |
| This provides a regression prediction that is robust to outliers. |
| |
| === Setting the Distance Measure |
| |
| The distance measure can be changed for the k-nearest neighbor search by adding a distance measure |
| function to the `knnRegress` parameters. Below is an example using `manhattan` distance. |
| |
| [source,text] |
| ---- |
| let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), |
| filesizes=col(samples, filesize_d), |
| serviceLevels=col(samples, service_d), |
| outcomes=col(samples, response_d), |
| observations=transpose(matrix(filesizes, serviceLevels)), |
| lazyModel=knnRegress(observations, outcomes, 5, manhattan(), scale=true), |
| predictions=predict(lazyModel, observations), |
| residuals=ebeSubtract(outcomes, predictions), |
| sumSqErr=sumSq(residuals)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "sumSqErr": 4761.221942288098 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 3571 |
| } |
| ] |
| } |
| } |
| ---- |