solr/solr-ref-guide/src/machine-learning.adoc - lucene-solr - Git at Google

 = Machine Learning
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.


 This section of the math expressions user guide covers machine learning
 functions.

 == Feature Scaling

 Before performing machine learning operations its often necessary to
 scale the feature vectors so they can be compared at the same scale.

 All the scaling function operate on vectors and matrices.
 When operating on a matrix the rows of the matrix are scaled.

 === Min/Max Scaling

 The `minMaxScale` function scales a vector or matrix between a minimum and maximum value.
 By default it will scale between 0 and 1 if min/max values are not provided.

 Below is a simple example of min/max scaling between 0 and 1.
 Notice that once brought into the same scale the vectors are the same.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(200, 300, 400, 500),
     c=matrix(a, b),
     d=minMaxScale(c))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": [
           [
             0,
             0.3333333333333333,
             0.6666666666666666,
             1
           ],
           [
             0,
             0.3333333333333333,
             0.6666666666666666,
             1
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 === Standardization

 The `standardize` function scales a vector so that it has a mean of 0 and a standard deviation of 1.
 Standardization can be used with machine learning algorithms, such as
 https://en.wikipedia.org/wiki/Support_vector_machine[Support Vector Machine (SVM)], that perform better
 when the data has a normal distribution.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(200, 300, 400, 500),
     c=matrix(a, b),
     d=standardize(c))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": [
           [
             -1.161895003862225,
             -0.3872983346207417,
             0.3872983346207417,
             1.161895003862225
           ],
           [
             -1.1618950038622249,
             -0.38729833462074165,
             0.38729833462074165,
             1.1618950038622249
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 17
       }
     ]
   }
 }
 ----

 === Unit Vectors

 The `unitize` function scales vectors to a magnitude of 1. A vector with a
 magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals
 with vector direction rather than magnitude.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(200, 300, 400, 500),
     c=matrix(a, b),
     d=unitize(c))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": [
           [
             0.2721655269759087,
             0.40824829046386296,
             0.5443310539518174,
             0.6804138174397716
           ],
           [
             0.2721655269759087,
             0.4082482904638631,
             0.5443310539518174,
             0.6804138174397717
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 6
       }
     ]
   }
 }
 ----

 == Distance and Distance Measures

 The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix.

 There are five distance measure functions that return a function that performs the actual distance calculation:

 * `euclidean` (default)
 * `manhattan`
 * `canberra`
 * `earthMovers`
 * `haversineMeters` (Geospatial distance measure)

 The distance measure functions can be used with all machine learning functions
 that support distance measures.

 Below is an example for computing Euclidean distance for two numeric arrays:

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(21, 29, 41, 49),
     c=distance(a, b))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": 2
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 Below the distance is calculated using *Manahattan* distance.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(21, 29, 41, 49),
     c=distance(a, b, manhattan()))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": 4
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 1
       }
     ]
   }
 }
 ----


 Below is an example for computing a distance matrix for columns
 of a matrix:

 [source,text]
 ----
 let(a=array(20, 30, 40),
     b=array(21, 29, 41),
     c=array(31, 40, 50),
     d=matrix(a, b, c),
     c=distance(d))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": [
           [
             0,
             15.652475842498529,
             34.07345007480164
           ],
           [
             15.652475842498529,
             0,
             18.547236990991408
           ],
           [
             34.07345007480164,
             18.547236990991408,
             0
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 24
       }
     ]
   }
 }
 ----

 == K-Means Clustering

 The `kmeans` functions performs k-means clustering of the rows of a matrix.
 Once the clustering has been completed there are a number of useful functions available
 for examining the clusters and centroids.

 The examples below cluster _term vectors_.
 The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> offers
 a full explanation of these features.

 === Centroid Features

 In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set
 and then the top features are extracted from the cluster centroids.

 [source,text]
 ----
 let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), <1>
                     id,
                     analyze(body, body_bigram) as terms),
     b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),<2>
     c=kmeans(b, 5), <3>
     d=getCentroids(c), <4>
     e=topFeatures(d, 5)) <5>
 ----

 Let's look at what data is assigned to each variable:

 <1> *`a`*: The `random` function returns a sample of 500 documents from the "enron"
 collection that match the query "body:oil". The `select` function selects the `id` and
 and annotates each tuple with the analyzed bigram terms from the `body` field.
 <2> *`b`*: The `termVectors` function creates a TF-IDF term vector matrix from the
 tuples stored in variable *`a`*. Each row in the matrix represents a document. The columns of the matrix
 are the bigram terms that were attached to each tuple.
 <3> *`c`*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the Euclidean distance measure.
 <4> *`d`*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
 from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
 <5> *`e`*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
 This returns the top 5 bigram terms for each centroid.

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": [
           [
             "enron enronxgate",
             "north american",
             "energy services",
             "conference call",
             "power generation"
           ],
           [
             "financial times",
             "chief financial",
             "financial officer",
             "exchange commission",
             "houston chronicle"
           ],
           [
             "southern california",
             "california edison",
             "public utilities",
             "utilities commission",
             "rate increases"
           ],
           [
             "rolling blackouts",
             "public utilities",
             "electricity prices",
             "federal energy",
             "price controls"
           ],
           [
             "california edison",
             "regulatory commission",
             "southern california",
             "federal energy",
             "power generators"
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 982
       }
     ]
   }
 }
 ----

 === Cluster Features

 The example below examines the top features of a specific cluster. This example uses the same techniques
 as the centroids example but the top features are extracted from a cluster rather than the centroids.

 [source,text]
 ----
 let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
                     id,
                     analyze(body, body_bigram) as terms),
     b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
     c=kmeans(b, 25),
     d=getCluster(c, 0), <1>
     e=topFeatures(d, 4)) <2>
 ----

 <1> The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
 that have been clustered together based on their features.
 <2> The `topFeatures` function is used to extract the top 4 features from each term vector
 in the cluster.

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": [
           [
             "electricity board",
             "maharashtra state",
             "power purchase",
             "state electricity",
             "reserved enron"
           ],
           [
             "electricity board",
             "maharashtra state",
             "state electricity",
             "purchase agreement",
             "independent power"
           ],
           [
             "maharashtra state",
             "reserved enron",
             "federal government",
             "state government",
             "dabhol project"
           ],
           [
             "purchase agreement",
             "power purchase",
             "electricity board",
             "maharashtra state",
             "state government"
           ],
           [
             "investment grade",
             "portland general",
             "general electric",
             "holding company",
             "transmission lines"
           ],
           [
             "state government",
             "state electricity",
             "purchase agreement",
             "electricity board",
             "maharashtra state"
           ],
           [
             "electricity board",
             "state electricity",
             "energy management",
             "maharashtra state",
             "energy markets"
           ],
           [
             "electricity board",
             "maharashtra state",
             "state electricity",
             "state government",
             "second quarter"
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 978
       }
     ]
   }
 }
 ----

 == Multi K-Means Clustering

 K-means clustering will produce different results depending on
 the initial placement of the centroids. K-means is fast enough
 that multiple trials can be performed and the best outcome selected.

 The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the
 best result based on which trial produces the lowest intra-cluster variance.

 The example below is identical to centroids example except that it uses `multiKmeans` with 100 trials,
 rather than a single trial of the `kmeans` function.

 [source,text]
 ----
 let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
                     id,
                     analyze(body, body_bigram) as terms),
     b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
     c=multiKmeans(b, 5, 100),
     d=getCentroids(c),
     e=topFeatures(d, 5))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "e": [
           [
             "enron enronxgate",
             "energy trading",
             "energy markets",
             "energy services",
             "unleaded gasoline"
           ],
           [
             "maharashtra state",
             "electricity board",
             "state electricity",
             "energy trading",
             "chief financial"
           ],
           [
             "price controls",
             "electricity prices",
             "francisco chronicle",
             "wholesale electricity",
             "power generators"
           ],
           [
             "southern california",
             "california edison",
             "public utilities",
             "francisco chronicle",
             "utilities commission"
           ],
           [
             "california edison",
             "power purchases",
             "system operator",
             "term contracts",
             "independent system"
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 1182
       }
     ]
   }
 }
 ----

 == Fuzzy K-Means Clustering

 The `fuzzyKmeans` function is a soft clustering algorithm which
 allows vectors to be assigned to more then one cluster. The `fuzziness` parameter
 is a value between 1 and 2 that determines how fuzzy to make the cluster assignment.

 After the clustering has been performed the `getMembershipMatrix` function can be called
 on the clustering result to return a matrix describing which clusters each vector belongs to.
 There is a row in the matrix for each vector that was clustered. There is a column in the matrix
 for each cluster. The values in the columns are the probability that the vector belonged to the specific
 cluster.

 A simple example will make this more clear. In the example below 300 documents are analyzed and
 then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the
 term vectors into 12 clusters with a fuzziness factor of 1.25.

 [source,text]
 ----
 let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
                    id,
                    analyze(body, body_bigram) as terms),
    b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
    c=fuzzyKmeans(b, 12, fuzziness=1.25),
    d=getMembershipMatrix(c),  <1>
    e=rowAt(d, 0),  <2>
    f=precision(e, 5))  <3>
 ----

 <1> The `getMembershipMatrix` function is used to return the membership matrix;
 <2> and the first row of membership matrix is retrieved with the `rowAt` function.
 <3> The `precision` function is then applied to the first row
 of the matrix to make it easier to read.

 This expression returns a single vector representing the cluster membership probabilities for the first
 term vector. Notice that the term vector has the highest association with the 12^th^ cluster,
 but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "f": [
           0,
           0,
           0.178,
           0,
           0.17707,
           0.17775,
           0.16214,
           0,
           0,
           0,
           0,
           0.30504
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 2157
       }
     ]
   }
 }
 ----

 == K-Nearest Neighbor (KNN)

 The `knn` function searches the rows of a matrix for the
 k-nearest neighbors of a search vector. The `knn` function
 returns a matrix of the k-nearest neighbors.

 The `knn` function supports changing of the distance measure by providing one of these
 distance measure functions as the fourth parameter:

 * `euclidean` (Default)
 * `manhattan`
 * `canberra`
 * `earthMovers`

 The example below builds on the clustering examples to demonstrate the `knn` function.

 [source,text]
 ----
 let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
                     id,
                     analyze(body, body_bigram) as terms),
     b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
     c=multiKmeans(b, 5, 100),
     d=getCentroids(c),  <1>
     e=rowAt(d, 0),  <2>
     g=knn(b, e, 3),  <3>
     h=topFeatures(g, 4)) <4>
 ----

 <1> In the example, the centroids matrix is set to variable *`d`*.
 <2> The first centroid vector is selected from the matrix with the `rowAt` function.
 <3> Then the `knn` function is used to find the 3 nearest neighbors
 to the centroid vector in the term vector matrix (variable *`b`*).
 <4> The `topFeatures` function is used to request the top 4 featurs of the term vectors in the knn matrix.

 The `knn` function returns a matrix with the 3 nearest neighbors based on the
 default distance measure which is euclidean. Finally, the top 4 features
 of the term vectors in the nearest neighbor matrix are returned:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "h": [
           [
             "california power",
             "electricity supply",
             "concerned about",
             "companies like"
           ],
           [
             "maharashtra state",
             "california power",
             "electricity board",
             "alternative energy"
           ],
           [
             "electricity board",
             "maharashtra state",
             "state electricity",
             "houston chronicle"
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 1243
       }
     ]
   }
 }
 ----

 == K-Nearest Neighbor Regression

 K-nearest neighbor regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
 technique which means it does not fit a model to the training set in advance. Instead the
 entire training set of observations and outcomes are held in memory and predictions are made
 by averaging the outcomes of the k-nearest neighbors.

 The `knnRegress` function prepares the training set for use with the `predict` function.

 Below is an example of the `knnRegress` function. In this example 10,000 random samples
 are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of
 `filesize_d` and `service_d` will be used to predict the value of `response_d`.

 [source,text]
 ----
 let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
     filesizes=col(samples, filesize_d),
     serviceLevels=col(samples, service_d),
     outcomes=col(samples, response_d),
     observations=transpose(matrix(filesizes, serviceLevels)),
     lazyModel=knnRegress(observations, outcomes , 5))
 ----

 This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "lazyModel": {
           "features": 2,
           "robust": false,
           "distance": "EuclideanDistance",
           "observations": 10000,
           "scale": false,
           "k": 5
         }
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 170
       }
     ]
   }
 }
 ----

 === Prediction and Residuals

 The output of `knnRegress` can be used with the `predict` function like other regression models.

 In the example below the `predict` function is used to predict results for the original training
 data. The sumSq of the residuals is then calculated.

 [source,text]
 ----
 let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
     filesizes=col(samples, filesize_d),
     serviceLevels=col(samples, service_d),
     outcomes=col(samples, response_d),
     observations=transpose(matrix(filesizes, serviceLevels)),
     lazyModel=knnRegress(observations, outcomes , 5),
     predictions=predict(lazyModel, observations),
     residuals=ebeSubtract(outcomes, predictions),
     sumSqErr=sumSq(residuals))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "sumSqErr": 1920290.1204126712
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 3796
       }
     ]
   }
 }
 ----

 === Setting Feature Scaling

 If the features in the observation matrix are not in the same scale then the larger features
 will carry more weight in the distance calculation then the smaller features. This can greatly
 impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which
 can be set to `true` to automatically scale the features in the same range.

 The example below shows `knnRegress` with feature scaling turned on.

 Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower.
 This shows how much more accurate the predictions are when feature scaling is turned on in
 this particular example. This is because the `filesize_d` feature is significantly larger then
 the `service_d` feature.

 [source,text]
 ----
 let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
     filesizes=col(samples, filesize_d),
     serviceLevels=col(samples, service_d),
     outcomes=col(samples, response_d),
     observations=transpose(matrix(filesizes, serviceLevels)),
     lazyModel=knnRegress(observations, outcomes , 5, scale=true),
     predictions=predict(lazyModel, observations),
     residuals=ebeSubtract(outcomes, predictions),
     sumSqErr=sumSq(residuals))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "sumSqErr": 4076.794951120683
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 3790
       }
     ]
   }
 }
 ----


 === Setting Robust Regression

 The default prediction approach is to take the mean of the outcomes of the k-nearest
 neighbors. If the outcomes contain outliers the mean value can be skewed. Setting
 the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors.
 This provides a regression prediction that is robust to outliers.

 === Setting the Distance Measure

 The distance measure can be changed for the k-nearest neighbor search by adding a distance measure
 function to the `knnRegress` parameters. Below is an example using `manhattan` distance.

 [source,text]
 ----
 let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
     filesizes=col(samples, filesize_d),
     serviceLevels=col(samples, service_d),
     outcomes=col(samples, response_d),
     observations=transpose(matrix(filesizes, serviceLevels)),
     lazyModel=knnRegress(observations, outcomes, 5, manhattan(), scale=true),
     predictions=predict(lazyModel, observations),
     residuals=ebeSubtract(outcomes, predictions),
     sumSqErr=sumSq(residuals))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "sumSqErr": 4761.221942288098
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 3571
       }
     ]
   }
 }
 ----
	= Machine Learning
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.


	This section of the math expressions user guide covers machine learning
	functions.

	== Feature Scaling

	Before performing machine learning operations its often necessary to
	scale the feature vectors so they can be compared at the same scale.

	All the scaling function operate on vectors and matrices.
	When operating on a matrix the rows of the matrix are scaled.

	=== Min/Max Scaling

	The `minMaxScale` function scales a vector or matrix between a minimum and maximum value.
	By default it will scale between 0 and 1 if min/max values are not provided.

	Below is a simple example of min/max scaling between 0 and 1.
	Notice that once brought into the same scale the vectors are the same.

	[source,text]
	----
	let(a=array(20, 30, 40, 50),
	b=array(200, 300, 400, 500),
	c=matrix(a, b),
	d=minMaxScale(c))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"d": [
	[
	0,
	0.3333333333333333,
	0.6666666666666666,
	1
	],
	[
	0,
	0.3333333333333333,
	0.6666666666666666,
	1
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	=== Standardization

	The `standardize` function scales a vector so that it has a mean of 0 and a standard deviation of 1.
	Standardization can be used with machine learning algorithms, such as
	https://en.wikipedia.org/wiki/Support_vector_machine[Support Vector Machine (SVM)], that perform better
	when the data has a normal distribution.

	[source,text]
	----
	let(a=array(20, 30, 40, 50),
	b=array(200, 300, 400, 500),
	c=matrix(a, b),
	d=standardize(c))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"d": [
	[
	-1.161895003862225,
	-0.3872983346207417,
	0.3872983346207417,
	1.161895003862225
	],
	[
	-1.1618950038622249,
	-0.38729833462074165,
	0.38729833462074165,
	1.1618950038622249
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 17
	}
	]
	}
	}
	----

	=== Unit Vectors

	The `unitize` function scales vectors to a magnitude of 1. A vector with a
	magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals
	with vector direction rather than magnitude.

	[source,text]
	----
	let(a=array(20, 30, 40, 50),
	b=array(200, 300, 400, 500),
	c=matrix(a, b),
	d=unitize(c))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"d": [
	[
	0.2721655269759087,
	0.40824829046386296,
	0.5443310539518174,
	0.6804138174397716
	],
	[
	0.2721655269759087,
	0.4082482904638631,
	0.5443310539518174,
	0.6804138174397717
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 6
	}
	]
	}
	}
	----

	== Distance and Distance Measures

	The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix.

	There are five distance measure functions that return a function that performs the actual distance calculation:

	* `euclidean` (default)
	* `manhattan`
	* `canberra`
	* `earthMovers`
	* `haversineMeters` (Geospatial distance measure)

	The distance measure functions can be used with all machine learning functions
	that support distance measures.

	Below is an example for computing Euclidean distance for two numeric arrays:

	[source,text]
	----
	let(a=array(20, 30, 40, 50),
	b=array(21, 29, 41, 49),
	c=distance(a, b))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"c": 2
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 0
	}
	]
	}
	}
	----

	Below the distance is calculated using Manahattan distance.

	[source,text]
	----
	let(a=array(20, 30, 40, 50),
	b=array(21, 29, 41, 49),
	c=distance(a, b, manhattan()))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"c": 4
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 1
	}
	]
	}
	}
	----


	Below is an example for computing a distance matrix for columns
	of a matrix:

	[source,text]
	----
	let(a=array(20, 30, 40),
	b=array(21, 29, 41),
	c=array(31, 40, 50),
	d=matrix(a, b, c),
	c=distance(d))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"e": [
	[
	0,
	15.652475842498529,
	34.07345007480164
	],
	[
	15.652475842498529,
	0,
	18.547236990991408
	],
	[
	34.07345007480164,
	18.547236990991408,
	0
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 24
	}
	]
	}
	}
	----

	== K-Means Clustering

	The `kmeans` functions performs k-means clustering of the rows of a matrix.
	Once the clustering has been completed there are a number of useful functions available
	for examining the clusters and centroids.

	The examples below cluster _term vectors_.
	The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> offers
	a full explanation of these features.

	=== Centroid Features

	In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set
	and then the top features are extracted from the cluster centroids.

	[source,text]
	----
	let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), <1>
	id,
	analyze(body, body_bigram) as terms),
	b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),<2>
	c=kmeans(b, 5), <3>
	d=getCentroids(c), <4>
	e=topFeatures(d, 5)) <5>
	----

	Let's look at what data is assigned to each variable:

	<1> `a`: The `random` function returns a sample of 500 documents from the "enron"
	collection that match the query "body:oil". The `select` function selects the `id` and
	and annotates each tuple with the analyzed bigram terms from the `body` field.
	<2> `b`: The `termVectors` function creates a TF-IDF term vector matrix from the
	tuples stored in variable `a`. Each row in the matrix represents a document. The columns of the matrix
	are the bigram terms that were attached to each tuple.
	<3> `c`: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the Euclidean distance measure.
	<4> `d`: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
	from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
	<5> `e`: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
	This returns the top 5 bigram terms for each centroid.

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"e": [
	[
	"enron enronxgate",
	"north american",
	"energy services",
	"conference call",
	"power generation"
	],
	[
	"financial times",
	"chief financial",
	"financial officer",
	"exchange commission",
	"houston chronicle"
	],
	[
	"southern california",
	"california edison",
	"public utilities",
	"utilities commission",
	"rate increases"
	],
	[
	"rolling blackouts",
	"public utilities",
	"electricity prices",
	"federal energy",
	"price controls"
	],
	[
	"california edison",
	"regulatory commission",
	"southern california",
	"federal energy",
	"power generators"
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 982
	}
	]
	}
	}
	----

	=== Cluster Features

	The example below examines the top features of a specific cluster. This example uses the same techniques
	as the centroids example but the top features are extracted from a cluster rather than the centroids.

	[source,text]
	----
	let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
	id,
	analyze(body, body_bigram) as terms),
	b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
	c=kmeans(b, 25),
	d=getCluster(c, 0), <1>
	e=topFeatures(d, 4)) <2>
	----

	<1> The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
	that have been clustered together based on their features.
	<2> The `topFeatures` function is used to extract the top 4 features from each term vector
	in the cluster.

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"e": [
	[
	"electricity board",
	"maharashtra state",
	"power purchase",
	"state electricity",
	"reserved enron"
	],
	[
	"electricity board",
	"maharashtra state",
	"state electricity",
	"purchase agreement",
	"independent power"
	],
	[
	"maharashtra state",
	"reserved enron",
	"federal government",
	"state government",
	"dabhol project"
	],
	[
	"purchase agreement",
	"power purchase",
	"electricity board",
	"maharashtra state",
	"state government"
	],
	[
	"investment grade",
	"portland general",
	"general electric",
	"holding company",
	"transmission lines"
	],
	[
	"state government",
	"state electricity",
	"purchase agreement",
	"electricity board",
	"maharashtra state"
	],
	[
	"electricity board",
	"state electricity",
	"energy management",
	"maharashtra state",
	"energy markets"
	],
	[
	"electricity board",
	"maharashtra state",
	"state electricity",
	"state government",
	"second quarter"
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 978
	}
	]
	}
	}
	----

	== Multi K-Means Clustering

	K-means clustering will produce different results depending on
	the initial placement of the centroids. K-means is fast enough
	that multiple trials can be performed and the best outcome selected.

	The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the
	best result based on which trial produces the lowest intra-cluster variance.

	The example below is identical to centroids example except that it uses `multiKmeans` with 100 trials,
	rather than a single trial of the `kmeans` function.

	[source,text]
	----
	let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
	id,
	analyze(body, body_bigram) as terms),
	b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
	c=multiKmeans(b, 5, 100),
	d=getCentroids(c),
	e=topFeatures(d, 5))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"e": [
	[
	"enron enronxgate",
	"energy trading",
	"energy markets",
	"energy services",
	"unleaded gasoline"
	],
	[
	"maharashtra state",
	"electricity board",
	"state electricity",
	"energy trading",
	"chief financial"
	],
	[
	"price controls",
	"electricity prices",
	"francisco chronicle",
	"wholesale electricity",
	"power generators"
	],
	[
	"southern california",
	"california edison",
	"public utilities",
	"francisco chronicle",
	"utilities commission"
	],
	[
	"california edison",
	"power purchases",
	"system operator",
	"term contracts",
	"independent system"
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 1182
	}
	]
	}
	}
	----

	== Fuzzy K-Means Clustering

	The `fuzzyKmeans` function is a soft clustering algorithm which
	allows vectors to be assigned to more then one cluster. The `fuzziness` parameter
	is a value between 1 and 2 that determines how fuzzy to make the cluster assignment.

	After the clustering has been performed the `getMembershipMatrix` function can be called
	on the clustering result to return a matrix describing which clusters each vector belongs to.
	There is a row in the matrix for each vector that was clustered. There is a column in the matrix
	for each cluster. The values in the columns are the probability that the vector belonged to the specific
	cluster.

	A simple example will make this more clear. In the example below 300 documents are analyzed and
	then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the
	term vectors into 12 clusters with a fuzziness factor of 1.25.

	[source,text]
	----
	let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
	id,
	analyze(body, body_bigram) as terms),
	b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
	c=fuzzyKmeans(b, 12, fuzziness=1.25),
	d=getMembershipMatrix(c), <1>
	e=rowAt(d, 0), <2>
	f=precision(e, 5)) <3>
	----

	<1> The `getMembershipMatrix` function is used to return the membership matrix;
	<2> and the first row of membership matrix is retrieved with the `rowAt` function.
	<3> The `precision` function is then applied to the first row
	of the matrix to make it easier to read.

	This expression returns a single vector representing the cluster membership probabilities for the first
	term vector. Notice that the term vector has the highest association with the 12^th^ cluster,
	but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"f": [
	0,
	0,
	0.178,
	0,
	0.17707,
	0.17775,
	0.16214,
	0,
	0,
	0,
	0,
	0.30504
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 2157
	}
	]
	}
	}
	----

	== K-Nearest Neighbor (KNN)

	The `knn` function searches the rows of a matrix for the
	k-nearest neighbors of a search vector. The `knn` function
	returns a matrix of the k-nearest neighbors.

	The `knn` function supports changing of the distance measure by providing one of these
	distance measure functions as the fourth parameter:

	* `euclidean` (Default)
	* `manhattan`
	* `canberra`
	* `earthMovers`

	The example below builds on the clustering examples to demonstrate the `knn` function.

	[source,text]
	----
	let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
	id,
	analyze(body, body_bigram) as terms),
	b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
	c=multiKmeans(b, 5, 100),
	d=getCentroids(c), <1>
	e=rowAt(d, 0), <2>
	g=knn(b, e, 3), <3>
	h=topFeatures(g, 4)) <4>
	----

	<1> In the example, the centroids matrix is set to variable `d`.
	<2> The first centroid vector is selected from the matrix with the `rowAt` function.
	<3> Then the `knn` function is used to find the 3 nearest neighbors
	to the centroid vector in the term vector matrix (variable `b`).
	<4> The `topFeatures` function is used to request the top 4 featurs of the term vectors in the knn matrix.

	The `knn` function returns a matrix with the 3 nearest neighbors based on the
	default distance measure which is euclidean. Finally, the top 4 features
	of the term vectors in the nearest neighbor matrix are returned:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"h": [
	[
	"california power",
	"electricity supply",
	"concerned about",
	"companies like"
	],
	[
	"maharashtra state",
	"california power",
	"electricity board",
	"alternative energy"
	],
	[
	"electricity board",
	"maharashtra state",
	"state electricity",
	"houston chronicle"
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 1243
	}
	]
	}
	}
	----

	== K-Nearest Neighbor Regression

	K-nearest neighbor regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
	technique which means it does not fit a model to the training set in advance. Instead the
	entire training set of observations and outcomes are held in memory and predictions are made
	by averaging the outcomes of the k-nearest neighbors.

	The `knnRegress` function prepares the training set for use with the `predict` function.

	Below is an example of the `knnRegress` function. In this example 10,000 random samples
	are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of
	`filesize_d` and `service_d` will be used to predict the value of `response_d`.

	[source,text]
	----
	let(samples=random(collection1, q=":", rows="10000", fl="filesize_d, service_d, response_d"),
	filesizes=col(samples, filesize_d),
	serviceLevels=col(samples, service_d),
	outcomes=col(samples, response_d),
	observations=transpose(matrix(filesizes, serviceLevels)),
	lazyModel=knnRegress(observations, outcomes , 5))
	----

	This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"lazyModel": {
	"features": 2,
	"robust": false,
	"distance": "EuclideanDistance",
	"observations": 10000,
	"scale": false,
	"k": 5
	}
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 170
	}
	]
	}
	}
	----

	=== Prediction and Residuals

	The output of `knnRegress` can be used with the `predict` function like other regression models.

	In the example below the `predict` function is used to predict results for the original training
	data. The sumSq of the residuals is then calculated.

	[source,text]
	----
	let(samples=random(collection1, q=":", rows="10000", fl="filesize_d, service_d, response_d"),
	filesizes=col(samples, filesize_d),
	serviceLevels=col(samples, service_d),
	outcomes=col(samples, response_d),
	observations=transpose(matrix(filesizes, serviceLevels)),
	lazyModel=knnRegress(observations, outcomes , 5),
	predictions=predict(lazyModel, observations),
	residuals=ebeSubtract(outcomes, predictions),
	sumSqErr=sumSq(residuals))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"sumSqErr": 1920290.1204126712
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 3796
	}
	]
	}
	}
	----

	=== Setting Feature Scaling

	If the features in the observation matrix are not in the same scale then the larger features
	will carry more weight in the distance calculation then the smaller features. This can greatly
	impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which
	can be set to `true` to automatically scale the features in the same range.

	The example below shows `knnRegress` with feature scaling turned on.

	Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower.
	This shows how much more accurate the predictions are when feature scaling is turned on in
	this particular example. This is because the `filesize_d` feature is significantly larger then
	the `service_d` feature.

	[source,text]
	----
	let(samples=random(collection1, q=":", rows="10000", fl="filesize_d, service_d, response_d"),
	filesizes=col(samples, filesize_d),
	serviceLevels=col(samples, service_d),
	outcomes=col(samples, response_d),
	observations=transpose(matrix(filesizes, serviceLevels)),
	lazyModel=knnRegress(observations, outcomes , 5, scale=true),
	predictions=predict(lazyModel, observations),
	residuals=ebeSubtract(outcomes, predictions),
	sumSqErr=sumSq(residuals))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"sumSqErr": 4076.794951120683
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 3790
	}
	]
	}
	}
	----


	=== Setting Robust Regression

	The default prediction approach is to take the mean of the outcomes of the k-nearest
	neighbors. If the outcomes contain outliers the mean value can be skewed. Setting
	the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors.
	This provides a regression prediction that is robust to outliers.

	=== Setting the Distance Measure

	The distance measure can be changed for the k-nearest neighbor search by adding a distance measure
	function to the `knnRegress` parameters. Below is an example using `manhattan` distance.

	[source,text]
	----
	let(samples=random(collection1, q=":", rows="10000", fl="filesize_d, service_d, response_d"),
	filesizes=col(samples, filesize_d),
	serviceLevels=col(samples, service_d),
	outcomes=col(samples, response_d),
	observations=transpose(matrix(filesizes, serviceLevels)),
	lazyModel=knnRegress(observations, outcomes, 5, manhattan(), scale=true),
	predictions=predict(lazyModel, observations),
	residuals=ebeSubtract(outcomes, predictions),
	sumSqErr=sumSq(residuals))
	----

	This expression returns the following response:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"sumSqErr": 4761.221942288098
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 3571
	}
	]
	}
	}
	----