| = Machine Learning |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| |
| This section of the math expressions user guide covers machine learning |
| functions. |
| |
| == Distance and Distance Matrices |
| |
| The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix. |
| |
| There are six distance measure functions that return a function that performs the actual distance calculation: |
| |
| * `euclidean` (default) |
| * `manhattan` |
| * `canberra` |
| * `earthMovers` |
| * `cosine` |
| * `haversineMeters` (Geospatial distance measure) |
| |
| The distance measure functions can be used with all machine learning functions |
| that support distance measures. |
| |
| Below is an example for computing Euclidean distance for two numeric arrays: |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(21, 29, 41, 49), |
| c=distance(a, b)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "c": 2 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 0 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| Below the distance is calculated using Manhattan distance. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(21, 29, 41, 49), |
| c=distance(a, b, manhattan())) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "c": 4 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 1 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Distance Matrices |
| |
| Distance matrices are powerful tools for visualizing the distance |
| between two or more |
| vectors. |
| |
| The `distance` function builds a distance matrix |
| if a matrix is passed as the parameter. The distance matrix is computed for the *columns* |
| of the matrix. |
| |
| The example below demonstrates the power of distance matrices combined with 2 dimensional faceting. |
| |
| In this example the `facet2D` function is used to generate a two dimensional facet aggregation |
| over the fields `complaint_type_s` and `zip_s` from the `nyc311` complaints database. |
| The *top 20* complaint types and the *top 25* zip codes for each complaint type are aggregated. |
| The result is a stream of tuples each containing the fields `complaint_type_s`, `zip_s` and the count for the pair. |
| |
| The `pivot` function is then used to pivot the fields into a *matrix* with the `zip_s` |
| field as the *rows* and the `complaint_type_s` field as the *columns*. The `count(*)` field populates |
| the values in the cells of the matrix. |
| |
| The `distance` function is then used to compute the distance matrix for the columns |
| of the matrix using `cosine` distance. This produces a distance matrix |
| that shows distance between complaint types based on the zip codes they appear in. |
| |
| Finally the `zplot` function is used to plot the distance matrix as a heat map. Notice that the |
| heat map has been configured so that the intensity of color increases as the distance between vectors |
| decreases. |
| |
| |
| image::images/math-expressions/distance.png[] |
| |
| The heat map is interactive, so mousing over one of the cells pops up the values |
| for the cell. |
| |
| image::images/math-expressions/distanceview.png[] |
| |
| Notice that HEAT/HOT WATER and UNSANITARY CONDITION complaints have a cosine distance of .1 (rounded to the nearest |
| tenth). |
| |
| |
| == K-Nearest Neighbor (KNN) |
| |
| The `knn` function searches the rows of a matrix with a search vector and |
| returns a matrix of the k-nearest neighbors. This allows for secondary vector |
| searches over result sets. |
| |
| The `knn` function supports changing of the distance measure by providing one of the following |
| distance measure functions: |
| |
| * `euclidean` (Default) |
| * `manhattan` |
| * `canberra` |
| * `earthMovers` |
| * `cosine` |
| * `haversineMeters` (Geospatial distance measure) |
| |
| The example below shows how to perform a secondary search over an aggregation |
| result set. The goal of the example is to find zip codes in the nyc311 complaint |
| database that have similar complaint types to the zip code 10280. |
| |
| The first step in the example is to use the `facet2D` function to perform a two |
| dimensional aggregation over the `zip_s` and `complaint_type_s` fields. In the example |
| the top 119 zip codes and top 5 complaint types for each zip code are calculated |
| for the borough of Manhattan. The result is a list of tuples each containing |
| the `zip_s`, `complaint_type_s` and the `count(*)` for the combination. |
| |
| The list of tuples is then *pivoted* into a matrix with the `pivot` function. |
| The `pivot` function in this example returns a matrix with rows of zip codes |
| and columns of complaint types. |
| The `count(*)` field from the tuples populates the cells of the matrix. |
| This matrix will be used as the secondary search matrix. |
| |
| The next step is to locate the vector for the 10280 zip code. |
| This is done in three steps in the example. |
| The first step is to retrieve the row labels from the matrix with the `getRowLabels` function. |
| The row labels in this case are zip codes which were populated by the `pivot` function. |
| Then the `indexOf` function is used to find the *index* of the "10280" zip code in the list of row labels. |
| The `rowAt` function is then used to return the vector at that *index* from the matrix. |
| This vector is the *search vector*. |
| |
| Now that we have a matrix and search vector we can use the `knn` function to perform the search. |
| In the example the `knn` function searches the matrix with the search vector with a K of 5, using |
| *cosine* distance. Cosine distance is useful for comparing sparse vectors which is the case in this |
| example. The `knn` function returns a matrix with the top 5 nearest neighbors to the search vector. |
| |
| The `knn` function populates the row and column labels of the return matrix and |
| also adds a vector of *distances* for each row as an attribute to the matrix. |
| |
| In the example the `zplot` function extracts the row labels and |
| the distance vector with the `getRowLabels` and `getAttribute` functions. |
| The `topFeatures` function is used to extract |
| the top 5 column labels for each zip code vector, based on the counts for each |
| column. Then `zplot` outputs the data in a format that can be visualized in |
| a table with Zeppelin-Solr. |
| |
| image::images/math-expressions/knn.png[] |
| |
| The table above shows each zip code returned by the `knn` function along |
| with the list of complaints and distances. These are the zip codes that are most similar |
| to the 10280 zip code based on their top 5 complaint types. |
| |
| == K-Nearest Neighbor Regression |
| |
| K-nearest neighbor regression is a non-linear, bivariate and multivariate regression method. |
| KNN regression is a lazy learning |
| technique which means it does not fit a model to the training set in advance. Instead the |
| entire training set of observations and outcomes are held in memory and predictions are made |
| by averaging the outcomes of the k-nearest neighbors. |
| |
| The `knnRegress` function is used to perform nearest neighbor regression. |
| |
| |
| === 2D Non-Linear Regression |
| |
| The example below shows the *regression plot* for KNN regression applied to a 2D scatter plot. |
| |
| In this example the `random` function is used to draw 500 random samples from the `logs` collection |
| containing two fields `filesize_d` and `eresponse_d`. The sample is then vectorized with the |
| `filesize_d` field stored in a vector assigned to variable *x* and the `eresponse_d` vector stored in |
| variable `y`. The `knnRegress` function is then applied with `20` as the nearest neighbor parameter, |
| which returns a KNN function which can be used to predict values. |
| The `predict` function is then called on the KNN function to predict values for the original `x` vector. |
| Finally `zplot` is used to plot the original `x` and `y` vectors along with the predictions. |
| |
| image::images/math-expressions/knnRegress.png[] |
| |
| Notice that the regression plot shows a non-linear relations ship between the `filesize_d` |
| field and the `eresponse_d` field. Also note that KNN regression |
| plots a non-linear curve through the scatter plot. The larger the size |
| of K (nearest neighbors), the smoother the line. |
| |
| === Multivariate Non-Linear Regression |
| |
| The `knnRegress` function is also a powerful and flexible tool for |
| multi-variate non-linear regression. |
| |
| In the example below a multi-variate regression is performed using |
| a database designed for analyzing and predicting wine quality. The |
| database contains nearly 1600 records with 9 predictors of wine quality: |
| pH, alcohol, fixed_acidity, sulphates, density, free_sulfur_dioxide, |
| volatile_acidity, citric_acid, residual_sugar. There is also a field |
| called quality assigned to each wine ranging |
| from 3 to 8. |
| |
| KNN regression can be used to predict wine quality for vectors containing |
| the predictor values. |
| |
| In the example a search is performed on the `redwine` collection to |
| return all the rows in the database of observations. Then the quality field and |
| predictor fields are read into vectors and set to variables. |
| |
| The predictor variables are added as rows to a matrix which is |
| transposed so each row in the matrix contains one observation with the 9 |
| predictor values. |
| This is our observation matrix which is assigned to the variable `obs`. |
| |
| Then the `knnRegress` function regresses the observations with quality outcomes. |
| The value for K is set to 5 in the example, so the average quality of the 5 |
| nearest neighbors will be used to calculate the quality. |
| |
| The `predict` function is then used to generate a vector of predictions |
| for the entire observation set. These predictions will be used to determine |
| how well the KNN regression performed over the observation data. |
| |
| The error, or *residuals*, for the regression are then calculated by |
| subtracting the *predicted* quality from the *observed* quality. |
| The `ebeSubtract` function is used to perform the element-by-element |
| subtraction between the two vectors. |
| |
| Finally the `zplot` function formats the predictions and errors for |
| for the visualization of the *residual plot*. |
| |
| image::images/math-expressions/redwine1.png[] |
| |
| The residual plot plots the *predicted* values on the x-axis and the *error* for the |
| prediction on the y-axis. The scatter plot shows how the errors |
| are distributed across the full range of predictions. |
| |
| The residual plot can be interpreted to understand how the KNN regression performed on the |
| training data. |
| |
| * The plot shows the prediction error appears to be fairly evenly distributed |
| above and below zero. The density of the errors increases as it approaches zero. The |
| bubble size reflects the density of errors at the specific point in the plot. |
| This provides an intuitive feel for the distribution of the model's error. |
| |
| * The plot also visualizes the variance of the error across the range of |
| predictions. This provides an intuitive understanding of whether the KNN predictions |
| will have similar error variance across the full range predictions. |
| |
| The residuals can also be visualized using a histogram to better understand |
| the shape of the residuals distribution. The example below shows the same KNN |
| regression as above with a plot of the distribution of the errors. |
| |
| In the example the `zplot` function is used to plot the `empiricalDistribution` |
| function of the residuals, with an 11 bin histogram. |
| |
| image::images/math-expressions/redwine2.png[] |
| |
| Notice that the errors follow a bell curve centered close to 0. From this plot |
| we can see the probability of getting prediction errors between -1 and 1 is quite high. |
| |
| *Additional KNN Regression Parameters* |
| |
| The `knnRegression` function has three additional parameters that make it suitable for many different regression scenarios. |
| |
| . Any of the distance measures can be used for the regression simply by adding the function to the call. |
| This allows for regression analysis over sparse vectors (`cosine`), dense vectors and geo-spatial lat/lon vectors (`haversineMeters`). |
| + |
| Sample syntax: |
| + |
| [source,text] |
| ---- |
| r=knnRegress(obs, quality, 5, cosine()), |
| ---- |
| |
| . The `robust` named parameter can be used to perform a regression analysis that is robust to outliers in the outcomes. |
| When the `robust` parameter is used the median outcome of the k-nearest neighbors is used rather than the average. |
| + |
| Sample syntax: |
| + |
| [source,text] |
| ---- |
| r=knnRegress(obs, quality, 5, robust="true"), |
| ---- |
| |
| . The `scale` named parameter can be used to scale the columns of the observations and search vectors |
| at prediction time. This can improve the performance of the KNN regression when the feature columns |
| are at different scales causing the distance calculations to be place too much weight on the larger columns. |
| + |
| Sample syntax: |
| + |
| [source,text] |
| ---- |
| r=knnRegress(obs, quality, 5, scale="true"), |
| ---- |
| |
| == knnSearch |
| |
| The `knnSearch` function returns the k-nearest neighbors |
| for a document based on text similarity. |
| Under the covers the `knnSearch` function uses Solr's <<other-parsers.adoc#more-like-this-query-parser,More Like This>> query parser plugin. |
| This capability uses the search engine's query, term statistics, scoring, and ranking capability to perform a fast, nearest neighbor search for similar documents over large distributed indexes. |
| |
| The results of this search can be used directly or provide *candidates* for machine learning operations such as a secondary KNN vector search. |
| |
| The example below shows the `knnSearch` function on a movie reviews data set. The search returns the 50 documents most similar to a specific document ID (`83e9b5b0...`) based on the similarity of the `review_t` field. |
| The `mindf` and `maxdf` specify the minimum and maximum document frequency of the terms used to perform the search. |
| These parameters can make the query faster by eliminating high frequency terms and also improves accuracy by removing noise terms from the search. |
| |
| image::images/math-expressions/knnSearch.png[] |
| |
| NOTE: In this example the `select` |
| function is used to truncate the review in the output to 220 characters to make it easier |
| to read in a table. |
| |
| == DBSCAN |
| |
| DBSCAN clustering is a powerful density-based clustering algorithm which is particularly well suited for geospatial clustering. |
| DBSCAN uses two parameters to filter result sets to clusters of specific density: |
| |
| * `eps` (Epsilon): Defines the distance between points to be considered as neighbors |
| |
| * `min` points: The minimum number of points needed in a cluster for it to be returned. |
| |
| |
| === 2D Cluster Visualization |
| |
| The `zplot` function has direct support for plotting 2D clusters by using the `clusters` named parameter. |
| |
| The example below uses DBSCAN clustering and cluster visualization to find |
| the *hot spots* on a map for rat sightings in the NYC 311 complaints database. |
| |
| In this example the `random` function draws a sample of records from the `nyc311` collection where |
| the complaint description matches "rat sighting" and latitude is populated in the record. |
| The latitude and longitude fields are then vectorized and added as rows to a matrix. |
| The matrix is transposed so each row contains a single latitude, longitude |
| point. |
| The `dbscan` function is then used to cluster the latitude and longitude points. |
| Notice that the `dbscan` function in the example has four parameters. |
| |
| * `obs` : The observation matrix of lat/lon points |
| |
| * `eps` : The distance between points to be considered a cluster. 100 meters in the example. |
| |
| * `min points`: The minimum points in a cluster for the cluster to be returned by the function. `5` in the example. |
| |
| * `distance measure`: An optional distance measure used to determine the |
| distance between points. The default is Euclidean distance. |
| The example uses `haversineMeters` which returns the distance in meters which is much more meaningful for geospatial use cases. |
| |
| Finally, the `zplot` function is used to visualize the clusters on a map with Zeppelin-Solr. |
| The map below has been zoomed to a specific area of Brooklyn with a high density of rat sightings. |
| |
| image::images/math-expressions/dbscan1.png[] |
| |
| Notice in the visualization that only 1019 points were returned from the 5000 samples. |
| This is the power of the DBSCAN algorithm to filter records that don't match the criteria |
| of a cluster. The points that are plotted all belong to clearly defined clusters. |
| |
| The map visualization can be zoomed further to explore the locations of specific clusters. |
| The example below shows a zoom into an area of dense clusters. |
| |
| image::images/math-expressions/dbscan2.png[] |
| |
| |
| == K-Means Clustering |
| |
| The `kmeans` functions performs k-means clustering of the rows of a matrix. |
| Once the clustering has been completed there are a number of useful functions available |
| for examining and visualizing the clusters and centroids. |
| |
| |
| === Clustered Scatter Plot |
| |
| In this example we'll again be clustering 2D lat/lon points of rat sightings. But unlike the DBSCAN example, k-means clustering |
| does not on its own |
| perform any noise reduction. So in order to reduce the noise a smaller random sample is selected from the data than was used |
| for the DBSCAN example. |
| |
| We'll see that sampling itself is a powerful noise reduction tool which helps visualize the cluster density. |
| This is because there is a higher probability that samples will be drawn from higher density clusters and a lower |
| probability that samples will be drawn from lower density clusters. |
| |
| In this example the `random` function draws a sample of 1500 records from the `nyc311` (complaints database) collection where |
| the complaint description matches "rat sighting" and latitude is populated in the record. The latitude and longitude fields |
| are then vectorized and added as rows to a matrix. The matrix is transposed so each row contains a single latitude, longitude |
| point. The `kmeans` function is then used to cluster the latitude and longitude points into 21 clusters. |
| Finally, the `zplot` function is used to visualize the clusters as a scatter plot. |
| |
| image::images/math-expressions/2DCluster1.png[] |
| |
| The scatter plot above shows each lat/lon point plotted on a Euclidean plain with longitude on the |
| x-axis and |
| latitude on the y-axis. The plot is dense enough so the outlines of the different boroughs are visible |
| if you know the boroughs of New York City. |
| |
| |
| Each cluster is shown in a different color. This plot provides interesting |
| insight into the densities of rat sightings throughout the five boroughs of New York City. For |
| example it highlights a cluster of dense sightings in Brooklyn at cluster1 |
| surrounded by less dense but still high activity clusters. |
| |
| === Plotting the Centroids |
| |
| The centroids of each cluster can then be plotted on a map to visualize the center of the |
| clusters. In the example below the centroids are extracted from the clusters using the `getCentroids` |
| function, which returns a matrix of the centroids. |
| |
| The centroids matrix contains 2D lat/lon points. The `colAt` function can then be used |
| to extract the latitude and longitude columns by index from the matrix so they can be |
| plotted with `zplot`. A map visualization is used below to display the centroids. |
| |
| |
| image::images/math-expressions/centroidplot.png[] |
| |
| |
| The map can then be zoomed to get a closer look at the centroids in the high density areas shown |
| in the cluster scatter plot. |
| |
| image::images/math-expressions/centroidzoom.png[] |
| |
| |
| === Phrase Extraction |
| |
| K-means clustering produces centroids or *prototype* vectors which can be used to represent |
| each cluster. In this example the key features of the centroids are extracted |
| to represent the key phrases for clusters of TF-IDF term vectors. |
| |
| NOTE: The example below works with TF-IDF _term vectors_. |
| The section <<term-vectors.adoc#,Text Analysis and Term Vectors>> offers |
| a full explanation of this features. |
| |
| In the example the `search` function returns documents where the `review_t` field matches the phrase "star wars". |
| The `select` function is run over the result set and applies the `analyze` function |
| which uses the Lucene/Solr analyzer attached to the schema field `text_bigrams` to re-analyze the `review_t` |
| field. This analyzer returns bigrams which are then annotated to documents in a field called `terms`. |
| |
| The `termVectors` function then creates TD-IDF term vectors from the bigrams stored in the `terms` field. |
| The `kmeans` function is then used to cluster the bigram term vectors into 5 clusters. |
| Finally the top 5 features are extracted from the centroids and returned. |
| Notice that the features are all bigram phrases with semantic significance. |
| |
| [source,text] |
| ---- |
| let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"), |
| id, |
| analyze(review_t, text_bigrams) as terms), |
| vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"), |
| clusters=kmeans(vectors, 5), |
| centroids=getCentroids(clusters), |
| phrases=topFeatures(centroids, 5)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,text] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "phrases": [ |
| [ |
| "empire strikes", |
| "rebel alliance", |
| "princess leia", |
| "luke skywalker", |
| "phantom menace" |
| ], |
| [ |
| "original star", |
| "main characters", |
| "production values", |
| "anakin skywalker", |
| "luke skywalker" |
| ], |
| [ |
| "carrie fisher", |
| "original films", |
| "harrison ford", |
| "luke skywalker", |
| "ian mcdiarmid" |
| ], |
| [ |
| "phantom menace", |
| "original trilogy", |
| "harrison ford", |
| "john williams", |
| "empire strikes" |
| ], |
| [ |
| "science fiction", |
| "fiction films", |
| "forbidden planet", |
| "character development", |
| "worth watching" |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 46 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Multi K-Means Clustering |
| |
| K-means clustering will produce different outcomes depending on |
| the initial placement of the centroids. K-means is fast enough |
| that multiple trials can be performed so that the best outcome can be selected. |
| |
| The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the |
| best result based on which trial produces the lowest intra-cluster variance. |
| |
| The example below is identical to the phrase extraction example except that it uses `multiKmeans` with 15 trials, |
| rather than a single trial of the `kmeans` function. |
| |
| [source,text] |
| ---- |
| let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"), |
| id, |
| analyze(review_t, text_bigrams) as terms), |
| vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"), |
| clusters=multiKmeans(vectors, 5, 15), |
| centroids=getCentroids(clusters), |
| phrases=topFeatures(centroids, 5)) |
| ---- |
| |
| This expression returns the following response: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "phrases": [ |
| [ |
| "science fiction", |
| "original star", |
| "production values", |
| "fiction films", |
| "forbidden planet" |
| ], |
| [ |
| "empire strikes", |
| "princess leia", |
| "luke skywalker", |
| "phantom menace" |
| ], |
| [ |
| "carrie fisher", |
| "harrison ford", |
| "luke skywalker", |
| "empire strikes", |
| "original films" |
| ], |
| [ |
| "phantom menace", |
| "original trilogy", |
| "harrison ford", |
| "character development", |
| "john williams" |
| ], |
| [ |
| "rebel alliance", |
| "empire strikes", |
| "princess leia", |
| "original trilogy", |
| "luke skywalker" |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 84 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Fuzzy K-Means Clustering |
| |
| The `fuzzyKmeans` function is a soft clustering algorithm which |
| allows vectors to be assigned to more then one cluster. The `fuzziness` parameter |
| is a value between `1` and `2` that determines how fuzzy to make the cluster assignment. |
| |
| After the clustering has been performed the `getMembershipMatrix` function can be called |
| on the clustering result to return a matrix describing the probabilities |
| of cluster membership for each vector. |
| This matrix can be used to understand relationships between clusters. |
| |
| In the example below `fuzzyKmeans` is used to cluster the movie reviews matching the phrase "star wars". |
| But instead of looking at the clusters or centroids, the `getMembershipMatrix` is used to return the |
| membership probabilities for each document. The membership matrix is comprised of a row for each |
| vector that was clustered. There is a column in the matrix for each cluster. |
| The values in the matrix contain the probability that a specific vector belongs to a specific cluster. |
| |
| In the example the `distance` function is then used to create a *distance matrix* from the columns of the |
| membership matrix. The distance matrix is then visualized with the `zplot` function as a heat map. |
| |
| In the example `cluster1` and `cluster5` have the shortest distance between the clusters. |
| Further analysis of the features in both clusters can be performed to understand |
| the relationship between `cluster1` and `cluster5`. |
| |
| image::images/math-expressions/fuzzyk.png[] |
| |
| NOTE: The heat map has been configured to increase in color intensity as the distance shortens. |
| |
| == Feature Scaling |
| |
| Before performing machine learning operations its often necessary to |
| scale the feature vectors so they can be compared at the same scale. |
| |
| All the scaling functions below operate on vectors and matrices. |
| When operating on a matrix the rows of the matrix are scaled. |
| |
| === Min/Max Scaling |
| |
| The `minMaxScale` function scales a vector or matrix between a minimum and maximum value. |
| By default it will scale between `0` and `1` if min/max values are not provided. |
| |
| Below is a plot of a sine wave, with an amplitude of 1, before and |
| after it has been scaled between -5 and 5. |
| |
| image::images/math-expressions/minmaxscale.png[] |
| |
| |
| Below is a simple example of min/max scaling of a matrix between 0 and 1. |
| Notice that once brought into the same scale the vectors are the same. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(200, 300, 400, 500), |
| c=matrix(a, b), |
| d=minMaxScale(c)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "d": [ |
| [ |
| 0, |
| 0.3333333333333333, |
| 0.6666666666666666, |
| 1 |
| ], |
| [ |
| 0, |
| 0.3333333333333333, |
| 0.6666666666666666, |
| 1 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 0 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Standardization |
| |
| The `standardize` function scales a vector so that it has a |
| mean of 0 and a standard deviation of 1. |
| |
| Below is a plot of a sine wave, with an amplitude of 1, before and |
| after it has been standardized. |
| |
| image::images/math-expressions/standardize.png[] |
| |
| Below is a simple example of of a standardized matrix. |
| Notice that once brought into the same scale the vectors are the same. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(200, 300, 400, 500), |
| c=matrix(a, b), |
| d=standardize(c)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "d": [ |
| [ |
| -1.161895003862225, |
| -0.3872983346207417, |
| 0.3872983346207417, |
| 1.161895003862225 |
| ], |
| [ |
| -1.1618950038622249, |
| -0.38729833462074165, |
| 0.38729833462074165, |
| 1.1618950038622249 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 17 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| === Unit Vectors |
| |
| The `unitize` function scales vectors to a magnitude of 1. A vector with a |
| magnitude of 1 is known as a unit vector. Unit vectors are preferred |
| when the vector math deals with vector direction rather than magnitude. |
| |
| Below is a plot of a sine wave, with an amplitude of 1, before and |
| after it has been unitized. |
| |
| image::images/math-expressions/unitize.png[] |
| |
| Below is a simple example of a unitized matrix. |
| Notice that once brought into the same scale the vectors are the same. |
| |
| [source,text] |
| ---- |
| let(a=array(20, 30, 40, 50), |
| b=array(200, 300, 400, 500), |
| c=matrix(a, b), |
| d=unitize(c)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "d": [ |
| [ |
| 0.2721655269759087, |
| 0.40824829046386296, |
| 0.5443310539518174, |
| 0.6804138174397716 |
| ], |
| [ |
| 0.2721655269759087, |
| 0.4082482904638631, |
| 0.5443310539518174, |
| 0.6804138174397717 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 6 |
| } |
| ] |
| } |
| } |
| ---- |