solr/solr-ref-guide/src/machine-learning.adoc - lucene-solr - Git at Google

 = Machine Learning
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.


 This section of the math expressions user guide covers machine learning
 functions.

 == Distance and Distance Matrices

 The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix.

 There are six distance measure functions that return a function that performs the actual distance calculation:

 * `euclidean` (default)
 * `manhattan`
 * `canberra`
 * `earthMovers`
 * `cosine`
 * `haversineMeters` (Geospatial distance measure)

 The distance measure functions can be used with all machine learning functions
 that support distance measures.

 Below is an example for computing Euclidean distance for two numeric arrays:

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(21, 29, 41, 49),
     c=distance(a, b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": 2
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 Below the distance is calculated using Manhattan distance.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(21, 29, 41, 49),
     c=distance(a, b, manhattan()))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": 4
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 1
       }
     ]
   }
 }
 ----

 === Distance Matrices

 Distance matrices are powerful tools for visualizing the distance
 between two or more
 vectors.

 The `distance` function builds a distance matrix
 if a matrix is passed as the parameter. The distance matrix is computed for the *columns*
 of the matrix.

 The example below demonstrates the power of distance matrices combined with 2 dimensional faceting.

 In this example the `facet2D` function is used to generate a two dimensional facet aggregation
 over the fields `complaint_type_s` and `zip_s` from the `nyc311` complaints database.
 The *top 20* complaint types and the *top 25* zip codes for each complaint type are aggregated.
 The result is a stream of tuples each containing the fields `complaint_type_s`, `zip_s` and the count for the pair.

 The `pivot` function is then used to pivot the fields into a *matrix* with the `zip_s`
 field as the *rows* and the `complaint_type_s` field as the *columns*. The `count(*)` field populates
 the values in the cells of the matrix.

 The `distance` function is then used to compute the distance matrix for the columns
 of the matrix using `cosine` distance. This produces a distance matrix
 that shows distance between complaint types based on the zip codes they appear in.

 Finally the `zplot` function is used to plot the distance matrix as a heat map. Notice that the
 heat map has been configured so that the intensity of color increases as the distance between vectors
 decreases.


 image::images/math-expressions/distance.png[]

 The heat map is interactive, so mousing over one of the cells pops up the values
 for the cell.

 image::images/math-expressions/distanceview.png[]

 Notice that HEAT/HOT WATER and UNSANITARY CONDITION complaints have a cosine distance of .1 (rounded to the nearest
 tenth).


 == K-Nearest Neighbor (KNN)

 The `knn` function searches the rows of a matrix with a search vector and
 returns a matrix of the k-nearest neighbors. This allows for secondary vector
 searches over result sets.

 The `knn` function supports changing of the distance measure by providing one of the following
 distance measure functions:

 * `euclidean` (Default)
 * `manhattan`
 * `canberra`
 * `earthMovers`
 * `cosine`
 * `haversineMeters` (Geospatial distance measure)

 The example below shows how to perform a secondary search over an aggregation
 result set. The goal of the example is to find zip codes in the nyc311 complaint
 database that have similar complaint types to the zip code 10280.

 The first step in the example is to use the `facet2D` function to perform a two
 dimensional aggregation over the `zip_s` and `complaint_type_s` fields. In the example
 the top 119 zip codes and top 5 complaint types for each zip code are calculated
 for the borough of Manhattan. The result is a list of tuples each containing
 the `zip_s`, `complaint_type_s` and the `count(*)` for the combination.

 The list of tuples is then *pivoted* into a matrix with the `pivot` function.
 The `pivot` function in this example returns a matrix with rows of zip codes
 and columns of complaint types.
 The `count(*)` field from the tuples populates the cells of the matrix.
 This matrix will be used as the secondary search matrix.

 The next step is to locate the vector for the 10280 zip code.
 This is done in three steps in the example.
 The first step is to retrieve the row labels from the matrix with the `getRowLabels` function.
 The row labels in this case are zip codes which were populated by the `pivot` function.
 Then the `indexOf` function is used to find the *index* of the "10280" zip code in the list of row labels.
 The `rowAt` function is then used to return the vector at that *index* from the matrix.
 This vector is the *search vector*.

 Now that we have a matrix and search vector we can use the `knn` function to perform the search.
 In the example the `knn` function searches the matrix with the search vector with a K of 5, using
 *cosine* distance. Cosine distance is useful for comparing sparse vectors which is the case in this
 example. The `knn` function returns a matrix with the top 5 nearest neighbors to the search vector.

 The `knn` function populates the row and column labels of the return matrix and
 also adds a vector of *distances* for each row as an attribute to the matrix.

 In the example the `zplot` function extracts the row labels and
 the distance vector with the `getRowLabels` and `getAttribute` functions.
 The `topFeatures` function is used to extract
 the top 5 column labels for each zip code vector, based on the counts for each
 column. Then `zplot` outputs the data in a format that can be visualized in
 a table with Zeppelin-Solr.

 image::images/math-expressions/knn.png[]

 The table above shows each zip code returned by the `knn` function along
 with the list of complaints and distances. These are the zip codes that are most similar
 to the 10280 zip code based on their top 5 complaint types.

 == K-Nearest Neighbor Regression

 K-nearest neighbor regression is a non-linear, bivariate and multivariate regression method.
 KNN regression is a lazy learning
 technique which means it does not fit a model to the training set in advance. Instead the
 entire training set of observations and outcomes are held in memory and predictions are made
 by averaging the outcomes of the k-nearest neighbors.

 The `knnRegress` function is used to perform nearest neighbor regression.


 === 2D Non-Linear Regression

 The example below shows the *regression plot* for KNN regression applied to a 2D scatter plot.

 In this example the `random` function is used to draw 500 random samples from the `logs` collection
 containing two fields `filesize_d` and `eresponse_d`. The sample is then vectorized with the
 `filesize_d` field stored in a vector assigned to variable *x* and the `eresponse_d` vector stored in
 variable `y`. The `knnRegress` function is then applied with `20` as the nearest neighbor parameter,
 which returns a KNN function which can be used to predict values.
 The `predict` function is then called on the KNN function to predict values for the original `x` vector.
 Finally `zplot` is used to plot the original `x` and `y` vectors along with the predictions.

 image::images/math-expressions/knnRegress.png[]

 Notice that the regression plot shows a non-linear relations ship between the `filesize_d`
 field and the `eresponse_d` field. Also note that KNN regression
 plots a non-linear curve through the scatter plot. The larger the size
 of K (nearest neighbors), the smoother the line.

 === Multivariate Non-Linear Regression

 The `knnRegress` function is also a powerful and flexible tool for
 multi-variate non-linear regression.

 In the example below a multi-variate regression is performed using
 a database designed for analyzing and predicting wine quality. The
 database contains nearly 1600 records with 9 predictors of wine quality:
 pH, alcohol, fixed_acidity, sulphates, density, free_sulfur_dioxide,
 volatile_acidity, citric_acid, residual_sugar. There is also a field
 called quality assigned to each wine ranging
 from 3 to 8.

 KNN regression can be used to predict wine quality for vectors containing
 the predictor values.

 In the example a search is performed on the `redwine` collection to
 return all the rows in the database of observations. Then the quality field and
 predictor fields are read into vectors and set to variables.

 The predictor variables are added as rows to a matrix which is
 transposed so each row in the matrix contains one observation with the 9
 predictor values.
 This is our observation matrix which is assigned to the variable `obs`.

 Then the `knnRegress` function regresses the observations with quality outcomes.
 The value for K is set to 5 in the example, so the average quality of the 5
 nearest neighbors will be used to calculate the quality.

 The `predict` function is then used to generate a vector of predictions
 for the entire observation set. These predictions will be used to determine
 how well the KNN regression performed over the observation data.

 The error, or *residuals*, for the regression are then calculated by
 subtracting the *predicted* quality from the *observed* quality.
 The `ebeSubtract` function is used to perform the element-by-element
 subtraction between the two vectors.

 Finally the `zplot` function formats the predictions and errors for
 for the visualization of the *residual plot*.

 image::images/math-expressions/redwine1.png[]

 The residual plot plots the *predicted* values on the x-axis and the *error* for the
 prediction on the y-axis. The scatter plot shows how the errors
 are distributed across the full range of predictions.

 The residual plot can be interpreted to understand how the KNN regression performed on the
 training data.

 * The plot shows the prediction error appears to be fairly evenly distributed
 above and below zero. The density of the errors increases as it approaches zero. The
 bubble size reflects the density of errors at the specific point in the plot.
 This provides an intuitive feel for the distribution of the model's error.

 * The plot also visualizes the variance of the error across the range of
 predictions. This provides an intuitive understanding of whether the KNN predictions
 will have similar error variance across the full range predictions.

 The residuals can also be visualized using a histogram to better understand
 the shape of the residuals distribution. The example below shows the same KNN
 regression as above with a plot of the distribution of the errors.

 In the example the `zplot` function is used to plot the `empiricalDistribution`
 function of the residuals, with an 11 bin histogram.

 image::images/math-expressions/redwine2.png[]

 Notice that the errors follow a bell curve centered close to 0. From this plot
 we can see the probability of getting prediction errors between -1 and 1 is quite high.

 *Additional KNN Regression Parameters*

 The `knnRegression` function has three additional parameters that make it suitable for many different regression scenarios.

 . Any of the distance measures can be used for the regression simply by adding the function to the call.
 This allows for regression analysis over sparse vectors (`cosine`), dense vectors and geo-spatial lat/lon vectors (`haversineMeters`).
 +
 Sample syntax:
 +
 [source,text]
 ----
 r=knnRegress(obs, quality, 5, cosine()),
 ----

 . The `robust` named parameter can be used to perform a regression analysis that is robust to outliers in the outcomes.
 When the `robust` parameter is used the median outcome of the k-nearest neighbors is used rather than the average.
 +
 Sample syntax:
 +
 [source,text]
 ----
 r=knnRegress(obs, quality, 5, robust="true"),
 ----

 . The `scale` named parameter can be used to scale the columns of the observations and search vectors
 at prediction time. This can improve the performance of the KNN regression when the feature columns
 are at different scales causing the distance calculations to be place too much weight on the larger columns.
 +
 Sample syntax:
 +
 [source,text]
 ----
 r=knnRegress(obs, quality, 5, scale="true"),
 ----

 == knnSearch

 The `knnSearch` function returns the k-nearest neighbors
 for a document based on text similarity.
 Under the covers the `knnSearch` function uses Solr's <<other-parsers.adoc#more-like-this-query-parser,More Like This>> query parser plugin.
 This capability uses the search engine's query, term statistics, scoring, and ranking capability to perform a fast, nearest neighbor search for similar documents over large distributed indexes.

 The results of this search can be used directly or provide *candidates* for machine learning operations such as a secondary KNN vector search.

 The example below shows the `knnSearch` function on a movie reviews data set. The search returns the 50 documents most similar to a specific document ID (`83e9b5b0...`) based on the similarity of the `review_t` field.
 The `mindf` and `maxdf` specify the minimum and maximum document frequency of the terms used to perform the search.
 These parameters can make the query faster by eliminating high frequency terms and also improves accuracy by removing noise terms from the search.

 image::images/math-expressions/knnSearch.png[]

 NOTE: In this example the `select`
 function is used to truncate the review in the output to 220 characters to make it easier
 to read in a table.

 == DBSCAN

 DBSCAN clustering is a powerful density-based clustering algorithm which is particularly well suited for geospatial clustering.
 DBSCAN uses two parameters to filter result sets to clusters of specific density:

 * `eps` (Epsilon): Defines the distance between points to be considered as neighbors

 * `min` points: The minimum number of points needed in a cluster for it to be returned.


 === 2D Cluster Visualization

 The `zplot` function has direct support for plotting 2D clusters by using the `clusters` named parameter.

 The example below uses DBSCAN clustering and cluster visualization to find
 the *hot spots* on a map for rat sightings in the NYC 311 complaints database.

 In this example the `random` function draws a sample of records from the `nyc311` collection where
 the complaint description matches "rat sighting" and latitude is populated in the record.
 The latitude and longitude fields are then vectorized and added as rows to a matrix.
 The matrix is transposed so each row contains a single latitude, longitude
 point.
 The `dbscan` function is then used to cluster the latitude and longitude points.
 Notice that the `dbscan` function in the example has four parameters.

 * `obs` : The observation matrix of lat/lon points

 * `eps` : The distance between points to be considered a cluster. 100 meters in the example.

 * `min points`: The minimum points in a cluster for the cluster to be returned by the function. `5` in the example.

 * `distance measure`: An optional distance measure used to determine the
 distance between points. The default is Euclidean distance.
 The example uses `haversineMeters` which returns the distance in meters which is much more meaningful for geospatial use cases.

 Finally, the `zplot` function is used to visualize the clusters on a map with Zeppelin-Solr.
 The map below has been zoomed to a specific area of Brooklyn with a high density of rat sightings.

 image::images/math-expressions/dbscan1.png[]

 Notice in the visualization that only 1019 points were returned from the 5000 samples.
 This is the power of the DBSCAN algorithm to filter records that don't match the criteria
 of a cluster. The points that are plotted all belong to clearly defined clusters.

 The map visualization can be zoomed further to explore the locations of specific clusters.
 The example below shows a zoom into an area of dense clusters.

 image::images/math-expressions/dbscan2.png[]


 == K-Means Clustering

 The `kmeans` functions performs k-means clustering of the rows of a matrix.
 Once the clustering has been completed there are a number of useful functions available
 for examining and visualizing the clusters and centroids.


 === Clustered Scatter Plot

 In this example we'll again be clustering 2D lat/lon points of rat sightings. But unlike the DBSCAN example, k-means clustering
 does not on its own
 perform any noise reduction. So in order to reduce the noise a smaller random sample is selected from the data than was used
 for the DBSCAN example.

 We'll see that sampling itself is a powerful noise reduction tool which helps visualize the cluster density.
 This is because there is a higher probability that samples will be drawn from higher density clusters and a lower
 probability that samples will be drawn from lower density clusters.

 In this example the `random` function draws a sample of 1500 records from the `nyc311` (complaints database) collection where
 the complaint description matches "rat sighting" and latitude is populated in the record. The latitude and longitude fields
 are then vectorized and added as rows to a matrix. The matrix is transposed so each row contains a single latitude, longitude
 point. The `kmeans` function is then used to cluster the latitude and longitude points into 21 clusters.
 Finally, the `zplot` function is used to visualize the clusters as a scatter plot.

 image::images/math-expressions/2DCluster1.png[]

 The scatter plot above shows each lat/lon point plotted on a Euclidean plain with longitude on the
 x-axis and
 latitude on the y-axis. The plot is dense enough so the outlines of the different boroughs are visible
 if you know the boroughs of New York City.


 Each cluster is shown in a different color. This plot provides interesting
 insight into the densities of rat sightings throughout the five boroughs of New York City. For
 example it highlights a cluster of dense sightings in Brooklyn at cluster1
 surrounded by less dense but still high activity clusters.

 === Plotting the Centroids

 The centroids of each cluster can then be plotted on a map to visualize the center of the
 clusters. In the example below the centroids are extracted from the clusters using the `getCentroids`
 function, which returns a matrix of the centroids.

 The centroids matrix contains 2D lat/lon points. The `colAt` function can then be used
 to extract the latitude and longitude columns by index from the matrix so they can be
 plotted with `zplot`. A map visualization is used below to display the centroids.


 image::images/math-expressions/centroidplot.png[]


 The map can then be zoomed to get a closer look at the centroids in the high density areas shown
 in the cluster scatter plot.

 image::images/math-expressions/centroidzoom.png[]


 === Phrase Extraction

 K-means clustering produces centroids or *prototype* vectors which can be used to represent
 each cluster. In this example the key features of the centroids are extracted
 to represent the key phrases for clusters of TF-IDF term vectors.

 NOTE: The example below works with TF-IDF _term vectors_.
 The section <<term-vectors.adoc#,Text Analysis and Term Vectors>> offers
 a full explanation of this features.

 In the example the `search` function returns documents where the `review_t` field matches the phrase "star wars".
 The `select` function is run over the result set and applies the `analyze` function
 which uses the Lucene/Solr analyzer attached to the schema field `text_bigrams` to re-analyze the `review_t`
 field. This analyzer returns bigrams which are then annotated to documents in a field called `terms`.

 The `termVectors` function then creates TD-IDF term vectors from the bigrams stored in the `terms` field.
 The `kmeans` function is then used to cluster the bigram term vectors into 5 clusters.
 Finally the top 5 features are extracted from the centroids and returned.
 Notice that the features are all bigram phrases with semantic significance.

 [source,text]
 ----
 let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"),
              id,
              analyze(review_t, text_bigrams) as terms),
     vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"),
     clusters=kmeans(vectors, 5),
     centroids=getCentroids(clusters),
     phrases=topFeatures(centroids, 5))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,text]
 ----
 {
   "result-set": {
     "docs": [
       {
         "phrases": [
           [
             "empire strikes",
             "rebel alliance",
             "princess leia",
             "luke skywalker",
             "phantom menace"
           ],
           [
             "original star",
             "main characters",
             "production values",
             "anakin skywalker",
             "luke skywalker"
           ],
           [
             "carrie fisher",
             "original films",
             "harrison ford",
             "luke skywalker",
             "ian mcdiarmid"
           ],
           [
             "phantom menace",
             "original trilogy",
             "harrison ford",
             "john williams",
             "empire strikes"
           ],
           [
             "science fiction",
             "fiction films",
             "forbidden planet",
             "character development",
             "worth watching"
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 46
       }
     ]
   }
 }
 ----

 == Multi K-Means Clustering

 K-means clustering will produce different outcomes depending on
 the initial placement of the centroids. K-means is fast enough
 that multiple trials can be performed so that the best outcome can be selected.

 The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the
 best result based on which trial produces the lowest intra-cluster variance.

 The example below is identical to the phrase extraction example except that it uses `multiKmeans` with 15 trials,
 rather than a single trial of the `kmeans` function.

 [source,text]
 ----
 let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"),
              id,
              analyze(review_t, text_bigrams) as terms),
     vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"),
     clusters=multiKmeans(vectors, 5, 15),
     centroids=getCentroids(clusters),
     phrases=topFeatures(centroids, 5))
 ----

 This expression returns the following response:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "phrases": [
           [
             "science fiction",
             "original star",
             "production values",
             "fiction films",
             "forbidden planet"
           ],
           [
             "empire strikes",
             "princess leia",
             "luke skywalker",
             "phantom menace"
           ],
           [
             "carrie fisher",
             "harrison ford",
             "luke skywalker",
             "empire strikes",
             "original films"
           ],
           [
             "phantom menace",
             "original trilogy",
             "harrison ford",
             "character development",
             "john williams"
           ],
           [
             "rebel alliance",
             "empire strikes",
             "princess leia",
             "original trilogy",
             "luke skywalker"
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 84
       }
     ]
   }
 }
 ----

 == Fuzzy K-Means Clustering

 The `fuzzyKmeans` function is a soft clustering algorithm which
 allows vectors to be assigned to more then one cluster. The `fuzziness` parameter
 is a value between `1` and `2` that determines how fuzzy to make the cluster assignment.

 After the clustering has been performed the `getMembershipMatrix` function can be called
 on the clustering result to return a matrix describing the probabilities
 of cluster membership for each vector.
 This matrix can be used to understand relationships between clusters.

 In the example below `fuzzyKmeans` is used to cluster the movie reviews matching the phrase "star wars".
 But instead of looking at the clusters or centroids, the `getMembershipMatrix` is used to return the
 membership probabilities for each document. The membership matrix is comprised of a row for each
 vector that was clustered. There is a column in the matrix for each cluster.
 The values in the matrix contain the probability that a specific vector belongs to a specific cluster.

 In the example the `distance` function is then used to create a *distance matrix* from the columns of the
 membership matrix. The distance matrix is then visualized with the `zplot` function as a heat map.

 In the example `cluster1` and `cluster5` have the shortest distance between the clusters.
 Further analysis of the features in both clusters can be performed to understand
 the relationship between `cluster1` and `cluster5`.

 image::images/math-expressions/fuzzyk.png[]

 NOTE: The heat map has been configured to increase in color intensity as the distance shortens.

 == Feature Scaling

 Before performing machine learning operations its often necessary to
 scale the feature vectors so they can be compared at the same scale.

 All the scaling functions below operate on vectors and matrices.
 When operating on a matrix the rows of the matrix are scaled.

 === Min/Max Scaling

 The `minMaxScale` function scales a vector or matrix between a minimum and maximum value.
 By default it will scale between `0` and `1` if min/max values are not provided.

 Below is a plot of a sine wave, with an amplitude of 1, before and
 after it has been scaled between -5 and 5.

 image::images/math-expressions/minmaxscale.png[]


 Below is a simple example of min/max scaling of a matrix between 0 and 1.
 Notice that once brought into the same scale the vectors are the same.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(200, 300, 400, 500),
     c=matrix(a, b),
     d=minMaxScale(c))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": [
           [
             0,
             0.3333333333333333,
             0.6666666666666666,
             1
           ],
           [
             0,
             0.3333333333333333,
             0.6666666666666666,
             1
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 0
       }
     ]
   }
 }
 ----

 === Standardization

 The `standardize` function scales a vector so that it has a
 mean of 0 and a standard deviation of 1.

 Below is a plot of a sine wave, with an amplitude of 1, before and
 after it has been standardized.

 image::images/math-expressions/standardize.png[]

 Below is a simple example of of a standardized matrix.
 Notice that once brought into the same scale the vectors are the same.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(200, 300, 400, 500),
     c=matrix(a, b),
     d=standardize(c))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": [
           [
             -1.161895003862225,
             -0.3872983346207417,
             0.3872983346207417,
             1.161895003862225
           ],
           [
             -1.1618950038622249,
             -0.38729833462074165,
             0.38729833462074165,
             1.1618950038622249
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 17
       }
     ]
   }
 }
 ----

 === Unit Vectors

 The `unitize` function scales vectors to a magnitude of 1. A vector with a
 magnitude of 1 is known as a unit vector. Unit vectors are preferred
 when the vector math deals with vector direction rather than magnitude.

 Below is a plot of a sine wave, with an amplitude of 1, before and
 after it has been unitized.

 image::images/math-expressions/unitize.png[]

 Below is a simple example of a unitized matrix.
 Notice that once brought into the same scale the vectors are the same.

 [source,text]
 ----
 let(a=array(20, 30, 40, 50),
     b=array(200, 300, 400, 500),
     c=matrix(a, b),
     d=unitize(c))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "d": [
           [
             0.2721655269759087,
             0.40824829046386296,
             0.5443310539518174,
             0.6804138174397716
           ],
           [
             0.2721655269759087,
             0.4082482904638631,
             0.5443310539518174,
             0.6804138174397717
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 6
       }
     ]
   }
 }
 ----