| = Streams and Vectorization |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| This section of the user guide explores techniques |
| for retrieving streams of data from Solr and vectorizing the |
| numeric fields. |
| |
| See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to |
| vectorize text fields. |
| |
| == Streams |
| |
| Streaming Expressions has a wide range of stream sources that can be used to |
| retrieve data from Solr Cloud collections. Math expressions can be used |
| to vectorize and analyze the results sets. |
| |
| Below are some of the key stream sources: |
| |
| * *`facet`*: Multi-dimensional aggregations are a powerful tool for generating |
| co-occurrence counts for categorical data. The `facet` function uses the JSON facet API |
| under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions |
| the aggregated results can be pivoted into a co-occurance matrix which can be mined for |
| correlations and hidden similarities within the data. |
| |
| * *`random`*: Random sampling is widely used in statistics, probability and machine learning. |
| The `random` function returns a random sample of search results that match a |
| query. The random samples can be vectorized and operated on by math expressions and the results |
| can be used to describe and make inferences about the entire population. |
| |
| * *`timeseries`*: The `timeseries` |
| expression provides fast distributed time series aggregations, which can be |
| vectorized and analyzed with math expressions. |
| |
| * *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch` |
| function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in |
| a distributed index. Once the nearest neighbors are retrieved they can be vectorized |
| and operated on by machine learning and text mining algorithms. |
| |
| * *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports |
| data retrieval using a subset of SQL which includes both full text search and |
| fast distributed aggregations. The result sets can then be vectorized and operated |
| on by math expressions. |
| |
| * *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with |
| streams originating from Solr. Result sets from outside data sources can be vectorized and operated |
| on by math expressions in the same manner as result sets originating from Solr. |
| |
| * *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic` |
| function provides publish/subscribe messaging capabilities by treating |
| Solr Cloud as a distributed message queue. Topics are extremely powerful |
| because they allow subscription by query. Topics can be use to support a broad set of |
| use cases including bulk text mining operations and AI alerting. |
| |
| * *`nodes`*: Graph queries are frequently used by recommendation engines and are an important |
| machine learning tool. The `nodes` function provides fast, distributed, breadth |
| first graph traversal over documents in a Solr Cloud collection. The node sets collected |
| by the `nodes` function can be operated on by statistical and machine learning expressions to |
| gain more insight into the graph. |
| |
| * *`search`*: Ranked search results are a powerful tool for finding the most relevant |
| documents from a large document corpus. The `search` expression |
| returns the top N ranked search results that match any |
| Solr query, including geo-spatial queries. The smaller set of relevant |
| documents can then be explored with statistical, machine learning and |
| text mining expressions to gather insights about the data set. |
| |
| == Assigning Streams to Variables |
| |
| The output of any streaming expression can be set to a variable. |
| Below is a very simple example using the `random` function to fetch |
| three random samples from collection1. The random samples are returned |
| as tuples which contain name/value pairs. |
| |
| |
| [source,text] |
| ---- |
| let(a=random(collection1, q="*:*", rows="3", fl="price_f")) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "a": [ |
| { |
| "price_f": 0.7927976 |
| }, |
| { |
| "price_f": 0.060795486 |
| }, |
| { |
| "price_f": 0.55128294 |
| } |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 11 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Creating a Vector with the col Function |
| |
| The `col` function iterates over a list of tuples and copies the values |
| from a specific column into an array. |
| |
| The output of the `col` function is an numeric array that can be set to a |
| variable and operated on by math expressions. |
| |
| Below is an example of the `col` function: |
| |
| [source,text] |
| ---- |
| let(a=random(collection1, q="*:*", rows="3", fl="price_f"), |
| b=col(a, price_f)) |
| ---- |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "b": [ |
| 0.42105234, |
| 0.85237443, |
| 0.7566981 |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 9 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Applying Math Expressions to the Vector |
| |
| Once a vector has been created any math expression that operates on vectors |
| can be applied. In the example below the `mean` function is applied to |
| the vector assigned to variable *`b`*. |
| |
| [source,text] |
| ---- |
| let(a=random(collection1, q="*:*", rows="15000", fl="price_f"), |
| b=col(a, price_f), |
| c=mean(b)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "c": 0.5016035594638814 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 306 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Creating Matrices |
| |
| Matrices can be created by vectorizing multiple numeric fields |
| and adding them to a matrix. The matrices can then be operated on by |
| any math expression that operates on matrices. |
| |
| [TIP] |
| ==== |
| Note that this section deals with the creation of matrices |
| from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields. |
| ==== |
| |
| Below is a simple example where four random samples are taken |
| from different sub-populations in the data. The `price_f` field of |
| each random sample is |
| vectorized and the vectors are added as rows to a matrix. |
| Then the `sumRows` |
| function is applied to the matrix to return a vector containing |
| the sum of each row. |
| |
| [source,text] |
| ---- |
| let(a=random(collection1, q="market:A", rows="5000", fl="price_f"), |
| b=random(collection1, q="market:B", rows="5000", fl="price_f"), |
| c=random(collection1, q="market:C", rows="5000", fl="price_f"), |
| d=random(collection1, q="market:D", rows="5000", fl="price_f"), |
| e=col(a, price_f), |
| f=col(b, price_f), |
| g=col(c, price_f), |
| h=col(d, price_f), |
| i=matrix(e, f, g, h), |
| j=sumRows(i)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "j": [ |
| 154390.1293375, |
| 167434.89453, |
| 159293.258493, |
| 149773.42769, |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 9 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| == Facet Co-occurrence Matrices |
| |
| The `facet` function can be used to quickly perform multi-dimension aggregations of categorical data from |
| records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence |
| counts for the values in the dimensions. The `pivot` function can be used to move two dimensional |
| aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for |
| correlations to learn about the hidden connections within the data. |
| |
| In the example below the `facet` expression is used to generate a two dimensional faceted aggregation. |
| The first dimension is the US State that a car was purchased in and the second dimension is the car model. |
| This two dimensional facet generates the co-occurrence counts for the number of times a particular car model |
| was purchased in a particular state. |
| |
| |
| [source,text] |
| ---- |
| facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*)) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "state": "NY", |
| "model": "camry", |
| "count(*)": 13342 |
| }, |
| { |
| "state": "NJ", |
| "model": "accord", |
| "count(*)": 13002 |
| }, |
| { |
| "state": "NY", |
| "model": "civic", |
| "count(*)": 12901 |
| }, |
| { |
| "state": "CA", |
| "model": "focus", |
| "count(*)": 12892 |
| }, |
| { |
| "state": "TX", |
| "model": "f150", |
| "count(*)": 12871 |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 171 |
| } |
| ] |
| } |
| } |
| ---- |
| |
| The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below |
| The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the |
| columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*)) |
| from the facet results. Once the co-occurrence matrix has been created the US States can be clustered |
| by car model, or the matrix can be transposed and car models can be clustered by the US States |
| where they were bought. |
| |
| [source,text] |
| ---- |
| let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)), |
| b=pivot(a, state, model, count(*)), |
| c=kmeans(b, 7)) |
| ---- |
| |
| == Latitude / Longitude Vectors |
| |
| The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into |
| a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long |
| pair for the corresponding tuple in the list. The row labels for the matrix are |
| automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated |
| on by distance-based machine learning functions using the `haversineMeters` distance measure. |
| |
| The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called |
| `field`, which tells the `latlonVectors` function which field to parse the lat/lon |
| vectors from. |
| |
| Below is an example of the `latlonVectors`. |
| |
| [source,text] |
| ---- |
| let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"), |
| b=latlonVectors(a, field="loc_p")) |
| ---- |
| |
| When this expression is sent to the `/stream` handler it responds with: |
| |
| [source,json] |
| ---- |
| { |
| "result-set": { |
| "docs": [ |
| { |
| "b": [ |
| [ |
| 42.87183530723629, |
| 76.74102353397778 |
| ], |
| [ |
| 42.91372904094898, |
| 76.72874889228416 |
| ], |
| [ |
| 42.911528804897564, |
| 76.70537292977619 |
| ], |
| [ |
| 42.91143870500213, |
| 76.74749913047408 |
| ], |
| [ |
| 42.904666267479705, |
| 76.73933236046092 |
| ] |
| ] |
| }, |
| { |
| "EOF": true, |
| "RESPONSE_TIME": 21 |
| } |
| ] |
| } |
| } |
| ---- |