solr/solr-ref-guide/src/vectorization.adoc - solr - Git at Google

 = Streams and Vectorization
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 This section of the user guide explores techniques
 for retrieving streams of data from Solr and vectorizing the
 numeric fields.

 See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
 vectorize text fields.

 == Streams

 Streaming Expressions has a wide range of stream sources that can be used to
 retrieve data from Solr Cloud collections. Math expressions can be used
 to vectorize and analyze the results sets.

 Below are some of the key stream sources:

 * *`facet`*: Multi-dimensional aggregations are a powerful tool for generating
 co-occurrence counts for categorical data. The `facet` function uses the JSON facet API
 under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions
 the aggregated results can be pivoted into a co-occurance matrix which can be mined for
 correlations and hidden similarities within the data.

 * *`random`*: Random sampling is widely used in statistics, probability and machine learning.
 The `random` function returns a random sample of search results that match a
 query. The random samples can be vectorized and operated on by math expressions and the results
 can be used to describe and make inferences about the entire population.

 * *`timeseries`*: The `timeseries`
 expression provides fast distributed time series aggregations, which can be
 vectorized and analyzed with math expressions.

 * *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
 function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
 a distributed index. Once the nearest neighbors are retrieved they can be vectorized
 and operated on by machine learning and text mining algorithms.

 * *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports
 data retrieval using a subset of SQL which includes both full text search and
 fast distributed aggregations. The result sets can then be vectorized and operated
 on by math expressions.

 * *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
 streams originating from Solr. Result sets from outside data sources can be vectorized and operated
 on by math expressions in the same manner as result sets originating from Solr.

 * *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic`
 function provides publish/subscribe messaging capabilities by treating
 Solr Cloud as a distributed message queue. Topics are extremely powerful
 because they allow subscription by query. Topics can be use to support a broad set of
 use cases including bulk text mining operations and AI alerting.

 * *`nodes`*: Graph queries are frequently used by recommendation engines and are an important
 machine learning tool. The `nodes` function provides fast, distributed, breadth
 first graph traversal over documents in a Solr Cloud collection. The node sets collected
 by the `nodes` function can be operated on by statistical and machine learning expressions to
 gain more insight into the graph.

 * *`search`*: Ranked search results are a powerful tool for finding the most relevant
 documents from a large document corpus. The `search` expression
 returns the top N ranked search results that match any
 Solr query, including geo-spatial queries. The smaller set of relevant
 documents can then be explored with statistical, machine learning and
 text mining expressions to gather insights about the data set.

 == Assigning Streams to Variables

 The output of any streaming expression can be set to a variable.
 Below is a very simple example using the `random` function to fetch
 three random samples from collection1. The random samples are returned
 as tuples which contain name/value pairs.


 [source,text]
 ----
 let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "a": [
           {
             "price_f": 0.7927976
           },
           {
             "price_f": 0.060795486
           },
           {
             "price_f": 0.55128294
           }
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 11
       }
     ]
   }
 }
 ----

 == Creating a Vector with the col Function

 The `col` function iterates over a list of tuples and copies the values
 from a specific column into an array.

 The output of the `col` function is an numeric array that can be set to a
 variable and operated on by math expressions.

 Below is an example of the `col` function:

 [source,text]
 ----
 let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
     b=col(a, price_f))
 ----

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           0.42105234,
           0.85237443,
           0.7566981
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 9
       }
     ]
   }
 }
 ----

 == Applying Math Expressions to the Vector

 Once a vector has been created any math expression that operates on vectors
 can be applied. In the example below the `mean` function is applied to
 the vector assigned to variable *`b`*.

 [source,text]
 ----
 let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
     b=col(a, price_f),
     c=mean(b))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "c": 0.5016035594638814
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 306
       }
     ]
   }
 }
 ----

 == Creating Matrices

 Matrices can be created by vectorizing multiple numeric fields
 and adding them to a matrix. The matrices can then be operated on by
 any math expression that operates on matrices.

 [TIP]
 ====
 Note that this section deals with the creation of matrices
 from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields.
 ====

 Below is a simple example where four random samples are taken
 from different sub-populations in the data. The `price_f` field of
 each random sample is
 vectorized and the vectors are added as rows to a matrix.
 Then the `sumRows`
 function is applied to the matrix to return a vector containing
 the sum of each row.

 [source,text]
 ----
 let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
     b=random(collection1, q="market:B", rows="5000", fl="price_f"),
     c=random(collection1, q="market:C", rows="5000", fl="price_f"),
     d=random(collection1, q="market:D", rows="5000", fl="price_f"),
     e=col(a, price_f),
     f=col(b, price_f),
     g=col(c, price_f),
     h=col(d, price_f),
     i=matrix(e, f, g, h),
     j=sumRows(i))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "j": [
           154390.1293375,
           167434.89453,
           159293.258493,
           149773.42769,
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 9
       }
     ]
   }
 }
 ----

 == Facet Co-occurrence Matrices

 The `facet` function can be used to quickly perform multi-dimension aggregations of categorical data from
 records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence
 counts for the values in the dimensions. The `pivot` function can be used to move two dimensional
 aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for
 correlations to learn about the hidden connections within the data.

 In the example below the `facet` expression is used to generate a two dimensional faceted aggregation.
 The first dimension is the US State that a car was purchased in and the second dimension is the car model.
 This two dimensional facet generates the co-occurrence counts for the number of times a particular car model
 was purchased in a particular state.


 [source,text]
 ----
 facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "state": "NY",
         "model": "camry",
         "count(*)": 13342
       },
       {
         "state": "NJ",
         "model": "accord",
         "count(*)": 13002
       },
       {
         "state": "NY",
         "model": "civic",
         "count(*)": 12901
       },
       {
         "state": "CA",
         "model": "focus",
         "count(*)": 12892
       },
       {
         "state": "TX",
         "model": "f150",
         "count(*)": 12871
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 171
       }
     ]
   }
 }
 ----

 The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below
 The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the
 columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*))
  from the facet results.  Once the co-occurrence matrix has been created the US States can be clustered
 by car model, or the matrix can be transposed and car models can be clustered by the US States
 where they were bought.

 [source,text]
 ----
 let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)),
     b=pivot(a, state, model, count(*)),
     c=kmeans(b, 7))
 ----

 == Latitude / Longitude Vectors

 The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into
 a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
 pair for the corresponding tuple in the list. The row labels for the matrix are
 automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated
 on by distance-based machine learning functions using the `haversineMeters` distance measure.

 The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
 `field`, which tells the `latlonVectors` function which field to parse the lat/lon
 vectors from.

 Below is an example of the `latlonVectors`.

 [source,text]
 ----
 let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"),
     b=latlonVectors(a, field="loc_p"))
 ----

 When this expression is sent to the `/stream` handler it responds with:

 [source,json]
 ----
 {
   "result-set": {
     "docs": [
       {
         "b": [
           [
             42.87183530723629,
             76.74102353397778
           ],
           [
             42.91372904094898,
             76.72874889228416
           ],
           [
             42.911528804897564,
             76.70537292977619
           ],
           [
             42.91143870500213,
             76.74749913047408
           ],
           [
             42.904666267479705,
             76.73933236046092
           ]
         ]
       },
       {
         "EOF": true,
         "RESPONSE_TIME": 21
       }
     ]
   }
 }
 ----
	= Streams and Vectorization
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	This section of the user guide explores techniques
	for retrieving streams of data from Solr and vectorizing the
	numeric fields.

	See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
	vectorize text fields.

	== Streams

	Streaming Expressions has a wide range of stream sources that can be used to
	retrieve data from Solr Cloud collections. Math expressions can be used
	to vectorize and analyze the results sets.

	Below are some of the key stream sources:

	* `facet`: Multi-dimensional aggregations are a powerful tool for generating
	co-occurrence counts for categorical data. The `facet` function uses the JSON facet API
	under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions
	the aggregated results can be pivoted into a co-occurance matrix which can be mined for
	correlations and hidden similarities within the data.

	* `random`: Random sampling is widely used in statistics, probability and machine learning.
	The `random` function returns a random sample of search results that match a
	query. The random samples can be vectorized and operated on by math expressions and the results
	can be used to describe and make inferences about the entire population.

	* `timeseries`: The `timeseries`
	expression provides fast distributed time series aggregations, which can be
	vectorized and analyzed with math expressions.

	* `knnSearch`: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
	function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
	a distributed index. Once the nearest neighbors are retrieved they can be vectorized
	and operated on by machine learning and text mining algorithms.

	* `sql`: SQL is the primary query language used by data scientists. The `sql` function supports
	data retrieval using a subset of SQL which includes both full text search and
	fast distributed aggregations. The result sets can then be vectorized and operated
	on by math expressions.

	* `jdbc`: The `jdbc` function allows data from any JDBC compliant data source to be combined with
	streams originating from Solr. Result sets from outside data sources can be vectorized and operated
	on by math expressions in the same manner as result sets originating from Solr.

	* `topic`: Messaging is an important foundational technology for large scale computing. The `topic`
	function provides publish/subscribe messaging capabilities by treating
	Solr Cloud as a distributed message queue. Topics are extremely powerful
	because they allow subscription by query. Topics can be use to support a broad set of
	use cases including bulk text mining operations and AI alerting.

	* `nodes`: Graph queries are frequently used by recommendation engines and are an important
	machine learning tool. The `nodes` function provides fast, distributed, breadth
	first graph traversal over documents in a Solr Cloud collection. The node sets collected
	by the `nodes` function can be operated on by statistical and machine learning expressions to
	gain more insight into the graph.

	* `search`: Ranked search results are a powerful tool for finding the most relevant
	documents from a large document corpus. The `search` expression
	returns the top N ranked search results that match any
	Solr query, including geo-spatial queries. The smaller set of relevant
	documents can then be explored with statistical, machine learning and
	text mining expressions to gather insights about the data set.

	== Assigning Streams to Variables

	The output of any streaming expression can be set to a variable.
	Below is a very simple example using the `random` function to fetch
	three random samples from collection1. The random samples are returned
	as tuples which contain name/value pairs.


	[source,text]
	----
	let(a=random(collection1, q=":", rows="3", fl="price_f"))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"a": [
	{
	"price_f": 0.7927976
	},
	{
	"price_f": 0.060795486
	},
	{
	"price_f": 0.55128294
	}
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 11
	}
	]
	}
	}
	----

	== Creating a Vector with the col Function

	The `col` function iterates over a list of tuples and copies the values
	from a specific column into an array.

	The output of the `col` function is an numeric array that can be set to a
	variable and operated on by math expressions.

	Below is an example of the `col` function:

	[source,text]
	----
	let(a=random(collection1, q=":", rows="3", fl="price_f"),
	b=col(a, price_f))
	----

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": [
	0.42105234,
	0.85237443,
	0.7566981
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 9
	}
	]
	}
	}
	----

	== Applying Math Expressions to the Vector

	Once a vector has been created any math expression that operates on vectors
	can be applied. In the example below the `mean` function is applied to
	the vector assigned to variable `b`.

	[source,text]
	----
	let(a=random(collection1, q=":", rows="15000", fl="price_f"),
	b=col(a, price_f),
	c=mean(b))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"c": 0.5016035594638814
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 306
	}
	]
	}
	}
	----

	== Creating Matrices

	Matrices can be created by vectorizing multiple numeric fields
	and adding them to a matrix. The matrices can then be operated on by
	any math expression that operates on matrices.

	[TIP]
	====
	Note that this section deals with the creation of matrices
	from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields.
	====

	Below is a simple example where four random samples are taken
	from different sub-populations in the data. The `price_f` field of
	each random sample is
	vectorized and the vectors are added as rows to a matrix.
	Then the `sumRows`
	function is applied to the matrix to return a vector containing
	the sum of each row.

	[source,text]
	----
	let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
	b=random(collection1, q="market:B", rows="5000", fl="price_f"),
	c=random(collection1, q="market:C", rows="5000", fl="price_f"),
	d=random(collection1, q="market:D", rows="5000", fl="price_f"),
	e=col(a, price_f),
	f=col(b, price_f),
	g=col(c, price_f),
	h=col(d, price_f),
	i=matrix(e, f, g, h),
	j=sumRows(i))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"j": [
	154390.1293375,
	167434.89453,
	159293.258493,
	149773.42769,
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 9
	}
	]
	}
	}
	----

	== Facet Co-occurrence Matrices

	The `facet` function can be used to quickly perform multi-dimension aggregations of categorical data from
	records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence
	counts for the values in the dimensions. The `pivot` function can be used to move two dimensional
	aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for
	correlations to learn about the hidden connections within the data.

	In the example below the `facet` expression is used to generate a two dimensional faceted aggregation.
	The first dimension is the US State that a car was purchased in and the second dimension is the car model.
	This two dimensional facet generates the co-occurrence counts for the number of times a particular car model
	was purchased in a particular state.


	[source,text]
	----
	facet(collection1, q=":", buckets="state, model", bucketSorts="count() desc", rows=5, count())
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"state": "NY",
	"model": "camry",
	"count(*)": 13342
	},
	{
	"state": "NJ",
	"model": "accord",
	"count(*)": 13002
	},
	{
	"state": "NY",
	"model": "civic",
	"count(*)": 12901
	},
	{
	"state": "CA",
	"model": "focus",
	"count(*)": 12892
	},
	{
	"state": "TX",
	"model": "f150",
	"count(*)": 12871
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 171
	}
	]
	}
	}
	----

	The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below
	The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the
	columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*))
	from the facet results. Once the co-occurrence matrix has been created the US States can be clustered
	by car model, or the matrix can be transposed and car models can be clustered by the US States
	where they were bought.

	[source,text]
	----
	let(a=facet(collection1, q=":", buckets="state, model", bucketSorts="count() desc", rows="-1", count()),
	b=pivot(a, state, model, count(*)),
	c=kmeans(b, 7))
	----

	== Latitude / Longitude Vectors

	The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into
	a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
	pair for the corresponding tuple in the list. The row labels for the matrix are
	automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated
	on by distance-based machine learning functions using the `haversineMeters` distance measure.

	The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
	`field`, which tells the `latlonVectors` function which field to parse the lat/lon
	vectors from.

	Below is an example of the `latlonVectors`.

	[source,text]
	----
	let(a=random(collection1, q=":", fl="id, loc_p", rows="5"),
	b=latlonVectors(a, field="loc_p"))
	----

	When this expression is sent to the `/stream` handler it responds with:

	[source,json]
	----
	{
	"result-set": {
	"docs": [
	{
	"b": [
	[
	42.87183530723629,
	76.74102353397778
	],
	[
	42.91372904094898,
	76.72874889228416
	],
	[
	42.911528804897564,
	76.70537292977619
	],
	[
	42.91143870500213,
	76.74749913047408
	],
	[
	42.904666267479705,
	76.73933236046092
	]
	]
	},
	{
	"EOF": true,
	"RESPONSE_TIME": 21
	}
	]
	}
	}
	----