solr/solr-ref-guide/src/search-sample.adoc - lucene-solr - Git at Google

 = Searching, Sampling and Aggregation
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 Data is the indispensable factor in statistical analysis. This section
 provides an overview of the key functions for retrieving data for
 visualization and statistical analysis: searching, sampling
 and aggregation.

 == Searching

 === Exploring

 The `search` function can be used to search a SolrCloud collection and return a
 result set.

 Below is an example of the most basic `search` function called from the Zeppelin-Solr interpreter.
 Zeppelin-Solr sends the `seach(logs)` call to the `/stream` handler and displays the results
 in *table* format.

 In the example the `search` function is passed only the name of the collection being searched.
 This returns a result set of 10 records with all fields.
 This simple function is useful for exploring the fields in the data and understanding how to start refining the search criteria.

 image::images/math-expressions/search1.png[]

 === Searching and Sorting

 Once the format of the records is known, parameters can be added to the `*search*` function to begin analyzing the data.

 In the example below a search query, field list, rows and sort have been added to the `search` function.
 Now the search is limited to records within a specific time range and returns
 a maximum result set of 750 records sorted by `tdate_dt` ascending.
 We have also limited the result set to three specific fields.

 image::images/math-expressions/search-sort.png[]

 Once the data is loaded into the table we can switch to a scatter plot and plot the `filesize_d` column
 on the *x-axis* and the `response_d` column on the *y-axis*.

 image::images/math-expressions/search-sort-plot.png[]

 This allows us to quickly visualize the relationship between two variables
 selected from a very specific slice of the index.

 === Scoring

 The `search` function will score and rank documents when a query is performed on
 a text field. The example below shows an example of the scoring and ranking of results.

 image::images/math-expressions/scoring.png[]

 == Sampling

 The `random` function returns a random sample from a distributed search result set.
 This allows for fast visualization, statistical analysis, and modeling of
 samples that can be used to infer information about the larger result set.

 The visualization examples below use small random samples, but Solr's random sampling provides sub-second response times on sample sizes of over 200,000.
 These larger samples can be used to build reliable statistical models that describe large data sets (billions of documents) with sub-second performance.

 The examples below demonstrate univariate and bivariate scatter
 plots of random samples.
 Statistical modeling with random samples
 is covered in the <<statistics.adoc#,Statistics>>, <<probability-distributions.adoc#,Probability>>, <<regression.adoc#,Linear Regression>>, <<curve-fitting.adoc#,Curve Fitting>>,
 and <<machine-learning.adoc#,Machine Learning>> sections.

 === Univariate Scatter Plots

 In the example below the `random` function is called in its simplest form with just a collection name as the parameter.

 When called with no other parameters the `random` function returns a random sample of 500 records with all fields from the collection.
 When called without the field list parameter (`fl`) the `random` function also generates a sequence, 0-499 in this case, which can be used for plotting the x-axis.
 This sequence is returned in a field called `x`.

 The visualization below shows a scatter plot with the `filesize_d` field
 plotted on the y-axis and the `x` sequence plotted on the x-axis.
 The effect of this is to spread the `filesize_d` samples across the length
 of the plot so they can be more easily studied.

 By studying the scatter plot we can learn a number of things about the
 distribution of the `filesize_d` variable:

 * The sample set ranges from 34,875 to 45,902.
 * The highest density appears to be at about 40,000.
 * The sample seems to have a balanced number of observations above and below
 40,000. Based on this the *mean* and *mode* would appear to be around 40,000.
 * The number of observations tapers off to a small number of outliers on
 the low and high end of the sample.

 This sample can be re-run multiple times to see if the samples
 produce similar plots.

 image::images/math-expressions/univariate.png[]

 === Bivariate Scatter Plots

 In the next example parameters have been added to the `random` function.
 The field list (`fl`) now specifies two fields to be
 returned with each sample: `filesize_d` and `response_d`.
 The `q` and `rows` parameters are the same as the defaults but are included as an example of how to set these parameters.

 By plotting `filesize_d` on the x-axis and `response_d` on the y-axis we can begin to study the relationship between the two variables.

 By studying the scatter plot we can learn the following:

 * As `filesize_d` rises, `response_d` tends to rise.
 * This relationship appears to be linear, as a straight line put through the data could be used to model the relationship.
 * The points appear to cluster more densely along a straight line through the middle and become less dense as they move away from the line.
 * The variance of the data at each `filesize_d` point seems fairly consistent. This means a predictive model would have consistent error across the range of predictions.

 image::images/math-expressions/bivariate.png[]

 == Aggregation

 Aggregations are a powerful statistical tool for summarizing large data sets and
 surfacing patterns, trends, and correlations within the data.
 Aggregations are also a powerful tool for visualization and provide data sets for further statistical analysis.

 === stats

 The simplest aggregation is the `stats` function.
 The `stats` function calculates aggregations for an entire result set that matches a query.
 The `stats` function supports the following aggregation functions: `count(*)`, `sum`, `min`, `max`, and `avg`.
 Any number and combination of statistics can be calculated in a single function call.

 The `stats` function can be visualized in Zeppelin-Solr as a table.
 In the example below two statistics are calculated over a result set and are displayed in a table:

 image::images/math-expressions/stats-table.png[]

 The `stats` function can also be visualized using the *number* visualization which is used to highlight important numbers.
 The example below shows the `count(*)` aggregation displayed in the number visualization:

 image::images/math-expressions/stats.png[]

 === facet

 The `facet` function performs single and multi-dimension
 aggregations that behave in a similar manner to SQL group by aggregations.
 Under the covers the `facet` function pushes down the aggregations to Solr's
 <<json-facet-api.adoc#,JSON Facet API>> for fast distributed execution.

 The example below performs a single dimension aggregation from the
 nyc311 (NYC complaints) dataset.
 The aggregation returns the top five *complaint types* by *count* for records with a status of *Pending*.
 The results are displayed with Zeppelin-Solr in a table.

 image::images/math-expressions/facettab1.png[]

 The example below shows the table visualized using a pie chart.

 image::images/math-expressions/facetviz1.png[]

 The next example demonstrates a multi-dimension aggregation.
 Notice that the `buckets` parameter now contains two dimensions: `borough_s` and `complaint_type_s`.
 This returns the top 20 combinations of borough and complaint type by count.

 image::images/math-expressions/facettab2.png[]

 The example below shows the multi-dimension aggregation visualized as a grouped bar chart.

 image::images/math-expressions/facetviz2.png[]

 The `facet` function supports any combination of the following aggregate functions: count(*), sum, avg, min,
 max.


 === facet2D

 The `facet2D` function performs two dimensional aggregations that can be
 visualized as heat maps or pivoted into matrices and operated on by machine learning functions.

 `facet2D` has different syntax and behavior then a two dimensional `facet` function which
 does not control the number of unique facets of each dimension. The `facet2D` function
 has the `dimensions` parameter which controls the number of unique facets
 for the *x* and *y* dimensions.

 The example below visualizes the output of the `facet2D` function. In the example `facet2D`
 returns the top 5 boroughs and the top 5 complaint types for each borough. The output is
 then visualized as a heatmap.

 image::images/math-expressions/facet2D.png[]

 The `facet2D` function supports one of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`.

 === timeseries

 The `timeseries` function performs fast, distributed time
 series aggregation leveraging Solr's builtin faceting and date math capabilities.

 The example below performs a monthly time series aggregation over a collection of
 daily stock price data.  In this example the average monthly closing price is
 calculated for the stock ticker *amzn* between a specific date range.

 The output of the `timeseries` function is then visualized with a line chart.

 image::images/math-expressions/timeseries1.png[]

 The `timeseries` function supports any combination of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`.


 === significantTerms

 The `significantTerms` function queries a collection, but instead of returning documents, it returns significant terms found in documents in the result set.
 This function scores terms based on how frequently they appear in the result set and how rarely they appear in the entire corpus.
 The `significantTerms` function emits a tuple for each term which contains the term, the score, the foreground count and the background count.
 The foreground count is how many documents the term appears in the result set.
 The background count is how many documents the term appears in the entire corpus.
 The foreground and background counts are global for the collection.

 The `significantTerms` function can often provide insights that cannot be gleaned from other types of aggregations.
 The example below illustrates the difference between the `facet` function and the `significantTerms` function.

 In the first example the `facet` function aggregates the top 5 complaint types
 in Brooklyn.
 This returns the five most common complaint types in Brooklyn, but
 its not clear that these terms appear more frequently in Brooklyn then
 then the other boroughs.

 image::images/math-expressions/significantTermsCompare.png[]

 In the next example the `significantTerms` function returns the top 5 significant terms in the `complaint_type_s` field for the borough of Brooklyn.
 The highest scoring term, Elder Abuse, has a foreground count of 285 and background count of 298.
 This means that there were 298 Elder Abuse complaints in the entire data set, and 285 of them were in Brooklyn.
 This shows that Elder Abuse complaints have a much higher occurrence rate in Brooklyn than the other boroughs.

 image::images/math-expressions/significantTerms2.png[]

 The final example shows a visualization of the `significantTerms` from a
 text field containing movie reviews. The result shows the
 significant terms that appear in movie reviews that have the phrase "sci-fi".

 The results are visualized using a bubble chart with the *foreground* count on
 plotted on the x-axis and the *background* count on the y-axis. Each term is
 shown in a bubble sized by the *score*.

 image::images/math-expressions/sterms.png[]

 === nodes

 The `nodes` function performs aggregations of nodes during a breadth first search of a graph.
 This function is covered in detail in the section <<graph-traversal.adoc#,Graph Traversal>>.
 In this example the focus will be on finding correlated nodes in a time series
 graph using the `nodes` expressions.

 The example below finds stock tickers whose daily movements tend to be correlated with the ticker *jpm* (JP Morgan).

 The inner `search` expression finds records between a specific date range
 where the ticker symbol is *jpm* and the `change_d` field (daily change in stock price) is greater then .25.
 This search returns all fields in the index including the `yearMonthDay_s` which is the string representation of the year, month, and day of the matching records.

 The `nodes` function wraps the `search` function and operates over its results. The `walk` parameter maps a field from the search results to a field in the index.
 In this case the `yearMonthDay_s` is mapped back  to the `yearMonthDay_s` field in the same index.
 This will find records that have same `yearMonthDay_s` field value returned
 by the initial search, and will return records for all tickers on those days.
 A filter query is applied to the search to filter the search to rows that have a `change_d`
 greater the .25.
 This will find all records on the matching days that have a daily change greater then .25.

 The `gather` parameter tells the nodes expression to gather the `ticker_s` symbols during the breadth first search.
 The `count(*)` parameter counts the occurrences of the tickers.
 This will count the number of times each ticker appears in the breadth first search.

 Finally the `top` function selects the top 5 tickers by count and returns them.

 The result below shows the ticker symbols in the `nodes` field and the counts for each node.
 Notice *jpm* is first, which shows how many days *jpm* had a change greater then .25 in this time
 period.
 The next set of ticker symbols (*mtb*, *slvb*, *gs* and *pnc*) are the symbols with highest number of days with a change greater then .25 on the same days that *jpm* had a change greater then .25.

 image::images/math-expressions/nodestab.png[]

 The `nodes` function supports any combination of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`.
	= Searching, Sampling and Aggregation
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	Data is the indispensable factor in statistical analysis. This section
	provides an overview of the key functions for retrieving data for
	visualization and statistical analysis: searching, sampling
	and aggregation.

	== Searching

	=== Exploring

	The `search` function can be used to search a SolrCloud collection and return a
	result set.

	Below is an example of the most basic `search` function called from the Zeppelin-Solr interpreter.
	Zeppelin-Solr sends the `seach(logs)` call to the `/stream` handler and displays the results
	in table format.

	In the example the `search` function is passed only the name of the collection being searched.
	This returns a result set of 10 records with all fields.
	This simple function is useful for exploring the fields in the data and understanding how to start refining the search criteria.

	image::images/math-expressions/search1.png[]

	=== Searching and Sorting

	Once the format of the records is known, parameters can be added to the `search` function to begin analyzing the data.

	In the example below a search query, field list, rows and sort have been added to the `search` function.
	Now the search is limited to records within a specific time range and returns
	a maximum result set of 750 records sorted by `tdate_dt` ascending.
	We have also limited the result set to three specific fields.

	image::images/math-expressions/search-sort.png[]

	Once the data is loaded into the table we can switch to a scatter plot and plot the `filesize_d` column
	on the x-axis and the `response_d` column on the y-axis.

	image::images/math-expressions/search-sort-plot.png[]

	This allows us to quickly visualize the relationship between two variables
	selected from a very specific slice of the index.

	=== Scoring

	The `search` function will score and rank documents when a query is performed on
	a text field. The example below shows an example of the scoring and ranking of results.

	image::images/math-expressions/scoring.png[]

	== Sampling

	The `random` function returns a random sample from a distributed search result set.
	This allows for fast visualization, statistical analysis, and modeling of
	samples that can be used to infer information about the larger result set.

	The visualization examples below use small random samples, but Solr's random sampling provides sub-second response times on sample sizes of over 200,000.
	These larger samples can be used to build reliable statistical models that describe large data sets (billions of documents) with sub-second performance.

	The examples below demonstrate univariate and bivariate scatter
	plots of random samples.
	Statistical modeling with random samples
	is covered in the <<statistics.adoc#,Statistics>>, <<probability-distributions.adoc#,Probability>>, <<regression.adoc#,Linear Regression>>, <<curve-fitting.adoc#,Curve Fitting>>,
	and <<machine-learning.adoc#,Machine Learning>> sections.

	=== Univariate Scatter Plots

	In the example below the `random` function is called in its simplest form with just a collection name as the parameter.

	When called with no other parameters the `random` function returns a random sample of 500 records with all fields from the collection.
	When called without the field list parameter (`fl`) the `random` function also generates a sequence, 0-499 in this case, which can be used for plotting the x-axis.
	This sequence is returned in a field called `x`.

	The visualization below shows a scatter plot with the `filesize_d` field
	plotted on the y-axis and the `x` sequence plotted on the x-axis.
	The effect of this is to spread the `filesize_d` samples across the length
	of the plot so they can be more easily studied.

	By studying the scatter plot we can learn a number of things about the
	distribution of the `filesize_d` variable:

	* The sample set ranges from 34,875 to 45,902.
	* The highest density appears to be at about 40,000.
	* The sample seems to have a balanced number of observations above and below
	40,000. Based on this the mean and mode would appear to be around 40,000.
	* The number of observations tapers off to a small number of outliers on
	the low and high end of the sample.

	This sample can be re-run multiple times to see if the samples
	produce similar plots.

	image::images/math-expressions/univariate.png[]

	=== Bivariate Scatter Plots

	In the next example parameters have been added to the `random` function.
	The field list (`fl`) now specifies two fields to be
	returned with each sample: `filesize_d` and `response_d`.
	The `q` and `rows` parameters are the same as the defaults but are included as an example of how to set these parameters.

	By plotting `filesize_d` on the x-axis and `response_d` on the y-axis we can begin to study the relationship between the two variables.

	By studying the scatter plot we can learn the following:

	* As `filesize_d` rises, `response_d` tends to rise.
	* This relationship appears to be linear, as a straight line put through the data could be used to model the relationship.
	* The points appear to cluster more densely along a straight line through the middle and become less dense as they move away from the line.
	* The variance of the data at each `filesize_d` point seems fairly consistent. This means a predictive model would have consistent error across the range of predictions.

	image::images/math-expressions/bivariate.png[]

	== Aggregation

	Aggregations are a powerful statistical tool for summarizing large data sets and
	surfacing patterns, trends, and correlations within the data.
	Aggregations are also a powerful tool for visualization and provide data sets for further statistical analysis.

	=== stats

	The simplest aggregation is the `stats` function.
	The `stats` function calculates aggregations for an entire result set that matches a query.
	The `stats` function supports the following aggregation functions: `count(*)`, `sum`, `min`, `max`, and `avg`.
	Any number and combination of statistics can be calculated in a single function call.

	The `stats` function can be visualized in Zeppelin-Solr as a table.
	In the example below two statistics are calculated over a result set and are displayed in a table:

	image::images/math-expressions/stats-table.png[]

	The `stats` function can also be visualized using the number visualization which is used to highlight important numbers.
	The example below shows the `count(*)` aggregation displayed in the number visualization:

	image::images/math-expressions/stats.png[]

	=== facet

	The `facet` function performs single and multi-dimension
	aggregations that behave in a similar manner to SQL group by aggregations.
	Under the covers the `facet` function pushes down the aggregations to Solr's
	<<json-facet-api.adoc#,JSON Facet API>> for fast distributed execution.

	The example below performs a single dimension aggregation from the
	nyc311 (NYC complaints) dataset.
	The aggregation returns the top five complaint types by count for records with a status of Pending.
	The results are displayed with Zeppelin-Solr in a table.

	image::images/math-expressions/facettab1.png[]

	The example below shows the table visualized using a pie chart.

	image::images/math-expressions/facetviz1.png[]

	The next example demonstrates a multi-dimension aggregation.
	Notice that the `buckets` parameter now contains two dimensions: `borough_s` and `complaint_type_s`.
	This returns the top 20 combinations of borough and complaint type by count.

	image::images/math-expressions/facettab2.png[]

	The example below shows the multi-dimension aggregation visualized as a grouped bar chart.

	image::images/math-expressions/facetviz2.png[]

	The `facet` function supports any combination of the following aggregate functions: count(*), sum, avg, min,
	max.


	=== facet2D

	The `facet2D` function performs two dimensional aggregations that can be
	visualized as heat maps or pivoted into matrices and operated on by machine learning functions.

	`facet2D` has different syntax and behavior then a two dimensional `facet` function which
	does not control the number of unique facets of each dimension. The `facet2D` function
	has the `dimensions` parameter which controls the number of unique facets
	for the x and y dimensions.

	The example below visualizes the output of the `facet2D` function. In the example `facet2D`
	returns the top 5 boroughs and the top 5 complaint types for each borough. The output is
	then visualized as a heatmap.

	image::images/math-expressions/facet2D.png[]

	The `facet2D` function supports one of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`.

	=== timeseries

	The `timeseries` function performs fast, distributed time
	series aggregation leveraging Solr's builtin faceting and date math capabilities.

	The example below performs a monthly time series aggregation over a collection of
	daily stock price data. In this example the average monthly closing price is
	calculated for the stock ticker amzn between a specific date range.

	The output of the `timeseries` function is then visualized with a line chart.

	image::images/math-expressions/timeseries1.png[]

	The `timeseries` function supports any combination of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`.


	=== significantTerms

	The `significantTerms` function queries a collection, but instead of returning documents, it returns significant terms found in documents in the result set.
	This function scores terms based on how frequently they appear in the result set and how rarely they appear in the entire corpus.
	The `significantTerms` function emits a tuple for each term which contains the term, the score, the foreground count and the background count.
	The foreground count is how many documents the term appears in the result set.
	The background count is how many documents the term appears in the entire corpus.
	The foreground and background counts are global for the collection.

	The `significantTerms` function can often provide insights that cannot be gleaned from other types of aggregations.
	The example below illustrates the difference between the `facet` function and the `significantTerms` function.

	In the first example the `facet` function aggregates the top 5 complaint types
	in Brooklyn.
	This returns the five most common complaint types in Brooklyn, but
	its not clear that these terms appear more frequently in Brooklyn then
	then the other boroughs.

	image::images/math-expressions/significantTermsCompare.png[]

	In the next example the `significantTerms` function returns the top 5 significant terms in the `complaint_type_s` field for the borough of Brooklyn.
	The highest scoring term, Elder Abuse, has a foreground count of 285 and background count of 298.
	This means that there were 298 Elder Abuse complaints in the entire data set, and 285 of them were in Brooklyn.
	This shows that Elder Abuse complaints have a much higher occurrence rate in Brooklyn than the other boroughs.

	image::images/math-expressions/significantTerms2.png[]

	The final example shows a visualization of the `significantTerms` from a
	text field containing movie reviews. The result shows the
	significant terms that appear in movie reviews that have the phrase "sci-fi".

	The results are visualized using a bubble chart with the foreground count on
	plotted on the x-axis and the background count on the y-axis. Each term is
	shown in a bubble sized by the score.

	image::images/math-expressions/sterms.png[]

	=== nodes

	The `nodes` function performs aggregations of nodes during a breadth first search of a graph.
	This function is covered in detail in the section <<graph-traversal.adoc#,Graph Traversal>>.
	In this example the focus will be on finding correlated nodes in a time series
	graph using the `nodes` expressions.

	The example below finds stock tickers whose daily movements tend to be correlated with the ticker jpm (JP Morgan).

	The inner `search` expression finds records between a specific date range
	where the ticker symbol is jpm and the `change_d` field (daily change in stock price) is greater then .25.
	This search returns all fields in the index including the `yearMonthDay_s` which is the string representation of the year, month, and day of the matching records.

	The `nodes` function wraps the `search` function and operates over its results. The `walk` parameter maps a field from the search results to a field in the index.
	In this case the `yearMonthDay_s` is mapped back to the `yearMonthDay_s` field in the same index.
	This will find records that have same `yearMonthDay_s` field value returned
	by the initial search, and will return records for all tickers on those days.
	A filter query is applied to the search to filter the search to rows that have a `change_d`
	greater the .25.
	This will find all records on the matching days that have a daily change greater then .25.

	The `gather` parameter tells the nodes expression to gather the `ticker_s` symbols during the breadth first search.
	The `count(*)` parameter counts the occurrences of the tickers.
	This will count the number of times each ticker appears in the breadth first search.

	Finally the `top` function selects the top 5 tickers by count and returns them.

	The result below shows the ticker symbols in the `nodes` field and the counts for each node.
	Notice jpm is first, which shows how many days jpm had a change greater then .25 in this time
	period.
	The next set of ticker symbols (mtb, slvb, gs and pnc) are the symbols with highest number of days with a change greater then .25 on the same days that jpm had a change greater then .25.

	image::images/math-expressions/nodestab.png[]

	The `nodes` function supports any combination of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`.