lucene/grouping/src/java/org/apache/lucene/search/grouping/package.html - lucene-solr - Git at Google

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 <html>
 <body>

 <p>This module enables search result grouping with Lucene, where hits
 with the same value in the specified single-valued group field are
 grouped together.  For example, if you group by the <code>author</code>
 field, then all documents with the same value in the <code>author</code>
 field fall into a single group.</p>

 <p>Grouping requires a number of inputs:</p>

   <ul>
     <li> <code>groupField</code>: this is the field used for grouping.
       For example, if you use the <code>author</code> field then each
       group has all books by the same author.  Documents that don't
       have this field are grouped under a single group with
       a <code>null</code> group value.

     <li> <code>groupSort</code>: how the groups are sorted.  For sorting
       purposes, each group is "represented" by the highest-sorted
       document according to the <code>groupSort</code> within it.  For
       example, if you specify "price" (ascending) then the first group
       is the one with the lowest price book within it.  Or if you
       specify relevance group sort, then the first group is the one
       containing the highest scoring book.

     <li> <code>topNGroups</code>: how many top groups to keep.  For
       example, 10 means the top 10 groups are computed.

     <li> <code>groupOffset</code>: which "slice" of top groups you want to
       retrieve.  For example, 3 means you'll get 7 groups back
       (assuming <code>topNGroups</code> is 10).  This is useful for
       paging, where you might show 5 groups per page.

     <li> <code>withinGroupSort</code>: how the documents within each group
       are sorted.  This can be different from the group sort.

     <li> <code>maxDocsPerGroup</code>: how many top documents within each
       group to keep.

     <li> <code>withinGroupOffset</code>: which "slice" of top
       documents you want to retrieve from each group.

   </ul>

 <p>The implementation is two-pass: the first pass ({@link
   org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector})
   gathers the top groups, and the second pass ({@link
   org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector})
   gathers documents within those groups.  If the search is costly to
   run you may want to use the {@link
   org.apache.lucene.search.CachingCollector} class, which
   caches hits and can (quickly) replay them for the second pass.  This
   way you only run the query once, but you pay a RAM cost to (briefly)
   hold all hits.  Results are returned as a {@link
   org.apache.lucene.search.grouping.TopGroups} instance.</p>

 <p>
   This module abstracts away what defines group and how it is collected. All grouping collectors
   are abstract and have currently term based implementations. One can implement
   collectors that for example group on multiple fields.
 </p>

 <p>Known limitations:</p>
 <ul>
   <li> For the two-pass grouping search, the group field must be a
     indexed as a {@link org.apache.lucene.document.SortedDocValuesField}).
   <li> Although Solr support grouping by function and this module has abstraction of what a group is, there are currently only
     implementations for grouping based on terms.
   <li> Sharding is not directly supported, though is not too
     difficult, if you can merge the top groups and top documents per
     group yourself.
 </ul>

 <p>Typical usage for the generic two-pass grouping search looks like this using the grouping convenience utility
   (optionally using caching for the second pass search):</p>

 <pre class="prettyprint">
   GroupingSearch groupingSearch = new GroupingSearch("author");
   groupingSearch.setGroupSort(groupSort);
   groupingSearch.setFillSortFields(fillFields);

   if (useCache) {
     // Sets cache in MB
     groupingSearch.setCachingInMB(4.0, true);
   }

   if (requiredTotalGroupCount) {
     groupingSearch.setAllGroups(true);
   }

   TermQuery query = new TermQuery(new Term("content", searchTerm));
   TopGroups&lt;BytesRef&gt; result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);

   // Render groupsResult...
   if (requiredTotalGroupCount) {
     int totalGroupCount = result.totalGroupCount;
   }
 </pre>

 <p>To use the single-pass <code>BlockGroupingCollector</code>,
    first, at indexing time, you must ensure all docs in each group
    are added as a block, and you have some way to find the last
    document of each group.  One simple way to do this is to add a
    marker binary field:</p>

 <pre class="prettyprint">
   // Create Documents from your source:
   List&lt;Document&gt; oneGroup = ...;

   Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
   groupEndField.setIndexOptions(IndexOptions.DOCS_ONLY);
   groupEndField.setOmitNorms(true);
   oneGroup.get(oneGroup.size()-1).add(groupEndField);

   // You can also use writer.updateDocuments(); just be sure you
   // replace an entire previous doc block with this new one.  For
   // example, each group could have a "groupID" field, with the same
   // value for all docs in this group:
   writer.addDocuments(oneGroup);
 </pre>

 Then, at search time, do this up front:

 <pre class="prettyprint">
   // Set this once in your app & save away for reusing across all queries:
   Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));
 </pre>

 Finally, do this per search:

 <pre class="prettyprint">
   // Per search:
   BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
   s.search(new TermQuery(new Term("content", searchTerm)), c);
   TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);

   // Render groupsResult...
 </pre>

 Or alternatively use the <code>GroupingSearch</code> convenience utility:

 <pre class="prettyprint">
   // Per search:
   GroupingSearch groupingSearch = new GroupingSearch(groupEndDocs);
   groupingSearch.setGroupSort(groupSort);
   groupingSearch.setIncludeScores(needsScores);
   TermQuery query = new TermQuery(new Term("content", searchTerm));
   TopGroups groupsResult = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);

   // Render groupsResult...
 </pre>

 Note that the <code>groupValue</code> of each <code>GroupDocs</code>
 will be <code>null</code>, so if you need to present this value you'll
 have to separately retrieve it (for example using stored
 fields, <code>FieldCache</code>, etc.).

 <p>Another collector is the <code>TermAllGroupHeadsCollector</code> that can be used to retrieve all most relevant
    documents per group. Also known as group heads. This can be useful in situations when one wants to compute group
    based facets / statistics on the complete query result. The collector can be executed during the first or second
    phase. This collector can also be used with the <code>GroupingSearch</code> convenience utility, but when if one only
    wants to compute the most relevant documents per group it is better to just use the collector as done here below.</p>

 <pre class="prettyprint">
   AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup);
   s.search(new TermQuery(new Term("content", searchTerm)), c);
   // Return all group heads as int array
   int[] groupHeadsArray = c.retrieveGroupHeads()
   // Return all group heads as FixedBitSet.
   int maxDoc = s.maxDoc();
   FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)
 </pre>

 <p>For each of the above collector types there is also a variant that works with <code>ValueSource</code> instead of
    of fields. Concretely this means that these variants can work with functions. These variants are slower than
    there term based counter parts. These implementations are located in the
    <code>org.apache.lucene.search.grouping.function</code> package, but can also be used with the
   <code>GroupingSearch</code> convenience utility
 </p>

 </body>
 </html>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<html>
	<body>

	<p>This module enables search result grouping with Lucene, where hits
	with the same value in the specified single-valued group field are
	grouped together. For example, if you group by the <code>author</code>
	field, then all documents with the same value in the <code>author</code>
	field fall into a single group.</p>

	<p>Grouping requires a number of inputs:</p>

	<ul>
	<li> <code>groupField</code>: this is the field used for grouping.
	For example, if you use the <code>author</code> field then each
	group has all books by the same author. Documents that don't
	have this field are grouped under a single group with
	a <code>null</code> group value.

	<li> <code>groupSort</code>: how the groups are sorted. For sorting
	purposes, each group is "represented" by the highest-sorted
	document according to the <code>groupSort</code> within it. For
	example, if you specify "price" (ascending) then the first group
	is the one with the lowest price book within it. Or if you
	specify relevance group sort, then the first group is the one
	containing the highest scoring book.

	<li> <code>topNGroups</code>: how many top groups to keep. For
	example, 10 means the top 10 groups are computed.

	<li> <code>groupOffset</code>: which "slice" of top groups you want to
	retrieve. For example, 3 means you'll get 7 groups back
	(assuming <code>topNGroups</code> is 10). This is useful for
	paging, where you might show 5 groups per page.

	<li> <code>withinGroupSort</code>: how the documents within each group
	are sorted. This can be different from the group sort.

	<li> <code>maxDocsPerGroup</code>: how many top documents within each
	group to keep.

	<li> <code>withinGroupOffset</code>: which "slice" of top
	documents you want to retrieve from each group.

	</ul>

	<p>The implementation is two-pass: the first pass ({@link
	org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector})
	gathers the top groups, and the second pass ({@link
	org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector})
	gathers documents within those groups. If the search is costly to
	run you may want to use the {@link
	org.apache.lucene.search.CachingCollector} class, which
	caches hits and can (quickly) replay them for the second pass. This
	way you only run the query once, but you pay a RAM cost to (briefly)
	hold all hits. Results are returned as a {@link
	org.apache.lucene.search.grouping.TopGroups} instance.</p>

	<p>
	This module abstracts away what defines group and how it is collected. All grouping collectors
	are abstract and have currently term based implementations. One can implement
	collectors that for example group on multiple fields.
	</p>

	<p>Known limitations:</p>
	<ul>
	<li> For the two-pass grouping search, the group field must be a
	indexed as a {@link org.apache.lucene.document.SortedDocValuesField}).
	<li> Although Solr support grouping by function and this module has abstraction of what a group is, there are currently only
	implementations for grouping based on terms.
	<li> Sharding is not directly supported, though is not too
	difficult, if you can merge the top groups and top documents per
	group yourself.
	</ul>

	<p>Typical usage for the generic two-pass grouping search looks like this using the grouping convenience utility
	(optionally using caching for the second pass search):</p>

	<pre class="prettyprint">
	GroupingSearch groupingSearch = new GroupingSearch("author");
	groupingSearch.setGroupSort(groupSort);
	groupingSearch.setFillSortFields(fillFields);

	if (useCache) {
	// Sets cache in MB
	groupingSearch.setCachingInMB(4.0, true);
	}

	if (requiredTotalGroupCount) {
	groupingSearch.setAllGroups(true);
	}

	TermQuery query = new TermQuery(new Term("content", searchTerm));
	TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);

	// Render groupsResult...
	if (requiredTotalGroupCount) {
	int totalGroupCount = result.totalGroupCount;
	}
	</pre>

	<p>To use the single-pass <code>BlockGroupingCollector</code>,
	first, at indexing time, you must ensure all docs in each group
	are added as a block, and you have some way to find the last
	document of each group. One simple way to do this is to add a
	marker binary field:</p>

	<pre class="prettyprint">
	// Create Documents from your source:
	List<Document> oneGroup = ...;

	Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
	groupEndField.setIndexOptions(IndexOptions.DOCS_ONLY);
	groupEndField.setOmitNorms(true);
	oneGroup.get(oneGroup.size()-1).add(groupEndField);

	// You can also use writer.updateDocuments(); just be sure you
	// replace an entire previous doc block with this new one. For
	// example, each group could have a "groupID" field, with the same
	// value for all docs in this group:
	writer.addDocuments(oneGroup);
	</pre>

	Then, at search time, do this up front:

	<pre class="prettyprint">
	// Set this once in your app & save away for reusing across all queries:
	Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));
	</pre>

	Finally, do this per search:

	<pre class="prettyprint">
	// Per search:
	BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
	s.search(new TermQuery(new Term("content", searchTerm)), c);
	TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);

	// Render groupsResult...
	</pre>

	Or alternatively use the <code>GroupingSearch</code> convenience utility:

	<pre class="prettyprint">
	// Per search:
	GroupingSearch groupingSearch = new GroupingSearch(groupEndDocs);
	groupingSearch.setGroupSort(groupSort);
	groupingSearch.setIncludeScores(needsScores);
	TermQuery query = new TermQuery(new Term("content", searchTerm));
	TopGroups groupsResult = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);

	// Render groupsResult...
	</pre>

	Note that the <code>groupValue</code> of each <code>GroupDocs</code>
	will be <code>null</code>, so if you need to present this value you'll
	have to separately retrieve it (for example using stored
	fields, <code>FieldCache</code>, etc.).

	<p>Another collector is the <code>TermAllGroupHeadsCollector</code> that can be used to retrieve all most relevant
	documents per group. Also known as group heads. This can be useful in situations when one wants to compute group
	based facets / statistics on the complete query result. The collector can be executed during the first or second
	phase. This collector can also be used with the <code>GroupingSearch</code> convenience utility, but when if one only
	wants to compute the most relevant documents per group it is better to just use the collector as done here below.</p>

	<pre class="prettyprint">
	AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup);
	s.search(new TermQuery(new Term("content", searchTerm)), c);
	// Return all group heads as int array
	int[] groupHeadsArray = c.retrieveGroupHeads()
	// Return all group heads as FixedBitSet.
	int maxDoc = s.maxDoc();
	FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)
	</pre>

	<p>For each of the above collector types there is also a variant that works with <code>ValueSource</code> instead of
	of fields. Concretely this means that these variants can work with functions. These variants are slower than
	there term based counter parts. These implementations are located in the
	<code>org.apache.lucene.search.grouping.function</code> package, but can also be used with the
	<code>GroupingSearch</code> convenience utility
	</p>

	</body>
	</html>