| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <html> |
| <body> |
| |
| <p>This module enables search result grouping with Lucene, where hits |
| with the same value in the specified single-valued group field are |
| grouped together. For example, if you group by the <code>author</code> |
| field, then all documents with the same value in the <code>author</code> |
| field fall into a single group.</p> |
| |
| <p>Grouping requires a number of inputs:</p> |
| |
| <ul> |
| <li> <code>groupField</code>: this is the field used for grouping. |
| For example, if you use the <code>author</code> field then each |
| group has all books by the same author. Documents that don't |
| have this field are grouped under a single group with |
| a <code>null</code> group value. |
| |
| <li> <code>groupSort</code>: how the groups are sorted. For sorting |
| purposes, each group is "represented" by the highest-sorted |
| document according to the <code>groupSort</code> within it. For |
| example, if you specify "price" (ascending) then the first group |
| is the one with the lowest price book within it. Or if you |
| specify relevance group sort, then the first group is the one |
| containing the highest scoring book. |
| |
| <li> <code>topNGroups</code>: how many top groups to keep. For |
| example, 10 means the top 10 groups are computed. |
| |
| <li> <code>groupOffset</code>: which "slice" of top groups you want to |
| retrieve. For example, 3 means you'll get 7 groups back |
| (assuming <code>topNGroups</code> is 10). This is useful for |
| paging, where you might show 5 groups per page. |
| |
| <li> <code>withinGroupSort</code>: how the documents within each group |
| are sorted. This can be different from the group sort. |
| |
| <li> <code>maxDocsPerGroup</code>: how many top documents within each |
| group to keep. |
| |
| <li> <code>withinGroupOffset</code>: which "slice" of top |
| documents you want to retrieve from each group. |
| |
| </ul> |
| |
| <p>The implementation is two-pass: the first pass ({@link |
| org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector}) |
| gathers the top groups, and the second pass ({@link |
| org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector}) |
| gathers documents within those groups. If the search is costly to |
| run you may want to use the {@link |
| org.apache.lucene.search.CachingCollector} class, which |
| caches hits and can (quickly) replay them for the second pass. This |
| way you only run the query once, but you pay a RAM cost to (briefly) |
| hold all hits. Results are returned as a {@link |
| org.apache.lucene.search.grouping.TopGroups} instance.</p> |
| |
| <p> |
| This module abstracts away what defines group and how it is collected. All grouping collectors |
| are abstract and have currently term based implementations. One can implement |
| collectors that for example group on multiple fields. |
| </p> |
| |
| <p>Known limitations:</p> |
| <ul> |
| <li> For the two-pass grouping search, the group field must be a |
| indexed as a {@link org.apache.lucene.document.SortedDocValuesField}). |
| <li> Although Solr support grouping by function and this module has abstraction of what a group is, there are currently only |
| implementations for grouping based on terms. |
| <li> Sharding is not directly supported, though is not too |
| difficult, if you can merge the top groups and top documents per |
| group yourself. |
| </ul> |
| |
| <p>Typical usage for the generic two-pass grouping search looks like this using the grouping convenience utility |
| (optionally using caching for the second pass search):</p> |
| |
| <pre class="prettyprint"> |
| GroupingSearch groupingSearch = new GroupingSearch("author"); |
| groupingSearch.setGroupSort(groupSort); |
| groupingSearch.setFillSortFields(fillFields); |
| |
| if (useCache) { |
| // Sets cache in MB |
| groupingSearch.setCachingInMB(4.0, true); |
| } |
| |
| if (requiredTotalGroupCount) { |
| groupingSearch.setAllGroups(true); |
| } |
| |
| TermQuery query = new TermQuery(new Term("content", searchTerm)); |
| TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit); |
| |
| // Render groupsResult... |
| if (requiredTotalGroupCount) { |
| int totalGroupCount = result.totalGroupCount; |
| } |
| </pre> |
| |
| <p>To use the single-pass <code>BlockGroupingCollector</code>, |
| first, at indexing time, you must ensure all docs in each group |
| are added as a block, and you have some way to find the last |
| document of each group. One simple way to do this is to add a |
| marker binary field:</p> |
| |
| <pre class="prettyprint"> |
| // Create Documents from your source: |
| List<Document> oneGroup = ...; |
| |
| Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED); |
| groupEndField.setIndexOptions(IndexOptions.DOCS_ONLY); |
| groupEndField.setOmitNorms(true); |
| oneGroup.get(oneGroup.size()-1).add(groupEndField); |
| |
| // You can also use writer.updateDocuments(); just be sure you |
| // replace an entire previous doc block with this new one. For |
| // example, each group could have a "groupID" field, with the same |
| // value for all docs in this group: |
| writer.addDocuments(oneGroup); |
| </pre> |
| |
| Then, at search time, do this up front: |
| |
| <pre class="prettyprint"> |
| // Set this once in your app & save away for reusing across all queries: |
| Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x")))); |
| </pre> |
| |
| Finally, do this per search: |
| |
| <pre class="prettyprint"> |
| // Per search: |
| BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs); |
| s.search(new TermQuery(new Term("content", searchTerm)), c); |
| TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields); |
| |
| // Render groupsResult... |
| </pre> |
| |
| Or alternatively use the <code>GroupingSearch</code> convenience utility: |
| |
| <pre class="prettyprint"> |
| // Per search: |
| GroupingSearch groupingSearch = new GroupingSearch(groupEndDocs); |
| groupingSearch.setGroupSort(groupSort); |
| groupingSearch.setIncludeScores(needsScores); |
| TermQuery query = new TermQuery(new Term("content", searchTerm)); |
| TopGroups groupsResult = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit); |
| |
| // Render groupsResult... |
| </pre> |
| |
| Note that the <code>groupValue</code> of each <code>GroupDocs</code> |
| will be <code>null</code>, so if you need to present this value you'll |
| have to separately retrieve it (for example using stored |
| fields, <code>FieldCache</code>, etc.). |
| |
| <p>Another collector is the <code>TermAllGroupHeadsCollector</code> that can be used to retrieve all most relevant |
| documents per group. Also known as group heads. This can be useful in situations when one wants to compute group |
| based facets / statistics on the complete query result. The collector can be executed during the first or second |
| phase. This collector can also be used with the <code>GroupingSearch</code> convenience utility, but when if one only |
| wants to compute the most relevant documents per group it is better to just use the collector as done here below.</p> |
| |
| <pre class="prettyprint"> |
| AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup); |
| s.search(new TermQuery(new Term("content", searchTerm)), c); |
| // Return all group heads as int array |
| int[] groupHeadsArray = c.retrieveGroupHeads() |
| // Return all group heads as FixedBitSet. |
| int maxDoc = s.maxDoc(); |
| FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc) |
| </pre> |
| |
| <p>For each of the above collector types there is also a variant that works with <code>ValueSource</code> instead of |
| of fields. Concretely this means that these variants can work with functions. These variants are slower than |
| there term based counter parts. These implementations are located in the |
| <code>org.apache.lucene.search.grouping.function</code> package, but can also be used with the |
| <code>GroupingSearch</code> convenience utility |
| </p> |
| |
| </body> |
| </html> |