commit | cd195980ecbcb45e19ce9465854fed0b6be79ea7 | [log] [tgz] |
---|---|---|
author | Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> | Tue Dec 12 00:48:36 2023 +0530 |
committer | GitHub <noreply@github.com> | Mon Dec 11 14:18:36 2023 -0500 |
tree | b8ea3d017ee2d74f88ff4661125225d1e0e80fe6 | |
parent | 1630ed4bd8431c225b90b7f581c4757fc92d8d3b [diff] |
Add support for similarity-based vector searches (#12679) ### Description Background in #12579 Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system ### Considerations I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to: 1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius 2. Collect all nodes that are within an inner "result" radius ### Advantages 1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`) 2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later) 3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK` On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes? Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?) ### Caveats I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental` For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
Apache Lucene is a high-performance, full-featured text search engine library written in Java.
This README file only contains basic setup instructions. For more comprehensive documentation, visit:
gradlew
).We‘ll assume that you know how to get and set up the JDK - if you don’t, then we suggest starting at https://jdk.java.net/ and learning more about Java, before returning to this README.
See Contributing Guide for details.
Bug fixes, improvements and new features are always welcome! Please review the Contributing to Lucene Guide for information on contributing.
#lucene
and #lucene-dev
on freenode.net