commit	cd195980ecbcb45e19ce9465854fed0b6be79ea7	[log] [tgz]
author	Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>	Tue Dec 12 00:48:36 2023 +0530
committer	GitHub <noreply@github.com>	Mon Dec 11 14:18:36 2023 -0500
tree	b8ea3d017ee2d74f88ff4661125225d1e0e80fe6
parent	1630ed4bd8431c225b90b7f581c4757fc92d8d3b [diff]

Add support for similarity-based vector searches (#12679)

### Description

Background in #12579

Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system

### Considerations

I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius

### Advantages

1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`

On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?

Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)

### Caveats

I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`

For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)

9 files changed

tree: b8ea3d017ee2d74f88ff4661125225d1e0e80fe6

README.md

Apache Lucene

Lucene Logo

Apache Lucene is a high-performance, full-featured text search engine library written in Java.

Online Documentation

This README file only contains basic setup instructions. For more comprehensive documentation, visit:

Latest Releases: https://lucene.apache.org/core/documentation.html
Nightly: https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/javadoc/
Build System Documentation: help/
Developer Documentation: dev-docs/
Migration Guide: lucene/MIGRATE.md

Building

Basic steps:

Install OpenJDK 17 or 18.
Clone Lucene's git repository (or download the source distribution).
Run gradle launcher script (gradlew).

We‘ll assume that you know how to get and set up the JDK - if you don’t, then we suggest starting at https://jdk.java.net/ and learning more about Java, before returning to this README.

See Contributing Guide for details.

Contributing

Bug fixes, improvements and new features are always welcome! Please review the Contributing to Lucene Guide for information on contributing.

Discussion and Support

Users Mailing List
Developers Mailing List
IRC: #lucene and #lucene-dev on freenode.net