commit | 3a1f208659b9b9eef930607f2c7b556c25ddfd44 | [log] [tgz] |
---|---|---|
author | Michael Bien <mbien42@gmail.com> | Mon May 08 13:17:29 2023 +0200 |
committer | GitHub <noreply@github.com> | Mon May 08 13:17:29 2023 +0200 |
tree | 956acdc2823a3b0082c627693d56ac9a3fd9763a | |
parent | 41e88f874132a6bcae3dd034547b735b6a8a4c12 [diff] |
[MINDEXER-185] Filter in index reader and use update request for configuration (#302) This PR does two things: - uses `IndexUpdateRequest` as configuration for `IndexDataReader` (tmp folder, factory etc) - moves the filtering from post extraction to the read phase - makes filtering really fast (and multi threaded too), it is no longer an extra step - has actually an effect on the on-disk index size (since I've learned lucene doesn't really remove things since all files are immutable) I do realize that this is not quite the same behavior as before. To retain the exact same behavior, we could add this as additional filter, one during read (new), one after extraction (old). (edit: done, see second commit) example filter: ```java final Instant cutoff = ZonedDateTime.now().minusYears(2).toInstant(); iur.setExtractionFilter((doc) -> { IndexableField field = doc.getField("m"); // usually never null return field != null && Instant.ofEpochMilli(Long.parseLong(field.stringValue())).isAfter(cutoff); }); ``` results (single threaded, since MT has a index size penalty due to merge overhead): ``` full: 5.6 GB 2y: 2.6 GB 1y: 1.4 GB ``` --- https://issues.apache.org/jira/browse/MINDEXER-185
You have found a bug or you have an idea for a cool new feature? Contributing code is a great way to give something back to the open source community. Before you dig right into the code, there are a few guidelines that we need contributors to follow so that we can have a chance of keeping on top of things.
We accept Pull Requests via GitHub. The developer mailing list is the main channel of communication for contributors.
There are some guidelines which will make applying PRs easier for us:
git diff --check
before committing.[MINDEXER-XXX] - Subject of the JIRA Ticket Optional supplemental description.
mvn -Prun-its verify
to assure nothing else was accidentally broken.If you plan to contribute on a regular basis, please consider filing a contributor license agreement.
For changes of a trivial nature to comments and documentation, it is not always necessary to create a new ticket in JIRA. In this case, it is appropriate to start the first line of a commit with ‘(doc)’ instead of a ticket number.