blob: cf790c7b1b94af7cc0048d6c8126136a5f020a04 [file] [log] [blame]
{
"Lucene.Net.Index.Memory.html": {
"href": "Lucene.Net.Index.Memory.html",
"title": "Namespace Lucene.Net.Index.Memory | Apache Lucene.NET 4.8.0-beta00010 Documentation",
"keywords": "Namespace Lucene.Net.Index.Memory <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> High-performance single-document main memory Apache Lucene fulltext search index. Classes MemoryIndex High-performance single-document main memory Apache Lucene fulltext search index. Overview This class is a replacement/substitute for a large subset of RAMDirectory functionality. It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as Nux XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Rather than targeting fulltext search of infrequent queries over huge persistent data archives (historic search), this class targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). For example as in float score = Search(string text, Query query) Each instance can hold at most one Lucene \"document\", with a document containing zero or more \"fields\", each field having a name and a fulltext value. The fulltext value is tokenized (split and transformed) into zero or more index terms (aka words) on AddField() , according to the policy implemented by an Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as \"he\", \"in\", \"and\" (stop words), reduce the terms to their natural linguistic root form such as \"fishing\" being reduced to \"fish\" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. For details, see Lucene Analyzer Intro . Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules . Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization. For some interesting background information on search technology, see Bob Wyman's Prospective Search , Jim Gray's A Call to Arms - Custom subscriptions , and Tim Bray's On Search, the Series . Example Usage Analyzer analyzer = new SimpleAnalyzer(version); MemoryIndex index = new MemoryIndex(); index.AddField(\"content\", \"Readings about Salmons and other select Alaska fishing Manuals\", analyzer); index.AddField(\"author\", \"Tales of James\", analyzer); QueryParser parser = new QueryParser(version, \"content\", analyzer); float score = index.Search(parser.Parse(\"+author:james +salmon~ +fish* manual~\")); if (score > 0.0f) { Console.WriteLine(\"it's a match\"); } else { Console.WriteLine(\"no match found\"); } Console.WriteLine(\"indexData=\" + index.toString()); Example XQuery Usage (: An XQuery that finds all books authored by James that have something to do with \"salmon fishing manuals\", sorted by relevance :) declare namespace lucene = \"java:nux.xom.pool.FullTextUtil\"; declare variable $query := \"+salmon~ +fish* manual~\"; (: any arbitrary Lucene query can go here :) for $book in /books/book[author=\"James\" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book No thread safety guarantees An instance can be queried multiple times with the same or different queries, but an instance is not thread-safe. If desired use idioms such as: MemoryIndex index = ... lock (index) { // read and/or write index (i.e. add fields and/or query) } Performance Notes Internally there's a new data structure geared towards efficient indexing and searching, plus the necessary support code to seamlessly plug into the Lucene framework. This class performs very well for very small texts (e.g. 10 chars) as well as for large texts (e.g. 10 MB) and everything in between. Typically, it is about 10-100 times faster than RAMDirectory . Note that RAMDirectory has particularly large efficiency overheads for small to medium sized texts, both in time and space. Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst case. Memory consumption is probably larger than for RAMDirectory . Example throughput of many simple term queries over a single MemoryIndex: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary. If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing )."
},
"Lucene.Net.Index.Memory.MemoryIndex.html": {
"href": "Lucene.Net.Index.Memory.MemoryIndex.html",
"title": "Class MemoryIndex | Apache Lucene.NET 4.8.0-beta00010 Documentation",
"keywords": "Class MemoryIndex High-performance single-document main memory Apache Lucene fulltext search index. Overview This class is a replacement/substitute for a large subset of RAMDirectory functionality. It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as Nux XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Rather than targeting fulltext search of infrequent queries over huge persistent data archives (historic search), this class targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). For example as in float score = Search(string text, Query query) Each instance can hold at most one Lucene \"document\", with a document containing zero or more \"fields\", each field having a name and a fulltext value. The fulltext value is tokenized (split and transformed) into zero or more index terms (aka words) on AddField() , according to the policy implemented by an Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as \"he\", \"in\", \"and\" (stop words), reduce the terms to their natural linguistic root form such as \"fishing\" being reduced to \"fish\" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. For details, see Lucene Analyzer Intro . Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules . Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization. For some interesting background information on search technology, see Bob Wyman's Prospective Search , Jim Gray's A Call to Arms - Custom subscriptions , and Tim Bray's On Search, the Series . Example Usage Analyzer analyzer = new SimpleAnalyzer(version); MemoryIndex index = new MemoryIndex(); index.AddField(\"content\", \"Readings about Salmons and other select Alaska fishing Manuals\", analyzer); index.AddField(\"author\", \"Tales of James\", analyzer); QueryParser parser = new QueryParser(version, \"content\", analyzer); float score = index.Search(parser.Parse(\"+author:james +salmon~ +fish* manual~\")); if (score > 0.0f) { Console.WriteLine(\"it's a match\"); } else { Console.WriteLine(\"no match found\"); } Console.WriteLine(\"indexData=\" + index.toString()); Example XQuery Usage (: An XQuery that finds all books authored by James that have something to do with \"salmon fishing manuals\", sorted by relevance :) declare namespace lucene = \"java:nux.xom.pool.FullTextUtil\"; declare variable $query := \"+salmon~ +fish* manual~\"; (: any arbitrary Lucene query can go here :) for $book in /books/book[author=\"James\" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book No thread safety guarantees An instance can be queried multiple times with the same or different queries, but an instance is not thread-safe. If desired use idioms such as: MemoryIndex index = ... lock (index) { // read and/or write index (i.e. add fields and/or query) } Performance Notes Internally there's a new data structure geared towards efficient indexing and searching, plus the necessary support code to seamlessly plug into the Lucene framework. This class performs very well for very small texts (e.g. 10 chars) as well as for large texts (e.g. 10 MB) and everything in between. Typically, it is about 10-100 times faster than RAMDirectory . Note that RAMDirectory has particularly large efficiency overheads for small to medium sized texts, both in time and space. Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst case. Memory consumption is probably larger than for RAMDirectory . Example throughput of many simple term queries over a single MemoryIndex: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary. If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ). Inheritance System.Object MemoryIndex Inherited Members System.Object.Equals(System.Object) System.Object.Equals(System.Object, System.Object) System.Object.GetHashCode() System.Object.GetType() System.Object.MemberwiseClone() System.Object.ReferenceEquals(System.Object, System.Object) Namespace : Lucene.Net.Index.Memory Assembly : Lucene.Net.Memory.dll Syntax [Serializable] public class MemoryIndex Constructors | Improve this Doc View Source MemoryIndex() Constructs an empty instance. Declaration public MemoryIndex() | Improve this Doc View Source MemoryIndex(Boolean) Constructs an empty instance that can optionally store the start and end character offset of each token term in the text. This can be useful for highlighting of hit locations with the Lucene highlighter package. Protected until the highlighter package matures, so that this can actually be meaningfully integrated. Declaration public MemoryIndex(bool storeOffsets) Parameters Type Name Description System.Boolean storeOffsets whether or not to store the start and end character offset of each token term in the text Methods | Improve this Doc View Source AddField(String, TokenStream) Equivalent to AddField(fieldName, stream, 1.0f) . Declaration public virtual void AddField(string fieldName, TokenStream stream) Parameters Type Name Description System.String fieldName a name to be associated with the text Lucene.Net.Analysis.TokenStream stream the token stream to retrieve tokens from | Improve this Doc View Source AddField(String, TokenStream, Single) Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, Lucene Field . Finally closes the token stream. Note that untokenized keywords can be added with this method via KeywordTokenStream{T}(ICollection{T} )\"/>, the Lucene KeywordTokenizer or similar utilities. Declaration public virtual void AddField(string fieldName, TokenStream stream, float boost) Parameters Type Name Description System.String fieldName a name to be associated with the text Lucene.Net.Analysis.TokenStream stream the token stream to retrieve tokens from. System.Single boost the boost factor for hits for this field See Also Boost | Improve this Doc View Source AddField(String, TokenStream, Single, Int32) Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, Lucene Field . Finally closes the token stream. Note that untokenized keywords can be added with this method via KeywordTokenStream{T}(ICollection{T} )\"/>, the Lucene KeywordTokenizer or similar utilities. Declaration public virtual void AddField(string fieldName, TokenStream stream, float boost, int positionIncrementGap) Parameters Type Name Description System.String fieldName a name to be associated with the text Lucene.Net.Analysis.TokenStream stream the token stream to retrieve tokens from. System.Single boost the boost factor for hits for this field System.Int32 positionIncrementGap the position increment gap if fields with the same name are added more than once See Also Boost | Improve this Doc View Source AddField(String, TokenStream, Single, Int32, Int32) Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, Lucene Field . Finally closes the token stream. Note that untokenized keywords can be added with this method via KeywordTokenStream{T}(ICollection{T} )\"/>, the Lucene KeywordTokenizer or similar utilities. Declaration public virtual void AddField(string fieldName, TokenStream stream, float boost, int positionIncrementGap, int offsetGap) Parameters Type Name Description System.String fieldName a name to be associated with the text Lucene.Net.Analysis.TokenStream stream the token stream to retrieve tokens from. System.Single boost the boost factor for hits for this field System.Int32 positionIncrementGap the position increment gap if fields with the same name are added more than once System.Int32 offsetGap the offset gap if fields with the same name are added more than once See Also Boost | Improve this Doc View Source AddField(String, String, Analyzer) Convenience method; Tokenizes the given field text and adds the resulting terms to the index; Equivalent to adding an indexed non-keyword Lucene Field that is tokenized, not stored, termVectorStored with positions (or termVectorStored with positions and offsets), Declaration public virtual void AddField(string fieldName, string text, Analyzer analyzer) Parameters Type Name Description System.String fieldName a name to be associated with the text System.String text the text to tokenize and index. Lucene.Net.Analysis.Analyzer analyzer the analyzer to use for tokenization | Improve this Doc View Source CreateSearcher() Creates and returns a searcher that can be used to execute arbitrary Lucene queries and to collect the resulting query results as hits. Declaration public virtual IndexSearcher CreateSearcher() Returns Type Description Lucene.Net.Search.IndexSearcher a searcher | Improve this Doc View Source GetMemorySize() Returns a reasonable approximation of the main memory [bytes] consumed by this instance. Useful for smart memory sensititive caches/pools. Declaration public virtual long GetMemorySize() Returns Type Description System.Int64 the main memory consumption | Improve this Doc View Source KeywordTokenStream<T>(ICollection<T>) Convenience method; Creates and returns a token stream that generates a token for each keyword in the given collection, \"as is\", without any transforming text analysis. The resulting token stream can be fed into AddField(String, TokenStream) , perhaps wrapped into another TokenFilter , as desired. Declaration public virtual TokenStream KeywordTokenStream<T>(ICollection<T> keywords) Parameters Type Name Description System.Collections.Generic.ICollection <T> keywords the keywords to generate tokens for Returns Type Description Lucene.Net.Analysis.TokenStream the corresponding token stream Type Parameters Name Description T | Improve this Doc View Source Reset() Resets the MemoryIndex to its initial state and recycles all internal buffers. Declaration public virtual void Reset() | Improve this Doc View Source Search(Query) Convenience method that efficiently returns the relevance score by matching this index against the given Lucene query expression. Declaration public virtual float Search(Query query) Parameters Type Name Description Lucene.Net.Search.Query query an arbitrary Lucene query to run against this index Returns Type Description System.Single the relevance score of the matchmaking; A number in the range [0.0 .. 1.0], with 0.0 indicating no match. The higher the number the better the match. | Improve this Doc View Source ToString() Returns a String representation of the index data for debugging purposes. Declaration public override string ToString() Returns Type Description System.String the string representation Overrides System.Object.ToString()"
}
}