| eZ Components - Search |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| .. contents:: Table of Contents |
| |
| |
| Introduction |
| ============ |
| |
| The search component allows you to index and search documents. A document |
| consists of an object. A definition maps the object's properties to fields of a |
| document, just like the PersistentObject_ component. The indexing process takes |
| the document and indexes the fields depending on the data type. The searching |
| part allows you to search for documents in the index with a rich query |
| language. |
| |
| .. _PersistentObject: introduction_PersistentObject.html |
| |
| |
| Class overview |
| ============== |
| |
| This section gives you an overview of the main classes in the Search |
| component. |
| |
| ezcSearchSession |
| This class provide access to the search index for both indexing and |
| searching. All operations towards the index go through this class which |
| is configured with a search handler and a definition manager. |
| |
| ezcSearchSolrHandler |
| The handler that uses Solr_ for indexing and searching. |
| |
| ezcSearchQueryBuilder |
| A class that allows you to build a search query from a search string. |
| |
| ezcSearchQuery |
| An interface to building complex search queries in case |
| ezcSearchQueryBuilder is not up to the task. |
| |
| .. _Solr: http://lucene.apache.org/solr/ |
| |
| Search Handlers |
| =============== |
| |
| Search handlers provide the link between the abstract query and document |
| interfaces to the mechanism that actually stores the index, and allows |
| querying for documents. Not all handlers handle all the different query |
| words or datatypes in the index, but effort is to put in to make as much |
| use of the handler's functionality as possible. |
| |
| Solr |
| ---- |
| |
| This handler uses Apache's Solr as backend. It accesses Solr over TCP/IP as a |
| web service. Solr is a very capable search provider, with many features__. |
| |
| __ http://lucene.apache.org/solr/features.html |
| |
| Using the handler is relatively easy. You basically only have to instantiate |
| the handler class which then can be passed to the ezcSearchSession |
| constructor: |
| |
| .. include:: tutorial/solr-handler.php |
| :literal: |
| |
| Solr requires a schema to work. This scheme defines how data types work |
| and allows for many more customizations. The default schema that comes with |
| Solr requires a few minor changes to make it work with the Search component. |
| `This schema`__ should be used as a basis for the Search component. |
| |
| __ https://svn.apache.org/repos/asf/incubator/zetacomponents/trunk/Search/docs/schema.xml |
| |
| Zend_Search_Lucene |
| ------------------ |
| |
| The component also provides a backend based on Zend's Lucene implementation. |
| This backend has *many* limitations compared to Solr_, such as missing |
| multi-valued field support, no data-type support and a much lower performance. |
| In order to use this backend, you need to have a specific autoload function as |
| well as the Zend Framework installed and included in the PHP included path. An |
| example on how to use it follows: |
| |
| .. include:: tutorial/zend-lucene.php |
| :literal: |
| |
| |
| Definition Managers |
| =================== |
| |
| A definition manager maps properties of an object to fields in a search |
| document. As most search handlers support fields with arbitrary names, you |
| don't actually provide the name of the fields in the search index. Instead, |
| the mapping configures several things for an object's property. |
| |
| First of all, every document type needs an ID field. This ID will uniquely |
| define a document in the search index. There can only be one ID field, and |
| there has to be one. For each field, you have to define the data type, and |
| optionally you can configure: |
| |
| - The importance of that field (boost factor) |
| - Whether the field should be a part of the resulting document |
| - Whether the field supports multiple values |
| - Whether highlighting should be performed for this field on result documents |
| |
| Definitions can be supplied in two ways. The `embedded manager`_ retrieves |
| the definitions from the document classes directly, whereas the `xml manager`_ |
| uses external XML file to read definitions from. |
| |
| |
| Embedded Manager |
| ---------------- |
| |
| The ezcSearchEmbeddedManager retrieves the definition from a class that |
| implements the ezcSearchDefinitionProvider interface. This interfaces specifies |
| the getDefinition() method that should be implemented by the classes to return |
| the document's definition mappings. An example of the implementation of the |
| getDefinition() method is:: |
| |
| static public function getDefinition() |
| { |
| $n = new ezcSearchDocumentDefinition( __CLASS__ ); |
| $n->idProperty = 'id'; |
| $n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT ); |
| $n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true ); |
| $n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true ); |
| $n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE ); |
| $n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING ); |
| $n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false ); |
| |
| return $n; |
| } |
| |
| Basically what this method does is construct an ezcSearchDocumentDefinition |
| object containing all the field definitions. It's required to have an ID |
| property. It's recommended to use a TEXT data type for this, although it is not |
| required. See the section `Data Types`_ on the differences between data types. |
| |
| Each field is then added to the fields property as a |
| ezcSearchDefinitionDocumentField object. The field index should be the same as |
| the first argument to the constructor of this class. By default the type will |
| be ezcSearchDefinitionDocumentField::TEXT. Subsequent arguments control the |
| importance (boost) of a field, whether it should be part of the result, whether |
| multiple values for this field are accepted and whether it should be selected |
| for highlighting. |
| |
| XML Manager |
| ----------- |
| |
| The ezcSearchXmlManager uses XML files to obtain a document definition from. |
| The manager is configured with the directory where the XML definition files can |
| be found in the constructor:: |
| |
| <?php |
| $xm = new ezcSearchXmlManager( 'search-defs/' ); |
| ?> |
| |
| The names of the definition files are required to be |
| ``name-of-class-in-lower-case.xml``. This means that for the class Article the |
| file ``article.xml`` is being read. The file itself is a simple XML file. The |
| file below demonstrates the same definition as the one in the example in |
| `Embedded Manager`_: |
| |
| .. include:: tutorial/example.xml |
| :literal: |
| |
| The RelaxNG-Compressed schema is: |
| |
| .. include:: tutorial/definition-schema.rnc |
| :literal: |
| |
| |
| Search Session |
| ============== |
| |
| The search session is responsible for indexing documents, and searching |
| for documents. The session object requires both a search handler and a |
| definition manager. The handler is used for storing the index, while the |
| definition manager is used to find the definition that maps object's properties |
| to search index fields. Creating a session is simple, as is demonstrated in the |
| following example: |
| |
| .. include:: tutorial/create-session.php |
| :literal: |
| |
| |
| Indexing |
| ======== |
| |
| With the session created, it is time to index documents. Before we can index |
| anything we need to create an object and create the definition. For this |
| tutorial we'll reuse the definition from the `Embedded Manager`_ section and |
| create a class out of this. Each class that you want to index through the |
| Search component needs to implement the ezcBasePersistable interface. |
| This interfaces defines two methods: getState() and setState() as well as the |
| requirement that the constructor should be able to be called without any |
| arguments. Those methods are used for fetching and re-creating the state |
| of this object, similarly to what PersistentObject_ requires. |
| |
| To see everything in perspective, the full class follows here, including the |
| definition method: |
| |
| .. include:: tutorial/article.php |
| :literal: |
| |
| The ezcBasePersistable interface is also compatible with PersistentObject_, |
| although there the interface is not enforced. |
| |
| After we've created the class and definition, indexing an object is relatively |
| simple. After instantiation, indexing the document is done by calling the |
| index() method of the session as you can see in the next example: |
| |
| |
| .. include:: tutorial/index.php |
| :literal: |
| |
| If you are indexing a large amount of documents, it's wise to wrap this into an |
| indexing transaction. For the handlers that support this, this will optimize |
| the indexing process. See the ezcSearchSession->beginTransaction() |
| documentation. |
| |
| |
| Data Types |
| ---------- |
| |
| The Search component understands many data types, but they might not always |
| be representable by every handler. The table below explains the different |
| data types that are available: |
| |
| ======== ============================================================== |
| Constant Description |
| ======== ============================================================== |
| BOOLEAN Stores a true or false boolean value |
| -------- -------------------------------------------------------------- |
| STRING Untokenized text, useful for keywords or facets. |
| -------- -------------------------------------------------------------- |
| TEXT Tokenized text, useful for summaries and large pieces of text. |
| -------- -------------------------------------------------------------- |
| HTML Tokenized HTML documents, strips out all tags and attributes. |
| -------- -------------------------------------------------------------- |
| DATE Stores Unix timestamps and DateTime_ objects. |
| -------- -------------------------------------------------------------- |
| INT Stores integer numbers, which can be used in range searches. |
| -------- -------------------------------------------------------------- |
| FLOAT Stores floating point numbers, which can be used in range |
| searches. |
| ======== ============================================================== |
| |
| .. _DateTime: http://php.net/manual/en/function.date-create.php |
| |
| Searching |
| ========= |
| |
| After documents are indexed, they are searchable. Building a search query can |
| be done in two ways. The `Query Language`_ approach is the most powerful one, |
| but is more complex. Alternatively you can use the `Query Builder`_ approach |
| which lets you feed it a string and it will build the query from that string. |
| |
| Query Language |
| -------------- |
| |
| The ezcSearchQuery interface defines all the methods that handlers should |
| implement to realize the query language for every handler. This interface |
| defines methods such as where(), lOr() and between() - very similar to what the |
| ezcQuerySelect and ezcQueryExpression classes provide. The following example |
| shows how to use the query language: |
| |
| .. include:: tutorial/search-query-language.php |
| :literal: |
| |
| The result of the query is returned in the form of an ezcSearchResult object. |
| This contains the documents, but also information about facets and pagination. |
| See the documentation of the ezcSearchResult class for more information. |
| |
| |
| Query Builder |
| ------------- |
| |
| The query builder approach allows you to use more powerful query strings |
| instead of having to use the API to create queries. With this you can allow |
| query strings such as ``foo -bar``, while still searching in multiple fields. |
| Be aware however, that it depends on the handlers whether it will actually |
| return the expected results. The query builder interface will most likely work |
| best if you're only searching in one field only. At the moment the query |
| builder understands ``+``, ``-``, grouping ( with '(' and ')' ), AND and OR |
| modifiers, as well as phrases (enclosed in "). |
| |
| An example that searches in two fields (body and title) follows: |
| |
| .. include:: tutorial/search-query-builder.php |
| :literal: |
| |
| |
| .. |
| Local Variables: |
| mode: rst |
| fill-column: 79 |
| End: |
| vim: et syn=rst tw=79 |