blob: 547fb3baeb54d45bf772aef6c325d6ee0c069e9d [file] [log] [blame]
eZ Components - Search
~~~~~~~~~~~~~~~~~~~~~~
.. contents:: Table of Contents
Introduction
============
The search component allows you to index and search documents. A document
consists of an object. A definition maps the object's properties to fields of a
document, just like the PersistentObject_ component. The indexing process takes
the document and indexes the fields depending on the data type. The searching
part allows you to search for documents in the index with a rich query
language.
.. _PersistentObject: introduction_PersistentObject.html
Class overview
==============
This section gives you an overview of the main classes in the Search
component.
ezcSearchSession
This class provide access to the search index for both indexing and
searching. All operations towards the index go through this class which
is configured with a search handler and a definition manager.
ezcSearchSolrHandler
The handler that uses Solr_ for indexing and searching.
ezcSearchQueryBuilder
A class that allows you to build a search query from a search string.
ezcSearchQuery
An interface to building complex search queries in case
ezcSearchQueryBuilder is not up to the task.
.. _Solr: http://lucene.apache.org/solr/
Search Handlers
===============
Search handlers provide the link between the abstract query and document
interfaces to the mechanism that actually stores the index, and allows
querying for documents. Not all handlers handle all the different query
words or datatypes in the index, but effort is to put in to make as much
use of the handler's functionality as possible.
Solr
----
This handler uses Apache's Solr as backend. It accesses Solr over TCP/IP as a
web service. Solr is a very capable search provider, with many features__.
__ http://lucene.apache.org/solr/features.html
Using the handler is relatively easy. You basically only have to instantiate
the handler class which then can be passed to the ezcSearchSession
constructor:
.. include:: tutorial/solr-handler.php
:literal:
Solr requires a schema to work. This scheme defines how data types work
and allows for many more customizations. The default schema that comes with
Solr requires a few minor changes to make it work with the Search component.
`This schema`__ should be used as a basis for the Search component.
__ https://svn.apache.org/repos/asf/incubator/zetacomponents/trunk/Search/docs/schema.xml
Zend_Search_Lucene
------------------
The component also provides a backend based on Zend's Lucene implementation.
This backend has *many* limitations compared to Solr_, such as missing
multi-valued field support, no data-type support and a much lower performance.
In order to use this backend, you need to have a specific autoload function as
well as the Zend Framework installed and included in the PHP included path. An
example on how to use it follows:
.. include:: tutorial/zend-lucene.php
:literal:
Definition Managers
===================
A definition manager maps properties of an object to fields in a search
document. As most search handlers support fields with arbitrary names, you
don't actually provide the name of the fields in the search index. Instead,
the mapping configures several things for an object's property.
First of all, every document type needs an ID field. This ID will uniquely
define a document in the search index. There can only be one ID field, and
there has to be one. For each field, you have to define the data type, and
optionally you can configure:
- The importance of that field (boost factor)
- Whether the field should be a part of the resulting document
- Whether the field supports multiple values
- Whether highlighting should be performed for this field on result documents
Definitions can be supplied in two ways. The `embedded manager`_ retrieves
the definitions from the document classes directly, whereas the `xml manager`_
uses external XML file to read definitions from.
Embedded Manager
----------------
The ezcSearchEmbeddedManager retrieves the definition from a class that
implements the ezcSearchDefinitionProvider interface. This interfaces specifies
the getDefinition() method that should be implemented by the classes to return
the document's definition mappings. An example of the implementation of the
getDefinition() method is::
static public function getDefinition()
{
$n = new ezcSearchDocumentDefinition( __CLASS__ );
$n->idProperty = 'id';
$n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT );
$n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true );
$n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true );
$n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE );
$n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING );
$n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false );
return $n;
}
Basically what this method does is construct an ezcSearchDocumentDefinition
object containing all the field definitions. It's required to have an ID
property. It's recommended to use a TEXT data type for this, although it is not
required. See the section `Data Types`_ on the differences between data types.
Each field is then added to the fields property as a
ezcSearchDefinitionDocumentField object. The field index should be the same as
the first argument to the constructor of this class. By default the type will
be ezcSearchDefinitionDocumentField::TEXT. Subsequent arguments control the
importance (boost) of a field, whether it should be part of the result, whether
multiple values for this field are accepted and whether it should be selected
for highlighting.
XML Manager
-----------
The ezcSearchXmlManager uses XML files to obtain a document definition from.
The manager is configured with the directory where the XML definition files can
be found in the constructor::
<?php
$xm = new ezcSearchXmlManager( 'search-defs/' );
?>
The names of the definition files are required to be
``name-of-class-in-lower-case.xml``. This means that for the class Article the
file ``article.xml`` is being read. The file itself is a simple XML file. The
file below demonstrates the same definition as the one in the example in
`Embedded Manager`_:
.. include:: tutorial/example.xml
:literal:
The RelaxNG-Compressed schema is:
.. include:: tutorial/definition-schema.rnc
:literal:
Search Session
==============
The search session is responsible for indexing documents, and searching
for documents. The session object requires both a search handler and a
definition manager. The handler is used for storing the index, while the
definition manager is used to find the definition that maps object's properties
to search index fields. Creating a session is simple, as is demonstrated in the
following example:
.. include:: tutorial/create-session.php
:literal:
Indexing
========
With the session created, it is time to index documents. Before we can index
anything we need to create an object and create the definition. For this
tutorial we'll reuse the definition from the `Embedded Manager`_ section and
create a class out of this. Each class that you want to index through the
Search component needs to implement the ezcBasePersistable interface.
This interfaces defines two methods: getState() and setState() as well as the
requirement that the constructor should be able to be called without any
arguments. Those methods are used for fetching and re-creating the state
of this object, similarly to what PersistentObject_ requires.
To see everything in perspective, the full class follows here, including the
definition method:
.. include:: tutorial/article.php
:literal:
The ezcBasePersistable interface is also compatible with PersistentObject_,
although there the interface is not enforced.
After we've created the class and definition, indexing an object is relatively
simple. After instantiation, indexing the document is done by calling the
index() method of the session as you can see in the next example:
.. include:: tutorial/index.php
:literal:
If you are indexing a large amount of documents, it's wise to wrap this into an
indexing transaction. For the handlers that support this, this will optimize
the indexing process. See the ezcSearchSession->beginTransaction()
documentation.
Data Types
----------
The Search component understands many data types, but they might not always
be representable by every handler. The table below explains the different
data types that are available:
======== ==============================================================
Constant Description
======== ==============================================================
BOOLEAN Stores a true or false boolean value
-------- --------------------------------------------------------------
STRING Untokenized text, useful for keywords or facets.
-------- --------------------------------------------------------------
TEXT Tokenized text, useful for summaries and large pieces of text.
-------- --------------------------------------------------------------
HTML Tokenized HTML documents, strips out all tags and attributes.
-------- --------------------------------------------------------------
DATE Stores Unix timestamps and DateTime_ objects.
-------- --------------------------------------------------------------
INT Stores integer numbers, which can be used in range searches.
-------- --------------------------------------------------------------
FLOAT Stores floating point numbers, which can be used in range
searches.
======== ==============================================================
.. _DateTime: http://php.net/manual/en/function.date-create.php
Searching
=========
After documents are indexed, they are searchable. Building a search query can
be done in two ways. The `Query Language`_ approach is the most powerful one,
but is more complex. Alternatively you can use the `Query Builder`_ approach
which lets you feed it a string and it will build the query from that string.
Query Language
--------------
The ezcSearchQuery interface defines all the methods that handlers should
implement to realize the query language for every handler. This interface
defines methods such as where(), lOr() and between() - very similar to what the
ezcQuerySelect and ezcQueryExpression classes provide. The following example
shows how to use the query language:
.. include:: tutorial/search-query-language.php
:literal:
The result of the query is returned in the form of an ezcSearchResult object.
This contains the documents, but also information about facets and pagination.
See the documentation of the ezcSearchResult class for more information.
Query Builder
-------------
The query builder approach allows you to use more powerful query strings
instead of having to use the API to create queries. With this you can allow
query strings such as ``foo -bar``, while still searching in multiple fields.
Be aware however, that it depends on the handlers whether it will actually
return the expected results. The query builder interface will most likely work
best if you're only searching in one field only. At the moment the query
builder understands ``+``, ``-``, grouping ( with '(' and ')' ), AND and OR
modifiers, as well as phrases (enclosed in ").
An example that searches in two fields (body and title) follows:
.. include:: tutorial/search-query-builder.php
:literal:
..
Local Variables:
mode: rst
fill-column: 79
End:
vim: et syn=rst tw=79