The search component allows you to index and search documents. A document
consists of an object. A definition maps the object's properties to fields of a
document, just like the PersistentObject component. The indexing process takes
the document and indexes the fields depending on the data type. The searching
part allows you to search for documents in the index with a rich query
language.
Search handlers provide the link between the abstract query and document
interfaces to the mechanism that actually stores the index, and allows
querying for documents. Not all handlers handle all the different query
words or datatypes in the index, but effort is to put in to make as much
use of the handler's functionality as possible.
Currently the only implemented handler is a handler that talks to Apache's
Solr. This handler can be accessed over TCP/IP as a web service. Solr is a
very capable search provider, with many features.
Using the handler is relatively easy. You basically only have to instantiate
the handler class which then can be passed to the ezcSearchSession
constructor:
- <?php
- require_once 'tutorial_autoload.php';
-
- // on localhost with the default port
- $handler = new ezcSearchSolrHandler;
-
- // on another host with a different port
- $handler = new ezcSearchSolrHandler( '10.0.2.184', 9123 );
- ?>
Solr requires a schema to work. This scheme defines how data types work
and allows for many more customizations. The default schema that comes with
Solr requires a few minor changes to make it work with the Search component.
This schema should be used as a basis for the Search component.
A definition manager maps properties of an object to fields in a search
document. As most search handlers support fields with arbitrary names, you
don't actually provide the name of the fields in the search index. Instead,
the mapping configures several things for an object's property.
First of all, every document type needs an ID field. This ID will uniquely
define a document in the search index. There can only be one ID field, and
there has to be one. For each field, you have to define the data type, and
optionally you can configure:
- The importance of that field (boost factor)
- Whether the field should be a part of the resulting document
- Whether the field supports multiple values
- Whether highlighting should be performed for this field on result documents
Definitions can be supplied in two ways. The embedded manager retrieves
the definitions from the document classes directly, whereas the xml manager
uses external XML file to read definitions from.
The ezcSearchEmbeddedManager retrieves the definition from a class that
implements the ezcSearchDefinitionProvider interface. This interfaces specifies
the getDefinition() method that should be implemented by the classes to return
the document's definition mappings. An example of the implementation of the
getDefinition() method is:
static public function getDefinition()
{
$n = new ezcSearchDocumentDefinition( __CLASS__ );
$n->idProperty = 'id';
$n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT );
$n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true );
$n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true );
$n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE );
$n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING );
$n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false );
return $n;
}
Basically what this method does is construct an ezcSearchDocumentDefinition
object containing all the field definitions. It's required to have an ID
property. It's recommended to use a TEXT data type for this, although it is not
required. See the section Data Types on the differences between data types.
Each field is then added to the fields property as a
ezcSearchDefinitionDocumentField object. The field index should be the same as
the first argument to the constructor of this class. By default the type will
be ezcSearchDefinitionDocumentField::TEXT. Subsequent arguments control the
importance (boost) of a field, whether it should be part of the result, whether
multiple values for this field are accepted and whether it should be selected
for highlighting.
The ezcSearchXmlManager uses XML files to obtain a document definition from.
The manager is configured with the directory where the XML definition files can
be found in the constructor:
- <?php
- $xm = new ezcSearchXmlManager( 'search-defs/' );
- ?>
The names of the definition files are required to be
name-of-class-in-lower-case.xml. This means that for the class Article the
file article.xml is being read. The file itself is a simple XML file. The
file below demonstrates the same definition as the one in the example in
Embedded Manager:
<?xml version="1.0"?>
<document>
<field type="id">id</field>
<field type="text" highLight="true" boost="2">title</field>
<field inResult="false" type="html">body</field>
<field type="date">published</field>
<field type="string">url</field>
<field highLight="true" type="string">type</field>
</document>
The RelaxNG-Compressed schema is:
default namespace = "http://components.ez.no/Search"
start =
element document {
field+
}
field =
element field {
attribute type { xsd:string },
attribute highLight { 'true' | 'false' }?,
attribute inResult { 'true' | 'false' }?,
attribute multi { 'true' | 'false' }?,
attribute boost { xsd:float }?,
string
}
The search session is responsible for indexing documents, and searching
for documents. The session object requires both a search handler and a
definition manager. The handler is used for storing the index, while the
definition manager is used to find the definition that maps object's properties
to search index fields. Creating a session is simple, as is demonstrated in the
following example:
- <?php
- require_once 'tutorial_autoload.php';
-
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
-
- $session = new ezcSearchSession( $handler, $manager );
- ?>
With the session created, it is time to index documents. Before we can index
anything we need to create an object and create the definition. For this
tutorial we'll reuse the definition from the Embedded Manager section and
create a class out of this. Each class that you want to index through the
Search component needs to implement the ezcBasePersistable interface.
This interfaces defines two methods: getState() and setState() as well as the
requirement that the constructor should be able to be called without any
arguments. Those methods are used for fetching and re-creating the state
of this object, similarly to what PersistentObject requires.
To see everything in perspective, the full class follows here, including the
definition method:
- <?php
- class Article implements ezcBasePersistable, ezcSearchDefinitionProvider
- {
- public $id;
- public $title;
- private $body;
- private $published;
- private $url;
- private $type;
-
- function __construct( $id = null, $title = null, $body = null, $published = null, $url = null, $type = null )
- {
- $this->id = $id;
- $this->title = $title;
- $this->body = $body;
- $this->published = $published;
- $this->url = $url;
- $this->type = $type;
- }
-
- function getState()
- {
- $state = array(
- 'id' => $this->id,
- 'title' => $this->title,
- 'body' => $this->body,
- 'published' => $this->published,
- 'url' => $this->url,
- 'type' => $this->type,
- );
- return $state;
- }
-
- function setState( $state )
- {
- foreach ( $state as $key => $value )
- {
- $this->$key = $value;
- }
- }
-
- static public function getDefinition()
- {
- $n = new ezcSearchDocumentDefinition( __CLASS__ );
- $n->idProperty = 'id';
- $n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT );
- $n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true );
- $n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true );
- $n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE );
- $n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING );
- $n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false );
-
- return $n;
- }
- }
- ?>
The ezcBasePersistable interface is also compatible with PersistentObject,
although there the interface is not enforced.
After we've created the class and definition, indexing an object is relatively
simple. After instantiation, indexing the document is done by calling the
index() method of the session as you can see in the next example:
- <?php
- require_once 'tutorial_autoload.php';
-
- // setup
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
- $session = new ezcSearchSession( $handler, $manager );
-
- // instantiate article
- $article = new Article();
- $article->title = "A test article to show indexing."
- $article->body = <<<ENDBODY
- This is the body of the text, nothing interesting now
- as this is just an example.
- ENDBODY;
- $article->published = time();
- $article->url = "/article/1";
- $article->type = "article";
-
- // index
- $session->index( $article );
- ?>
If you are indexing a large amount of documents, it's wise to wrap this into an
indexing transaction. For the handlers that support this, this will optimize
the indexing process. See the ezcSearchSession->beginTransaction()
documentation.
The Search component understands many data types, but they might not always
be representable by every handler. The table below explains the different
data types that are available:
| Constant |
Description |
| BOOLEAN |
Stores a true or false boolean value |
| STRING |
Untokenized text, useful for keywords or facets. |
| TEXT |
Tokenized text, useful for summaries and large pieces of text. |
| HTML |
Tokenized HTML documents, strips out all tags and attributes. |
| DATE |
Stores Unix timestamps and DateTime objects. |
| INT |
Stores integer numbers, which can be used in range searches. |
| FLOAT |
Stores floating point numbers, which can be used in range
searches. |
After documents are indexed, they are searchable. Building a search query can
be done in two ways. The Query Language approach is the most powerful one,
but is more complex. Alternatively you can use the Query Builder approach
which lets you feed it a string and it will build the query from that string.
The ezcSearchQuery interface defines all the methods that handlers should
implement to realize the query language for every handler. This interface
defines methods such as where(), lOr() and between() - very similar to what the
ezcQuerySelect and ezcQueryExpression classes provide. The following example
shows how to use the query language:
- <?php
- require_once 'tutorial_autoload.php';
-
- // setup
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
- $session = new ezcSearchSession( $handler, $manager );
-
- // initialize a pre-configured query
- $q = $session->createFindQuery( 'Article' );
-
- $searchWord = 'test';
-
- // where either body or title contains thr $searchWord
- $q->where(
- $q->lOr(
- $q->eq( 'body', $searchWord ),
- $q->eq( 'title', $searchWord )
- )
- );
-
- // limit the query and order
- $q->limit( 10 );
- $q->orderBy( 'title' );
-
- // add a facet on url (not very useful)
- $q->facet( 'url' );
-
- // run the query and show titles for found documents
- $r = $session->find( $q );
-
- foreach( $r->documents as $res )
- {
- echo $res->document->title, "\n";
- }
- ?>
The result of the query is returned in the form of an ezcSearchResult object.
This contains the documents, but also information about facets and pagination.
See the documentation of the ezcSearchResult class for more information.
The query builder approach allows you to use more powerful query strings
instead of having to use the API to create queries. With this you can allow
query strings such as foo -bar, while still searching in multiple fields.
Be aware however, that it depends on the handlers whether it will actually
return the expected results. The query builder interface will most likely work
best if you're only searching in one field only. At the moment the query
builder understands +, -, grouping ( with '(' and ')' ), AND and OR
modifiers, as well as phrases (enclosed in ").
An example that searches in two fields (body and title) follows:
- <?php
- require_once 'tutorial_autoload.php';
-
- // setup
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
- $session = new ezcSearchSession( $handler, $manager );
-
- // initialize a pre-configured query
- $q = $session->createFindQuery( 'Article' );
-
- // where either body or title contains test but not article
- $searchWord = 'test -article';
-
- // run the query builder to search for the $searchWord in body and title
- $qb = new ezcSearchQueryBuilder();
- $qb->parseSearchQuery( $q, $searchWord, array( 'body', 'title' ) );
-
- // run the query and show titles for found documents, and its score
- $r = $session->find( $q );
-
- foreach( $r->documents as $res )
- {
- echo $res->document->score, ", ", $res->document->title, "\n";
- }
- ?>