Chapter 17. Indexing and searching XML with swish-e
Indexing is one half of the information retrieval coin. The other half being databases. This section out- lines how to use swish-e to index, search, and retrieve XML files.
Swish-e is an uncomplicated indexer/search engine that works on Unix/Linux computers as well as Win- dows. Once built you feed the swish-e binary a configuration file and/or a set of command line switches to index content. This content can be individual files on a file system, files retrieved by crawling a web- site, or a stream of content from another application such as a database.
The indexing half of swish-e is able to index plain text files (including XML data) or data structured as HTML-like streams. Using supplied plug-ins swish-e can index some binary files such as Microsoft Word and Excel documents and PDF documents. The plug-ins work by converting the documents into plain text files and streaming these conversions to the indexer. Searches against these kinds of indexes return pointers to the originals and thus facilitate retrieval. The indexes created by swish-e are portable from file system to file system.
The same binary that creates the indexes can be used to search them. Swish-e supports relevance rank- ing, Boolean operations, right-hand truncation, field searching, and nested queries. Later versions of swish-e come with C, Perl, and PHP APIs allowing developers to create stand alone applications or CGI interfaces to their indexes.
Swish-e is not perfect. For example, searches against indexes return only properties of located items (such as title, author, abstract, etc.) and pointers to original documents. It is very difficult to return only parts of a particular document, say paragraph number three of a TEI file. Without a lot of intervention, swish-e does not support exact field matching so queries like "title=time" will return Time, Time Magazine, as well as Time and Time Again. Lastly, and probably most importantly, the swish-e indexer does not support Unicode very well. The two-byte nature of Unicode characters confuses the indexer.
Swish-e is an unsung hero. It's inherently open nature allows for the creation of some very smart search engines supporting things like spelling correction, thesaurus intervention, and "best bet" implementa- tions. Of all the different types of information services librarians provide, access to indexes is one of the biggest ones. With swish-e librarians could create their own indexes and rely on commercial biblio- graphic indexers less and less.
Installing swish-e is almost a trivial matter. On Unix/Linux systems you simply:
download the archive
change into the distribution's directory
configure ( ./configure )
test ( make check )