PLEX - the PLucene EXperience
2005-01-27 20:00
After another day of documentation-reading, source-understanding, trial-and-error-ing and specification-writing I finally think that I have a grasp on the Plucene engine. I still think that it is usable, you'll just have to handle it with care.
There were a few new noteworthy things about all this:
- Plucene has a specific keyword field, which prevents the tokenizer from splitting up its contents. Unfortunalety, you cannot query it as the query parsers' tokenizer does split the query string up no matter what you do. This restricts keyword fields to two main applications: Unique identifiers (for deletion) and date fields (for date range queries). See also this post.
- To the contrary of the documentation (surprise!) Plucene does support the asterisk(*) wildcard at the end of a term. It is slow, and does throw Perl Warnings about undefined variables here and there, but it works... a bit. Perhaps not enough to recommend its usage.
- When having big resultsets, easily done with the * operator, Plucene sometimes crashes with an Too-Many-Files-Open error.
- Again not documented is the fact that the similarity-operatror ~ is not supported, which would be great for things like "pages similar to this one"-style operations.
- In general I have the feeling that Plucene is a bit buggy here and there, while experimenting with some Phrases in the query string I got some more uninitialized variable warnings.
- The scoring of the result documents is not limited to the range [ 0.0 - 1.0 ], like it is (or perhaps should be) in Lucene. This calls for some creative post-processing of the scores before they get back to MidCOM.
- Performance does not look that bad in other cases, with queries being faster then a second on my 256 MB Celeron-1000 devel server here. Tested with a 5000 small-documents index. Still not a real-life example, but getting closer.
Summing up, its not that bad, but it is not that good either. Well, I guess I'll just have to live with it. I'm not that sure that a pure Lucene solution would be better. While Lucene is proven, getting Java to run smoothly usually holds its own surprises.
Anyway.
I have got the specification of the XML communication protocol to access remote indexers like Plucene finished so far. While writing them up a few new things came to my mind, which have to be taken care of when writing the PHP layer.
The distinctions between (P)Lucenes storage types is rather crucial for an effective index, for a short summary:
- date is a date-wrapped field suitable for use with the Date Filter.
- keyword is store and indexed, but not tokenized.
- unindexed is stored but neither indexed nor tokenized.
- unstored is not stored, but indexed and tokenized.
- text is stored, indexed and tokenized.
As I already wrote in the XML spec, both date and keyword cannot be reliably queried. While date cannot be identified in a result set (it is just some gibberish that a mudane guy won't identify as a date), a keyword cannot even be queried safely. Nevertheless we need them, as date is the only way to restrict searches to a given date range, and keyword is the only way to reliably store an unique identifier that is used for deletion.
This leaves us with three fields, from which all are useful in theory (so all will be implemented), with the focus on unstored and text.
Why am I telling you this. Good question, simple answer: The current MidCOM-Indexer API draft works on a basis of simple key/value pairs for a document. This is no longer sufficient. Instead, we will need Document and Field classes to distinguish between the five field types available in total. The Document class will be resembeled by an class hirarchy to support the various indexing targets more easily.
Only if we have this distinction during indexing, we will be able to create an efficient index. For example, it does not make much sense to store a complete copy of a pages content in the index, unless you want to do some cached file stuff like Google does. Unfortunalety, I actually don't want to know how much this will impact performance. So the recommendation will be storing as much data as possible as unstored, only using abstracts, metadata and stuff like that with text fields.
So, where do we go from here?
Tomorrow I'll try to come up with an intial Plucene front end that can talk XML like I have specified it. I just hope that my Perl is good enough for that task. In the end, I'll finally learn more Perl, so it might not be that bad...