20 March 2008

Tags: engines knowledge lingway lucene management search semantics

Introduction

At Lingway, we’ve been using Lucene for a few years now. For those who are new to Lucene, here’s its bottomline : Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java..

Before criticizing, I must admit that Lucene is a very good high-performance full text search engine. For years, Lucene has been considered as a first class citizen when looking for an embeddable search engine written in Java. Its reputation has grown fast, and it is still now the best open source Java search engine available. There’s nothing to say about that : Doug Cutting has done a great job. However, it’s development has been going very slow those late months, and I think Lucene will most likely not keep in touch with today’s document processing needs. Don’t mess up : I am no search engine developer, I am a developer which leverages search engines in order to provide high relevance information retrieval technologies.

This post is about why Lucene may not be the best choice for future developments if nothing is done, and why the situation may not be close to change. In our situation, we push Lucene to its limits, although we make it work quite good. It’s a reason why we made some suggestions and submitted a patch to Lucene (which does not cover everything listed here) : Lingway uses semantics to generate complex queries where proximity matters. For example, if you are looking for documents on conflicts in middle east, you’ll probably also want to find documents talking about war in Iraq. In that case, war and Iraq are called expansions of conflict and middle east respectively. We provide a technology which analyzes your query in order to deduce the most relevant expansions, and generate queries for them. Yet, in order to get relevant results, this is insufficient : Google-like ranking or term frequency scoring like implemented in Lucene do not suit semantic scoring needs. For example, a document which contains both middle and east terms but separated by more than 1 word are most likely not what you want to find. Moreover, we should attribute lower scores on expansions than on the regular words. For example, we’ll give a better score to conflict in middle east phrase than in war in Iraq.

At Lingway, we think this kind of document retrieval technology is the future of search engines. Google is good at finding documents, but what we want is to find the most relevant ones. However, most (if not all) of current search engines have not been thought to perform such complex queries… Lucene is used by wikipedia, and you’ll notice that if you try to find more than a single word, most results are irrevelant…

Here’s a capture of the upcoming Lingway KM 3.7 interface, which demonstrates the requirements. Here, we write a query in french, which is used to find documents in english talking about the same subject. Note that this is more than plain translation, we call it cross language mode :

image

Note the matches in green : chanteur becomes singer, but we also find matches about singing. Same for pop which expands to blues… Now for the technical part:

6 ways of improving Lucene

\6. No built-in support for clustering. If you want to create clusters, either write your own implementation of a Directory, or use Solr, or Nutch+https://hadoop.apache.org/[Hadoop]. Both Solr and Nutch leverage Lucene, but are not straight replacements. Lucene is embeddable, while you must leverage on Solr or Nutch. Well, I think it is not very surprising that Hadoop idea emerged from the Lucene team : Lucene doesn’t scale out. It’s internals makes it (very) fast for most common situations, but for large document sets, you have to scale out, and as Lucene does not implement clustering at the core level, you must switch from Lucene to another search engine layer, which is not straightforward. The problem with switching to Solr or Nutch is that you carry things you probably won’t need : integrated crawling for Nutch and indexing server for Solr.

\5. Span queries are slow. This may be interpreted as a problem specific to Lingway where we make intensive use of span queries (NEAR operator : ``red NEAR car''), but the Lucene index structure has been updated to add this feature, which was not thought at first. The underlying implementation leads to complex algorithms that are very slow, especially when some term is repeated many times in a single large document. That’s why I tend to say that Lucene is a high-performance text search engine only if you use basic boolean queries.

\4. Scoring is not really pluggable. Lucene has its own implementation of a scoring algorithm, where terms may be boosted, and makes use of a Similarity class, but soon shows limitations when you want to perform complex scoring, for example based on actual matches and meta data about the query. If you want to do so, you’ll have to extend the Lucene query classes. The facts is that Lucene has been thought in terms of tf/idf like scoring algorithms, while in our situation, for linguistic based scoring, the structure of Lucene scoring facilities don’t fit. We’ve been obliged to override every Lucene query class in order to add support for our custom scoring. And that was a problem :

\3. Lucene is not well designed. As a software architect, I would tend to make this reason 1. Lucene has a very poor OO design. Basically, you have packages, classes, but almost no design pattern usage. I always makes me think about an application written by C(++) developers who discover Java and carry bad practices among them. This is a serious point : whenever you have to customize Lucene to suit your needs (and you will have to do so), you’ll have to face the problem. Here are some examples :

  • Almost no use of interfaces. Query classes (for example BooleanQuery, SpanQuery, TermQuery…) are all subclasses of an abstract class. If you want to add a feature to those classes, you’ll basically want to write an interface which describes the contract for your extension, but as the abstract Query class does not implement an interface, you’ll have to constantly cast your custom query objects to a Query in order to be able to use your objects in native Lucene calls. Tons of examples like this (HitCollector, …). This is also a problem when you wan’t to use AOP and auto-proxying.

  • Unnatural iterator implementations. No hasNext() method, next() returns a boolean and refreshes the object context. This is a pain when you want to keep track of iterated elements. I assume this have been done intentionally to reduce memory footprint but it makes once again algorithms both unclear and complex.

\2. A closed API which makes extending Lucene a pain. In Lucene world, it is called a feature. The policy is to open classes when some user needs to gain access to some feature. This leads to an API where most classes are package protected, which means you won’t ever be able to extend it (unless you create your class in the same package, which is quite dirty for custom code) or you’ll have to copy and rewrite the code. Moreover, and it is quite related to the previous point, there’s a serious lack of OO design here too : some classes which should have been inner are not, and anonymous classes are used for complex operations where you would typically need to override their behaviour. The reason invoked for closing the API is that the code has to be cleaned up and made stable before releasing it publicly. While the idea is honourable, once again it is a pain because if you have some code that does not fit in the mainstream Lucene idea, you’ll have to constantly backport Lucene improvements to your version until your patch is accepted. However, as the developers want to limit API changes as long as possible, there’s little chance that your patch will ever be commited. Add some final modifiers on either classes or methods and you’ll face the problem. I don’t think the Spring framework would have come so popular if the code had been so locked…

\1. Lucene search algorithms are not adapted to grid computing. Lucene has been written when hardware did not have memory that much and multicore processors didn’t exist. Therefore, the index structure has been thought and implemented in order to perform fast linear searches with a very little memory footprint. I’ve personally spent lots of hours trying to rewrite span query algorithms so that I could make use of a multithreaded context (for dual/quad core cpus), but the iterator-based directory reading algorithms make it hardly impossible to do so. In some rare cases you’ll be able to optimize things and iterate the index in parallel, but in most situations it will be impossible. In our case, when we have a very complex query with 50+ embedded span queries, the CPU is almost not used while the system is constantly calling I/Os, even using a RAMDirectory.

Any alternative ?

I think the last point is the more problematic : Lucene reaches its limits when it goes to searching large datasets (with many operators or not) on modern hardware. That’s why I’ve been looking for an alternative to Lucene. After reading blog entries and a discussion about Wikia, I found that there were not so many alternatives. However, I finally came to a very promising solution : MG4J. It has a very good object design, excellent performance on search (indexing is slower than Lucene), a small memory footprint, is up to 10x faster than Lucene on my span query benchmarks, and is nativelly designed for clustering. It also has built-in support for payloads, while in Lucene it is a very recent addition which is still experimental. However, MG4J still misses some features such as easy incremental indexation (indices ARE clusters, but there’s no idea on performance issues), document removal and an easier to use indexing process. What made me happy is that I was able to reproduce the customizations I made on Lucene in a few hours where it took me days on Lucene.

I think there’s room for a new open source search engine which is not thought in terms of a single computer indexing a collection of documents with limited memory, but in terms of transparent distributed indexation and searching in order to provide fast answers on large datasets (think of Terracotta or GridGain as repartition frameworks) : you should not have to leverage an application to gain access to clustering features. Lucene has an excellent implementation of the first category of search engines, but I think this is not adapted to what we require now : beeing able to find the best answer to a question in a reasonable amount of time. tf/idf based search algorithms and Google rank are not the future of search engines. Finding the most relevant answers implies complex queries involving metadata on documents and semantics, which is basically what Lingway does (with Lucene and other backing search engines), but it requires more power and an underlying technology which supports modern hardware.

A good reason to choose Lucene

Whatever the reproaches I have to make about Lucene, it is still the best java open source solution available for what we are doing, and it performs quite good ;-)