Full text search with MongoDB and Lucene analyzers

This post describes how you with a few lines of code can take full advantage of Lucenes powerful analysis functionality for text normalization (crucial for free text searching) for both storing and querying, but using standard MongoDB features for data storage, indexing and retrieval.

Background
At Öredev last week, I had the opportunity to talk briefly to Mathias Stearn from 10Gen since I was curious about the search performance in MongoDB. The result of this little chat was (in short) that keyword search is very fast, but it is lacking good support for free text search, and you will have to do text normalization on your own. After our chat I starting thinking that Lucene already does a good job with this so you should be able to use Lucenes analyzers for the normalization job but MongoDB for storage and indexing.

More info on basic MongoDB full text search.

I have created a sample project for this at github that you can try if you like, https://github.com/jrask/mongo-text-search.

Summary
This is really simple and cool(!), but It is important to understand that for a full fledged text search engine, Lucene or Solr is still your choice since it has many other powerful features. This example only includes simple text searching and not i.e phrase searching or other types of text searches, nor does it include ranking of hits. But, not all sites have a need for very advanced searching or enormous number of concurrent visitors it should be fine under these circumstances.

You should be aware of that it will have negative impact on write performance depending on the size of the data your are indexing but this is something you will have to test yourself. I have not done any search performance tests for this but I will try to make some comparison to Lucene and publish the results here.

Lucene analyzers
Lucene has many different analyzers, and this article will not cover all of them, instead we will only use StandardAnalyzer which is the most advanced analyzer for text analysis in Lucene. It uses stopwords (in english), lowercases words, recognizes URLs and email addresses etc. Lucene uses analyzers to transform text into something that can be searched. I.e, the text “I would like to use MongoDB for full text search” would be tokenized into the following words: [i] [would] [like] [use] [mongodb] [full] [text] [search].

Index fields
To solve this, we create a subclass of BasicDBObject called AnalyzedDBObject and add the method appendAndAnalyzeFullText(). What this class does it that it will use the analyzer to tokenize the text and add it as a json array.

To use the code above, simply use it as a normal DBObject instance. We have one field for the normalized text and one field for the original text so we will store more data by doing it like this.

Full text search
Ok, so we have indexed the text nicely but how to we perform a simple search for more than a single term and how do we make sure that the text is normalized? If you are used to Lucene you know that you have to use the same analyzer for query as you used during indexing so I guess we will do the same thing here. This is very important since if would not do this, searching with capital letters or including a stopword that was removed during indexing would not give us any result.

We simply add a method, createQuery() that can create a DBObject useful for searching for multiple terms. What this method does is that it will go through the same process as the indexed text so some words will be removed, all words will have lowercase etc. We will also add a condition; do we want All words to exists or is one word enough and we use MongoDB built in operators $all and $in for this.

Thats it!

6 Comments

  1. Skall Paul

    MongoLantern, mongodb fulltext search plugin is also having it’s own query parser which is good enough to parse most of the queries. Moreover it’s highly customizable.

  2. We also use both MongoDB and Elastic Search, we found Elastic Search easy to setup but a nightmare to develop against with the C# NEST client.

    10gen recently released MongoDB 2.4 which has a new Free Text Search feature.

    MongoDB 2.4 Release Notes:
    http://bit.ly/ZnSScz

    I have a blog post on how to use MongoDB Free Text Search:
    http://bit.ly/ZmfytH

    It has full instructions, working code examples and links to all the documentation; it might be helpful to developers getting started with MongoDB Text Search.

  3. Mongo 2.4 is still a long way to go in terms of text searching. This is still useful. Too many limitations.

  4. Lucene indexes stored in MongoDB – http://lumongo.org

Trackbacks for this post

  1. Tweets that mention Full text search with mongodb and Lucene analyzers — Jayway Team Blog -- Topsy.com
  2. NoSQL Daily – Tue Nov 16 › PHP App Engine

Leave a Reply