Full-Text Search in CouchDB Using... CouchDB

12/08/2008

I have a project I am working on where I want to be able to use full-text search on CouchDB, and spent way too much time today looking into the available options. Along the more practical route there are the CouchDB-Lucene and CouchDB-Solr projects, but I was pretty determined to get full-text search working without resorting to any external projects.

It sort of happened.

Approach one: one-word limitation.

So I'd probably be publicly stoned for calling this "full-text search", but it does let you retrieve all documents that contain a given word. (Although it does take a long time to build the initial index if you have a large database, it took something like 40 minutes for my sad Macbook to build the index for 60k documents which contain a total of 2.3 million words. After the index is created, though, the retrievals are as quick as they are useless.)

For example, lets say you wanted a view named word, that looks up words in either the document's title or desc attributes. This view function does the trick:

function(doc) {
  var txt = doc.title + doc.desc;
  var words = txt.replace(/[!.,;]+/g,"").toLowerCase().split(" ");
  for (var word in words) {
    emit(words[word], doc._id);
  }
}

Then you can retrieve all documents with the word hello using this url http://localhost:5984/mydb/_view/search/word?key="hello". Taking this further, you could send a POST request to http://localhost:5984/mydb/_view/search/word which contains multiple keys, but that performs an or operation, rather than an and operation, so this doesn't provide a sufficient tool for matching documents that contain a set of words.

(If you were serious about this, you'd want to do a better job of sanitizing words, and to also convert the list of words into a set of words.)

Approach two: horrifying, but works. For some values of works.

The key for a CouchDB view doesn't have to just be a string, it can be any valid JSON expression. So, for example, you might think of your search along these lines (not URI encoded for readability):

http://localhost:5984/mydb/_view/_search/lookup?key="couchdb view"

But if you structured your keys differently, you could also think of it along these lines:

http://localhost:5984/mydb/_view/_search/lookup?key=["couchdb,"view"]

That is, you could specify the key as a JavaScript array instead of as a JavaScript string. So, if we could just create an index that contains an array for each possible combinations of words for each document, then we can perform full-text searches.

Wait, why are you closing the browser. Stop. Damnit. It'll be quick. Really fast lookup times. And who cares if the big O notation for both space and speed is horrifying? It's just a one-time cost. A tremendously large one-time cost--yes--but hey, it's kind of novel nonetheless. And if you're only indexing a very small amount of text (just titles, or titles and tags for example), then this may actually work for you.

Here is what the view function looks like:

function(doc) {
  // permutation func by Jonas Raoni Soares Silva
  var permute = function( v, m ){
    for( var j, l = v.length, i = ( 1 << l ) - 1, r = new Array( i ); i; )
      for( r[--i] = [], j = l; j; i + 1 & 1 << --j && ( r[i].push( m ? j : v[j] ) ) );
    return r;
  };
  var txt = doc.title;
  txt.replace(/[!.,;]+/g, "");
  var raw_words = txt.split(" ");
  var words = {};
  for (var i in raw_words) {
    var word = raw_words[i];
    if (word == "") continue;
    if (!words[word]) { words[word] = 1; }
    else { words[word]++; }
  }
  var word_set = [];
  for (var word in words) {
    word_set.push(word);
  }
  var permutations = permute(word_set,0);
  for (var i in permutations) {
    emit(permutations[i],doc._id);
  }
}

Let's just start out by saying, yes, this actually works. I tested it. It does work. And let's follow that with a caveat: if the value of txt is more than 4 of 5 words, then it will sometimes trip the 5 second limitation on map functions. (I tried recompiling the code with the delay moved from 5 to 50 seconds, but the change didn't seem to stick for whatever reason. Also, it just shouldn't take five seconds to perform the above code. It is inherently inefficient, but it shouldn't be that slow. I think using the Python or Common Lisp view server might alleviate the issues as well. I may try that a bit later.)

If you wanted all documents with a permutation that contained only NFL, then you would go to this uri:

http://127.0.0.1:5984/bossv/_view/_search/word?key=["NFL"]

If you wanted all documents with a permutation that contained NFL, then you could do this instead:

http://127.0.0.1:5984/bossv/_view/_search/word
    ?startkey=["NFL"]&endkey={}

Finally, I'll briefly mention that you could use an adaptation of this technique to make it possible to retrieve all blog entries that have an arbitrary combination of tags, which is a complex query that a relational database can't easily build an index for, but--using the above technique--CouchDB easily can. So, although this example of created a permutated index is pretty ridiculous, I think the technique could be genuinely useful in some situations.

All Rights Reserved, Will Larson 2007 - 2014.