I have a project I am working on where I want to be able to use full-text search on CouchDB, and spent way too much time today looking into the available options. Along the more practical route there are the CouchDB-Lucene and CouchDB-Solr projects, but I was pretty determined to get full-text search working without resorting to any external projects.
It sort of happened.
Approach one: one-word limitation.
So I'd probably be publicly stoned for calling this "full-text search", but it does let you retrieve all documents that contain a given word. (Although it does take a long time to build the initial index if you have a large database, it took something like 40 minutes for my sad Macbook to build the index for 60k documents which contain a total of 2.3 million words. After the index is created, though, the retrievals are as quick as they are useless.)
For example, lets say you wanted a view named word,
that looks up words in either the document's title or desc attributes. This view function
does the trick:
function(doc) {
var txt = doc.title + doc.desc;
var words = txt.replace(/[!.,;]+/g,"").toLowerCase().split(" ");
for (var word in words) {
emit(words[word], doc._id);
}
}
Then you can retrieve all documents with the word hello
using this url http://localhost:5984/mydb/_view/search/word?key="hello".
Taking this further, you could send a POST request to
http://localhost:5984/mydb/_view/search/word which contains
multiple keys, but that performs an or operation, rather than an and operation,
so this doesn't provide a sufficient tool for matching documents that
contain a set of words.
(If you were serious about this, you'd want to do a better job of sanitizing words, and to also convert the list of words into a set of words.)
Approach two: horrifying, but works. For some values of works.
The key for a CouchDB view doesn't have to just be a string, it can be any valid JSON expression. So, for example, you might think of your search along these lines (not URI encoded for readability):
http://localhost:5984/mydb/_view/_search/lookup?key="couchdb view"
But if you structured your keys differently, you could also think of it along these lines:
http://localhost:5984/mydb/_view/_search/lookup?key=["couchdb,"view"]
That is, you could specify the key as a JavaScript array instead of as a JavaScript string. So, if we could just create an index that contains an array for each possible combinations of words for each document, then we can perform full-text searches.
Wait, why are you closing the browser. Stop. Damnit. It'll be quick. Really fast lookup times. And who cares if the big O notation for both space and speed is horrifying? It's just a one-time cost. A tremendously large one-time cost--yes--but hey, it's kind of novel nonetheless. And if you're only indexing a very small amount of text (just titles, or titles and tags for example), then this may actually work for you.
Here is what the view function looks like:
function(doc) {
// permutation func by Jonas Raoni Soares Silva
var permute = function( v, m ){
for( var j, l = v.length, i = ( 1 << l ) - 1, r = new Array( i ); i; )
for( r[--i] = [], j = l; j; i + 1 & 1 << --j && ( r[i].push( m ? j : v[j] ) ) );
return r;
};
var txt = doc.title;
txt.replace(/[!.,;]+/g, "");
var raw_words = txt.split(" ");
var words = {};
for (var i in raw_words) {
var word = raw_words[i];
if (word == "") continue;
if (!words[word]) { words[word] = 1; }
else { words[word]++; }
}
var word_set = [];
for (var word in words) {
word_set.push(word);
}
var permutations = permute(word_set,0);
for (var i in permutations) {
emit(permutations[i],doc._id);
}
}
Let's just start out by saying, yes, this actually works.
I tested it. It does work. And let's follow that with a caveat:
if the value of txt is more than 4 of 5 words, then it will
sometimes trip the 5 second limitation on map functions.
(I tried recompiling the code with the delay moved from 5 to 50 seconds,
but the change didn't seem to stick for whatever reason.
Also, it just shouldn't take five seconds to perform the above code.
It is inherently inefficient,
but it shouldn't be that slow. I think using the Python or
Common Lisp view server might alleviate the issues as well.
I may try that a bit later.)
If you wanted all documents with a permutation that contained only NFL, then you would go to this uri:
http://127.0.0.1:5984/bossv/_view/_search/word?key=["NFL"]
If you wanted all documents with a permutation that contained NFL, then you could do this instead:
http://127.0.0.1:5984/bossv/_view/_search/word
?startkey=["NFL"]&endkey={}
Finally, I'll briefly mention that you could use an adaptation of this technique to make it possible to retrieve all blog entries that have an arbitrary combination of tags, which is a complex query that a relational database can't easily build an index for, but--using the above technique--CouchDB easily can. So, although this example of created a permutated index is pretty ridiculous, I think the technique could be genuinely useful in some situations.
That permutation definitely makes me laugh. I had pretty much just discarded it out of hand, but you make a good point that it's a one time cost.
If you're interested, a couple of us on IRC have been discussing how to do FTI in pure javascript+map/reduce as well as maybe implementing an Erlang plugin.
Also, the _external branch should be hitting trunk sometime soon and with jchris' action servers this will allow for alot of interesting experiments.
Thanks for your ideas, i'll give it a try!
I think this wil help me a lot. I will give it a try.
Amazing - thanks a lot, sounds great i will use this.
hi it is good but please can u let us know hw to perform the search. the URL you provided is not getting executed. "mydb" does it refer to the database name?? please reply
this article looks good, but is it compatible with all sql versions ? For me its not working, but i am just a beginner and dont know my database version.
sounds great i will use this.
amazing article thanks a lot. This sounds great and i will use this next time.
how do you support binary files ? like PDF ? I thought this is why we need CouchDB-Lucene or CouchDB-Solr
looks really good, i'll try that.
yeap, i've tried and its good and working. thanks for posting.
Nice article, well written and explained... and nice sense of humour :) I'm thinking if I could use couchdb for a project on 12m+ documents that now is being built on hadoop+solr. Do you think it would suffer from performance? Is couchdb a solution (and performing better) for the task? Thanks for the article.
nice sourcecode I saved that for testing it by my self, thanks a lot
Thank you for this nice article. I'am going to try it.
Frohes neues Jahr 2010 wünscht Euch Maike
Very interesting site.
Wow this works fine. THanks a lot!
This is great, exactly what I was looking for. I will make a great search.
Hey great article. Keep up the good work.
I would suggest to sort the permutations before emitting them:
Otherwise you need to know the order of the words in the title to find a document.
Now you can sort your key on client side and fetch all documents titled "hello world my friend" like this:
Sorting permutations doesn't make any sense, so I think I have to clarify my comment.
The permute() function shown in your post doesn't compute permutations, instead it yields all subgroups:
So, if you want to find a document with "one" and "two" in the title, the key ["one", "two"] wouldn't find it, but the permutation ["two", "one"] does.
A solution is to sort all subgroups and the keys, or emit all subgroups and their permutations.
hallo, beim Surfen in der Google, habe ich dieser sehr Hilfreiche Seite entdeckt, sehr gelungen, mit Freundliche Grüße aus Duisburg
Ich gratuliere zur informativen Seite. Schön, dass es noch Leute mit Ideen gibt. Liebe Grüsse, Mikail
Ich bin auch total begeistert! Echt spannender Artikel!
Hey thank you for this good article. well done!
Nice to have people out there, who still have ideas. Very nice article. Keep it up!
Haven't tried this yet. I even haven't used couchdb until today. will try it now. Thanks for the tipps.