Irrational Exuberance!

Reranking Results in django-springsteen

February 26, 2009. Filed under djangobossspringsteen

Continuing on yesterday's post, here is a brief introduction to reranking results in django-springsteen as well as coverage of how it is able to paginate consistently through results collected from multiple sources.

(You can grab the springsteen source code on GitHub.)

Custom Ranking of Results

If you've ever tried to tune search relevency (particularly on a large or diverse corpus of data), you're probably painfully aware that reranking results is indeed the crux of the problem.

That's why springsteen exposes a simple hook for reordering results, and also why it defaults to a simply stacking approach rather than trying to reorder results for you. You know your data, you get to tweak its ranking yourself.

Here is a simple--although flawed--example of how to reorder search results.

from springsteen.views import search
from springsteen.services import Web, Images
import random

def scatter_results(query, results):
    random.shuffle(results)
    return results

def scattered_search(request)
    services = (Images,Web)
    return search(request, services=services, 
                  reranking_func=scatter_results)

There you have it, the simplest reranking function imagineable. But it has a bit of a problem: running the ranking function twice on the same data will return different rankings, which makes predictable pagination impossible.

This brings us to the one requirement for a good reranking algorithm: given the same data, it must return the same rankings. That is to say, they must be consistent. It is not important that the results be consistent from day to day or even hour to hour, but the reranking algorithm must be consistent to the extent that you want predictable pagination.

The less you object to noise in pagination (one result occuring on multiple pages), the more malleable your reranking algorithm can be.

To understand how consistent ranking across pagination is acchieved, we need to take a sidetrip into the details of pagination in springsteen.

Consistent Pagination across Sources

The simplest case for pagination is fetching the 1st through Nth results. To accomplish this,springsteen queries each service for N results, merges them together, calls any reranking function on the results, and then returns the first N results.

Let's call N results (or as close to N as the service has availale) from all services a batch. So if we had 5 services, and got N results from each of them, then our first batch has 5N results.

This means we have the raw material for five pages of results to paginate through from that first batch. Because the user only requested the first N results, we rerank them as we please and then return the first N of the reranked results.

Now, if the user asks for N through 2N results, we once again query all our sources for N results, once again rerank them, and then return results N to 2N. This may feel a bit inefficient, but springsteen tries to cache results, so paging through those five pages will only require hitting each of those services once (assuming all five pages are displayed within the ~30 minutes that the results are cached).

Things get bit a more complex if the user asks for results which fall outside the first batch of results. To retrieve those we need to first grab the first batch of results, set them aside, and then continue fetching batches (and setting them aside) until the sum of results in all batches is greater than the final requested result .

Then we perform the reranking algorithm seperately on each batch whose results fall within the first and last result requested by the user. Then we merge together the batches (maintaing their order), and slice out the requested results. It is crucial to sort the batches separately to acchieve consistent ordering of results (which permits predictable pagination).

With an example this is fairly simple. Consider this code:

>>> a = [1,5,2,7,8,9]
>>> b = [5,1,3,9,6]
>>> sorted(a) + sorted(b)
[1, 2, 5, 7, 8, 9, 1, 3, 5, 6, 9]
>>> sorted(a+b)
[1, 1, 2, 3, 5, 5, 6, 7, 8, 9, 9]

By reranking batches independently we get the first result, whereas by reranking them in unison we recieve the second.

With both algorithms on the first page of results you would see results 1,2,5,7,9. On the second page, with the first algorithm you would see 1,3,5,6,9. With the second algorithm you would instead see 5,6,7,8,9,9. Meaning the second algorithm shows you worse results on the second page, and also shows repeated results.

Thus ends our brief interlude on pagination.

A Consistent Ranking Algorithm

Here is a simple example of a consistent reranking algorithm. It is not a useful example in terms of increasing search relevency, but is perhaps useful purely as a technical example.

This reranks results by the length of their title.

from springsteen.views import search
from springsteen.services import Web, Images

def title_length(query, results):
    results.sort(lambda a,b : cmp(len(a['title']), len(b['title'])))
    return results

def scattered_search(request)
    services = (Images,Web)
    return search(request, services=services,
                  reranking_func=title_length)

Keep in mind that reranking is consistent within batches, and that batches are consistent with regard to one-another, but that in some given set of results you may see the end of one batch and the beginning of another, which (for a very visual example like this one) is a bit jarring.

In a more realistic situation, you might create a function which assigns each result a grade, and then rerank them based on that grade. You might start out only looking at source, but then factor in its published date (where it exists) and keep tweaking for your corpus.

As you start experimenting a bit, I'm curious to hear your ideas of ranking algorithms.