Reranking Results in django-springsteen
Continuing on yesterday's post, here is a brief introduction
to reranking results in django-springsteen
as well as coverage of how
it is able to paginate consistently through results collected from multiple sources.
(You can grab the springsteen
source code on GitHub.)
Custom Ranking of Results
If you've ever tried to tune search relevency (particularly on a large or diverse corpus of data), you're probably painfully aware that reranking results is indeed the crux of the problem.
That's why springsteen
exposes a simple hook for reordering
results, and also why it defaults to a simply stacking approach
rather than trying to reorder results for you. You know your data,
you get to tweak its ranking yourself.
Here is a simple--although flawed--example of how to reorder search results.
from springsteen.views import search
from springsteen.services import Web, Images
import random
def scatter_results(query, results):
random.shuffle(results)
return results
def scattered_search(request)
services = (Images,Web)
return search(request, services=services,
reranking_func=scatter_results)
There you have it, the simplest reranking function imagineable. But it has a bit of a problem: running the ranking function twice on the same data will return different rankings, which makes predictable pagination impossible.
This brings us to the one requirement for a good reranking algorithm: given the same data, it must return the same rankings. That is to say, they must be consistent. It is not important that the results be consistent from day to day or even hour to hour, but the reranking algorithm must be consistent to the extent that you want predictable pagination.
The less you object to noise in pagination (one result occuring on multiple pages), the more malleable your reranking algorithm can be.
To understand how consistent ranking across pagination is acchieved, we
need to take a sidetrip into the details of pagination in springsteen
.
Consistent Pagination across Sources
The simplest case for pagination is fetching
the 1st through N
th results. To accomplish this,springsteen
queries each service for N
results, merges
them together, calls any reranking function on
the results, and then returns the first N
results.
Let's call N
results (or as close to N
as the
service has availale) from all services a batch.
So if we had 5 services, and got N
results from
each of them, then our first batch has 5N
results.
This means we have the raw material for five pages of
results to paginate through from that first batch.
Because the user only requested the first N
results,
we rerank them as we please and then return the first
N
of the reranked results.
Now, if the user asks for N
through 2N
results,
we once again query all our sources for N
results,
once again rerank them, and then return results
N
to 2N
. This may feel a bit inefficient,
but springsteen
tries to cache results, so
paging through those five pages will only require hitting
each of those services once (assuming all five pages are
displayed within the ~30 minutes that the results are
cached).
Things get bit a more complex if the user asks for results which fall outside the first batch of results. To retrieve those we need to first grab the first batch of results, set them aside, and then continue fetching batches (and setting them aside) until the sum of results in all batches is greater than the final requested result .
Then we perform the reranking algorithm seperately on each batch whose results fall within the first and last result requested by the user. Then we merge together the batches (maintaing their order), and slice out the requested results. It is crucial to sort the batches separately to acchieve consistent ordering of results (which permits predictable pagination).
With an example this is fairly simple. Consider this code:
>>> a = [1,5,2,7,8,9]
>>> b = [5,1,3,9,6]
>>> sorted(a) + sorted(b)
[1, 2, 5, 7, 8, 9, 1, 3, 5, 6, 9]
>>> sorted(a+b)
[1, 1, 2, 3, 5, 5, 6, 7, 8, 9, 9]
By reranking batches independently we get the first result, whereas by reranking them in unison we recieve the second.
With both algorithms on the first page of results you would see results 1,2,5,7,9
.
On the second page, with the first algorithm you would see 1,3,5,6,9
.
With the second algorithm you would instead see 5,6,7,8,9,9
.
Meaning the second algorithm shows you worse results on the second page, and
also shows repeated results.
Thus ends our brief interlude on pagination.
A Consistent Ranking Algorithm
Here is a simple example of a consistent reranking algorithm. It is not a useful example in terms of increasing search relevency, but is perhaps useful purely as a technical example.
This reranks results by the length of their title.
from springsteen.views import search
from springsteen.services import Web, Images
def title_length(query, results):
results.sort(lambda a,b : cmp(len(a['title']), len(b['title'])))
return results
def scattered_search(request)
services = (Images,Web)
return search(request, services=services,
reranking_func=title_length)
Keep in mind that reranking is consistent within batches, and that batches are consistent with regard to one-another, but that in some given set of results you may see the end of one batch and the beginning of another, which (for a very visual example like this one) is a bit jarring.
In a more realistic situation, you might create a function which assigns each result a grade, and then rerank them based on that grade. You might start out only looking at source, but then factor in its published date (where it exists) and keep tweaking for your corpus.
As you start experimenting a bit, I'm curious to hear your ideas of ranking algorithms.