Continuing on yesterday's post, here is a brief introduction
to reranking results in django-springsteen as well as coverage of how
it is able to paginate consistently through results collected from multiple sources.
If you've ever tried to tune search relevency (particularly on
a large or diverse corpus of data), you're probably painfully
aware that reranking results is indeed the crux of the problem.
That's why springsteen exposes a simple hook for reordering
results, and also why it defaults to a simply stacking approach
rather than trying to reorder results for you. You know your data,
you get to tweak its ranking yourself.
Here is a simple--although flawed--example of how to reorder
There you have it, the simplest reranking function imagineable.
But it has a bit of a problem: running the ranking function twice
on the same data will return different rankings, which makes
predictable pagination impossible.
This brings us to the one requirement for a good reranking algorithm:
given the same data, it must return the same rankings.
That is to say, they must be consistent.
It is not important that the results be consistent from
day to day or even hour to hour, but the reranking algorithm
must be consistent to the extent that you want predictable pagination.
The less you object to noise in pagination (one result occuring
on multiple pages), the more malleable your reranking algorithm can be.
To understand how consistent ranking across pagination is acchieved, we
need to take a sidetrip into the details of pagination in springsteen.
Consistent Pagination across Sources
The simplest case for pagination is fetching
the 1st through Nth results. To accomplish this,springsteen
queries each service for N results, merges
them together, calls any reranking function on
the results, and then returns the first N results.
Let's call N results (or as close to N as the
service has availale) from all services a batch.
So if we had 5 services, and got N results from
each of them, then our first batch has 5N results.
This means we have the raw material for five pages of
results to paginate through from that first batch.
Because the user only requested the first N results,
we rerank them as we please and then return the first
N of the reranked results.
Now, if the user asks for N through 2N results,
we once again query all our sources for N results,
once again rerank them, and then return results
N to 2N. This may feel a bit inefficient,
but springsteen tries to cache results, so
paging through those five pages will only require hitting
each of those services once (assuming all five pages are
displayed within the ~30 minutes that the results are
Things get bit a more complex if the user asks for
results which fall outside the first batch of results.
To retrieve those we need to first grab the first
batch of results, set them aside, and then continue
fetching batches (and setting them aside) until the sum
of results in all batches is greater than the final requested result .
Then we perform the reranking algorithm
seperately on each batch whose results fall within the first and last result requested by the user.
Then we merge together the batches (maintaing their order), and slice out the requested results.
It is crucial to sort the batches separately to acchieve
consistent ordering of results (which permits predictable
With an example this is fairly simple. Consider this code:
By reranking batches independently we get the
first result, whereas by reranking them in unison
we recieve the second.
With both algorithms on the first page of results you would see results 1,2,5,7,9.
On the second page, with the first algorithm you would see 1,3,5,6,9.
With the second algorithm you would instead see 5,6,7,8,9,9.
Meaning the second algorithm shows you worse results on the second page, and
also shows repeated results.
Thus ends our brief interlude on pagination.
A Consistent Ranking Algorithm
Here is a simple example of a consistent reranking algorithm.
It is not a useful example in terms of increasing search
relevency, but is perhaps useful purely as a technical example.
This reranks results by the length of their title.
Keep in mind that reranking is consistent within batches,
and that batches are consistent with regard to one-another,
but that in some given set of results you may see the end
of one batch and the beginning of another, which (for a
very visual example like this one) is a bit jarring.
In a more realistic situation, you might create a function which
assigns each result a grade, and then rerank them based on that
grade. You might start out only looking at source, but then factor
in its published date (where it exists) and keep tweaking for your
As you start experimenting a bit, I'm curious to hear your ideas of ranking algorithms.