django-springsteen and Distributed Search

Published on February 25, 2009. django (72), boss (11), springsteen (6)

For quite some time I've been wanting to put together a pluggable Django application for querying Yahoo! BOSS. In itself doing that is pretty trivial though, so the app needed to include some kind of special sauce to sweeten the deal. I hope you'll find the taste agreeable.

This is django-springsteen. (Credit for the name goes entirely to Justin Lilly.)

springsteen provides a trivial wrapper for Yahoo! BOSS, but goes further and provides a simple framework for building distributed search networks. If you dream of a world where every blog network is searchable, and each niche has its own vertical search, then springsteen is for you.

Let's start with some examples.

Querying BOSS for Web Results

springsteen has prebuilt views for searching Yahoo! BOSS for web, images and news results, making this the simplest usecase.

from django.conf.urls.defaults import *
from springsteen.views import web, images, news
urlpatterns = patterns('',
(r'^search/web/$', web),
(r'^search/images/$', images),
(r'^search/news/$', news),
)

Then navigate to http://yourproject.com/search/web/ (or /search/images/ or /search/news/ and you'll immediately have a search page waiting for you.

The search results for Web--as well as all implemented services--are cached using the caching backend specified in your settings.py file. The speed benefits of caching Yahoo! BOSS may be fairly minimal, but for more exotic services (and frequent searches) the caching may become more of a feature.

To clean up the appearance override either the springsteen/base.html or springsteen/results.html templates. (You can also override the *_result.html templates to customize differerent result types.)

BOSS Results with Site Restrict

If you only want web results on a single site (the poor man's site search), you can subclass the springsteen.services.Web class (you could restrict news or images, by subclassing the springsteen.services.News and springsteen.services.Images classes respectively).

First the subclassing.

from springsteen.services import Web
class DjangoProjectSearch(Web):
def init(self, query, params={}):
super(Web, self).init(query, params)
self.params['sites']='djangoproject.com'

You can add other parameters as well, which are defined in the Yahoo! BOSS documentation.

After writing your custom search, then write a view which uses it.

from springsteen.views import search
from my_searches import DjangoProjectSearch
def dp_search(request, timeout=2000, max_count=5, 

services=(DjangoProjectSearch,), extra_params={}):
return search(request, timeout, max_count, services, extra_params)

Querying Multiple Services in Parallel

One of the frequent mistakes I've made as a web developer is to make http requests sequentially when they could have been done concurrently. springsteen aims to aggregate numerous search services, so it needs to be able to request and process them in parallel.

To perform concurrent requests simply specify multiple services. (Note that the below values defined in settings are not standard, but you can put them in your settings.py if that's how you like to organize globals.)

from django.conf import settings
from springsteen.services import Web, Images
from springsteen.views import search
def my_search(request):
timeout = settings.SPRINGSTEEN_TIMEOUT
max_count = setttings.SPRINGSTEEN_MAX_COUNT
services = (Web, Images)
return search(request, timeout, max_count, services)

By default results from services are stacked on one another. For example, results from the above my_search would return all results from Images and then begin showing results from Web.

Ranking results is the hardest part of search, and springsteen won't solve that. Instead it'll give you the levers to do it yourself. For most small scale situations it should be possible to write fairly concise ranking logic that is specific to the services you're querying that will outperform any generic genius that springsteen might try to provide.

Exposing Results via a `Springsteen Service`

Because springsteen is all about aggregating search services, it will be gradually extended to understand new formats. However, sometimes you just want to expose new data to springsteen, and haven't already decided on a format.

For those situations, you can use a Springsteen Service. Cool name aside, they are about as simple as it gets. Let's imagine that you can somehow get search results in CSV format (no it doesn't make sense, it's an example).

Perhaps your data looks like this:

title, url, text
abc, http://yadayad/abc/, some text here
efg, http://yadayad/efg/, some text here as well

and you have a function csv_search which returns relevant rows. You could expose that via a Springsteen Service as follows.

from csv import DictReader
from fake_web_service import csv_search
from django.utils import simplejson
from springsteen.views import service
def retrieve_csv_results(query, start, count):
csv_results, total_results = csv_search(query, start, count)
results = []
for line in DictReader(csv_results):
result = {
'title': line['title'],
'url': line['url'],
'text': line['text'],
}
results.append(result)
data = {
'total_results': total_results,
'results': results,
}
return simplejson.dumps(data)
def my_service(request):
return service(request, retrieve_func=retrieve_csv_results)

At this point in time the only three acknowledged fields for a Springsteen Service are the above title, url and text. As need arrises the standard may be fleshed out to accomodate additional metadata.

Rather than the hypothetical csv_search, you can use this approach to wrap Solango or Django-Sphinx results, as well as routing to non-Django apis or services in your ecosystem.

A Search API Repeater & Transformer

Let's say you want to implement a site search api for your blog, but don't have the "engineering resources" to integrate a solution like Solango or Django-Sphinx.

First we need to subclass Web to get our site's results.

from springsteen.services import Web
class MySiteService(Web):
def init(self, query, params={}):
super(Web, self).init(query, params)
self.params['sites']='lethain.com'

Then we need to expose the results.

from django.http import HttpResponse
from springsteen.views import service
from django.utils import simplejson
from somewhere import MySiteService
def retrieve_func(query, start, count):
params = {'start':start, 'count':count}
mss = MySiteService(query, params)
mss.run()
results = mss.results()
json = simplejson.dumps(results)
return HttpResponse(json, mimetype="application/json")

Now the Yahoo! BOSS results are transformed in the Springsteen Service format, and can be predictably queried by external springsteen searches.

Retrieving Results from `Springsteen Service`

Retrieving results from a Springsteen Service is simple, akin to retrieving Yahoo! BOSS results.

from springsteen.services import SpringsteenService, Web
from springsteen.views import search
class MyService(SpringsteenService):
_uri = "http://example.com/search/cvs/"
def my_search(request, timeout=2500, max_count=20):
services = (MyService, Web)
return search(request, timeout, max_count, services)

springsteen already knows how to display results from a Springsteen Service, so integration is rather concise.

Accessing a Custom Service

It's always easiest when you can get partners to expose a service in the format you want (in this case, a SpringsteenService), but sometimes you have to get in there and parse the data yourself.

In springsteen.services both the SpringsteenService and BossSearch classes provide examples of interfacing with different data formats.

The key point is to write a run method that retrieves results and converts them into a Python list of dictionaries. If you want to render them with one of the existing template fragments (web, news, image or springsteen results) then you should add the corresponding value to the source key for each result's dictionary.

def run(self):
    stuff, total_results = get_results(self.query, self.params)
    results = simplejson.loads(stuff)
    for result in results:
        result['source'] = 'web'
    self._results = results
    self.total_results = total_results

Let me know if it proves challenging to follow the existing results, and I'll gladly provide a complete walkthrough of subclassing Service and CachableService.

The Future of `springsteen`

At the moment the core of springsteen is nearly complete, I just need to refactor slightly to facilitate inserting custom ranking logic. Beyond that, there are an infinite number of services that springsteen would like to know how to query and display.

There is a working example of both exposing data via a Springsteen Service as well as querying and aggregating results, and hopefully it'll be sufficiently composed for revealing by this upcoming Monday.

I hope that springsteen and its vision of distributed search by small-time players is something that you find exciting, I know I'm excited about the prospect of creating targeted and relevant search boxes powered not by thousands of commodity servers in datacenters but instead by my vps, and yours.

Download

django-springsteen is available on GitHub.

git clone git@github.com:lethain/django-springsteen.git

We're all busy people, but you're more than welcome to join in the development!