django-springsteen and Distributed Search
For quite some time I've been wanting to put together a pluggable Django application for querying Yahoo! BOSS. In itself doing that is pretty trivial though, so the app needed to include some kind of special sauce to sweeten the deal. I hope you'll find the taste agreeable.
This is django-springsteen. (Credit for the name goes entirely to Justin Lilly.)
springsteen
provides a trivial wrapper for Yahoo! BOSS,
but goes further and provides a simple framework for
building distributed search networks. If you dream
of a world where every blog network is searchable, and each
niche has its own vertical search, then springsteen
is for you.
Let's start with some examples.
Querying BOSS for Web Results
springsteen
has prebuilt views for searching Yahoo! BOSS for
web, images and news results, making this the simplest usecase.
from django.conf.urls.defaults import *
from springsteen.views import web, images, news
urlpatterns = patterns('',
(r'^search/web/$', web),
(r'^search/images/$', images),
(r'^search/news/$', news),
)
Then navigate to http://yourproject.com/search/web/
(or /search/images/
or /search/news/
and you'll
immediately have a search page waiting for you.
The search results for Web
--as well as all implemented
services--are cached using the caching backend specified
in your settings.py
file. The speed benefits of caching
Yahoo! BOSS may be fairly minimal, but for more
exotic services (and frequent searches) the caching may
become more of a feature.
To clean up the appearance override either
the springsteen/base.html
or springsteen/results.html
templates. (You can also override the *_result.html
templates
to customize differerent result types.)
BOSS Results with Site Restrict
If you only want web results on a single site (the poor man's
site search), you can subclass the springsteen.services.Web
class (you could restrict news or images, by subclassing the
springsteen.services.News
and springsteen.services.Images
classes respectively).
First the subclassing.
from springsteen.services import Web
class DjangoProjectSearch(Web):
def init(self, query, params={}):
super(Web, self).init(query, params)
self.params['sites']='djangoproject.com'
You can add other parameters as well, which are defined in the Yahoo! BOSS documentation.
After writing your custom search, then write a view which uses it.
from springsteen.views import search
from my_searches import DjangoProjectSearch
def dp_search(request, timeout=2000, max_count=5,
services=(DjangoProjectSearch,), extra_params={}):
return search(request, timeout, max_count, services, extra_params)
Querying Multiple Services in Parallel
One of the frequent mistakes I've made as a web developer
is to make http requests sequentially when they could have
been done concurrently. springsteen
aims to aggregate
numerous search services, so it needs to be able to request
and process them in parallel.
To perform concurrent requests simply specify multiple services.
(Note that the below values defined in settings
are not
standard, but you can put them in your settings.py
if that's
how you like to organize globals.)
from django.conf import settings
from springsteen.services import Web, Images
from springsteen.views import search
def my_search(request):
timeout = settings.SPRINGSTEEN_TIMEOUT
max_count = setttings.SPRINGSTEEN_MAX_COUNT
services = (Web, Images)
return search(request, timeout, max_count, services)
By default results from services are stacked on
one another. For example, results from the above
my_search
would return all results from Images
and then begin showing results from Web
.
Ranking results is the hardest part of search,
and springsteen
won't solve that. Instead it'll
give you the levers to do it yourself. For most
small scale situations it should be possible to
write fairly concise ranking logic that is specific
to the services you're querying that will outperform
any generic genius that springsteen
might try
to provide.
Exposing Results via a Springsteen Service
Because springsteen
is all about aggregating search
services, it will be gradually extended to understand
new formats. However, sometimes you just want to expose
new data to springsteen
, and haven't already decided
on a format.
For those situations, you can use a Springsteen Service. Cool name aside, they are about as simple as it gets. Let's imagine that you can somehow get search results in CSV format (no it doesn't make sense, it's an example).
Perhaps your data looks like this:
title, url, text
abc, http://yadayad/abc/, some text here
efg, http://yadayad/efg/, some text here as well
and you have a function csv_search
which returns
relevant rows. You could expose that via a Springsteen Service
as follows.
from csv import DictReader
from fake_web_service import csv_search
from django.utils import simplejson
from springsteen.views import service
def retrieve_csv_results(query, start, count):
csv_results, total_results = csv_search(query, start, count)
results = []
for line in DictReader(csv_results):
result = {
'title': line['title'],
'url': line['url'],
'text': line['text'],
}
results.append(result)
data = {
'total_results': total_results,
'results': results,
}
return simplejson.dumps(data)
def my_service(request):
return service(request, retrieve_func=retrieve_csv_results)
At this point in time the only three acknowledged fields for
a Springsteen Service are the above title
, url
and text
.
As need arrises the standard may be fleshed out to accomodate
additional metadata.
Rather than the hypothetical csv_search
, you can
use this approach to wrap Solango or Django-Sphinx results,
as well as routing to non-Django apis or services in your
ecosystem.
A Search API Repeater & Transformer
Let's say you want to implement a site search api for your blog, but don't have the "engineering resources" to integrate a solution like Solango or Django-Sphinx.
First we need to subclass Web
to get our site's results.
from springsteen.services import Web
class MySiteService(Web):
def init(self, query, params={}):
super(Web, self).init(query, params)
self.params['sites']='lethain.com'
Then we need to expose the results.
from django.http import HttpResponse
from springsteen.views import service
from django.utils import simplejson
from somewhere import MySiteService
def retrieve_func(query, start, count):
params = {'start':start, 'count':count}
mss = MySiteService(query, params)
mss.run()
results = mss.results()
json = simplejson.dumps(results)
return HttpResponse(json, mimetype="application/json")
Now the Yahoo! BOSS results are transformed in the
Springsteen Service format, and can be predictably queried
by external springsteen
searches.
Retrieving Results from Springsteen Service
Retrieving results from a Springsteen Service is simple, akin to retrieving Yahoo! BOSS results.
from springsteen.services import SpringsteenService, Web
from springsteen.views import search
class MyService(SpringsteenService):
_uri = "http://example.com/search/cvs/"
def my_search(request, timeout=2500, max_count=20):
services = (MyService, Web)
return search(request, timeout, max_count, services)
springsteen
already knows how to display results from
a Springsteen Service, so integration is rather concise.
Accessing a Custom Service
It's always easiest when you can get partners
to expose a service in the format you want
(in this case, a SpringsteenService
),
but sometimes you have to get in there and
parse the data yourself.
In springsteen.services
both the
SpringsteenService
and BossSearch
classes provide examples of interfacing
with different data formats.
The key point is to write a run
method
that retrieves results and converts them into
a Python list of dictionaries. If you want
to render them with one of the existing template
fragments (web, news, image or springsteen results)
then you should add the corresponding value to
the source
key for each result's dictionary.
def run(self):
stuff, total_results = get_results(self.query, self.params)
results = simplejson.loads(stuff)
for result in results:
result['source'] = 'web'
self._results = results
self.total_results = total_results
Let me know if it proves challenging to follow
the existing results, and I'll gladly provide
a complete walkthrough of subclassing Service
and CachableService
.
The Future of springsteen
At the moment the core of springsteen
is
nearly complete, I just need to refactor slightly
to facilitate inserting custom ranking logic.
Beyond that, there are an infinite number of
services that springsteen
would like to
know how to query and display.
There is a working example of both exposing data via a Springsteen Service as well as querying and aggregating results, and hopefully it'll be sufficiently composed for revealing by this upcoming Monday.
I hope that springsteen
and its vision of
distributed search by small-time players is something
that you find exciting, I know I'm excited about the
prospect of creating targeted and relevant search boxes
powered not by thousands of commodity servers
in datacenters but instead by my vps, and yours.
Download
django-springsteen is available on GitHub.
git clone git@github.com:lethain/django-springsteen.git
We're all busy people, but you're more than welcome to join in the development!