March 5, 2009.
(View a live django-springsteen example here.)
(Also, I've greatly simplified the process of deploying django-springsteen on Google App Engine, as explained here, but you will still want to read this article to understand how to customize Springsteen.)
Vik Singh premiered running Yahoo! BOSS on Google App Engine quite some months ago, but django-springsteen is a somewhat different magician than the BOSS Mashup Framework, so hopefully you'll forgive a bit of repetition.
In this article I'll walk through the brief steps necessary to deploy a slightly interesting search engine by taking advantage of Springsteen's builtin support for Yahoo! BOSS, Twitter, and Amazon. It shouldn't take more than half an hour to get up and running.
First register a new Google App Engine applicaton.
I'm using djangosearch
, because apparently I registered
it the first time I did a lame Yahoo! BOSS on GAE tutorial.
It's good to know I'm not stuck in a rut or anything.
Checkout the django-springsteen source from GitHub.
git clone git://github.com/lethain/django-springsteen.git django-springsteen
Now we're going to salvage some relevant pieces of django-springsteen
's repository,
adapt them to our new purposes, and throw away the rest.
mv django-springsteen/example_project/ ./djangosearch
mv django-springsteen/springsteen/ djangosearch/
rm -rf django-springsteen
Next we want to grab a recent Django tarball from djangoproject.com/download/.
tar -xvf Django-1.0.2-final.tar
mv Django-1.0.2-final/django/ ./
rm -rf Django-1.0.2-*
rm -rf django/bin django/contrib/admin django/contrib/auth
rm -rf django/contrib/databrowse django/test
rm -rf django/contrib/admindocs django/contrib/gis
(We need to remove these files to get under the 1000 file limit for Google App Engine. You can also do some file zipping magic to get around it, but this approach is a bit simpler.)
Actually, django-springsteen pretty much works with
Django 0.96 except for using the safe
template
filter in some of the templates. If you're willing
to strip it out, then you could skip installing a
more recent version of Django.
And next we need to scavenge several pieces from the
django_example for Google App Engine.
First create the djangosearch/main.py
file with
these contents.
import logging, os, sys
# Google App Engine imports.
from google.appengine.ext.webapp import util
# Remove the standard version of Django.
for k in [k for k in sys.modules if k.startswith('django')]:
del sys.modules[k]
# Force sys.path to have our own directory first, in case we want to import
# from it.
sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)))
# Must set this env var *before* importing any part of Django
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
import django.core.handlers.wsgi
def main():
# Create a Django application for WSGI.
application = django.core.handlers.wsgi.WSGIHandler()
# Run the WSGI CGI handler with that application.
util.run_wsgi_app(application)
if __name__ == '__main__':
main()
Next we need to create the djangosearch/app.yaml
file.
(Be sure to replace djangosearch
with the name of the
application you registered.)
application: djangosearch
version: 1
runtime: python
api_version: 1
handlers:
- url: /static
static_dir: static
- url: /.*
script: main.py
And finally djangosearch/index.yaml
.
indexes:
# AUTOGENERATED
# This index.yaml is automatically updated whenever the dev_appserver
# detects that a new type of query is run. If you want to manage the
# index.yaml file manually, remove the above marker line (the line
# saying "# AUTOGENERATED"). If you want to manage some indexes
# manually, move them above the marker line. The index.yaml file is
# automatically uploaded to the admin console when you next deploy
# your application using appcfg.py.
Next open up djangosearch/local_settings.py
and add
these at the bottom.
ROOT_URLCONF = 'urls'
MIDDLEWARE_CLASSES = (
'django.middleware.common.CommonMiddleware',
'django.middleware.doc.XViewMiddleware',
)
INSTALLED_APPS = ('springsteen',)
DATABASE_ENGINE = None
DATABASE_NAME = None
CACHE_BACKEND = "dummy:///"
Create the djangosearch/boss_settings.py
file,
which contains only the BOSS_APP_ID
parameter,
and AMAZON_ACCESS_KEY
if you have one.
(You'll need to sign up here and here
to get a AWS Affiliate ID if you want Amazon search results, which is
the uninspired man's choice for monetizing a Springsteen service out
of the box.)
BOSS_APP_ID = "abcdefghijlknop"
Tweak the djangosearch/urls.py
file to
remove all references to example_project
,
as well as removing the extra url patterns.
from django.conf.urls.defaults import *
urlpatterns = patterns('',
(r'^$', 'views.search'),
)
Time out. Let's pick a topic for our new search engine. Hmm... hmm.... Okay, let's make it a search engine that is specialized on Apple products. What could go wrong?
Next let's configure our search results.
Go ahead and open up djangosearch/views.py
,
and start by removing everything.
Start rebuilding by adding these imports:
from springsteen.views import search as default_search
from springsteen.services import Web, TwitterLinkSearchService, AmazonProductService
from django.conf import settings
Next let's create our Amazon product service (if you have an Amazon Affiliates AWS key).
class ComputerAmazonSearch(AmazonProductService):
_access_key = settings.AMAZON_ACCESS_KEY
_topic = 'apple'
Followed by creating an Apple flavored Twitter service.
class AppleTwitterService(TwitterLinkSearchService):
_qty = 3
_topic = 'apple'
Finally we just need to mix in web results from Yahoo! BOSS and then expose our new search engine.
def search(request, timeout=2500, max_count=10):
services = (ComputerAmazonSearch, AppleTwitterService, Web)
return default_search(request, timeout, max_count, services)
A Short Warning
Please note that Yahoo! BOSS search results won't be retrieved successfully when you are testing your springsteen application locally. However, they will be correctly retrieved once you push your app to production. I'll look into some kind of patch for this, but no need to panic.
At this point we have everything working correctly, but results are just stacked on top of each other. Sure, you might love having those Amazon affiliate links clustered at the top, but your users might not. Now's a nice time to dip our toes into relevency.
You want the most relevant results to bubble to the top (feel free to make a bubble sort pun), but naively stacking results from different services doesn't permit that unless all results from source A are more relevant than those from source B, all results from B are more relevant than those from source C and so on. Let's take a stab at a very simple relevency algorithm to address these problems.
You can think of two kinds of relevency approachs:
We're going to do a little bit of both here. First we're going to boost results which contain the query term in their title, and second we're going to punish the 2nd-Nth results from an already encountered domain.
Place this code in views.py
above the search
function.
def ranking(query, results):
query = query.lower()
def rank(result):
score = 0.0
title = result['title'].lower()
if title in query:
score += 1.0
return score
scored = [(rank(x), x) for x in results]
scored2 = []
domains = {}
for score, result in scored:
domain = result['url'].replace('http://','').split('/')[0]
times_viewed = domains.get(domain, 0)
new_score = score + times_viewed * -0.1
scored2.append((new_score, result))
domains[domain] = times_viewed + 1
scored2.sort()
return [ x[1] for x in scored2 ]
And then update the search
function to use this
ranking function.
def search(request, timeout=2500, max_count=10):
services = (AppleAmazonSearch, AppleTwitterService, Web)
return default_search(request, timeout, max_count,
services, {}, ranking)
Now our results are ranked using the above ranking function. This is a pretty basic approach to relevancy, but hopefully shows the basic concepts.
Now we have our search engine up and running, and is a good time to customize your site's templates.
First make a templates
directory within djangosearch
,
as well as a templates/springsteen
directory and a few
empty files.
cd djangosearch
mkdir templates templates/springsteen
touch templates/base.html
touch templates/springsteen/base.html
Then let's edit the templates/base.html
file.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html> <head>
<title>{% block title %}FruitySearch{% endblock %}</title>
<link rel="stylesheet" type="text/css" href="/static/css/reset.css">
<link rel="stylesheet" type="text/css" href="/static/css/search.css">
</head> <body>
<div id="body">
<div id="hd"><h1><a href="/">FruitySearch</a></h1></div>
{% block body %}{% endblock %}
<div id="ft"><p>A <a href="">Your-Name-Here</a> production, 2009.</p></div>
</div> </body> </html>
Next edit templates/springsteen/base.html
(this one
if pretty brief).
{% extends "base.html" %}
{% block body %}{% endblock %}
As you continue customizing the appearance of your results, you'll
probably want to override `templates/springsteen/results.html
,
but for the time being it should be a reasonable default.
Create some CSS to style the site.
cd djangosaerch
mkdir static static/css
The current base.html
assumes you'll have
reset.css
and search.css
files.
Recently I tend to use YUI's reset.css,
and just mashed together some custom stylings for search.css
.
Test everything out.
cd djangosearch
dev_appserver.py ./
Assuming that worked, go ahead and push it to Google App Engine, and you're done.
You'll notice by default you
won't be able to paginate past 3-4 pages. This
is because of a safety mechanism in Springsteen
that makes sense when you're dealing with a large
number of sources, but is a bit annoying when
you're really only dealing with one source.
You can override this setting in local_settings.py
by adding this line:
# allow up to 10 pages
SPRINGSTEEN_MAX_MATCHES = 10
Also, at the moment this setup isn't using any caching, and as a result it simply will not behave efficiently at higher pages (display results 100-110, for example).
It is possible to use memcache on GAE, and I'll write a patch which enables Springsteen to take advantage of that functionality in the next day or two.
Let me know if you have any problems or questions about deploying Springsteen on Google App Engine! I think it's pretty amazing how far the landscape has evolved to be able to roll out a product like this at no cost and essentially no effort.
I guess it's our responsibility to take advantage of it.
I'll post a reusable package in a day or two once I update the caching mechanism to be smart enough to play nicely with Google App Engine. (If I post it now people will use it and then complain that it behaves as advertised...)