Deploying django-springsteen on Google App Engine
(View a live django-springsteen example here.)
(Also, I've greatly simplified the process of deploying django-springsteen on Google App Engine, as explained here, but you will still want to read this article to understand how to customize Springsteen.)
Vik Singh premiered running Yahoo! BOSS on Google App Engine quite some months ago, but django-springsteen is a somewhat different magician than the BOSS Mashup Framework, so hopefully you'll forgive a bit of repetition.
In this article I'll walk through the brief steps necessary to deploy a slightly interesting search engine by taking advantage of Springsteen's builtin support for Yahoo! BOSS, Twitter, and Amazon. It shouldn't take more than half an hour to get up and running.
First register a new Google App Engine applicaton. I'm using
djangosearch
, because apparently I registered it the first time I did a lame Yahoo! BOSS on GAE tutorial. It's good to know I'm not stuck in a rut or anything.Checkout the django-springsteen source from GitHub.
git clone git://github.com/lethain/django-springsteen.git django-springsteen
Now we're going to salvage some relevant pieces of
django-springsteen
's repository, adapt them to our new purposes, and throw away the rest.mv django-springsteen/example_project/ ./djangosearch mv django-springsteen/springsteen/ djangosearch/ rm -rf django-springsteen
Next we want to grab a recent Django tarball from djangoproject.com/download/.
tar -xvf Django-1.0.2-final.tar mv Django-1.0.2-final/django/ ./ rm -rf Django-1.0.2-* rm -rf django/bin django/contrib/admin django/contrib/auth rm -rf django/contrib/databrowse django/test rm -rf django/contrib/admindocs django/contrib/gis
(We need to remove these files to get under the 1000 file limit for Google App Engine. You can also do some file zipping magic to get around it, but this approach is a bit simpler.)
Actually, django-springsteen pretty much works with Django 0.96 except for using the
safe
template filter in some of the templates. If you're willing to strip it out, then you could skip installing a more recent version of Django.And next we need to scavenge several pieces from the django_example for Google App Engine. First create the
djangosearch/main.py
file with these contents.import logging, os, sys # Google App Engine imports. from google.appengine.ext.webapp import util
# Remove the standard version of Django. for k in [k for k in sys.modules if k.startswith('django')]: del sys.modules[k]
# Force sys.path to have our own directory first, in case we want to import # from it. sys.path.insert(0, os.path.abspath(os.path.dirname(file)))
# Must set this env var before importing any part of Django os.environ['DJANGO_SETTINGS_MODULE'] = 'settings' import django.core.handlers.wsgi
def main(): # Create a Django application for WSGI. application = django.core.handlers.wsgi.WSGIHandler()
<span class="c"># Run the WSGI CGI handler with that application.</span> <span class="n">util</span><span class="o">.</span><span class="n">run_wsgi_app</span><span class="p">(</span><span class="n">application</span><span class="p">)</span>
if name == 'main': main()
Next we need to create the
djangosearch/app.yaml
file. (Be sure to replacedjangosearch
with the name of the application you registered.)application: djangosearch version: 1 runtime: python api_version: 1
handlers: - url: /static static_dir: static
- url: /.* script: main.py
And finally
djangosearch/index.yaml
.indexes:
# AUTOGENERATED
# This index.yaml is automatically updated whenever the dev_appserver # detects that a new type of query is run. If you want to manage the # index.yaml file manually, remove the above marker line (the line # saying "# AUTOGENERATED"). If you want to manage some indexes # manually, move them above the marker line. The index.yaml file is # automatically uploaded to the admin console when you next deploy # your application using appcfg.py.
Next open up
djangosearch/local_settings.py
and add these at the bottom.ROOT_URLCONF = 'urls' MIDDLEWARE_CLASSES = ( 'django.middleware.common.CommonMiddleware', 'django.middleware.doc.XViewMiddleware', ) INSTALLED_APPS = ('springsteen',) DATABASE_ENGINE = None DATABASE_NAME = None CACHE_BACKEND = "dummy:///"
Create the
djangosearch/boss_settings.py
file, which contains only theBOSS_APP_ID
parameter, andAMAZON_ACCESS_KEY
if you have one. (You'll need to sign up here and here to get a AWS Affiliate ID if you want Amazon search results, which is the uninspired man's choice for monetizing a Springsteen service out of the box.)BOSS_APP_ID = "abcdefghijlknop"
Tweak the
djangosearch/urls.py
file to remove all references toexample_project
, as well as removing the extra url patterns.from django.conf.urls.defaults import *
urlpatterns = patterns('', (r'^$', 'views.search'), )
Time out. Let's pick a topic for our new search engine. Hmm... hmm.... Okay, let's make it a search engine that is specialized on Apple products. What could go wrong?
Next let's configure our search results. Go ahead and open up
djangosearch/views.py
, and start by removing everything.Start rebuilding by adding these imports:
from springsteen.views import search as default_search from springsteen.services import Web, TwitterLinkSearchService, AmazonProductService from django.conf import settings
Next let's create our Amazon product service (if you have an Amazon Affiliates AWS key).
class ComputerAmazonSearch(AmazonProductService): _access_key = settings.AMAZON_ACCESS_KEY _topic = 'apple'
Followed by creating an Apple flavored Twitter service.
class AppleTwitterService(TwitterLinkSearchService): _qty = 3 _topic = 'apple'
Finally we just need to mix in web results from Yahoo! BOSS and then expose our new search engine.
def search(request, timeout=2500, max_count=10): services = (ComputerAmazonSearch, AppleTwitterService, Web) return default_search(request, timeout, max_count, services)
A Short Warning
Please note that Yahoo! BOSS search results won't be retrieved successfully when you are testing your springsteen application locally. However, they will be correctly retrieved once you push your app to production. I'll look into some kind of patch for this, but no need to panic.
At this point we have everything working correctly, but results are just stacked on top of each other. Sure, you might love having those Amazon affiliate links clustered at the top, but your users might not. Now's a nice time to dip our toes into relevency.
You want the most relevant results to bubble to the top (feel free to make a bubble sort pun), but naively stacking results from different services doesn't permit that unless all results from source A are more relevant than those from source B, all results from B are more relevant than those from source C and so on. Let's take a stab at a very simple relevency algorithm to address these problems.
You can think of two kinds of relevency approachs:
- Scoring results on their individual merits. We might call this intrinsic relevance.
- Scoring results in regard to each other. We might call this contextual relevance.
We're going to do a little bit of both here. First we're going to boost results which contain the query term in their title, and second we're going to punish the 2nd-Nth results from an already encountered domain.
Place this code in
views.py
above thesearch
function.def ranking(query, results): query = query.lower() def rank(result): score = 0.0 title = result['title'].lower() if title in query: score += 1.0 return score
<span class="n">scored</span> <span class="o">=</span> <span class="p">[(</span><span class="n">rank</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">results</span><span class="p">]</span> <span class="n">scored2</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">domains</span> <span class="o">=</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">score</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">scored</span><span class="p">:</span> <span class="n">domain</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">'url'</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'http://'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'/'</span><span class="p">)[</span><span class="mf">0</span><span class="p">]</span> <span class="n">times_viewed</span> <span class="o">=</span> <span class="n">domains</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">domain</span><span class="p">,</span> <span class="mf">0</span><span class="p">)</span> <span class="n">new_score</span> <span class="o">=</span> <span class="n">score</span> <span class="o">+</span> <span class="n">times_viewed</span> <span class="o">*</span> <span class="o">-</span><span class="mf">0.1</span> <span class="n">scored2</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">new_score</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span> <span class="n">domains</span><span class="p">[</span><span class="n">domain</span><span class="p">]</span> <span class="o">=</span> <span class="n">times_viewed</span> <span class="o">+</span> <span class="mf">1</span> <span class="n">scored2</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span> <span class="k">return</span> <span class="p">[</span> <span class="n">x</span><span class="p">[</span><span class="mf">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">scored2</span> <span class="p">]</span>
And then update the
search
function to use this ranking function.def search(request, timeout=2500, max_count=10): services = (AppleAmazonSearch, AppleTwitterService, Web) return default_search(request, timeout, max_count, services, {}, ranking)
Now our results are ranked using the above ranking function. This is a pretty basic approach to relevancy, but hopefully shows the basic concepts.
Now we have our search engine up and running, and is a good time to customize your site's templates.
First make a
templates
directory withindjangosearch
, as well as atemplates/springsteen
directory and a few empty files.cd djangosearch mkdir templates templates/springsteen touch templates/base.html touch templates/springsteen/base.html
Then let's edit the
templates/base.html
file.<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>{% block title %}FruitySearch{% endblock %}</title> <link rel="stylesheet" type="text/css" href="/static/css/reset.css"> <link rel="stylesheet" type="text/css" href="/static/css/search.css"> </head> <body> <div id="body"> <div id="hd"><h1><a href="/">FruitySearch</a></h1></div> {% block body %}{% endblock %} <div id="ft"><p>A <a href="">Your-Name-Here</a> production, 2009.</p></div> </div> </body> </html>
Next edit
templates/springsteen/base.html
(this one if pretty brief).{% extends "base.html" %} {% block body %}{% endblock %}
As you continue customizing the appearance of your results, you'll probably want to override `
templates/springsteen/results.html
, but for the time being it should be a reasonable default.Create some CSS to style the site.
cd djangosaerch mkdir static static/css
The current
base.html
assumes you'll havereset.css
andsearch.css
files. Recently I tend to use YUI's reset.css, and just mashed together some custom stylings forsearch.css
.Test everything out.
cd djangosearch dev_appserver.py ./
Assuming that worked, go ahead and push it to Google App Engine, and you're done.
Some Cautions
You'll notice by default you won't be able to paginate past 3-4 pages. This is because of a safety mechanism in Springsteen that makes sense when you're dealing with a large number of sources, but is a bit annoying when you're really only dealing with one source. You can override this setting in
local_settings.py
by adding this line:# allow up to 10 pages SPRINGSTEEN_MAX_MATCHES = 10
Also, at the moment this setup isn't using any caching, and as a result it simply will not behave efficiently at higher pages (display results 100-110, for example).
It is possible to use memcache on GAE, and I'll write a patch which enables Springsteen to take advantage of that functionality in the next day or two.
Let me know if you have any problems or questions about deploying Springsteen on Google App Engine! I think it's pretty amazing how far the landscape has evolved to be able to roll out a product like this at no cost and essentially no effort.
I guess it's our responsibility to take advantage of it.
I'll post a reusable package in a day or two once I update the caching mechanism to be smart enough to play nicely with Google App Engine. (If I post it now people will use it and then complain that it behaves as advertised...)