Deploying django-springsteen on Google App Engine

(View a live django-springsteen example here.)

(Also, I've greatly simplified the process of deploying django-springsteen on Google App Engine, as explained here, but you will still want to read this article to understand how to customize Springsteen.)

Vik Singh premiered running Yahoo! BOSS on Google App Engine quite some months ago, but django-springsteen is a somewhat different magician than the BOSS Mashup Framework, so hopefully you'll forgive a bit of repetition.

In this article I'll walk through the brief steps necessary to deploy a slightly interesting search engine by taking advantage of Springsteen's builtin support for Yahoo! BOSS, Twitter, and Amazon. It shouldn't take more than half an hour to get up and running.

  1. First register a new Google App Engine applicaton. I'm using djangosearch, because apparently I registered it the first time I did a lame Yahoo! BOSS on GAE tutorial. It's good to know I'm not stuck in a rut or anything.

  2. Checkout the django-springsteen source from GitHub.

    git clone git://github.com/lethain/django-springsteen.git django-springsteen
    
  3. Now we're going to salvage some relevant pieces of django-springsteen's repository, adapt them to our new purposes, and throw away the rest.

    mv django-springsteen/example_project/ ./djangosearch
    mv django-springsteen/springsteen/ djangosearch/
    rm -rf django-springsteen
    
  4. Next we want to grab a recent Django tarball from djangoproject.com/download/.

    tar -xvf Django-1.0.2-final.tar 
    mv Django-1.0.2-final/django/ ./
    rm -rf Django-1.0.2-*
    rm -rf django/bin django/contrib/admin django/contrib/auth
    rm -rf django/contrib/databrowse django/test
    rm -rf django/contrib/admindocs django/contrib/gis
    

    (We need to remove these files to get under the 1000 file limit for Google App Engine. You can also do some file zipping magic to get around it, but this approach is a bit simpler.)

    Actually, django-springsteen pretty much works with Django 0.96 except for using the safe template filter in some of the templates. If you're willing to strip it out, then you could skip installing a more recent version of Django.

  5. And next we need to scavenge several pieces from the django_example for Google App Engine. First create the djangosearch/main.py file with these contents.

    import logging, os, sys
    # Google App Engine imports.
    from google.appengine.ext.webapp import util
    
    # Remove the standard version of Django.
    for k in [k for k in sys.modules if k.startswith('django')]:
        del sys.modules[k]
    
    # Force sys.path to have our own directory first, in case we want to import
    # from it.
    sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)))
    
    # Must set this env var *before* importing any part of Django
    os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
    import django.core.handlers.wsgi
    
    def main():
        # Create a Django application for WSGI.
        application = django.core.handlers.wsgi.WSGIHandler()
    
        # Run the WSGI CGI handler with that application.
        util.run_wsgi_app(application)
    
    if __name__ == '__main__':
        main()
    

    Next we need to create the djangosearch/app.yaml file. (Be sure to replace djangosearch with the name of the application you registered.)

    application: djangosearch
    version: 1
    runtime: python
    api_version: 1
    
    handlers:
    - url: /static
      static_dir: static
    
    - url: /.*
      script: main.py
    

    And finally djangosearch/index.yaml.

    indexes:
    
    # AUTOGENERATED
    
    # This index.yaml is automatically updated whenever the dev_appserver
    # detects that a new type of query is run.  If you want to manage the
    # index.yaml file manually, remove the above marker line (the line
    # saying "# AUTOGENERATED").  If you want to manage some indexes
    # manually, move them above the marker line.  The index.yaml file is
    # automatically uploaded to the admin console when you next deploy
    # your application using appcfg.py.
    
  6. Next open up djangosearch/local_settings.py and add these at the bottom.

    ROOT_URLCONF = 'urls'
    MIDDLEWARE_CLASSES = (
        'django.middleware.common.CommonMiddleware',
        'django.middleware.doc.XViewMiddleware',
    )
    INSTALLED_APPS = ('springsteen',)
    DATABASE_ENGINE = None
    DATABASE_NAME = None
    CACHE_BACKEND = "dummy:///"
    
  7. Create the djangosearch/boss_settings.py file, which contains only the BOSS_APP_ID parameter, and AMAZON_ACCESS_KEY if you have one. (You'll need to sign up here and here to get a AWS Affiliate ID if you want Amazon search results, which is the uninspired man's choice for monetizing a Springsteen service out of the box.)

    BOSS_APP_ID = "abcdefghijlknop"
    
  8. Tweak the djangosearch/urls.py file to remove all references to example_project, as well as removing the extra url patterns.

    from django.conf.urls.defaults import *
    
    urlpatterns = patterns('',
        (r'^$', 'views.search'),
    )
    
  9. Time out. Let's pick a topic for our new search engine. Hmm... hmm.... Okay, let's make it a search engine that is specialized on Apple products. What could go wrong?

  10. Next let's configure our search results. Go ahead and open up djangosearch/views.py, and start by removing everything.

    Start rebuilding by adding these imports:

    from springsteen.views import search as default_search
    from springsteen.services import Web, TwitterLinkSearchService, AmazonProductService
    from django.conf import settings
    

    Next let's create our Amazon product service (if you have an Amazon Affiliates AWS key).

    class ComputerAmazonSearch(AmazonProductService):
        _access_key = settings.AMAZON_ACCESS_KEY
        _topic = 'apple'
    

    Followed by creating an Apple flavored Twitter service.

    class AppleTwitterService(TwitterLinkSearchService):
        _qty = 3
        _topic = 'apple'
    

    Finally we just need to mix in web results from Yahoo! BOSS and then expose our new search engine.

    def search(request, timeout=2500, max_count=10):
        services = (ComputerAmazonSearch, AppleTwitterService, Web)
        return default_search(request, timeout, max_count, services)
    

    A Short Warning

    Please note that Yahoo! BOSS search results won't be retrieved successfully when you are testing your springsteen application locally. However, they will be correctly retrieved once you push your app to production. I'll look into some kind of patch for this, but no need to panic.

  11. At this point we have everything working correctly, but results are just stacked on top of each other. Sure, you might love having those Amazon affiliate links clustered at the top, but your users might not. Now's a nice time to dip our toes into relevency.

    You want the most relevant results to bubble to the top (feel free to make a bubble sort pun), but naively stacking results from different services doesn't permit that unless all results from source A are more relevant than those from source B, all results from B are more relevant than those from source C and so on. Let's take a stab at a very simple relevency algorithm to address these problems.

    You can think of two kinds of relevency approachs:

    1. Scoring results on their individual merits. We might call this intrinsic relevance.
    2. Scoring results in regard to each other. We might call this contextual relevance.

    We're going to do a little bit of both here. First we're going to boost results which contain the query term in their title, and second we're going to punish the 2nd-Nth results from an already encountered domain.

    Place this code in views.py above the search function.

    def ranking(query, results):
        query = query.lower()
        def rank(result):
            score = 0.0
            title = result['title'].lower()
            if title in query:
                score += 1.0
            return score
    
        scored = [(rank(x), x) for x in results]
        scored2 = []
        domains = {}
        for score, result in scored:
            domain = result['url'].replace('http://','').split('/')[0]
            times_viewed = domains.get(domain, 0)
            new_score = score + times_viewed * -0.1
            scored2.append((new_score, result))
            domains[domain] = times_viewed + 1
    
        scored2.sort()
        return [ x[1] for x in scored2 ]
    

    And then update the search function to use this ranking function.

    def search(request, timeout=2500, max_count=10):
        services = (AppleAmazonSearch, AppleTwitterService, Web)
        return default_search(request, timeout, max_count,
                              services, {}, ranking)
    

    Now our results are ranked using the above ranking function. This is a pretty basic approach to relevancy, but hopefully shows the basic concepts.

  12. Now we have our search engine up and running, and is a good time to customize your site's templates.

    First make a templates directory within djangosearch, as well as a templates/springsteen directory and a few empty files.

    cd djangosearch
    mkdir templates templates/springsteen
    touch templates/base.html
    touch templates/springsteen/base.html
    

    Then let's edit the templates/base.html file.

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
    <html> <head>
    <title>{% block title %}FruitySearch{% endblock %}</title>
    <link rel="stylesheet" type="text/css" href="/static/css/reset.css">
    <link rel="stylesheet" type="text/css" href="/static/css/search.css">
    </head> <body>
    <div id="body">
    <div id="hd"><h1><a href="/">FruitySearch</a></h1></div>
    {% block body %}{% endblock %}
    <div id="ft"><p>A <a href="">Your-Name-Here</a> production, 2009.</p></div>
    </div> </body> </html>
    

    Next edit templates/springsteen/base.html (this one if pretty brief).

    {% extends "base.html" %}
    {% block body %}{% endblock %}
    

    As you continue customizing the appearance of your results, you'll probably want to override `templates/springsteen/results.html, but for the time being it should be a reasonable default.

  13. Create some CSS to style the site.

    cd djangosaerch
    mkdir static static/css
    

    The current base.html assumes you'll have reset.css and search.css files. Recently I tend to use YUI's reset.css, and just mashed together some custom stylings for search.css.

  14. Test everything out.

    cd djangosearch
    dev_appserver.py ./
    

    Assuming that worked, go ahead and push it to Google App Engine, and you're done.

Some Cautions

  1. You'll notice by default you won't be able to paginate past 3-4 pages. This is because of a safety mechanism in Springsteen that makes sense when you're dealing with a large number of sources, but is a bit annoying when you're really only dealing with one source. You can override this setting in local_settings.py by adding this line:

    # allow up to 10 pages
    SPRINGSTEEN_MAX_MATCHES = 10
    
  2. Also, at the moment this setup isn't using any caching, and as a result it simply will not behave efficiently at higher pages (display results 100-110, for example).

    It is possible to use memcache on GAE, and I'll write a patch which enables Springsteen to take advantage of that functionality in the next day or two.

Let me know if you have any problems or questions about deploying Springsteen on Google App Engine! I think it's pretty amazing how far the landscape has evolved to be able to roll out a product like this at no cost and essentially no effort.

I guess it's our responsibility to take advantage of it.


I'll post a reusable package in a day or two once I update the caching mechanism to be smart enough to play nicely with Google App Engine. (If I post it now people will use it and then complain that it behaves as advertised...)

All Rights Reserved, Will Larson 2007 - 2014.