Irrational Exuberance!

IE's New Infrastructure and Writing Workflow

March 28, 2011. Filed under sisyphusredisgit

Following up on the discussion on design changes to Irrational Exuberance, this time I've put together a look at the implementation of Sisyphus. I know you really really don't care, so I promise to stop talking about my blog soon and start blogging instead.

[TOC]

Implementation & Infrastructure

Django was the first web framework I worked with--and remains my favorite although I've never used it professionally (most of my work has ended up more infrastructure focused)--so I decided to stick with it for this project as well. (I also thought about using Flask, which seems like an interesting project, but I want to keep using this blog for a couple of years and would have undoubtedly butchered my first app using it.)

At Digg we use a lot of Redis, and I've grown to have a healthy appreciate for it. In particular having easy access to sorted sets allows for a lot of interesting experiments. As such, rather than sticking with a standard PostgreSQL and Memcached Django stack, I replaced both with Redis. This was two parts whimsy and three parts specific ideas I wanted to play with (more on that later).

I stayed with the standard Django, mod_wsgi, Apache2 deploy, with Nginx serving static media and acting as a reverse proxy for Apache. It's served me well thus far, and it let me crib off of my old Django and Ubuntu deployment post.

Whoosh powers search, which has proven a much simpler approach (for very limited requirements) compared to SOLR which was a bit of a memory hog considering it was doing a couple queries a day.

Popular Stories Module

The goal of the Popular module is to show the best content based on number of pageviews. It does this using a couple of Redis sorted sets.

:::python
TAG_PAGES_ZSET_BY_TREND = "tag_pages_by_trend.%s"
PAGE_ZSET_BY_TREND = "pages_by_trend"

When a page is first created it receives an initial score equal to the current timestamp, and as every page is viewed, its score in those sorted sets is incremented.

:::python
def track(request, page, cli=None):
    "Log pageview into analytics."
    slug = page['slug']
    cli.zincrby(PAGE_ZSET_BY_TREND, slug, PAGEVIEW_BONUS)
    for tag_slug in page['tags']:
        cli.zincrby(TAG_PAGES_ZSET_BY_TREND % tag_slug, slug, PAGEVIEW_BONUS)

zincrby is a O(log(N)) operation, but in practice both the number of articles in a tag and the total number of pages is going to be extremely low (I have less than four hundred). In practice^2, when operating on an in-memory store, even relatiely inefficient operations usually work out.

Finding the most popular stories to populate the module is as simple as:

:::python
tag = "django"
key = PAGE_ZSET_BY_TREND            # all pages key
key = TAG_PAGES_ZSET_BY_TREND % tag # pages in a tag
slugs = cli.zrevrange(key, 0, 3)

Where zrevrange is O(log(N)+M), with N being the number of pages and M the is number of pages retrieved. (Asking people to implement a sorted set can be a fascinating interview question.)

Similar Stories Module

Similarly to the Popular module, the Similar Stories module relies on Redis sorted sets for its implementation, but this time is relies a bit more on the set part. The goal of this module is to show pages closely related to the page you're currently looking at.

It does this taking taking the union of all pages in the same tags as the current page with bonus points to tags sharing multiple tags and pages which are popular on the site. More concisely it:

:::python
sim_key = SIMILAR_PAGES_BY_TREND % slug
tag_keys = [ TAG_PAGES_ZSET_BY_TREND % x for x in page['tags'] ]
cli.zunionstore(sim_key, tag_keys)
# a page is always similar to itself, but rather boringly so
cli.zrem(sim_key, slug)
similar_slugs = cli.zrevrange(sim_key, 0, 3)

For pages with a couple of tags, I've been fairly impressed by how well this extremely simple approach works. For pages with only one tag it still works rather well. It does completely break down for content with tags, although I suppose I could munge something together using search.

Writing Workflow

One of the weirdest things about Lifeflow was its publishing workflow. I had my heart in the right place, but it was a bizarre Ajax-y UI that looked like Halloween had erupted into a CSS file. Writing in a <textarea> isn't a whole lot of fun, so I mostly wrote in Emacs and pasted the version into afterwards.

At the end I was left with a Markdown file somewhere on my machine, and an updated version in Postgres. It was a pain to sync data into a local development instance, which in turn meant I spent a lot of time editing post-publish (for some reason it's just easier to read for editing purposes on the blog itself than in Emacs).

My new approach is to store the pages in Git and load them via three Django management commands:

python manage.py update_page ../some_file.html
python manage.py update_markdown_page ../some_file.markdown
python manage.py sync_sisyphus ../some_folder

The first two simply load pages in HTML and Markdown (which I'm using Python-Markdown to render) format respectively, but the third takes a folder in this format:

some_folder
- draft
- edit
- publish
- static

All files in draft are ignored, all pages in edit are added to the site but in editing mode, where they are not listed in any storylists or RSS but can be accessed directly via their slugs, and all stories in publish are added to storylists, RSS, etc.

If you store your your pages in a Git repository with the folders, this makes it quite simple to write and publish at your leisure to both local development instances as well as "production" deployments.

Static media is handled by adding it into the static folder in the story repository and then symlinking that folder into the Nginx and/or django.contrib.staticfiles's STATIC_ROOT directory. It's a bit jank, but it does make keeping static media in-sync with blog entries a breeze.

There are a couple more things in the pipeline before the first version of Sisyphus is complete, in particular I'm looking to experiment with a handrolled analytics system which uses referrers and pageviews of each page to help give readers some context (where was this blog entry popular? how popular is this entry relative to other stuff on this blog? how popular is this entry in absolute terms, am I the only guy who has read this damn thing?)

Thanks for reading. Moving on to less egomaniacal content for the next posts!