March 29, 2011.
Following up on the discussion on design changes to Irrational Exuberance, this time I've put together a look at the implementation of Sisyphus. I know you really really don't care, so I promise to stop talking about my blog soon and start blogging instead.
Django was the first web framework I worked with--and remains my favorite although I've never used it professionally (most of my work has ended up more infrastructure focused)--so I decided to stick with it for this project as well. (I also thought about using Flask, which seems like an interesting project, but I want to keep using this blog for a couple of years and would have undoubtedly butchered my first app using it.)
At Digg we use a lot of Redis, and I've grown to have a healthy appreciate for it. In particular having easy access to sorted sets allows for a lot of interesting experiments. As such, rather than sticking with a standard PostgreSQL and Memcached Django stack, I replaced both with Redis. This was two parts whimsy and three parts specific ideas I wanted to play with (more on that later).
I stayed with the standard Django, mod_wsgi, Apache2 deploy, with Nginx serving static media and acting as a reverse proxy for Apache. It's served me well thus far, and it let me crib off of my old Django and Ubuntu deployment post.
The goal of the Popular module is to show the best content based on number of pageviews. It does this using a couple of Redis sorted sets.
TAG_PAGES_ZSET_BY_TREND = "tag_pages_by_trend.%s" PAGE_ZSET_BY_TREND = "pages_by_trend"
When a page is first created it receives an initial score equal to the current timestamp, and as every page is viewed, its score in those sorted sets is incremented.
def track(request, page, cli=None): "Log pageview into analytics." slug = page['slug'] cli.zincrby(PAGE_ZSET_BY_TREND, slug, PAGEVIEW_BONUS) for tag_slug in page['tags']: cli.zincrby(TAG_PAGES_ZSET_BY_TREND % tag_slug, slug, PAGEVIEW_BONUS)
zincrby is a
but in practice both the number of articles in a tag and the total number of pages
is going to be extremely low (I have less than four hundred). In
practice^2, when operating on an in-memory store, even relatiely inefficient
operations usually work out.
Finding the most popular stories to populate the module is as simple as:
tag = "django" key = PAGE_ZSET_BY_TREND # all pages key key = TAG_PAGES_ZSET_BY_TREND % tag # pages in a tag slugs = cli.zrevrange(key, 0, 3)
N being the number of
M the is number of pages retrieved. (Asking people to
implement a sorted set can be a fascinating interview question.)
Similarly to the Popular module, the Similar Stories module relies on Redis sorted sets for its implementation, but this time is relies a bit more on the set part. The goal of this module is to show pages closely related to the page you're currently looking at.
It does this taking taking the union of all pages in the same tags as the current page with bonus points to tags sharing multiple tags and pages which are popular on the site. More concisely it:
sim_key = SIMILAR_PAGES_BY_TREND % slug tag_keys = [ TAG_PAGES_ZSET_BY_TREND % x for x in page['tags'] ] cli.zunionstore(sim_key, tag_keys) # a page is always similar to itself, but rather boringly so cli.zrem(sim_key, slug) similar_slugs = cli.zrevrange(sim_key, 0, 3)
For pages with a couple of tags, I've been fairly impressed by how well this extremely simple approach works. For pages with only one tag it still works rather well. It does completely break down for content with tags, although I suppose I could munge something together using search.
One of the weirdest things about Lifeflow
was its publishing workflow. I had my heart in the right place,
but it was a bizarre Ajax-y UI that looked like Halloween had
erupted into a CSS file. Writing in a
<textarea> isn't a whole
lot of fun, so I mostly wrote in Emacs
and pasted the version into afterwards.
At the end I was left with a Markdown file somewhere on my machine, and an updated version in Postgres. It was a pain to sync data into a local development instance, which in turn meant I spent a lot of time editing post-publish (for some reason it's just easier to read for editing purposes on the blog itself than in Emacs).
My new approach is to store the pages in Git and load them via three Django management commands:
python manage.py update_page ../some_file.html python manage.py update_markdown_page ../some_file.markdown python manage.py sync_sisyphus ../some_folder
The first two simply load pages in HTML and Markdown (which I'm using Python-Markdown to render) format respectively, but the third takes a folder in this format:
some_folder - draft - edit - publish - static
All files in
draft are ignored, all pages in
edit are added to the site
but in editing mode, where they are not listed in any storylists or RSS but
can be accessed directly via their slugs, and all stories in
added to storylists, RSS, etc.
If you store your your pages in a Git repository with the folders, this makes it quite simple to write and publish at your leisure to both local development instances as well as "production" deployments.
Static media is handled by adding it into the
static folder in the
story repository and then symlinking that folder into the Nginx and/or
It's a bit jank, but it does make keeping static media in-sync with blog
entries a breeze.
There are a couple more things in the pipeline before the first version of Sisyphus is complete, in particular I'm looking to experiment with a handrolled analytics system which uses referrers and pageviews of each page to help give readers some context (where was this blog entry popular? how popular is this entry relative to other stuff on this blog? how popular is this entry in absolute terms, am I the only guy who has read this damn thing?)
Thanks for reading. Moving on to less egomaniacal content for the next posts!