Following up on the discussion on design changes to Irrational Exuberance,
this time I’ve put together a look at the implementation of Sisyphus. I know you
really really don’t care, so I promise to stop talking about my blog soon and
start blogging instead.
Implementation & Infrastructure
Django was the first web framework I worked
with–and remains my favorite although I’ve never used it professionally
(most of my work has ended up more infrastructure focused)–so I decided
to stick with it for this project as well. (I also thought about
using Flask, which seems like an interesting
project, but I want to keep using this blog for a couple of years and
would have undoubtedly butchered my first app using it.)
At Digg we use a lot of Redis,
and I’ve grown to have a healthy appreciate for it. In particular having
easy access to sorted sets allows
for a lot of interesting experiments.
As such, rather than sticking with a standard PostgreSQL
and Memcached Django stack,
I replaced both with Redis. This was two parts whimsy and three parts specific
ideas I wanted to play with (more on that later).
I stayed with the standard Django, mod_wsgi,
Apache2 deploy, with Nginx
serving static media and acting as a reverse proxy
for Apache. It’s served me well thus far, and it let me crib
off of my old Django and Ubuntu deployment post.
Whoosh powers search,
which has proven a much simpler approach (for very limited requirements)
compared to SOLR which was a bit of
a memory hog considering it was doing a couple queries a day.
Popular Stories Module
The goal of the Popular module is to show the best content
based on number of pageviews. It does this using a couple
of Redis sorted sets.
TAG_PAGES_ZSET_BY_TREND = "tag_pages_by_trend.%s"
PAGE_ZSET_BY_TREND = "pages_by_trend"
When a page is first created it receives an initial
score equal to the current timestamp, and as every
page is viewed, its score in those sorted sets is incremented.
def track(request, page, cli=None):
"Log pageview into analytics."
slug = page['slug']
cli.zincrby(PAGE_ZSET_BY_TREND, slug, PAGEVIEW_BONUS)
for tag_slug in page['tags']:
cli.zincrby(TAG_PAGES_ZSET_BY_TREND % tag_slug, slug, PAGEVIEW_BONUS)
zincrby is a
but in practice both the number of articles in a tag and the total number of pages
is going to be extremely low (I have less than four hundred). In
practice^2, when operating on an in-memory store, even relatiely inefficient
operations usually work out.
Finding the most popular stories to populate the module is as simple as:
tag = "django"
key = PAGE_ZSET_BY_TREND # all pages key
key = TAG_PAGES_ZSET_BY_TREND % tag # pages in a tag
slugs = cli.zrevrange(key, 0, 3)
N being the number of
M the is number of pages retrieved. (Asking people to
implement a sorted set can be a fascinating interview question.)
Similar Stories Module
Similarly to the Popular module, the Similar Stories module relies on
Redis sorted sets for its implementation, but this time is relies a bit
more on the set part. The goal of this module is to show pages closely
related to the page you’re currently looking at.
It does this taking taking the union of all pages in the same tags
as the current page with bonus points to tags sharing multiple tags
and pages which are popular on the site. More concisely it:
sim_key = SIMILAR_PAGES_BY_TREND % slug
tag_keys = [ TAG_PAGES_ZSET_BY_TREND % x for x in page['tags'] ]
# a page is always similar to itself, but rather boringly so
similar_slugs = cli.zrevrange(sim_key, 0, 3)
For pages with a couple of tags, I’ve been fairly impressed by
how well this extremely simple approach works. For pages with
only one tag it still works rather well. It does completely
break down for content with tags, although I suppose I could
munge something together using search.
One of the weirdest things about Lifeflow
was its publishing workflow. I had my heart in the right place,
but it was a bizarre Ajax-y UI that looked like Halloween had
erupted into a CSS file. Writing in a
<textarea> isn’t a whole
lot of fun, so I mostly wrote in Emacs
and pasted the version into afterwards.
At the end I was left with a Markdown file somewhere on my machine,
and an updated version in Postgres. It was a pain to sync data into
a local development instance, which in turn meant I spent a lot of time
editing post-publish (for some reason it’s just easier to read for editing purposes
on the blog itself than in Emacs).
My new approach is to store the pages in Git and load them via three
Django management commands:
python manage.py update_page ../some_file.html
python manage.py update_markdown_page ../some_file.markdown
python manage.py sync_sisyphus ../some_folder
The first two simply load pages in HTML and Markdown (which I’m using
Python-Markdown to render)
but the third takes a folder in this format:
All files in
draft are ignored, all pages in
edit are added to the site
but in editing mode, where they are not listed in any storylists or RSS but
can be accessed directly via their slugs, and all stories in
added to storylists, RSS, etc.
If you store your your pages in a Git repository with the folders, this makes
it quite simple to write and publish at your leisure to both local development
instances as well as “production” deployments.
Static media is handled by adding it into the
static folder in the
story repository and then symlinking that folder into the Nginx and/or
It’s a bit jank, but it does make keeping static media in-sync with blog
entries a breeze.
There are a couple more things in the pipeline before the first version of
Sisyphus is complete, in particular I’m looking to experiment with a handrolled
analytics system which uses referrers and pageviews of each page to help
give readers some context (where was this blog entry popular?
how popular is this entry relative to other stuff on this blog?
how popular is this entry in absolute terms, am I the only guy
who has read this damn thing?)
Thanks for reading. Moving on to less egomaniacal content for the