Catching Lessons From Spam

Published on January 22, 2008. lifeflow (20), spam (1)

As far as I can tell LifeFlow has yet to have any comment spam successfully breach its prevention mechanisms¹. The core of the prevention is a reverse-capcha, which have received much higher quality coverage than I can give them here, so instead lets just look at some of the unintended bits of wisdom and failure gleaned from caught spam.

Spammers Use Unicode?

This is from an old error report before I updated the Markdown implementation that LifeFlow is using to be Unicode compatable.

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/django/core/handlers/base.py", line 81, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.5/site-packages/lifeflow/views.py", line 56, in comments
    form.is_valid()
  File "/usr/lib/python2.5/site-packages/django/newforms/forms.py", line 95, in is_valid
    return self.is_bound and not bool(self.errors)
  File "/usr/lib/python2.5/site-packages/django/newforms/forms.py", line 86, in _get_errors
    self.full_clean()
  File "/usr/lib/python2.5/site-packages/django/newforms/forms.py", line 188, in full_clean
    value = getattr(self, 'clean_%s' % name)()
  File "/usr/lib/python2.5/site-packages/lifeflow/forms.py", line 45, in clean_body
    self.cleaned_data['rendered'] = unicode(markdownpp.markdown(escaped))
  File "/usr/lib/python2.5/site-packages/lifeflow/markdownpp.py", line 1472, in markdown
    return str(Markdown(text))
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 1517: ordinal not in range(128)

The message that triggered it was just a huge list of links, that had no need for unicode, but apparently the scripter hadn't thought that any comment implementation would be so mediocre as to fail with a few unicode letters inserted.

Poor Implementations

Here is another great error message triggered by spam in the comment system:

  File "/usr/lib/python2.5/site-packages/lifeflow/markdownpp.py", line 477, in handleMatch
    return doc.createTextNode(place_holder)
  File "/usr/lib/python2.5/site-packages/lifeflow/markdownpp.py", line 117, in createTextNode
    return TextNode(text)
  File "/usr/lib/python2.5/site-packages/lifeflow/markdownpp.py", line 182, in __init__
    self.attrRegExp = re.compile(r'\{@([^\}]*)=([^\}]*)}') # {@id=123}
  File "re.py", line 180, in compile
    return _compile(pattern, flags)
RuntimeError: maximum recursion depth exceeded

The LifeFlow comment system expects Markdown formatted text, and given extremely poor input--like a list of hundreds of spam links--the rendering proces can go to hell. Its amazing that the spam comment is so attrocious that it literally kills the Markdown library by pushing it to the maximum recursion depth. Its an interesting point where a legitimate user has no reason to push the system this far, and so its limitations turn into an unexpected boon.

Being Inhuman

Here is what happens when you try to directly navigate to the comment creation page without naturally getting there by following a real link included in a real entry:

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/django/core/handlers/base.py", line 81, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.5/site-packages/lifeflow/views.py", line 43, in comments
    id = int(request.POST['entry_id'])
  File "/usr/lib/python2.5/site-packages/django/utils/datastructures.py", line 189, in __getitem__
    raise MultiValueDictKeyError, "Key %r not found in %r" % (key, self)
MultiValueDictKeyError: "Key 'entry_id' not found in <QueryDict: {}>"

This means that you have to appropriately follow the same process that a human would: read an entry, click on the reply link, and fill it out in a timely manner, or it won't have the necessary information to process the entry.

Again, you could make this process smarter and try to predict what entry the user was looking at, or ask them to select the entry they want to comment upon, but that isn't a situation that can occur for a human user of the system. This is another "worse is better" situation, although not in the sense it is usually intended.

Writing Real Stories

In the end I think the valuable lesson here is that we need to focus our time writing real stories about what real users want to do. Sometimes we spend a lot of time working on hypothetical corner cases that add a lot of complexity to our programs--and consume a lot of time--but don't actually improve the quality of the system for actual users.

The spam here is a bit of an unusual example, as its unlikely that the negative consequences of a better implementation would be as clear as they were in this case², but I think it illustrates an interesting point about unintended consequences from implementing stories that don't matter to legitimate users.

On the other hand, it has also thwarted a handful of well-intentioned and real commenters using Internet Explorer 6. Its definitely working with Camino, Safari, Firefox and Internet Explorer 7 now, and it may be working with IE6 as well, but I haven't been able to test that.↩
Admittedly they aren't even that clear here, since the quantity of spam hasn't increased since the quality of the comment implementation was improved. What was really occuring is that there was an informal layer of spam disruption (aka, the bad implementation) operating redundantly ontop of the correctly functioning spam disruption layer (the reverse capchas).↩