Taming AuditTrail Proliferation

October 16, 2008. Filed under django 72

I've spent a bit of time with AuditTrail over the past day, since I first discovered it, and I've been quite pleased with it. However, my app makes a large number of changes, and I was beginning to experience a bit of database bloat because of the growing number of audits.

After a day of usage, one of my models had about 180 revisions, and while each revision itself is small, it was pretty clear that I wasn't going to be able to ignore the situation without causing myself some serious headaches in the relatively near future (of course, being able to only record diffs is a nice advantage for something like django-rcsfield, which would be able to get by with much less space).

Fortunately, depending on how you're using revisions, there is a fairly simple solution to this dilemma: throw the excess revisions away. I didn't want to perform extra database lookups everytime a new revision was created, so I decided that adding an extension to manage.py would be an adequate solution (which I could periodically activate with a cronjob).

So I setup the skeleton for a management command:

cd my_app
mkdir management
cd management
touch __init__.py
mkdir commands
cd commands
touch __init__.py
emacs clean_audit_trails.py

At first I intended to go with a very specific set of rules for picking the revisions to keep:

  1. All revisions in the past hour,
  2. The first revision older than one hour,
  3. The first revision older than one day,
  4. The first revision older than one week,
  5. The first revision older than one month, 6 and so on...

But then I started actually writing that code, and my enthusiasm for that approach swiftly dwindled. Instead I decided I could accomplish roughly what I wanted much more concisely by using a simple backoff to determine the cutoffs for dates.

Depending the type of backoff you use, you can control the spacing of revisions to save.

>>> def mult_backoff(x):
...     return x * 10
... 
>>> [ mult_backoff(x) for x in xrange(0,10) ]
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> def exp_backoff(x):
...     return x * x
... 
>>> [ exp_backoff(x) for x in xrange(0,10) ]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

You could also do an additive backoff, etc. For my needs the multiplicitive backoff worked well. Starting from 60 seconds and multiplying by ten it follows this pattern: 1 minute, 10 minutes, 1 hour, 16 hours, 6 days, 9 weeks, and so on.

Here is the implementation of the clean_audit_trails management command:

from django.core.management.base import NoArgsCommand
from my_app.models import MyModel
import datetime

class Command(NoArgsCommand):
    help='Removes excessive Reversion history for Notes.',
    args=''

    def handle_noargs(self, **options):
        print "Removing unwanted audit trails..."
        # if you let the backoff grow too large, 
        # it'll turn into a long int and datetime.timedelta
        # cannot be instantiated with a long int
        max_age = 60000
        objects = MyModel.objects.select_related().all()
        remove = 0
        now = datetime.datetime.now()
        for obj in objects:
            backoff = 60
            cutoff = datetime.timedelta(seconds=backoff)
            for trail in obj.history.all():
                diff = now - trail._audit_timestamp
                if backoff > max_age or diff < cutoff:
                    trail.delete()
                    remove = remove + 1
                else:
                    backoff = backoff * 10
                    cutoff = datetime.timedelta(seconds=backoff)
        print "Removed %d audit trails." % remove

Note that the code is assuming a model that looks like this:

from django.db import models
import audit

class MyModel(models.Model)
    title = models.CharField(max_length=200)
    text = models.TextField()
    history = audit.AuditTrail()

Using it is the same as any other management command:

python manage.py clean_audit_trails

With a little meta-magic you could probably put together a versitle tool based on this that isn't hardcoded to clean a specific model, and uses a backoff method specified in the projects settings.py.