Configuration Driven Behavior

Building technical leverage goes beyond implementing systems and tools, it's also an opportunity to explore how system design leads to reusable, flexible and useful tools. One of the most interesting system design concepts that we've hit upon is configuration driven behavior.

It's the idea of writing your programs to process configuration that describes the behavior, rather than writing programs that directly describe behavior. This pattern pops up frequently:

  1. in the Django middleware,
  2. in the extraction library I cobbled together recently,
  3. and when configuring the map and reduces in a Hadoop pipeline.

Let's look more closely at an example of this pattern, discuss its benefits, and end by considering its applicability.

A Config Driven Classification System

One example of this pattern is the submitted article classifier we built at Digg for identifying spam and low quality submissions. The system consisted of:

  1. classifiers which were passed data and metrics to evaluate against,
  2. a workflow which managed passing data and metrics to the classifiers, and
  3. a configuration system which selected which classifiers ran, the order they ran in, and the values passed into their initializer.

The high-level implementation looked something like this:

import importlib
import config_store

class Classifier(object):
    def __init__(self, **kwargs):
        self.kwargs = kwargs

    def classify(self, data):
        return True

class Workflow(object):
    def __init__(self, classifiers):
        self.classifiers = classifiers

    def run(self, data):
        for classifier in self.classifiers:
            if classifier.classify(data):
                return classifier.__class__.__name__

def init_classifier(class_path):
    params = config_store.get("classifier.%s" % class_path, {})
    class_path_parts = class_path.split()
    class_ref = importlib.import_module(class_path_parts[:-1])
    return class_ref(**params)

def setup():
    classifier_class_names = config_store.get("classifiers", [])
    classifiers = [init_classifier(x) for x in classifier_class_names]
    wf = Workflow(classifiers)

    # assuming we have some kind of blocking generator which will
    # return data when available
    for data in data_to_classify:
        classification = wf.run(data)
        print "Classified as: %s" % classification

if __name__=="__main__":
    setup()

If a classifier started behaving badly, we'd remove it from the configuration mechanism, if the research team had an experimental new classifier, into the configuration it went.

In either case, we could tweak the initialization parameters to tune behavior with minimal effort, and it exposed the classifications to the community management team who closed the feedback loop by suggesting improvements.

Ease of Experimentation

The biggest benefit of the configuration driven approach is making experimentation lightweight.

The systems we build aren't inherently valuable, they're valuable when they're used to perform useful work. As we build in a reasonable degree of generality and ease experimentation, then we create the possibility of unexpected wins from others and from our future selves.

When a system's user learns a well-designed configuration mechanism, they can then take advantage of the underlying system without having to understand the system itself. If the configuration language is simplified enough--maybe a format described in YAML--then all the sudden the group of users goes from developers to a much larger group.

In my experience, easing experimentation on your system is probably the most effective way for it to teach you something new and unexpected.

Improved Maintainability

Where the ease of experimentation makes it easier to get the full value out of your system, it's the ease of maintenance that allows you spend more time building new systems and less time as a slave to the systems you've already built.

With behavior described in the configuration, most long-term maintenance is pushed into maintaining the core which processes the configuration, and which is hopefully quite small and relatively simple. Done properly, that core is also extremely testable, as its interface is strictly defined by the configuration language.

Applicability

The challenge--and the danger--of ideas like Configuration Driven Behavior is in picking the correct level of abstraction for a given piece of code as it exists today and as we anticipate it existing in the future. If we write something too coupled to our current needs then we'll end up rewriting it in the near future without acrueing much benefit. If we're too abstract, it takes too long or ends up having to be rewritten anyway because it solves the wrong problem.

As abstractions go, I think this one scales well upwards and downwards. At the small end we can use a dictionary or list to direct behavior instead of hardcoding the values, and at the high end we could be parsing a domain specific configuration language.

My rule of thumb is to spend a minute trying to find a way to describe what I'm accomplishing in a datastructure which drives the code. If I can find it, use it. If not, move on.

I'll end with two similar examples from extraction, one of which fit with the Configuration Driven Behavior approach, and another which didn't. In the first snippet, the pattern for extracting data from Facebook Opengraph tags is very straight forward, so I used a map to drive the behavior:

property_map = {
    'og:title': 'titles',
    'og:url': 'urls',
    'og:image': 'images',
    'og:description': 'descriptions',
    }

def extract(self, html):
    "Extract data from Facebook Opengraph tags."
    extracted = {}
    soup = BeautifulSoup(html)
    for meta_tag in soup.find_all('meta'):
        if 'property' in meta_tag.attrs and 'content' in meta_tag.attrs:
            property = meta_tag['property']
            if property in self.property_map:
                property_dest = self.property_map[property]
                if property_dest not in extracted:
                    extracted[property_dest] = []
                extracted[property_dest].append(meta_tag.attrs['content'])

    return extracted

In a different snippet, the process for handling link tags was significantly more complex, and I couldn't find a reasonable way to describe my intent. So I skipped describing the work in a datastructure and described it literally:

def extract(self, html):
    "Extract data from meta, link and title tags within the head tag."
    extracted = {}
    soup = BeautifulSoup(html)
    for link_tag in soup.find_all('link'):
        if 'rel' in link_tag.attrs:
            if ('alternate' in link_tag['rel'] or \
                link_tag['rel'] == 'alternate') and \
               'type' in link_tag.attrs and \
               link_tag['type'] == "application/rss+xml" \
               and 'href' in link_tag.attrs:
                if 'feeds' not in extracted:
                    extracted['feeds'] = []
                extracted['feeds'].append(link_tag['href'])
            elif ('canonical' in link_tag['rel'] or \
                  link_tag['rel'] == 'canonical') \
                 and 'href' in link_tag.attrs:
                if 'urls' not in extracted:
                    extracted['urls'] = []
                extracted['urls'].append(link_tag['href'])

    return extracted

Altogether, it's a nice pattern that fits into quite a few of the projects I've been working on lately. When you're designing your next project, give it a thought.

All Rights Reserved, Will Larson 2007 - 2014.