SuperFamily Relationships with Lazyboy

May 24, 2010. Filed under pythoncassandralazyboy

Recently I started playing with Cassandra and one of its Python clients, lazyboy, and the documentation is pretty good, but it took me a bit of toying around and source reading before I really got it (well, assuming I did really get it, which the dear reader would be encouraged to inform me of if I've only reached a plateau of comprehension).

I tend to think that doing is the best form of learning, so let's build something.

Modeling Our Application

For our project, I've decided to build a simple task management system for a development team.

Our system will have many people.

p1 = {"first": "Jack", "last":"Bauer", "role":"developer"}
p2 = {"first": "Lindsay", "last":"Lohan", "role":"product-manager"}

Our system will also have many tasks.

t1 = {"name":"Prevent Nuclear Bomb Plot",
      "estimated-hours":"24",
      "completed-hours":"23"}
t2 = {"name":"Resurrect Career",
      "estimated-hours":"4000",
      "completed-hours":"0"}

Each task will have zero or more people working on it.

# Sometimes I find myself typing snippets to maintain consistency even
# though they aren't actually coherent or helpful.
# I'm having trouble finding the delete key right now.
t1.people.append(p1)
t1.people.append(p2)

Based on these relations, we'll probably want to be able to perform these operations:

  • get all people,
  • get all tasks,
  • retrieve a specific person,
  • retrieve a specific task,
  • get people working on a specific task.

So, how can we model this in Cassandra? Actually, very easily following this basic rule:

Have one ColumnFamily per type of document, and also one ColumnFamily for each relationship between documents.

(A ColumnFamily is a collection of similar documents. Documents within a given ColumnFamily don't have to have the same or similar key/values, but usually it does simplify your application, especially if you're going to be iterating over your data.)

Applying that rule to our application we have two types of documents:

  • tasks and
  • members,

and we also have one relationship between the two:

  • a task is worked on by zero or more members.

In other words, we'll require three ColumnFamilies.

Now we'll move on to installing lazyboy and Cassandra before getting on to the implementation.

Installing Prerequisites

  1. Download Cassandra 0.6.1.

  2. Install Cassandra (instructions here for OS X, but one can imagine alternate directory structures for other OSes):

    cd ~/Download
    tar -xvf apache-cassandra-*.gz
    mv apache-cassandra-0.6.1 ~/Library/cassandra
    
  3. Create the necessary directories for Cassandra to run:

    tar -zxvf cassandra-$VERSION.tgz
    cd cassandra-$VERSION
    sudo mkdir -p /var/log/cassandra
    sudo chown -R `whoami` /var/log/cassandra
    sudo mkdir -p /var/lib/cassandra
    sudo chown -R `whoami` /var/lib/cassandra
    
  4. Install Git if you don't have it already.

  5. Download Lazyboy which we'll use as our Cassandra client for Python (note that you'll need to use Python 2.5 or newer, which you will have if you have OS X 10.5 or newer):

    mkdir ~/git && cd ~/git
    git clone http://github.com/digg/lazyboy.git
    sudo python setup.py
    

Now the prerequisites are satisfied, we can get started with... configuration.

Configuring & Starting Cassandra

Next we'll need to setup Cassandra with the proper settings for our project. Although Cassandra doesn't require compliance to a schema, it does require defining each ColumnFamily before the Cassandra process is started.

Well, rather, than a definition, it's more like a statement of existance. Go ahead and open ~/Library/cassandra/conf/storage-conf.xml and replace the existing contents of <Keyspaces/> with this:

<Keyspaces>
 <Keyspace Name="TaskManager">
  <ColumnFamily CompareWith="UTF8Type" Name="Task"/>
  <ColumnFamily CompareWith="UTF8Type" Name="Member"/>
  <ColumnFamily CompareWith="UTF8Type" Name="TaskMember"/>
  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
  <ReplicationFactor>1</ReplicationFactor>
  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
 </Keyspace>
</KeySpaces>

Now, go ahead and start Cassandra.

~/Library/cassandra/bin/cassandra

Using lazyboy

Now that we have the various pieces installed and configured, we need to write our code to take advantage of lazyboy. (Go ahead and take a look at the lazyboy readme and lazyboy examples if you find the following section a bit hard to follow.)

For each of our column families (Task, Member, TaskMember) we need to subclass three classes:

  • a subclass of lazyboy.key.Key which makes it possible to write

    a = TaskKey("my-task-a")
    

    instead of

    a = Key("TaskManager", "Task", "my-task-a")
    

    So, admittedly not essentially, but a nice little pattern to take advantage of.

  • a subclass of lazyboy.record.Record which is used for accessing a given ColumnFamily's elements, which in combination with the TaskKey class makes it possible to write:

    t1 = Task()
    t2 = Task().load(TaskKey("my-task-a")
    

    instead of

    t1 = Record()
    t1.key = Key("TaskManager", "Task", "my-task-a")
    t2 = Record().load(Key("TaskManager", "Task", "my-task-a")
    

    The Record subclasses also make it possible to require certain fields before a column is saved (which we'll take a deeper look at in a moment).

  • A subclass of lazyboy.View which is used to iterate through items in a given ColumnFamily.

The three subclasses for Task will look like this:

class TaskKey(Key):
    def __init__(self, key=None):
        Key.__init__(self, "TaskManager", "Task", key)

class Task(record.Record):
    # note that we're requiring these two keys for each column
    # in the Task ColumnFamily
    _required = ('name','desc')
    def __init__(self, *args, **kwargs):
        record.Record.__init__(self, *args, **kwargs)
        self.key = TaskKey()

class TaskView(View):
    def __init__(self):
        View.__init__(self)
        self.key = TaskKey(key="row_a")
        self.record_class = Task
        self.record_key = TaskKey()

Identical classes will need to be created for Member and TaskMember, which can be viewed in the source code.

Creating & Manipulating One Column

Now that we've created all our classes, we can start using them to interface with Cassandra. Let's say that we want to create a new Task with the key my-project-a.

>>> from tasks import *
>>> t = Task()
>>> t.key = TaskKey("my-project-a")
>>> t.update({"name":"My Project A", "desc":"Build A project"})
Task: {'name': 'My Project A', 'desc': 'Build the critical A project'}
>>> t.save()
Task: {'name': 'My Project A', 'desc': 'Build the critical A project'}

Note that we explicitly specify our key here, but that isn't required, if not specified then a random key will be generated for you.

>>> t = Task()
>>> t.key
{'column_family': 'Task', 
 'keyspace': 'TaskManager',
 'super_column': None, 
 'key': 'cff715a0f7cf4f5bb78e70cab6bcd867'}

Now that we've saved our Task we can retrieve it at will.

>>> t2 = Task().load(TaskKey("my-project-a"))
>>> t2
Task: {'name': 'My Project A', 'desc': 'Build the critical A project'}

It behaves like a dictionary for retrieving specific values, and can also be updated with dictionary syntax. Note that save is required for the data to be updated in Cassandra.

>>> t2['name']
'My Project A'
>>> t2['name'] = "A newer project"
>>> t2
Task: {'name': 'A newer project', 
       'desc': 'Build the critical A project'}
>>> t2.save()
Task: {'name': 'A newer project', 
       'desc': 'Build the critical A project'}

Finally, we can also delete our task.

>>> t2.remove()
Task: {}
>>> Task().load(TaskKey("my-project-a"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.6/site-packages/Lazyboy-0.7.5-py2.6.egg/lazyboy/record.py", line 184, in load
    columns = iterators.slice_iterator(key, consistency)
  File "/Library/Python/2.6/site-packages/Lazyboy-0.7.5-py2.6.egg/lazyboy/iterators.py", line 50, in slice_iterator
    raise exc.ErrorNoSuchRecord("No record matching key %s" % key)
lazyboy.exceptions.ErrorNoSuchRecord: No record matching key {'column_family': 'Task', 'keyspace': 'TaskManager', 'super_column': None, 'key': 'my-project-a'}

Using these examples, you have enough material to use Cassandra as a simple key-value store, but being strictly limited to a KV store isn't enough to develop some classes of applications, so we're fortunate that Cassandra gives us quite a bit more.

Retrieving Many Columns

In addition to retrieving specific keys, we can also iterate through all the keys in a ColumnFamily. To simplify the example, let's pretend that we have this function defined:

def add_member(username, first, last):
    m = Member()
    m.key = MemberKey(username)
    m['first'] = first
    m['last'] = last
    m.save()
    mv = MembersView()
    mv.append(m)

In particular notice the last two lines where we explicitly add the new Member to the MembersView. This is critical for making the member retrievable via the operations we look at below.

Now let's try iterating through our Member columns.

>>> members = (('a','a','last_f'),
...                ('b','b','last_c'),
...                ('c','c','last_d'),
...                ('d','d','last_e'),
...                ('e','e','last_a'),
...            ('f','f','last_b'))
>>> 
>>> for member in members:
...         add_member(member[0], member[1], member[2])
... 
>>> len(MemberView())
6
>>> for m in MemberView(): print m
... 
Member: {'last': 'last_f', 'first': 'a'}
Member: {'last': 'last_c', 'first': 'b'}
Member: {'last': 'last_d', 'first': 'c'}
Member: {'last': 'last_e', 'first': 'd'}
Member: {'last': 'last_a', 'first': 'e'}
Member: {'last': 'last_b', 'first': 'f'}

More than that, we can also iterate through a subset of columns by defining which key we want to start iterating at:

>>> mv = MemberView()
>>> mv.start_col = "c"
>>> for m in mv: print m
... 
Member: {'last': 'last_d', 'first': 'c'}
Member: {'last': 'last_e', 'first': 'd'}
Member: {'last': 'last_a', 'first': 'e'}
Member: {'last': 'last_b', 'first': 'f'}

You'd think there would be a way to specify the end range for this iteration, and there is, although it involves a bit of trickery at this point (likely a minor oversight in the API design, I'll put together a quick patch after writing this up).

>>> mv = MemberView()
>>> for a in mv._cols("c", "e"): print a
... 
Column(timestamp=1274654073, name='c', value='c')
Column(timestamp=1274654073, name='d', value='d')
Column(timestamp=1274654073, name='e', value='e')

Now we've moved from only using Cassandra as a key-value store to also being able to iterate through all of portions of the keys in a given ColumnFamily.

Managing Relations

The final step in implementing our master plan is to store the list of Members assigned to a particular Task At the simplest you might try storing a comma-separated list of keys in the Task itself, but that approach will become unreliable as write volume increases because there is no value merging on conflict, rather the latest write wins. (I realize this is a bit hand-wavy, but I've never succeeded explaining this sucinctly, so I'll need to devote another entry to it rather than trying to be ambitious in this one.)

Rather, a more succesful approach is to use a new ColumnValue (for us, TaskMember) to store the relations. For each Task we'll also create a new TaskMember with the same key, and will use it to store relations to Members.

First let's create some members and a Task, and also create a TaskMember which has the same key as the Task.

>>> from tasks import *
>>> add_member("will", "will", "larson")
>>> add_member("bill", "bill", "fakename")
>>> add_member("jill", "jill", "lastnamehere")
>>> Task().update({"name":"Web App 1",
         "desc":"We will build a fantastic web applicaiton"}).save()
>>> t.key
{'column_family': 'Task', 'keyspace': 'TaskManager', 
 'super_column': None, 'key': '52a6c23e1afd480991e2232f5e7d9ba8'}
>>> t.key.key
'52a6c23e1afd480991e2232f5e7d9ba8'
>>> tm = TaskMember()
>>> tm.key = TaskMemberKey(t.key.key)
>>> tm.save()
TaskMember: {}
>>> tv = TaskView()
>>> tv.append(t)

Now let's add Will and Jill to the task, while leaving Bill all by his lonesome.

>>> tm = TaskMember().load(TaskMemberKey(t.key.key))
>>> will = Member().load(MemberKey("will"))
>>> will.key.key
'will'
>>> tm[will.key.key] = 1
>>> jill = Member().load(MemberKey("jill"))
>>> tm[jill.key.key] = 1
>>> tm.save()

Yep, all we're doing is adding a new key/value for each of the users we want in the project. A less contrived version of what we just did looks like:

>>> tm.update({"will":1, "jill":1}).save()
TaskMember: {'will': '1', 'jill': '1'}

Then you can simply retrieve it and iterate through it to determine the current members assigned to the task:

>>> tm = TaskMember().load(TaskMemberKey(t.key.key))
>>> for a in tm: print a
... 
will
jill

And that's all there really is to it. Now we can do relations, key-values and iteration across ranges. There isn't too much out that which you can't build with these simple tools.

The End

This code is available on my fork of lazyboy on GitHub, and I'll see about sending a pull request to the official repository after I've done a bit of clean up.

Let me know if there are any questions or comments!