SuperFamily Relationships with Lazyboy
Recently I started playing with Cassandra and one of its Python clients, lazyboy, and the documentation is pretty good, but it took me a bit of toying around and source reading before I really got it (well, assuming I did really get it, which the dear reader would be encouraged to inform me of if I've only reached a plateau of comprehension).
I tend to think that doing is the best form of learning, so let's build something.
Modeling Our Application
For our project, I've decided to build a simple task management system for a development team.
Our system will have many people.
p1 = {"first": "Jack", "last":"Bauer", "role":"developer"}
p2 = {"first": "Lindsay", "last":"Lohan", "role":"product-manager"}
Our system will also have many tasks.
t1 = {"name":"Prevent Nuclear Bomb Plot",
"estimated-hours":"24",
"completed-hours":"23"}
t2 = {"name":"Resurrect Career",
"estimated-hours":"4000",
"completed-hours":"0"}
Each task will have zero or more people working on it.
# Sometimes I find myself typing snippets to maintain consistency even
# though they aren't actually coherent or helpful.
# I'm having trouble finding the delete key right now.
t1.people.append(p1)
t1.people.append(p2)
Based on these relations, we'll probably want to be able to perform these operations:
- get all people,
- get all tasks,
- retrieve a specific person,
- retrieve a specific task,
- get people working on a specific task.
So, how can we model this in Cassandra? Actually, very easily following this basic rule:
Have one
ColumnFamily
per type of document, and also oneColumnFamily
for each relationship between documents.
(A ColumnFamily
is a collection of similar documents. Documents within
a given ColumnFamily
don't have to have the same or similar key/values,
but usually it does simplify your application, especially if you're going
to be iterating over your data.)
Applying that rule to our application we have two types of documents:
- tasks and
- members,
and we also have one relationship between the two:
- a task is worked on by zero or more members.
In other words, we'll require three ColumnFamilies
.
Now we'll move on to installing lazyboy and Cassandra before getting on to the implementation.
Installing Prerequisites
Install Cassandra (instructions here for OS X, but one can imagine alternate directory structures for other OSes):
cd ~/Download tar -xvf apache-cassandra-*.gz mv apache-cassandra-0.6.1 ~/Library/cassandra
Create the necessary directories for Cassandra to run:
tar -zxvf cassandra-$VERSION.tgz cd cassandra-$VERSION sudo mkdir -p /var/log/cassandra sudo chown -R `whoami` /var/log/cassandra sudo mkdir -p /var/lib/cassandra sudo chown -R `whoami` /var/lib/cassandra
Install Git if you don't have it already.
Download Lazyboy which we'll use as our Cassandra client for Python (note that you'll need to use Python 2.5 or newer, which you will have if you have OS X 10.5 or newer):
mkdir ~/git && cd ~/git git clone http://github.com/digg/lazyboy.git sudo python setup.py
Now the prerequisites are satisfied, we can get started with... configuration.
Configuring & Starting Cassandra
Next we'll need to setup Cassandra with the proper settings for our
project. Although Cassandra doesn't
require compliance to a schema, it does require defining each ColumnFamily
before the Cassandra process is started.
Well, rather, than a definition, it's more like a statement of existance.
Go ahead and open ~/Library/cassandra/conf/storage-conf.xml
and replace
the existing contents of <Keyspaces/>
with this:
<Keyspaces>
<Keyspace Name="TaskManager">
<ColumnFamily CompareWith="UTF8Type" Name="Task"/>
<ColumnFamily CompareWith="UTF8Type" Name="Member"/>
<ColumnFamily CompareWith="UTF8Type" Name="TaskMember"/>
<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
<ReplicationFactor>1</ReplicationFactor>
<EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
</Keyspace>
</KeySpaces>
Now, go ahead and start Cassandra.
~/Library/cassandra/bin/cassandra
Using lazyboy
Now that we have the various pieces installed and configured, we need to
write our code to take advantage of lazyboy
. (Go ahead and take a look at
the lazyboy readme and lazyboy examples
if you find the following section a bit hard to follow.)
For each of our column families (Task
, Member
, TaskMember
) we need
to subclass three classes:
a subclass of
lazyboy.key.Key
which makes it possible to writea = TaskKey("my-task-a")
instead of
a = Key("TaskManager", "Task", "my-task-a")
So, admittedly not essentially, but a nice little pattern to take advantage of.
a subclass of
lazyboy.record.Record
which is used for accessing a givenColumnFamily
's elements, which in combination with theTaskKey
class makes it possible to write:t1 = Task() t2 = Task().load(TaskKey("my-task-a")
instead of
t1 = Record() t1.key = Key("TaskManager", "Task", "my-task-a") t2 = Record().load(Key("TaskManager", "Task", "my-task-a")
The
Record
subclasses also make it possible to require certain fields before a column is saved (which we'll take a deeper look at in a moment).A subclass of
lazyboy.View
which is used to iterate through items in a givenColumnFamily
.
The three subclasses for Task
will look like this:
class TaskKey(Key):
def __init__(self, key=None):
Key.__init__(self, "TaskManager", "Task", key)
class Task(record.Record):
# note that we're requiring these two keys for each column
# in the Task ColumnFamily
_required = ('name','desc')
def init(self, args, **kwargs):
record.Record.init(self, args, **kwargs)
self.key = TaskKey()
class TaskView(View):
def init(self):
View.init(self)
self.key = TaskKey(key="row_a")
self.record_class = Task
self.record_key = TaskKey()
Identical classes will need to be created for Member
and TaskMember
, which can be viewed in the source code.
Creating & Manipulating One Column
Now that we've created all our classes, we can start using them
to interface with Cassandra. Let's say that we want to create
a new Task
with the key my-project-a
.
>>> from tasks import *
>>> t = Task()
>>> t.key = TaskKey("my-project-a")
>>> t.update({"name":"My Project A", "desc":"Build A project"})
Task: {'name': 'My Project A', 'desc': 'Build the critical A project'}
>>> t.save()
Task: {'name': 'My Project A', 'desc': 'Build the critical A project'}
Note that we explicitly specify our key here, but that isn't required, if not specified then a random key will be generated for you.
>>> t = Task()
>>> t.key
{'column_family': 'Task',
'keyspace': 'TaskManager',
'super_column': None,
'key': 'cff715a0f7cf4f5bb78e70cab6bcd867'}
Now that we've saved our Task
we can retrieve it at will.
>>> t2 = Task().load(TaskKey("my-project-a"))
>>> t2
Task: {'name': 'My Project A', 'desc': 'Build the critical A project'}
It behaves like a dictionary for retrieving specific values, and can
also be updated with dictionary syntax. Note that save
is required
for the data to be updated in Cassandra.
>>> t2['name']
'My Project A'
>>> t2['name'] = "A newer project"
>>> t2
Task: {'name': 'A newer project',
'desc': 'Build the critical A project'}
>>> t2.save()
Task: {'name': 'A newer project',
'desc': 'Build the critical A project'}
Finally, we can also delete our task.
>>> t2.remove()
Task: {}
>>> Task().load(TaskKey("my-project-a"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/Lazyboy-0.7.5-py2.6.egg/lazyboy/record.py", line 184, in load
columns = iterators.slice_iterator(key, consistency)
File "/Library/Python/2.6/site-packages/Lazyboy-0.7.5-py2.6.egg/lazyboy/iterators.py", line 50, in slice_iterator
raise exc.ErrorNoSuchRecord("No record matching key %s" % key)
lazyboy.exceptions.ErrorNoSuchRecord: No record matching key {'column_family': 'Task', 'keyspace': 'TaskManager', 'super_column': None, 'key': 'my-project-a'}
Using these examples, you have enough material to use Cassandra as a simple key-value store, but being strictly limited to a KV store isn't enough to develop some classes of applications, so we're fortunate that Cassandra gives us quite a bit more.
Retrieving Many Columns
In addition to retrieving specific keys, we can also iterate through all the keys in
a ColumnFamily
. To simplify the example, let's pretend that we have this function
defined:
def add_member(username, first, last):
m = Member()
m.key = MemberKey(username)
m['first'] = first
m['last'] = last
m.save()
mv = MembersView()
mv.append(m)
In particular notice the last two lines where we explicitly
add the new Member
to the MembersView
. This is critical
for making the member retrievable via the operations we look
at below.
Now let's try iterating through our Member
columns.
>>> members = (('a','a','last_f'),
... ('b','b','last_c'),
... ('c','c','last_d'),
... ('d','d','last_e'),
... ('e','e','last_a'),
... ('f','f','last_b'))
>>>
>>> for member in members:
... add_member(member[0], member[1], member[2])
...
>>> len(MemberView())
6
>>> for m in MemberView(): print m
...
Member: {'last': 'last_f', 'first': 'a'}
Member: {'last': 'last_c', 'first': 'b'}
Member: {'last': 'last_d', 'first': 'c'}
Member: {'last': 'last_e', 'first': 'd'}
Member: {'last': 'last_a', 'first': 'e'}
Member: {'last': 'last_b', 'first': 'f'}
More than that, we can also iterate through a subset of columns by defining which key we want to start iterating at:
>>> mv = MemberView()
>>> mv.start_col = "c"
>>> for m in mv: print m
...
Member: {'last': 'last_d', 'first': 'c'}
Member: {'last': 'last_e', 'first': 'd'}
Member: {'last': 'last_a', 'first': 'e'}
Member: {'last': 'last_b', 'first': 'f'}
You'd think there would be a way to specify the end range for this iteration, and there is, although it involves a bit of trickery at this point (likely a minor oversight in the API design, I'll put together a quick patch after writing this up).
>>> mv = MemberView()
>>> for a in mv._cols("c", "e"): print a
...
Column(timestamp=1274654073, name='c', value='c')
Column(timestamp=1274654073, name='d', value='d')
Column(timestamp=1274654073, name='e', value='e')
Now we've moved from only using Cassandra as a key-value store to
also being able to iterate through all of portions of the keys
in a given ColumnFamily
.
Managing Relations
The final step in implementing our master plan is to store the list of
Member
s assigned to a particular Task
At the simplest you might
try storing a comma-separated list of keys in the Task
itself, but
that approach will become unreliable as write volume increases because
there is no value merging on conflict, rather the latest write wins.
(I realize this is a bit hand-wavy, but I've never succeeded explaining
this sucinctly, so I'll need to devote another entry to it rather than
trying to be ambitious in this one.)
Rather, a more succesful approach is to use a new ColumnValue
(for us,
TaskMember
) to store the relations. For each Task
we'll also create
a new TaskMember
with the same key, and will use it to store relations
to Member
s.
First let's create some members and a Task
, and also create a TaskMember
which has the same key as the Task
.
>>> from tasks import *
>>> add_member("will", "will", "larson")
>>> add_member("bill", "bill", "fakename")
>>> add_member("jill", "jill", "lastnamehere")
>>> Task().update({"name":"Web App 1",
"desc":"We will build a fantastic web applicaiton"}).save()
>>> t.key
{'column_family': 'Task', 'keyspace': 'TaskManager',
'super_column': None, 'key': '52a6c23e1afd480991e2232f5e7d9ba8'}
>>> t.key.key
'52a6c23e1afd480991e2232f5e7d9ba8'
>>> tm = TaskMember()
>>> tm.key = TaskMemberKey(t.key.key)
>>> tm.save()
TaskMember: {}
>>> tv = TaskView()
>>> tv.append(t)
Now let's add Will and Jill to the task, while leaving Bill all by his lonesome.
>>> tm = TaskMember().load(TaskMemberKey(t.key.key))
>>> will = Member().load(MemberKey("will"))
>>> will.key.key
'will'
>>> tm[will.key.key] = 1
>>> jill = Member().load(MemberKey("jill"))
>>> tm[jill.key.key] = 1
>>> tm.save()
Yep, all we're doing is adding a new key/value for each of the users we want in the project. A less contrived version of what we just did looks like:
>>> tm.update({"will":1, "jill":1}).save()
TaskMember: {'will': '1', 'jill': '1'}
Then you can simply retrieve it and iterate through it to determine the current members assigned to the task:
>>> tm = TaskMember().load(TaskMemberKey(t.key.key))
>>> for a in tm: print a
...
will
jill
And that's all there really is to it. Now we can do relations, key-values and iteration across ranges. There isn't too much out that which you can't build with these simple tools.
The End
This code is available on my fork of lazyboy on GitHub, and I'll see about sending a pull request to the official repository after I've done a bit of clean up.
Let me know if there are any questions or comments!