Reflection on RethinkDB
Since first reading about RethinkDB I’ve been thinking about giving it a go, and this weekend I finally did, throwing together a very simple example which uses RethinkDB to store crawled pages and extracted metadata.
This post contains a few initial, uninformed, thoughts.
Each Database In Its Rightful Niche
Quite a few open source NoSQL databases have achieved notoriety over the past five years or so: Cassandra, CouchDB, mongoDB, Voldemort and so on. To my mind, RethinkDB has captured some of the best ideas from Cassandra and CouchDB.
From Cassandra come the goals of:
- easy system maintence,
- simple sharding, and
- reaching beyond a key-value store to a key-value-plus store.
From CouchDB come the ideas of
- storing data in JSON documents,
- the ability to create indexes against those JSON documents,
- JavaScript powered map-reduce against your data, and
- feature-rich UI for interacting with the database.
RethinkDB also avoided some of the expensive or conceptually powerful features that ended up having less impact from those projects including: HTTP API As A Feature, Java As A Feature, Erlang As A Feature, Synchronization Over Consistency As A Feature, and Must Fit EveryThing On One Server As A Feature.
Using the Python Client
The RethinkDB API is impressive and well-documented; I got to know a small piece of it and its Python driver while building a short example. A few issues came up that might be worth changing.
My biggest frustration was around how the recommended way to use the client invisibly manage connections, and you have to read through the source code a bit to figure out how to do something safer (hardly the end of the world).
The examples recommend:
:::python
from rethinkdb import r
r.connect('localhost', 28015)
r.db_create('crawl').run()
r.db('crawl').table_create('pages').run()
In that case, though, you’re sharing a single connection by default across different modules, meaning that your connection might get closed by someone else without you really having any awareness of it. I believe you could also end up with the default database being altered if you’re not careful with initialization races, which would introduce some confusing phenomena to debug.
To accomplish the same thing without sharing your connection, the syntax is definitely a bit less polished:
:::python
from rethinkdb import r
from rethinkdb.net import Connection
conn = Connection("localhost", 28015, "crawl")
r.db_create('crawl').run(conn=conn)
r.db('crawl').table_create('pages').run(conn=conn)
With the caveat emptor that if some other piece of code uses the from rethinkdb import r
approach then it will know about your connection due to a module variable and will piggyback
on it. From my quick reading, if any code uses the default approach, you are in moderate peril
of confusion. If possible, I’d rewrite the Python client to remove the shortcut entirely,
or at least update it to only update rethinkdb.net._last_connection
if you use the
shortcuts, not if you create the connection manually.
The reason I’m somewhat passionate about exposing connection pooling is that depending on whether you are threading, processing or gevent, you’ll probably need different approaches, and having the wrong one happens quite a bit with the system’s performance or the developer’s mental model as a casualty.
The other issue I ran into with the client is that the Exceptions aren’t quite as granular as I’d like them to be, so it’s hard for the client to react based on the exception which is thrown, unless you already know which error might be thrown and why.
from rethinkdb import r
from rethinkdb.net import ExecutionError
# if a primary key you retrieve doesn't exist, ExecutionError
try:
found = self.client.db(self.db).table(self.html_table).get(url).run()
except ExecutionError, ee:
print ee
# if a db already exists, ExecutionError
try:
self.client.db_create(self.db).run()
except ExecutionError, ee:
print ee
# if a table already exists, ExecutionError
try:
self.client.db(self.db).table_create(table).run()
except ExecutionError, ee:
print ee
Having distinct exceptions (inheriting from a shared parent exception) helps a great deal, especially when some of the failures can legitimately be retrieved and others cannot.
On the same theme of sane error handling, while db_create
, table_create
and get
are raising ExecutionError
on failure, the inserts use return status codes instead:
:::python
success = self.client.db(self.db).table(
self.page_table).insert({'id': 'hi'})
if 'first_error' in success :
raise Exception("num errors %s, first error %s" % \
(success['errors'], success['first_error']))
I certainly see the reasoning behind return status codes making more sense for the inserts (which might insert a number of documents, some of which succeed and some of which fail), but I’m not particularly a fan of the inconsistency. Perhaps it would be possible to update exception throwing responses to return status codes similar to insert.
What I Really Want
After my first usage, it’s safe to say that I like RethinkDB quite a bit. It hits most of my typical needs:
- it’s a scalable key-value store,
- it can index its key-value data,
- you can do some lightweight analysis across your dataset.
There are two pieces of functionality that RethinkDB doesn’t have–and hey, you have to draw some lines or your system will do everything, poorly–which would make it perfect for the larger scale applications I’ve worked with.
First, integration with a more full-fledged datawarehouse. I’d love to see a way to tie the data within RethinkDB into HFS so that analysts/data scientists could use their existing toolkits to work against the stored data.
Second–and this is a feature on the eventual roadmap–is second indexes, which you could already simulate by creating more tables (this is how developing with Cassandra used to work), but would definitely be a huge savinings in time (this is one of the key pieces of magic which, in my opinion anyway, makes mongoDB so popular).
Overall, as long as the scalability and performance stories turn out to work at larger data quantities, then I could absolutely envision moving over to RethinkDB for new projects, even without the additional features I mentioned above. I’m certainly hopeful that the performance and stability turn out as promised, it would be a fantastic option.