August 18, 2009.
After reading Lampson's Hints for Computer System Design, the topic which has most grabbed my fancy was fault tolerence and fault recovery. Inherently fascinating, it has taken on additional intrigue as we work at bringing sanity-in-failure to the rather involved system we are building. As a result, the next paper I picked up was Jim Gray's A Transaction Model, 1980.
From a great height, A Transaction Model discusses how systems fail, concurrency, degrees of consistency, determining legality of concurrent transactions in systems with consistency requirements, dangers of concurrent transactions, solutions for concurrent transactions, distributed databases and possible aspects of a transaction-centric programming language.
As a quick prelude to the remainder of my summary, we should like include Gray's definition of transations, in which he posits two requirements:
Will be executed exactly once (reliability).
Will be isolated from temporary violations of the consistency constraint introduction by actions of concurrently executing transactions (consistency).
For many developers, of which I have been one, transactions are generally only discussed with regard to choosing between MySQL backends, or debating the relative merits of MySQL and PostgreSQL. However, it has been an emerging thought of mine that transactions apply at much higher levels as well, and that most systems could be described as transactions, with the due caveat that they are desired neither to be particularly reliable nor to be particularly consistent.
Consider for example a Twitter client which is attempting to post a tweet. The pre-commit stage involves writing the tweet, and then the commit stage is the actual sending to Twitter as well as displaying it in the client's GUI. However, it is possible (perhaps likely, depending on your daily diet of snark) that Twitter is not available or you've been temporarily prohibitted from further API usage for the next hour, in that case the client needs to undo the pre-commit actions (remove your tweet from its GUI) and either reattempt the pre-commit and commit stages or simply abandon the transaction entirely.
I'm not sure that using transactions to model this Twitter client logic is particularly useful, but with greater ambition I imagine there are more interesting examples one devise where viewing the entire program as series of transactions would be a powerful model for simplification.
The paper goes through a thorough process of outlining the various kinds of errors--which I won't attempt to faithfully recreate here--but at the broadest discusses two kinds of failures: transaction restarts and system restarts. The distinction between the two is exactly what one would imagine from reading the terms.
More interesting is Gray's classification of resources into four buckets:
The net advantage of thinking in these terms is that it makes it clearer to determine where a transaction should reset to, and it also helps determine if your system is unrecoverable (dependent on a volatile value to recover post-system restart).
The last concept I'll mention--similar to the event log discussed in Lampson's Hint's for Computer System Design--is the concept of undo logs and redo logs which enable a system to rollback a transaction's pre-commit events.
In essence, before taking the pre-commit steps the actions are written to a redo-log. Then as each step is taken, it is popped off the redo-log and pushed onto the undo-log, such that any time the full list of events in the pre-commit stage can be performed and reversed as required.
With stable undo and redo logs, it becomes possible to sanely backout transactions which are underway when a system restart occurs. This is a fairly amazing property, which many systems would benefit from having, but rather few have.
I won't belabor this point as it isn't examined in great depth, but the broad theory here is that when concurrent transactions are written to a log in such a way that they are equivalent to some set of serial transactions, several positive properties arise.
When concurrent transactions cannot be record in such a way that they are equivalent to some serial transaction history, then the sanity of the system becomes suspect.
Before examining potential implementations for creating consistent transactions, Gray defines three degrees of consistency:
A Degree 3 consistency was defined before as a protocol which acquires locks to cover all actions and holds all locks to transaction commit. It was shown to prevent concurrency anomalies.
In order to support transaction restart, all systems acquire X-mode locks to cover writes and hold them to transaction commit. This is called the degree 1 consistency lock protocol.
If the system additionally acquires S-mode locks to cover reads but releases the locks prior to commit then it provides degree 2 consistency.
Regarding the relative performance of using the less thorough degree 1 and degree 2 consistencies, Gray wrote:
In the experiments we have done, degree 3 consistency has a cost (throughput and processor overhead) indistinguishable from the lower degrees of consistency.
I found his brief remarks regarding why others continue using the (supposedly) comparably efficient and less consistent degree 1 and degree 2 consistencies to be rather amusing, or at least painful.
Many times I have seen others--and most certainly been guilty myself--of desperately seeking an explanation for why we have already done something in a backwards fashion, and afterwards trying to explain why it really was a rational decision instead of just an ill-considered choice. Gray's two example excuses were complexity of implementation and imagined but unproven performance benefits. Both of which strike me as rather familiar utterances.
The issue of restarting a transaction when a node--throughout its lifespan--exists on many different nodes is discussed, and is a problem that maps closely to the trickiest problem we are working on at work.
Gray's suggestion is that each node must record the origin node it receives a given transaction from, record the redo-undo history for steps occuring on itself, and record the node the transaction next proceeds onto. In such a way a full redo-undo history may be maintained across nodes.
There is a great deal of interesting thoughtwork to be done here in regard to in-memory file-backed databases--which are themselves a distributed database--and then distributing those in-memory and file-backend aspects across multiple nodes.
In the final portion of the paper, Gray very briefly describes how his results on transactions might be incorporated into a future programming language. Part of the described language is quite similar to Java or Objective-C's method level locks (in Java one can add the synchronized directive to a method which transparently generates a lock).
The other portion has found less tranction (although it seems to me that McCarthy's (SP?) language Elephant--the sum of my knowledge about which is that its motto is an elephant never forgets--may be related), but seems quite intriguing:
adding language level support for maintaining the redo and undo logs required for transactions,
adding language level support for marking attributes as real, stable, recoverable-volatile and volatile, and having the language implementation take responsibility for maintaining the necessary record-keeping for implementing stable and recoverable-volatile attributes.
This is likely a situation where my youth fails me, but I'm not aware of any general purpose languages which have incorporated these, but I do think it would be fairly straight-forward to add a Python (or language-of-your-choice) library that would support programming using these transaction-centric and persistence-centric approaches.
A Transaction Model was a very approachable paper which contained a number of interesting ideas, and served as an excellent introduction to transactions. It is always surreal to read papers which are twenty-six years old but still entirely relevant, this one certainly is.
The most interesting aspect for me was--as mentioned before--fault tolerance and recovery from failed transactions, and I'll be raiding the cited works for additional reading material.