Experiment with distributed finite state machines.

June 2, 2018. Filed under infrastructure 34 go 1

Most modern task systems such as RabbitMQ or Celery focus on queuing tasks and having a distributed fleet of consumers pull down tasks. This works remarkably well, but also offloads expressing task workflow to the applications themselves. If you want sophisticated workflows, you end up in a parallel universe of tools like Airflow, which are excellent but heavy and designed to coordinate batch workflows rather than task workflows. There is also interesting prior art in Erlang's gen_fsm, but it's constrained to a single language.

I wanted to explore this idea, so I prototyped an interface named dfsmr or Distributed Finite State Machine Runner. The experiment here is entirely in the interface, and the implementation is the bare minimum to support exercising the interface (e.g. not good). dfsmr's README explains the design, so this post will be a few higher-level comments.

I found it surprisingly rewarding to focus exclusively on the interface design, decoupling from the implementation details entirely. Not too long ago I would have looked at the interface as the last component to focus on. The early NoSQL movement was notorious for this, forcing folks to learn the constraints of the system before they could use it for anything, ensuring that their interfaces only supported highly scalable patterns, even if it required every user to gain mastery of a wide range of concepts. Modern infrastructure design is biased towards incurring immense internal complexity in infrastructure to provide simple, intuitive interfaces to their users (e.g. Cloud Spanner), and my sense is absorbing complexity on behalf of our users is indeed the future.

gRPC was a particularly helpful tool for focusing on the interface design, and in particular not letting me "cheat" by shoving more and more key-value pairs into under-defined JSON blobs. The streaming methods in particular make it easy to implement a streaming changelog akin to Redis' MONITOR command, which is an essential comprehension and debugging aid.

There is definitely space in the open source ecosystem for a tool along these lines, preferably one that efficiently supports delayed retries (loosely, priority queue based scheduling, as opposed to FIFO-based), is distributed across multiple servers, and has low operational overhead. Ya know, the easy stuff.