Options for orchestrating periodic tasks.
Reliably running tasks periodically is hard. Getting to normal-case behavior of “exactly once” running requires either a singleton instance (introducting a single point of failure) or a leader election mechanism to determine which instance should be running (which is why having elections as primitive via something like Chubby can so powerful).
Before going too deep, a few definitions:
- Scheduling is deciding when and whether a task should run.
- Orchestration is deciding where and how a task should run.
Even once you have the ability to correctly schedule tasks, you still need a second mechanism to orchestrate them somewhere, and doing this effectively requires a fairly significant amount of coupling between the scheduler and orchestrator. For example, determining if the task completed successfully is information in the orchestrator, but determining the conditions a task should be restarted is potentially behavior you’d want determined by the orchestrator, especially around cases for tasks which are running long (e.g. do you want front-of-line blocking behavior or not).
We’ve been chatting more about this problem space at work, and it’s been a while since I’ve explored the options in this space, so I decided to look around a bit.
First, a few thoughts about the features we want:
- Language agnostic - we’d like to have one framework which can run periodic tasks for all programming languages we use, not have to deploy one for each language.
- Familiar deployment paradigm - getting deployment right (with code review, linting, rollbacks, etc) is hard, and we’d prefer to use a single deployment paradigm and mechanism for periodic and long-running processes if possible. This is important both from a leverage perspective (we can improve everyone’s experience in one place), and also from a training and adoption perspective.
- Reliable - these are business critical tasks, and the scheduling and orchestration components both need to be reliable and predictable.
- Reusible - ideally we could use the same orchestrator for both our periodic tasks and long-running ones. This will reduce our maintenance overhead, allow us to gain operational expertise more quickly, and also leave the door open to bin-packing based fleet efficiency optimization further down the line.
- No vendor lock-in - ideally we’d find a solution that doesn’t require vendor lock-in, e.g. proprietary cloud solutions from AWS, GCP or Azure.
With those features in mind, I spent some time digging around for common solutions:
- AWS Lambda with Scheduled Events is a cloud solution that should solve most straightforward use cases, both from a scheduling and an orchestration perspective. It is not particularly flexible in either regard, but it does give you the same primitives as cron, and if you happen to be using their supported languages (Node.js, Java, C#, Python), then it might be sufficient. (You can also use a hybrid Amazon EC2 Container Service and AWS Lambda approach if you need more flexibility in your orchestration layer.)
- Google’s Cloud Functions can be paired with App Engine Cron Service, coordinating over Google Cloud Pub/Sub, to get more or less the same scheduling behavior as AWS Lambda with Scheduled Events, albeit with more pieces to futz with. Good Cloud Functions are still a bit limiting in terms of only supporting the Node.js runtime today, but one imagines they’ll add more support over time. (You can of course get creative and have jobs call running services, allowing you to break out of Cloud Compute’s language restrictions.)
- Chronos is a scheduler running on top of Mesos, which handles both the scheduling and orchestration aspects for you, and gives a good degree of flexibility in both. Running Mesos is a bit heavy, but this certainly makes if you already have operational expertise with running Mesos.
- Kubernetes’ Cron Jobs give you a solution similar to Chronos, except running on Kubernetes instead of Mesos, for organizations which already have it deployed.
cronis still used pretty frequently as a scheduler, and if you run it in a prebaked AMI in an AutoScaling Group with a size of one instance, then you can rely on the ASG for “election” of a single instance. You do have a single point of failure, but it’ll recover relatively quickly. What you don’t have is any orchestration primatives, so you would still need to integrate this with a second system that handles the orchestration aspects (e.g. spinning up a container on Amazon EC2 Container Service or calling into an AWS Lambda). On the plus side, you can use your existing server imaging and deployment strategies.
- Python Celery is used by many Python shops for this kind of functionality, although it suffers from most of the same scheduling challenges as Cron and orchestration is both fairly naive and restricted to Python (although, just found a Go implementation of Celery workers, which is a terrifying find). There are a bunch of other similar solutions in this category, both in the Python space and in other languages.
- Dkron is purely a scheduler, which aims to provide fault-tolerant scheduling, even if some nodes fail. (E.g. solving the leader-election and leader handoff problems for you, as opposed to building your own on top of Zookeeper, etc.) It also provides a nice UI, although depending on your security and compliance needs, it’s possible that UI is a mixed blessing.
- Bistro is broadly in this space, but feels much more targeted at running jobs once for every resource in a fleet, as opposed to running a job once somewhere on some resource in a fleet.
Of those options, it feels like for larger companies, you’ll likely end up with either a Cloud based solutions (AWS Scheduled Lambdas or App Engine Cron), Chronos if you’re already happy running Mesos, or Kubernetes’ Cron Jobs if you’re already happy running Kubernetes.