Reliably running tasks periodically is hard.
Getting to normal-case behavior of "exactly once" running requires
either a singleton instance (introducting a single point of failure)
or a leader election mechanism to determine which instance should be
running (which is why having elections as primitive via something like
Chubby can so powerful).
Before going too deep, a few definitions:
- Scheduling is deciding when and whether a task should run.
- Orchestration is deciding where and how a task should run.
Even once you have the ability to correctly schedule tasks, you still
need a second mechanism to orchestrate them somewhere, and doing this
effectively requires a fairly significant amount of coupling between
the scheduler and orchestrator. For example, determining if the task
completed successfully is information in the orchestrator, but determining
the conditions a task should be restarted is potentially behavior you'd
want determined by the orchestrator, especially around cases for tasks
which are running long (e.g. do you want front-of-line blocking behavior
or not).
We've been chatting more about this problem space at work, and it's been
a while since I've explored the options in this space, so I decided to look
around a bit.
First, a few thoughts about the features we want:
- Language agnostic - we'd like to have one framework which can run
periodic tasks for all programming languages we use, not have to deploy
one for each language.
- Familiar deployment paradigm - getting deployment right (with code review,
linting, rollbacks, etc) is hard, and we'd prefer to use a single deployment
paradigm and mechanism for periodic and long-running processes if possible.
This is important both from a leverage perspective (we can improve everyone's
experience in one place), and also from a training and adoption perspective.
- Reliable - these are business critical tasks, and the scheduling and
orchestration components both need to be reliable and predictable.
- Reusible - ideally we could use the same orchestrator for both our
periodic tasks and long-running ones. This will reduce our maintenance overhead,
allow us to gain operational expertise more quickly, and also leave the door
open to bin-packing based fleet efficiency optimization further down the line.
- No vendor lock-in - ideally we'd find a solution that doesn't require vendor
lock-in, e.g. proprietary cloud solutions from AWS, GCP or Azure.
With those features in mind,
I spent some time digging around for common solutions:
- AWS Lambda with Scheduled Events
is a cloud solution that should solve most straightforward use cases,
both from a scheduling and an orchestration perspective. It is not
particularly flexible in either regard, but it does give you
the same primitives as cron, and if you happen to be using their
supported languages (Node.js, Java, C#, Python), then it might be
sufficient.
(You can also use a hybrid Amazon EC2 Container Service and AWS Lambda
approach if you need more flexibility in your orchestration layer.)
- Google's Cloud Functions can be paired
with App Engine Cron Service,
coordinating over Google Cloud Pub/Sub,
to get more or less the same scheduling behavior as AWS Lambda with Scheduled Events,
albeit with more pieces to futz with.
Good Cloud Functions are still a bit limiting in terms of only supporting the Node.js runtime today,
but one imagines they'll add more support over time.
(You can of course get creative and have jobs call running services, allowing you to
break out of Cloud Compute's language restrictions.)
- Chronos is
a scheduler running on top of Mesos, which handles both the scheduling and orchestration
aspects for you, and gives a good degree of flexibility in both.
Running Mesos is a bit heavy, but this certainly makes if you already have
operational expertise with running Mesos.
- Kubernetes' Cron Jobs
give you a solution similar to Chronos, except running on Kubernetes instead of Mesos,
for organizations which already have it deployed.
cron
is still used pretty frequently as a scheduler, and if you run it
in a prebaked AMI in an AutoScaling Group
with a size of one instance, then you can rely on the ASG for "election"
of a single instance. You do have a single point of failure, but it'll recover
relatively quickly. What you don't have is any orchestration primatives,
so you would still need to integrate this with a second system that handles
the orchestration aspects (e.g. spinning up a container on
Amazon EC2 Container Service or calling into
an AWS Lambda). On the plus side, you can
use your existing server imaging and deployment strategies.
- Python Celery is used by many Python shops for
this kind of functionality, although it suffers from most of the same scheduling
challenges as Cron and orchestration is both fairly naive and restricted to
Python (although, just found a Go implementation of Celery workers,
which is a terrifying find).
There are a bunch of other similar solutions in this category, both in the Python
space and in other languages.
- Dkron is purely a scheduler, which aims to provide fault-tolerant
scheduling, even if some nodes fail. (E.g. solving the leader-election and
leader handoff problems for you, as opposed to building your own on top of
Zookeeper, etc.) It also provides a nice UI, although depending on your
security and compliance needs, it's possible that UI is a mixed blessing.
- Bistro is broadly in this space, but
feels much more targeted at running jobs once for every resource in a fleet,
as opposed to running a job once somewhere on some resource in a fleet.
Of those options, it feels like for larger companies, you'll likely end up with
either a Cloud based solutions (AWS Scheduled Lambdas or App Engine Cron),
Chronos if you're already happy running Mesos, or Kubernetes' Cron Jobs if
you're already happy running Kubernetes.