Stripe is starting to build out a load generation team in Seattle (that’s a posting for San Francisco,
but also works for Seattle), and consequently I’ve
been thinking more about load generation lately. In particular, I’ve been thinking that I know
a lot less about the topic than I’d like to, so here is a collection of sources and reading notes.
Hopefully, I’ll synthesize these into something readable soon!
The Interesting Questions
Perhaps because many companies never develop a mature solution for load generation,
and because none of the open source solutions command broad awareness (except maybe JMeter?),
it tends to be a place with far more opinions than average, and consequently there
are quite a few interesting questions to think through.
Let’s start by exploring a few of those.
Should you be load testing? Surprisingly few companies invest much into load testing,
so it’s never entirely clear if you should be investing at a given point in time.
My anecdotal impression is that companies which “believe in QA” tend to invest into load
testing early, because they have dedicated people who can build the tooling and integration,
and that most other companies tend to ignore it until they’re doing a significant amount of
unplanned scalability investment. Said differently, for most companies
load testing is a mechanism to convert unplanned scalability work into planned scalability work.
Should you be load testing, redux? Beyond whether you should invest into building load testing tooling,
my colleague Davin suggested an interesting perspective
that most of the metrics generated by load testing can also be obtained through thoughtful
instrumentation and analysis of your existing traffic.
What layer of your infrastructure should you load test against?
Depending on the application you’re running, it may be easy to generate load against your external interfaces (website, API, etc)
but as you go deeper into your infrastructure you may want to run load
against a specific service or your stateful systems (Kafka, databases, etc).
What environment should you run your tests against? Perhaps the most common argument when
rolling out load testing is whether you should run it against an existing QA environment,
against a dedicated performance environment, or against your production environment.
This depends a great deal on the layer you’re testing at, and if you’re doing load
(how does the system react to this traffic?) or stress (at what load does the system fail?)
How should you model your traffic? Starting with the dead simple Siege,
there are quite a few different ways to think about generating your load.
Should you send a few request patterns at a high concurrency? Should you model your traffic using
a state machine (codified in a simple script, or perhaps in a DSL), or should you just replay
sanitized production traffic?
Is the distinction between integration testing and load testing just a matter of scale?
As you start thinking about it, integration testing is running requests against a server and
ensuring the results are reasonable, and that’s pretty much the exact same thing as
load testing. It’s pretty tempting to try to solve both problems with one tool.
How should you configure load tests? Where should that configuration live?
Merging into the whole “is configuration code?” debate, where should the configuration
for your load tests live? Ideally it would be in your service’s repository, right?
Or perhaps it should be stored in a dynamic configuration database somewhere to
allow more organic exploration and testing? Hmm.
Alright then, with those questions in mind, time to read some papers.
As part of my research, I ran into an excellent list of stress and load testing papers
from the Software Engineering Research Group at University of Texas Arlington. Many of these
papers are pulled from there, and it’s a great starting point for your reading as well!
For [ultra large scale systems], it is impossible for an analyst to skim
through the huge volume of performance counters to find the
required information. Instead, analyst employ few key
performance counters known to [her] from past practice,
performance gurus and domain trends as ‘rules of thumb’. In a
ULSS, there is no single person with complete knowledge of end
to end geographically distributed system activities.
That rings true to my experience.
With increasingly complex systems,
it is remarkably hard to actually find the performance choke points,
and load testing aspires to offer the distributed
systems equivalent of a profiler. (This is also why QA environments
are fraught with failures: as systems become increasingly complex, creating a useful
facimile becomes rather hard.)
A few other interesting ideas:
Fairly naive software can batch together highly correlated metrics, to greatly reduce
the search space for humans to understand where things are degrading.
Most test designs mean they can only run occasionally (as opposed to continuously), and interfering workloads
(e.g. a weekly batch job occuring during your load test)
can easily invalidate the data of those infrequent runs.
Many tests end up with artificial constraints when run against non-production environments
like QA, for example running against underpowered or mis-configured databases.
They use “Principal Component Analysis” to find the minimal principal components
which are not correlated with each other, so you have less redundant data to explore.
After they’ve found the key Principal Components, they then convert those back into the underlying
counters, which are human comprehensible, for further analysis.
Ideally, you should be able to use historical data to build load signatures, such that you
could quickly determine if the system’s fundamental constraint has shifted (e.g. you’ve
finally fixed your core bottleneck and got a new one, or something else has degraded such
that it is not your core bottleneck).
In particular, my take away is that probably the right load generation tool
will start with a fairly simple approach with manually identified key metrics, and then
move increasingly to using machine learning to avoid our implicit biases around where our
systems ought to be slow.
to design an accurate artificial load generator which is responsible to act in a flexible
manner under different situations we need not only a load model but also a formal method to specify
the realistic load.
I’m not sure I entirely agree that we need a formal model to get value from our load testing,
we are after all trying to convert unplanned scalability work into planned scalability work,
but I have such an industry focus that it’s a fascinating idea to me that you would even try to
create a formal model here.
It also introduces a more specific vocabulary for discussing load generation:
The workload or load L=L (E, S, IF, T) denotes the total sequence of requests which
is offered by an environment E to a service system S via a well-defined interface IF during the timeinterval
Perhaps more interestingly, is the emphasize on how quality of service
forces us to redefine the goal of load testing:
The need for telecommunication networks capable of providing communication services such as
data, voice and video motivated to deliver an acceptable QoS level to the users, and their success
depends on the development of effective congestion control schemes.
I find this a fascinating point. As your systems start to include more systematic load shedding mechanisms,
it becomes increasingly challenging to push them out of equilibrium because they (typically) preserve
harvest by degrading yield.
It’s consequently not enough to say that your systems should or should not fail at a given level of load,
you also have to start to measure if it degrades appropriately based on load levels.
In section 3.5, it explains (the seemingly well known, albeit not to me) UniLog, also known as the Unified Load Generator.
(Which is perhaps based on this paper,
which is sadly hidden behind a paywall.) UniLog has an interesting architecture with intimidatingly
exciting component names like PLM, ELM, GAR, LT, ADAPT and EEM. As best I can tell it is an extremely
generic architecture for running and evaluating load experiments. It feels slightly overdesigned
from my first reading, but perhaps as one spends more time in the caverns of load generation it
starts to make more sense.
In section 4.4, it discusses centralized versus distributed load generation, which feels like
one of the core design decisions you need to make for building such a system. My sense is that
you likely want a distributed approach at some point, if only to avoid getting completely throttled
by QoS operating on a per-IP ratelimit.
The rest of the paper focuses on some case studies and such. Overall, it was a surprisingly thorough
introduction to the related research.
It summarizes the challenges of analyzing load test results as: outdated documentation,
process-level profiling is cost prohibitive, load testing typically occurs late in the development
cycle with short time lines, and the output from load tests can be overwhelmingly large.
I think the most interesting take away for me is the idea of very explicitly decoupling the gathering
of performance data from its analysis. For example, you could start logging performance data early on
(and likely your metrics tool, e.g. Graphite, already is capturing that data), and invest into more
sophisticated analysis much later on. There is particular focus on comparing results across multiple
load test runs, which can at a minimum narrow in on where performance “lives” within your metrics.
Even more papers…
Some additional papers with short summaries:
Converting Users to Testers - This paper discusses recording user traffic as an input to your load testing, with the goal of reducing time spent writing load generation scripts.
Automatic Feedback, Control-Based, Stress and Load Testing - This paper
explores the idea of systems that try to drive and maintain load on a system to targeted threshold. This is an interesting idea
because this would allow you to consistently run load against your production environment. The only caveat is that you have
to first identify the inputs you want to use to influence that load, so you still need to model the incoming traffic in
order to use it as an input (or record and sanitize real traffic), but at least once you have modeled it you could be more
abstract in how you use that model (if your target is to create load, you don’t necessarily need to simulate realistic
traffic, and you could use something like an n-armed bandit approach to “optimize” your load to generate the correct amount
of load against the system). (Similarly, this paper tries to do that using genetic algorithms.)
I took a brief look at Gatling, which is a lightweight DSL written in Scala,
which can be easily run by Jenkins or such. This seems like an interesting potential starting point
for building a load generation tool. In particular the concept of treating your load tests as something
you would check into your repository feels right, allowing you to iterate on your load tests like you
would anything else.
Reading through a few
on Gatling gave me a stronger sense that this might be a useful component of an overall load testing system
(that allowed, e.g. many instances to be run against different endpoints or such).
The basic principle underlying the design and elaboration of
UniLoG has been to start with a formal description of an abstract load model and thereafter
to use an interface-dependent adapter to map the abstract requests to the concrete requests
as they are “understood” by the service providing component at the real interface in question.
It also does a nice job of exploring ways to generate requests, although again coming back to
either using logs of existing traffic or generating a model which defines your workload. There
is an interesting hybrid here which would be using the distribution of actual usage as an input
for the generated load (as opposed to using load on a one to one basis).
That said, unfortunately, I didn’t really get much out of it.