Braindump on Load Generation

Published on December 18, 2016. architecture (33), load testing (1), braindump (1)

Stripe is starting to build out a load generation team in Seattle (that’s a posting for San Francisco, but also works for Seattle), and consequently I’ve been thinking more about load generation lately. In particular, I’ve been thinking that I know a lot less about the topic than I’d like to, so here is a collection of sources and reading notes.

Hopefully, I’ll synthesize these into something readable soon!

The Interesting Questions

Perhaps because many companies never develop a mature solution for load generation, and because none of the open source solutions command broad awareness (except maybe JMeter?), it tends to be a place with far more opinions than average, and consequently there are quite a few interesting questions to think through.

Let’s start by exploring a few of those.

Should you be load testing? Surprisingly few companies invest much into load testing, so it’s never entirely clear if you should be investing at a given point in time. My anecdotal impression is that companies which “believe in QA” tend to invest into load testing early, because they have dedicated people who can build the tooling and integration, and that most other companies tend to ignore it until they’re doing a significant amount of unplanned scalability investment. Said differently, for most companies load testing is a mechanism to convert unplanned scalability work into planned scalability work.
Should you be load testing, redux? Beyond whether you should invest into building load testing tooling, my colleague Davin suggested an interesting perspective that most of the metrics generated by load testing can also be obtained through thoughtful instrumentation and analysis of your existing traffic.
What layer of your infrastructure should you load test against? Depending on the application you’re running, it may be easy to generate load against your external interfaces (website, API, etc) but as you go deeper into your infrastructure you may want to run load against a specific service or your stateful systems (Kafka, databases, etc).
What environment should you run your tests against? Perhaps the most common argument when rolling out load testing is whether you should run it against an existing QA environment, against a dedicated performance environment, or against your production environment. This depends a great deal on the layer you’re testing at, and if you’re doing load (how does the system react to this traffic?) or stress (at what load does the system fail?) testing.
How should you model your traffic? Starting with the dead simple Siege, there are quite a few different ways to think about generating your load. Should you send a few request patterns at a high concurrency? Should you model your traffic using a state machine (codified in a simple script, or perhaps in a DSL), or should you just replay sanitized production traffic?
Is the distinction between integration testing and load testing just a matter of scale? As you start thinking about it, integration testing is running requests against a server and ensuring the results are reasonable, and that’s pretty much the exact same thing as load testing. It’s pretty tempting to try to solve both problems with one tool.
How should you configure load tests? Where should that configuration live? Merging into the whole “is configuration code?” debate, where should the configuration for your load tests live? Ideally it would be in your service’s repository, right? Or perhaps it should be stored in a dynamic configuration database somewhere to allow more organic exploration and testing? Hmm.

Alright then, with those questions in mind, time to read some papers.

Stress and Load Testing Research

As part of my research, I ran into an excellent list of stress and load testing papers from the Software Engineering Research Group at University of Texas Arlington. Many of these papers are pulled from there, and it’s a great starting point for your reading as well!

A Methodology to Support Load Test Analytics

A Methodology to Support Load Test Analytics (2010) starts with an excellent thought on why load testing is becoming increasingly important and complex:

For [ultra large scale systems], it is impossible for an analyst to skim through the huge volume of performance counters to find the required information. Instead, analyst employ few key performance counters known to [her] from past practice, performance gurus and domain trends as ‘rules of thumb’. In a ULSS, there is no single person with complete knowledge of end to end geographically distributed system activities.

That rings true to my experience.

With increasingly complex systems, it is remarkably hard to actually find the performance choke points, and load testing aspires to offer the distributed systems equivalent of a profiler. (This is also why QA environments are fraught with failures: as systems become increasingly complex, creating a useful facimile becomes rather hard.)

A few other interesting ideas:

Fairly naive software can batch together highly correlated metrics, to greatly reduce the search space for humans to understand where things are degrading.
Most test designs mean they can only run occasionally (as opposed to continuously), and interfering workloads (e.g. a weekly batch job occuring during your load test) can easily invalidate the data of those infrequent runs.
Many tests end up with artificial constraints when run against non-production environments like QA, for example running against underpowered or mis-configured databases.
They use “Principal Component Analysis” to find the minimal principal components which are not correlated with each other, so you have less redundant data to explore.
After they’ve found the key Principal Components, they then convert those back into the underlying counters, which are human comprehensible, for further analysis.
Ideally, you should be able to use historical data to build load signatures, such that you could quickly determine if the system’s fundamental constraint has shifted (e.g. you’ve finally fixed your core bottleneck and got a new one, or something else has degraded such that it is not your core bottleneck).

In particular, my take away is that probably the right load generation tool will start with a fairly simple approach with manually identified key metrics, and then move increasingly to using machine learning to avoid our implicit biases around where our systems ought to be slow.

A Unified Load Generator For Geographically…

A Unified Load Generator for Geographically Distributed Generation of Network Traffic (2006) is a master’s thesis, that happens to be an pretty excellent survey of academic ideas and topics around load generation.

One of the interesting ideas here is:

to design an accurate artificial load generator which is responsible to act in a flexible manner under different situations we need not only a load model but also a formal method to specify the realistic load.

I’m not sure I entirely agree that we need a formal model to get value from our load testing, we are after all trying to convert unplanned scalability work into planned scalability work, but I have such an industry focus that it’s a fascinating idea to me that you would even try to create a formal model here.

It also introduces a more specific vocabulary for discussing load generation:

The workload or load L=L (E, S, IF, T) denotes the total sequence of requests which is offered by an environment E to a service system S via a well-defined interface IF during the timeinterval T.

Perhaps more interestingly, is the emphasize on how quality of service forces us to redefine the goal of load testing:

The need for telecommunication networks capable of providing communication services such as data, voice and video motivated to deliver an acceptable QoS level to the users, and their success depends on the development of effective congestion control schemes.

I find this a fascinating point. As your systems start to include more systematic load shedding mechanisms, it becomes increasingly challenging to push them out of equilibrium because they (typically) preserve harvest by degrading yield. It’s consequently not enough to say that your systems should or should not fail at a given level of load, you also have to start to measure if it degrades appropriately based on load levels.

In section 3.5, it explains (the seemingly well known, albeit not to me) UniLog, also known as the Unified Load Generator. (Which is perhaps based on this paper, which is sadly hidden behind a paywall.) UniLog has an interesting architecture with intimidatingly exciting component names like PLM, ELM, GAR, LT, ADAPT and EEM. As best I can tell it is an extremely generic architecture for running and evaluating load experiments. It feels slightly overdesigned from my first reading, but perhaps as one spends more time in the caverns of load generation it starts to make more sense.

In section 4.4, it discusses centralized versus distributed load generation, which feels like one of the core design decisions you need to make for building such a system. My sense is that you likely want a distributed approach at some point, if only to avoid getting completely throttled by QoS operating on a per-IP ratelimit.

The rest of the paper focuses on some case studies and such. Overall, it was a surprisingly thorough introduction to the related research.

Automated Analysis of Load Testing Results

Automated Analysis of Load Testing Results takes a look at using automation to understand load test results (using both the execution logs of the load test and overall system metrics during the load test).

It summarizes the challenges of analyzing load test results as: outdated documentation, process-level profiling is cost prohibitive, load testing typically occurs late in the development cycle with short time lines, and the output from load tests can be overwhelmingly large.

I think the most interesting take away for me is the idea of very explicitly decoupling the gathering of performance data from its analysis. For example, you could start logging performance data early on (and likely your metrics tool, e.g. Graphite, already is capturing that data), and invest into more sophisticated analysis much later on. There is particular focus on comparing results across multiple load test runs, which can at a minimum narrow in on where performance “lives” within your metrics.

Even more papers…

Some additional papers with short summaries:

Converting Users to Testers - This paper discusses recording user traffic as an input to your load testing, with the goal of reducing time spent writing load generation scripts.

Automatic Feedback, Control-Based, Stress and Load Testing - This paper explores the idea of systems that try to drive and maintain load on a system to targeted threshold. This is an interesting idea because this would allow you to consistently run load against your production environment. The only caveat is that you have to first identify the inputs you want to use to influence that load, so you still need to model the incoming traffic in order to use it as an input (or record and sanitize real traffic), but at least once you have modeled it you could be more abstract in how you use that model (if your target is to create load, you don’t necessarily need to simulate realistic traffic, and you could use something like an n-armed bandit approach to “optimize” your load to generate the correct amount of load against the system). (Similarly, this paper tries to do that using genetic algorithms.)

Existing Tools

There are surprisingly few load testing tools, although wikipedia has a short list. Of that list, I’ve actually used JMeter some years ago, and I enjoyed this short rant about HP’s Loadrunner tooling.

I took a brief look at Gatling, which is a lightweight DSL written in Scala, which can be easily run by Jenkins or such. This seems like an interesting potential starting point for building a load generation tool. In particular the concept of treating your load tests as something you would check into your repository feels right, allowing you to iterate on your load tests like you would anything else. Reading through a few other blog posts on Gatling gave me a stronger sense that this might be a useful component of an overall load testing system (that allowed, e.g. many instances to be run against different endpoints or such).

Are there others that I’m missing out on?

Web Workload Generation According to…

As the name suggests, Web Workload Generation According to UniLoG Approach looks at adapting the UniLoG approach to the web. It nicely summarizes UniLoG’s approach as well:

The basic principle underlying the design and elaboration of UniLoG has been to start with a formal description of an abstract load model and thereafter to use an interface-dependent adapter to map the abstract requests to the concrete requests as they are “understood” by the service providing component at the real interface in question.

It also does a nice job of exploring ways to generate requests, although again coming back to either using logs of existing traffic or generating a model which defines your workload. There is an interesting hybrid here which would be using the distribution of actual usage as an input for the generated load (as opposed to using load on a one to one basis).

That said, unfortunately, I didn’t really get much out of it.