QoS, Cost & Quotas

September 11, 2016. Filed under architecture 30 scaling 3

One of the most exciting inversions thats comes with scaling a company is the day when someone repeats the beloved refrain that engineers are more expensive than servers, followed by a pregnant pause where you collectively realize that the lines have finally crossed: servers have become more expensive than engineers.

But what do you do with this newfound knowledge? Why, you kick off a Cost Accounting or Efficiency project to start optimzing your spend! These projects tend to take on a two pronged approach of:

  1. identify some low-hanging fruit for show immediate wins, and
  2. begin to instrument and tag your costs to create transparency into how your spends actually work.

The later gets interesting quickly.

First, let's start with the fundamental unit of infrastructure cost: a server. For each server you probably have a role (your mysql01-west is probably in a mysql role), and for each role you can probably assign an owner to it (perhaps your database team in this case). Now you write a quick script which queries your server metadata store every hour and emits the current servers and owners into Graphite or some such.

A job well done! Your work here is over... until later that week when you chat with the team who just finished migrating all your processes from dedicated hosts to Mesos or Kubernetes, who present you with an interesting question: "Sure, we glorious engineers on the Scheduling Team run the Kubernetes cluster, but we don't write any of the apps. Why are we responsible for their costs?"

So you head back to your desk and write down the resources provided by each machine:

  1. cpu,
  2. memory,
  3. network,
  4. disk space,
  5. disk IOPs,

Thinking about those resources and your existing per-host costs, you're able to establish a more granular pricing model for each of those resources. Then you take that model and add it as a layer on top of your per-host where the Scheduling team is able attribute their server costs downstream to the processes which run on their servers, as long as they're able to start capturing high-fidelity per-process utilization metrics and maintaining a process-to-team mapping (that was on their roadmap anyway).

Doodling on a paper pad, you realize that things have gotten a bit more complex. Now you have:

  1. server to team mappings,
  2. server allocation metrics,
  3. team server costs,
  4. process to server mappings,
  5. process utilization metrics,
  6. team process costs.

Figuring out the total cost per team is pretty easy though: you just take the team's server costs, plus the process costs attributed to them by other teams, minus the process costs they attribute to other teams.

Over the following weeks, you're surprised as every infrastructure team pings you to chat. The team running Kafka wants to track traffic per topic and to attribute cost back to publishers and consumers by utilization; that's fine with you, it fits into your existing attribution model. The Database team wants to do the same with their MySQL database which is a little bit more confusing because they want to build a model which attributes disk space, IOPs and CPU, but you're eventually able to figure out some heuristics that are passable enough to create visibility. (What are query comments meant for if not injecting increasingly complex structured data into every query?)

The new SRE manager started scheduling weekly incident review meetings, and you listen absentmindedly while the Kafka team talks about an outage caused by a publisher which started generating far more load than usual. It's a bummer that Kafka keeps going down, but at least their spend is going down, nothing to do with you. Later, you awake in a panic when someone suggests that we just massively overprovision the Kafka cluster to avoid these problems. You sputter out an incoherent squak of rage at this suggestion–we've made too much progress on reducing costs to regress now!–and leave the meeting shaken.

Next week, the MySQL team is in the incident review meeting because they ran out of disk space and had a catastrophic outage. A sense of indigestion starts to creep into your gut as you see the same person as last week gears up to speak, and then she says it, she says it again: "Shouldn't we spend our way to reliability here?"

Demoralized on the way to get your fifth LaCroix for the day, you see the CFO walking your way. He's been one of your biggest supporters on the cost initiative, and you perk up anticipating a compliment. Unfortunately, he starts to drill into why infrastructure costs have returned to the same growth curve and almost the same levels they were at when you started your project.

Maybe, you ponder to yourself on the commute home, no one is even looking at their cost data. Could a couple of thoughtful nudges can fix this?

You program up an automated monthly cost reports for each team, showing how much they are spending, and also where their costs fit into the overall team spend distribution. Teams with low spends start asking you if they can trade their savings in for a fancy offsite, and teams with high spends start trying to reduce their spend again. (You even rollout a daily report that only fires if it detects anomalously high spend after a new EMR job spends a couple million dollars analyzing Nginx logs.)

This helps, but the incident meetings keep ending with the suggestion to spend more for reliability, and you've started to casually avoid the CFO. So, you know, it's going great... but...

While you're typing up notes from your latest technical phone screen, you hear an argument through the thin, hastily constructed conference room walls. The product manager is saying that the best user experience is to track and retain all incoming user messages in their mailbox forever, and the engineer is yelling that it isn't possible: only bounded queues have predictable behavior, an unbounded queue is just a queue with undefined failure conditions!

Suddenly, it all comes together: your costs and your infrastructure are both missing back pressure! As a bonus, the metrics you've put together for cost accounting are exactly the right inputs to start rolling out back pressure across your infrastructure. As the old adage goes, the first back pressure is the hardest, and as you brainstorm with teams you come up with a variety of techniques.

For the Kafka team, constrainted on throughput, you decide on strict per-application per-minute ratelimits. The MySQL team, where bad queries are saturating IOPs, starts circuit breaking applications generating poor queries. To work around Splunk's strict enforcement of daily indexing quotas, you roll out a simple quality of service strategy: applications specify log priority in the structured data segments of their syslogs, and you shed as many lower priority logs as necessary to avoid overages. For users pegging your external API, you start injecting latency which causes them to behave like kinder clients with backoffs, even though they haven't changed a thing. (Assuming you can handle a large number of concurrent, idle, connections. Otherwise you've just DOSing yourself, you realize later to great chagrin.)

All of the sudden you're spending a lot of time discussing something you've never discussed before: how should our systems fail?

Degraded service isn't a sign of failure, it's normal and expected behavior. Some of your key systems are regularly degraded during peak hours, gracefully shedding load of their least important traffic. Best of all, the incident review meetings have stopped.

During a chat with your CFO, while lamenting Splunk's pricing and quota model, you realize that you can apply back pressure to spend by assigning a cost quota for each team. Six months of angsty conversations later, each team has a quarterly cost quota derived from their historical usage and guidance from Finance, your site never goes down due to insufficient resources, and based on your excellent cost accounting efforts you are promoted infinity times.

Thanks to the many, many people I've chatted with about this topic over the last year or so. In particular, thanks to Cory who is probably looking for a way to stop talking about this with me every week.