Head in the clouds.

Published on July 7, 2019. cloud (3), infrastructure (38)

When I wrote about the public cloud evolving the role of infrastructure engineering, I sort of imagined that the precursor question–should we run our infrastructure on the public cloud?–was already quite settled. Unexpectedly, it’s a discussion that I find myself having more rather than less frequently each year, so I’ve taken some time to structure and document my thinking.

In short: run on the public cloud unless (1) it prevents you from executing on your core competency or (2) your workloads are expensive and require a specialized hardware profile that doesn’t align with the general purpose computation favored by the public cloud’s economics.

Structure here is (a) start with a review of advantages of both datacenter and public cloud approaches, (b) follow with a review of how some modern companies are approaching this problem, and (c) then end with how companies should make this decision for themselves.

Datacenters

Running your own datacenters is good for:

Investing into your core competencies in ways that don’t align with cloud vendors’ generalized workloads. For example, if your core competency was supporting cryptocurrency mining rigs, you could optimize for low power costs at the expense of reduced availability or increased latency, in ways that cloud vendors cannot.
Supplying general purpose computing resources for 30% less cost than the public cloud. “General purpose” here meaning roughly equivalent resources to the public cloud in terms of the ratio of CPU to RAM to IOPS and so on.
Note that this is the best case scenario, requiring excellent execution, but it is a number that I’ve heard multiple companies calculate after performing exhaustive “all in” analysis (e.g. including salaries for folks to operate the infrastructure and so on).
Supporting special purpose workloads, especially workloads benefitting from very large disks with fewer IOPs (clouds are largely moving to SSDs and away from spinning disks), or the sorts of vertical scaling where you need 2U or 4U servers.
These sorts of specialized workloads might be the only way to scale your architecture (which is generally speaking a bad sign, but you do you), and can also be vastly cheaper than alternative approaches due to the quirks of your software.
Meeting data locality or regulatory requirements in a specific market, typically markets where first-tier cloud providers haven’t entered for whatever reasons (too small, too regulated, etc).
Controlling the predictability of your costs. The rigor required to manage your supply chain trainslates into predictability in your costs, and more structured planning. Folks talk about the elasticity of the cloud as an advantage, but in this specific regard it is a disadvantage, and large companies (especially large, public companies) place a great deal of value in predictable costs.

Public cloud

Running in the public cloud is good a for:

Elasticity in the small things. At a certain scale you’re doing capacity planning with AWS and you don’t have large scale elasticity, but you do have immense elasticity in the small things, for new prototypes and such, which allow you to innovate without blockers.
This is the predominate reason I see companies that “grew up in datacenters” start moving to the cloud, and I believe it’s a huge boost to a company’s long-term ability to innovate.
Benefitting from their vast economies of scale, where most of the savings are being passed along the users as long as cloud vendor competition remains fierce.
Benefitting from the broad, continuous investment that cloud vendors make to support general purpose workloads that likely resemble your workloads: improvements to security, availability, productivity, etc.
Offloading support overhead to new cloud services for more and more of your infrastructure, allowing you to concentrate more and more of your team on your core competencies (for businesses where foundational infrastructure isn’t part of your core competencies).
Supporting the shifting tides of data locality at an international scale, since few companies have ability to manage the compliance and legal overhead of dozens of countries regulatory regimes simultaneously (and those are dynamic, living things, not something that you do once).
Avoiding supply chain management. Someone once told me that 90% of SSDs are being sold to cloud vendors, which–if true–suggests that long-term only cloud vendors will be able to get good SSD pricing. I imagine this sort of logic applies to other server components beyond SSDs.
Even if you can get costs to be equivalent with clouds, they are always going to be a more important customer than you to the component vendors, which means they’ll get priority on components when supply dips, meaning the predictability of their supply will be higher than yours.
(If you haven’t dealt with the server supply chain, it’s easy to imagine that there is this sort of rationally optimal economy producing exactly the number of requirements components, but in practice it’s pretty common to have component scarcity due to global supply chain issues.)

Real-life examples

It’s easy to get overly abstract when talking about “the right way” to do something, so before jumping into decision criteria, it’s useful to look at what companies are actually doing:

Airbnb runs mostly on AWS.
Dropbox has moved most of its data out of AWS into their own datacenters, after previously running their full business on AWS.
Fastly operates 60 points of presence, and also reslies on AWS, Google, Softlayer and “other cloud providers” for some aspects of their platform.
Lyft relies entirely on AWS.
Pinterest relies entirely on AWS.
Twitter runs their production and development workloads on their own datacenters, but are experimenting with running adhoc Hadoop workloads on Google Cloud Platform.
Uber runs primarily on its own datacenters, scaling to AWS, and running some small workloads on multiple cloud vendors.
Zoom runs in 13 different datacenters, with some aspects of their business on AWS and Azure.

My recommended takeaway here is that a lot of companies are doing different things, and there is no single dominant strategy that you should take in every every scenario.

(Also, read S-1s! They have so much good data.)

How to decide

Okay, so we’ve identified that there are very successful companies operating exclusively on the cloud, almost exclusively off the cloud, and running hybrid approaches. Every approach is valid.

What should you do?

Economies and diseconomies of scale

Your infrastructure costs are going to scale with your business in one of three ways: (a) economies of scale, (b) diseconomies of scale, (b) or proportionally.

Three different infrastructure cost growth rates: proportional growth, dis-economies of scale, economies of scale

Generally, you should optimize for growth as long as you’re benefitting from economies of scale or are scaling costs proportionally. If you’re getting less efficient with scale, then your growth is strangling your business, and you should prioritize costs.

Similarly, if your business is not growing, then reducing your absolute costs are more important than their relationship with business scale, and you should consider prioritizing costs, especially if there isn’t alternative work you can do to initiate new growth.

If one of those is true (you have diseconomies of scale or your business’ growth has slowed), then you likely want to be focusing on costs, and moving away from the public cloud to your own datacenters is a viable strategy to reduce those costs.

That said, you should still follow the first law of optimization: optimization where there’s the most room for improvement. If there are other areas where you’re spending more (or spending less intentionally), focus there first.

Growth versus efficiency

Every strategy to reduce cost shifts energy away from growth.

Many cost strategies can be contained within a small number of teams, allowing you to make a fixed investment into reducing costs (some examples here are improved efficiency in your orchestration tier, improvement to your storage implementation). It’s easy to make a cost versus investment calculation for these sorts of efforts. (Roughly, if you’ll save more money than you invest, then grow team to support the efforts.)

Other cost strategies require tradeoffs against developer productivity or product development, and these are much harder to make. The principled way to make this sort of tradeoff is to consider the discounted cash flow between the different scenarios.

I believe moving off the public cloud is in the second category.

The optimal solution will depend on how you model the cloud’s productivity benefits, but generally I think discounted cash flow analysis will argue for staying on the cloud unless (a) you have specialized workloads that support significantly greater than 30% savings over the public cloud or (b) you have dismal expectations for future growth.

Core compentencies

If your ability to execute on your core competency is constrained by the public cloud, then you may want to run your own datacenters (e.g. Fastly, Dropbox).

You can outperform cloud in short term, but long-term outperformance requires ongoing investment, which most companies can’t make.

However, if you’re small or are spreading yourself across many directions, then you’re unlikely to outperform the public cloud in the long run. If you’re running a general purpose infrastructure (e.g. Pinterest, Lyft), then you’ll likely benefit from running on the public cloud.

Supporting data locality and regional regulatory regimes are a particularly interesting case of this tradeoff. You can easily invest more into meeting regulations for any given market than AWS can, but you almost certainly cannot invest more into meeting regulations for every market.

More nuanced maturity model

So far we’ve thought of this mostly as a decision you make at the company level, but as you get large enough that doesn’t have to be the case. You could start new business lines in the cloud to optimize for growth, and move your mature business lines (where growth has slowed down) into datacenters to optimize for efficiency.

Once a company reaches a certain size and age, I suspect this is the mathematically optimal approach.

Trapdoor decision

Focusing on either datacenters or the cloud is a somewhat trapdoor decision. You can evolve your strategy over time, but each time you shift direction you’ll lose some of your expertise, and if you shift direction frequently you’ll lack the expertise to excel in any approach.

And yes, you need that expertise, because the whole tradeoff between public cloud and datacenters is irrelevant if you’re not doing them well. An excellent cloud implementation is far better than a bad datacenter strategy, just as an excellent datacenter implementation is far better than a bad cloud strategy.

Ending thoughts

Overall, my belief is that few companies benefit from starting outside of the public cloud, and few larger companies can rationally prioritize migrating their infrastructure off the public cloud. If your core competency can’t be expressed within the public cloud, then moving a portion of your infrastructure into your own datacenters makes sense.

For those few businesses, investing into their core competency makes them more valuable and defensible, but for most folks, it’s just a tar pit.