Expanding on S[a-z]{3,} Reliability Engineer roles.

November 18, 2019. Filed under infrastructure 34 sre 1

One of my foundational learning experiences occurred in 2014, when I designed and rolled out Uber’s original Site Reliability Engineering role and organization. While I’d make many decisions a bit differently if I could rewind and try again, for the most part I’m proud when reviewing the reel of rewound memories.

Folks will occasionally ask my advice on introducing SRE in their company, and I give them an answer they don’t expect: don’t. The one word version comes across rather more controversial than I intend. I love the approaches that define good SRE organizations, and love the SREs I’ve worked with, but I believe that industry preconceptions of the role are sufficiently muddled that the term is actively unhelpful.

To grow a team using best SRE practices, skip the label “SRE.”

Label stuffing

When looking for an egregious example of label stuffing, it’s hard to find better than Long Island Ice Tea Corp renaming to Long Blockchain Corp. It was an ice tea company before the name change, and it was an ice tea company after the name change, just an ice team company that really wanted to be valued as a blockchain company. Thus far, only the SEC has valued that distinction.

Some engineering organizations have committed a similar maneuver, renaming their system administration groups as SREs. While the majority of those renames are done in good faith and following the scriptures of the good book, many of them don’t nail that transition. Perhaps a small subset are these rebrands are truly cynical, hoping to improve their hiring fortunes without changing practices, but my sense is that the vast majority are well-meaning folks struggling to land a difficult cultural transition.

Does the distinction matter?

This label stuffing is important, because these two styles of work are incompatible, and create a dysfunctional or ineffective organization when applied in tandem, The two critical distinctions between systems administration and SRE organizations are how they (1) handle ownership and (2) create leverage.

Administration groups split ownership such that routine workflows cross organizational boundaries between them and their peer development groups: you write the code, I’ll deploy and operate the code. SRE groups split ownership in ways that ensure common workflows do not cross organizational boundaries: I’ll build the deployment platform, and you’ll use it to deploy.

These approaches to dividing ownership also impact how they create leverage for the organization they support. In most system administration organizations, the fundamental unit of progress is working hours. This is distinct from the software engineering and SRE teams I’ve worked with, where the fundamental unit of progress is working software.

When I joined Uber, services were provisioned manually through a sequence of highly coupled Puppet commits and deploys. When I left, services were provisioned by typing in the service name, clicking a button and waiting thirty seconds. No amount of additional human effort could have met the ramping business need for service provisioning, although certainly we could have kept trying harder and burned ourselves out.

“Just filter in the interview!”

Some folks will agree with the premise that the SRE label has become overly broad, but suggest that it’s easy to refine your hiring process to filter for the approach your organization prefers.

In general, this seems like the obvious approach, but I’ve found that the label stuffing exerts an ongoing, exhausting pressure on the interview loop itself. Folks get frustrated that the loop is filtering our great SREs that they’ve worked with before who were very successful in the SRE organization at another company, and this becomes evidence that your SRE loop is flawed.

With a very clear vision of how you want SRE to operate at your company, you can prevent erosion of evaluation, but why spend your life doing that when you can skip out on the label’s confused preconceptions entirely?

Descriptive teams

My approach for the last few years, as well as what I recommend to others is to drop the SRE label entirely and hire software engineers. Do you abandon all hope of hiring folks with SRE expertise?

No! Move the specialization into the team’s mission instead of the role’s label. For example, create a team or group responsible for reliability and call that team the Resiliency team or the Reliability team. Advertise for software engineers to join that team to work on reliability, and hire those who you think will make your team successful, including those who have previously worked in successful SRE organizations.

But the inbound funnel…

The last concern I’ll hear from folks is that if they don’t use a frequent search term like SRE, they’ll miss out on folks who would only apply to an SRE role. My experience is that by posting a specific team job description, you’re freeing up time spent filtering to invest into sourcing and growing your organization’s public SRE brand.

I can easily imagine an organization who finds that “just turning on SRE” for their hiring increases inbound significantly, but my experience has been that it’s more of a wash in terms of your long-term goal of hiring folks who’ll be successful within your organization and align with your team's approach.

Summing it all together, I’m not against anyone using the label SRE, I just think it’s more effective to avoid at this point in the spincycle. If it’s working really well for you, then by all means keep using it. As for me, I’ll be over here writing more specific team descriptions.