Engineering management stuff I learned in 2017.

December 29, 2017. Filed under management 129 infrastructure 34

I've been working with Stripe's infrastructure engineering team for 2017 (SF & remote, SEA), getting to work an increasingly interesting set of problems, at an increasingly large scale, and with an increasingly talented group of folks.

These are some of the things I got to learn over the past year.


  • There is a surprisingly broad category of work where a single dedicated person saves person-years of rushed efforts later: ensuring systems are scaling (and not regressing), cost accounting, build tooling, code hosting, etc. Prefer two people, one is still the loneliest number.
  • One of the most valuable skills in management is bridging between your management’s and your team’s expectations in a way that is authentic to both’s values. If you never say no to your team, you’re not actually managing them. If you never change your management’s mind, you’re not being a manager.
  • A less obvious advantage of having a broad set of skills is that you can succeed in a wider variety of situations. When your manager changes, you switch companies, etc.
  • I hope we can do more to emphasize that line management is a different job than managing managers. The skills are different, and line management is an extremely rich, deep field that we should celebrate more.
  • Stereotype threat, the idea that simply being aware of stereotypes about you meaningfully reduces your related performance, was a new and eye-opening learning for me (see Whistling Vivaldi by Claude Steele).
  • I, and almost every manager I worked with, was certain that cold sourcing did not work and was a waste of time. We wrote up a process, followed it, and damningly it consistently worked for all of us. Humbling example of strong conviction in a reasonable but entirely false belief.
  • Most important inclusivity effort? Structured process for folks to privately apply to important roles and opportunities. Folks learn quickly, though: efforts picked through process have to succeed and be recognized or folks don’t apply.
  • Every time you put together a panel, a working group, a team, build a habit around reflecting on the membership of that group, and how you selected them. (This is how/why we developed the structure process for applying to roles/opportunities.)
  • Composing diverse teams and groups easily becomes second-shift work for URMs in your company. When we ask folks to participate, we become ethically obligated to ensure this work is shifted into the first-shift and recognized in performance reviews, promotions, etc.
  • Alignment and consensus are very different things. Consensus is a slow mechanism for reaching alignment; alignment is about communication and hearing perspectives, and can be very quick if you have the pre-existing relationships and communication mechanisms.
  • In a scaling organization, the ability to be consistently aligned within and across teams is a marker of excellence. “Time to alignment” is your reorg success metric.


  • AWS and GCP are eating internal infrastructure, we need to provide differentiated value beyond the cloud. Current opportunities? Security, compliance, language-specialized tools, maintained base images, CI usability/speed, multi-region, architectural patterns (e.g. kappa architecture).
  • Most current opportunities gone in 2-3 years: CI usability/speed (active investment area for all clouds), multi-region (GCP attempting “multi-region first”), architectural patterns (new apps gets by default). Security, compliance and language-specific are less susceptible to single solution, remain differentiated for now.
  • The now standard practice of pitch GCP and AWS against each other appears to be one of GCP’s key marketing strategies: they want to win on quality of commodity features.
  • GCP streaming offering seems best-of-breed, but it’s still irresponsible for us as internal infrastructure providers to adopt differentiated cloud offerings, meaning its adoption is now throttled by AWS offering competitive alternative.
  • gRPC/Protobuf ecosystem has captured most mindshare from Thrift/Finagle, AVRO, etc. gRPC,and consequently HTTP/2, are library and protocol of the future.
  • HTTP/2 is still one of the coolest, least utilized new infrastructure primitives out there. We have bi-directional steams now! Let’s use these more.
  • CloudFlare, Akamai, etc seem deeply in danger of becoming undifferentiated from cloud offerings. Fastly’s offering of full power of VCL remains interesting and crafty differentiated moat for now.
  • Very interesting and excited to see DDoS mitigation becoming part of the default toolkit for cloud offerings. Less obvious to me how well this offering attaches to existing offering (e.g. can you actually sell DDoS mitigation?), but if clouds already have to provide it anyway, reselling the existing capacity may be nearly free.
  • Kubernetes ecosystem is winning, and amazing, but it seems like orchestration is becoming commoditized by clouds. Container lifecycle still the hard part. Remain sold on k8s as facilitator of cloud-agnosticism. (GCP aims to win on value, so reducing migration cost is a key for their strategy.) (Julia Evans wrote an amazing post on how we rolled out Kuberenetes at Stripe.)

  • Terraform is an excellent tool, but isn’t opinionated or constrained enough to fulfill the dreams of cross-cloud portability it used to inspire within me. I think there is a ripe gap for a tool, potentially even a tool written on TF, to do this. (Relatedly, how has AWS or GCP not acquired Hashicorp yet?)

  • Envoy is the unexpected new technology of 2017 for me, one of the last remaining infrastructure components for Xooglers and ex-Twitter folks looking to relive their tooling dreams. Developer productivity components are still locked behind the walls, feels like one of largest opportunities to high-adoption open-source.
  • Stream computation seems certain to become unified paradigm for data. To the extent that immutable events can gain traction, stream will become unified paradigm for scalable computation. (Alt: streaming is the new NoSQL.)

What did you learn this past year?