Forecasting synthetic metrics.

November 5, 2019. Filed under infrastructure 34metrics 4reliability 3

Imagine you woke up one day and found yourself responsible for a Site Reliability Engineering team. By 10AM, you’ve downloaded a free copy the SRE book, and are starting to get the hang of things. Then an incident strikes: oh no! Folks rally to mitigate user impact and later diagnosis and remediate the underlying cause, but a bunch of your users have a very bad day. Your shoulders are a bit heavier than just a few hours ago. You sit down with your team and declare your bold leader-y goal: next quarter we’ll have _zero_ _incidents_.

Sending weekly 5-15 updates.

November 3, 2019. Filed under management 99

About a year ago I started sending public weekly updates to a relevant public (within the company) mailing list. I've found the practice useful enough to write a few works on the how and why. This practice is sometimes called a 5-15 report reflecting the goal of spending fifteen minutes a week writing a report that can be read in five minutes.

"Investing in technical infrastructure"

October 31, 2019. Filed under infrastructure 34speaking 5talks 3

A few weeks ago I got the chance to speak at SRECon EMEA 2019, and the videos are up! This is the video of my talk, Investing in technical infrastructure.

Healthchecks at scale.

October 27, 2019. Filed under infrastructure 34architecture 26

A couple days ago at Stripe's weekly incident review, we started a discussion on a topic that is always surprisingly controversial: healthchecks. I've been thinking about them since and have written up some related thoughts.

An Elegant Puzzle by the numbers, five months later.

October 23, 2019. Filed under elegant-puzzle 8

An Elegant Puzzle was released on May 20th, 2019. In June I summarized what I learned writing the book, which says what I have to say about creating the book. Instead of retreading that material, I wanted to recap An Elegant Puzzle by the numbers.