I’ve had this war story-type draft for a few weeks, but it felt incomplete. Then an intern candidate asked me “What is the worst incident you were involved in, and what was the outcome?”

For context, I’ve been in the SRE space since 2014. I’ve had accidents, and broken production a few times, but one incident stands out to me.


Setting the stage

An engineer joined my team as an internal transfer from another team within the company, and they were getting exposed to our systems by cleaning up the monitoring configs for our services.

We had started a migration from an implicit configuration (a machine agent would do a reverse lookup in service discovery to figure out the tiers the machine was in, then collect the metrics defined on the tiers) to explicit (container agent collects the metrics defined on the container spec) - but hadn’t finished the migration.

The new team member already knew a lot about the monitoring system, so we were pretty sure this was a low-risk operation.

There were 5 separate production services to migrate, and the first 4 had been migrated successfully - now we were working on the fifth.

Minutes before Disaster

One of the things I was worried about was metrics being dropped because the collection agent changed. We hit a problem with deployments - it was an in-place update, so it would start the new version, wait until it’s healthy, then stop the old version. For the duration of the swap the metrics from the last inspected container won, meaning the new collection model was reporting metrics randomly between the two containers.

But we fixed the issue and hadn’t had any further problems. When the engineer tested the latest change, the number of metrics reported were identical, so I was fairly sure that it was low-risk because nothing obvious was changing. Besides, we were just adding the monitoring config to the container spec, which is usually a safer operation than deleting something.

That said, this was the final change in a fairly significant operation that had been put off because no one wanted to tackle the high risk/low reward operation that was cleaning up this tech debt.

I was surprised no one had complained anything had broken yet, and my initial approval message was along the lines of “this is touching a dark and dusty corner, something will probably break in a downstream system, and we’ll find out when the team with the broken thing comes looking for us”, before deleting it in favour of a simple internal meme:

See you in incident review

The Incident

I had typed that message expecting another team to come to us in a few days asking why some metric had changed, and we’d create an incident for tracking any impact and fixing whatever the underlying issue was.

Instead, it took a few minutes for this to become one of the most serious incidents I was ever involved in. Afterward, we figured out what happened – and why:

  1. To track the general health of the service, we were combining metrics from individual machines into an aggregated metric at the logical service level. We were running this service in multiple datacenters and didn’t want to create a config for each datacenter, so we used a dynamic placeholder element - <TIER>.
  2. The machine local collection agent replaced <TIER> with the name of the service discovery tier that the monitoring config was set on
  3. But the container collection agent replaced <TIER> with the name of the job
  4. The first 4 services had job names that matched the service discovery tier. This fifth job didn’t.
  5. One of the service-level metrics was the number of requests sent to that datacenter, and the metric was used by the load director when deciding how to spread traffic across all datacenters

The configuration code caused the service metrics to get aggregated under a different name. When the load director polled the monitoring system, the metrics it got back were reporting 0 or very low traffic, so the load director started routing traffic to datacenters on old default values – and promptly caused a global service degradation as some datacenters got overwhelmed with traffic, while other datacenters went idle.

At the peak, we were dropping about 15% worth of traffic. Alarms fired, the problematic change was identified and reverted, and the systems were back to normal within 30 minutes.

On a personal note, I also felt remarkably supported in the immediate aftermath even though I was one of the two people who had just caused a highly visible incident. My old manager (who had moved to a different company the year before) somehow found out about it and sent me a message, and my fellow SREs managed to get a bottle of my preferred drink delivered to me.

The Blameless Postmortem

One thing that The Company does well is their blameless postmortems (similar to Google’s) are truly blameless.

Paraphrasing one of my coworkers (from memory):

An individual event shouldn’t be able to cause an incident; an incident is generally the result of a series of bugs or past decisions. It’s a technical problem that is fixable now we know about it. Blaming an individual is counterproductive, we want people to report the problems, not hide them in fear of retribution.

The postmortem didn’t focus on the change itself; instead, it focused on the gaps in the system. Why didn’t the load distribution system consider 0 to be invalid and safely fail (trigger an alarm & refuse to shift traffic)? What other remnants of incomplete migrations were still lurking?

By focusing on identifying & fixing the technical contributing causes, when different parts of the load director failed with a similar result of invalid metrics, there was no observable impact – the load director simply froze all changes.

My approval of the change wasn’t malicious, so there were no direct repercussions during the aftermath and the incident review.

Why I consider it my worst mistake

Simply put, a combination of the visibility, and how straightforward it would have been to prevent it.

It was sobering. I knew binary 0 to 100 changes are dangerous. I had generally been the person going “hey, this is going to cause unknown/new behaviour, we should go slow with this”, to the point I even have a standard 9 phase rollout: 0% -> 1% -> 2% -> 5% -> 10% -> 20% -> 40% -> 60% -> 80% -> 100%.

To recognize a proposed change was dangerous, and then approve the change without any mitigation or further testing was atypical for me. The fact that the change I was confident in to indirectly caused a global service degradation was a confidence shaker. I had been overconfident and got bitten.

Admittedly, my normal phased rollout would not have prevented the issue entirely, but it wouldn’t have been as visible, and I could have said that I had done my best to make the change safe.

What did I learn?

One of my philosophies is you should only screw up in a specific way once.

One lesson was “When working in areas I don’t know well, be paranoid and go slower”. There’s always a tradeoff between speed and safety, and I was used to biasing for action because any issue could generally be handled with a revert or undo. Going slower doesn’t change that, but it gives me a chance to undo something before it hits 100%.

Another lesson was “Don’t be flippant”. While I did honestly believe that the change was tested enough to find major issues, leaving the “See you in incident review” meme made it seem like I was either rubber-stamping the changes, or I knew there was going to be a problem and didn’t stop it.

My manager told me later that I was between two levels on the next performance review, and I got put in the lower one because of this incident. It wasn’t my decision itself that had made the difference, but that I had a pseudo-mentor role and I didn’t take the opportunity to encourage safe practices.

Instead, if I had written why I believed the change to be safe, would that drop have happened? Would some people have a better view of me? I hope that would be the case, but I can’t change the past, I can only be more cautious in the future.

The final lesson is that I contributed to breaking production in a highly visible way, and I came out the other side mostly unscathed. My “Breaking Production” Rite of Passage: Complete

So what?

There’s no pithy ending here. I screwed up, stuff broke, I helped fix it, and I’m planning not to make the same mistake again.