In future postmortems (of which I hope there will be very few or even none) you ...

idlewords · on July 27, 2023

It always amuses me how people want reassurance that the next crisis will be a fresh, new problem, and not one the person can demonstrably solve.

A lot of 'lessons learned' analysis boils down to this: in order to prevent a recurrence of X, we introduced complex subsystem Y, the unexpected effects of which you can read about in our next post-mortem.

bityard · on July 27, 2023

That's an overly cynical take, post-mortems are not for anyone's reassurance, they are a learning opportunity.

The airline industry is as safe as it is because every accident gets thoroughly investigated with detailed reports ("post-mortems") including what to do differently going forward. These are taken as gospel among all players in the industry and as a result, you very rarely see two different accidents caused by the same thing anymore.

jacquesm · on July 27, 2023

That was entirely not what I was getting at and is a cheap shot that is well beneath you, especially because I suspect that you know that that wasn't what I was getting at.

idlewords · on July 27, 2023

My comment wasn't intended personally; your words about "will never recur" just reminded me of this peculiarity of software systems, where it's often error handling/monitoring/backups/etc. that cause cascading failures in the systems they're intended to safeguard.

I'm sorry if I misconstrued your meaning, but I am flattered that you think there are things beneath me!

jacquesm · on July 27, 2023

Fair enough. I see the whole function of a postmortem in a very simple way: to avoid recurrence of the same fault. Yes, there will be plenty of new ones to make. But if you don't change your processes as the result of a failure you are almost certainly going to see a repeat because the bulk of the conditions are still the same. All it takes then is a minor regression and you're back to where you were before. This I've seen many times in practice and I suspect that Colin isn't immune to it. And yes, I look up to you, your writing is usually sharp and on point and has both amused me and educated me. So you have an image to live up to ;)

hgsgm · on July 28, 2023

You'd love my team's recent postmortem, featuring the comment "action items haeve been copied from the previous postmortem".

nijave · on Aug 1, 2023

Could be as simple as "test restore a new server every 1-2 years"

rsync · on July 27, 2023

You should consider this possible lesson:

"Our simple model that fails gracefully did so and was simple to recover"

Redundancies and failsafes are not free - they add complexity.

99.9% availability fails in boring ways.

99.999% availability fails in fascinating ways.

cperciva · on July 27, 2023

Yeah, I was going to do that but it was getting late, I wanted to get some sleep, and the post-mortem had already been waiting far too long to be sent out.

The main lesson learned was "rehearse this process at least once a year".

jacquesm · on July 27, 2023

Agreed, that's the big one. But also: when sleep deprived: take a nap!

vintagedave · on July 27, 2023

The infrastructure page* says,

> at the present time it is possible — but quite unlikely — that a hardware failure would result in the Tarsnap service becoming unavailable until a new EC2 instance can be launched and the Tarsnap server code can be restarted ... So far such an outage has never occurred

I read the postmortem as that a hardware failure did cause it to be unavailable and the code could not be restarted, a new server had to be built.

If that is correct, as well as writing up learning (as Jacques mentions) this page could be updated with outage information -- or even info on changes to reduce risk of repetition.

For what it's worth, one outage of a single day in fifteen years is impressive. If my ballpark math is correct, that's 99.992% uptime, ie four nines.

* http://www.tarsnap.com/infrastructure.html