Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In future postmortems (of which I hope there will be very few or even none) you may want to spell out your 'lessons learned' to show why particular items will never recur.


It always amuses me how people want reassurance that the next crisis will be a fresh, new problem, and not one the person can demonstrably solve.

A lot of 'lessons learned' analysis boils down to this: in order to prevent a recurrence of X, we introduced complex subsystem Y, the unexpected effects of which you can read about in our next post-mortem.


That's an overly cynical take, post-mortems are not for anyone's reassurance, they are a learning opportunity.

The airline industry is as safe as it is because every accident gets thoroughly investigated with detailed reports ("post-mortems") including what to do differently going forward. These are taken as gospel among all players in the industry and as a result, you very rarely see two different accidents caused by the same thing anymore.


That was entirely not what I was getting at and is a cheap shot that is well beneath you, especially because I suspect that you know that that wasn't what I was getting at.


My comment wasn't intended personally; your words about "will never recur" just reminded me of this peculiarity of software systems, where it's often error handling/monitoring/backups/etc. that cause cascading failures in the systems they're intended to safeguard.

I'm sorry if I misconstrued your meaning, but I am flattered that you think there are things beneath me!


Fair enough. I see the whole function of a postmortem in a very simple way: to avoid recurrence of the same fault. Yes, there will be plenty of new ones to make. But if you don't change your processes as the result of a failure you are almost certainly going to see a repeat because the bulk of the conditions are still the same. All it takes then is a minor regression and you're back to where you were before. This I've seen many times in practice and I suspect that Colin isn't immune to it. And yes, I look up to you, your writing is usually sharp and on point and has both amused me and educated me. So you have an image to live up to ;)


You'd love my team's recent postmortem, featuring the comment "action items haeve been copied from the previous postmortem".


Could be as simple as "test restore a new server every 1-2 years"


You should consider this possible lesson:

"Our simple model that fails gracefully did so and was simple to recover"

Redundancies and failsafes are not free - they add complexity.

99.9% availability fails in boring ways.

99.999% availability fails in fascinating ways.


Yeah, I was going to do that but it was getting late, I wanted to get some sleep, and the post-mortem had already been waiting far too long to be sent out.

The main lesson learned was "rehearse this process at least once a year".


Agreed, that's the big one. But also: when sleep deprived: take a nap!


The infrastructure page* says,

> at the present time it is possible — but quite unlikely — that a hardware failure would result in the Tarsnap service becoming unavailable until a new EC2 instance can be launched and the Tarsnap server code can be restarted ... So far such an outage has never occurred

I read the postmortem as that a hardware failure did cause it to be unavailable and the code could not be restarted, a new server had to be built.

If that is correct, as well as writing up learning (as Jacques mentions) this page could be updated with outage information -- or even info on changes to reduce risk of repetition.

For what it's worth, one outage of a single day in fifteen years is impressive. If my ballpark math is correct, that's 99.992% uptime, ie four nines.

* http://www.tarsnap.com/infrastructure.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: