Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think there's value in rolling the different failure modes up into a single concept for discussion and recall. And the whole thing comes in at 5 pages! "The thing that caused this problem isn't necessarily what's sustaining it" is something I'll commonly call out to people managing an incident, often before we know what the failure modes are.

I looked through the AWS Builders' Library for articles which seemed to cover metastable failures. (Disclosure: Work there. Disclaimer: Writing here for myself, not my employer). A system designer with more time might be interested in:

    https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/
    https://aws.amazon.com/builders-library/reliability-and-constant-work/
    https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
    https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
    https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs/
If the Google SRE book is your preferred school of thought, you might like:

   https://sre.google/sre-book/load-balancing-datacenter/
   https://sre.google/sre-book/handling-overload/
   https://sre.google/sre-book/addressing-cascading-failures/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: