Metastable Failures in Distributed Systems [pdf]

mjb · on Oct 4, 2021

I think this paper is super important, and anybody who designs or runs big systems should read it and take the core point to heart. As system designers, we're very used to thinking about systems as 'stable' and 'unstable', where stability is good, and instability is bad. What this paper points out is that many kinds of distributed systems have multiple 'stable' modes, some of which are modes where the system is stable (in a control theory sense), but not doing any useful work from the client's perspective. This is dangerous, because the system won't kick itself out of this "stable but down" mode without something changing: human input, a control plane taking action, etc.

I don't think this paper covers anything particularly new, but writing it down in this form, with the evidence they present, is very valuable. Hopefully this paper will deepen the conversation about applying control theory to distributed systems design and control problems, and allow a more theoretical approach to be taken to the design of these systems to avoid common causes of instability and bistability.

One of the authors has a great summary of the paper on his blog: http://charap.co/metastable-failures-in-distributed-systems/

I wrote a summary and discussion too: https://brooker.co.za/blog/2021/05/24/metastable.html

hinkley · on Oct 5, 2021

What bothers me most about The Cloud is that in every other era of networked computing, the people who forgot the https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu... would get smacked in the face with it in a matter of months or a few years at most. But now, people seem to be getting away with it, and so nobody is learning.

I suspect 'the problem' is in several parts.

One, there is one administrator, after a sort, and on top of that the amount of real-time monitoring is way up from the Sun era, and so humans can and do intervene more often. Less interactive systems fail for longer before anyone can rectify the situation. The network still isn't reliable, but the intervals are short and it's difficult to build a narrative about how much of a 'failure' it is for a service to be down for 10 minutes versus 8 hours.

Two, operating systems are still struggling to saturate high end networking hardware from a single process, let alone a single thread in that process. If you send data as fast as you can, you can still be doing the same from a couple other processes, so it takes longer to notice the bottleneck. Noisy neighbors can't completely shout you down.

The ones I'm still scratching my head over are homogenous networks and static topologies. Those really are affecting us, but probably get obscured in the statistics to a degree where they can be ignored by most of us, or at least can't be converted into an action item concretely enough, which is effectively the same outcome. They hide in response variability, among other places.

leghifla · on Oct 5, 2021

In your post: "There's no more time-honored way to get things working again, from toasters to global-scale distributed systems, than turning them off and on again"

This is generally true, but as all rules, there are exceptions, and I encountered one a few month ago:

To be short, the system (an embedded soft real-time control) ran fine for a long time, and the user added more and more processes. After some glitch and "to be sure the restart fresh", he initiated a reboot... And then nothing worked anymore!

The problem: each process consumed a lot of RAM for a short period at their start. When the user added processes manually, everything ran smoothly. But as soon as a few processes needed to start roughly in sync, it took too much RAM, the OOM killer killed the entire app, and back to square one.

In a way, this is also an example of metastability: the application is restarting in a loop and cannot exit that loop on its own.

bitminer · on Oct 5, 2021

It is a useful paper, yes. Good examples, good diagnosis of issues in systems theory, good definition of a way forward.

However it suffers from (a) weak definitions and (b) implausible or strange descriptions.

For (a), what is a "strong feedback loop"? Does it have high gain (low error) or high bandwidth (fast)? Is it hidden (the accidental link imbalance example)? Is it obvious (the cold cache example)? What makes it "strong"?

Or, conversely, what is a "weak" feedback loop?

A number of acronyms are undefined (SRE, LIFO). I think I know what they mean, and most HN readers will too. What about the other readers?

And using Wikipedia to define metastability? There must be a more persistent or academically defendable reference. Wikipedia is OK for informal definitions. In a paper calling for more academic studies this is ironic.

(b) Section 2.1 "When replicas are sharded differently..." Huh?

Section 4 "upper bound" used as a verb. Should be "limit or place bounds on".

Section 4 "The strength of the loop depends on a host of constant factors from the environment..." Odd, the term is not defined but this is the second dependency listed. Very strange.

In short it needs/needed a better reviewer.

That all said, it has summarized a lot of good ideas on controlling stability in distributed systems.

Other references may be found in Adrian Colyer's "the morning paper". No longer updated but has many years of good references. See blog.acolyer.org.

nuerow · on Oct 5, 2021

> I think this paper is super important, and anybody who designs or runs big systems should read it and take the core point to heart.

Please forgive my naiveness but isn't this article just bundling well known and basic failure modes, which are already covered in distributed system 101 courses, into a new keyword?

The article covers retry storms, cache invalidation and resets (which even AWS documented well[1]), and brownouts. These are basic distributed system failure modes. What is there to gain by coining a redundant keyword to refer to basic failure modes in distributed systems which are already covered in intro courses?

[1] https://aws.amazon.com/builders-library/caching-challenges-a...

andrewf · on Oct 6, 2021

I think there's value in rolling the different failure modes up into a single concept for discussion and recall. And the whole thing comes in at 5 pages! "The thing that caused this problem isn't necessarily what's sustaining it" is something I'll commonly call out to people managing an incident, often before we know what the failure modes are.

I looked through the AWS Builders' Library for articles which seemed to cover metastable failures. (Disclosure: Work there. Disclaimer: Writing here for myself, not my employer). A system designer with more time might be interested in:

    https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/
    https://aws.amazon.com/builders-library/reliability-and-constant-work/
    https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
    https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
    https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs/

If the Google SRE book is your preferred school of thought, you might like:

   https://sre.google/sre-book/load-balancing-datacenter/
   https://sre.google/sre-book/handling-overload/
   https://sre.google/sre-book/addressing-cascading-failures/

devnulll · on Oct 5, 2021

I actually thought this was a link to your blog at first. Took me a moment to realize it wasn't you.

Your May posting on Metastable was excellent, although your (now ancient) article on volatile variables and atomic processor operations has always been my favorite.

dang · on Oct 4, 2021

Discussed 4 months ago:

Metastable Failures in Distributed Systems - https://news.ycombinator.com/item?id=27506167 - June 2021 (11 comments)

...but on a day like today we dare not mark it as a dupe.

cs702 · on Oct 5, 2021

As I was reading the paper, it dawned on me that persistent large-scale systemwide problems in many other domains can be understood as metastable failures -- including, for example, persistent supply chain problems, persistent waves of disinformation on social networks, persistent government support/rescue of financial firms, persistent bear markets, and deep economic recessions. Quoting from the paper:

"A system starts in a stable state. Once the load rises above a certain threshold -- implicit and invisible -- the system enters a vulnerable state. The vulnerable system is healthy, but may fall into an unrecoverable metastable state due to a trigger. The vulnerable state is not an overloaded state; a system can run for months or years in the vulnerable state and then get stuck in a metastable state without any increase in load. In fact, many ... systems choose to run in the vulnerable state all the time because it has much higher efficiency than the stable state.

When one of many potential triggers causes the system to enter the metastable state, a feedback loop sustains the failure, causing the system to remain in the failure state until a big enough corrective action is applied. In the most severe outages, the feedback loop is contagious, causing portions of the system that weren’t exposed to the trigger to enter the failure state as well. It is common for an outage that involves a metastable failure to be initially blamed on the trigger, but the true root cause is the sustaining effect.

Metastable failures have a disproportionate impact on hyperscale distributed systems. ... The strength of many feedback loops is proportional to the scale, so they can slip past even a robust testing and deployment regime. The difference between the trigger and the sustaining effect makes it hard to discover the correct response, increasing the time to recovery. Shedding load as a corrective action can be a further source of disruption for users."

sitkack · on Oct 5, 2021

No one has mentioned "gray failure", https://www.cs.jhu.edu/~huang/paper/grayfailure-hotos17.pdf

> Cloud scale provides the vast resources necessary to replace failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences. We also argue that a key feature of gray failure is differential observabil- ity: that the system’s failure detectors may not notice problems even when applications are afflicted by them. This realization leads us to believe that, to best deal with them, we should focus on bridging the gap between different components’ perceptions of what constitutes failure.

ctlachance · on Oct 4, 2021

This paper introduced me to a new concept in system architecture. Thanks for posting it!

fmakunbound · on Oct 5, 2021

Does this lead into control system theory?