Also designing something "crash-first", even if you don't call it like this, lea...

dr_dshiv · on May 25, 2022

I remember a visitor center large-scale multitouch game we made for Sea World. It was being installed for long-term heavy usage. We built it in Flash (I know) — it was lovely but we just couldn’t stop the memory leaks.

We made slight adjustments (outros and intros) so that it seemed natural to have a 10 second break for the program to restart. And, we built in a longer cycle of computer resets. It was an unreasonably stable system for years!

Banana699 · on May 25, 2022

Great Story. I like the Erlang\distributed systems view of the world: Who needs costly resilience or recovery when you can simply die and be reborn again? And if you can't do that, well... make it so you can. Erlang and distributed systems in general have no choice because the kind of computing they do is so wickedly complex there is no other way to fail, but you and GP's comments illustrate that even when you do have other options, this way of faliure is simply easier and more effective.

Can you elaborate on

>We built it in Flash [...] we just couldn’t stop the memory leaks

I thought Flash games was written in a high level JS-like language ? did it grant you enough access to raw memory that you can leak ? or did you mean a high level equivalent to memory leaks ?

dr_dshiv · on May 25, 2022

We never figured it out. But when the program would run for a long time the RAM would fill up and the game would slow to a snail’s pace. We turned to this restart solution in desperation.

pvillano · on May 25, 2022

thank you for playing Wing Commander!

shadowofneptune · on May 25, 2022

You can see this approach even in spacecraft. The Apollo Guidance Computer was an example of a crash-first system (https://www.ibiblio.org/apollo/hrst/archive/1033.pdf).

Banana699 · on May 25, 2022

Mandatory

Light Years Ahead | The 1969 Apollo Guidance Computer

https://www.youtube.com/watch?v=B1J2RMorJXM

34C3 - The Ultimate Apollo Guidance Computer Talk

https://www.youtube.com/watch?v=xx7Lfh5SKUQ

shadowofneptune · on May 25, 2022

The second one is interesting for using hexadecimal in its new syntax format, even though octal is a natural fit for a 15-bit word and was used in the display. The reverse was common back in the 70s and 80s, so I guess it's always been about which is understood rather than which is correct.

bheadmaster · on May 25, 2022

Well explained.

Erlang seems to also follow this kind of philosophy, although on a more granular level. The point seems to be in separation of "worker" code and "supervisor" code - where "worker" represents a well-behaved function without any (unexpected-)error checks, and "supervisor" represents error-handling code that will catch and resolve any errors that happens in the worker code, expected or not.

Joe Armstrong's "Making reliable distributed systems in the presence of software errors" contains more information on the topic.