Also designing something "crash-first", even if you don't call it like this, leads to many different approaches and possible improvements.
For example lets imagine some embedded device in the ISP network, it is not very accessible so high reliability is required. You can overengineer it to be a super reliable Voyager-class computer, but that a) will cost much more money than it should, b) you will fail to achieve the target.
Or you can go crash-first approach. Many things can be simplified then, for example no need to have stateful config, no need to code any config management which saves it, checks it etc. You just rely on receiving new config every boot and correctly writing it to controllers. Less complexity.
Then if you are crash-first you expect to reboot more often. You then optimize boot time, which would be a much lower priority task otherwise. And suddenly you have x times better (lower) downtimes per device.
You can optimize on some hard stuff - e.g. any and all 3rd party controllers with 3rd party code blobs and weird APIs. Instead of writing a lot of health checks for each and every failure mode of the stuff you can't really influence, you write a bare minimum and then watchdog to reboot whole unit and hope it will recover. And this works very well in practice.
The list goes on. Instead of very complicated all-in-one device you have a lightweight code which has a good and predictable recovery mechanism. It is cheaper, and eventually even more reliable than overengineered alternative. Another example - network failure. Overengineered device will do a lot of smart things trying to recover network, re-initialize stuff, re-try different waits (and there will be a lot of re-tries), and may eventually get stuck without access. Lightweight device has a short simple wait with simple retry, or several, and then reboots. Statistically this is better than running some super complicated code, if the device is engineered to reboot from the start.
I remember a visitor center large-scale multitouch game we made for Sea World. It was being installed for long-term heavy usage. We built it in Flash (I know) — it was lovely but we just couldn’t stop the memory leaks.
We made slight adjustments (outros and intros) so that it seemed natural to have a 10 second break for the program to restart. And, we built in a longer cycle of computer resets. It was an unreasonably stable system for years!
Great Story. I like the Erlang\distributed systems view of the world: Who needs costly resilience or recovery when you can simply die and be reborn again? And if you can't do that, well... make it so you can. Erlang and distributed systems in general have no choice because the kind of computing they do is so wickedly complex there is no other way to fail, but you and GP's comments illustrate that even when you do have other options, this way of faliure is simply easier and more effective.
Can you elaborate on
>We built it in Flash [...] we just couldn’t stop the memory leaks
I thought Flash games was written in a high level JS-like language ? did it grant you enough access to raw memory that you can leak ? or did you mean a high level equivalent to memory leaks ?
We never figured it out. But when the program would run for a long time the RAM would fill up and the game would slow to a snail’s pace. We turned to this restart solution in desperation.
The second one is interesting for using hexadecimal in its new syntax format, even though octal is a natural fit for a 15-bit word and was used in the display. The reverse was common back in the 70s and 80s, so I guess it's always been about which is understood rather than which is correct.
Erlang seems to also follow this kind of philosophy, although on a more granular level. The point seems to be in separation of "worker" code and "supervisor" code - where "worker" represents a well-behaved function without any (unexpected-)error checks, and "supervisor" represents error-handling code that will catch and resolve any errors that happens in the worker code, expected or not.
Joe Armstrong's "Making reliable distributed systems in the presence of software errors" contains more information on the topic.
For example lets imagine some embedded device in the ISP network, it is not very accessible so high reliability is required. You can overengineer it to be a super reliable Voyager-class computer, but that a) will cost much more money than it should, b) you will fail to achieve the target.
Or you can go crash-first approach. Many things can be simplified then, for example no need to have stateful config, no need to code any config management which saves it, checks it etc. You just rely on receiving new config every boot and correctly writing it to controllers. Less complexity.
Then if you are crash-first you expect to reboot more often. You then optimize boot time, which would be a much lower priority task otherwise. And suddenly you have x times better (lower) downtimes per device.
You can optimize on some hard stuff - e.g. any and all 3rd party controllers with 3rd party code blobs and weird APIs. Instead of writing a lot of health checks for each and every failure mode of the stuff you can't really influence, you write a bare minimum and then watchdog to reboot whole unit and hope it will recover. And this works very well in practice.
The list goes on. Instead of very complicated all-in-one device you have a lightweight code which has a good and predictable recovery mechanism. It is cheaper, and eventually even more reliable than overengineered alternative. Another example - network failure. Overengineered device will do a lot of smart things trying to recover network, re-initialize stuff, re-try different waits (and there will be a lot of re-tries), and may eventually get stuck without access. Lightweight device has a short simple wait with simple retry, or several, and then reboots. Statistically this is better than running some super complicated code, if the device is engineered to reboot from the start.