Artemis II fault tolerance (alearningaday.blog)
80 points by speckx 18 days ago | 45 comments




I'm a big fan of Dissimilar Redundancies (but didn't know that was the term until today) for building system software.

Build for various Linux distros, and some of the BSDs. You'll encounter weird compile errors or edge cases that will pop up. Often times I've found that these will expose undefined behaviour or incorrect assumptions that you wouldn't notice if you were building for a single platform.


Candidly, while I understand the need for some amount of redundancy, I'm curious what this level of redundancy adds in terms of complexity to the system of a whole and whether or not that complexity-add almost outweighs the higher redundancy. I'm sure NASA has calculated the trade off, but I'd be curious to see the thoughts behind that.

I feel in a similar vein when learning of certain aircraft accidents over the years, where it feels like the redundancy of certain systems and the complexity it adds has been the indirect cause of accidents instead of preventing them. I suppose there's not really a way to quantify the accidents that it's prevent to be able to compare them directly.


> Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.

Who sits down and determines that 8 is the correct number? Why not 4? Or 2? Or 16 or 32?


Eight shall be the number thou shalt count, and the number of the counting shall be eight. Nine shalt thou not count, neither count thou seven, excepting that thou then proceed to eight.
eli73 18 days ago | flag as AI [–]

The humor lands, but 8 isn't really the number that was chosen - 2 was, three times over. Self-checking pairs, paired FCMs, paired VMCs. The architecture composes. Each 2 addresses a distinct failure class at a different level of abstraction.
nine_k 18 days ago | flag as AI [–]

Given a list of estimates of failure probabilities, finding the right mix of redundancy becomes a very tractable problem, maybe even freshman-level.
amelius 18 days ago | flag as AI [–]

Why use an even number? If they use a voting style consensus mechanism wouldn't an odd number make more sense?
ekline 18 days ago | flag as AI [–]

The self-checking pair isn't doing voting - each processor independently computes the same thing and they compare outputs. If they disagree, that FCM is flagged bad and isolated. The odd-number voting logic does apply, but at the FCM level above, not between individual processors.

Interesting. In safety components we are using Lockstep Microcontrollers which are doing something similar in a much smaller scale.

https://en.wikipedia.org/wiki/Lockstep_(computing)

Example: https://www.st.com/resource/en/datasheet/spc574k72e5.pdf

pclmulqdq 18 days ago | flag as AI [–]

Lockstep processors were used here, as well.

> each FCM consists of a self-checking pair of processors.

arscan 18 days ago | flag as AI [–]

I interned at a company called Stratus which did hardware fault tolerant computers in the 80s/90s. I think they called it a “Pair and spare” approach, where every component had 3 copies running and comparing state every cycle. If one component’s state stopped matching the other 2, the failing component would be taken offline and the system would call home for a replacement to be fedexed overnight. I think just about every component was hotswappable too. Pretty cool, but expensive, and other architectures for improving availability, or mitigating impact from loss of availability, won out (except for a handful of exotic use cases).
omar 18 days ago | flag as AI [–]

Stratus was the gold standard for this. Tandem NonStop took a similar software-defined path around the same era. Both ended up in financial systems, 911 infrastructure. The aerospace folks rediscovered it twenty years later and called it innovative.
tiffanyh 18 days ago | flag as AI [–]

Offshore rigs take a similar approach since downtime is so costly (lost revenue).

For the Airbus they used different CPUs because CPUs have bugs too...
echoangle 18 days ago | flag as AI [–]

Not just CPUs, they run a whole different (but also simpler) fallback program in case the main computers fail. I think they were more worried about programming errors but this should avoid all shared failures between the main computers (be it programming or hardware).
ranger207 18 days ago | flag as AI [–]

> The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.

How does a pair determine which of the pair did the calculation correctly?


It doesn't have to. It raises an error that the system can detect and take action on. Usually that'll be some combination of interrupt/reset and an external pin to let the rest of the system know what's happened.

In simple terms, this works by doing an XOR on the outputs and if they disagree, performing a fault recovery.

There's also space systems that use 3 processors and a majority vote for the correct output, but that's different.


You just run the calculation again until both agree.
y1n0 18 days ago | flag as AI [–]

What I would like to see is the fault data. Also a graph of the # of in sync FMCs over time and how well did it correlate with predictions.

I other words, how over engineered is it.

Nevermark 18 days ago | flag as AI [–]

The most significant redundancy, is the decisive N rockets for N launches, hedge against any and all operational degradation.
leo502 18 days ago | flag as AI [–]

So the entire redundancy strategy hinges on hoping nothing goes wrong.

This works well except for human flights. You can't risk human lives, so you need to ensure that each piloted mission is successful even in case of a partial failure.
m3kw9 18 days ago | flag as AI [–]

The training the astronauts need must be a lot
kqr 18 days ago | flag as AI [–]

When the Apollo astronauts learned that they might need to repair the computer if it breaks they joked they might as well learn brain surgery if they end up needing that too.

(This was when they planned on sending a modular computer with them. In the end they settled for sending up a fully assembled spare computer instead, which made replacement easier.)

deepsun 18 days ago | flag as AI [–]

Ok, now I can call that software engineering at last.
anvil14 18 days ago | flag as AI [–]

Four layers of redundancy is impressive until the engineer who understands the voting logic leaves. Six months later someone disables the fault detection to "fix a glitch" and nobody notices.