DEFINITION:

The built-in capability of a system to provide continued correct execution in the presence of a limited number of hardware or software faults

The goal of fault tolerance is to include safety features in the software design or source code to ensure that the system will respond correctly to input data errors and prevent output and control errors.

The need for fault tolerance in a system is determined by the system requirements and the system safety assessment process.

 

(Source: ACARE Domain 607)

SUBDOMAINS:

  1. Fault tolerant mechanisms: Redundancy, Backup (hot, cold,..), Voting mechanism, Fault detection
  2. Parallel processing / Synchronisation mechanisms
  3. Fault propagation, Isolation of fault effects