ResiliNets Strategy
The ResiliNets Architecture is based on a six-step two-phase strategy D2R2+DR: defend, detect, remediate, recover, diagnose, refine, which support the four ResiliNets Axioms IUER: inevitable, understand, expect, and respond. The strategy is supported by the ResiliNets Principles that are implemented by the ResiliNets Mechanisms.
Contents |
Phase 1: Real-Time Control Loop – D2R2
The first phase consists of a cycle of four steps that are performed in real time and are directly involved in network operation and service provision. Many of these cycles operate simultaneously, triggered whenever an adverse event or condition is detected.
S1. Defend
The first step in the resilience strategy is to defend against challenges and threats to normal operation. The goals are to
- reduce the probability of a fault leading to a failure
- reduce the impact of a adverse event or condition
A threat analysis is necessary to mount a defence/defense.
Examples of defences:
- erasure coding over spatially redundant diverse paths, which permits data transfer to continue even when one of the paths is disrupted
- secure signalling protocols with necessary authentication and encryption to resist traffic analysis and prevent the injection of bogus signalling messages
S2. Detect
The second step is to detect when an adverse event or condition has occurred. Detection is used to determine when defences
- need to be strengthened
- have failed and remediation needs to occur
S3. Remediate
The third step is to remediate the effects of the adverse event or condition to minimise the impact. The goal is to do the best possible at all levels after an adverse event and during an adverse condition. Corrective action must be taken at all levels to minimise the impact of service failure, including correct operation with graceful degradation of performance.
S4. Recover
The fourth step is to recover to original and normal operations, including control and management of the network.
Once an adverse event has ended or an adverse condition is removed, the network should recover from its remediation state to allow any degraded services return to normal performance and operation.
Examples:
- deployment of replacement infrastructure after a natural disaster
- restoration of normal routes after termination of a DDoS attack or the end of a flash crowd
Phase 2: Background Diagnosis and Refinement – DR
The second phase consists of two background operations that observe and modify the behaviour of the D2R2 cycle: diagnosis of faults and refinement of future behaviour.
S5. Diagnose
While it is not possible to directly detect faults, a system may be able to detect resultant errors within itself, or failures may be detected outside the system. It may be possible to diagnose the fault that was the root cause. This may result in an improved system design, and may affect recovery to a better state.
S6. Refine
The final aspect of the strategy is to refine behaviour for the future based on past D2R2 cycles. The goal is to learn and reflect on how the system has defended, detected, remediated, and recovered so that all of these can be improved to continuously increase the resilience of the network.
Representation
Related Work
Several other research efforts have proposed strategies for various apsects of survivability, dependability, fault tolerance.
ANSA
- fault confinement (defense)
- fault detection (error/failure detection)
- fault diagnosis (diagnosis)
- reconfiguration (remediation)
- recovery (remediate)
- restart (remediate)
- repair (recovery)
- reintegration (recovery)
DENISE
CMU SEI
- resistance (defend)
- recognition (detect)
- recovery (remediate, recover)
- adaptation and evolution (refine)
[ Ellison-Fisher-Linger-Lipson-Longstaff-Mead-1999 ]
© 2006–2007 James P.G. Sterbenz and David Hutchison