ResiliNets Principles

From ResiliNetsWiki

Jump to: navigation, search

The ResiliNets Architecture is guided by a set of architectural principles, which support the ResiliNets Strategy D²R²+DR: defend, detect, remediate, recover, diagnose, refine, which in turn support the four ResiliNets Axioms IUER: inevitability, understand, expect, and respond. The principles are implemented by ResiliNets Mechanisms.

The target number of principles is O(10): large enough to provide specific guidance in the architecture and design of resilient networks, but small enough to be manageable without being overwhelming. Each principle is then applied to specific aspects of network architecture and design. For example, the principles of redundancy, diversity, and resource tradeoffs are refined and instantiated to the specific contexts of network architecture as

fault tolerant network components (redundancy)
spatially diverse redundant paths (redundancy and diversity)
without replicating everything (resource tradeoffs).

Prerequisites

P1. Service Requirements

Service requirements determine the need for network resilience

The user and application service requirements determine the necessary level and properties of resilience. In this sense, resilience and its subset survivability is an additional QoS property along with conventional properties such as performance. It is important to note that various applications demand different levels of resilience, and that some applications may not need networks that are highly resilient.

Application service resilience:

ability to access information
continuity of communication association
service of networked storage and distributed processing

Relationship with other principles:

specifies the requirements for P4. Metrics
determines the resistance to P3. Threat and Challenge Models

P2. Normal Behaviour

Normal behaviour must be specified, verified, and refined through monitoring to understand normal operations

In order to understand normal operations and thus be able to detect adverse events or conditions, it is essential to understand the normal behaviour of the system. This involves several phases:

Rigorous specification of system protocol and design, along with constraints in behaviour
Functional verification of design
Monitoring and learning normal behaviour of the system in situ
Refinement of behaviour specification and operational constraints based on observation and learning of in situ operation

The refinement step is essential since existing systems have neither complete nor correct specifications, and the inherent complexity of networks challenges the ability to correctly specify such systems a priori

Relationship with other principles:

P1. Service Requirements
P3. Threat and Challenge Models
P4. Metrics

P3. Threat and Challenge Models

Threat and Challenge Models are essential to understanding and detecting potential adverse events and conditions

It is not possible to understand, define, and implement mechanisms for resilience that defend against, detect, and remediate challenges without a model of the threat.

Relationship with other principles:

P1. Service Requirements
P4. Metrics
P9. Heterogeneity in Mechanism, Trust and Policy
P13. Security and Self-Protection

P4. Metrics

Metrics are needed to measure and engineer network resilience

Metrics are needed to understand, analyse, evaluate, and engineer network resilience (and survivability). Furthermore, metrics are needed to understand the impact of an adverse event or condition on the network service provided.

Relationship with other principles:

determines the satisfaction of P1. Service Requirements
measures the impact of P3. Threat and Challenge Models

P5. Heterogeneity in Mechanism, Trust, and Policy

Heterogeneity in mechanism, trust, and policy among different network realms is a reality of emerging multi-provider networks; resilient mechanisms must admit this heterogeneity

It is increasingly unrealistic to consider the set of global networks (including the global Internet, PSTN, and domain-specific stovepipes) as a homogeneous internet. Rather the network becomes a collection of realms or compartments, which each have their own mechanism (addressing, forwarding, routing, signalling, traffic and resource management), and which define trust and policy relationships with one another. Social, political, economic, and regulatory concerns are manifest at these boundaries as tussle.

Network resilience is impacted in two ways:

The resilience architecture and mechanism must admit this heterogeneity. Some of these issues result from the legacy PSTN (IXC/ILEC/CLEC) and Internet (AS/peering/tiers) architecture.
Resilience depends on well-defined relationships among entities that need trust and policy boundaries. Thus resilience can exploit these relationships.

Examples:

traditional wired Internet
optical circuit-switched networks
mobile infrastructure (cellular) networks
mobile ad hoc networks
sensor networks

Relationship with other principles:

P3. Threat and Challenge Models
P13. Security and Self-Protection

Tradeoffs

P6. Resource Tradeoffs

Resource tradeoffs determine the deployment of resilience mechanisms

Networks are collections of resources that are not infinitely abundant.
P – processing
M – memory
B – bandwidth (rate)
E – energy and power
L – latency
£$€F¥ – cost

The relative composition and placement of these resources must be balanced to optimise resilience and cost. The maximum availability of a particular resource serves as a constraint in these optimisations.

Relationship with other principles:

determines the relative application of other principles that have some resource cost. For example P9. Redundancy and P10. Diversity both require additional resource. In the limit, an infinite degree of both redundancy and diversity could maximise resilience, but this is clearly not practical. Determines the distribution among P7. Multilevel Resilience.

P7. Complexity

Complexity of the network in general, and resilience in particular, must be reduced to maximise overall resilience

Networks are inherently complex systems of systems, with interactions at multiple levels of hardware and software [Sterbenz-Touch-2001]. While many of the resilience principles and mechanisms increase this complexity, complexity itself makes systems difficult to understand and threatens resilience. The degree of complexity must be carefully balanced in terms of cost vs. benefit, and unnecessary complexity should eliminated.

Relationship with other principles:

P1. Service Requirements
P3. Threat and Challenge Models
P4. Metrics
P5. Resource Tradeoffs
P12. Self-Organising and Autonomic

P8. State Management

State management is an essential aspect of networks in general, and resilience mechanisms in particular; the alternatives of how to distribute and manage this state are critical to resilience

State management is an essential part of any large complex system and is related to resilience in two ways:

The resilience of the network is impacted by the way state is managed and the choices that are made:
- stateless vs. soft state vs. hard state
- distributed vs. mirrored vs. centralised
- tolerance of inconsistent vs. requirement for consistent

Resilience tends to favour the set of choices in bold, but these are choices that must be carefully made in the larger context of the overall system architecture.

Resilience mechanisms themselves require state and it is important that they achieve their goal in increasing overall resilience by the way in which they manage state.

This principle is used to make a choice in mechanism.

Relationship with other principles:

P5. Resource Tradeoffs
P6. Complexity
P10. Redundancy
P12. Self-Organising and Autonomic
P16. Context-Awareness

Enablers

P9. Security and Self-Protection

Security and self-protection are essential properties of entities to defend against challenges in a resilient network

The resilience of the network is dependent on the ability for entities to protect and defend themselves against challenges. Self-protection is implemented by a number of mechanisms, including but not limited to mutual suspicion, the AAA mechanisms of

as well as the additional conventional security mechanisms of

Relationship with other principles:

P3. Threat and Challenge Models
P9. Heterogeneity in Mechanism, Trust, and Policy
P12. Self-Organising and Autonomic
P16. Context-Awareness

P10. Connectivity and Association

Connectivity and association among communicating entities should be maintained when possible, but information flow should still take place even when a stable end-to-end path does not exit

Network connectivity should be maintained when practical, so that conventional communication mechanisms can be used based on eventual stability, which rely on information flow along a stable end-to-end path to maximise performance (minimise delay and memory).

When a stable end-to-end path is not possible, information can still flow as far as possible, whenever possible, based on the eventual connectivity model. Nodes store-and-forward when necessary, and may exploit mobility when nodes store-and-haul information or move into communication range to restore connectivity.

Relationship with other principles:

P5. Resource Tradeoffs determine the feasibility of maintaining connectivity (e.g. transmission energy required to overcome noise) vs. alternate mechanisms.

P11. Redundancy

Redundancy in space and time increases resilience against faults and some challenges

Redundancy refers to the replication of entities in the network, generally to provide fault-tolerance. In the case that a fault disables part of a system, the redundant parts are able to operate and prevent a service failure.

Redundancy can be further categorised in two ways:

degree (k-redundant)
type (hot spare, active load balance, on-demand)

Spatial Redundancy

Replication of entities in space.

Examples:

triple modular redundant hardware
parallel links and network paths

Temporal Redundancy

Replication of entities in time.

Examples:

erasure coding consisting of repeated transmission of packets
periodic state synchronisation
periodic information transfer (e.g. digital fountain)

Information Redundancy

The transmission or storage of redundant information.

Example:

FEC (forward error correction) consisting of redundant bits
erasure coding

Relationship with other principles:

P3. Threat and Challenge Models
P5. Resource Tradeoffs
P6. Complexity
P11. Diversity

P12. Diversity

Diversity in space, time, medium, and mechanism increases resilience against challenges to particular choices.

Diversity consists of providing different alternatives so that even when challenges impact particular alternatives, other alternatives prevent degradation from normal operations. The degree of diversity is the number of different alternatives. Diverse alternative can either be simultaneously operational, in which case they defend against challenges, or they may be available for use as needed to remediate.

Spatial Diversity

Diversity of an entity spread across space. Spatial diversity requires redundancy of a degree at least equal to the degree of diversity.

Topological Diversity is spatial diversity across the (logical) topology of the network. Note that geographically diverse links and nodes may be topologically entwined.
Geographic Diversity is spatial diversity across the physical topology of the network. Note that topologically diverse links or nodes may be physically co-located.

Spatial diversity examples:

spatially diverse links
spatially diverse nodes

Temporal Diversity

Diversity in the temporal behaviour of a component or protocol.

Temporal diversity example:

variation in timing of protocol state transitions to resist traffic analysis

Operational Diversity

Operational diversity refers to alternatives in the architecture and implementation of network components and protocols.

Implementation Diversity prohibits systems to exhibit the same error caused by an implementation fault.
Medium Diversity provides choices among alternative physical mediums through which information can flow.
Mechanism Diversity consists of providing alternative mechanisms.

Operational diversity examples:

software from multiple vendors, such as operating systems on end systems and control of routers
switches and routers from different hardware vendors
fibre, free-space optical, RF – 802.11, 802.16
open vs. closed loop – ARQ vs. FEC

Relationship with other principles:

P1. Service Requirements
P3. Threat and Challenge Models
P6. Complexity
P7. Multilevel Resilience
P5. Resource Tradeoffs
P10. Redundancy
P14. State Management
P15. Connectivity and Association
P16. Context-Awareness
P17. Adaptability

P13. Multilevel Resilience

Multilevel resilience is needed with respect to protocol layer, protocol plane, and hierarchical network organisation

An overall resilient system requires resilience at the various internal levels of its implementation. In the case of the global network, multilevel resilience is needed along three orthogonal dimensions:

protocol layers: links, network paths, end-to-end transport, session control, applications
- we approach layers bottom up, that is each layer provides the best service practical given the cost of constrained resources, which provides a foundation for the next layer
planes: data, control, and management
network architecture: inside-out from fault tolerant components, through survivable subnetwork and network topologies, to the global Internetwork including attached end systems

Relationship with other principles:

P1. Service Requirements
P3. Threat and Challenge Models
P4. Metrics
P5. Resource Tradeoffs
P6. Complexity

P14. Context Awareness

Context awareness is necessary for network components to operate autonomously to detect challenges

Context awareness is needed for resilient nodes to monitor the network environment (channel conditions, link state, operational state of network components, etc.) and detect adverse events or conditions.

Relationship with other principles:

context awareness depends on P4. Metrics to measure and covey context
context awareness is an essential requirement for P12. Self Organising and Autonomic
context awareness is one of the inputs used for P17. Adaptability

P15. Translucency

Translucency is needed to control the degree of abstraction vs. the visibility between levels

Complex systems are structured into multiple levels to abstract complexity and separate concerns. In the case of networks this consists of three multilevel dimensions: layer, plane, and system organisation. While this abstraction is important, an opaque level boundary can hide too much and result in suboptimal and improper behaviour based on incorrect implicit assumptions about the adjacent level. Thus it is important that level boundaries be translucent in which cross-layer control loops allow selected state to be explicitly visible across levels.

dials expose state and behaviour from below
knobs influence behaviour from above

Relationship with other principles:

P1. Service Requirements
P4. Metrics
P5. Resource Tradeoffs
P6. Complexity
P7. Multilevel Resilience
P14. State Management
P16. Context-Awareness
P17. Adaptability

Behaviour

P16. Self-Organising and Autonomic

Self-organising and autonomic behaviour is necessary for network resilience that is highly reactive with minimal human intervention

A resilient network must initialise and operate itself with minimal human configuration, management, and intervention. Ideally human intervention should be limited to that desired based on high-level operational policy. The phases of autonomic networking consist of:

initialisation – auto-configuration of network components and their self-organisation into a network
steady-state normal operation – self-managing with minimal human interaction dictated by policy and self-optimising to dynamic network conditions
steady-state expecting faults and challenges – self-diagnosing and self-repair

Relationship with other principles:

P3. Threat and Challenge Models
P5. Resource Tradeoffs
P6. Complexity
P7. Multilevel Resilience
P9. Heterogeneity in Mechanism, Trust, and Policy
P13. Security and Self-Protection
P14. State Management
P16. Context-Awareness
P17. Adaptability

P17. Adaptability

Adaptability to the network environment is essential for a node in a resilient network to detect, remediate, and recover from challenges

Resilient network components need to adapt their behaviour based on dynamic network conditions, in particular to remediate from adverse events or conditions, as well as to recover to normal operations.

Relationship with other principles:

adaptability is an essential requirement for P12. Self Organising and Autonomic
adaptability is a short term reaction to P16. Context Awareness
long term changes in the network operation and structure are covered by P18. Evolvability

P18. Evolvability

Evolvability is needed to refine future behaviour to improve the response to challenges, as well as for the network architecture and protocols to respond to emerging threats and application demands

Resilient network components should evolve their behaviour for two primary reasons:

refinement of future behaviour based on reflection on the defence against, detection, and remediation of adverse events or conditions and recovery to normal operations
evolution and extension of the network architecture and protocols over time, in response to long term changes in
- user and application service requirements, including new and emerging applications
- technology trends and resource tradeoffs
- changes in the communication environment
- attack strategies and threat models

Relationship with other principles:

in cases that the network evolves itself, P12. Self Organising and Autonomic behaviour is required
evolvability has a longer time constant and relatively larger scope than P17. Adaptability

ResiliNets Principles

Contents

Prerequisites

P1. Service Requirements

P2. Normal Behaviour

P3. Threat and Challenge Models

P4. Metrics

P5. Heterogeneity in Mechanism, Trust, and Policy

Tradeoffs

P6. Resource Tradeoffs

P7. Complexity

P8. State Management

Enablers

P9. Security and Self-Protection

P10. Connectivity and Association

P11. Redundancy

Spatial Redundancy

Temporal Redundancy

Information Redundancy

P12. Diversity

Spatial Diversity

Temporal Diversity

Operational Diversity

P13. Multilevel Resilience

P14. Context Awareness

P15. Translucency

Behaviour

P16. Self-Organising and Autonomic

P17. Adaptability

P18. Evolvability

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox