ResiliNets Principles

From ResiliNetsWiki
Jump to: navigation, search

The ResiliNets Architecture is guided by a set of architectural principles, which support the ResiliNets Strategy D2R2+DR: defend, detect, remediate, recover, diagnose, refine, which in turn support the four ResiliNets Axioms IUER: inevitability, understand, expect, and respond. The principles are implemented by ResiliNets Mechanisms.

The target number of principles is O(10): large enough to provide specific guidance in the architecture and design of resilient networks, but small enough to be manageable without being overwhelming. Each principle is then applied to specific aspects of network architecture and design. For example, the principles of redundancy, diversity, and resource tradeoffs are refined and instantiated to the specific contexts of network architecture as

  • fault tolerant network components (redundancy)
  • spatially diverse redundant paths (redundancy and diversity)
  • without replicating everything (resource tradeoffs).
Resilinets-principles.png

Contents

Prerequisites

P1. Service Requirements

Service requirements determine the need for network resilience

The user and application service requirements determine the necessary level and properties of resilience. In this sense, resilience and its subset survivability is an additional QoS property along with conventional properties such as performance. It is important to note that various applications demand different levels of resilience, and that some applications may not need networks that are highly resilient.

Application service resilience:

  • ability to access information
  • continuity of communication association
  • service of networked storage and distributed processing

Relationship with other principles:

P2. Normal Behaviour

Normal behaviour must be specified, verified, and refined through monitoring to understand normal operations

In order to understand normal operations and thus be able to detect adverse events or conditions, it is essential to understand the normal behaviour of the system. This involves several phases:

  1. Rigorous specification of system protocol and design, along with constraints in behaviour
  2. Functional verification of design
  3. Monitoring and learning normal behaviour of the system in situ
  4. Refinement of behaviour specification and operational constraints based on observation and learning of in situ operation

The refinement step is essential since existing systems have neither complete nor correct specifications, and the inherent complexity of networks challenges the ability to correctly specify such systems a priori

Relationship with other principles:

P3. Threat and Challenge Models

Threat and Challenge Models are essential to understanding and detecting potential adverse events and conditions

It is not possible to understand, define, and implement mechanisms for resilience that defend against, detect, and remediate challenges without a model of the threat.

Relationship with other principles:

P4. Metrics

Metrics are needed to measure and engineer network resilience

Metrics are needed to understand, analyse, evaluate, and engineer network resilience (and survivability). Furthermore, metrics are needed to understand the impact of an adverse event or condition on the network service provided.

Relationship with other principles:

P5. Heterogeneity in Mechanism, Trust, and Policy

Heterogeneity in mechanism, trust, and policy among different network realms is a reality of emerging multi-provider networks; resilient mechanisms must admit this heterogeneity

It is increasingly unrealistic to consider the set of global networks (including the global Internet, PSTN, and domain-specific stovepipes) as a homogeneous internet. Rather the network becomes a collection of realms or compartments, which each have their own mechanism (addressing, forwarding, routing, signalling, traffic and resource management), and which define trust and policy relationships with one another. Social, political, economic, and regulatory concerns are manifest at these boundaries as tussle.

Network resilience is impacted in two ways:

  1. The resilience architecture and mechanism must admit this heterogeneity. Some of these issues result from the legacy PSTN (IXC/ILEC/CLEC) and Internet (AS/peering/tiers) architecture.
  2. Resilience depends on well-defined relationships among entities that need trust and policy boundaries. Thus resilience can exploit these relationships.

Examples:

  • traditional wired Internet
  • optical circuit-switched networks
  • mobile infrastructure (cellular) networks
  • mobile ad hoc networks
  • sensor networks

Relationship with other principles:

Tradeoffs

P6. Resource Tradeoffs

Resource tradeoffs determine the deployment of resilience mechanisms

Networks are collections of resources that are not infinitely abundant.
P – processing
M – memory
B – bandwidth (rate)
E – energy and power
L – latency
£$€F¥ – cost

The relative composition and placement of these resources must be balanced to optimise resilience and cost. The maximum availability of a particular resource serves as a constraint in these optimisations.

Relationship with other principles:

  • determines the relative application of other principles that have some resource cost. For example P9. Redundancy and P10. Diversity both require additional resource. In the limit, an infinite degree of both redundancy and diversity could maximise resilience, but this is clearly not practical. Determines the distribution among P7. Multilevel Resilience.

P7. Complexity

Complexity of the network in general, and resilience in particular, must be reduced to maximise overall resilience

Networks are inherently complex systems of systems, with interactions at multiple levels of hardware and software [Sterbenz-Touch-2001]. While many of the resilience principles and mechanisms increase this complexity, complexity itself makes systems difficult to understand and threatens resilience. The degree of complexity must be carefully balanced in terms of cost vs. benefit, and unnecessary complexity should eliminated.

Relationship with other principles:

P8. State Management

State management is an essential aspect of networks in general, and resilience mechanisms in particular; the alternatives of how to distribute and manage this state are critical to resilience

State management is an essential part of any large complex system and is related to resilience in two ways:

  1. The resilience of the network is impacted by the way state is managed and the choices that are made:
    • stateless vs. soft state vs. hard state
    • distributed vs. mirrored vs. centralised
    • tolerance of inconsistent vs. requirement for consistent

Resilience tends to favour the set of choices in bold, but these are choices that must be carefully made in the larger context of the overall system architecture.

  1. Resilience mechanisms themselves require state and it is important that they achieve their goal in increasing overall resilience by the way in which they manage state.

This principle is used to make a choice in mechanism.

Relationship with other principles:

Enablers

P9. Security and Self-Protection

Security and self-protection are essential properties of entities to defend against challenges in a resilient network

The resilience of the network is dependent on the ability for entities to protect and defend themselves against challenges. Self-protection is implemented by a number of mechanisms, including but not limited to mutual suspicion, the AAA mechanisms of

as well as the additional conventional security mechanisms of

Relationship with other principles:

P10. Connectivity and Association

Connectivity and association among communicating entities should be maintained when possible, but information flow should still take place even when a stable end-to-end path does not exit

Network connectivity should be maintained when practical, so that conventional communication mechanisms can be used based on eventual stability, which rely on information flow along a stable end-to-end path to maximise performance (minimise delay and memory).

When a stable end-to-end path is not possible, information can still flow as far as possible, whenever possible, based on the eventual connectivity model. Nodes store-and-forward when necessary, and may exploit mobility when nodes store-and-haul information or move into communication range to restore connectivity.

Relationship with other principles:

  • P5. Resource Tradeoffs determine the feasibility of maintaining connectivity (e.g. transmission energy required to overcome noise) vs. alternate mechanisms.

P11. Redundancy

Redundancy in space and time increases resilience against faults and some challenges

Redundancy refers to the replication of entities in the network, generally to provide fault-tolerance. In the case that a fault disables part of a system, the redundant parts are able to operate and prevent a service failure.

Redundancy can be further categorised in two ways:

  1. degree (k-redundant)
  2. type (hot spare, active load balance, on-demand)

Spatial Redundancy

Replication of entities in space.

Examples:

  • triple modular redundant hardware
  • parallel links and network paths

Temporal Redundancy

Replication of entities in time.

Examples:

  • erasure coding consisting of repeated transmission of packets
  • periodic state synchronisation
  • periodic information transfer (e.g. digital fountain)

Information Redundancy

The transmission or storage of redundant information.

Example:

  • FEC (forward error correction) consisting of redundant bits
  • erasure coding

Relationship with other principles:

P12. Diversity

Diversity in space, time, medium, and mechanism increases resilience against challenges to particular choices.

Diversity consists of providing different alternatives so that even when challenges impact particular alternatives, other alternatives prevent degradation from normal operations. The degree of diversity is the number of different alternatives. Diverse alternative can either be simultaneously operational, in which case they defend against challenges, or they may be available for use as needed to remediate.

Spatial Diversity

Diversity of an entity spread across space. Spatial diversity requires redundancy of a degree at least equal to the degree of diversity.

  1. Topological Diversity is spatial diversity across the (logical) topology of the network. Note that geographically diverse links and nodes may be topologically entwined.
  2. Geographic Diversity is spatial diversity across the physical topology of the network. Note that topologically diverse links or nodes may be physically co-located.

Spatial diversity examples:

  • spatially diverse links
  • spatially diverse nodes

Temporal Diversity

Diversity in the temporal behaviour of a component or protocol.

Temporal diversity example:

Operational Diversity

Operational diversity refers to alternatives in the architecture and implementation of network components and protocols.

  1. Implementation Diversity prohibits systems to exhibit the same error caused by an implementation fault.
  2. Medium Diversity provides choices among alternative physical mediums through which information can flow.
  3. Mechanism Diversity consists of providing alternative mechanisms.

Operational diversity examples:

  • software from multiple vendors, such as operating systems on end systems and control of routers
  • switches and routers from different hardware vendors
  • fibre, free-space optical, RF – 802.11, 802.16
  • open vs. closed loop – ARQ vs. FEC

Relationship with other principles:

P13. Multilevel Resilience

Multilevel resilience is needed with respect to protocol layer, protocol plane, and hierarchical network organisation

An overall resilient system requires resilience at the various internal levels of its implementation. In the case of the global network, multilevel resilience is needed along three orthogonal dimensions:

  1. protocol layers: links, network paths, end-to-end transport, session control, applications
    • we approach layers bottom up, that is each layer provides the best service practical given the cost of constrained resources, which provides a foundation for the next layer
  2. planes: data, control, and management
  3. network architecture: inside-out from fault tolerant components, through survivable subnetwork and network topologies, to the global Internetwork including attached end systems

Relationship with other principles:

P14. Context Awareness

Context awareness is necessary for network components to operate autonomously to detect challenges

Context awareness is needed for resilient nodes to monitor the network environment (channel conditions, link state, operational state of network components, etc.) and detect adverse events or conditions.

Relationship with other principles:

P15. Translucency

Translucency is needed to control the degree of abstraction vs. the visibility between levels

Complex systems are structured into multiple levels to abstract complexity and separate concerns. In the case of networks this consists of three multilevel dimensions: layer, plane, and system organisation. While this abstraction is important, an opaque level boundary can hide too much and result in suboptimal and improper behaviour based on incorrect implicit assumptions about the adjacent level. Thus it is important that level boundaries be translucent in which cross-layer control loops allow selected state to be explicitly visible across levels.

  • dials expose state and behaviour from below
  • knobs influence behaviour from above

Relationship with other principles:

Behaviour

P16. Self-Organising and Autonomic

Self-organising and autonomic behaviour is necessary for network resilience that is highly reactive with minimal human intervention

A resilient network must initialise and operate itself with minimal human configuration, management, and intervention. Ideally human intervention should be limited to that desired based on high-level operational policy. The phases of autonomic networking consist of:

Relationship with other principles:

P17. Adaptability

Adaptability to the network environment is essential for a node in a resilient network to detect, remediate, and recover from challenges

Resilient network components need to adapt their behaviour based on dynamic network conditions, in particular to remediate from adverse events or conditions, as well as to recover to normal operations.

Relationship with other principles:

P18. Evolvability

Evolvability is needed to refine future behaviour to improve the response to challenges, as well as for the network architecture and protocols to respond to emerging threats and application demands

Resilient network components should evolve their behaviour for two primary reasons:

  1. refinement of future behaviour based on reflection on the defence against, detection, and remediation of adverse events or conditions and recovery to normal operations
  2. evolution and extension of the network architecture and protocols over time, in response to long term changes in

Relationship with other principles:



© 2006–2010 James P.G. Sterbenz and David Hutchison

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox