Survivability
Survivability is the capability of a system to fulfil its mission, in a timely manner, in the presence of threats such as targetted attacks or large-scale natural disasters resulting in many failures, in addition to the few random failures covered by fault tolerance. Survivability is thus a superset of fault tolerance but a subset of resilience. [ResiliNets with strong influence from Ellison-Fisher-Linger-Lipson-Longstaff-Mead-1999]
[Frank-Frisch-1970 (doi) .]
H. Frank and I.T. Frisch,
“Analysis and Design of Survivable Networks”,
IEEE Transactions on Communication Technology, vol.18, #5, October 1970, pp. 501–519
ResiliNets Keywords: survivability
Keywords:
Abstract: “The problem of designing networks that can survive an enemy attack or natural disaster has received considerable attention in recent years. Work in this area has focused on the formulation of survivability criteria, the development of analysis methods to rank networks with respect to these criteria, and the generation of networks which are optimal with respect to these criteria. Many partial results for a variety of network models are available. The purpose of this paper is to summarize the most significant of these results, to link various models and approaches, and to indicate areas of study in which additional research seems both desirable and feasible.”
Notes:
[Ellison-Fisher-Linger-Lipson-Longstaff-Mead-1999 .]
Robert J. Ellison, David A. Fisher, Richard C. Linger, Howard F. Lipson, Thomas A. Longstaff, and Nancy R. Mead,
Survivable Network Systems: An Emerging Discipline,
Carnegie-Mellon Software Engineering Institute Technical Report CMU/SEI-97-TR-013, 1997 revised 1999,
available from http://www.sei.cmu.edu/pub/documents/97.reports/pdf/97tr013.pdf
ResiliNets Keywords: Survivability
Keywords: Survivability, security, unbounded networks, networks, Internet
Abstract: “Society is growing increasingly dependent upon large-scale, highly distributed systems that operate in unbounded network environments. Unbounded networks, such as the Internet, have no central administrative control and no unified security policy. Furthermore, the number and nature of the nodes connected to such networks cannot be fully known. Despite the best efforts of security practitioners, no amount of system hardening can assure that a system that is connected to an unbounded network will be invulnerable to attack. The discipline of survivability can help ensure that such systems can deliver essential services and maintain essential properties such as integrity, confidentiality, and performance, despite the presence of intrusions. Unlike the traditional security measures that require central control or administration, survivability is intended to address unbounded network environments. This report describes the survivability approach to helping assure that a system that must operate in an unbounded network is robust in the presence of attack and will survive attacks that result in successful intrusions. Included are discussions of survivability as an integrated engineering framework, the current state of survivability practice, the specification of survivability requirements, strategies for achieving survivability, and techniques and processes for analyzing survivability.”
Notes: This report contains the de-facto standard definition of network survivability, from which we've derived the ResiliNets definition by adding large scale natural disasters. This report also defines a "taxonomy of strategies related to survivability": "resistance, recognition, recovery, adaptation and evolution".
[Medhi-Tipper-2000 (doi) .]
D. Medhi and D. Tipper,
“Multi-Layered Network Survivability – Models, Analysis, Architecture, Framework and Implementation: An Overview”,
Proceedings of DARPA Information Survivability Conference DISCEX 2000,
Hilton Head, SC, Jan., 2000, pp. 173–186.
Absrtact: A major attack can significantly reduce the capability to deliver services in large-scale networked information systems. In this project, we have addressed the survivability of large scale heterogeneous information systems which consist of various services provided over multiple interconnected networks with different technologies. The communications network portions of such systems are referred to as multi-networks. We specifically address the issue of survivability due to physical attacks that destroy links and nodes in multi-networks. The end goal is to support critical services in the face of a major attack by making optimum use of network resources while minimizing network congestion. This is an area which is little studied, especially for largescale heterogeneous systems. In this paper, we present an overview of our contributions in this area.
[Sterbenz-Krishnan-Hain-Jackson-Levin-Ramanathan-Zao-2002 (doi) .]
James P.G. Sterbenz, Rajesh Krishnan, Regina Rosales Hain, Alden W. Jackson, David Levin, Ram Ramanathan, and John Zao,
“Survivable Mobile Wireless Networks: Issues, Challenges, and Research Directions”,
Proceedings of the ACM Wireless Security Workshop (WiSE) 2002 at MobiCom,
Atlanta GA, September 2002, pp. 31–40
ResiliNets Keywords: Survivability
Keywords: Survivability, mobile wireless network, weak and episodic connectivity, disconnected, asymmetric channel, eventual stability, eventual connectivity, store and haul forwarding, low probability of detection (LPD), satellite, ad hoc routing, topology, security, fault tolerance.
Abstract: "In this paper we survey issues and challenges in enhancing the survivability of mobile wireless networks, with particular emphasis on military requirements. Research focus on three key aspects can significantly enhance network survivability: (i) establishing and maintaining survivable topologies that strive to keep the network connected even under attack, (ii) design for end-to-end communication in challenging environments in which the path from source to destination is not wholly available at any given instant of time, (iii) the use of technology to enhance survivability such as adaptive networks and satellites."
Notes: This paper is a summary of the work done in the SUMOWIN DARPA project, (Douglas Maughan, Program Manager).
[Knight-Strunk-Sullivan-2003 .]
John C. Knight, Elisabeth A. Strunk, and Kevin J. Sullivan,
“Towards a Rigorous Definition of Information System Survivability”,
Proceedings of the DARPA Information Survivability Conference and Exposition (DISCEX III),
Washington DC, Apr. 2003, pp. 78–89.
ResiliNets Keywords: Survivability metrics
Abstract: The computer systems that provide the information underpinnings for critical infrastructure applications, both military and civilian, are essential to the operation of those applications. Failure of the information systems can cause a major loss of service, and so their dependability is a major concern. Current facets of dependability, such as reliability and availability, do not address the needs of critical information systems adequately because they do not include the notion of degraded service as an explicit requirement. What is needed is a precise notion of what forms of degraded service are acceptable to users, under what circumstances each form is most useful, and the fraction of time such degraded service levels are acceptable. This concept is termed survivability. In this paper, we present the basis for a rigorous definition of survivability and an example of its use.
[Grover-Stamatelakis-1998 .]
W.D. Grover and D. Stamatelakis,
"Cycle-oriented distributed pre-configuration: ring-like speed with mesh-like capacity for self-planning network restoration",
in Proc. IEEE International Conf. Commun. (ICC '98), pp. 537-543, Atlanta, June 8-11, 1998
ResiliNets Keywords: Survivability, link failure
Abstract: "Cycle-oriented preconfiguration of spare capacity is a new idea for the design and operation of mesh-restorable networks. It offers a sought-after goal: to retain the capacity-efficiency of a mesh-restorable network, while approaching the speed of line-switched self-healing rings. We show that through a strategy of pre-failure cross-connection between the spare links of a mesh network, it is possible to achieve 100% restoration with little, if any, additional spare capacity than in a mesh network. In addition, we find that this strategy requires the operation of only two cross-connections per restoration path. Although spares are connected into cycles, the method is different than self-healing rings because each preconfigured cycle contributes to the restoration of more failure scenarios than can a ring. Additionally, two restoration paths may be obtained from each pre-formed cycle, whereas a ring only yields one restoration path for each failure it addresses. We give an optimal design formulation and results for preconfiguration of spare capacity and describe a distributed self-organizing protocol through which a network can continually approximate the optimal preconfiguration state."
[Clouqueur-Grover 2002 .]
M. Clouqueur and W. Grover,
"Availability Analysis of Span-Restorable Mesh Networks",
IEEE Journal on Selected Areas in Communications (JSAC), vol. 20, no. 4, pp. 810-821, May 2002
ResiliNets Keywords: Survivability, dual link failure
Abstract: The most common aim in designing a survivable network is to achieve restorability against all single span failures, with a minimal investment in spare capacity. This leaves dual-failure situations as the main factor to consider in quantifying how the availability of services benefit from the investment in restorability. We approach the question in part with a theoretical framework and in part with a series of computational routing trials. The computational part of the analysis includes all details of graph topology, capacity distribution, and the details of the restoration process, effects that were generally subject to significant approximations in prior work. The main finding is that a span-restorable mesh network can be extremely robust under dual-failure events against which they are not specifically designed. In a modular-capacity environment, an adaptive restoration process was found to restore as much as 95% of failed capacity on average over all dual-failure scenarios, even though the network was designed with minimal spare capacity to assure only single-failure restorability.
[Grover-2004]
W. Grover,
"Mesh-based Survivable Transport Networks: Options and Strategies for Optical, MPLS, SONET and ATM Networking ",
Prentice Hall PTR (August 26, 2003)
[Westmark-2004 (doi) .]
Vickie R. Westmark
“A Definition for Information System Survivability”,
Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS'04) - Track-9,
Hawaii, 2004, pp. 90303.1
ResiliNets Keywords: survivability
Keywords:
Abstract: “Society has become dependent on information systems. As networks develop into large-scale systems, often critical to personal and business operations, survivability of these systems is imperative. While these systems continue to emerge and grow, answers to questions like: "What does survivability mean?", "How is survivability being measured?", and "How is survivability computed?" become very important.This paper summarizes the standard or lack of standard methods for defining and computing survivability while providing an easy to reference baseline of the current state. It also provides a template for defining survivability to facilitate subsequent research into computational quality attributes by using standard definitions. Where there are gaps or inconsistencies in current research and practice, assessments can be made to continue research and development in the areas most needed to develop taxonomy of survivability.”
Notes: A good survey paper on computational survivability
[Molisz-2004 (doi) .]
Wojciech Molisz
“Survivability function-a measure of disaster-based routing performance”,
IEEE Journal on Selected Areas in Communications, vol.22, #9, November 2004, pp. 1876–1883
ResiliNets Keywords: Survivability, natural disasters
Keywords: Disaster-based routing performance, survivability attributes, survivability functions
Abstract: “The explosive growth of data traffic imposes critical requirements on core network survivability. Developments in wavelength-division multiplexing have strengthened this need. Survivability becomes increasingly crucial, since large traffic volumes are multiplexed onto a single fiber. A single cable cut can affect incredibly large groups of users, leading to catastrophic socioeconomic effects. This paper defines the network survivability function - the probability function of the percentage of total data flow delivered after failure and survivability attributes - the expected percentage of total data flow delivered after failure, the respective p-percentile values, the worst case survivability. Models for finding these survivability measures are described. The main goal in this paper is to investigate the survivability function for typical routing protocols used in the IP networks. Examples of survivability assessment of a typical wide area network employed in Poland illustrate the proposed approach.”
Notes: This paper derives a closed form survivability function considering the effects of natural disasters on routing protocols.
[Chen-Garg-Trivedi-2002 (doi) .]
Dongyan Chen, Sachin Garg, and Kishor S. Trivedi
“Network survivability performance evaluation: A Quantitative Approach with Applications in Wireless Ad-hoc Networks”,
Proceedings of the 5th ACM international workshop on Modeling analysis and simulation of wireless and mobile systems,
Atlanta, Georgia, USA, September 2002, pp. 61–68
ResiliNets Keywords: survivability
Keywords: Survivability, wireless ad-hoc networks, Markov models, transient analysis, availability
Abstract: “Network survivability reflects the ability of a network to continue to function during and after failures. Our purpose in this paper is to propose a quantitative approach to evaluate network survivability. We perceive the network survivability as a composite measure consisting of both network failure duration and failure impact on the network. A wireless ad-hoc network is analyzed as an example, and the excess packet loss due to failures (ELF) is taken as the survivability performance measure. To obtain ELF, we adopt a two phase approach consisting of the steady-state availability analysis and transient performance analysis. Assuming Markovian property for the system, this measure is obtained by solving a set of Markov models. By utilizing other analysis paradigms, our approach in this paper may also be applied to study the survivability performance of more complex systems.”
Notes: Computational survivability analysis for wireless networks
[Zolfaghari-Kaudel-1994 (doi) .]
Ali Zolfaghari and Fed J. Kaudel
“Framework for Network Survivability Performance”,
IEEE Journal on Selected Areas in Communications, vol.12, #1, January 1994, pp. 46–51
ResiliNets Keywords: Survivability
Keywords:
Abstract: “The article is based on the results of ANSI Technical Subcommittee T1A1 activities in the area of a general framework for telecommunication network survivability performance. The issues of users' expectations and requirements, outage categorization, and a framework for analysis of survivability techniques are discussed. Based on the survivability framework, models for network survivability assessment and analysis are considered, and performance measures are described. Examples illustrate the application of this framework in network design and planning”
Notes:
[Liu-Trivedi-2004 (doi) .]
Yun Liu and Kishor S. Trivedi
“A General Framework for Network Survivability Quantification”,
Proceedings of the 12th GI/ITG Conference on Measuring, Modelling and Evaluation of Computer and Communication Systems,
Dresden, Germany, September 2004, pp. 369–378
ResiliNets Keywords: survivability
Keywords: communication network, survivability, quantification, Markov model
Abstract: “In this paper, we propose a general survivability quantification framework which is applicable to a wide range of system architectures, applications, failure/recovery behaviors, and desired metrics. We show how this framework can be used to derive survivability measures based on different definitions and extend it to other measures not covered by current definitions which can provide helpful information for better understanding of system steady state and transient behaviors under failures/attacks. An illustrative example of a telecommunications switching system is given for the ease of discussion. Markov models are developed and solved to depict various aspects of system survivability.”
Notes:
[Tipper-Dahlberg-Shin-Charnsripinyo-2002 (doi) .]
David W. Tipper, Teresa A. Dahlberg, Hyundoo Shin, and Chalermpol Charnsripinyo
“ Providing Fault Tolerance in Wireless Access Networks”,
IEEE Communications Magazine, vol.40, #1, January 2002, pp. 58–64
ResiliNets Keywords: Survivability
Keywords: Performance analysis, Survivability analysis, Mobile cellular networks, Wireless access networks, Multilayer survivability framework
Abstract: “Research and development on network survivability has largely focused on public switched telecommunications networks and high-speed data networks with little attention on the survivability of wireless access networks supporting cellular and PCS communications. This article discusses the effects of failures and survivability issues in PCS networks with emphasis on the unique difficulties presented by user mobility and the wireless channel environment. A simulation model to study a variety of failure scenarios on a PCS network is described, and the results show that user mobility significantly worsens network performance after failures, as disconnected users move among adjacent cells and attempt to reconnect to the network. Thus, survivability strategies must be designed to contend with spatial as well as temporal network behavior. A multilayer framework for the study of PCS network survivability is presented. Metrics for quantifying network survivability are identified at each layer. Possible survivability strategies and restoration techniques for each layer in the framework are also discussed”
Notes: Even though the paper has “fault tolerance” in the title, this paper is really about the survivability of wireless networks.
[Bassiri-Heydari-2002 (doi) .]
B. Bassiri, and S.S. Heydari,
“Network Survivability in Large-Scale Regional Failure Scenarios”,
Proceedings of the 2nd Canadian Conference on Computer Science and Software Engineering,
Montreal, Quebec, Canada, 2009, pp. 83–87
ResiliNets Keywords: survivability
Keywords: failure recovery, large-scale failures, mesh survivable networks, network survivability, traffic restoration
Abstract: “In this short paper we present a preliminary study of the impact of large-scale failures on communication networks. Models for study of large-scale failures are studied and unique characteristics of these scenarios as well as their differences with independent multiple-failure scenarios are presented. In particular, we focus on regional large-scale failure scenarios in which node/link failures are location-dependent. Methods for restoration of traffic and the issue of failure notification time are examined. The regional failure scenario is examined on a sample backbone European network and the simulation results including network capacity requirements and failure notification times for various cases are analyzed and discussed.”
Notes:
[Heegaard-Trivedi-2009 (doi) .]
P.E. Heegaard and K.S. Trivedi,
“Network survivability modeling”,
Computer Networks, vol.53, #8, June 2009, pp. 1215–1234
ResiliNets Keywords: survivability
Keywords: Survivability; End-to-end performance; Analytical models; Simulation
Abstract: “Critical services in a telecommunication network should be continuously provided even when undesirable events like sabotage, natural disasters, or network failures happen. It is essential to provide virtual connections between peering nodes with certain performance guarantees such as minimum throughput, maximum delay or loss. The design, construction and management of virtual connections, network infrastructures and service platforms aim at meeting such requirements.
In this paper we consider the network’s ability to survive major and minor failures in network infrastructure and service platforms that are caused by undesired events that might be external or internal. Survive means that the services provided comply with the requirement also in presence of failures. The network survivability is quantified as defined by the ANSI T1A1.2 committee which is the transient performance from the instant an undesirable event occurs until steady state with an acceptable performance level is attained.
The assessment of the survivability of a network with virtual connections exposed to link or node failures is addressed in this paper. We have developed both simulation and analytic models to cross-validate our assumptions. In order to avoid state space explosion while addressing large networks we decompose our models first in space by studying the nodes independently and then in time by decoupling our analytic performance and recovery models which gives us a closed form solution. The modeling approaches are applied to both small and real-sized network examples. Three different scenarios have been defined, including single link failure, hurricane disaster, and instabilities in a large block of the system (transient common failure).
The results show very good correspondence between the transient loss and delay performance in our simulations and in the analytic approximations.”
Notes:
[Dean-Mihail-Mostrel-Shallcross-1996 (doi) .]
N.Dean, and M. Mihail, and M. Mostrel, and D. Shallcross,
“A Commercial Application of Survivable Network Design: ITP/INPLANS CCS Network Topology Analyzer”,
Proceedings of the Seventh Annual Symposium on Discrete Algorithms,
Atlanta, Georgia, United States, 1996, pp. 279–287
ResiliNets Keywords: survivability
Keywords: survivability, topology analyzer, path-connectivity
Abstract: “The ITP/INPLANS CCS Network Topology Analyzer is a Bellcore product which performs automated design of cost effective survivable CCS (Common Channel Signaling) networks, with survivability meaning that certain path-connectivity is preserved under limited failures of network elements. The algorithmic core of this product consists of suitable extensions of primal-dual approximation schemes for Steiner network problems. Even though most of the survivability problems arizing in CCS networks are not strictly of the form for which the approximation algorithms with proven performance guarantees apply, we implemented modifications of these algorithms with success: In addition to duality-based performance guarantees that indicate, mathematically, discrepancy of no more than 20% from optimality for generic Steiner problems and no more than 40% for survivable CCS networks, our software passed all commercial benchmark tests, and our code was deployed with the August ‘94 release of the product. CCS networks fall in the general category of low bit-rate backbone networks. The main characteristic of survivability problems for these networks is that each edge, once present, can be assumed to carry arbitrarily many paths. For high bit-rate backbone networks, such as the widely used ATM and SONET, this is no longer the case. We discuss versions of network survivability with capacitated edges that appear to model survivability considerations in such networks.”
Notes: