From the given service descriptions errors and failures of the service can be derived. Such a listing of failures will never be complete and should therefore show examples of failures. This is part of the ResiliNets architecture.
Definitions
The definitions for service, error, fault and failure can be found in the Definition section.
SF1. Service Failures
A service instance fails if it does not provide any service to legitimate clients or returns erroneous results. The cause for not providing any service can be a service instance crash, a deadlock of the internal logic, or a blocking service access point to the communication subsystem. DoS attacks often cause such a behaviour. They utilise programming mistakes to exhaust resources or cause the service to change to an erroneous state. The return of false results is often caused by implementation mistakes.
SF1.1. QoS Errors
A service fails if it can not provide its service within the QoS parameters guaranteed to the client.
SF1.1.1. Performace Errors
- Data arrives too late
- Jitter is too high
- Throughput is too small
SF1.1.2. Resilience Errors
We have seen multiple resilience failures in the past. The list is far from complete and gets extended during the ongoing work:
- No backup system provided although resilience required
- Redundant systems are not location disjoint: Natural disaster brings both systems down
- Backup path is not node/link disjoint from primary path: Failure of one node/link can bring both paths down
- Error propagation failure: BGP route flapping caused by join/leave messages sent to all BGP speakers
- Bad failover strategy: SCTP retransmission to secondary IP address (backup path) degrades performance
- Overlap of data segments: TCP overrides previously received data with newly received data, e.g. by retransmissions
SF1.2. Addressing Errors
- Addressing failure: No such network, host, protocol, SAP
- Unknown sender due to spoofed source address
SF2. Basic Building Block Failures
All hardware errors will result in a failure since we do not intend to build resilience mechanisms for these. This is caused by our level of abstraction.
SF2.1 Physical Link Failures
Cause |
Duration |
Countermeasure |
Example (optional)
|
Wired networks: cable cut |
permanent |
Redeploy cable |
|
Wireless networks: peer moves out of range |
temporal |
find multi-hop path |
|
Noisy channel destructs signal |
temporal |
add error control |
electro-magnetic interference
|
SF2.2 Node Hardware Failures
Cause |
Duration |
Countermeasure |
Example (optional)
|
Node destruction |
permanent |
Redeploy hardware |
Natural disaster
|
Defective Hardware |
permanent |
Redeploy hardware |
Aging
|
DoS attack |
permanent |
Reboot / Change hardware |
maleformated packets blocks interface
|
DoS attack |
temporal |
reboot |
power outage
|
SF2.3 Operating System Failure
Cause |
Duration |
Countermeasure |
Example (optional)
|
Implementation mistake |
temporal |
software update |
exception handling, resource management, ...
|
SF3. Communication specific Building Blocks
A communication system will inherit one or more of the following services as building blocks. For all service challenges, response mechanisms, result of the response, and examples are depicted.
SF3.1. Link Transport Service Errors
Challenge |
Response |
Result |
Example (optional)
|
Physical link failure |
|
service failure |
|
Logical link failure |
|
service failure |
no link to peer
|
Logical link failure |
use redundant path |
normal operation |
forwarding service uses different multi-hop path
|
Node failure |
|
service failure |
peer is down
|
Transport association error |
|
service failure |
Connection reset attack
|
Transport association error |
re-estblishment of association |
normal operation |
Connection reset attack
|
SF3.1.1. Secure Link Transport Service Errors
Challenge |
Response |
Result |
Example (optional)
|
Anti-replay counter overrun |
disable association |
service failure |
|
Anti-replay counter overrun |
re-keying |
normal operation |
|
Data duplication |
drop data |
normal operation |
|
Association timeout |
disable association |
service failure |
|
Association timeout |
re-keying |
normal operation |
|
Authentication failure |
disable association |
service failure |
|
Truncation attack |
|
service failure |
refine implementation
|
SF3.1.2. E2E Transport Errors
Since this is only a specialised link transport service the same failures as for any other link transport service can occur.
SF3.1.3. Reliability Errors
Challenge |
Response |
Result |
Example (optional)
|
Packet loss |
ARQ mechanisms |
degraded service |
Stop-and-Wait, Go-back-N, Selective Repeat
|
Packet re-ordering |
reverse re-ordering |
normal operation |
|
Packet duplication |
drop packet |
normal operation |
|
Data alteration |
drop packet and ARQ |
degraded service |
|
Data alteration |
enable FEC codes |
degraded service |
correct data after reception
|
SF3.1.4. Types of communication Errors
Challenge |
Response |
Result |
Example (optional)
|
Anycast: processing of multiple hosts |
none |
normal operation |
|
Anycast: changing receiver |
none |
service failure |
|
Mulitcast: incomplete data ay one host |
ARQ |
degraded service |
|
Reliable Multicast: ACK storms |
concast |
normal operation |
|
SF3.2. Forwarding Errors
Challenge |
Response |
Result |
Example (optional)
|
Link failure |
none |
service failure |
|
Link failure |
redundant path or route |
normal operation |
|
Node hard/software failure |
none |
service failure |
next hop or end system is down
|
Node hard/software failure |
redundant node |
normal operation |
next hop or end system is down
|
Unknown destination |
none |
service failure |
non self-learning routing
|
Unknown destination |
learn route |
normal operation |
self-learning routing
|
Degraded node service |
enable congestion avoidance service; use different path |
degraded service |
congestion, random packet drops
|
Attacks(?) |
|
|
blockhole router, wormhole router
|
Firewalling |
|
Service failure |
Wrong firewall configuration
|
SF3.3. Node Configuration Errors
Challenge |
Response |
Result |
Example (optional)
|
Node failure |
|
service failure |
Configuration server down
|
Link failure |
|
service fialure |
no link no external configuration
|
Link transport failure |
|
service failure |
|
SF3.4. Security Association Negotiation Errors
Challenge |
Response |
Result |
Example (optional)
|
Downgrade attack |
abort negotiation |
service failure |
|
Authentication failure |
abort negotiation |
service failure |
|
Incompatible algorithms |
abort negotiation |
service failure |
|
SF3.5. Access Control Errors
Challenge |
Response |
Result |
Example (optional)
|
Forwarding failure |
none |
service failure |
|
E2E transport failure |
none |
service failure |
|
Secure E2E transport failure |
suspend certain schemes |
degraded service |
scheme which do not rely on a secure E2E transport can still be used
|
No common scheme |
none |
service failure |
|
Replay attack |
drop message |
normal service operation |
replay detection neccessary
|
SF3.5.1. Network Access Control Errors
Challenge |
Response |
Result |
Example (optional)
|
Link failure |
|
service failure |
|
Node failure |
|
service failure |
Network access server down
|
Incompatible schemes |
abort negotioation |
normal service |
Incompatible authentication schemes, i.e. no WPA compatible Hardware
|
SF3.6. Certificate Online Verification Error
Challenge |
Response |
Result |
Example (optional)
|
Node error |
use trusted backup server |
normal operation |
primary trusted server is down
|
Link transport error |
none |
service failure |
|
SF3.7. Name resolution service Errors
Challenge |
Response |
Result |
Example (optional)
|
(Secure) E2E transport failure |
none |
degraded service |
accepting unsecured responses can lead to vulnerability to cache poisoning or other attacks
|
Non existent name |
send error report |
normal operation |
|
Server error |
use backup server |
normal operation |
find backup either by configuration or anycast
|
SF3.8 Feedback Services Errors
Challenge |
Response |
Result |
Example (optional)
|
Feedback from untrusted source |
|
|
|
Feedback from hostile source |
drop information |
normal service |
|
SF3.8 Monitoring Service Errors
Challenge |
Response |
Result |
Example (optional)
|
SF3.9 Congestion Avoidance Service Errors
Challenge |
Response |
Result |
Example (optional)
|
SF3.10. Routing Errors
Challenge |
Response |
Result |
Example (optional)
|
Unnoticed topology change |
re-run algorithm |
service failure |
topology change due to new link, node movement, etc
|
Unrecognoiced addtional node |
re-run algorthm |
service failure |
|
SF3.10.1. Path Establishment Errors
Challenge |
Response |
Result |
Example (optional)
|
SF3.11. Transaction Service Error
- Partial update of compartment policy due to connection failures, system resets, etc. leading to different policies on node within a compartment
SF3.12. Anonymity Service Errors
Challenge |
Response |
Result |
Example (optional)
|
Failure Semantics
We must identify the connection between failures and fault. A lower layer failure can be an fault for an upper layer and does not have to be a failure on the upper layer, too.