Skip to main content

SCS Taxonomy of Failsafe Levels: Examples of Failure Cases and their impact on IaaS and KaaS resources

Examples of the impact from certain failure scenarios on Cloud Resources

Failure cases in Cloud deployments can be hardware related, environmental, due to software errors or human interference. The following table summerizes different failure scenarios, that can occur:

Failure ScenarioProbabilityConsequencesFailsafe Level Coverage
Disk FailureHighPermanent data loss in this disk. Impact depends on type of lost data (data base, user data)L1
Host Failure (without disks)Medium to HighPermanent loss of functionality and connectivity of host (impact depends on type of host)L1
Host FailureMedium to HighData loss in RAM and temporary loss of functionality and connectivity of host (impact depends on type of host)L1
Rack OutageMediumOutage of all nodes in rackL2
Network router/switch outageMediumTemporary loss of service, loss of connectivity, network partitioningL2
Loss of network uplinkMediumTemporary loss of service, loss of connectivityL3
Power Outage (Data Center supply)MediumTemporary outage of all nodes in all racksL3
FireMediumpermanent Disk and Host loss in the affected zoneL3
FloodLowpermanent Disk and Host loss in the affected regionL4
EarthquakeVery Lowpermanent Disk and Host loss in the affected regionL4
Storm/TornadoLowpermanent Disk and Host loss in the affected regionL4
Software bug (major)Lowpermanent loss or compromise of data that trigger the bug up to data on the whole physical machineL3
Software bug (minor)Hightemporary or partial loss or compromise of dataL1
Minor operating errorHighTemporary outageL1
Major operating errorLowPermanent loss of dataL3
Cyber attack (minor)Highpermanent loss or compromise of data on affected Disk and HostL1
Cyber attack (major)Mediumpermanent loss or compromise of data on affected Disk and HostL3

Those failure scenarios can result in either only temporary (T) or permanent (P) loss of IaaS / KaaS resources or data. Additionally, there are a lot of resources in IaaS alone that are more or less affected by these failure scenarios. The following tables shows the impact when no redundancy or failure safety measure is in place, i.e., when not even failsafe level 1 is fulfilled.

Impact on IaaS Resources (IaaS Layer)

ResourceDisk LossNode LossRack LossPower LossNatural CatastrophyCyber ThreatSoftware Bug
ImageP1T2T/PTP (T3)T/PP
VolumeP1T2T/PTP (T3)T/PP
User Data on RAM /CPUPPPPT/PP
volume-based VMP1T2T/PTP (T3)T/PP
ephemeral-based VMP1PPTP (T3)T/PP
Ironic-based VMP4PPTP (T3)T/PP
SecretP1T2T/PTP (T3)T/PP
network configuration (DB objects)P1T2T/PTP (T3)T/PP
network connectivity (materialization)T2T/PTP (T3)T/PT
floating IPP1T2T/PTP (T3)T/PT

For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. So some of these outages are easier to mitigate than others.

Impact on Kubernetes Resources (KaaS layer)

note

In case the KaaS layer runs on top of IaaS layer, the impacts described in the above table apply for the KaaS layer as well.

ResourceDisk LossNode LossRack LossPower LossNatural CatastrophyCyber ThreatSoftware Bug
NodePT/P
KubeletTT/P
PodTT/P
PVCPP
API ServerTT/P

Footnotes

  1. If the resource is located on that specific disk. 2 3 4 5 6 7

  2. If the resource is located on that specific node. 2 3 4 5 6 7

  3. In case of disks, nodes or racks are not destroyed, some data could be safed. E.g. when a fire just destroyes the power line. 2 3 4 5 6 7 8 9

  4. Everything located on that specific disk. If more than one disk is used, some data could be recovered.