Skip to main content

Taxonomy of Failsafe Levels

Abstract

When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various resources is neither homogenous nor intuitive. There does exist very detailed lists of risks and what consequences there are for each risk, but this Decision Record should give a high-level view on the topic. So that in each standard that referenced redundancy, it can easily be seen how far this redundancy goes in that certain circumstance. Readers of such standards should be able to know at one glance, whether the achieved failure safeness is on a basic level or a higher one and whether there would be additional actions needed to protect the data.

This is why this decision record aims to define different levels of failure safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer.

Glossary

TermExplanation
Availability Zone(also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources.
BSIGerman Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik).
CSPCloud Service Provider, provider managing the OpenStack infrastructure.
ComputeA generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack).
NetworkA generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack).
StorageA generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack).
RTORecovery Time Objective, the acceptable time needed to restore a ressource.
DiskA physical disk drive (e.g. HDD, SSD) in the infrastructure.
HostA physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities.
Cyber attack/threatAttacks on the infrastructure through the means of electronic access.

Context

Some standards provided by the SCS project will talk about or require procedures to back up resources or have redundancy for resources. This decision record should discuss, which failure threats exist within an IaaS and KaaS deployment and will classify them into several levels according to their impact and possible handling mechanisms. In consequence these levels should be used in standards concerning redundancy or failure safety.

Based on our research, no similar standardized classification scheme seems to exist currently. Something close but also very detailed is the BSI-Standard 200-3 (german) published by the German Federal Office for Information Security. As we want to focus on IaaS and K8s resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed.

Goal of this Decision Record

The SCS wants to classify levels of failure cases according to their impact and the respective measures CSPs can implement to prepare for each level. Standards that deal with redundancy or backups or recovery SHOULD refer to the levels of this standard. Thus every reader knows, up to which level of failsafeness the implementation of the standard works. Reader then should be able to abstract what kind of other measures they have to apply, to reach the failsafe lavel they want to reach.

caution

This document will not be a replacement for a risk analysis. Every CSP and every Customer (user of IaaS or KaaS resources) need to do a risk analysis of their own. Also the differentiation of failure cases in classes, may not be an ideal basis for Business Continuity Planning. It may be used to get general hints and directions though.

Differentiation between failsafe levels and high availability, disaster recovery, redundancy and backups

The levels of failsafeness defined in this decision record classify the possibilities and impacts of failure cases (such as data loss) and the possible measures. High Availability, disaster recovery, redundancy and backups are all measures that can and should be applied to IaaS and KaaS deployments by both CSPs and Users to reduce the possibility and impact of data loss. So with this document every reader can see to what level of failsafeness their measures protect user data.

To differentiate also between the named measures the following table can be used:

TermExplanation
High AvailabilityRefers to the availability of resources over an extended period of time unaffected by smaller hardware issues. E.g. achievable through having several instances of resources.
Disaster RecoveryMeasures taken after an incident to recover data, IaaS resource and maybe even physical resources.
RedundancyHaving more than one (or two) instances of each resource, to be able to switch to the second resource (could also be a data mirror) in case of a failure.
BackupA specific copy of user data, that presents all data points at a given time. Usually managed by users themself, read only and never stored in the same place as the original data.

Failsafe Levels and RTO

As this documents classifies failure case with very broad impacts and it is written in regards of mostly IaaS and KaaS, there cannot be one simple RTO set. The RTOs will differ for each resource and also between IaaS and KaaS level. It should be taken into consideration that the measure to achieve the RTOs for IaaS and KaaS means to make user data available again through measures within the infrastructure. But this will not be effective, when there is no backup of the user data or a redundancy of it already in place. So the different failsafe levels, measures and impacts will be needed to define realistic RTOs. For example a storage disk that has a failure will not result in a volume gein unavailable and needing a defined RTO, when the storage backend uses internal replication and still has two replicas of the user data. While in the worst case of a natural disaster, most likely a severe fire, the whole deployment will be lost and if there were no off-site backups done by users any defined RTO will never be met, because the data cannot be recovered anymore.

Decision

Failsafe Levels

This Decision Record defines four failsafe levels, each of which describe what kind of failures have to be tolerated by a provided service.

caution

This table only contains examples of failure cases. This should not be used as a replacement for a risk analysis.

In general, the lowest, level 1, describes isolated/local failures which can occur very frequently, whereas the highest, level 4, describes relatively unlikely failures that impact a whole or even multiple datacenter(s):

LevelProbabilityImpactExamples
1Very Highsmall Hardware IssueDisk failure, RAM failure, small software bug
2HighRack-wideRack outage, power outage, small fire
3Mediumsite-wide (temporary)Regional power outage, huge fire, orchestrated cyber attack
4Lowsite destructionNatural disaster

For example, a provided service with failsafe level 2 tolerates a rack outage (because there is some kind of redundancy in place.)

There are some general consequences, that can be addressed by CSPs and users in the following ways:

Levelconsequences for CSPsconsequences for Users
1. LevelCSPs MUST operate replicas for important components (e.g. replicated volume back-end, replicated database, ...).Users SHOULD backup their data themself and place it on an other host.
2. LevelCSPs MUST have redundancy for important components (e.g. HA for API services, redundant power supply, ...).Users MUST backup their data themselves and place it on an other host.
3. LevelCSPs SHOULD operate hardware in dedicated Availability Zones.Users SHOULD backup their data, in different AZs or even other deployments.
4. LevelCSPs may not be able to save user data from such catastrophes.Users MUST have a backup of their data in a different geographic location.
caution

The columns consequences for CSPs / Users only show examples of actions that may provide this class of failure safety for a certain resource. Customers should always check, what they can do to protect their data and not rely solely on the CSP.

More specific guidance on what these levels mean on the IaaS and KaaS layers will be provided in the sections further down. But beforehand, we will describe the considered failure scenarios and the resources that may be affected.

Failure Scenarios

The following failure scenarios have been considered for the proposed failsafe levels. For each failure scenario, we estimate the probability of occurence and the (worst case) damage caused by the scenario. Furthermore, the corresponding minimum failsafe level covering that failure scenario is given. The following table give a coarse view over the probabilities, that are used to describe the occurance of failure cases:

ProbabilityMeaning
Very LowOccurs at most once a decade OR needs extremly unlikely circumstances.
LowOccurs at most once a year OR needs very unlikely circumstances.
MediumOccurs more than one time a year, up to one time a month.
HighOccurs more than once a month and up to a daily basis.
Very HighOccurs within minutes.
Failure ScenarioProbabilityConsequencesFailsafe Level Coverage
Disk FailureHighPermanent data loss in this disk. Impact depends on type of lost data (data base, user data)L1
Host Failure (without disks)Medium to HighPermanent loss of functionality and connectivity of host (impact depends on type of host)L1
Host FailureMedium to HighData loss in RAM and temporary loss of functionality and connectivity of host (impact depends on type of host)L1
Rack OutageMediumOutage of all nodes in rackL2
Network router/switch outageMediumTemporary loss of service, loss of connectivity, network partitioningL2
Loss of network uplinkMediumTemporary loss of service, loss of connectivityL3
Power Outage (Data Center supply)MediumTemporary outage of all nodes in all racksL3

Environmental

Note that probability for these scenarios is dependent on the location.

Failure ScenarioProbabilityConsequencesFailsafe Level Coverage
FireLowpermanent Disk and Host loss in the affected zoneL3
FloodVery Lowpermanent Disk and Host loss in the affected regionL4
EarthquakeVery Lowpermanent Disk and Host loss in the affected regionL4
Storm/TornadoLowpermanent Disk and Host loss in the affected regionL4

As we consider mainly deployments in central Europe, the probability of earthquakes is low and in the rare case of such an event the severity is also low compared to other regions in the world (e.g. the pacific ring of fire). The event of a flood will most likely come from overflowing rivers instead of storm floods from a sea. There can be measures taken, to reduce the probability and severity of a flooding event in central Europe due to simply choosing a different location for a deployment.

Failure ScenarioProbabilityConsequencesFailsafe Level Coverage
Software bug (major)Low to Mediumpermanent loss or compromise of data that trigger the bug up to data on the whole deploymentL3
Software bug (minor)Medium to Hightemporary or partial loss or compromise of dataL1

Many software components have lots of lines of code and cannot be proven correct in their whole functionality. They are tested instead with at best enough test cases to check every interaction. Still bugs can and will occur in software. Most of them are rather small issues, that might even seem like a feature to some. An exmple for this would be: whether a floating IP in OpenStack could be assigned to a VM even if it is already bound to another VM. Bugs like this do not affect a whole deployment, when they are triggered, but just specific data or resources. Nevertheless those bugs can be a daily struggle. This is the reason, the probability of such minor bugs may be pretty high, but the consequences would either be just temporary or would only result in small losses or compromisation.

On the other hand major bugs, which might be used to compromise data, that is not in direct connection to the triggered bug, occur only a few times a year. This can be seen e.g. in the OpenStack Security Advisories, where there were only 3 major bugs found in 2023. While these bugs might appear only rarely their consequences are immense. They might be the reason for a whole deployment to be compromised or shut down. CSPs should be in contact with people triaging and patching such bugs, to be informed early and to be able to update their deployments, before the bug is openly announced.

Human Interference

Failure ScenarioProbabilityConsequencesFailsafe Level Coverage
Minor operating errorHighTemporary outageL1
Major operating errorLowPermanent loss of dataL3
Cyber attack (minor)Very Highpermanent loss or compromise of data on affected Disk and HostL1
Cyber attack (major)Mediumpermanent loss or compromise of data on affected Disk and HostL3

Mistakes in maintaining a data center will always happen. To reduce the probability of such a mistake, measures are needed to reduce human error, which is more an issue of sociology and psychology instead of computer science. On the other side an attack on an infrastructure cannot be avoided by this. Instead every deployment needs to be prepared for an attack all the time, e.g. through security updates. The severity of Cyber attacks can also vary broadly: from denial-of-service attacks, which should only be a temporary issue, up until coordinated attacks to steal or destroy data, which could also affect a whole deployment. The easier an attack is, the more frequently it will be used by various persons and organizations up to be just daily business. Major attacks are often orchestrated and require specific knowledge e.g. of Day-0 Bugs or the attacked infrastructure. Due to that nature their occurance is less likely, but the damage done can be far more severe.

Consequences

Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability.

Affected Resources

IaaS Layer (OpenStack Resources)

ResourceExplanationAffected by Level
Ephemeral VMEquals the server resource in Nova, booting from ephemeral storage.L1, L2, L3, L4
Volume-based VMEquals the server resource in Nova, booting from a volume.L2, L3, L4
Ephemeral StorageDisk storage directly supplied to a virtual machine by Nova. Different from volumes.L1, L2, L3, L4
Ironic MachineA physical host managed by Ironic or as a server resource in Nova.L1, L2, L3, L4
(Glance) ImageIaaS resource usually storing raw disk data. Managed by the Glance service.(L1), L2, L3, L4
(Cinder) VolumeIaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service.(L1, L2), L3, L4
(Volume) SnapshotThinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes.(L1, L2), L3, L4
Volume TypeAttribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted.L3, L4
(Barbican) SecretIaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service.L3, L4
Key Encryption KeyIaaS resource, used to encrypt other keys to be able to store them encrypted in a database.L3, L4
Floating IPIaaS resource, an IP that is usually routed and accessible from external networks.L3, L4

KaaS Layer (Kubernetes Resources)

A detailed list of consequnces for certain failures can be found in the Kubernetes docs. The following table gives an overview about certain resources on the KaaS Layer and in which failsafe classes they are affected:

Resource(s)ExplanationAffected by Level
PodKubernetes object that represents a workload to be executed, consisting of one or more containers.L3, L4
ContainerA lightweight and portable executable image that contains software and all of its dependencies.L3, L4
Deployment, StatefulSetKubernetes objects that manage a set of Pods.L3, L4
JobApplication workload that runs once.L3, L4
CronJobApplication workload that runs once, but repeatedly at specific intervals.L3, L4
ConfigMap, SecretObjects holding static application configuration data.L3, L4
ServiceMakes a Pod's network service accessible inside a cluster.(L2), L3, L4
IngressMakes a Service externally accessible.L2, L3, L4
PersistentVolume (PV)Persistent storage that can be bound and mounted to a pod.L1, L2, L3, L4

Also see Kubernetes Glossary.