SCS Taxonomy of Failsafe Levels

Abstract

When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various resources is neither homogenous nor intuitive. There does exist very detailed lists of risks and what consequences there are for each risk, but this Decision Record should give a high-level view on the topic. So that in each standard that referenced redundancy, it can easily be seen how far this redundancy goes in that certain circumstance. Readers of such standards should be able to know at one glance, whether the achieved failure safeness is on a basic level or a higher one and whether there would be additional actions needed to protect the data.

This is why this decision record aims to define different levels of failure safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer.

Glossary

Term	Explanation
Availability Zone	(also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources.
BSI	German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik).
CSP	Cloud Service Provider, provider managing the OpenStack infrastructure.
Compute	A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack).
Network	A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack).
Storage	A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack).
RTO	Recovery Time Objective, the acceptable time needed to restore a ressource.
Disk	A physical disk drive (e.g. HDD, SSD) in the infrastructure.
Host	A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities.
Cyber attack/threat	Attacks on the infrastructure through the means of electronic access.

Context

Some standards provided by the SCS project will talk about or require procedures to back up resources or have redundancy for resources. This decision record should discuss, which failure threats exist within an IaaS and KaaS deployment and will classify them into several levels according to their impact and possible handling mechanisms. In consequence these levels should be used in standards concerning redundancy or failure safety.

Based on our research, no similar standardized classification scheme seems to exist currently. Something close but also very detailed is the BSI-Standard 200-3 (german) published by the German Federal Office for Information Security. As we want to focus on IaaS and K8s resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed.

Goal of this Decision Record

The SCS wants to classify levels of failure cases according to their impact and the respective measures CSPs can implement to prepare for each level. Standards that deal with redundancy or backups or recovery SHOULD refer to the levels of this standard. Thus every reader knows, up to which level of failsafeness the implementation of the standard works. Reader then should be able to abstract what kind of other measures they have to apply, to reach the failsafe lavel they want to reach.

caution

This document will not be a replacement for a risk analysis. Every CSP and every Customer (user of IaaS or KaaS resources) need to do a risk analysis of their own. Also the differentiation of failure cases in classes, may not be an ideal basis for Business Continuity Planning. It may be used to get general hints and directions though.

Differentiation between failsafe levels and high availability, disaster recovery, redundancy and backups

The levels of failsafeness defined in this decision record classify the possibilities and impacts of failure cases (such as data loss) and the possible measures. High Availability, disaster recovery, redundancy and backups are all measures that can and should be applied to IaaS and KaaS deployments by both CSPs and Users to reduce the possibility and impact of data loss. So with this document every reader can see to what level of failsafeness their measures protect user data.

To differentiate also between the named measures the following table can be used:

Term	Explanation
High Availability	Refers to the availability of resources over an extended period of time unaffected by smaller hardware issues. E.g. achievable through having several instances of resources.
Disaster Recovery	Measures taken after an incident to recover data, IaaS resource and maybe even physical resources.
Redundancy	Having more than one (or two) instances of each resource, to be able to switch to the second resource (could also be a data mirror) in case of a failure.
Backup	A specific copy of user data, that presents all data points at a given time. Usually managed by users themself, read only and never stored in the same place as the original data.

Failsafe Levels and RTO

As this documents classifies failure case with very broad impacts and it is written in regards of mostly IaaS and KaaS, there cannot be one simple RTO set. The RTOs will differ for each resource and also between IaaS and KaaS level. It should be taken into consideration that the measure to achieve the RTOs for IaaS and KaaS means to make user data available again through measures within the infrastructure. But this will not be effective, when there is no backup of the user data or a redundancy of it already in place. So the different failsafe levels, measures and impacts will be needed to define realistic RTOs. For example a storage disk that has a failure will not result in a volume gein unavailable and needing a defined RTO, when the storage backend uses internal replication and still has two replicas of the user data. While in the worst case of a natural disaster, most likely a severe fire, the whole deployment will be lost and if there were no off-site backups done by users any defined RTO will never be met, because the data cannot be recovered anymore.

Decision

Failsafe Levels

This Decision Record defines four failsafe levels, each of which describe what kind of failures have to be tolerated by a provided service.

caution

This table only contains examples of failure cases. This should not be used as a replacement for a risk analysis.

In general, the lowest, level 1, describes isolated/local failures which can occur very frequently, whereas the highest, level 4, describes relatively unlikely failures that impact a whole or even multiple datacenter(s):

Level	Probability	Impact	Examples
1	Very High	small Hardware Issue	Disk failure, RAM failure, small software bug
2	High	Rack-wide	Rack outage, power outage, small fire
3	Medium	site-wide (temporary)	Regional power outage, huge fire, orchestrated cyber attack
4	Low	site destruction	Natural disaster

For example, a provided service with failsafe level 2 tolerates a rack outage (because there is some kind of redundancy in place.)

There are some general consequences, that can be addressed by CSPs and users in the following ways:

Level	consequences for CSPs	consequences for Users
1. Level	CSPs MUST operate replicas for important components (e.g. replicated volume back-end, replicated database, ...).	Users SHOULD backup their data themself and place it on an other host.
2. Level	CSPs MUST have redundancy for important components (e.g. HA for API services, redundant power supply, ...).	Users MUST backup their data themselves and place it on an other host.
3. Level	CSPs SHOULD operate hardware in dedicated Availability Zones.	Users SHOULD backup their data, in different AZs or even other deployments.
4. Level	CSPs may not be able to save user data from such catastrophes.	Users MUST have a backup of their data in a different geographic location.

caution

The columns consequences for CSPs / Users only show examples of actions that may provide this class of failure safety for a certain resource. Customers should always check, what they can do to protect their data and not rely solely on the CSP.

More specific guidance on what these levels mean on the IaaS and KaaS layers will be provided in the sections further down. But beforehand, we will describe the considered failure scenarios and the resources that may be affected.

Failure Scenarios

The following failure scenarios have been considered for the proposed failsafe levels. For each failure scenario, we estimate the probability of occurence and the (worst case) damage caused by the scenario. Furthermore, the corresponding minimum failsafe level covering that failure scenario is given. The following table give a coarse view over the probabilities, that are used to describe the occurance of failure cases:

Probability	Meaning
Very Low	Occurs at most once a decade OR needs extremly unlikely circumstances.
Low	Occurs at most once a year OR needs very unlikely circumstances.
Medium	Occurs more than one time a year, up to one time a month.
High	Occurs more than once a month and up to a daily basis.
Very High	Occurs within minutes.

Failure Scenario	Probability	Consequences	Failsafe Level Coverage
Disk Failure	High	Permanent data loss in this disk. Impact depends on type of lost data (data base, user data)	L1
Host Failure (without disks)	Medium to High	Permanent loss of functionality and connectivity of host (impact depends on type of host)	L1
Host Failure	Medium to High	Data loss in RAM and temporary loss of functionality and connectivity of host (impact depends on type of host)	L1
Rack Outage	Medium	Outage of all nodes in rack	L2
Network router/switch outage	Medium	Temporary loss of service, loss of connectivity, network partitioning	L2
Loss of network uplink	Medium	Temporary loss of service, loss of connectivity	L3
Power Outage (Data Center supply)	Medium	Temporary outage of all nodes in all racks	L3

Environmental

Note that probability for these scenarios is dependent on the location.

Failure Scenario	Probability	Consequences	Failsafe Level Coverage
Fire	Low	permanent Disk and Host loss in the affected zone	L3
Flood	Very Low	permanent Disk and Host loss in the affected region	L4
Earthquake	Very Low	permanent Disk and Host loss in the affected region	L4
Storm/Tornado	Low	permanent Disk and Host loss in the affected region	L4

As we consider mainly deployments in central Europe, the probability of earthquakes is low and in the rare case of such an event the severity is also low compared to other regions in the world (e.g. the pacific ring of fire). The event of a flood will most likely come from overflowing rivers instead of storm floods from a sea. There can be measures taken, to reduce the probability and severity of a flooding event in central Europe due to simply choosing a different location for a deployment.

Failure Scenario	Probability	Consequences	Failsafe Level Coverage
Software bug (major)	Low to Medium	permanent loss or compromise of data that trigger the bug up to data on the whole deployment	L3
Software bug (minor)	Medium to High	temporary or partial loss or compromise of data	L1

Many software components have lots of lines of code and cannot be proven correct in their whole functionality. They are tested instead with at best enough test cases to check every interaction. Still bugs can and will occur in software. Most of them are rather small issues, that might even seem like a feature to some. An exmple for this would be: whether a floating IP in OpenStack could be assigned to a VM even if it is already bound to another VM. Bugs like this do not affect a whole deployment, when they are triggered, but just specific data or resources. Nevertheless those bugs can be a daily struggle. This is the reason, the probability of such minor bugs may be pretty high, but the consequences would either be just temporary or would only result in small losses or compromisation.

On the other hand major bugs, which might be used to compromise data, that is not in direct connection to the triggered bug, occur only a few times a year. This can be seen e.g. in the OpenStack Security Advisories, where there were only 3 major bugs found in 2023. While these bugs might appear only rarely their consequences are immense. They might be the reason for a whole deployment to be compromised or shut down. CSPs should be in contact with people triaging and patching such bugs, to be informed early and to be able to update their deployments, before the bug is openly announced.

Human Interference

Failure Scenario	Probability	Consequences	Failsafe Level Coverage
Minor operating error	High	Temporary outage	L1
Major operating error	Low	Permanent loss of data	L3
Cyber attack (minor)	Very High	permanent loss or compromise of data on affected Disk and Host	L1
Cyber attack (major)	Medium	permanent loss or compromise of data on affected Disk and Host	L3

Mistakes in maintaining a data center will always happen. To reduce the probability of such a mistake, measures are needed to reduce human error, which is more an issue of sociology and psychology instead of computer science. On the other side an attack on an infrastructure cannot be avoided by this. Instead every deployment needs to be prepared for an attack all the time, e.g. through security updates. The severity of Cyber attacks can also vary broadly: from denial-of-service attacks, which should only be a temporary issue, up until coordinated attacks to steal or destroy data, which could also affect a whole deployment. The easier an attack is, the more frequently it will be used by various persons and organizations up to be just daily business. Major attacks are often orchestrated and require specific knowledge e.g. of Day-0 Bugs or the attacked infrastructure. Due to that nature their occurance is less likely, but the damage done can be far more severe.

Consequences

Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability.

Affected Resources

IaaS Layer (OpenStack Resources)

Resource	Explanation	Affected by Level
Ephemeral VM	Equals the `server` resource in Nova, booting from ephemeral storage.	L1, L2, L3, L4
Volume-based VM	Equals the `server` resource in Nova, booting from a volume.	L2, L3, L4
Ephemeral Storage	Disk storage directly supplied to a virtual machine by Nova. Different from volumes.	L1, L2, L3, L4
Ironic Machine	A physical host managed by Ironic or as a `server` resource in Nova.	L1, L2, L3, L4
(Glance) Image	IaaS resource usually storing raw disk data. Managed by the Glance service.	(L1), L2, L3, L4
(Cinder) Volume	IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service.	(L1, L2), L3, L4
(Volume) Snapshot	Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes.	(L1, L2), L3, L4
Volume Type	Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted.	L3, L4
(Barbican) Secret	IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service.	L3, L4
Key Encryption Key	IaaS resource, used to encrypt other keys to be able to store them encrypted in a database.	L3, L4
Floating IP	IaaS resource, an IP that is usually routed and accessible from external networks.	L3, L4

KaaS Layer (Kubernetes Resources)

A detailed list of consequnces for certain failures can be found in the Kubernetes docs. The following table gives an overview about certain resources on the KaaS Layer and in which failsafe classes they are affected:

Resource(s)	Explanation	Affected by Level
Pod	Kubernetes object that represents a workload to be executed, consisting of one or more containers.	L3, L4
Container	A lightweight and portable executable image that contains software and all of its dependencies.	L3, L4
Deployment, StatefulSet	Kubernetes objects that manage a set of Pods.	L3, L4
Job	Application workload that runs once.	L3, L4
CronJob	Application workload that runs once, but repeatedly at specific intervals.	L3, L4
ConfigMap, Secret	Objects holding static application configuration data.	L3, L4
Service	Makes a Pod's network service accessible inside a cluster.	(L2), L3, L4
Ingress	Makes a Service externally accessible.	L2, L3, L4
PersistentVolume (PV)	Persistent storage that can be bound and mounted to a pod.	L1, L2, L3, L4

Also see Kubernetes Glossary.

Abstract​

Glossary​

Context​

Goal of this Decision Record​

Differentiation between failsafe levels and high availability, disaster recovery, redundancy and backups​

Failsafe Levels and RTO​

Decision​

Failsafe Levels​

Failure Scenarios​

Hardware Related​

Environmental​

Software Related​

Human Interference​

Consequences​

Affected Resources​

IaaS Layer (OpenStack Resources)​

KaaS Layer (Kubernetes Resources)​

Abstract

Glossary

Context

Goal of this Decision Record

Differentiation between failsafe levels and high availability, disaster recovery, redundancy and backups

Failsafe Levels and RTO

Decision

Failsafe Levels

Failure Scenarios

Hardware Related

Environmental

Software Related

Human Interference

Consequences

Affected Resources

IaaS Layer (OpenStack Resources)

KaaS Layer (Kubernetes Resources)