November 12, 2023

Resilient Digital Infrastructure

How can we ensure the continuous availability of critical digital infrastructure?

Last week Optus in Australia suffered a massive network outage impacting more than 10 million customers for more than 8 hours, including disruptions to emergency services, trains and digital payments. In April 2022, Atlassian experienced a full product outage that impacted 775 customer organisations for up to 2 weeks. In December 2022, Southwest Airlines experienced a scheduling crisis due in large part to technical debt and a lack of investment.

These crises inevitably result in significant financial losses and multi-year investments to restore customer trust and brand reputation.

But resiliency is hard.

What do we mean by resilient digital infrastructure?

Resilience is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.

We can consider digital infrastructure as the underlying hardware, networks, and software that support the functioning of digital services. That could include a broad range of telecom & network infrastructure, cloud computing, private data centres and applications that underpin operations and essential business functions.

Causes of disruption

The causes of digital infrastructure service disruptions are typically:

  • Hardware failures
  • Power failures
  • Software issues
  • Natural and man-made disasters
  • Cyber security breaches
  • Misconfiguration
  • Human error

Human error underpins the majority of technology failures in one form or another. Uptime Institute research indicates that human error plays a role in more than two-thirds of all data centre outages. This is clearly where the biggest improvements in resiliency can be achieved.

A framework for resilience

Digital infrastructure resilience is a complex topic, but there are essentially 5 key steps.


1. Define service performance targets

What is an acceptable level of service? Without a performance target, it’s hard to know if what is being designed will meet expectations.

As a basis for defining service performance, Site Reliability Engineering provides a good framework to differentiate between:

  • Things you measure and track that indicate service performance => Service Level Indicators
  • Things you measure and track AND set internal targets for => Service Level Objectives
  • Targets that are contractual and have consequences if not met => Service Level Agreements

A common Service Level Agreement (SLA) for all in-scope digital services would include:

  • Service criticality definitions. What is mission-critical and what is not?
  • Service availability targets based on service criticality. For example, 99.99% availability could be a target for mission-critical services (52 minutes annual allowable downtime), 99.95% availability for the next tier of service and so on.
  • Incident & service request resolution targets. How long will it take to detect and resolve a service disruption?
  • Service performance targets dependent on the service type (e.g. response time, transaction time, error rates, packet loss, jitter, throughput)
  • Service recovery time (RTO) and data recovery point (RPO) in case of disaster

The SLA may include other elements such as hours of support coverage, scheduled maintenance, critical and non-critical operations periods and backup schedules.

2. Design the digital infrastructure

Now that the service performance targets are clear, the design can proceed to meet expectations with a focus on:

  • Removing Single Points of Failure (SPoFs) through redundancy and geographic diversity of critical facilities and paths
  • Adequate capacity planning
  • Performance engineering for load, scalability, stability, fault tolerance and monitoring
  • System configuration and data backups
  • Security vulnerability mitigation and automated penetration testing

The language of service availability and performance differs depending on the type of infrastructure and often vendors have their own spin.

For data centre design The Uptime Institute's Tier Classification System (Tier I to IV) and the ANSI/TIA-942 (Rating 1 to 4) are the most widely recognised and accepted standards.

For public cloud computing, AWS published uptime SLAs typically range from a target of 99.9% to 99.99% depending on the service. Interestingly, the Route 53 DNS service has a published SLA of 100% as does Cloudflare. Similar uptime targets exist for Azure and Google Cloud .

With telecom & network design high levels of design resilience are achieved through:

  • Redundancy of connections and critical facilities
  • Diversity of paths with physical separation
  • Redundancy of equipment (e.g. switches, firewalls) and fault tolerance
  • Out-of-band management capability
  • Diversity of suppliers for critical connections or services (e.g. Internet or firewalls)
  • Network segmentation
  • Availability of spares

For software infrastructure design resiliency is an extensive field and Site Reliability Engineering provides a good basis for design and operations practices.

To show the power of redundancy, assume that you have an internet connection from a supplier that commits to 99 % uptime monthly (potentially more than 7 hours of downtime per month). Adding a second internet connection from a completely independent provider with the same commitment translates to an uptime of 99.99% (4 minutes downtime per month!) for internet connectivity.

3. Resilient operations design

Given that most service disruptions are due to human error, operations design focuses primarily on the prevention, detection and resolution of incidents.

Service management processes provide a framework to guide teams on “how” things should be done such as the onboarding of new services, and the management of incidents, changes, requests and problems. Team procedures or checklists describe the support steps in detail.

Designing the support model provides clarity on who is responsible for what so that the right skills are in place to monitor services and provide support. This may include physical operations centre, on-call staff and remote support teams operating as part of an integrated support model.
A competent, well-trained, and tightly integrated support team is a critical success factor for resilient operations.

Communications design involves the channels and methods for communication between support teams, management and end users. For critical outages, a tested crisis communications process will help to minimise reputation damage. There is nothing worse for customer trust and brand reputation than no communication at all.

In addition to monitoring and alerting, operations tools are needed for team collaboration, communication, service management and knowledge management.

Training is needed on the operations processes, procedures, support model, communications and tools.

4. Test the resiliency of the design

Assuming that functional tests have been performed, the resilience testing will focus on the following for each service, although the language may differ by type of digital infrastructure and vendor:

  • Load testing – does it perform under normal load? does it perform with additional load. Does it scale / auto-scale as designed? Is monitoring configured properly to observe performance?
  • Stress testing – at what load does it fail and how does it fail? Is it fault-tolerant by design? Are those failures being detected by monitoring and alerting?
  • Stability testing – does it operate continuously for long periods? Is there any noticeable deterioration observed?
  • Fault Tolerance – does it failover automatically as designed? Does it fail back?
  • Security testing - penetration testing across various layers of the network, infra and applications to scan code and requests for the identification of common vulnerabilities. DDOS simulation testing.
  • Backup & recovery testing – recovering system configurations and data from backups to ensure the backup solution works and the team possesses the expertise to execute the backup recovery process.
  • Disaster Recovery Testing – this tests the complete loss of a critical facility (e.g. a data centre, cloud region or availability zone, or operations centre). While this may involve technical solutions that can automate DR, it must also be tested as an operational process.

Testing mission-critical systems requires test environments that are as close as possible to the production environment. This is difficult to achieve but essential if the goal is to prevent failures in live operations.

5. Test the resiliency of operations

The operations design can be tested in several ways:

Tabletop scenarios - where knowledge of operating processes and procedures is tested in a group setting. By reviewing normal and service disruption scenarios jointly the team can learn, challenge, question and finally come to a shared understanding of how to deal with technology issues.

Readiness rehearsals - to test that the people and processes are in place to handle operational situations such as incidents, changes and disaster recovery. A rehearsal simulates the real operations with team members taking their assigned operational roles and dealing with both normal and abnormal situations that may occur. Readiness rehearsals achieve the following aims:

  • An opportunity for the team to learn their role and their team interactions.
  • Test people's knowledge of processes, procedures and tools that will be used to deliver technology support.
  • Test communication channels between operations centres and field teams, including crisis management processes and escalations.
  • Identify areas of improvement to ensure that the teams can cope with any situation that might arise in future.

Chaos Engineering , aka failure injection testing, simulates stress by systematically disrupting different elements of the infrastructure, including:

  • Randomly halting specific virtual machines.
  • Test people's knowledge of processes, procedures and tools that will be used to deliver technology supportArbitrarily obstructing network connectivity along select paths.

Chaos engineering can verify that the design is robust and can gracefully handle faults. It also aids in the early detection and resolution of issues before they impact users, ultimately enhancing overall Service Level Agreements (SLAs).

Achieving service performance targets

With clear performance targets and a well-designed and tested infrastructure and operations, high levels of reliability and resilience can be achieved, even when unexpected disruptions occur.

To maintain resilience, organisations should be paranoid that the next critical outage may be just around the corner. Consider again that most outages are caused by human error. In an increasingly software-defined world, this becomes visible through code and configuration changes. So operations teams need to become competent at deploying changes successfully at speed, and if needed, quickly detecting and resolving incidents.