Ninja Docs Help

HLD008 - Multi-Region Disaster Recovery

Introduction

Purpose

The purpose of this High-Level Architecture Design document is to describe the strategy for deploying AWS resources across multiple regions with a key focus on Disaster Recovery (DR). The document aims to provide a overview of the proposed architecture, demonstrating how services, data, and network configurations are designed to create a robust, scalable, and resilient infrastructure that can efficiently manage and recover from region-specific failures.

Changelog

Revision

Date

Description

1.0

23.07.2024

Initial document

Background

In the context of a global digital environment, this architecture leverages multi-region AWS deployment using cross-region replication services and Transit Gateways. Utilizing multiple regions enhances data redundancy and improves disaster recovery by allowing rapid failover. The architecture's design ensures robust performance and high availability of services, critical for business continuity.

Architecture diagram

HLD008-MRD-01.png

Explanation

The architecture replicates vital data, secrets, container registry and backups across multiple AWS regions using services like S3, RDS, and Secrets Manager, and suitable backup solutions.

Implementation Details

The proposed system uses an active-passive model with the primary AWS region serving as the active environment and a separate AWS region as the passive standby environment. The system periodically backs up crucial data and stores it safely, ready for restoration in the event of a disaster. Network connection between regions is always active.

Inter-region connection

HLD008-MRD-02.png

Transit Gateway peering is a secure and highly efficient mechanism to interconnect resources in different regions. It provides secure communication since traffic remains on the global AWS backbone and never traverses the public internet. Peering connections also provide high bandwidth and low-latency links between regions, which are vital for applications that require real-time or near real-time data synchronization.

This architecture allows resources in a VPC of one region to communicate with resources in a VPC of another region just as if they were within the same region. This inter-region communication is critical for data replication and disaster recovery.

Compute infrastructure

Compute resources such as EC2 instances, EKS clusters, and RDS databases are not continually replicated across regions. Instead, these resources are provisioned in the failover region through CI/CD pipelines when a disaster recovery process is initiated. This approach ensures resource optimization and cost-effectiveness, while also ensuring that the latest versions of applications and configurations are deployed during a DR event.

Data Replication

  • Data Replication: S3 Cross-Region Replication is used for storing and replicating application data. For databases, RDS cross-region read replicas ensure that a live copy of the database is always available in the failover region.

  • Secrets Replication: AWS Secrets Manager is used for managing and storing application secrets. Secrets are replicated across regions, ensuring their availability for applications in the failover region.

  • Backup Replication: Depending on the nature of the backups, appropriate AWS solutions will be used to replicate backups across regions. This ensures the availability of the latest backups in the failover region.

Disaster Recovery Process

During a disaster in the active region, the recovery process starts in the passive region using the stored backups. Resources are restored to their latest state: Amazon EC2 instances are recreated from AMIs and EBS snapshots, and data is restored to Amazon RDS from the snapshots. The restoration process can be automated using CloudFormation and Terraform in combination with CI/CD pipelines.

AWS Route 53 is used to handle DNS failover and health checks. In the event of a disaster, Route 53 health checks detect the outage in the primary region and automatically redirect DNS queries to the secondary (passive) region, ensuring minimal disruption to end-users.

Fallback Strategy

After the disaster is resolved in the primary region, a failback process can be initiated to restore operations back to the primary region. This process involves creating backups of the current system state in the secondary region and restoring it in the primary region.

Expected Outcomes

  1. enhanced disaster recovery capabilities

  2. improved data redundancy

  3. increased network resilience

  4. system scalability

Last modified: 17 February 2025