Disaster Recovery

Revision	Date	Description
`1.0`	26.08.2024	Init document

Introduction

When designing disaster recovery (DR) strategies for Amazon EC2 instances, there are a few common patterns that organizations often use. These are typically tailored based on the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets of the organization. RTO is the maximum acceptable length of time that your application can be offline, and RPO is the maximum amount of data loss your application can tolerate.

Solutions

Backup and Restore: In this pattern, your key data is backed up regularly using AWS's native backup solutions. You can take snapshots of your Amazon Elastic Block Store (EBS) volumes, which provide point-in-time copies of your data, or create an Amazon Machine Image (AMI), which is a template containing a software configuration (like an OS, an application server, and applications).
When a disaster strikes, you can launch new EC2 instances from these EBS snapshots or AMIs. While this approach is simple and cost-effective, it has a high RTO and RPO because you have to wait for the backup to be restored, and data loss can occur between backups. This method is best suited for applications that can tolerate a longer recovery period.
This is the most cost-effective method for disaster recovery. You're only paying for the storage of backups, such as Amazon EBS snapshots or Amazon Machine Images (AMIs), in Amazon S3. The cost will be dependent on the size of your backups and how often you perform them. This method has minimal operational costs until a restore is needed.
Pilot Light: The idea here is to have a small capacity (the "pilot light") always running that can be rapidly provisioned to full capacity in case of a disaster. In this DR pattern, you always have the minimal version of an environment up and running.
The key systems are set up and data replication is in place, ready to quickly increase capacity when a disaster happens. Using AWS services like Auto Scaling and Elastic Load Balancer, you can quickly scale up your resources to handle the production load. This method has a lower RTO and RPO than the Backup and Restore approach.
In this scenario, the costs are higher than the Backup and Restore method since some critical systems will be continuously running. You will pay for running the minimal version of the application and data replication. However, the cost is still lower than running a full-scale application, as you scale up resources only when a disaster occurs.
Warm Standby: A warm standby DR strategy involves having a scaled-down version of a fully functional environment always running in the cloud. This solution extends the pilot light elements and preparation. It further decreases the recovery time because some services are always running.
In the event of a disaster, you can quickly scale up this environment to handle the production load. This not only reduces the RTO but also gives you a better RPO compared to the previous two methods, as data is continuously being replicated.
This is more expensive than the Pilot Light method as a scaled-down version of the full environment is always running. You're paying for running instances, data replication, and possibly load balancing. However, it offers faster recovery times as the applications are already running and can be quickly scaled up in case of a disaster.
Multi-Site / Hot Standby: This approach involves running a full-scale duplicate of your production environment in a different Availability Zone or AWS Region. This environment is always on and can take over instantly if something happens to your production environment.
With Route 53, AWS’s DNS service, you can use routing policies to determine where your traffic goes. For example, you could split your traffic evenly between two environments, or you could route all traffic to the standby environment only when your primary environment is unavailable. This approach provides the lowest RTO and RPO, but it's also the most expensive.
This is the most expensive solution as it involves running a full-scale duplicate of your live environment in another Availability Zone or AWS region. It includes the cost of running duplicate instances, data replication, and traffic management. It offers the fastest recovery time but at a significantly higher cost.

In all patterns, you should ensure that all data is replicated across multiple regions or availability zones to avoid a single point of failure. Also, it's important to regularly test your disaster recovery strategy to ensure that it works as expected.

These are general guidelines, and the best approach for your organization will depend on your specific needs and constraints. Remember, when planning for disaster recovery, the goal is to achieve the optimal balance between minimizing downtime, data loss, and cost.

Last modified: 17 February 2025