AWS Elastic Disaster Recovery
A DR service that continuously replicates on-premises and other-cloud servers to AWS, enabling failover in minutes
Overview
AWS Elastic Disaster Recovery (DRS) is a disaster recovery service that continuously replicates on-premises servers and virtual machines from other clouds to AWS, executing failover in minutes when a disaster occurs. Block-level data replication keeps RPO (Recovery Point Objective) at the seconds level and RTO (Recovery Time Objective) at the minutes level. Simply installing an agent on the source server starts replication without any changes to the OS or applications, supporting both Windows and Linux.
Continuous Data Replication Architecture
DRS replication works by having the AWS Replication Agent installed on the source server detect block-level changes and transfer them to replication servers in the staging area. The initial sync performs a full disk copy, after which only changed blocks are continuously transferred. The staging area consists of lightweight EC2 instances (around t3.small) and EBS volumes, holding the source server's disk data at low cost. Replication traffic is encrypted with TLS, and bandwidth throttling settings let you control the impact on business networks. The point-in-time recovery feature enables recovery from snapshots at any past point in time. Even if data is encrypted by a ransomware attack, you can roll back to a pre-infection point for recovery. Replication lag is monitored in real time from the console and is also available as CloudWatch metrics.
Recovery Drills and Failback Operations
The most critical aspect of DRS operations is conducting regular recovery drills (DR exercises). Drills execute without interrupting source server replication and have no impact on the production environment. When a drill is initiated, an EC2 instance launches from the latest replication data, allowing you to verify that the recovered server functions correctly. After verification, terminating the drill instance means the only additional cost is EC2 charges for the uptime. Incorporating monthly or quarterly drill execution into operational procedures and recording actual RTO measurements is recommended. During an actual disaster, you execute a recovery and continue business operations on AWS instances in place of the source servers. After the disaster subsides, DRS also supports failback (returning to the original environment), executing reverse replication from AWS instances back to on-premises. Once failback is complete and replication resumes, you return to the normal DR posture.
Launch Templates and Network Design
The configuration of EC2 instances launched during recovery is predefined in launch settings. You specify instance type, subnet, security group, IAM role, EBS volume type, and more, ensuring recovered instances have network connectivity and security settings equivalent to the production environment. If the source server has multiple network interfaces, you can configure the same number of ENIs for the recovered instance. For network design, separating the staging area subnet from the recovery target subnet is recommended. The staging area should be dedicated to replication traffic, while the recovery target subnet is designed to meet the production workload's network requirements. Connectivity to on-premises uses Direct Connect or Site-to-Site VPN, securing the communication path from the replication agent to the staging area. Combining Route 53 health checks with failover routing for DNS cutover enables automated failover.