AWS DR Strategy Options - Designing Disaster Recovery Step by Step from Pilot Light to Multi-Site

We explain the breadth and flexibility of disaster recovery options AWS provides, focusing on Pilot Light, Warm Standby, Multi-Site Active/Active strategies, and Elastic Disaster Recovery.

約 7 分で読めます最終更新: 2025-10-23

DR Strategy Is a Trade-off Between Cost and Recovery Speed

Disaster Recovery (DR) design revolves around two metrics: RTO (Recovery Time Objective) and RPO (Recovery Point Objective). The shorter the RTO, the less time it takes to restore service after a failure. The shorter the RPO, the less data is lost when a failure occurs. However, the shorter you make RTO and RPO, the higher the cost of maintaining the DR environment. AWS defines four tiers of DR strategy - Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active - allowing you to choose the optimal balance of cost and recovery speed based on your business requirements. This tiered approach is built into the Reliability Pillar of the AWS Well-Architected Framework, which provides systematic DR design guidance.

The Four Tiers of DR Strategy

Backup & Restore is the lowest-cost strategy. Data backups are stored in S3 in another region, and resources are rebuilt from backups when a failure occurs. RTO ranges from hours to tens of hours, and RPO depends on backup frequency. Costs are limited to backup storage, but recovery takes time. Pilot Light keeps only minimal core components like database replicas running in the DR region at all times. When a failure occurs, stopped application servers and load balancers are launched. RTO ranges from tens of minutes to hours, and RPO depends on replication lag. Warm Standby keeps a scaled-down version of the production environment running in the DR region at all times. When a failure occurs, it scales up to handle production traffic. RTO is minutes to tens of minutes. Multi-Site Active/Active processes traffic in multiple regions simultaneously. RTO and RPO approach near-zero, but costs are the highest.

Simplifying DR with Elastic Disaster Recovery

AWS Elastic Disaster Recovery (DRS) is a managed service that replicates on-premises or other cloud servers to AWS and enables rapid failover during disasters. DRS installs an agent on source servers and performs continuous block-level replication. Data is compressed, encrypted, and transferred to AWS, stored in a low-cost staging area. When a disaster occurs, EC2 instances are launched from the staging area data within minutes, and production traffic is switched over. A key feature of DRS is the very low cost during replication. The staging area uses low-cost EBS volumes, and production-spec EC2 instances are only launched during failover. This achieves an RTO close to Warm Standby at a cost close to Pilot Light. Regular DR drills (recovery exercises) can be easily executed from the console without impacting the production environment, allowing you to verify recovery procedures.

AWS Cross-Region Replication Capabilities

AWS's major services natively provide cross-region data replication capabilities. S3 Cross-Region Replication (CRR) automatically replicates objects to an S3 bucket in another region. RDS Cross-Region Read Replicas maintain a read-only copy of the database in another region that can be promoted to primary during a failure. Aurora Global Database replicates data to up to 5 secondary regions with sub-second replication lag and failover in under a minute. DynamoDB Global Tables provides active-active table replication across multiple regions, enabling reads and writes in each region. EBS snapshot cross-region copy, EFS cross-region replication, and Secrets Manager multi-region secrets - the DR capabilities needed for the data layer are built into each service. By combining these native features, you can build a DR architecture without depending on third-party tools.

DR Traffic Control with Route 53 and Global Accelerator

The effectiveness of a DR strategy heavily depends on the speed of failure detection and traffic switching. Route 53's health check feature automatically detects endpoint failures and executes DNS failover. By configuring a failover routing policy, traffic is automatically switched to the DR region when the primary region fails. However, DNS-based failover is affected by TTL, so switching may have a lag of tens of seconds to minutes. Global Accelerator provides static IP-based traffic routing leveraging AWS's global network. Since it doesn't depend on DNS TTL, traffic can be switched to another region within tens of seconds after detecting an endpoint failure. In a Multi-Site Active/Active strategy, Global Accelerator's traffic dial feature allows dynamic control of traffic distribution to each region. During a failure, you can set the traffic dial for the affected region to 0% and immediately concentrate traffic on healthy regions.

Practical Considerations for DR Design

DR strategy selection should be based on a Business Impact Analysis (BIA). It's important to define the RTO and RPO for each workload and reach agreement with management on the appropriate DR strategy and costs. Not every workload needs Multi-Site Active/Active - it's realistic to use different strategies based on workload criticality. Beyond maintaining the DR environment, regular DR drills are essential. Even if recovery procedures are documented, they won't work during an actual disaster without practice. AWS Resilience Hub is a service that evaluates workload resilience and visualizes compliance with RTO and RPO targets. To systematically learn DR design patterns, related books (Amazon) can also be helpful.

Summary

AWS systematically defines four tiers of DR strategy - Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active - allowing you to choose the optimal balance of cost and recovery speed based on business requirements. Elastic Disaster Recovery is a managed service that achieves an RTO close to Warm Standby at low cost, significantly lowering the barrier to DR adoption. With rich native cross-region replication capabilities across major services like S3 CRR, Aurora Global Database, and DynamoDB Global Tables, you can build DR architectures without depending on third-party tools. The flexibility of traffic control through Route 53 and Global Accelerator is also a key element in enhancing DR effectiveness. In the breadth of DR strategy options and the maturity of services supporting each strategy, AWS leads other cloud providers.

Amazon.com Is AWS's Biggest Customer - How Internal Dogfooding Drives Service QualityStarting from the fact that Amazon.com's e-commerce site, Prime Video, and Alexa all run on AWS, this article explores how internal dogfooding elevates service quality and how Prime Day's traffic demands have shaped AWS's architecture.The Layered Architecture of AWS AI/ML Services - Flexibility Through the Three Tiers of SageMaker, Bedrock, and API ServicesThis article organizes AWS AI/ML services into three layers - SageMaker (full control), Bedrock (managed generative AI), and Rekognition/Comprehend/etc. (API-based) - and explains AWS's flexibility through comparisons with GCP Vertex AI and Azure OpenAI Service, including custom silicon integration.AWS Data Analytics and Data Lakes - The Integrated Ecosystem of Athena, Glue, Lake Formation, and RedshiftExplore the integrated data analytics stack of AWS Athena, Glue, Lake Formation, Redshift, and QuickSight, comparing it with Azure Synapse Analytics and GCP BigQuery to highlight AWS's advantages in ecosystem integration.AWS Backward Compatibility and API Stability - The Trust Built by Never Retiring Published APIsExamine AWS's track record of never retiring published APIs, compare it with Azure's rebranding history and GCP's service discontinuation cases, and explain why API stability matters for enterprises.AWS Availability Zone Design - How Physical Separation and Fault Isolation Create a Reliability AdvantageExamine the design philosophy behind AWS AZs as physically independent data center clusters, compare them with Azure and GCP availability zones, and analyze the differences in fault isolation maturity through real-world incident examples.The Market Value of AWS Skills and the Salary Premium of CertificationsAnalyze the number of job postings requiring AWS skills, the salary premium for certification holders, and the impact on career paths, comparing with Azure and GCP to evaluate the return on investment of AWS certifications.AWS Technical Communities and Learning Resources - From re:Invent to JAWS-UGCompare the richness of AWS technical communities including re:Invent, AWS Summit, and JAWS-UG, along with localized documentation and training resources, against Azure and GCP to highlight AWS's learning ecosystem advantages.AWS Compliance - Over 143 Certifications from ISMAP to PCI DSS That Outpace the CompetitionExplore the breadth of AWS's 143+ compliance certifications, focusing on ISMAP, SOC, PCI DSS, and HIPAA, and compare the certification coverage with Azure and GCP.