Resilience Assessment - Quantifying Application Fault Tolerance with AWS Resilience Hub

Learn how to assess application fault tolerance with AWS Resilience Hub, including defining RTO/RPO targets, configuring resilience policies, running automated assessments, and leveraging improvement recommendations.

Why Resilience Assessment Matters

Application fault tolerance (resilience) is quantified by how quickly you can recover from a failure (RTO: Recovery Time Objective) and how much data you can recover (RPO: Recovery Point Objective). However, many organizations have vague RTO/RPO targets or haven't verified whether their current architecture can meet those targets. AWS Resilience Hub is a service that quantitatively assesses application fault tolerance and provides improvement recommendations. It automatically discovers resource configurations from CloudFormation stacks or Terraform State and calculates estimated RTO/RPO for AZ failure, region failure, and application failure scenarios. It then compares these estimates against your defined RTO/RPO targets to determine whether they can be met.

Resilience Policies and Running Assessments

Using Resilience Hub starts with defining a resilience policy. In the policy, you set RTO and RPO targets for each failure scenario. For example: "AZ failure: RTO 1 hour, RPO 5 minutes," "Region failure: RTO 4 hours, RPO 1 hour," "Application failure: RTO 30 minutes, RPO 5 minutes." Next, you register your application. Specifying a CloudFormation stack name automatically discovers the resources within the stack (EC2, RDS, DynamoDB, Lambda, S3, etc.) and maps their dependencies. When you run an assessment, it analyzes each resource's current configuration (Multi-AZ setup, backup settings, replication settings, etc.) and calculates estimated RTO/RPO for each failure scenario. If any resource can't meet the target, specific improvement recommendations are provided.

Improvement Recommendations and FIS Integration

Assessment results present improvement recommendations as specific actions for each resource. For example, a single-AZ RDS instance gets a recommendation to "switch to Multi-AZ deployment," a DynamoDB table without backups gets "enable point-in-time recovery (PITR)," and EC2 instances without Auto Scaling get "create an Auto Scaling group." Each recommendation includes the estimated RTO/RPO improvement if implemented, helping you prioritize. Integration with FIS (Fault Injection Simulator) lets Resilience Hub auto-generate FIS experiment templates for recommended test scenarios (AZ failure simulation, RDS failover, etc.), allowing you to inject actual faults and verify fault tolerance. By cycling through assessment, improvement, testing, and re-assessment, you can continuously improve your application's resilience. For a comprehensive study of cloud disaster recovery, books (Amazon) offer systematic learning.

Operations and Continuous Assessment

Resilience Hub supports ongoing resilience management, not just one-time assessments. When your application's resource configuration changes (CloudFormation stack updates), drift detection identifies the changes and prompts re-assessment. Assessments can be run manually or integrated into CI/CD pipelines for automatic execution at deployment time. Organizations integration enables centralized management of applications across multiple accounts. Pricing is a flat rate of $15 per application per month with no limit on assessment frequency. Positioned as a tool for automating the reliability pillar review of the Well-Architected Framework, it also integrates with the Well-Architected Tool.

Resilience Hub Pricing

Resilience Hub pricing is based on the number of application assessments. Each assessment costs approximately $0.10, with monthly costs depending on assessment frequency. Defining applications and setting RTO/RPO policies incur no additional charges. A recommended operational pattern is to schedule periodic assessments (monthly or quarterly) and run re-assessments after architecture changes. Implementation costs for recommended improvements (Multi-AZ deployment, backup configuration, etc.) are separate.

Summary - Guidelines for Using Resilience Hub

AWS Resilience Hub is a service that quantitatively assesses application fault tolerance using RTO/RPO metrics and provides improvement recommendations. Its key strengths are automatic resource discovery from CloudFormation, assessment across three failure scenarios, and integration with FIS for test execution. We recommend starting by defining RTO/RPO targets for mission-critical production applications and assessing the current state with Resilience Hub. At $15/month per application, the ability to understand potential failure impact in advance delivers significant value.