AWS Resilience Hub
A resilience management service that quantitatively assesses application fault tolerance, visualizes compliance against RTO/RPO targets, and provides improvement recommendations
Overview
AWS Resilience Hub is a service for systematically assessing and managing application resilience. It automatically discovers application architecture from CloudFormation stacks, EKS clusters, Terraform state files, and other sources, then evaluates compliance against defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets. Based on assessment results, it presents specific improvement recommendations and generates fault injection test plans through integration with AWS Fault Injection Service. Continuous assessments enable early detection of resilience degradation after infrastructure changes.
Resilience Policies and the Assessment Process
Assessment in Resilience Hub begins with defining a resilience policy. A policy sets RTO and RPO targets for four disruption types: infrastructure failures, Availability Zone failures, Region failures, and application failures. For example, you might define "RTO of 30 minutes and RPO of 1 hour for AZ failures." When registering an application, you specify sources such as CloudFormation stacks, resource groups, EKS clusters, or AppRegistry applications. Resilience Hub automatically maps resource dependencies from these sources and generates an application architecture diagram. Running an assessment analyzes whether each component's current configuration meets the policy targets. For instance, a single-AZ RDS instance would be flagged as unable to meet AZ failure RTO/RPO targets. Results are displayed in three tiers - "Policy Met," "Policy Breached," and "Improvement Possible" - with specific recommended actions for each breach, such as enabling Multi-AZ, adjusting backup frequency, or adding read replicas.
Fault Injection Testing Integration
Resilience Hub integrates with AWS Fault Injection Service (FIS) to automatically generate fault injection test plans based on assessment results. FIS experiment templates are proposed for risk scenarios identified during assessment - AZ failures, instance termination, network latency, and more - allowing you to reproduce those scenarios and verify application behavior. For example, you can test whether Multi-AZ RDS failover works correctly, or whether an Auto Scaling group detects and replaces failed instances. Test results are recorded in Resilience Hub and reflected in the assessment score. A passing test proves resilience against that scenario; a failing test triggers a cycle of improving the configuration per recommendations and retesting. While fault injection tests can run in production, it is safer to validate thoroughly in a staging environment first. FIS guardrails (stop conditions) should be configured to automatically halt tests if they cause unexpected impact.
Operational Recommendations and Continuous Assessment
The true value of Resilience Hub lies not in one-time assessments but in continuous resilience management. Through EventBridge integration, CloudFormation stack updates and resource changes can trigger automatic reassessments. This means an alert fires immediately if a new deployment or infrastructure change degrades resilience. Operational Recommendations cover alarm configuration, SOP (Standard Operating Procedure) creation, and periodic fault injection testing. CloudWatch alarm recommendations specify the exact metrics and thresholds to monitor for each component. SOP recommendations generate recovery procedures as Systems Manager Automation documents, turning manual runbooks into automated workflows. Embedding Resilience Hub assessments in CI/CD pipelines creates a quality gate that automatically blocks deployments violating resilience policies. The assessment API can be invoked as a CodePipeline action, halting the pipeline on any policy breach. Pricing is based on the number of application assessments - a monthly assessment for a small application costs just a few dollars per month.