AWS Fault Injection Simulator
A service that safely runs chaos engineering experiments to verify system resilience
Overview
AWS Fault Injection Simulator (FIS) is a fully managed service for safely conducting chaos engineering experiments in AWS environments. It intentionally introduces failures that could occur in production - such as stopping EC2 instances, injecting network latency, and simulating AZ outages - to verify system resilience. You define actions, targets, and stop conditions in experiment templates to reproduce failures in a controlled environment. With stop conditions linked to CloudWatch alarms, it includes a safety mechanism that automatically halts experiments when they cause unexpected impact.
Experiment Template and Action Design
An FIS experiment template consolidates the type of fault to inject (actions), the target resources (targets), and safety mechanisms (stop conditions) into a single definition. Actions include stopping or rebooting EC2 instances, force-terminating ECS tasks, triggering RDS failovers, injecting network latency and packet loss, and simulating API throttling. Multiple actions can be executed in parallel or sequentially, enabling you to build compound failure scenarios such as "first stop 30% of EC2 instances, then inject network latency after 5 minutes." Action duration is specified in seconds, and resources automatically revert to their original state after the experiment ends. Experiment templates are defined in JSON/YAML and can be managed as code through CloudFormation or Terraform. Template versioning ensures experiment reproducibility, allowing re-execution under identical conditions as past experiments. Running an experiment requires an IAM role, with minimum-privilege permissions granted per action (ec2:StopInstances, ecs:StopTask, etc.).
Target Selection and Safety Controls via Stop Conditions
Target selection is the most critical design consideration for FIS safety. Tag-based filtering lets you restrict fault injection to specific environments (env:staging), teams (team:platform), or services (service:payment). You can further specify the percentage of selected resources affected (e.g., 30% of the total) or the resource count (e.g., up to 3 instances), controlling the blast radius. Stop conditions are linked to CloudWatch alarms and immediately halt the experiment when error rates exceed thresholds or latency deviates beyond acceptable ranges. Multiple stop conditions can be configured, and the entire experiment stops if any single one triggers. For production experiments, a gradual scope expansion approach is recommended. Start with a single EC2 instance, and if no issues arise, progressively expand the target to 10%, 30%, and 50%. Experiment execution logs are recorded in CloudTrail, providing an audit trail of who ran which experiment and when.
Analyzing Results and the Resilience Improvement Cycle
FIS experiment results are used to validate hypotheses and derive improvement actions. Before an experiment, formulate a hypothesis such as "the Auto Scaling group will recover from a 30% EC2 instance shutdown within 5 minutes," then verify it post-experiment using CloudWatch metrics (CPU utilization, request success rate, latency). If the hypothesis is rejected (e.g., recovery took 15 minutes), translate the findings into specific improvement actions such as reviewing Auto Scaling cooldown settings or health check intervals. Integrating FIS experiments into CI/CD pipelines enables automatic resilience verification with every deployment. Place an FIS experiment as a stage in CodePipeline, proceeding to production deployment only when the experiment succeeds. Pricing is usage-based at $0.10 per action-minute. A 10-minute experiment running 3 actions in parallel costs 30 action-minutes × $0.10 = $3.00. An increasing number of teams are making regular experiments a habit, conducting them monthly as Game Days (failure drill days).