Chaos Engineering in Practice - Verifying Fault Tolerance with AWS Fault Injection Simulator
Learn about practicing chaos engineering with AWS Fault Injection Simulator (FIS). Covers designing fault injection scenarios, injecting faults into EC2, ECS, and RDS, and conducting safe experiments.
The Need for Chaos Engineering and the Role of FIS
Failures in production environments occur at unpredictable times. Sudden EC2 instance termination, AZ failures, increased network latency, database failovers - chaos engineering is the practice of verifying in advance whether your system behaves correctly in the face of these failures by intentionally injecting faults. Originating from Netflix's Chaos Monkey, this approach intentionally injects failures to verify system fault tolerance. AWS Fault Injection Simulator (FIS), released in 2021, is a managed chaos engineering service that executes fault injection against AWS resources in a safe and controlled manner. It eliminates the need to build and operate your own chaos engineering tools, and through native integration with AWS services, it can inject faults into a wide range of resources including EC2, ECS, EKS, RDS, ElastiCache, and Systems Manager.
Designing Experiment Templates
FIS experiments are defined using experiment templates. A template consists of three elements: actions (what to do), targets (which resources to affect), and stop conditions (when to abort). Examples of actions include aws:ec2:stop-instances (stopping EC2 instances), aws:ec2:send-spot-instance-interruptions (simulating Spot interruptions), aws:ssm:send-command (CPU/memory stress via SSM), aws:ecs:stop-task (stopping ECS tasks), aws:rds:failover-db-cluster (Aurora failover), and aws:fis:inject-api-internal-error (AWS API error injection). ```json { "description": "EC2 instance stop experiment", "targets": { "ec2Instances": { "resourceType": "aws:ec2:instance", "resourceTags": {"Environment": "dev"}, "selectionMode": "COUNT(1)" } }, "actions": { "stopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": {"startInstancesAfterDuration": "PT5M"}, "targets": {"Instances": "ec2Instances"} } }, "stopConditions": [ {"source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:ap-northeast-1:123:alarm:high-error-rate"} ], "roleArn": "arn:aws:iam::123:role/FISExperimentRole" } ``` The target's selectionMode controls how target resources are selected. COUNT(1) targets just one resource, while PERCENT(50) targets 50% of resources. Tag-based filtering lets you limit the scope of experiments.
Conducting Safe Experiments
The most important aspect of chaos engineering is ensuring safety. FIS provides multiple safety mechanisms. Stop conditions monitor CloudWatch alarms and automatically stop experiments when error rates or latency exceed thresholds. This prevents experiments from having excessive impact on production services. IAM roles restrict experiment permissions to the minimum necessary, preventing impact on resources outside the experiment scope. Gradual expansion of experiments is also an important practice. Start with small-scale experiments in a dev environment (stopping one instance), verify the results, and then expand the scope. Next, run experiments in a staging environment under conditions close to production, and finally conduct limited experiments in the production environment. Forming a hypothesis before the experiment (e.g., "Even if one EC2 instance stops, the ALB should automatically redistribute traffic and the error rate should stay below 1%") is also important - when a hypothesis is disproven, it clearly identifies areas for system improvement. For practical knowledge on fault tolerance design, related books on Amazon can also be a useful reference.
Practical Experiment Scenarios
Here are some representative experiment scenarios. For AZ failure simulation, simultaneously stop EC2 instances in a specific AZ and verify that multi-AZ Auto Scaling works correctly. For network latency injection, execute tc (traffic control) commands via SSM to increase network latency on specific instances. This lets you verify whether communication between microservices is correctly handled through timeouts and retries. For RDS failover, execute an Aurora cluster failover and verify that the application automatically reconnects to the new writer endpoint. For Spot instance interruption simulation, simulate the 2-minute interruption notice and verify that the application's graceful shutdown works correctly. Pricing is $0.10 USD per minute of action execution time, so a 5-minute experiment costs $0.50 USD.
FIS Pricing
Fault Injection Simulator is billed by action minutes. Each action minute costs approximately $0.10, so running 10 actions for 30 minutes costs approximately $30. Target resource (EC2, ECS, RDS) charges apply as usual. There are no additional charges for creating and storing experiment templates. For production environment experiments, design them with limited scope and short duration to manage both cost and risk.
Summary - FIS Usage Guidelines
AWS Fault Injection Simulator is a tool that provides chaos engineering as a managed service, enabling safe verification of system fault tolerance. With automatic stopping via stop conditions, permission restrictions via IAM, and gradual experiment expansion, you can practice chaos engineering safely. We recommend starting with EC2 stops and network latency injection in a dev environment, then iterating on a cycle of discovering and improving system weaknesses. By operating under the assumption that failures will inevitably occur and verifying in advance, you can minimize the impact when production failures happen.