Chaos Engineering in Practice - Verifying Fault Tolerance with AWS Fault Injection Simulator

Learn about practicing chaos engineering with AWS Fault Injection Simulator (FIS). Covers designing fault injection scenarios, injecting faults into EC2, ECS, and RDS, and conducting safe experiments.

約 6 分で読めます最終更新: 2026-01-26

The Need for Chaos Engineering and the Role of FIS

Failures in production environments occur at unpredictable times. Sudden EC2 instance termination, AZ failures, increased network latency, database failovers - chaos engineering is the practice of verifying in advance whether your system behaves correctly in the face of these failures by intentionally injecting faults. Originating from Netflix's Chaos Monkey, this approach intentionally injects failures to verify system fault tolerance. AWS Fault Injection Simulator (FIS), released in 2021, is a managed chaos engineering service that executes fault injection against AWS resources in a safe and controlled manner. It eliminates the need to build and operate your own chaos engineering tools, and through native integration with AWS services, it can inject faults into a wide range of resources including EC2, ECS, EKS, RDS, ElastiCache, and Systems Manager.

Designing Experiment Templates

FIS experiments are defined using experiment templates. A template consists of three elements: actions (what to do), targets (which resources to affect), and stop conditions (when to abort). Examples of actions include aws:ec2:stop-instances (stopping EC2 instances), aws:ec2:send-spot-instance-interruptions (simulating Spot interruptions), aws:ssm:send-command (CPU/memory stress via SSM), aws:ecs:stop-task (stopping ECS tasks), aws:rds:failover-db-cluster (Aurora failover), and aws:fis:inject-api-internal-error (AWS API error injection). ```json { "description": "EC2 instance stop experiment", "targets": { "ec2Instances": { "resourceType": "aws:ec2:instance", "resourceTags": {"Environment": "dev"}, "selectionMode": "COUNT(1)" } }, "actions": { "stopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": {"startInstancesAfterDuration": "PT5M"}, "targets": {"Instances": "ec2Instances"} } }, "stopConditions": [ {"source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:ap-northeast-1:123:alarm:high-error-rate"} ], "roleArn": "arn:aws:iam::123:role/FISExperimentRole" } ``` The target's selectionMode controls how target resources are selected. COUNT(1) targets just one resource, while PERCENT(50) targets 50% of resources. Tag-based filtering lets you limit the scope of experiments.

Conducting Safe Experiments

The most important aspect of chaos engineering is ensuring safety. FIS provides multiple safety mechanisms. Stop conditions monitor CloudWatch alarms and automatically stop experiments when error rates or latency exceed thresholds. This prevents experiments from having excessive impact on production services. IAM roles restrict experiment permissions to the minimum necessary, preventing impact on resources outside the experiment scope. Gradual expansion of experiments is also an important practice. Start with small-scale experiments in a dev environment (stopping one instance), verify the results, and then expand the scope. Next, run experiments in a staging environment under conditions close to production, and finally conduct limited experiments in the production environment. Forming a hypothesis before the experiment (e.g., "Even if one EC2 instance stops, the ALB should automatically redistribute traffic and the error rate should stay below 1%") is also important - when a hypothesis is disproven, it clearly identifies areas for system improvement. For practical knowledge on fault tolerance design, related books on Amazon can also be a useful reference.

Practical Experiment Scenarios

Here are some representative experiment scenarios. For AZ failure simulation, simultaneously stop EC2 instances in a specific AZ and verify that multi-AZ Auto Scaling works correctly. For network latency injection, execute tc (traffic control) commands via SSM to increase network latency on specific instances. This lets you verify whether communication between microservices is correctly handled through timeouts and retries. For RDS failover, execute an Aurora cluster failover and verify that the application automatically reconnects to the new writer endpoint. For Spot instance interruption simulation, simulate the 2-minute interruption notice and verify that the application's graceful shutdown works correctly. Pricing is $0.10 USD per minute of action execution time, so a 5-minute experiment costs $0.50 USD.

FIS Pricing

Fault Injection Simulator is billed by action minutes. Each action minute costs approximately $0.10, so running 10 actions for 30 minutes costs approximately $30. Target resource (EC2, ECS, RDS) charges apply as usual. There are no additional charges for creating and storing experiment templates. For production environment experiments, design them with limited scope and short duration to manage both cost and risk.

Summary - FIS Usage Guidelines

AWS Fault Injection Simulator is a tool that provides chaos engineering as a managed service, enabling safe verification of system fault tolerance. With automatic stopping via stop conditions, permission restrictions via IAM, and gradual experiment expansion, you can practice chaos engineering safely. We recommend starting with EC2 stops and network latency injection in a dev environment, then iterating on a cycle of discovering and improving system weaknesses. By operating under the assumption that failures will inevitably occur and verifying in advance, you can minimize the impact when production failures happen.

Automated Deployment Strategies - Continuous Delivery with AWS CodeDeploy and CodePipelineLearn how to build automated deployments using AWS CodeDeploy and CodePipeline. This guide covers diverse deployment strategies for EC2, Lambda, and ECS, along with practical continuous delivery techniques using pipelines.Reading AWS's Next 3 Years from re:Invent Announcements - Service Lifecycles and the Rules of DeprecationFace the reality that AWS services have lifespans, examining cases like CodeCommit's enrollment freeze and SimpleDB's de facto end, and provide a technology selection perspective for identifying services that will endure.CI/CD Pipeline Automation - Continuous Delivery with AWS CodePipelineLearn about CI/CD pipeline automation using AWS CodePipeline and CodeBuild.How Far Does CloudFormation Drift Detection See - How It Tracks Manual Changes and Its LimitationsLearn how CloudFormation drift detection internally compares the current state of resources with the expected state defined in templates, the boundary between detectable and undetectable changes, and strategies for drift remediation.What Happens During a CloudFormation Stack Update - Behind Change Sets, Rollbacks, and ReplacementsLearn about the internal processing flow when CloudFormation updates a stack, including how change sets detect differences, the logic behind resource update, replacement, and deletion decisions, and how rollbacks work.AWS CodeDeploy Deployment Strategies - Blue/Green Deployments for EC2, ECS, and LambdaManage deployment strategies across three platforms - EC2, ECS, and Lambda - in a unified way. This article covers ECS blue/green deployments and automatic rollback triggered by CloudWatch alarms.AWS CodeDeploy EC2/On-Premises Deployment - Designing AppSpec and Lifecycle HooksDeclaratively define deployment target paths and lifecycle hooks using AppSpec files. Learn about fleet management with tag-based deployment groups and Auto Scaling integration.Building CI/CD Pipelines with AWS CodePipeline - Automating from Source to DeploymentAutomate everything from source change detection to build, test, and deployment. Learn about V2 trigger filters, manual approval actions, and cross-account deployment design.

The Need for Chaos Engineering and the Role of FIS

Designing Experiment Templates

Conducting Safe Experiments

Practical Experiment Scenarios

FIS Pricing

Summary - FIS Usage Guidelines

Related Services

Related Articles

More on This Topic

Similar Articles and Services