Lessons from AWS Incident Reports (COE) - How Past Major Outages Shaped Design Principles

Analyze the root causes of past major incidents including the S3 outage, us-east-1 DNS failure, and Kinesis outage from AWS's published Correction of Errors (COE) and incident reports, and explain how they changed AWS's design principles.

約 6 分で読めます最終更新: 2026-03-15

The Cultural Background Behind AWS Publishing Incident Reports

When major outages occur, AWS publishes Post-Event Summaries. Internally called Correction of Errors (COE), these documents provide detailed accounts of the incident timeline, root cause, impact scope, and preventive measures. While many companies limit incident disclosure to the bare minimum, AWS publishes detailed incident reports based on one of Amazon's Leadership Principles: "Customer Obsession." Concealing the causes of failures may protect corporate image in the short term, but it erodes customer trust in the long run. AWS believes that transparency about failures is essential for building trust, and this stance has influenced the incident reporting culture across the industry. Google's SRE team's published postmortems and Cloudflare's detailed incident reports were also inspired in part by AWS's transparency.

The 2017 S3 Outage - The Day a Typo Took Down Half the Internet

On February 28, 2017, a major outage lasting approximately four hours occurred in S3 in the us-east-1 region. The cause was an operator executing a command that shut down more servers than intended during debugging of S3's billing system. Specifically, while intending to stop a small number of servers, a typo caused the majority of S3's index subsystem (which manages object metadata) and placement subsystem (which manages physical data placement) to be shut down. S3 is one of the most widely used services in us-east-1, and countless services and websites that depend on S3 were cascadingly affected. Ironically, AWS's Service Health Dashboard itself depended on S3, so even the display of outage status failed to function properly. AWS drew multiple lessons from this incident. First, implement rate limits on large-scale change operations (set an upper limit on the number of servers that can be stopped at once). Second, redesign the Service Health Dashboard to not depend on S3. Third, optimize the index subsystem restart procedure to reduce recovery time. These improvements were incorporated into subsequent S3 operations.

The 2020 Kinesis Outage - How Adding Frontend Servers Triggered a Cascading Failure

On November 25, 2020, a Kinesis Data Streams outage occurred in us-east-1, cascading to numerous services including CloudWatch, Lambda, Cognito, and API Gateway. The root cause was the addition of Kinesis frontend servers. Kinesis frontend servers were designed so that each server maintained threads with every other frontend server. As the number of servers increased, the number of threads each server maintained grew quadratically. On this day, when frontend servers were added during routine capacity expansion, the thread count reached the OS thread limit, preventing new connections from being accepted. The outage spread widely because CloudWatch internally used Kinesis. When CloudWatch stopped functioning, metrics collection and alarms for other services ceased, causing secondary damage by delaying failure detection and response. The lessons from this incident were the importance of managing inter-service dependencies and designing so that control plane failures don't cascade to the data plane. AWS subsequently changed Kinesis's frontend architecture from a thread model to an event-driven model.

The 2021 us-east-1 Network Outage - What Internal Network Congestion Taught Us

On December 7, 2021, a network outage lasting approximately five hours occurred in us-east-1. A latent bug in AWS's internal network device auto-scaling system caused network devices to become overloaded when more traffic than usual flowed into the internal network. A distinctive feature of this outage was that the AWS Console itself became difficult to access. Since the AWS Management Console is hosted in us-east-1, resource management and outage status checking via the console became impossible. CLI and SDK API calls also experienced delays and timeouts due to internal network congestion. This incident reaffirmed the importance of the design principle of "separating control plane and data plane." The data plane (actual data reads and writes) requires higher availability than the control plane (resource creation, modification, and deletion). After this outage, AWS strengthened internal network isolation to prevent control plane failures from cascading to the data plane. Lessons for users include the importance of multi-region design and the need for operational procedures that don't depend on the AWS Console (pre-prepared CLI scripts).

Design Principles Derived from Outages

The design principles AWS has systematized from past outages are reflected in the Reliability Pillar of the AWS Well-Architected Framework. First is "Blast Radius Minimization." To limit the scope of failure impact, services are divided into independent units called Cells, designed so that a failure in one cell doesn't cascade to others. DynamoDB operates each partition as an independent cell, so a failure in one partition doesn't affect the entire table. Second is "Static Stability." This means designing systems to continue existing operations even when dependent services fail. For example, even if Auto Scaling fails, already-running instances are unaffected. Third is "Shuffle Sharding." By randomly assigning customers to shards, the probability that one customer's abnormal traffic affects other customers is minimized. These principles were distilled from AWS's own outage experiences and represent universal insights applicable when users design their own systems. To systematically learn distributed systems design principles, specialized books (Amazon) can be helpful.

How AWS Keeps Time Internally - Amazon Time Sync Service and Leap Second Smearing DesignLearn how Amazon Time Sync Service works, how GPS and atomic clocks provide high-precision time sources, the design decision to absorb leap seconds through smearing, and why time synchronization matters in distributed systems.Centralizing SaaS Audit Logs with AWS AppFabric - OCSF Standardization and Security Lake IntegrationLearn how AppFabric collects audit logs from SaaS applications, standardizes them to OCSF format, and builds analysis pipelines.Implementing Feature Flags with AWS AppConfig - Safe Configuration Deployment and RollbackRoll out configuration changes independently from code deployments using Linear and Exponential strategies. Ensure safety with automatic rollback triggered by CloudWatch alarms.Architecture Review - Systematically Evaluate Workloads with the AWS Well-Architected ToolLearn about architecture reviews using the AWS Well-Architected Tool. Covers evaluation based on the six pillars, improvement planning, and custom lens usage.Audit Log Design and Operations - Complete API Activity Recording with CloudTrailLearn how to design audit logs using AWS CloudTrail, including recording API activity, long-term storage in S3, and compliance automation through integration with AWS Config.Tag Design Determines Operations - Trivia and Practical Naming Conventions for AWS Resource Tagging StrategyWe explain why AWS resource tags are not just labels but the foundation for cost allocation, access control, and automation, covering tag key naming conventions, how to use the 50-tag limit, and governance through tag policies.Why AWS Service Quotas Exist - Multi-Tenant Design That Protects Shared InfrastructureExplain how AWS service quotas (formerly service limits) are not mere restrictions but a design to protect other customers in a multi-tenant environment, covering the noisy neighbor problem, soft vs hard limits, and what happens behind quota increase requests.Centralized Backup Management with AWS Backup - Backup Plans and Cross-Region ProtectionManage backups for EC2, RDS, DynamoDB, and more under a unified policy. Covers Vault Lock WORM protection and automated restore testing.

The Cultural Background Behind AWS Publishing Incident Reports

The 2017 S3 Outage - The Day a Typo Took Down Half the Internet

The 2020 Kinesis Outage - How Adding Frontend Servers Triggered a Cascading Failure

The 2021 us-east-1 Network Outage - What Internal Network Congestion Taught Us

Design Principles Derived from Outages

Related Services

Related Articles

More on This Topic

Similar Articles and Services