Lessons from AWS Incident Reports (COE) - How Past Major Outages Shaped Design Principles
Analyze the root causes of past major incidents including the S3 outage, us-east-1 DNS failure, and Kinesis outage from AWS's published Correction of Errors (COE) and incident reports, and explain how they changed AWS's design principles.
The Cultural Background Behind AWS Publishing Incident Reports
When major outages occur, AWS publishes Post-Event Summaries. Internally called Correction of Errors (COE), these documents provide detailed accounts of the incident timeline, root cause, impact scope, and preventive measures. While many companies limit incident disclosure to the bare minimum, AWS publishes detailed incident reports based on one of Amazon's Leadership Principles: "Customer Obsession." Concealing the causes of failures may protect corporate image in the short term, but it erodes customer trust in the long run. AWS believes that transparency about failures is essential for building trust, and this stance has influenced the incident reporting culture across the industry. Google's SRE team's published postmortems and Cloudflare's detailed incident reports were also inspired in part by AWS's transparency.
The 2017 S3 Outage - The Day a Typo Took Down Half the Internet
On February 28, 2017, a major outage lasting approximately four hours occurred in S3 in the us-east-1 region. The cause was an operator executing a command that shut down more servers than intended during debugging of S3's billing system. Specifically, while intending to stop a small number of servers, a typo caused the majority of S3's index subsystem (which manages object metadata) and placement subsystem (which manages physical data placement) to be shut down. S3 is one of the most widely used services in us-east-1, and countless services and websites that depend on S3 were cascadingly affected. Ironically, AWS's Service Health Dashboard itself depended on S3, so even the display of outage status failed to function properly. AWS drew multiple lessons from this incident. First, implement rate limits on large-scale change operations (set an upper limit on the number of servers that can be stopped at once). Second, redesign the Service Health Dashboard to not depend on S3. Third, optimize the index subsystem restart procedure to reduce recovery time. These improvements were incorporated into subsequent S3 operations.
The 2020 Kinesis Outage - How Adding Frontend Servers Triggered a Cascading Failure
On November 25, 2020, a Kinesis Data Streams outage occurred in us-east-1, cascading to numerous services including CloudWatch, Lambda, Cognito, and API Gateway. The root cause was the addition of Kinesis frontend servers. Kinesis frontend servers were designed so that each server maintained threads with every other frontend server. As the number of servers increased, the number of threads each server maintained grew quadratically. On this day, when frontend servers were added during routine capacity expansion, the thread count reached the OS thread limit, preventing new connections from being accepted. The outage spread widely because CloudWatch internally used Kinesis. When CloudWatch stopped functioning, metrics collection and alarms for other services ceased, causing secondary damage by delaying failure detection and response. The lessons from this incident were the importance of managing inter-service dependencies and designing so that control plane failures don't cascade to the data plane. AWS subsequently changed Kinesis's frontend architecture from a thread model to an event-driven model.
The 2021 us-east-1 Network Outage - What Internal Network Congestion Taught Us
On December 7, 2021, a network outage lasting approximately five hours occurred in us-east-1. A latent bug in AWS's internal network device auto-scaling system caused network devices to become overloaded when more traffic than usual flowed into the internal network. A distinctive feature of this outage was that the AWS Console itself became difficult to access. Since the AWS Management Console is hosted in us-east-1, resource management and outage status checking via the console became impossible. CLI and SDK API calls also experienced delays and timeouts due to internal network congestion. This incident reaffirmed the importance of the design principle of "separating control plane and data plane." The data plane (actual data reads and writes) requires higher availability than the control plane (resource creation, modification, and deletion). After this outage, AWS strengthened internal network isolation to prevent control plane failures from cascading to the data plane. Lessons for users include the importance of multi-region design and the need for operational procedures that don't depend on the AWS Console (pre-prepared CLI scripts).
Design Principles Derived from Outages
The design principles AWS has systematized from past outages are reflected in the Reliability Pillar of the AWS Well-Architected Framework. First is "Blast Radius Minimization." To limit the scope of failure impact, services are divided into independent units called Cells, designed so that a failure in one cell doesn't cascade to others. DynamoDB operates each partition as an independent cell, so a failure in one partition doesn't affect the entire table. Second is "Static Stability." This means designing systems to continue existing operations even when dependent services fail. For example, even if Auto Scaling fails, already-running instances are unaffected. Third is "Shuffle Sharding." By randomly assigning customers to shards, the probability that one customer's abnormal traffic affects other customers is minimized. These principles were distilled from AWS's own outage experiences and represent universal insights applicable when users design their own systems. To systematically learn distributed systems design principles, specialized books (Amazon) can be helpful.