Distributed Systems Principles Learned from AWS Outages - How Past Major Incidents Reshaped Architecture

Using AWS's published incident reports as case studies - including the S3 outage (2017), Kinesis outage (2020), and the unique nature of us-east-1 - this article explains design principles such as Shuffle Sharding, Static Stability, and Cell-based Architecture.

約 8 分で読めます最終更新: 2026-03-12

Incident Reports as Textbooks

When major outages occur, AWS publishes detailed incident reports called Post-Event Summaries. These are not mere apologies but valuable technical documents describing the root cause, the chain of impact, the recovery process, and preventive measures. Distributed systems design principles tend to be discussed in the abstract, but linking them to actual incident cases makes it concretely clear why each principle matters. Here we examine representative incidents published by AWS and explain the design principles derived from them.

The S3 Outage (February 2017) - Cascading Failures and Recovery Difficulty

On February 28, 2017, S3 went down for approximately 4 hours in the us-east-1 region. According to AWS's official report, while debugging a billing system issue, more servers than intended were removed from S3's index subsystem and placement subsystem. The problem cascaded from there. These subsystems were essential for S3 read and write operations, and the remaining servers lacked sufficient processing capacity. Even more critically, these subsystems had not undergone a complete restart in years. Restarting required metadata integrity checks, and the time needed had grown far beyond expectations due to data volume growth. This outage left two lessons. First, the danger of dependency chains (cascading failures). Numerous services and websites that depended on S3 went down in a chain reaction. Even AWS's own status dashboard depended on S3, creating the ironic situation where they could not communicate the outage status. Second, the restart cost of large-scale systems. Systems designed to never stop carry the risk of unexpectedly long restart times.

The Kinesis Outage (November 2020) - When Capacity Expansion Caused the Outage

On November 25, 2020, Kinesis experienced an outage lasting approximately 21 hours in us-east-1. The root cause revealed in AWS's official report was the capacity expansion itself. Kinesis's front-end server fleet used a mesh structure where each server consumed one OS thread per peer for inter-server communication. When backend servers were added, the thread count on each front-end server exceeded the OS configuration limit. The irony of this outage is that a capacity addition intended to improve the service triggered the failure. Furthermore, the Kinesis outage cascaded to numerous dependent services including CloudWatch, Cognito, and Lambda. In response, AWS migrated front-end servers to larger instances with more CPU and memory, reducing server count to fundamentally resolve the thread limit issue. They also redesigned the front-end caching structure to reduce the impact of backend changes on the front end.

The Unique Nature of us-east-1 - Why This Region Has More Outages

Following AWS incident reports, you will notice that us-east-1 (Northern Virginia) appears repeatedly. This is no coincidence. us-east-1 is AWS's oldest region, and the control planes for global services like IAM, Route 53, and CloudFront are concentrated there. Global services are those with worldwide common endpoints not tied to a specific region. Because these control planes (management components handling configuration changes and authentication) reside in us-east-1, outages in this region can propagate to other regions. In the December 2021 outage, a network device issue within us-east-1 affected the control planes of multiple services in the region. Although data planes (components performing actual data processing) continued operating normally, creating new resources or making configuration changes was impossible for an extended period. This case demonstrates the importance of separating control planes from data planes, ensuring data planes continue operating independently even when control planes fail.

Design Principles Born from Outages

AWS has systematized design principles from these outage experiences through the Builders' Library and whitepapers. Shuffle Sharding assigns customers to randomly selected combinations of servers (shards). With traditional fixed sharding, a single shard failure affects all customers on that shard, but with Shuffle Sharding, each customer has a different shard combination, probabilistically reducing the fraction of affected customers dramatically. Route 53's nameserver assignment uses this technique. Static Stability is a design where the system continues operating in its current state even when a dependency fails. For example, if an Auto Scaling group pre-places sufficient instances in each Availability Zone, existing instances can continue processing even when scaling decisions (control plane operations) cannot be made during an AZ failure. Cell-based Architecture divides a system into multiple independent cells (replicas), each operating autonomously. A failure in one cell does not affect others. This is a design pattern AWS promoted after the S3 outage, physically limiting the blast radius of failures.

Applying These Principles to Your Own Systems

These principles apply not only to AWS-scale systems but also to general application design. First, explicitly map your dependencies. Identify in advance which services your system depends on and what happens when those services go down. Next, be conscious of separating control planes from data planes. Aim for designs where existing data processing continues even when configuration changes and management operations are unavailable. Introducing the Circuit Breaker pattern to detect dependency failures and immediately switch to fallback processing is also effective. Multi-AZ configuration should be adopted as a minimum level of fault tolerance, with multi-region configurations considered based on business requirements. However, multi-region involves trade-offs in data consistency and latency, so it should not be applied to all workloads. Designing with the assumption that failures will happen and minimizing the blast radius when they do is the essence of distributed systems design. To systematically learn distributed systems design principles, specialized books on Amazon can be helpful.

Summary

AWS's major outages are the best learning material for distributed systems design principles. The S3 outage taught us about the dangers of cascading failures and restart costs, the Kinesis outage about unexpected side effects of capacity changes, and the recurring us-east-1 outages about the concentration risk of global services. Design principles born from these experiences - Shuffle Sharding, Static Stability, and Cell-based Architecture - are universal insights applicable to any distributed system design, not just within AWS.

Why Auto Scaling Scales Out Fast but Scales In Cautiously - The Design Intent Behind Asymmetric Decision LogicThis article explains why EC2 Auto Scaling executes scale-out immediately while applying a cooldown period for scale-in, the flapping prevention mechanism, and the internal logic of target tracking scaling.Demand-Driven Infrastructure with AWS Auto Scaling - Designing and Optimizing Scaling PoliciesLearn how to use target tracking, predictive, and scheduled scaling policies effectively, and optimize costs with mixed instances policies that leverage Spot Instances.AWS Fault Domain Design - How the Three-Layer Structure of AZs, Regions, and Partitions Protects AvailabilityLearn why AWS infrastructure is designed with three layers of fault domains - AZs (fault isolation), Regions (geographic separation), and Partitions (political separation) - and how far failures propagate at each layer, with real-world examples.Why AWS Builds Regions Where It Does - The Hidden Criteria Behind Data Center Site SelectionWe explain the criteria AWS considers when deciding region locations, including power supply, geopolitical risk, data sovereignty legislation, network connectivity, and natural disaster risk, with concrete examples from specific regions.Why AWS Availability Zone IDs Differ Per Account - The Design Intent Behind AZ MappingExplains how us-east-1a maps to different physical AZs per account, why AZ IDs (use1-az1) were introduced, the design intent of even capacity distribution, and considerations for cross-account AZ specification.Batch Computing Infrastructure - Large-Scale Parallel Processing with AWS BatchLearn how to build large-scale batch processing with AWS Batch. Covers job queue design, auto-scaling compute environments, cost optimization with Spot Instances, and building batch infrastructure ideal for scientific computing and large-scale data processing.Automating Batch Computing with AWS Batch - Designing Job Queues and Compute EnvironmentsLearn about job scheduling with AWS Batch, choosing between Fargate and EC2 compute environments, and leveraging Spot Instances for cost optimization.Large-Scale Batch Processing with AWS Batch - Job Queue Design and Cost OptimizationLearn how to design job queue priorities, choose between Fargate and EC2 compute environments, and build complex computational pipelines using array jobs and job dependencies.

Incident Reports as Textbooks

The S3 Outage (February 2017) - Cascading Failures and Recovery Difficulty

The Kinesis Outage (November 2020) - When Capacity Expansion Caused the Outage

The Unique Nature of us-east-1 - Why This Region Has More Outages

Design Principles Born from Outages

Applying These Principles to Your Own Systems

Summary

Related Services

Related Articles

More on This Topic

Similar Articles and Services