AWS Fault Domain Design - How the Three-Layer Structure of AZs, Regions, and Partitions Protects Availability

Learn why AWS infrastructure is designed with three layers of fault domains - AZs (fault isolation), Regions (geographic separation), and Partitions (political separation) - and how far failures propagate at each layer, with real-world examples.

What Is a Fault Domain?

A fault domain is the scope of impact caused by a single failure. How many servers go down if one power cable is cut? How many servers lose connectivity if one network switch fails? How many servers are affected if one data center loses power? These "blast radii" define fault domains. AWS infrastructure is designed with a hierarchical three-layer fault domain structure. The smallest fault domain is the AZ (Availability Zone), which isolates failures within a single cluster of data centers. The middle layer is the Region, which isolates failures across geographically separated locations. The largest fault domain is the Partition (aws, aws-cn, aws-us-gov), which separates politically and legally independent infrastructure. This three-layer design ensures that a failure in one AZ does not cascade to the entire Region, a failure in one Region does not cascade to other Regions, and a failure in one Partition does not cascade to other Partitions.

AZ Fault Isolation - Independent Power, Cooling, and Networking

An AZ is the smallest unit of fault isolation in AWS. Each AZ consists of one or more data centers with fully independent power systems, cooling systems, and network connections. AZs within the same Region are connected by dedicated high-bandwidth, low-latency networks, but are physically separated by tens of kilometers or more. There is a real-world example of AZ fault isolation working as designed. In 2019, a power failure occurred in one AZ of us-east-1, affecting EC2 instances and EBS volumes within that AZ. However, the other AZs in the same Region continued operating normally. Services running in a multi-AZ configuration experienced no end-user impact because instances in the unaffected AZs continued handling traffic even after the affected AZ's instances went down. An important detail to note is that AZ names (such as us-east-1a) are mapped differently per account. Account A's us-east-1a and Account B's us-east-1a may refer to different physical AZs. AZ IDs (such as use1-az1) are consistent across accounts, so use AZ IDs when specifying AZs in cross-account scenarios.

Regional Geographic Separation - Preparing for Natural Disasters and Large-Scale Outages

A Region is an independent infrastructure deployment in a geographically separate location. Each Region has its own control plane (the system that manages resource creation, modification, and deletion) and operates independently from other Regions' control planes. This design ensures that a control plane failure in one Region does not affect resource management in other Regions. However, some services have global control planes. IAM, Route 53, and CloudFront are global services with control planes concentrated in us-east-1. During the 2021 us-east-1 network disruption, the IAM control plane was affected, preventing new IAM role creation and policy changes in other Regions as well. However, the IAM data plane (authentication and authorization processing) is cached in each Region, so access using existing credentials continued to work. This design principle - where the control plane may stop but the data plane keeps running - is known as Static Stability.

Partition-Level Political Separation

AWS Partitions are completely separated infrastructure environments for political and legal reasons. Three partitions exist: the commercial partition (aws), the China partition (aws-cn), and the GovCloud partition (aws-us-gov). Each partition has independent IAM, independent billing systems, and independent support structures. Resource sharing and data transfer between partitions is not possible by design. The China partition is separated because Chinese law prohibits foreign companies from directly providing cloud services. China Regions are operated by Chinese partner companies and are completely independent from AWS's global infrastructure. AWS accounts for China Regions must be created separately from commercial partition accounts. The GovCloud partition is designed to handle classified U.S. government data. It meets strict compliance requirements including FedRAMP High, ITAR (International Traffic in Arms Regulations), and CJIS (Criminal Justice Information Services), and is accessible only to U.S. citizens.

Architecture Design with Fault Domains in Mind

With an understanding of the three-layer fault domain structure, you can choose the right design based on your workload's availability requirements. A single-AZ configuration is suitable for development and test environments or batch processing that tolerates downtime. Cost is minimal, but the service goes down during an AZ failure. A multi-AZ configuration is the standard for production environments. Deploying an ALB with Auto Scaling Groups across multiple AZs and using multi-AZ RDS provides resilience against single-AZ failures. This configuration is sufficient for most workloads. A multi-Region configuration is adopted when you need to withstand a Region-wide failure. Route 53 failover routing automatically switches to a secondary Region when the primary Region experiences an outage. DynamoDB Global Tables and Aurora Global Database automate cross-Region data replication. However, multi-Region configurations significantly increase cost and operational complexity, so make sure to define clear business requirements (RPO/RTO) before adopting this approach. To systematically learn availability design patterns, specialized books on Amazon are a helpful reference.