AWS Fault Domain Design - How the Three-Layer Structure of AZs, Regions, and Partitions Protects Availability

Learn why AWS infrastructure is designed with three layers of fault domains - AZs (fault isolation), Regions (geographic separation), and Partitions (political separation) - and how far failures propagate at each layer, with real-world examples.

約 6 分で読めます最終更新: 2025-10-02

What Is a Fault Domain?

A fault domain is the scope of impact caused by a single failure. How many servers go down if one power cable is cut? How many servers lose connectivity if one network switch fails? How many servers are affected if one data center loses power? These "blast radii" define fault domains. AWS infrastructure is designed with a hierarchical three-layer fault domain structure. The smallest fault domain is the AZ (Availability Zone), which isolates failures within a single cluster of data centers. The middle layer is the Region, which isolates failures across geographically separated locations. The largest fault domain is the Partition (aws, aws-cn, aws-us-gov), which separates politically and legally independent infrastructure. This three-layer design ensures that a failure in one AZ does not cascade to the entire Region, a failure in one Region does not cascade to other Regions, and a failure in one Partition does not cascade to other Partitions.

AZ Fault Isolation - Independent Power, Cooling, and Networking

An AZ is the smallest unit of fault isolation in AWS. Each AZ consists of one or more data centers with fully independent power systems, cooling systems, and network connections. AZs within the same Region are connected by dedicated high-bandwidth, low-latency networks, but are physically separated by tens of kilometers or more. There is a real-world example of AZ fault isolation working as designed. In 2019, a power failure occurred in one AZ of us-east-1, affecting EC2 instances and EBS volumes within that AZ. However, the other AZs in the same Region continued operating normally. Services running in a multi-AZ configuration experienced no end-user impact because instances in the unaffected AZs continued handling traffic even after the affected AZ's instances went down. An important detail to note is that AZ names (such as us-east-1a) are mapped differently per account. Account A's us-east-1a and Account B's us-east-1a may refer to different physical AZs. AZ IDs (such as use1-az1) are consistent across accounts, so use AZ IDs when specifying AZs in cross-account scenarios.

Regional Geographic Separation - Preparing for Natural Disasters and Large-Scale Outages

A Region is an independent infrastructure deployment in a geographically separate location. Each Region has its own control plane (the system that manages resource creation, modification, and deletion) and operates independently from other Regions' control planes. This design ensures that a control plane failure in one Region does not affect resource management in other Regions. However, some services have global control planes. IAM, Route 53, and CloudFront are global services with control planes concentrated in us-east-1. During the 2021 us-east-1 network disruption, the IAM control plane was affected, preventing new IAM role creation and policy changes in other Regions as well. However, the IAM data plane (authentication and authorization processing) is cached in each Region, so access using existing credentials continued to work. This design principle - where the control plane may stop but the data plane keeps running - is known as Static Stability.

Partition-Level Political Separation

AWS Partitions are completely separated infrastructure environments for political and legal reasons. Three partitions exist: the commercial partition (aws), the China partition (aws-cn), and the GovCloud partition (aws-us-gov). Each partition has independent IAM, independent billing systems, and independent support structures. Resource sharing and data transfer between partitions is not possible by design. The China partition is separated because Chinese law prohibits foreign companies from directly providing cloud services. China Regions are operated by Chinese partner companies and are completely independent from AWS's global infrastructure. AWS accounts for China Regions must be created separately from commercial partition accounts. The GovCloud partition is designed to handle classified U.S. government data. It meets strict compliance requirements including FedRAMP High, ITAR (International Traffic in Arms Regulations), and CJIS (Criminal Justice Information Services), and is accessible only to U.S. citizens.

Architecture Design with Fault Domains in Mind

With an understanding of the three-layer fault domain structure, you can choose the right design based on your workload's availability requirements. A single-AZ configuration is suitable for development and test environments or batch processing that tolerates downtime. Cost is minimal, but the service goes down during an AZ failure. A multi-AZ configuration is the standard for production environments. Deploying an ALB with Auto Scaling Groups across multiple AZs and using multi-AZ RDS provides resilience against single-AZ failures. This configuration is sufficient for most workloads. A multi-Region configuration is adopted when you need to withstand a Region-wide failure. Route 53 failover routing automatically switches to a secondary Region when the primary Region experiences an outage. DynamoDB Global Tables and Aurora Global Database automate cross-Region data replication. However, multi-Region configurations significantly increase cost and operational complexity, so make sure to define clear business requirements (RPO/RTO) before adopting this approach. To systematically learn availability design patterns, specialized books on Amazon are a helpful reference.

Why Auto Scaling Scales Out Fast but Scales In Cautiously - The Design Intent Behind Asymmetric Decision LogicThis article explains why EC2 Auto Scaling executes scale-out immediately while applying a cooldown period for scale-in, the flapping prevention mechanism, and the internal logic of target tracking scaling.Demand-Driven Infrastructure with AWS Auto Scaling - Designing and Optimizing Scaling PoliciesLearn how to use target tracking, predictive, and scheduled scaling policies effectively, and optimize costs with mixed instances policies that leverage Spot Instances.Distributed Systems Principles Learned from AWS Outages - How Past Major Incidents Reshaped ArchitectureUsing AWS's published incident reports as case studies - including the S3 outage (2017), Kinesis outage (2020), and the unique nature of us-east-1 - this article explains design principles such as Shuffle Sharding, Static Stability, and Cell-based Architecture.Why AWS Builds Regions Where It Does - The Hidden Criteria Behind Data Center Site SelectionWe explain the criteria AWS considers when deciding region locations, including power supply, geopolitical risk, data sovereignty legislation, network connectivity, and natural disaster risk, with concrete examples from specific regions.Why AWS Availability Zone IDs Differ Per Account - The Design Intent Behind AZ MappingExplains how us-east-1a maps to different physical AZs per account, why AZ IDs (use1-az1) were introduced, the design intent of even capacity distribution, and considerations for cross-account AZ specification.Batch Computing Infrastructure - Large-Scale Parallel Processing with AWS BatchLearn how to build large-scale batch processing with AWS Batch. Covers job queue design, auto-scaling compute environments, cost optimization with Spot Instances, and building batch infrastructure ideal for scientific computing and large-scale data processing.Automating Batch Computing with AWS Batch - Designing Job Queues and Compute EnvironmentsLearn about job scheduling with AWS Batch, choosing between Fargate and EC2 compute environments, and leveraging Spot Instances for cost optimization.Large-Scale Batch Processing with AWS Batch - Job Queue Design and Cost OptimizationLearn how to design job queue priorities, choose between Fargate and EC2 compute environments, and build complex computational pipelines using array jobs and job dependencies.

What Is a Fault Domain?

AZ Fault Isolation - Independent Power, Cooling, and Networking

Regional Geographic Separation - Preparing for Natural Disasters and Large-Scale Outages

Partition-Level Political Separation

Architecture Design with Fault Domains in Mind

Related Services

Related Articles

More on This Topic

Similar Articles and Services