Most Common Well-Architected Review Findings - 5 Design Mistakes Engineers Overlook

Covers the most frequently flagged design issues in AWS Well-Architected Reviews, focusing on five areas: single-AZ deployment, missing backups, underutilized logging, neglected cost optimization, and overly permissive security groups.

About 6 min readLast updated: 2025-09-27

What Is a Well-Architected Review?

The AWS Well-Architected Review is a program where AWS Solutions Architects or certified partners review customer workloads against six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. You can also conduct self-service reviews using the Well-Architected Tool. Reviews proceed through a question-based format, and answers identify "high risk" and "medium risk" issues. Based on experience from thousands of reviews conducted by AWS Solutions Architects, recurring findings follow clear patterns. Below are the five most common findings. These are not technically advanced issues - they are fundamental design mistakes that are easily avoided once you know about them.

Finding 1: Single-AZ Deployment - The Most Common Reliability Issue

Production workloads deployed in a single AZ are the most frequently flagged issue in Well-Architected Reviews. Common patterns include EC2 instances existing in only one AZ, RDS in single-AZ deployment, and ELBs associated with subnets in only one AZ. With single-AZ deployment, a failure in that AZ causes a complete service outage. AWS AZ failures occur several times a year and are by no means rare events. The fix is relatively straightforward: specify multiple AZ subnets in your Auto Scaling Group, switch RDS to multi-AZ deployment, and associate your ELB with subnets in multiple AZs. The cost increase for multi-AZ is approximately 2x for RDS, but this is a reasonable investment compared to the losses from a service outage. Development and test environments can remain single-AZ, but production environments should always be multi-AZ.

Finding 2: Missing Backups - Risk of Data Loss

Missing backups are the second most common finding - EBS volume snapshots not being taken, RDS automatic backups disabled, and DynamoDB Point-in-Time Recovery (PITR) disabled. Some teams rely on S3's 11 nines of durability and neglect backups. However, S3's durability protects against physical data loss, not accidental or malicious deletion. Without S3 versioning enabled, deleted objects cannot be recovered. AWS Backup lets you centrally manage backups for EC2, EBS, RDS, DynamoDB, EFS, and S3. Define backup policies to automatically take backups on schedule and auto-delete old backups based on retention periods. Beyond just taking backups, regular restore testing is critical. Even if backups exist, you cannot recover during an incident if restore procedures are not established.

Finding 3: Security Groups Allowing 0.0.0.0/0

Security group inbound rules allowing access from 0.0.0.0/0 (all IP addresses) are the most common finding under the Security pillar. Particularly high-risk cases involve SSH (port 22) or RDP (port 3389) open to 0.0.0.0/0. Internet bots constantly scan for open ports 22 and 3389 and automatically execute brute-force attacks. The solution is to restrict SSH/RDP source IPs to specific addresses, or use Systems Manager Session Manager to eliminate SSH/RDP entirely. Session Manager provides IAM-authenticated instance access without opening any ports in security groups. For web servers, opening HTTP (80) and HTTPS (443) to 0.0.0.0/0 is legitimate, but the recommended design is to place them behind an ALB and configure EC2 security groups to allow access only from the ALB's security group.

Finding 4: Neglected Cost Optimization - Abandoned Resources

Resources launched for development or testing that are left running and incurring unnecessary costs are the most common finding under the Cost Optimization pillar. Typical patterns include forgotten EC2 instances, unattached EBS volumes, unused Elastic IP addresses (unattached EIPs cost $0.005 per hour), and idle NAT Gateways ($0.045 per hour plus data processing charges). AWS Trusted Advisor's cost optimization checks automatically detect these wasted resources. Cost Explorer's "underutilized resources" report is also effective. Another common finding is not leveraging Savings Plans or Reserved Instances. Running stable production workloads at on-demand pricing means paying 30-60% more compared to Savings Plans. Use Cost Explorer's Savings Plans recommendations to determine the optimal commitment amount.

Finding 5: Logging and Monitoring Gaps

Gaps in logging and monitoring - CloudTrail not enabled, CloudWatch alarms not configured, application logs not structured - are the most common finding under the Operational Excellence pillar. CloudTrail records management events by default, but without configuring delivery to S3 (creating a trail), events older than 90 days are lost. For production environments, always create a trail and store logs in S3 for long-term retention. Without CloudWatch alarms, incident detection is delayed. At minimum, set alarms for CPU utilization, memory utilization (custom metrics), disk utilization, ELB 5xx error rate, and RDS connection count. Structure application logs in JSON format and send them to CloudWatch Logs, enabling SQL-like queries with CloudWatch Logs Insights. Unstructured text logs can only be searched with grep, which is impractical at scale. To systematically learn Well-Architected design principles, specialized books (Amazon) are a helpful reference.

How AWS Keeps Time Internally - Amazon Time Sync Service and Leap Second Smearing DesignLearn how Amazon Time Sync Service works, how GPS and atomic clocks provide high-precision time sources, the design decision to absorb leap seconds through smearing, and why time synchronization matters in distributed systems.Centralizing SaaS Audit Logs with AWS AppFabric - OCSF Standardization and Security Lake IntegrationLearn how AppFabric collects audit logs from SaaS applications, standardizes them to OCSF format, and builds analysis pipelines.Implementing Feature Flags with AWS AppConfig - Safe Configuration Deployment and RollbackRoll out configuration changes independently from code deployments using Linear and Exponential strategies. Ensure safety with automatic rollback triggered by CloudWatch alarms.Architecture Review - Systematically Evaluate Workloads with the AWS Well-Architected ToolLearn about architecture reviews using the AWS Well-Architected Tool. Covers evaluation based on the six pillars, improvement planning, and custom lens usage.Audit Log Design and Operations - Complete API Activity Recording with CloudTrailLearn how to design audit logs using AWS CloudTrail, including recording API activity, long-term storage in S3, and compliance automation through integration with AWS Config.Lessons from AWS Incident Reports (COE) - How Past Major Outages Shaped Design PrinciplesAnalyze the root causes of past major incidents including the S3 outage, us-east-1 DNS failure, and Kinesis outage from AWS's published Correction of Errors (COE) and incident reports, and explain how they changed AWS's design principles.Tag Design Determines Operations - Trivia and Practical Naming Conventions for AWS Resource Tagging StrategyWe explain why AWS resource tags are not just labels but the foundation for cost allocation, access control, and automation, covering tag key naming conventions, how to use the 50-tag limit, and governance through tag policies.Why AWS Service Quotas Exist - Multi-Tenant Design That Protects Shared InfrastructureExplain how AWS service quotas (formerly service limits) are not mere restrictions but a design to protect other customers in a multi-tenant environment, covering the noisy neighbor problem, soft vs hard limits, and what happens behind quota increase requests.

What Is a Well-Architected Review?

Finding 1: Single-AZ Deployment - The Most Common Reliability Issue

Finding 2: Missing Backups - Risk of Data Loss

Finding 3: Security Groups Allowing 0.0.0.0/0

Finding 4: Neglected Cost Optimization - Abandoned Resources

Finding 5: Logging and Monitoring Gaps

Related Services

Related Articles

More on This Topic

Similar Articles and Services