Most Common Well-Architected Review Findings - 5 Design Mistakes Engineers Overlook

Covers the most frequently flagged design issues in AWS Well-Architected Reviews, focusing on five areas: single-AZ deployment, missing backups, underutilized logging, neglected cost optimization, and overly permissive security groups.

What Is a Well-Architected Review?

The AWS Well-Architected Review is a program where AWS Solutions Architects or certified partners review customer workloads against six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. You can also conduct self-service reviews using the Well-Architected Tool. Reviews proceed through a question-based format, and answers identify "high risk" and "medium risk" issues. Based on experience from thousands of reviews conducted by AWS Solutions Architects, recurring findings follow clear patterns. Below are the five most common findings. These are not technically advanced issues - they are fundamental design mistakes that are easily avoided once you know about them.

Finding 1: Single-AZ Deployment - The Most Common Reliability Issue

Production workloads deployed in a single AZ are the most frequently flagged issue in Well-Architected Reviews. Common patterns include EC2 instances existing in only one AZ, RDS in single-AZ deployment, and ELBs associated with subnets in only one AZ. With single-AZ deployment, a failure in that AZ causes a complete service outage. AWS AZ failures occur several times a year and are by no means rare events. The fix is relatively straightforward: specify multiple AZ subnets in your Auto Scaling Group, switch RDS to multi-AZ deployment, and associate your ELB with subnets in multiple AZs. The cost increase for multi-AZ is approximately 2x for RDS, but this is a reasonable investment compared to the losses from a service outage. Development and test environments can remain single-AZ, but production environments should always be multi-AZ.

Finding 2: Missing Backups - Risk of Data Loss

Missing backups are the second most common finding - EBS volume snapshots not being taken, RDS automatic backups disabled, and DynamoDB Point-in-Time Recovery (PITR) disabled. Some teams rely on S3's 11 nines of durability and neglect backups. However, S3's durability protects against physical data loss, not accidental or malicious deletion. Without S3 versioning enabled, deleted objects cannot be recovered. AWS Backup lets you centrally manage backups for EC2, EBS, RDS, DynamoDB, EFS, and S3. Define backup policies to automatically take backups on schedule and auto-delete old backups based on retention periods. Beyond just taking backups, regular restore testing is critical. Even if backups exist, you cannot recover during an incident if restore procedures are not established.

Finding 3: Security Groups Allowing 0.0.0.0/0

Security group inbound rules allowing access from 0.0.0.0/0 (all IP addresses) are the most common finding under the Security pillar. Particularly high-risk cases involve SSH (port 22) or RDP (port 3389) open to 0.0.0.0/0. Internet bots constantly scan for open ports 22 and 3389 and automatically execute brute-force attacks. The solution is to restrict SSH/RDP source IPs to specific addresses, or use Systems Manager Session Manager to eliminate SSH/RDP entirely. Session Manager provides IAM-authenticated instance access without opening any ports in security groups. For web servers, opening HTTP (80) and HTTPS (443) to 0.0.0.0/0 is legitimate, but the recommended design is to place them behind an ALB and configure EC2 security groups to allow access only from the ALB's security group.

Finding 4: Neglected Cost Optimization - Abandoned Resources

Resources launched for development or testing that are left running and incurring unnecessary costs are the most common finding under the Cost Optimization pillar. Typical patterns include forgotten EC2 instances, unattached EBS volumes, unused Elastic IP addresses (unattached EIPs cost $0.005 per hour), and idle NAT Gateways ($0.045 per hour plus data processing charges). AWS Trusted Advisor's cost optimization checks automatically detect these wasted resources. Cost Explorer's "underutilized resources" report is also effective. Another common finding is not leveraging Savings Plans or Reserved Instances. Running stable production workloads at on-demand pricing means paying 30-60% more compared to Savings Plans. Use Cost Explorer's Savings Plans recommendations to determine the optimal commitment amount.

Finding 5: Logging and Monitoring Gaps

Gaps in logging and monitoring - CloudTrail not enabled, CloudWatch alarms not configured, application logs not structured - are the most common finding under the Operational Excellence pillar. CloudTrail records management events by default, but without configuring delivery to S3 (creating a trail), events older than 90 days are lost. For production environments, always create a trail and store logs in S3 for long-term retention. Without CloudWatch alarms, incident detection is delayed. At minimum, set alarms for CPU utilization, memory utilization (custom metrics), disk utilization, ELB 5xx error rate, and RDS connection count. Structure application logs in JSON format and send them to CloudWatch Logs, enabling SQL-like queries with CloudWatch Logs Insights. Unstructured text logs can only be searched with grep, which is impractical at scale. To systematically learn Well-Architected design principles, specialized books (Amazon) are a helpful reference.