ML-Based Operational Anomaly Detection - Catching Issues Early with Amazon DevOps Guru

Learn how Amazon DevOps Guru uses ML to detect operational anomalies, including automatic CloudWatch metrics analysis, proactive anomaly detection, recommended actions, and CloudFormation stack-level monitoring.

About 6 min readLast updated: 2026-01-10

Operational Monitoring Challenges and the Role of DevOps Guru

In cloud environments, operational monitoring typically involves setting up CloudWatch metrics, alarms, and dashboards to detect anomalies. However, in large-scale environments with hundreds to thousands of metrics, setting and maintaining appropriate thresholds for every metric is impractical. Additionally, some complex anomalies (such as CPU appearing normal while latency gradually increases and error rates slightly rise) cannot be caught by individual metric thresholds. Amazon DevOps Guru is a service that uses ML models to automatically analyze CloudWatch metrics, CloudTrail events, and Config configuration changes, detecting operational anomalies at the early warning stage. No individual metric thresholds are needed; DevOps Guru learns from historical patterns to dynamically build baselines. Its multi-metric correlation analysis catches complex anomalies that single-metric monitoring would miss.

Configuring Monitoring Targets and Anomaly Detection

DevOps Guru monitoring targets can be specified in two ways. CloudFormation stack-based targeting monitors only resources within specific stacks (applications), enabling monitoring aligned with application boundaries and making it easier to identify the scope of anomalies. Account-wide monitoring covers all resources in the AWS account. DevOps Guru automatically collects and analyzes CloudWatch metrics for target resources. For Lambda functions, it monitors invocation counts, error rates, execution duration, and cold start counts. For DynamoDB, it tracks read/write capacity utilization and throttling events. For RDS, it monitors CPU utilization, connection counts, and disk I/O. When an anomaly is detected, it is reported as an Insight, which includes the related metric anomalies, estimated root cause, and recommended actions.

Root Cause Analysis and Recommended Actions

DevOps Guru Insights include estimated root causes and specific recommended actions. For example, if a Lambda function's error rate increases, it correlates with recent CloudFormation deployment events or Config configuration changes and may estimate that "a recent deployment is the likely cause." If DynamoDB throttling is detected, it recommends "increasing provisioned capacity" or "switching to on-demand mode." Insights are classified into two types: Reactive (anomalies already causing impact) and Proactive (anomalies showing early warning signs but not yet causing impact). Proactive Insights enable you to take action before an outage occurs, directly preventing downtime. SNS notifications provide immediate alerts when Insights are generated, and EventBridge integration lets you incorporate them into existing incident management workflows like PagerDuty or Slack. To deepen your DevOps automation expertise, specialized books on Amazon are a valuable resource.

Comparison with CloudWatch Anomaly Detection

Both DevOps Guru and CloudWatch Anomaly Detection use ML-based anomaly detection, but they differ in scope. CloudWatch Anomaly Detection sets an anomaly detection band (expected value range) for individual metrics and triggers an alarm when the band is breached. It requires per-metric configuration and does not perform multi-metric correlation analysis. DevOps Guru targets the entire application, automatically performing multi-metric correlation analysis and root cause estimation. Configuration only requires specifying monitoring targets, with no individual metric threshold setup needed. The two can be used together: set explicit CloudWatch Anomaly Detection alarms for critical individual metrics (API latency, error rates, etc.) and use DevOps Guru for comprehensive application-wide monitoring. Pricing is $0.0028 per resource analysis hour, which works out to approximately $20 per month for monitoring 100 resources.

DevOps Guru Pricing

DevOps Guru pricing is based on the number of AWS resources analyzed. CloudFormation stack-based analysis costs approximately $0.0028 per resource per month, while tag-based analysis costs approximately $0.0042. For an environment with 100 resources, the monthly cost is approximately $0.28-$0.42, making it very affordable. Both Proactive Insights (early warning detection) and Reactive Insights (incident analysis) are included. Limit analysis targets to production resources and exclude development/test environments to optimize costs.

Summary - DevOps Guru Usage Guidelines

Amazon DevOps Guru is a service that uses ML-based automatic analysis to detect operational anomalies at the early warning stage. Its key strengths are automatic CloudWatch metrics analysis, multi-metric correlation analysis, root cause estimation, and recommended actions. CloudFormation stack-level monitoring enables anomaly detection aligned with application boundaries, and the lack of individual metric threshold configuration makes it easy to adopt. It is especially effective for monitoring production serverless applications and microservices.

How AWS Keeps Time Internally - Amazon Time Sync Service and Leap Second Smearing DesignLearn how Amazon Time Sync Service works, how GPS and atomic clocks provide high-precision time sources, the design decision to absorb leap seconds through smearing, and why time synchronization matters in distributed systems.Centralizing SaaS Audit Logs with AWS AppFabric - OCSF Standardization and Security Lake IntegrationLearn how AppFabric collects audit logs from SaaS applications, standardizes them to OCSF format, and builds analysis pipelines.Implementing Feature Flags with AWS AppConfig - Safe Configuration Deployment and RollbackRoll out configuration changes independently from code deployments using Linear and Exponential strategies. Ensure safety with automatic rollback triggered by CloudWatch alarms.Architecture Review - Systematically Evaluate Workloads with the AWS Well-Architected ToolLearn about architecture reviews using the AWS Well-Architected Tool. Covers evaluation based on the six pillars, improvement planning, and custom lens usage.Audit Log Design and Operations - Complete API Activity Recording with CloudTrailLearn how to design audit logs using AWS CloudTrail, including recording API activity, long-term storage in S3, and compliance automation through integration with AWS Config.Lessons from AWS Incident Reports (COE) - How Past Major Outages Shaped Design PrinciplesAnalyze the root causes of past major incidents including the S3 outage, us-east-1 DNS failure, and Kinesis outage from AWS's published Correction of Errors (COE) and incident reports, and explain how they changed AWS's design principles.Tag Design Determines Operations - Trivia and Practical Naming Conventions for AWS Resource Tagging StrategyWe explain why AWS resource tags are not just labels but the foundation for cost allocation, access control, and automation, covering tag key naming conventions, how to use the 50-tag limit, and governance through tag policies.Why AWS Service Quotas Exist - Multi-Tenant Design That Protects Shared InfrastructureExplain how AWS service quotas (formerly service limits) are not mere restrictions but a design to protect other customers in a multi-tenant environment, covering the noisy neighbor problem, soft vs hard limits, and what happens behind quota increase requests.

Operational Monitoring Challenges and the Role of DevOps Guru

Configuring Monitoring Targets and Anomaly Detection

Root Cause Analysis and Recommended Actions

Comparison with CloudWatch Anomaly Detection

DevOps Guru Pricing

Summary - DevOps Guru Usage Guidelines

Related Services

Related Articles

More on This Topic

Similar Articles and Services