Amazon DevOps Guru

An AIOps service that uses machine learning to automatically detect operational anomalies, estimate root causes, and provide remediation recommendations

Overview

Amazon DevOps Guru is an AIOps service that uses machine learning to automatically detect operational anomalies in applications, estimate root causes, and provide specific remediation recommendations. It continuously analyzes CloudWatch metrics, CloudTrail logs, and Config change history, reporting deviations from normal operational patterns as insights. It detects operational issues in serverless and container environments - such as increased Lambda cold starts, DynamoDB throttling, and abnormal ECS task termination patterns - without relying on manual monitoring.

Insight Types and Anomaly Detection Mechanics

Insights generated by DevOps Guru fall into two categories: reactive insights and proactive insights. Reactive insights are generated when an active anomaly is detected. For example, when API Gateway latency spikes, it correlates related Lambda function error rate increases and DynamoDB throttling to present root cause candidates. Proactive insights detect signs that haven't yet caused failures but could become problems if left unaddressed. Examples include DynamoDB table capacity consumption trending upward or Lambda concurrent executions approaching the limit. The anomaly detection foundation uses machine learning models trained on massive operational datasets accumulated by AWS. Users don't need to train their own models - simply enabling the service automatically begins baseline learning. The learning period is typically 1-2 weeks, during which insights are still generated, though accuracy improves after the baseline is established.

Coverage Settings and Resource Grouping

DevOps Guru's monitoring scope can be narrowed to the entire AWS account, specific CloudFormation stacks, or specific tags. A common practice is to monitor only production stacks, filtering out noise from development environments. When specifying by CloudFormation stack, all resources within the stack (Lambda, DynamoDB, API Gateway, SQS, etc.) are automatically added to the monitoring scope. Tag-based specification lets you target only resources with a specific tag, such as `devops-guru:enabled=true`. Resource grouping directly impacts anomaly detection accuracy. Grouping related resources together enables DevOps Guru to accurately understand inter-resource dependencies and trace failure cascades. In microservices architectures, isolating CloudFormation stacks per service and treating each as an independent monitoring unit is an effective design. Pricing is a monthly charge based on the number of monitored AWS resources, approximately $0.0028 per resource per month.

Notification Channels and Operational Workflows

DevOps Guru insights are delivered to operations teams through Amazon SNS topics. When an SNS topic is configured, messages are published on new insight generation, insight severity changes, and insight closure. Forwarding SNS notifications to a Slack channel via Chatbot lets the team review insights within their everyday communication tool. EventBridge integration is also available, enabling different actions to be triggered based on insight type and severity. For example, you can page the on-call responder via PagerDuty for high-severity reactive insights while limiting medium and below to Slack notifications. The insight detail view displays graphs of the metrics where anomalies were detected, a list of related resources, and recommended remediation actions. Recommendations include links to AWS documentation with specific configuration change procedures, making it easy for less experienced operators to determine the appropriate response.

共有するXB!