What Is CloudWatch?
CloudWatch is AWS's native observability service. It collects metrics (CPU, memory, request counts), stores logs, triggers alarms, and provides dashboards. Every AWS service publishes metrics to CloudWatch by default.
It's not the most elegant observability platform (Datadog, Grafana Cloud, and New Relic have better UX), but it's already there, already collecting data, and already integrated with every service. For most teams, CloudWatch is sufficient with the right configuration.
Metrics
Default metrics (free)
AWS publishes these automatically:
- EC2: CPUUtilization, NetworkIn/Out, DiskReadOps (5-minute resolution)
- Lambda: Invocations, Duration, Errors, Throttles, ConcurrentExecutions
- DynamoDB: ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests
- ALB: RequestCount, TargetResponseTime, HTTP5xxCount
- SQS: ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage
Custom metrics
Publish your own application metrics:
import { CloudWatch } from '@aws-sdk/client-cloudwatch';
const cw = new CloudWatch({});
await cw.putMetricData({
Namespace: 'MyApp',
MetricData: [{
MetricName: 'OrdersProcessed',
Value: 1,
Unit: 'Count',
Dimensions: [{ Name: 'Environment', Value: 'prod' }],
}],
});
Custom metrics cost $0.30/metric/month (standard resolution) or $0.10/metric/month with Embedded Metrics Format (see below).
Embedded Metrics Format (EMF)
The cost-effective way to publish custom metrics. Write structured JSON to CloudWatch Logs, and CloudWatch automatically extracts metrics from it. No PutMetricData API calls needed:
{
"_aws": {
"Timestamp": 1234567890,
"CloudWatchMetrics": [{
"Namespace": "MyApp",
"Dimensions": [["Environment", "Service"]],
"Metrics": [
{ "Name": "ProcessingTime", "Unit": "Milliseconds" },
{ "Name": "ItemsProcessed", "Unit": "Count" }
]
}]
},
"Environment": "prod",
"Service": "order-processor",
"ProcessingTime": 245,
"ItemsProcessed": 12
}
Benefits: you get both the metric (for alarms/dashboards) and the log entry (for debugging), from one write.
Alarms
Key alarms every production system needs
| Service | Alarm on | Threshold |
|---|---|---|
| Lambda | Errors | > 0 for 1 minute |
| Lambda | Throttles | > 0 for 1 minute |
| Lambda | Duration | > 80% of timeout |
| DynamoDB | ThrottledRequests | > 0 for 5 minutes |
| SQS | ApproximateAgeOfOldestMessage | > your SLA |
| SQS DLQ | ApproximateNumberOfMessagesVisible | > 0 |
| ALB | HTTP 5xx | > 1% of requests |
| ALB | TargetResponseTime | > P99 baseline |
| RDS | CPUUtilization | > 80% for 5 minutes |
| RDS | FreeableMemory | < 10% for 5 minutes |
Composite alarms
Combine multiple alarms to reduce noise:
ALARM("high-error-rate") AND ALARM("high-latency") β page the on-call
ALARM("high-error-rate") AND NOT ALARM("deployment-in-progress") β page the on-call
Alarm actions
- SNS notification (email, PagerDuty, Slack via Lambda)
- Auto-scaling actions
- EC2 instance actions (reboot, stop, terminate)
- Systems Manager automation
Logs Insights
CloudWatch Logs Insights is a query language for searching and analyzing logs. Much faster than scrolling through raw logs:
Common queries
# Find errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
# Lambda cold starts and duration
filter @type = "REPORT"
| stats avg(@duration) as avgDuration,
max(@duration) as maxDuration,
count(*) as invocations
by bin(5m)
# Top 10 slowest API requests
filter @message like /Request completed/
| parse @message "duration=* " as duration
| sort duration desc
| limit 10
# Error rate by function
filter @type = "REPORT"
| stats sum(strcontains(@message, "ERROR")) / count(*) * 100 as errorRate
by @logGroup
| sort errorRate desc
Cost
$0.005 per GB scanned. This adds up fast with verbose logging. Keep log retention short (7-14 days for debug logs, 30-90 days for access logs) and use S3 export for long-term storage.
Log Retention
CloudWatch Logs never expire by default. You accumulate storage cost forever. Set retention policies:
| Log type | Recommended retention |
|---|---|
| Lambda execution logs | 7-14 days |
| Application debug logs | 7 days |
| Access/audit logs | 90 days (export to S3 for long-term) |
| Error logs | 30 days |
| VPC Flow Logs | 14 days (export to S3) |
What to Monitor (Pragmatic Approach)
Don't try to monitor everything. Focus on:
- Golden signals: Latency, traffic, errors, saturation
- SLA-impacting metrics: What would a customer notice?
- Cost signals: Are any resources scaling unexpectedly?
Skip: per-instance CPU on auto-scaled services (who cares if one instance is hot?), detailed network metrics for most apps, per-Lambda memory utilization.
CDK Example
import { Alarm, Metric, ComparisonOperator } from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';
const errorAlarm = new Alarm(this, 'LambdaErrors', {
metric: orderFunction.metricErrors(),
threshold: 1,
evaluationPeriods: 1,
comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
alarmDescription: 'Order function is throwing errors',
});
errorAlarm.addAlarmAction(new SnsAction(alertTopic));
// DLQ depth alarm
const dlqAlarm = new Alarm(this, 'DLQNotEmpty', {
metric: deadLetterQueue.metricApproximateNumberOfMessagesVisible(),
threshold: 1,
evaluationPeriods: 1,
});
Further Reading
Looking for hands-on help? View my AWS architecture services β