Amazon CloudWatch

What Is CloudWatch?

CloudWatch is AWS's native observability service. It collects metrics (CPU, memory, request counts), stores logs, triggers alarms, and provides dashboards. Every AWS service publishes metrics to CloudWatch by default.

It's not the most elegant observability platform (Datadog, Grafana Cloud, and New Relic have better UX), but it's already there, already collecting data, and already integrated with every service. For most teams, CloudWatch is sufficient with the right configuration.

Metrics

Default metrics (free)

AWS publishes these automatically:

EC2: CPUUtilization, NetworkIn/Out, DiskReadOps (5-minute resolution)
Lambda: Invocations, Duration, Errors, Throttles, ConcurrentExecutions
DynamoDB: ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests
ALB: RequestCount, TargetResponseTime, HTTP5xxCount
SQS: ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage

Custom metrics

Publish your own application metrics:

import { CloudWatch } from '@aws-sdk/client-cloudwatch';

const cw = new CloudWatch({});
await cw.putMetricData({
  Namespace: 'MyApp',
  MetricData: [{
    MetricName: 'OrdersProcessed',
    Value: 1,
    Unit: 'Count',
    Dimensions: [{ Name: 'Environment', Value: 'prod' }],
  }],
});

Custom metrics cost $0.30/metric/month (standard resolution) or $0.10/metric/month with Embedded Metrics Format (see below).

Embedded Metrics Format (EMF)

The cost-effective way to publish custom metrics. Write structured JSON to CloudWatch Logs, and CloudWatch automatically extracts metrics from it. No PutMetricData API calls needed:

{
  "_aws": {
    "Timestamp": 1234567890,
    "CloudWatchMetrics": [{
      "Namespace": "MyApp",
      "Dimensions": [["Environment", "Service"]],
      "Metrics": [
        { "Name": "ProcessingTime", "Unit": "Milliseconds" },
        { "Name": "ItemsProcessed", "Unit": "Count" }
      ]
    }]
  },
  "Environment": "prod",
  "Service": "order-processor",
  "ProcessingTime": 245,
  "ItemsProcessed": 12
}

Benefits: you get both the metric (for alarms/dashboards) and the log entry (for debugging), from one write.

Alarms

Key alarms every production system needs

Service	Alarm on	Threshold
Lambda	Errors	> 0 for 1 minute
Lambda	Throttles	> 0 for 1 minute
Lambda	Duration	> 80% of timeout
DynamoDB	ThrottledRequests	> 0 for 5 minutes
SQS	ApproximateAgeOfOldestMessage	> your SLA
SQS DLQ	ApproximateNumberOfMessagesVisible	> 0
ALB	HTTP 5xx	> 1% of requests
ALB	TargetResponseTime	> P99 baseline
RDS	CPUUtilization	> 80% for 5 minutes
RDS	FreeableMemory	< 10% for 5 minutes

Composite alarms

Combine multiple alarms to reduce noise:

ALARM("high-error-rate") AND ALARM("high-latency") → page the on-call
ALARM("high-error-rate") AND NOT ALARM("deployment-in-progress") → page the on-call

Alarm actions

SNS notification (email, PagerDuty, Slack via Lambda)
Auto-scaling actions
EC2 instance actions (reboot, stop, terminate)
Systems Manager automation

Logs Insights

CloudWatch Logs Insights is a query language for searching and analyzing logs. Much faster than scrolling through raw logs:

Common queries

# Find errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

# Lambda cold starts and duration
filter @type = "REPORT"
| stats avg(@duration) as avgDuration, 
        max(@duration) as maxDuration,
        count(*) as invocations
  by bin(5m)

# Top 10 slowest API requests
filter @message like /Request completed/
| parse @message "duration=* " as duration
| sort duration desc
| limit 10

# Error rate by function
filter @type = "REPORT"
| stats sum(strcontains(@message, "ERROR")) / count(*) * 100 as errorRate
  by @logGroup
| sort errorRate desc

Cost

$0.005 per GB scanned. This adds up fast with verbose logging. Keep log retention short (7-14 days for debug logs, 30-90 days for access logs) and use S3 export for long-term storage.

Log Retention

CloudWatch Logs never expire by default. You accumulate storage cost forever. Set retention policies:

Log type	Recommended retention
Lambda execution logs	7-14 days
Application debug logs	7 days
Access/audit logs	90 days (export to S3 for long-term)
Error logs	30 days
VPC Flow Logs	14 days (export to S3)

What to Monitor (Pragmatic Approach)

Don't try to monitor everything. Focus on:

Golden signals: Latency, traffic, errors, saturation
SLA-impacting metrics: What would a customer notice?
Cost signals: Are any resources scaling unexpectedly?

Skip: per-instance CPU on auto-scaled services (who cares if one instance is hot?), detailed network metrics for most apps, per-Lambda memory utilization.

CDK Example

import { Alarm, Metric, ComparisonOperator } from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';

const errorAlarm = new Alarm(this, 'LambdaErrors', {
  metric: orderFunction.metricErrors(),
  threshold: 1,
  evaluationPeriods: 1,
  comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
  alarmDescription: 'Order function is throwing errors',
});

errorAlarm.addAlarmAction(new SnsAction(alertTopic));

// DLQ depth alarm
const dlqAlarm = new Alarm(this, 'DLQNotEmpty', {
  metric: deadLetterQueue.metricApproximateNumberOfMessagesVisible(),
  threshold: 1,
  evaluationPeriods: 1,
});