Home β€Ί AWS Resources β€Ί Amazon CloudWatch

Amazon CloudWatch

Observability on AWS: metrics, alarms, Logs Insights, embedded metrics format, and what you actually need to monitor.

What Is CloudWatch?

CloudWatch is AWS's native observability service. It collects metrics (CPU, memory, request counts), stores logs, triggers alarms, and provides dashboards. Every AWS service publishes metrics to CloudWatch by default.

It's not the most elegant observability platform (Datadog, Grafana Cloud, and New Relic have better UX), but it's already there, already collecting data, and already integrated with every service. For most teams, CloudWatch is sufficient with the right configuration.

Metrics

Default metrics (free)

AWS publishes these automatically:

  • EC2: CPUUtilization, NetworkIn/Out, DiskReadOps (5-minute resolution)
  • Lambda: Invocations, Duration, Errors, Throttles, ConcurrentExecutions
  • DynamoDB: ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests
  • ALB: RequestCount, TargetResponseTime, HTTP5xxCount
  • SQS: ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage

Custom metrics

Publish your own application metrics:

import { CloudWatch } from '@aws-sdk/client-cloudwatch';

const cw = new CloudWatch({});
await cw.putMetricData({
  Namespace: 'MyApp',
  MetricData: [{
    MetricName: 'OrdersProcessed',
    Value: 1,
    Unit: 'Count',
    Dimensions: [{ Name: 'Environment', Value: 'prod' }],
  }],
});

Custom metrics cost $0.30/metric/month (standard resolution) or $0.10/metric/month with Embedded Metrics Format (see below).

Embedded Metrics Format (EMF)

The cost-effective way to publish custom metrics. Write structured JSON to CloudWatch Logs, and CloudWatch automatically extracts metrics from it. No PutMetricData API calls needed:

{
  "_aws": {
    "Timestamp": 1234567890,
    "CloudWatchMetrics": [{
      "Namespace": "MyApp",
      "Dimensions": [["Environment", "Service"]],
      "Metrics": [
        { "Name": "ProcessingTime", "Unit": "Milliseconds" },
        { "Name": "ItemsProcessed", "Unit": "Count" }
      ]
    }]
  },
  "Environment": "prod",
  "Service": "order-processor",
  "ProcessingTime": 245,
  "ItemsProcessed": 12
}

Benefits: you get both the metric (for alarms/dashboards) and the log entry (for debugging), from one write.

Alarms

Key alarms every production system needs

Service Alarm on Threshold
Lambda Errors > 0 for 1 minute
Lambda Throttles > 0 for 1 minute
Lambda Duration > 80% of timeout
DynamoDB ThrottledRequests > 0 for 5 minutes
SQS ApproximateAgeOfOldestMessage > your SLA
SQS DLQ ApproximateNumberOfMessagesVisible > 0
ALB HTTP 5xx > 1% of requests
ALB TargetResponseTime > P99 baseline
RDS CPUUtilization > 80% for 5 minutes
RDS FreeableMemory < 10% for 5 minutes

Composite alarms

Combine multiple alarms to reduce noise:

ALARM("high-error-rate") AND ALARM("high-latency") β†’ page the on-call
ALARM("high-error-rate") AND NOT ALARM("deployment-in-progress") β†’ page the on-call

Alarm actions

  • SNS notification (email, PagerDuty, Slack via Lambda)
  • Auto-scaling actions
  • EC2 instance actions (reboot, stop, terminate)
  • Systems Manager automation

Logs Insights

CloudWatch Logs Insights is a query language for searching and analyzing logs. Much faster than scrolling through raw logs:

Common queries

# Find errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
# Lambda cold starts and duration
filter @type = "REPORT"
| stats avg(@duration) as avgDuration, 
        max(@duration) as maxDuration,
        count(*) as invocations
  by bin(5m)
# Top 10 slowest API requests
filter @message like /Request completed/
| parse @message "duration=* " as duration
| sort duration desc
| limit 10
# Error rate by function
filter @type = "REPORT"
| stats sum(strcontains(@message, "ERROR")) / count(*) * 100 as errorRate
  by @logGroup
| sort errorRate desc

Cost

$0.005 per GB scanned. This adds up fast with verbose logging. Keep log retention short (7-14 days for debug logs, 30-90 days for access logs) and use S3 export for long-term storage.

Log Retention

CloudWatch Logs never expire by default. You accumulate storage cost forever. Set retention policies:

Log type Recommended retention
Lambda execution logs 7-14 days
Application debug logs 7 days
Access/audit logs 90 days (export to S3 for long-term)
Error logs 30 days
VPC Flow Logs 14 days (export to S3)

What to Monitor (Pragmatic Approach)

Don't try to monitor everything. Focus on:

  1. Golden signals: Latency, traffic, errors, saturation
  2. SLA-impacting metrics: What would a customer notice?
  3. Cost signals: Are any resources scaling unexpectedly?

Skip: per-instance CPU on auto-scaled services (who cares if one instance is hot?), detailed network metrics for most apps, per-Lambda memory utilization.

CDK Example

import { Alarm, Metric, ComparisonOperator } from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';

const errorAlarm = new Alarm(this, 'LambdaErrors', {
  metric: orderFunction.metricErrors(),
  threshold: 1,
  evaluationPeriods: 1,
  comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
  alarmDescription: 'Order function is throwing errors',
});

errorAlarm.addAlarmAction(new SnsAction(alertTopic));

// DLQ depth alarm
const dlqAlarm = new Alarm(this, 'DLQNotEmpty', {
  metric: deadLetterQueue.metricApproximateNumberOfMessagesVisible(),
  threshold: 1,
  evaluationPeriods: 1,
});

Further Reading

Looking for hands-on help? View my AWS architecture services β†’

Need better visibility into your AWS systems?

Drop me a message β€” I typically respond within one business day.