Home β€Ί AWS Resources β€Ί Lambda Durable Functions

Lambda Durable Functions

Build fault-tolerant multi-step workflows directly in Lambda: checkpointing, automatic retry, long-running waits, and how it compares to Step Functions.

What Are Lambda Durable Functions?

Lambda durable functions extend the Lambda programming model with primitives for multi-step, fault-tolerant workflows. You write sequential code in your handler, and Lambda automatically checkpoints progress, retries failed steps, and suspends execution (for up to one year) when waiting on external events. Without paying for idle compute.

Think of it as Step Functions semantics (checkpointing, retry, wait states) embedded directly in your Lambda code rather than defined in a separate state machine.

Announced at re:Invent 2025. Available in expanding regions.

Core Concepts

Steps

A step is a unit of work that gets checkpointed. Once completed, it won't re-execute during replay:

# If the function crashes after step 1, replay skips step 1 and resumes at step 2
result_1 = context.step(validate_order(order_id))
result_2 = context.step(charge_payment(order_id, result_1["amount"]))
result_3 = context.step(send_confirmation(order_id))

Waits

Suspend execution without paying for compute. The function terminates, and Lambda resumes it after the specified duration:

# Function terminates here β€” no compute charges during the wait
context.wait(duration=Duration.from_hours(24))
# Execution resumes here 24 hours later
context.step(send_followup_email(order_id))

Callbacks

Pause execution until an external system signals completion (human approval, webhook, external API):

callback = context.create_callback(name="approval", config=CallbackConfig(timeout=Duration.from_days(7)))
context.step(request_approval(callback.callback_id, order_id))

# Function terminates β€” resumes when external system calls SendDurableExecutionCallbackSuccess
approval_result = callback.result()

Checkpoint and Replay

The mechanism that makes it all work. When a durable function fails or resumes from a wait:

  1. Lambda replays the handler from the beginning
  2. Completed steps are skipped (their results are returned from checkpoint storage)
  3. Execution continues from where it left off

This means your handler code must be deterministic. Don't use random() or datetime.now() outside of steps, because replay will produce different values.

When to Use Durable Functions vs Step Functions

Durable Functions Step Functions
Definition Code in your Lambda handler JSON/ASL state machine
Visualization None (code is the workflow) Visual workflow editor
Learning curve Low (just code) Medium (state machine concepts)
Direct SDK integrations No (must call from code) Yes (200+ AWS API actions without Lambda)
Language support Python, Node.js (at launch) Language-agnostic (invokes any Lambda)
Parallel execution parallel() and map() primitives Parallel and Map states
Max duration Up to 1 year Up to 1 year (Standard)
Pricing Lambda invocation + execution time Per state transition ($0.025/1000)
Best for Developer-owned workflows, code-first teams Complex orchestration, visual design, ops teams

Use Durable Functions when:

  • Your team prefers code over configuration
  • The workflow is simple enough that a visual editor isn't needed
  • You want the workflow logic colocated with the business logic
  • You're already writing Lambda functions and want to add resilience

Use Step Functions when:

  • You need direct SDK integrations (call DynamoDB, SQS, etc. without Lambda)
  • Non-developers need to understand or modify the workflow
  • You want the visual execution history and debugging tools
  • The workflow involves complex branching, parallel execution, or choice states

Architecture Patterns

Multi-step data pipeline

Ingest β†’ Validate β†’ Transform β†’ Load β†’ Notify

Each step is checkpointed. If Transform fails, retry from Transform (not from Ingest).

Human-in-the-loop approval

Submit β†’ Create callback β†’ Notify approver β†’ Wait for callback β†’ Process

Function suspends during approval (no compute charges). Resumes when human acts.

AI agent orchestration

Receive query β†’ Call LLM β†’ Parse tool calls β†’ Execute tools β†’ Call LLM again β†’ Return

Each LLM call and tool execution is a step. If one fails, automatic retry without re-running completed steps.

Saga pattern

Step 1 (with compensation) β†’ Step 2 (with compensation) β†’ Step 3
If Step 3 fails β†’ Run compensations for Step 2, Step 1

Try/catch around steps, with compensation logic in the catch block.

Idempotency

Durable functions provide built-in idempotency via execution names. Invoking a function twice with the same execution name returns the existing execution result instead of creating a duplicate. Essential for preventing double-processing in distributed systems.

Monitoring

  • Lambda console: Durable Executions tab shows each execution's step status and timing
  • EventBridge integration: Lambda sends Durable Execution Status Change events to the default bus
  • CloudWatch Logs: Standard Lambda logging (use context.logger to suppress duplicate logs during replay)

Limitations

  • Language support: Node.js and Python at launch. .NET and Java support anticipated but not yet available.
  • Determinism requirement: Handler code must be deterministic. Side effects (HTTP calls, random values, timestamps) must be inside steps.
  • Cannot be enabled on existing functions: Durable execution must be set at function creation time.
  • Replay overhead: Long workflows with many completed steps add replay time (though checkpoint data is returned instantly).
  • Region availability: Started in US East (Ohio), expanding to additional regions.

Further Reading

Looking for hands-on help? View my AWS architecture services β†’

Building multi-step workflows?

Drop me a message β€” I typically respond within one business day.