What Are Lambda Durable Functions?
Lambda durable functions extend the Lambda programming model with primitives for multi-step, fault-tolerant workflows. You write sequential code in your handler, and Lambda automatically checkpoints progress, retries failed steps, and suspends execution (for up to one year) when waiting on external events. Without paying for idle compute.
Think of it as Step Functions semantics (checkpointing, retry, wait states) embedded directly in your Lambda code rather than defined in a separate state machine.
Announced at re:Invent 2025. Available in expanding regions.
Core Concepts
Steps
A step is a unit of work that gets checkpointed. Once completed, it won't re-execute during replay:
# If the function crashes after step 1, replay skips step 1 and resumes at step 2
result_1 = context.step(validate_order(order_id))
result_2 = context.step(charge_payment(order_id, result_1["amount"]))
result_3 = context.step(send_confirmation(order_id))
Waits
Suspend execution without paying for compute. The function terminates, and Lambda resumes it after the specified duration:
# Function terminates here β no compute charges during the wait
context.wait(duration=Duration.from_hours(24))
# Execution resumes here 24 hours later
context.step(send_followup_email(order_id))
Callbacks
Pause execution until an external system signals completion (human approval, webhook, external API):
callback = context.create_callback(name="approval", config=CallbackConfig(timeout=Duration.from_days(7)))
context.step(request_approval(callback.callback_id, order_id))
# Function terminates β resumes when external system calls SendDurableExecutionCallbackSuccess
approval_result = callback.result()
Checkpoint and Replay
The mechanism that makes it all work. When a durable function fails or resumes from a wait:
- Lambda replays the handler from the beginning
- Completed steps are skipped (their results are returned from checkpoint storage)
- Execution continues from where it left off
This means your handler code must be deterministic. Don't use random() or datetime.now() outside of steps, because replay will produce different values.
When to Use Durable Functions vs Step Functions
| Durable Functions | Step Functions | |
|---|---|---|
| Definition | Code in your Lambda handler | JSON/ASL state machine |
| Visualization | None (code is the workflow) | Visual workflow editor |
| Learning curve | Low (just code) | Medium (state machine concepts) |
| Direct SDK integrations | No (must call from code) | Yes (200+ AWS API actions without Lambda) |
| Language support | Python, Node.js (at launch) | Language-agnostic (invokes any Lambda) |
| Parallel execution | parallel() and map() primitives |
Parallel and Map states |
| Max duration | Up to 1 year | Up to 1 year (Standard) |
| Pricing | Lambda invocation + execution time | Per state transition ($0.025/1000) |
| Best for | Developer-owned workflows, code-first teams | Complex orchestration, visual design, ops teams |
Use Durable Functions when:
- Your team prefers code over configuration
- The workflow is simple enough that a visual editor isn't needed
- You want the workflow logic colocated with the business logic
- You're already writing Lambda functions and want to add resilience
Use Step Functions when:
- You need direct SDK integrations (call DynamoDB, SQS, etc. without Lambda)
- Non-developers need to understand or modify the workflow
- You want the visual execution history and debugging tools
- The workflow involves complex branching, parallel execution, or choice states
Architecture Patterns
Multi-step data pipeline
Ingest β Validate β Transform β Load β Notify
Each step is checkpointed. If Transform fails, retry from Transform (not from Ingest).
Human-in-the-loop approval
Submit β Create callback β Notify approver β Wait for callback β Process
Function suspends during approval (no compute charges). Resumes when human acts.
AI agent orchestration
Receive query β Call LLM β Parse tool calls β Execute tools β Call LLM again β Return
Each LLM call and tool execution is a step. If one fails, automatic retry without re-running completed steps.
Saga pattern
Step 1 (with compensation) β Step 2 (with compensation) β Step 3
If Step 3 fails β Run compensations for Step 2, Step 1
Try/catch around steps, with compensation logic in the catch block.
Idempotency
Durable functions provide built-in idempotency via execution names. Invoking a function twice with the same execution name returns the existing execution result instead of creating a duplicate. Essential for preventing double-processing in distributed systems.
Monitoring
- Lambda console: Durable Executions tab shows each execution's step status and timing
- EventBridge integration: Lambda sends
Durable Execution Status Changeevents to the default bus - CloudWatch Logs: Standard Lambda logging (use
context.loggerto suppress duplicate logs during replay)
Limitations
- Language support: Node.js and Python at launch. .NET and Java support anticipated but not yet available.
- Determinism requirement: Handler code must be deterministic. Side effects (HTTP calls, random values, timestamps) must be inside steps.
- Cannot be enabled on existing functions: Durable execution must be set at function creation time.
- Replay overhead: Long workflows with many completed steps add replay time (though checkpoint data is returned instantly).
- Region availability: Started in US East (Ohio), expanding to additional regions.
Further Reading
Looking for hands-on help? View my AWS architecture services β