What Are Step Functions?
Step Functions is AWS's workflow orchestration service. You define a state machine (a series of steps with transitions, branches, retries, and error handling) and AWS executes it. Each step can invoke a Lambda, call an AWS API directly, wait for a callback, or branch based on data.
It replaces the pattern of chaining Lambda functions through SQS or writing custom orchestration code. The state machine is the orchestration logic; individual steps do the work.
When to Use Them
Good fit:
- Multi-step business processes (order fulfillment, onboarding, approvals)
- Long-running workflows (hours or days, with wait states)
- Workflows that need human approval steps
- ETL pipelines with branching logic
- Saga patterns for distributed transactions
- Anything where you need built-in retry, timeout, and error handling
Don't use when:
- Simple event reactions (use Lambda + EventBridge)
- High-throughput, low-latency pipelines (use Kinesis or SQS)
- Streaming data processing
Standard vs Express Workflows
| Standard | Express | |
|---|---|---|
| Duration | Up to 1 year | Up to 5 minutes |
| Execution semantics | Exactly-once | At-least-once |
| Pricing | Per state transition | Per request + duration |
| History | Full execution history | CloudWatch Logs only |
Use Standard for business workflows, long-running processes, anything that needs audit trail. Use Express for high-volume, short-duration work like request processing, data transformation, IoT event handling.
Key Concepts
States
- Task: Do work (invoke Lambda, call AWS API, run ECS task)
- Choice: Branch based on input data
- Wait: Pause for a duration or until a timestamp
- Parallel: Run branches concurrently
- Map: Iterate over an array, processing each item
- Pass: Transform data without calling anything
- Succeed/Fail: Terminal states
SDK Integrations
Step Functions can call 200+ AWS APIs directly without a Lambda in between. Need to put an item in DynamoDB? Call it directly. Need to send an SNS notification? Direct integration. This reduces Lambda invocations and simplifies your architecture.
Error Handling
Built-in retry with exponential backoff and catch blocks for fallback logic:
{
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}],
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "HandleError"
}]
}
CDK Example
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
// Step 1: Validate the order
const validateOrder = new tasks.LambdaInvoke(this, 'ValidateOrder', {
lambdaFunction: validateFn,
outputPath: '$.Payload',
});
// Step 2: Process payment (direct DynamoDB integration)
const recordPayment = new tasks.DynamoPutItem(this, 'RecordPayment', {
table: paymentsTable,
item: {
PK: tasks.DynamoAttributeValue.fromString(sfn.JsonPath.stringAt('$.orderId')),
SK: tasks.DynamoAttributeValue.fromString('PAYMENT'),
amount: tasks.DynamoAttributeValue.numberFromString(sfn.JsonPath.stringAt('$.amount')),
},
});
// Step 3: Send confirmation
const sendConfirmation = new tasks.LambdaInvoke(this, 'SendConfirmation', {
lambdaFunction: notifyFn,
});
// Error handler
const handleError = new tasks.LambdaInvoke(this, 'HandleError', {
lambdaFunction: errorHandlerFn,
});
// Wire it up
const definition = validateOrder
.addCatch(handleError)
.next(recordPayment)
.addCatch(handleError)
.next(sendConfirmation);
new sfn.StateMachine(this, 'OrderWorkflow', {
definitionBody: sfn.DefinitionBody.fromChainable(definition),
timeout: sfn.Duration.minutes(5),
tracingEnabled: true,
});
Architecture Patterns
Saga pattern
For distributed transactions across microservices. Each step has a compensating action. If step 3 fails, run compensations for steps 2 and 1. Step Functions makes this explicit with Catch states that route to compensation logic.
Map state for batch processing
Process an array of items in parallel with concurrency control. Upload 1000 images? Map state processes them 40 at a time with automatic retry on failures.
Human-in-the-loop
Use a Task state with a callback token. The workflow pauses, you send the token to a human (via email, Slack, etc.), and when they approve, their action calls back with the token to resume the workflow.
Cost Considerations
- Standard: $0.025 per 1000 state transitions
- Express: $0.00001667 per request + $0.00000025 per GB-second of duration
A 5-step Standard workflow processing 100K executions/month = ~$12.50. Express workflows are much cheaper for high-volume, short-duration work.
The biggest cost mistake: using Standard workflows for high-volume event processing that should be Express, or worse, not using direct SDK integrations and paying for unnecessary Lambda invocations at each step.
Further Reading
Related Blog Posts
Looking for hands-on help? View my AWS architecture services β