AWS Lambda functions can only run for a
maximum of five minutes.
This must be distinctly understood, or
nothing wonderful can come of the story
you are about to hear.
FORREST BRAZEAL
Cloud Services team lead at Infor
AWS automation tooling
Blog: forrestbrazeal.com
Twitter: @forrestbrazeal
SERVERLESS WORKFLOWS IN AWS
A JOURNEY FROM SWF TO STEP
FUNCTIONS
FORREST BRAZEAL
THE PROBLEM (MAY 2016)
Enterprise apps with many moving parts
Deployment orchestration becomes tricky
Need a workflow system
WHAT DOES A
WORKFLOW SYSTEM DO?
Coordinates worker processes
Manages state
Responds to interrupts
Is highly available, elastic, and all that good stuff
WHY SERVERLESS
FOR THIS USE CASE?
Lots of downtime
Need to scale wide
Internal tooling, need low cost/maintenance
It’s cool! (Famous last words)
How do you manage a
long-running process
using
short-lived, stateless functions?
THE BIG QUESTION
IDEA 1: PURE LAMBDA
GHASTLY!
BUT WAIT A MINUTE
AWS has had a workflow service, Amazon SWF, for several
years.
Could we use it?
WE NEED TO TALK ABOUT SWF
(not so) “Simple Workflow Service”
You write a decider program that orchestrates your workflow
You write activity worker programs to handle tasks
SWF is the glue
TRADITIONAL SWF
ARCHITECTURE
IDEA 2: SWF + LAMBDA
Put the “decider” program on Lambda
Find some way to invoke it repeatedly over the life of the
workflow
Advantages
• SWF manages the workflow
• More maintainable than Possibility 1
• Still technically “serverless”!
Disadvantages
• Let’s see …
THIS (WAS) MY
ARCHITECTURE
AWS even made a video about it!
SERVERLESS SWF PAIN POINTS
☹ Latency
☹ Cost
☹ Retries
☹ Debugging
☹ State
“Serverless for serverless’
sake”
AWS STEP FUNCTIONS (SFN)
Announced at re:Invent 2016
A “serverless-native” workflow solution
State machine as a service
THE FIRST RULE OF
STEP FUNCTIONS CLUB
Your Lambda functions don’t talk about Step Functions Club!
They just accept input and output – SFN does the
orchestration
“Amazon States Language” template defines the state
machine in JSON
SFN IN ACTION:
DYNAMIC BACKOFF
You pay here
Not here!
DYNAMIC BACKOFF
DYNAMIC BACKOFF
In Lambda function for “BackoffTaskState”
From this …
…to this!
STEP FUNCTIONS
WISHLIST
CloudFormation support ✔
> 32kb state size
Updates
Signals
Child state machines
Dynamic parallel states
STATE MACHINE
VERSIONING
SWF WORKFLOW RESULTS
SFN WORKFLOW RESULTS
IN CONCLUSION
Recognize when your serverless reach exceeds your grasp
Your use cases make the tooling better
Go build an awesome workflow with AWS Step Functions!
RESOURCES
More details about our SFN use case and results:
https://forrestbrazeal.com/2016/12/29/serverless-workflows-on-aws-my-journey-
from-swf-to-step-functions/
AWS “This Is My Architecture” video about the SWF/Lambda solution:
https://youtu.be/rKeS3RpMEOw
Infor’s re:Invent session about deployment automation:
https://www.youtube.com/watch?v=Epx_32c3c6s
FORREST BRAZEAL
@forrestbrazeal

Serverless Workflows on AWS - A Journey from SWF to Step Functions

Editor's Notes

  • #21 Latency. CloudWatch rules cannot run more frequently than once a minute, and SWF scheduling delays made it advisable to invoke the decider even less frequently than that. This situation created significant latency between workflow actions, leading to inflated workflow times that were especially noticeable for workflows involving lots of short tasks. Cost. Having to run the decider on a two-minute loop throughout the life of the workflow somewhat negated the cost advantages we hoped to get from using Lambda instead of EC2 in the first place, especially as our number of workflows scaled up. Runtime State. Every time the Lambda decider function ran, it had to figure out where it was in the workflow process. SWF is supposed to make stateless execution easy, and it provides the complete deployment workflow history as a JSON blob when handing tasks to the decider, but the blob quickly becomes unmanageably large and filled with superfluous data. To keep track of what was going on in the workflows, we resorted to maintaining ephemeral state in a DynamoDB table, adding more latency and cost. Retries/Error Handling. SWF, I regret to say, has bugs. Sometimes SWF completely fails to schedule a task, or does it so late that the workflow’s task timeout expires. Finding and catching these errors required even more state maintained outside of SWF. Debugging. The SWF console’s workflow event views are difficult to read, oddly paginated and don’t provide much information, leading to a rabbit trail of log searches anytime something went wrong. Code Maintainability. The combination of multiple state sources and ramifying failure scenarios, not to mention the central “hack” of running the SWF decider in a loop, led to a mess of one-off fixes and hacky workarounds in our codebase.