Loading summary
A
If you've been listening to AWS Bytes for a while, you've probably noticed a pattern. We keep coming back to lambda. And that's not a coincidence. We're big fans. It's one of those services we like because it's very convenient. You can write tiny little functions in the programming language you like. They run on demand when specific events happen. They scale like crazy when you need them to, and scale to zero when nothing happens. Even better. And you only pay for what you use. Of course, lambda is not always the best solution for everything, as lots of listeners like to remind us, which is completely fair. The moment you try to do anything that looks like a workflow, for example, Lambda can start to feel like it's fighting against you. You've got 15 minutes max execution time. It's stateless by default, and if you need some orchestration like retries, back off all that kind of stuff, you end up bolting on something like step functions, queues, schedules, and a bunch of extra stuff you didn't really want in the beginning. And it's not always easy to get that stuff working reliably. Now. What if you could keep the Lambda model we all know and love, but add a few extra superpowers that might help us to overcome some of those challenges? Well, last December, we got a few new superpowers. At re invent 2025, AWS Lambda Durable functions were announced. And to be honest, we're pretty excited about this one now. It's still lambda. It still has the same runtime, the same scaling, but with a framework that can now checkpoint progress, suspend execution when you need to wait, can resume later from a safe point, skipping the work you already completed. And this is what we're going to talk about in detail today. We're going to break down what durable actually means in practice and how this whole resume mechanism works under the hood. We'll talk about when this approach is a huge win compared to the usual patterns, and of course, the gotchas that can surprise you, especially around determinism, idempotency and debugging resumed executions. Finally, we'll also talk about one of our own open source applications that we rebuilt from scratch to have an excuse to use durable lambda functions and see what it feels like to use them in a real project. My name is Owen, I'm joined by Luciano and this is AWS Bytes. Okay. Hi Luciano. Would you like to start off by telling us what is a durable function? What are the basic ideas around it?
B
Of course, yeah. So, as you said in the intro it's still lambda, right? It's the same services, the same model, the same scaling that doesn't change. So that also means that you still write a lambda handler in the usual way, with the same runtimes that you know and love, the same type of resources, the same scaling mechanics. Now, the difference is that now there is a new flag that you can turn on to basically turn a regular lambda function into a durable lambda function, and that basically opts in the function into what is called the durable execution engine that you can now use through a dedicated SDK. So there are these new capabilities that we described, and they become available by using a special SDK that you need to install. What is the core difference? So there is, I guess, a mental shift that we need to start embracing when we start switch a regular lambda function into a durable function, which basically is that you have to stop thinking about one lambda invocation and start thinking about a workflow made of atomic steps. So this dedicated SDK basically allows you to write inside the code, inside your handler, your business logic, as a sequence of explicitly named steps. And you can think of every step as an atomic unit of work. So, so you can think about, okay, do this, then do that. And of course, each step has clear boundaries and outcomes. And this step model is basically what makes the idea of checkpointing possible. And this idea basically means that after a step is completed, the framework can create a checkpoint, which is basically a way of saying that all the state that was in the lambda function at that point is persisted in this workflow. So it's like, okay, we are doing progress. We completed this step. The result of this step was an object. For example, that object is persisted inside the runtime execution state. And other than that, we also have to talk about, because we mentioned suspension. So what does that mean when this lambda function execution can stop? And it can stop for a few different reasons. For instance, there can be unexpected stops if there is an error or some kind of crash, a timeout or something like that. In that case, there will be a retry mechanism that kicks in and will start to re execute the function. Or maybe there are planned stops as well. And this is also a new concept, because basically, sometimes in your business logic, you just have to wait for something externally to happen. We'll talk about some examples in a second. And the idea is that when you want to wait, you don't need to keep the lambda running because that consumes resources and it's going to cost you money because you have CPU and memory that gets occupied doing Nothing effectively. So what happens is that with durable function, the lambda can now be suspended, which means that it's literally, the instance is literally stopped. So nothing is running. You're not going to be paying for that until something happens that basically wakes up the execution and basically starts a new fresh execution. But with all this state that we discussed before, with the checkpointing mechanism being preserved and restored now, you might be wondering what are some good examples of wait steps. They might be timer based. For example, you can just say, okay, I know that the external action is gonna take three seconds, for example, so I'm just gonna drop in a wait of maybe four seconds. If you just want to play it, sure. But you can predict more or less how much time you're gonna need and you can just sleep for that amount of time. Another thing could be wait for another compute step. For example, you might be invoking another lambda. And we know that this is generally an anti pattern, but in this case it might be starting to become acceptable. But if you are calling another lambda, you can wait for that other lambda to finish. And while you are waiting, your execution gets suspended and then resumed only when the other lambda completes and returns some kind of response. Or maybe you can wait until a generic condition is satisfied, which is basically a little bit of a wrapper around the wait model, the one we described before, the timer based wait model. And the idea is that you can say, okay, I'm going to wake up this function every few seconds and then I'm going to check on a condition, and if that condition is satisfied, I'm going to stop sleeping and progress to the next step. Otherwise I'm going to go to sleep again and wait for the next timer interval to resume and check the condition again. And then another one that you might be familiar with if you use step functions is the waiting for an external callback. So you could create this concept of a callback. So it's almost like a unique ID that another service can then use to programmatically wake up that lambda function execution. So this is generally useful for instance, when you have a human in the loop. So you might have some kind of UI that gets triggered with that callback id. Then the user will see some kind of interface and be able to decide, okay, maybe do some action and then decide whether that execution should progress or maybe be interrupted. And in that case that UI you implemented is going to trigger the callback mechanism to resume the lambda invocation. Now, there are some other interesting implications. For instance, one of the main ones is that a durable Execution can last up to one year. And this is again similar to step functions. And by the way, this doesn't have to be confused with the individual lambda invocation, which is still 15 minutes. This basically means that every time you suspend the execution and then resume it, the overall execution period from the first time that that lambda invocation started to when it ends can last one year. But of course each execution cannot last longer than 15 minutes. And let's actually, yeah, let's actually say that this is convenient. For instance, if you're waiting for human approval, that gives you time, maybe you are doing something that could take days for a human to be available and do the approval, or even months. And that's still a good programming model for lambda durable functions. Finally, I think it's worth mentioning that when it stops, either because of a failure in the execution or because you're waiting for something, the workflow can later resume from the last checkpoint. We'll talk more about the details because I think there are some important nuances. Basically the conceptual idea is that it doesn't start from the beginning, but it kind of restarts from whatever was completed is going to start from the next step. This is a simplification. We'll talk more about how exactly that model works, but this is how you can build a mental model for what happens behind the scenes. So I suppose, yeah, I think that should cover more or less the main ideas. What do you think?
A
Yeah, well, we maybe just talk about what durable functions can do that like a standard lambda function cannot. One thing is you can run multi step workflows just in code without having to roll your own lambda Q state the next lambda orchestration. It's probably a lot more easy for people to reason about. It can be very difficult sometimes when you've got your context split across multiple different AWS services and then you can suspend and resume cleanly. So for things like timers, callbacks, as you mentioned, and human approvals, of course, without burning compute and paying for it while you're waiting. You can also keep reliable progress automatically by using the checkpointing feature, so completed steps aren't redone when the execution does resume. And also you can then apply resilience controls at the step level so you can add retries, back off and jitter without returning your business logic into a whole load of retry state plumbing. So it's a little bit of the benefit of that that you get from step functions, but it's in the language that you prefer to use. When it comes to workflow hygiene, I suppose you could say that can be often painful to build yourself. Like deduplication, cancellation and compensation. Like we're thinking about saga style rollbacks, maybe to get that distributed transaction kind of effect. That's something you can do with durable functions. And overall the development and operations experience is improved, I would say with a better testability story and just clearer observability around a single durable execution and its steps. Okay, so that's how it compares to regular lambda, as we see. Might be useful then to share what kind of good use cases do we have that where you might think, okay, this is a good fit for durable functions. Let's give it a try.
B
Yeah. I actually read recently on Yanshui's newsletter one example that I found very, very good in terms of explaining the capabilities. So I'm just going to steal that. Sorry, Ian, if you're listening. So the idea is basically you can build an order processing workflow for a food delivery service. So the idea could be, okay, there is some kind of trigger and a new order comes in. You might imagine, I don't know, there is a website or a mobile application where a user can place an order and that's basically the starting point. There is an event once the order is being placed. And that event triggers a durable lambda function which implements the following workflow in steps. So the first step is basically save the order details in a database. Then of course we are gonna broadcast this order placed event into eventbridge. And this is basically where the human in the loop might come in. So we might want to implement some kind of restaurant confirmation. So that eventbridge triggers effectively somehow triggers a notification to the restaurant, maybe through another web application or mobile application that is available to the restaurant, where they will see, okay, there is this new order coming in. Do you want to accept it or reject it? And you can imagine that, I don't know, maybe the restaurant is about to close, maybe they run out of ingredients, maybe they are overbooked. So there might be several reasons why the restaurant might not be in a position to accept that order. So the humor in the loop in this case is an important element of this business flow. And of course you might also want to apply a timeout because maybe that makes sense for a customer not to wait forever. If for whatever reason, the restaurant cannot even receive that notification, maybe they receive it and nobody is available to actually respond to it in a timely manner. So imagine this mechanism. So your lambda, a durable function is suspended. This event is going to get to the restaurant somehow and the restaurant is going to have Some kind of application that can use the callback too, resume or reject that execution, which is effectively confirming or rejecting that particular order. And I guess this is the example that Ian provided, but you could imagine that you could extend this example even further if you want to think about a slightly more complex workflow. You can imagine, okay, once the order is accepted, then you can also start to track the progress of that order. Maybe the food preparation or I don't know, maybe it's even before that is waiting in a queue, but then it's getting prepared, then picked up by the delivery driver, then is delivered, and then maybe you can even have a final step which is waiting for customer feedback. So each of these steps can be implemented as steps inside the durable function lambda code. And if you want to think about other examples, just to, I don't know, provide more use cases, more food for thought. 1. Another good use case that I've seen is tenant onboarding. So imagine you have a multi tenant system. Generally the onboarding of a new tenant has lots of steps. You might want to provision infrastructure, you might want to configure identity providers, you might want to think about billing and setting up payments. You might model all of that with a durable lambda function at each step. You might even have, I don't know, a human in the loop. You might have review steps. And if something fails, you know that all the previous steps can be easily reverted or, or maybe you can just resume and then try to complete the missing ones. Payment retries is another good use case, because sometimes, for example, if you have a system that expects to have recurring payments, it's very common that if you are charging, for example, a credit card, it might happen that that credit card doesn't have enough credit at the time of charge. But then if you retry maybe two days later, it's going to work. So you could model, for example, the kind of behavior in a durable lambda function and another one which is a little bit of a spoiler because that's actually what we implemented for our use case is media processing. Media processing generally involves lots of steps like conversion, creating thumbnails, transcriptions, and all kinds of things. So you can imagine that that complicated workflow can be modeled as a durable step function. If something fails at any point, you can resume from the last functional bit and you don't have to redo a bunch of steps that might actually be very expensive from a computational perspective. So I think that gives you probably a good few ideas on where you can use this pattern and this new capability of Lambda function. So now I would like to talk about what does the experience of writing AWS durable function looks like?
A
Well, good news, I guess. As we said, it's a regular lambda function with a few extra capabilities, powerful capabilities. And the way those capabilities are provided is through a special SDK called the Durable Functions SDK. You'll need to install it for the programming language you want to use. And right now what's supported is JavaScript or TypeScript and Python. We believe that Java is in the works, and we even saw a discussion in the Rust SDK repository, so that might arrive pretty soon. Now, we were using the JavaScript TypeScript Durable Functions SDK for our work, so that's the one we're going to talk about. Other languages might use a different syntax, but the capabilities and concepts should be the same. So the first thing you'll notice is that the handler isn't just a normal lambda handler. You wrap it with a helper called with durable execution that effectively turns on durable mode and injects a durable context into your function as a parameter. Then inside the handler, you don't just write one big blob of code like you normally do. You define named atomic steps, right? So instead of doing work directly, the function runs work inside explicit named steps like step one, step two, step three. And those step boundaries are really meaningful. They're the points where the platform can track progress and treat each unit as done. When it comes to things like resuming now, the code outside steps is sometimes referred to as the orchestrator path. So what this looks like in terms of the SDK is you'll define a step by calling something like context step, then you give it a step name and a callback. And that context you're providing is the special durable context we mentioned a minute ago. JavaScript terms that step call returns a promise. So you typically do const result equals await context step and that await it reads just like normal async code, but it's also the boundary where the durable engine can track completion and then persist progress when it's done. So step results, then the actual result of each of these steps that you define are treated like durable state. So the function captures results whatever you return from each step in a way that could be reused when the workflow resumes, rather than recomputing everything. And then you have the concept of waits. And a wait is a first class operation. It's not just a hack. There's an explicit wait for n minutes construct that you can use to create a step that simply suspends your function for a while. So if you're waiting for something else to set to settle in, a regular lambda waiting usually means sleeping and burning up 15 minutes and you're paying for it. Or you build an external timer mechanism. But here the wait actually suspends the execution and the workflow resumes later, so the function is no longer running and you don't pay while you wait. So with the durable executions mode, the thing you'll get used to is the fact that a function execution, it spans multiple invocations, but it still feels like one flow, even though it seems like a single sequential thing under the hood. It'll be starting and stopping and resuming across separate invocations, continuing from the next step. And the code is then workflow code. It's not just request response code with a little bit of business logic. The return value is the final outcome of the durable execution from multiple steps, not merely the result of one atomic invocation. So this sounds pretty good. Should we dive a little bit deeper? How do they actually work? What's the magic behind them?
B
Yeah, I think this is probably one of the most interesting and perhaps also confusing bits that you need to understand about lambda durable functions, if you want to use them correctly. So let's try to deep dive and try to describe what really happens when, for example, a step is completed and the state is persistent, so a checkpoint, so to speak, is created. And then what happens when there is a resume, how things are actually restored and the execution actually continues from the next logic step. So as we said, the core idea is that you have this concept of execution history. So whenever an execution starts, you can imagine that the lambda service is somewhere capable of storing state. And then for each step, which is treated as an atomic unit, as we said, basically when that step completes, you can imagine that that step inside your code of that step, you can return data. And basically a return from that callback basically means, well, I was able to calculate something that I want to retain for the next resume, or maybe the state of this step. If you want to think it like that, this is it, this is what I am returning. So make sure it's persistent. So basically when what the SDK does is basically every time it executes a step, once that step is successfully completed, if there is a return value, that return value is sent to the lambda service so that it can be persistent. And this is how the checkpoint mechanism works. But we also said that there might be cases when the execution gets suspended or interrupted for errors, timeouts or other reasons, and that in those Cases, the execution can be resumed later. So what happens on a resume? And this is, I think the key thing that we want you to understand. If there is one thing you should take away from this episode, hopefully is this one. Basically the idea that might be confusing is that when the lambda resume from an execution, it always starts to execute your code from the beginning. So imagine you have your hundred code and there are, I don't know, 100 lines of code. Even if you executed five steps and you reach line 50, the next time you resume, you're still gonna restart executing code from the first line. So how is that checkpoint mechanism possible? The idea is that every time a step is encountered again in the execution, from the first point, from the first line, sorry, basically the lambda SDK is gonna check, okay, did I already complete this step before? And if it did, then it's not gonna re execute the handler well, the callback basically of that step, but it's just gonna take the value from the persistent state. So effectively you can imagine the execution flow to be like, okay, I'm going to start from the beginning and then quickly check, did I do step one? Yes. Did I do step two? Yes. Did I do step three? And so on, until it gets to a point where, okay, this is a new step which I haven't executed yet. So this is exactly the point where I am in a way resuming the execution. But practically speaking, everything gets executed from the first line every time there is a resume. And this is really important because effectively I think it could be a common misconception to think of suspension like, okay, you are pausing the CPU at a specific line in the code. Like for instance, when you are pausing a thread or something like that, and then you just resume from that line of code. That concept doesn't exist in durable function, it's just you restart from scratch. But then there is this mechanism that allow the execution to know, I already completed this step, so I'm just going to read the result and continue from the point where something still needs to be computed. In a way, you can think about this checkpointing mechanism like a cache, where basically if you already have that result computed for this execution, there is no point in executing it again. You can just read it from a persistent state. And the reason why you need to understand this is because sometimes it might be tempting or it might be making sense, depending on what you're trying to implement, to use non deterministic code outside steps, what we call the orchestrator path before, because you can have a sequence of steps, but of Course nothing is stopping you from having business logic outside steps and that's not getting checkpointed. So if in that code, this orchestrator path code, use stuff that is non deterministic. For instance you might use a mat random or a uuid, or you might be using time based logic like in JavaScript. You might have a date now for example and then have an if statement that checks. I don't know, are we after 5pm and then going to do something? Otherwise you're going to do something else. You need to understand this is not going to give you a predictable execution. Effectively you are making your execution non deterministic because the next time you resume you might get different values and therefore your code is going to take a different path and you end up with subtle bugs or behaviors that you didn't expect. So this is why it's really important to understand how the model is built and the checkpoint it works because then you can avoid this kind of issues. So hopefully that clarifies I think one of the main misconceptions of durable lambda functions, but you might be wondering, because this is a very new feature, what is the current state in the ecosystem? Should they wait before using the new feature? Or maybe it's already in a good state where I can start leveraging it for my applications.
A
Okay, let's talk about the whole ecosystem then and what it's like as a developer, what the developer experience is, et cetera. So the SDK for TypeScript I think we found is pretty good, right? It even supports testing as mocking and local execution, which is really good for dx. There's some good articles by Eric Johnson if you want to see some concrete examples, we'll have the links in the description. MIDI 7, which seems to be keeping really at pace with all the new developments, already supports durable functions. So again, if you haven't tried midi, there'll be a link to that in the description too. The Lambda Power Tools team has worked very close with the Lambda team to make sure everything works as expected if you're using power tools. Still, durable functions are still very new and there's definitely some room for improvement in the whole area of dx with we found some missing features or inconsistencies in the SDK and some small glitches in the web console as well. But it's pretty minor stuff and we're sure it's going to be fixed soon and I guess we look forward to seeing more languages supported. I'm sure Java Net fans Golang will all like to see it. The interesting thing on this front is that the runtimes don't actually have to change. It seems to be just an SDK thing that's required, so it should just be a matter of time. And we can guess that the reason why broader support doesn't exist yet is because AWS is trying to build these SDK in a way that feels idiomatic to the specific language, like JavaScript, as we mentioned, relies heavily on promises, while the Python one uses decorators. Now all of the theory and the deep diving is done. Shall we talk about the fun part? What did we build?
B
Yes, you might remember our podcast transcription service that we described back in episode 63. Or maybe not, because that was three years ago. Pretty much exactly three years. I think it was somewhat January or February three years ago. So yeah, if you don't remember, don't worry, you are officially excused. But you can always go and check out that old episode if you're curious. But I'll give you, or at least I'll try to give you a quick refresher on what this project is. It's called Pod Whisperer and it's basically our own solution, fully open source, that allows us to create transcriptions for this very podcast. It originally was based on OpenAI whisper and Amazon Transcribe. And you might be wondering why are you using two different transcription services and not just one? But actually, yes, you can listen the entire episode to know the entire story. But in short, we use OpenAI whisper because it's really, really good in terms of quality of transcriptions. It does recognize most of the words without mistakes. But one problem is that it doesn't recognize speakers. So what Transcribe does is kind of the opposite. It isn't always very accurate, as we found at least didn't used to be three years ago. I don't know if now it has improved, to be honest, but it did do three years ago a very good job at recognizing different speakers. So giving you like speaker labels, speaker one, speaker two, trying to figure out how many people are actually engaging in the conversation. So basically what we did is okay, we tried to get the best of both words by doing the transcription twice, one with one service and one with the other. And then we have a slightly convoluted workflow that tries to join the two results and extrapolate the information that we need from both. So the actual words from Whisper and the speaker labels from Transcribe and basically podwhisperers was born as a way to orchestrate this entire workflow. What we recently discovered is that there is actually a new project that is based on Whisper that is called WhisperX. We'll have the link in the show notes. And it's actually pretty cool because it's still using Whisper under the wood, but adds a few extra steps using additional models. And those steps are, one is adding word level timestamp synchronization, which can be really useful for a bunch of different use cases that we'll mention in a second. And the other step is what generally called diarization, which is effectively recognizing the different speakers. So you can imagine internally when you use Whisper X, there are three different AI models that gets executed in a pipeline. The first one just gets the raw words, the transcription, and then in segments where a segment starts and finishes. The next step is a word level timestamp. So the second model is basically taking the input of the previous model, the audio file again and try to figure out where each single word starts at and finishes. And then the third step in the pipeline is trying to figure out, okay, for each sentence or segment and word, who is the speaker that is talking now? Of course, who is the speaker in the sense of a speaker label. It doesn't try to get the name or just figures out, okay, this is a different person talking now, or maybe it's the same person as before and calling it Speaker 1, 2, 3 and so on. And the cool thing is that also runs on gpu and we noticed that it is much faster at transcribing when you have a GPU available. We noticed, for example, on a G5X large that it takes about five minutes or less to transcribe 30 minutes of audio. So basically what we thought, because we have been meaning to switch to a model like this for a while, we thought, okay, this is a really good option that we should try. And maybe this can replace our complex workflow where we try to run two different transcription in parallel and then join the results. Maybe just using Whisper X is going to be good enough for us. And at the same time, there were a few other features that we wanted to implement for a while that we took the opportunity to say, okay, now that we are rewriting this transcription workflow, maybe we can also add the extra features. One of these features, for example, is every time we get the transcription file, we, in the last few episodes, we started to manually feed an LLM with this transcription file and just giving it enough context to understand, okay, we are talking about something related to AWS and can you make sure that everything makes sense? Most likely there might be, I don't know things that are misspelled or slightly out of context or name of services that are not properly named or casing that is not respected. The name of the people talking is not always correct. For instance, Owen is always spelled as O, W, E, N, which we know is not the correct one for you, Owen. So all these kind of things, actually LLMs are really, really good at fixing. We used to fix them manually, but there's a lot of work and now you can just drop all of that text with a little bit of context to an LLM and you get a pretty good result. So this is kind of a refinement step. So we started to realize, actually we could do a few more refinement steps. Another one is we generally get Speaker 1 and Speaker 2 in our subscriptions and then we have to manually check, okay, who is the first one talking? Okay, this is Speaker 1 and we change the label manually. With the second one, we change the label manually. LLMs are also really good at detecting that because generally we say something like, my name is Luciano and I'm joined by Owen. And that's a good signal to the LLM that the person speaking now is Luciano and the other one is Owen. So we also included in this refinement step, we tell the LLM, can you try to detect the names of the speaker and replace the labels? And the next step is because we have word level timestamps that are now provided by WhisperX. One of the problems that you might have noticed if you used to watch these episodes on YouTube is that sometimes Whisper gives you pretty big segments, multiple lines of text. Sometimes you might see an overlay in our videos, if you use captions that is like three lines of text, which is pretty unreadable, to be honest. So this is something that has annoyed me a lot. And once we started to see word level timestamps, then you can start to split the segments in whatever arbitrary way you want because. Because of course you can decide, okay, I want to have always no more than one line of text and no more than, I don't know, 40 characters or maybe 10 words. And of course we included this logic in our workflow. So try to break down the segments into something that is going to be more readable. So basically, out of all of these ideas and features we wanted to implement, this is what we did for Podbeast V2. And it's all open source. You can check out the repo will be in the show notes. And just to recap what's happening here is we are Also using a few other things that are pretty cool in my opinion. So let me just tell you very quickly what happens in each step at the end when you spot the whisperer. So the first thing that happens is that we drop a file into a stream, so an audio file, and that creates an eventbridge event which effectively is gonna start the durable function execution. The first thing that the durable function execution does is just going to send that an event into sqs saying this file is available for transcription. And what happens behind the scenes is that that SQS is being monitored by an ECS managed instances cluster. And if you don't know what that is, we recently spoke about it at length. It's episode 150, check it out in the show notes. But the main idea is that we want to have an easy way to bootstrap a machine that has a GPU only when there is work to do and then shut it down when there is no work left. So that mechanism allows us to do that. We drop a message into sqs. ECS manage instances is configured to monitor that queue and spin up an instance when there is work to do in the queue. And at that point the instance, sorry, the cluster will start as a service configure that is basically an image with WhisperX already pre configured with all the necessary model preloaded into it and it's going to do all the transcriptions and then it's going to call a callback. So this is another data that maybe I didn't explain very well. After we drop something into the queue, the execution pauses waiting for a callback. So effectively the message that we send into the queue is this is the file that needs transcribing and this is the callback ID that you need to call when you're done. And of course there is also timeout that should be reasonable depending on the length of your episodes. If you want to use this tool you can configure the length. In our case I think it's about 60 minutes, which should be more than reasonable. Then we have all the other steps. I'm just going to go through them very quickly because probably they're less interesting. The second step is basically what we call replacement rules. We have a bunch of either strict matches like I don't know, very often we see that, as I said, Owen gets misspelled. So we have all the common misspellings listed out and we have replacement rules. We can also do that using regexes, for example. Another use case is aws. Bytes often is spelled with a Y rather than an I. So we have regexes that capture that and can fix it on the fly. Then we have that LLM refinement step. So it's effectively using Bedrock and creating a prompt for Bedrock to say can you check if there are potential common issues there and fix them? And can you also try to identify speakers? And then give us back a structured JSON that we can use to reconcile your proposed changes with our existing transcript. Then we have the segment normalization step. So effectively we break down each segment into smaller chunks so that they are more readable. And finally we have another step that generates captions in the common formats, for instance SRT or VDT, which is what we use on YouTube. And we also have our own custom JSON format. That's what we use to build our website. If you notice on our website you can go to the transcript tab and you will see the entire text and you can even click around and that will move the video to that specific point. So this is the JSON we use to build that feature in the ui. And finally when everything is done, we trigger an event on eventbridge saying podwhisperer has finished to do a transcription. And this is something you can use for any arbitrary extension mechanism if you want to use podwhisperer. In our case we have another tool, also open source called Episoder, which you'll find the link in the show notes, which does another step which is basically trying to update the website for us. So it creates a PR to our website repo with the new episode description, trying to figure out what are the chapters suggesting a description, suggesting tags for YouTube, all that kind of stuff. So again, everything is Open source on YouTube. If you're curious, check it out and if you have ideas on how to improve it, again open source, feel free to submit issues or PRs. Now probably before we move into final topics like comparison with other tools and pricing, does it make sense to quickly recap some of the best practices or things that can bite you?
A
Yeah, because it's kind of a new programming model that's a good thing to talk about. There is an AWS document with some best practices we'll link in. But our summary I guess is based on what we discussed so far. Any kind of non deterministic code side effects with steps should be wrapped like things like random UUID design for idempotency as well. That's always a good practice with things that are at least once invocable. Adopt the replay aware logger from the context. We also noticed that LLMs don't understand these rules, so be careful with LLMs in general, but specifically with new things like durable functions. One example there is we got a case where we wanted to keep track of the total time of execution of a durable step function. And of course the LLM generated code outside a step initializing a new data object called start time. And for reasons already stated, we know that's not going to work. So this is basically breaking one of our rules. Resume would generate an entirely new data object and we wouldn't be able to track the total time of executions across invocations. The solution there was just to calculate this date within a step so it's properly persisted as part of the durable state and then you can just reload on resume. That's your solution. You have to tell LLMs these rules very explicitly and of course always review the generated code. Important topic, Luciano. How much does it cost?
B
Yes, so basically, very quickly doesn't change too much. Meaning that it's the same lambda pricing as a base. But of course because the lambda function is now doing more stuff, you are expected to pay for that extra feature for those extra features. So there is an additional cost for durable operations. So these are checkpoints, related steps, weights, callback, etc. Basically you pay $8 per million operations and then also because there is data being persisted, you have to pay for that and it's 0.$25 per gigabyte and then data retentions is 0.$15 per gigabyte per month. And this is something you can configure. I think sometimes it might make sense. For instance, I don't know, when it comes to payment, maybe you want to have a longer retention for. For whatever reason. But if it's something that once it's completed you don't really care about, you can have a much shorter retention and don't have to pay too much for that. Now one thing that I think is really interesting that I was spending a little bit of time on is comparing durable function with other industry tools that are somewhat similar. I've seen lots of people in the past talking about dbos, so I don't know if something new to the listeners here, but I always find it very interesting. And other options are temporal or temporal. I'm not sure what's the right pronunciation and trigger.dev and basically you can think of the same story that we just described for lambda durable function as in a more generic service that is not necessarily tied to lambda. For instance, dbos is effectively, if you want to do durable execution, totally open source, you can just pick whatever machines to run your code or containers. That's basically it. It's implemented using postgres as a mechanism to persist the state and then it gives you an SDK that allows you to write your durable code in Typescript, Python, Go and Java. And of course they also have their own cloud service if you don't want to self host it. But I think this is a great option if you want to self host. This concept of durable execution I think I've seen somewhere I don't have the link right now. If I find it I'll put it in the show Notes somebody trying to run dbos on Lambda, which I think was a pretty cool idea before Durable function was effectively created by the Lambda team. The problem of course is that you wouldn't be able to easily replicate that stop and resume model. So you will probably be limited to 15 minutes execution or you'll need to do some kind of crazy orchestration to recreate all the checkpoint in a resuming yourself. So I'll try to find that video and if I find it, I'll link it. I haven't fully watched it myself, but that seems really where the benefit of having a specific service built into Lambda comes in. Because if you have to do it yourself, it's not easy to do or it would be much more limited than you can actually achieve with a native service then temporal and trigger.dev are basically pretty much the same ideas. I'm not sure if they both evernote an open source version, but they are more sold as kind of hosted services. I've seen trigger.dev briefly and it seems like as a pretty cool UI and it seems very easy to use. So probably another alternative to look for in case you are not really tied to Lambda but again it's just. Well, I guess I want to say with this section of the episode is that this idea is not new is just the Lambda team figure out okay, this is a capability that many people are actually using. It is nice to have it in Lambda. But if for whatever reason you cannot use Lambda in another service and you enjoyed using Durable function, you can achieve something very similar using one of these tools. So that's why we wanted to give a mention to these other tools. I think there is another common questions that I've heard a lot even in the presentation talk at re invent this question came up and it's basically this seems very similar to step functions. So when should they use durable functions compared to step functions?
A
Well, you might think that durable functions are just step functions without the ASL Amazon states language, but I don't think that's necessarily true. It's obviously going to be down to preference. If you're very proficient with step functions and the nature of your workflows are not too complex, that might be fine. But I've definitely been in the situation where you end up with lots of step functions with lots of lambda functions interspersed where you're doing business logic and the switching back and forth can be a little frustrating. Like durable functions, they're just lambda functions, right? So one of the advantages is that you can use any event source that works with lambda to trigger them with step functions. I think, like, you can still trigger them from events, but you don't have the same set of supported integrations there. Now, testing and running locally, the durable functions, as we mentioned already, seems to be pretty well designed. I've had a good few attempts with testing and running step functions locally, and while it's better than it used to be, it's still pretty hard and not a like for like experience. Whereas if you're testing in the language that you're very familiar with, it's a much more pleasant experience. When you're comparing the two as well, I'd say beware of massive parallelism. I've been able to do lots of highly scalable distributed map step functions, and that allows you to run tens of thousands of jobs in parallel. Now, while durable functions does support map steps, it doesn't seem to be designed for that kind of scale. We haven't actually tried them yet, but it does seem like it's more geared towards smaller volumes of data.
B
Yeah, I have a few more. For instance, I think there is a light we have to shine on the workflow builder of step functions. Like if you like that visual way of building different steps. Or even, for example, if you have cases where you're trying to integrate a bunch of different AWS services in lambda, I think that will be much more complicated because you have to write code, make sure you install the correct SDKs. With step function, you can just drag and drop and connect different things. And that gives you also a pretty nice visual story of, okay, this is what happens and that's what happens. Especially when you start to have lots of branches. It might be much more complex to represent the same logic within a durable function. And for certain, you don't get a visualization built in. So that's something that you're gonna be lacking. I found myself that sometimes when I'M working on complex step functions. Just, just screenshotting the flow visualizer is already pretty good documentation. While for example, in durable function that's something you need to do yourself. You need to create some kind of diagram that represents all the different states and then keep it up to date. So that might be a reason, like if you find yourself preferring more this visual model, I guess that might be a reason to pick step function in favor of durable Durable functions. And another thing that is probably worth mentioning is that when you're doing distributed transactions, stateful application logic or even AI workflow, I'm hearing lots of people building AI agent workflow with durable function seems like kind of a better candidate because probably it's mostly code that you're writing. So it's probably easier to just drop that business logic and split it into steps within your lambda handler. So that's maybe a case where I would prefer to pick durable functions over step functions. So that's probably everything we had to share. I know this was a long episode, so maybe Owen, I don't know if you want to try to give a quick recap and then we'll. We'll wrap it up.
A
Sure thing. So what do we have to say about durable functions? You know, still lambda, same scaling, but you can write multi step workflows in code now with checkpoints, weights and a nice resume model. Big mental shift here is you're thinking in atomic steps and the system persists progress so you don't have to hand roll orchestration glue. And we did talk about the resume model because that's where the power is and it's also where a lot of the surprises live. So what makes us excited about this? Well, I think it feels like a really interesting middle ground between raw lambda plus, lots of glue and full blown orchestration. I think we're really excited for the long waits and human approvals. The fact that you can could do this now without paying for them and the fact that a durable execution can hang around for up to a year, that's pretty impressive. So has anybody out there tried durable functions yet? We'd love to hear from any listeners or viewers. What did you build? What tripped you up first? What were your successes? And if you haven't tried them yet, where do you think they'd fit better than step functions into your world? So let us know in the comments or reach out on socials. We really want to hear real world experiences, good and bad. And if you've got a weird edge case or gotcha story, even better, send it our way and we might cover it up in a follow up episode. Lastly, thanks again to four Theorem for backing us and powering this episode. If you want help designing and implementing an AWS architecture that's simple, scalable, and not too hard on cost, head to4theorem.com thanks so much for joining us again and we'll catch you on the next episode.
Episode Date: February 6, 2026
Hosts: Eoin Shanaghy & Luciano Mammino
This episode of AWS Bites dives deep into AWS Lambda Durable Functions, a major new feature announced at re:Invent 2025. The hosts discuss what "durable" functions actually mean for Lambda, how the new patterns work, key use cases, and their firsthand experience rebuilding an open source app to take advantage of the new capabilities. They also address sticky points—like the deterministic execution model, debugging gotchas, comparison with Step Functions, cost considerations, and alternatives outside AWS.
[00:00–09:10]
Memorable Quote:
“It’s still lambda. It still has the same runtime, the same scaling, but with a framework that can now checkpoint progress, suspend execution... and resume later from a safe point, skipping the work you already completed.” — Eoin, [00:57]
[09:10–15:47]
Notable Example:
“...you can build an order processing workflow for a food delivery service… durable lambda function which implements the following workflow in steps: save the order, broadcast to EventBridge, wait for a restaurant confirmation, potentially use a callback for human approval, handle timeouts, then track order progress, feedback, etc.” — Luciano, [11:30]
[15:47–19:26]
withDurableExecution).context.step('step1', ...)) and let the platform handle checkpointing.Key Point:
“… a function execution spans multiple invocations, but it still feels like one flow, even though it seems like a single sequential thing under the hood.” — Eoin, [17:20]
[19:26–24:58]
Memorable Moment:
“If there’s one thing you should take away...: when the lambda resumes... it always starts to execute your code from the beginning... every time a step is encountered, the SDK checks if it’s completed... if it did, just takes the value from state.” — Luciano, [21:00]
[24:58–26:42]
middy and Lambda Powertools) are aligning quickly to support durable features.[26:42–37:44]
Timestamped Narrative:
“We drop a file into a stream, that creates an EventBridge event... starts the durable execution. First thing: send an event into SQS... which triggers ECS cluster for transcription. After that, execution pauses waiting for a callback…” — Luciano, [31:35]
[37:44–39:17]
[39:17–43:20]
[43:20–47:00]
Quote:
“Durable functions are just Lambda functions… One of the advantages is you can use any event source… [with] step functions you don’t have the same set of supported integrations.” — Eoin, [43:37]
Durable Functions are a compelling new middle ground between classic Lambda and full-blown orchestration, with big benefits for complex, code-driven AWS workflows. They're code-first, event-driven, and support long-lived executions (up to a year), with built-in checkpointing, suspension, and resume—all while only paying for what you use.
Let the hosts know: What’s your experience with Lambda Durable Functions? What worked, what didn't, and where do they fit best over Step Functions?
Full episode resources, code examples, and more at the AWS Bites website.