Transcript
A (0:00)
If you've been listening to AWS Bytes for a while, you've probably noticed a pattern. We keep coming back to lambda. And that's not a coincidence. We're big fans. It's one of those services we like because it's very convenient. You can write tiny little functions in the programming language you like. They run on demand when specific events happen. They scale like crazy when you need them to, and scale to zero when nothing happens. Even better. And you only pay for what you use. Of course, lambda is not always the best solution for everything, as lots of listeners like to remind us, which is completely fair. The moment you try to do anything that looks like a workflow, for example, Lambda can start to feel like it's fighting against you. You've got 15 minutes max execution time. It's stateless by default, and if you need some orchestration like retries, back off all that kind of stuff, you end up bolting on something like step functions, queues, schedules, and a bunch of extra stuff you didn't really want in the beginning. And it's not always easy to get that stuff working reliably. Now. What if you could keep the Lambda model we all know and love, but add a few extra superpowers that might help us to overcome some of those challenges? Well, last December, we got a few new superpowers. At re invent 2025, AWS Lambda Durable functions were announced. And to be honest, we're pretty excited about this one now. It's still lambda. It still has the same runtime, the same scaling, but with a framework that can now checkpoint progress, suspend execution when you need to wait, can resume later from a safe point, skipping the work you already completed. And this is what we're going to talk about in detail today. We're going to break down what durable actually means in practice and how this whole resume mechanism works under the hood. We'll talk about when this approach is a huge win compared to the usual patterns, and of course, the gotchas that can surprise you, especially around determinism, idempotency and debugging resumed executions. Finally, we'll also talk about one of our own open source applications that we rebuilt from scratch to have an excuse to use durable lambda functions and see what it feels like to use them in a real project. My name is Owen, I'm joined by Luciano and this is AWS Bytes. Okay. Hi Luciano. Would you like to start off by telling us what is a durable function? What are the basic ideas around it?
B (2:28)
Of course, yeah. So, as you said in the intro it's still lambda, right? It's the same services, the same model, the same scaling that doesn't change. So that also means that you still write a lambda handler in the usual way, with the same runtimes that you know and love, the same type of resources, the same scaling mechanics. Now, the difference is that now there is a new flag that you can turn on to basically turn a regular lambda function into a durable lambda function, and that basically opts in the function into what is called the durable execution engine that you can now use through a dedicated SDK. So there are these new capabilities that we described, and they become available by using a special SDK that you need to install. What is the core difference? So there is, I guess, a mental shift that we need to start embracing when we start switch a regular lambda function into a durable function, which basically is that you have to stop thinking about one lambda invocation and start thinking about a workflow made of atomic steps. So this dedicated SDK basically allows you to write inside the code, inside your handler, your business logic, as a sequence of explicitly named steps. And you can think of every step as an atomic unit of work. So, so you can think about, okay, do this, then do that. And of course, each step has clear boundaries and outcomes. And this step model is basically what makes the idea of checkpointing possible. And this idea basically means that after a step is completed, the framework can create a checkpoint, which is basically a way of saying that all the state that was in the lambda function at that point is persisted in this workflow. So it's like, okay, we are doing progress. We completed this step. The result of this step was an object. For example, that object is persisted inside the runtime execution state. And other than that, we also have to talk about, because we mentioned suspension. So what does that mean when this lambda function execution can stop? And it can stop for a few different reasons. For instance, there can be unexpected stops if there is an error or some kind of crash, a timeout or something like that. In that case, there will be a retry mechanism that kicks in and will start to re execute the function. Or maybe there are planned stops as well. And this is also a new concept, because basically, sometimes in your business logic, you just have to wait for something externally to happen. We'll talk about some examples in a second. And the idea is that when you want to wait, you don't need to keep the lambda running because that consumes resources and it's going to cost you money because you have CPU and memory that gets occupied doing Nothing effectively. So what happens is that with durable function, the lambda can now be suspended, which means that it's literally, the instance is literally stopped. So nothing is running. You're not going to be paying for that until something happens that basically wakes up the execution and basically starts a new fresh execution. But with all this state that we discussed before, with the checkpointing mechanism being preserved and restored now, you might be wondering what are some good examples of wait steps. They might be timer based. For example, you can just say, okay, I know that the external action is gonna take three seconds, for example, so I'm just gonna drop in a wait of maybe four seconds. If you just want to play it, sure. But you can predict more or less how much time you're gonna need and you can just sleep for that amount of time. Another thing could be wait for another compute step. For example, you might be invoking another lambda. And we know that this is generally an anti pattern, but in this case it might be starting to become acceptable. But if you are calling another lambda, you can wait for that other lambda to finish. And while you are waiting, your execution gets suspended and then resumed only when the other lambda completes and returns some kind of response. Or maybe you can wait until a generic condition is satisfied, which is basically a little bit of a wrapper around the wait model, the one we described before, the timer based wait model. And the idea is that you can say, okay, I'm going to wake up this function every few seconds and then I'm going to check on a condition, and if that condition is satisfied, I'm going to stop sleeping and progress to the next step. Otherwise I'm going to go to sleep again and wait for the next timer interval to resume and check the condition again. And then another one that you might be familiar with if you use step functions is the waiting for an external callback. So you could create this concept of a callback. So it's almost like a unique ID that another service can then use to programmatically wake up that lambda function execution. So this is generally useful for instance, when you have a human in the loop. So you might have some kind of UI that gets triggered with that callback id. Then the user will see some kind of interface and be able to decide, okay, maybe do some action and then decide whether that execution should progress or maybe be interrupted. And in that case that UI you implemented is going to trigger the callback mechanism to resume the lambda invocation. Now, there are some other interesting implications. For instance, one of the main ones is that a durable Execution can last up to one year. And this is again similar to step functions. And by the way, this doesn't have to be confused with the individual lambda invocation, which is still 15 minutes. This basically means that every time you suspend the execution and then resume it, the overall execution period from the first time that that lambda invocation started to when it ends can last one year. But of course each execution cannot last longer than 15 minutes. And let's actually, yeah, let's actually say that this is convenient. For instance, if you're waiting for human approval, that gives you time, maybe you are doing something that could take days for a human to be available and do the approval, or even months. And that's still a good programming model for lambda durable functions. Finally, I think it's worth mentioning that when it stops, either because of a failure in the execution or because you're waiting for something, the workflow can later resume from the last checkpoint. We'll talk more about the details because I think there are some important nuances. Basically the conceptual idea is that it doesn't start from the beginning, but it kind of restarts from whatever was completed is going to start from the next step. This is a simplification. We'll talk more about how exactly that model works, but this is how you can build a mental model for what happens behind the scenes. So I suppose, yeah, I think that should cover more or less the main ideas. What do you think?
