
A major challenge with creating distributed applications is achieving resilience, reliability, and fault tolerance. It can take considerable engineering time to address non-functional concerns like retries, state synchronization,
Loading summary
Shawn Falconer
A major challenge with creating distributed applications is achieving resilience, reliability and fault tolerance. It can take considerable engineering time to address non functional concerns like retries, state synchronization and distributed coordination. Event driven models aim to simplify these issues but often introduce new difficulties in debugging and operations. Stefan Yuen is the founder at Restate, which aims to simplify modern distributed applications. He is also the co creator of Apache Flink, which is an open source framework for unified stream processing and batch processing. Stephan joins the show with Sean Falconer to talk about distributed applications and his work with Restate. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
Sean Falconer
Stefan, welcome to the show.
Stefan Yuen
Thanks for having me. Hi Sran.
Sean Falconer
Yeah, absolutely, thanks for doing this. I'm excited to get into it. So I wanted to start off with a bit of your background, you know, what was sort of your journey and experience from working on Flink to now being the CEO and founder of Restate.
Stefan Yuen
Yeah, most of my professional life was Apache Flink so far as part of the team that started it in 2014. And in a way, probably I'm responsible for a lot of the early architecture of Apache Flink around the way the data plane, the coordination, the snapshots and all of that worked. The journey actually started even earlier in a way. It started when I was still at university in grad school and we were working in this sort of intersection between Hadoop and databases and some of the very, very early steps of stream processing had just come up. You know, like Storm was a new thing back then. And after that we actually took the project that we worked on at university and turned it into an open source project with sort of a mix of a pipeline batch processing system and maybe some early steps of a streaming system. And like as part of the open source journey we found our sort of sweet spot with users and stream processing, turned it into a stream processor and then, you know, kept riding that wave of Kafka Flink, the sort of advent of real time stream processing, stateful stream processing, unified virtual stream processing and so on. I left that space roughly 2021 to focus on something new and sort of a project that or a set of problems that caught my eye back then was kind of similar to the problems that we're trying to address with Flink. Flink being sort of an analytical system for robust analytical real time pipelines. And we're more and more being asked for how do you build more like transactional event driven Application, not the applications that aggregate events and join events and so on. And yeah, feed dashboards, feed recommenders and so on, but the type of PIP pipelines that in the end actually process payments, invoicing, orchestrate orders and shipments and so on. These types of applications that folks were sort of stitching together manually with databases and queues and lots of custom logic. And it felt like folks were looking for a solution. They even turned to systems like Flink to kind of implement that it's not a great match. You don't use an analytical system for transactional processing as a general rule of thumb. But I guess this question came up more and more. We thought, okay, we should probably start looking into that space and building something. And that is when we started started working on Restate. I've been doing this for around about two years now. And there we are.
Sean Falconer
So how do you sort of describe Restate? Do you term it or do you sort of bucket it into this class of like durable execution frameworks?
Stefan Yuen
Yeah, it definitely puts durable execution as one of the main ingredients on its list. So you're right, there's this big bucket of durable execution engines. It almost seems like there's a Cambrian explosion of those right now. Like there's a new one every few months. And reset is definitely a good candidate if you're looking for a durable execution engine. It's a little more than that, though. It's really, I would say, a more holistic platform for building distributed resilient application. It doesn't just include durable execution as in being able to sort of journal different steps in your process and being able to reliably recover them, which is this notion of, you know, like workflow style logic, but implemented in, in general code and general purpose code, not in dsl. So Restate goes quite a bit beyond that. Restate sort of tackles the more holistic problem of like, what if we try to apply this idea of durable execution not just to a single workflow, but what if we sort of incorporate concepts like distributed communication state that outlives an individual, like workflow or an individual durable execution? How would all those things interact? How do you build sort of a more general platform that applies that level of durability and resilience to distributed services in general and not just like an.
Sean Falconer
Individual workflow, going back to this sort of rise of durable execution frameworks and the idea that there's sort of like, seems like there's a new one every six months or whatever. Why do you think that is the case? Like, this is something that people are investing time into, and it seems like there's growing interest in it.
Stefan Yuen
Yeah, I think this is because the state of the art is sort of is not feasible. I think it's more and more developers and companies reaching that conclusion that the challenges that you're facing today when you're implementing distributed apps, it's just not something that many software development teams can handle. And the ones that can handle that, they're not really using their time well because they're spending most of their time really on problems that have nothing to do with the business logic. They're spending their time on problems like, you know, figuring out race conditions and how to avoid split brains and how to avoid lost updates if a zombie process appears and all those things. They should be focusing on adding features to the application and not creating workarounds around distributed systems problems. I think this has gotten particularly bad with the rise of microservices, and I think it's a big part of why there's a little bit of a backlash even against microservices right now, like for all the benefit they give you. I think many people realize how challenging distributed infrastructures with lots of microservices are. And some are just saying, okay, let's go back to the monolith. It was just a bad idea in the first place. And then there's a whole group of people that say, like, no, no, we actually like a lot of the benefits that microservices give us. We just want, like a stronger foundation to build them on. We want something that sort of frees us from dealing with many of these problems. And I think this is where the whole wave of durable execution systems get started, kind of in that movement. And I would say it gets actually more and more necessary to have those systems because applications get increasingly distributed. Right. It's not only the services you built yourself, but more and more functionality that you access is sort of is hidden behind APIs provided by SaaS vendors and so on. These are all, like, services that you interact with. They become part of your microservice architecture. Even you don't really own them and so on, but they add to the complexity of the problem. And that's a trend that's only increasing. I think that's not going back.
Sean Falconer
Yeah. Even though he said, I hate microservices and down with microservices, I'm going to go back to the monolith. Even if I build that monolith and I deploy it and I'm able to manage that, I doubt that application exists in isolation is going to have interdependencies to third party services, which then are going to reintroduce essentially all these distributed system problems. So if I'm connecting up microservices or even I'm calling like a third party API, then there's all kinds of things in a distributed system that can go wrong. There can be certain outages. So without using some sort of framework to help me solve those problems, like what are teams typically doing to do that? Are they just sort of making those requests and then, you know, at best they're doing some sort of retry scheme with exponential back off to see if they can, you know, push that through and they're okay with sometimes that not happening. Or what is it that, you know, companies are doing to try to solve these problems now?
Stefan Yuen
Yeah, I think there's a lot of different approaches to that problem. There's, I think the first observation I would throw in is like, there's a lot of companies that actually don't really get it right. You know, just the fact that there are still so many websites that tell you like, don't hit F5 while you're undergoing a booking on order process. Just like one indication that they can't really handle these like, things well, like concurrent requests. You still see lots of like a lot of artifacts that if you are a developer you can, you can understand. Okay, something has gone wrong in their backend right now. I would say just like first of all, a lot of times it's not actually getting solved correctly. And actually heard a quote in another podcast from somebody who works at a food delivery startup who said like, for, for many, many years, their solution was just like, ignore it and send a voucher until it like at some point just became really expensive. Ignore the problems and send vouchers. I would say if you want to actually solve the problem, one of the ways to do it typically is to stitch together different systems. The typical ingredients will be use a queue, a database, build your own retry loops with Bacoff. But it's not just as simple as implementing a few retries with Bacoff. Right? Like you have to always kind of worry about, okay, what actually, like, is the retry actually happening? If you're triggering this as an RPC call, you know, you might be retrying it. The whole process might actually go away. Then there's a next step. You probably put a queue in front to actually say, even if the process gets away, the event gets redelivered somewhere else. Now you have to actually worry about the fact you might actually have two processes that work on the same event twice. Are they overriding each other with their retries? Do you want to throw in a lock? Do you want to introduce versioning and conditional updates and so on? Then you might be interacting with APIs. You might call them, get a result, crash afterwards, recall them, might get a different result the second time. You call that. So the next retry actually follows a different control flow than the first one and things go completely haywire. It doesn't stop with a retry. I would say it's very often you start with a queue and a retry and then you incrementally just add like bits to guard against that bucket that you discovered at that. And then incrementally just grows really complex and it makes a hidden assumption on this is exactly how that queue behaves and that's exactly how that API behaves. And then somebody changes that and everything breaks again. And then you're back to fixing this. I don't know. I don't think they're really good solutions. If you go to the extreme end of saying, okay, here's an extremely sensitive, high value process, then sometimes folks throw in workflow engines as one solution. Right. Let's say here's an order process that we really don't want to go wrong because that can actually cost us a lot of money. You might pull in a heavyweight Oracle Orchestrator, but it's really not something you typically pull in for small microservice logic. First of all, because it's really a complicated component to have in the stack. And second of all, it's really foreign citizen. It doesn't interact well with a lot of the other logic.
Sean Falconer
Yeah, and it's probably a little heavyweight for the majority of the types of calls that you might be making in between microservices or even to external services.
Stefan Yuen
Yeah, and this is actually one of the interesting things that durable execution brings in, specifically in like an implementation like we're looking at at Restate. If you can actually make durable execution cheap. Cheap as in low latency. Having a durable step introduces a very moderate latency overhead. Then what you can actually do is you can start assuming this workflow style guarantees for a lot of code in your application. It's no longer sort of prohibitive to do that from two sides. Like it's no longer so slow and expensive that you say, oh, I really don't want this here. This is in the synchronous path of the user interactions. It's going to make everything really sluggish. But it's going to feel still fast doing that. And the second thing is you're still writing code, it still is sort of still fits in with all your tools and with all your deployment and pipelines and all your versioning, all your schema registries, you can still keep using that. So it feels like you can keep doing mostly what you are. You're just adding this fine grained reliability to your functions and get a lot of the problems out of the door. That's actually the ultimate goal of systems Lottery state.
Sean Falconer
Okay, going back to 2021, when you started working on this, where did that project start? How did you even begin to try to tackle this problem?
Stefan Yuen
Yeah, so we started initially actually trying to solve this from the side of Apache Flink. As I mentioned, we were working with a bunch of users on analytical pipelines and then this question came up. Users building sort of transactional event driven app pipelines on Flink. And we really didn't find that a good match and you know, very moderate success. There were a few that could make it work with like specific approaches, but not generally good experience. And then we started a sub project in Flink. It's kind of still around. It's called Stateful Functions. The idea was just like, let's do the thing that folks really like, which is just reliable communication, transactional state and sort of encapsulate that into an individual piece, an individual function. Think of it as like a lambda function, but when it's invoked it has contextual state. It's invoked in the concept of a key. It's sort of attached to a key and when it's invoked, it's sort of hydrated with the state of that key, can interact with that, modify that. It can basically produce a set of RPC or messages that go out to other functions. And then this is all like transactionally committed. Like the messages are sent to other functions. The state of is committed. It's almost like a stateful disaggregated serverless actor system. I think of it like that. So that in principle raised a lot of interest. Like there were a lot of folks that did like that as an abstraction could see there's like this is great for building anything from digital twins to well, yeah, transactional state machines which represent orders, even invoicing payments and so on. Fox did actually build payment processors on that. We learned that only years later. Fox was quite crazy. There's just like one linchpin that this had and this is it was built on Flink as an analytical system. And Flink's really throughput, optimized. It's like low latency for an analytical system. But that is mostly if you sort of use the at least one's semantics in the sinks, meaning you sort of like you push events right when they come. If you want actually transactional results, you're introducing a huge latency. Namely Flink is checkpoint based and you have to wait until the next checkpoint happens. That's sort of the latency you introduce if you want to say that they want to take a second step before the first step is really durable. And that is in the seconds. Right. And imagine using this as a foundation for workflow style logic. It means like every workflow step is sort of a, let's say 10 second latency was like completely impossible to do that. And it is also something that the Flink architecture could never fully remedy. So we thought if we really want to make that happen, if we want to make that vision happen of durable execution being something that's so low latency that you can use it without worrying about introducing latency overhead, even in sort of latency critical paths that are like synchronous interaction paths or. So then we'd really have to build a new stack. We'd have to start from the bottom, building on a low latency log on a low latency architecture that emphasizes fast durability and not analytical throughput. And that's how we then got started with Restate.
Sean Falconer
So what are the core building blocks of Restate? Both from a user's perspective, like what am I sort of stitching together from the developer experience? And then what is sort of the architecture behind the scenes that's helping me essentially support that in a way that is going to shield me against these kind of like outages or other issues that you might run into in distributed systems.
Stefan Yuen
So yeah, there are very different levels from which to look at. Let's look at it first from the sort of infrastructure side. Where does reset actually like sit in your infrastructure? You can think of it, it takes a similar place as a message queue or message broker. It's kind of a, or a workflow orchestrator. It's kind of a marriage of let's say the Kafkaesque event driven application and sort of the temporal esque durable execution workflow world. So it sits where a broker would sit. You're writing your logic as service handlers. The abstraction is, we try to keep microservices really as the abstraction. So you're writing services like almost as if it would be like a spring boot Application or express JS application or so. So like handlers grouped into services and then restate is the queue through which those services sort of get triggered. Right. So if you want to actually trigger a handler, you put an event in that queue that's supposed to invoke the handler and then restate and invokes that service. So in that sense, sort of classical queue in front of the service. The abstraction that we expose though is not really that of an event, but it's more that restate looks at the services and their handlers re exports them and sort of becomes a reverse proxy. So we're really trying to get away from people thinking in terms of like queues and events and trying to keep thinking in terms of like synchronous, asynchronous rpc. So that's really how you build it. Think of it. It sits in the infrastructure like a Kafka services and then. But it exposes itself as a reverse proxy that sits in front of the services.
Sean Falconer
So is the main advantage in sort of comparing this to doing something like event driven architecture with Kafka is the main advantage just having sort of a level of abstraction from having to think about events and queues and so forth as the person doing the implementation. If I'm doing the implementation, I can kind of just focus on the work that I need to do and that stuff's sort of abstracted away by restate.
Stefan Yuen
I guess that's one way to think about. I think if we go in the details of what the programming model has, maybe we'll see this. But in general you can think of it as it's like it's a level up from sort of Kafka style event driven application. You're not thinking in terms of queues and events, you're thinking in terms of durable, stateful, resilient invocations or functions. Yeah, and that sounds like maybe an academic detail, but it actually is a word of a difference because that actually means that restate takes on a lot more responsibilities. It doesn't just take on the responsibility of say, okay, I'll deliver the event and make sure the function is triggered and it's, you know, it's redelivered, it's re triggered on a failure. It also understands, okay, how do I fence retries against, you know, earlier executions, how do I lock contextual state, attach contextual state, how do I track progress? If I have multiple steps that happen as part of a function invocation and I want to actually understand that I record the result of the previous step before I start the next one, just to give ourselves an easier life. When it comes to implementing a complex control flow that would be thrown up if it's on different results during different retries, all those things, if you implement them manually, you're typically not just looking at a queue at Kafka. You're typically looking at combining a queue with a locking service, with the database with the scheduler and so on. It's a restate wraps that all together and says we're going from queue and event to doable stateful resilient function executions. And then as I mentioned before, the core sort of programming model is services that are meant to mimic RPC style service frameworks. And the simplest building block you have is really service handlers that get durable execution and then resets layers a few things on top of that. Like one concept is virtual objects that are stateful handlers that remember state across individual invocations, applications, shard around keys and then more like high level workflow constructs where you can actually add signal handlers and query handlers and so on. But all of that is sort of built on top of the general service abstraction.
Shawn Falconer
This episode of Software Engineering Daily is brought to you by Capital One. How does Capital One stack? It starts with applied research and leveraging data to build AI models. Their engineering teams use the power of the cloud and platform standardization and automation to embed AI solutions throughout the the business. Real time data at scale enables these proprietary AI solutions to help Capital One improve the financial lives of its customers. That's technology at Capital One. Learn more about how Capital One's modern tech stack data ecosystem and application of AI ML are central to the business by visiting capitalone.comtech so if I'm implementing.
Sean Falconer
One of these handlers and it gets executed, what is sort of the life of that process behind the scenes?
Stefan Yuen
Let's assume you're implementing a payment processing handler or something like this and that gets invoked. And let's say the logic that you have in there is I first have to check the status. Like was that already processed? Was it maybe canceled? Was it blocked before? So let's say the payments identified by an id I might want to call a fraud detector, I might want to update the database, send them automesage and so on. The lifecycle of executing this would be the following. Some external trigger, some external client says I want to execute that function that enters Restate, the restate server, the broker component as an event and restate will understand, okay, where does that service live? You can think of it that the service has to be registered at Restate, the service endpoint. You know, where is that deployed? Is that like here, a URL on lambda, Is that an HTTP 2 server endpoint? This Kubernetes deployment and so on, you have to register that at the server and then the server connects and pushes the invocation. If you've worked with, for example, like Amazon EventBridge or things like that, it's kind of a very similar model. So reset will then look up, okay, that handler is on that endpoint and I'm connecting to that. Let's assume it's an endpoint on Kubernetes or. So in this case, it would open a streaming connection, HTTP 2, put the invocation and then hold on to the connection and that sort of the lifeline to that single invocation or execution attempt, which allows the service to stream back things like journal progress, state update, outgoing messages. The function, let's say our payment handler would also get when it's invoked, you know, if it's a stateful handler, a virtual object, Restate would attach all the contextual status it knows for that individual handler to the invocation so that the handler could directly look up things like, okay, what's the status that. The previous status that was committed like, okay, it's still new. So let's start and execute that payment. And then let's say we're calling things like the. Let's say we call an external fraud detector API. We get the result and we say, okay, this is actually a durable step. Then the handler would put the result of that step into that stream that goes back to Restate. Restate internally has a consensus log that persists all the things it receives. And it has a bunch of sort of logic around this to understand, okay, is that information that still comes from, you know, like a valid execution attempt, or does that come from an attempt that has been fenced off in the past? You know, it has like a sort of. Yeah, it has sort of an elaborate consensus lock that supports a conditional append of that operation to the journal and it links that that operation or that that entry. That's the result of the. Of calling the fraud detector API to the original event. And if we'd say, okay, there's a failure after that point, now that failure could be, you know, like just the connection is ruptured, the process goes away, or there's a timeout, then the reset server would understand, okay, that event hasn't been completed, the execution. I didn't actually get an acknowledgement back for that yet. And it would send the event to another process, it would retry sending it to that endpoint and it would attach everything it has to that event. And that's the contextual state, like last time, but now also it would attach things like the journal entries that it already collected. Like, here's the result from the previous step, wraps that all up and sends it there. Then lets that service basically say, as I'm going through the code again, I can skip over steps that have already completed. This is what the SDK library basically does for you. Like understands, okay, that has been. That's already found in the journal. We can ignore this. This is a new step. We actually add an action or an event for that in the journal and then, you know, then it goes on. That applies to pretty much any operation recording the result of an API call, updating state, sending out a message. All these things basically become events that are streamed to the reset server and the reset server understands how to process these. These events, they all get sort of attached to the original invocation, but sometimes they also represent more like they present an outgoing event that is then routed to another service, or they represent a state update which is applied to an internal state index and so on. So it's generally an extensible event driven, event driven architecture on the server side that synchronizes over streaming protocol with the service.
Sean Falconer
I install, you know, this SDK, I set up this client, I'm wrapping up my call essentially around some of the SDK semantics or whatever, and that's going to call the Restate server. Restate is going to do its magic to make sure that that call is able to essentially be facilitated in a way that it's reliable, durable and so on. How do you make sure that the call from essentially the client to the server is done in such a way that it's reliable?
Stefan Yuen
So if you want to just the initial event, the initial call that triggers our durable handler, you have a bunch of ways to do this. You can do this through HTTP, through a client library, or you can actually just connect Kafka and it will just like pull these events from Kafka that represent these invocations. There's a few ingredients in there that help you that make this reliable. Like number one, like the reset server will not acknowledge anything before it has persisted that in its internal consensus logs. So even the original event has to go through the consensus log first before even an asynchronous submit or so is acknowledged. So we already have that durable. And the second thing is you can attach out Impotency keys to the invocation. And then the sort of event processor inside Restate server can use that to deduplicate invocation events. So that you know all the goodness of saying we deduplicate steps inside a durable handler doesn't really help you much if you can't deduplicate the invocations. So like the idempotency key support is there to do that. And then if you integrate this with Kafka, it automatically does the sort of Kafka offset mapping to our impotency mechanisms and basically gives us end to end in exactly once integration.
Sean Falconer
So if I want to start using something like Restate and I have an existing project, do I have to kind of think about re architecting everything to start with, or can I do it sort of bit by bit based on where maybe my most critical workflows are, like a payment system, for example?
Stefan Yuen
Yeah, so we've really built it to avoid having to re architect everything. And that kind of shows in many of the core abstractions. We do have to adjust the code to use the SDK to have access to some of the durable execution mechanisms like execute this code block as a durable step or access the built in transactional state or let Restate deliver that message to another service. You have to use the SDK, the library to do that. So there's some adjustment in the code. But the way you're deploying this, the way you're generally packaging this is meant to be very much in line with what you're doing anyways. Hence this sort of idea to abstract it or to give it the shape of microservice service handlers, the way you deploy it from the outside, you can very often just say, okay, this was a non Restate service, I'm importing the reset SDK, I'm starting using these Restate actions inside my code, connecting this to the Restate server, which becomes the reverse proxy. And now these services that initially used to call the service directly, then call the Restate server, which becomes the reverse proxy for the service, it's really meant to sort of allow you to plug it in incrementally, to sort of look at it one service at a time. There are a few things that really become very powerful only once you start attaching a few more services, like between services that are attached to the same reset server, you get kind of end to end exactly once RPC messaging, which is pretty nice, but even in the absence of that, you're still getting a lot of goodies. So yeah, it's totally meant for incremental.
Sean Falconer
Adoption for someone like adopting this or for teams that are adopting this approach. Does it take some work for them in terms of their thought process and the way they've traditionally developed to kind of come around to this mode of like operating and calling services?
Stefan Yuen
Yeah, I think it does a bit. And I would say mostly it almost requires unlearning a few things that they have learned in the past. So if you're coming from a traditional workflow system, we often have folks asking, okay, I'm writing this, but like, where is, you know, how do I make something a persistent activity now? Or yeah, just looking for the concept of like a workflow and activity. And then the interesting thing is like in Restate, every durable step is like an activity. Or if you want to separate it out, then make it a separate service that you call. And you don't really need workflows as a special construct anymore. It's sort of. You get similar guarantees just from your regular sort of service abstraction. Even the same observability and telemetry, you just get out of that. So you have to kind of maybe take back a step from looking for exactly the concepts you might know and just understand that a lot of the reasons why you were using those concepts, the guarantees you were really looking for, they're sort of like everywhere now in almost all the code you write with Restate. So you don't have to go to like these special constructs anymore. The second thing is understanding that many of the operations you do are now durable across failures and crashes. For example, if I'm doing something like an RPC call, a sequential RPC called request response to another system and the caller actually fails and gets recovered into a different node. That's not something that you usually assume still works, right? Because, you know, like the network call might be lost or like even if something gets sent back, like the code that actually issued the call is now waiting for the response was recovered in a completely different process, but that it still works in case of reset because all the building blocks are actually durable, persistent distributed building blocks. The RPC is basically connected to a persistent future that gets recovered in a different process and completed there. So the entire code that made the call gets recovered, restored to the point where it made the call and then completed with the result of that call and just works even if it moves around. So this is something that a lot of people don't expect to work. And that's why they're trying to code ways around that. And then they come into the discord and say like, okay, I'm not really connecting the dots here. And you basically tell them not to just like delete that. It just works. That's an interesting experience.
Sean Falconer
Are there certain kinds of projects that this makes more sense for than others? Like at what stage it makes sense to go with an approach like this versus, you know, something?
Stefan Yuen
Alternatively, I think there's a few cases where it does not make sense. And then there are a few cases where probably lots of durable execution systems could make sense. And then there are some cases where I would say that's a really good reset use case in particular. I mean, in general durable execution makes sense for workloads that sort of orchestrate many steps that update stuff. Like if you have mostly, you know, read heavy workloads or read only workloads, just like, doesn't really make sense to plug in a system like this, you know, then there, then there are workflows that, where something like durable execution is a nice convenient piece because it helps you encapsulate retries. You don't have to do them yourself. It helps you to implement asynchronous primitives a bit easier. But if anything goes wrong and some state gets lost and everything gets retried and recomputed, there's really no big deal. Like a lot of let's take a Rack pipeline or so retrieval, augmented generation, right? Like you know, you lose something, you recompute it. Worst cases, you call your LLM a few more times and it adds like half a cent to your bill. Maybe that's not a big deal. And then there's cases that, where that actually really matters, where you, where you absolutely care about transactional correctness. Where you, where you say okay, like no matter what kind of funky failure happened, I can never go back beyond before previous step. Or cases where you do explicitly need transactional state that outlives individual workflows that you can rely on, that other services can integrate with. This is a very good Restate use case because we've kind of architected it with that level of resilience in mind. Restate is implementing really its own stack. It isn't built on a database. It implements its own consensus log, it's event process. On top of that it's a complete self contained single binary. You just deploy it and run. And it has internally an extremely well thought through consensus architecture that allows you to make very strong assumptions on your semantics. And I think payment processing is a good example. Like if you want that there's a good use case for Restate, I would say specifically also, when you want something that works both in the cloud, but has also, I guess, a credible story for self hosting, then the converged single binary architecture is actually feasible to self host. It's not just like theoretically you can. It's open source and you can host it, but it's actually fun to operate.
Sean Falconer
You mentioned rag pipeline. There maybe not being the ideal use case because if it fails you can run it again or something like that.
Stefan Yuen
It's a use case, but it's a good use case even. It's very convenient to do that on top of Restate. It's just not a use case where you would rely on strong transactional correctness.
Sean Falconer
But what about a user facing application that leverages a foundation model of some sort there? Especially if I'm doing something where I'm making multiple inference calls and some sort of agentic workflow, I would think it would make a ton of sense there because you could be calling the tools to various data systems, multiple models and so forth. Is that a use case that you're seeing?
Stefan Yuen
Yeah, I think that's actually a very interesting one. As soon as you come more into the AI agent space, I think it becomes a lot more interesting for a couple of reasons. Like number one, I think agents are a good kind of match for durable execution in general because they are a bit like dynamic workflows. Workflows with a control flow is not known upfront. It's kind of determined by the responses of the LLM. And durable execution has this flexibility that you don't need to define the sequence of steps in the control flow up once you can kind of create dynamic control flow, just like record it and replay it after a failure. So I think durable execution in general matches agents very well. And then the second thing is agents, they're usually contextually stateful, so they map really well to these virtual object concepts that we have in Restate, where you have this exclusively scoped state that you have access to that you can use to remember basically not just previous steps, but also previous context. But it's still not something that is just like it's hidden in the workflow, but it's still an open state that you can probe from other services. Yeah, that you can even interact and put additional context in from other services. If that comes up, the whole abstraction just matches really nicely.
Sean Falconer
What are some of the unexpected use cases that you've seen of people applying Restate?
Stefan Yuen
Yeah, so there's some very expected use cases like, you know, classical workflow sagas Distributed state machines, the unexpected ones. It seems there are lots of folks that have fairly complicated sort of distributed queuing setups where they're starting with something like Kafka and then they're also pulling in rabbitmq and they have some, I don't know, some routers and some actors in between. I think often this is kind of a workaround to build something like you have maybe a common log and then you fan this out into like more fine grained entities that you interact with. And we have a bunch of users that basically could replace a whole zoo of sort of distributed queue orchestration just like with a single restate service. That's something we hadn't quite expected to happen so often. The second one that I found fascinating is that we've seen folks do in fact build a lot of custom workflow engines and custom rule engines. Apparently that's the thing many companies build for internal processes, internal tools and so on. So that's a quite common use case that we've seen. My favorite one is actually folks building a custom workflow and rule engine that they ship into factories to evaluate sensor data and trigger actions that controls machines. That was not on my list for one of the early use cases. So that was quite fun to see.
Sean Falconer
What would you say is the biggest challenges that you face when designing and implementing Restate?
Stefan Yuen
I mean there's technical challenges, right? The mission is extremely ambitious to say we're building a full stack that starts on the bott with a consensus log that has low latency. But also you can deploy it in extremely complicated setups, cross availability zones across regions. It tries to make good use of modern cloud architecture object stores, but at the same time bridge the gap to low latency. That's a technical challenge that we still, we've worked quite some time actually up to over two years by now on making this happen. I would say beyond that, the biggest challenge really is I'd say education of the space. Durable execution is becoming more and more known, but it still is not necessarily a mainstream concept. A lot of folks still associate it also primarily with workflows. So if you're doing durable execution for workflows, maybe you get more and more folks that like, not okay. Yeah, I know that, but if you're trying to say okay, no, we're actually sort of talking about durable execution in a more general way. It also includes state communication. Like think of it as a microservice paradigm, not a workflow paradigm. That's like, okay, I need to, I need to think about that A Bit like I think this education is something that is a big challenge, but also I would look at it positively. It's also something that is making progress. Most folks after they've gone through the initial, okay, I hadn't expected that. Let me think through it a bit. Like once they actually crack it, they usually get quite excited about it. So they still help spreading the word. So that's good.
Sean Falconer
Yeah, I mean I think that part and partial with any sort of new category creation like this is not the way that people are used to doing things. Then it's hard for people to even know that they have a problem and there's maybe a better way of doing something so they're not necessarily actively searching for it until you sort of cross the barrier of this educational awareness essentially.
Stefan Yuen
Definitely. I'm not sure if I would go as far as to say like this is a brand new category that we're creating. Like durable execution as a category exists before we started. I think the new we're bringing a bit of a new twist into it definitely. Like, you know, treating it as more than a workflow paradigm is probably something new. And then adding this like low latency capabilities that actually allow you to use it in places where you might previously not have thought it being applicable is maybe something new as well that people need to wrap their heads around. But yeah, we're also sort of working with other folks that have worked on, on creating this durable execution category are basically leveraging their work for sure.
Sean Falconer
What do you think overall the impact will be to how we design distributed systems in the future if more and more people adopt this approach of durable execution?
Stefan Yuen
I would venture a guess and say this type of solutions are going to be very, very widely adopted in a couple of years. I think they're going to replace a lot of workflow, queuing and other sort of distributed orchestration systems that are out there just because they're a nicer, more approachable way of solving these problems and they just interact better with the rest of your application stack and they can actually do things like they can actually support use cases that you might not have been thinking of before and vice versa, not using these systems, as we said before, it's just like it's getting harder and harder. This is one of the drivers I would actually throw in a second element. Why I think this is going to be extremely widely adopted in the future. And that is if you look at the whole AI trends and AI code generation, you can actually see that these systems are getting increasingly good at doing things like even Complicated business logic where like assuming you have all the domain context, you really need a bunch of steps, a bunch of non trivial steps to happen, but those systems are not the ones that solve distributed race conditions for you or like understand. Okay, hey, here is a case where, you know, if that process stalls just here and then a retry happens and forks off a copy here and then those are going to interfere with in a weird manner. I don't see that happening. Even if you think they can conceptually do that, that's probably a waste of compute power. I think if you just use a foundation like durable execution and say like, it's just an incredibly good target for our foundation for AI generated code because it's solid semantics. A lot of the problems that you really don't want anything unexplainable, semi unpredictable to be reasoning about and then put the much simpler generated business logic on top of that. It's a nice package.
Sean Falconer
Yeah, that would be great. So what's next for Restate?
Stefan Yuen
So at the moment we're working very hard on releasing the next version, which is our first distributed release. I guess by the time this comes out it's probably going to be released already. So we're targeting like two to three weeks from now the moment. If you use Restate, you can think of it as it deploys like a single node database, like a postgres. You give it a persistent volume and good. The next version gives you the complete distributed deployment power, distributed replication, scale out and everything. So that's actually a big thing that we've seen a lot of excitement building up for and we're pretty anxious to get it out there. That's the biggest immediate step. And then after that we're at the moment at the phase where we're really, we're really just excited to be working with as many users as possible. Learn from them what they're using it for, what they see as good use cases, how they think about the problem, how would they explain it to others? How would they explain this category? How would they explain explain the abstraction and the mental model? You'd have to have and really, you know, share this with the world and work with whoever is excited to work with us.
Sean Falconer
Awesome. Well Stefan, thanks so much for being here.
Stefan Yuen
Thank you for having me. Cheers.
Podcast Summary: Software Engineering Daily – "Modern Distributed Applications with Stefan Yuen"
Release Date: June 5, 2025
In this episode of Software Engineering Daily, host Sean Falconer engages in an in-depth discussion with Stefan Yuen, the founder and CEO of ReState and co-creator of Apache Flink. The conversation delves into the complexities of building modern distributed applications, the challenges of achieving resilience and fault tolerance, and how ReState aims to simplify these processes.
Timestamp: [01:03]
Stefan Yuen shares his extensive experience with Apache Flink, an open-source framework for unified stream processing and batch processing. He highlights his role in shaping Flink's early architecture, focusing on data plane coordination, snapshots, and state management.
Stefan Yuen [01:20]:
"Most of my professional life was Apache Flink so far as part of the team that started it in 2014... We kept riding that wave of Kafka Flink, the advent of real-time stream processing."
His transition from Flink to founding ReState was motivated by the need to address transactional event-driven applications, such as payment processing and order orchestration, which traditional analytical systems like Flink were not optimally designed to handle.
Timestamp: [03:34]
ReState is positioned as a durable execution framework aimed at simplifying the development of distributed, resilient applications. Stefan describes it as more than just a durable execution engine; it offers a holistic platform that integrates durable execution with distributed communication and state management.
Stefan Yuen [03:42]:
"ReState goes quite a bit beyond that. It tackles the more holistic problem of applying durable execution to distributed services in general, not just individual workflows."
Timestamp: [04:55]
Stefan attributes the surge in durable execution frameworks to the growing complexity of distributed systems, especially with the rise of microservices. Developers and companies are increasingly realizing that managing distributed system challenges like race conditions and state synchronization is resource-intensive and detracts from focusing on core business logic.
Stefan Yuen [05:13]:
"Many software development teams can't handle the challenges of distributed apps efficiently. They're spending time on race conditions, split brains, and lost updates instead of adding features."
He notes a backlash against microservices, with some advocating a return to monoliths, while others seek stronger foundational frameworks like ReState to maintain the benefits of microservices without the accompanying complexities.
Timestamp: [07:57]
Sean Falconer probes into existing practices for managing distributed system failures, such as implementing retry schemes with exponential backoff, using queues, and building custom logic to handle idempotency and state synchronization.
Stefan Yuen [07:57]:
"Often, teams start with a queue and a retry mechanism, then incrementally add complexities like locks or versioning to handle duplicate processing. This leads to a tangled web of assumptions and fragile integrations."
Stefan critiques these ad-hoc solutions for their ineffectiveness and complexity, emphasizing that they often fail to fully resolve the underlying issues.
Timestamp: [10:31]
ReState distinguishes itself by offering cheap durable execution with minimal latency overhead. This allows developers to implement durable, stateful workflows without significant performance penalties, integrating seamlessly with existing tools and deployment pipelines.
Stefan Yuen [10:40]:
"With ReState, durable execution becomes affordable and low latency, allowing you to assume workflow-style guarantees without making your application sluggish."
Timestamp: [19:16]
Stefan provides a detailed walkthrough of ReState's architecture and operation:
Stefan Yuen [23:23]:
"ReState's server persists all events in a consensus log before acknowledging any operations, ensuring durability and consistency across the system."
Timestamp: [24:57]
Sean inquires about integrating ReState into existing projects. Stefan explains that ReState is designed for incremental adoption, allowing teams to integrate it gradually without overhauling their entire architecture.
Stefan Yuen [25:15]:
"You don't need to re-architect everything. You can start by importing the ReState SDK into your existing services and incrementally adopt durable execution for your most critical workflows."
This approach minimizes disruption and allows teams to leverage ReState's benefits where they matter most, such as in payment processing or order management systems.
Timestamp: [29:27]
Stefan outlines ideal and non-ideal scenarios for using ReState:
Ideal Use Cases:
Non-Ideal Use Cases:
Stefan Yuen [29:27]:
"Durable execution makes sense for orchestrating many steps that update state. It doesn't add much value for read-heavy workloads or simple retry scenarios where eventual consistency is acceptable."
Timestamp: [33:38]
Stefan shares surprising ways users have leveraged ReState:
Stefan Yuen [33:38]:
"One of the most fascinating uses has been in factories where ReState powers custom workflow engines to evaluate sensor data and control machinery."
These diverse applications highlight ReState's versatility beyond traditional transactional systems.
Timestamp: [35:01]
Stefan discusses the primary challenges faced while developing ReState:
Stefan Yuen [35:01]:
"Our mission is ambitious. We're building a full stack that prioritizes low latency and resilience, which requires overcoming significant technical hurdles."
He emphasizes that while durable execution is gaining traction, shifting perceptions and illustrating its general-purpose utility remains a critical hurdle.
Timestamp: [37:44]
Stefan envisions durable execution frameworks like ReState becoming ubiquitous in distributed system architectures. He anticipates that they will:
Stefan Yuen [37:44]:
"These solutions are going to replace many workflow and queuing systems because they're more approachable and integrate better with the rest of your stack."
He also points out that as AI continues to evolve, systems like ReState will be essential in ensuring the reliability and predictability of AI-driven workflows.
Timestamp: [39:42]
Looking ahead, Stefan outlines ReState's immediate plans:
Stefan Yuen [39:42]:
"We're excited to release our first distributed version in the next few weeks and to work closely with our users to understand and expand our use cases."
This phase focuses on enhancing ReState's capabilities and fostering a community around its adoption and evolution.
The conversation between Sean Falconer and Stefan Yuen offers a comprehensive exploration of the challenges in building resilient distributed applications and how ReState aims to address them through durable execution. Stefan's insights highlight the technical innovations behind ReState, its practical applications, and its potential to reshape the landscape of distributed system design. As durable execution frameworks gain prominence, platforms like ReState are poised to become foundational elements in the development of modern, resilient applications.
Notable Quotes:
Stefan Yuen [05:13]:
"Many software development teams can't handle the challenges of distributed apps efficiently. They're spending time on race conditions, split brains, and lost updates instead of adding features."
Stefan Yuen [10:40]:
"With ReState, durable execution becomes affordable and low latency, allowing you to assume workflow-style guarantees without making your application sluggish."
Stefan Yuen [23:38]:
"ReState wraps distributed queuing complexities into a single service, simplifying orchestration across multiple systems."
Stefan Yuen [33:38]:
"One of the most fascinating uses has been in factories where ReState powers custom workflow engines to evaluate sensor data and control machinery."
Stefan Yuen [37:44]:
"These solutions are going to replace many workflow and queuing systems because they're more approachable and integrate better with the rest of your stack."
This summary captures the essence of the podcast episode, providing a structured overview of the discussions, insights, and conclusions shared by Stefan Yuen regarding modern distributed applications and the role of durable execution frameworks like ReState.