Modern Distributed Applications with Stephan Ewen

Summary

Podcast Summary: Software Engineering Daily – "Modern Distributed Applications with Stefan Yuen"

Release Date: June 5, 2025

Introduction

In this episode of Software Engineering Daily, host Sean Falconer engages in an in-depth discussion with Stefan Yuen, the founder and CEO of ReState and co-creator of Apache Flink. The conversation delves into the complexities of building modern distributed applications, the challenges of achieving resilience and fault tolerance, and how ReState aims to simplify these processes.

Guest Background and Journey

Timestamp: [01:03]

Stefan Yuen shares his extensive experience with Apache Flink, an open-source framework for unified stream processing and batch processing. He highlights his role in shaping Flink's early architecture, focusing on data plane coordination, snapshots, and state management.

Stefan Yuen [01:20]:
"Most of my professional life was Apache Flink so far as part of the team that started it in 2014... We kept riding that wave of Kafka Flink, the advent of real-time stream processing."

His transition from Flink to founding ReState was motivated by the need to address transactional event-driven applications, such as payment processing and order orchestration, which traditional analytical systems like Flink were not optimally designed to handle.

Understanding ReState

Timestamp: [03:34]

ReState is positioned as a durable execution framework aimed at simplifying the development of distributed, resilient applications. Stefan describes it as more than just a durable execution engine; it offers a holistic platform that integrates durable execution with distributed communication and state management.

Stefan Yuen [03:42]:
"ReState goes quite a bit beyond that. It tackles the more holistic problem of applying durable execution to distributed services in general, not just individual workflows."

The Rise of Durable Execution Frameworks

Timestamp: [04:55]

Stefan attributes the surge in durable execution frameworks to the growing complexity of distributed systems, especially with the rise of microservices. Developers and companies are increasingly realizing that managing distributed system challenges like race conditions and state synchronization is resource-intensive and detracts from focusing on core business logic.

Stefan Yuen [05:13]:
"Many software development teams can't handle the challenges of distributed apps efficiently. They're spending time on race conditions, split brains, and lost updates instead of adding features."

He notes a backlash against microservices, with some advocating a return to monoliths, while others seek stronger foundational frameworks like ReState to maintain the benefits of microservices without the accompanying complexities.

Current Solutions and Their Limitations

Timestamp: [07:57]

Sean Falconer probes into existing practices for managing distributed system failures, such as implementing retry schemes with exponential backoff, using queues, and building custom logic to handle idempotency and state synchronization.

Stefan Yuen [07:57]:
"Often, teams start with a queue and a retry mechanism, then incrementally add complexities like locks or versioning to handle duplicate processing. This leads to a tangled web of assumptions and fragile integrations."

Stefan critiques these ad-hoc solutions for their ineffectiveness and complexity, emphasizing that they often fail to fully resolve the underlying issues.

ReState’s Approach to Durable Execution

Timestamp: [10:31]

ReState distinguishes itself by offering cheap durable execution with minimal latency overhead. This allows developers to implement durable, stateful workflows without significant performance penalties, integrating seamlessly with existing tools and deployment pipelines.

Stefan Yuen [10:40]:
"With ReState, durable execution becomes affordable and low latency, allowing you to assume workflow-style guarantees without making your application sluggish."

Technical Deep Dive: How ReState Works

Timestamp: [19:16]

Stefan provides a detailed walkthrough of ReState's architecture and operation:

Service Registration: Services are registered with ReState, specifying their endpoints (e.g., URLs, Kubernetes deployments).
Event Invocation: When a service handler is invoked, ReState uses a streaming connection (such as HTTP/2) to manage the invocation lifecycle.
Durable Steps: Each step in the handler's logic (e.g., calling an external API, updating a database) is treated as a durable step. Results are streamed back to ReState and persisted in a consensus log.
Failure Handling: In case of failures (e.g., connection loss, process crash), ReState retries the invocation, ensuring idempotency and consistency by referencing the consensus log.
State Management: ReState maintains contextual state across invocations, allowing handlers to resume operations seamlessly after failures.

Stefan Yuen [23:23]:
"ReState's server persists all events in a consensus log before acknowledging any operations, ensuring durability and consistency across the system."

Integration and Adoption

Timestamp: [24:57]

Sean inquires about integrating ReState into existing projects. Stefan explains that ReState is designed for incremental adoption, allowing teams to integrate it gradually without overhauling their entire architecture.

Stefan Yuen [25:15]:
"You don't need to re-architect everything. You can start by importing the ReState SDK into your existing services and incrementally adopt durable execution for your most critical workflows."

This approach minimizes disruption and allows teams to leverage ReState's benefits where they matter most, such as in payment processing or order management systems.

Use Cases and Applications

Timestamp: [29:27]

Stefan outlines ideal and non-ideal scenarios for using ReState:

Ideal Use Cases:
- Transactional Systems: Payment processing, order orchestration where transactional correctness is paramount.
- Stateful Applications: Systems requiring persistent and contextually scoped state across invocations.
- AI Agent Workflows: Dynamic, stateful workflows driven by AI agents that require resilience and state management.
Non-Ideal Use Cases:
- Read-Heavy or Read-Only Workloads: Applications that predominantly perform read operations may not benefit significantly from ReState's features.
- Simple Retry Mechanisms: Workflows where occasional retries are acceptable without requiring strict transactional guarantees.

Stefan Yuen [29:27]:
"Durable execution makes sense for orchestrating many steps that update state. It doesn't add much value for read-heavy workloads or simple retry scenarios where eventual consistency is acceptable."

Unexpected Applications

Timestamp: [33:38]

Stefan shares surprising ways users have leveraged ReState:

Replacing Distributed Queue Setups: Simplifying complex queuing systems by using ReState as a central orchestrator.
Custom Workflow and Rule Engines: Building internal tools for processes and evaluations within companies.
Industrial Automation: Implementing custom workflow and rule engines in manufacturing settings to control machines based on sensor data.

Stefan Yuen [33:38]:
"One of the most fascinating uses has been in factories where ReState powers custom workflow engines to evaluate sensor data and control machinery."

These diverse applications highlight ReState's versatility beyond traditional transactional systems.

Challenges in Designing ReState

Timestamp: [35:01]

Stefan discusses the primary challenges faced while developing ReState:

Technical Complexity: Building a low-latency consensus log and ensuring scalability across distributed cloud environments.
Education and Adoption: Educating the community about durable execution beyond workflow paradigms and demonstrating its broader applicability.

Stefan Yuen [35:01]:
"Our mission is ambitious. We're building a full stack that prioritizes low latency and resilience, which requires overcoming significant technical hurdles."

He emphasizes that while durable execution is gaining traction, shifting perceptions and illustrating its general-purpose utility remains a critical hurdle.

Future Impact on Distributed Systems Design

Timestamp: [37:44]

Stefan envisions durable execution frameworks like ReState becoming ubiquitous in distributed system architectures. He anticipates that they will:

Replace Traditional Orchestration Systems: Offering a more integrated and resilient approach compared to existing workflow and queuing solutions.
Facilitate AI-Driven Development: Serving as a robust foundation for AI-generated code by providing solid transactional semantics and reducing the complexity of distributed operations.

Stefan Yuen [37:44]:
"These solutions are going to replace many workflow and queuing systems because they're more approachable and integrate better with the rest of your stack."

He also points out that as AI continues to evolve, systems like ReState will be essential in ensuring the reliability and predictability of AI-driven workflows.

Upcoming Developments and Roadmap

Timestamp: [39:42]

Looking ahead, Stefan outlines ReState's immediate plans:

Distributed Release: Launching a distributed version of ReState, enabling replication and scalability across multiple nodes.
User Collaboration: Engaging with users to gather feedback, understand diverse use cases, and refine the platform's abstractions and mental models.

Stefan Yuen [39:42]:
"We're excited to release our first distributed version in the next few weeks and to work closely with our users to understand and expand our use cases."

This phase focuses on enhancing ReState's capabilities and fostering a community around its adoption and evolution.

Conclusion

The conversation between Sean Falconer and Stefan Yuen offers a comprehensive exploration of the challenges in building resilient distributed applications and how ReState aims to address them through durable execution. Stefan's insights highlight the technical innovations behind ReState, its practical applications, and its potential to reshape the landscape of distributed system design. As durable execution frameworks gain prominence, platforms like ReState are poised to become foundational elements in the development of modern, resilient applications.

Notable Quotes:

Stefan Yuen [05:13]:
"Many software development teams can't handle the challenges of distributed apps efficiently. They're spending time on race conditions, split brains, and lost updates instead of adding features."
Stefan Yuen [10:40]:
"With ReState, durable execution becomes affordable and low latency, allowing you to assume workflow-style guarantees without making your application sluggish."
Stefan Yuen [23:38]:
"ReState wraps distributed queuing complexities into a single service, simplifying orchestration across multiple systems."
Stefan Yuen [33:38]:
"One of the most fascinating uses has been in factories where ReState powers custom workflow engines to evaluate sensor data and control machinery."
Stefan Yuen [37:44]:
"These solutions are going to replace many workflow and queuing systems because they're more approachable and integrate better with the rest of your stack."

This summary captures the essence of the podcast episode, providing a structured overview of the discussions, insights, and conclusions shared by Stefan Yuen regarding modern distributed applications and the role of durable execution frameworks like ReState.