Designing Data-intensive Applications with Martin Kleppmann - The Pragmatic Engineer

Summary7 min read

Podcast Summary: The Pragmatic Engineer — Designing Data-Intensive Applications with Martin Kleppmann
Host: Gergely Orosz | Guest: Martin Kleppmann
Date: April 22, 2026

Main Theme & Purpose

This episode is a deep dive with Martin Kleppmann, renowned author of Designing Data-Intensive Applications, to discuss the evolution of the book’s second edition, changes in data system architecture, the enduring fundamentals of designing reliable, scalable, and maintainable systems, and Martin’s transition between industry and academia. They explore not just technical details, but the philosophy and ethics of distributed systems, the practical reality of building at scale, and the emerging frontiers in both industry and academic research, including formal verification and local-first software.

Key Discussion Points & Insights

Martin Kleppmann's Journey into Tech

Startup Beginnings:
- Initially started in computer science, then launched a startup (GoTestIt) focusing on cross-browser automated testing ([02:03–03:57]).
- Lesson: Even technically sound products can struggle with adoption and commercial viability.
Second Startup – Rapportive:
- A browser extension adding social context to Gmail; gained traction and was acquired by LinkedIn ([04:58–07:17]).
- Rapid growth via Y Combinator; challenges with visas and pressure to sell.
Transition to LinkedIn:
- Joined LinkedIn after the acquisition; continued to work on related products and new initiatives ([09:09–10:21]).
- Moved into data infrastructure, getting involved in stream processing and working with Kafka and Samza ([10:21–13:18]).

Impact of LinkedIn & Kafka on Book’s Genesis

Kafka’s Motivation:
- Built for integrating numerous data sources and enabling scalable, append-only log-based data streaming ([11:08–12:15]).
Learning at Scale:
- Direct exposure to real-world distributed systems problems at LinkedIn shaped Martin’s understanding and eventually influenced the structure and content of his book ([12:29–13:18]).

Writing Designing Data-Intensive Applications

Motivation:
- To provide a broad, practitioner-focused conceptual overview, covering trade-offs across multiple systems – the book Martin wished he’d had as a startup CTO ([15:22–17:01]).
Research Approach:
- Combined learning from deep discussions with experts, extensive reading (papers, blog posts), and industry experience ([17:40–18:55]).
- Notable Quote:
  - "A lot of it was just kind of being curious and talking to people actually, and just asking them lots of questions." — Martin ([17:40])
Book Structure:
- Chapters focused on core distributed systems ideas: transactions, replication, sharding, consistency, consensus ([19:07–20:25]).
Writing Realities:
- Significantly underestimated the effort; first edition took ~4 years (often not full-time), with publisher deadlines overshot by years ([20:58–22:09]).

Principles: Reliable, Scalable, Maintainable

Definitions:
- Reliability: Fault tolerance (e.g., replication to handle failures)
- Scalability: Mechanisms for dealing with load changes, especially horizontal scalability ([22:22–24:31])
- Maintainability: System’s ability to evolve and remain understandable ([22:22–24:31])
Notable Quote:
- "Reliability means fault tolerance primarily... Scalability is one of those terms that gets thrown around a lot... For this book, I tried to take a bit more dispassionate kind of approach and said scalability is just like what mechanisms we have for dealing with changes in load." — Martin ([22:22])

Second Edition: What’s New & Why the Update?

Triggers & Motivation:
- Cloud-native systems had shifted core assumptions since the first edition; book needed to be relevant to new architectures ([25:31–26:49])
Collaboration:
- Partnered with Chris Riccomini, who brought up-to-date industry expertise and writing skills ([26:54–28:01]).
Major Additions:
- Focus on cloud-native systems, object stores as primitives, and managed services ([28:09–29:59])
De-Emphasized Topics:
- Reduced MapReduce coverage as it's now largely obsolete; newer tools now standard ([46:19]).
Expanded Topics:
- Added dataframes, vector indexes, and modern AI data concerns ([46:19]).

Managed Services, Abstraction, and Trade-Offs

Shift in Engineering Responsibility:
- "It’s specialization: some people can work on higher abstraction layers, others build the lower-level reliable primitives" ([32:46–33:10]).
Value of Knowing Internals:
- While details can be abstracted, understanding system internals is still a superpower for diagnosing and optimizing performance ([33:18–35:00]).
Notable Quote:
- "Knowing a bit about the internals is actually like a superpower." — Martin ([34:32])
Cloud Trade-Offs:
- Balancing availability, performance, cost, and resilience in a multi-cloud or regional setup; importance of considering business and even geopolitical risk ([35:00–37:41]).

Handling Scale & Sharding Today

Cloud Impact:
- Scaling down (cost-effective, lightweight services—thanks to serverless) is easier than ever; scaling up and sharding across machines remains technically demanding ([38:10–40:02]).
Sharding:
- Still relevant at extreme scale but less urgent due to improved single-machine capacity ([40:58–41:57]).

Troubles of Distributed Systems

Unreliable Networks & Timing:
- The need to design for uncertainty—delayed or lost messages, clock drift, unexpected failures ([42:13–44:56]).
- Real-world examples: from data center fires, undersea cables bitten by sharks, to cows stepping on land cables.
Engineering at Scale:
- S3-class teams treat such failures as daily operational concerns, while small companies might perceive them as rare anomalies ([44:56–45:37]).

Doing the Right Thing: Ethics in Systems Design

Raising Ethical Questions:
- Engineers must consider the consequences and societal impact of the systems they build; the book includes an explicit "Doing the Right Thing" chapter ([48:06–51:03]).
Notable Quote:
- "If you want to change the world, then thinking about the impacts that your technologies have on the world is part of your job." — Martin ([48:30])
Engineers as Decision-Makers:
- Engineers bear responsibility for surfacing trade-offs and articulating risks—technical and societal ([51:03]).

Formal Methods and the AI Revolution

Formal Verification:
- Proving system properties through mathematics (vs. just testing); critical for high-stakes algorithms ([51:56–54:51]).
- Getting started: Prefer model checking (e.g., TLA+) over proof assistants for practical learning ([57:04–57:47]).
Notable Quote:
- "For those domains where really we want to ensure there’s a complete absence of bugs... formal verification can really shine." ([57:55])
AI’s Role:
- AI (LLMs) may make writing proofs easier—automation of proof generation could make formal verification mainstream, especially as AI-generated code increases review workload ([57:55–59:20]).

Academia vs Industry – Synergies and Contrasts

Academic Freedom:
- Allows longer-term, idealistic research unbound by commercial imperatives (e.g., "local-first" software) ([59:45–62:43]).
Research Challenges:
- Consistency and access control in decentralized systems—harder than similar problems solved with centralized cloud servers ([63:00–66:29]).
Notable Quote:
- "This is an example... where, because it's research, we can afford to take this idealistic, principled stance and say... we're going to solve this harder engineering problem because we think decentralization is a valuable feature." — Martin ([68:04])
Teaching:
- Courses in distributed systems, cryptographic protocol engineering, and security; focus on both theory and practical implementation ([69:00–71:49]).
Impact of AI on Computer Science Education:
- Major challenges for assessment and ensuring genuine learning; adjustments underway but answers still evolving ([72:07–74:35]).

Bridging Academia and Industry

Mutual Benefit:
- Real-world problems can inform academic research; deep research can offer industry transformative ideas ([76:40–77:42]).
Career Perspective:
- Advocates for not viewing academia and industry as mutually exclusive—cross-pollination gives valuable breadth and rigor ([81:47–83:34]).
Notable Quote:
- "It’s really good actually if people can weave in and out of industry and academia a bit and not regard it as like two totally mutually exclusive career paths, but actually have a bit of switching between the two." — Martin ([83:26])

Notable Quotes & Memorable Moments

"Kafka was really about data integration... an abstraction for integrating various data sources to downstream data sinks." — Martin Kleppmann ([12:15])
"A lot of it was just kind of being curious and talking to people actually, and just asking them lots of questions." ([17:40])
"Scalability is just like what mechanisms we have for dealing with changes in load… not just scaling up, but scaling down as well." ([23:42–24:31])
"Knowing a bit about the internals is actually like a superpower." ([34:32])
"If you want to change the world, then thinking about the impacts that your technologies have on the world is part of your job." ([48:30])
"For those domains where really we want to ensure there’s a complete absence of bugs... formal verification can really shine." ([57:55])
"It’s really good actually if people can weave in and out of industry and academia a bit and not regard it as like two totally mutually exclusive career paths." ([83:26])

Timestamps for Important Segments

Martin’s Tech Journey & Startups: [01:44–13:18]
Kafka and Large-Scale Systems at LinkedIn: [10:21–13:18]
Book Genesis & Structure: [13:18–20:25]
Reliability, Scalability, Maintainability: [22:09–24:31]
Second Edition, Cloud-Native Focus: [25:31–29:59]
Managed Services, Trade-Offs, and Abstraction: [30:35–37:41]
Scaling Down & Serverless: [38:10–40:14]
Distributed System Pitfalls: [42:13–46:02]
Shifting Topics (MapReduce out, AI/Vector Indexes in): [46:19–48:06]
Ethics in Systems Design: [48:06–51:56]
Formal Verification & Model Checking: [51:56–59:20]
Local-First Software & Decentralization Challenges: [59:45–68:55]
Teaching & Academia-Industry Relationships: [69:00–77:42]
Bridging Academia & Industry, Closing Reflections: [81:32–83:34]

Tone and Takeaways

The conversation is thoughtful, richly detailed, and candid—balancing technical rigor with big-picture philosophical thinking. Martin is open about the challenges (technical and personal) faced in both his startups and academic work. He advocates for the combination of curiosity, practical exposure, careful reasoning, and ethical reflection as core to the craft of building data systems. The episode is invaluable for engineers, leaders, and anyone interested in where software infrastructure is headed—and what foundational knowledge endures as tools, hardware, and paradigms evolve.

Loading summary

Transcript142 lines

[00:01]
A
Designing data intensive applications has been the go to book for anyone building large backend systems. Nine years after publishing this book, the second edition is here. Martin Klutman is the author of this generational book. I sat down with him and today we cover how working on Kafka at LinkedIn directly shaped the ideas that became the first edition of the book. What's new in the second edition and why things like MapReduce got removed from this updated version? Formal methods, local first software, decentralized access, and many more. If you care about how large systems work, where they're heading, and what the fundamentals are that don't change, this episode is for you. This episode is presented by statsig, the Unified Platform for Flags, analytics, experiments and more. This episode is brought to you by Sonar Sonar, the makers of sonarqube understands that code quality is about more than just avoiding syntax errors. It's about long term maintainability by protecting the structural integrity of the system. As agents generate code at massive scale, they often ignore your system's structural integrity. This creates tangles, duplicated code, and other maintainability issues. These issues turn a modular design into a big ball of mud, making it increasingly difficult to extend. But here's something that's really helpful. Sonarqube's Architecture Management it moves architectural governance out of static wikis and into your automated workflow. It allows you to visualize your current architecture, define architectural boundaries, and manage architectural issues in real time, whether it's a human or an AI agent at the keyboard. Sonar acts as a circuit breaker for structural decay. It ensures every commit respects the system's blueprint, protecting the long term health of your most complex applications. Head to sonarsource.compragmatic to find out more.
[01:43]
B
So Martin, welcome to the podcast.
[01:44]
C
Hi Gekko, it's great to be here.
[01:46]
B
It's amazing to have you here. I don't think you need introduction to many software engines, including myself. You're the author of this iconic book that I've had on my bookshelf for probably about 10 years. Not much longer after it came out. Before we get into this book, which we're going to talk about, how did you get into the technology field?
[02:04]
C
Yes, well, I did an undergraduate computer science like many others, and then after that I wasn't quite sure what to do with my life. But I thought, well, starting a startup seems like an interesting thing to try. So I started a startup having no clue what I was going to actually do, and then spent the first while searching around for things that might be interesting. The first startup didn't work out that well. But through that I met some others who then became my co founders for the second startup, which work better. And we sold that one to LinkedIn. And then after that I started being interested in like teaching these distributed systems concepts. So that's when I got into writing the book. And then during the writing of the book I also switched over from industry back to academia.
[02:48]
B
Can we talk a little bit about your first and second startup?
[02:51]
C
Yeah, go tested. This was like 2008 or something like that. It was the age where people were having really difficulties getting their JavaScript working. Cross browser. Internet Explorer was still pretty big at the time. Chrome had just come out, all the browsers were incompatible with each other. And so Go Test. It was a cross browser automated testing service for websites. It was based on Selenium, an open source project that still exists. And the idea is you would write like test scripts that automate a user clicking through the various interactions with a website and then just check that the right behavior happens. And so yeah, it was based on Selenium, but just provided as a hosted service so people wouldn't have to run various VMs with various operating systems themselves. It worked technically, but I found it really hard to actually get adoption for it. A lot of people building websites like in theory said, oh yeah, this is great, we need to test cross browser. And in practice, actually it was really difficult to get them to integrate it into their workflow and just get in the habit of using it and investing in writing test scripts. So that ended up not really going anywhere.
[03:57]
B
So there wasn't a business to be done or revenue to be generated in a meaningful sense?
[04:03]
C
Yeah, well there's at least one other, maybe two other companies from that same era that did manage to make a business. Source Labs is one that managed to actually succeed. But even for them it was a pretty slow running business. I think it was not an easy business to be in.
[04:21]
B
And for the startup, were you in the UK building it?
[04:25]
C
I was in the UK at the time.
[04:26]
B
Was this bootstrapped? Did you raise some kind of funding? How big was the team? How can we imagine this?
[04:31]
C
It was mostly bootstrapped. So I did a bunch of consulting in order to fund hiring some people and then hired some friends on the cheap to help contribute to actually building the product. And so it was done all very cheaply. I had a very small amount of of angel money in there, but mostly bootstrapped.
[04:52]
B
And then when you decided to not go forward with this, how did the next startup come Rapportive Right, yeah.
[04:58]
C
The second one was Rapportive. That went a lot better. So that was putting social media inside Gmail, basically. So the idea was that if you get an email from someone you don't know, we had a little browser extension which manipulated the Gmail web interface so that on the side next to the email, we'd show you a summary social profile with like a profile picture and like job title pulled from LinkedIn and recent tweets pulled from Twitter and maybe recent Facebook posts or things like that. Just whatever we could find about that person and put that as a social summary next to the email. We started in 2010 or something like that. It was then pretty quickly became quite popular. And so on the back of that we were then able to raise some money from Y Combinator, which was still fairly young at the time.
[05:46]
B
That was very young. You must have been one of the very early batchers.
[05:50]
C
Yeah, I can't remember exactly when they started, but it was certainly in the early years. I think Y Combinator had already built up quite a good reputation at the time, but it was still fairly small.
[06:00]
B
And then as part of Y Combinator, did you have to fly from the UK to San Francisco to attend that 10 week program?
[06:08]
C
If I remember exactly, yes. So we initially came for the three months or whatever it was of the Y Combinator, but then we were able to get us work visas for ourselves and set up permanently in San Francisco.
[06:25]
B
How was that shift from the UK where you spent going to university, your first startup, the first part of this, to coming to San Francisco?
[06:32]
C
It was very exciting because it felt like going to the center of where it was all happening really. And we at the started app, not knowing anybody at all, we knew like one or two people in the entire Bay Area, but we contacted them and they introduced us to more people and they introduced us to more people and so we were able to pretty quickly actually build up a network. And that's something that I really appreciated, that it was actually so open to outsiders like us who could just basically turn up with an idea and an early stage startup. And we managed to raise some money and managed to like actually become somewhat established in the, in the Bay Area.
[07:09]
B
And can you tell me how the, how the company grew and at what point did the LinkedIn acquisition offer come and how can we imagine even you were a founder of this company?
[07:18]
C
It was about in 2012 that we sold it and we were five people at the time. So it's all still pretty small, not vast amounts of money involved, but it was a Success, I would say, for everybody involved. The acquisition process, it itself was fine. Is like, as always with these kinds of transactions, there was like twists and turns and moments where we thought it would all fall apart. And then we were almost running out of money and hadn't really succeeded in raising another round, so we kind of had to sell or shut down. So we were under quite a bit of pressure. We couldn't reduce our own salaries because to do so would have violated the conditions of our visas.
[07:56]
B
Yes.
[07:57]
C
So we were in a slightly stuck situation given our lack of leverage in that situation. Actually, I'm pretty happy how it all turned out.
[08:04]
B
Yeah, it's nice that, you know, like from 10 plus years we can talk about this honestly, because oftentimes you see an acquisition by LinkedIn and of course you might ask the founders and they would say like, this was our, either our dream or our goal or we will do so many things together. But some things that you don't often hear is, well, that there was a pressure involved as well. So did you go into this wanting to sell the company because you saw that things were getting a little. Either you need to raise a new round or you sell it to someone and then you found LinkedIn to be the best of or the only or the best option to go into?
[08:37]
C
We tried a little bit to see what revenue generating options we had and hadn't really managed to make that work. So we were just burning money and our user growth was okay, but not really enough to go and raise a big round. So we were a little bit stuck there and selling the company seemed like the least bad option there in a way. And I'm pretty happy how it turned out because, you know, LinkedIn was great, actually. They were very good to us. They allowed us to operate as essentially like a independent team within the company.
[09:10]
B
So your team stayed together?
[09:11]
C
Our team stayed together. We continued working on the product that we wanted to make.
[09:15]
B
Oh, you got to keep working on Rapportive.
[09:18]
C
Yes, well, actually so rapportive, the Gmail browser extension sort of got put on live support, but we were working on a new product at the time which did eventually get released under the name LinkedIn Intro. It kind of got a slightly weird reception at the time and it ended up getting shut down shortly after we released it. There's kind of longer background story there, but I'm still really happy with LinkedIn, like how they gave us the freedom to do this and allowed us to launch this product. And even though it didn't succeed, you know, they were very good to us throughout that process. And then after that got shut down, then our team got disbanded. But we had a good run within LinkedIn building this product.
[09:57]
B
What tech stack did you work at the time? What do you use?
[10:00]
C
The reporter was fairly unexciting. It was a Rails app with a postgres database basically and some Redis and some similar things like that mixed in. So actually, you know, nothing particularly revolutionary. We essentially built a graph database on top of postgres. So there was a little bit of technical interest in there, but, you know, nothing particularly outrageous.
[10:22]
B
And then you spent time after LinkedIn intro, you still work inside LinkedIn as I understand you worked on data infrastructure, right?
[10:29]
C
Yes, data infrastructure. After our team got disbanded, I switched over to the stream processing team. So Kafka had just been developed at LinkedIn and had just been open source.
[10:40]
B
Yeah, they developed, right? Oh, it was just being open sourced.
[10:43]
C
Yeah, I think it had just been open sourced. And then I got to work on Samza, which was a stream processing framework on top of Kafka.
[10:50]
B
I always wanted to ask this question, so this comes here. Why did LinkedIn build Kafka or develop Kafka every time it's now such a fun foundational technology? I was always curious like why did a company feel the necessity to build this thing? That seems pretty generic and it seems everyone would have needed it.
[11:09]
C
Yes. So I think jcreps has a pretty good blog post from that era called the Log where he explains his motivation behind Kafka and why make it an append only log rather than a traditional message queue or something like that. I think the motivation was really about data integration because there were a whole bunch of databases and event generating systems like activity events from users, for example. They were all generating data that in a sort of stream shape and then a bunch of downstream systems that wanted to consume this, like wanted to get it into the data warehouse and wanted to be able to get it into the Hadoop cluster at the time in order to run like machine learning and things over it. And there was just this data integration problem of actually like how do you physically get the data out of one system and into another? And Jay designed Kafka as this integration point, essentially like the almost kind of lowest common denominator, but still a general purpose abstraction for integrating various data sources to downstream data Syncs.
[12:16]
B
Working at LinkedIn, at Kafka and at LinkedIn scale, what did you learn or what surprised you about working at this type of scale? As I understand this was the first time that you hands on worked At a really large system, right?
[12:29]
C
That's right, yes. Because previously the biggest company I had worked in was Reporter with five people. We had a sizable database, but it was still like a single instance database and not really that big in the grand scheme of things. And then yeah, suddenly I was at LinkedIn and oh, we got to use their big Hadoop cluster. That was fun, like hand coding MapReduce Jobs in Java at the time. And so I learned a huge amount there, especially when the stream processing ideas came up and Jay was evangelizing the use of Kafka and the things you could do with it. That was kind of a revelation for me really, where somebody like felt, ah, this kind of makes sense. Like I'm, I start to understand how these various data systems fit together, what they have in common, what the fundamental principles are. And so that experience then fed directly into the writing of the book.
[13:19]
B
At what point did you decide to leave LinkedIn to me in your career? As I'm looking through the career, start out in the uk, do a startup, do a second startup, Y Combinator, move to San Francisco, get acquired by LinkedIn. The arc that most people would draw would be okay, do something more and more Silicon Valley or maybe start a second startup, et cetera. And instead you decided to leave LinkedIn?
[13:40]
C
Yeah, so first I decided to move back to the UK actually. And I continued working for LinkedIn remotely. Okay. That was mostly because my girlfriend at the time, now wife, was still in the UK and long distance relationship is not a lot of fun and I didn't feel that at home in the Bay Area. So I wasn't really encouraging her to move to the Bay Area either. I thought it was better for me to go back to Europe and I'm very happy with that decision. Like, I still have a lot of great friends in the Bay Area. I love it as a place to visit, but I wouldn't want to live here. Honestly then I was still remotely working for LinkedIn and that worked all right for a while. When I then started writing the book, LinkedIn even gave me 50% of my time free to work on my book alongside my software engineering duties, which is really great.
[14:25]
B
Amazing. Yeah, that is so nice of them.
[14:28]
C
Absolutely. And we hear this, they don't have to do that. And LinkedIn didn't directly get anything out of it in response other than like a book that they could use for internal training purposes.
[14:38]
B
Well, shout out, shout out to LinkedIn for this.
[14:39]
C
Yeah, absolutely. Though then I did find then that actually trying to write a book in parallel with doing a software engineering job and being on call, et cetera, I just wasn't able to do it. It's just too much context switching and it's very easy for the urgent things from the on call to dominate and then not to have the, you know, the freedom that you need in order to write something new. And so then after a while I decided, okay, like, it's probably better if I focus full time on the book. So I then left LinkedIn and just took a sabbatical, unpaid sabbatical, I.e. unemployment, to just focus full time on the book for a while. And then it's only after that that I actually even considered getting into academia.
[15:22]
B
So how did the idea of the book come? What was the point where you decided you would write and in your mind, what were you deciding to write? Was it already this book with this layout or you had an early idea back then?
[15:36]
C
I had an idea that of course the final product ended up looking somewhat different, but the overall goal, I think, stayed the same. So I knew I wanted to write something that was a broad conceptual overview, so not about how you use any one specific system or tool, but comparing the trade offs between many different types of tools. And I knew that I wanted to be practitioner focused, like not a theoretical textbook, but something that people could use to build real systems. That was basically like the goal with which I approached it. And this was exactly the book that I wish I had had when I was starting out and working at Rapportive, for example, because we were all like searching around in the dark where we're having performance problems with our database and we had no idea what to do, basically because we were totally lacking the foundations to actually understand what was going on and how to diagnose the issues. And so I felt that, well, if I had had a bit more background on how these data systems actually work internally, then I could have had an intuition about how to debug these kinds of performance issues. And then after a while, after I'd learned more about how data systems work, I thought, well, okay, it's time to write this down so that others don't have to learn it the hard way, but can hopefully just get a better idea of how these systems work and thus be better at managing their own data systems.
[17:01]
B
To start with, how did you learn about, for example, how databases work? Because again from your story at Rapportive, you build systems, you've had some performance issues at a smaller scale, to be fair, compared to LinkedIn. Then you worked at LinkedIn and you saw a Little bit of how the sausage was made. But I know a lot of software engineers who have been in this path and they still don't really know how the fundamental systems work. They just know, okay, we have a platform team inside our company and they build it. I could read the RFCs, but it's a lot of work. For the planning docs, I could look at the source code. It feels to me that even at that point you just went down and tried to dig in. What resources did you use? How did you find out those basics, which you later put into the book?
[17:40]
C
A lot of it was just kind of being curious and talking to people actually, and just asking them lots of questions. At LinkedIn, there were a bunch of senior data systems engineers who understood this stuff very well, but hadn't maybe necessarily written it down. And so I just talked to a bunch of them and quizzed them and that way started building an image in my own mind of how this stuff works. And then once I sort of got the basics from these conversations, then I was able to go and read research papers, for example. They go into much more detail of exactly how and why things are designed in such a way. But, you know, it is time consuming to read those things. So then what I tried to do was like, pull out what are really the essential ideas. I just read a ton of blog posts as well. And so the reason why you see so many references at the end of each chapter in the book is, well, that is actually the material that I myself used in order to understand what was going on. And then I thought, well, okay, well, if I've found these things useful, then I'll also cite them in the book as a way for anyone, any reader who wants to go beyond the basics covered in the book, here are some good sources to further reading.
[18:55]
B
Yeah, the structure of the book, this first book, at least, it's foundational data systems, distributed data, and derived data. If I understood these are three big parts, did you already have the structure in mind when you started writing the book or did it shape as you went?
[19:08]
C
This three part structure is not that critical in the design of the design of the book, really, that sort of. More after the fact, I thought, oh, well, it seems like we can group the chapters into roughly this sort of structure. But the topics of the chapters were more or less what I had envisaged. So I knew that I wanted to talk about what a transaction actually is. I knew that I wanted to talk about replication, knew that I wanted to talk about sharding or partitioning, knew That I want to talk about consistency and consensus. Those sort of high level topics, I think were clear from my initial book proposal to the publisher. The details within each chapter, that is something that I often figured out once I got to that chapter. So I wrote one chapter at a time and started each chapter work with just a lot of background research to actually get up to speed on the topic myself. And it's often only then that save for then replication, I decided, okay, well, it seems like the three major ways of doing this are single leader, multi leader, or leaderless. Okay. I would decide on that structure essentially when I started writing each chapter and then tried to fit the various points I wanted to make into this narrative structure.
[20:26]
B
As a fellow author who also wrote a book, one thing I've noticed, there's a bit of parallels between estimating a book and estimating a software project and that you come in with an estimate and if you've never done it before, you tend to be wildly off. How was this in your journey and addition? You also had a publisher. And publishers are a little bit like project managers. They, you know, they like to have a schedule, they like to try to keep you on track, they like to ask, when is it done? How did you manage that part as well? And in the end, how long did you estimate it would take when you started and how long did it actually take?
[20:59]
C
As always, it takes vastly longer than expected. It's the same for software and projects as it is for writing, I think. So I think it took me about four years to write the first edition. That was not four years of full time. Maybe like two and a half years of full time equivalent or something like that, but written over the course of about four years. So it definitely took a long time. The publisher deadline I missed by a ludicrous margin. I think I missed it by about two and a half years or something like that. But fortunately O'Reilly were pretty laid back with the, with the second. With the first edition and were happy for me to just take my time and make it good when it came to the second edition. Then actually O'Reilly got a bit more aggressive and pushy about sticking to deadlines. I guess by that point the book had been established and people were waiting eagerly for the second edition. So I kind of understand the desire to want to accelerate it, but at the same time I really appreciated the freedom that I had for the first edition to work on my own schedule. And I had a bit less of that with the second.
[22:10]
B
The tagline for the first edition, which I believe is the same as second edition. The big idea is behind reliable, scalable and maintainable systems. Reliable, scalable and maintainable. What do these objectives mean to you?
[22:22]
C
Yeah, so they're all slightly vaguely defined. Right. So there's not a formal definition of those things. But for me, reliability means fault tolerance primarily, so meaning that a system should on the whole continue working even if a network link is interrupted or a node crashes or something like that. So a lot of the book is about techniques that support fault tolerance, like replication, for example. So that's reliability. Scalability is one of those terms that gets thrown around a lot and it's sort of so much. And it's like fashionable and cool to make things scalable, you know, because it suggests success and millions of users. And so that's of course everyone wants things to be scalable because everyone wants successful. For this book, he had tried to take a bit more dispassionate kind of approach and said scalability is just like what mechanisms we have for dealing with changes in load. If load increases, how can we add computing capacity to a system, for example, so that the system still continues working? And then the techniques that you use to achieve scalability, well, they are like
[23:30]
B
Sharding, for example, but in this case scalability. Your definition. Do I understand that you're mostly referring to horizontal scalability so that you're not compute up or down? Pretty much, yeah.
[23:42]
C
I guess, because that's the more interesting one. Like, yes, you can always buy a bigger machine. And what's interesting about that? Exactly. There's not that much to be said about it. I mean, there are details of how you scale even on a single machine. But I think part of what has become interesting about modern cloud services, just backend services in general, is how they introduced this idea of horizontal scalability and shared nothing systems. So we can build systems that are able to cope with very high load, even if the individual components are just fairly cheap commodity machines. But maybe sort of part of the scalability story, which I wasn't thinking about as much at the time, but started thinking about more recently, is not just scaling up, but scaling down as well. So actually, how do you run a service in such a way that if it has a very small amount of load, it's really cheap to run it? That's sort of, in a way the same question as how do you continue running a service if it has very high load? Generally you just want the cost and the computing capacity to be roughly proportional to the load that you have. And at the low end, that means actually being able to scale down to something that is extremely cheap to run. And that's not so necessarily given that's something that is hard with on premises software, for example, because if you've got a machine of a physical machine that's like a unit of deployment and yes, you could carve it up into two dozen virtual machines and make those small virtual machines, but it still requires some sort of resource allocation. So part of what's interesting about some serverless systems, for example, is actually their ability to scale down and say like, okay, if you're going to handle just three requests per day, that's just fine as well.
[25:31]
B
Can you tell me about the second edition? When did the idea come about?
[25:35]
C
Yeah, it had been clear for a couple of years that a second edition was needed just because the first edition was getting a bit dated. There were changes in technology that just hadn't been reflected in the first edition. So I wanted to update it. But you know, I now have an academic job. I'm actually like doing research and teaching is my main thing and updating the book is just a sort of sideline business on the side in some sense. So it actually took quite a while to make progress with that because I was always doing it alongside other projects and essentially back to that context switching problem that I had while writing the first edition. But just now with an academic job that I didn't want to just drop because actually quite enjoy it initially, then I made very slow progress with the second edition and also I kind of realized that I had slightly lost touch with current industry practices because I'd switched over to the academic side. I'd gone much deeper on the theory, but I was no longer up to speed on what people were doing with say, data lakes or things like that. So then at some point I remembered Chris Riccomini, an old colleague from LinkedIn. I had worked with him on the stream processing stuff. You worked with him?
[26:50]
B
He's the author of the Missing readme.
[26:52]
C
Exactly.
[26:53]
B
Wow, what a small world.
[26:54]
C
Yeah. And I had read Chris's book, the Missing readme and thought, oh, he's a great writer. And I had worked with him as a software engineer and found him a great colleague. And also he had been writing this newsletter called Materialized View on like latest trends in data systems essentially and become a startup investor in that space. And so at some point I thought, well, actually I have to get in touch with Chris and ask him whether he wants to help out with the second edition. And he was keen to do that. And that turned into such A good collaboration because he was up to date on what the cutting edge was in terms of technology in industry. I had strong opinions on how to teach essentially, so how to explain things in the book. Make sure that we were explaining everything in a way that was very precise, very carefully chosen words, but at the same time very accessible so that it's hopefully easy to read. And so we took essentially, like my writing style plus Chris's knowledge of latest industry trends to bring the book up to date. And that was a great collaboration.
[28:02]
B
What are the big things that you added and which ones of these you knew would be missing and which ones? Did you realize during the writing process that, okay, this needs to be in here now?
[28:10]
C
Yeah. So the thing we knew from the start that we wanted to reflect was cloud native systems architecture. It's a bit of a vague term, but what I mean with that is essentially building data systems on top of cloud services as the foundational abstraction. In the first edition, the assumption was basically that you have some machines, each machine has some local disks. You can run a database instance on a machine. It will write its data to the local disk. If you want to replicate it to another machine, then, well, the database software will replicate it at the database level to another machine, which will also write the data to its local disks. For a long time, that was exactly the way computers worked. And now suddenly people are building databases on top of object stores, for example, and now the replication happens at the object store level, no longer at the database level. Or maybe there's still some replication at the database level, but it really changes the nature of things if you're building on top of an object store. And this is different from, say, building on top of a virtual block device like EBS also, because these block devices, although they are cloud services, but they still offer the abstraction that is a sort of single node operating system abstraction of a block device on top of which you run a file system. Whereas an object store is just like a brand new abstraction. It just looks different from a file system. It behaves differently. And so then building on top of that as a foundational abstraction is something that, like, people were starting to do at the time of the first edition, but since the first edition, that has really taken off, like a whole lot of system have been built in that style now. And so that's an idea that we really wanted to incorporate. And we weaved that in throughout the book. So it's not just like one section here, but it's sort of an idea that we've integrated throughout the Entire narrative.
[30:00]
B
There's now a lot of managed services as well, the primitives that we use. But there's also so many managed services that all the cloud providers use. And a lot of engineers, they often just use the managed services as is because they, they take care of replication, they have SLAs for uptime and so on. But when you build on top of these things and you, you kind of use those as primitives as well, is there any risk as a software engineer that you're no longer incentivized to understand the underlying layer, or are we building better systems because of that? How do you think about this? It feels there's a move of distraction because of cloud, right?
[30:35]
C
Yeah, it's definitely a shift to different and higher level abstractions. But you know, that's been the story of the entire computing industry since the start. It's like building new abstractions. So it is true that if you rely on a higher level abstraction, you're no longer thinking about the lower level details. And so if you're using a programming language with a garbage collector, you're no longer thinking about memory allocation. And so is that a loss? Well, maybe like if you're building low level systems, you should still have to care about memory allocation if you're building higher level business logic. Actually, I think it's just fine for people not to care about memory management. So I think there's an analogous thing here with data systems that if you're building the higher level systems that don't need to particularly care about the underlying infrastructure, then that's fine, just use the higher level abstractions, nothing wrong with that. But somebody still has to build those lower level abstractions from lower level components. Somebody's got to implement the cloud services.
[31:37]
A
Martin talked about trade offs that come with using cloud services. And this is a good time to talk about our season sponsor, workos. If you've read designing data intensive applications, you know that building systems at scale is all about trade offs. But one thing isn't a trade off, that's enterprise features. The moment you land bigger customers, you need sso, directory sync, rbac, audit logs, all the things they expect out of the box. The building that yourself can take months. WorkOS gives you APIs to ship it in days so you can stay focused
[32:05]
B
on your core product.
[32:06]
A
That's why companies like OpenAI and Anthropic run on WorkOS. Visit workos.com to learn more. I'd also like to mention our presenting sponsor, statsig. Statsig built a unified platform that enables both experimentation and continuous shipping built in experimentation means that every rollout automatically becomes a learning opportunity. With proper statistical analysis showing you exactly how features impact your metrics, Feature Flags lets you ship continuously with confidence. And because it's all in one platform with the same product, data teams across your organization can collaborate and make data driven decisions. To learn more, head to statsig.compragmatic with this. Let's get back to Martin and the trade offs that come with using cloud services.
[32:47]
C
And so those people will have to then specialize even more in actually the details of how you engineer those cloud services, how you make them reliable, how you operate them and so on. The skills are still there. It's just a bit of specialization happening that some people can worry about the higher level things without having to concern themselves with the lower level things. Some people focus on the lower level things and treat the higher level aspects as their customers.
[33:10]
B
Interesting. So it sounds to me that if you're an engineer who is utilizing a lot of these services, you might not need to know how they exactly work.
[33:19]
C
Yes, and I would say like the underlying philosophy of the entire book is to give people insights into just the sort of essence of how the systems work internally so that if for example, they start having weird performance behavior, you can have a bit of intuition for why it's doing that and how you might solve it. So for example, say the storage engine chapter tells you about how B trees work and how log structured LSM trees storage engines work. And the book is not intended for people who are going to actually build their own databases and implement their own storage engines. If you want to do that, you have to go much, much greater depth than this book covers. But the idea is that as an app developer, if you know just a little bit about how the storage engine works internally, you'll be in a much better place to use it in a way that gives you good performance, for example, and to diagnose any issues. That philosophy we've kept also in the context of cloud services, where yes, like cloud service hides some of the operational details that app developers don't need to think about anymore, but they should still know a bit about how they work internally just so that they can use them effectively.
[34:24]
B
I guess argue about the trade offs, deciding on which service to use, which characteristics to look out for your use case.
[34:32]
C
Right, exactly. And there are huge differences of say if you're doing analytics, whether you're using row oriented storage or column oriented storage. That's a bit of a technical distinction and takes a little bit of background reading to even understand what that means. But it has a massive performance implication in terms of the final behavior of the system. And so those are those places where I feel like knowing a bit about the internals is actually like a superpower.
[35:01]
B
Yeah. And I guess engineers, the one thing that we always need to argue about, or should need to argue about is at the very least cost versus performance. And by performance I mean latency to the user and of course resilience of if something happens. You know, like a region goes, like a zone goes down, a machine goes down, zone goes down, region goes down, how our product is affected and what's acceptable.
[35:22]
C
The basic idea there seems to be like how much availability risk are you willing to take on versus both the overheads in terms of the system itself, like the computational overheads, but also the human overheads actually designing and operating the system and the cost overhead. Yeah, exactly. And so yes, you can have a system that is more able to tolerate various types of faults, but it's which is more expensive to design and operate versus a simpler system that might go down a bit more often, but which is cheaper. And there's no right and wrong with that. Everyone needs to figure out where they sit on that trade off space themselves. And I would say that multi region is pushing in the direction of higher availability because it means you could tolerate the outage of an entire region, but then it has implications on the consistency model that you can get across different regions, for example. So that's a trade off that the book tries to make very explicit to help people reason that through of like what is the right choice for them in terms of multi cloud. For example. One thing that I've been concerned about just in the last month really is European dependence on US cloud services. Yes. So what if geopolitics was to go horribly wrong and tensions escalate and Europe finds itself suddenly, suddenly locked out of US cloud services? I hope that doesn't happen. I still think it's fairly unlikely, but it's no longer unthinkable. And as a result, I, coming sort of from this European perspective, have been thinking a fair bit about how can we engineer systems to be resilient against that sort of thing. And that's not just like a regional outage, but it's like a business risk essentially. And multi cloud setup could help mitigate against that sort of risk. So that at least, for example, if one company locks you out, then you could still have systems on another company. Again, that's very much towards the expensive but high availability risk reduction end of the spectrum. But for the people who have really critical workloads where they think this sort of geopolitical risk is a significant enough risk, I think it's seriously worth considering that kind of setup.
[37:42]
B
I'm thinking that Zengiers, we do have the responsibility because who else will do this?
[37:46]
C
Yes, totally. But I totally agree with you as well that this understanding what the risks are and communicating what the trade offs are, I think is going to be a core part of our role as engineers moving forward as well. Maybe as AI writes more and more of our code, it's less about the details of how you express logic in a particular programming language and much more about those kinds of high level trade offs.
[38:11]
B
How has the definition of scale changed in this book? Because as we talk with cloud, before cloud building a scalable system, it sounded pretty involved because building a horizontally scalable system, it's complicated. All the pieces you need to put it in the first book. You detail a lot of this with cloud. A lot of the services actually they do define how they allow horizontal scaling, what the trade offs are. Do you feel that it's made a lot easier to reason about scale scalability when you are using these primitives?
[38:42]
C
So I think achieving really high scale is still challenging because even though we have cloud services like object storage for example, which provide you this very elastic storage model, at least you don't have to worry about capacity, planning on your disks anymore and running out of disk space, because those kinds of operational things, they're taken care of. But if you need sharding, for example, that's something that actually does reflect on the application code as well. You can't really make that entirely transparent. And so you are at a sufficiently large scale, the charging is required because a single machine is not powerful enough to process your workload. Then I think even with cloud systems you still have to do quite a bit of engineering thinking of how to realize that where I think the cloud has helped quite a bit is actually at the lower end of scaling down. If you want to have a very lightweight service that processes only a small number of requests. What we've got with serverless systems being able to very quickly spin up and spin down an instance, very lightweight, that's quite a good innovation that has enabled those very low scale services. And that's something that would be much harder to do without cloud services because you would have to statically allocate a certain amount of memory and certain CPU resources to a particular virtual machine.
[40:02]
B
I love serverless. I have a small website that runs on serverless and my bill is like 13 cents per month because it has very little load.
[40:11]
C
Absolutely. It's just making more efficient use of computational resources.
[40:15]
B
Let's talk about sharding. In the first book and when you wrote the first book, when I was working at Uber, we talked a lot about sharding and there was a lot of internal implementations. Our interviews involved asking about sharding because we were designing systems that were sharding. I did sense that over time. Again, as cloud systems start to become available that give you turnkey solutions that act more like platforms, you send the data and it takes care of these things. Fewer engineers have to actually implement sharding with cloud native systems. In your research, what have you seen? What are the cases where putting sharding in place is still important and where are the places where it might have just disappeared as a concern? I mean, it's still nice to know, but you might not have to implement it.
[40:58]
C
I think it's probably less of an effect of cloud and more of just hardware getting more powerful. That actually, like a big machine nowadays can do a lot on a big machine, and that means that more and more workloads you can just run on a single machine. And that is sufficient actually to achieve quite significant scale. Already there's still concerns of like, how do you actually efficiently make use of hundreds of CPU cores that you have on a single machine. So there's still parallelism is still a required thing to think about there, and sharding is one way of achieving parallelism. But at least this sort of sharding across multiple machines has maybe become less of a pressing issue. Just because more and more workloads can just run on a single machine. Some people still have very large scale workloads that do have to be sharded across multiple machines. So it's not going away entirely. And replication is still relevant even at smaller scales, because that's for fault tolerance, that's not for scalability.
[41:57]
B
You have a chapter called the Troubles with Distributed Systems, which goes through a lot of things that can go wrong. Without going through the whole chapter, can you recall some of the things that are memorable to you or some of the things that you feel are important to remember?
[42:13]
C
Yeah. The whole idea of this chapter is that in distributed system theory, there are certain things that we tend to assume. Like, for example, we just assume that there's no upper bound on how long it might take for a message to go over the network. You send a message, it might arrive within 100 microseconds, or it might take 10 years. And distributed system theory just doesn't make any assumptions about that sort of timing if we can avoid it. Or rather some theory does make those assumptions. But it's an dangerous assumption to make because occasionally the network delay does become much higher than what is typical. Another thing is about crashes. For example, the distributed system theory just says nodes can crash. But what does that actually mean? What in practice does it mean for a node to become unavailable? Because it might be a software crash, but yes, it might be a hardware failure, it might be somebody unplugging the power cable. It might be that the node is actually still running, but it's just become disconnected from the network. The point of this book chapter really is to defend and justify those theoretical models that we use for analyzing distributed systems and just giving a lot of stories and case studies that show that actually tons of stuff does go wrong. And don't believe anyone who says, oh, failures are rare, don't worry about it, it's fine. The moral of this chapter is really that actually if you want to make things reliable, you really do have to worry about a whole bunch of weird, unusual, but certainly possible edge cases. Timing is another one of those things. It's very easy to assume that your clocks are correct, and most of the time the clocks are pretty correct, but we just can't rely on it because actually they're just not precise enough on the whole. And so a lot of it is about it's very tempting to make certain assumptions that things are well behaved. And in distributed systems we just have to try to get away from those assumptions if we want the systems to work reliably, even in the face of things going wrong. But it was a really fun chapter to write because it's essentially a big collection of stuff that has gone wrong. And so I went through a bunch of postmortems published by various tech companies, for example, in order to see, okay, what was the root cause of how things went wrong and what kind of lessons can we draw from this that apply to the book in general? And you know, there's some fun stuff like the sharks biting undersea cables and damaging them that just, you know, makes for a great story. And then I hear that in recent years the shielding of undersea cables has got better, and therefore the sharks are not biting them anymore, but instead the cows on land are stepping on cables and occasionally causing network interruptions that way. And you know, that sort of thing is just. It makes it a bit more fun.
[44:56]
B
That chapter is so interesting also, because when depending on what kind of teams you work on or what kind of people you talk with. When I talk with the S3 team, for them, that whole chapter is just their day to day. It's, it's, they, they don't, it's not a weird thing when, you know, like a hard drive goes up or there might be, okay, it might be a weird thing to have a fire in a data center, but they're prepared for all of those things. They're at the scale where these things just happen on a regular cadence because they're one of the large scales. Whereas at a smaller company, even if you read this chapter and you will treat this as like, well, this could happen, but when it actually happens, it will be a once in 10 years and it will be a big deal.
[45:38]
C
Yeah, but I think there's no right answer. It's a trade off between risk and cost, broadly speaking. And that means a business decision has to be made in terms of where the business wants to lie on that trade off. And so the goal of this chapter is really just to give people the information in order to make an educated decision. But I don't want to make that decision for people. That's for businesses themselves to decide.
[46:03]
B
That's very clear. Have you come across some concepts or SIPs as mentioned in the book, in the first edition and now in the second edition that are becoming either more popular or less popular over time? More or less referenced by your readers thinking about from things like streaming systems, batch processing or anything else?
[46:19]
C
Yeah, so some things that we've been able to take out of the book compared to the first edition in particular. For example, coverage of MapReduce was quite detailed in the first edition, but basically MapReduce is dead. Nobody uses it anymore. Its successors, like in the form of Spark and Flink, for example, they are used. And so we still reference MapReduce in the second edition, but more as a learning tool in order to understand how these kind of partition sharded batch processing systems work. So that's one thing where we've been able to reduce the coverage, but other areas where we've increased the coverage are for example, systems in support of AI. And so even though this is not an AI book, but there are still data systems concerns that arise when needing to support AI applications. Like a classic one is vector indexes, for example. And so we've added some coverage of vector indexes to the Storage engine chapter fit in really well there because it already covers various different indexing strategies anyway. And so vector indexes, it's just Another indexing strategy, we also added some coverage of data frames, for example, that's not an exclusively AI thing, but data frames are quite a good data representation for training data, for example. And that was not one of the data models that we discussed in the first edition, but we decided to add to the second edition because it has actually become a very important data model that people are using alongside all of the classic data models like relational and graph and JSON documents and so on. And so there are these places where we've just expanded the coverage a bit to reflect the kinds of systems people are building, for example, to support AI without it changing the direction of the book entirely.
[48:06]
B
The final subsection in this first edition, the first few, I guess like subparts were titled Doing the Right Thing. And in the second edition, this has its own chapter, the final chapter is Doing the Right Thing. And I quote a little bit from it. We, the engineers building these systems, have a responsibility to carefully consider those consequences and consciously decide what kind of world we want to live in. Can we talk a little bit about this section and the importance of it?
[48:31]
C
Absolutely, yeah. So the motivation for putting in an ethics section there in the first edition was that I just felt it had been quite ignored as a concern during my time in industry, that especially in startups, people were very focused on building a product that their customers would love and really deprioritizing these ethical questions in the process. For example, with consumer facing products, it might be that the products are very much geared towards essentially data harvesting, collecting behavioural data, because that's what can be monetized in the form of advertising. And there seemed to be just very little reflection on what was good and bad about these sort of things. So I really just wanted to encourage a bit of thinking there, not really wanting to prescribe too much like a particular approach there, but at least to point out, you know, there is this thing such as data protection legislation now, which we do have to think about in the architecture of our data systems. And there is an ethical responsibility. You know, people say that you get into tech in order to change the world. If you want to change the world, then thinking about the impacts that your technologies have on the world is part of your job. It's a really essential part, really, and something that engineers are often prone to ignoring as we focus just on the technology and less on the effect that that technology will have out in the real world. And so this chapter is really just an attempt to get people thinking about it a bit, and it's sort of a reflection of my own process as well, because as I started working on these systems, I didn't really think about ethical things particularly either. So I felt like I had to put that section in there for myself as well as for the readers, because it was my own way of grappling with these questions a bit.
[50:32]
B
Is it fair to say that as engineers building these systems that will have an impact on a wide range of things, potentially societal wide impact, we are just in such a good position to directly influence and maybe even change course. So do I understand that this section is a bit of a reminder that by building it we have a huge opportunity to shape these. We probably have a lot stronger voices, maybe as strong voices as later on the regulator might have years down the road. Right, Exactly.
[51:03]
C
I think engineers have a very strong voice there. And like we talked about earlier, engineers need to articulate trade offs in such a way that business leaders can then make educated decisions about how to address those trade offs. And part of those trade offs is pointing out risks. And risks include not just technical risks, like the data might get corrupted, but they include societal risks as well. For example, like what negative effects, what harms might arise from this technology, what sort of unintended consequences, possibly, or what, like risk for reputational damage. If it turns out that technology has some harmful effects, you know, that can reflect badly on the company that made it. And that has to be part of the trade off discussion. And I just want people to make intentional and deliberate decisions about those kind of things and not just sweep it under the carpet.
[51:57]
B
One of the hot topics these days is of course AI, and you've written a very interesting post about this just in December, about formal verification, how your conviction that formal verification might be more important with AI. Can we talk for those of us engineers who have heard form of verification, can we talk about what this is and how you envision this becoming more important?
[52:19]
C
Yeah. So there's a whole range of formal methods. One approach is to, for example, use a specification language like FYSB or TLA or something like that to describe the expected behavior of a system at a high level. And then use a model checker, which is essentially like a randomized test case generator, to just play through a lot of scenarios and see whether the system has those desired behaviours in all the different scenarios. That's like the sort of intro level formal verification. I would say the more advanced level is to use actual formal proof. And in that case you can write a specification of some system in a formal language, as usually using mathematical notation, and then Make a mathematical proof that a certain algorithm or certain implementation always satisfies that specification. And the distinction to testing there is that. Well, in testing you just try through a couple of examples, give the algorithm some example inputs, and check whether you get the expected output in those particular examples. But a proof can reason about potentially infinite state spaces, so it can tell you things about every possible thing that could possibly happen in the entire universe. Show that, for example, a certain safety property is always given in those. Formal verification is a lot of work. I never used it in my time in industry because it's just too time consuming. Basically. I only got into formal verification when I was in academia and I could afford to take the time to spend a few months proving an algorithm correct. But there I've started finding this very useful, especially if I was working on very subtle algorithms where it's very hard to tell just from reading the implementation whether this actually is always correct under all possible cases. But if it's an important algorithm where, for example, it will corrupt data if there's a mistake in it, or it will have a security vulnerability if there's a mistake in it, then when it's high stakes things like that, then I feel it's worthwhile to have formal verification and to really make sure that the code really is correct. And so I've done some formal proofs using the Isabel Proof Assistant, for example. There are a couple of others as well, like Rock and Lean and so on. These proofs are really hard to write. It takes a long time to learn the language of writing those proofs. And then even once you know the language, it's just really laborious in order to actually write the individual proof steps.
[54:51]
B
And when you say it's hard to write, just as someone I I know how to code so many different languages. Can you just explain what it means to hard to write? Does it feel like a strict programming language with all sorts of rules or lots of math formulas? What makes it hard for you to learn it and get good at it?
[55:11]
C
Yeah, so you're trying to make a proof that a certain piece of code always satisfies a certain property. In some cases, that property might be quite easy to specify. Let's say as a really simple example, you have two lists and you want to concatenate them, and then you want to prove that the length of the concatenated list equals the sum of the two individual lists. Very, very simple property. How would you prove something like this? Well, you would have a function that concatenates two lists and then you would probably do a proof by induction over one of the lists that shows that, okay, well, if you have one list of length I and another list of length 0, well then the sum of the two is I. If you have a list of length I appended with a list of length one, well then it's I +1 and so on. And then by using a proof by induction, you can then show that the length of the concatenated list is I J, where I and J are the lengths of the two input lists for every possible value of I and j. And this is something that in a test case, in tests, you would maybe test it for the cases of j equals 0, j equals 1 and j equals 5 and then you're done. And J equals nx.
[56:24]
B
Yes, the edge case. That's what we do. That's how I write my unit test.
[56:28]
C
Exactly. And so this is a trivial example like list concatenation. You can easily just read the code and convince yourself that it's correct. But if it's a much more complex algorithm, then you, our brains just can't grok the algorithm well enough to really convince ourselves that it's correct. If you don't prove it. And that's where these proofs then become handy.
[56:48]
B
If I'm an engineer and I would be interested in getting started with formal verification, for example, because I have the notion that it will be more important with AI, of course it will be easier to write these things. Where would you point engineers to get started? Or how did you get started in this field?
[57:04]
C
I would suggest starting with model checking. So something like TLA or FIZB are much friendlier to getting started with compared to proof assistants like Isabel, Rock and Lean. These proof assistants just require a whole lot of additional knowledge. And the resources for learning about writing these formal proofs are, to be honest, not particularly good. I haven't really found really great books on it as well. The way I learned it was by working with some colleagues in my lab who had learnt it through years of prior experience. And I just sat down with them and paired with them at a desk where I described the thing I was trying to prove. And they showed me how to prove it step by step, how to break it down.
[57:47]
B
I'm interested to see if you're thinking will be correct, which is this thing will go more mainstream and hopefully we'll have better books and resources for it as well.
[57:56]
C
Yes, I do hope so. The reason I think that I believe that this formal verification could become more important in the future is there are kind of several aspects to it. One is that the LLMs are getting increasingly good at writing these proofs. And if we don't have to write the proofs by hand as humans, it just becomes feasible to do them in situations where previously it would have not been economical. But also LLMs increase the need for these formal proofs because we're vibe coding a bunch of stuff. If we have to manually review all of that code, then that will become the bottleneck. So we can't really have humans reviewing all of the generated code either if we really want to get the benefits of AI. So we need some automated way of checking whether the code is correct. And writing lots of tests is a very good starting point. But the thing that proof can do that tests can't, is to consider absolutely every possible thing that could happen. And that's really important in a security context, for example, where it just takes one little bug to create a vulnerability that destroys the security of the whole them. And so I feel for those domains where really we want to ensure there's a complete absence of bugs. That's the kind of places where formal verification can really shine. And I'm hoping that LLMs will actually make that a lot more accessible to people who would have previously not considered using formal verification because it was just too hard and too expensive.
[59:21]
B
You've worked in the industry and then you went into academia. Can you tell us what the difference is between us, myself and most people watching work in what you would call industry? In the tech industry, we work at different companies. We're bootstrapping our own or just building our things. How does academia contrast to this? What do you and your colleagues do inside of academia?
[59:45]
C
Yeah, within academia there are lots of different styles. Really there's not one thing. Some people go full on theoretical, mathematical, don't care about the real world at all. Just want to work on things that are intellectually interesting and that's fine. And some people are very much at the applied end of wanting to do research that is likely to have a real world impact. I'm more on the applied end and that's fine too. But a common distinction there is that academia can just think much longer term. So the, you know, if you're doing a startup, you have to ship something within a few months. You can't afford to think 10 years into the future, maybe you'll have sort of a sort of long term vision that you're gradually getting towards, but you do have to really ship things on a fairly short timescale at a bigger company, maybe if you're working on infrastructure or so you can think on a bit of a longer timescale because the requirements of what are needed is, are perhaps better understood. And in that case, you know, making sure that the system is like scalable, operationally robust and so on, it's then fairly clear what the requirements are and it's still a matter of implementing it. But in that case you can think a bit longer term. But in academia what I really appreciate is the freedom to work on things that are long term and which are not like immediately commercially viable or which are not aligned with the incentives of commercial companies. So one of research areas that I've been on for several years now is what we call local first software, which is this idea that we want to take away a bit of the power from cloud operators and give it back to end users. So end users should be more in control of their own data and less dependent on cloud services for providing the applications and the data that the users need. And that's something that doesn't naturally come to companies, right? Because software as a service businesses, for example, the whole reason why they can charge a subscription is because they are able to essentially hold a gun to the customer's head and say, pay us your subscription, otherwise we will delete all your data. And I totally understand the commercial imperatives that lead to that, but it also leads to this situation where the people have a gun against their head all of the time. That isn't really a healthy situation to be in in my opinion. But changing that in such a way to take away that gun from customers heads is difficult if you're in a business whose revenue depends on perpetuating that kind of lock in situation. And there I feel like in academia I have the freedom to work on things that go against this commercial incentive of companies and say like actually no, I'm going to do what I think is right for the users and I'm going to say the commercial model of the companies making the software is second priority. And I can afford to do that because I'm not dependent on this commercial model to add to this.
[62:44]
B
It's very interesting and challenging engineering problems, right?
[62:47]
C
Yes. And it's wonderful to get to work on interesting engineering and computer science problems while at the same time trying to pursue this higher level vision for local first software.
[63:01]
B
What are some of these really interesting engineering challenges that we will need to solve or we need to solve to get to a more viable local first software? May that be like, let's say, note taking? It's a Very popular one, right?
[63:14]
C
Yeah. So with our vision of local first software, we are trying to get away from this dependency on centralized cloud services. There may still be cloud services involved in syncing data between your phone and your laptop, say, because often going via cloud service is just the most convenient way of establishing that kind of communication. But we just don't want to have to trust on a cloud service providing a particular function. Then if you can get away from assuming this one cloud service, you could, for example, have multiple cloud services on multiple cloud providers side by side, and you just sync by whichever happens to respond first or sync with all of them, and then if one of them disappears, no problem, because you've got the other one. And so it gives us a huge amount of freedom and flexibility if we get away from this assumption of centralized cloud services. But that introduces a whole bunch of interesting research and engineering challenges because. So one thing that we've been working on lately, say, is access control. You know, simple problem. You have a document, you want to be able to grant collaborators access and you want to be able to revoke that access. Again, totally obvious. Should be totally straightforward. In a centralized cloud service model, it is totally straightforward because you have the
[64:26]
B
rules, you confirm that those sort of things, and you check for the right roles and that's it.
[64:30]
C
Yeah, but if you want to run your system over multiple providers or even in a peer to peer setting, then, well, what could happen is that a user gets their edit permissions revoked, and concurrently that user makes an edit to the document whose permissions have just changed. And now some devices may see the edit to the document first and the revocation second. And so they would accept the edit to the document and another device may see it the other way around. They may see the revocation first and then the edit to the document second, and they'll drop the edit to the document because they think it's not authorized. And now those devices have become inconsistent with each other, permanently inconsistent. So that means if we actually want to ensure consistency even for this fairly basic setup, we now have to somehow figure out how to resolve the situation of an edit that is concurrent with the revocation of the user who made that edit. Solving that problem then mean in a decentralized setting, where we don't have just a single server that can make that decision. In a centralized setting, you know, you just have one server, it decides did the edit to the document come first or did the revocation come first and that one server makes that decision. But if you have multiple servers, they might make different decisions. So then you could have a consensus protocol, but then consensus is messy because it requires like some quorum votes and requires nodes to be online. And so we've been trying to do the whole thing without doing consensus, but while preserving high availability, while preserving the ability for user to work offline, preserving the ability to synchronize peer to peer without any servers, for example, that just makes the engineering challenge a lot harder and it's solvable. And we are close to solving it for automerge, which is the CRDT library that I work on, but it's just much less straightforward than it is in the centralized case. But that's a nice example of where interesting engineering challenges arise from this desire to get away from centralized services.
[66:29]
B
And then we were just talking about clocks earlier, but an obvious thing that came to mind as well. If all of them had the same clock, exactly to the microsecondition, you could just use a clock, you could use a timestamp. But as you said, in distributed systems we cannot always trust the clocks are always synchronized. So I assume you just have these. A lot of the things that you have been researching and writing about are just coming back to.
[66:51]
C
Absolutely. And in this particular setting of like a user getting their edit permissions revoked, if a revoked user still wants to say, vandalize a document, they can just backdate their edits, give it an earlier timestamp. So relying on clocks is absolutely useless here because people can forge the timestamps from those clocks and thereby then potentially undermine the access control mechanism. So in this kind of system, we have to worry about potentially maliciously generated actions as well, when the actions come from end user devices.
[67:23]
B
This is fascinating because it feels to me that you're solving a hard, or maybe even harder enduring challenge than some startups would do, because the startups would go the easy route, they would take on a constraint. In this case, a centralized server, which makes business sense, makes revenue sense. But because you are not doing this, you now need to look for a solution for a harder problem. And if you solve this harder problem, you can give a building block that can just move the industry forward. Just give an option for either a business or an individual or an institution to have an option not just to centralize, but use this decentralized local first approach and then of course reason about the trade off and decide whichever makes sense.
[68:04]
C
Exactly. And that's what I mean with this long term thinking. This is an example of it where because it's research, we can afford to take this idealistic, principled stance and said, yes, we're going to solve this harder engineering problem because we think decentralization is a valuable feature. And we know perfectly well that most startups are not going to solve this problem because they will just do the easy pragmatic thing, which is the right thing for startups to do. But we have a different set of incentives and we can afford to put in the time to try and solve those hard problems. And as you said, if we can solve them, then it creates more optionality for anyone, any users of this technology, if they want to choose to use this decentralized tech. And there are still trade offs around it, but at least if they're not having to invent it from scratch, it'll be a lot easier to adopt this kind of decentralized tech for those who want to use it.
[68:56]
B
So inside in academia, you're also teaching, what courses do you teach?
[69:01]
C
At the moment I have a concurrent and distributed systems course for the undergraduates and a cryptographic protocol engineering course for the master's students. And then additionally this year I have a seminar course on security and teaching, also the undergraduate operating systems course. I've got quite a lot of teaching
[69:22]
B
this year, the distributed systems course. It's available on YouTube. Can you summarize what people who would go through this course, which again is freely available, thank you for you and the university for making it available. What would they learn throughout those courses?
[69:36]
C
Yeah, so that distributed systems course, it's a bit more theoretical than what is in the book. So it's more focused on algorithms and sort of how we convince ourselves that the algorithms behave correctly under the assumptions of distributed systems that we talked about. Of like nodes may crash, communication might be unreliable, clocks might be wrong, et cetera. So that's really it. It's not a very long course, it's just eight lectures worth of material. But it goes into substantially more detail on the algorithms than the book. So for example, one of the lectures goes through the entire RAFT consensus algorithm, which is pretty complex, but I really wanted to show the students exactly how it works because it's just such a nice illustration of the challenges of distributed systems and the various measures we need to take in order to handle the various types of edge cases and failures that can happen and showing that those problems can be overcome. It's not easy and the algorithms are very subtle and it's very easy to have bugs in them, but it is possible to solve consensus in a way that works pretty well. And so that's really the sort of message I'm trying to get across with this course.
[70:50]
B
And you mentioned that when you're writing the book together with Chris, you brought a lot of industry insight and being up to date and you brought your experience of teaching and what works.
[71:01]
C
I don't think I have a particularly unique teaching style. In lectures I will go through slides. I like to annotate the slides by hand during the lectures. So just draw, draw on an iPad to make it a little bit more interactive. But other than that, it is fairly theoretical. That's partly the way the Cambridge system works. It kind of favours theoretical and pen and paper courses over, say, implementation practical courses. I think it would be possible certainly to do a practical course on this and I may incorporate a bit more practical exercise in the future. But right now it's mostly a theoretical pen and paper course and that is fine. The cryptography course that I do is. That's much more hands on. So that's about actually getting the students to like implement some elliptic curves from scratch, for example.
[71:50]
B
And how have you seen it in your time in academia, which has been, it's now a longer time period. How have you seen computer science education changing? How do you think it might change further in the future? Especially as we're seeing AI be part of industry and probably the world as well?
[72:08]
C
Yeah, I mean, prior to AI explosion happening, actually rate of change is very slow in computer science teaching. Partly that might be Cambridge. Cambridge is over 800 years old. Everyone thinks on longer time scales. People don't tend to rush into the latest fad and instead try to focus on the fundamentals and the ideas that a lot of the fundamentals of computer science were developed in the 1930s already and are still true today. And, you know, lambda calculus and those types of things, for example. And so we have quite a bit of a focus on those sort of fundamentals rather than chasing the latest fashionable thing. That said, AI has totally changed the way we can assess coursework, for example, because of course now we can try banning AI, but it's impossible to actually enforce such a ban. And also it's kind of counterproductive because we do want students to engage with new technologies and figure out how to use them productively for themselves, but we want to somehow do that in a way that supports their own learning and doesn't undermine it. So how do we get the students to use AI in a responsible way, in a way that's mature? And we can't necessarily rely on the students being mature enough to know for themselves what is a helpful use of AI and what is A form of use of AI that undermines their own learning, because some of them are quite mature and able to decide that for themselves, but many are not. And so we need to provide some guardrails for them. And we do need to make sure that when we have assessed work, for example, it's fair and it's perceived as fair by the students. And if the students feel that some of their co students are getting really good marks without doing any work, that undermines the trust in the entire system. And so we have to be very careful with how we approach this. And to be honest, we don't really have good answers yet. So we do now, for example, have a boot camp right at the start of the first year for the new students to expose them to basic software engineering skills, which is like, this is version control, this is unit testing, this is generative AI and the sort of basics that really everyone should be familiar with. And then the hope is that they will use that throughout their degree in order to just improve the work that they do. But how exactly we handle things for assessment, for example, we're still in the process of figuring out.
[74:35]
B
So it sounds like the pace of change is going to be fast in the industry and also in academia. We'll probably adopt it and we'll see what comes after.
[74:46]
C
Yes, there's a difference though, which is in the desired outcomes. I think with industry generally, the desired outcome is like a work working product. For example, in academia, the actual artifacts that the students produce, like an essay that our students write. That's not really the point. We don't ask the students to write essays because we love reading their amazing essays. We ask them to write essays because we want them to go through a thought process which helps them learn something. And it's that thought process and that learning which is really the desired outcome here. And so that means that we do have to approach it a little differently because in generally in industry, if you can use AI to get a job done faster and you get an equivalent result, do it. Absolutely, because that is the desired outcome. Whereas in education we do have to think about how we ensure that the learning outcomes and the thought processes are still preserved such that the students benefit intellectually.
[75:43]
B
It's very relevant, especially antropic. Had a recent study where they looked at junior engineers. One group used AI, the other one did not. And they found, unsurprisingly from what you've also explained, that the group who used AI, they had little to no learning, whereas the group that did not, they actually learned it.
[76:03]
C
Yes, I Saw that study as well. I think the detailed methods of that study we might be able to quibble with a bit. But I think the general principle seems true that yes, sometimes in order to learn something, you just have to struggle with it a bit, not struggle too much. So if people are stuck on some technicality and they can use AI to get unblocked and then be able to focus really on the main learning outcome, then I think it's good to use these types of tools. But if the point is to actually grapple with some difficult ideas and think them through their own minds, then we need to still find ways to make sure the students are doing that.
[76:40]
B
You work both in industry and academia. What do you think industry could learn from academia and academia can learn from industry.
[76:47]
C
The two really could be closer together because often they regard each other with sort of disrespect, really. Like the industry people will say, ah, that's theoretical, that's academic, it's got nothing to do with the real world. And they're really missing a trick there because actually there's a lot of interesting insights from research that are very relevant to the real world, but they're not necessarily making their way across that chasm in the other direction. The academics will say, oh, this industry stuff, that's just engineering. They're not actually doing any interesting thinking, it's just writing routine stuff. I think I see it as one of my goals to try and build better respect across in both directions by bringing interesting insights from research into industrial practice, but also by informing our research by the problems that arise in real world. And so that way, like joining those two things up a bit better.
[77:42]
B
What are your current research topics that you're working on, ones that you're excited about?
[77:47]
C
I have two main areas I'm working on at the moment. One is local first software. So that's this idea that we want collaborative software like Google Docs, like Figma, etc. But in a way that gives better protection to users data that's less dependent on a single cloud provider who can lock you out of your files and that's therefore more resilient, gives users greater agency and greater autonomy over their own data. So that's an area that I've been working on for the last 10 years or so through a mixture of open source work and algorithm development and formal verification and so on. I'm now also trying to set up a brand new research area in a totally different topic, which is on using cryptography to prove things about the physical world. So I'm interested there in especially sustainability related things. So, for example, if you want to verify that the carbon emissions involved in manufacturing a particular product were X and you want to be sure that that number is correct, because maybe you want to include emissions as part of your purchasing decision and choose the product with the lower emissions. For that to be meaningful, the emissions number has to be correct. And unfortunately at the moment the numbers are generally not correct because the incentives are to lie and cheat and to use creative accounting techniques all as a way of greenwashing, basically. Or a related thing is happening in the eu, for example, which is bringing in new regulations on preventing deforestation of tropical rainforests, so that for example coffee, cocoa, palm oil, et cetera, imported into the eu, the importer needs to prove exactly which plot of land it actually came from and then check against satellite imagery that that was not recently deforested. And so I've been looking into using cryptography as a tool of proving things about the supply chains of these physical products, but without revealing commercially sensitive information. For example, a company will not want to reveal who its suppliers were and which ingredient to its process it purchased from which supplier, for example, because that might reveal something about its secret recipe that it uses. And so the hope here is that cryptography can allow us to prove that, for example, the accounting has been done correctly across supply chains, but without having to reveal publicly any of this sensitive data about suppliers or other customers.
[80:10]
B
What is your view from your vantage point, the impact that AI is having on academia, not just for students studying beyond that and also industry with your industry contacts?
[80:23]
C
Yeah, I mean, I'm not that deeply into the AI things really. I'm seeing it more through my collaborators who are making very good use of AI tools for software development especially. I personally write very little code these days, and so I haven't had that much need or occasion to actually use AI agents myself. Personally, when writing prose, like working on the book, for example, I prefer to still do that the old fashioned way of just write every word by hand. So I haven't let AI anywhere near the text of the book, for example. And I don't know if that's the right decision. It's not really a principal thing that I think it would be wrong to do. So it's more that for myself, the process of writing is the way how I figure things out. And figuring things out is really my goal here. So I'm trying to figure it out in my own head and for that I just have to write it myself. There doesn't seem to be any way around it. But using AI as a way of getting feedback on ideas or exploring whether an idea really holds up to scrutiny or things like that, that seems like a very productive use of the technology. And that applies for, for both industry and academia, I would say.
[81:33]
B
So as closing for a student or a young professional who is still studying and considering the route into either industry or academia, what have you seen who thrives in one or the other?
[81:47]
C
Yeah, my feeling is they're not really that mutually exclusive. Or rather some of the best PhD students I've worked with, for example, actually have a few years of industry experience. So they might have done an undergraduate, maybe done a master's, then spent a few years in industry developing actual doing real software engineering, learning about the real world, and then maybe at some point got bored and thought, oh, actually I want to work on maybe more idealistic things or have more freedom to choose their own research topics and then start getting interested in doing a PhD. And that I find is quite a healthy route. You do get people who go straight from their undergraduate degree and masters into doing a PhD, but sometimes those people can just lack a bit of the breadth of perspective. And so I think having seen a bit of just real world engineering is actually really helpful for people, even if they then want to stay in research, but in the opposite direction. I think it can work very well too, because in research and academia we just get to think things through a lot more carefully than people often do in industry. Often people in industry, I feel like, sort of have short circuit reasoning, like maybe don't quite reason something through from first principles, but just like, oh, I heard this from a conference talk, I'm just going to go with that. And what academia can teach is this sort of nuanced and critical thinking to really reason through trade offs, for example, and to really justify why something is true. And so I think it's really good actually if people can weave in and out of industry and academia a bit and not regard it as like two totally mutually exclusive career paths, but actually have a bit of switching between the two.
[83:35]
B
Well, Martin, thank you very much. I expected us to talk a lot more about your book, which we did, but I have a newfound curiosity and respect for all the important and interesting academic work that you and everyone else is doing. So thank you so much for this.
[83:49]
C
Thank you for the great interview. This was really interesting.
[83:51]
A
I hope you enjoyed this rare conversation with Martin Clubman. I found it interesting to learn that the first edition of the book assumed that you have machines with local disks but actually today this is not how most engineers build systems anymore. Cloud native primitives like S3 change how you build systems, and this is why this book just needed a refresh. I also appreciated Martin's take on whether engineers still need to undercast systems internals when they're using using managed services. If you're building business logic on top of these services, you probably don't need to know every detail, but it can become useful to be able to look deeper, especially when you need to debug your system. By the end of our conversation, I gained a lot of appreciation for the academic research that Markin is doing. The local first software work, the access control problem, and decentralized systems using cryptography to verify supply chain emissions. A lot of these are hard and engineering problems that few startups would take on. It was nice to understand how academia is in a good position to do work that has a long term focus. Do check out the show notes below for related to Pragmatic Engineer Deep Dives. If you've enjoyed this podcast, please do subscribe on your favorite podcast platform and on YouTube. A special thank you. If you also leave a rating on the show. Thanks and see you in the next one.