Summary8 min read

The Pragmatic Engineer – CI/CD with Robert Erez

Host: Gergely Orosz
Guest: Robert Erez (CI/CD expert, early engineer at Octopus Deploy, former Skype for Web team)
Date: June 17, 2026

Episode Overview

This episode provides a deep dive into Continuous Integration (CI), Continuous Delivery (CD), and Continuous Deployment, exploring both technical and practical aspects of software delivery at scale. Robert Erez shares insights from over a decade in engineering, touching on progressive delivery, the realities of platform teams, deployment at scale (including on Kubernetes and on-premise environments), the evolving role of AI in CI/CD processes, and the shifting landscape for enterprises and smaller teams.

A must-listen for software engineers, SREs, tech leads, and engineering managers concerned with how to build delivery pipelines, scale infrastructure, and adopt best practices around deployments.

Key Discussion Points and Insights

1. The Evolution and Stages of CI/CD

[05:52] CI, CD, and CD… but with nuance
- “People talk about CI and CD as interchangeable, or just say CICD as one word, but there are different levels.” (Robert, 05:43)
[06:15] Maturity Model:
- YOLO: Direct, mostly manual deploys to production.
- Continuous Integration (CI): Frequent merges to main branch, automated tests.
- Continuous Delivery (CD): Every commit is tested and ready for deployment.
- Continuous Deployment: Automated pushing to production without manual steps.
[07:28] Continuous delivery vs. deployment:
- CD requires manual release to prod; continuous deployment never waits for a human.

Memorable Quote

“YOLO is the first stage. The second stage is continuous integration... Continuous delivery is kind of the next stage where you also test your deployment process itself.”
– Robert Erez [06:21]

2. Real-world Deployments – Skype and the “New Zealand Canary”

[01:46] Skype for Web:
- Weekly releases gated by a Change Advisory Board.
- Team engineered their own CD flow to go faster: build, test, stage, production.
- Canary deployments: New Zealand as test cohort—first to get new builds due to time zones, English language, and manageable user count.
[03:12]
- “New Zealand was always our canary. The first country that’s significant in size, they speak English, and if there are bugs, they’re easier to address.”

3. Octopus Deploy and the Focus on Deployments

[04:15] Early Days at Octopus Deploy:
- Start-up founded and built by engineers—including the CEO.
- Everyone did support, marketing, engineering.
[05:43] Octopus focused on deployment (CD, not strictly CI):
- “Deployment isn’t the same as continuous delivery, but you can’t really have CD without deployment automation.”

4. Kubernetes: Winning the Platform Wars

[10:02] Kubernetes' Rise:
- “Kubernetes is the platform of the moment… came out of Google’s Borg system.”
- Helped level the playing field between AWS, Google Cloud, Azure.
- Standardized container orchestration.
[14:04 / 15:20] Kubernetes Adoption:
- Widely used not just in the cloud but on-prem by enterprises for control.
- On-prem uses include datacenters, co-lo, even store point-of-sale systems, and ships/research vessels with intermittent connectivity!

Memorable Story

“Some customers have Kubernetes clusters running on research vessels—boats at sea. When they come to port, that’s when deployments happen.”
– Robert Erez [16:14]

5. GitOps: Principles and Misunderstandings

[18:13] What is GitOps?
- Pulls “desired state” (as defined in git) and ensures infra matches via declarative configs and reconciliation.
- Formalized by Weaveworks (2017).
[22:23] Pillars of GitOps:
- Declarative configuration
- Versioned and immutable storage (ideally git), but not a requirement
- Pull—not push—deployment model
- Continuous reconciliation
[23:33] Caveats:
- “The naming makes teams think everything must be in git—including secrets. But some things shouldn’t be (‘like secrets—don’t put them in git’).”
- GitOps isn’t for everyone; practical blends of GitOps and imperative steps are routine.

6. Platform Teams: The Modern Approach to Scaling Deployment

[30:22] Rise of Platform Teams:
- Came from DevOps—now, teams maintain standards, templates, and self-service portals (IDPs) for other engineers.
- Provides best practices and common tooling; application teams retain responsibility for operational run.

Memorable Quote

“It’s about giving you self-service. Focus on writing code; the platform team lets you not have a ‘Jenkins lottery’ for who manages the pipeline.”
– Robert Erez [33:08]

7. AI’s Impact on CI/CD

[35:00] Increasing Velocity, New Demands:
- More code, generated faster (AI-generated code and reviews).
- “Fast feedback loops are key when humans are in the review loop. For AI, the bottleneck shifts from waiting for builds to managing rollout risk.”
[38:59] Progressive Delivery is more important—AI generation increases the need for safe rollouts and fast reversibility.

Memorable Quote

“If your code is being written by AI, does it matter if the pipeline takes 20 vs. 30 minutes? AI can babysit the process—focus shifts from speed to reducing risk.”
– Robert Erez [36:32]

8. Progressive Delivery: Canary, Blue-Green, but Feature Flags Rule

[39:11] Techniques:
- Canary Deployments: Route a fraction of traffic to the new version (e.g., “New Zealand” at Skype).
- Blue-Green Deployments: Maintain two environments, switch traffic upon validation.
- Feature Flags/Toggles: Preferable—granular, instant revert, decouples deployment from release.
[41:04] Feature Flags:
- “With feature flags, you control release timing and targeting. Rollback doesn’t mean redeploy—it’s a toggle.”
[45:23] Challenges:
- Database/schema changes are hardest for progressive delivery; maturity needed to manage expand/contract migrations.

9. Rollbacks: Prefer Roll Forwards

[47:29]
- “Many customers want a ‘rollback button.’ In stateless systems, reversion is easy. But with state (databases), it’s dangerous. Roll forward instead.”
- Hotfixes and rapid patching preferable over trying to revert stateful changes.

Memorable Quote

“In an ideal world, rolling back is via a feature flag. But if you have a schema change, rollback is a myth—roll forward instead.”
– Robert Erez [49:49]

10. Feature Toggle Hygiene

[51:23]
- Feature toggles can litter codebases; essential to track owners, set expiries, and notify teams to remove stale toggles.
- Use tools/frameworks (like OpenFeature + internal wrappers) to manage lifecycle.

11. Environments & Ephemeral Integration

[54:04]
- Common environments: dev/test/prod
- Trend: Move to ephemeral environments per feature/branch. “Spin up temporary infra to get feedback, then tear down once merged.”
[58:20]
- Even with AI, ephemeral envs help validate functionality before shipping.

12. Running SaaS and On-Prem in Parallel

[59:13] Octopus SaaS Migration:
- Octopus started as on-prem; SaaS began as a VM-per-customer experiment, migrated to Kubernetes with cell-based “reef” architecture.
- Resiliency: Zero downtime upgrades are hard; every reduction in downtime is meaningful.
[63:17] Dual Offerings:
- On-prem is in heavy demand (finance, government). Slow upgrade cycles—some customers on 5+ year old versions.
- “On-prem takes, on average, 200 days for 50% of customers to upgrade to a release.”

Business Insight

“Supporting both SaaS and on-prem can reduce competition since not everyone wants to do the hard work of supporting self-hosted infra.”
– Gergely Orosz [65:57]

13. AI, Model Versioning, and Customer Attitudes

Parallel drawn with LLM/AI adoption: Some users want to pin working models rather than always upgrade, suggesting support for stability will remain a business need.

14. Advice for Engineers: How to Get Started With Progressive Delivery

[68:50]
- “Just start with a feature toggle. It’s scary, but once you do it, you’ll want to use toggles for everything. There’s nothing like flicking one off at 2 am to fix a bug instantly.”

15. Book Recommendations

The Phoenix Project (Gene Kim, et al.) — foundational DevOps/ops novel.
Radical Candor (Kim Scott) — on communication and empathy in teams.
Anything by Greg Egan (hard sci-fi, e.g., Diaspora) — for techies who love mind-bending science.

Notable Quotes & Moments (with Timestamps)

The “New Zealand Canary” approach:

“New Zealand was always our canary. The first country of significant size, English speaking… but if there’s a bug, it’s small enough. Sorry to all the New Zealanders.” — Robert Erez [03:12]
On real deployment hygiene:

“The majority of our customers are still on Prem… banks, financial institutions, governments, they want full control. So this isn’t going away.” — Robert Erez [64:43]
On dogma and pragmatism:

“Most teams don’t care whether you call it GitOps or anything else. They just want to ship software and know that it works.” — Robert Erez [27:59]

Timestamps for Important Segments

00:00 — Why is CI/CD hard? Episode intro
03:12 — Canary (New Zealand) and real-world deployment at Skype
10:02 — Kubernetes: origins and why it won
18:13 — GitOps explained; what is and isn’t “in git”
30:22 — The rise of platform teams and their role
35:00 — AI’s impact on how (and how fast) we ship software
39:11 — Progressive delivery: canary, blue-green, feature flags (with pros/cons)
47:29 — Rollbacks vs. roll forward in the real world
51:23 — Feature flag hygiene and operationalizing removal
54:04 — Environment patterns and the adoption of ephemeral environments
59:13 — Running SaaS and on-prem together: Octopus Deploy’s journey
68:50 — Advice: How to adopt feature toggles/progressive delivery
70:27 — Book recommendations (Phoenix Project, Radical Candor, Greg Egan novels)

Key Takeaways

Progressive delivery is now critical, especially as AI increases code velocity; feature toggles are usually safer and more granular than canary or blue/green deploys.
GitOps is about principles, not dogma—don’t put secrets in Git, and don’t force everything into repositories just for the sake of it.
Platform teams and self-service infra are the new norm at scale but aren’t needed everywhere. Simplicity and context fit matter.
Dual SaaS/on-prem products are operationally hard but valuable; many large customers demand real control and slow upgrades.
Rollbacks are overrated when state is involved—build fast hotfix/roll forward flows instead.
Feature flags/toggles must be maintained—a culture of periodic cleanup is essential.
Embrace ephemeral environments and pragmatic automation for better feedback loops.
Be practical, not dogmatic: Choose the right mix of tools, processes, and architectures for your team’s context.
Read, stay grounded, and communicate candidly as you scale your engineering practice.

Recommended for:
Engineers, SREs, DevOps, platform engineers, and tech leads navigating decisions around CI/CD, feature rollouts, Kubernetes, and building software delivery platforms at scale.

Notable Quotes for Sharing:

“In an ideal world, rolling back is via a feature flag. But if you have a schema change, rollback is a myth—roll forward instead.” [49:49]
“Most teams don’t care if you call it GitOps. They just want to ship software and know it works.” [27:59]
“If you want to move beyond continuous delivery to progressive delivery, just start with one feature toggle. It’s addictive.” [68:50]

Books Mentioned:

The Phoenix Project (Gene Kim et al.)
Radical Candor (Kim Scott)
Diaspora / Zendegi / Permutation City (Greg Egan)

Loading summary

Transcript154 lines

[00:01]
A
CI CD remains one of the hardest things to get right in software engineering. But why? Rob Aris is a CI CD expert, having worked in this field for more than a decade. In the early 2010s we were teammates on the Skype for web team and then Rob joined Octopus Deploy as one of the first engineers ten years ago. In today's episode we cover progressive delivery and practice, canary deployments, blue green and why feature toggles are often still better. What is GitOps and why it's not about git and where the everything in git mindset breaks down, why you should prioritize rollbacks less and focus on role forwards and many more. If you want hard earned lessons about CI cd, progressive delivery and what's coming as AI changes how much code we ship to production, then this episode is for you. This episode is presented by Antithesis. Verify your system's correctness without human review or traditional integration tests and avoid bugs or outages. Today's episode will be about CI cd. CI CD at Scale is one of the hardest infrastructure problems to get right and the teams who nail it know that the details very much matter. This is where I need to mention our seasoned sponsor, workos. Workos brings the same rigor as many of us use with CI CD at Scale to Enterprise auth, sso, scim, RBAC Production Ready Battle tested and built to handle real load and real compliance requirements. To add Enterprise Auth without the infrastructure project, visit workos.com rob it's awesome to
[01:23]
B
have you here on the podcast.
[01:24]
C
Hello Gilgay. It's good to be here. Yeah, I'm loving Amsterdam.
[01:28]
B
Yeah, it's been like what, 11, 12 years since we worked together.
[01:31]
C
Yeah, yeah, I think 2015, 15, 2014, 2015. I think I left UK a while
[01:38]
B
and Skype when there was still Skype. Our team Somehow inherited the Outlook.com plugin which had like 400 million users per month or something.
[01:46]
C
Yeah, it was crazy the amount of massive scale. So this was an interesting job. Deployments were very much a case of, you know, you ship once a week and you have to go to a cab board, you know, a change advisory board. You have to get sign off and approval. And I always found that really weird. Right. Like we're building this piece of software, it runs on the web, we can ship it whenever we want. It was running on Azure at the time and so, you know, we've got full access to push whenever we want and we make these changes through the week, but we'd kind of have to hold them back. I Guess to Abdela, our manager, Both of our managers at the time, we kind of, I guess, worked around the system. When the code was ready, we'd build it and ship it through the week. And I was really sort of impressed and proud at this process that the whole team had kind of put together. Right. Where we'd commit the code. The test would run several kind of layers of testing. It would go to staging, et cetera, and then it would get shipped to production. So we're kind of, I guess, executing a form of, I guess, you know, continuous delivery at a time. And we would then ship ourselves, you know, once a week. I kind of always like to tell this story that at the time, you know, when we'd have a build ready to go, you know, we do a form of canary deployments. And so this is where you kind of roll out to a small percentage of your customer base. And we always found that the customer base that would be our test subjects was New Zealand. So New Zealand was always our canary.
[03:11]
B
Yep.
[03:12]
C
A bunch of reasons for that. You know, they're in the. You know, the first country to kind of reach this new date. So they're always the first ones to kind of roll out into a time
[03:21]
B
when it comes, like, you know, like midnight passes. It's like first countries. New Zealand.
[03:28]
C
Bang. Exactly. So the first country that's of significant size, they speak English. So if there's any bugs or issues or reports, it's kind of easy to understand. But to be honest, New Zealand is small enough that no one really cared if we shipped a bug and had to fix it quickly. So sorry to all the New Zealanders listening. Yeah, I think that's kind of this good example of using a continuous delivery technique to ship the code faster than what we otherwise could have if we had these kind of big bang releases. And this whole process, I guess, opened my eyes to, you know, what. What progressive delivery, what good CICD could be. And, yeah, I guess from there I spent a few years there at Skype, and eventually wife and I decided it was sort of time to come home to Australia.
[04:11]
B
And then back in Australia, you went to start to work at Octopus Deploy.
[04:16]
C
Yeah, eventually I came back and actually worked at a. At a place with a friend of mine just for a little while, just to kind of get back on the feet. And I remember they were using Octopus Deploy there. And so Octopus Deploy, for those who don't know, is a deployment tool that was built and sort of developed originally in Brisbane. So there was a strong kind of Brisbane attachment. Attachment to It. Yeah, that's right. So when I found out that they were hiring, I thought, okay, why not? I'll give it a go. I like CI cd. I like this space. I think there's, you know, a lot of interesting problems in this space. So applied and joined. And at the time I was employee, I think employee number eight or nine or something like that. So it was very much still a bit of a startup culture. Definitely not startup in the sense of, you know, Silicon Valley world parties and, you know, ridiculous spending, but startup in the sense that everyone who I worked with was an engineer. Now even Paul Stavel, the CEO, he's an engineer. This is kind of where it started from. And so we'd all be working on code together. If someone had an idea, you'd have a bit of a chat about it and ship it. So we were the marketing, we were support. We were kind of bit of everything. And yeah, obviously the company has grown a lot since then.
[05:23]
B
The company was focused, Octave was deployed. From the start, they were focused on deployments. Right. Can we talk a little bit on. Whenever I think about deployment, I always say CI, cd, Continuous integration, continuous delivery. Why was there a focus on deployments? And is that the same as continuous delivery?
[05:44]
C
Yeah. Interesting. So you're right, like, quite often people talk about CI and CD as this kind of interchangeable. They're either interchangeable or the word is CICD is like the name.
[05:52]
B
It's just attached to itself. It's hard for me to imagine a CD without a CI. Continuous integration.
[05:58]
C
That's right. And I guess the way to look at it is, you know, you've got sort of multiple stages of maturity of software teams as they kind of move on their way from, you know, initially, CI, which continuous integration. This is the idea that.
[06:14]
B
Well, initially it's yolo.
[06:16]
C
Initially it's. Yeah, that's right. Initially you just deploy to PROD or SSH machine.
[06:20]
B
And we used to.
[06:21]
C
We've all. We've all worked in places where we've done that, and that's the starting point. So you're right. YOLO is the first stage. The second stage is, you know, continuous integration. And so this is this idea where you want to keep integrating, merging your code changes into a single branch, and you want to be continually running tests against it. Now, continuous delivery is kind of the next stage where, you know, we talk about testing our code and there's, you know, unit tests and integration tests, et cetera. But what you also really need to test is your deployment process itself. Right. So continuous delivery Is this idea, okay, you want to make sure that at any point in time, when I click the button to deploy, I want it to go to production. Once we kind of get to this place, the next stage beyond that, which not all companies necessarily reach, is continuous deployment. Right. So this is the idea that not only are your changes being merged and merged together at the same time and ready to go, but they're also being shipped to production, essentially.
[07:18]
B
So the stages we have is first yolo, then continuous integration, then continuous delivery. Delivery and continuous deployment.
[07:28]
C
That's right.
[07:29]
B
What is the difference between continuous delivery and continuous deployment?
[07:33]
C
The big difference, I guess, is the question of do your changes go out to production automated? Does it kind of flow through without any intervention? I guess.
[07:41]
B
And then for continuous delivery, they go out, but not necessarily to production, right?
[07:46]
C
That's right. And so that's why you'll have environments like, you know, dev environment or testing or staging or whatever. Now, it's possible that, you know, some parts of that process may also still be manual. Maybe you only update the test environment once a week so the testers can play around with it again. But the key principle is that you could, you can kind of push it through sort of automatically the whole way through if you want.
[08:06]
B
And what teams would not want to do continuous deployment, right? Because it seems to me, continuous delivery, you kind of want to get to. Because then you just get more and more feedback. Right. But then it is a kind of good question, like, should it go out immediately?
[08:21]
C
This is the question, you know, everyone always sort of asks, like, it's almost ready to go out. Why can't we just push a production push as engineers? You want as soon as possible it's ready. Right. The reality is it doesn't really suit every, every, every company. Right. So, you know, it may be the case that, you know, some, some companies really do still have, you know, review boards where you need to validate. Is this good to go out? Particularly if you're in an industry that has a lot of regulation and compliance problems, problems, compliance requirements. And they need to make sure that when it does go out to production, it's, it's sort of done at the right time with the right people available, et cetera, et cetera. It's not necessarily true to say that everyone should be going to continuous deployment because that's, you know, sometimes just not, not viable for various reasons. But if you at least got to that point where you're sort of continually seeing your changes, go through all the testing, you know, you're promoting it through the different environments, which is, you know, you're therefore testing the process itself. If you can only click that button to go to production once a week or whatever, okay, that's fine. You know, you've done a lot of that hard work. You've mitigated risk, which is what a lot of this process is about. Right. Is feel the pain as soon as possible and de risk anything that could go wrong right up until that last point.
[09:39]
B
So I know you're deep into CI CD or continuous integration, continuous delivery, continuous deployment. You've been doing this for like, what, 10 plus years now. But I was pretty surprised to see that when I checked Octopus deploy, it said deployment. It says continuous deployment, continuous delivery, but it also says Kubernetes. How has Kubernetes kind of arrived in the topic of CI CD and in general infrastructure? What happened there?
[10:02]
C
Yeah, yeah, Kubernetes is, is the platform of, of the moment. If we take a bit of a step back, Kubernetes came out of the, you know, Google, I guess they originally had Borg. You know, they were using it to, to host and run their infrastructure. They ended up releasing Kubernetes. Partly, I'm not going to, you know, pretend I can read their minds and know exactly why, but partly as a way of helping to level the playing field between them and some of the other cloud vendors.
[10:27]
B
Yeah, so, like before Kubernetes, AWS was a clear leader. And I talked with Kat Colesgrove, who came to the podcast, who works on the Kubernetes team, and again, she speculated that by releasing Kubernetes, it was a lot easier to move workloads from between AWS and Google Cloud. So it kind of leveled the playing field. And now there was a reason to. That's right, like choosing Google Cloud was no longer as big of a risk, or choosing Azure was not as big a risk, and so on.
[10:57]
C
Yeah, that's right. It kind of, it made it simple to move between vendors. And so as a, as a customer of one of these platforms, if you wanted to move to AWS and you're using containers, no problem. You just sort of putting it in a new place. And so Kubernetes came along at the time when there was a bunch of plays in the field for container orchestration. So, you know, you had, you know, Nomad, even Dockerform, Hashicorp, a bunch of other options are out there.
[11:24]
B
Because at Coreos, Kelsey Hightower, just on the podcast, they built Fleet, which was another container orchestration. And this was all around like 2012 2013, 2014 and then Kubernetes came out and somehow it started to win market share.
[11:40]
C
Yeah, yeah. I mean I think some of the mechanics that it provided kind of really appealed to engineers and I guess DevOps teams out there. And eventually I think particularly because it was so easy to use cross platform and because some of the cloud vendors then did end up picking it up, it kind of has ended up now being essentially the winner in this space. I know even back then, even non container orchestration tools like Azure Service Fabric was another kind of attempt to handle the fact that well, everyone's building microservices and they want to host them in a single platform and how do you orchestrate that and deal with dependencies, et cetera. But Kubernetes has become the clear winner.
[12:22]
B
And when you say winner, I understand that for example, when you have a bunch of backend servers on a service like you have a website, there's a large backend, okay, I'll use kubernetes for that. But you're talking about infrastructure or you're talking about even things like build servers.
[12:38]
C
That's right. So it's kind of funny, you know, we talk about Kubernetes as being cloud native. This is what, this is the term you always hear. It's cloud native.
[12:46]
B
That's what they say.
[12:46]
C
That's what they say. And you know, you look at the vendors that picked it up, it's, it's Azure and AWS and kind of the made it available on their platforms. The reality is a lot of customers actually use Kubernetes for running on premise. So you know, a non insignificant number of our customers who are doing Kubernetes are running on potentially their own VMs on their own server farms or maybe they're running VMS in AWS or Azure, but they're maintaining Kubernetes itself. The idea being that they have a lot more control then over exactly what's running. It's particularly common you'll find in things like financial industry and things like that where again, wanting to fully sort of control the process and manage the whole sort of piece of infrastructure from end to end is kind of one of their goals. But they want to leverage the capabilities that Kubernetes provides by, you know, allowing the, the application team and the ops teams to just build and define kind of in that declarative fashion that Kubernetes provides exactly what runs and how does it run, etc.
[13:45]
B
So they, they chose Kubernetes because this is the best tool they can Manage their on prem infrastructure and say like, okay, I have like these physical machines and I want this many virtual machines and I want to run a database on this many nodes and a internal web server or like whatever. So it just won this area as well.
[14:05]
C
Yeah, yeah. I mean, so there's around the same time that Kubernetes came out actually before that there was a lot of these other kind of declarative type tools. Right. So you have, you know, Terraform, which is a really popular one, you can define kind of exactly what infrastructure you want. And what you're doing is you're essentially defining the desired state and then the tool kind of applies it and you know, you've got puppet, etc. And so Kubernetes has this similar concept, right, where you define what you want your sort of infrastructure to look like. And the internal Kubernetes controllers and operators will basically ensure that whatever you've asked for always applies. So if you say you want, you know, three replicas of something, it will ensure that there's three like replicas of something. And so if one of those pods dies, for example, it will spin another one up. And so it simplifies this process of being able to find declarative. Declaratively kind of exactly what you as a sort of application team need to run your system.
[14:59]
B
It's fascinating because I always assume that Kubernetes has won the cloud native space and Skype scalers. Can you tell me a bit more about how it's being used on prem.
[15:11]
C
On prem?
[15:12]
B
Some interesting stories. You must have seen some because you said that you're working with companies who are managing large on premises Kubernetes or like interesting situations.
[15:21]
C
Yeah. This is one of the nice things about working at a company like Octopus, right. We, we talk to and deal with so many different customers and you know, everyone's doing things a little bit different, although it's slightly different needs and requirements and you kind of get exposed to a lot of different, you know, problems and patterns and it's easy to sometimes to get lost in, you know, what people are talking about in conferences and everyone's saying it's all about cloud and this is the, you know, best practice and you should be doing this. And the reality is, you know, everyone's kind of got their own little problems and they just want to solve them the way they kind of need to solve them. And so some of our customers, in fact a lot of our customers will run Kubernetes kind of, you know, quote unquote on Premise. So for example, I was actually talking with one.
[15:58]
B
When you say on premise, can you just be bitmalkier? Is this a data center where they're like renting and co locating? Is this actually like I have my own data center or is this like I actually have my own machines in my closet?
[16:10]
C
Yes and yes, I guess.
[16:12]
B
What, even in the closet.
[16:13]
A
I mean, I was trying to joke there.
[16:15]
C
I'm sure there are still teams out there that are running, you know, the core accounting tools and et cetera under Steve's desk. But even when we talk about, you know, small computers, some of our customers have Kubernetes clusters basically in their point of sale systems. So they have hundreds and hundreds of stores and they have little Kubernetes classes that essentially run in them and each one's independent and they, you know, run into their own problems with that because particularly at scale, when you've got, you know, thousands and thousands of clusters and you know, these, these customers are, you know, following various GitOps practices, etc. Where they're pulling the, the actual state from, from a git repository. So the git repository itself becomes the bottleneck or they start getting throttled and so they have to sort of resort to other mechanics to try to sort of mitigate and work around that. I was talking to another one of our customers actually just the other day at KUCOM there who they are deploying. They've got Kubernetes clusters running on research vessels. And those research vessels, research as in like boats? As in ships on the ocean. That's right. I'm not going to pretend to know exactly what they're doing on those ships. We didn't quite get into that detail. But they've got Kubernetes clusters out in the open sea, right. Which is apps given Kubernetes name. The problems they run into though are a little bit different. Right. So for them, you know, those boats might be out at sea for, I don't know, weeks, months at a time or whatever that might be. So when you want to do a deployment that a ship's not available, so when that ship comes back into port, it needs to get the update. Right. So that we're talking through how you would, how you'd achieve this. Right. And how that process would work.
[17:50]
B
This is super interesting and I love how you kind of get a peek into so many different types of teams through the fact that, you know, like you're talking with them, with how they do the deployments, but you're, you probably see some other things that they're doing or things they're struggling with. What are some trends you're seeing across the industry in terms of the. This wide range of companies you work from startups to like, finance companies, to like these research vessels?
[18:13]
C
Yeah, I guess one of the. One of the big trends these days is a lot of focus on GitOps. So GitOps is.
[18:19]
B
What is GitOps?
[18:19]
C
What is GitOps? That's a good question. Grego, let's take a step back for a minute. So, you know, we mentioned, we talked about Kubernetes earlier. We talked about the fact that it's kind of got this internal continuous reconciliation process where you say to the cluster, please spin up, you know, five pods. And it takes that desired state and ensures it always sort of is true in the, in the world. And so there was a lot of products around there that were doing similar thing. You know, Terraform does that for infrastructure, etc. And a bunch of people started wondering, why can't we sort of take that process and pull it back further so that not only is Kubernetes just dealing with desired state, but we can pull it sort of directly out of git? And so, you know, I can, as an engineer, make changes to that git definition, that desired state. And I'd have some process that essentially pushes that to the cluster and ensures that it remains in line with what I'm asking, what I'm expecting. And so the term GitOps was coined by Weaveworks in, I think it was 2017 or so. And as a. As a general practice, it sort of started picking up steam, particularly in tandem with Kubernetes, because at its core, Kubernetes is very declarative. Right. Later on, sort of in the early 2000s, it was kind of formalized a bit more, and there was sort of four key pillars of GitOps, the first being essentially declare you want your state to be declarative. So this is the idea that you want to define what you want the state of your infrastructure to look like. This is to basically make things a lot, I guess, simpler to understand what the state of the world is going to be when a deployment takes place. So if you think about deployments that are a bit more imperative, that has sort of a process. The end result is sort of the result of multiple steps. But when you're wanting to just update some infrastructure, that desired state kind of works really well at, particularly in the Kubernetes space.
[20:12]
B
And then in GitOps, the desired state will be just describing, like, how many nodes I want or like how many, I don't know, replicas do I want on a database or how many web servers or like load balance or how to be connected? That kind of stuff.
[20:26]
C
That's right, yeah. So it's basically a way of being able to say I want my infrastructure to have whatever state it is. And then the GitOps agents, GitOps products basically ensure that remains the case so they'll keep applying it to kubernetes. So you've kind of got this, this situation where kubernetes keeps its internal status in sync with reality. And now you've got these GitOps tools that take the declarative configuration instead in sync with what Kubernetes is.
[20:54]
B
So they will take the whatever I put in git and whatever format I use and they kind of translate it into something that makes sense for kubernetes. And now kubernetes can apply it.
[21:02]
C
Yeah, I mean ideally you want it as close as possible to what sort of, I guess kubernetes is expecting because.
[21:09]
B
Or it allows.
[21:10]
C
Yeah, that's right. And so what you're describing there, I guess is the continuous reconciliation. And so this is the idea that these, these GitOps apps will essentially, as we said, sort of take that state and apply it. And if there's any drift from Kubernetes side. So for example, someone you know runs kubectl, you know, delete pod or a delete deployment or whatever the case might be, because your desired state is now stored in, in git, in this case, that will kind of for self repair. The second pillar of GitOps is that that desired set you've sort of defined should be stored somewhere that's immutable and versioned. And so this is the idea that once I say that I want to have this state, I want to have sort of something I can point to, a pointer and that might be a tag or a commit char or whatever. And I want to basically use that to define what, what that actual state should be. And I don't want that to be able to change. Right. Because otherwise that kind of defeats half the point. By having it versioned and immutable. It also makes things like auditing a lot simpler. Right. You can see the transition of that, that, that desired set over time. What's interesting though is a lot of people will point to that and go, yes, version immutable. I know what that is. That's git.
[22:24]
B
I was about to say that because git gives you, it definitely gives you versioning or it gives you commit history. I'm not sure if it gives you versioning and immutable in the sense that, I mean, the past cannot be changed.
[22:34]
C
That's. That's right.
[22:35]
B
Actually, can it? Because you can rewrite.
[22:38]
C
Yeah, it's a history. You're right. So you, depending on how you sort of configure your GitOps agent, you know, you certainly can rewrite history if you have it pointing at a tag, for example, you can change tags. And so that's why there's, you know, best practices around that, I guess, kind of, you know, wiggle the finger a bit if you're using tags to, to manage that sort of state. But what's interesting though is really nothing in the, in these pillars. And very quickly, the third one being pull versus push. And so this is the idea that your, your GitOps agent will pull the state from GitHub and put. Or git, I should say, and put it into the cluster. And the fourth being continuous reconciliation. But nothing in any of these sort of pillars actually talks about git. And I think that the naming of GitOps is kind of, kind of gets people to already have this expectation that everything has to be in git.
[23:30]
B
I mean, why would you not have that expectation? That's what I assumed.
[23:34]
C
That's right. I think the problem though is not everything should be in git. Right. So you've got this constant kind of conversation within that community about where do you put secrets, for example? So no one, not a git. That's right. We know that.
[23:48]
A
Right.
[23:48]
B
Do not put it in a git.
[23:50]
C
And so that's the thing, you know, there's been all these solutions to try to put it in git. So there's things like sealed secrets where you encrypt and put it in git.
[23:57]
B
Sounds like a terrible idea, but I
[23:59]
C
guess what's really this is highlighting is the reality that some things don't need to be in git. Right. As long as you can have this sort of control over the versioning or immutability of it, then that's, that's completely fine.
[24:11]
B
And then the trend around GitOps is, is what you're seeing that a lot more infra teams are moving from, okay, a few years ago they might have just like made definitions for kubernetes and now they're moving over to GitOps so saying, okay, we'd like to control infra in a tool in a way that's described, that's in version control. Is that the trend or what is the trend around GitOps?
[24:32]
C
I guess it's More just the trend of the growth in General of GitOps in Enterprises. Right. So not every company out there is using kubernetes today. And as they sort of approach Kubernetes and they're looking at, well, how do I, how do I perform the deployments, how do I manage that process? GitOps becomes the sort of de facto process and to some extent it is giving rise to this idea of using it to manage other things outside of Kubernetes. And there are a few examples of projects and experiments that will use things like Terraform and there's a continuous reconciliation service that keeps your actual state outside in sync. At the moment, it's really about. The focus is on, I guess Kubernetes is the core place where it lives. And I guess it's more. The growth of Kubernetes itself means that GitOps is coming along for the ride.
[25:22]
B
And you mentioned enterprises, which means like these large companies with oftentimes thousands of people or in regulated environments. That's what I think of enterprises. Are you also seeing smaller teams pick up things like GitOps? Is it like everywhere or is it more? There's certain types of teams that seem to be just more interested in this.
[25:39]
C
That's a good question. So, um, I guess sometimes what we see is a lot of people go, go to conferences or they read blog posts and they hear that GitOps is what you should do. So I guess what I want to point out here is GitOps is potentially not necessary for all. All, all locations, all environment, all teams. Right. Um, there's certainly a, a bunch of benefits to it, but the reality is there's some things you need to do outside of just GitOps. You might use GitOps principles in parts of your process, but some of this absolutism I think that sometimes exists may not be necessary. So there's often a bunch of other processes you do around your actual sort of, you know, quote unquote deployment. So things like maybe you run smoke tests or maybe you want to send a notification when it's complete, or maybe you want to do a database update or something like that. These kind of steps don't really lend themselves very well to kind of this declarative everything is in git kind of process. Right. And so that's why you get things like Argo workflows and rollouts and things come out to try to kind of get opsify this process. And that works for some people. But the reality, I guess is that I think some people get really hung up on this idea that everything is Git. So therefore they've found the tool, you know, and so therefore everything is a nail. Yeah, I think that's a. And this is the thing, like talking with customers when we go through this process of, you know, you can use GitOps in Octopus and you know, we've got a bunch of support for various mechanics that integrate well with Kubernetes and Argo, but there's a bunch of other sort of operations you do around that process that doesn't it. And when you talk to them about it, you know, they realize that what they're trying to do is ultimately just ship software. So again, that difference between what you hear when you talk at conferences and things where, you know, everything is. Everything is Git and everything must be, you know, in. In this particular format or whatever the case might be, the reality is for most customers, they're just trying to ship software, right? And they don't care what name you give it. If it's GitOps and it works end to end and solves everything good. If they want to use GitOps as part of the process, but then have other mechanics that are more sort of imperative than good, it's just sort of the reality of, you know, there's. There's tens and tens of thousands of companies out there in the world that are doing software delivery and not all of them are at conferences and not all of them are at the forefront.
[28:00]
A
I guess, as Rob says, most teams don't care whether you call it GitOps or anything else. They just want to ship software and know that it works. Our presenting sponsor, Antithesis, does exactly that. It lets you ship knowing that it works. Antist goes beyond code review. It runs your whole system inside a hostile simulation. By doing so, it finds every bug before your users do. And because the simulation is fully deterministic, Antistysis doesn't only find bugs, it gives you a perfect reproduction of every issue. I know this sounds like science fiction, but it's actually hardcore engineering under the hood. JanetStreet fly IO and the ETCD community ship agent written code with full confidence because they know it's been verified by Antithesis. To see more case studies and details, head to antithesis.compragmatic. that's antithesis.compragmatic. i also want to mention our seasoned sponsor, Turbopuffer. Turbopuffer is exactly the thing that just works. A vector and full text search engine built on object storage. Fast, cheap and extremely scalable. No exotic architecture required. Here's something I find interesting. The teams building the smartest AI products out there, Cursor Notion, Cognition and Tropic, they all run on Turbo Buffer. But why? Let's think about it. An LLM without context, it can feel pretty dumb. I can still Remember shortly after ChatGPT launched early 2023 how it felt both incredibly smart but also frustratingly stupid. If you asked it a question that was outside its training data, it just made things up. Fast forward to today and the models and their tools for retrieving context are much better. But hallucinations still happen frequently. Here's a typical way to integrate turbobuffer with an LLM. Pluck it behind your search tools. Get fast responses and relevant context back. The neat thing is how it gives you the perfect blend of low cost storage, fast retrieval and a bunch of different search tools like Vector, full text and filtering. You can afford to index billions of documents and then you can query it with different tools to get the most relevant handful of documents in milliseconds. This is how your LLM feels smart. It gets the right context really fast without blowing through a bunch of tokens. Of course, you can use TurboPuffer to search anything, not just code. If you're building AI products, check out TurboPuffer@turbopuffer.com pragmatic with this. Let's get back to Rob and talk about progressive delivery.
[30:15]
B
Yeah. Another trend that we talked about just before is the rise of platform teams. Can you talk about what you're seeing?
[30:22]
C
So platform teams are kind of, I guess in the past several years they've become this sort of new standard organizational structure to help teams manage their, I guess, deployment workflows. The, a bunch of the infrastructure around it and it's kind of come out of this evolution of, of DevOps. Right. So you know, we mentioned before in the old days you'd write a bit of code and you'd throw it over the wall to the ops team. So it was dev teams and ops teams and you, this was like in
[30:50]
B
the 2010s, 2000s, back in the, back
[30:52]
C
in the long, long ago. And then DevOps became, you know, the, the practice that everyone sort of realized that actually we want to have the engineering teams be involved in and have ownership of part of that operational process idea being, you know, you get faster feedback loops. You are able to kind of, if you feel the pain, you sort of fix. You know, it's that saying, you know, you fix it, you ship it, you know, we've all kind of heard that. And so a lot of teams, you know, took that to heart. That's that's good, great, good practice. But as things start to scale up, what you'd find is that they would end up being like a DevOps team again and sometimes, sometimes separate to another ops team. And so there'd be this separation of, of development and DevOps. And it kind of goes against some of the principles of what DevOps was, you know, trying to destroy. But not only that, these teams then end up having lots of different ways of doing their deployments. So you know, you've got every single, you know, bunch of application teams and they've all got slightly different require and they were building it from scratch. And so you'd end up with these, these teams either whether you had the DevOps teams or it was still within the application teams where there was just this, this large number of different ways of doing things. Right. And that becomes difficult at scale. So you know, you can't really move between teams.
[32:13]
B
And by scale you mean typically when there's a lot of teams. Right, that's the easiest.
[32:17]
C
Yeah, that's right. If you got, if you've got lots and lots of teams and each one is kind of owning that process end to end, you know, you sort of get this bifurcation of processes and not only that, the application teams themselves start kind of getting this, this context overload. Right. They now need to think about what's best practices of the different cloud tools. Yeah.
[32:36]
B
And there's a lot of devs rarely want to configure the deployment scripts and test them and testing is hard, you can't not get unit testable. So it's now a different job. I remember when I was on earlier teams were typically on a mobile team. You have a mobile team of five people and one of us had to kind of specialize in Jenkins configurations because Jenkins is oftentimes or used to be the mobile CI cd and it's kind of like half a person dedicated to that. And it was more like we had to draw a stick on who's going to do it because we want to build stuff.
[33:09]
C
You want to write code, right? You just want to focus on writing code. And so if you're spending a bunch of your time sort of managing infrastructure and pipelines and things, that's no fun for anyone. And so platform teams have come about as a new way of solving that problem where it's different to kind of, you know, this idea of a DevOps team or Ops team that kind of own the whole process. They more sort of define best practices and they provide ideally a self service mechanism where application Teams can essentially use often what's called as an idp, an internal development portal and they'll be able to essentially self service and, you know, maybe they want to spin up a new project and they're able to use a template that the platform team have generated. And so the platform team are able to sort of create these standards throughout the company and they can be responsible for sort of, I guess, the definitions of those processes and the best practices and how to achieve that. But the ownership of the actual running operational sort of element is still within the teams. Right? So they still get those benefits of DevOps being close to the real code and feeling the pain if there's a problem and et cetera, et cetera, et cetera, but they don't need to spend all that time becoming experts in, you know, all the different ways that you can deploy the software they've got. And so this has become really common now where particularly as you sort of get to a larger size, platform teams are a great way of solving that problem. Now that's not to say that every company everywhere should have a platform team. I mean, if you're a smaller company, sometimes it's you, you've just got the apps team and they sort of are doing quote unquote DevOps. But this is certainly something that as you sort of start seeing larger organizations with multiple teams and multiple projects, these platform teams are a way of basically bringing some sanity and control and focus, I guess, to the whole space.
[35:00]
B
One trend across the industry of course, is AI. Everyone's. It's hard to see any teams where devs are not using AI agents specifically to code. You know, product managers will be using these things and of course we have a lot more code produced as a result. When it comes to CICD systems, what are you seeing changing there because of AI?
[35:24]
C
This is the, this is the elephant in the room, right? How does AI affecting sort of dispatch? The reality is, I think, to be honest, it's still very early. I think what will happen is the impacts of CICD are really tightly coupled to how development teams end up using AI. So there's going to be some sort of like, I guess a lagging process there, but we're finding a lot of people, a lot of teams are starting to use AI in their development process. And so we're starting this process of going out and looking and talking to customers and learning what's the way that they're handling AI in their teams and their application teams and then how we can best leverage sort of the CI side to to support that. But then in addition to that, use AI within the pipeline itself, again, in the right place. So one of the things we've been, I think, pretty keen on at Octopus is this idea that at Kubecon, we were probably one of the few companies there that didn't have AI plastered all over it. We tried to be very. That's what gets the sales. Right. Yeah.
[36:31]
B
You stand out now these days.
[36:32]
C
That's right. By not having AI. I mean, we've got AI in Octopus, but what we've been trying to do is think about, well, how do we actually use it in a way that's actually useful for our customers. Right. For engineers, et cetera. And so we've been slowly adding capabilities within Octopus to provide AI support, whether it's a MCP server, whether it's a recovery agent that can review logs and tasks and all that sort of thing. But that's within the product itself. Some of the bigger changes will depend on, like I said, how. How actual application teams use. Use AI. What I think, you know, we're talking about we'll find is there's going to be a lot more velocity. I think that's one of the big, big changes, right, Is there's just going to be a lot more code coming through. I think one of the questions is, okay, what does that mean for your pipeline? One of the things you often talk about when you know, human. There's a human element to the pipeline is speeding up the cycle to get that feedback quicker. You know, if you've got engineers sitting there waiting for their code to run tests, they can get back to it and fix it. The shorter and shorter you can make that feedback loop, the better it becomes because they don't need a context, pitch, et cetera. I think in a world where the majority of your code is being developed by AI, that becomes perhaps less important. You know, if you can kick out your build and test process and it takes 30 minutes versus 20 minutes, does it really matter if the engineers are already long gone, moved on to the next problem, and the actual AI agent itself can kind of babysit the process and review the problem that came up and issue a new fix? I guess there'll be a de emphasis, I think, on some of the speed of the pipeline itself and more on increasing sort of or decreasing risk. Right. The risk that comes from having AI agents generate code. And so exactly what that process looks like, I guess, remains to be seen. I think what we'll see a lot more use of is things like progressive delivery and I Think particularly feature toggles are going to be a really common tool in the tool belt of application teams, partly because it allows you to ship that code as fast as you can or as fast as you want, but manage the rollout of the actual feature set or changes sort of independent of the deployment. So it decouples your deployment from your release. And so in a world where, you know, we've got a lot more AI agents generating code and being involved in perhaps part of the build process, those agents themselves being able to use toggles to react to it quickly, I think then become a lot more important than perhaps what we see today.
[38:59]
B
Can we talk about progressive delivery? What it is and what are the most common ways to de risk getting your code or your software out there?
[39:11]
C
Progressive delivery is the next evolution beyond continuous delivery. So, you know, with continuous delivery, it's this idea that, you know, I've made a change to this system and I want to ship it to Dev or stage or typically if it gets to production sort of in one hit, right. With progressive delivery, you're. What you're trying to do is basically release those changes in a little bit more of a controlled way, typically through things like a canary deployment. So this is where you might deploy some subset of, of your instances that are out there.
[39:43]
B
So what is a canary?
[39:45]
C
What is a canary? Canary deployment is. This is New Zealand, basically New Zealand's our canary. So this is, as we said before, this idea where you select some subset of your, your customer base or whatever that might be and you would typically route traffic to a new instance. So you'd ship, you know, you've got version one running and you want to release version two. You essentially ship version two side by side and you might use, you know, most common one would be some sort of network traffic manager to route some percentage of your traffic to that new instance and you gradually roll that up. Typically, you know, as you do sort of do this process properly, you should have a fairly mature observability mechanisms in place to see that, you know, you can roll up or roll down and
[40:30]
B
I guess this whole thing comes from a canary in a coal mine, right?
[40:33]
C
That's right, yeah, yeah. So the idea being that, you know, in the old days when you be in a coal mine digging away and it would release, you know, all sorts of toxic fumes and things like that. Canaries were a lot more sensitive to it. So they'd have a little canary in a cage and if that canary sort of died, I guess got knocked down,
[40:51]
B
I think the canaries as I understand they were like, chirping and then.
[40:55]
C
Okay, well, that sounds better.
[40:57]
B
But when it stopped chirping, well, it also died.
[40:58]
C
Oh, okay. So, all right. Same ending. Same ending, but just, you know, nicer. A nicer way to go out.
[41:04]
B
They need to get out.
[41:06]
C
Yeah. So it's this idea that you get that advanced warning, I guess, that, you know, rather than you getting knocked out by the toxic gases, et cetera, you know, you can get out of there sooner. So it's that same principle, I guess, brought to the software. There's various other mechanisms, like blue green deployments. So you've got your first version, they're still receiving traffic, and your second version is up and running, and you can now do some tests against it, validate it. Maybe you've got sort of the. The, you know, the IP details to access it directly. You can basically validate that it's working. Sometimes there may be a way of avoiding cold starts and things, because that process may need to, you know, initialize a bunch of stuff. But then when you've sort of done that validation and you're ready, you can essentially swap top traffic around. So all the. All the new traffic goes the other. In some ways, it's like doing a canary, but straight to 100%, but you're doing a bunch of validation sort of on the side before it actually reaches customers. In my view, probably the more useful progressive delivery strategy is feature toggles. So this is the idea that you've got some sort of feature flags as well. Feature flags, feature toggles, yeah, yeah. Often used interchangeably. So this is the idea that you've got, you know, some sort of variable in your system and it's linked to typically, some sort of external service, and through the state of that particular variable being sort of true or false on or off, you can essentially have different code paths essentially take effect. And there's a bunch of benefits that feature toggles have over, say, canary releases, particularly for application delivery, where, you know, your unit of change with a feature toggle is very granular. It can be, you know, single lines of code, and so everything else remains the same. And all you're doing is tweaking that single line of code with a canary or any sort of versioned sort of delivery deployment mechanism. Your unit of change is the entire app. So if you've had 20 commits since the last sort of release went out, then you're essentially testing all 20 things in that one hit. Your ability to sort of target the actual customers. It's a lot more precise when using Feature toggles. So you can use all sorts of complex rules and say that, I don't know, everyone from Germany who has this particular product in the basket has this kind of experience. And that's, you know, really hard to do via network traffic rules. Right. The other is your ability then to, to actually roll back. So to sort of roll back from a canary. Hopefully you're still in the process where you're sort of going through that canary, you know, process and you can roll it back. That could take, you know, minutes maybe you have to redeploy the whole old version. That could be minutes or more with a feature toggle, you know, you can do that in seconds. That's pressing a button and it happens immediately. Not only that, but you've kind of, you've got more control, I guess, on when you sort of do that. So with, with a deployment that you're doing via, via a standard versioned release, you're sort of tied to when that deployment takes place. Because when it takes place, that's when essentially your new feature is available. And as an application team, that means you need to know about exactly when it's taking place and make sure you're watching the logs at that point. And maybe you and 10 other teams who are shipping things at the same time are all doing the same thing. Whereas with feature flags, your control, you've basically got control over when that takes place. So you might ship the actual, you know, the assemblies and that sort of thing on the Monday, but you release your feature on Tuesday when you come in and you've got the logs ready and you've kind of reviewed what the next steps are. So it really makes things a lot easier to decouple. Releasing a feature from deploying software, you know, version deployments through Canary, et cetera, they're really useful, particularly if you're doing like infrastructure type changes where there is no kind of application toggle that's relevant there, but you want to violate some changes to your infrastructure or your process or potentially, you know, things like, things that will involve schema changes. And schema changes are the big database schema changes, right? Database schema change. This is the big problem in any, like, to be fair, in any progressive delivery. And this is why, you know, the question always is, are you ready for progressive delivery to do schema changes? I guess this is the point that application teams kind of need to be really mature. And I don't mean mature in terms of, you know, not telling silly jokes, but mature in terms of understand all the problems that are in place with this and know how to release these sort of changes in a gradual, controlled fashion and do it over multiple stages that, you know, ironically, it's actually quite hard for us at Octopus because our Software is both SaaS hosted, so we have a SaaS offering that customers can use and we have an on premise version and there's kind of, because we have both, both sides, we kind of have the best and worst of both worlds. In the cloud system, if you've got a SaaS product, you have complete control over what versions go where. So if you want to do an expand and contract, you can stage the whole process. You know that it's all been updated before, you kind of move to the next stage. On the other hand, for a self hosted application where they go in and they install it on their own infrastructure somewhere, you don't know what version they're running and what they're coming from. So they might upgrade from version one straight to version six. And so you're not really forcing them to go through that expand and contract phase. On the other hand, you know, they've got a lot more control over when they upgrade. And so you can kind of be a little bit more deliberate about, you know, making sure that they do backups before they change and you know, maybe the down, maybe they can manage that migration and accept a little bit more downtime during migrations and updates and things like that then would actually be, you know, acceptable in a SaaS product.
[46:42]
B
So one thing about, you know, we talked about progressive delivery and you're kind of doing this to avoid surprises. You know, if a regression goes out, a new bug or something doesn't work, you kind of want to catch it early. Hopefully only a few customers have experienced it. Or even if it's not 100%, you kind of. And you have a way to go back. All you do is if it's a feature flag, you hide it. If it's a canary deployment, you go back to the other one. But there's also this thing where when things do go wrong at some point you want to do a rollback. Can we talk about how have you seen rollbacks done? Well and what does it take to actually have a real rollback strategy? Bunch of people talk about CI cd, some people talk about feature flags. I don't hear too much chatter about rollbacks.
[47:30]
C
Yeah, rollbacks. This is always a spicy one. We get a lot of customers who say, why don't you have a rollback button? I want to roll things back. Why can't we roll things Back as
[47:38]
B
in the deployment software, like Octopus or anything else they like. Okay, if it can deploy, I want to do track point and just do a rollback.
[47:46]
C
That's right. How hard could it be? Just do what you.
[47:48]
B
How hard could it be? Tell me.
[47:49]
C
Well, this is the problem, right? So in a completely stateless system, that's, you know, pretty straightforward. If you've got a completely stateless system and, you know, this is something that GitOps is really good at, where you'll have that definition. So somewhere in repo, if it's completely stateless, you can do a git revert and push it and it'll go back. The reality is for most systems out there, you've probably got some state, state being, state being databases. It could be, you know, any sort of, any sort of information that you can't necessarily just kind of undo, I guess, because if you roll it back and now you've got your code talking with the schema of the database that's not in sy, you can provide schema. If you've got a schema migration, let's say in a normal deployment you can provide alongside that a secondary sort of anti migration that kind of undoes the change. But again, that's not always possible. You need to deal with what are you going to do with that data. We've gotten pretty far in basically trying to advise customers that you never want, you want to avoid ever talking about rollback, it's always roll forward. So if there's a bug, okay, roll forward, get a change. Yeah, get, get your change in as soon as possible. This is where fast feedback loops are important. Right. You know, this is what the hotfix processes are for. Right. Like, so we all know that in a standard process you want to go dev staging, prod and maybe it's, maybe you've got, you know, approval processes and slows down, et cetera. But if you've got a significant, significant bug that you need to kind of quote, unquote, roll back, sometimes the, the safest thing to do is actually make a hot fix to that, that version and push it out sort of as quick as possible. And your bottleneck might be the build pipeline or whatever, but depending on sort of your appetite for risk there, you can resolve that sort of a lot quicker. Now obviously, if the failure itself is just from some mechanism in the deployment process itself or somewhere further down that chain, then your time to recover is going to be a lot quicker. But it's this idea that if I've got a failure in version two, my Rollback isn't to go to version one, it's to go to version three and make sure I've got that fix in version three. It's the sort of thing that, you know, when we talk to customers and some of them go, yeah, we roll back. You know, we roll back all the time if there's a problem. And then when you ask them what do you do if you've got a schema change, they kind of stop and realize that they've never, it's just sheer luck that they've never sort of run into that. Right.
[50:08]
B
Is it fair to say that you want to roll forward if it involves business logic or something that is not stateless? Because if it is stateless or if it's application logic, you code that says if this else then and you realize there's a bug there, you can just revert it. As long as it doesn't, you know, touch the schema or the data.
[50:27]
C
Yeah. I mean, in an ideal world you're reverting is through a feature flag, right? That you click and you're essentially reverting by changing the code path. And this is why I always say feature flags are kind of a nice, a nice tool to use for doing this progressive delivery. Because, you know, it's just as easy, just as easy it is to roll out that feature. You can typically roll it back. Now you're still going to have some of those problems with schema issues, et cetera. If you know, if you're making a change and you've got parts of your code path that expect one and not the other, you're going to need to account for that.
[50:55]
B
But you can even account for that inside the feature flag.
[50:58]
C
That's right, yeah. So that's the way you sort of ideally sort of manage that. So that within, regardless of which path you get in the feature flag, it's kind of self consistent with whatever version of the actual sort of database schema that's out there.
[51:10]
B
So I guess the more feature flags you use, the more fewer surprises you might have. But it's a bit of extra work both to build and also to remove. You get stuck with still feature flags all across your code base once you start to use it a lot. I saw this at Uber.
[51:23]
C
Yes, yes, A hundred times. Yes. So when you're adding a feature toggle to your app itself. So we at Octopus, we obviously use feature toggles in our code quite a lot and we use open feature as like the framework, the SDK, to interact with it. But we essentially have built A wrapper around it where the toggle itself within the code is sort of. We provide some details about which team owns it and that team sets an expiry on it. Now, the expiry itself, when that time passes, nothing bad will happen. But through parts of the CI process, if that time has passed, we can send a notification to that team and say, hey, it looks like this toggle's no longer used. So the specific mechanics don't matter as much. But it's more a matter of making sure that, you know, if you're adding feature toggles, it's really easy to forget about it because you start rolling it out and you kind of forget about it. And you know, you want to keep it in there just in case for a while, in case you need to roll it back. And having the ability to understand how long a toggle has been there is kind of a key part of helping to maintain that hygiene. Now, the reality is, even at Octopus, we've got a bunch in. I know I've got a bunch in there that I'm sure if I was to log in, I'd probably get a bunch of notifications to remove, you know, when we use that gardening metaphor in code, right? This is, this is one of those sort of operations. This is weeding, right? You need to just kind of keep on top of it. There are some mechanisms around, even in, in lieu of the AI side, which will, you know, ideally, if you're using feature toggles, you've probably got a bunch of observability and metrics and logging around it. And there are some system, some tools out there that will allow you to keep track of when the last time a toggle was kind of evaluated. And that kind of gives you that, that signal. Similarly, you know, you might remove it from the code because typically when you want to remove a feature toggle, you want to remove from the code first before you touch your actual sort of toggle system. And so having a mechanism so that once you remove from the code, you know, it might take two weeks before it makes all the way out into production. So you don't want to delete it before then. By that time you've kind of forgotten about the fact you removed it.
[53:21]
A
Oh yeah.
[53:22]
C
And so having mechanisms that will keep track of that change, I guess going through the system and when it reaches the environment where, you know, production, where it's actually being used can kind of show, okay, that code's gone out, that's, you know, remove the toggle, it's, it's fine. And safe to actually remove the configuration because you've got that feature toggle information in two places. Right. You've got it in the code and you've got it in your platform.
[53:44]
B
Can we talk about how development environments evolve? We talked about CI cd, but I'm interested more in, you know, you go from like you have one environment later you might have staging or something. And what evolution have you seen across all the teams that you work with? All these hundreds or thousands of teams?
[54:04]
C
Yeah, I'm not sure if there is one particular pattern there. I mean, I think, you know, most common is, you know, dev test prod.
[54:14]
B
So these three different environments.
[54:16]
C
Yeah, and I mean even that I think is probably a good, a gross simplification of all the different kind of.
[54:22]
B
By the way, and dev, meaning my
[54:23]
C
local machine dev in the case of cd, is often like the first point of integration. So it's kind of test. Often customers will keep test kind of reasonably in sync with let's say production or some sort of sanitized data source. So that way that whether it's the QA testers or the product team or whatever can go and review the code. Dev is almost like the first point point of integration. That is it actually is the deployment process just at its core actually working or is anything fundamentally broken at all? I think more and more now we're finding that dev is less useful in that respect. And what we're seeing is more the growth of things like ephemeral environments. And so this is the idea that, you know, I, as an engineer, I'm running some sort of feature on a feature branch and I want to kind of evaluate that it's actually doing what we're expecting it to do. But not only that, I want the rest of my team to be able to see it working. And if I've got it running on my machine, it's not exactly easy to give other people access. I guess I may want to completely context change, move on to something completely different. So ephemeral environments is this idea that from my branch pre merge I want to spin up a whole environment essentially from scratch, ideally with whatever dependencies are required to sort of run this particular component that I've been building. And then I want to basically deploy my app into that as if it was a normal full fledged environment. As once that's available, I want to sort of have access to, if it's a web app, maybe it gives me the URL and I can poke around it and hand it around and other people can kind of evaluate and Then the moment I kind of merge that pr, tear it down again. You know, it's quite common to have multiple test environments because, you know, I've got a lot of stuff going through my pipeline and I've got three testers, let's have three environments, so they can all sort of have one at once. Or often you'll see a single test environment and a bunch of tests and they all kind of need to collaborate to see who's got access to the system at the moment, et cetera, et cetera. Whereas with ephemeral. Ephemeral environments, it doesn't roll off the tongue. With ephemeral environments, you can essentially have a full fledged deployment per feature. And so again, that's about speeding up that feedback process. Right? Again, all of these processes are all about speeding up that feedback process to catch those failures or issues or bugs or whatever sooner.
[56:51]
B
There was a time a few years ago where cloud development environments were really talked about a lot, which was the idea is, as a developer, you have an environment spin up in the cloud, let's say your Visual Studio code connects to it, or maybe you just log in online and it spins up all the dependencies. Oftentimes done with containers, which reminds me of this as well. And there's also preview environments. But somehow it feels that both that discussion and this one kind of died down. Maybe it's AI, maybe it's something else, but I mean, the technology is there, right? We have containers, you can package things together. It's. I'm sure it depends, but it's all doable.
[57:28]
C
Yeah, it does get tricky. This is again one of those sort of things that's really easy to talk about for simple cases. It can get tricky when. What if I've got more than just a single app in my kind of quote unquote environment and how do I make sure it's got all the data I need to validate? So it can get tricky.
[57:45]
B
Or if you have a bunch of services that have state.
[57:48]
C
That's right, exactly. So there are sort of complications that it does bring. But I guess the benefits that you get as an application team, particularly application team, where you've still got engineers writing code, is sort of speeding up that feedback process, I guess.
[58:05]
B
Well, now with AI agents everywhere, that's even better because in the sense that if one of the best ways to validate, you know, we have code reviews and AI agent generates and you look at the code. But is it not better to just confirm that this thing works, especially when it has the UI that's Right.
[58:21]
C
I think even in that world where you've got AI agents kind of building the code and validating the code, any sort of scenario where you want the AI agent to kind of validate what it's done, you're essentially talking about ephemeral environments. Even if it's not exposed to people because it's doing its own testing and poking around in whatever shape or form that it's doing, that still is, I guess, one of these kind of environments. Right. So it, it's ephemeral, it spins up, you've got some sort of provisioning process, and then ideally, once the job's down, you kind of tear down.
[58:52]
B
I'm interested in learning more about the reality of operating a large infrastructure platform. And, you know, one big one you're working on is actually Octopus Deploy's SaaS offering. How does that look like and what are the challenges of, you know, like running something where you're running all of these deploy processes, all these cd, you probably have a bunch of different things. What is it like?
[59:14]
C
So at the moment I'm not on the team that sort of builds that, but I can give some of the context, I guess, from history and kind of context there. Originally, when we first sort of decided to sort of provide a octopus SaaS offering, I think it was 2020 or something like that, it was all VMs. So every customer would basically get a VM spun up and we would have a virtual machine. Yep. And the Octopus, you know, self installed, would basically get installed onto that VM and they'd get a whole VM for running workloads on, et cetera. And that was very much not cost effective. It was costing us something like 100 bucks per customer per month. And they were paying, I don't know, $20 a month or whatever it was. But this whole process was more an experiment to see was there a demand. And to his credit, Paul was happy to sort of pass out the credit card to kind of go through this process to see that is this actually the direction we want to go? Is this something that going to turn into a viable sort of direction for the company? Because it's a big step, right. Going from building software that you kind of hand out and people can download and manage themselves to.
[60:21]
B
It was like pretty much self hosted or like run on your own infrastructure.
[60:24]
C
Exactly. Yeah, that's right. And so the demand was there. So not long after that sort of first experiment, we basically started from scratch again. And I worked with a couple of the other engineers back then to start building it on Kubernetes. And so Octopus itself in that space we have what we call kind of a reef. So what you'll find is everything in octopus. We've always got sort of octopus or nautical kind of names around it. So a reef is basically a way of. It's this cell based architecture where contains all the resources that are needed for that particular customer's instance. Well, some of it's shared, but it's kind of broken down into individual cells. And so a reef will contain, you know, the cluster and Azure database, et cetera. And each customer instance is running now in a, in a pod in that cluster. And so as part of that project, that was when I think I was working on converting it so it could run on Linux and inside containers and someone else was building the dynamic worker infrastructure. So there were a couple of us that kind of just got in and yeah, really just got it up and running. So that way we could kind of start moving forward and I guess stop losing money. Fast forward to today. Now there's an entire team that's kind of backs that and we've got, you know, several thousand customers on it and we run many, many thousands of deployments every month. And so now what we're trying to do is there's a project at the moment to basically make the Octopus deployment process itself more resilient. So what that means is at the moment when a deployment kicks off a bunch of the process, it's kind of a imperative set of steps. A bunch of that is stored in memory, which means that whenever we want to do an upgrade, we need to essentially stop running tasks for some period of time so it can kill their instance and spit another one back up. Octopus itself at the moment doesn't sort of have zero downtime between upgrades, so there's a bit of downtime between that. We kind of want to reduce that and get that as close to, as close to zero as possible with the realization that, you know, going from downtime of five minutes to one, that's, that's just work, right? That's, you know, you can move things around, you can maybe change the architecture. Going from 10 seconds to zero is, is a much bigger shift. I'm not sure if, if and when we'll get there, but yeah, there's definitely this, this big effort at the moment to make the whole process a lot more resilient to basically improve and reduce the amount of downtime that takes place so we can kind of perform upgrades quicker, et cetera.
[62:57]
B
One interesting thing you do is you have a SaaS but you also have an on Prem offering. What are interesting engineering challenges that come from that? A lot of companies have decided to just like honestly just move, to move to SaaS because now they control everything centrally. I think Jira did this or maybe they're doing it, which is a well known one, but clearly it's just a lot more work and a lot more headache to have both.
[63:18]
C
Yeah, and we touched on one of the big problems here a little earlier is that when we want to push out any updates, you know, to cloud, because we control the whole process, we can push it out. And so we have a sort of a gradual rollout process. There's because each customer's on their own instance, we can sort of deploy each one individually and that may take, I don't know, a few days to, let's say, roll out. A change on Prem though, is kind of another matter. So actually I was digging into some of the stats around this a little while ago and found it took about 200 days for an average 50% of our customers on Prem to get. Let's say, let's say you ship a new change today, takes about 200 days for on average 50%, it's half a year. But then there's kind of like almost an exponential decay there where it takes 400 and something days for 75% to get it. So just there's kind of this curve where I mean, we've got customers that are still running, you know, versions of Octopus from five, six, seven years ago. And so whenever we ship a new change, we need to basically make sure Octopus will work from version, you know, 2023.1 to 2026.4. And so there's a bunch more a baggage I guess that we have in terms of like schema upgrades and making sure that that whole process actually is achievable.
[64:32]
B
But why do you do it? A lot of startups will be like, screw it, let's not support old versions. This even happens on mobile. What's the benefit? And this, it feels like you're kind of swimming against the crowd with this one.
[64:44]
C
The majority of our customers are still on Prem. And so this is, you know, you're talking about banks, financial institutions, governments, things like that, where they want full control over the system. They want to run it on their own hardware. Now they may use their own cloud or whatever to run it, but they want to manage the whole process and be in control of, let's say, upgrades or downtime or things like that. So it's certainly not, it's certainly not uncommon and I don't think that's going away anytime soon. As for the upgrade support, we're kind of going through this process actually in the past couple years we've been getting a lot more, I guess, confident with deprecating features and things like that and just kind of cutting, cutting loose old capabilities. And part of that has come from, you know, fully embracing feature toggles as part of that process. I think we're getting a little bit braver in terms of, you know, removing capabilities that perhaps older customers may, may miss. But I don't think that in the long term self hosted will, will kind of go away. This is one of the sort of things again where I think it's, it's really common to hear, you know, everything's in the cloud, we're all in the cloud again. The reality is there's a lot of companies out there where for them it just doesn't make sense or it's not viable or it doesn't meet compliance requirements or whatever the case may be.
[65:57]
B
Also, it's kind of a reminder, I think, that you actually might have a lot less competition if you build infrastructure software that also runs on prem because it sounds like there's a demand where companies are like, we want to give you money in order for us to run on prem. And I'm sure some of them would do SaaS if there's no other alternative. But for SaaS it's easier to build anyway so there'll be more competition. So if you're an entrepreneur or if you're a software engineer thinking to do a business or start a business, it might give you an edge.
[66:26]
C
Yeah, that's right.
[66:27]
B
It sounds like that a lot of your customers, you know, the ones who have not upgraded your software for, let's say five years on one end, you probably say, oh my gosh, what are they doing? But they might just be happy with it. And if they keep paying you as a business, those are some of your most loyal customers. You see what I mean?
[66:42]
C
That's right. And this is the thing. I mean, I remember when I worked in like when I worked in the previous job that used Octopus or any of us who have any other sort of, you know, software that you've, you've got running, potentially you've got running locally, if it just works, why, why, why touch it? I guess. And so it's kind of the bane of, of our existence because it annoys us if we want to ship the features and give them all these great new things. But on the flip side, you know, particularly for something as critical as, you know, their deployment system, a lot of customers, once they've got it running, they kind of step away and go, okay, let's, let's just let it be.
[67:16]
B
And it keeps happening with AI as well, in the sense that, for example, I just read that cursor, their latest coding model, it's updated like I think every five hours, which is amazing. It keeps getting better. However, there are customers who, once you have an LLM and it works for you, you kind of tuned it, you have the instructions, great. But oftentimes what happens, a new version comes out of a model or major version and it stops working. And I assume that there will be more teams, companies, businesses who are like, look, it would be worth for me money to kind of pin this thing or to run it on my own infra and just have it stay as is. And then I will decide when I want to change it. As long as if it ain't broken, don't fix it.
[67:57]
C
That's right. And I think, to Octopus's credit, I think we have a really good history at sort of helping customers even when they're kind of on those older, sometimes to the extent of wanting to say the support team, just they're on old incident, like tell them that to get the fixed upgrade. But support team are second to none in terms of their willingness to help. And as you said, if they're willing to pay us, who am I to say no?
[68:24]
B
Yeah, I mean, it's a business strategy, but I think it's just a nice reminder that there's not just one size. And even though I think SaaS is eating the world and we're hearing it and we're seeing it, it's nice to see that it's not just that as closing. If I'm a software engineer and I would like to move beyond continuous delivery, continuous deployment and go into progressive delivery, what pointers can you give me?
[68:50]
C
Yeah, I guess just start with something, right? So start with adding one feature. Toggle. It may be scary at first to kind of go, oh, it's in production. If I toggle this, I'm going to break something in production. It's nice and comfortable to know that you're kind of well to the left of of the running systems and if you ship code, everything will be caught by the test. But you know, if I toggle it, what will happen? It's kind of like a drug, right? Once you start doing it, you don't want to stop. And that's that's why we've got this, this hygiene problem for things like feature toggles. Right. It's really easy to add them and actually end up with the opposite problem of how do you, how do you kind of control yourself? How do you stop? So I'd say just, just kind of start doing it, add one and, and keep an eye on kind of as you roll it out and you look at the results from it. And the reality is, you know, I've shipped features on feature toggles where I've shipped a bug, right. And it's one thing to ship something and turn on a feature and go, okay, cool, customers have it. It's a very different thing when you do the opposite of you ship something and there's a problem and you can reach immediately for the toggle and switch it back off. You know, the amount of times you kind of, in the past, you had this kind of panic of, oh, no, I've shipped something. It's, I don't know what's going wrong. And particularly when you're in that state, you know, maybe you've got called up at 2am because you've got an on call and, you know, you don't know what the next step is to do, and you kind of got a panic mind and should I, you know, build a new thing or do I somehow force a redeployment? So having the capability of being able to sort of flick that switch just allows you then come right down and go, okay, I've, I've stemmed the bleeding, now come back and reanalyze it and understand what's wrong. And so having that capability, once you sort of experience that and realize the value that, not just rolling things out, but I guess rolling that individual feature back off. Yeah, you'll, you'll want to use it for everything.
[70:28]
B
What's one or two books you would recommend and why?
[70:30]
C
I'll give two kind of, I guess, technical ones and more of a fun Phoenix project is still, for me, a good one. This is one by Gene Kim. Yeah, yeah. And I can see, you know, you kind of remember that we got that in, in, in Skype. This was one that Abdela kind of
[70:46]
B
gave to everyone and our manager gave it to everyone.
[70:49]
C
Yeah. And, you know, it's, it's, you know, parts of it, it may be a little bit outdated and, you know, some of the practices have changed a little bit. At its core, this idea of as an engineer being involved in that whole sort of operation side of your, what you're shipping and the value that gives to not just the company but to you is amazing. So I think that book has kind of a core. It's one of those core foundational ones that sets the context for everything we talked about today. The other one from a more, I guess, organizational and communication side of things, Radical Candor by Kim Scott allows you to communicate more efficiently and with more compassion with your peers and other people around you. It's really common. I'm an engineer, so I know sometimes it's really. You kind of look back on what you said and you feel like, okay, maybe I can be a little bit blunt. Whereas Radical Candid teaches us to think about. You want to have those communications that are both sharing that you're caring and empathetic, but also direct. And the benefits of that and kind of the inverse of that where perhaps like I said, you're very blunt. You're sort of being honest about it, but you're missing that, that empathy. So I found that book really useful and interesting as I guess not even just as an engineer, but as a. As a person working with other people. From the more fun side, Basically Anything by Greg Egan. He's an Australian sci fi author. He writes some pretty crazy mind bending hard sci fi. So if you're really into that, I'd say read like Diaspora or Charles Letter. They're the sort of books that actually took a second read to get through. And he's a mathematician as well, so he's got a whole bunch of background and mathematics on why a certain part of his story goes the way it is. He wrote an entire story on the premise of what if the speed of light wasn't absolute or something like that. This one premise and it kind of breaks out into. And then this is what happens to energy and therefore molecules work like this and da da da da. And as a, you know, I'm a tech nerd, that sort of science stuff really appeals.
[72:54]
B
Same when sci fi, there's some science involved. That's actually. I find it way more fun. Rob, thanks very much.
[73:00]
C
Thank you, Gerg. It's been great.
[73:01]
B
This is great.
[73:02]
A
What an interesting conversation. I hope you enjoyed having someone like Rob who has been building and thinking about CI CD at scale for a decade. It was such a fun blast from the past story as he talked about how at Skype our team basically did continuous delivery years before most of the industry caught up and how we did it by quietly shipping new builds to New Zealand every week using this as our canary country. It's a reminder that a lot of modern software practices were already being run in the wild by devs who just wanted to ship software faster than our Change Advisory board would allow us to do so. One other thing I took a note is Rob's take on rollbacks. Lots of engineering teams talk about rollbacks as if they're a safety net, but the moment you have a database schema, change in the mix. What safety net? Rob's advice is to roll forward, not back, and use feature toggles as a way to turn features off or on. This is also a reminder that investing feature flags is usually really helpful. But if you have feature flags, be sure to clean up after them after
[73:55]
B
you've rolled them out.
[73:56]
A
Otherwise they become a big mess. Finally, a part where I learned something new was on GitOps. I've always assumed that GitOps was about, well, git, but as Rob pointed out, none of of the four actual pillars of GitOps require Git et al. The four pillars number one declarative number two version and immutable number three pulled not pushed four continuously reconciled. The name GitOps has caused the whole industry to get a bit dogmatic about putting everything into Git repo, even things like Secrets, which absolutely should not be there. Rob's take is that most teams just want to ship software. If GitOps helps with that part, great.
[74:30]
B
But if a more practical process works, works better.
[74:32]
A
Just use that. Do check out the show notes below for related the Pragmatic Engineer Deep Dives on backend technologies and other related topics. If you've enjoyed this podcast, please do subscribe on your favorite podcast platform and on YouTube. A special thank you if you also leave a rating on the show. Thanks and see you in the next one.