Summary7 min read

Podcast Summary: "Cloud War Games: Building Disaster Muscle Memory and Collaborative Resilience in DevOps Teams"

Podcast: To The Point Cybersecurity
Episode: REPLAY: Cloud War Games: Building Disaster Muscle Memory and Collaborative Resilience in DevOps Teams with Matt Lea
Date: March 31, 2026
Host(s): Rachael Lyon, Jonathan Knepher
Guest: Matt Lea, creator of Cloud War Games

Episode Overview

This episode dives deep into the critical topic of building disaster resilience within DevOps and cloud engineering teams, emphasizing the value of realistic crisis simulations (“Cloud War Games”) to develop muscle memory, collaborative problem solving, and adaptability in the face of growing cyber threats and operational outages. Matt Lea shares practical insights from his experiences training and advising cloud engineering teams facing high-stakes service disruptions, discusses the crucial role of team dynamics and culture, and explores emergent challenges such as bot and AI-driven threats.

Key Discussion Points & Takeaways

1. The Origin and Power of "Cloud War Games"

Matt’s Motivation: Noticing junior DevOps staff, under pressure during real outages, “freeze up” from stage fright or lack of experience in high-stakes troubleshooting (01:54).
Simulating Real Problems: Matt began journaling recurring 3 a.m. “big headaches,” turning these into realistic simulations for teams to practice on—moving from whiteboards to actual cloud infrastructure, culminating in the Cloud War Games platform (02:30–03:40).

“If you get lemons, you might as well make some lemonade. So then I started designing these simulations…” (03:25, Matt Lea)
Collaborative Training: Emphasis on designing scenarios to foster communication and shared problem-solving, not just “hero culture” (04:13).

“You want to have that one super engineer ... but when we design these scenarios, most of the time I try and make [it] more collaborative.” (04:15, Matt Lea)

2. Building Team Resilience and Breaking Down Knowledge Silos

Revealing Single Points of Failure: Simulations often expose over-reliance on key individuals—teams only realize their vulnerability when forced to act without them (05:38).

“Take [the lead engineer’s] keyboard away and break something and then see how long it takes to come back in dev or staging or something of that nature.” (06:37, Matt Lea)
Cultural Shifts: Organizations must move away from fear-of-failure mindsets, encouraging experimentation and rapid iteration (07:19).

“People just freeze up on, ‘Can I scale this down to zero, run another deploy?’ ... So that’s one of the things I guess I’m good at ... teaching: know which switches to flip.” (08:25, Matt Lea)

3. Realism in Incident Response Training

From Tabletop to Real-Time: Immersive, surprise drills create lasting learning compared to scheduled, annual tabletop exercises (05:04–09:20).
Organizational Buy-In: Often, real investment in simulation training only comes after a painful outage or cyber attack (09:25).

“If I just approach a company that hasn’t seen any cybersecurity issues ... it’s not a high priority. But the day after that outage, it becomes a priority.” (09:30, Matt Lea)

4. Rapid Diagnosis: Is It a Bug or an Attack?

Layered Troubleshooting: Lea emphasizes dashboards from external (Route53, API Gateway) to internal (ALBs, ECS tasks, RDS) to spot anomalies (10:53–12:28).

“I just look for the discrepancies ... starting on the outside, work your way back.” (11:45, Matt Lea)
Preparedness Counts: Don’t build metrics or log pipelines during incidents—prepare ahead (12:35).

“You absolutely want to have those dashboards and your CloudWatch Insights queries set to go ... But we don’t always get handed a perfect hand of cards.” (12:40, Matt Lea)

5. Credential Leaks & Containment Strategies

Contain, Don’t Overreact: Disable rather than outright delete leaked credentials to avoid accidental outages for critical third parties (13:34–14:51).

“Engineers think in ones and zeros, but the C suite—and the language of business—is dollars and cents. So you always got to be doing that math.” (14:37, Matt Lea)

6. Security Guardrails & Insider Risks

Least Privilege Best Practices: Don’t grant rookies admin rights. Use password vaults and secrets managers (15:03–16:11).
Real-World Story: A client who demanded hand-delivery of firmware found their sensitive code—and keys—publicly on GitHub weeks later (16:12–16:54).

“We went through all this hassle and you—there it is, publicly, on someone’s GitHub with the keys.” (16:52, Matt Lea)

7. Layered Network Security

Beyond Passwords: Use granular IAM roles, careful security groups, and subnet isolation—multiple fail-safes in case other controls are bypassed (17:28–18:53).

8. Evolving Threats: Bots & Agentic AI

Rise of Sophisticated Bot Traffic: The boundary between “good” shopping bots and malicious automation is blurring; volume and sophistication make detection and policy decisions harder (19:18–20:46).

“We’re at this very interesting spot where we’re spotting what we can tell is bot traffic ... but we’re letting them through ... That water is murky right now.” (19:40, Matt Lea)
AI Security Implications: Many companies are hurriedly bolting on AI—often giving LLMs dangerous levels of decision-making (e.g., refund issuing) without sufficient oversight (21:01–22:53).

“The bot that can issue a refund is a dangerous bot ... The bot shouldn’t make decisions, it should make recommendations.” (21:15, Matt Lea)
Adaptive Attacks: AI-powered bots can mutate input and evade detection—making signature-based security less reliable (22:53–23:13).

9. Human in the Loop: Avoiding AI Pitfalls

LLMs as Unreliable Interns: Generative AI can hallucinate, fabricate progress, and outright lie—always double-check its output (24:18–25:42).

“With those LLMs ... they’re basically like having an intern that lies to you.” (24:22, Matt Lea)

10. Model Choices for Startups: Build, Fine-Tune, or Buy?

Matt’s Advice: Off-the-shelf models and managed services are best for most; fine-tuning is the feasible middle ground for non-experts (26:26–27:51).

11. Multi-Cloud, Multi-Region, and Startup Growth

When to Go Multi-Cloud: Don’t over-engineer before product/market fit; multi-region/cloud becomes cost-effective only at scale (28:14–29:46).

“The smaller you are, the less I’m worried about vendor locking ... just boot it up as cheap as you can.” (28:30, Matt Lea)
Technical Debt as Strategy: Sometimes, taking on technical debt (with eyes open) is fine if business growth far outpaces the cost (30:06–31:22).

“As long as your income is growing 10x the debt you’re taking on ... it doesn’t hurt them.” (31:01, Matt Lea)

12. Matt Lea’s Cybersecurity Journey

Started programming by hacking video game files, later fell into cybersecurity and DevOps roles out of necessity while working for early-stage startups (31:56–33:32).

“I started programming ... hacking ... INI files and breaking them ... and then I find myself at various startups where you don’t have big budgets for someone else to do security or DevOps ... so I end up jumping in.” (32:00, Matt Lea)

Notable Quotes & Timestamps

“If you get lemons, you might as well make some lemonade. So then I started designing these simulations…”
— Matt Lea (03:25)
“You want to have that one super engineer ... but when we design these scenarios, most of the time I try and make [it] more collaborative.”
— Matt Lea (04:15)
“Take [the lead engineer’s] keyboard away and break something and then see how long it takes to come back in dev or staging or something of that nature.”
— Matt Lea (06:37)
“People just freeze up on, ‘Can I scale this down to zero, run another deploy?’”
— Matt Lea (08:25)
“If I just approach a company that hasn’t seen any cybersecurity issues ... it’s not a high priority. But the day after that outage, it becomes a priority.”
— Matt Lea (09:30)
“Engineers think in ones and zeros, but the C suite ... the language of business is dollars and cents.”
— Matt Lea (14:37)
“We went through all this hassle and you—there it is, publicly, on someone’s GitHub with the keys.”
— Matt Lea (16:52)
“The bot that can issue a refund is a dangerous bot ... The bot shouldn’t make decisions, it should make recommendations.”
— Matt Lea (21:15)
“With those LLMs ... they’re basically like having an intern that lies to you.”
— Matt Lea (24:22)
“As long as your income is growing 10x the debt you’re taking on ... it doesn’t hurt them.”
— Matt Lea (31:01)
“I started programming ... hacking ... INI files and breaking them ... and then I find myself at various startups where you don’t have big budgets for someone else to do security or DevOps ... so I end up jumping in.”
— Matt Lea (32:00)

Important Timestamps

01:26 – Guest introduction: Matt Lea
03:25 – How Cloud War Games began
04:13 – Collaborative simulation designs
05:38 – Exposing knowledge silos and single points of failure
07:19 – Team culture and fear of failure
09:25 – Organizational buy-in post-outage
10:53 – Differential diagnosis: attack vs. misconfiguration
13:34 – Handling leaked credentials
16:11 – Real-world insider threat story
17:28 – Internal security guardrails
19:18 – Bot and agentic AI trends
21:01 – Security risks of careless AI integration
22:53 – Adaptive, mutating AI-powered attacks
24:22 – LLMs as unreliable interns
26:26 – When to build, buy, or fine-tune AI models
28:14 – Multi-cloud considerations for startups
31:56 – Matt Lea’s journey into cybersecurity

Resources & Further Information

Matt Lea’s YouTube & Consulting: schematical.com, YouTube.com/schematical
Comics & Infosec Humor: schematical.com/comics

The episode provides a lively, practical, and candid window into the realities of cloud security, incident response, and operational readiness—blending frontline war stories and actionable advice for any cloud engineer, DevOps professional, or security leader.

Loading summary

Transcript62 lines

[00:01]
A
Welcome to to the Point Cybersecurity Podcast. Each week, join Jonathan Neffer and Rachel Lyon to explore the latest in global cybersecurity news, trending topics and cyber industry initiatives impacting businesses, governments and our way of life. Now, let's get to the Point. Hello, everyone. Welcome to this week's episode of to the Point Podcast. I I'm Rachel Lyon, here with my co host, Jon Neffer. John, I've missed you. It's been a couple of weeks in your worldly travels.
[00:35]
B
Exactly. I'm glad to be back home, though, back from visiting the kids out in Granada where everywhere you go out you get tapas with all of your drinks.
[00:46]
A
Isn't that wonderful? You spend like €5, right, on a beer and then you get like this delicious buffet meal.
[00:52]
B
Yeah, exactly.
[00:53]
C
It was fantastic.
[00:54]
A
I love that. I am all about that. Well, we have another awesome guest this week, you guys. I am so excited to welcome Matt Lee. He's the creator of cloudware Games, an online training platform designed to help cloud engineers and DevOps professionals develop their problem solving skills by fixing realistic issues in simulated cloud environments. And my favorite thing, please go to his LinkedIn profile where he talks about helping CTOs running on AWS sleep better at night. Which is something we all want, right? We all absolutely want. So welcome. Welcome to the podcast, Matt.
[01:27]
C
Thanks for having me. Yeah, I'm glad you guys took an interest.
[01:31]
B
Yeah, Matt, thanks for joining us. Let's kick it right off though, here. Outages and service delivery problems are not only costly, but they get a lot of attention. Customers get upset. What do you do and how do you help responding teams stay calm and make the right decisions when things go badly?
[01:55]
C
So it's an excellent question and I'll start off, I guess where the inspiration for cloud war games came from is I was training up these more junior cloud professionals, DevOps type people. And when you get a client, my biggest client, they're a big E commerce company. If they go down, they lose about $100,000 an hour. So you can imagine how many guys in Ties or C Suite type people are breathing down your neck right there. And so I saw these junior guys, I try and give them a chance to handle some of the outage, but after a certain point, you have to step in and move. And they were having trouble just getting stage fright of sorts. You know, I don't know what switches to push or anything like that. So I thought, wouldn't it be great if there was a way we could simulate this disaster, you know, in our staging environment? Or testing environment, or we could rerun the same disaster we just hit. So over the years, I'm also a journaler. I write down tons of things. And so I actually have a stack of all these 3am problems. The big headaches. Hard coded read replicas and stuff like that. Or hard code, the right replica and it couldn't switch over. Oh, just the things that spent hours of your time and you hate it. But, you know, you might as well. If you get lemons, you might as well make some lemonade. So then I started designing these simulations, and the first iteration of it, I just used a big whiteboard and actually just drew out a network diagram and kind of Dungeons and Dragons, the whole thing. And then after a while I'm like, why am I. Why not use real infrastructure? I can design something that could run for a couple pennies and simulate the whole disaster. And so that kind of expanded. And then more people wanted to be the. The inframaster. We say, you know, this is a dungeon master. And so kind of kept expanding from there, and it became a fun little thing to do. And eventually I said, well, I was. I was talking to a couple of colleagues and they're like, why don't you make this into a business? And there we go. Cloud War Games was born.
[03:51]
B
That's awesome. And you know, something that I noticed in some of these roles too, was like, allowing, you know, those. The first. The first people like, working on things, how to be more enabled and empowered to take action. Like, how. How does this work together in enabling and in the fast response?
[04:14]
C
Well, that's a really interesting part there. So you might think it pays to be competitive. I got, you know, you want to have that one super engineer, and he's like, I can fix anything, but no one else knows what he's doing. And so when we design these scenarios, most of the time I try and make more collaborative. Okay. You know, I want to see someone say, okay, you go check the DNS records, make sure that's all going. You go check database metrics, and they should be shouting and out communicating. That way you've got multiple people attacking the problem from multiple different angles. You start from one guy from the back, another guy from the front end, another person checking somewhere in the middle where the application layer is, and you get more efficiency there. So. And if you. And it's not really intuitive off the bat, but if you get, you know, if you can coordinate and rehearse this type of stuff, it becomes much more intuitive.
[05:04]
A
I like, I love this idea, right? Because like, you know, incident response planning. I mean, if you're lucky, you're doing kind of a tabletop exercise once a year, but something like this, that's, you know, kind of much more immersive, I believe. And, and also scenario planning. Right, because scenarios change, particularly when we look at AI and all of these other things. It seems like this would be a great way for folks to start building like muscle memory versus oh snap, where did I put that plan? And are these people even still here? I mean, I'd love to hear a little bit more on your perspective, kind of this eye opening moment perhaps that your clients are seeing.
[05:38]
C
Yeah, it really opens up their eyes to data silos, I'd say, which you don't really know when you're just doing it on paper, but. But sometimes it pays to take whatever your lead engineer is and you're like, oh, this guy, if he gets hit by a bus or if he takes a week off, it's not gonna be a big deal. Have him sit on his hands and then have the people reporting to him then try and solve it. First without him on the call or them on the call. And second, when they, assuming the juniors don't figure it out, then you bring in the person but still have them sit on their hands. They can talk through it. Everybody else is making notes. And so you really find out where the gaps in your knowledge base are. So many customers or clients have single point of failures around a key person and they don't even realize how bad it is. And so that's one of the biggest revelations I see. And if you have that person, I strongly suggest you take their keyboard away and break something and then see how long it takes to come back in dev or staging or something of that nature.
[06:43]
B
Yeah. We had an executive here many years ago and his view was if you ever had a person who was a single point of failure, you had to move them, move them into a different team or something to remove that. And that was kind of the extreme example of that. But kind of bring us to the next step on the team dynamics. Are there cultural factors or team dynamics that you see that maybe they need changing or they need other training and practice and, and how does that lead to hopefully making something like this sustainable?
[07:19]
C
Yeah, culturally, a lot of times I see a fear of failure, which is interesting. I deal more with early stage startups, but a lot of this gets trained in academia. Like, okay, you submit the test and that's it. You're. If you got the right answers right, you're right. You got the answers wrong, you're wrong, you know, but we don't live in that environment. We live in an environment where you can iterate extremely fast. And the big thing, you know, it's good to be able to know which switches to flip quickly, you know, and you have to know which ones are irreversible. So you know, deleting a production database, that's irreversible, don't, you know, don't do that. But you know, knocking over, you know, a handful of ECS tasks, Docker, you know, tasks running that'll automatically reboot up the same image, you know, assuming we haven't deleted the image or something, that's a finite thing. Even if you delete the image, could you run the build pipeline, get it back? You know, so there's, you know, you have to know which switches to flip and be willing to flip them fast, you know, so that's a culture thing, you see. And that's one of the biggest things I see. People just freeze up on. I can, I, can I scale this down to zero, run another deploy, you know, possibly go and go for it? I mean, that's that, you know, so that's one of the things I guess I'm good at is just, and I try and teach is the, yeah, no switches to flip. And that's something you could document or drill, you know, under no circumstances should you, you know, scale production to zero, you know, something of that nature.
[08:45]
A
But yeah, so I imagine, I mean, something like this, obviously, I think to learn the most and be the most effective, it's one thing to say, okay, hey, next Friday we're gonna do this thing versus we just put it out in the wild, make it happen. And you gotta respond. Obviously that requires a certain level of buy in at the executive level or things like that. But I mean, how are you seeing, particularly I guess with startups, are they open to, you know, we got to be disruptive. If we're going to learn anything and have lasting learning, we got to be disruptive and we got to do it in the moment versus we're not going to know when something's happening. I mean, how are you seeing that being taken by the companies you're working with?
[09:26]
C
I'll say this, most of the time, if I just approach a company that hasn't seen any cybersecurity issues, hasn't had an attack or a massive outage, it's not a high priority. But the day after that outage becomes a priority. And so, and that's usually when I Get the phone call like, hey, that AWS outage just a few weeks ago. That's when I got the phone calls. Before that, you know, it was a little quiet, but it's. If they experience and it happens. I deal with early stage startups. They don't have a lot of cyber attacks on them and they've got just the generic ones, but not the very focused ones. But you get that first focus one, or you get the first time a junior pushes credentials that can send email out to a public repository and you send out 15 million emails in about eight minutes. True story. Then you start thinking about these things. So I wish I could approach them sooner, but at least once they see that, then they generally like, okay, we don't want that to happen again. How can we drill for that? How can we set policies that says.
[10:32]
B
So you brought up cyber attack versus versus other issues. What types of things do you train folks to look at in order to quickly differentiate? Right, like, are we down? Did something break? Is this a misconfiguration? Or are we under attack and something like outside of our control is triggering some issue?
[10:54]
C
Yeah, WAF is a big one for DDoS, AWS web application firewall, those logs in there, access logs, of course, metrics. You're seeing fluctuations in traffic. If you're not seeing fluctuations in request counts or external traffic, then chances are it's not a DDoS. But that doesn't mean it couldn't be an SQL injection attack or some other type of attack that way. So those starting with are we getting a massive increase in traffic or Basically I create dashboards where starting at the external layer, RDS or sorry, not RDS, Route 53. We got all our metrics there. We're looking for spikes anywhere there. Hit API gateway looking for spikes in there. Hits the albs looking for spikes. Hits the C2 task running, looking for CPU usage spikes. It's, you know, and so we can see kind of just the layers as you go through. And I just look for the discrepancies in there. And if you don't, you know, that that gives me a pretty good idea where to start, you know. Okay, the discrepancy starts for some reason. Rds, oh, we've got some batch job that's pounding it right now. We're in trouble. So I try and get people to stack it. Visually, I'm huge on diagramming too, and drawing. I don't know if you see my YouTube videos with my pixel art animated diagramming network software, but. But I like being able to make it tangible. It's such an intangible thing, the cloud, but every little part of it talks to everything else. So if you could see the map and then equate that to where the discrepancy from the baseline of the metrics are or the logs, then you can start tracking it. So it's a long way of saying you start on the outside, work your way back.
[12:28]
B
I guess it sounds like too, the key is being prepared ahead of time with the metrics and the log collections.
[12:36]
C
Yeah, absolutely. You don't want to do that. Game time. I have had to build out metrics game time, because they don't call me until sometimes there is a problem and then you're just, okay, let's get out, we're off to the races. But if at all possible, you absolutely want to have those dashboards and your CloudWatch Insights queries set to go, your AWS WAF logs and your WAF rules all in place. But I, you know, especially deal with early stage. We don't always get handed a perfect delta, perfect hand of cards. So play with what you got.
[13:12]
A
Never a dull moment, right? That's why we love staying in the cyber industry. You just never know what the day may bring. But I'd love to come back on, you know, unfortunate, you know, kind of credential leaking, you know, in that kind of scenario, I mean, what kind of steps would a company take to try to contain something of that magnitude? I mean, you can't really pull it back. So how would you manage through that?
[13:34]
C
See, with that case, we disabled the credentials, not delete. Because what could happen is that could kill off some other major service for third parties. If it was inside AWS's bubble, you'd use roles, not IAM creds. And so right there we had to. We disabled the creds. And so at that point, you gotta do some math. Did that break some big third party? So say is it that our inventory service, that every order, for some reason they're pulling our orders using those credentials because it's a third party, and now all of a sudden we're losing $100,000 an hour because we can't ship orders? Well, then we got to turn that back on and then disable the email role in that case or the email IAM permission. So you've got to be ready to do some quick math. If you just delete it right away, then you got to go send your, you know, in this case, an inventory producer. And it wasn't, but let's just say it was the people that handle all the shipping for us. You have to get them to do those creds and for how long is that down and broken? So it's always a delicate balance of math. And I think that's something a lot of engineers don't think of right away. Just very ones and zeros. I say this all the time. Engineers think in ones and zeros in the C suite and the language of business is dollars and cents. So you always got to be doing that math if you want to sell. As a I tell, this is kind of career advice for people in cloud. If you can think in dollars and cents as well as ones and zeros, you'll go far.
[14:52]
B
Can you talk some more about some of the specific controls that you would suggest for, for keeping guardrails around things that could go. Go badly.
[15:03]
C
Level of least principle of least privilege. Okay, your junior guy that shows up day one, don't let him have admin. There's a lot of things, you know, there's, there's absolutely.
[15:15]
B
Don't give root to everybody.
[15:16]
C
Yeah, yeah. I mean even with credentials, they shouldn't have access. Use Secrets Manager like that to make sure that they only have access to the sandbox credentials for all these services, not, you know, the important stuff. Never commit those big important banking credentials, you know, into the code base that should be in Secrets Manager or AWS Param store. And if it's not clear, obviously I'm a bit of an AWS fanboy here. I'm sure there's equivalent password vaults everywhere, but definitely restricting the juniors to what they need to get the job done there and not giving them access to, you know, everything. Day one. They don't even. It's not a lot of times. It's not their fault. It was even just a couple weeks ago I had a junior push up a code base to their own personal GitHub publicly. And it happens more often than you'd think. I had one time. I can't say the specifics on the client, but do you have time for a story you can tell?
[16:12]
A
Always. Always.
[16:13]
C
Okay. So we were working with this hardware company, Smart, smart home device company. And they were like, we got to be super secure with the code base for the what lives on the device. So much so that I had to fly one of my right hand guys to there because they wouldn't put it across the Internet. I had to fly them across the country with a hard drive to try and get the code base. They wouldn't give it to him. Still, he ended up Coming back a month later, we found it publicly available on the Internet with keys intact. And it was just like, oh my gosh. We went through all this hassle and you there it is publicly on someone's GitHub with the keys.
[16:54]
B
Well, it sounds like their paranoia was well founded, just not well executed.
[16:59]
C
Yeah, I guess. What's the hard outside shell but the candy inside? Yeah. Oh my gosh. So it's not even. You're always being attacked, but sometimes it's the intern that makes a mistake. You know, they got to learn from the mistakes, but don't. Don't let them make the big ones.
[17:17]
B
No, totally. Okay, so you brought up the whole hardened outside, soft inside. What types of things should folks be doing to help protect the inside?
[17:29]
C
Sure. One of the most basic things I see you talked about IAM roles, you know, granular IAM roles, very important. You should never have an API key or, you know, AWS access key you're using for API access to have administrative rights. That's insane. Don't do that. The security groups, of course, I'm pretty nitpicky on security groups. You can basically, you can get away most of the time with just being picky with your ingress rules. One of the biggest things I see there is cross environment contamination. If you don't have good security groups and you keep them in the same AWS account and all of a sudden some staging jobs right into production and you've got duplicate data and you're like, why? So definitely lock down that you don't want to have the public ever able to access your databases there. So you always want to have. I usually do multiple layers of security. Of course you've got your passwords, you know, but people can brute force that. Secondary, you got your security groups, which is basically firewall rules. And tertiary would be making. So the subnets are actually not able to be hit that like Internet traffic can't hit the subnet. It would have to go through a bash into the public subnet and that would have to poke a hole to the. To the private. So even if somehow you screwed up the password and you screwed up the firewall rules, you're still in a private subnet that's virtually untouchable by the Internet. So that's some of the basic ones. And those are most of my basic scenarios actually revolve around those kind of pieces of the puzzle.
[18:53]
A
Since you're on the front line for attacks I'd be interested in. And obviously we're not going to name names or be super specific. But I'd be curious of any trends that you've been seeing kind of bubbling up in the last year, kind of anomalies that maybe are becoming more common as we look to like 20, you know, how getting ahead, I don't know we're ever going to get there, but how do we at least start thinking about these things?
[19:18]
C
Right now? The big trend that we're wrestling with is bot traffic, agentic traffic. We used to, you know, it used to be fairly easy to figure out if it was a human or bot to some extent, you know, and it's getting easier to mask yourself as a bot. But even if it's a bot, do we want to block them because it might be an agentic shopping bot.
[19:40]
A
Exactly.
[19:41]
C
My client's got E Commerce stuff. They want to move product. It really doesn't matter if it's a bot putting the credit card number in or a human. But you know, as long as it, you know, everybody's happy at the end, you know, the client gets their product. So we're at this very interesting spot where we're spotting what we can tell is bot traffic. But, you know, in order to try and make sure that they still buy a product, we're letting them through. And so that water is murky right now, is what I'm saying. And it's, and these bots, a lot of times they're, they're not the best at pacing themselves. So, you know, we're putting in more threshold or more tools to tell them to, you know, relax. I can't remember the exact status code, but the, it's, we're not returning the status code, but we've proposed this is that there's a fake status code 420, which is like chill out or something like that. So, you know, we, we have to make sure we're telling these bots, feeding them very efficiently the information they need, but at the same time, you know, make sure they're not bombarding us.
[20:46]
B
What types of things, though, are you seeing? Kind of from the agentic AI stuff kind of separate from this. Right. Like, are there, are you seeing security implications as well as volumetric things?
[21:02]
C
It depends. I, I, yes, I do see some security implications. I see people rushing to put AI in their product. And I'm doing air quotes for those of you on the audio version. And 90% of what that means or 90% of the time what they end up doing is slapping a chatbot in there. And if they're lucky, they figure out how to give it tool call Access, you know, but what I see them doing is giving it, this bot, a bit more decision power than it probably should. Oh, the bot. The bot that can issue a refund is a dangerous bot. Okay. I don't. I don't think I'd ever put a refund in there that would, you know, would. Without a human in the loop, I'd have something on the screen like, this bot is not authorized to issue you refunds, even if it says so, you know, and so keeping the human in the loop on any thing that's remotely important, refunds, changing shipping addresses, things like that, perhaps, you know, so that's something I'm seeing that scares me a bit. And I've got to talk people off the ledge saying it's not magic. They do hallucinate, you know, I bet you I could jailbreak the thing if I. If you gave me an hour. So, you know, putting. Making sure every tool call has some type of logging. Auditing restrictions are important tool calls, decisions. The bot shouldn't make decisions, it should make recommendations which are then reviewed by somebody on your internal customer service team. That's my thoughts on it. So that's something where I see it injecting us, injecting our own, shooting ourselves in the foot there with. If you meant like attacking from the outside, that's interesting as well, because we have seen attacks that mutate. The bot is smart enough to change the input and change up what it's, you know, it's not just running through a loop and just blasting you like the old school ddos. It's changing the inputs up enough where the signature is rather difficult to discern. And if, you know, it's becoming more and more difficult to track. We do find patterns occasionally. I found a very interesting one just yesterday, actually, I should say my colleague found one. I don't want to take credit for him, but found a very interesting one yesterday we saw. But it's possible that was just a scheduled AI assistant too. So do we block it?
[23:16]
A
You know, just tangentially, because John knows I like to do these things. But I was reading a Wired article. We talk about AI and agents and workforce stuff, keeping a human in the leap. One of the Wired, this author wanted to stand up a new company and then he had an AI workforce and he was the only human in the loop. And as he went through whatever the product was, one of the agents reached out to him and gave him an update. But it was all lies, absolutely lied about all the progress that they were doing. Oh, yes, we completed that Cycle. We did this, we sent email. And the fellow's like, you're literally lying to me. And the bot's like, yeah, well, my bad. Won't do that again. But that's truly frightening, you know, when you talk about them kind of evolving, learning and making decisions without a human in the loop. That just kind of, you know, stopped me in my tracks a little bit to, you know, as we think about these things, I. Fascinating article and very humorously written, but I'd be interested on your thoughts there.
[24:18]
C
Yeah, well, so this. Earlier this year, I tore my bicep and so I had to get bicep surgery. So I said I'm going to embrace vibe coding for a little bit. And it's funny, with those LLMs, after working with them for quite a while, I came to the conclusion that they're basically like having an intern that lies to you. So, you know, as much as I, and I do use it for pixel pushing occasionally, I'm a horrible graphic designer. But at the same time, no, you absolutely need to double check everything with those, you know, it's. I can see different models. You could train classifiers or utilize tools like vector, DBS and coding to kind of work as a classifier, and those can be extremely effective. But just the general knowledge, just to assume you can type it in some text and it'll just be able to do a classification task really well. I think that's to say it right now is a bit optimistic, but I don't want to downplay every model either. This, you know, if you have the right tool for the job. I've seen incredible stuff done with, with classification models. I mean, honestly, Gans, for images and stuff like that, that's pretty impressive too. But of course they succeed if there's no right answer. You know, say, give me a cool picture that's completely made up. It'll make up a story all day long. That's, that's not, you know, where we need accuracy there.
[25:42]
A
So like the, the images where the person has like seven fingers on one hand, that, that kind of.
[25:48]
C
Yeah, they've gotten so much better at hands. That was. I have to explain to my mom, you know, mom, this is AI. Look at the hands. This is something that's generated and now they're getting so good you can barely tell.
[25:59]
A
Oh, it's wonderful. Finally.
[26:03]
B
Yeah, they're definitely scary how good they are now. It's. It's brutal. Okay, so you mentioned some of the stuff on AI. When do you think it makes sense for customers to be doing Their own thing, training their own models versus using some of the existing models that are out there. There seem to be so many good options today.
[26:26]
C
Good question. Again, I deal with early stage companies and I also have friends that have PhDs that are experts in stuff and it basically if you don't have a budget for someone with a very strong background in training models, probably somebody with a PhD or something of that nature, it, you probably don't want to train one from scratch there. You're, you're, you know, you're small startup there's. Unless you're doing something that's incredibly intricate and you've, you've got all the training data in the world for it and no one else does then, and even then I probably fine tune. Fine tuning is kind of that middle ground. You can go off the shelf and if you're a one person shop and you're just trying to just get some code done, then use a managed service, you know, use a serverless, serverless bedrock. Converse. If you wanted a LLM which is the equivalent of ChatGPT's API anyway, but you know, if you want to take it a step further and you want to fine tune it for your specific usage, then fine tuning a model, I'm okay with it. Still at a pretty early stage, you wouldn't need a PhD. I don't have a PhD and I found fine tuned a lot of different models to fit my purposes. But again training from scratch, that is a pretty tall order. You better have a lot of money and time on your hands because you're going to spend a lot on compute as well.
[27:51]
A
How do you find? I'm kind of interested. When we look at the cloud landscape, there's a simple path and complexity paths, multiple cloud. How do you find startups are navigating that path forward as they grow and accelerate their business? I mean how do, do you, are you helping to advise them on how to, how to do this in a, in a very thoughtful, strategic way?
[28:14]
C
Absolutely, yeah. I mean multi cloud comes up especially after the recent outage with Amazon, you know, but it's, it's not cheap as you get and you, the smaller you are, the less I'm worried about vendor locking because if you're pre validation, if you haven't got anybody to give you dollars, boot it up as cheap as you can and just get until you've got people repeatedly giving you money. But once you're making a million dollars a day, you know, and if the site goes down, you're losing that in opportunity cost. There Then it makes sense to explore multi region and multi cloud there. So what I usually do is we do a cost estimate and say, hey, you know, if you go down, it's going to cost you $10,000. If we go multi region, it's going to cost you $10,000 a month. You know, then it's like, how often do you go down or does AWS go down in that case? Well, almost never. So does it really worth it to do that now? Let's just say you go down, you lose a million dollars or $10 million. Now, that investment of an extra $10,000 a month, not the end of the world. And so that's, you know, you got to also evaluate if you've got technologies that need to run differently on GCP than they do on AWS or Azure. So that's a little extra that you got to think about the engineering hours. It's one of the most overlooked things I always have to deal with is the engineers that forget to count that their hours cost the company money. Oh, it's exciting. I really want to learn how to run this on gcp. It's going to only cost me about three months of my life in payroll. It's like. It is. We're paying you for that, right?
[29:45]
B
That's real money.
[29:46]
C
Yeah, you got to factor that in.
[29:49]
B
But then how do you account though, for the technical debt side of it, right? Like the idea of running multi region or multi cloud, like there's early decisions you make in architecture and like undoing those things, sometimes very difficult.
[30:06]
C
Completely agree that it's moving faster. Docker is a technology from above. Thank goodness for Docker, because that allows us to swap out much quicker. But technical debt is an interesting thing. I work with so many startups and part of me is a purist. Engineer says, do it right the first time and even if we got to take a little bit longer. But I've seen time and time again, some brilliant entrepreneurs, novice programmers, leverage technical debt. Like you're taking debt for real estate and they leverage it and make money off it. It's like, oh, yeah, we took on a little bit and it does grow and sometimes it becomes a pain. But as long as your income is growing 10x the debt you're taking on, they still haven't paid off some of the debt that they've had for six years or so. But it doesn't hurt them. It's just they're growing so fast, it doesn't matter. So I look at it like a real estate investment, you know, I'M taking out a loan and if the interest on the loan or the is growing faster than the value of the real estate, then we got a big problem. But if it's the other way around, they can keep throwing money at it for a while and they do, you know, it's not optimal so. But who am I to judge
[31:26]
A
definitely. So I would love to. John knows where I like to go here. As we kind of look at our time, I'm always interested in the path to cyber. Everyone has such a unique path of how they got to where they are today. And I'd love to kind of, if you wouldn't mind sharing with our listeners kind of your journey, how you found yourself today. Is this something you've always wanted to do or just kind of happenstance? Right. Life just kind of brought you on this twisted turning path to where you are.
[31:56]
C
So it's funny, I started programming in I think the sixth grade and the way I learned to program was by taking apart video games, hacking into a basically and hacking I mean text files, ini files and breaking them and all that stuff. Then you work your way up and to you know, web technologies and of course you know, it's, you get hacker news and all that different stuff and you're like oh, if somebody else broke it this way or ah, I can back, I can, I can, if I use the counter I can just extract the auto, suggest your password and now I've got your password in plain text there. So little script kitty, things like that, you work your way up and then I find myself at various startups where you don't have big budgets for someone else to do security or DevOps or all that stuff. So I end up jumping in those roles and of course the bigger the startup the more your area surface for attack. So we've gotten it and I've probably made the mistake before but you start to learn and that's, that's where I kind of get into it. And if you're kind of advising at the level I am up but the C suite you're getting. I spent a good chunk of the morning doing a finops write up but at the same time I spent a good chunk of yesterday doing the cyber security so you've got to know have a good cross training on it. Am I the best cyber security ninja in the world? No, don't think that. But just out of necessity it's become something I've had to deal with. I've had the pleasure of working with some of the best cybersecurity people out there, I'm sure. So I don't want to take anything away from them. I'm kind of, you know, a lot of liaison to the C suite a lot of times. But yeah, that's my journey, I guess. Started by taking apart video games.
[33:32]
A
I love it. I love it. That's a great way to start too, right? I mean, as you figure things out and then how that translates, surprisingly at higher and higher levels where those kind of tactics still work. So I know we're at time, but I did want to say revisit your YouTube channel because I would love to get a shout out to folks if you know, what's the name of your channel so folks can go there and visit. We'll make sure to include it in the show notes as well.
[34:03]
C
Sure. The name of my consulting company and name you find me all across the Internet should be schematical. It's the word schematic with Al at the end. So YouTube.com schematical schematical.com if you want to see my comic strips, they're schematical.com comics. So I do comics on all this stuff that you might find amusing. I have a whole series on the lone wolf programmer, which is the knowledge silo you don't want.
[34:29]
A
I love it. Okay, we'll be sure to include that in the show notes, everyone. Well, Matt, thank you. This has been a fun conversation. I learned so much we've never had. I was saying we never had a cloud demolition expert on before, so thank you. Thank you for sharing your insights with our listeners today.
[34:48]
C
Pleasure. And if you guys ever want to compete in a cloud war game, you know where to find me.
[34:51]
B
John, sounds like fun.
[34:54]
A
That's right. I think you got a signer here. Well, so to all of our listeners, as always, thank you for joining us this week for another amazing guest. And as always, don't forget to drumroll please.
[35:06]
B
Jonathan, smash that subscribe button.
[35:09]
A
That's right. And you get a fresh episode every single Tuesday right to your ear inbox. So everyone, until next time, stay secure. Thanks for joining us on the to the Point Cybersecurity podcast, brought to you by forcepoint. For more information and show notes from today's episode, please visit forcepoint.com podcast and don't forget to subscribe and leave a review on Apple podcasts or your favorite listening platform, Sam.