
Serverless computing is a cloud-native model where developers build and run applications without managing server infrastructure. It has largely become the standard approach to achieve scalability, often with reduced operational overhead. However,
Loading summary
Sean Falconer
Serverless computing is a cloud native model where developers build and run applications without managing server infrastructure. It has largely become the standard approach to achieve scalability, often with reduced operational overhead. However, in banking and financial services, adopting a serverless model can present unique challenges. Brian McNamara is a distinguished engineer at Capital One, where he works in serverless integration and development. Brian joins the show with Shawn Falconer to talk about why Capital One shifted to a serverless approach, how to think about cloud costs, establishing governance controls, tools to stay well managed, and much more. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.
Brian McNamara
Brian, welcome to the show.
Hey Sean, thanks so much for having me.
Yeah.
So to start our conversation, can you explain a little bit about your role at Capital One?
Sure. So I'm a distinguished engineer at Capital One. Essentially that's a senior IC role.
And my role really revolves around considering.
How we do serverless at scale, both at a tactical level and a strategic level.
So at a tactical level, I have.
The opportunity to engage with different teams.
Who are looking to adopt serverless compute, help them run through any outstanding questions.
That they might have around the technologies, help them decide which compute may be better for their given workloads. But then we also have the opportunity on my team to actually work with others across the enterprise.
So in terms of where I sit at Capital One, I'm in the retail.
Bank, but I have the opportunity to.
Really work with partners in other lines of business like our card business, enterprise, cyber ML. So it's really nice to operate really.
Both at a very high level and.
Really a very low level too.
When you reach sort of the level of distinguished engineer within an organization, is that kind of your primary thing that you're doing day to day? Instead of necessarily hands on keyboard coding, you're a little bit more like a thought leader within the company and working with some of these other teams to help them invest in certain technologies where you have deep expertise in.
I'll say yes.
And so one of the nice things about the distinguished engineer role at Capital.
One specifically is you do have the.
Ability to influence really how a large.
Engineering organization works or considers problems. So in that sense, we do get.
To work on the big boulders.
In my role though, it's actually a.
Really good blend in that I do get to get my hands on the keyboard. So while I may not support the.
Feature teams and actually write production code.
I am working on ways, you know, whether it's in sample applications or templates, you know, Looking for ways to help streamline the overall adoption of serverless, like Cap one.
Okay. And what would you say is kind of like the primary difference between operating at this level as an engineer versus say being at maybe like a senior or staff engineer level?
Yeah, so I think, and I mean, obviously this is my experience, it's the.
Ability to influence strategy, I think is.
One of the main differentiators.
So while I do get to work with those future teams who are looking.
To adopt serverless compute, it's working with our partners across the enterprise, not just within a single line of business. And that really, I think, allows us to take a more holistic view in how our engineers approach solving business problems with serverless technology.
I see. And why did Capital One decide to make this transition to serverless technology?
Great question. So if you look back at Capital One's history, we really started our cloud journey in 2014. And if we look at how we've.
Adopted cloud technologies over the years, I.
Think it followed more or less. I mean, I hate to say a traditional pattern, but you could look at initial efforts revolved around literally lift and shift, so moving existing stacks from on prem to the cloud.
But really, as the years progressed, our leadership was looking for ways to improve the overall developer experience and quite honestly reduce developer burden, letting developers focus on.
What they do well. So really in 2021, we made a declaration that we were going to be.
A serverless first company.
And what that's meant is for newer.
Projects, newer applications, not that you have to go serverless first, but serverless technologies like Lambda and like AWS Fargate should be considered first. And if we look at the reasons.
Why, one of the canonical examples that people provide for why any organization goes.
Serverless is lower cost, I think it makes sense to unpack that a little bit more. If we look at what's driving that cost, there are really three components. There's the engineering cost, there's the cloud.
Infrastructure cost, and then there's the maintenance cost.
If we look at the cloud compute cost, you may actually find that serverless, when compared with an equivalent or an.
Analog for an instance based compute, may actually be more expensive. So people often hold up cloud costs.
As a reason to not consider serverless. I think our leadership has looked beyond.
That and said, well, really, if we.
Look at what's really driving overall cost or the total cost of ownership, it's not necessarily only cloud cost. We have to consider maintenance cost and.
The ability to innovate.
And I think really that's what serverless provides. If we look at the dimensions of what constitutes serverless computer. You know, really it's all about minimizing management cost. So yes, there are servers in serverless, but you don't have to manage them.
You don't have to patch them.
You can have compute that scales to.
Really, really high levels if needed, but.
Can also scale down to zero if needed too.
You know, you also have compute that really runs when it's needed to.
So in response to business events and.
You have built in resiliency. So looking at, let's say a service like AWS Lambda in particular, you get.
High availability out of the box. Whereas if you look at instance based.
Compute, you know, you may need to.
Consider like, okay, how resilient do I want this application to be to different types of failures? Well, with a managed service like AWS Lambda, well, we as customers don't have.
To worry about that.
You know, we can let developers focus.
On the interesting problems.
Yeah, I mean, I think the point you raised there about sort of speed of innovation and being able to sort of potentially plug and play different parts of the stack as essentially the public cloud introduces new technologies, it's a little bit easier to be adaptive versus if you are sort of running this stuff yourself then in the same way that like a monolithic application, you end up with like this tight coupling of different services and it makes it hard to be essentially as adaptive because every team kind of needs to be in sync and it's going to really slow down your price. If you're running this yourself, you're sort of almost like you're tightly coupling your software to the sort of on prem system that you're running versus having a little bit more of a decoupling between the service that you're creating and essentially taking advantage of something like public cloud. Does that make sense?
Yeah, exactly. Like working with wonderful cloud partners like Amazon. You know, this is a great time of year where we get to see what Amazon's been working on for the past year. You know, it's always exciting knowing that we have a partner that's innovating on our behalf.
Yeah. And then some of the problems I've seen companies run into when it comes to actually like cloud costs, being expensive comes from not really going through the process of like rethinking or rearchitecting the way that they are building and running their software to be best suited for the cloud. Right. Legitimately, just sort of lifting and shifting something into the cloud is not going to help you save money. You need to go through a Process of sort of rethinking how those services are going to take advantage of the fact that not everything has to run in the cloud, like, you know, 100% of the time. And that's how you can actually help reduce your cost if you're optimizing for the things that the cloud's actually good at.
Yeah, I would agree with that. And with like, you know, let's say lambda applications in particular, even when we see, you know, teams that are interested in moving from other compute to lambda, there's this initial thought to build essentially like a monolithic lambda or lambda lift.
Yeah.
And without even, you know, if they don't initially go through the process of.
Decomposing their application, breaking down that monolith.
They may find that lambda may not be the right choice for any number of reasons.
But I think for teams that take the time to consider what their applications.
Do, how they're invoked, and really break things apart, they find that lambda can really be a suitable landing place for.
So I believe you said that this sort of project kick started in 2021. So what is the status of things today in terms of capital one's journey to serverless?
Yeah, so we have a large number.
Of applications that are in fact running on serverless compute. I can't share the exact numbers, but it's a non trivial number.
And what we've seen for the teams that have made the transition is that they are able to spend time or.
More time on adding business value, on.
Delighting customers, and much less time on.
Patching infrastructure or, you know, managing availability in the event. Like, you know, as a part of our normal process of launching applications, we.
Ensure that applications are resilient to different types of failures.
Well, with serverless compute, that honestly becomes a lot easier.
You know, we've talked about a couple of the advantages of serverless in terms of, you know, the team being able to work faster, be a little bit more adaptive to changes in technology. Is there also an advantage in terms of, you know, bringing engineers into the organization where they might have experience with a lot of these services already and it's kind of like the way that they're used to working?
Yeah, I mean, I would say for us, you know, as a large enterprise.
We have about 14,000 engineers.
So when we do bring new engineers on, it's often easy, you know. Yes, there, you know, as with any.
Any new hire, there is a ramp up period.
But you know, we find that people are happy to embrace serverless technologies, you know, without necessarily being beholden to legacy Architectures, you know, figuring out what the right server is to jump to and you know, all those activities that don't necessarily add value. So it's, you know, we find that the more that we lean into managed services, the more productive our developers can be.
And then, you know, you don't have to share specifics, but can you help maybe for the audience, like shape the amount of like data or traffic that you're dealing with and actually running through aws, like how big is this essentially?
So we are, you know, a very.
Very large AWS customer. Without going into specifics, I can say, you know, you can think of CAP1 as a very, very large user of AWS strategic partner with AWS, you know, Akin to a Netflix. So yeah, we are all in on the AWS public cloud. In fact we shut our last data center back in 2020. So really whether it's, you know, serverless compute, instance based compute storage, you know, we are very, very large users.
And what are some of the like high use services that you use within aws? Like what were some of the like original workloads that you wanted to move over?
So we are large users of serverless.
Compute, as you might imagine.
But we do also have a large.
Footprint in more traditional compute.
So whether it's instance based like EC2.
Or ECS on EC2, you know, if we look at other services that are more serverless in nature, S3, you know, we are very, very large users of S3 that powers, you know, a lot of our AI and data analysis workloads. You know, then there are the connective services like Amazon Simple Mode notification service, Amazon Simple Queue service. So really we don't use all AWS services, but we do use a lot. And what we do use we're pretty heavy users of as you might imagine with, you know, despite being, you know, yes, we are a very, very innovative tech company, but we are a financial services company first.
So we do have to be mindful.
In how services are approved and in how we govern usage, you know, within the developer community.
So yeah, I'd love to get into details on that. But one thing before we jump there is like with many industries like banking's gone through a lot of transformations I think over the last certainly 10 to 15 years. And I think like one area that has changed banking significantly is around the fact that now if I'm a customer of Capital One, a lot of my access to my accounts is going to be coming from like a mobile device. Whereas before, you know, maybe I actually had to go into a bank or reach an ATM machine or something like that to make something happen. And I would think that that would have a pretty massive effect on sort of the traffic that you end up seeing. Because like I can essentially just be hitting that banking app multiple times a day, doing transfers, doing refreshes, like all kinds of stuff that you never would have seen before.
Yeah.
So for us, we have recognized the industry changing from being exclusively to a physical experience to the virtual experience as well. And I think Capital One places a really high value on delighting customers wherever they are.
So whether it's in a retail bank.
Itself, you know, whether you're a customer of our retail bank who uses our mobile app or you know, our website or you know, similarly credit card holder, we want to meet you where you are.
Yeah. And I guess some of the sort of elasticity of the cloud really helps with potentially spiky traffic.
Yes. Yeah.
And for, you know, you bring up a great point there. You know, I think when people were initially migrating to the cloud, like, you know, when you saw the public cloud.
Become more of a thing like you.
Know, in the mid, we'll say 2010s where you know, you saw larger businesses adopting cloud technologies, I think the initial.
Value that people held up was that you'll save money.
And I don't know if that's necessarily true.
I think what you're really buying is elasticity, the ability to be wrong.
If you think about what it takes, the process of actually racking and stacking.
A physical server, there's a whole lot.
That goes into that and you better be right.
Otherwise you're living with that decision for a non trivial amount of time. I think what the cloud offers you.
Is one, the ability to be wrong. So if you don't get like, if you're looking at instance based compute, you don't have to get it just right out of the gate. It's just an API call to destroy that instance.
It's an API call to create a new instance.
But I think the point that you bring up about adjusting to customer demand and having that elasticity built in, even if you get that instance sizing right or the lambda function configured with just the right amount of ram, well, you.
May need to scale horizontally. And I think that's really the power of the cloud.
You know, we don't have to have capacity just waiting. And with each type of compute that is offered by cloud providers like Amazon.
It'S interesting to look at what the.
Unit of scale becomes and how fast that unit of scale can be applied with services like lambda, you can see the unit of concurrency, which is really the unit of scale.
You know, you get a large number in your account and in your region.
Out of the box. But depending on your needs, you can.
Work with Amazon to increase those limits as needed.
And we've certainly seen use cases where.
We have had to go well beyond those account default limits. But beyond that, you're able to get.
More and more capacity and scale to.
Really high levels pretty fast too.
Whereas even if you compare lambda scaling with let's say instance based auto scaling, you're talking about potentially milliseconds or hundreds of milliseconds versus minutes. So it lets you more closely associate scale with the business value. You don't have to over provision or you know, worry that you're under provisioned, right? What you need as you need it.
Does working in a sort of these elastic serverless environments change the way that you have to think about monitoring and observability?
I'll say yes and no. So yes, you know, it doesn't change the need, right?
You know, there is a need to.
Understand what's happening in your application. That need doesn't go away just because AWS is handling, you know, we'll say the underlying infrastructure, you know, it doesn't absolve you of making sure your applications are monitored appropriately. It does change how you do it or the mechanism that you use. So like in many regards, I think.
Working in serverless environments forces you to.
Be more disciplined in how you approach observability. You know, there's no instance SSH into or you know, where you can run, you know, htop and see which processes are consuming CPU what you're, you know, or running IO stat, or you know, VM stat, you don't have an instance, right? So like I think you do need.
To be more disciplined. The good news is AWS does provide.
A number of metrics out of the box.
Whether it's to deal with concurrency, performance, there are a number of metrics that are there, but you also have the.
Ability to add your own instrumentation as well.
So if you want to write custom.
Metrics that are associated with, let's say business value, user sign up, user abandonment or deposits made, you have that ability, you can build that logic in itself. And the nice thing is there's a really good vibrant community that has, I think shown how to do this well. So whether it's using utilities provided by aws. So I'll give a shout out to capability that AWS offers called AWS Lambda power tuning.
Initially power tuning was built to help with observability. So if you look at what the module provides, so there's power tuning for.
Python Node and for Java. The nice thing is logging metrics and traces are all first class citizens in those modules.
So you don't have to do a whole lot to get a lot of visibility, which is really nice.
But if you wanted to embrace industry.
Standards like OpenTelemetry, like you can do that too.
And there is a good story for.
OTEL with Lambda in particular. So if you want to instrument code you can, otherwise you can lean into.
The SDKs to auto instrument code. Obviously there are trade offs there. You know, if you are using, you know, a non compiled language, auto instrumentation, you know, you'll see a penalty at cold start but you can do it. You know, you don't have to be an OTEL expert to get that visibility, which is great.
So I wanted to talk a little bit about essentially governance and security compliance.
Like fun stuff.
Banking's a very regulated, sensitive industry. You have to take care with anything that you're building. Obviously you're, you're sort of dealing in customer trust. Like how does that, that kind of changed the way that you have to think about, you know, building products when you work at a bank.
Yeah. So I mean at Capital One, security is job one. As you point out, we are working in a regulated industry. We need to make sure that we're doing things the right way from a.
Security standpoint, from a governance standpoint.
What that means is that, you know.
As new services are introduced, we may.
Not be early adopters.
Right. We need to do evaluations to determine.
Whether or not services have the necessary controls that we feel they need. So there's that part of it, making sure services are imbued with the necessary.
Controls that we need internally. But beyond that, we need to make.
Sure then that our process of building and deploying code also is very rigorous and stands up to compliance. So making sure that all artifacts are versioned, making sure like some of the practices that we adhere to, we use tools like Open Policy Agent or OPA to make sure that anything that we deploy conforms with our policies, that we're not deploying services that we're not supposed to deploy. Making sure that even for the resources that we can deploy that they are compliant, that we're not using certain properties.
Or if we are using certain prop properties that they're configured a particular way.
So you can get really, really granular. And the nice thing is you can offload burden from your developers in figuring it out. Right. So we use OPA so that developers.
Can deploy compliant applications.
It's an important part of how we deploy code.
And in terms of the same care that you have to take when potentially adopting new services within aws, like what is that process like when you look at potentially bringing in, you know, a new library within the actual source code, that isn't something that was developed at Capital One, obviously. I'm sure you, you must have some checks in place to make sure that there's not some sort of malicious supply chain attack that is hidden within the library or a reference library or something like that.
So you know, as a part of.
Our security process we, we do vet new libraries that come in and we do have internal processes that continually check for vulnerabilities and notify teams when they' using modules that are no longer compliant.
Because they may have a critical cve.
So yeah, we really try to secure the entire supply chain from build to deploy to really you're running applications as.
Well, like these governance controls that meet these standards without sort of compromising some of the speed and efficiency that you get from this serverless development environment.
Yeah, it is certainly an interesting question.
There is this natural tension, I think.
Between developers who want to iterate fast and want to deliver and want to execute on the latest greatest thing. But I think many of our engineers also recognize the importance of the work.
That we do and they're willing to accept a certain trade off there.
Because we're dealing with people's money and their finances. We need to partner closely, you know, as like in engineering organizations or more developer focused organizations.
We have to partner with our developer.
Experience teams, teams that are actually, you.
Know, managing the CI and CD processes. We have to partner with our cyber.
Teams, we have to partner with our.
Open source management teams. There is a lot of coordination and.
You know, I think if we consider, you know, all that's involved there, you.
Know, really there is this effort to.
Try to shift as much of that as possible away from our developers. So in many respects we try to both shift left and shift right. So when we talk about centralized controls.
We want to make sure that our.
CI CD pipeline is the choke point.
The ultimate decider to determine whether or not something can be deployed, is it.
Compliant, is it secure? But we also want to minimize friction for our developers. So their day to day should not be spent wondering am I doing this right? Am I doing this securely?
And we do things internally.
So like one of the nice things we've Built. We certainly lean into open source tooling. Like I mentioned open Policy Agent, we.
Have the ability to let our developers.
Determine am I doing things in a secure and compliant manner or do I have to wait for something to go.
To a pipeline and see it fail?
Ideally we want to shift that left so we use some internal tooling.
But we also do lean into tooling.
Like aws, cfn, lint, so you can.
Determine like for let's say cloud cloudformation oriented deployments.
Are the templates you're deploying, are they syntactically correct? You know, is it valid YAML, is it valid JSON all the way down to like, you know, are there rules or like, you know, resources that you.
Have defined, are the prop properties that you're specifying valid?
You can also lean in and write, write your own rules to determine like whether or not a template is compliant and add that to your CFN lint run. So really it's a really delicate balancing act, making sure that we do provide.
Our developers with the means to be.
Agile, iterate quickly while balancing the need to be secure, compliant and well governed.
What role does a strong notification messaging system play in this?
For messaging systems, we think it's important that developers understand like when things change, making sure that they have visibility into what has changed and why it's changed. Ideally we would, you know, allow those developers to see notifications like when we move things, like when we shift, shift things right? You know, when we have that central CICD process if things you know, aren't going according to plan, if let's say builds fails, if deploys fail, like making.
Sure our developers understand where the failures.
Occur and similarly, you know, actually having that visibility tied back to shift left efforts as well. So like if resource that you're trying.
To deploy is not compliant, making sure they understand why and making sure that.
We have supporting documentation to help them understand how to get that non compliant resource compliant.
Do those notifications ever get too noisy?
Not going to lie, it can be.
Noisy for our developers. But the great thing is we have.
A really, really strong developer productivity group internally and they're constantly looking for ways to improve that experience because it's, you know, it's one, one thing to see that a build failed and you know.
You see like this huge dump and you wonder, you know, how the heck.
Am I going to troubleshoot this?
They've been spending a lot of time.
In narrowing down where, where failures occur, making sure that, you know, messages are.
Only as verbose as they need need to be.
And providing that supporting doc documentation and.
Also meeting your developers where they are.
So you know, whether you're looking at a build build log or whether Slack notification or an email note notification, just making sure that developers understand at that necessary point why things didn't necessarily go according to plan.
What role does the Serverless center of Excellence play in all this?
Yeah, so for an organization the size of Capital One, I think it'd be arrogant to say I know what our developer community needs and I'm the only one who, who can speak authoritatively because I am not, I am not at all.
Different lines of business have different priorities and they have different needs. So really what the center of Excellence.
Allows us to do is group people.
Like people who have an interest improving the serverless developer experience, letting them come.
Together and share what's working, what's not, what are the pains, what can be done, how do we appropriately leverage the knowledge and experience that we may have as a group of domain experts to improve the lives of all developers, whether or not you have that domain expertise. And also really, I think, work with.
Other teams outside of, let's say, the.
Serverless center of Excellence, work with cyber partners, work with enterprise partners, work across multiple lines of business.
So really it's a good platform to receive feedback from the developing community, but also influence how other teams consider the work that we do as developers.
When you talk about the developer community, you're talking about the internal developers at Capital One, is that right?
Yeah.
Yep.
How big is that roughly?
There are about 14,000 developers at CAP1.
Okay, so a good size community.
Yeah, yeah.
Going back to some of the things that we were talking about, you know, in the beginning there, in terms of some of the, you know, reasons for moving to serverless for Capital One, like what were some of the challenges with actually putting that migration in place?
Yeah, I think there, there are a few. One I would say is plain old fud.
You know, fear, uncertainty and doubt. You know, I think people for a long time have associated Lambda with, I mean like, I hate to say toy applications, but you know, we'll say like operations oriented things that may not be business critical. Lambda doesn't scale like I need it to scale. There's no way I can run my application in Lambda initially. Five minutes isn't enough. Fifteen minutes isn't enough. I need more resources.
I think developers will be surprised at the work workloads that can be handled by serverless compute like Lambda. The other important thing to note too.
Is that we're really big on saying.
We'Re serverless first, but we're not serverless only.
There are going to be certain workloads that are not suited for lambda that are not.
Like if they're not well suited for lambda, we would ask teams to look.
At serverless container services like Fargate. But even beyond that, if a service is not right, we have the means to support other compute and other services as well.
But I think the biggest issue was.
Helping overcome a lot of fud, even helping teams understand what goodlook looks like. So what I mean by that is earlier you mentioned how does observability change? Well, we want to make sure that.
Teams are empowered to know how their.
Serverless app applications are running. When it comes to things like splitting apart the monolithic application, what's the right.
Way to do it? When is the right time to do it?
How does your lambda function scale? How does your application traffic change over time? Are serverless services like lambda able to.
Keep up with what you need? So really overcoming fud, but then ultimately just doing the hard analysis to say like, is lambda right for you? If it is, great.
If it's not, that's okay too.
For those teams that did decide to.
Make the plunge, I think it was initially a struggle to help them understand what the different metrics really meant. So for non serverless compute, a really common metric, let's say for APIs in particular, is requests per second or transactions per second.
And in Lambda, if you look at the unit of scale, it's concurrency, which is really a product of the number.
Of requests that come in.
So that rps, that tps, but also duration, how long does that lambda function run for? And the goal is to minimize the.
Amount of concurrency that your functions consuming. With that, it's a matter of helping teams understand like what's the right way to impact performance. And in lambda really there's only one knob to turn and that's memory.
So with lambda functions you'll find that.
Both memory and CPU scale together linearly. So like a 256 meg lambda function has twice the compute and ram as a 128 meg lambda function. So for some teams, we would see them come in and allocate 10 gigs of RAM, like I need all the RAM in the world. When you look at what they're actually.
Consuming, it's like drop in the bucket. Maybe we ratchet that down. But we also saw a lot of.
Teams under allocate the amount of ramp. So knowing what the calculation is for.
Lambda functions, you have invocations. There's a number of invocations. That's a component of the price.
But then there's the amount of RAM.
Consumed and the duration that that RAM is consumed for.
And we would see teams allocate like 128 megs. That's the minimum RAM that we can allocate.
So I'll do that and I'll save all this money. Well, what we would see is teams.
Would actually starve their lambda functions so.
Functions would actually run longer because they.
Didn'T have the necessary resources.
So the great thing is there are open source tools. AWS Lambda Power Tuner is a great open source tool.
If you're not using it and you're a lambda shop, please use it. Alex Castleboni was a solutions architect at AWS at the time, wrote it. It is awesome and it really helps.
You determine like what's that right number based on either cost or performance needs.
So you actually can dial that in a lot more. You're not wasting resources or money when.
It comes down to teams having to make decisions about is serverless the right thing to use versus, you know, using something like, you know, Fargate. What's sort of the framework for making that decision?
Yeah, like I would say, first consider the AWS constraints.
Like, do you have a function? Like, do you have a workload that needs to run consistently for more than 15 minutes?
If so, you know, lambda's not right for you. Do you have the need to consume more than 10 gigs of RAM? If so, Lambda's not right for you. Right. Like there, there's some easy ones to look at.
But it's interesting when you consider then.
Too, like, beyond those obvious constraints, looking at things like, well, what is your.
Lambda function being triggered by?
And what I mean there is like.
Lambda is event driven compute. It'll be your code is triggered in response to an event. And aws has over 100 event sources.
That can trigger lambda functions. But let's say you want to write a lambda backed API and you're using, let's say alb, Amazon's application load balancer service. Well, that's a synchronous invocation of a lambda function. Now, lambda can handle a 6 meg payload size at this time, but ALB can only pass in 1 meg and 1 meg in, 1 meg out. So if you're writing an API and you need to handle more than 1.
Meg in or 1 meg out, Lambda.
May not be the right, right choice. If you're using ALB with a service like Amazon's API Gateway Rest service That number jumps up to the full 6 meg. So API Gateway can handle 1010 meg, but you're going to be constrained by the 6.6meg lambda limit.
So like, look at the obvious things.
Like, hey, you know, like what are the physical constraints?
Consider what you're looking to integrate with. Beyond that though, it really gets interesting.
One common thing that that comes up that would drive someone away is, you know, I have a Java application. You know, Java will never run well in Lambda.
Well, I would ask you to revisit your assumptions. There are ways to mitigate things like cold start pains. Consider how often things like that those cold starts happen.
Is it something that's happening a lot or not?
If it's a synchronous invocation where you have a user on the other end of that request, that cold start may really matter a lot more than an asynchronous invocation.
Like someone uploads an object to S3, well, you may not have somebody waiting.
On the other end of that request.
So if you have a cold start that takes a little while, so what.
It may not matter as much.
The other thing too is, I mean.
The AWS Lambda team has worked really, really hard over the years to improve that developer experience for, let's say Java in particular. So you can use services or capabilities.
Like provision concurrency, where you essentially have an AWS management capability that will keep.
A certain number of Lambda execution environments warm so you won't have those cold.
Cold starts for the number of provision concurrent units that you've set. Last year Amazon also introduced a capability called Snapstart for Java. And I'll be honest, I need to revisit the exact Java version. I want to say Java 17 and higher, I need to double check.
But this year they actually introduced that for Python as well as a pre invent announcement.
So AWS is looking for ways to improve that developer experience and help developers force them into making a tough choice.
Like is Lambda right?
Where do you think re invent is going on this week? There's lots of tons of announcements. Where do you think serverless is going in the next couple years?
Interesting places. So one of the more interesting announcements I heard this week was on DSQL Aurora Rehab that I think is going to be massive in ways that we can't yet comprehend. That's one I think durable workflows will be another area. So like if you consider how like how important it is for certain workloads to run to completion, you know, when you have ephemeral compute services like Lambda.
State becomes really important.
So, like, how do we. How do we do that for really important work? That's going to be another one. I think continuing to manage the software supply chain is going to be really important too, and providing visibility into that. And I think the last, I would say improving the operator experience. One thing that makes me shudder is when people say serverless is. Nope, no ops.
It absolutely is. It doesn't absolve you from your operational responsibilities.
Good news is, as your cloud provider, whether it's aws, Google, Azure or anybody.
Else, they're assuming more of the responsibility.
But it doesn't mean that you don't have any responsibility. So I would say I would look for ways to improve the understanding of what's happening in applications.
So the ability to observe what's happening.
Is going to become really important, even.
Yeah. Well, I know we're coming up on time here. Brian, I want to thank you so much for coming on the show. I really enjoyed this.
Yeah, Sean, thank you so much for the invitation.
I really enjoyed the conversation.
Cheers.
Podcast Summary: "Going Serverless in Financial Services with Brian McNamara"
Introduction
In the January 7, 2025 episode of Software Engineering Daily titled "Going Serverless in Financial Services with Brian McNamara," host Sean Falconer engages in an in-depth conversation with Brian McNamara, a distinguished engineer at Capital One. The discussion explores Capital One's transition to serverless architecture, the benefits and challenges of adopting serverless in the highly regulated financial sector, governance and security considerations, and future trends in serverless computing.
Brian McNamara’s Role at Capital One
Brian McNamara introduces himself as a distinguished engineer at Capital One, emphasizing his role as a senior individual contributor (IC). His responsibilities revolve around serverless integration and development at scale, both tactically and strategically.
"I'm a distinguished engineer at Capital One. Essentially that's a senior IC role." [01:05]
He collaborates with various teams across the enterprise, including the retail bank, card business, enterprise cybersecurity, and machine learning teams. This cross-functional collaboration allows him to influence how Capital One's large engineering organization approaches business problems with serverless technology.
"I have the opportunity to engage with different teams who are looking to adopt serverless compute, help them run through any outstanding questions." [01:24]
Capital One’s Transition to Serverless Technology
Capital One began its cloud journey in 2014 with a traditional "lift and shift" approach, migrating existing on-premises systems to the cloud. By 2021, Capital One's leadership declared a "serverless first" strategy, prioritizing serverless technologies like AWS Lambda and AWS Fargate for new projects to enhance the developer experience and reduce operational burdens.
"In 2021, we made a declaration that we were going to be a serverless first company." [04:38]
This strategic shift aimed to allow developers to focus more on delivering business value and less on managing infrastructure.
Benefits of Serverless at Capital One
Brian outlines several key advantages of adopting serverless architecture:
Reduced Operational Overhead: Serverless eliminates the need to manage server infrastructure, allowing developers to concentrate on writing code and delivering features.
"Serverless is all about minimizing management cost." [06:02]
Scalability and Elasticity: Serverless automatically scales to handle varying levels of traffic, ensuring high availability without manual intervention.
"You can have compute that scales to really high levels if needed, but can also scale down to zero." [06:08]
Maintenance Cost Reduction: With managed services like AWS Lambda, tasks such as patching and ensuring high availability are handled by the cloud provider, reducing maintenance costs.
Enhanced Innovation: By offloading infrastructure management, developers can focus on innovative solutions and speed up the development cycle.
"Letting developers focus on what they do well." [04:31]
Challenges in Migration to Serverless
While the benefits are substantial, migrating to serverless presents unique challenges:
Fear, Uncertainty, and Doubt (FUD): Initial skepticism about serverless capabilities, scalability, and suitability for critical applications persists.
"Plain old fud, you know, fear, uncertainty and doubt." [28:34]
Application Decomposition: Migrating requires breaking down monolithic applications into smaller, event-driven functions, which can be complex and time-consuming.
Cost Optimization: While serverless can lower total cost of ownership (TCO), improper architecture can lead to higher cloud compute costs. Optimizing both engineering and cloud infrastructure costs is essential.
"Leadership has looked beyond [...] total cost of ownership, it's not necessarily only cloud cost." [05:35]
Observability and Monitoring: Transitioning to serverless demands a disciplined approach to observability since traditional server-based monitoring tools are not applicable.
"Working in serverless environments forces you to be more disciplined in how you approach observability." [17:11]
Governance and Security in Serverless Environments
Operating within the financial sector, Capital One places a high emphasis on security and compliance. Brian discusses the company’s rigorous processes to ensure that serverless applications meet security standards and comply with regulations. Key strategies include:
"We use OPA so that developers can deploy compliant applications." [21:09]
Serverless Center of Excellence
Capital One has established a Serverless Center of Excellence to foster collaboration among developers interested in serverless technologies. This center serves as a platform for sharing best practices, addressing common challenges, and collaborating across different lines of business and support teams, including cybersecurity and developer experience.
"The center of Excellence allows us to group people who have an interest in improving the serverless developer experience." [27:07]
Monitoring and Observability in Serverless
While serverless abstracts away server management, robust monitoring and observability remain crucial. Capital One leverages AWS-provided metrics and custom instrumentation to gain insights into application performance. Tools like AWS Lambda Power Tuner and OpenTelemetry are utilized to optimize performance and ensure visibility into serverless applications.
"Working in serverless environments forces you to be more disciplined in how you approach observability." [17:11]
Cost Management in Serverless
Brian highlights that while serverless can potentially incur higher cloud compute costs, the overall TCO often decreases due to reduced maintenance and operational expenses. He advises rearchitecting applications to fully leverage the strengths of the cloud rather than merely lifting and shifting existing infrastructure to optimize costs.
"Serverless, when compared with an equivalent or an instance based compute, may actually be more expensive. So people often hold up cloud costs as a reason to not consider serverless." [05:24]
Future of Serverless
Looking ahead, Brian anticipates advancements in serverless technology, including:
He emphasizes that serverless does not eliminate operational responsibilities but rather shifts them, making observability and understanding application behavior increasingly important.
"Serverless absolutely is ops. It doesn't absolve you from your operational responsibilities." [37:05]
Conclusion
In this insightful episode, Brian McNamara shares Capital One’s comprehensive journey into serverless computing, emphasizing strategic adoption, cost management, robust governance, and security within a highly regulated industry. The conversation highlights the balance between innovation and compliance, illustrating how serverless architecture can drive efficiency and agility in financial services while maintaining stringent security standards.
Notable Quotes