
Compute optimization in a cloud environment is a common challenge because of the need to balance performance, cost, and resource availability. The growing use of GPUs for workloads, including AI, is also increasing the complexity and importance of opti...
Loading summary
Shawn Falconer
Compute optimization in a cloud environment is a common challenge because of the need to balance performance cost and resource availability. The growing use of GPUs for workloads, including AI, is also increasing the complexity and importance of optimization, given the relatively high cost of GPU cloud computation. Jerzy Grzwinski is a Senior Director of Software Engineering and leads FinOps at Capital One. Brent Segner is a distinguished engineer at Capital One and is focused on performance engineering and cloud cost optimization. Jersey and Brent join the show with Shawn Falconer to talk about methods to measure compute efficiency, horizontal versus vertical scaling, how to think about adopting new instance types, the effect of different languages on compute efficiency, and much more. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
Brent Segner
Brent Jersey, welcome to the show.
Jerzy Grzwinski
Thanks for having us.
Brent Segner
Thank you.
Yeah, awesome. Thanks for being here. I'm really excited to get on this topic. I think there's going to be a lot of practical, actionable advice for the listeners regards to cloud optimization. Since there's two of you just to help the audience out a little bit with whose voice is whose? Can we have you introduce yourself? Who are you? What do you do? Grant, let's start with you.
So just good afternoon, My name is Brent Segner. I'm a distinguished engineer in the Cloud Evolution team over at Capital One. I've been with Capital One now for just over two years, focused on all aspects of cloud cost optimization, performance engineering, but probably had about two decades in this field.
Awesome. Jersey, same question to you.
Jerzy Grzwinski
Yeah, I'm Jersey. I'm a bit of a longer timer at Capital one. Going on 16 years, I've done many different engineering roles from backend to front end development. Currently I lead finops for Capital One, where Capital One's been all in, in the cloud for several years now. And as our footprint has grown in the cloud, we've really wanted to focus in on architecture, standards, engagements and then tooling for how we help our developers be more efficient and effective in the cloud in order to maximize the value. So I spent a lot of time with Brent and other members of my team to build tools and go on that journey for the company.
Brent Segner
Awesome. You know what, what started your interest, I guess in like finops and like what's kept you, you know, interested enough to keep at the same company for, you know, 16 years?
Jerzy Grzwinski
Yeah. So for me, what I really like about my journey at Capital One is I've had, I don't know, 10 different jobs without leaving the four walls of Capital One, which has allowed me to explore different areas of the tech stack and different areas of interest either from a technology perspective or product problem statement perspective. So that's really kind of what kept me inside these walls. And the culture of the company and the, the fast moving evolution of how we use the technology has been a fun ride. And really, for getting to FinOps, I was just telling someone the story that, you know, I still remember sitting at the dinner table with my father back in my high school days of what will I go to college for? I always knew I probably would be an engineer, but I was really like wanted to go into business because of my interest in finance. So I kind of just chuckled of I said, I know I can be an engineer and do finance later. I don't think I can go into finance and be an engineer. And many years later, sure enough, technology brought me to something that has a finance problem statement, but also a technology problem statement. So that's really what keeps me energized in the current role.
Brent Segner
Awesome. Brent, did you have anything to add about your journey into FinOps?
Well, I was going to say, ironically, I actually took the exact opposite path that Jersey did just over two decades ago. I started out in finance, quickly discovered that I had a knack or an interest for technology and that kind of brought me to the journey to where I am today, specifically my interest in Capital One. Capital One was intriguing to me. Just given a financial institution of its size actually going all in on the cloud, you don't see that happen very often. And then since starting here, what's really kind of stood out is just the fact that they're continually like tip of the spear as far as technology, just trying to be able to go through, push the existing boundaries.
Jerzy Grzwinski
Yeah.
Brent Segner
I think even as an outsider, I think when you look at companies, especially in sort of financial services space, that are these early adopters of technologies like Capital One is one that stands out to me that I've seen over and over again in my career. And you mentioned the cloud migration. I think your cloud migration kind of wrapped up around like 2020. What was sort of the original motivation to move to the cloud and what were some, perhaps some of the key learnings from that?
Jerzy Grzwinski
Yeah, no, absolutely. So I'll take this one as Brent was Naqua here with us during that period? So Capital One really started on a transformation journey, a tech transformation journey. And actually not too long after I joined 2010, 1112, we really wanted to pivot from off the shelf tools and Capabilities to becoming a technology company that builds the capabilities, is able to innovate and really drive impact for our customers through innovative solutions. So as the company started going through that journey, it became evident in order for you to start building the best tools, you need to have the best talent. In order for you to have the best talent, you need to have an environment that attracts the top talent. And the cloud was clearly eating up the world for where the talent wanted to go. Startups were mostly on the cloud leveraging. So Capital One decided that's the direction we want to go. We want to be nimble and fast, we want to attract the best talent. And why does that talent want to work in the cloud is because it gives access to being able to work on technical problem statements versus paper pushing process that sometimes occurs when you have to have boundaries between the folks that can develop versus the teams that can provide infrastructure and support that infrastructure. So as Capital One went through that arc of tech transformation, we went from the waterfall to agile, we went from on prem to the cloud, all in the name of building great capabilities for our customers with great talents and that nimbleness to getting faster time, to market flexibility, et cetera, that the cloud really offers.
Brent Segner
What was some of the like in terms of, you know, faster time to market? Like what kind of impact did this have on the speed of execution that the engineering team was capable of versus, you know, maybe what was there before?
Jerzy Grzwinski
Yeah, so before we started on the journey, I was working in an infrastructure team that really was facing off with the application teams to understand what they needed from the hardware perspective. I helped architects, you know, what servers, database network, et cetera was needed. And then I also had a team that helped execute on that. And the time to market for someone to say I need a server to they are actually able to deploy code on that server, let alone that server, be able to talk to anything else was measured in weeks, months and definitely not hours, minutes. So for me, as I went from supporting that role prior to the cloud migration to then shifting over to full stack development and being the application team member asking for infrastructure. You know, the cloud obviously unlocks servers in seconds. It, you know, network is based on again API calls and configurations and code. So there was real speed of taking away a lot of process of just making something happen. Now of course through that journey there were a lot of additions to that. And Finops I think is one space where most developers didn't think about how much they were spending on servers during data center days. Our goal is to get developers to think about what their spend is, because now they do have access to provision and build things in a way that can be efficient or can be very clunky. And really, if you're designing things in a way that actually doesn't think about effectiveness and efficiency as a measure of success, the cloud can become very expensive. And it can also bring a lot of challenges that a developer may not be happy to be spending time on. So the cloud certainly unlocks so much value, but with freedom, that also brings a lot of responsibility back to the developer, which then, you know, what we hear from our developers today is, oh, I have to worry about not only feature development, but I also have to worry about the other things. So our goal is, how do we make sure they do worry about it, but it only takes up so much time of their day. How much can we automate on their behalf? How much can we help us build into the enterprise tools and standards, things that, you know, defaults and things like that? So developers can be engineering in an efficient way. Think about it a little bit, because there's no way to just completely take that away, but also allow them to again, have that time to market which brought us to the cloud in the first place. Right?
Brent Segner
Yeah. And I'm curious, like, you know, I think a lot of people, when they first start thinking about moving the cloud, they think about it, you know, purely like, oh, this is going to save us money. But it doesn't necessarily save you money, especially if you're just simply like, you're lifting and shifting something that maybe wasn't designed to operate in the cloud, and then suddenly you're just putting on a gigantic single server and paying a lot of money to run that thing. Or you might run in this cases where you think like, oh, well, this is not performing at the level that we need. We'll add more and more memory. But maybe that won't actually solve the problem because it turns out that the thing that you're running is like, single threaded, and it's not going to take advantage of the fact that you have more memory available anyway. So when you're in this place where you're trying to help people or engineering teams within an organization optimize both from, like, a cost perspective and also from performance, how do you, I guess, like, develop trust with those teams so that they, you know, don't see you as like, the office of no. Like, you know, why won't Jersey let me scale this XL3 EC2 instance or something like that? You know, he keeps saying that this isn't the right thing to do.
Jerzy Grzwinski
Yeah. So I'll give my perspective and I know, Brent, you have a few good examples here. So for me, what's been helpful is I was in the shoes of the developer to build tools. Actually, we have teams in my space that build tools and are in the function of owning an application and delivering value. So I think most importantly, we need to work backwards from how do we make the developer successful knowing they have full control. At the end of the day, my goal is like, how do I bring value to the developer either through tools, automation, et cetera, or through engagement and information in order for that developer to be successful. And I feel like if we bring the relevant information to the user and provide the context for why we think we're right and also be humble when we are wrong and take those learnings back to our tools, we build that trust. I think it's also very important to call out when we are wrong or when we know we have some shortcomings. At the end of the day, we're building tools that provide recommendations to 2,000 plus apps. Each of those apps is slightly architected differently with different requirements, etc. So there's certainly edge cases. So we try to be humble in our recommendation to build that trust. We do try to celebrate the wins, of course, we try to limit the stick approach of how much we penalize. But we want to be direct and open about where we think the opportunities are. We don't want to slow down just because we feel like the person might feel like we're delivering a hard message of their overspending or inefficient. But really we're trying to be direct about the opportunities, but then bring solutions to the table. And I think Brent has some really great examples that I think he really has contributed some big wins that maybe Brent, you can kind of chime in with.
Brent Segner
Yeah, no, I was going to say just kind of expanding on some of the points you made. For us, a lot of it is, can we take a look through the eyes of the developer to be able to make sure that whatever we're recommending from a cost performance optimization perspective meets their needs? More often than not, developers, they care about their user experience. They want to be able to have an application that's resilient and responsive. So we have to be able to make sure that when we're making a recommendation that we can measure, did it have a positive impact, both cost of performance on the developer's experience towards the end objective, we look at different things, measurements we could term as like utilization saturation errors to be able to give us kind of an ability to be able to triangulate with data on, you know, did the change that we recommended that they make have the desired impact and if not, can we help them make a change that's going to improve their cost performance?
When it comes to these types of optimizations and people making choices about, you know, what they're going to run in the cloud, do people tend to, you know, over provision or do they sometimes under provision?
So what we've seen historically is more often than not the developer once again cares mainly about their user experience. So they tend to be able to err on the side of over provisioning the number of resources that they would actually need to be able to meet a very specific use case. We've tried to be able to help them by giving them better data to be able to understand, you know, what are the actual capabilities of the instances that they're looking at and what one's going to be most appropriate for the use case that they're bringing to us.
Jerzy Grzwinski
Yeah, I think there's a common theme that we hear from users that my easiest way to be resilient from issues is to overscale and we with data try to provide an understanding that sometimes that's a false sense of security. Sometimes if you gave an example of like you over provision servers, but really you're single threaded through a single cpu, you actually don't have the security that you think. So we, you know, in the mission statement to try to be more efficient, we also very much try to uncover where your performance, both from capacity or even speed for how fast your application is running might not be quite there. So it could be a win, win, win solution if you do becoming a little bit more efficient by downsizing.
Brent Segner
Yeah you've developed some interesting approaches to measuring some of the compute efficiency. Like can you explain how Coremark benchmarking works?
Yeah, so this is one of my passion projects. So I guess like giving kind of like we'll go two steps back before we go one step forward. So just understanding kind of where Core Mark came from. There's been a number of attempts over the years to be able to come up with a single unifying number that can actually encompass what is the capability of that CPU of that instance, some of the steps that came before, like Drystone and Whetstone and yes, I get the irony of the two names, did a fantastic job stepping towards a unifying number, but ultimately fell short because they didn't actually encompass the realities of a workload as it would operate on an instance. I think it was 2009, the embedded microprocessor benchmark Consortium. And that is a mouthful. So I'll say like EEMBC first published Core Mark as like a consolidated group of synthetic tests that would actually measure nine distinct types of operations on a cpu. These operations range from different things we'll say like linear algebraic functions all the way through image renderings. But ultimately what it did is you take the composite of those nine different types of functions that can be formed on a CPU and it gives you a single unifying number we refer to as a multi core score. That multi core score allows us to be able to compare apples to apples across instance sizes, instance families, clouds, and even down to bare metal hardware residing in a physical data center.
So what is this multicore score? Is that a number? 1 out of 100? What's it actually look like?
It would be a score probably in the range of one out of hundreds of thousands each of the individual tests. So linear algebraic test or an image rendering test or a compression test is each assigned a score based on the efficiency and time it took to be able to execute X number of operations. At the end of running one of those synthetic tests, every one of those nine operations receives its own score. It's then combined across the nine tests to be able to come up with that single unifying score. Super interesting about this is you're able to start to be able to tell how a score would scale as you increase instance sizes or even change instance families, to be able to get different allocations of CPU, the memory, the IO, etc.
Can I factor in the type of application and consequently the type of operations that are going to be running and factor that into my capacity planning optimization? Because if I'm looking at the aggregate metric, maybe things don't look that good for the thing that I'm planning. But I really only need to, I don't know, perform linear algebra operations or something like that. And that's a different type of optimization than I do for a general computer program.
I love where you're going with this question, so you're hitting nail on the head. What we started to be able to see as we execute the tests is based on physical attributes or different libraries that are native within the CPU itself. There are certain instance types that are better performing different types of operations. So your overall multi core score may be a little bit lower, but for a particular instance type. But if you say in your example I perform mainly linear Algebraic operations. This instance type may actually be ideal for what you're trying to be able to do. So given the combination of those nine individual tests along the multi core score, we use that to be able to help influence our development teams to be able to make the right decision when trying to be able to pick an instance type and instance size that meets their requirements.
Jerzy Grzwinski
So in addition to, you know, we do some engagement types of recommendations, especially for our largest platforms. But the way we use core marks at scale is we leverage that in our tooling. So the user experience for a developer capital one is you log into our internal tool and you type in your application and then we provide different recommendations for you're running this instance types and you could be running these different instance types, the undercover algorithm. And the way we get to that recommendation, you know, is based on the core marks and the ecosystem experience that we're trying to create is provide not only we pull a bunch of utilization metrics that enriches our opinion of our recommendation, but we're also entering a phase of allowing users to also add additional information. So create a feedback loop to only improve the recommendations we provide to again, to build that trust that we're not just creating a general recommendation, we're making a recommendation just for you, just for your application, just for your specific architecture.
Brent Segner
And then if I have something that's running in production today and then I find out that there's, you know, a better recommendation in terms of, you know, what I'm allocating in terms of resources, is there a way I can sort of like test out that plan to see if it's actually going to continue to meet my needs from a user perspective while also, you know, maybe reducing costs and so on?
Jerzy Grzwinski
Yeah, so we're, you know, we're definitely trying to get to a place where we automate away a lot of our recommendations, but we also recognize we want to provide accountability where it's own. So what we recommend in our tooling is step changes in your infrastructure. So we never want to take you from 24 XL down to a small in one hop. We very much encourage, you know, performance testing, et cetera, the standard ways to get things into production and not just react and over correct. So we do recommend step changes that might take several months just to kind of test these things out. Operational stability we still find very important because if we fail once, people will start over provisioning overnight and we lose that trust factor.
Brent Segner
Right. What were some of the surprising findings that you had, you know, regards to instance size, scaling and performance so there.
Were for us, a number of really interesting takeaways when we started to be able to run the benchmarking tests against, like, a broad cross section of instance types and sizes. The one that probably stands out to me the most is how much the performance per CPU was impacted as we start to be able to move up the ladder of instance sizes. I think, for example, it was on maybe like the M7i instance family that as we continue to test and we grew past 16 VCPU, we saw a 12% performance hit per CPU. When we kept going up in size, eventually we saw another plateau where we hit another 13% performance hit per CPU. What this really came down to is underneath the covers is even though we're working with purely virtual instances that are masked by a hypervisor that we're still bound by a lot of the physical laws of the actual physical hardware underneath, as we continue to be able to grow the virtual instance sizes, we actually pass physical boundaries on the hardware, like the NUMA boundary, where each time we pass those boundaries, small performance penalties are applied. Individually, these performance penalties are probably small, but eventually you pass so many physical boundaries on the underlying hardware, eventually you start to be able to accumulate them, and ultimately you end up paying for a lot of performance that you're not really able to realize, ultimately. I know that depending on the type of workload you're running, there are different realities that apply. But for us, in general, like a rule of thumb, it's a really good indicator of why we try to be able to encourage our workloads to remain smaller and be able to horizontally scale rather than always leaning on vertically scaling.
Yeah. I recently did a podcast on supercomputing where they kind of talked about some of the stuff where the number of flops that most of these supercomputers can operate at far exceed the capacity, essentially moving data around in the physical hardware. Because if you're moving data, you know, from CPU to cache to memory and so forth, like, that's a higher cost essentially, of replicating that data to those different places than it is to do the operations of floating points. So you've never actually. Most programs aren't able to fully utilize, like, you know, the ridiculous gigaflops or whatever that are available now?
Exactly what we solve.
How do you factor in, essentially, the performance improvements that we see to, like, cloud instances over time, like, obviously, like, the hardware, and the virtualization of this hardware continues to get better from where it was, you know, five, 10 years ago? It's always getting better. How does that kind of get factored into your strategy when you're thinking about, you know, you continuum performance improvements and optimization.
Jerzy Grzwinski
Yeah, that's something where we're actually really doubling down on as we've noticed that we're not having folks shifting to newer generations as quickly as we like. There is a little bit of a balance of the newer generation. The cloud providers need to provision enough capacity and you know, at GA they certainly have X amount of capacity for, you know, infinite amount of customers. At our scale, when an application is deploying a thousand servers at a time or more, we need to make sure that application teams are not running into insufficient capacity errors when they're provisioning instances, which we've seen to see more on the newest generation of instances. And ultimately all it takes is one occurrence for a team to back off and use an older generation instance. So it is a balancing act that we are trying to now counter a little bit because we feel that folks are being a little too conservative and passing up on the benefits of newer instance types. So that is an angle that we are tackling right now and more forcefully with our tooling to try to get folks to newer generation of tools. As Brent's data, the core marks that we see, we definitely see the same thing that the cloud providers are advertising. The newer instances are faster, better and if you also right size them are cheaper. There is the problem statement too that we are tackling is folks like to stay consistent. So maybe on 5th generation instance they're running a 4XL, they want to run the equal size of a newer, newer generation instance which could in some cases be more expensive but also provides you far more performance. So we really are trying to get folks not to get tied down to the name of the instance or the size of the instance, but the underlying performance that it provides. And that really how, you know, that ties back to the core mark scoring, etc. That we were mentioning is just because you were on a large on a instance from 2015 doesn't mean you have to be on a large run instance from a newer generation. So that is definitely the second challenge that we're tackling. So folks really just forget, don't forget, but don't take the time to understand the performance improvements for newer generations. And we're trying to bring tools and data to the conversation so folks can understand what they're getting and that they can use smaller instances in the newer generations.
Brent Segner
Is there any, I guess, risk associated with being too much on the cutting edge of whatever the latest flashiest instances are?
Jerzy Grzwinski
Yes, To a degree, and I think it's how you manage that risk. For us, the primary risk is potentially around those capacitors. If you day of ga, you take the latest instance and you want to scale, that is something to consider. But the way we're trying to counter that risk is we really want to automate away instance selection as much as possible. So we would love everyone to use the one of the latest generation instance types. But if that's not available, we automatically take you down to the next instance type in the right size if the size needs to be changed and hopefully you're not going N minus 2. But ultimately if the capacity of the newest instance does have an issue in a short period of time, we can then mitigate that risk, but also promoting the newer generation of instances. So we do want to do some of the proving out of the newer instances centrally. In general, the cloud providers do a nice job in making sure what comes to GA is market ready. It's just more around that capacity piece. So, you know, we feel like we should our whole footprint on N minus one or newer generations of instances at our scale.
Brent Segner
So just to expand on what Jersey said, I really love this point. As we've watched like over the last five years or so, and you know, companies are definitely now moving all their workloads into the cloud, we've started to be able to see that all the cloud providers as well as silicon manufacturers are really starting to be able to understand what types of workloads are running in the cloud and engineering around how to be able to have those operate most efficiently for both themselves as well as for the users. Take a look. Like architectural changes we've seen that are just groundbreaking in the latest generations, like the size of the L1 through L3 cache, we start to be able to see on a lot of the latest generations that the cloud providers are now disabling. The SMT or the multi threading coming in, both of these changes have had a huge performance boost while at the same time actually making the workloads much more predictable for the users. So we took a look a little while back just to be able to benchmark across generations with ARM architectures. So I think when we compared like N2 to N1 on generations, there was a 30% boost and it's going to N minus one and that unto itself, if it stopped there, that would have been phenomenal. But then you compare N minus 1 to the current generation, it jumped again by 20%. So at the end of the day, when you compare current generation performance to two generations back, there is almost a 50% jump in performance and at the same time there's only been a nominal increase in cost. So for our application teams, if we can encourage them to get to the current generation, it's a 50% performance boost, nominal cost, which means that they're actually getting a significantly better cost for performance on the current generations than they would if they stayed on an N minus 2 generation. Huge incentive to try to get them to the current technologies.
Yeah, absolutely. How much does the language that you're you're programming in impact optimization and your ability to control cost.
You watch the reinvent cost. So Jersey and I've had this conversation a lot. I made a mistake a little while back. I started to be able to talk very specifically about languages and I very quickly learned if you speak poorly of a language in one regard, it's like calling somebody's baby ugly. So I try to be very careful when I'm addressing this topic. I know each organization, even developers, have like very definitive thoughts on which languages are preferred. So all I'm trying to be able to say is like language selection, library selection just plays a very foundational role influencing like resource utilization, performance, scalability. So for instance, like for me, a language like Python is very easy to be able to use if you're working with different data science tools, numpy pandas, et cetera. However, there are trade offs when it comes down to performance. So if you're looking for more performance than simplicity, then you may want to go through and take a look and say, okay, do I have the opportunity to be able to move to a Go or a Rust based language or use libraries like Pollers or Hugging Face, which are actually optimized to be able to run in those types of environments. So for me personally, my message like when we take a look at are there potential optimizations or bottlenecks in the environment itself, we just try to be able to help the developers understand that selection of languages, selection of libraries is a constant balancing act to be able to get the best trade off between performance, scalability and cost efficiency.
And there's also I guess like a trade off external to the cloud cost that you might have to factor into. Because you could make the argument perhaps that your amount of engineering time while developing in Python is going to be less than the engineering time if you're developing in like a lower level language like C or something like that. But it's going to be hard for you to match the sheer performance that you can get from a C program in a Python program. And there's a reason why operating system kernels are written in C. Yes, that's right.
Jerzy Grzwinski
And for us it's just providing that context to help folks make the right decision locally. Because there are certain situations where, you know, for example, Brent did an engagement where we had a very large platform that was very data intensive, that was using the wrong library and just shifting them from one library to another, which was a relatively low lift, increased the performance by 100% and it was a lambda infrastructure. So the faster it went, the lower the cost was. So, you know, they doubled their speed, reduced their cost by half and it was a win win. And in that case it made a lot of sense. But to your point, our goal is not to walk around and get everyone to re architect, rewrite their applications, but be a mindful decision. Like, you know, if you're building an application that is very performance sensitive and you're in that development lifecycle of building new, the new microservice or new capability, you might benefit from that language, not just for costly reasons, but really for meeting your performance requirements for your application. So yes, this is where we find the spreading insights. As much on financial efficiency as we try to spread insights on performance, we see those going hand in hand. And it actually, the added benefit of talking about it both from the performance and a cost perspective, is it helps build that trust. If we're helping you to be create a better tool to meet your requirements, we can also make you efficient. It really kind of helps kind of generate that win.
Brent Segner
Yeah. What about in terms of GPU optimization, which I think is probably something that a lot of businesses over the last couple of years are thinking about, how does that perhaps differ from traditional compute optimization?
So the biggest change we start to be able to see when taking a look at GPU optimization versus CPU optimization is just understanding that the architectures have fundamentally changed. When you look at a GPU based environment, in a CPU based environment, the CPUs the workhorse. All of the operations are executed sequentially through the cpu. So just how fast can you push sequential operations? When we switch over to a GPU based environment, the CPU now plays a completely different role. So you can't take a look at how busy the CPU is. The CPUs only function in this environment is how quickly can it dispatch instructions to a GPU to be able to execute on. And then you got to take it even one step further. Each GPU is comprised of multiple streaming multiprocessors, multiple CUDA cores. So instead of taking a look at a single unifying function that is like a cpu, how heavily is it utilized? You now got to be able to get down into the weeds to be able to take that next step to say, hey, how much are we engaging our GPUs? How thoroughly are we starting to be able to saturate our streaming multiprocessors, crude cores, etc. So you know, are there opportunities for us to be able to go through and actually tune the way that we're presenting the work to the GPU instances to be able to take better advantage of its capabilities?
You know, I understand like capacity planning for GPUs around like training cycles where you have some sense for how much data you're going to have to process, how long those things are going to take. But you know, how do you think about optimization for capacity planning around like inference cycles?
So a lot of it for us specific to inference comes down to testing the capabilities of the GPU within the context of how we're going to be able to use it. So how closely can we actually replicate or simulate the traffic patterns that are going to be presented as inference to the GPU to figure out what is the appropriate instance type size architecture to be able to handle the workloads above and beyond that, just you know, out of even the technical range, it's you know, going back old school algorithms to be able to determine, you know, capacity planning. How many instances theoretically would we need to be able to grow into? Given the different dimensionality of the users, we may start to be able to.
Jerzy Grzwinski
See to build on that right now in the cycle of where the industry is, we are a little bit operating in on premise kind of mindset from a capacity managed perspective just because of scarce resources and availability of the instances. So right now we are far more comfortable in over provisioning just to time to market and making sure that we're meeting our mark operationally, what we're trying to leverage this time period. And as the, you know, availability to GPUs will only improve with time, we really want to land on how do we measure that efficiency, how do we get to be as good in defining what efficient on a GPU instance is as we have the insights on a traditional cpu in order to then be able to have some of the conversations that you're kind of referring to Sean, as influencing teams to say, look, you're not quite where you need to be and here's what we recommend. So we certainly have our opinions and we're only making those opinions stronger as we learn through this process as well, to get to a world where we are going to be scaling more in and out more real time with GPUs as we are in CPUs, hopefully in the near future.
Brent Segner
Yeah, I mean, I think that's a challenge right now for everyone is we're all kind of still in the learning cycle of, you know what this means. We don't have decades and decades that we have with conventional CPUs and memory structures. Outside of GPU optimization for AI ML workloads, are there other optimizations that you have to think about that are maybe different than traditional workloads?
So for us, this is a kind of a multi faceted answer. First one's getting to figure out like, you know, where are our opportunities in a conventional CPU world. We used to be able to take a look at CPU utilization to be able to inform how well are we using the resource in a GPU context. Utilization itself is not necessarily sufficient because utilization more points to are we using any of the resources on the gpu rather than how thoroughly? So we start to be able to shift our narrative over to how heavily are we actually starting to be able to saturate the gpu. In other words, are we taking advantage of all of its parallel processing capabilities? So we do that today largely through taking a look at the, the wattage and the thermals on the GPU itself. In other words, understanding what the upper thresholds are for wattage on a gpu, upper thresholds for thermals, what period of time are we at, what percentage of those maximum capabilities. So I can now take a look in kind of triangulating, if I start to be able to see, for example, GPUs with high utilization but low saturation, meaning basically my utilization's 80, 90, 100%. I am using some resource on it, but I don't ever see the wattage increase. That now tells me that I've got to go back and work back with some of our teams that are presenting models to it to be able to tune the model so we're better able to take advantage of the resources on that gpu. So instead of just once again like a single unifying metric, now we got to be able to start to triangulate metrics and figure out what problem exactly are we solving with respect to optimization.
Right.
Jerzy Grzwinski
And then maybe kind of almost taking quite a bit of a jump into a different type of optimization is at our scale, we really believe in platforms and creating enterprise platforms that are multi tenant and different users, either developers or Analysts or business folks are able to use those tools as those users log into these platforms without them even knowing they might be provisioning highly oversized capacity for what their needs are. So there is a question of product design and how do we influence our enterprise platforms for what the interaction is between the user of the platform and that internal managed service where we provide transparency into the decision making, or maybe you automate away some of the decision making to make sure that that user gets the experience that they need for whatever the tool provides. But also the platform team is accountable for making sure that that user is being super efficient. And it's a little bit of that shared responsibility model where the platform needs to think not only about the architecture of the platform, they also need to think about the user interaction because that user might not be enticed or might not have enough insights to, you know, which dropdowns to choose for the thing they need to use. So how do we make sure that platform team builds an experience, but how do we make sure that user has the right information to make the right choices as well? So there's a lot of aspects of product design that we also spend a good amount of time on debating for how to best go after that problem.
Brent Segner
Statement in terms of other types of optimization, GPU or otherwise. How much does I guess like sustainability and sort of the overall carbon footprint, energy footprint that you're putting into these in your resourcing play a factor really.
Jerzy Grzwinski
Going into 2025, we have elevated sustainability to be one of our top KPIs that we measure. Our FinOp journey has now been five, six plus years. And then we have found, look, we've optimized some of the big things. And the obvious thing about low hanging fruit, we have tooling and messaging around all the different things developers can do to save financials. So we're encountering a phase of a how do we get more folks to really care? How do we get more folks to really take the actions that we're preaching? How do we drive that culture of efficient engineering is good engineering. And for some folks, money doesn't resonate if it's not from their own back pocket. So sustainability can really play a big role where we hear from a lot of our developers that they are very motivated by taking the right actions to have a meaningful positive impact on the environment. Externally, there's a big push that we're adding into our tooling to provide context and information for what kind of impact you have environmentally or how much wattage you're using on different solutions. And this became bit only clear as this year we've been looking at different metrics and we saw that our cost trends went in a certain velocity, but our wattage trend went in a drastically exponential velocity. Of course, as GPUs are quite power hungry and that becomes a bigger and bigger footprint. So we both want to tackle it from getting our associates really energized and engaged on the things we've been trying to drive for some time. But also we also want to be mindful of the power draw that is required to support everything that everyone wants to do with GPUs and AI. And we just want to be responsible stewards for how do we leverage that technology to drive the changes we are looking to drive from the customer experience perspective, but in a sustainable way.
Brent Segner
I think over the last few years, and maybe this is just a product of people sort of been invested in the cloud now for a lot of people for over 10 years there's been some pushback on the idea of the cloud. There's a lot more people I think upset about the cost and there's also been even movements of people declouding and moving off the cloud and going back to on prem systems. In terms of these cloud cost optimization, what is your vision of the future for that?
Jerzy Grzwinski
So for me the promise of the cloud is real and true. But I feel that in all of that goodness the reality sometimes is a little bit forgotten. That's where finops really grew so exponentially is look, the cloud allows you to do anything you want. Anything you want can be very, very expensive. So all of a sudden you are now tasking engineers to be thinking about costs as part of engineering. And I think that's a good thing. I think we just all need to grow that muscle to consider that as a variable in the equation of what a good architecture looks like. And if you're thinking, you know, as you design an architect and then solution and operate the application that you have built, the cloud can be a beautiful place and meet all those things that have been advertised as time to market and great costs. But that doesn't come without a little bit of effort and mindfulness. And I think as larger and larger organizations lean into the cloud, that problem becomes a very large dollar amount. And when you have a startup mentality and you're trying to, you only have so many funds in order to deliver something in a very short time period. The cloud is amazing. And those engineers are very incentivized to be frugal and mindful of those engineering decisions. And that drives to amazing outcomes. We got to figure out how to drive that culture of the startup mentality into a large enterprise so you can also get that time to market benefits. You know, you have to definitely include that frugalness and mindfulness and good engineering from an efficiency standpoint. And that's really where I think FinOps is looking to do. Our goal is to drive that culture. So I do still feel the cloud is the place to enable all the goodness but it definitely has to be supported through FinOps and other practices that are not quite needed in on prem.
Brent Segner
I love that and I agree 100%. So like Jersey, I don't ever see a wholesale shift back to on premises data center. I I do think that we are going to start to be able to see a evolution towards more of a poly cloud environment where you know, users start to be able to run workloads in the environments that make most sense to be able to run them in. And I think what we're starting to be able to see is some of the different foundations like the FinOps foundation are leaning into a future that looks that way with their focus project that allows different companies to be able to run workloads in different clouds but keep a same look from a billing perspective, kind of a ubiquitous look as far as cost and usage goes. Just with the future of hopefully be able to run workloads where it's most cost and performance advantageous for what they're driving towards.
Yeah and I think a lot of the cloud providers have become obviously they're like aware of these issues and the pushback as well and they've done more over the last couple of years to make investments within the product to give you more visibility into, you know, what spend looks like, how you can optimize and so forth. So it seems like a trend that is going to continue to make, you know, both the sort of hyperscale or cloud companies are going to make these investments and then the companies like Capital One and beyond are also going to continue to be sort of more conscious of this and how they can make these optimizations for some of these cloud.
Jerzy Grzwinski
Providers, some of the products and features they roll out. It's a learning opportunity too. We've had several occasions where working with aws we would provide feedback on certain services that they offer that just at our scale, just financially made zero sense. We really wanted to use the services and we worked with AWS to kind of explain how we got to our thinking and our strategy and props to Amazon where they took that feedback and they worked through it. And there's cases where different services got completely re architected to be able to scale to larger customers. We take our relationship with our cloud providers very seriously, where we feel that it's a two way street. We can provide feedback and requests of what kind of features we would like to see. And, you know, the cloud providers are there to listen. And then in some cases, completely re architect services to make them more economical where it just they would have not made sense for a large corporation after they made the change. Is now something that makes sense to actually leverage in the cloud versus to shift to another provider or, you know, build something in house.
Brent Segner
Awesome. Well, Brent Jersey, thanks so much for being here. This was great.
Jerzy Grzwinski
Awesome. Thanks for having us. It was fun.
Brent Segner
Thank you.
Cheers.
Podcast Summary: Maximizing Cloud Efficiency with Jerzy Grzywinski and Brent Segner
Podcast Information:
In this episode of Software Engineering Daily, host Shawn Falconer welcomes two seasoned experts from Capital One: Jerzy Grzywinski, Senior Director of Software Engineering and leader of FinOps, and Brent Segner, Distinguished Engineer focused on performance engineering and cloud cost optimization. The trio delves into the intricacies of cloud optimization, balancing performance with cost, and the evolving landscape of cloud technologies.
Brent Segner introduces himself as a distinguished engineer with over two decades of experience in cloud cost optimization and performance engineering. He mentions, “[...] I've been with Capital One now for just over two years, focused on all aspects of cloud cost optimization, performance engineering” [01:29].
Conversely, Jerzy Grzywinski shares his extensive tenure at Capital One, spanning nearly 16 years, during which he has undertaken various engineering roles. Currently leading FinOps, Jerzy emphasizes his passion for merging technology with finance, stating, “[...] technology brought me to something that has a finance problem statement, but also a technology problem statement” [02:37].
The conversation shifts to Capital One's strategic move to the cloud. Jerzy explains, “Capital One decided that's the direction we want to go. We want to be nimble and fast, we want to attract the best talent” [04:21]. This transition was driven by the need to innovate rapidly, attract top-tier talent, and replace cumbersome on-premises infrastructure with scalable cloud solutions.
Jerzy contrasts the pre-cloud and post-cloud environments, highlighting the drastic reduction in deployment times. “[...] the time to market for someone to say I need a server to they are actually able to deploy code on that server, let alone that server, be able to talk to anything else was measured in weeks, months and definitely not hours, minutes” [06:35]. The cloud empowered developers with unprecedented speed and flexibility, fostering a culture of rapid iteration and deployment.
FinOps emerged as a critical focus area, aiming to instill cost-consciousness among developers. Jerzy notes, “[...] our goal is to get developers to think about what their spend is, because now they do have access to provision and build things in a way that can be efficient or can be very clunky” [08:57]. The team strives to balance the newfound freedom of the cloud with responsible spending, ensuring that scalability doesn't translate into runaway costs.
A significant challenge in FinOps is fostering trust with engineering teams. Jerzy emphasizes the importance of empathy and collaboration: “My goal is like, how do I bring value to the developer either through tools, automation, et cetera, or through engagement and information in order for that developer to be successful” [10:00]. By providing actionable insights and celebrating successes, the FinOps team positions itself as a partner rather than a gatekeeper.
A pivotal topic discussed is the CoreMark benchmarking developed by Brent. He explains, “[...] Core Mark score allows us to be able to compare apples to apples across instance sizes, instance families, clouds, and even down to bare metal hardware residing in a physical data center” [15:44]. This unified metric aggregates performance across nine distinct CPU operations, enabling precise capacity planning and instance selection tailored to specific workloads.
The guests delve into the nuances of instance sizing. Brent shares insights from their benchmarking efforts: “As we continue to test and we grew past 16 VCPU, we saw a 12% performance hit per CPU” [20:07]. This phenomenon, attributed to surpassing physical hardware boundaries like NUMA, underscores the importance of horizontal scaling over vertical scaling to maintain performance efficiency.
The discussion transitions to how programming languages influence cloud costs and performance. Brent articulates, “[...] language selection, library selection just plays a very foundational role influencing like resource utilization, performance, scalability” [28:16]. For instance, while Python offers rapid development and ease of use, switching to languages like Go or Rust can yield significant performance gains, thereby reducing cloud expenditure.
GPU optimization presents a distinct set of challenges compared to CPU optimization. Brent highlights the architectural differences: “You can't take a look at how busy the CPU is. The CPUs only function in this environment is how quickly can it dispatch instructions to a GPU” [31:56]. Effective GPU optimization involves maximizing the utilization of streaming multiprocessors and CUDA cores to fully leverage parallel processing capabilities.
Sustainability has become a paramount KPI for Capital One. Jerzy states, “[...] sustainability to be one of our top KPIs that we measure” [39:18]. The team integrates environmental impact metrics into their optimization strategies, recognizing that GPU-intensive workloads significantly affect power consumption and carbon footprint. This dual focus on financial efficiency and environmental responsibility shapes their approach to cloud optimization.
Looking ahead, Jerzy envisions a poly cloud future, where workloads are distributed across multiple cloud providers for optimal performance and cost-efficiency. He remarks, “[...] the cloud providers are there to listen. [...] it just has to be supported through FinOps and other practices that are not quite needed in on prem” [44:30]. This strategy aligns with industry trends towards diversified cloud infrastructures and enhanced inter-cloud billing visibility through initiatives like the FinOps foundation's focus projects.
In conclusion, Jerzy Grzywinski and Brent Segner shed light on the multifaceted challenges and strategies involved in maximizing cloud efficiency. From fostering a culture of cost-conscious engineering to leveraging advanced benchmarking tools and embracing sustainability, Capital One's approach serves as a comprehensive model for enterprises navigating the complexities of cloud optimization. Their insights underscore the importance of balancing innovation with responsibility, ensuring that the cloud remains a catalyst for growth without compromising financial or environmental goals.
Notable Quotes:
Jerzy Grzywinski [02:37]: “technology brought me to something that has a finance problem statement, but also a technology problem statement.”
Brent Segner [06:35]: “[...] the time to market for someone to say I need a server to they are actually able to deploy code on that server, let alone that server, be able to talk to anything else was measured in weeks, months and definitely not hours, minutes.”
Jerzy Grzywinski [08:57]: “Our goal is to get developers to think about what their spend is, because now they do have access to provision and build things in a way that can be efficient or can be very clunky.”
Brent Segner [15:44]: “Core Mark score allows us to be able to compare apples to apples across instance sizes, instance families, clouds, and even down to bare metal hardware residing in a physical data center.”
Jerzy Grzywinski [39:18]: “Going into 2025, we have elevated sustainability to be one of our top KPIs that we measure.”
This summary encapsulates the comprehensive discussion between Jerzy Grzywinski and Brent Segner on cloud optimization strategies, challenges, and future directions, offering valuable insights for software engineers and cloud professionals aiming to enhance their cloud efficiency practices.