
In this episode of the AWS Podcast, host Jillian Forde discusses the migration journey of Booking.co
Loading summary
A
This is episode 741 of the AWS podcast released on October 13, 2025.
B
Welcome everyone to the AWS Podcast. I am your host, Gillian Ford and we've got a really exciting story for those who really love geeking out on infrastructure at the Edge serverless. This one's going to be really cool for you. And we've also got a really interesting tidbit as part of this. We're going to be talking with booking.com and a serverless expert as well at AWS and how they not only migrated to the cloud with this infrastructure at the edge, but also made their website faster, more secure, saved $500,000 with a single line of code. That is absolutely crazy. So you definitely want to stick around. So there's two amazing people that we are going I'm going to be talking to today. So the first is Ali. He's a leader from the networking and traffic management organization@booking.com and he led a really big migration at booking.
A
Com.
B
This allowed them to make the networking and content delivery more global to users on AWS and their customers all over the world. So we're going to learn about that. Really excited. And the next person who's also I want to introduce who's going to be on this interview today is Sarah and she is a principal Solutions Architect at AWS. She supports booking.com and she also supports customers on their serverless journey as well. She is a former lead of the maintainers team of the open source library of power tools for AWS Lambda. So you can definitely bet that I'm going to ask her some serverless questions here as well to really help you be able to take the learnings that Booking.com has gone through and applied in their business and make sure that you can be able to take things away for you. All right, so excited to have both of you here. So let's get started. So for the folks Ali, who aren't familiar with booking.com like maybe you can start off as just telling them what it is.
A
Thank you Julian for the introduction. Yeah, booking.com is probably the word biggest hospitality platform. People know it for a platform to book hotels, but actually booking is a much, much bigger than that. It's a full end to end booking platform with the mission of making it easier for everyone to experience the world. You can use it to book anything from hotels to homes to flights, cars, attractions, even airport taxis. Anything related to travel you will find in booking.com yes, super cool.
B
And maybe you can describe for us the architecture, what it was like in 2022 when there were challenges that booking.com was facing.
A
Yeah, booking.com is a survival of the dot com bubble. It's quite an old company. We've been in business for more than 27 years now, not even more. Starting from a small Dutch startup and then it got acquired via Booking holdings from the US and then they give them the dot com domain. Moving from booking nl to booking.com, that's something that not many people know. We are still sitting in the Netherlands, our headquarters, but we're quite global now, become a global phenomena. So with around three decades of engineering under our wing, it's quite, quite a big and diverse infrastructure. You name it, we have it all. Going from bare metal monolithic architecture to private cloud to containerized, moving into the cloud, ecospace, from serverless and all the modern and new things. We have quite a diverse big teams across everywhere and booking is continuously going through multiple movements of migrations and modernization. And our story today, I think comes only in the last decade, let's say, and the last, specifically the last five years and of how we utilized AWS to help us accelerate that modernization into the modern era.
B
Maybe you can tell us more about what were some of the challenges that you've started to face as you've kind of reached a point in your architecture where you needed to be able to modernize and be able to serve more customers globally.
A
Sure. As mentioned, Booking started as a Dutch startup. So we've always been Europe centric. Our data centers, our main content delivery network, basically everything we owned, used to be very much focused on Europe. But across the last, let's say 15 years, booking was growing exponentially and globally, far more, far bigger reach than just the European Union. So we started to think how do we expand the booking infrastructure to be more global, to be customer centric and really reflect our user base. So we built a bunch of different things starting in the past from which was very typical in the market to just buy, load balancer appliances and install it in our own data centers and try to cache more and enhance the content delivery. Trying to utilize the newer technologies and trying to build a global presence to enhance the connectivity, especially for remote users in Asia and the U.S. in South America and Africa, we started building what we called minipops, our small data centers with few servers just to serve DNS and to serve as a reverse proxy to our content. Just trying to optimize the connection, establishment time and building a dedicated IPsec lines between those small mini pops into our main data centers. In Europe. And that worked out very well for us for a while. But as booking continued to grow very quickly and very exponentially across the globe everywhere, we quickly realized that the return of investment, of continuing to build more data center, expanding our toil and operational heads was not, not very maintainable. The return of investment was very quickly, not very maintainable. And this is where we started looking into a more global, globally distributed POP solutions. And we considered multiple options including renting more data center space or going with more multiple providers of CBN and proc future proxies and networking around the globe. And yeah, this is where we thought about Cloudfront as one of the candidates to solve that.
B
So yeah, you mentioned Cloudfront that you were thinking about as well. And maybe you can tell us more about like when, how did you think about like the right architecture? Like how do you think about Cloudfront? And I also understand that you, you are also looking at and are utilizing Lambda at edge.
A
Yeah, sure. So we set our requirements as the following. We needed something that have a global presence everywhere because to reflect where our customers are. Cloudfront met that very quickly. When we started Cloudfront have more than 400 pops around the globe, almost in every continent, in every region. I think last time I checked there is more than 700 now only in the last four years Cloudfront have doubled the many pops. At least from what I see probably there is even more. We needed something that comes and help us to accelerate our cloud adoption because it came in a time where we're already adopting AWS way more. So Cloudfront played into that because cloudfront comes with a lot of native integrations, with a lot of AWS tooling that really fit like a piece of a puzzle in the remaining part of our infrastructure. We needed something that have measurable performance improvement that we can actually see improve in booking. We pride ourselves to be religiously data driven. We have a very complex and holistic experimentation system where every change we do, either it's a change of a color of a button to change, changing a major ISP or a network provider, everything have to be experimented on. And we have every experiment running for two to three weeks to collect statistically significant data. We can prove that this actually this solution or this change is actually enhancing our business. We needed something that is easy to manage, that will reduce our and our operational toil. And we definitely needed something that will strengthen our security. So when we looked at multiple solutions, including building our own global load balancers, including other CDNs and other solutions in the market, Cloudfront stood out mainly because of it being a big part of the AWS ecosystem, that we were accelerating its adoption and it meets all our other requirements. With the 700 pops around the globe, with the addition of WAF and SHIELD and bot control on top of Cloudfront to strengthen our security, with the ability of Cloudfront to manage connections and maintain long lasting connections to our origins in Europe, it all played very well to what we need.
B
Yeah, I really like how when you were thinking about the requirements, you thought very in depth and holistically about what they were. I know, I work for startups and I know at least at that stage there it's really like, okay, we need to be able to solve this specific problem in the lowest cost possible and that's a great place to be able to start. And I think that's a very useful way overall to be able to think about your requirements. But I love, I think as you were just explaining that you were really showing that there was a lot more in depth that you had to think about and from not like how does this impact our security? So I think, I think, I just hope that maybe customers, as they were just kind of listening to how you were really thinking about it in depth, were kind of inspired by how they can think more in depth about the right architecture approach long term for whatever problem it is that they're trying to solve.
A
Yeah, indeed. Security was a major requirement of what we need and turned out it really made our security teams very happy. We try to implement security in the depth when possible and we have a zero trust environment, end to end implementation of security across the board. But security teams still have to put a lot of focus on securing the parameter. In our old setup we had many, many entry points and the service of ATTCK was quite big. They had to worry about F5 security, our reverse proxy open source software. They have to worry about the global infrastructure in many countries. Sometimes they have to worry about different rules and different government policies across the globe. So it was not very easy for our security teams to keep up. Moving to cloud front and using the standard WAF and SHIELD solutions allowed our security teams to really focus their parameter security work on the edge. As we have finished the migration and moved everything to CloudFront. Now this the team that focuses on DDoS protection, the teams that focus on firewalling and protecting from specific kind of external attacks are having a much, much easier time trying to focus all their work in one place. We put some extra effort to make sure that the connection lines between cloud, cloud front or data center is strengthened and hardened and it's extra secured and it's not susceptible to ddos. And now our security team can put all their focus on the Edge.
B
So Sarah, I know Ali has just mentioned like Cloudfront and Lambda at Edge. So for someone who is new to Cloudfront and Lambda at Edge, how would you describe it?
C
Sure. So Amazon Cloudfront is essentially Amazon cdn, so a content delivery network, which means that this service delivers and caches content through a worldwide network of data center, which in AWS we call Edge location. Ali already positioned that the notions before. Alongside this on the Edge we also have AWS Lambda at the Edge. Lambda the Edge is a fully managed serverless compute service which again runs at the Edge and allows you to customize your content before it hits the original stack. So this is, I think, the key power of running services at the Edge. With Lambda the Edge you can essentially set up trigger events for Cloudfront requests or responses from your original stack. And by doing that you can, as a customer, you can customize requests and responses. That opens to a variety of use cases as well.
B
Wow, that's super cool. Can you give me an example of when someone would want to customize content at the Edge before it hits the stack?
C
Absolutely. So I've seen customers, for instance, in the media and entertainment industry using Land at the Edge for manipulation of HTTP live streaming manifests. So it can be very specific. Customer used that Lambda the Edge for a B testing, I've used Customer. I've seen customers using Lambda the Edge, for instance, creating a single page application or manipulation of HTTP headers. So the sky is the limit in that sense.
B
So I want to tie in what Sarah was just saying about Cloudfront and Lambda at Edge. So Ali, you said earlier that you had migrated a hundred percent of the traffic to cloudfront. So do you mean were those like static assets or was it dynamic content?
A
Oh, great question. Thank you. I was personally surprised in the beginning when my team was suggesting Cloudfront as a possible solution, because in my head, as probably to many of our listeners right now, Cloudfront is just a CDN. And when we think about CDN, we think about our JavaScript files, HTML files, CSS, maybe for fancy images and videos, and then we cache them on the Edge and that's it. And we are very proud of our cache hit ratio and that's it. But in our case, that's not exactly what we were looking for. We've been caching static assets on the Edge for quite a while. By that time, what we are looking for is to enhance the connectivity to all our stack, including the ABI calls, the stuff that we never want to cache actually. And by doing that we enhance the connectivity of the users and remote regions, to be specific. What I mean by that, I will go a bit technical if that's okay with you. Right now, as you know, if a customer wants to connect to one of our servers, they start with establishing a TCP connection which is sends an ack and an ack, that's 1, 2, 3 round chip already. And then to establish a TLS secure connection you need another 1, 2, 3, sometimes 4 or 5, depends on what protocol you're using. We're talking about around from four to seven back and forth trips to just to establish the connection before we can actually start communicating them data. Imagine if you are sitting in a very remote region in the world, let's say you are in the edge of South America, where booking does not have any presence. If you want to connect our European servers going through the very chaotic Internet, every round trip could be up to 150 to 250 milliseconds. So that round trip to establish a connection alone will cost like a second and a half of time, which is crazy for us. We did a study a while ago and we discovered that every hundred millisecond reduction is causing us conversions already. So, and this is where Cloudfront is a great solution for us. People don't think about it a lot being used as a reverse proxy, but turned out to be a great solution for that in comparison to how a user will be connecting directly to our data centers. For example, if you are in that remote location in the edge of the world, probably there is a very close by cloudfront bump location. As we said, there is 7, 800 of them around the globe, so they are everywhere. So this round trip to establish a secure connection with the edge, that would be super fast. It is usually within double digit, barely 100 millisecond to connect an edge location. Then an edge location is establishing this very expensive from the trip with our origin data centers. But here is the trick. We keep those connections alive, sometimes for up to five minutes and we call this the highway. Now the connection is established and then we send tens of thousands of requests through this pre established connection. Last time we checked, after we moved to Cloudfront, 99.7% of our traffic comes through a pre established connection, which massively improved the latency for our remote unions. In some areas of the world we saw an improvement of 30% of their request time and the page load time and time to first file. And that's why we loved using CloudFront, even though for most of our traffic, the dynamic traffic, we set caching as false. We don't cache anything, we just use it as a reverse proxy. And this is what I meant by 100% of the all booking traffic. I meant it. It's everything, the static assets, the images, the JavaScript file, the CSS files, and also the ABI calls, the, the jQuery calls the graphic UL everything.
B
There's really a lot to unpack there. And I think one thing that really stands out is what you were talking about earlier, and I think it's very evident from what you just explained is the really elaborate due diligence process that you utilized@booking.com to be able to test the entire architecture, test cloud front, how else were you able to come up with these statistics and know your numbers? So maybe you can, it'd be really helpful if you could just elaborate on that more so that our customers can learn from how it is that you're able to do your own like thorough testing so they can be able to apply it to their business.
A
Yeah, excellent question. And it's always the first step for me. As a rule, you only enhance what you measure. So if you're thinking about enhancing anything, any part of your infrastructure, the first step is to build excellent observability around it. And before we even sent any traffic to Cloudfront, before we even thought about the migration, we built a very comprehensive end to end observability solution. We used almost every available tool that cloudfront or AWS allows us to use, starting from just the standard logs that Cloudfront offers. We take that in the beginning, we just put it in as the three bucket and we used Athena to query it. But as the traffic was building up very quickly, Athena started to be very, very, very slow, does not handle big amounts of data easily. So we started building our own pipelines to take all this data spin, especially the errors and the warnings part, and put it all inside an Elasticsearch OpenSearch cluster. And that allowed us to easily debug and look at what's happening on the edge. The cloudfront standard logs have a lot of useful information from what publication was used to what was the timing and latencies, which origin was being had, how much did the DNS take, what kind of errors we were having, how much of that traffic is being reused and not reused, et cetera, etc. Etc. Also, Cloudfront allows us a more comprehensive look of what's happening on cloudfront. Which is the real time logs. And the Cloudfront now have a feature to easily just hook that up into a Kinesis stream to put it directly into a lambda, into events into also into elasticsearch method, trying to be very cost effective. We were logging 100% of the errors and only 1% of our normal access logs because it was enough for us moving on top. Looking at security, WAP and Shield, they have very comprehensive observability solution which was excellent. We built a lot of observability also in our reverse proxy in the data centers we use Envoy, we use HA proxy which both are heavily instrumented and you get a lot of useful metrics. And this is where we were monitoring our number of established connection. What is the ratio of requests to connection, observing the number going from eight requests to one connection to 60 or 80 once we started adopting CloudFront. And yeah, we will talk more about Lambda on the edge in a bit and I can talk about the observability around it. I would finally add that we added the server timing, which is something I would recommend everyone to do based on the rfc. There is a standard header you can set up coming from the backend sending you server timing, telling you what was the DNS time, connection time and what is the origin request time. This is something you can enable in Cloudfront. So Cloudfront will start setting that header, sending it to the client. So from the browser you can see that header and see exactly each part of your connection going from the user browser to Cloudfront to your origin and back, how much each one of those took, including the connect time which we used to also monitor the number of reuse connections. So in collaboration with our front end teams, we started collecting those server timing headers from the browsers, putting it in our traceability tooling to see end to end what's happening on the request journey.
B
It's really amazing the entire pipeline that you've built to be able to understand your metrics. And I'm also just curious from that experience if you have any advice just specifically on that for a business who really wants to build better observability, since I think I can, it's clear that a lot of other businesses as well can learn from your engineering culture.
A
Yeah. So in addition to everything I just said about having a comprehensive observability that is invaluable and everyone should have it, and that it's an essential first step to improve anything, which is it's impossible for you to improve something that you don't collect an SLI around I would mention like keep a keen eye around the observability course. This is something I would admit I personally and my team, we have overlooked a little bit. We were so focused on the functionality and building something comprehensive and then it hit us that observability became a very significant chunk of our total cost, both for Cloud Front and also for the Lambda on the Edge. And we had to do a lot of work to then fine tune that observability cost to get it right. So if I am building an end to end observability, of course everything I said really stands. But I would give an very early on time to look at observability calls. Especially that some of The AWS services, for example CloudWatch, they charge you per 1 gigabyte of API calls and send to them. This cost can adopt super quickly, especially when you are a big customer like us sending trillions of requests every minute or every hour. Yeah, this hit us a little bit in the beginning and we had great collaboration from AWS to help us optimize and to get it right in the end.
B
So Ali, maybe if you could like explain like how the entire architecture works using Cloudfront and Lambda @ Edge.
A
Cloudfront Lambda on the edge. Okay, true. Let's do this. Let's follow a journey of a request from the browser to our backend and back. I think this will give a very good picture of how we set up. So let's say you are a customer who wanted to go to booking.com or using the mobile app which calls our mobile APIs booking.com or something like that. The first step of course is DNS. DNS will resolve our booking.com domain into a Cloudfront distribution domain. The Cloudfront distribution domain by design is set up with a global load balancer based on DNS with a geolocation policy. So if you are sitting, let's say in Europe, that the Cloud Front distribution domain will give you an IP address of the nearest location to you if you're sitting in the us, if you are in New York, if you are in California, it will always find the nearest edge location. And that all happened with the magic of the dynamic DNS stuff. So Route 53 takes care of that for Cloud Front and then the user gets an IP address of the nearest bulb location, as I mentioned, because now the publications are very well distributed. Usually there is one very, very close to the customer that round the trip of four to five to six, seven back and forth, get done super quickly and establishes usually an HTTP 2 or HTTP 3 connection to our to the edge location and then the browser will establish up to six connections and start sending requests to the edge. On the edge. As we mentioned, if this IB is in you and the edge does not have like in every web location there is a pool of connections established to our own origins. Those are very expensive. That's why we keep them alive. Sorry, we keep them alive for as long as possible. If I'm not mistaken, I think we keep our connections alive for five full minutes and we said tens of thousands of requests across each and yeah, so of course TLS termination happened on Cloudfront and then we established another secure connection to our data centers and we do inject a secret header and we have ACLS to only accept connections from Cloudfront in our public facing connections. We also use various methods to try to force the traffic to stay in the AWS backbone, which is more reliable and faster than going through the public Internet. I'm not going to dig more about into that. But then the traffic will be received by our reverse proxies or load balancers in our data center and then it goes from there to the backend. One of the interesting parts of this journey that we built failover and redundancy in each into each layer of those. So Cloudfront takes care of the redundancy on the first layer. If an edge location is down, they immediately fail it over to the next one. And that's something we delegate or we trust AWS with completely. The second layer to round robin the traffic between our multiple data centers and to make sure we are very highly available. We have multiple data center locations and we use as an origin Behind Cloudfront another Route 53 DNS name that will split the traffic equally between our data centers. And we also have healthy checks on top of each one of those data center domains. If one of our data centers is down within 30 seconds, we route all the traffic to the other two or three or four, whatever we have up at that time. There is a very complex traffic management traffic routing solution there. I'm not going to go into, but this is the high level journey of that.
B
Wow. And then like how long did it really get for you to be able to get to this point of this layer of redundancy throughout the architecture?
A
Oh, this we designed was one of our requirements since day one. It's a general resiliency requirement at booking to build failover and build redundancy at every layer of the traffic all the way going from DNS to CBN and down to our HF to our reverse proxies. And even our backends, we should, we are very fault tolerant. If we have, and we even have some Chaos Engineering that just randomly sometimes deletes one of the network routes or deletes one of our whole data center connections and we're very resilient to that. We have very, very small effect on the user experience even when that happens.
B
So that is awesome that you are using Chaos Engineering. I hear customers who want to be able to use it, they're a bit scared of being able to use it to perform Chaos entering. And maybe I should even first ask, if you can just explain to the listeners, for anyone who's new to the term, what is Chaos Engineering?
A
Yeah, sure. So in general, let me start. I always like to start with the why. The only way for you to be able to absorb failure reliably is to fail all the time. And that's exactly what Chaos Engineering does. It teaches everyone, every service owner, every platform owner, every network owner as well, that we should build our systems to be very fault tolerant. If any part of our systems is down, we should have ways to automatically auto heal and auto recover. Usually we want that to happen very quickly. And this is what Chaos Engineering does. We look at all the layers and we, we double check that every layer is redundant by making part of it fail intentionally. So when, when it's failing all the time, we are learning all the time. We are enhancing the reliability all the time. And that's what kaios Engineering is that in very short terms.
B
Love that explanation. So how often do you just like test overall and then implement Chaos Engineering?
A
We have various policies bent on various services. The minimum requirement is to fail over once a quarter. And in some services we do it almost weekly and even random timings. But in general the minimum requirement we have at booking that every single component at booking, especially the high criticality ones, have to be failed over at least once a quarter.
B
Wow. So you said failover once a. Once a quarter. So when you say failover, do you mean like you're, like you're testing once a quarter of what would happen if you had to failover?
A
Yeah, it could go. It could be anything. We don't even announce what is that. We do it kind of abruptly. So my team, for example, we own DNS infrastructure. So once a quarter we go into one of our data centers, the internal DNS and we just shut off all the internal DNS servers. We stop broadcasting the IPs and all the servers have to now reroute their DNS into a second data center for Example my load balancer, the one facing the Internet, we talked about that received the traffic from Cloudfront also randomly without notifying anyone. Once a quarter we would go and shut off all of them in one of the data centers. That's randomly. And everyone in the company, all the, all the architecture around it, next to it, on top of it, have to adapt. And we have a combination of DNS, healthy checks and any cash in all the layers that allow us to recover almost instantly without any disruption to the business or very minimal one.
B
There are a lot of companies that don't do this. They want to. They're already at a stage where they have business critical applications. So what advice do you have? Because this is really like a cultural change as well. So what do you advice do you have for a business to be able to start? Let's say we can start testing more frequently since some maybe, maybe they do it once a month or maybe they just pray that nothing bad happens, which is I don't recommend.
A
We didn't get to this overnight. It took us years of actually different methods until we get to this point. If you are a business and you want to start implementing some sort of chaos engineering, I would start with bland drills where we would announce to everyone and we give them months ahead of time that at that date, end of October, we are going to shut down this specific region. And for every single service owner, probably network engineer, probably platform owner, please do your drills, do your checks and be ready for that. In the beginning of course you will always have little failures. You would pay the cost of failures as well. But this have a great return of investment in the longer term. You start by plan the drill, then you make those drills more often. You start by doing it once a year, then once a half and then you start doing it quarterly. You start making this as part of your business as usual where your reliability organization would be very comfortable with doing this. Once you do it more, you build more confidence then you can start to apply policies across the company that every medium criticality have to do it once a year, every high criticality have to do it once a quarter. And you slowly you get to a point where those failover becomes as well business as usual. At that point maybe you can start thinking about introducing chaos engineering where you can let teams know I'm hooking up your service with the Sky Engineering API and at some random time it will be, it will be failed and it will be recovered after this time and you should be ready for that. But this is not something you can do Overnight. It takes a lot of time to build the culture, to build the automation and tooling and eventually to get to that point.
B
And then this is so fascinating. And I'll have one, one last question on this because I know a lot of people are probably wanting me to ask these questions on this topic. So in the chaos engineering, when it happens, do you let like the other engineers like know after the fact or they've gone to the point where they kind of know if something were to happen and you're able to be resilient that it was just maybe like a chaos engineering like test?
A
I'm not sure I want to keep talking about this topic. It's not really my area, but I can area.
B
Okay. All right, so I'll just like we.
A
Are, we are ready for it and. Yeah, so I'm not even sure how to answer this question.
B
That's okay. All right.
A
Never.
B
Yeah, it was just something I thought of randomly. But okay, I'll put a, a marker of, to get rid of that. Okay, that's fine. Okay, so I'll move on to the next question. Okay, Sarah, so I know there are probably some listeners who are hearing like the architecture that Booking.com has with Cloudfront Lambda Edge and they're probably thinking, wow, that's like a really advanced architecture. It's fine that I get that booking.com can do that, but we're still like a small, medium sized business, we're still a startup. Is this the type of architecture that also could be feasible for my business?
C
So something to keep in mind is that cloudfront and Lambda the Edge are services that can be adopted from day one. That is regardless whether you are a startup, a scale up or an enterprise. So what customers at any scale like about those services is they can offer you, as Ali actually mentioned, this global reach. So that unlocks not only a lot of opportunities and a lot of advancement when it comes to the engineering part of your organization, but also new markets. Reach to new markets if you're part of, if you're an organization or business. And all of these services allow you to do that without the burden of managing this complex on premise infrastructure. So indeed we need to acknowledge the fact that booking operates at a gigantic scale scale like the numbers that Ali provided and the use case that obviously that does not apply to startup or scale up and so on. So Ali talked about migrating this complex on premise edge network infrastructure. Infrastructure, sorry, like mini pops and data centers. Startups and scale ups wouldn't have that edge network on premise to manage or migrate in the first case. So they typically begin with this infrastructure deployed directly in the cloud and they would generally adopt cloudfront and Lag at the Edge as their primary edge solution from the beginning. Also important to keep in mind is that also it's not only about management and complexity and this legacy infrastructure on prem, but it's also about the order of the scale. Right? So, so the traffic the booking has as well as the usage of the service is bigger. So startups and scale ups could potentially start with a much smaller traffic. But this should offer the reassurance that if your startup scale up grows as the business grows, those services will be able to handle the growing traffic and the growing users with high availability and high scalability. And that is important to keep in mind. So these services are accessible and do fit different use cases regardless of your company size. That being said, so I just also want to add give a shout out to Ali because to be fair, I do feel that even the booking, especially Ali's team booking, is a large enterprise, right? Ali's team really moved with the agility of a startup. So shout out to him and his team because I mean, I say this in the best way possible. Everything is a journey, but I think his journey was quite, quite effective and fast. So congratulations to him and the team.
A
Thank you.
B
So, all right, so maybe you can tell me more about like this architectural that you've now you've adopted, cloudfront. But let's talk more about like the Lambda at Edge part and how you're using Lambda Edge.
A
Yeah, sure. So very early on when we were thinking about cloudfront, this Lambda the edge thing keeps popping up. It seems to be a great opportunity. Typically or historically, we used multiple types of reverse proxies in front of our traffic. We had very limited capability of executing business model. Usually it is something very simple. We use some LUA scripts, we use some native configuration in the Reverso proxies just to manipulate a header, remove a header, add something, some security stuff and that's it. Then once we saw Lambda on the edge, a big opportunity presented ourselves. Now we can execute very sophisticated logic on the edge, written with Node JS or with Python. What we can do, we can do very sophisticated and complicated logic on the edge. And as mentioned, we have a very complex ecosystem. We have all sorts of infrastructure going from private cloud to public cloud to monolith, et cetera, et cetera. We never had one unified layer of business logic on top of everything. So that's very quickly presented an opportunity where we had teams in Booking internally, fighting who should own that lambda on the H scale. We are very limited. We can only execute few lambdas per request. We can execute a lambda on the viewer request to the edge and we can execute a lambda on the origin request from the edge to the origin and then we can execute lambdas on the way back and the responses from the origin and the viewer. We had at least 10 use cases who presented to my team that we should be the ones taking that lambda on the edge thing. It's great for us. We will have great benefits for business all the way from authentication to traceability to security headers. We've got control to locale and language and currency. All sorts of amazing things that could contribute to the business quite a lot. So me and my team, we took a step back and we thought okay, seems like this lambda and the edge thing would be great for multiple people in the team. Let's actually build a platform on top of lambda on the edge to allow all those modules that make sense to exist on the edge side by side. What I mean by that we set the requirements. We need a way to allow all those modules to run ideally in parallel. We need a way to make sure lambda and edge is super critical. If there is a failure there, it will fail all our paths. So we needed a way to make to have also fail safe there. And if there is any module who have an issue, we don't want it to take our whole website down. We needed a way to make it configurable and to build some firefighting tools where we can enable disable a module with a click of a button. We took all those requirements and many more, including the latency where we needed it be less than in the beginning. We set a number as 50 milliseconds as a maximum we want to add to every request. And then we designed our own framework. We built it with TypeScript where anyone at booking who want to implement something at lambda on the edge, they can just import our interface, implement their hand in function and we have our own tooling who would take all those modules? Right now we have I think around 12 of them and bundle them into one zip file that we deploy on lambda on the edge everywhere. And our function runner will pseudo try to execute those modules in parallel and in isolated kind of mimic sandboxes. If a module have an issue, it will just fail that module. If a module is trying to run beyond the set time, we will just time it out and let the request continue. So and that's where we stand right now. And we did a lot of optimization trying to find the right lambda size, adding traceability, connecting it to OpenTelemetry and optimizing on the performance of the different modules. Because now it's global, we needed to find a way to also get data into the edge, whether that could be DynamoDB, global tables or using a cloud front and S3 next to Lambda at the edge. It's a long journey with a lot of details. I don't think we have time to dig into all of it. But in the end we ended up with this amazing lambda and the edge platform that allowed us to do a lot of interesting stuff. Just to give two examples I mentioned in the beginning at booking, it is a religion for us to be data centric and to run experiments on top of everything. One of the amazing things we added during recent eba, working very closely with AWS to facetrack part of our infrastructure into AWS is to move the experimentation system into the edge where we for the first time in booking history we can run experiments trying out some services between our legacy infrastructure and our modernized infrastructure in aws. And all that is happening on Lambda on the edge, the coin toss, the blob object and even the addition of cookies and et cetera authentication was another one. For the first time in booking history we have, we have a way to cover all sorts of infrastructure with one authentication cookie that covers everything. And it actually one of the side effects of that, it made our customers, our service owners, a trust free from being tightly coupled to some legacy infrastructure. For example, by moving authentication, boot control and experimentation system to the edge. Now a module owner inside our legacy monolith application is not stuck there anymore. They can move into a modernized Java containerized application running on aws. And because all those good stuff is running on the edge, they get it out of the box without needing to get stuck where they are.
B
I'd love to understand like maybe like what you saw as like the business impact of now moving a lot of this like logic now at the edge.
A
Well, the very short answer is accelerating modernization. As I mentioned, we had the goal for a long time to modernize part of our infrastructure and that was usually very hard mainly because of those common middlewares, very old heavy complex logic that lives as form of middleware inside our legacy systems. By moving this into the edge and the edge being a cover of all other kind of infrastructure, that really freed our service owners to easily move between it largest moving from bare metal to private cloud, to our Kubernetes cluster, even to native AWS solutions, basically using an ABI gateway within a number behind it.
B
So Sarah, I know earlier you were talking about Lambda at the Edge, but it would be really helpful to hear from your perspective about how customers can use Lambda to really get more granular with their own business logic at the edge.
C
Yeah, so what I like about this service personally is that it's an intersection between serverless and edge network. So it's not about running any code on a serverless environment, but running code specifically at the network edge, which means for you or for your business closer to your end user. So Ali already gave a great overview of how booking uses Lambda the Edge capabilities. So customers can use Lambda the Edge to intercept those HTTP requests before they hit your regional stack on aws and also HTTP responses of course, after they leave that origin, that regional stack and before they reach your viewer, so your clients, your browser or mobile apps and so on. So I've seen many customers use Lambda the Edge to route requests to different regions, for instance based on user agent metadata, information about the location, device, type, booking. Already leverage a lot of these capabilities like a B testing experimentation. The capability really unlocks experimentation, which again it's tied not only about on technical metrics, but also business metrics. So you're able to quantify the effectiveness of something that you have changed in your system or in your platform closer to the edge. And you can also add some totally custom business logic. Right, and that customization. So executing code at the edge also allows you to customize content based on other parameters. But not only that, you can also implement redirects for instance, just as a reminder that request response manipulation that the customer can set can be done on a viewer request, on an origin request, on an origin response and a viewer response. And yeah, outside of a B testing, I've seen customer doing traffic splitting as well. You can personalize your content, so data locks also a lot of capability when it comes to providing an enhanced experience to your own end users. Ali also mentioned a little bit about securing IT authentication. I've seen customers using it for validating token or credentials at the edge, so reducing the burden and the load on the regional stack. Ali also mentioned build protection. That's another very common use cases you can add security headers. Indeed. But also so we mentioned enhancing business metrics, a B testing experimentation and so on. You can also use it to do performance optimization. So dynamic compression image optimization. I did indeed mention at the beginning the some media company companies use it to manipulate manifest file reader files. That's another cool use Case in my perspective. So you can do real time content transformation and I think all of this can really unlock a lot of opportunities for businesses. At the same time, it can really accelerate the speed of innovation that will benefit your own customers as well.
B
Wow. It's really amazing just how many use cases there are for lambda at the edge. And it really sounds like a lot of customers that aren't utilizing lambda at edge, they're already have an architecture where they're just doing this wherever it is that they're doing this. So it requires I think a different level of thinking of how to architect your application. So Sarah, I'm curious from your perspective of any advice or how customers should think about like architecting to be able to shift some of this logic that they're already doing like device type AB testing over to the edge.
C
Yeah, so the great question. So I would apply a lot of the best practices that we recommend for serverless computing and lambda in general. So a lot of these was covered by Ali and also delivered by booking as part of their journey. And Ali mentioned something around troubleshooting and observability. So being able to have the dashboard, monitoring dashboard and alerts in phrase to be able to troubleshoot when the scale grows and when your business is successful as well, you want to minimize business impact. So this is something that I would definitely leverage. Another aspect that also Ali mentioned is cost management. So lambda the edge is charged based on two factors, two dimensions. One is number of requests and the function duration. So when you're a startup those two dimensions may not be as impactful in your cost eventually. But at scale, small efficiency can contribute to unnecessary cost. So make sure and this is something that I advise a typical customer to do to really be mindful about the right trigger that you use to invoke your lambda at the edge. So do you want your function to be executed for instance when there is a cache miss? Then you should use origin triggers perhaps. And do you want your function to be executed for all requests? Then maybe you should use a viewer trigger. So think about, think about what kind of events are best suited. So avoid inefficiencies when it comes to cost. CloudWatch costs can also should also be taken into account. So login outputs need to be really fine tuned. Adopt best practices like log sampling and selecting the right log level. Make sure you use structure logging so that you're able to understand and troubleshoot when something happens. But all of this selecting the rilo level, log sampling to a specific percentage of requests and in production perhaps only Logs errors instead of like the whole request invocation outside of log sampling. Of course that is a good way to keep custom under control. But in general also other best practices when it comes to lambda the edge is performance optimization, which means keep your lambda code as lean and small as possible. Minimize dependencies. So if you're using node js, you can also use tree shaking for instance. That's a typical way to kind of remove all the dependencies that you don't need. So keep your code as lean as possible. So your lambda not only is going to be more performant, you're going to reduce your cold start and you're going to reduce also your duration which has a positive impact on the overall cost. So this is something, these are kind of the pillars that I would keep in mind especially as your company grows. Performance optimization, good monitoring of observability in place, cost management and fine tuning. Be mindful about this so you can transfer a lot of the knowledge from the knowledge that you may already have from Lambda, the regional stack. But yeah, this is what I would advise customer typically.
A
Yeah, great advice Sara. I would just add to that. The performance optimization was very important one for us. We spent a lot of time trying to lower the size of the total lambda zip file. With the tree shaking and many other techniques we managed to get it into the right size and playing tweaking with the fine tuning of node js stuff our code and also playing with different sizes of lambda on the edge, we managed to lower our latency average latency per request from 50 we started with to 40 down to 12 and now it's around only 5 milliseconds on average per their request. And that was mainly due to finding the right size of lambda. Too much was too, was too expensive and it was actually adding latency and too small was also not nothing great enough because it added a lot to our cold styles. Just fine tuning, playing with it. We found the right edge on the end and that's still I need to remind you, running 10 modules M parallel and that 4 millisecond average. Wow.
B
I absolutely just love like all of those recommendations and Ali, just how it is that you've been able to apply them at booking from performance optimization and even cost optimization. So I want to get into that as well because I love, I love that Sarah was getting into cost optimization as well. This is a topic that is always top of mind for customers. So Ali, I'd love to hear from your experience with cost optimization with this entire architecture.
A
Yeah, so there is a lot to be said about cost optimization, in addition to what I said about observability cost, about optimizing the right size of lambda, really iterating multiple times over the efficiency of your code. I would add maybe a story, I think you already mentioned it, where I talked about how we said half a million with a one line of code since we started. When you build a normal lambda, there is a feature that comes with lambda. When you enable logging, Lambda by default spits three lines of code or three lines of logs. Start by when did this lambda request start, when did it end? And the report, this report will tell you how much resources was consumed by this request and how much total time it consumed with the lambda Advanced Logging, which is a feature that was enabled only on normal lambda but not lambda on the edge, you can easily with a click of a button disable usually this three log lines. They are sent as an API call to CloudWatch. And the average cost, it's different per region but I think it's around half a dollar per gigabyte. That's the standard cost across aws. When you think about how many trillions and billions of requests we handle those three lines alone, which is an average of 300 kilobyte if I'm not mistaken, they add up to tens of thousands of gigabytes which really added up into huge sum that we see in cloud of funds. So for a while we tried to find a way to delete those or disable logging but we couldn't do that. So we had to engage very closely with aws. With the help of SARA and other solution architects, we escalated that and we said yeah, we need the same advanced controls on logs on lambda on the edge like we have a normal lambda because this is costing us a lot of money without any real business value. So yeah, eventually AWS engaged with us. They took the requirements and they took action by enabling the advanced logging feature also on Lambda on the edge. And yeah, it literally took us one little line in our terraform model to disable those standard logs, things that we don't use. And it really saved us 10 to 1000 months which add up to more than half a million a year.
B
This is absolutely amazing. I know I've learned a lot. I know the listeners here have learned a lot. So one last question for each of you, Ali, is there any other piece of advice that you would have for, for customers who are thinking about moving to more of like an edge based type of architecture.
A
Piece of advice for people using Edge? I would say don't be afraid to engage with AWS and ask the hard questions. Multiple times I noticed my engineers greeting AWS like a black box and not really wanting to engage deeply with AWS and ask the hard question of trying to uncover what happens under the hood but what I discovered working very closely with solution architects, product managers and also support engineers that they are very receptive to feedback and they are very engaged often with a customer centric review. When we bring a problem or an issue we're trying to solve, we usually get feedback more than we expect, especially when we are really trying to optimize on the small finite tuning things. I have a lot of great solution architects in mind who was super helpful to us from people who are network focused to CDN focused and also the product managers who would engage with us and take our requirements way ahead of the market release of some features for standard NDA where they would exactly get the features that we need to do something. Like maybe a great example that Sarah was engaged on this recent SaaS feature that Cloudfront published. We had a requirement of onboarding thousands of domains to cloudfront but we did not need to handle SSL certificates issuing and create and dependent cloud front distribution for each one of those. So we've been engaged with AWS for almost a year now. They were collecting requirements from us to be an early adopter of this new feature that will allow us to disconnect the front end from the distribution of the backend, making us use cloudfront more like a reverse proxy. And yeah, this is my advice. So just don't be afraid to nag over AWS and ask for exactly what you need. They might say no, they might say yes, but if they say yes you would get exactly what you want.
B
Really good. Sarah, what about you?
C
So I just want to first of all thank you, thank you for the shout out Ali. That really means a lot. But for me I had a twofold advice so I want to echo what Ali just said. Don't underestimate the value and the influence that you can have as a customer over the roadmap of a service, in this case Edge services. So please, please engage with service teams and make your solutions architect, working with your account team create these feature requests on your behalf. So make them be your advocate within AWS so we can kind of improve our platform and tooling and capabilities. So that is point number one and point number two is at the end this tech technology, these features, these capabilities, these are means to a goal. So as an engineer, if you're part of your engineering organization, knowing the flexibility and the broad spectrum of capabilities, and even listening for how Ali and Booking is using Lambda the Edge and Cloudfront. Think about how, how which markets you can unlock that you haven't unlocked as a business. So partner with your business stakeholders within your company to really leverage those capabilities to unlock new business opportunities. And it's not only about speed of innovation, but it's also opportunities that can enhance your own customer experience and make you evolve your own product. So think about that as well.
B
Love it. Really good piece of advice. This has been such a fascinating conversation. Ali, Sarah, thank you so much for being here on the AWS podcast.
C
Thank you for having us.
A
Yeah, thank you Julian. And thank you, Sarah, for inviting me as well.
Release Date: October 13, 2025
Guests: Ali (Networking & Traffic Management, Booking.com), Sarah (Principal Solutions Architect, AWS)
Host: Gillian Ford
This episode offers a deep dive into how Booking.com modernized its global infrastructure by migrating edge networking to AWS CloudFront and Lambda@Edge. The conversation demystifies their architectural considerations, observability strategies, resilience via chaos engineering, granular business logic at the edge, and transformative cost optimizations—including saving $500,000 with a single line of code. It’s a case study in large-scale, data-driven transformation, but one with clear lessons for organizations of any size.
Introduction to Booking.com
Legacy Infrastructure and Challenges in 2022
Core Requirements:
CloudFront as the Chosen Solution:
What is CloudFront?
What is Lambda@Edge?
Common Lambda@Edge Scenarios:
Dynamic Content and Edge Reverse Proxy:
Technical Details:
Advanced Experimentation:
Observability Practices:
Advice:
Request Journey:
Chaos Engineering:
Advice for Adoption:
CloudFront and Lambda@Edge are for Everyone:
Shoutout:
General Guidance:
Cost Optimization Wisdom:
Engage Deeply with AWS:
Edge Technology as an Enabler:
This summary captures the technical depth, hands-on lessons, and strategic guidance shared by Booking.com and AWS during this episode. Whether for architects, engineers, or business stakeholders, the discussion covers actionable strategies for adopting and optimizing global edge infrastructure.