Streamlining Cloud Infrastructure Deployments with Jake Cooper - Software Engineering Daily

Summary6 min read

Software Engineering Daily Podcast Summary

Title: Streamlining Cloud Infrastructure Deployments with Jay Cooper
Host: Shawn Falconer
Guest: Jay Cooper, Founder and CEO of Railway
Release Date: July 22, 2025

1. Introduction and Background

The episode kicks off with host Shawn Falconer introducing Jay Cooper, the founder and CEO of Railway—a prominent platform designed to simplify the deployment and management of cloud applications. Jay shares his journey from graduating at the University of Victoria in British Columbia to establishing Railway in the competitive Bay Area tech scene.

Notable Quote:

"I think there's almost like this brain drain kind of pull... you pay the same amount of tax dollars and you get, you know, twice X the ambition plus twice X the sun."
— Jay Cooper [00:59]

2. The Challenge of Cloud Infrastructure Deployment

Jay delves into the complexities developers face when transitioning applications from local environments to the cloud. He highlights the often fragmented deployment lifecycle, encompassing infrastructure provisioning, scaling, and managing dependencies, which can impede developer productivity.

Notable Quote:

"It's how do you make changes, how do you go and get them reviewed... a lot of these microservices and no tools to manage them and stuff like that."
— Jay Cooper [02:11]

3. Railway's Unique Solution

Railway addresses these challenges by offering an intuitive, developer-friendly platform that automates and streamlines the deployment process.

a. Intuitive UI and Layered Canvas

Jay explains Railway’s user interface, which allows developers to easily add services like databases or deployment targets with simple commands, abstracting away underlying complexities.

Notable Quote:

"We've built a really intuitive UI that is kind of, like, layered. It's a canvas. You essentially just go to it and you kind of just like, spew out, hey, give me Postgres... Redis."
— Jay Cooper [06:07]

b. Automated Docker Image Generation

Railway automates the creation of Docker images by statically analyzing user code, eliminating the need for manual Dockerfile configurations.

Notable Quote:

"We will go and statically analyze it, we will go and figure it out and stuff like that. So you don't have to actually, like, write anything to get started."
— Jay Cooper [06:07]

c. Storage System and Orchestration Engine

By developing their own storage solutions and orchestration engine, Railway ensures efficient resource allocation and cost-effective scaling, distinguishing itself from traditional cloud providers.

Notable Quote:

"We've built our own orchestration engine, so we will Go and place workloads... only charge for what you're using."
— Jay Cooper [07:24]

4. Infrastructure as Legos

Jay introduces the concept of "Infrastructure as Legos," where Railway treats infrastructure components as modular, reusable blocks that developers can easily assemble to build complex systems.

Notable Quote:

"If you consider that as kind of like a LEGO block, then actually you can basically say, hey, I want to go in and import that thing and I want to use it as part of my project."
— Jay Cooper [11:09]

5. Security by Default: Zero Trust Model

Railway adopts a zero-trust security model, ensuring that services communicate over secure, private networks without exposing endpoints to the public internet. This approach minimizes security risks and simplifies authentication and authorization.

Notable Quote:

"We're big proponents in open source. We have an IPv6 wireguard mesh... the best level of security that you could possibly have is you just can't get to it without SSO."
— Jay Cooper [13:07]

6. Deployment Process and Behind-the-Scenes

Jay provides a walkthrough of deploying an application on Railway. Whether deploying a database or a GitHub repository, Railway automates the setup, selecting optimal server regions based on user location to minimize latency.

Notable Quote:

"You can go and select that region and say, like, oh, I actually want to run it on these specific class of instances... It could be running anywhere and ultimately, you shouldn't really care."
— Jay Cooper [17:22]

7. Scaling and Cost Efficiency

Railway’s orchestration engine not only optimizes resource usage but also offers scaling solutions that grow with user demands. This ensures that applications remain performant without incurring unnecessary costs.

Notable Quote:

"We're letting you only pay for what you're using... avoid those like random errant, you know, $1,500 bills."
— Jay Cooper [21:09]

8. Monitoring, Logging, and Observability

Railway integrates built-in observability tools, providing distributed tracing, alerting, and seamless integration with popular monitoring services like Datadog and Grafana.

Notable Quote:

"We've also built from the ground up an observability system inside of Railway... Some point in the future we'll do an app, right?"
— Jay Cooper [23:41]

9. Overcoming Constraints and Building Trust

While discussing potential constraints, Jay emphasizes that Railway strives to eliminate traditional platform limitations by offering flexible, scalable solutions. Building trust is paramount, especially when competing against established cloud giants.

Notable Quote:

"The main thing that's kind of the limiting reagent right now is not like what the platform can do, but it's almost like how much you can trust it."
— Jay Cooper [25:02]

10. Open Source Commitment

Railway’s commitment to open source fosters transparency and community collaboration. By open-sourcing significant portions of their stack, they enhance trust and invite contributions that drive platform improvements.

Notable Quote:

"If we can give you this kind of like ability to introspect the service... it's super cool... it's their GitHub repository."
— Jay Cooper [29:32]

11. Remote Company Operations

Jay discusses the deliberate choice to run Railway as a fully remote company. This model leverages global talent, promotes autonomy, and enhances productivity through asynchronous collaboration across multiple time zones.

Notable Quote:

"It's a terrible idea for probably about 90% of people... We have to hire people who are really, really excited about the problem space, who are going to like self-manage."
— Jay Cooper [32:59]

12. Hiring and Managing Remote Teams

Railway’s hiring strategy focuses on identifying highly motivated, self-managing individuals passionate about solving complex problems. Their extensive onboarding process ensures new hires quickly become productive and aligned with the company's mission.

Notable Quotes:

"We aim for trying to find those people because we think that... that focus and that passion is like, it's almost like a necessary precondition."
— Jay Cooper [37:18]

"We have six weeks of onboarding... the goal is almost pushing up the funnel on, on what you can do."
— Jay Cooper [37:28]

13. Conclusion and Future Outlook

Jay concludes by reflecting on Railway’s growth and scalability, fortified by their open-source initiatives and robust infrastructure. He underscores the continuous effort to build trust and reliability, positioning Railway as a formidable contender in the cloud infrastructure space.

Notable Quote:

"We've scaled the cloud version of what we built is like scaled to like 2 million... so we can say like, hey, if we were to go and like do a self-hosted version of this for you, we can do that like level of scale."
— Jay Cooper [30:34]

This episode offers an in-depth exploration of Railway’s innovative approach to simplifying cloud infrastructure deployments, highlighting the platform’s unique features, security models, and strategic company operations. Jay Cooper’s insights provide valuable perspectives for developers and organizations seeking efficient, scalable, and trustworthy deployment solutions.

Loading summary

Transcript66 lines

[00:01]
Shawn Falconer
Railway is a software company that provides a popular platform for deploying and managing applications in the cloud. It automates tasks such as infrastructure provisioning, scaling and deployment and is particularly known for having a developer friendly interface. Jay Cooper is the founder and CEO at Railway. He joins the show to talk about the company and its platform. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
[00:41]
Jake Cooper
Jake, welcome to the show.
[00:42]
Jay Cooper
Great to be here. You know, I'm super excited to chat about a bunch of stuff. I know we've got a couple of things on the docket, so.
[00:47]
Jake Cooper
Yeah, yeah, absolutely. We were chatting before we hit the cord here, but you and I both graduated from the University of Victoria in British Columbia in Canada. So we what ended up sort of pulling you south to the Bay Area?
[00:59]
Jay Cooper
Yeah, I think there's almost like this brain drain kind of pull that I think we're both talking about it right before and I think like half of maybe my graduating class was just kind of like, yeah, I want to generally kind of move down there in general. So I think I kind of always knew that I wanted to be in the U.S. i think I always knew that I wanted to start a company at some point. So it was more of a matter of like when, not if, if that makes sense. And so I ended up moving to a few different places before I ended up working out a bit out of like Amsterdam and Italy in between grad and then moved on to New York and, and then move to San Francisco. But yeah, I think it's a pretty no brainer in my mind because you pay the same amount of tax dollars and you get, you know, twice X the ambition plus twice X the sun, you know, so there's a pretty strong payoff on doing that, you know.
[01:38]
Jake Cooper
Yeah. As much as I love Canada, I think if you're in tech, there is such a strong pull to the Bay Area, especially when I moved here, you know, now like 15 years ago. I think we'll get into it today, but you're running a fully remote company so I think there's somewhat less constraints on companies and the opportunities people have in tech today all over the world. But that wasn't always the case. But you know, back to, you know, what you're doing now around Railway, like what was sort of the driving factor behind the creation of it as a platform focused on streamlining deployment and management of infrastructure and dependencies.
[02:11]
Jay Cooper
Yeah, so I mean I grew up like hacking on random stuff. I started actually writing like My first computer science stuff was writing kind of aimbots and cheats for, like, video games. And so it was always like a very kind of exploratory, creative kind of thing like that. And so I ended up writing, like, small programs or anything else like that. And then every single time I kind of like moved to, like deploy something, it was just like, you're switching from this world of, oh, cool, this joy, you know, it's this like beautiful, happy kind of like thing where you're hacking around, it's nice, and you're like, well, how do I like, move this thing? Right? And obviously there were like, tools like Heroku at the time. And so, like, finding those tools was like, awesome and magical. But there's a whole class of problems that exist actually outside of like, actually getting something deployed that we call like deployment lifecycle, right? And so it's how do you make changes, how do you, like, go and get them reviewed, how do you go and add a database in another environment and then how do you make sure that you're going to actually have that database when you, like, go in and merge, right? This whole kind of split universe phenomenon of staging, et cetera, right? And so when you get into things like that, you end up having to wrangle a lot of stuff, right? And so it goes kind of back to this bit of trough of sorrow having to go and figure out all of these things, right? And in reality, a lot of the workflows that people end up having are very, very similar, right? And so if you end up building a lot of those workflows, then most people, they want to split into a, you know, a parallel environment. They want to test their stuff, they want to merge it and they want it to kind of like automatically roll out, right? So they just simply don't want to handle that. So it ends up being that, like, this class of problems ends up being both interesting from like a systems perspective. So, you know, we work on some really, really cool, like networking, storage, et cetera, all of those other things. We've got our own bare metal servers now, but also just like, very, very applicable to the wide swath of people, right? And I also think that, like, the compute market is one of those things that will just continue to grow, right? It's just like we will need more computers, right? And so anything that we can do to kind of streamline people's, like, productivity in there, it's like, it's one of the highest leverage things that we have in general, right? And we talk a lot about, like, Leverage. Like, how do you build leverage? How do you build efficiency? Right? How do you make it so that a user's action has an outsized return on what they are putting in? Right? Because that's, for us, that's the definition of magic, right? It's like you do a little. You get a lot.
[04:09]
Jake Cooper
Do you think the public cloud has increased that pain in terms of, you know, we have this joy of, like, building something, and you get sort of that aha moment of, you know, building something on maybe on your local machine and running it, and then you got to get to a place where, like, now I have to deploy it. Does all the sort of, like, boxes that you have available in the cloud make that even more of a challenge?
[04:30]
Jay Cooper
I think you're trading pain, if that makes sense, right? And so, like, it's obviously very, very painful to, like, go and procure servers, get them up, making sure that they don't fall out of the wall and that, you know, somebody doesn't bump the power cable and all that. Like, classic problems of solving those, right? And then you kind of trade those for, like, okay, how do I, like, manage these machines? Right? And then you kind of have to play with the abstractions that the cloud providers give you, right? And there's benefits and curses to that, right? In the prior kind of like, bare metal world, it's like, you want to spin up a server, you want to spin up something. Well, it's like you're measuring your time on the order of weeks or maybe even months, right? To, like, get this thing, like, up and running, right? Versus you go to the cloud, and it's like, boop, hit a button. And it's kind of just like, up and running, right? But you do have to kind of like, manage the primitives that the cloud providers give you and kind of like work within those walls in general, right? And those primitives can be, I think, made faster in and of themselves, right? So from making deployment instant, right? Moving your builds, like, immediately beside your compute, moving your storage, keeping all of these things together, right? That's kind of the whole end goal of, like, a lot of the stuff that we're building is like, all of these things have almost been verticalized, right? If you look at the AWS kind of dashboard, it's like every service, you know, famously, Jeff Bezos is kind of like, everybody's going to interact with these things over an API. So everything's kind of like vertically sliced, right? So there's no mechanism to kind of like, share these things together unless you really want to go and start composing, and then you have to. Again, you start bumping into the abstractions there in general. Right. So I think it's like you're trading kind of, like, pain over time.
[05:49]
Jake Cooper
Okay. And, you know, there's been a number of companies that have tried to simplify this process. You mentioned Heroku. There's other companies in the space, like, you know, render your Netlify these various pass platforms. Like, what is Railway's sort of unique approach to this that distinguishes it from some of the other players?
[06:07]
Jay Cooper
Yeah. So I'd say there's a couple of things. We talk about wrangling complexity a lot in general. So we've built a really intuitive UI that is kind of, like, layered. It's a canvas. You essentially just go to it and you kind of just like, spew out, hey, give me Postgres. Hey, give me, you know, redis. Give me a deployment of GitHub. Right. And I think we've gone above and beyond on a lot of different things. We've built a system for kind of, like, automatically generating Docker images. Right. So essentially it gives your code. We will go and statically analyze it, we will go and figure it out and stuff like that. So you don't have to actually, like, write anything to get started in general. Additionally, we've built our own kind of, like, storage system. So you can host literally anything on Railway, Right. So you can build a Clickhouse database beside your Python instance, beside your whatever. Right. Like, for us, it doesn't really matter, and we've, like, built those primitives in such a way where they feel very, very quick, and they're really, really easy for you to kind of compose together. Right. And so I would say that what separates Railway or. Versus like, something to fly or render or anything else like that on the surface is they may kind of, like, look very, very similar, but over time, as you kind of, like, compose these things together, we've tried to almost like, linearize the complexity of each of these things versus allowing that kind of complexity to spawn out until, oh, you have so many of these microservices and no tools to manage them and stuff like that.
[07:16]
Jake Cooper
So you mentioned Docker there and being able to abstract away the challenge of putting together a Docker image. What is that challenge that people typically run into?
[07:25]
Jay Cooper
Yeah, so I would say that there's a couple challenges. Oh, and sorry, another thing to mention is, like, Railways only, you only pay for what you're using because we've built our own orchestration engine, so we will Go and place workloads. So normally, if you're on a cloud provider, you know, you pay for a 4 gig box, and if you don't use the 4 gig box, you're built for for that at the end of the month, what we do is we basically allow you to run the code and we pack all of these instances together, and as instances are scaling, we'll go and move them around in general, right? So that allows us to get an edge there in general in terms of like, both pricing as well as, like, do us doing our own bare metal instances, which is both pricing and performance. But in terms of what makes the Docker process a bit more complicated, it's just another abstraction you have to wrangle in general, right? You have to, like, go and figure out where to go and place these binaries, what are the permissions, like, which ordering, right? All of these other things, right? And a lot of people don't know. It's like, you know, Docker is layered, right? So it's. If you invalidate the cache at any stage upward, it'll invalidate the cache all the way down, right? And so even constructing the ordering of your, like, Docker commands, right, has an effect on the build times, the image output, all of these other things, right? So there's kind of like, there's a science to it, obviously, which is, I think, on the surface, but there's almost like an art to it where it's like, oh, you have to almost, again, understand the underlying abstraction. And our hope is that we can basically just say, like, yeah, you really just. You want to make sure that you have Node in here, and then you want to make sure that you also have Python in here. And you basically, like, you should select almost the packages that you want if you ever use, like, something like Ninite or something, and then you have access to them, right? There's no messing with permissions, there's no messing with anything else like that. And that's what we've built, the Nixpack's automated kind of construction engine on, like.
[08:53]
Jake Cooper
Some of the challenges I think organizations run into sometimes when they invest in, like, a PAAS platform is that if they are successful, they reach this, like, graduation problem where they hit sort of the scale limits of that platform, and then they need to essentially migrate off of it, go to aws, Google Cloud or whoever directly and sort of, you know, stand up a bunch of the infrastructure and run it themselves. How do you sort of avoid that with Railway?
[09:15]
Jay Cooper
So Heroku famously had this graduation problem. It's like one of the main things that we chat with investors about. And so the interesting thing about Heroku is they built like the thing that was really, really great for them is also in my theory the thing that killed them not going to kill them. They're doing $1 billion in revenue and they said it's successful year and all of these other things, right? But like I think as Heroku goes like, we all know that there is a massive, massive, massive business to be made in there in terms of impact, right? And they were only just like scratching the surface, right? So anyways, back to the original point about the graduation problem. I think the main thing that happens is you end up having kind of almost this like outsourced state problem. So that marketplace where it's like oh, I need postgres and I need redis and I need to be able to deploy my, you know, call it like Ruby on Rails, like API server as well as my workers, right? Heroic is really, really the servers, right? You just spend up the stateless things, right? And they'll like scale up or down. Excellent. And then you end up going in and integrating with something like an external postgres provider or they provided one at one point or anything else like that, right? But it was a very, very bespoke offering, right? So they haven't like solved the generalizable storage problem, right? And I think there's a few more primitives that have come out, namely like ebpf I uring a bunch of those other things that allows us to kind of solve these things at a more generalizable level on the storage stack, which means you can spin up anything, right? And so instead of bumping into those edges where it's like oh, I want this specific thing but Heroku doesn't offer it, right? So I couldn't do like self hosted clickhouse on Heroku, right? Because there's no way to like do in the kubernetes world of things like a persistent volume claim, right? There's no, there's an elastic block storage, there's none of those things, right? And so you end up bumping into these limits of that platform and we've kind of invested right out of the gate. One of the first things like Railway didn't even host code at the start. We were just a database provider. We were underlying database where like you can click and you get there was one database at the start and then it was four databases, right? So that's where we've kind of invested in making sure that people can literally just do anything on the platform, right? And then we're making it really, really trivial for them to go and actually go and do that. Anything.
[11:09]
Jake Cooper
I mean, can you explain this idea of like infrastructure as legos?
[11:13]
Jay Cooper
Yeah. So it's interesting. Like, I think if you squint, it kind of already exists in terms of like infrastructure as code, right? And so if you look at like a Docker compose or a helm chart or something like that, right? Those are essentially like infrastructure as legos. They have like a variety of environment variables that you have to provide. They have a variety of like inputs and outputs in terms of endpoints that exist. And then they have a variety of like, services in there. And they also have versioning that exists over time, right? And so when you drop, like if you just assume that this thing kind of exists as this like bucket and this docker compose file has maybe like four services, the aforementioned, like Ruby on Rail Service, the Worker, the Postgres, et cetera, that's now kind of like a Lego that you can use and you can piece together, right? Because that API endpoint actually has an input, right? And there are environment variables that you can pull from, right? And there are environment variables that you can provide to, right? So if you consider that as kind of like a LEGO block, then actually you can basically say, hey, I want to go in and import that thing and I want to use it as part of my project, right? And so like we allow people to one click deploy things like Strapi or Aki or any of like the analytics toolkits that are open source, right? Like we're big proponents in open source. We have an open source kickback program where if you build the template, people run the template, you get paid for what people are actually using. So that's kind of the infrastructure as Legos piece of it, where you basically you take that Lego and you drop it inside of your canvas, right? And then you can kind of like consume or interact with it, right? And this ends up solving a like, very, very interesting class of problems. It ends up solving like authentication, authorization, it ends up solving sharding, it ends up solving security. Because you're not managing this like massive multi tenant thing. It solves like API versioning, which is super interesting, right? So if you go and push changes, you can actually go and roll out those changes and we have health checks for your services. Any of those changes were to actually cause those health checks to fail, the rollout would fail in general, right? And so you can actually almost like split these things up over your canvas and Consume. And so that's the Lego aspect of it in general.
[13:04]
Jake Cooper
Can you explain that in a little bit more detail? Like, how does this help solve something like auth?
[13:08]
Jay Cooper
So then you're kind of like not talking with the public Internet, right? And so We've built this IPv6 wireguard mesh on top of all of our services, right. And so essentially you're not kind of either exposing your instance publicly, so you can just talk with it externally. And you're also not at risk where somebody says, oh, potentially you've leaked your keys and now there's a publicly accessible endpoint. Right. The best level of security that you could possibly have is you just can't get to it. Right. Without sso. That's the default level of security that we're trying to provide here. And I think it's inspired from the kind of zero trust mantra almost of saying, hey, let's give people the best experience and the best practices right out of the box. And we'll make sure that your database is within like single digit, ideally even like, you know, hundreds of, you know, microseconds from your instance. We're going to make sure that it's not accessible. We're going to make sure that you have really solid primitives to like go and access these things. So we do automated service discovery based on your name. So if you have like your analytics service that you've deployed internally, right, it's just analytics.railway.interal you just make requests to it and then you're the only one that can actually access that within that environment. Right. So that's kind of how it solves like authentication authorization, because you don't end up needing it, right. So you don't end up having to put like a NGINX server with basic auth or anything else like that in front of all your things, shoveling it into one pass and then saying, hey, everybody on the company, you know, network, go and do these things, right? And then invariably at some point that ends up getting breached and then you have security posture on those things, right? Yeah, yeah.
[14:33]
Jake Cooper
So it's kind of like security by default approach. This is like zero trust model, essentially. Like, let's take the best practices, bake it in. So it's like the guardrails are essentially in place and that's how people will develop against it.
[14:44]
Jay Cooper
Yeah, exactly.
[14:45]
Jake Cooper
So can you walk me through, like, if I'm going to use railway, like what is that process like? And then can you explain sort of like what's happening behind the scenes?
[14:53]
Jay Cooper
Yeah, so it's funny, it like rhymes with what happens when you type something into the browser question. That's like a very common technical interview question. Yeah, yeah, yeah, yeah. So when you go to Railway, so if you go to like dev new, we will drop you on a page that allows you to basically say, give me a postgres instance or deploy my GitHub, right? And so if you hit a postgres instance, what we do is we like go and we have a fleet of servers that exist all across the world. We will go and make a claim for that volume, create it, and then go and bind that instance there, right? And then we'll go return to you like that running instance which you can access over the either private network or you can click generate public URL and it will generate a public URL for you, right? So that's the kind of like stateful storage one. And if you go and do the GitHub one, what we do is we basically will parse your repository, figure out what applications you might have in there. Maybe you have a Docker file, maybe you don't have a Docker file. If you have a Docker file, well, obviously just use it. If you don't have a docker file, we kind of go down this like tree of almost decision making where we say, do you have a package JSON? Right? Because if you have a package JSON, it's very, very likely that you have a node application in here. So let's pull in some of that information. Oh, do you have like, you know, or requirements. Txt? Okay, cool. It's obviously Python and stuff like that. So we have this kind of tool that we've built. Again, this is the Nixpacks engine. It's all open source. It's on our GitHub repository if you want to have a look at it. It's super cool. It's like this Rust engine that we built. But it will go and essentially figure out what is your build command, what is your start command, what are all of these other things. And you can modify them after, right? But the whole goal here is to almost like compress all that knowledge so that the user, when they go to that platform, they just say, here's my GitHub repository. And we say, excellent, it's already deployed, right? And then you say, wait, what do you mean right? Because that's not like the default experience of going to a cloud provider. You have to fill out reams of forms and, you know, select a region, right? In the AWS dashboard, you go, oh, the name, that's pretty good, right? And then you're like, oh, which flavor of Linux do I want? And you're like, oh, okay, I guess I have to pick that now, right? And then you go through this, like reams of stuff, right? And so our whole goal is to take all of that config and push it post haste. Anything that you can do, do later, you should want to go and do later, right? So even something like setting like a region for like a database, right? We built a system that allows us to like move these volumes around, right? And so if you spin up something in the region that's probably closest to you, which is a good default, right? Like if I'm spinning up from San Francisco, I'm probably in U.S. west. If somebody's spinning up from London, they're probably in the Amsterdam servers, right? So we pick that and then if you really want to go and move it, you just move it later, right? And so we take all the config and we push it later.
[17:13]
Jake Cooper
And then where's this all running? Is this running in my account or are you sort of like running this within behind the scenes, like railways access to a public cloud?
[17:23]
Jay Cooper
Yeah, so we're running it on a few different servers at this point in time. So we have some straddling between Google Cloud and aws, as well as our own bare metal service that we've spun up over the last year, basically. So it runs in a variety of the, you know, servers that we have. And ultimately, like, the only time you should care is if you're trying to potentially pair it with something externally. So let's say you have, I don't know, like a Supabase instance somewhere and you want it as close as possible because, like, ultimately the only thing that matters is that your computer is beside your storage, just from a latency perspective because the database calls are so quick. So then you can potentially go in and select that region and say, like, oh, I actually want to run it on these specific class of instances, right? And you know, barring any sort of failover, like, we will go and do that because we built the orchestration engine to go and manage and drop the instances. Instances right beside it, right? So the short answer is like, it could be running anywhere and ultimately you shouldn't really care, right? Except for those latency reasons, at which point you can add that constraint kind of like later.
[18:16]
Shawn Falconer
This episode of Software Engineering Daily is brought to you by Capital One. How does Capital One stack? It starts with applied research and leveraging data to build AI models. Their engineering teams use the power of the cloud and platform standardization and automation to embed AI solutions throughout the business. Real Time Data at Scale enables these proprietary AI solutions to help Capital One improve the financial lives of its customers. That's technology at Capital One. Learn more about how Capital One's modern tech stack data ecosystem and application of AI ML are central to the business by visiting capitalone.comtech and how does Railway handle?
[19:00]
Jake Cooper
Like the kind of distributed system dependencies that you can run into as these, you know, if you have a bunch of services like this can get pretty complex, errors can occur. How do you manage that aspect?
[19:11]
Jay Cooper
Yeah, so I think like in traditional systems, I guess, like it really depends on the class of error, right? Because there's a variety of different errors that can occur. There's like I pushed bad code and I flunked the instance and it got past all of the health checks, right? And so we give you a one click automated rollback. Like we'll keep the container around for a little bit. You click that and you say, hey, listen, my health checks didn't catch that. Something is down. Now a user's reporting something. You click that immediately, you're back, right? We have automated health checks for going and managing these instances. So if you define a health check and you just say like, you know, slash health and then you go into return, it's like, oh, okay, you know, I updated my Redis library and it no longer is able to communicate with Redis for some reason. Right? Okay, cool, now that fails. And so we'll just like actually flunk that deploy and then notify you whether you have like emails turned on, whether you like, you know, through the, the in app inbox, right? Some point in the future we'll do an app, right? So like basically get really, really close to notifying you as, as quickly as possible. And then there's also like things that we've done on top of it in terms of solving classes of problems that are kind of interesting and only happen like scale of complexity. So like assuming you have a ton of different microservices, right? Like let's say that you've modified your GRPC or something like that, right? I think GRPC is a bad example because it's backwards compatible. But let's say you've modified something and create a breaking change. Like you've modified a field on a GraphQL endpoint, right? You know, you go and roll it out, things start flunking, right? So what we'll do is we'll actually Do a dependent rollout. So if you are saying, hey, my front end communicates with my backend and then my back end starts failing when it rolls out, we're going to flunk the whole kind of like class of that deployment. Right. Because you've made a change to your front end and your backend. Right. So we've done a bunch of things in the application that basically said just consume the criteria or config from various different services and we'll almost construct for you like a dependency graph to like go and automate any of these things. You don't have to do the like Basel thing where you're defining your dependencies as this and then you forget and then something happens. Or Turbo does this I think as well in terms of like build stuff. You don't have to do any of that. You just consume the properties of the services and then we will go and automatically figure that out. Including cycle detection, which is super cool.
[21:09]
Jake Cooper
What about like a canary rollout?
[21:11]
Jay Cooper
Yeah, so we can do canary rollout. So you just, you can just do it on the command line. If you just do railway up. So as also kind of like a piece of a fill in. We have obviously the Canvas dashboard or anything else like that, but we also have a command line because people like to interact with services in various different ways. Right, right. So the command line's really useful for a bunch of different things, including that.
[21:28]
Jake Cooper
Okay, and then how do I. I think one of the challenges companies typically have with whatever sort of cloud resources they're using is they'll sometimes have essentially provision too much or potentially too little, which will lead to problems like they're spending too much or maybe they provision too little and then they run into like challenges with like throughput or latency or something like that. How do you solve for that?
[21:51]
Jay Cooper
I think that's a really important thing to solve for. I think it's also a thing that like is a core differentiator for us versus other platforms in the sense that we've. The aforementioned orchestration engine that we've built allows us to basically only bill for the like what you're using perspective. Right. And so in a production workload that's pretty cool because like assume that you have like 2x standard deviation or you know, 10x because you're on the front page of hacker news or anything else like that. That's fine, we can go and handle that. We can go and scale up the instances. We can scale them down. We can go and do that for you. Right. The part where it Becomes actually like, I think even a little bit more interesting is when you end up with pull request environments that can be served. Like serverless. Right. I use the word serverless in like quotes. I know we're not on like a video or we're not going to like share the menu or whatever, but in a way that basically allows you to send a request to it. The request will spin up the container, the request will be filled and then the container will be finished. Right. And so we can actually go in and kind of construct that parallel environment of yours with a copy and write database volume pretty soon, which is super cool. We'll roll that out in like Q1 of 2025 as well as like serverless. Serverless spin up of those parallel environments. Right. So they, instead of spinning up something that's like, oh, I need 32 gigs for this service and I need four gigs for all this other service and you spin them up and you leave them around over the weekend and you just incinerate money for like these things that are idle. We will actually only spin them up for the time that you need and only charge you for the kind of usage that you have. Right. And so ultimately that means that like, you're going to avoid those like random errant, you know, $1,500 bills because somebody forgot to like, you know, actually terraform apply off of Master instead of like their staging environment that they were testing. Right. So, and I think that having that posture in place kind of by default means that companies get a lot more cost control, they get a lot more benefit, they get a lot. Right. But they still get the ability to move extremely quickly by having the ability to create copies of their environment. Right.
[23:37]
Jake Cooper
How does monitoring, logging, observability, these types of things work?
[23:41]
Jay Cooper
We have templates that people have built that allow you to kind of like exfill logs to Datadog. We have a template for spinning up Grafana or like Victoria Metrics or Prometheus compatible instances. So you can do all of those things. We've also built from the ground up a observability system inside of Railway. And since we have the Edge network, we automatically can kind of add a request id. So we can kind of give you distributed tracing by default through all of your microservices. We can give you alerting for if things spike or stay high or anything else like that. Right. We can give you information that you kind of wouldn't have in, in other kind of environments without doing like, for lack of a better word, like a ton of, of plumbing Right. So that's the thing that we've kind of like built from the ground up internally at Railway. And I think that like, obviously Datadog is like a massive business and there's tons and tons of stuff in there, so we're, we're kind of, you know, straddling the more of the 80, 20 of like, let's just give people the baseline amount of things that they want and over time we're going to go in and ask them, hey, what else can we, we give to you? Right, but we have people who of using Railway entirely in terms of build, deploy, observe, scale, all of those pillars, and are actually extremely happy on using just that. Right. And I think it's bare bones in terms of like, where we want to take it right now, but it's super exciting to say, like, hey, listen, we have all the building blocks right here and we just need to work with our users to kind of scale to the things that they really, really want.
[24:59]
Jake Cooper
What would you say are some of the, like, constraints or limits today?
[25:03]
Jay Cooper
Constraints or limits? That's an interesting one. I would say that there's like, I mean, it's going to sound weird, but there's not really any constraints or limits right now. And that's kind of like we've really tried to solve that like, Heroku problem of, oh, you know, I'm going to outgrow this thing. Right. And so I would say that we're almost limited by trust, if that makes sense. Right. And so, you know, you have aws, you have GCP and you have Azure. Those are the big clouds, right? Like, barring, like anything, those are the big clouds, right? And you have like Cloudflare that's like trying to do things. Things. Right? But Cloudflare is like a $30 billion organization. They've been around for like a decade plus. They've done like, they've accumulated trust. You know, they're the meme whenever Cloudflare has an issue. It's, it's a like software engineering snow day, right? It's kind of like they've continued to build that trust over time and they're still working on it. Right. And so for us, I would say, like, the main thing that's kind of the limiting reagent right now is not like what the platform can do, but it's almost like how much you can trust it. Right? Because it's, when it comes to software infrastructure, it's, it's your livelihood, right? Like, it's, there's not much more that you can kind of like trust to people, right? It's your data. It's the fact that if that thing goes down, especially with your data, you're SOL until these people like get back to you, right? And so you're kind of hanging the limbo, et cetera, right? So it kind of pulls more towards the quote of like, you know, nobody got fired for buying aws, nobody got fired for any of these other things, right? And so I think that's the main thing that we are kind of like consistently working with companies on and saying like, yes, you're going to get all of these benefits and we're going to give you like a higher order level of reliability and we're going to give you better service SLA turnarounds than the kind of like larger clouds. And that ends up being kind of like a very, very difficult battle to have with people because they just say that sounds like bs, right? And you have to just show them over time. It's like, no, like we're going to continue to kind of like work on that and like we will be available for you should anything occur. And we've also designed the system such that there are less and less fault points as you kind of go.
[26:52]
Jake Cooper
I mean, as a, like if you're a founder of a business and you're building out a new product, then it could be a lot to sort of bet your product life on another essentially startup going back to sort of the trust challenge that you're talking about.
[27:06]
Jay Cooper
I think actually the startups really, they like are what we call from our like growth master master plan right now is like we're stretching market, right? And so startups like our current icp, like the normal distribution of that, that kind of like go to market is actually a 15 to 50 person teams. Those people like seem to love us, right? They seem to be able to want to move like a ton of different stuff over. Maybe it's not literally all of their infrastructure footprint, but it's like a large SW of things that like are no longer kind of like legacy, right? And they're basically saying like, listen, we want to move really, really quickly on these things. It ends up being those kind of like larger organizations that like move a little bit slower and like want that level, like that higher order kind of like trust bit really, really flipped. And so when you start going to like organizations, it ends up being most of the sales motion at that point ends up being this kind of like how do we get past any sort of like trust or compliance or whatever objections which we've Gotten really, really good at and show you the value to kind of like tie it to maybe one of your like top eight, like Velocity Initia where like we just want engineers to be able to ship faster and get, get more done, you know.
[28:05]
Jake Cooper
So if you are a larger organization, let's say that you, you're able to establish that trust with them. Like what is the starting point for them? Like, you know, obviously I'm not going to just, if I'm you know, hybrid cloud or I'm on cloud today, I'm not going to go and like rip and replace everything. Like how do I get started essentially in a way that doesn't require me to boil the entire ocean?
[28:22]
Jay Cooper
Yeah. So I mean the nice thing that about railway is like you can incrementally adopt it, right? And so if you have services that you want to go in and spin up internally and you want to like pair them with services that already exist, you can do that using like something like tailscale or we like have a wireguard binary that allows you to like mesh it into your instances over there. We can also do potentially dedicated instances if you want after chatting with you. So there's a variety of different ways you can get started, but the main point is that people just kind of like incrementally adopt it. They'll start with something they'll basically start with usually kind of it's an em messing around on the weekend to basically say like, all right, how do I go and explore a couple of these things that I know, you know, are being pitched as like these faster alternatives to X or. But I need to be able to know that like, you know, it's good, like it's first of all it's going to satisfy that like a faster initiative to X criterion that I'm looking for and two, that you know, it'll be solid for us to like go in and make a case in the future. Right. And so what we do essentially is we go and we pull telemetry from people as they're they're signing up and we say like, hey, when you start getting to points where you want to be activated, we basically just say, hey, if you want to go and chat with us, you can chat with us over Slack, you can chat with us over email, right? Like we want to be really, really available without being kind of all up in their face, you know, and then.
[29:27]
Jake Cooper
You'Ve open sourced like a significant portion of the railway production stack. Like what was the motivation behind that?
[29:33]
Jay Cooper
I think the motivation for us is that we Want that trust. Right. And so if we can give you this kind of like ability to introspect the service, maybe even like self hosted in the, in the future or anything else like that, then realistically there's not much more you can trust us if you can see, if you can see the guts and the internals and anything else like that. Right. So that's kind of the main motivator, I would say. In general it's also obviously excellent to have members of the community be able to like go and contribute. We have like people who are like submitting like nixpacks PRs all the time to like go and add new versions or new providers or new languages or new like you know, dependencies or anything else like that. And so getting kind of that tailwind of like open source, not like tailwind in the like, but like the benefit of the open source community to be able to like go in and help us like build this together. You know, that's I think a big key. And I also like Kubernetes ends up being open source. Right. And so it's like, do you want to potentially you kind of have to go and meet people where they are and if they're, they're self hosted, if they're kind of self hosting some of these things, at some point you need to be able to allow them to like self host at least the data plane. Right.
[30:30]
Jake Cooper
You know, does that help also overcome some of the trust issues?
[30:34]
Jay Cooper
I believe so, yeah. Because you can see everything that people are doing, right? You can see what they're doing on the instances, you can see what, you know, what code they're running. You can see, you can see all of the other things. Right. So I think ultimately that really helps with the trust issues. Maybe not issues but like problem there in general. Right. But at the end of the day, I think the main thing that you can do to make sure that people trust you is to once the stuff is up, keep it up and keep it running. Right. And just make sure that it's like a bulletproof experience. Right. So we've been like working especially over like the last like six months to make sure it's like we're starting to get, you know, multiple nines on the board of like, okay, cool. Like this is like a level of reliability, especially with our new bare metal instances. We haven't had any issues so far. So obviously knock on wood there, but like building something that you have a higher order control of so you can get that level of reliability so people can say actually I'VE used it for X. I don't really have a ton of problems with it. Right. So and then we've also scaled the like cloud version of it to like, you know, we're doing tens of billions of requests per month on the edge proxy. We have like, you know, the orchestration engine in the cloud is managing like two plus million microservices on like I think four clusters. Right. So that's kind of like been built out so that whenever we go to like have a conversation with larger companies like Uber or anything else like that. 5,000 microservices. That's a lot of microservices evidently. Right. But the cloud version of what we built is like scaled to like 2 million. Right. So it's like a few orders of magnitude off. So we can say like, hey, if we were to go and like do a self hosted version of this for you, we can do that like level of scale.
[32:00]
Jake Cooper
You previously worked at Uber. Did any of that experience sort of like inform or motivate you to start railway?
[32:06]
Jay Cooper
Yeah, definitely. So there was you know, kind of this like walled garden of like platform teams where you know, some people were responsible for like getting code deployed and stuff like that. And I would go and interact with these things at work and it would be like fine, you know, and then I would go and interact with them at, at home and there'd be like varying levels of less than fine, you know, and so there was just, it really seemed that, you know, Uber was a high growth company at the time, still is, you know, it's massive and you know, they had the ability to retain the best engineers to go in and build this thing. And I still could potentially see ways that they could be like significantly better just in terms of like unlocking developer productivity. So I would say that that is definitely like a driving function for like making this system.
[32:47]
Jake Cooper
So you've been working on railway for what, almost five years?
[32:50]
Jay Cooper
Yeah, yeah, about five years, I think four and a half now.
[32:53]
Jake Cooper
And as like a first time, like founder, you're running 100% remote. Like was that a conscious decision?
[33:00]
Jay Cooper
Yes, it was a conscious decision. So I've worked remotely since 2015 maybe and I've always like really, really strongly enjoyed working remotely. It's definitely not for everybody. I think it requires people be almost like extrinsically motivated about the problem. Right. So people have to be like, they have to really, really like the thing that they're working on and they have to kind of like see the vision. They have to see all of this other Thing like the where it goes and everything else like that. So you can't kind of like get a lot of the benefits out of, you know, being in an office and you know, kind of being able to breathe down people's spine and about saying like, oh, we got to like do all of these things, right? So I was chatting with somebody, one of my friends the other day and I was like, I think maybe one of my most controversial opinions is like, I think despite running a remote company, I'm like, I think it's a terrible idea for probably about 90% of people, right? Because you have to be extremely deliberate and you have to like hire people who are really, really excited about the problem space, who are going to like self manage. We're going to go and do all of these things, right? And so that's what we've done so far. We've got like 25 people. We're like remote, spanning all the way. You know, like we have some people in like the western hemisphere of Canada. And I'm from like Vancouver island as we like previously mentioned, right. I'm in San Francisco now. And then we span all the way. You know, we got like, we got Thailand, we got Dubai, we've got Japan. So there's a lot of different time zones that we're spanning in general. And so we've had to hire people who are autonomous and who can push these things. So that's been excellent from leverage. But it does also mean that there's a specific class of individual who does really, really well at remote companies. And there's a specific class of individual who does well in person companies. It's kind of more of a chocolate or vanilla. But yeah, it was definitely a conscious decision because I do enjoy the benefits that remote has. Not in like a, you know, you can and sit around and kind of twiddle your thumbs, but like you can almost like meet with people at almost various different times in the day and you know, their morning can be your evening. And so you can almost have this like, almost like time compression handoff of saying like, oh, we've really got to go and do this. And then you go to bed and then you wake up and it's done and you're like, damn, I love working with excellent co workers, you know. And so it means that you almost get twice as much time in the week if you can get these handoffs.
[35:00]
Jake Cooper
Like 100% in terms of like hiring people that are, you know, extremely motivated by the problem and also are able to self manage. Like I feel like that's kind of a requirement for any relatively small stage startup because you just can't afford to have to micromanage people in order to do their job. You have to hire people who are like, motivated to be there, believe in the mission, and also can essentially just like get things done and understand what needs to be done.
[35:24]
Jay Cooper
I totally agree by the 100%. The corollary with that is that like most people don't become like good managers at the start, especially like first time founders. You'll hear this from like, you know, Emmett from, from Twitch has talked about like how he was not a great founder originally. I think Brian or Joe have also talked about it from Airbnb. Right. And so it almost like forces you to develop those skills while you're also trying to assemble the airplane. Right. And so you don't get a lot of that kind of like benefit that you can kind of like smear this over time and learn those. As you go through all of these stages, you really have to kind of like learn and compress a lot of those learnings and you have to do it in there. So like, that's the only kind of like thing that I generally see is yes, you all have, you'll have to do all things eventually. But we also talk a lot about, you know, internally about focus, right? Like, how do we stay focused on just doing the things that we can be like, like top number one in the world at. Right? Because that's where we're going to have all of the compounded returns. Right. So those are kind of like some interesting benefits and drawbacks to remote that I think that maybe people don't necessarily consider, you know, they consider the, yeah, sure, I can go and, you know, go for a bike ride in the middle of the day and work slightly later and stuff like that. And that's kind of like copacetic with my, you know, operating cadence or you know, we have, we have people who have families. You know, they spend the afternoon with their families and then they'll come back and you know, polish up the stuff once the kids are to bed. Right. And I, I personally, I'm a big fan of treating people, it's going to sound a little bit condescending, but like treating people like adults because everybody is adults, right? It's like they're going to manage all of their time and they're, they're going to like go and do all of these things, right? And you shouldn't kind of be around and basically say like, oh, I'm going to Approve all your PTO requests. I'm going to like go and do this right? Because again, those kinds of people that you mentioned, like, they're not going to be successful at startups in general. So we've almost kind of erred on the side of like, like, let's just pretty rapidly remove any of the guardrails in terms of onboarding and say like, hey, this is how we kind of operate. If you really, really like it, like here we are. And if we don't, then we can hopefully find a really, really awesome spot for you to like go and land in the future.
[37:14]
Jake Cooper
How do you test for that? Like how do you find these people that are going to be a fit?
[37:18]
Jay Cooper
Yeah, so there's a couple of problems that I like to ask people. There's a couple open ended kind of like technical problems in terms of like design I think that you don't get. I don't think you test for this using like lead code.
[37:28]
Jake Cooper
You know, there's a lot of problems with using things like Leako.
[37:31]
Jay Cooper
Oh yeah. And so like we prefer to go for kind of like maybe the Montessori School of Management of like open ended kind of like problems of saying like, hey listen, like how would you like solve this class of like real world problems? Right? So I think that's like one way to go and do it. I think also sitting down and chatting with people about what drives them, you know, are they passionate about X, you know, stuff like that. And it doesn't need to necessarily be devtools, right? Like it could literally just be like, I am so, so passionate about networking, right? Like it's just networking. I got, I got Arista switches in my basement and all these other things, right? And stuff like that, right? And I think if people, they have that thing that they're really, really passionate about and you can almost like see it and stuff like that. It's gonna sound like super holistic, but like you can almost tell when like people really, really care about these things because the moment you poke them they almost say it kind of like expands and you're like, oh my God, like that's so much stuff, you know, like, where did that all come from? Right? And where it came from is this kind of deep passion, this life experience, all of these other things, right? And so that's like one way to go and test it in the interview. And then as part of like onboarding, what we do is we rapidly kind of will just remove the guardrails as you go, right? So like we have six weeks of onboarding, which like may seem long but the, the goal is almost pushing up the funnel on, on what you can do, right? Because by the end of the six weeks, like we consider onboarding to be like, how do we get you from like good to doing great work and then being able to be fully autonomous, right? So like that's the whole goal is we do two weeks of tasks. So it's like five tickets. You just like have them in linear, they'll be really straightforward. It's like, hey, this thing is like actually broken. And it's like, you know, you go in, you make a few lines of code changes and then, you know, that's your, your first two weeks and then you move on to the problems which is, hey, our cron experience sucks, right? Very different class of like problems that we've just given you, right? It's like, well, what sucks about it? Like what are the problems? All those things, right? And then people have to like go in and you know, maybe they'll ask their coworkers, maybe they'll go and ask the users, maybe they'll go and put together a document that says, I think these are the things. And then people will be like, no, I like what about like we could probably drop this requirement, right? And then they'll go and solve those problems. And then the third one is like opportunities, which is, you've been here a month, what do you think the company needs? That's the only prompt, right? And so you get to that open ended kind of like line of thinking by the end of it and you know, that's, that kind of pushes people more towards like, oh, I can actually do pretty much anything here. Right. And so it's just a matter of what.
[39:39]
Jake Cooper
Yeah, yeah, I think there's a certain like, you know, back to what you're talking about, where you're trying to look for, you know, what is the person like really passionate about or interested in. There's a certain obsession behavior that I think people need to be successful in startups and it doesn't even need to be like historically an obsession that relates necessarily to what the company is doing. But like you have to kind of be like manically obsessed about like solving something in your life because a lot of it is really about sort of like knocking down doors and solving problems. And the level of distraction also hits you much faster earlier in your career, I think at a startup because there's just less people around. Like to your point, by week six, you're asking questions about what could the company be Doing better. If you're at a really large organization and you've worked for some, I've worked for some, you could really be in that first stage of like, here's a really small problem that you need to solve and that could be like the first two years of your life in existence there.
[40:35]
Jay Cooper
Yeah. Right. So we aim for trying to find those people because we think that, I think even when it comes down to like solving problems excellently, right. Like that last kind of like 5% of the problem is kind of where most of the progress is, right. And if you're not like focused and you know, you're consistently oscillating and bumping between all of these different things, right. You're probably not going to like get actually to like the meat and potatoes of like what that problem actually is, Right. I'm a very, very big, big believer of sit there and run a ton of different revisions, right? And figure out like why these things were either better or worse than your previous revisions, right. And make sure you have a clear goal that you're kind of like entering towards. Right. And I think that doing that and doing that well, you know that that focus and that passion is like, it's almost like a necessary precondition. It's like it unlocks like pretty much all of those other things. Right. And I don't see how people can potentially do it without it. I've seen it once in a blue moon, but it's also very, very rare. So you know, it's. We aim for kind of saying like, okay, well like how do we generalize this class of problem of like finding these extrinsically motivated people, right? And saying like, this is kind of like in general the archetype of individuals that we've seen be quite successful here.
[41:43]
Jake Cooper
Yeah. I mean, there's a quote about, I don't remember the exact quote, but it's like, you know, we've completed 90%, so now we get to start on the other 90% of the project essentially. Because.
[41:52]
Jay Cooper
Exactly.
[41:52]
Jake Cooper
It's really that like last 10% where all the hard work, the meat and potatoes has to exist in order to actually, whether it's a product or whatever it is that you're building that to do that. That's where the polish happens. That's where like, you know, you run into your scale problems is so forth.
[42:05]
Jay Cooper
It's also important like not just from like an individual perspective, but like from working with other individuals perspective. Because that last 10%, it's the hardest, like that last extra mile of like going and doing the thing, it's like, yeah, well, like, this is, like, good enough, right? Like, it'll work, right? And if you work with people, like, who have, you know, outsized talent or outsized standards or anything else like that, they'll basically say, like, no, we can do better, right? Like, we can push this thing a little bit farther, right? And then they'll have some, like, tools in their toolkit. They'll have, you know, some skill or some emotional acumen or some way to kind of, like, judo your thinking and basically saying, like, what if we thought of it like this? Like, I think this is really, really close, and this is where the reason why it's really, really good. Like, can we try this? Right? Versus, you know, I think the conventional thinking at larger companies is, like, okay, cool, it's done. Like, let's just. We'll move on to the other thing.
[42:47]
Jake Cooper
Well, awesome. Jake, thanks so much for being here.
[42:49]
Jay Cooper
Cool. Awesome. Yeah, thanks so much for having me. This was great.
[42:52]
Jake Cooper
All right. Cheers.
[42:53]
Jay Cooper
Cheers. Sa.