Summary6 min read

AI + a16z: Tigris Data CEO on Building Your Own Datacenters

Podcast Host: a16z (Martin Casado)
Guest: Ovais Tariq, Co-founder & CEO of Tigris Data
Date: November 7, 2025

Episode Overview

This episode features Martin Casado (General Partner at a16z) in conversation with Ovais Tariq, CEO and co-founder of Tigris Data—one of the few startups building independent, core infrastructure for the AI era. They discuss the unique challenges of storage startups, why Tigris operates its own datacenters, the changing needs of AI workloads, technical and operational hurdles, developer experience, and how AI is reshaping software development and infrastructure companies.

Key Discussion Points & Insights

1. Why Independent Storage is Uncommon and Hard (03:25–06:51)

Unlike other foundational cloud services (databases, data warehouses), independent storage is rare because most companies build on top of cloud providers like AWS S3, while Tigris competes at the foundational storage layer.
- “What we are effectively doing is building a product that competes with the foundational services that cloud provides. And that is something that is really hard to make cost efficient on top of the cloud itself.” – Ovais Tariq, 04:24
Building storage involves not just software but also hardware and operational expertise.

2. Technical Architecture: Global, Active-Active Object Store (05:22–07:30)

Tigris offers global storage with no concept of regions—data dynamically moves to wherever compute happens; ideal for AI workloads running in many locations.
- “With Tigris, there’s no concept of a single region. The storage is distributed and it gets dynamically placed wherever compute is running… The goal is to provide local access to storage everywhere, regardless of any region.” – Ovais Tariq, 05:26
Managing global, S3-compatible storage requires deep knowledge of both distributed systems (for metadata) and hardware (for data durability, reliability).

3. Why AI Needs Specialized Storage (08:34–09:49, 10:12–12:25 )

AI and ML workloads are inherently distributed—requiring storage close to compute to minimize latency and support new cloud providers focused on training and inference.
Traditional object storage (S3) is poor for AI use cases:
- Struggles with massive volumes of small files (ML datasets)
- Can’t provide low-latency access needed for real-time AI (e.g., audio gen AI)
Tigris focuses on the remote storage tier (the “data lake”) rather than the local storage on GPU clusters.

4. Developer Experience & Immutability (14:25–16:29)

Strong focus on making high-performant infrastructure simple for developers:
- “Oh, I'm super focused on developer experience. I use the product myself as well.” – Ovais Tariq, 14:45
Tigris is append-only and immutable, enabling features like instant, zero-copy snapshots and the ability to instantly fork even petabyte-size buckets.
- “You can instantly create a fork. Zero copy fork.” – Ovais Tariq, 16:22

5. Capacity Planning & Running Your Own Datacenters (16:30–22:34)

Immutability means storage requirements continually grow. Capacity planning remains crucial—not just for space but also IOPS (input/output operations per second).
Uber ran 100 petabytes of operational storage on SSDs entirely; Tigris follows similar practices with modern, high-density drives.
- “Per drive capacity has been increasing drastically—for example, 25 terabyte per drive now!” – Ovais Tariq, 18:41
Setting up datacenters is simpler than a decade ago due to increased reliability and efficiency of modern hardware.
Tigris pre-populates racks, shipping them to datacenters to minimize on-site work.

6. Team Structure, Skills, and Data Reliability (22:34–26:24)

Tigris divides engineering into infrastructure (hardware, automation) and distributed systems (software) teams.
For metadata consistency and reliability, they use FoundationDB—the same as Apple’s iCloud and Snowflake.
- “FoundationDB is actually one of the only databases I know of that has implemented simulation testing. They are really maniac about how they do testing.” – Ovais Tariq, 24:26
The company takes a “batteries-included” philosophy—users get built-in caching and robust routing without needing to configure these themselves.

7. AI Coding Tools and Engineering Productivity (28:30–31:44)

Senior developers at Tigris use AI code tools (notably Claude and Cuzr) for significant productivity gains.
- “For a long period of time, the only advantage new college grads brought is speed—[but] none of that matters anymore. What matters right now is systems thinking—whether you can design a system, you architect a system, you know how systems work. Because if you know that, then you can use AI assisting.” – Ovais Tariq, 29:02
Team mandates using AI tools for code reviews, debugging, and log analysis.
80% of Ovais’s own code is AI-written.

8. Cost Structure & Eliminating Cloud Egress Fees (32:09–34:33)

Tigris abolishes egress fees (the cost penalty for transferring data out of a cloud), a major pain point of incumbent cloud providers.
- “We have done away with the cloud tax, which is the egress bill. So there’s no egress.” – Ovais Tariq, 32:32
- “80% of [a customer’s] storage bill was egress.” – Ovais Tariq, 34:13
Lower infrastructure and operational complexity allow Tigris to pass on cost savings and to optimize for AI-specific workload patterns.

9. Future of Infrastructure & Cloud (36:11–37:48)

Tariq sees ongoing trend toward specialized infrastructure providers (compute, storage, higher-level services), catalyzed by AI workloads.
Prefers inference workloads (over training) for storage efficiency—typical paradigms like Pareto (80% of requests for 20% of data) mean caching and IOPS are manageable.

Notable Quotes & Memorable Moments

On the challenge of independent storage:
“What we are effectively doing is building a product that competes with the foundational services that cloud provides. And that is something that is really hard to make cost efficient on top of the cloud itself.”
— Ovais Tariq, 04:24
On product philosophy:
“We are not trying to build something very general. We’re trying to... focus just on storage, GPU providers focusing just on the GPUs. And that’s what I feel like is going to be, that’s how the world is going to be.”
— Ovais Tariq, 11:23
On capacity planning:
“So the immutability of the system allows... the storage keeps on growing... capacity planning in general is important... not just the amount of space you have, but also the IOPS.”
— Ovais Tariq, 16:54
On developer experience:
“The infrastructure needs to be as simple to use as possible.”
— Ovais Tariq, 14:51
On AI for engineering:
“What matters right now is systems thinking... if you know that, you can use AI assisting, write things like code or culture and build a system. So that’s the thing that’s most important. That’s why the senior folks are succeeding.”
— Ovais Tariq, 29:02
On egress fees:
“We have done away with the cloud tax, which is the egress bill. So there’s no egress.”
— Ovais Tariq, 32:32
“80% of their storage bill was egress.”
— Ovais Tariq, 34:13

Timestamps for Key Segments

[03:25] – Why storage startups are rare; difference from other cloud services
[05:22] – Tigris’s global storage model; non-regional, API-compatible
[08:34] – AI workload demands and Tigris’s fit
[14:25] – Developer experience; interface design, append-only system
[16:30] – Immutability, snapshots, and capacity planning
[19:39] – Building physical datacenters: operational lessons
[22:34] – Team structure, FoundationDB, reliability guarantees
[28:30] – Using AI-assisted code tools throughout the stack
[32:09] – Cost and pricing: No egress fees, user economics
[36:11] – The future: Rise of specialty infrastructure providers

Conclusion

The episode provides an insider’s look at the technical and operational challenges of building independent, global storage infrastructure for AI, underscored by an emphasis on developer experience, operational excellence, cost savings, and the transformative role of AI in productivity.
Ovais Tariq’s hands-on perspective—spanning deep technical details, operational realities, and shifting industry economics—offers a window into the next wave of infrastructure innovation driven by AI workloads.
For developers and companies building with or for AI, Tigris exemplifies the new breed of specialty providers: cost-efficient, globally distributed, simple to use, and optimized for the future’s most demanding workloads.

Loading summary

Transcript178 lines

[00:00]
A
Before cloud it was the same thing. You have sans and you need to have storage administrators. Everyone needed to do capacity planning. Not it's just the service providers that have to do the capacity planning. Guess what's the simplest thing to do with immutable? Taking snapshot, just like, you know, a pointer in time. So that's what we are, that's what we're working on. Yeah, that's what we're releasing. A feature that allows you to fork instantly. Even if it's a petabyte of bucket, you can instantly create a fork zero copy fork, the global part or the routing part. That is where distributed system knowledge is important. Right. So for example, in our case we are an active, active system. So you can write anywhere, read anywhere, multimaster writes. So that brings consistency problems. How do you deal with that? How do you provide read after write consistency? That's a distributed system problem. But when you think about the bytes getting stored on disk, that's where it becomes a hyper localized storage problem. That how do you make sure that the bits are always there the same way that they have always existed. In this episode, A16Z general partner Martin Casad sits down with Ovais Tariq, co founder and CEO of Tigris Data to.
[01:04]
B
Discuss why independent storage is so hard.
[01:06]
A
While operating your own data centers is like and what's in store for the future of cloud.
[01:11]
B
Let's get into it. Ovais, thanks very much for joining us. So for everybody listening, we're having a chat with Ovais Tariq, founder and CEO of Tigris, which is one of the few actual super core infrastructure companies right now. So I'm very excited for this and so let me just provide a bit of background. So Tigris does storage think of like S3. It is the leader in doing storage for AI companies which are incredibly storage heavy. Right. A lot of these gen AI companies think image and video, et cetera. And when I say infrastructure I mean like all the way down to data centers and servers and the whole thing. So you know, the reason I was very excited for this is there aren't a lot of companies that are, you know, like that level of ambition. You've got a huge background to do this. So we're going to talk through, you know, how does AI change the needs for infrastructure? And then we're going to talk through kind of like your experiences doing this. So welcome.
[02:00]
A
Thank you so much. Great to be here.
[02:01]
B
So let's go ahead and start this way. So maybe can you give a little bit of a background that kind of brought you to this case. I mean, like you, I mean, didn't you run storage for Uber?
[02:11]
A
Yes, so. So I founded Tigris with.
[02:15]
B
Is it Tigris or Tigris?
[02:16]
A
I mean the Americanized way of saying it is Tigris, but I guess the real way is the way that you were using before. It is not Tigris. Yeah, but I've been used to saying Tigris because that's what resonates more with people. So that's how.
[02:29]
B
When in Silicon Valley. Fair enough.
[02:30]
A
Yeah. So going back to. So how all of this has started or I got how I got into this. So Tigris started about four years ago with. I have two other co founders. They are, they worked with me at Uber. I, as you mentioned, I ran storage at Uber and Uber had a global storage footprint. And Uber was one of the companies that was very different from other companies at that time. Uber wasn't building on the public cloud. Uber was running its own data centers. And Uber had a global infrastructure problem. Right. Like Uber has to be available in all of these cities, all around the world. And uptime is really important because there are a lot of critical services that are also dependent on moving people around. So being part of that kind of building out the storage part of the global infrastructure, having to deal with data center, even designing our own hardware and building at that scale gave me some unique insights into how a similar kind of infrastructure could be built outside. And then I started working on Tigris.
[03:25]
B
Awesome. You know, I've always wondered this. If you take most of the popular cloud services, say whatever, redshift the data warehouse, you've got an independent company. And like the one area that it's taken a long time for somebody to build an independent company has just been storage. And do you have any guesses to why that is or do you have any theories on why that is?
[03:44]
A
You are right. In terms of if you look at other foundational services or for example, you mentioned Redshift and then you know, snowflake databricks. Right. They are much more successful as compared.
[03:54]
B
To you've got confluence. I mean, just feel for everything you've got got like an independent company who's successful, but we just haven't seen them in storage. And I've always wondered, is it just like the knowledge isn't there? It's too hard? I mean, what do you think?
[04:08]
A
Storage is definitely quite hard. So one of the things with redoing storage is that it's not something you can do on the Cloud. So when you're talking about building a database or you're talking about building a data warehouse product, that's, you know, if you talk about Snowflake or Databricks or Planetscale other company.
[04:23]
B
Oh, I see what you're saying.
[04:24]
A
What they're doing is that they're building on top of the cloud. So they're not actually competing with service that the cloud is providing, they're building on top of S3. Right. And what we are effectively doing is building a product that competes with the foundational services that cloud provides. And that is something that is really hard to make cost efficient on top of the cloud itself. So you have to go all the way down, not just build a software layer, but build the hardware layer for it as well.
[04:50]
B
Okay, awesome. So I work with a number of companies, mostly gen AI companies that use you as a back end. I mean, let me just kind of describe the way that like, I don't know, I view it and then you can tell me where I'm wrong, which is it's kind of like a global object store. Yeah. But it's actually really global, so you can just kind of put objects anywhere. So like I use tigris. Tigris. Tigris myself. Like anytime I have a big object, even just for sharing files, like I literally just go to the dashboard, I put it there. So it's like truly global in the sense that you don't actually pick regions. It's just kind of up there, you put it there and then you can read and write from anywhere. So is that a fair way?
[05:23]
A
That's a really great way to describe it.
[05:24]
B
Like super simplistic, but that's kind of how I think about it.
[05:27]
A
It's super simplist from a developer perspective. From an end user perspective, you don't have to think about where the data is getting stored. So the way I talk about tigris is exactly as you were talking about it as well, that it's a globally distributed storage service that's S3 API compatible, but it's fundamentally different from S3. So when you think about cloud, you think about kind of centralized workloads. The data is stored in one region, your compute runs in one region. But with tigris, there's no concept of a single region. The storage is distributed and it gets dynamically placed wherever compute is running, it gets moved around where computer is running. The goal is to provide local access to storage everywhere, regardless of any region. So that's the differentiating factor. And that's why some of the AI companies, the well known AI companies are using us.
[06:15]
B
And if I could just get maybe like a traditional infrastructure Y wonky for a second. So the only other team I know that is actually built like an S3 compatible global storage thing is the Magic Pocket team at Dropbox. And I actually talked to the team and here's what, what struck me was like you both had to understand distributed systems fundamentally for the metadata. That's a really hard problem. But like also they were talking about having the right hard drives. So I just can't think of another system where you have to be like super high level distributed systems but also actually have to worry about the hardware. Is that the case with what you're doing or is that the wrong way to think about it?
[06:51]
A
No, that's exactly the way to think about it.
[06:53]
B
Do you worry about the actual hard drives?
[06:54]
A
Yeah, we worry about Bitrot. We worry about what's going to happen over time to the data that's getting stored on the disk. So yeah, so the global part or the routing part, that is where distributed system knowledge is important. Right. So for example, in our case you're an active system so you can write anywhere, read anywhere, and you have multi master rights. Yeah, multi master writes. So that brings consistency problems. How do you deal with that? Right. How do you provide read after write consistency? That's a distributed system problem. But when you think about the bytes getting stored on disk, that's where it becomes a hyper localized storage problem. That how do you make sure that the bits are always there the same way that they have always existed?
[07:30]
B
That's so crazy. The amount of data that you store, like totally global, like you've built this entire system is a lot of it because you have the earned knowledge of having built it before.
[07:39]
A
In Uber, I would. Yeah, a lot of it is definitely due to having worked with another global infrastructure. As I was talking about before, like Uber, we had to build a global infrastructure infrastructure. So a lot of knowledge of course comes from having spent time out there and build the storage infrastructure over there.
[07:55]
B
So I'll tell you why I use it. I use it just because it's the easiest way for me to store large objects. So for example, I work with a generative 3D company that creates these ply files which are Gaussian splats. They're huge. They're like everyone is like 300 megs or a gig or whatever. And then you make these video fly throughs of them to share around. Those are also hundreds of megs. And so for me it's just the easiest way to do that, you know. But I've also seen like a lot of our portfolio are using this as like actually their primary storage. And so maybe can you talk through like why AI requires something like tigris and then you know how it's different than like a classical cloud storage.
[08:34]
A
Yeah. So when I. When you think about compute in AI sense or for AI workloads, ML workloads, when I think about it, I think about it as distributed compute. So there are specialized cloud providers that are running in multiple different regions and these AI workloads are running in a distributed fashion.
[08:51]
B
Is actually, is that, Sorry, I don't mean to interrupt. Is that part of the argument for having a third party storage is we're actually seeing the rise of a new compute cloud and these new compute clouds basically do inference and like they're not the traditional clouds and so like they need a storage layer. Is there, is that an argument or no?
[09:08]
A
That's definitely one of the arguments. The new cloud providers, they are, you know, hyper focused on specialized workloads, right? On inference workload or training workload. And they do not have hundreds of services. They're not building general platform, they're not building AWS like cloud, hundreds of services, right. And they need storage. Right. And there is no service storage service that exists that has the same reliability, same durability that S3 provides, but being able to use with these new clouds, and that's what we are building. So there's definitely a strong case for these companies and that's why we see the customers that are using these new clouds, they come to us for their storage needs.
[09:50]
B
If I were just to look at, I don't know, if I didn't know, let's say whatever I'm running S3 and I didn't know what the app was, would I be able to tell the difference between an app that's like more traditional non gen AI, like traditional kind of web based thing versus a gen AI. Do they look that different from the storage layer?
[10:13]
A
From a storage perspective, definitely. The workload patterns are different. So if it's a training workload, it requires scanning through billions and billions of small files. So that workload pattern is effectively very different. And traditional cloud object storage services like S3, that doesn't really work well.
[10:33]
B
It's famously bad at small objects, right?
[10:35]
A
Yes, exactly. So that's one workload pattern that is completely different. The other is think about audio gen AI applications, right? They require low latency close to the user, right? Because if Imagine talking to a support specialist that's actually an AI. You don't want latency, you don't want your request to go to some central location. Right. You care very much about real time latency. So again, and for that you want to have COMPUTE close by. But what about storage? You also need to have storage close by. Think about the private compute work that Apple is trying to do where some the compute runs on the phone and then some computer spills over to a private cloud nearby. That's all examples of distributed compute and you know, need for distributed storage.
[11:18]
B
Do you think over time you'll have to wrap other services like compute in this or do you think it stays pretty standalone?
[11:23]
A
Our focus is to keep it really standalone and focus on storage. There's so much work that needs to be done on the storage side. So we are not really thinking of venturing into the compute space. We would rather partner with other, you know. Yeah, totally specialist. And that's what I feel like makes the new clouds and us different from the traditional cloud providers that we are hyper focused on a speciality. Right. We are not trying to build something very general. We're trying to, you know, if there's a database company that's getting popular outside, it's getting popular because it's focusing just on the database. We are focusing just on storage, GB providers focusing just on the GPUs. And that's what I feel like is going to be, that's how the world is going to be.
[12:00]
B
So how, how confident are you that you can. Maybe it's the wrong way to ask a question. Here's my impression of these large training runs for these state of the art models is that they use lots and lots and lots of data. And so is there a point in the design space that you are focused on as far as like where you think maybe it's not the right, you know, there's too much data or do you feel like, you know, Tigris is ready to do the largest storage possible?
[12:25]
A
It has to be the largest storage possible. But there is a hierarchy of storage needs. So there is storage that is needed local on the GPU clusters because you kind of need that local performance. And then the next hierarchy is remote storage, which is what we are focusing on. So you're not focusing on the localized storage right now. That's another thing that you would do in the future. But right now you're focused on that. I don't want to use the word data lake, but something like that where all the data gets stored.
[12:55]
B
Do you make any assumptions on what software is running on top of the storage layer? You know, is there like common frameworks that you see showing up for these things or is it all basically custom code written for the, by the AI folks, the gen AI folks that are building these?
[13:14]
A
In most cases it tends to be custom software that's running on top. To us it all seems like the same sort of API call, you know, the same set of, you know, large scans or you know, a lot of point lookups. It all looks the same.
[13:26]
B
Really sworn, I don't know, like these things. Like I o. I hear a bunch of framework names in the AI space, but those really. It's kind of immaterial. You're kind of agnostic to them.
[13:35]
A
Yeah, we are agnostic to that. We don't really specialize for any of those frameworks. But things like, you know, LANS format for example, do you start seeing that as getting popular? You know, people using Pytorch, people using Ray? Sure, we see that, but those are like common tools that, you know, people have part of their tool belt.
[13:52]
B
What about the actual files themselves? Do you see more, you know, image, video, music stuff or language stuff or again is it kind.
[14:02]
A
So we are actually most popular with media oriented workloads. So we see a lot of images, we see a lot of audio files, we see a lot of videos. That seems to be the most popular workload utilizing us. And that's where the size of the data set also comes into play. If it's a textual data set, it's not going to be as big as video oriented data set. And that's also where latency is more important.
[14:25]
B
Yeah. One thing I got to say that I've really appreciated of using Tigris is like the actual devex is really nice. I know this is kind of funny. It's funny to me. Talk to this hardcore infrastructure guy who's whatever, racking and stacking servers and doing a big distributed system. But that actually really matters to me. Like it's just really simple to use. So like how do you think about the actual developer experience?
[14:45]
A
Oh, I'm super focused on developer experience. I use the product myself as well. I use stick with myself as well.
[14:51]
B
By the way, I want to let you know I was using it before the file system showed up. So like when the only option was like, you know, the S3 API.
[14:59]
A
Yeah, no, we are, we have to be compatible with the S3 API. It's the S3 SDK. It's not the best SDK, but it's the one that's most commonly used. We're also working on our own version of the SDK and devex is super important. The infrastructure needs to be as simple to use as possible. And some of the features actually improve devex for example, the global nature of the product, the renua anywhere aspects of the product. And there are some other new things that we're working on that will improve the devex further.
[15:30]
B
Is there any way we can get a peek?
[15:32]
A
I can talk about that briefly. So one of the cool things about the design of Tigris is immutability. So it's a pen only log based design. Right.
[15:43]
B
I remember the first paper was Mendel Rosenblum log structured fired system Log structured file system was basically append only.
[15:49]
A
Yeah. So it's a pen only system. There's no mutation that happens on place and this is a feature that was not really exposed to the users in any way. But as you've been talking to agentic workflows, like people who are building workflows and framework, they would really like to be able to snapshot storage. And guess what's the simplest thing to do with immutable Is taking a snapshot, just a pointer in time. So that's what we are, that's what we're working on. Yeah, that's what we are currently releasing. A feature that allows you to fork instantly. Even if it's a petabyte of bucket. You can instantly create a fork. Zero copy fork.
[16:25]
B
Wow.
[16:25]
A
You can instantly snapshot it.
[16:27]
B
Does that mean you can roll back? Yeah, you can roll back.
[16:29]
A
Yeah.
[16:30]
B
I always worry for these types of systems that like, I don't know, you run out of hard drives. Like how do you even think about like capacity planning for this? Like, is it just a silly thing? There's just, you know, like so much data that you don't have to worry about it. Or is it something you have to worry about?
[16:43]
A
No. Capacity planning.
[16:44]
B
I mean this is probably a stupid question, but every time I hear about oh, you can kind of snapshot a petabyte, I'm like, oh, there's you know, like a closet full of hard drives they're going to use for every snapshot. And yeah.
[16:54]
A
So the immutability of the system allows. It essentially means that the storage keeps on growing until. Yes. So that's like capacity planning in general is important when you have to think about. You don't have to just think about the amount of space you have, but also have to think about the iops.
[17:14]
B
Right? Yeah.
[17:14]
A
How much IO capacity I Have we sometimes see cases where people are running training workloads and they'll be sending tens of thousands of requests instantly. Right. There's no notification as such. So of course you have to plan around things like that, make sure the system is always available.
[17:30]
B
Can I just dig into the actual capacity planning for the number of bits? Let's just take the case of Uber and this is just a naive places the Martin naive question. But like, do you just keep buying hard drives? Like how is it you can actually have an infinite storage? Like how do you think about that?
[17:48]
A
So.
[17:48]
B
Or I remember like way back when you like move stuff onto tape, but that was like 30 years ago. What do you do now?
[17:55]
A
So you know, uber operational storage, 100 petabyte of data.
[17:59]
B
Yeah.
[18:00]
A
All on SSDs. What? None of that on hard drives. None of that on hard drive. Yeah, all on SSDs.
[18:08]
B
No kidding.
[18:09]
A
We spent a great amount of time, you know, working with the vendors, looking at newest technologies like DLCs and QLCs that people on how to make that cheaper and cheaper. Yeah, capacity, Yeah. I mean there used to be a constant capacity planning exercise. You have to think about, you know, how much buffer you need to keep in that, you know, the bigger the buffer, the more the cost you need. So there's always a, you know, so.
[18:34]
B
The idea is you just keep buying more hard drives. Infinitely. Like at some point in time you'd think you'd be aging this stuff out or putting it in like longer term storage or something. Right.
[18:41]
A
I mean you. So you have to continuously keep buying the hard drives, no matter. So hard drives are going to fail after five years anyway. So you need to replace them. Yeah, you need to keep buying them. But the good thing is that the per drive capacity has been increasing drastically. For example, in our case, you know, we have 25 terabyte per drive and imagine being, you know, one terabyte like you know, a few years ago. Right. So you have to buy fewer drives, but you still have to keep buying them, you have to keep recycling them and you have to keep looking at new technologies that are coming up that are making the drives more efficient, more dense.
[19:17]
B
Yeah. And this is why we have people like you deal with this because there's just a lot of complexity on all of it.
[19:21]
A
Yes, it's definitely a lot of complexity. It's one of the reasons why Cloud was born right before Cloud, it was the same thing. You need to buy, you have sans and you need to have storage administrators. Everyone needed to do capacity planning. Not it's just the service providers that have to do the capacity planning.
[19:39]
B
So the last time we spoke, you were just finishing up a data center. It's done now, right?
[19:47]
A
Yeah, we actually have two sites and the third site is going to come up, come live soon, very soon, you know, this week, by the end of the week it's going to be live.
[19:54]
B
So I'm just curious before you get into that, it just doesn't seem like a lot of people are doing this right now. And so was it kind of a little bit more complicated than you would have expected to stand up a data center or is that all still pretty straightforward?
[20:09]
A
It has definitely become much easier as before, as compared to before. So running your own hardware has become less complicated. If you keep aside the operational part, the capacity planning part. If you just think about the hardware in general, the failure rate is lower. These machines are scale vertically much? Well, they are more energy efficient. So there are definitely complexity that has reduced, but still very few people that are doing that. In fact, most of the people that are doing it are in the AI space. They are the ones that are doing it. So when we were actually talking to the data center vendors and trying to see where to get the space, we were also talking about density. But somehow they thought that we have GPU people, so that's why we care about power density.
[20:51]
B
Oh, I see. And so what do you think about. I mean, I know for GPU planning it is actually power is the important thing for storage. What is it? Is it space, like physical space or.
[21:02]
A
Something is similarly important.
[21:03]
B
HP maybe?
[21:04]
A
Yeah, I mean cooling and power is similarly important. We are not as power dense as a GPU would have, but I would say if you would be one third the same density in a rack as compared to gpu. So because the drives take up a lot of power, SSDs are very power efficient and they are costly, they're getting cheaper and drives are not that power efficient. So you have to worry about the power.
[21:28]
B
So do you have actual employees next to the server? Is that a stupid question or is it something you can kind of manage remotely? What is the implications other than that? Clearly there's like a financial implication which we can get into. But what are like the, as a founder, the implications of having a data center. From the operational side.
[21:44]
A
From the operational side, it places a huge burden for sure. And because now you are responsible for dealing with the hardware failure, you're not. It's not cloud. Right. And you have to do. And again, going back to capacity planning, you have to Actually do the capacity.
[21:58]
B
So it's like someone walking around replacing hard drive.
[22:01]
A
So the initial data center setup, we actually went there in person and did that. But then the way we have, you know, set it up, it's possible for remote hands, like people who are working there, that we give them the instruction, they're able to make the changes that we need. One thing that we did was to make our life easier is we designed the racks here so the rack is fully populated and the entire rack gets shipped. Right. So the amount of work that's needed in the data center is not my. It's mainly just plugging network, plugging the power and the old rack is online.
[22:35]
B
How about networking? Is that you or is that somebody else?
[22:38]
A
So network.
[22:39]
B
Because you kind of have to like make these things look like the same thing. What do you do? Like some anycast sort of thing or is it the. Yeah, how does networking work? Yeah, it's complicated.
[22:48]
A
Yeah, networking is complicated and it's more complicated in this case because we also run our storage. We also provide a storage services in a way where it's co located with other customers like Fly or the customers you have to run there. It's complicated. Yeah, we are relying on a combination of Anycast and Geodns to manage that.
[23:06]
B
That's really cool. Yeah. So far I don't even know how you built your team. Right. Because I was imagining the interview question. You're like, oh, whatever. Do you use. What is the power density of hard drives all the way up to. How do you do global routing to how do you not ever lose anything? You need to have strong consistency at the metadata layer. Right. This is pretty serious stuff. You can't lose data, right?
[23:29]
A
No, we cannot lose data. So yeah, strong consistency is important, but they are two different principles of software engineering I would say that are in play here. So that's why we have, you know, two different skill sets. So we have the infrastructure team and then we have the distributed system team. That's the product team that's building separate, that's focused completely on the product infrastructure team, that's focused on racking, stacking and automation.
[23:49]
B
How do you get, how do you get your, your, your users comfortable that you'll never lose data? Did you build your own transaction layer? Like how does that work? That's such a hard problem.
[24:01]
A
It is a hard problem. So for metadata we are using FoundationDB. FoundationDB.
[24:06]
B
That's what Apple uses?
[24:07]
A
Yeah, Apple uses it. Snowflake is a big use and Apple actually uses it for their icloud. Which is this global storage system as.
[24:15]
B
Well, by the way, for those listening, Apple has a phenomenal paper on using FoundationDB to build a global storage. It's actually one of my favorite systems papers of the last, whatever, 10 years. Yeah.
[24:26]
A
FoundationDB is actually one of the only databases I know of that has implemented simulation testing. They are really maniac about how they do testing and you know, it works really well for us. It has its limitations or restrictions, but once you work within those limitations it scales really nicely. It doesn't put the operational burden. It is working really well for us.
[24:48]
B
Some kind of philosophical question for you. I mean again, STORJ is like the, like the foundation, one of the three foundations that you build everything on. Right. And so you could decide that you're totally agnostic to the workload. Totally. And then, and then the developer, like you don't expose anything to the developer or the app developer or the product developer at all. And then you just try and figure it out. Or there's another view which is like you actually could expose more things to, I don't know, help with caching or help with locality or whatever that would actually, you know, give hints to the system. Do you come down, do you have a philosophical view on like how much do you like hide everything from the user versus actually expose a few things. And I guess I just feel like with global storage you actually have speed of light like in a lot of other systems you can kind of fudge virtualization but you can't really fudge speed of light. That's one, you know, like it's just, you probably have a better system if you knew it region, for example.
[25:39]
A
So my philosophy is to hide away as much of the common use case like bundle as much of the common use cases or common usage inside storage.
[25:49]
B
Right.
[25:49]
A
So for example, caching is a good example. We do caching out of the box. It's not something that the user has to configure. We understand the access pattern.
[25:57]
B
It's like read caching. Right?
[25:58]
A
Yeah. And routing is another example. So we actually cheat somewhat on the consistency side as well. Metadata is consistent, but data cannot instantly replicate because it depends on the size of the data. Right. So we cheat on it in the sense that the routing is intelligent and it knows that it cannot serve the data because the data isn't replicated that locally. So it can fetch it remotely. But the user gets this, you know, consistent view of the data.
[26:25]
B
Oh, interesting. Right. So another thing. So do you still code?
[26:28]
A
Yeah, I gord and it's so crazy.
[26:31]
B
How many of like the many of the top CEOs I work with are still actually coding, like Ankara Braintrust codes half of the day. Right? You're still coding. I mean, how do you manage to, you know, code and then run the team and then run a complex operations? Do you find much tension there or.
[26:48]
A
There's definitely tension in the sense that you have to think about priorities, what's important. So it's not that I code every day, it's more about.
[26:56]
B
Do you look at PRs maybe? Is that what it is?
[26:58]
A
No, I do code, but it's not something I do. For example, I'm working on a newer implementation of caching layer for the service. Right. So, yeah, I do actually code and I actually believe in being hands on with product development, knowing exactly what's happening, especially because what we are doing is hard tech. It's not one of those things where I can just go on and focus on the sales and just everything will work smoothly. Right. I have to actually be involved, understand where we are going, what, how we're going to solve the problem that we are solving.
[27:28]
B
I also presume, like, in situations where you've got like a large customer that wants to like, move a lot to it, like.
[27:35]
A
Yeah.
[27:37]
B
You know, so much of their decision is a technical decision and like, it's very hard for you, you know, like, for you not to be like pretty close to the details. Yeah. And that was the case with my company too, which is I was literally in the code. And therefore, like, once you're dealing with large customers, like, even as the CEO or the leader or whatever, you can kind of help guide them on why this makes financial sense a lot. Largely because of, you know, technical guarantees or technical efficiencies or whatever.
[28:02]
A
Yeah. And a lot of times they actually want to talk to the founders before signing a big deal. Right. Because, you know, they won't understand whether people who are building it know actually what they're doing or not. But the management overhead I have kept to minimum, it's mostly flat and I intend to keep it that way.
[28:18]
B
Nice.
[28:18]
A
And I feel a lot of things, AI is taking care of a lot of things. Right. So it makes life so much easier. Gives me more time to actually focus on doing things that are important.
[28:30]
B
Yeah. So I had this very specific question, which is I think you're probably the most senior infrastructure team at work, definitely one of the two most senior infrastructure teams. You've built and run global systems for a very long time. Your developers are incredibly senior. You've shipped tons of Code you supported the, you know, you know, everything at scale. Do you get a lot of advantage from AI coding?
[28:51]
A
Oh yeah, a lot of advantages. Most of us use cuzr and it actually improves the productivity if it's used the right way.
[28:59]
B
This is my experience. It's always like the most senior people that know how to use these tools.
[29:03]
A
Well and, and it's funny you're asking that because I was talking to someone else about as well about, you know, new college grads and well, I just want to mention it. For a long period of time, the only advantage new colleges grad brought is speed that we can write code very fast. None of that matters anymore. What matters right now is systems thinking. Whether you can design a system, you architect a system, you know how systems work. Because if you know that, then you can use AI assisting, right things like lard code or culture and build a system. So that's the thing that's most important. That's why the senior folks are succeeding, because they understand how design.
[29:39]
B
Literally my experience, many of the most hardcore like senior systems devs like yourself, like James Cowling, like Pengfang, like these guys are hardcore, been doing this for 20 years. They're like the most active users of AI development and they use it very effectively. And many of them have said to me, they're like, listen, you know, I don't need to like work so much with the junior person right now in order to get things now that said, I mean clearly as the industry catches up, there's just a lot of stuff to do. And so I mean I don't think in any way this replaces Junior Jeffs writ large, but it does enable, it doesn't replace them.
[30:12]
A
It's just that they need a different set of skill and they can come at the same level as experienced person.
[30:16]
B
Right?
[30:17]
A
Yeah, yeah, that's what they.
[30:18]
B
Have you like changed your development practices around AI or is it kind of shaped organically? Do you have like best practices or.
[30:25]
A
I would say that it's more organic. There are some best practices, some things that work for us. For example, we don't just ask AI to implement a feature. You know, we first work on planning out what we actually need and then go and use the AI tool. We use it heavily for code reviews. We actually use two tools for code reviews because we don't trust one. So we use two of them. And all of that has in general improved productivity. I myself use Claude code very heavily. I would say that probably, you know, 80% of the code that I write is AI written so I use that very heavily as well.
[30:58]
B
That's crazy. And then do you like institute specific tooling or you just view it from what gets produced? Is there any sort of mandates to use any specific tools or people can use whatever they want.
[31:08]
A
So the mandate was to use AI.
[31:10]
B
Okay.
[31:10]
A
So I wanted everyone to use AI that we need to figure out which tool works best for us. You know, we tried out, we started out by using VS code and then because it was so much better, we ended up using that. We're still using it. And then the PR reviewer I mandated, not mandated as in full stem, by doing it myself and then presenting it last off site. My presentation was all about how about AI and how we should adopt it. And we have adopted not just for coding, but things like production debugging as well, like log analysis. AI is greater at analysis for sure.
[31:45]
B
So another thing that I would guess that you know, you're built on FoundationDB. Do you find like actually these AI models are better because this has been an open source code base that's been around for a long time. Do you think that plays into it at all?
[31:56]
A
For sure. We don't so much have AI write the foundational code that deals with FoundationDB, but more of the higher level code.
[32:04]
B
That runs on it's on top. So maybe it's a little bit orthogonal. Right. So the last bit, I want to talk a little bit about the cost.
[32:09]
A
Okay.
[32:10]
B
So it seems to be one area that you can really win is simply sure, performance. Sure. Devex, we've talked about all the things about cost. Maybe can you talk about how you think about pricing from the perspective of the user? So I'll give you an example. The thing that has always killed me are egress fees. To what extent are you able to provide a different experience or a different cost structure for the user?
[32:33]
A
So the cost structure at a high level is similar to other consumption based services. There's usage based. But we have done away with the cloud tech, which is the egress build. So there's no egress.
[32:45]
B
Okay, this is naive Martin speaking. Naive Martin, like the thing that's always drove me not by cloud storage is these egress fees because I know that networking isn't that expensive. And so is this basically just a disincentive to migrate off? Is that what it is?
[33:00]
A
That's exactly what it is. That's a way to lock people in so that you cannot use all of those other specialized services that are better than what the cloud is providing by making it an economist.
[33:11]
B
Okay, so you're right. So it's not migrating off, it's just that, that you know, you want to prefer their internal services because then you're not paying egress fees. And so it's a way to disincent using third party services.
[33:23]
A
Yeah, exactly. And then it's instead of being a architectural choice, they make it economic choice. I mean hard for people to use other services.
[33:31]
B
I see. And so in the case of Tigris.
[33:34]
A
There is no egress fees, right? No egress fees, yeah.
[33:36]
B
That's great. And so I mean, what does that mean? Let's assume that like I as a user am going to be using a third party service. What does it cost different central there. Just roughly. I'm curious, is it like a factor of two, is it a factor of ten, is it like 20? What's it a big cloud versus a big cloud? Okay, not that sort of just egress fees. Let's say I'm like heavily using, I'm heavily using some third party service and so I'm thinking integrous versus say S3. I mean is it like a factor of 2 or a factor of 10 or 20?
[34:05]
A
So just on the, just, just on the egress side, one example that comes to mind is actually file and 80% of the bill was egress.
[34:13]
B
Really?
[34:14]
A
Yeah, 80% of their storage bill was egress.
[34:16]
B
Oh my God, that's unbelievable. So that's a factor of five just right there.
[34:20]
A
Yeah. Depending on the workload it can really be really high. The egress cost per GB is actually much higher than the cost to store data. How is that even possible?
[34:29]
B
Right?
[34:33]
A
Read your data.
[34:34]
B
Yeah, yeah. Okay, so the egress things is enormous. Right? So basically listen for people listening. If you're using storage and the service is not on the cloud, you're crazy because like your egress fees are killing you. Right. And so like for that definitely use something like Tigris. But you also have other opportunities to cost optimize because you're running the data center. Right. So how do you think about that? Is it just that you've got, you can take lower margin or you can optimize it in a certain way or you have lower operating margin because you know you're not bloated like a big company.
[35:02]
A
Definitely, definitely. We can optimize more on the margin side because you don't want to have to make some of the decisions that they make. So the S3 for example, the system has been around for such a long period. Of time.
[35:11]
B
Right.
[35:11]
A
There are some, some architectural choice and decision that have been made that cannot be undone.
[35:14]
B
Yeah, sure.
[35:15]
A
Yes. We are starting from scratch so we have a much cleaner, simpler design.
[35:19]
B
And can you do that for the AI workloads in particular? Can you just be like, listen, we know small objects really matter here and so, you know, can the workload actually guide some of that or do you think it just independent of workload, you can make it better just because it's.
[35:30]
A
A newer system for small file workload for sure, because you do not have the same. Maybe you don't have to deal with the same architectural decisions, but in general it's going to be much better. Right. And the thing that I mentioned about immutability, no storage service provides any of that's awesome. It's something that's unheard of.
[35:46]
B
Yeah, that's great. Cool. Well, maybe as we wrap this up, do you want to kind of bring us a little bit into the future about how you think of, you know, like the evolution of the cloud in general and of the storage layer and all this? Are we going to see like somebody do an EC2 or is that basically what the core weaves are? Are we going to see someone, you know, you know, build out more pieces of it, or do you think that ultimately, you know, with the current players in place, we're pretty close and we should be thinking up the steps?
[36:12]
A
I think there's still a lot to be done. So one thing I think about is how much of the it is running in the cloud. And if 80% of the it was running in the cloud would only be run by three companies, that would probably not be true. Right. So I feel like AI has pushed this notion of distributed computing, where they are specialized providers forward and we'll just continue seeing more of that. We'll continue seeing more of core weave succeeding on the compute side, we'll continue seeing that on the, on the storage side, we'll continue seeing that on other higher level services like, you know, Data Warehouse or Data Lake products or database products. So I foresee more movement towards specialized service providers to provide the best service possible at the cheapest cost possible.
[36:57]
B
And so do you see a significant difference between training and inference workloads? And do you actually, do you like underlie both training and inference today?
[37:04]
A
Yeah, we underlie board training inference.
[37:05]
B
Do you have a bias towards one or the other? Is it. It kind of all the same.
[37:08]
A
So inference is better from a conex perspective for us, inference is better because inference doesn't involve reading all the data caching.
[37:17]
B
It's probably also like this kind of, I don't know, Pareto thing or whatever, where 80% of the requests go to the same, you know, 20%. So it's like cache effect. The cache is very effective.
[37:29]
A
The main thing is that hard drives have very limited I O. Iops. Yeah, right.
[37:33]
B
Yeah.
[37:34]
A
So how do we preserve that? So it's best if 80% of the storage is written but not accessed and only 20% gets accessed. That's the best kind of. And inference is like that. Training is basically reading the entire data set again and again.
[37:48]
B
Right, right, right. Is that the case on these training workloads, that a few of the images get the majority of the views just.
[37:54]
A
Yeah, yeah. It's already skewed.
[37:56]
B
Interesting. Cool. All right, well, that was an awesome discussion, so thanks so much for coming in.
[38:00]
A
Thank you for having me here. I had really fun having this conversation with here. Thanks for listening. If you enjoyed the episode, let us know by leaving a review@ratethispodcast.com a16z we've.
[38:12]
B
Got more great conversations coming your way.
[38:14]
A
See you next time. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed.
[38:27]
B
At any investors or potential investors in any A16Z fund.
[38:31]
A
Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.
[38:36]
B
For more details, including a link to our investments, please see a16z.com disclosures.