Loading summary
A
Hello and welcome to another episode of AWS Bytes. My name is Luciano and I'm joined by Owen and we wanted to talk about DuckDB for a while and today we finally get to do just that. So a few years ago we started to hear lots of excitement from our data scientists and analyst friends and all colleagues and everyone was talking about DuckDB and they were super excited and we couldn't figure out exactly why. But finally we have used it in a few different projects and now we really understand why everyone is excited about that. So today we want to share with you how DuckDB is becoming a must have tool for anyone who has some data on a strip bucket somewhere and will need to do some kind of analytical job with that data. So we'll share some use cases and how to get started. And by the way, there is a little bit of a trick at the end if you are missing the recently retired S3 select feature on AWS, stay tuned until the end. We might have a surprise for you, hopefully a better option that you can use today. So let's get into it. AWS Bytes is brought to you by For Theorem, our sponsor. And if you're looking for a partner to architect, develop and modernize on aws, give For Theorem a call. You can check out all the details@forthereum.com so oh, and maybe we can start by giving a little bit of an explanation of what DuckDB is.
B
Well, it's a database, but specifically it's an in memory analytical database. So when you think in memory, you might think of things like SQLite, sometimes known as SQLite, but this is for analytical or OLAP workloads, so you can therefore run it locally because it's in memory, but you can also run it in the cloud basically anywhere. And it's SQL first, so anyone with knowledge of SQL can pick it up and use it really very quickly. It works with a lot of different storage formats. It has its own format, but you can also use it with parquet, Iceberg, CSV, Data, JSON, Data, aro, and one of the really big draws of it is that it's very simple to use and it has managed to stay very simple to use. It doesn't have any dependencies, works on pretty much any modern OS or cpu, and it has language bindings for almost everything you might need. It's also pretty modular, so new features are automatically enabled by installing extensions, and you can even write your own custom extensions as well. And it's open source. I heard an interview with one of the founders of Duckdb recently and they were talking about the different model, the governance model, but they basically set up a separate foundation holding all of its ip, keeping it clear of all the commercial entities to ensure that the community doesn't get any nasty license switching surprises in the future. So that's the intro. Luciano, what is DuckDB good for?
A
Yeah, you already mentioned that you can run it anywhere. That's probably one of the main thing one worth to mention. So you can run it on your laptop, on an EC2 instance, in a container, running again on your machine or somewhere in the cloud. You can even run it in lambda, which is probably the coolest use case if you ask me. So one of the biggest benefits is how fast it is on modest hardware. So you don't need to run it in high end machines or buy expensive VMs. You can run it pretty much in almost any reasonable hardware. And it has very efficient multicore columnar execution engine which is able to parallelize fetching and execution of the queries that you provide to analyze your data. So it can also work with more data than available in memory, which is another interesting detail for something that is defined as an in memory database. And effectively we are used to massive clusters or data warehouses required for querying big data. DuckDB kind of changes that a little bit because it solves a lot of the problem that we typically will solve with these massive clusters, but with a much simpler approach and a much more cost effective approach. And I think that's probably one of the main reasons why everyone is so excited about DuckDB. So they say that one of the main reasons why DuckDB can be so efficient is that the engine is particularly where engineers it's a columnar vectorized type of engine and it can process large numbers of rows for a column at once. So this is why it's very well suited for analytical type of workloads. So because DuckDB can support more data than it can fit in memory, you often hear stories of people processing terabytes of data even in their laptops. And I'm seeing lots of small examples and use cases of people talking online. For instance, one common use case would be if you want to implement your own version of Google Analytics, people are trying to do that with DuckDB and seems to be extremely easy to build and extremely efficient and cost effective. So that's just to give you an idea of the kind of things you could be building with something like DuckDB. So where do we get started?
B
Yeah, the best way to take it for a test drive is just to install the CLI. So you can do that for any Linux, macOS, Windows. Once you do that, you can run the DuckDB command and you have an interactive SQL rebel. So you can start writing SQL and you can run in memory, which it would do by default, or use a DuckDB file. I think 95% of the use cases I've seen you're just using the in memory case. But you know, there's a lot of advantages if you want to persist the data as well. So if you just want to query data on S3 in memory is fine. You could start with something like select star from and then just give the uri for your S3 object. And that could be a single file, or it could be a group of files. And then you can get into more complex stuff like partitioning. So if you're in the big data space, you'll be familiar with hive partitioning on stores like S3, where you have certain parts of the path which represent basically partitions of Data like date equals 20, 25, 02, 18. And when you have those date dates in your where expression, it means it only has to read in that folder. You can do that with DuckDB as well. You can also create in memory tables for intermediate results or indeed persistent tables if you're not using in memory mode. And then you can do joins. And it has, you know, very rich SQL support, but it also has custom functions for doing things like reading parquet data with specific parameters and scanning lots of different partitions on a store like S3. So on AWS, then if you want to start using it in your application, there's lots of options for integrating it lambda, you mentioned already, Luciano. It lets you do simple and cheap data lake querying or even ETL pipelines if you're doing transformations. That's one of the places where I've found it very useful because you can do etl, a couple of lines of SQL just running in a lambda function, and that's your job done. You don't have to stand up any kind of complicated or just heavier ETL infrastructure. And you can obviously run it in containers, for example on ecs, Fargate, eks, something like that. You can use it in AWS glue, if you like, in your Python shell jobs. And you can also integrate it into step functions. And that's one of the things that I've found it really good for actually, is putting it into step functions to do one step in your step function, which will do massive transformations on a lot of data. And you gave a preview at you Said that we might have an alternative for S3 Select. Unfortunately, S3 select has been retired since late last year for new AWS accounts. If you have existing accounts, you might still be able to use it, but it was a very handy feature. It was a console tool and an API that you could use from the SDKs that would allow you to query data on S3 directly. It was quite a limited subset of SQL supported in S3 Select. You could basically just select with a where clause on a single file, but people really miss it since it's gone. And the Good news with DuckDB is that you can replace it. Essentially you can use the CLI as a drop in replacement for what you would have done in the AWS console, but you can also use DuckDB as a programmatic replacement for the AWS SDK usage of S3 Select. And the good news is that you'll have a lot more features than you would have had with S3 select. So we'll talk a little bit about that more because one of the recent use cases I came across was I wanted to use S3 select in the step function just to do some filtering on data. It was in a new account, so that wasn't available to me and I was able to replace it with DuckDB and get really good results. So I'll come back to that in a bit. But maybe we should talk about before that how S3, sorry, maybe we should talk about how DuckDB compares to other things like Athena, SQLite, pandas, polars.
A
Yeah, let me try to give a quick overview on that. So comparing it to Athena, DuckDB is much faster and simpler in terms of setting it up and the code that you need to write. I think it's fair to say that Athena will probably support much more data, like if you are at massive scale, but you can get very far with DuckDB. And the reason why Athena or even Redshift, if we want to bring that in the conversation that they will support more data is because they are large scale multi node systems, while DuckDB is a single node system. So unless you want to write your own crazy demultiplexer multiplexer type of approach where you can shard multiple nodes and manage all that querying in between. I think DuckDB is more suitable for things that will fit a single node, but I think with that you can still solve lots of problems even at relatively big scale. So that just keeps the comparison between Athena and even redshift, I guess, versus DuckDB, another one that it's very often compared to is SQLite. And I think it's fair to say that these days there is an alternative called lib SQL. So if you are not aware, lib SQL is a fork of SQLite. So in a way it's very similar, tries to solve the same problems, has a slightly different governance than SQLite, but apparently I'm seeing it more and more, so possibly it's going to eventually become a big contender in this space. So both SQLite and Lib SQL, if you compare them with DuckDB, I think the approach is very similar, so I think it makes sense to compare them, but at the same time they are trying to solve very different problems. So where SQLite or Lib SQL are giving you effectively an easy way to have a local embeddable transactional database, DuckDB is much more focused on large scale analytics. So effectively the way that the data is being stored and retrieved is very, very different. And therefore I think you need to pick the right tool for the right job. So if you're doing analytics, stick with DuckDB. Otherwise probably SQLite will give you better results. Then if we bring into the pictures Pandas and the more recent Polars, which is pretty much so Pandas is the original one written in Python and Polars is kind of a more modern rewrite written in Rust. So more focus on bringing, I guess additional performance that the Rust programming language can bring. So these are two very popular data processing libraries that are used for analytics. I think the use case is very similar, but the approach is probably very different. So while Pandas and Bollards effectively are programming languages libraries, so they offer you like a programmatic interface, you import the library in a piece of code you're writing and then you have this concept of an object that represents a data frame and you can use this data frame to do all sorts of kind of things using different programming languages like Python or Rust, you write different instruction effectively, you write your own script and that way you can create even fairly complex analytical pipelines. I don't think those will be completely replaced by DuckDB. I think on one side maybe Pandas and Polars can give you an extra bit of flexibility. Maybe we will just prefer the more programmatic approach so that there is definitely a difference in the things you can do and the way you approach the problem. And but with that being said, I think you can be able to replace some of the work that you might be doing with Pandas and Polars with just a query or a single executely even in DuckDB. So I think it's worth Figuring out where is the sweet spot where maybe DuckDB will give you a much simpler solution and you don't have to write a mundane code. You probably just focus on writing a SQL statement.
B
Yeah, I think relatum is like when you run into multiple lines of SQL, then eventually it gets more complicated. You feel like you're back in stored procedure land in the early 2000s, and at that point you might be better off using something a bit more imperative, like Python.
A
So you mentioned step functions before. I'm really curious to see how you use DuckDB with step functions.
B
Yeah, I mean, there's no way you can create like a custom user defined task or state in step functions, apart from from using something like lambda or a container task. So I mentioned that we had this issue where we couldn't use S3 select and we trialed using DuckDB instead. It worked really well and we actually decided to make this a general purpose pattern so that anyone can really easily integrate things like DuckDB into step functions and other workflows. And the way we did that was by creating a lambda runtime for DuckDB. And, and this might sound a little bit strange, but hear us out. The layer allows you to basically deploy a lambda function that doesn't have any code. So it just has a DuckDB engine. And we created this runtime layer on top of it that basically accepts a query and allows you to then do something with the results. So it's really lightweight. It's written in C, because DuckDB's core itself is written in C. So it was fairly straightforward to create a thin wrapper around that written in C as well. So it's very tiny to deploy, quite lightweight in terms of startup. And then you really just have to incur the analytical time for whatever query you're running. So it means you don't have to have a Python runtime or a node JS runtime. You don't have to worry about dependencies and upgrades. It's just a single binary. And you can then use this as a response to an EventBridge event or integrate it into step functions. And all of a sudden then you have the ability to do DuckDB in your step function definition and you can just write your SQL in the input parameters for the lambda task and interpolate any valuable variables you might need from your step function state. And as we've been able to use this then for just doing simple things like taking a CSV file and converting it into JSON and then doing something else with the response. But you can also do different filtering, grouping, aggregation before you return your JSON. So it's really powerful. It is. I think it's worthy of a runtime because DuckDB is so powerful and you can do so much with it. And I think a lot of us who use Lambda heavily are always looking for ways to reduce the number of functions you need to maintain, because it just seems like you've got a lot of cattle to herd then and this is a good way to do it. So we'll have a link to the repo in the show Notes. It's essentially a SAR app, so serverless application repository app that you can use to install the runtime in your own account. And then you just create a simple function that you can deploy. You might need to give it some permissions. So normally, rather than returning analytical data in the response of your function invocation, you might Write it to S3 just based because of the size of the data or the format of the data. So you might want to give that function, you deploy some access to S3, like read access to some buckets and maybe write access to like a staging area where your intermediate results are stored. And once you have that, you can do some pretty mad things with DuckDB and step functions. Like you can even call APIs using DuckDB because it has like a HTTP extension. And you can even do like authenticated requests to APIs, transform the response in SQL. You can convert tabular data on S3 to JSON like I mentioned, that you can then use in your step function state. You can do like fairly significant ETL or ELT type workloads and then just anything else in terms of analytics, aggregation. You know, if you've got a step function that's doing a map or even a distributed map to process lots and lots of units of data, you could then use DuckDB to collate all that data, aggregate it all together and get results. So yeah, I'm curious to hear how people get on with it and if they can try it out and find some really good use cases. Because there really is kind of a Swiss army knife type tool and it's just limited by your imagination.
A
That's awesome. I think this is everything we wanted to share today. We believe as a final conclusion that DuckDB is really good, really promising, probably will be seeing more and more of it and especially in the cloud, in Lambda, in aws. So hopefully you got some value from this episode today. If you did, please remember to share it with your friends, like subscribe everything else. And also please give us a star to our new repository if you like this project. So thank you very much and we'll see you in the next episode.
Release Date: February 21, 2025
Hosts: Luciano Mammino (“A”) & Eoin Shanaghy (“B”)
This episode introduces DuckDB, an emerging open-source, in-memory analytical database, and dives into why it’s generating so much enthusiasm in the cloud analytics space. The hosts detail how DuckDB’s simplicity, performance, and cloud-native capabilities make it a compelling choice for data analysis on AWS. They discuss use cases, practical integration tips, and even present a solution to the recently retired S3 Select service. Finally, they unveil a new open-source runtime for using DuckDB efficiently with AWS Lambda and Step Functions.
[00:00–02:46]
“It’s very simple to use and it has managed to stay very simple to use. … You can run it locally because it’s in memory, but you can also run it in the cloud basically anywhere.”
— Eoin, [01:24]
[02:46–04:58]
“You can even run it in lambda, which is probably the coolest use case if you ask me.”
— Luciano, [02:49]
“Because DuckDB can support more data than it can fit in memory, you often hear stories of people processing terabytes of data even in their laptops.”
— Luciano, [04:08]
[04:58–08:55]
duckdb command.“You could start with something like
SELECT * FROMand then just give the URI for your S3 object.”
— Eoin, [05:29]
[08:55–10:36]
“You can use DuckDB as a programmatic replacement for the AWS SDK usage of S3 Select. And…you’ll have a lot more features.”
— Eoin, [07:56]
[08:55–12:37]
“I think you can be able to replace some of the work that you might be doing with Pandas and Polars with just a query…”
— Luciano, [12:24]
[12:55–16:52]
“It is. I think it’s worthy of a runtime because DuckDB is so powerful and you can do so much with it. …there really is kind of a Swiss Army knife type tool and it’s just limited by your imagination.”
— Eoin, [15:51]
On DuckDB’s Philosophy:
“They basically set up a separate foundation holding all of its IP, keeping it clear of all the commercial entities to ensure that the community doesn’t get any nasty license switching surprises in the future.”
— Eoin, [02:16]
On Convenience:
“You can run it pretty much in almost any reasonable hardware.”
— Luciano, [02:55]
On Lambda Integration:
“You don’t have to have a Python runtime or a Node JS runtime. …It’s just a single binary. And you can then use this as a response to an EventBridge event or integrate it into step functions.”
— Eoin, [13:27]
On DuckDB’s Power:
“We believe as a final conclusion that DuckDB is really good, really promising, probably will be seeing more and more of it and especially in the cloud, in Lambda, in aws.”
— Luciano, [16:52]
DuckDB is emerging as a versatile, high-performance analytical tool—ideal for AWS and beyond. Its simple deployment, strong feature set, and innovative cloud integrations (like the new Lambda runtime) position it as a strong successor to services like S3 Select, and a powerful alternative for a host of data analysis workloads. The episode is a must-listen (or now, a must-read) for AWS practitioners, data engineers, and anyone looking for modern, cost-effective analytics solutions.
Call to Action: Try the new DuckDB Lambda runtime, share your use cases, and check the show notes for the GitHub repo!
This summary skips non-content (ads, intros, outros) as requested and preserves the episode's technical depth and friendly, practical tone.