AWS Bites Podcast Episode 140: DuckDB Meets AWS – A Match Made in Cloud
Release Date: February 21, 2025
Hosts: Luciano Mammino (“A”) & Eoin Shanaghy (“B”)
Overview
This episode introduces DuckDB, an emerging open-source, in-memory analytical database, and dives into why it’s generating so much enthusiasm in the cloud analytics space. The hosts detail how DuckDB’s simplicity, performance, and cloud-native capabilities make it a compelling choice for data analysis on AWS. They discuss use cases, practical integration tips, and even present a solution to the recently retired S3 Select service. Finally, they unveil a new open-source runtime for using DuckDB efficiently with AWS Lambda and Step Functions.
Key Discussion Points & Insights
1. What is DuckDB and Why the Hype?
[00:00–02:46]
- DuckDB is a fast, efficient, in-memory analytical database designed for OLAP workloads.
- It’s SQL-first, works across major operating systems, requires no dependencies, and has broad language support.
- Supports various storage formats (Parquet, Iceberg, CSV, JSON, Avro).
- Open-source with a strong governance model ensuring the community’s interests (all IP held by a separate foundation; see [01:24–02:22]).
- Quote:
“It’s very simple to use and it has managed to stay very simple to use. … You can run it locally because it’s in memory, but you can also run it in the cloud basically anywhere.”
— Eoin, [01:24]
2. Where and How Can You Run DuckDB on AWS?
[02:46–04:58]
- Can be run locally (laptop), on EC2, containers, even AWS Lambda.
- Major benefit: blazing fast analytics even on modest hardware—no need for massive clusters.
- Efficient multicore columnar execution engine; processes terabytes even with limited RAM.
- Example use case: DIY analytics systems (e.g., custom Google Analytics replacements).
- Quote:
“You can even run it in lambda, which is probably the coolest use case if you ask me.”
— Luciano, [02:49] - Quote:
“Because DuckDB can support more data than it can fit in memory, you often hear stories of people processing terabytes of data even in their laptops.”
— Luciano, [04:08]
3. Getting Started with DuckDB
[04:58–08:55]
- Install the CLI on any OS; open SQL REPL via the
duckdbcommand. - Common pattern: run queries in-memory, or persist using DuckDB files.
- DuckDB can query S3 directly:
“You could start with something like
SELECT * FROMand then just give the URI for your S3 object.”
— Eoin, [05:29] - Supports S3 partitioning schemes familiar to the big data community.
- Enables rapid ETL, analytics, and transformations in simple environments:
- Lambda ( “simple and cheap data lake querying”)
- AWS Glue (Python Shell jobs)
- Step Functions (for complex workflows)
- Notable: Can now replace retired S3 Select for querying S3 data—a “drop-in replacement” but with vastly greater features.
4. Replacing S3 Select with DuckDB
[08:55–10:36]
- S3 Select, a popular limited-SQL query feature for S3, has been retired for new AWS accounts.
- DuckDB steps in as a flexible, more powerful CLI or programmatic alternative:
“You can use DuckDB as a programmatic replacement for the AWS SDK usage of S3 Select. And…you’ll have a lot more features.”
— Eoin, [07:56] - Practical experience: Eoin replaced S3 Select step in a Step Function with DuckDB and “got really good results.”
5. How DuckDB Compares: Athena, SQLite/LibSQL, Pandas & Polars
[08:55–12:37]
- Athena / Redshift: Larger scale (multi-node) vs. DuckDB (single node, much simpler, faster for moderate loads).
- “If you are at massive scale, Athena will probably support much more data…” — Luciano, [09:09]
- SQLite / LibSQL: Comparable for embedding, but SQLite is transactional while DuckDB excels at analytics/OLAP workloads.
- Pandas/Polars:
- Host libraries for analytics in Python/Rust; offer object-oriented dataframes.
- DuckDB’s power is in SQL-based simplicity—sometimes a single query can replace complex scripts.
- For imperative logic or highly custom workflows, Pandas/Polars remain beneficial.
- Quote:
“I think you can be able to replace some of the work that you might be doing with Pandas and Polars with just a query…”
— Luciano, [12:24]
6. DuckDB in Step Functions and Lambda: The New Runtime
[12:55–16:52]
- Launch of a new, open-source AWS Lambda runtime with DuckDB built-in.
- “The layer allows you to basically deploy a Lambda function that doesn’t have any code. So it just has a DuckDB engine.”
— Eoin, [13:14]
- “The layer allows you to basically deploy a Lambda function that doesn’t have any code. So it just has a DuckDB engine.”
- Lets you submit queries as input; runs efficiently (no Python/Node/other heavy runtimes—just a single, tiny binary).
- Use cases:
- ETL/ELT jobs (CSV→JSON, filtering, aggregation)
- Query S3, transform data, call APIs using HTTP extension
- Compose with Step Functions: “You have the ability to do DuckDB in your Step Function definition and you can just write your SQL in the input parameters…” ([13:54])
- Security tip: Attach relevant S3 read/write permissions.
- Community invitation: Repo is in the show notes; hosts encourage listeners to try, fork, and star it.
- Quote:
“It is. I think it’s worthy of a runtime because DuckDB is so powerful and you can do so much with it. …there really is kind of a Swiss Army knife type tool and it’s just limited by your imagination.”
— Eoin, [15:51]
Notable Quotes & Memorable Moments
-
On DuckDB’s Philosophy:
“They basically set up a separate foundation holding all of its IP, keeping it clear of all the commercial entities to ensure that the community doesn’t get any nasty license switching surprises in the future.”
— Eoin, [02:16] -
On Convenience:
“You can run it pretty much in almost any reasonable hardware.”
— Luciano, [02:55] -
On Lambda Integration:
“You don’t have to have a Python runtime or a Node JS runtime. …It’s just a single binary. And you can then use this as a response to an EventBridge event or integrate it into step functions.”
— Eoin, [13:27] -
On DuckDB’s Power:
“We believe as a final conclusion that DuckDB is really good, really promising, probably will be seeing more and more of it and especially in the cloud, in Lambda, in aws.”
— Luciano, [16:52]
Timestamps for Important Segments
- Introduction & Hype around DuckDB: [00:00–02:46]
- Technical Overview & Use Cases: [02:46–04:58]
- How to Get Started on AWS: [04:58–08:55]
- Replacing S3 Select: [07:56–10:36]
- DuckDB vs. Athena, SQLite, Pandas, Polars: [08:55–12:37]
- Custom DuckDB Lambda Runtime & Workflow Integration: [12:55–16:52]
- Final Thoughts: [16:52]
Conclusion
DuckDB is emerging as a versatile, high-performance analytical tool—ideal for AWS and beyond. Its simple deployment, strong feature set, and innovative cloud integrations (like the new Lambda runtime) position it as a strong successor to services like S3 Select, and a powerful alternative for a host of data analysis workloads. The episode is a must-listen (or now, a must-read) for AWS practitioners, data engineers, and anyone looking for modern, cost-effective analytics solutions.
Call to Action: Try the new DuckDB Lambda runtime, share your use cases, and check the show notes for the GitHub repo!
This summary skips non-content (ads, intros, outros) as requested and preserves the episode's technical depth and friendly, practical tone.
