AWS Bites Podcast Episode 140: DuckDB Meets AWS – A Match Made in Cloud

Release Date: February 21, 2025
Hosts: Luciano Mammino (“A”) & Eoin Shanaghy (“B”)

Overview

This episode introduces DuckDB, an emerging open-source, in-memory analytical database, and dives into why it’s generating so much enthusiasm in the cloud analytics space. The hosts detail how DuckDB’s simplicity, performance, and cloud-native capabilities make it a compelling choice for data analysis on AWS. They discuss use cases, practical integration tips, and even present a solution to the recently retired S3 Select service. Finally, they unveil a new open-source runtime for using DuckDB efficiently with AWS Lambda and Step Functions.

Key Discussion Points & Insights

1. What is DuckDB and Why the Hype?

[00:00–02:46]

DuckDB is a fast, efficient, in-memory analytical database designed for OLAP workloads.
It’s SQL-first, works across major operating systems, requires no dependencies, and has broad language support.
Supports various storage formats (Parquet, Iceberg, CSV, JSON, Avro).
Open-source with a strong governance model ensuring the community’s interests (all IP held by a separate foundation; see [01:24–02:22]).
Quote:

“It’s very simple to use and it has managed to stay very simple to use. … You can run it locally because it’s in memory, but you can also run it in the cloud basically anywhere.”
— Eoin, [01:24]

2. Where and How Can You Run DuckDB on AWS?

[02:46–04:58]

Can be run locally (laptop), on EC2, containers, even AWS Lambda.
Major benefit: blazing fast analytics even on modest hardware—no need for massive clusters.
Efficient multicore columnar execution engine; processes terabytes even with limited RAM.
Example use case: DIY analytics systems (e.g., custom Google Analytics replacements).
Quote:

“You can even run it in lambda, which is probably the coolest use case if you ask me.”
— Luciano, [02:49]
Quote:

“Because DuckDB can support more data than it can fit in memory, you often hear stories of people processing terabytes of data even in their laptops.”
— Luciano, [04:08]

3. Getting Started with DuckDB

[04:58–08:55]

Install the CLI on any OS; open SQL REPL via the duckdb command.
Common pattern: run queries in-memory, or persist using DuckDB files.
DuckDB can query S3 directly:

“You could start with something like SELECT * FROM and then just give the URI for your S3 object.”
— Eoin, [05:29]
Supports S3 partitioning schemes familiar to the big data community.
Enables rapid ETL, analytics, and transformations in simple environments:
- Lambda ( “simple and cheap data lake querying”)
- AWS Glue (Python Shell jobs)
- Step Functions (for complex workflows)
Notable: Can now replace retired S3 Select for querying S3 data—a “drop-in replacement” but with vastly greater features.

4. Replacing S3 Select with DuckDB

[08:55–10:36]

S3 Select, a popular limited-SQL query feature for S3, has been retired for new AWS accounts.
DuckDB steps in as a flexible, more powerful CLI or programmatic alternative:

“You can use DuckDB as a programmatic replacement for the AWS SDK usage of S3 Select. And…you’ll have a lot more features.”
— Eoin, [07:56]
Practical experience: Eoin replaced S3 Select step in a Step Function with DuckDB and “got really good results.”

5. How DuckDB Compares: Athena, SQLite/LibSQL, Pandas & Polars

[08:55–12:37]

Athena / Redshift: Larger scale (multi-node) vs. DuckDB (single node, much simpler, faster for moderate loads).
- “If you are at massive scale, Athena will probably support much more data…” — Luciano, [09:09]
SQLite / LibSQL: Comparable for embedding, but SQLite is transactional while DuckDB excels at analytics/OLAP workloads.
Pandas/Polars:
- Host libraries for analytics in Python/Rust; offer object-oriented dataframes.
- DuckDB’s power is in SQL-based simplicity—sometimes a single query can replace complex scripts.
- For imperative logic or highly custom workflows, Pandas/Polars remain beneficial.
Quote:

“I think you can be able to replace some of the work that you might be doing with Pandas and Polars with just a query…”
— Luciano, [12:24]

6. DuckDB in Step Functions and Lambda: The New Runtime

[12:55–16:52]

Launch of a new, open-source AWS Lambda runtime with DuckDB built-in.
- “The layer allows you to basically deploy a Lambda function that doesn’t have any code. So it just has a DuckDB engine.”
  — Eoin, [13:14]
Lets you submit queries as input; runs efficiently (no Python/Node/other heavy runtimes—just a single, tiny binary).
Use cases:
- ETL/ELT jobs (CSV→JSON, filtering, aggregation)
- Query S3, transform data, call APIs using HTTP extension
- Compose with Step Functions: “You have the ability to do DuckDB in your Step Function definition and you can just write your SQL in the input parameters…” ([13:54])
Security tip: Attach relevant S3 read/write permissions.
Community invitation: Repo is in the show notes; hosts encourage listeners to try, fork, and star it.
Quote:

“It is. I think it’s worthy of a runtime because DuckDB is so powerful and you can do so much with it. …there really is kind of a Swiss Army knife type tool and it’s just limited by your imagination.”
— Eoin, [15:51]

Notable Quotes & Memorable Moments

On DuckDB’s Philosophy:

“They basically set up a separate foundation holding all of its IP, keeping it clear of all the commercial entities to ensure that the community doesn’t get any nasty license switching surprises in the future.”
— Eoin, [02:16]
On Convenience:

“You can run it pretty much in almost any reasonable hardware.”
— Luciano, [02:55]
On Lambda Integration:

“You don’t have to have a Python runtime or a Node JS runtime. …It’s just a single binary. And you can then use this as a response to an EventBridge event or integrate it into step functions.”
— Eoin, [13:27]
On DuckDB’s Power:

“We believe as a final conclusion that DuckDB is really good, really promising, probably will be seeing more and more of it and especially in the cloud, in Lambda, in aws.”
— Luciano, [16:52]

Timestamps for Important Segments

Introduction & Hype around DuckDB: [00:00–02:46]
Technical Overview & Use Cases: [02:46–04:58]
How to Get Started on AWS: [04:58–08:55]
Replacing S3 Select: [07:56–10:36]
DuckDB vs. Athena, SQLite, Pandas, Polars: [08:55–12:37]
Custom DuckDB Lambda Runtime & Workflow Integration: [12:55–16:52]
Final Thoughts: [16:52]

Conclusion

DuckDB is emerging as a versatile, high-performance analytical tool—ideal for AWS and beyond. Its simple deployment, strong feature set, and innovative cloud integrations (like the new Lambda runtime) position it as a strong successor to services like S3 Select, and a powerful alternative for a host of data analysis workloads. The episode is a must-listen (or now, a must-read) for AWS practitioners, data engineers, and anyone looking for modern, cost-effective analytics solutions.

Call to Action: Try the new DuckDB Lambda runtime, share your use cases, and check the show notes for the GitHub repo!

This summary skips non-content (ads, intros, outros) as requested and preserves the episode's technical depth and friendly, practical tone.

AWS Bites Podcast Episode 140: DuckDB Meets AWS – A Match Made in Cloud

Release Date: February 21, 2025
Hosts: Luciano Mammino (“A”) & Eoin Shanaghy (“B”)

Overview

Key Discussion Points & Insights

1. What is DuckDB and Why the Hype?

[00:00–02:46]

DuckDB is a fast, efficient, in-memory analytical database designed for OLAP workloads.
It’s SQL-first, works across major operating systems, requires no dependencies, and has broad language support.
Supports various storage formats (Parquet, Iceberg, CSV, JSON, Avro).
Open-source with a strong governance model ensuring the community’s interests (all IP held by a separate foundation; see [01:24–02:22]).
Quote:

“It’s very simple to use and it has managed to stay very simple to use. … You can run it locally because it’s in memory, but you can also run it in the cloud basically anywhere.”
— Eoin, [01:24]

2. Where and How Can You Run DuckDB on AWS?

[02:46–04:58]

Can be run locally (laptop), on EC2, containers, even AWS Lambda.
Major benefit: blazing fast analytics even on modest hardware—no need for massive clusters.
Efficient multicore columnar execution engine; processes terabytes even with limited RAM.
Example use case: DIY analytics systems (e.g., custom Google Analytics replacements).
Quote:

“You can even run it in lambda, which is probably the coolest use case if you ask me.”
— Luciano, [02:49]
Quote:

“Because DuckDB can support more data than it can fit in memory, you often hear stories of people processing terabytes of data even in their laptops.”
— Luciano, [04:08]

3. Getting Started with DuckDB

[04:58–08:55]

Install the CLI on any OS; open SQL REPL via the duckdb command.
Common pattern: run queries in-memory, or persist using DuckDB files.
DuckDB can query S3 directly:

“You could start with something like SELECT * FROM and then just give the URI for your S3 object.”
— Eoin, [05:29]
Supports S3 partitioning schemes familiar to the big data community.
Enables rapid ETL, analytics, and transformations in simple environments:
- Lambda ( “simple and cheap data lake querying”)
- AWS Glue (Python Shell jobs)
- Step Functions (for complex workflows)
Notable: Can now replace retired S3 Select for querying S3 data—a “drop-in replacement” but with vastly greater features.

4. Replacing S3 Select with DuckDB

[08:55–10:36]

S3 Select, a popular limited-SQL query feature for S3, has been retired for new AWS accounts.
DuckDB steps in as a flexible, more powerful CLI or programmatic alternative:

“You can use DuckDB as a programmatic replacement for the AWS SDK usage of S3 Select. And…you’ll have a lot more features.”
— Eoin, [07:56]
Practical experience: Eoin replaced S3 Select step in a Step Function with DuckDB and “got really good results.”

5. How DuckDB Compares: Athena, SQLite/LibSQL, Pandas & Polars

[08:55–12:37]

Athena / Redshift: Larger scale (multi-node) vs. DuckDB (single node, much simpler, faster for moderate loads).
- “If you are at massive scale, Athena will probably support much more data…” — Luciano, [09:09]
SQLite / LibSQL: Comparable for embedding, but SQLite is transactional while DuckDB excels at analytics/OLAP workloads.
Pandas/Polars:
- Host libraries for analytics in Python/Rust; offer object-oriented dataframes.
- DuckDB’s power is in SQL-based simplicity—sometimes a single query can replace complex scripts.
- For imperative logic or highly custom workflows, Pandas/Polars remain beneficial.
Quote:

“I think you can be able to replace some of the work that you might be doing with Pandas and Polars with just a query…”
— Luciano, [12:24]

6. DuckDB in Step Functions and Lambda: The New Runtime

[12:55–16:52]

Launch of a new, open-source AWS Lambda runtime with DuckDB built-in.
- “The layer allows you to basically deploy a Lambda function that doesn’t have any code. So it just has a DuckDB engine.”
  — Eoin, [13:14]
Lets you submit queries as input; runs efficiently (no Python/Node/other heavy runtimes—just a single, tiny binary).
Use cases:
- ETL/ELT jobs (CSV→JSON, filtering, aggregation)
- Query S3, transform data, call APIs using HTTP extension
- Compose with Step Functions: “You have the ability to do DuckDB in your Step Function definition and you can just write your SQL in the input parameters…” ([13:54])
Security tip: Attach relevant S3 read/write permissions.
Community invitation: Repo is in the show notes; hosts encourage listeners to try, fork, and star it.
Quote:

“It is. I think it’s worthy of a runtime because DuckDB is so powerful and you can do so much with it. …there really is kind of a Swiss Army knife type tool and it’s just limited by your imagination.”
— Eoin, [15:51]

Notable Quotes & Memorable Moments

On DuckDB’s Philosophy:

“They basically set up a separate foundation holding all of its IP, keeping it clear of all the commercial entities to ensure that the community doesn’t get any nasty license switching surprises in the future.”
— Eoin, [02:16]
On Convenience:

“You can run it pretty much in almost any reasonable hardware.”
— Luciano, [02:55]
On Lambda Integration:

“You don’t have to have a Python runtime or a Node JS runtime. …It’s just a single binary. And you can then use this as a response to an EventBridge event or integrate it into step functions.”
— Eoin, [13:27]
On DuckDB’s Power:

“We believe as a final conclusion that DuckDB is really good, really promising, probably will be seeing more and more of it and especially in the cloud, in Lambda, in aws.”
— Luciano, [16:52]

Timestamps for Important Segments

Introduction & Hype around DuckDB: [00:00–02:46]
Technical Overview & Use Cases: [02:46–04:58]
How to Get Started on AWS: [04:58–08:55]
Replacing S3 Select: [07:56–10:36]
DuckDB vs. Athena, SQLite, Pandas, Polars: [08:55–12:37]
Custom DuckDB Lambda Runtime & Workflow Integration: [12:55–16:52]
Final Thoughts: [16:52]

Conclusion

Call to Action: Try the new DuckDB Lambda runtime, share your use cases, and check the show notes for the GitHub repo!

This summary skips non-content (ads, intros, outros) as requested and preserves the episode's technical depth and friendly, practical tone.

wavePod

140. DuckDB Meets AWS: A Match Made in Cloud

Powered by Wave AI

Summary

AWS Bites Podcast Episode 140: DuckDB Meets AWS – A Match Made in Cloud

Overview

Key Discussion Points & Insights

1. What is DuckDB and Why the Hype?

2. Where and How Can You Run DuckDB on AWS?

3. Getting Started with DuckDB

4. Replacing S3 Select with DuckDB

5. How DuckDB Compares: Athena, SQLite/LibSQL, Pandas & Polars

6. DuckDB in Step Functions and Lambda: The New Runtime

Notable Quotes & Memorable Moments

Timestamps for Important Segments

Conclusion

Summary

AWS Bites Podcast Episode 140: DuckDB Meets AWS – A Match Made in Cloud

Overview

Key Discussion Points & Insights

1. What is DuckDB and Why the Hype?

2. Where and How Can You Run DuckDB on AWS?

3. Getting Started with DuckDB

4. Replacing S3 Select with DuckDB

5. How DuckDB Compares: Athena, SQLite/LibSQL, Pandas & Polars

6. DuckDB in Step Functions and Lambda: The New Runtime

Notable Quotes & Memorable Moments

Timestamps for Important Segments

Conclusion