Python

Ducks to the rescue - ETL using Python and DuckDB

UA2.220 (Guillissen)
Marc-André Lemburg
<p><em>Summary:</em></p> <p>ETL stands for "extract, transform, load" and is a synonym for moving data around. This has traditionally often required managing complex systems in the cloud or large data centers. The talk will demonstrate how all this can be greatly simplified by applying modern tools for the task: Python and DuckDB, both open source and readily available to run on most systems - even your notebook.</p> <p><em>Description:</em></p> <p><strong>ETL</strong> stands for "extract, transform, load" and is a synonym for moving data from one system to another. </p> <p>Traditionally, ETL was done in exactly that order: first you extract the data you want to process, then you transform it and then you load it into the target system. More modern approaches based on data lakes, swap the T and L, since transformation is more efficiently done in a database system, especially when it comes to large volumes of data.</p> <p>In order to make all this work, the usual approach is to have a workflow system, taking care of managing all the intermediate steps, a large data lake database and distributed storage systems. This results in lots of complexity, need for system/cluster administration and maintenance.</p> <p>Now, with today's computers, most data sizes used in ETL no longer need all this complexity. Even notebooks or single VMs can handle the load, when used with external object storage, so all you really just need is the right software stack to manage your ETL - without all the overhead:</p> <ul> <li> <p><strong>Python</strong> has grown to be the number one programming language on the planet and is especially well suited for integration work due to its many readily available connectors to plenty of backend systems. It often comes preinstalled on Linux machines and is easy to install on most other systems.</p> </li> <li> <p><strong>DuckDB</strong> has emerged as one of the most capable embedded OLAP database systems and supports data lakes with the DuckLake extension, right out of the box. Installation is just a <code>uv add duckdb</code> away.</p> </li> </ul> <p>Both can be run on the same machine and are very resource friendly.</p> <p>The talk will give an overview of the typical steps involved in ETL processes, give a short intro to DuckDB and showcase how DuckDB can be put to good use when implementing ETL processes. If time permits, I can also cover a few advanced topics addressing optimization strategies.</p> <p><em>Resources:</em></p> <ul> <li> <p><a href="https://www.python.org/">Python.org</a></p> </li> <li> <p><a href="https://duckdb.org/">DuckDB – An in-process SQL OLAP database management system</a></p> </li> </ul>

Additional information

Live Stream https://live.fosdem.org/watch/ua2220
Type devroom
Language English

More sessions

1/31/26
Python
Jacob Coffee
UA2.220 (Guillissen)
<p>Discover how PEP 810's explicit lazy imports can dramatically improve Python application startup times. Using a real CLI tool as a case study, that we totally use in our real business, this talk demonstrates practical techniques to optimize import performance while maintaining code clarity and safety.</p>
1/31/26
Python
Ruben Hias
UA2.220 (Guillissen)
<p>Python’s Global Interpreter Lock has shaped the way developers build concurrent applications for nearly three decades. While the GIL simplified the CPython ecosystem, it also imposed well-known limits on CPU-bound work and multithreaded scalability. With the introduction of free-threaded Python (3.14t), that is about to change.</p> <p>This talk explores the history and purpose of the GIL, why it existed for so long, and the innovations that finally made its removal viable. We’ll look at ...
1/31/26
Python
Jarek Potiuk
UA2.220 (Guillissen)
<p>Apache Airflow is the most popular Data Workflow Orchestrator - developed under the Apache Software Foundation umbrella. We have 120+ Python distributions in our rep, and we often release ~ 100 of them every two week. </p> <p>All those distributions are built from a single monorepo. </p> <p><code>[jarekpotiuk:~/code/airflow] find . -name 'pyproject.toml' | wc 120 120 4248</code></p> <p>This had always posed a lot of challenges and we had a lot of tooling to make it possible, however with the ...
1/31/26
Python
Loïc Tosser "wowi42"
UA2.220 (Guillissen)
<p>Remember when we said "Infrastructure as Code"? Somehow, the industry heard "Infrastructure as YAML" and ran with it. Now we're drowning in a sea of indentation-sensitive, template-riddled, Jinja2-abused configuration files that make even the most battle-hardened sysadmins weep into their mechanical keyboards.</p> <p>Enter <strong>PyInfra</strong>—where your infrastructure is <em>actually</em> code. Real Python. With loops that don’t require learning a DSL. With functions that are... wait ...
1/31/26
Python
Emma Delescolle
UA2.220 (Guillissen)
<p><a href="https://www.djangoproject.com">Django</a>'s built-in admin is powerful, but it's essentially a separate framework within Django and it's 20 years old.</p> <p>Wouldn't it be nice to be able to work with an admin interface that works like the rest of Django, built on generic CBVs, plugins, and view factories? <a href="https://github.com/jazzband/django-admin2">Django-Admin2</a>, was an attempt at doing just that and it was a fairly successful ptoject.</p> <p>10 years later, after ...
1/31/26
Python
Manuel Raynaud
UA2.220 (Guillissen)
<p>The French digital agency (DINUM) has undertaken to develop an open-source collaborative digital workplace to make the work of public servants simpler and more effective.</p> <p>This collaborative digital workplace is distributed under an open-source license to allow anyone who wishes to take its applications and integrate them into their preferred tools.</p> <p>By participating in existing open-source communities, the digital workplace enables the emergence of digital commons that facilitate ...
1/31/26
Python
Marc-André Lemburg
UA2.220 (Guillissen)
<p>After the success of last year's impromptu lightning talks session, we will have an official one in the Python Devroom for 2026.</p> <p>Please submit your talks using this form:</p> <ul> <li> <p><a href="https://docs.google.com/forms/d/e/1FAIpQLSfh1zpbP6KgMmexQeT6jlcX1_o8W26zovBUVVudxopBMfjGsg/viewform?usp=header">Lightning Talk Submission Form</a></p> </li> <li> <p>The form will be opened for submissions at around 14:00 CET on Saturday, Jan 31, 2026.</p> </li> </ul> <p>Lightning Talks are at ...