Local Data Pipeline With DuckDB: A Step-by-Step Guide

Nov 3, 2025 by Admin 54 views

Hey guys! Today, we're diving into setting up a local data pipeline using DuckDB. If you're like me, you're always looking for ways to streamline your data workflows without getting bogged down in complex server setups. DuckDB is a fantastic solution for this, offering a powerful in-process SQL OLAP database management system that's perfect for local analytics. We'll walk through initializing a DuckDB database, writing SQL scripts for data ingestion and cleaning, and merging various datasets into a unified table. Plus, we'll save our intermediate results as Parquet files for efficient storage. Let's get started!

Why DuckDB for Local Data Pipelines?

Before we jump into the how-to, let's quickly cover why DuckDB is such a great choice for local data pipelines.

No Server Required: This is huge! Unlike traditional databases like PostgreSQL or MySQL, DuckDB doesn't require a separate server process. It runs directly within your application, simplifying deployment and eliminating the overhead of managing a database server.
In-Process: DuckDB operates within the same process as your application, which means data access is incredibly fast. This is especially beneficial for local data analysis where you want quick turnaround times.
SQL Support: If you're already familiar with SQL, you'll feel right at home with DuckDB. It supports a wide range of SQL features, allowing you to perform complex queries, joins, and aggregations.
Parquet Support: DuckDB has excellent support for Parquet files, a columnar storage format that's optimized for analytical queries. This allows you to efficiently store and query large datasets.
Easy to Set Up: Setting up DuckDB is a breeze. You just need to install the DuckDB library for your programming language of choice, and you're good to go.

DuckDB shines in scenarios where you need to perform analytical queries on local data without the complexity of managing a full-fledged database server. It's perfect for data exploration, prototyping data pipelines, and building local data applications.

Step 1: Initialization and Setup

First things first, we need to initialize our DuckDB database. We'll create a database file named /data/toronto_housing.duckdb. This file will store all our data and metadata.

Install DuckDB:

Make sure you have DuckDB installed. If you're using Python, you can install it using pip:
```
pip install duckdb
```
For other languages, refer to the official DuckDB documentation for installation instructions.

Create the Database File:

You can create the database file using a simple Python script:

import duckdb
import os

# Define the database file path
db_file = '/data/toronto_housing.duckdb'

# Create the /data directory if it doesn't exist
data_dir = os.path.dirname(db_file)
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# Connect to the database (this will create the file if it doesn't exist)
conn = duckdb.connect(db_file)

# Close the connection
conn.close()

print(f'DuckDB database created at {db_file}')

This script creates the /data directory if it doesn't exist and then creates the toronto_housing.duckdb file. When you connect to a DuckDB database, it automatically creates the file if it doesn't already exist.

Step 2: Writing SQL Scripts for Data Ingestion

Now that we have our DuckDB database set up, let's write some SQL scripts to ingest our data. We'll start with a script named 01_load_raw.sql to load the raw data into our database. Make sure to replace the file paths with the actual paths to your data files.

Example: `01_load_raw.sql`

-- 01_load_raw.sql

-- Create tables for CMHC, Census, and Toronto Open Data

CREATE TABLE cmhc_data AS
SELECT * FROM read_csv_auto('/path/to/cmhc_data.csv');

CREATE TABLE census_data AS
SELECT * FROM read_csv_auto('/path/to/census_data.csv');

CREATE TABLE toronto_open_data AS
SELECT * FROM read_csv_auto('/path/to/toronto_open_data.csv');

-- Optional: Print the first few rows of each table to verify the data

SELECT * FROM cmhc_data LIMIT 5;
SELECT * FROM census_data LIMIT 5;
SELECT * FROM toronto_open_data LIMIT 5;

This script uses the read_csv_auto function to automatically detect the format of the CSV files and load them into DuckDB tables. The CREATE TABLE AS SELECT statement creates new tables and populates them with the data from the CSV files. It is crucial that the paths to the data are correct.

Step 3: Writing SQL Scripts for Data Cleaning and Joining

Next, we'll create a script named 02_clean_join.sql to clean and join our data. This script will perform the necessary transformations and merges to create a unified dataset.

Example: `02_clean_join.sql`

-- 02_clean_join.sql

-- Clean and transform the data

-- Example: Rename columns, cast data types, and filter out irrelevant rows

ALTER TABLE cmhc_data RENAME COLUMN some_column TO cmhc_some_column;
ALTER TABLE census_data RENAME COLUMN another_column TO census_another_column;

-- Join the tables

CREATE TABLE unified_data AS
SELECT
    c.*,
    ce.*,
    t.*
FROM
    cmhc_data c
JOIN
    census_data ce ON c.join_key = ce.join_key
JOIN
    toronto_open_data t ON c.join_key = t.join_key;

-- Save the unified data to a Parquet file

COPY unified_data TO '/data/processed/unified_data.parquet' (FORMAT 'parquet');

-- Optional: Print the first few rows of the unified data to verify the join

SELECT * FROM unified_data LIMIT 5;

This script performs several important steps. First, it cleans and transforms the data by renaming columns and casting data types as needed. Next, it joins the CMHC, Census, and Toronto Open Data tables based on common keys. Finally, it saves the unified data to a Parquet file in the /data/processed/ directory.

Important: Adjust the column names, join keys, and data transformations based on the actual structure of your datasets.

Step 4: Running the SQL Scripts

Now that we have our SQL scripts, let's run them using the DuckDB command-line interface.

Load Raw Data:
```
duckdb /data/toronto_housing.duckdb < 01_load_raw.sql
```
This command executes the 01_load_raw.sql script, which loads the raw data into the DuckDB database.
Clean and Join Data:
```
duckdb /data/toronto_housing.duckdb < 02_clean_join.sql
```
This command executes the 02_clean_join.sql script, which cleans and joins the data, and saves the unified data to a Parquet file.

Step 5: Testing the Pipeline

To ensure our pipeline is working correctly, let's run some test queries on the unified data.

Verify Data Loading:

Connect to the DuckDB database using the command-line interface:
```
duckdb /data/toronto_housing.duckdb
```
Then, run the following SQL query to count the number of rows in the unified data table:
```
SELECT COUNT(*) FROM unified_data;
```
If the query returns the expected number of rows, it means the data has been loaded and joined correctly.
Verify Parquet Output:

You can also query the Parquet file directly using DuckDB:
```
SELECT COUNT(*) FROM '/data/processed/unified_data.parquet';
```
This query should return the same number of rows as the previous query, confirming that the Parquet file has been written correctly.

Step 6: Saving Intermediate Outputs

Saving intermediate outputs is a good practice for several reasons:

Debugging: If something goes wrong in your pipeline, you can easily inspect the intermediate outputs to identify the source of the problem.
Reproducibility: Saving intermediate outputs ensures that your pipeline is reproducible. You can rerun the pipeline from any intermediate step without having to start from scratch.
Performance: Saving intermediate outputs can improve the performance of your pipeline by allowing you to reuse pre-computed results.

In our example, we're saving the unified data to a Parquet file. You can also save other intermediate results to Parquet files as needed.

Acceptance Criteria

To ensure that our DuckDB-based local pipeline meets our requirements, we need to verify the following acceptance criteria:

All datasets load successfully into DuckDB:

We should be able to load the CMHC, Census, and Toronto Open Data datasets into DuckDB tables without any errors.
Can run duckdb < 02_clean_join.sql to output a joined dataset (<10GB):

We should be able to execute the 02_clean_join.sql script and generate a joined dataset that is less than 10GB in size.
No manual DB setup or server required:

The pipeline should not require any manual database setup or a separate server process. DuckDB should handle all the database management internally.

Conclusion

And there you have it! You've successfully set up a local data pipeline using DuckDB. This setup allows you to ingest, clean, and merge data from various sources without the complexity of managing a traditional database server. By saving intermediate outputs as Parquet files, you can improve the performance and reproducibility of your pipeline. DuckDB is an incredibly useful tool for local data analysis, and I hope this guide has helped you get started. Happy data crunching, folks!

By following these steps, you can leverage the power of DuckDB to create efficient and scalable local data pipelines for your projects. Remember to adapt the SQL scripts and data transformations to your specific datasets and requirements. With DuckDB, you can focus on analyzing your data without getting bogged down in database management complexities. If you have any questions, feel free to ask!