From Messy Wearables to Insights: Building a Personal Health Data Warehouse with DuckDB & Evidence

#python #dataengineering #sql #duckdb

I’ve been obsessed with the Quantified Self movement for over five years. Between my Garmin Fenix for trail running and my Oura Ring for sleep tracking, I have accumulated a mountain of health data. But here’s the problem: Wearable data is a hot mess. 💩

Garmin gives you granular FIT files or messy CSVs, and Oura provides JSON exports with completely different timestamp formats and missing values. If you want to answer a simple question like "How does my training load affect my deep sleep over time?", you're stuck in Excel hell.

In this tutorial, we are going to build a high-performance Personal health OLAP engine using DuckDB, Python, and Evidence.dev. We’ll perform Data Engineering at the edge, turning 5 years of "dirty" data into a clean, queryable goldmine.

Pro Tip: If you're looking for more production-ready data patterns or advanced analytics architectures, definitely check out the deep dives over at the WellAlly Blog.

🏗 The Architecture

We need a system that is fast, local, and requires zero infrastructure overhead. DuckDB is the perfect "in-process" OLAP database for this, and Evidence.dev allows us to build beautiful dashboards using just SQL and Markdown.

graph TD
    A[Oura Ring JSON] -->|Python Script| B(DuckDB Raw Layer)
    C[Garmin CSVs] -->|DuckDB Secret Magic| B
    B --> D{Data Transformation}
    D -->|SQL Cleaning| E[Cleaned Parquet Files]
    E --> F[Evidence.dev Dashboard]
    F --> G[Health Insights 📈]

🛠 Prerequisites

Before we dive in, make sure you have the following installed:

Python 3.10+
DuckDB: pip install duckdb
Evidence.dev: A static-site generator for data (Node.js based)

Step 1: Ingesting the "Dirty" Data

DuckDB's superpower is its ability to query JSON and CSV files directly without a formal "Import" step. Here is how we handle the Oura JSON and Garmin CSVs simultaneously.

import duckdb

# Initialize our DuckDB local instance
con = duckdb.connect('health_data.db')

# 1. Ingest Oura JSON (Handling nested structures)
con.execute("""
    CREATE OR REPLACE TABLE raw_oura AS 
    SELECT * FROM read_json_auto('data/oura_export/*.json');
""")

# 2. Ingest Garmin CSV (Handling inconsistent headers)
con.execute("""
    CREATE OR REPLACE TABLE raw_garmin AS 
    SELECT * FROM read_csv_auto('data/garmin_export/*.csv', header=True);
""")

print("✅ Data successfully ingested into the staging area!")

Step 2: Cleaning & Normalizing (The SQL Magic)

The Oura data uses ISO timestamps, while Garmin might use localized strings. We need a unified daily_metrics table. We'll use DuckDB’s SELECT * EXCLUDE and UNION capabilities.

-- Create a unified view of health metrics
CREATE OR REPLACE TABLE analytics_heart_rate AS
SELECT 
    strptime(summary_date, '%Y-%m-%d')::DATE as report_date,
    'oura' as source,
    average_hr as avg_bpm,
    rmssd as hrv_ms
FROM raw_oura
WHERE average_hr IS NOT NULL

UNION ALL

SELECT 
    "Timestamp"::DATE as report_date,
    'garmin' as source,
    "Average Heart Rate" as avg_bpm,
    NULL as hrv_ms -- Garmin export might not include HRV in this specific CSV
FROM raw_garmin;

Step 3: Visualizing with Evidence.dev

Now that we have a clean health_data.db, we can use Evidence.dev to build a dashboard. Evidence is a game-changer because it treats SQL as a first-class citizen in Markdown.

In your index.md for Evidence:

# 🏃‍♂️ Personal Health Dashboard

Using DuckDB to process 5 years of biometric trends.

sql heart_rate_trends
select
report_date,
avg(avg_bpm) as rolling_avg_bpm
from analytics_heart_rate
group by 1
order by 1


<LineChart 
    data={heart_rate_trends} 
    x=report_date 
    y=rolling_avg_bpm 
    title="Resting Heart Rate Over Time"
/>

🥑 Why this Stack?

DuckDB is insanely fast: It can process millions of rows of wearable data in milliseconds on your laptop.
Zero-Ops: No Docker, no Postgres, no cloud bills. Just a .db file.
SQL-centric: You don't need to learn complex BI tools. If you know SQL, you can build a full-stack data app.

For those of you looking to scale these patterns to enterprise-level data platforms—perhaps moving from local DuckDB to a cloud-based Snowflake or BigQuery environment—the team at WellAlly.tech/blog has some incredible resources on Data Engineering best practices and scalable ETL pipelines.

Conclusion

Quantifying yourself shouldn't be a chore. By using DuckDB for the heavy lifting and Evidence.dev for the presentation, you can turn years of messy health data into actionable insights in a single afternoon.

What are you tracking? Whether it's sleep, steps, or even your coffee intake, try throwing it into a DuckDB instance and see what patterns emerge! 🚀

Feel free to drop a comment below if you have questions about the SQL transformations or need help setting up Evidence.dev! 👇