Beck_Moulton

Posted on Feb 13

From Messy Health Data to Actionable Insights: Building a Personal Knowledge Graph with Neo4j and Apache Hop

#ai #python #discuss #opensource

We are living in the golden age of "Quantified Self." Between your Apple Watch, Oura Ring, and Whoop strap, you probably have more data on your heart rate than a 1980s ICU. But here’s the problem: Your data is trapped in silos.

Apple Health speaks one language, Oura speaks another, and none of them talk to each other. If you want to know how a specific workout affected your REM sleep three days later, a simple dashboard won't cut it. You need a Personal Health Knowledge Graph.

In this tutorial, we’ll build an end-to-end Data Engineering pipeline to integrate multi-source health data into Neo4j, using Apache Hop for ETL and Pandas for the heavy lifting.

The Architecture: Why a Knowledge Graph?

Relational databases struggle with the "n+1" problem when you start asking complex questions about correlations over time. A Knowledge Graph treats relationships as first-class citizens.

Data Flow Overview

graph TD
    A[Apple Health XML] -->|Pandas Parsing| B(Cleaned CSVs)
    C[Oura Ring API/Export] -->|Pandas Parsing| B
    B --> D{Apache Hop ETL}
    D -->|Cypher Ingestion| E[(Neo4j Graph DB)]
    E --> F[Grafana Dashboard]
    E --> G[Discovery Queries]

By connecting entities like Date, Activity, SleepSession, and BiometricMetric, we can traverse the graph to find hidden patterns.

Prerequisites 🛠️

To follow along, you’ll need:

Neo4j (Community or Desktop version)
Apache Hop (The modern successor to Kettle/PDI)
Pandas (For initial XML/JSON flattening)
Python 3.10+

Step 1: Taming the "Dirty" Data with Pandas 🐼

Apple Health exports data as a massive, nested XML file. Oura gives you JSON or CSV. Our first job is to flatten these into a unified schema.

import pandas as pd
import xml.etree.ElementTree as ET

def parse_apple_health(file_path):
    # This is where the magic happens
    tree = ET.parse(file_path)
    root = tree.getroot()

    # Extracting Record types
    records = []
    for record in root.findall('Record'):
        records.append(record.attrib)

    df = pd.DataFrame(records)
    # Convert timestamps to ISO format for Neo4j
    df['startDate'] = pd.to_datetime(df['startDate'])
    return df

# Example: Filtering for Resting Heart Rate
apple_df = parse_apple_health('export.xml')
rhr_df = apple_df[apple_df['type'] == 'HKQuantityTypeIdentifierRestingHeartRate']
rhr_df.to_csv('cleaned_rhr.csv', index=False)

Step 2: Designing the Graph Schema 🕸️

In Neo4j, we want to model our health journey as a timeline.

Nodes: (:Person), (:Day), (:Activity), (:Sleep), (:Metric)
Relationships: (:Person)-[:TRACKED_ON]->(:Day), (:Day)-[:HAS_METRIC]->(:Metric), (:Activity)-[:PERFORMED_ON]->(:Day)

Step 3: Orchestrating the ETL with Apache Hop 🦘

Apache Hop is a game-changer for Data Engineering workflows. It allows us to build visual pipelines that handle the upserts (MERGE) into Neo4j without writing hundreds of lines of boilerplate code.

File Input: Point to your cleaned CSVs.
Neo4j Cypher Output: Use the MERGE command to ensure idempotency.

// This Cypher runs inside the Apache Hop Neo4j Output transform
MERGE (d:Day {date: $formatted_date})
MERGE (m:Metric {type: $metric_type})
CREATE (m)-[:LOGGED_ON {value: $value, unit: $unit}]->(d)

Pro-Tip: For more production-ready examples and advanced patterns on handling high-velocity biometric data, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover excellent architectural strategies for health-tech scaling.

Step 4: Querying for Insights 🔍

Once your data is in Neo4j, you can run queries that were previously impossible. Want to see if high-intensity workouts (Apple Health) lead to better Sleep Efficiency (Oura) the following night?

MATCH (a:Activity)-[:PERFORMED_ON]->(d1:Day)-[:NEXT_DAY]->(d2:Day)
MATCH (s:Sleep)-[:LOGGED_ON]->(d2)
WHERE a.intensity = 'High'
RETURN d1.date, a.type, s.efficiency
ORDER BY d1.date DESC
LIMIT 10

Step 5: Visualizing with Grafana 📊

Neo4j is great for discovery, but for daily tracking, we need a dashboard. Using the Neo4j DataSource for Grafana, you can turn Cypher queries into beautiful time-series charts.

Install the Neo4j plugin in Grafana.
Add your Bolt URL and credentials.
Write your query to return a time and value column.

Conclusion: Data-Driven Wellness 🥑

Building a Personal Health Knowledge Graph isn't just a fun weekend project; it's about taking ownership of your biological data. By combining the flexibility of Neo4j with the robust ETL capabilities of Apache Hop, you transform "dirty" multi-source data into a strategic asset.

What’s next?

Add a Weather API node to see how humidity affects your running pace.
Integrate nutrition data from MyFitnessPal to see the impact of late-night snacks on HRV.

If you enjoyed this build, don't forget to subscribe and let me know in the comments: What's the weirdest correlation you've found in your health data? 📈

For more advanced guides on health-tech and data engineering, visit wellally.tech/blog.

DEV Community