DEV Community

Beck_Moulton
Beck_Moulton

Posted on

From Messy Health Data to Actionable Insights: Building a Personal Knowledge Graph with Neo4j and Apache Hop

We are living in the golden age of "Quantified Self." Between your Apple Watch, Oura Ring, and Whoop strap, you probably have more data on your heart rate than a 1980s ICU. But hereโ€™s the problem: Your data is trapped in silos.

Apple Health speaks one language, Oura speaks another, and none of them talk to each other. If you want to know how a specific workout affected your REM sleep three days later, a simple dashboard won't cut it. You need a Personal Health Knowledge Graph.

In this tutorial, weโ€™ll build an end-to-end Data Engineering pipeline to integrate multi-source health data into Neo4j, using Apache Hop for ETL and Pandas for the heavy lifting.


The Architecture: Why a Knowledge Graph?

Relational databases struggle with the "n+1" problem when you start asking complex questions about correlations over time. A Knowledge Graph treats relationships as first-class citizens.

Data Flow Overview

graph TD
    A[Apple Health XML] -->|Pandas Parsing| B(Cleaned CSVs)
    C[Oura Ring API/Export] -->|Pandas Parsing| B
    B --> D{Apache Hop ETL}
    D -->|Cypher Ingestion| E[(Neo4j Graph DB)]
    E --> F[Grafana Dashboard]
    E --> G[Discovery Queries]
Enter fullscreen mode Exit fullscreen mode

By connecting entities like Date, Activity, SleepSession, and BiometricMetric, we can traverse the graph to find hidden patterns.


Prerequisites ๐Ÿ› ๏ธ

To follow along, youโ€™ll need:

  • Neo4j (Community or Desktop version)
  • Apache Hop (The modern successor to Kettle/PDI)
  • Pandas (For initial XML/JSON flattening)
  • Python 3.10+

Step 1: Taming the "Dirty" Data with Pandas ๐Ÿผ

Apple Health exports data as a massive, nested XML file. Oura gives you JSON or CSV. Our first job is to flatten these into a unified schema.

import pandas as pd
import xml.etree.ElementTree as ET

def parse_apple_health(file_path):
    # This is where the magic happens
    tree = ET.parse(file_path)
    root = tree.getroot()

    # Extracting Record types
    records = []
    for record in root.findall('Record'):
        records.append(record.attrib)

    df = pd.DataFrame(records)
    # Convert timestamps to ISO format for Neo4j
    df['startDate'] = pd.to_datetime(df['startDate'])
    return df

# Example: Filtering for Resting Heart Rate
apple_df = parse_apple_health('export.xml')
rhr_df = apple_df[apple_df['type'] == 'HKQuantityTypeIdentifierRestingHeartRate']
rhr_df.to_csv('cleaned_rhr.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

Step 2: Designing the Graph Schema ๐Ÿ•ธ๏ธ

In Neo4j, we want to model our health journey as a timeline.

  • Nodes: (:Person), (:Day), (:Activity), (:Sleep), (:Metric)
  • Relationships: (:Person)-[:TRACKED_ON]->(:Day), (:Day)-[:HAS_METRIC]->(:Metric), (:Activity)-[:PERFORMED_ON]->(:Day)

Step 3: Orchestrating the ETL with Apache Hop ๐Ÿฆ˜

Apache Hop is a game-changer for Data Engineering workflows. It allows us to build visual pipelines that handle the upserts (MERGE) into Neo4j without writing hundreds of lines of boilerplate code.

  1. File Input: Point to your cleaned CSVs.
  2. Neo4j Cypher Output: Use the MERGE command to ensure idempotency.
// This Cypher runs inside the Apache Hop Neo4j Output transform
MERGE (d:Day {date: $formatted_date})
MERGE (m:Metric {type: $metric_type})
CREATE (m)-[:LOGGED_ON {value: $value, unit: $unit}]->(d)
Enter fullscreen mode Exit fullscreen mode

Pro-Tip: For more production-ready examples and advanced patterns on handling high-velocity biometric data, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover excellent architectural strategies for health-tech scaling.


Step 4: Querying for Insights ๐Ÿ”

Once your data is in Neo4j, you can run queries that were previously impossible. Want to see if high-intensity workouts (Apple Health) lead to better Sleep Efficiency (Oura) the following night?

MATCH (a:Activity)-[:PERFORMED_ON]->(d1:Day)-[:NEXT_DAY]->(d2:Day)
MATCH (s:Sleep)-[:LOGGED_ON]->(d2)
WHERE a.intensity = 'High'
RETURN d1.date, a.type, s.efficiency
ORDER BY d1.date DESC
LIMIT 10
Enter fullscreen mode Exit fullscreen mode

Step 5: Visualizing with Grafana ๐Ÿ“Š

Neo4j is great for discovery, but for daily tracking, we need a dashboard. Using the Neo4j DataSource for Grafana, you can turn Cypher queries into beautiful time-series charts.

  1. Install the Neo4j plugin in Grafana.
  2. Add your Bolt URL and credentials.
  3. Write your query to return a time and value column.

Conclusion: Data-Driven Wellness ๐Ÿฅ‘

Building a Personal Health Knowledge Graph isn't just a fun weekend project; it's about taking ownership of your biological data. By combining the flexibility of Neo4j with the robust ETL capabilities of Apache Hop, you transform "dirty" multi-source data into a strategic asset.

Whatโ€™s next?

  • Add a Weather API node to see how humidity affects your running pace.
  • Integrate nutrition data from MyFitnessPal to see the impact of late-night snacks on HRV.

If you enjoyed this build, don't forget to subscribe and let me know in the comments: What's the weirdest correlation you've found in your health data? ๐Ÿ“ˆ


For more advanced guides on health-tech and data engineering, visit wellally.tech/blog.

Top comments (0)