We are living in the golden age of "Quantified Self." Between your Apple Watch, Oura Ring, and Whoop strap, you probably have more data on your heart rate than a 1980s ICU. But hereโs the problem: Your data is trapped in silos.
Apple Health speaks one language, Oura speaks another, and none of them talk to each other. If you want to know how a specific workout affected your REM sleep three days later, a simple dashboard won't cut it. You need a Personal Health Knowledge Graph.
In this tutorial, weโll build an end-to-end Data Engineering pipeline to integrate multi-source health data into Neo4j, using Apache Hop for ETL and Pandas for the heavy lifting.
The Architecture: Why a Knowledge Graph?
Relational databases struggle with the "n+1" problem when you start asking complex questions about correlations over time. A Knowledge Graph treats relationships as first-class citizens.
Data Flow Overview
graph TD
A[Apple Health XML] -->|Pandas Parsing| B(Cleaned CSVs)
C[Oura Ring API/Export] -->|Pandas Parsing| B
B --> D{Apache Hop ETL}
D -->|Cypher Ingestion| E[(Neo4j Graph DB)]
E --> F[Grafana Dashboard]
E --> G[Discovery Queries]
By connecting entities like Date, Activity, SleepSession, and BiometricMetric, we can traverse the graph to find hidden patterns.
Prerequisites ๐ ๏ธ
To follow along, youโll need:
- Neo4j (Community or Desktop version)
- Apache Hop (The modern successor to Kettle/PDI)
- Pandas (For initial XML/JSON flattening)
- Python 3.10+
Step 1: Taming the "Dirty" Data with Pandas ๐ผ
Apple Health exports data as a massive, nested XML file. Oura gives you JSON or CSV. Our first job is to flatten these into a unified schema.
import pandas as pd
import xml.etree.ElementTree as ET
def parse_apple_health(file_path):
# This is where the magic happens
tree = ET.parse(file_path)
root = tree.getroot()
# Extracting Record types
records = []
for record in root.findall('Record'):
records.append(record.attrib)
df = pd.DataFrame(records)
# Convert timestamps to ISO format for Neo4j
df['startDate'] = pd.to_datetime(df['startDate'])
return df
# Example: Filtering for Resting Heart Rate
apple_df = parse_apple_health('export.xml')
rhr_df = apple_df[apple_df['type'] == 'HKQuantityTypeIdentifierRestingHeartRate']
rhr_df.to_csv('cleaned_rhr.csv', index=False)
Step 2: Designing the Graph Schema ๐ธ๏ธ
In Neo4j, we want to model our health journey as a timeline.
- Nodes:
(:Person),(:Day),(:Activity),(:Sleep),(:Metric) - Relationships:
(:Person)-[:TRACKED_ON]->(:Day),(:Day)-[:HAS_METRIC]->(:Metric),(:Activity)-[:PERFORMED_ON]->(:Day)
Step 3: Orchestrating the ETL with Apache Hop ๐ฆ
Apache Hop is a game-changer for Data Engineering workflows. It allows us to build visual pipelines that handle the upserts (MERGE) into Neo4j without writing hundreds of lines of boilerplate code.
- File Input: Point to your cleaned CSVs.
- Neo4j Cypher Output: Use the
MERGEcommand to ensure idempotency.
// This Cypher runs inside the Apache Hop Neo4j Output transform
MERGE (d:Day {date: $formatted_date})
MERGE (m:Metric {type: $metric_type})
CREATE (m)-[:LOGGED_ON {value: $value, unit: $unit}]->(d)
Pro-Tip: For more production-ready examples and advanced patterns on handling high-velocity biometric data, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover excellent architectural strategies for health-tech scaling.
Step 4: Querying for Insights ๐
Once your data is in Neo4j, you can run queries that were previously impossible. Want to see if high-intensity workouts (Apple Health) lead to better Sleep Efficiency (Oura) the following night?
MATCH (a:Activity)-[:PERFORMED_ON]->(d1:Day)-[:NEXT_DAY]->(d2:Day)
MATCH (s:Sleep)-[:LOGGED_ON]->(d2)
WHERE a.intensity = 'High'
RETURN d1.date, a.type, s.efficiency
ORDER BY d1.date DESC
LIMIT 10
Step 5: Visualizing with Grafana ๐
Neo4j is great for discovery, but for daily tracking, we need a dashboard. Using the Neo4j DataSource for Grafana, you can turn Cypher queries into beautiful time-series charts.
- Install the Neo4j plugin in Grafana.
- Add your Bolt URL and credentials.
- Write your query to return a
timeandvaluecolumn.
Conclusion: Data-Driven Wellness ๐ฅ
Building a Personal Health Knowledge Graph isn't just a fun weekend project; it's about taking ownership of your biological data. By combining the flexibility of Neo4j with the robust ETL capabilities of Apache Hop, you transform "dirty" multi-source data into a strategic asset.
Whatโs next?
- Add a Weather API node to see how humidity affects your running pace.
- Integrate nutrition data from MyFitnessPal to see the impact of late-night snacks on HRV.
If you enjoyed this build, don't forget to subscribe and let me know in the comments: What's the weirdest correlation you've found in your health data? ๐
For more advanced guides on health-tech and data engineering, visit wellally.tech/blog.
Top comments (0)