Have you ever tried to export your health data from Google Health Connect? If you have, you know it’s a wild ride. You expect a clean spreadsheet, but what you get is a "JSON explosion"—nested objects, inconsistent timestamps, and more metadata than actual data.
To turn this digital noise into actionable health trends, you need a robust ETL pipeline, professional Data Modeling, and a reliable BigQuery Data Warehouse. In this tutorial, we are going to build a modern data stack that transforms raw wearable telemetry into a clean, queryable format using DBT (Data Build Tool) and Google Cloud Functions.
The Architecture: From Raw Bytes to Dashboard
Before we dive into the code, let's visualize how the data flows from your wrist to your screen. We’ll use a "Medallion Architecture" (Bronze, Silver, Gold) to ensure data integrity.
graph TD
A[Google Health Connect JSON] -->|Trigger| B(Google Cloud Functions)
B -->|Streaming Ingest| C[(BigQuery: Bronze - Raw)]
C -->|dbt run| D[(BigQuery: Silver - Staging)]
D -->|dbt test/build| E[(BigQuery: Gold - Metrics)]
E -->|Visualize| F[Looker Studio]
style B fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#bbf,stroke:#333,stroke-width:2px
Prerequisites
To follow along, you'll need:
- Google Cloud Platform (GCP) account with BigQuery enabled.
- DBT Core or DBT Cloud set up.
- The
tech_stack: Python (for Cloud Functions), SQL (for DBT), and Looker for the final polish.
Step 1: Ingesting the "Mess" with Cloud Functions
Google Health Connect exports are often dumped into a Cloud Storage bucket. We need a serverless trigger to catch these files and pipe them into BigQuery as raw JSON strings.
import base64
import json
from google.cloud import bigquery
# Initialize the BigQuery client
client = bigquery.Client()
dataset_id = 'health_data_raw'
table_id = 'health_connect_ingest'
def ingest_health_data(event, context):
"""Triggered by a change to a Cloud Storage bucket."""
file_name = event['name']
print(f"Processing file: {file_name}")
# Logic to read JSON from GCS and stream to BigQuery
# We store the entire record as a single JSON column initially
# to avoid schema evolution headaches.
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)
# Simplified example of streaming a row
rows_to_insert = [
{"raw_content": json.dumps(event), "ingested_at": "AUTO"}
]
errors = client.insert_rows_json(table, rows_to_insert)
if not errors:
print("New rows have been added.")
else:
print(f"Encountered errors while inserting rows: {errors}")
Step 2: Normalization with DBT (The Magic Sauce)
Now that we have a mountain of JSON in our Bronze layer, we use DBT to flatten it. This is where we turn raw_content into actual columns like heart_rate, step_count, and sleep_duration.
The Staging Model (stg_heart_rate.sql)
Using BigQuery's JSON_EXTRACT functions makes this incredibly efficient.
-- models/staging/stg_heart_rate.sql
WITH raw_data AS (
SELECT
JSON_EXTRACT_SCALAR(raw_content, '$.device_id') as device_id,
JSON_EXTRACT(raw_content, '$.metrics.heart_rate') as hr_array,
CAST(JSON_EXTRACT_SCALAR(raw_content, '$.timestamp') AS TIMESTAMP) as event_timestamp
FROM {{ source('health_raw', 'health_connect_ingest') }}
)
SELECT
device_id,
event_timestamp,
-- Extracting nested JSON values
CAST(JSON_EXTRACT_SCALAR(hr_item, '$.bpm') AS FLOAT64) as bpm
FROM raw_data,
UNNEST(JSON_QUERY_ARRAY(hr_array)) as hr_item
WHERE bpm IS NOT NULL
Step 3: Production-Grade Patterns
While the above works for a hobby project, production environments require data quality tests, documentation, and snapshotting for slowly changing dimensions.
If you're looking for more production-ready examples and advanced engineering patterns—such as handling late-arriving data or multi-tenant warehouse schemas—I highly recommend checking out the technical deep dives at WellAlly Tech Blog. They cover some fantastic "Day 2" operations that take your data engineering from "it works" to "it scales."
Step 4: Visualizing Trends in Looker
With your Gold models ready (e.g., fct_daily_health_summary), connect BigQuery to Looker Studio.
- Select Project: Your GCP Project.
- Table:
dbt_metrics.fct_daily_health_summary. - Metrics: Create a Time Series chart with
Avg(bpm)andSum(steps).
Suddenly, that messy JSON is a beautiful line graph showing how your resting heart rate drops after you actually start going to the gym!
Conclusion: Learning in Public
Building ETL pipelines isn't just about moving data; it's about making data useful. By using DBT with BigQuery, we’ve created a system that is:
- Version Controlled: All logic lives in SQL files.
- Scalable: BigQuery handles the heavy lifting.
- Extensible: Want to add Oura Ring or Apple Watch data? Just add a new staging model.
What are you building with your health data? Drop a comment below or share your DBT tips! Don't forget to subscribe for more data engineering adventures.
Top comments (0)