Building an Observability Mesh with Grafana, Loki, and Prometheus
When multiple backend services start running in isolation, debugging becomes guesswork. My recent sprint was about turning that guesswork into clarity — by wiring up full observability across Django, Nextcloud, Grafana, Loki, and Prometheus.
Goal
Unify logs and metrics across services in a distributed setup — all communicating over Caddy TLS and my Tailnet domain.
I wanted one dashboard that could tell me everything about my system’s health without SSH-ing into individual servers.
Architecture
Here’s the high-level design:
Stack Overview
Prometheus → scrapes metrics from Django and Nextcloud API endpoints
Loki → ingests logs from both services
Grafana → visualizes metrics and logs together
Caddy → reverse proxy with trusted TLS for all endpoints
Tailnet (Tailscale) → private network with identity-based access
Everything talks securely — no exposed ports, no unencrypted traffic.
Challenges
1. Grafana showed logs but no metrics
Root cause: Prometheus targets weren’t reachable after moving from localhost to tailnet hostnames.
2. TLS verification issues in Prometheus
Solved by updating Caddy’s certificates and confirming Prometheus scrape configs pointed to HTTPS endpoints.
3. Cross-service routing
Caddy needed to handle routes like /metrics, /api/schema, and /api/* correctly between Django and Nextcloud.
Config Highlights
Here’s a simplified Prometheus scrape config example:
scrape_configs:
- job_name: "django" metrics_path: /metrics static_configs:
targets: ["X.tail.ts.net:8000"]
job_name: "nextcloud" metrics_path: /metrics static_configs:
targets: ["X.tail.ts.net:8080"]
Both routes sit behind Caddy, which handles TLS termination using trusted Tailnet certificates.
Results
Once Prometheus started scraping successfully, Grafana dashboards came alive.
Now I can:
Correlate logs and metrics per request
Track uptime and performance trends
Visualize distributed system behavior across all nodes
It feels like operating my own mini control plane — distributed, secure, and explainable.
Next Steps
Add distributed tracing (OpenTelemetry)
Define Prometheus alert rules for critical endpoints
Automate observability config rollout via CI/CD
Key Takeaway
Observability isn’t an add-on — it’s the nervous system of your infrastructure.
When your servers start talking, you start listening differently.


Top comments (0)