Monitoring AI agents in production

#ai #monitoring #devtools #agents

Traditional monitoring asks one question: is the server up? If the endpoint returns 200, everything is fine. AI agents break that assumption. The server can be perfectly healthy while the agent silently produces wrong outputs, skips steps, runs over budget, or stops working entirely — all without triggering a single alert.

Monitoring autonomous agents requires a different mental model. Here's what actually breaks and how to catch it.

What traditional monitoring misses

Uptime monitoring tells you the endpoint responded. It says nothing about what the agent did inside that response. An agent endpoint that returns {"status": "ok"} in 50ms might have skipped the entire task due to a context length limit, a rate limit on the model API, or a malformed tool call that silently failed.

The failure modes specific to AI agents in production:

Silent tool failures. A tool call returns an error that the model handles by continuing without it. The task "completes" but with missing data.
Context window exhaustion. Long-running agents hit token limits mid-task and truncate their work. The HTTP response is still 200.
Model API degradation. The underlying model API is slow or returning degraded outputs. Your endpoint is up; the work is wrong.
Drift over time. An agent that worked last week starts producing subtly different outputs as the model is updated. No alert fires — outputs just quietly change.
Scheduled run skips. The agent was supposed to run at 06:00. It didn't. Nothing in your existing monitoring catches this because the server never went down.

The three layers of agent monitoring

Layer 1: Uptime monitoring

Still necessary — just not sufficient. Your agent's HTTP endpoint should be monitored for availability and response time. A degraded model API often manifests first as increased latency before it causes failures.

Set up an uptime monitor on the endpoint your agent exposes. A 30-second check interval catches most outages before users do. Configure timeout alerts — if your agent normally responds in under 10 seconds and starts taking 90, something is wrong even if it's still returning 200.

curl -X POST https://api.tickstem.dev/v1/monitors \
  -H "Authorization: Bearer $TICKSTEM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "summary-agent-endpoint", "url": "https://your-app.com/agents/summary/health", "interval_secs": 30, "timeout_secs": 15}'

Layer 2: Heartbeat monitoring

Uptime tells you the server is alive. Heartbeat tells you the agent actually did the work.

A heartbeat monitor works as a dead man's switch: your agent sends a ping after each successful completion. If the ping stops arriving within the expected window, you get an alert. The server being up is irrelevant — if the work stopped happening, the heartbeat catches it.

# Create a heartbeat — save the token
curl -X POST https://api.tickstem.dev/v1/heartbeats \
  -H "Authorization: Bearer $TICKSTEM_API_KEY" \
  -d '{"name":"daily-summary-agent","interval_secs":86400,"grace_secs":3600}'

# At the end of every successful agent run
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_TOKEN/ping

The ping only fires on success — after the agent has verified its own output. Silence means failure, regardless of what the HTTP response said.

Layer 3: Execution history

The most underused layer. Every scheduled agent run should produce a logged record: when it ran, how long it took, whether it succeeded, and what it returned.

Without this, debugging a failure means reconstructing what happened from scattered logs. With it, you open the execution history and see immediately: the run at 06:03 took 4 minutes instead of the usual 45 seconds, returned a 500, and the response body contains a rate limit error from the model API.

If you're using HTTP-based scheduling for your agent, execution history comes for free — every run is logged with the full request and response.

A practical rule: any agent task that runs on a schedule and produces output that other systems depend on needs all three layers. Uptime alone is not monitoring — it's a pulse check.

Wiring it up via MCP

If you're building with Claude Code or a similar MCP-compatible agent, you can set up the full monitoring stack from within your editor. The Tickstem MCP server exposes create_monitor, create_heartbeat, and list_executions as native tools.

What good agent monitoring looks like

The goal is to answer three questions at any point in time, without digging through logs:

Is the agent endpoint reachable and responding normally? (uptime)
Did the agent complete its last scheduled task? (heartbeat)
What happened on the last N runs? (execution history)

When all three are in place, debugging shifts from "something might be wrong, let me check everything" to "here's exactly what happened and when."

Tickstem provides uptime monitoring, heartbeat checks, cron scheduling, and email verification under one API key. Free tier at app.tickstem.dev — no credit card required.

Top comments (1)

Harjot Singh • May 31

Monitoring agents is genuinely different from monitoring a normal service, and that's the part people underestimate. With a CRUD app, green means healthy. With an agent, every call can return 200, latency can look fine, and the agent can still be quietly doing the wrong thing - looping, drifting from the goal, burning tokens, producing confidently-wrong output. So the metrics that matter are weird: per-run cost (the silent killer), task success rate not uptime, step/loop counts, tool-call error rates, and output-quality signals, not just CPU and 5xx. You're monitoring behavior and economics, not just liveness.

The one I'd put first is cost-per-run with an alert, because a wedged or looping agent doesn't crash, it just silently spends - that's the failure mode that shows up on the bill, not the dashboard. That's exactly why cost is a first-class monitored number in Moonshift, the thing I build - a multi-agent pipeline that takes a prompt to a deployed SaaS, with idle-watchdogs for wedged agents, per-run cost caps, and a verify layer so a drifting agent gets caught instead of running up a tab (a full build lands ~$3 flat, first run free no card). Solid topic. What are you tracking as your top signal - cost-per-run, task success, or loop/step anomalies? And do you have a kill-switch when a run blows past a cost/step budget? That circuit breaker is the thing that saves you at 3am.