Nazar Boyko

Posted on Jun 12 • Originally published at nazarboyko.com

AI Observability: Logs, Prompts, Tool Calls, And Cost

#opentelemetry #observability #llm #ai

Here's a five-line function. It calls an LLM, logs the answer, returns it.

async function ask(question: string) {
  const res = await openai.responses.create({ model: "o4-mini", input: question });
  console.log("answer:", res.output_text);
  return res.output_text;
}

This compiles. It passes tests. It ships. And it will quietly cost you four figures a month before anyone notices, because nothing in that log tells you the model burned 8,000 hidden reasoning tokens to produce a 40-token reply.

That's the gap this article is about. AI calls are not regular HTTP calls. The interesting state isn't the response body - it's the messages you sent, the tools the model picked, the tokens it consumed (visible and otherwise), and the dollars that drained out of the budget. If your observability story is "we log the answer," you're flying a plane with one gauge and that gauge is the altimeter.

Let's talk about what to actually capture.

The four signals that matter

Every AI system has the same four dimensions worth instrumenting, and most teams only track one or two of them:

Logs - the request/response pair, the error, the latency. The boring stuff that traditional APM already covers.
Prompts - the actual text that went in and the actual text that came out. Including system prompts, tool definitions, and history.
Tool calls - which tool the model picked, with what arguments, what came back, in what order, with what retries.
Cost - input tokens, output tokens, cached tokens, reasoning tokens, model, and the per-million-token price for each. Multiplied per user, per feature, per request.

Lose any one of these and you're working blind on a different axis of the problem. Lose the cost signal and you wake up to a Slack message from finance. Lose the tool-call signal and you can't tell why your agent kept booking the wrong flight. Lose the prompt signal and a prod regression becomes a guessing game. Lose plain logs and you don't even know the call happened.

The good news: in 2026 there's finally a standard for capturing all four. The bad news: most teams are still rolling their own and missing half the fields.

Logs: what to capture, and why "200 OK" is a lie

Start with the boring layer. Every LLM call deserves a structured log line with at minimum:

Timestamp, request ID, parent trace ID.
Provider (openai, anthropic, bedrock, your own gateway), model name, model version if you have it.
Endpoint or operation (chat.completions, responses, messages).
Latency - both wall-clock and time-to-first-token if you stream.
HTTP status, error class, error body.
Finish reason (stop, length, tool_calls, content_filter).

That last one is the trap. A 200 from the API does not mean "the model answered the question." A finish_reason of length means the response was truncated mid-sentence. content_filter means the safety system blocked the output. tool_calls means the model is asking you to do work and the conversation isn't done. If your monitoring counts all 200s as success, you're counting truncations and refusals as wins.

The streaming case is its own thing. A streamed response can return an HTTP 200, emit half a sentence, and then die with a connection drop. The "did this call succeed" check has to happen at the end of the stream, not at the headers. Capture the byte count and the chunk count as well - a partial response that arrived in three chunks instead of forty tells you the model died early, and the latency-to-first-token will look great even though the user got nothing useful.

Time-to-first-token is the latency number that actually correlates with user-perceived speed. Total duration matters for billing and capacity planning, but a user who sees the first token in 600ms and the last token in 8s feels a fast app. A user who waits 4 seconds before anything appears does not, even if total duration is shorter.

Prompts: capture the whole conversation, then redact

Here's a rule that takes one prod incident to learn: when a prompt-related bug shows up - wrong answer, weird tone, refusal that shouldn't have happened - you cannot debug it from a summary. You need the exact text the model saw. System prompt, every message in history, every tool definition, every retrieval result you stuffed in. The whole payload.

This is where most homegrown logging falls down. Teams log prompt.length === 4720 because storing the actual text feels excessive. Then a user complains the assistant gave them an answer about basketball when they asked about tennis, and you have nothing - just a length and a model name. The bug was a stale memory chunk from another user's session bleeding into the system prompt, and you can't see it because you didn't store it.

Store the full payload. Disk is cheap, your time is not. But two caveats:

Redact PII before it leaves your network. Prompts are unstructured user input. They contain names, emails, addresses, credit card numbers, internal account IDs, and worse. If you ship that to a third-party observability vendor, you've just turned a debugging tool into a GDPR liability. The OpenTelemetry GenAI working group has put real attention into this - there's a concept of an in-pipeline PII-redaction processor that strips sensitive tokens before the span leaves your collector. Datadog's LLM Observability ships default scanning rules for emails and IPs out of the box using their Sensitive Data Scanner. Either build your own redaction step or pick a vendor that's already done it. Don't ship raw prompts blindly.

Version your system prompts. If you change the system prompt, you've changed the program. Treat it like a git-tracked artifact, assign it a version, and stamp every request with the version that produced it. When you A/B a new prompt and one variant degrades, you want to slice your metrics by prompt.version the same way you'd slice by deploy.sha.

A reasonable shape for a captured prompt looks like this:

{
  "request_id": "req_01HXY...",
  "trace_id": "abc123",
  "model": "claude-sonnet-4-6",
  "prompt_version": "support-agent-v37",
  "system": "[redacted system prompt — stored at hash sha256:9f3a...]",
  "messages": [
    { "role": "user", "content": "[redacted: email]" },
    { "role": "assistant", "content": "Sure, I can help with that..." },
    { "role": "user", "content": "What was the total of order [redacted: order_id]?" }
  ],
  "tools": ["lookup_order", "issue_refund", "escalate_to_human"]
}

Store the system prompt by hash and look it up from a versioned registry. That way you can replay any historical request against any historical prompt - and you don't store the same 2,000-token system message ten thousand times a day.

Tool calls: where most agents quietly go wrong

This is the signal teams underinvest in the most, and it's the one that matters most for anything agent-shaped.

A modern LLM call doesn't return text - it returns a decision. It might return text. It might return a request to call search_inventory({"sku": "WIDGET-7"}). It might return three tool calls in parallel. It might return a tool call with arguments that look reasonable but reference a SKU that doesn't exist in your catalog. The failure modes here are weird and varied, and they all look like the same opaque "agent didn't do the right thing" symptom from the outside.

The known failure modes are basically:

Wrong tool picked. Model called refund_order when it should have called cancel_order.
Malformed arguments. Model returned JSON that doesn't parse, or parses but violates the schema.
Hallucinated arguments. Model invented a parameter that isn't in the tool definition. Or filled a real parameter with a value it made up ("order_id": "ORD-12345" when no such order exists).
Wrong order. Model called ship_order before confirm_payment.
Missing call. Model answered the question without using the tool that would have grounded the answer.
Infinite retry. Tool returns an error, model retries with the same arguments, error returns, repeat until the loop limit kicks in or the bill does.

Every one of those has a different fix and a different blast radius. You cannot tell them apart from response text alone. You need to capture each tool call as its own structured event.

The minimum you want per tool call:

Tool name, tool definition version.
Full arguments object.
Parent message ID and the model decision that produced it.
Tool execution result - the literal value you returned to the model.
Execution time, success/failure status, error message if any.
Sequence position within the turn (was this call 1 of 3 in parallel, or call 4 of a serial chain).

In OpenTelemetry's GenAI semantic conventions, this is structured. The model's request to call a tool shows up inside gen_ai.output.messages as a message with { "type": "tool_call", "id": "call_abc", "name": "search_inventory", "arguments": {...} }. The result you sent back appears in the next turn's gen_ai.input.messages with "role": "tool" and "type": "tool_call_response". The gen_ai.response.finish_reasons attribute will include "tool_calls" when the turn ended with the model requesting tools rather than answering.

Once you have this structured, you can run cheap deterministic checks on every tool call before it even reaches a human reviewer:

validate-tool-call.ts

function validateToolCall(call: ToolCall, schemas: Record<string, JSONSchema>) {
  const schema = schemas[call.name];
  if (!schema) return { ok: false, reason: "unknown_tool" };

  const { valid, errors } = ajv.validate(schema, call.arguments);
  if (!valid) return { ok: false, reason: "schema_violation", errors };

  // Catch hallucinated IDs before they hit your DB.
  if (call.arguments.order_id && !isWellFormedOrderId(call.arguments.order_id)) {
    return { ok: false, reason: "malformed_id" };
  }

  return { ok: true };
}

Most production AI failures are syntax and routing problems, not deep semantic hallucinations. A regex and a JSON-schema validator catch a huge chunk of them before they cost you anything. Treat that validation as the first gate; only failures past the gate become evals for a human or a stronger model to grade.

And about retries - "retry on failure" is one of the most dangerous instructions you can put in a system prompt. An agent that retries a charge_card call because the response timed out is an agent that just charged your customer twice. Idempotency keys on every tool that mutates state are non-negotiable. Log the idempotency key alongside the tool call. When two calls have the same key, you know the retry path got exercised.

Cost: the bill nobody saw coming

This is where the OpenAI snippet at the top of the article hurts you. You logged the answer. You did not log the cost. And modern models have at least four token counters that all matter for the final number:

Input tokens - the prompt you sent. Billed at the model's input rate.
Output tokens - the text that came back. Billed at the much higher output rate.
Cached input tokens - tokens served from a prompt-prefix cache. Billed at a steep discount.
Reasoning tokens - internal "thinking" tokens used by reasoning models like the o-series. They count toward output cost, but they don't appear in the response text. The user never sees them. Your wallet does.

The numbers here are not small. Anthropic's prompt caching, for example, prices cache reads at roughly 10% of the base input token rate. The flip side is that writing to the cache costs more than a normal input token - about 1.25x the base rate for the 5-minute cache, 2x for the 1-hour cache. So caching is a bet: the cache write pays off only if you actually get cache hits later. Cache reads need to outpace cache writes for the strategy to clear water. If you don't track cache_creation_input_tokens vs cache_read_input_tokens separately, you can spend more on caching than you save and not realize it.

OpenAI's usage object on the Responses API reports the same split slightly differently. You get input_tokens, output_tokens, total_tokens, plus input_tokens_details.cached_tokens and output_tokens_details.reasoning_tokens. Cached tokens at OpenAI are billed at 50% of the regular input price and the discount kicks in automatically - you don't opt into it. Reasoning tokens, again, count toward output cost.

The "I shipped a thin wrapper around an o-series model and my bill went 8x" surprise is almost always reasoning tokens. A reasoning model on a hard problem can spend tens of thousands of tokens thinking before it writes a 100-token answer. If your dashboards show "output tokens per request" and your number looks reasonable, but your bill doesn't, look at reasoning_tokens separately. Plot them as their own series.

A minimum schema for cost telemetry:

:::tabs
@tab TypeScript
record-llm-cost.ts

type LLMCostRecord = {
  request_id: string;
  user_id: string;
  feature: string;          // "support_chat", "summarize_pr", "search_rerank"
  provider: "openai" | "anthropic" | "bedrock";
  model: string;            // "claude-sonnet-4-6", "o4-mini"
  input_tokens: number;
  output_tokens: number;
  cached_input_tokens: number;     // Anthropic: cache_read_input_tokens
  cache_write_tokens: number;      // Anthropic only; 0 elsewhere
  reasoning_tokens: number;        // o-series, Claude extended thinking
  estimated_cost_usd: number;      // computed from per-model price table
};

@tab Python
record_llm_cost.py

from dataclasses import dataclass

@dataclass
class LLMCostRecord:
    request_id: str
    user_id: str
    feature: str             # "support_chat", "summarize_pr", "search_rerank"
    provider: str            # "openai" | "anthropic" | "bedrock"
    model: str               # "claude-sonnet-4-6", "o4-mini"
    input_tokens: int
    output_tokens: int
    cached_input_tokens: int     # Anthropic: cache_read_input_tokens
    cache_write_tokens: int      # Anthropic only; 0 elsewhere
    reasoning_tokens: int        # o-series, Claude extended thinking
    estimated_cost_usd: float    # computed from per-model price table

@tab Go
record_llm_cost.go

type LLMCostRecord struct {
    RequestID         string  `json:"request_id"`
    UserID            string  `json:"user_id"`
    Feature           string  // "support_chat", "summarize_pr", "search_rerank"
    Provider          string  // "openai" | "anthropic" | "bedrock"
    Model             string  // "claude-sonnet-4-6", "o4-mini"
    InputTokens       int     `json:"input_tokens"`
    OutputTokens      int     `json:"output_tokens"`
    CachedInputTokens int     `json:"cached_input_tokens"`
    CacheWriteTokens  int     `json:"cache_write_tokens"`
    ReasoningTokens   int     `json:"reasoning_tokens"`
    EstimatedCostUSD  float64 `json:"estimated_cost_usd"`
}

:::

Notice the user_id and feature fields. Those are the attribution dimensions. The only way to act on a cost number is to know whose cost it is. A dashboard that shows "$4,200 yesterday" doesn't tell you anything you can fix. A dashboard that shows "$3,100 of yesterday's $4,200 came from feature=pr_summarizer and 72% of that came from one customer running it on a 50,000-line diff" is a budget conversation, a rate-limit ticket, and a feature decision in one breath.

Push that attribution down to the API call level. The pattern is dead simple: every request adds metadata like { user_id, team_id, feature, environment }. Your observability layer indexes on it. Your billing layer slices on it. When a single user spikes their cost above some threshold, an alert fires. When a feature regresses to 3x its baseline cost-per-request, you catch it before finance does.

Under the hood: the OpenTelemetry GenAI conventions

You don't have to invent the schema. OpenTelemetry's GenAI Semantic Conventions, developed by a CNCF working group, now define a standard for LLM telemetry across providers and platforms. The conventions are still marked experimental as of mid-2026, but they're stable enough that Datadog, AWS, Azure, Google Cloud, and the major open-source platforms have all implemented them. If you instrument once against the spec, your telemetry works on any backend that speaks it.

Two pieces of the spec are worth knowing in detail.

Spans. A GenAI client span carries attributes like:

gen_ai.system - the provider name (e.g. openai, anthropic).
gen_ai.request.model - the model the caller asked for.
gen_ai.response.model - the actual model that answered (these diverge when providers route, e.g. when a gpt-4o request gets served by a gpt-4o-2024-08-06 snapshot).
gen_ai.usage.input_tokens and gen_ai.usage.output_tokens - the counts.
gen_ai.response.finish_reasons - array, because multi-choice responses can have multiple. Includes "tool_calls" when the model wants to call tools.
gen_ai.input.messages and gen_ai.output.messages - the full message arrays, including the tool-call shape mentioned earlier. These are optional and gated by a content-capture flag, because of the PII concern.

Metrics. Two histogram metrics are the workhorses:

gen_ai.client.operation.duration - call latency in seconds. The spec recommends explicit bucket boundaries of [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92]. Those boundaries are tuned so the histogram resolves both fast retrieval calls and slow generation calls without one swamping the other.
gen_ai.client.token.usage - token counts as a histogram, with boundaries of [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864]. The very large top buckets exist because long-context models routinely chew through hundreds of thousands of tokens per call.

The spec also says: when a provider reports both "used" tokens and "billable" tokens (because of caching, batching discounts, etc.), instrumentation MUST report the billable number. Your dashboard should match your invoice.

Auto-instrumentation packages exist for OpenAI, Anthropic, LangChain, and LlamaIndex. If your stack uses any of those, you can light up GenAI tracing with a single import and a config flag. Roll your own only when none of the auto-packages cover your provider.

Where the telemetry lives: proxy vs SDK

Once you've decided what to capture, you have a second question: where in the request path do you capture it? There are basically two architectures.

Proxy-based. You put a gateway in front of every LLM call. Helicone is the canonical example: change your base URL or add one header, and every request flows through their (or your self-hosted) proxy, which logs request, response, latency, and cost. You instrumented zero code. The downside is you only see what the proxy sees - a single LLM call. If your agent does retrieval, then an LLM call, then three tool calls, then another LLM call, the proxy sees four disconnected events, not one logical conversation. You also add a network hop to every call, which matters for latency-sensitive workloads.

SDK-based. You wrap your LLM client (or your framework's wrappers) with tracing code that builds a tree of spans. Langfuse is the canonical example: an SDK that exposes trace, span, generation, and event primitives. You write more integration code, but you get hierarchical traces where the root span is the user's request and the leaf spans are every LLM call, retrieval, tool invocation, and post-processing step in between. For anything agent-shaped, this is what you want.

LangSmith sits in a third category - deep integration with LangChain. If your stack is already LangChain or LangGraph, LangSmith hooks in automatically and understands the framework's internals. Outside LangChain it's less compelling.

The honest tradeoff: if you need to ship observability today and you mostly make single LLM calls, a proxy wins on time-to-value (Helicone's free tier covers 10K requests/month; Langfuse Cloud's covers 50K events/month; LangSmith's covers 5K traces/month). If you're building an agent and you care about understanding why a conversation went sideways across nine model calls and twelve tool invocations, you need SDK-based hierarchical tracing.

You can absolutely run both. A proxy for the raw billable-event firehose, an SDK for the structured agent traces. The OpenTelemetry conventions make this less crazy than it sounds - both layers can emit the same span shape.

Wiring it up: a worked example

Here's what a single LLM call looks like with all four signals captured, using OpenTelemetry's GenAI conventions and the OpenAI auto-instrumentation:

instrumented_llm_call.py

from opentelemetry import trace
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from openai import OpenAI

OpenAIInstrumentor().instrument(capture_content=True)
tracer = trace.get_tracer(__name__)
client = OpenAI()

def summarize_pr(pr_diff: str, user_id: str) -> str:
    with tracer.start_as_current_span("summarize_pr") as span:
        # Cost attribution: the dimensions you'll want to slice by later.
        span.set_attribute("app.user_id", user_id)
        span.set_attribute("app.feature", "pr_summarizer")
        span.set_attribute("app.prompt_version", "pr-summarizer-v12")

        # OpenAI auto-instrumentation will emit a child span with all the
        # gen_ai.* attributes: model, input/output tokens, finish_reasons,
        # plus the messages array if capture_content is on.
        res = client.responses.create(
            model="o4-mini",
            input=f"Summarize this PR diff:\n\n{pr_diff}",
            metadata={"user_id": user_id, "feature": "pr_summarizer"},
        )

        # The reasoning_tokens field is the one most homegrown logging misses.
        # Promote it to your own attribute so dashboards can slice on it.
        u = res.usage
        span.set_attribute("app.reasoning_tokens",
                           u.output_tokens_details.reasoning_tokens or 0)

        # Finish-reason check: a 200 from the API is not success.
        if res.status != "completed":
            span.set_attribute("app.completed", False)
            raise RuntimeError(f"response not completed: {res.status}")

        return res.output_text

The auto-instrumentation handles the GenAI semantic conventions - span name, gen_ai.request.model, gen_ai.response.model, the token usage histograms, the messages capture (gated by capture_content=True, which you'll want off in environments where PII redaction isn't in place). You handle the things the spec can't know: the user, the feature, the prompt version, and the reasoning-token promotion.

Now when this call goes sideways, you can answer all the questions that matter:

Which user? app.user_id.
Which feature regression? app.feature + app.prompt_version.
Did the model truncate? gen_ai.response.finish_reasons.
Why did the cost spike? app.reasoning_tokens vs gen_ai.usage.output_tokens.
How long did the user wait? gen_ai.client.operation.duration.
What did the model actually see? gen_ai.input.messages (if content capture is on).

That's the whole story. Four signals, captured at the right layer, attributed to the right dimensions.

A few things worth getting wrong only once

A handful of lessons that tend to be expensive the first time:

Warning
Don't log raw prompts to a third-party vendor without redaction in front. GDPR and CCPA both treat prompts as user data. A leaky observability pipeline is a breach.

Tip
Sample aggressively on success, capture everything on failure. Storing every payload from every successful call at scale will eat budget. Storing every payload from every failed call is non-negotiable for debugging.

Note
Set per-user and per-feature cost alerts before you launch a feature, not after. A single user driving 90% of your spend on a brand-new feature is one of the most common shapes of an LLM cost incident, and it almost never trips traditional rate limits because the request rate looks normal.

And the meta-lesson: the model is the cheapest part of the system to change. The expensive part is the feedback loop between "users saw a bad answer" and "the team figured out why." Observability is what shortens that loop. Skipping it because the prototype works is borrowing from a credit card you haven't read the rate on yet.

Log the prompts. Trace the tool calls. Track the cached and reasoning tokens. Attribute the cost. Then ship.

Originally published at nazarboyko.com.

DEV Community