The incident started with a boring support automation task.
Take a user request, search a private document index, summarize the answer, and hand t...
For further actions, you may consider blocking this person and/or reporting abuse
the "called the right tool with the wrong input, retried against stale context" description is the failure pattern hardest to catch. wrong input reuse is invisible in the final output.
worth adding: hash the tool input on each
tool_start. if the same tool fires with an identical input hash on consecutive turns, that is a retry loop signal before the guard triggers. caught this in a document QA agent — same query string across 4 turns, model summarizing the same chunk each time with full confidence.does the guard check input repetition, or is it purely turn count and spend?
That's a great suggestion. I like the idea of tracking input hashes because, as you said, retry loops are often invisible if you're only looking at the final output.
In the version from the article, the guard is intentionally simple and only watches turn count and cost. It doesn't currently check for repeated inputs or repeated tool patterns. But I can definitely see value in adding that layer, especially for catching "same action, same context, same result" loops before they become expensive.
The document QA example is a perfect illustration of the kind of failure that a basic budget guard won't catch early enough.
yeah, simple and inspectable is the right call for v1. shipping the basic guard first makes sense.
the hash approach is cheap — short md5 of serialized args, stored in agent_state alongside the turn counter. two lines. the real win: makes DuckDB queries interesting. you can group by input_hash, count turns, and surface repeated failure patterns across sessions not just within one run.
have you queried across multiple agent runs yet, or just within a single session so far?
This is a strong pattern.
The part I like most is that the trace is not just for debugging exceptions. It lets you
reconstruct the decision path after the fact: what the agent saw, what tool it called,
what input it used, what failed, and where the guard stopped it.
That matters because agent failures are rarely single-point failures. They are usually
chain failures: stale context → wrong tool input → retry → old result summarized
confidently → cost leak.
I’ve been testing a neighboring problem around agent memory: relevant context is not
always authoritative context. So one thing I would want in a black-box trace is not only
“what memory/context was used?” but also “what made that context allowed to govern the
next action?”
For example, I’d add fields like:
context_sourcecontext_status(active,stale,superseded,provisional)action_type(read,write,execute)governing_ruleverification_requiredThen after a crash, DuckDB could answer questions like:
That would connect observability with authority, not just observability with failure.
Really useful article. The “query the run after everything is over” framing is exactly
the right direction.
I really like the distinction you're making between observability and authority.
One thing that became clear while building this was that many agent failures don't start where they become visible. By the time you see the bad tool call or the budget overrun, the actual mistake may have happened several steps earlier when the agent accepted a piece of context that it shouldn't have trusted.
Your proposed fields are interesting because they move the trace from "what happened?" toward "why was this allowed to happen?" That's a much harder question, and probably the one that matters most as agents start relying more heavily on memory and long-running state.
The idea of tracking context status and governing rules especially stands out to me. Being able to ask "which actions were influenced by stale or provisional context?" would expose an entire class of failures that basic logging completely misses.
I also like your example queries. They feel very similar to the transition from debugging software failures to auditing decision systems. At that point the trace becomes more than a reliability tool. It becomes a way to inspect authority flow through the run.
Definitely gave me a few ideas for a future version of the black box. Thanks for the thoughtful comment.
Exactly, “where it becomes visible” and “where it became allowed” are two different
points in the run.
That distinction is the part I keep circling back to. A trace that only records the final
bad tool call can tell you what broke, but it may not tell you which memory, rule,
assumption, or stale context gave the agent permission to move in that direction.
That is where observability starts becoming authority inspection.
The useful trace fields are not only:
but also:
That would let you query failures backward from the action into the authority path.
Something like:
“Show me every write action influenced by provisional context.”
or:
“Show me tool calls where stale memory appeared in the decision path.”
That is the kind of black box I think agents need next. Not just a record of execution,
but a record of why execution was permitted.
Your article already has the right foundation for that because JSONL gives the run a
durable spine. Once authority metadata gets attached to those events, the trace becomes
much more than debugging. It becomes a decision audit.
That's a really interesting way to frame it: not just "what happened?" but "what gave the agent permission to do it?"
The more I think about it, the more I agree that authority metadata could reveal an entire class of failures that normal traces miss. A bad action is often just the end of a much longer chain of accepted assumptions.
I especially like the idea of querying authority paths the same way we query execution paths. That starts moving the black box from debugging toward decision auditing, which feels like a natural next step for more capable agents.
Yes, that “accepted assumptions” phrase is exactly the thing.
A lot of agent failures do not begin at the visible action. The bad tool call is just
where the chain finally becomes observable.
Before that, the system may have already accepted:
If none of that authority path is recorded, the trace can tell us what happened but not
why the system believed it was permitted.
That is why I like the idea of treating authority as first-class trace data.
Execution path:
Authority path:
Once that exists, you can ask much better post-run questions:
That is the part that starts turning a black box into a decision audit.
The run trace should not only preserve what the agent did. It should preserve what the
agent thought it was allowed to do.
I think you're getting at something really important here. We spend a lot of time tracing actions, but much less time tracing the assumptions that authorized those actions.
The distinction between execution history and authority history is becoming more interesting to me the more I think about it. If an agent can explain not only what it did, but which memory, policy, or verification path allowed it to do it, post-run analysis becomes much more powerful.
At that point, we're not just debugging behavior. We're auditing decisions.
"That is not debugging. That is guessing with syntax highlighting" is the line that lands. The whole post is the working version of that distinction.
The DuckDB-over-JSONL move is the right shape for the single-process case because it inverts the typical observability tradeoff: most teams pay for a hosted stack to get queryability, but for one agent in one process, append-only JSONL plus a free SQL engine gets you 80% of the forensic value without a vendor. The 71-line constraint is what makes it shippable instead of yet another half-built observability platform.
One extension worth considering: the schema you've got captures WHAT happened (tool_start, tool_end, tool_error, guard_check) but not WHY the agent chose that tool with that input. The model's reasoning chain (which memory was retrieved, which policy was checked, which prior turn the decision was conditioned on) is the layer below your current trace. Most "the agent hallucinated" post-mortems hit a wall at exactly that gap: you can see the call, you can't see the deliberation.
Adding a tool_selection event before tool_start, with the retrieved context hash and the policy snapshot the agent was operating under, gives you a deliberation trace alongside the execution trace. Still 71 lines of recorder code; the schema does the work.
The provenance question gets harder when you cross process boundaries: multi-agent coordination, retries that span sessions, model versions changing under you. That's where the local-file model starts to break and you need either a content-addressed event store or something stronger. Different problem though. For the single-agent case you're describing, the JSONL+DuckDB pattern is correct.
Building toward the cross-process version on the protocol side at NOVAI. Same forensic question, different trust assumptions when there's no single process to own the log file. The local case you're solving is the right starting point.
Good post. The constraint is the contribution.
Thanks. I really like the distinction you're making between execution traces and deliberation traces.
You're right that many investigations eventually hit the "I can see what happened, but not why it was chosen" wall. A tool_selection event with context or policy metadata would be a natural extension of the current approach without fundamentally changing the design.
And I agree about the scope. The article is very much focused on the single-agent, single-process case. Once you move into multi-agent systems and cross-process coordination, provenance becomes a much harder problem. But as you said, the local case feels like the right place to start before tackling the distributed one.
Really strong piece. What stands out is that you are not just logging failures, you are preserving the decision trail that caused them. That is the difference between guessing at a bad output and actually isolating where stale context, a wrong tool input, or a retry loop changed the run. The DuckDB part is especially good because it turns debugging into analysis, not archaeology. This is exactly the kind of pattern more agent systems should adopt early.
Thanks, Jake. "Debugging into analysis, not archaeology" is a great way to describe it.
That was exactly the goal. Once the decision trail is preserved, you're no longer trying to reconstruct the run from memory or assumptions. You can simply query what actually happened.
Really strong pattern here. The part that stands out is not the 71 lines, it is the shift in mental model: once every run becomes an append-only event stream, debugging stops being guesswork and turns into a queryable history. I also like that redaction and guard stops are treated as first-class events, because that is what makes observability feel trustworthy instead of decorative. DuckDB is a sharp choice for this too since it keeps the whole workflow local, cheap, and easy to inspect without adding a heavy stack. This feels like a very practical baseline for anyone shipping tool-using agents, especially before the failures start costing real money.
Thanks, Emma. I really like your point about observability being trustworthy instead of decorative.
That was one of the reasons I treated things like guard stops and redaction as events rather than side notes. If the goal is to understand what actually happened during a run, those decisions should be part of the record too. And yes, keeping everything local with DuckDB was a deliberate choice. I wanted something simple enough to adopt before the failures become expensive.
The "part of the record" idea is what clicked for me too. Once guard stops, redactions, and tool decisions are all queryable events, you can start asking much richer questions about agent behavior instead of reconstructing runs from logs after the fact.
Flattening the critical fields (tool_name, turn_id, parent_event_id, latency_ms, tokens_in/out) into top-level columns at write time saves a lot of json_extract gymnastics in DuckDB later. First cross-day groupby is when you notice.
Loop detection is where this gets messy. Same tool_name with near-identical args can be either a real retry or actual progress when upstream context changed. A cheap hack that works: hash (tool_name, normalized_args, context_digest) per call, count collisions per turn window. False-positives on legitimate polling drop a lot.
Also, sanitize on tool inputs is the obvious case but tool outputs are where most agent traces leak secrets. The function-result branch is the one that catches people.
Those are great points, Abdullah.
I especially agree about tool outputs. Most people think about sanitizing inputs, but outputs are often where sensitive data quietly ends up in traces.
The context_digest idea is interesting too. One thing I ran into was that a simple retry count doesn't tell you whether the agent is stuck or actually making progress. Factoring context into the fingerprint seems like a practical way to separate the two without adding much complexity.
You've definitely given me a few ideas for a future iteration of the black box.
This is such a creative approach — using DuckDB as a debugging query layer is something I haven't seen before. The $200 crash point is painfully relatable. One pattern I've found helpful is logging the full request/response for every LLM call (model, prompt, tokens, latency, error) to a SQLite db. It turns "mysterious crash" into "I can see exactly which model+prompt combo caused it." Nice to see someone pushing the debugging workflow forward!
Thanks, Felix. The $200 crash was definitely the moment that convinced me I needed something more than traditional logs. 😅
I like your SQLite approach too. Being able to trace issues back to a specific model, prompt, and response combination is incredibly valuable. In the end, I think the common theme is making agent behavior inspectable instead of trying to debug from the final output alone.
The way you integrated a compact “black box” into your Python agent and then leveraged DuckDB for querying a large crash dataset is really interesting. I appreciate how you balanced minimal code complexity with practical functionality, especially using only 71 lines to achieve what would usually require a more extensive pipeline. One point I found particularly clever was treating the crash dataset as an analytical layer rather than just raw logs, which opens opportunities for near real-time insights. It would be interesting to see how this approach scales when the dataset grows beyond the 200 records—do you think performance will hold, or would you consider chunking or indexing strategies?
I really like how you described it as an analytical layer rather than just logs. That was exactly the mindset behind using DuckDB.
As for scale, I think DuckDB would comfortably handle far more than what I showed in the article. If traces grew significantly, I'd probably look at partitioning or archiving older events first, while keeping the event structure unchanged. The nice part is that the tracing approach stays simple even as the storage strategy evolves.
The 71-line constraint is clever, but the column I'd add to that trace is cost. Knowing which tool call consumed how many tokens per step turns a debugging tool into a budget tool. The $200 crash gets a root cause and a price tag per decision.
A 71-line black box that lets you query the crash with DuckDB afterward is a lovely example of the highest-ROI move in agent reliability: making the run inspectable after the fact. Agents fail in ways logs don't capture well, the interesting question is never just what threw, it's what was the state when it went wrong, and structured, queryable event capture turns a vague it broke into select what happened around the failure. The DuckDB angle is the clever bit, because it means the trace isn't just readable, it's analyzable: you can aggregate across many runs (which tool fails most, where tokens get burned, what precedes the bad outputs) instead of squinting at one log at a time, which is exactly how you go from anecdote to pattern. The thing I like most is the 71 lines, observability for agents has a reputation for needing a heavyweight platform, but a tiny structured event log you own often beats a vendor dashboard because you can query it however the incident demands. Capture structured events cheaply, then let SQL ask the questions you didn't anticipate. That make-the-run-queryable instinct is core to how I think about agent debugging in Moonshift. Are you logging one event per tool call, or finer-grained, capturing the model's inputs/outputs at each step so you can reconstruct the decision too?
That's a great way to put it. The shift from "what failed?" to "what was happening when it failed?" was exactly what pushed me toward building the black box in the first place.
I also agree with your point about moving from anecdotes to patterns. Looking at a single failed run is useful, but being able to ask questions across many runs is where things get interesting. That's where DuckDB ended up providing far more value than I expected.
For the current version, I'm logging more than just tool calls. Each run captures lifecycle events, tool starts and ends, errors, timing, guard checks, and the associated inputs and outputs after sanitization. The goal is to reconstruct enough of the execution path to understand not only what the agent did, but why it ended up there.
What I'm not fully capturing yet is a richer view of the model's internal decision process between steps. That's probably the next layer I want to explore because, as you mentioned, the really interesting failures often happen before the tool error appears.
I'd be curious to hear how you're approaching this in Moonshift. Are you storing the reasoning trail as structured events too, or focusing primarily on tool and state transitions?
What stood out to me is that the black box is not really an observability layer, it's a change in how we model agent failures. Most teams still treat an agent as a prompt that occasionally calls tools, so when something goes wrong they inspect the final output. Your approach treats the entire run as an execution history that can be queried later.
I also like the decision to use JSONL + DuckDB instead of introducing a heavier telemetry stack. There is a sweet spot between print debugging and full distributed tracing, and many agent projects probably live there. The append only design means the trace survives the very failures you're trying to investigate.
One thing I'd be curious about: have you considered recording parent/child relationships between events? As agents become multi-agent systems or start spawning parallel tool calls, reconstructing causality becomes harder than identifying individual failures. A simple event graph could make the DuckDB queries even more powerful without adding much complexity.
The bigger lesson here is that cost overruns and hallucinations are often symptoms, not root causes. Once you can reconstruct the execution path, the conversation shifts from "the model did something weird" to "this exact decision chain produced this outcome." That is a much more useful place to debug from.
Thanks, Jordan. I think you captured the core idea perfectly. The goal was to stop treating failures as isolated outputs and start treating them as outcomes of a traceable execution path.
I also like your point about parent/child relationships. The current version is intentionally simple, but event lineage becomes much more important once you introduce parallel tools or multiple agents. That's definitely an interesting direction for extending the black box without losing its lightweight nature.
And I completely agree with the last point. In many cases, the bad output is just the final symptom. The real value comes from being able to trace back and find the exact decision that set the run on the wrong path.
The biggest win is not the 71 lines, it is the shift from postmortem guessing to a queryable execution record. Once every tool call, guard check, and failure is encoded as an event, debugging stops being “the model probably drifted” and becomes a concrete investigation into where the chain broke. I also like that you kept it local and lightweight instead of reaching for a heavy observability stack. That makes the pattern much easier to adopt in real projects, especially for agents where the expensive part is usually not the bug itself, but the time spent reconstructing the path that led to it.
Exactly. The goal wasn't really the 71 lines, it was making the run inspectable. Once the execution becomes queryable, debugging shifts from theories to evidence.
I also wanted something lightweight enough that people could adopt without adding another platform to their stack. Thanks for the thoughtful insight.
Interesting angle here is that this feels less like logging and more like observability for agents. Most people inspect the final output when something breaks, but the real issue is often hidden in tool selection, retries, or context flow. Using DuckDB to query failures makes debugging feel much more structured. Curious if you’ve experimented with tracking the “reason” behind tool choices too, because wrong decisions can be harder to catch than obvious crashes.
That's exactly how I see it too. A lot of agent failures don't show up as crashes, they show up as reasonable-looking decisions that were based on the wrong context or tool choice.
I haven't experimented much with recording the reasoning behind tool selection yet, but I think that's a really interesting direction. In many cases, understanding why a tool was chosen could be more valuable than knowing that it failed.