S M Tahosin

Posted on May 31

I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB

#python #ai #debugging #tutorial

Reconstructing the agent decision path

The incident started with a boring support automation task.

Take a user request, search a private document index, summarize the answer, and hand the result to a reviewer. Nothing heroic. The kind of Python agent you build when the demo is over and the real workflow begins.

Then one run got stuck in a retry loop.

It did not burn $200 before I caught it. The actual test run was cheaper. The problem was the projection: same bad loop, same document search, same model calls, left inside the overnight batch. The estimate landed close to $200 for one avoidable failure.

The answer it produced looked polished enough to pass a sleepy review. The trace behind it was not polished at all. The agent had called the right tool with the wrong input, retried against stale context, summarized old results, and kept paying for each turn.

That is when I stopped treating the agent like a chat feature.

I started treating it like a system that needs a black box.

Not a dashboard. Not a full observability stack. Not another hosted service.

Just one local file that can answer:

What did the agent try?
Which tool did it call?
What input did the tool receive?
Did the tool fail?
How long did it take?
Did the run cross a cost or turn limit?
Can I query the run after everything is over?

We will build that black box in plain Python, then use DuckDB to inspect it like a tiny crash database.

Before And After

Before the fix, debugging looked like this:

The final answer is wrong.
The model probably hallucinated.
Maybe the search tool returned bad data.
Maybe the retry loop reused an old message.
Maybe the cost spike came from the model call.

That is not debugging. That is guessing with syntax highlighting.

After the fix, debugging looked like this:

Turn 1 called search_docs with the wrong query.
The tool timed out after 147.82 ms.
The retry used stale context.
The guard stopped the run at $0.0124.
DuckDB shows one tool_error and one guard_stop.

Same bug. Very different day.

The Shape Of The Problem

A normal Python script usually fails in one place.

An agent fails across a chain.

User Request -> Model Decision -> Tool Call -> Tool Result -> Next Turn -> Final Answer

If you only log the final answer, you have a diary entry.

If you record the chain, you have evidence.

The simplest useful format is JSONL. One event per line.

{"type":"tool_start","tool":"search_docs","input":{"query":"rate limits"}}
{"type":"tool_end","tool":"search_docs","duration_ms":83.4,"ok":true}
{"type":"turn_end","turn":2,"total_cost_usd":0.0041}

JSONL is boring in exactly the right way. It appends cleanly, survives crashes better than one large JSON document, and can be searched with normal tools.

A Small Recorder That Does Real Work

Here is the recorder.

It does four things:

gives every run a unique id
writes append-only JSONL events
measures tool duration
sanitizes obvious secrets before writing anything to disk

from __future__ import annotations

import json
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4


SECRET_KEYS = re.compile(
    r"(api[_-]?key|token|password|secret|authorization|cookie)",
    re.IGNORECASE,
)


@dataclass
class Event:
    run_id: str
    event_id: str
    type: str
    timestamp: float
    data: dict[str, Any] = field(default_factory=dict)


def sanitize(value: Any) -> Any:
    if isinstance(value, dict):
        cleaned = {}
        for key, item in value.items():
            if SECRET_KEYS.search(str(key)):
                cleaned[key] = "[redacted]"
            else:
                cleaned[key] = sanitize(item)
        return cleaned

    if isinstance(value, list):
        return [sanitize(item) for item in value]

    return value


class AgentBlackBox:
    def __init__(self, path: str | Path, run_id: str | None = None) -> None:
        self.path = Path(path)
        self.run_id = run_id or uuid4().hex
        self.path.parent.mkdir(parents=True, exist_ok=True)

    def record(self, event_type: str, **data: Any) -> None:
        event = Event(
            run_id=self.run_id,
            event_id=uuid4().hex,
            type=event_type,
            timestamp=time.time(),
            data=sanitize(data),
        )

        with self.path.open("a", encoding="utf-8") as file:
            file.write(json.dumps(asdict(event), default=str) + "\n")

    @contextmanager
    def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
        started = time.perf_counter()
        self.record("tool_start", tool=name, input=tool_input)

        try:
            yield
        except Exception as exc:
            self.record(
                "tool_error",
                tool=name,
                error_type=type(exc).__name__,
                error=str(exc),
                traceback=traceback.format_exc(limit=6),
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
            raise
        else:
            self.record(
                "tool_end",
                tool=name,
                ok=True,
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )

The sanitize() function is not perfect. It is a seatbelt, not a vault.

Still, it prevents the most embarrassing version of this pattern: building a helpful debug trace that quietly stores API keys.

Wrap One Tool First

Start with one tool. Do not instrument everything on day one.

import random
import time


def search_docs(query: str, api_key: str) -> list[str]:
    time.sleep(random.uniform(0.05, 0.2))

    if "timeout" in query:
        raise TimeoutError("Document search timed out")

    return [
        "JSONL works well for append-only traces.",
        "Context managers are useful around tool calls.",
        "DuckDB can query JSON files without a server.",
    ]

Now record the call:

box = AgentBlackBox("traces/run.jsonl")

query = "python agent trace format"

with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
    docs = search_docs(query=query, api_key="sk-not-a-real-key")

box.record("tool_result", tool="search_docs", result_count=len(docs))

Open traces/run.jsonl and the key is redacted.

{"tool":"search_docs","input":{"query":"python agent trace format","api_key":"[redacted]"}}

That tiny detail matters. Debugging should not create a second incident.

Add A Cheap Run Guard

Most runaway agent stories start with a loop that looked harmless.

So the black box should not only record what happened. It should record when it refused to continue.

class RunStopped(RuntimeError):
    pass


def stop_if_needed(
    box: AgentBlackBox,
    *,
    turn: int,
    max_turns: int,
    spent_usd: float,
    max_usd: float,
) -> None:
    box.record(
        "guard_check",
        turn=turn,
        max_turns=max_turns,
        spent_usd=round(spent_usd, 6),
        max_usd=round(max_usd, 6),
    )

    if turn > max_turns:
        box.record("guard_stop", reason="max_turns", turn=turn)
        raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")

    if spent_usd > max_usd:
        box.record("guard_stop", reason="budget", spent_usd=spent_usd)
        raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")

This is not exact billing. Use your provider response for real token counts when you have them.

The goal here is a local tripwire. You want the run to leave a clear reason when it stops.

A Tiny Agent Loop

This fake loop keeps the moving parts small.

Replace the pretend model section with your real model call.

def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    return input_tokens * 0.0000005 + output_tokens * 0.0000015


def run_agent(question: str) -> str:
    box = AgentBlackBox("traces/run.jsonl")
    messages = [{"role": "user", "content": question}]
    spent_usd = 0.0
    max_turns = 3
    max_usd = 0.01

    box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)

    for turn in range(1, max_turns + 1):
        stop_if_needed(
            box,
            turn=turn,
            max_turns=max_turns,
            spent_usd=spent_usd,
            max_usd=max_usd,
        )

        box.record("turn_start", turn=turn, message_count=len(messages))

        # Pretend the model picked this tool input.
        query = question if turn == 1 else "python jsonl duckdb traces"

        with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
            docs = search_docs(query=query, api_key="sk-not-a-real-key")

        messages.append({"role": "tool", "content": "\n".join(docs)})

        turn_cost = estimate_cost(
            input_tokens=sum(len(message["content"].split()) for message in messages),
            output_tokens=120,
        )
        spent_usd += turn_cost

        box.record(
            "turn_end",
            turn=turn,
            message_count=len(messages),
            turn_cost_usd=round(turn_cost, 6),
            total_cost_usd=round(spent_usd, 6),
        )

    answer = "Record every tool call as JSONL, then query failures after the run."
    box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
    return answer

Run it once with a normal question.

print(run_agent("How should I debug Python agent tools?"))

Then run it with a bad one.

print(run_agent("timeout during document search"))

The second run should fail, but now it fails with a trail.

To force a budget stop for testing, temporarily set max_usd = 0.0001. The next guard check will write a guard_stop event instead of letting the loop continue quietly.

Query The Crash With DuckDB

This is the part that makes JSONL feel less like logging and more like a debugging tool.

Install DuckDB:

pip install duckdb

Then query the trace:

import duckdb


def query_trace(path: str = "traces/run.jsonl") -> None:
    con = duckdb.connect()

    con.sql(
        f"""
        create or replace view events as
        select *
        from read_json_auto('{path}');
        """
    )

    print("Event counts")
    con.sql(
        """
        select type, count(*) as events
        from events
        group by type
        order by events desc;
        """
    ).show()

    print("Tool errors")
    con.sql(
        """
        select
            data.tool as tool,
            data.error_type as error_type,
            data.error as error,
            data.duration_ms as duration_ms
        from events
        where type = 'tool_error';
        """
    ).show()

    print("Slow tools")
    con.sql(
        """
        select
            data.tool as tool,
            data.duration_ms as duration_ms
        from events
        where type = 'tool_end'
        order by data.duration_ms desc
        limit 5;
        """
    ).show()

Now run:

query_trace()

The payoff should look something like this:

Event counts
+-------------+--------+
| type        | events |
+-------------+--------+
| guard_check |      4 |
| turn_start  |      3 |
| tool_start  |      3 |
| tool_end    |      2 |
| tool_error  |      1 |
| guard_stop  |      1 |
+-------------+--------+

And the crash row is now a query result, not a mystery:

Tool errors
+-------------+--------------+---------------------------+-------------+
| tool        | error_type   | error                     | duration_ms |
+-------------+--------------+---------------------------+-------------+
| search_docs | TimeoutError | Document search timed out |      147.82 |
+-------------+--------------+---------------------------+-------------+

You can answer questions that normal print logs make annoying:

Which tools failed most often?
Which tool was slowest?
Which turn crossed the budget?
Did the same input fail repeatedly?
Did the guard stop the run, or did the tool crash first?

That is the upgrade.

Not "I have logs."

"I can interrogate the run."

What I Would Record In A Real Project

For a demo, the trace above is enough.

For a real project, I would add these fields:

model
provider
prompt_hash
tool_schema_version
input_tokens
output_tokens
finish_reason
retry_count
user_id_hash
environment

I would not record these by default:

raw access tokens
private documents
full customer prompts
full tool responses with sensitive data
cookies or request headers

The boring security rule is simple:

Record enough to debug behavior. Do not record enough to harm someone.

The Pattern In One Sentence

Every agent run should produce a local, append-only event stream that is safe to keep, easy to query, and useful after the process crashes.

That sentence is less exciting than a new prompt trick.

It is also more likely to save your weekend.

Full File

Here is the complete example in one place.

from __future__ import annotations

import json
import random
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4


SECRET_KEYS = re.compile(
    r"(api[_-]?key|token|password|secret|authorization|cookie)",
    re.IGNORECASE,
)


@dataclass
class Event:
    run_id: str
    event_id: str
    type: str
    timestamp: float
    data: dict[str, Any] = field(default_factory=dict)


def sanitize(value: Any) -> Any:
    if isinstance(value, dict):
        return {
            key: "[redacted]" if SECRET_KEYS.search(str(key)) else sanitize(item)
            for key, item in value.items()
        }

    if isinstance(value, list):
        return [sanitize(item) for item in value]

    return value


class AgentBlackBox:
    def __init__(self, path: str | Path, run_id: str | None = None) -> None:
        self.path = Path(path)
        self.run_id = run_id or uuid4().hex
        self.path.parent.mkdir(parents=True, exist_ok=True)

    def record(self, event_type: str, **data: Any) -> None:
        event = Event(
            run_id=self.run_id,
            event_id=uuid4().hex,
            type=event_type,
            timestamp=time.time(),
            data=sanitize(data),
        )

        with self.path.open("a", encoding="utf-8") as file:
            file.write(json.dumps(asdict(event), default=str) + "\n")

    @contextmanager
    def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
        started = time.perf_counter()
        self.record("tool_start", tool=name, input=tool_input)

        try:
            yield
        except Exception as exc:
            self.record(
                "tool_error",
                tool=name,
                error_type=type(exc).__name__,
                error=str(exc),
                traceback=traceback.format_exc(limit=6),
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
            raise
        else:
            self.record(
                "tool_end",
                tool=name,
                ok=True,
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )


class RunStopped(RuntimeError):
    pass


def stop_if_needed(
    box: AgentBlackBox,
    *,
    turn: int,
    max_turns: int,
    spent_usd: float,
    max_usd: float,
) -> None:
    box.record(
        "guard_check",
        turn=turn,
        max_turns=max_turns,
        spent_usd=round(spent_usd, 6),
        max_usd=round(max_usd, 6),
    )

    if turn > max_turns:
        box.record("guard_stop", reason="max_turns", turn=turn)
        raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")

    if spent_usd > max_usd:
        box.record("guard_stop", reason="budget", spent_usd=spent_usd)
        raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")


def search_docs(query: str, api_key: str) -> list[str]:
    time.sleep(random.uniform(0.05, 0.2))

    if "timeout" in query:
        raise TimeoutError("Document search timed out")

    return [
        "JSONL works well for append-only traces.",
        "Context managers are useful around tool calls.",
        "DuckDB can query JSON files without a server.",
    ]


def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    return input_tokens * 0.0000005 + output_tokens * 0.0000015


def run_agent(question: str) -> str:
    box = AgentBlackBox("traces/run.jsonl")
    messages = [{"role": "user", "content": question}]
    spent_usd = 0.0
    max_turns = 3
    max_usd = 0.01

    box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)

    for turn in range(1, max_turns + 1):
        stop_if_needed(
            box,
            turn=turn,
            max_turns=max_turns,
            spent_usd=spent_usd,
            max_usd=max_usd,
        )

        box.record("turn_start", turn=turn, message_count=len(messages))

        query = question if turn == 1 else "python jsonl duckdb traces"

        with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
            docs = search_docs(query=query, api_key="sk-not-a-real-key")

        messages.append({"role": "tool", "content": "\n".join(docs)})

        turn_cost = estimate_cost(
            input_tokens=sum(len(message["content"].split()) for message in messages),
            output_tokens=120,
        )
        spent_usd += turn_cost

        box.record(
            "turn_end",
            turn=turn,
            message_count=len(messages),
            turn_cost_usd=round(turn_cost, 6),
            total_cost_usd=round(spent_usd, 6),
        )

    answer = "Record every tool call as JSONL, then query failures after the run."
    box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
    return answer


if __name__ == "__main__":
    print(run_agent("How should I debug Python agent tools?"))

There is one line in that full file worth staring at:

box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)

That line changes the posture of the program.

The run is no longer a private conversation with a model. It is a recorded execution with a trace you can inspect, query, and improve.

That is the difference between a demo and something you can trust.

What would you add next: prompt hashes, token counts, screenshots, checkpoints, or replayable tool fixtures?

Top comments (33)

Mudassir Khan • Jun 3

the "called the right tool with the wrong input, retried against stale context" description is the failure pattern hardest to catch. wrong input reuse is invisible in the final output.

worth adding: hash the tool input on each tool_start. if the same tool fires with an identical input hash on consecutive turns, that is a retry loop signal before the guard triggers. caught this in a document QA agent — same query string across 4 turns, model summarizing the same chunk each time with full confidence.

does the guard check input repetition, or is it purely turn count and spend?

S M Tahosin • Jun 3

That's a great suggestion. I like the idea of tracking input hashes because, as you said, retry loops are often invisible if you're only looking at the final output.

In the version from the article, the guard is intentionally simple and only watches turn count and cost. It doesn't currently check for repeated inputs or repeated tool patterns. But I can definitely see value in adding that layer, especially for catching "same action, same context, same result" loops before they become expensive.

The document QA example is a perfect illustration of the kind of failure that a basic budget guard won't catch early enough.

Mudassir Khan • Jun 7

yeah, simple and inspectable is the right call for v1. shipping the basic guard first makes sense.

the hash approach is cheap — short md5 of serialized args, stored in agent_state alongside the turn counter. two lines. the real win: makes DuckDB queries interesting. you can group by input_hash, count turns, and surface repeated failure patterns across sessions not just within one run.

have you queried across multiple agent runs yet, or just within a single session so far?

Self-Correcting Systems • May 31

This is a strong pattern.

The part I like most is that the trace is not just for debugging exceptions. It lets you
reconstruct the decision path after the fact: what the agent saw, what tool it called,
what input it used, what failed, and where the guard stopped it.

That matters because agent failures are rarely single-point failures. They are usually
chain failures: stale context → wrong tool input → retry → old result summarized
confidently → cost leak.

I’ve been testing a neighboring problem around agent memory: relevant context is not
always authoritative context. So one thing I would want in a black-box trace is not only
“what memory/context was used?” but also “what made that context allowed to govern the
next action?”

For example, I’d add fields like:

context_source
context_status (active, stale, superseded, provisional)
action_type (read, write, execute)
governing_rule
verification_required

Then after a crash, DuckDB could answer questions like:

Did the agent act from stale context?
Did a provisional memory govern an execute action?
Did a verify-first rule get skipped?
Which tool calls happened after the budget or confidence guard should have stopped the run?

That would connect observability with authority, not just observability with failure.

Really useful article. The “query the run after everything is over” framing is exactly
the right direction.

S M Tahosin • Jun 1

I really like the distinction you're making between observability and authority.

One thing that became clear while building this was that many agent failures don't start where they become visible. By the time you see the bad tool call or the budget overrun, the actual mistake may have happened several steps earlier when the agent accepted a piece of context that it shouldn't have trusted.

Your proposed fields are interesting because they move the trace from "what happened?" toward "why was this allowed to happen?" That's a much harder question, and probably the one that matters most as agents start relying more heavily on memory and long-running state.

The idea of tracking context status and governing rules especially stands out to me. Being able to ask "which actions were influenced by stale or provisional context?" would expose an entire class of failures that basic logging completely misses.

I also like your example queries. They feel very similar to the transition from debugging software failures to auditing decision systems. At that point the trace becomes more than a reliability tool. It becomes a way to inspect authority flow through the run.

Definitely gave me a few ideas for a future version of the black box. Thanks for the thoughtful comment.

Self-Correcting Systems • Jun 1

Exactly, “where it becomes visible” and “where it became allowed” are two different
points in the run.

That distinction is the part I keep circling back to. A trace that only records the final
bad tool call can tell you what broke, but it may not tell you which memory, rule,
assumption, or stale context gave the agent permission to move in that direction.

That is where observability starts becoming authority inspection.

The useful trace fields are not only:

tool called
input used
duration
error
cost

but also:

which context influenced this action
what status that context had
what rule governed the tool call
whether a higher-authority rule was skipped
whether the action should have verified before executing

That would let you query failures backward from the action into the authority path.

Something like:

“Show me every write action influenced by provisional context.”

or:

“Show me tool calls where stale memory appeared in the decision path.”

That is the kind of black box I think agents need next. Not just a record of execution,
but a record of why execution was permitted.

Your article already has the right foundation for that because JSONL gives the run a
durable spine. Once authority metadata gets attached to those events, the trace becomes
much more than debugging. It becomes a decision audit.

S M Tahosin • Jun 1

That's a really interesting way to frame it: not just "what happened?" but "what gave the agent permission to do it?"

The more I think about it, the more I agree that authority metadata could reveal an entire class of failures that normal traces miss. A bad action is often just the end of a much longer chain of accepted assumptions.

I especially like the idea of querying authority paths the same way we query execution paths. That starts moving the black box from debugging toward decision auditing, which feels like a natural next step for more capable agents.

Self-Correcting Systems • Jun 1

Yes, that “accepted assumptions” phrase is exactly the thing.

A lot of agent failures do not begin at the visible action. The bad tool call is just
where the chain finally becomes observable.

Before that, the system may have already accepted:

this memory is current
this note can govern
this policy applies to this scope
this tool is allowed under this context
this action does not need verification

If none of that authority path is recorded, the trace can tell us what happened but not
why the system believed it was permitted.

That is why I like the idea of treating authority as first-class trace data.

Execution path:

agent called tool X with input Y

Authority path:

tool X was allowed because memory A was active, policy B governed the action, gate C
passed, and no higher-authority rule blocked it

Once that exists, you can ask much better post-run questions:

which actions were governed by stale context?
which writes happened without a live source check?
which tool calls relied on provisional memory?
which policy admitted the action?
which authority layer was skipped?

That is the part that starts turning a black box into a decision audit.

The run trace should not only preserve what the agent did. It should preserve what the
agent thought it was allowed to do.

S M Tahosin • Jun 1

I think you're getting at something really important here. We spend a lot of time tracing actions, but much less time tracing the assumptions that authorized those actions.

The distinction between execution history and authority history is becoming more interesting to me the more I think about it. If an agent can explain not only what it did, but which memory, policy, or verification path allowed it to do it, post-run analysis becomes much more powerful.

At that point, we're not just debugging behavior. We're auditing decisions.

Self-Correcting Systems • Jun 1

Yes, execution history and authority history is the right split.

Execution history answers:

what did the agent do?

Authority history answers:

what did the agent believe it was allowed to do, and why?

That second question is where a lot of the hidden failure chain lives.

A bad action may be perfectly traceable at the tool level:

called the tool
passed this input
got this result
spent this much
returned this answer

But the real failure may have happened earlier when the agent accepted the wrong
assumption as governing context.

That is why I think future agent traces need to record things like:

which memory influenced the action
whether that memory was active, stale, provisional, or superseded
which policy admitted the action
whether a verification gate was required
whether that gate actually passed
whether a higher-authority rule was skipped

Then the trace stops being only a replay of behavior and becomes a replay of permission.

That is the part that makes post-run analysis much stronger.

The agent should not only be able to say:

I did X.

It should be able to show:

I did X because rule Y was active, memory Z was allowed to govern, and gate A passed.

That is decision auditing.

NOVAInetwork • Jun 2

"That is not debugging. That is guessing with syntax highlighting" is the line that lands. The whole post is the working version of that distinction.

The DuckDB-over-JSONL move is the right shape for the single-process case because it inverts the typical observability tradeoff: most teams pay for a hosted stack to get queryability, but for one agent in one process, append-only JSONL plus a free SQL engine gets you 80% of the forensic value without a vendor. The 71-line constraint is what makes it shippable instead of yet another half-built observability platform.

One extension worth considering: the schema you've got captures WHAT happened (tool_start, tool_end, tool_error, guard_check) but not WHY the agent chose that tool with that input. The model's reasoning chain (which memory was retrieved, which policy was checked, which prior turn the decision was conditioned on) is the layer below your current trace. Most "the agent hallucinated" post-mortems hit a wall at exactly that gap: you can see the call, you can't see the deliberation.

Adding a tool_selection event before tool_start, with the retrieved context hash and the policy snapshot the agent was operating under, gives you a deliberation trace alongside the execution trace. Still 71 lines of recorder code; the schema does the work.

The provenance question gets harder when you cross process boundaries: multi-agent coordination, retries that span sessions, model versions changing under you. That's where the local-file model starts to break and you need either a content-addressed event store or something stronger. Different problem though. For the single-agent case you're describing, the JSONL+DuckDB pattern is correct.

Building toward the cross-process version on the protocol side at NOVAI. Same forensic question, different trust assumptions when there's no single process to own the log file. The local case you're solving is the right starting point.

Good post. The constraint is the contribution.

S M Tahosin • Jun 3

Thanks. I really like the distinction you're making between execution traces and deliberation traces.

You're right that many investigations eventually hit the "I can see what happened, but not why it was chosen" wall. A tool_selection event with context or policy metadata would be a natural extension of the current approach without fundamentally changing the design.

And I agree about the scope. The article is very much focused on the single-agent, single-process case. Once you move into multi-agent systems and cross-process coordination, provenance becomes a much harder problem. But as you said, the local case feels like the right place to start before tackling the distributed one.

Jake Sullivan • Jun 1

Really strong piece. What stands out is that you are not just logging failures, you are preserving the decision trail that caused them. That is the difference between guessing at a bad output and actually isolating where stale context, a wrong tool input, or a retry loop changed the run. The DuckDB part is especially good because it turns debugging into analysis, not archaeology. This is exactly the kind of pattern more agent systems should adopt early.

S M Tahosin • Jun 1

Thanks, Jake. "Debugging into analysis, not archaeology" is a great way to describe it.

That was exactly the goal. Once the decision trail is preserved, you're no longer trying to reconstruct the run from memory or assumptions. You can simply query what actually happened.

Emma Sofia • Jun 1

Really strong pattern here. The part that stands out is not the 71 lines, it is the shift in mental model: once every run becomes an append-only event stream, debugging stops being guesswork and turns into a queryable history. I also like that redaction and guard stops are treated as first-class events, because that is what makes observability feel trustworthy instead of decorative. DuckDB is a sharp choice for this too since it keeps the whole workflow local, cheap, and easy to inspect without adding a heavy stack. This feels like a very practical baseline for anyone shipping tool-using agents, especially before the failures start costing real money.

S M Tahosin • Jun 1

Thanks, Emma. I really like your point about observability being trustworthy instead of decorative.

That was one of the reasons I treated things like guard stops and redaction as events rather than side notes. If the goal is to understand what actually happened during a run, those decisions should be part of the record too. And yes, keeping everything local with DuckDB was a deliberate choice. I wanted something simple enough to adopt before the failures become expensive.

Emma Sofia • Jun 1

The "part of the record" idea is what clicked for me too. Once guard stops, redactions, and tool decisions are all queryable events, you can start asking much richer questions about agent behavior instead of reconstructing runs from logs after the fact.

Abdullah Shahin • Jun 3

Flattening the critical fields (tool_name, turn_id, parent_event_id, latency_ms, tokens_in/out) into top-level columns at write time saves a lot of json_extract gymnastics in DuckDB later. First cross-day groupby is when you notice.

Loop detection is where this gets messy. Same tool_name with near-identical args can be either a real retry or actual progress when upstream context changed. A cheap hack that works: hash (tool_name, normalized_args, context_digest) per call, count collisions per turn window. False-positives on legitimate polling drop a lot.

Also, sanitize on tool inputs is the obvious case but tool outputs are where most agent traces leak secrets. The function-result branch is the one that catches people.

S M Tahosin • Jun 3

Those are great points, Abdullah.

I especially agree about tool outputs. Most people think about sanitizing inputs, but outputs are often where sensitive data quietly ends up in traces.

The context_digest idea is interesting too. One thing I ran into was that a simple retry count doesn't tell you whether the agent is stuck or actually making progress. Factoring context into the fingerprint seems like a practical way to separate the two without adding much complexity.

You've definitely given me a few ideas for a future iteration of the black box.

Felix • Jun 1

This is such a creative approach — using DuckDB as a debugging query layer is something I haven't seen before. The $200 crash point is painfully relatable. One pattern I've found helpful is logging the full request/response for every LLM call (model, prompt, tokens, latency, error) to a SQLite db. It turns "mysterious crash" into "I can see exactly which model+prompt combo caused it." Nice to see someone pushing the debugging workflow forward!

S M Tahosin • Jun 1

Thanks, Felix. The $200 crash was definitely the moment that convinced me I needed something more than traditional logs. 😅

I like your SQLite approach too. Being able to trace issues back to a specific model, prompt, and response combination is incredibly valuable. In the end, I think the common theme is making agent behavior inspectable instead of trying to debug from the final output alone.

Elsie Nora • Jun 1

The way you integrated a compact “black box” into your Python agent and then leveraged DuckDB for querying a large crash dataset is really interesting. I appreciate how you balanced minimal code complexity with practical functionality, especially using only 71 lines to achieve what would usually require a more extensive pipeline. One point I found particularly clever was treating the crash dataset as an analytical layer rather than just raw logs, which opens opportunities for near real-time insights. It would be interesting to see how this approach scales when the dataset grows beyond the 200 records—do you think performance will hold, or would you consider chunking or indexing strategies?

S M Tahosin • Jun 1

I really like how you described it as an analytical layer rather than just logs. That was exactly the mindset behind using DuckDB.

As for scale, I think DuckDB would comfortably handle far more than what I showed in the article. If traces grew significantly, I'd probably look at partitioning or archiving older events first, while keeping the event structure unchanged. The nice part is that the tracing approach stays simple even as the storage strategy evolves.

Valentin Monteiro • Jun 4

The 71-line constraint is clever, but the column I'd add to that trace is cost. Knowing which tool call consumed how many tokens per step turns a debugging tool into a budget tool. The $200 crash gets a root cause and a price tag per decision.

Harjot Singh • May 31

A 71-line black box that lets you query the crash with DuckDB afterward is a lovely example of the highest-ROI move in agent reliability: making the run inspectable after the fact. Agents fail in ways logs don't capture well, the interesting question is never just what threw, it's what was the state when it went wrong, and structured, queryable event capture turns a vague it broke into select what happened around the failure. The DuckDB angle is the clever bit, because it means the trace isn't just readable, it's analyzable: you can aggregate across many runs (which tool fails most, where tokens get burned, what precedes the bad outputs) instead of squinting at one log at a time, which is exactly how you go from anecdote to pattern. The thing I like most is the 71 lines, observability for agents has a reputation for needing a heavyweight platform, but a tiny structured event log you own often beats a vendor dashboard because you can query it however the incident demands. Capture structured events cheaply, then let SQL ask the questions you didn't anticipate. That make-the-run-queryable instinct is core to how I think about agent debugging in Moonshift. Are you logging one event per tool call, or finer-grained, capturing the model's inputs/outputs at each step so you can reconstruct the decision too?

S M Tahosin • Jun 1

That's a great way to put it. The shift from "what failed?" to "what was happening when it failed?" was exactly what pushed me toward building the black box in the first place.

I also agree with your point about moving from anecdotes to patterns. Looking at a single failed run is useful, but being able to ask questions across many runs is where things get interesting. That's where DuckDB ended up providing far more value than I expected.

For the current version, I'm logging more than just tool calls. Each run captures lifecycle events, tool starts and ends, errors, timing, guard checks, and the associated inputs and outputs after sanitization. The goal is to reconstruct enough of the execution path to understand not only what the agent did, but why it ended up there.

What I'm not fully capturing yet is a richer view of the model's internal decision process between steps. That's probably the next layer I want to explore because, as you mentioned, the really interesting failures often happen before the tool error appears.

I'd be curious to hear how you're approaching this in Moonshift. Are you storing the reasoning trail as structured events too, or focusing primarily on tool and state transitions?

View full discussion (33 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.