Mukunda Rao Katta

Posted on May 25

agent-event-bus-rs: Sync Pub/Sub for Rust AI Agent Events

#hermeschallenge #ai #rust #agents

The crash that lost the context

The agent had several handlers wired to its event loop. One logged each tool call. One tracked latency. One tallied cost. One checked each event against a budget threshold and raised an alert if the run was getting expensive.

The cost tracking handler received an event with an unexpected shape. A new tool had been added to the agent and its output structure differed from what the cost handler expected. The handler panicked while indexing into the event data.

The panic propagated up through the synchronous handler dispatch. The agent process unwound. The context was gone. The partial results from the run were not flushed to storage before the unwind happened. The user got a 500.

The cost tracking handler should never have been able to bring down the agent. It was a side-channel concern. Logging, metrics, alerting: none of those should be on the critical path of the agent's actual work.

agent-event-bus-rs separates those concerns and isolates the panics.

The shape of the fix

[dependencies]
agent-event-bus-rs = "0.1"
serde_json = "1"

Define and publish events:

use agent_event_bus_rs::{EventBus, Event};
use serde_json::json;

let mut bus = EventBus::new();

// Subscribe to a specific event type:
bus.subscribe("tool_called", |event: &Event| {
    println!("Tool called: {}", event.payload["tool_name"]);
});

// Subscribe with a wildcard:
bus.subscribe("*", |event: &Event| {
    println!("Any event: {} at {}", event.kind, event.timestamp_ms);
});

// Publish from your agent loop:
bus.publish(Event::new("tool_called", json!({
    "tool_name": "search",
    "args": { "query": "rust async runtimes" }
})));

Handlers run synchronously in the order they were registered. The caller blocks until all handlers for the event have run (or panicked and been caught).

For the once pattern:

bus.subscribe_once("run_complete", |event: &Event| {
    println!("Run finished. Total cost: ${}", event.payload["usd"]);
});

subscribe_once handlers are removed automatically after their first invocation.

Panic isolation in practice:

bus.subscribe("tool_called", |event: &Event| {
    // This handler panics on unexpected event shapes:
    let _cost = event.payload["token_count"].as_u64().unwrap(); // panics if key missing
    // ...
});

bus.subscribe("tool_called", |event: &Event| {
    // This handler runs regardless:
    println!("Second handler still runs even if first panicked");
});

bus.publish(Event::new("tool_called", json!({ "tool_name": "fetch" })));
// First handler panics on missing "token_count".
// Panic is caught. Error is logged internally.
// Second handler runs normally.
// publish() returns Ok(1 error) to let the caller know something went wrong.

The bus returns a result from publish that includes the count of handler panics. You can inspect it or ignore it depending on how much you care about side-channel failures.

What it does NOT do

It does not support async handlers. This is a sync bus. If you need async subscribers, wrap the handler in a tokio::spawn or send events to a channel from the handler.
It does not persist events to disk. Events are dispatched in memory and discarded after all handlers have run.
It does not guarantee handler order under concurrent publish calls. This bus is not thread-safe for concurrent publishing. Each call to publish is a synchronous dispatch.
It does not filter by payload content, only by event kind string. Wildcard matching is on the kind, not on payload fields.

Inside the lib: catch_unwind instead of threads

The panic isolation is the design decision worth unpacking.

There are two ways to isolate handler panics: run each handler in its own thread, or use catch_unwind. Threads are the obvious choice if you want true concurrent isolation. Each handler gets its own stack, its own panic boundary, no interaction with other handlers.

The problem is cost. Spawning a thread for every handler invocation on every event is expensive. Agents that publish dozens of events per second, each with several handlers, would spend a meaningful fraction of their time in thread spawn overhead. The handler work is often trivial (a log line, a counter increment, a threshold check). The overhead of spawning for that work exceeds the work itself.

catch_unwind with AssertUnwindSafe is the zero-overhead alternative. When no panic occurs, catch_unwind compiles down to nothing in release mode. There is no additional cost per handler invocation on the happy path. Only when a panic actually occurs does the unwinding machinery activate, and at that point we are already in error territory.

The tradeoff is that catch_unwind is not a complete isolation boundary. Some panics in Rust are not catchable (panics that occur in code compiled with panic = "abort", for example). AssertUnwindSafe also requires the caller to reason about whether the wrapped closure is actually unwind-safe. In practice, for side-channel handlers that observe events without holding critical state, this is fine.

The rule of thumb: if a handler panic would leave shared mutable state in an inconsistent condition, catch_unwind is not safe and you need thread isolation. For read-only observation handlers (logging, metrics, alerting), catch_unwind is appropriate and the overhead matters.

When this is useful

Use agent-event-bus-rs when:

Your agent has side-channel concerns (logging, cost tracking, alerting, audit) that should not be on the critical path of the agent's core work.
You want to add or remove subscribers without changing the agent's core loop.
You have multiple listeners for the same event and want each to run independently even if another crashes.
You are building a plugin or hook system for your agent where third-party handlers can subscribe to agent events. Panic isolation is especially important when you do not control the handler code.
You want wildcard subscriptions for cross-cutting concerns that apply to all events.

When NOT to use it

Skip agent-event-bus-rs when:

You need async subscribers. This bus is sync. For async event dispatch use a channel (tokio::sync::broadcast or similar).
Your handlers modify shared state that must stay consistent. In that case you need proper locking, not just panic isolation.
You are building a distributed event system. This bus is in-process. It does not serialize events over the network.
You have one subscriber per event and no cross-cutting concerns. Direct function calls are simpler and faster.

Install

[dependencies]
agent-event-bus-rs = "0.1"
serde_json = "1"

GitHub: MukundaKatta/agent-event-bus-rs

Requires Rust stable. No unsafe code. Dependency: serde_json (for event payloads). 31 tests.

Siblings

Lib	Boundary	Repo
agent-event-bus (Python)	Same pub/sub semantics for Python agents	MukundaKatta/agent-event-bus
agenttrace-rs	Cost and latency aggregation that subscribes to agent run events	MukundaKatta/agenttrace-rs
agentsnap-rs	Snapshot the event sequence for regression testing	MukundaKatta/agentsnap-rs
agent-decision-log	WHY-layer log that publishes decision events for subscribers to observe	MukundaKatta/agent-decision-log

A common composition: the agent publishes events to the bus at each step. agenttrace-rs subscribes to accumulate cost and latency. agentsnap-rs subscribes to record the event sequence. agent-decision-log publishes decision events that other subscribers can audit. Each concern is isolated. Each can fail independently.

What's next

Async bridge. A helper that takes an async fn handler and wraps it in a tokio::task::spawn_blocking call, so async handlers can subscribe without the bus needing to become async itself.
Event replay. Store events in a ring buffer and allow new subscribers to replay recent history. Useful for subscribers that join mid-run and need to catch up on what already happened.
Dead-letter queue. Collect panicked events (events where every handler panicked) into a separate queue for inspection. Useful for debugging handler failures in production.

v0.1.0 shipped 2026-05-24. The sync dispatch and panic isolation are stable. If you find a case where a caught panic corrupts bus state (it should not, but let me know), open an issue with a minimal reproducer.

Part of the Hermes Agent Challenge sprint. The full agent-stack series is at MukundaKatta on GitHub.

DEV Community