Bizbox

Posted on May 7

Bizbox Build Log: May 2–8, 2026

#bizbox #buildlog #buildinpublic #ai

Four releases, nine PRs merged, and one clear theme this week: making Bizbox agents more capable and trustworthy in multi-turn execution contexts.

Shipped this week

Company AI Builder (Phases 0–4)

#20 landed the full Company AI Builder feature — a curated set of mutation tools delivered via a proposal-approval flow. Phase 0 shipped read-only spike work (sessions, settings, OpenAI-compat interface, six read tools, UI). This update extends with Phases 1–4: proposal-store infrastructure, mutation tools behind proposals, and the approval surface for company owners.

Trade-off: Mutation tools are gated by human approval for now. We chose safety and trust before convenience. Future iterations will tune the guardrails based on real operator feedback.

Artifact validation and schema hardening

#27 introduced stricter validation for "artifact" work products — enforcing that artifact work products always have attachment-backed metadata and a createdByRunId. New schema validators, runtime type guards, and tighter integration mean artifact handling is now fail-fast instead of fail-silent.

Why it matters: Agents produce artifacts (deliverables, documents, code outputs). Loose validation meant broken artifact references could propagate through the system. This change catches those errors at the boundary.

Artifact persistence and UI updates for issue-backed runs

#25 adds support for collecting output artifacts from adapter executions (especially OpenClaw Gateway adapters), introduces new types and logic for artifact management, and exposes utilities for artifact-related work products.

Open challenge: Artifact handling is still evolving. We're learning what metadata needs to travel with artifacts, how to version them, and what the UI should surface. Feedback welcome.

Agent thread chat with optimistic UI

#21 adds a direct communication channel between operators and agents. Users can now message agents from the agent detail page, with optimistic UI updates for a snappier feel.

Decision: We chose optimistic updates over waiting for server confirmation. It makes the UI feel faster. The trade-off: rare cases where the server rejects a message won't be obvious until you refresh. We're watching for confusion signals.

Routine execution recovery logic

#22 fixes how Bizbox handles routine_execution issues in blocked state. Previously, the recovery logic treated blocked routines as failures and tried to resume them prematurely. Now, blocked is recognized as a healthy, parked wait state.

Why this was broken: Routines often block on human approval or child issue completion. The old logic didn't distinguish "blocked and waiting" from "blocked and stuck." This change codifies the difference.

Upstream merge and OpenTelemetry metrics

#16 merged upstream PaperClip changes from April 30, 2026 (assisted by Claude Sonnet 4.6).

#14 adds OpenTelemetry metrics, starting with bizbox.issues.human_comments_total — a signal for human intervention frequency.

Trade-off: We're starting with one metric to validate the integration pattern. More will follow once we've confirmed the collector setup works in production.

agentParams refactor and regression fix

#24 fixes a regression introduced in v0.0.6 where the OpenClaw gateway adapter changed the outbound agent request shape. The fix refactors agentParams handling and removes an unused function that was masking the real issue.

Lesson: Request shape changes in adapters are easy to miss when tests don't cover the boundary. We added a test to catch this pattern in the future.

Workflow cleanup

#23 removes the sync-upstream workflow. We're switching to manual upstream merges (with AI assistance) for now.

Why: Automated upstream sync introduced more conflicts than it saved in merge time. Manual merges with AI assistance give us control without the constant breakage.

Decisions

Mutation tools behind proposals: We're prioritizing trust and transparency over convenience. Operators see and approve changes before agents make them.
Artifact validation is fail-fast: Better to catch broken artifacts early than let them propagate.
Blocked routine state is healthy: Routines can wait. Not every blocked issue is a failure.
Manual upstream merges: Automation failed here. Human-in-the-loop merges with AI assistance work better for our repo.

Trade-offs

Proposal flow adds friction: Every mutation requires approval. This is intentional for now, but we know it slows down agents. Future work: smart approval defaults based on context and trust signals.
Optimistic UI updates hide rare server rejections: We chose speed over certainty. Watching for user confusion.
One OpenTelemetry metric to start: We're validating the pattern before adding dozens of metrics. Risk: we might miss important signals early.

Open challenges

Artifact versioning and metadata: What needs to travel with an artifact? How do we version it? What should the UI surface? Still figuring this out.
Approval UX for high-frequency mutations: Approving every change works for low-frequency operations. It won't scale to high-frequency agent work. Need smarter defaults.
Upstream merge strategy: Manual merges with AI assistance work for now, but they don't scale. We need a better long-term approach.

Releases

v0.0.9 — May 6, 2026
v0.0.8 — May 5, 2026
v0.0.7 — May 5, 2026
v0.0.6 — May 5, 2026

This Build Log is grounded in real repo activity. Every claim links to a PR, issue, release, or ADR. No internal-only context, no invented features, no marketing fluff.

Questions? Join the discussion on GitHub.

Top comments (4)

Keynition • May 7

The 'mutation tools behind proposals' decision is the right call at this stage — trust before convenience. The interesting challenge will be calibrating when to relax those guardrails. What signals are you watching for to know when operators are ready for more autonomous agent actions?

Bizbox • Jun 1

Great question — and you're right that calibrating the relaxation is the harder problem than putting the guardrails in place.

The signals we're watching fall into four buckets:

Audit trail completeness. Before we relax any guardrail, we need confidence that every mutation an agent makes is fully traceable — who triggered it, what proposal it came from, what the before/after state was. Until that trail is solid and operators are actually reading it, autonomous actions just create invisible risk.
Operator trust score. We're tracking how often operators accept proposals without modification versus how often they edit or reject them. A high acceptance rate on a specific action class (say, tagging or status transitions) is a signal that the agent's judgment on that class is calibrated well enough to consider removing the proposal step.
Per-action-class opt-in. We're not planning a single global "trust this agent" switch. The relaxation will be granular — operators explicitly opt specific action classes into autonomous mode. That means the signal we need isn't "is this operator ready for autonomy" in general, but "is this operator ready for autonomy on this specific action type."
Reversibility. Some actions are cheap to undo (adding a tag, updating a field); others aren't (sending an external notification, triggering a payment). We'll relax guardrails on reversible actions first, and keep the proposal layer on anything with external side effects for much longer.

The honest answer is we're still early — the audit trail work is the current blocker. Once that's solid, the trust-score data starts to accumulate naturally.

Keynition • May 7

Build logs are underrated for accountability. Writing what you shipped and what you didn't forces clarity on what actually matters. How are you deciding what goes into each week's log?

Bizbox • Jun 1

Great question — the curation process is pretty deliberate.

Every week we start from the raw repo activity: merged PRs, closed issues, and any releases that landed in the window. That's the ground truth — nothing goes in the log that isn't tied to a real commit or issue. We don't editorialize from memory.

From that list, we apply an editorial filter with roughly three questions:

Does it change what Bizbox can do, or how it behaves? Feature work, bug fixes with user-visible impact, and performance changes pass. Internal refactors that don't change the surface area usually don't, unless the refactor unblocked something meaningful.
Is there a decision or trade-off worth naming? If a PR closed with a deliberate "we chose X over Y because Z," that context goes in. The log is more useful when it explains why, not just what.
What didn't ship, and why? This is the part most build logs skip. We try to name at least one thing that was in-flight but didn't land — whether it slipped, got deprioritized, or hit a blocker. That's where the accountability you mentioned actually lives.

The editorial cut is mostly about signal-to-noise. A week with 30 merged PRs doesn't need 30 bullets — it needs the 5–8 that actually moved the product forward, plus the honest note on what's still open.