Build an Email Support Triage Agent With Its Own Inbox

#ai #email #agents #tutorial

Every shared support inbox eventually becomes a triage problem: 80 unread messages, no agreement on what "urgent" means, and the one person who knows which customer is about to churn is on PTO. Teams keep solving this with labels and heroics. It's a better fit for an LLM — as long as the LLM has somewhere safe to live.

That's the case for giving the triage agent its own mailbox. Nylas Agent Accounts (currently in beta) are hosted mailboxes you create entirely through the API. A support@yourcompany.com Agent Account receives every inbound support email, gets six system folders out of the box (inbox, sent, drafts, trash, junk, archive), and exposes the same grant_id-based endpoints as any connected Gmail or Outlook account.

Creating one is a single request:

curl --request POST \
  --url "https://api.us.nylas.com/v3/connect/custom" \
  --header "Authorization: Bearer $NYLAS_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "provider": "nylas",
    "settings": { "email": "support@yourcompany.com" }
  }'

Save the grant_id from the response — every other call hangs off it.

Four buckets beat five

The classification scheme from the email triage agent recipe sorts mail into exactly four categories:

Bucket	Meaning	Action
`URGENT`	Production incident, executive ask	Draft a reply within the hour
`ACTION`	Code review, meeting follow-up	Draft a reply same-day
`FYI`	Status update	Leave it alone
`NOISE`	Newsletter, automated alert	Archive

Four is deliberate. Three loses fidelity — everything collapses into "important." Five and the model starts confusing adjacent categories.

The prompt runs with temperature=0 and max_tokens=10, and the model only sees sender + subject + a 200-character snippet, not the full body. That's enough for over 90% accuracy. Here's the prompt verbatim from the recipe:

You triage email into one of four categories:

URGENT  — production incidents, executive requests; reply within 1 hour
ACTION  — code reviews, meeting follow-ups; reply same day
FYI     — informational, no response needed
NOISE   — newsletters, marketing, automated notifications

From:    {sender}
Subject: {subject}
Snippet: {snippet}

Return ONLY the category name. Nothing else.

Validate the output against the four valid strings (LLMs occasionally invent a category) and fall back to FYI on anything unrecognized.

The cost math is almost a rounding error. GPT-4o-mini runs about $0.15 per million input tokens; a 200-character snippet plus the prompt is roughly 150 tokens, so 100 emails is about 15K tokens — call it $0.002. Drafting uses a stronger model (GPT-4o, around $2.50 per million input tokens), but only on the URGENT and ACTION subset, typically under 20% of the inbox. A heavy day at 200 unread emails costs roughly a nickel. And if some of that mail can't leave your infrastructure, point the same OpenAI client at a local Ollama endpoint — Llama 3.1 classifies almost as well as GPT-4o-mini for this task, though drafting quality drops noticeably below a 70B-parameter model.

Drafting is where you add a second gate

Classifying wrong is cheap. Replying wrong is expensive. The support agent pattern layers two independent checks before any draft exists:

Confidence gating on the knowledge-base match. At a score of 0.85 or higher, draft directly from the matched article. Between 0.60 and 0.85, draft conservatively and cite the article inline so a reviewer can verify. Below 0.60, don't draft at all — flag for manual handling with the best-guess article attached.

Risk tiering, which doesn't care about confidence. Password resets and FAQ-shaped questions get drafted. Refunds and billing changes get drafted with extra scrutiny. Legal threats, regulatory matters, and fraud reports skip the model entirely and escalate to a human with full context. A high-confidence KB match for a refund question still goes through review — the Air Canada chatbot ruling is the canonical reminder of why.

In both recipes, the agent never hits send. Drafts land in the drafts folder; a human approves. On an Agent Account that drafts folder belongs to the agent itself, so the review queue and the audit trail are the same mailbox.

def handle(msg):
    question = extract_question(msg)
    article, conf = kb.search(question)

    if classify_risk(msg) == "high":
        escalate_to_human(msg, reason="high-risk topic")
        return

    if conf < 0.60:
        flag_for_review(msg, article)
        return

    draft = generate_draft(msg, article, cite_inline=(conf < 0.85))
    queue_for_approval(msg, draft, article)

Rules filter before the model ever runs

Here's the part a borrowed human inbox can't do. Agent Accounts support inbound rules that run at the mail layer: block known spam domains at the SMTP stage, auto-route invoices to a finance folder with assign_to_folder, mark VIP senders for immediate attention. The junk never reaches your classification prompt, which also means injection-laden garbage never enters the model's context. Rules and allow/block lists are covered in the Agent Accounts overview.

Inbound mail fires the standard message.created webhook — identical in shape to the same event for any other grant — so the trigger side of your pipeline is whatever you already run for connected accounts. If webhooks are more infrastructure than you want on day one, polling works; the support recipe suggests every 5–15 minutes, which is fine latency for support.

Rollout advice from the recipes

Start at 5 tickets per cycle while you tune the KB matcher and risk classifier. Bump to 20 once the false-positive rate is acceptable.
Group similar tickets. If the agent sees three "where's my receipt?" tickets in a row, batch them — same KB article, same draft template, one reviewer pass.
Track what the agent can't match. Low-confidence tickets are the strongest signal of where your knowledge base has holes.
Mind the send cap. A free-plan Agent Account sends up to 200 messages per account per day; paid plans have no daily cap by default, and a policy can set a stricter quota.
Cap reply length in the prompt. The drafting prompt in the triage recipe says "three sentences max," and that constraint is load-bearing — without it, drafts read like a politely overcompensating intern.
Log everything — every classification, KB lookup, and approval decision — to your own store. Support is the workload where you'll be asked "why did it say that?" months later.

The cheapest way to evaluate this is to run classification-only for a week: create the Agent Account, point a copy of your support flow at it, and log the four-bucket output without drafting anything. Compare against what your humans actually prioritized. If the agreement rate clears 90%, you've earned the right to turn on drafting.

Top comments (1)

TopStar AI • Jun 12

The "four buckets beat five" point is a great example of treating classification taxonomy as a design decision rather than an afterthought — three collapses everything into important, five and the model confuses neighbors. And separating confidence gating from risk tiering is the move I'd underline; a high-confidence KB match on a refund still going through review is exactly the discipline these systems usually skip until something expensive happens.
I build support and agent systems — Python/FastAPI, LLM classification, RAG-backed drafting — and have worked through this triage-and-escalation problem on real inboxes. Would love to connect and trade notes, and happy to collaborate if you're building in this space.