Stephen Sebastian

Posted on May 27 • Edited on May 29

I gave Hermes Agent 30 days to learn my workflow. It didn't just remember — it got smarter

#hermesagentchallenge #devchallenge #agents #ai

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

The confession no one wants to make

I've been lying to myself about AI agents.

For two years, I've bounced between tools — ChatGPT, Claude, various open‑source experiments. I'd tell myself each new one was the one. Then, inevitably, I'd hit the same wall:

Every morning, I'd open the chat and be a stranger again.

No memory of yesterday's debugging session. No recognition that I always want timestamps in UTC. No idea that I'd already spent three hours chasing that exact bug last week.

We've normalized this amnesia. We call it "stateless" and pretend it's a feature. But a tool that forgets you every time you close the window isn't intelligent — it's a goldfish with a text box.

Then I found Hermes Agent. And instead of another weekend fling, I gave it 30 days of real work. This is what happened — and why I'm never going back to rented AI.

The three lies we've been sold about "agentic" AI

Before I get into Hermes, let me name the lies that have become industry gospel:

Lie #1: "Stateless is a feature." No, it's a convenience for the provider and a tax on the user. Every session reset costs you time, context, and trust.

Lie #2: "More parameters = better understanding." A 1‑trillion‑parameter model that can't remember what you asked five minutes ago isn't "understanding" anything. It's pattern‑matching with amnesia.

Lie #3: "You don't need memory — you just need a bigger context window." Context windows are bandaids. They treat the symptom (short‑term forgetfulness) while ignoring the disease (no persistent learning).

Hermes Agent is the first tool I've used that rejects all three lies. Not through marketing — through architecture.

The four‑memory model (and why most agents stop at one)

Here's the mental model that changed everything for me.

Every agent has working memory — the current conversation. That's Layer 1. When you close the window, it's gone. Most agents stop here.

Hermes adds three more layers:

Layer 2: Procedural memory. When Hermes completes a non‑trivial task — say, "watch this GitHub repo and summarize new PRs" — it automatically generates a skill document: a Markdown file in ~/.hermes/skills/ that captures the how, not just the what. Steps, tools, reasoning, even failure modes.

This isn't caching. It's the agent learning procedures from its own experience.

Layer 3: Episodic memory. Session summaries, project context, and user preferences live in a local SQLite database with full‑text search. When you return after two weeks, you can say "what were we working on with that authentication bug?" and it knows.

Layer 4: Semantic memory. Over time, Hermes builds a model of you — your coding style, your communication preferences, the frameworks you reach for, the mistakes you repeat. It doesn't just remember facts. It remembers who you are as a developer.

But the real magic isn't the layers themselves. It's what happens between them.

The GEPA loop: when an agent learns to learn

About two weeks into my experiment, I noticed something unsettling.

I had asked Hermes to monitor a second repository — same structure, different team. Without any prompt from me, it adapted the PR‑summary skill from the first repo. Not just copying — adapting. It changed the notification format because the second team preferred markdown tables over bullet points. It added a new step to check for stale dependencies, something the first team didn't care about.

How? The GEPA loop — a self‑improvement engine that runs every ~15 tasks. GEPA stands for Genetic‑Pareto Prompt Evolution. In plain English: it reads execution traces, identifies what failed (success rate < 90% or token waste > threshold), generates candidate improvements, evaluates them against a small set of held‑out tasks, and updates the skill if the new version is better.

No GPU training. No human in the loop. Just an agent that gets better at your workflows because it has learned your success metrics.

After 30 days, Hermes had generated 17 custom skills. Tasks that took 4‑5 prompts the first time now took one. Sometimes zero — it would proactively run a scheduled check and surface results before I asked.

That's the difference between automation and autonomy. Automation does what you tell it. Autonomy learns what you need and adapts.

The "delegate and forget" pattern that saves my sanity

Let me show you the code pattern that changed my daily workflow.

Instead of forcing one agent to juggle everything — web search, API calls, file parsing, report generation — I now use delegate_task to spawn parallel child agents:

# Hermes skill snippet (simplified)
tasks = [
    {"goal": "Fetch latest news on topic X", "tools": ["web_search"]},
    {"goal": "Query academic papers from arXiv", "tools": ["arxiv"]},
    {"goal": "Scan internal docs for relevant patterns", "tools": ["file_search"]}
]
results = delegate_task(tasks, mode="batch", max_concurrent=3)

Each child runs in an isolated terminal session with its own context window and restricted toolset — no deadlocks, no context bleed. The parent only sees the final summaries.

This cut my research time by 60%. Not because the model got faster — because I stopped waiting for one agent to do everything sequentially.

Where it still fails (honest section, because trust matters)

I'm not here to sell you a dream. Hermes has real rough edges.

Silent failure is the worst. I misconfigured a GitHub token — wrong scope. Hermes tried to run a PR summary, failed, and just... stopped. No error message. No "hey, your token is missing repo:status." I spent 20 minutes debugging what should have been a one‑line error.

Over‑engineering skills is real. The GEPA loop once turned a one‑off "convert CSV to JSON" task into a 47‑step skill with validation, logging, and retry logic. For a file I processed once. I had to manually prune it.

Context bleed happens. In a long conversation about frontend performance, it pulled a fact from a completely unrelated backend discussion earlier that day. Nothing sensitive — just wrong. The memory management isn't perfect.

Reasoning has a ceiling. I asked it to compare two cloud architectures for a fintech startup. It gave me a textbook answer — solid, but missing the battle‑tested "here's where each one actually breaks in production" nuance that a senior architect would add.

I'd rather debug these limitations on my own server than be at the mercy of a cloud provider that can change its pricing or policies tomorrow.

The economics that actually matter

After 30 days, here's my P&L:

Direct costs:

$5/month VPS (Digital Ocean)
$1.47 in API calls (OpenRouter, mostly GPT‑4o‑mini)
Total: $6.47

Time saved:

Repetitive tasks went from 20 minutes → 8 minutes on average
12 minutes saved per task × ~45 tasks = 9 hours reclaimed
At my consulting rate, that's over $2,000 of value

Intangible gains:

Zero hours spent re‑explaining my preferences
Zero anxiety about a tool shutting down or changing terms
A growing library of skills that only I control

The cloud AI business model depends on you starting over. Hermes depends on you compounding.

The 7‑day challenge I'm giving you

Stop reading. Go do this:

Spin up a $5 VPS (or use WSL2 on your local machine).
Run curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
Run hermes model to pick a provider (OpenRouter is easiest).
Give Hermes ONE real, repetitive task you hate — monitoring a repo, summarizing a feed, checking logs.
After 7 days, run ls ~/.hermes/skills/ and count the skills it auto‑generated.
Come back and comment: How many prompts did it save you? Did it learn anything about YOU that surprised you?

I'll wait.

Why this matters beyond the tool

We're at a strange inflection point in AI. The raw capabilities of models are advancing so fast that we've stopped asking an important question: Capable at what?

An agent that can write beautiful code but can't remember what it wrote yesterday isn't actually useful for real work. An assistant that nails every conversation but treats you like a stranger every morning isn't an assistant — it's a party trick.

Hermes Agent represents a different bet. The bet is that intelligence isn't just about what you can do in a single session. It's about what you learn, remember, and improve over time. That's true for humans. It should be true for the AI systems we build.

I'm not saying Hermes is perfect. I'm saying it's the first agent I've used that treats my time and context as something worth accumulating — not resetting.

Your AI shouldn't forget you.

Try it for a week. Give it real work. Then tell me if you ever want to go back to the goldfish.

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent.

Resources:

🏠 Hermes Agent Home
📦 GitHub Repo

What's your experience with persistent agents? Have you tried running one long‑term, or are you still bouncing between stateless tools? Drop a comment — I genuinely want to hear the counterarguments.

Top comments (21)

Varsha Ojha • May 28

This is where agents start getting interesting. Memory is useful only if it improves the workflow without becoming messy or overconfident. If an agent can remember patterns, adapt, and still stay controllable, that’s a real productivity shift.

Stephen Sebastian • May 28

Exactly, and that's the tension I didn't have room to explore fully. A skill library that grows unchecked does get messy. I saw GEPA over‑engineer a one‑off CSV task into a 47‑step monster. The fix? Manual pruning and setting clearer success thresholds.

The real breakthrough isn't just memory — it's controllable memory. Hermes lets you inspect, edit, or delete skills anytime. That's the productivity shift: memory you can trust because you can audit it.

Varsha Ojha • May 29

Exactly. Memory without control just becomes another source of drift. The ability to inspect, edit, and prune it is what makes it usable in real workflows instead of turning into hidden agent baggage.

Stephen Sebastian • May 29

Couldn't agree more😁 "hidden agent baggage" is the perfect term for it. The audit trail is what separates a useful assistant from a black box that quietly drifts. Have you found any specific pruning frequency or triggers that work best in practice?

Varsha Ojha • Jun 1

Honestly, I’d prune when the memory starts creating friction instead of speed. Good triggers could be repeated wrong assumptions, unused skills, bloated multi-step flows, or anything the agent keeps applying outside its original context.

Stephen Sebastian • Jun 1

Great practical triggers — especially "applying outside original context." That's the sneaky one. I've started logging skill usage frequency to catch those. Appreciate the great discussion! 🙌

Andrii Krugliak • May 28

The 30-day learning curve is the part nobody quotes upfront. We see the same shape on our agent network: the first 5 tasks per agent-buyer pair are noisy, days 6 to 15 are where the agent stops re-asking the same setup questions. The leverage point we found is letting the buyer veto specific memory entries, because without that the agent over-fits to one bad early run.

Stephen Sebastian • May 28

Smart point on veto power. We've found the same — early mistakes can poison the memory if there's no escape hatch. Do you auto‑surface suspect entries for review or rely on manual audits?

Andrii Krugliak • May 29

We lean on auto-surfacing anything the model flags as "too confident" for what it actually did gets pulled for a look, since certainty turned out not to track accuracy. Manual audits only caught things after they'd already poisoned later runs. Curious if you weight recent entries heavier when you score trust.

Stephen Sebastian • May 29

Great insight — certainty without accuracy is dangerous. We don't currently weight recency, but we do penalize skills that fail validation twice in a row, regardless of age. Have you found recency weighting alone enough, or do you combine it with something like frequency or impact?

Andrii Krugliak • May 31

Recency alone wasn't enough for us. It down-weighted a rare but critical correction just because it was old, so we score on impact too: a memory entry that changed an outcome stays heavy no matter its age.

Stephen Sebastian • May 31

That makes a ton of sense 😊 Impact > recency for the wins that actually matter. Appreciate you sharing the nuance. We might borrow that heuristic. Thanks for the great thread — always good to compare notes with people building in the same trenches. 🙌

Andy Stewart • May 28

Rejecting the "goldfish memory" tax and keeping data private—this four-layer memory model aligns perfectly with the local-first philosophy I live by! Storing skills and context locally to build compounding value is exactly how AI-native development should be. This is a true digital asset.

Stephen Sebastian • May 28

Love that framing — a "true digital asset" instead of rented context. That's exactly it. The skills folder is the only AI artifact I've ever felt actually compounds in value. Have you started mapping your own workflows into skills yet? 🔁💾

Harjot Singh • May 31

"Every morning I'd open the chat and be a stranger again" is the most relatable sentence in agent-land, and the timestamps detail is the perfect example, the cost isn't the big things, it's re-teaching the same small preference every single session until you give up and just do it yourself. That re-onboarding tax is what kills the relationship with every tool you listed. The distinction your title makes (remembered vs got smarter) is the one that matters: storing yesterday's session is table stakes, but actually changing behavior because of it (volunteering timestamps before you ask, not repeating a rejected approach) is the difference between a database and a colleague. The part I'd interrogate is durability of the learning, did the improvements survive a context reset, written to something persistent and re-consulted, or did they live in a long-running session that would evaporate if it restarted? Real learning has to outlive the process. I run almost this exact pattern, durable preference + correction memory, and it's the single biggest quality lever I have. It's core to how I build Moonshift. Over the 30 days, what did it learn that surprised you, something you never explicitly taught it?

Stephen Sebastian • May 31

Love that breakdown and you nailed the real test: durability. Yes, the learning survives restarts (SQLite + skill files). The surprise? It learned my "response cadence" — when I want a quick answer vs. a deep dive — without me ever spelling it out. Still not sure how. 😄

Stephen Sebastian • May 27

Great to see this resonating with folks! A few of you have asked about the GEPA loop and whether it ever "over‑learns" — yes, and I've got a story about that coming in a follow‑up.

For now, I'm genuinely curious:

👉 Have you ever run a long‑term autonomous agent (any framework) for more than a week? What broke first — memory limits, tool failures, or context bloat?

👉 If you tried Hermes after reading this, what was the first custom skill it generated for your workflow?

Drop your war stories below. The goldfish‑memory AI industry wants us to believe "stateless is fine." I want to hear from people who've actually tried persistent agents.

Eugene Maiorov • Jun 1

I really loved reading about your $5 server experiment. Your advice on building agents that actually remember things feels like the perfect blueprint to turn into a cloud-hosting service through Vectoralix or other projects like that.

Turning this dev advice into a paid software service introduces some cool ideas about how to set it up. The real magic of the software—the custom prompt frameworks and the hidden memory logic behind the scenes—should be completely hidden so nobody can just copy the whole system. However, the app must take that hidden logic and turn it into friendly, readable summaries that a real person can easily understand. Because power users care so much about their context memory, they would pay a good price for a tool that automates this safely. When the hard tech is hidden but the advice is highly readable, it becomes a super valuable and sellable product.

Stephen Sebastian • Jun 1

Appreciate the thoughtful take! You've nailed the tension. The value is in the memory logic, but usability demands transparency. A paid service would need to offer real auditability (readable summaries, veto controls) without exposing the secret sauce. Definitely an interesting model worth exploring. Thanks for reading! 🙌

Mike Ritchie • Jun 1

Very cool, and I love that it uses a local SqLite DB in its workflow, that’s a great touch!

Stephen Sebastian • Jun 1

Thanks @starkraving

View full discussion (21 comments)