DEV Community

I gave Hermes Agent 30 days to learn my workflow. It didn't just remember — it got smarter

Stephen Sebastian on May 27, 2026

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent The confession no one wants to make I've been lying to my...

Read full post

Varsha Ojha • May 28

This is where agents start getting interesting. Memory is useful only if it improves the workflow without becoming messy or overconfident. If an agent can remember patterns, adapt, and still stay controllable, that’s a real productivity shift.

Stephen Sebastian • May 28

Exactly, and that's the tension I didn't have room to explore fully. A skill library that grows unchecked does get messy. I saw GEPA over‑engineer a one‑off CSV task into a 47‑step monster. The fix? Manual pruning and setting clearer success thresholds.

The real breakthrough isn't just memory — it's controllable memory. Hermes lets you inspect, edit, or delete skills anytime. That's the productivity shift: memory you can trust because you can audit it.

Varsha Ojha • May 29

Exactly. Memory without control just becomes another source of drift. The ability to inspect, edit, and prune it is what makes it usable in real workflows instead of turning into hidden agent baggage.

Stephen Sebastian • May 29

Couldn't agree more😁 "hidden agent baggage" is the perfect term for it. The audit trail is what separates a useful assistant from a black box that quietly drifts. Have you found any specific pruning frequency or triggers that work best in practice?

Varsha Ojha • Jun 1

Honestly, I’d prune when the memory starts creating friction instead of speed. Good triggers could be repeated wrong assumptions, unused skills, bloated multi-step flows, or anything the agent keeps applying outside its original context.

Stephen Sebastian • Jun 1

Great practical triggers — especially "applying outside original context." That's the sneaky one. I've started logging skill usage frequency to catch those. Appreciate the great discussion! 🙌

Andrii Krugliak • May 28

The 30-day learning curve is the part nobody quotes upfront. We see the same shape on our agent network: the first 5 tasks per agent-buyer pair are noisy, days 6 to 15 are where the agent stops re-asking the same setup questions. The leverage point we found is letting the buyer veto specific memory entries, because without that the agent over-fits to one bad early run.

Stephen Sebastian • May 28

Smart point on veto power. We've found the same — early mistakes can poison the memory if there's no escape hatch. Do you auto‑surface suspect entries for review or rely on manual audits?

Andrii Krugliak • May 29

We lean on auto-surfacing anything the model flags as "too confident" for what it actually did gets pulled for a look, since certainty turned out not to track accuracy. Manual audits only caught things after they'd already poisoned later runs. Curious if you weight recent entries heavier when you score trust.

Stephen Sebastian • May 29

Great insight — certainty without accuracy is dangerous. We don't currently weight recency, but we do penalize skills that fail validation twice in a row, regardless of age. Have you found recency weighting alone enough, or do you combine it with something like frequency or impact?

Andrii Krugliak • May 31

Recency alone wasn't enough for us. It down-weighted a rare but critical correction just because it was old, so we score on impact too: a memory entry that changed an outcome stays heavy no matter its age.

Stephen Sebastian • May 31

That makes a ton of sense 😊 Impact > recency for the wins that actually matter. Appreciate you sharing the nuance. We might borrow that heuristic. Thanks for the great thread — always good to compare notes with people building in the same trenches. 🙌

Andy Stewart • May 28

Rejecting the "goldfish memory" tax and keeping data private—this four-layer memory model aligns perfectly with the local-first philosophy I live by! Storing skills and context locally to build compounding value is exactly how AI-native development should be. This is a true digital asset.

Stephen Sebastian • May 28

Love that framing — a "true digital asset" instead of rented context. That's exactly it. The skills folder is the only AI artifact I've ever felt actually compounds in value. Have you started mapping your own workflows into skills yet? 🔁💾

Harjot Singh • May 31

"Every morning I'd open the chat and be a stranger again" is the most relatable sentence in agent-land, and the timestamps detail is the perfect example, the cost isn't the big things, it's re-teaching the same small preference every single session until you give up and just do it yourself. That re-onboarding tax is what kills the relationship with every tool you listed. The distinction your title makes (remembered vs got smarter) is the one that matters: storing yesterday's session is table stakes, but actually changing behavior because of it (volunteering timestamps before you ask, not repeating a rejected approach) is the difference between a database and a colleague. The part I'd interrogate is durability of the learning, did the improvements survive a context reset, written to something persistent and re-consulted, or did they live in a long-running session that would evaporate if it restarted? Real learning has to outlive the process. I run almost this exact pattern, durable preference + correction memory, and it's the single biggest quality lever I have. It's core to how I build Moonshift. Over the 30 days, what did it learn that surprised you, something you never explicitly taught it?

Stephen Sebastian • May 31

Love that breakdown and you nailed the real test: durability. Yes, the learning survives restarts (SQLite + skill files). The surprise? It learned my "response cadence" — when I want a quick answer vs. a deep dive — without me ever spelling it out. Still not sure how. 😄

Stephen Sebastian • May 27

Great to see this resonating with folks! A few of you have asked about the GEPA loop and whether it ever "over‑learns" — yes, and I've got a story about that coming in a follow‑up.

For now, I'm genuinely curious:

👉 Have you ever run a long‑term autonomous agent (any framework) for more than a week? What broke first — memory limits, tool failures, or context bloat?

👉 If you tried Hermes after reading this, what was the first custom skill it generated for your workflow?

Drop your war stories below. The goldfish‑memory AI industry wants us to believe "stateless is fine." I want to hear from people who've actually tried persistent agents.

Eugene Maiorov • Jun 1

I really loved reading about your $5 server experiment. Your advice on building agents that actually remember things feels like the perfect blueprint to turn into a cloud-hosting service through Vectoralix or other projects like that.

Turning this dev advice into a paid software service introduces some cool ideas about how to set it up. The real magic of the software—the custom prompt frameworks and the hidden memory logic behind the scenes—should be completely hidden so nobody can just copy the whole system. However, the app must take that hidden logic and turn it into friendly, readable summaries that a real person can easily understand. Because power users care so much about their context memory, they would pay a good price for a tool that automates this safely. When the hard tech is hidden but the advice is highly readable, it becomes a super valuable and sellable product.

Stephen Sebastian • Jun 1

Appreciate the thoughtful take! You've nailed the tension. The value is in the memory logic, but usability demands transparency. A paid service would need to offer real auditability (readable summaries, veto controls) without exposing the secret sauce. Definitely an interesting model worth exploring. Thanks for reading! 🙌

Mike Ritchie • Jun 1

Very cool, and I love that it uses a local SqLite DB in its workflow, that’s a great touch!

Stephen Sebastian • Jun 1

Thanks @starkraving