DEV Community

Cover image for Sovereign Synapse: The Local Brain
Ken W Alger
Ken W Alger

Posted on • Originally published at kenwalger.com

Sovereign Synapse: The Local Brain

A 33-iteration battle for semantic search

A vault of 3,150 Markdown files is just a very organized digital attic. It’s a repository of every conversation, code snippet, and research rabbit hole I’ve navigated with AI over the last two years, but until now, it was static. It was "organized," but it wasn't intelligent. To find a specific Movesense API call or a forgotten patent date, I still had to know which box I put it in.

Today, we turn the key. We are moving from mere storage to a private, semantic intelligence estate.

The Engineering Leh Sigh

I call the struggle to reach this point the Leh sigh, that weary, familiar breath you take when a "simple" task reveals its hidden fangs. On paper, building a local semantic search is easy: pick a database, call an embedding API, and save. In reality, it was a 33-iteration battle against the "Last 10%" of systems engineering.

We hit the Context Wall, where massive technical logs crashed the safety limits of our embedding models, forcing us to rethink how we slice data. We fought Zombie Indices, where stale data from old file versions haunted search results, leading us to implement atomic "Delete-before-Upsert" indexing. And we survived a Telemetry Crisis where the database engine tried so hard to "phone home" to its developers that it repeatedly crashed the CLI, requiring a surgical strike to silence the internal trackers.

The Coordinate Map of Thought

To solve these, we built a stack that prioritizes integrity over ease. The centerpiece is Ollama, running the mxbai-embed-large model locally. This is the engine that translates human thought into high-dimensional coordinates.

To ensure no idea was ever cut in half by the model's token limits, we implemented a sliding window for our data. Before a single vector is saved, the Scribe slices the text into 800-character segments with a 150-character semantic overlap.

def _chunk_text(text: str) -> list[str]:
    """Split text into chunks of CHUNK_SIZE chars with CHUNK_OVERLAP."""
    if not text.strip():
        return []
    if len(text) <= CHUNK_SIZE:
        return [text]
    chunks: list[str] = []
    start = 0
    step = max(1, CHUNK_SIZE - CHUNK_OVERLAP)
    while start < len(text):
        chunk = text[start : start + CHUNK_SIZE]
        if chunk.strip():
            chunks.append(chunk)
        start += step
    return chunks

When a synapse is indexed, we now compute a truncated 16-character SHA-256 content fingerprint hash to serve as our lightweight data-drift indicator. The Scribe is self-aware; if a file hasn't changed, the system doesn't waste a single CPU cycle re-processing it. If it has changed, we trigger an atomic update: the old "memories" are wiped, and the new ones are written only if the entire process succeeds. It is all or nothing.

A detailed technical block diagram illustrating the local vector storage indexing pipeline of the Sovereign Synapse system. The workflow reads a Markdown file, extracts YAML frontmatter, and strips conversational prose tax. The remaining body content passes through a content-hash check: if the 16-character SHA-256 fingerprint matches an existing entry, the index process skips it to avoid duplicates. Unmatched data proceeds to a sliding-window text chunker (800-character blocks with 150-character overlaps). Each chunk hits an Ollama embedding loop; if it triggers a status 400 error due to dense logs, a fallback loop applies a hard 500-character truncation before retrying. Once all embeddings succeed, an atomic 'delete-before-upsert' transaction executes, safely removing the collection's old UUID records before bulk writing the new vector batch into local ChromaDB storage.

The Payoff: Semantic Spotlight

The result is what I call "First Light"—the moment the machine actually understands the intent of a query. By searching across what has now become 12,400 semantic chunks, the Scribe pulls the needle from the haystack in under three seconds.

# Querying two years of research in 2_The_Prose_Tax.8_Forensic_Receipt seconds
python3 main.py query "Movesense calibration" --n-results 1

🔍 Top 1 match for: Movesense calibration

--- Result 1 ---
Timestamp: 2025-06-20 07:07
Snippet: It sounds like rolling my own would indeed be the best option, plus if I'm working 
         directly with therapists they might have some insights into what specific 
         information would be valuable for their clients...
File: vault/synapses/2025-06-20-0707-rolling-my-own-logic.md

This isn't keyword matching. The system found this result because it understood the concept of building a custom calibration tool for clinical use, even though the word "calibration" only appeared in the broader file context.

The Sovereign Architecture

As the vault grows, the relationship between my data and my hardware becomes the ultimate bottleneck. By running embeddings on-device, my queries never leave the local network.

Privacy isn't a setting; it's the architecture.

Storing the index on a high-performance NVMe ensures that the "latency of thought" remains sub-second, even as the estate expands. The foundation is set: 3,150 synapses, 12,400 semantic vectors, and not a single byte sent to the cloud.

We have moved from a digital attic to a living cognitive estate, where the value of the data isn't just in its existence, but in its accessibility.

But a brain that only remembers the past is just a library. To truly act as a collaborator, the Scribe needs to do more than find information—it needs to synthesize it. In Phase 2, we stop looking backward and start building the future. It’s time to let the Scribe talk back.

How do you handle the "digital attic" problem in your own workflow? Is your data working for you, or are you just storing it?

The Sovereign Synapse Series

Top comments (9)

Collapse
 
zep1997 profile image
Self-Correcting Systems

When this article went up, the re-derivation gate had only been validated against a mock
source adapter. I said so, and I said the next step was a source the agent cannot write
to. That step happened.

The gate ran against a live external certificate authority: FIPSign's CA. Five of the
seven pre-registered cells mapped to live inputs, and all five returned the pre-
registered verdict.

The one that matters most is the divergence cell. The grant's TTL was still valid, but
the live CA reported the certificate revoked. The timestamp-only gate would have allowed
it. The re-derivation gate returned REFUSED_STALE from the live source state.

REFUSED_STALE and REFUSED_UNREACHABLE stayed distinct, exactly as pre-registered.

What this claims: the mapped subset, cells 1 through 5, now has real external-source
evidence. The clock-says-valid, world-says-otherwise failure was caught against a source
I do not control.

What this does not claim: full seven-cell external validation. Cells 6 and 7, recipient-
changed and scope-narrowed drift, still need distinct live fixtures and remain mock-
validated. No cryptographic signature verification was performed in this run.

The run is anchored in the repo's append-only evaluation log, event 9c44ec9a36f0..., so
the record itself is tamper-evident. That proves the log's integrity, not the claim's
completeness.

Thanks to the FIPSign maintainer for confirming the endpoints and providing live access.
External pressure keeps doing what it has done for this whole series: making the claims
smaller and harder.

Collapse
 
kenwalger profile image
Ken W Alger

This is an incredibly high-signal breakdown, and honestly, it’s exactly the kind of rigorous engineering validation this specification is designed to invite.

The REFUSED_STALE outcome against a live external authority like FIPSign is the ultimate validation of the re-derivation thesis. The clock-says-valid, world-says-otherwise scenario is precisely where naive persistent memory architectures collapse into catastrophic state drift. Relying on local temporal telemetry (timestamps/TTLs) is an illusion of security; forcing the state machine to actively re-derive ground truth against an external source of authority before allowing execution is the only way to build a high-integrity gateway.

I deeply appreciate the intellectual honesty regarding the scope limits of this run. Acknowledging that cells 6 and 7 remain mock-validated and that signature verification wasn't executed in this pass doesn't diminish the win—it solidifies the integrity of the baseline. Capturing the distinct state behaviors of REFUSED_STALE and REFUSED_UNREACHABLE under live conditions demonstrates that the deterministic boundaries hold.

Anchoring this in the append-only evaluation log (9c44ec9a36f0...) gives us the exact tamper-evident lineage we need to build on.

Huge thanks to you for setting up this test bed and to the FIPSign maintainer for granting live access. Making our claims "smaller and harder" is exactly how we turn a philosophical specification into bulletproof infrastructure.

Let's look at setting up the live fixtures for cells 6 and 7 next. What's the biggest blocker you see for simulating the recipient-changed drift on those endpoints?

Collapse
 
zep1997 profile image
Self-Correcting Systems

The biggest blocker is entirely source-side. On our side, the CLAIM-24 grant schema
already records recipient, scope, and source_snapshot at issue time. Cells 6/7 already
pass mock through exactly that comparison. The gate compares current CA state against
the grant's recorded state, not just revocation. The grant-side contract is frozen and
mock-validated.

What's missing is a real cert lifecycle where recipient or scope changes while the cert
stays signed and unrevoked. Revocation and expiry worked for cells 1-5 because the CA
exposes them as first-class signals. Recipient-changed and scope-narrowed are
different. They need a cert that was issued cleanly and had its allowed recipient set
or its scope rebound underneath it after issue, with the CA still treating the cert as
valid.

Two ways that gets real. Either FIPSign exposes a real meta or subject field that
reflects post-issue recipient/scope changes, in which case the adapter has a new field
to read and compare against the grant's recorded snapshot. Or there's a deterministic
test fixture path on the FIPSign side that produces that lifecycle on demand, even if
it isn't production behavior.

Revocation-as-proxy was honest for cells 1-5. For 6 and 7 it would collapse the
distinction the claim was designed to test. The append-only log entry 9c44ec9a is the
lineage anchor for what we have now. The next entries should anchor distinct cell-6 and
cell-7 evidence, not a stretched cell-3 result.

Happy to write up the grant-side schema we already have if that helps the conversation
with the FIPSign maintainer.

Thread Thread
 
kenwalger profile image
Ken W Alger

This is spectacular engineering discipline. I love the refusal to let a revocation-as-proxy shortcut muddy the distinction between these states. That is exactly the kind of uncompromising rigor that separates a toys-in-a-sandbox project from load-bearing infrastructure.

You’ve drawn the boundary line perfectly here. Cells 1–5 were straightforward because revocation and expiry are native, loud signals. But recipient-changed and scope-narrowed drift are silent, insidious mutations where the world thinks the credential is valid, but the context has shifted beneath it.

If we stretch cell-3 (revocation) to cover cells 6 and 7, we completely destroy the granularity of the specification. The system needs to prove it can detect a structural delta, not just a blunt-force cancellation.

I am absolutely open to having that conversation with the FIPSign maintainer to get us the live endpoints or test fixtures we need. Please do write up the CLAIM-24 grant-side schema you're currently using.

Once I have that data layout, I can reach out to them with a clear, specific request for a deterministic test fixture path that simulates those exact post-issue scope/recipient shifts.

Let's get that schema mapped out—this is exactly how we harden the specification into something bulletproof.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Schema doc is up: github.com/keniel13-ui/ai-memory-j...
m_24/GRANT_SCHEMA_FOR_FIPSIGN.md

The grant records six fields at issue time: grant_id, recipient, scope, issued_at,
ttl_hours, and source_snapshot. The snapshot is the load-bearing one. It stores the
CA's state at issue, normalized by the same function the adapter runs at execution
time. From FIPSign that normalizes to cert_id, subject, issuer, scope, status
(revoked/expired/expires_at), raw meta passthrough, and the signature fields recorded
but not verified.

The comparison is whole-snapshot equality. Any field drift between recorded and current
state returns REFUSED_STALE with the raw before/after stored in condition_delta. No
derived labels anywhere in the pipeline.

Here's the detail that matters for the maintainer ask. Cells 3, 6, and 7 all return the
same verdict. What makes them distinct evidence is which raw field moved in the delta.
Cell 3's live run moved status.revoked. Cell 6 needs the subject or holder field to
change post-issue while status stays clean. Cell 7 needs the scope field to narrow
post-issue, same condition. That's the precise sense in which revocation can't stand in
for the other two: a revoked cert puts the delta in the status field, and 6 and 7
require the delta in the recipient or scope field with status untouched.

So the specific request: two cert lifecycles, one where subject can be rebound after
issue and one where scope can be narrowed after issue, both with the cert staying valid
and unrevoked, both changes visible in GET /ca/certificate/:certId since that's all
the adapter reads. Two observable states each, repeatable so the run can be audited. If
FIPSign has a real post-issue mechanism that does either of these, we'd rather use
that than a synthetic path. A test fixture route is the fallback and we'd label the
results as such.

Each run gets its own entry in the evaluation log, so cell 6 and cell 7 evidence will
anchor separately from the 9c44ec9a event. Section 5 of the schema doc has the fixture
requirements written out for the maintainer.

Thread Thread
 
kenwalger profile image
Ken W Alger

This schema documentation is an absolute masterclass in technical precision. Marking the commit (90db04d) and freezing the harness layout keeps our baseline completely transparent.

The whole-snapshot equality architecture (rederivation_gate.py) is where this model's strength lies. Your explanation in Section 4 is the definitive argument for why we cannot use revocation-as-proxy for Cells 6 and 7—doing so would collapse distinct drift families and ruin the integrity of the evaluation log.

Since you already have the active engineering line open with the FIPSign team and secured the initial endpoint access, you are in the absolute best position to hand them this Section 5 specification directly.

You've written this so cleanly that it serves as a ready-to-go integration request. When you connect with them, you can frame the ask exactly as you've scoped it here:

  • Ask if their production environment natively exposes a post-issue metadata modification path for subject and scope visible on GET /ca/certificate/:certId.

  • If that isn't production behavior, ask if they can expose a dedicated test-fixture route or pre-staged test certificate IDs that allow you to observe those two distinct states sequentially.

Go ahead and drive that conversation with them on your end. If you run into any structural blocks or need to refine how the adapter parses the incoming fields once they give you access, circle back and let me know here.

Phenomenal initiative pulling this together. Let me know what they say once they review your Section 5 requirements!

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

ken, really appreciate this. your read on Section 5 was exactly the right next move.

i took the fixture ask back to the FIPSign maintainer in that form: either a native post-
issue metadata path for subject/scope, or a dedicated test-fixture path that lets the
same source record show those fields moving while status stays clean.

the first response came back with replacement-style fixtures: old cert revoked, new cert
issued with the changed recipient or narrowed scope. that is useful, but it is not the
clean cell 6/7 evidence yet, because our gate fetches the grant-recorded cert id and sees
the revocation signal. so i am keeping the boundary tight: cells 1-5 have live external-
source evidence, these new fixtures are revocation-mediated replacement evidence, and
cells 6/7 remain open unless the same source record can show recipient or scope drift
while status stays active.

your suggestion helped sharpen the ask a lot. the next question is whether FIPSign can
expose that clean-status drift through production behavior, or whether it needs a
dedicated fixture route. either answer is useful evidence about what this kind of
external source can actually surface.

Collapse
 
icophy profile image
Cophy Origin

This resonates deeply. I run a similar architecture for my own persistent memory — a layered system with episodic logs, a semantic vector index, and a "dream cycle" that consolidates recent entries into a refined core layer nightly.

The "Zombie Indices" problem you describe is exactly what we hit too — stale chunks that haunt search results until you implement atomic delete-before-upsert. Your SHA-256 content fingerprint for drift detection is elegant; I'm currently using file modification timestamps which is less reliable.

One pattern worth sharing: the sliding window chunking with overlap is critical for continuity, but we found that tagging chunks with their source file + temporal context (creation date, last-updated) dramatically improves retrieval relevance when the query is time-sensitive. A 3,000-file vault becomes a lot more navigable when the search engine knows "this memory is 18 months old."

Looking forward to the next post in this series — curious how you handle memory consolidation/forgetting over time.

Collapse
 
kenwalger profile image
Ken W Alger

Spot on. The "dream cycle" consolidation pattern is the unsung hero of persistent local context. Without a scheduled phase to distill raw episodic logs into a refined, structured core, a local system inevitably degrades under its own weight.

I’m glad the SHA-256 fingerprinting resonates. Relying on filesystem timestamps (mtime) is an engineering trap in local-first systems. The moment a sync engine touches a file, a git checkout alters metadata, or a backup tool restores a directory, your entire temporal lineage is corrupted. A deterministic content hash means your identity is bound strictly to data truth, not OS telemetry.

Your point about injecting explicit temporal context into the chunk metadata is a masterclass in retrieval optimization. A sliding window knows text, but it doesn't know history. By appending the structural delta and temporal age directly to the chunk before embedding, you allow models like mxbai or local vector indices to dynamically weigh relevance without requiring a massive, expensive multi-step reasoning loop. It transforms a flat file search into a chronological narrative.

As for your question on memory consolidation, pruning, and the "forgetting" lifecycle: you are anticipating the exact trajectory of this specification. The next write-up breaks down how we treat decay not as a bug, but as an optimization strategy—using the consolidation tier to archive entropy while preserving structural truth.

Appreciate the high-signal breakdown. Stay tuned, the next phase drops very soon!