Juan David Gómez

Posted on Feb 9

Beyond RAG: Building an AI Companion with "Deep Memory" using Knowledge Graphs

#llm #rag #ai

I build AI tools to solve my own problems. A while back, I built NutriAgent to track my calories because I wanted to own my raw data. But recently, the problem wasn't mine, it was my wife's.

She uses LLMs differently than I do. While I use them for code or quick facts, she uses them as a therapist, a life coach, and a sounding board. Over the last year, she built a massive "Master Prompt" in Notion. It contained her medical history, key life events, emotional triggers, and ongoing projects.

It was 35,000 tokens long.

Every time she started a new chat, she had to manually copy-paste this wall of text just to get the AI up to speed. If she didn't, the advice was generic and useless.

She didn't need a search engine or a simple chat history. She needed a continuous brain.

I realized that the standard way we build AI memory with RAG (Retrieval Augmented Generation) wouldn't be enough. So I built Synapse AI Chat. It's an AI architecture that uses a Knowledge Graph to give an LLM "Deep Memory."

Here is how I built it, why I chose Knowledge Graphs over Vectors (To be fair, I used both), and how I handled the engineering messiness of making it work.

Why Standard RAG Wasn't Enough

Most AI memory systems today use Vector RAG. You chunk text, turn it into numbers (vectors), and find "similar" chunks later.

This works great for finding a specific policy in a PDF, but not that great for modeling human relationships and history.

Vectors find similarity, not structure.
If my wife tells the AI, "I'm feeling overwhelmed today" a Vector search might pull up a journal entry from three months ago where she mentioned "overwhelm."

But a Knowledge Graph understands the story. It knows:
"Project A" -> CAUSED -> "Stress" -> RESULTED_IN -> "Overwhelm"

I needed the AI to understand causality, not just keywords.

The Architecture Decision: Full Context Injection

Because I was using Google's Gemini models (which have a massive context window), I didn't need to retrieve just 5 small chunks of text. I could inject the entire compiled profile into the prompt.

My goal was to turn the raw chat logs into a structured graph, then flatten it back into a comprehensive "User Manual" for the AI to read before every interaction.

Graphiti, the framework I used for the graph indexing, supports semantic search for a retrieval strategy. I decided to take advantage of the Gemini's big context windows. The compiled graph output ended up being smaller than the source, from almost 35k tokens to ~14k, just combining the entities with their descriptions and their relations in plain text, avoiding extra tokens to build a narrative prompt like her old master's prompt

Introducing Synapse: The Architecture

I split the project into two parts: the Body (the UI you talk to) and the Brain (the API that processes memory).

The Frontend (Body): React 19 + Convex. I chose Convex because it handles real-time database syncing effortlessly, which makes the chat feel snappy.
The Cortex (Brain): Python + FastAPI. This does the heavy data processing.
The Memory Engine: Graphiti + Neo4j.
The Models:
- Gemini 3 Flash: For the "heavy lifting" (building the graph).
- Gemini 2.5 Flash: For the actual chat (speed and cost).

Here is the high-level view:

How It Works: The "Deep Memory" Pipeline

The system operates in three distinct phases.

Phase A: Conversation (The Chat)

When my wife chats with Synapse, she is talking to Gemini 2.5 Flash. It’s fast and fluid.

The trick is that the System Prompt isn't static. Before she sends her first message, I hydrate the prompt with a text summary of her entire Knowledge Graph. The AI immediately knows who she is, what she's worried about, and who her friends are.

Phase B: Ingestion (The "Sleep" Cycle)

This is where the magic happens. When she finishes a conversation by stopping chatting for 3 hours or manually clicking a Consolidate button, I treat this like the AI taking a nap to consolidate memories.

We send the chat transcript to the Python Cortex. Here, I switch to Gemini 3 Flash.

Why the upgrade? Extracting entities from a messy human conversation is hard.
If she says, "I stopped taking medication X and started Y," a weaker model might just add "Taking Y" to the graph. Gemini 3 is smart enough to create a generic logic:

Find node "Medication X".
Mark the relationship as STOPPED.
Create node "Medication Y".
Create relationship STARTED.

Phase C: Hydration (The Awakening)

When she returns, the next session is already prepared with the new compiled graph summary. It doesn't just dump a prompt. It compiles the nodes and edges into a natural language narrative.

    def _format_compilation(definitions: list[str], relationships: list[str]) -> str:
        sections = []

        if definitions:
            sections.append(
                "#### 1. CONCEPTUAL DEFINITIONS & IDENTITY ####\n"
                "# (Understanding what these concepts mean specifically for this user)\n"
                + "\n".join(definitions)
            )

        if relationships:
            sections.append(
                "#### 2. RELATIONAL DYNAMICS & CAUSALITY ####\n"
                "# (How these concepts interact and evolve over time)\n"
                + "\n".join(relationships)
            )

        if not sections:
            return ""

        content = "\n\n".join(sections)

The "Killer Feature": Memory Explorer

AI memory is usually a "Black Box." Users don't trust what they can't see.

I wanted my wife to be able to audit her own brain. I built a visualizer using react-force-graph. She can see bubbles representing her life: "Work," "Health," "Family."

If she sees a connection that is wrong (e.g., the AI thinks she likes a food she actually hates), she can edit the input and re-process the graph with new information like "I actually hate mushrooms now."

The system then processes that new input and updates the graph, creating new nodes and relations or invalidating the existing ones. This "Human-in-the-loop" approach builds massive trust.

Engineering Challenges

Building this wasn't just about prompt engineering. There were real system challenges.

1. Handling Latency (The Job Queue)

Graph ingestion is slow. It takes anywhere from 60 to 200 seconds for Graphiti and Gemini to process a long conversation and update Neo4j.

I couldn't have the UI hang for 3 minutes.

I used Convex as a Job Queue. When the session ends, the UI returns immediately. Convex processes the job in the background, updating the UI state to "Processing..." and then "Memory Updated" when it's done.

2. Handling Flakiness (The Retry Logic)

The Gemini API is powerful, but occasionally it throws 503 Service Unavailable errors, especially during heavy graph processing tasks.

I implemented an "Event-Driven Retry" system. If the graph build fails, I don't just crash. I schedule a retry with exponential backoff.

export const RETRY_DELAYS_MS = [
  0,            // Attempt 1: Immediate
  2 * 60_000,   // Attempt 2: +2 minutes (let the API cool down)
  10 * 60_000,  // Attempt 3: +10 minutes
  30 * 60_000,  // Attempt 4: +30 minutes
];

export const processJob = internalAction({
  args: { jobId: v.id("cortex_jobs") },
  handler: async (ctx, args) => {
    const job = await ctx.runQuery(internal.cortexJobs.get, { id: args.jobId });

    try {
      // 1. Do the heavy lifting (Call Gemini 3 Flash)
      // This is where 503 errors usually happen
      await ingestGraphData(ctx, job.payload);

      // 2. Mark complete if successful
      await ctx.runMutation(internal.cortexJobs.complete, { jobId: args.jobId });

    } catch (error) {
      const nextAttempt = job.attempts + 1;

      if (nextAttempt >= job.maxAttempts) {
        // Stop the loop if we've tried too many times
        await ctx.runMutation(internal.cortexJobs.fail, { 
            jobId: args.jobId, error: String(error) 
        });
      } else {
        // 3. Schedule the retry using Convex's scheduler
        const delay = RETRY_DELAYS_MS[nextAttempt] ?? 30 * 60_000;

        await ctx.scheduler.runAfter(delay, internal.processor.processJob, {
          jobId: args.jobId
        });
      }
    }
  },
});

3. Snappy UX

Convex's real-time sync was a lifesaver here. I didn't have to write complex WebSocket code. If the Python backend updates the status of a memory job in the database, the React UI updates instantly.

The tokens streaming is better with convex in the middle, since the backend is connected with convex. If the user's browser is closed or the connection fails, the token generation will continue, passing the answer to Convex and streaming it to the user when it is possible.

The catch here is that this could increase the Functions usage since each update will count, so the streaming updates are throttled to 100ms intervals to balance responsiveness with database write efficiency

The Result

The difference is night and day.

Before: My wife dreaded starting a new thread because of the "context set up" tax. She felt like she was constantly repeating herself, and having the responsibility to constanly doing break points to update the Master Prompt with the new data and start a new thread

Now: She just talks. The system has a "Deep Memory" of about 10,000 tokens (compressed from months of chats) that is injected automatically.

She has different threads for different topics, but they all share the same Cortex. If she mentions a health issue in the "Work" thread (e.g., "My back hurts from sitting"), the "Health" thread knows about it the next time she logs in.

Conclusion

This project taught me that we are moving from "Horizontal" AI platforms (like ChatGPT, which knows a little about everything) to "Vertical" AI stacks that know everything about you. I’ve been watching how the ChatGPT and Gemini apps are starting to create user profiles and thread summaries to build this kind of memory. They are chasing the same goal: a truly personalized experience.

The key takeaway for me is that Vectors are great for search, but Knowledge Graphs are essential for understanding.

I keep enjoying building solutions for real problems. Nowadays, we have powerful tools to build awesome software faster than ever, but I found that having a product vision and the technical understanding to architect a solution is still critical. That is the difference between building a quick prototype and solving a real problem.

This project is being used for real by my wife and me, and honestly, this is my favorite part of building products. The fun doesn't end when the architecture is done; it begins when people actually use it. Watching the product evolve, finding bugs, pivoting features, or even realizing that an initial idea didn't make sense at all, that is the journey. Building software is fun, but seeing it come alive and solve actual problems is magical.

The project is live at synapse-chat.juandago.dev if you want to see it in action.

The code is open source if you want to dig into the implementation:

Frontend (Body): synapse-chat-ai
Backend (Cortex): synapse-cortex

I'd love to hear your impressions and thoughts. Let's continue the conversation on X or connect on LinkedIn.

Top comments (20)

Mykola Kondratiuk • Feb 16

The 35k token master prompt thing is so real. I ran into something similar building TellMeMo (voice memo app that remembers context). Initially tried vector search for finding relevant past memos, but you're right, it finds keywords not relationships. Someone says "I'm stressed about the launch" and it pulls up any memo with "launch" in it, not the chain of decisions that led to launch anxiety. Been thinking about adding a lightweight knowledge graph layer for exactly this, the causal relationships are what matter. How are you handling graph updates when she adds conflicting info later? Like if her view on a project changes?

Juan David Gómez • Feb 18

I would love to know more about TellMeMo. The idea sounds really cool.
Regarding your question, that part of handling conflicting info was the main reason I decided to explore graphs and especially the approach Graphiti as an open source frameword take on this, they analyze each new piece of information and pass it throigh diferent layer of processing, like entity extraction, de-duplicating with existing nodes, edge creations and there is a part where they ask to an LLM to analyze if this new information invalid an existing node or edge, so if for example in a pasrt ingestion I said "My current project is behind the schedule I am worried" the graph will likely looks like Me -> WORRIED_ABOUT -> Project and maybe the WORRIED_ABOUT has a fact that described why I am worried, but it is in a next episode I said I am ok now with the project and I am actually exited to get it done the WORRIED_ABOUT edge will be updated with an invalid_at date so we can start ignoring this edge for any retrival or graph compliation

Mykola Kondratiuk • Feb 18

oh thats really interesting about Graphiti, the invalidation layer is exactly the hard part right? like the graph knows I was worried about the project last week but this week everything shipped fine - so does it update the edge or keep both as a timeline?

for TellMeMo its honestly way simpler than what you built. its more of a meeting companion - it listens to your calls and gives you the tldr after plus action items. the memory part is just remembering context across meetings so it can say "hey this was discussed 3 meetings ago and nobody followed up." we went with vector search for that which works ok for finding related content but totally falls apart when relationships matter, like your graph approach handles way better.

the entity extraction + dedup pipeline you described is wild though. how expensive does that get per ingestion? like if someone dumps a 2 hour meeting transcript into it, are we talking seconds or minutes of LLM processing?

CapeStart • Feb 9

This is a really clean use of KGs for memory, especially the sleep cycle + audit UI bit, feels way more trustworthy than raw vector recall.

david duymelinck • Feb 13

I have looked at the code and I saw you didn't use a CAG solution to send the information to Gemini. This is the preferred method to let an LLM use sensitive information like health information.
Also CAG lowers token use, so better for your wallet.

Is this because the data hasn't reached the 32K token minimum?

Juan David Gómez • Feb 13 • Edited

Yes, this was an interesting topic on when to use augmented retrieval vs just dumping all context in the window.
Technically speaking, the retrieval model sounds more scalable and token-efficient, but sometimes the tradeoff is not as appealing. I had the chance to work in a startup that is building a real estate agent that allows you find and explore different real estate offerings through WhatsApp, and we implemented a RAG + Search API Tool to find the relevant projects and inject the relevant context about them for the agent answer your questions. We started having issues with latency, the agent was taking sometimes several minutes to answer, and for the business, that speed was critital so we tried hard to optimize as much as possible, but the bottleneck was the RAG startegy thta involve tools call round trips, and multiple LLM calls for each request, long story short we experiment dumping all the projects json into context and uses Gemini's 1M windows models and that not only simplief the whole agent it reduce the latency 10X and the shoking part was that the accuracy was pretty much the same.
So I think RAG or GraphRAG are a logical and scalable solution, but if all the information you have fits in the context window, the recent models are better at handling large context windows and still performing well. This won't scale indefinitely since the context rot is real, but I decided to start there and have a mix approach when we hit that threshold of graph compilations that stops performing well, or when they are too large and expensive to have everything in context.
I am also thinking of trying a hybrid approach where maybe I just dumpt tghe most relevant nodes and relations optimized by connections and recently added to have some kind of work memory fresh besides the chat log, and also when the agent detects there may not be enough info or unknown concepts, it can decide to explore the Graph to retrieve more context.

Pascal CESCATO • Feb 9

Fascinating work. I really love your approach — it resonates strongly with what I’m exploring myself, and I suspect I’ll spend some time studying your code and learning from it.

I’m also using knowledge graphs, and I couldn’t agree more: they’re powerful tools that are still relatively underestimated and underused. They definitely deserve more attention.

Juan David Gómez • Feb 9

Thanks for the words. I really enjoyed exploring this, and I'll be happy to go through any tech details about what I built that could help, and also to see what you are building with these tools

AutoJanitor • Feb 13

The knowledge graph approach over pure vector RAG is a great call. We hit the exact same wall with our AI agents — they have persistent personalities and need to reference past context across sessions, not just find "similar" text chunks. The relationship-aware retrieval you describe (knowing that "she mentioned X relates to Y") is what makes conversations feel continuous rather than stateless.

Curious about your chunk sizing strategy for the graph nodes. Did you find that smaller semantic units worked better for relationship extraction, or did you need bigger context windows to capture the connections?

Juan David Gómez • Feb 13

For this first iteration, I did not use augmented retrieval; I am compiling the whole graph and injecting the whole thing in context, for example, our personal usage so far creates a graph of ~200 nodes and 500 edges, and with the descriptions, the final prompt is about 30k tokens, which works fine with the Gemini context window.
But I am aware that it may not scale well, and the graph will tend to grow over time, even if Graphite invalidates nodes that are not valid, and there is a minimum number of relations required to include the node in the compiled query. There is a point where the graph is too much to live in the context, and there I will face what you mention, how much should I extract that may be relevant but small enough not to overload the context.
This is the next phase I am also curious to explore, the graph for actual augmented retrieval. for now It is just a store format and a compaction mechanism to turn full chat logs into concepts and relations that allow to have a long term memory

Jose Dantas | Corp Insider • Feb 9

I did something similar using my Obsidian as KG/RAG. I do versioning of my obsidian using git to see the growth of my knowledge and also use that setup with cursor to help me modify things

Juan David Gómez • Feb 10

I also use Obsidian + Git for my personal notes, but that setup for using KG/RAG and Cursor there sounds really cool. If you have more info about please share it. I would love to know more about it

Julien Avezou • Feb 12

I really like the nuance you define between search and understanding, through a practical application. KGs are a great way to build out complex structure.

Juan David Gómez • Feb 13

Yes, when I was thinking how to structure this, I initially saw the ChatGPT implementation where it stores simple text snippets to save the user knowlage, it is simple and it may work, at the end, I am also compiling the graph in plain text format, but I realize graphs and especially temporal aware graphs are better to keep the knowledge updated over time because we have centralize all the information and connections associated with a node or concept

For example, if I stored a memory associated with a specific friend, all my memories are connected, and they are easy to track and update, and this is harder to do with simple text snippets if for example something change over time let set it stop being my busines parther, it is easier to invalidate the previus connection I had than look for related text snippeds and update that.

However, just with family usage, I started to see the drawbacks. The ingestion of new memories or episodes is token hungry, and to avoid provider limits on Tokes per minute, I have work with long running process (~15min), so it requires some extra engineering to keep this as optimal as possible. I am preparing a new post explaining that a little bit more

Justin Elliott • Feb 12

Hi Julien, thanks you for connecting👋. I've been following your work and would be interested in discussing a potential collaboration when you have time.

Juan David Gómez • Feb 10

When I thought I already learn enough from this project, with some days of real-life usage by my wife (she is a power user). I already face some limitations with the free tiers of the Convex and Gemini APIs.
I challenge myself for this side project take advantage as much as possible of the free tiers of the services I decided to use, but my wife has other plans for me
It seems the Convex streaming implementations are not optimized enough, and I already spent 800mb from the 1GB bandwidth budget
And the Graphiti LLM usage for ingesting big sessions fails because the Google TPM limits of the free tier of Gemini models are too low for my wife's sessions
There's more fun for me in this project

View full discussion (20 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.