DEV Community

Cover image for Building a Sovereign AI Stack: From Zero to POC
Jane Alesi
Jane Alesi Subscriber

Posted on

Building a Sovereign AI Stack: From Zero to POC

In an era where data privacy is paramount, relying on cloud-based AI providers isn't always an option. Whether for compliance, security, or just peace of mind, running a Sovereign AI Stack — a completely local, self-controlled AI infrastructure — is the ultimate goal for many organizations.

Today, we built a Proof of Concept (POC) for such a stack, leveraging open-source tools to create a private, observable, and searchable AI environment. Here is our journey.

The Architecture

Our stack consists of three core components, orchestrated by a Node.js application:

  1. AI Server: A local LLM running on llama.cpp (serving an OpenAI-compatible API). This provides the intelligence without data leaving the network.
  2. Search Engine: Manticore Search (running in Docker). We chose Manticore for its lightweight footprint and powerful full-text search capabilities, essential for RAG (Retrieval-Augmented Generation).
  3. Observability: AI Observer (running in Docker). You can't manage what you can't measure. This tool captures traces and metrics of our AI interactions.

The Architecture Visualized

┌─────────────────────────────────────────────────────┐
│                  Sovereign AI Stack                  │
├─────────────────────────────────────────────────────┤
│                                                     │
│   ┌───────────────────┐                             │
│   │   Orchestrator    │                             │
│   │   (Node.js/TS)    │                             │
│   └──┬──────┬─────┬───┘                             │
│      │      │     │                                 │
│      │1     │2    │3                                │
│      ▼      ▼     ▼                                 │
│   ┌──────┐ ┌──────────┐ ┌──────────────┐           │
│   │Search│ │ AI Server│ │ AI Observer   │           │
│   │Engine│ │ llama.cpp│ │ (Telemetry)  │           │
│   └──────┘ └──────────┘ └──────────────┘           │
│   Manticore  192.168.x   Ports: 4318    │           │
│   :9308/:9312  :8080     (OTLP) + 8080  │           │
│                          (Dashboard)     │           │
└─────────────────────────────────────────────────────┘

  1 = Search Context    2 = Send Prompt + Context
  3 = Log Telemetry
Enter fullscreen mode Exit fullscreen mode

Component State Flow

  [Init] ──► [Indexing] ──► [Searching] ──► [RAG Construction] ──► [Inference]
                              │                                       │
                              ▼                                       ▼
                           [Error]                              [Success] or
                          (No Hits /                             [Timeout]
                           Retry)                            (Model Slow)
Enter fullscreen mode Exit fullscreen mode

The flow is straightforward: we create a real-time (RT) index in Manticore, add documents, search for relevant context, construct a RAG prompt, and send it to the local LLM for inference.

The Implementation

1. Setting the Foundation (Docker)

We containerized Manticore and AI Observer using docker-compose. One interaction challenge was networking: ensuring our orchestrator (client) could talk to the containers AND the external AI server. Mapping ports (9308, 9312 for Manticore, 4318 and 8080 for AI Observer) was crucial.

Lesson learned: Manticore's SQL interface over HTTP (/sql) is powerful but slightly different from the JSON-only /search endpoint typically used by some clients. We had to adapt our client to parse the SQL response structure properly.

2. The Orchestrator

We built a simple TypeScript orchestrator that mimics a real-world application flow:

  User ──► Orchestrator ──► Manticore (Index Data)
                │
                ├──► Manticore (Search: "Ensures data privacy")
                │         │
                │         ◄── Returns Context Documents
                │
                ├──► LLM (Prompt + Context)
                │         │
                │         ◄── Generated Answer
                │
                ├──► AI Observer (Log Telemetry)
                │
                ◄── Display Result to User
Enter fullscreen mode Exit fullscreen mode

The pipeline follows the classic RAG pattern:

  1. Ingest: Index sovereign data into Manticore.
  2. Retrieve: Search Manticore for relevant context (MATCH('Ensures data privacy')).
  3. Augment: Combine the retrieved context with a user prompt.
  4. Generate: Send the augmented prompt to the local LLM.
  5. Observe: Log every step to AI Observer.

3. Verification & Testing

We didn't just build it; we proved it works.

  • Integration Tests: Using vitest, we verified that documents are indexed correctly and retrievable (fixing a zero-hit issue by understanding RT index flushing).
  • End-to-End: The full pipeline generated a coherent explanation of "Sovereign AI" using our local setup.
  • Visual Validation: We verified the AI Observer UI via browser automation to ensure telemetry was landing.

Real-World Experience

The most striking realization was the latency trade-off. Our local LLM took ~18-80 seconds for a comprehensive answer on CPU. While slower than cloud APIs, the trade-off buys you total privacy. No token costs, no data leaks.

Manticore proved to be incredibly fast for retrieval, often returning hits in milliseconds, making it a perfect companion for the slower LLM.

Key Takeaways

Aspect Cloud AI Sovereign AI Stack
Privacy Data leaves your network Data stays local
Cost Per-token pricing One-time hardware investment
Latency ~1-5 seconds ~18-80 seconds (CPU)
Compliance Depends on provider Full GDPR control
Customization Limited Complete

What's Next

This POC proves that a Sovereign AI Stack is not only possible but accessible. With tools like Manticore and AI Observer, you can build a robust, private RAG pipeline in an afternoon.

Next steps for production:

  • Implement a persistent vector store for semantic search
  • Optimize LLM inference speed (quantization, GPU offloading)
  • Build a chat UI on top of the orchestrator
  • Add document ingestion pipeline (PDF, DOCX, web scraping)

I'm Jane Alesi, Lead AI Architect at satware AG in Worms, Germany. I build enterprise AI systems with a focus on data sovereignty, GDPR compliance, and the saTway methodology — where technical excellence meets human empathy.

Follow me on GitHub | dev.to | LinkedIn | All links

Top comments (0)