💻 Arpad Kish 💻

Posted on Feb 17

Building a Cloud-Native Agentic AI Research App: A Comprehensive Deep Dive into pgvector, Remix, and Multimodal LLMs

#llm #rag #ai

How I modernized a legacy computer vision project into an LLM-powered, cloud-native research assistant using PostgreSQL, Redis Pub/Sub, React, and AWS Fargate.

Hi, I'm Arpad Kish. I am a Full-spectrum Software Engineer, DevSecOps practitioner, and the CEO of GreenEyes.AI. Over my career, I’ve navigated the rapid evolution of software deployment—transitioning from traditional on-premises servers to containerized, cloud-native microservices.

Back in 2014, for my Bachelor's thesis, I built a Three-Tier client-server application for a content-based image retrieval system. It was a monolithic C++ and Node.js app utilizing traditional computer vision techniques like SURF descriptors and CIELAB color space clustering. Recently, I took the foundational concepts from that academic research and replatformed them into a modern Cloud-Native architecture on Amazon Web Services (AWS).

This effort ultimately evolved into my latest open-source project: the Agentic Research App.

In this article, we are going to take a massive, code-heavy deep dive into the implementation details of this application. We'll explore how to build a multimodal Retrieval-Augmented Generation (RAG) pipeline, implement vector similarity search using raw SQL and pgvector, stream LLM responses in real-time using Redis Server-Sent Events (SSE), inject dynamic agent personalities using the Strategy pattern, and deploy the entire system on AWS.

Part 1: The Architecture and the Stack

To handle intensive AI processing while maintaining a snappy, resilient user experience, the architecture is strictly divided into specialized layers.

Frontend & API Gateway: React and Remix, utilizing Vite for bundling. The UI is styled with Tailwind CSS and Shadcn UI.
Backend Logic: Node.js, built on top of my custom open-source MVC framework, @greeneyesai/api-utils.
AI Integration: Direct integrations with the @google/generative-ai (Gemini 2.5 Flash, text-embedding-004) and openai (GPT-4o, text-embedding-3-large) SDKs.
Database: PostgreSQL 16 Alpine running the pgvector extension.
Cache & Pub/Sub: Redis for rate-limiting, session management, and streaming real-time LLM outputs.

Locally, everything is orchestrated via docker-compose. In production, this maps directly to managed AWS services: Elastic Container Service (ECS) with Fargate, Amazon Aurora PostgreSQL, and Amazon ElastiCache.

Part 2: Vector Memory with `pgvector`

At the heart of any RAG pipeline is the vector database. Instead of relying on expensive third-party SaaS vector databases (like Pinecone or Qdrant), I chose to integrate vector search directly into Postgres using pgvector. This keeps the architecture consolidated and reduces network latency between relational and vector queries.

Database Initialization

To set this up, I created a custom database Dockerfile that compiles pgvector (v0.6.2) from source during the build process to ensure compatibility with the Alpine Linux environment:

FROM postgres:16-alpine3.19

# Install build tools for pgvector
RUN apk add --no-cache git g++ make musl-dev postgresql-dev

# Install pgvector
RUN git clone --branch v0.6.2 https://github.com/pgvector/pgvector.git \
    && cd pgvector \
    && make \
    && make install \
    && cd .. \
    && rm -rf pgvector

Next, the database schema is initialized. I defined a memory table that stores the user ID, the content of the prompt, the response, and the 768-dimensional embedding. To guarantee blazing-fast similarity searches even as the memory grows, I applied an ivfflat index:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE memory (
    id SERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL,
    content TEXT NOT NULL,
    response TEXT NOT NULL,
    embedding VECTOR(768),
    hidden BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    deleted_at TIMESTAMP DEFAULT NULL
);

CREATE INDEX memory_embedding_idx
ON memory
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

ANALYZE memory;

Sequelize ORM Integration

In the Node.js backend, I use Sequelize. Because Sequelize doesn't natively support pgvector operators perfectly out of the box for complex ordering, I bypassed the standard ORM query builder for the similarity search, opting for a raw SQL query using the <-> (cosine distance) operator.

Here is the exact implementation from src/api/lib/models/memory.ts:

getResultsByEmbedding: async function (
  userId: number,
  embedding: number[] | null,
  limit: number = 5
): Promise<MemoryModelType[]> {
  if (!embedding) return [];

  const sql = `
    SELECT *
    FROM memory
    WHERE user_id = $1
    ORDER BY embedding <-> '[${embedding!.toString()}]'
    LIMIT $2;
  `;

  const results = await this.sequelize!.query(sql, {
    bind: [userId, limit],
    model: MemoryModel as MemoryModelTypeStatic,
    mapToModel: true,
  });

  return results ? results : [];
}

Part 3: Multimodal Text Extraction & Summarization

Historically, extracting text from images or complex PDFs required brittle Optical Character Recognition (OCR) pipelines (like Tesseract). In fact, my 2014 prototype heavily relied on a massive C++ CLI tool utilizing Tesseract and OpenCV.

Today, we can entirely bypass traditional OCR. When a user uploads a file, the Node.js backend handles the multipart form data using multer, retaining the file buffer in memory. We then pass this raw buffer directly to a multimodal LLM.

Using the Gemini 2.5 Flash model, we simply convert the file to a base64 string and prompt the model to act as a data extractor. Once extracted, we generate an embedding, store the raw text as a hidden memory node, and subsequently ask the LLM to summarize it.

async addFileToMemoryAndSummarizeIt(userId: number, file: Multer.File, options: GenerateOptions = {}): Promise<MemoryModelType> {
  // 1. Extract raw text multimodally
  const text = await this.extractFileText(file);
  const embedding = await this.embed(text);

  // 2. Store the massive raw text context hidden from the UI
  await this.memory.store(userId, `FILE_UPLOAD: ${file.originalname}`, text, embedding, true);

  // 3. Summarize it for the user
  const summary = await this.llm(`Summarize this:\n\n${text}`);
  const embedding2 = await this.embed(summary);

  // 4. Store the summary as a visible chat node
  return await this.memory.store(userId, `Summary of ${file.originalname}`, summary, embedding2);
}

protected async extractFileText(file: Multer.File): Promise<string> {
  const model = this.client.getGenerativeModel({ model: "gemini-2.5-flash" });
  const base64Data = file.buffer.toString("base64");

  const result = await model.generateContent([
    "Extract all readable text from this file. Output only the raw text.",
    { inlineData: { data: base64Data, mimeType: file.mimetype } },
  ]);

  return result.response.text();
}

Part 4: The Agentic Pipeline and Strategy Personalities

An agent isn't truly "agentic" unless it can utilize external tools and adapt to context. Before the final prompt is sent to the LLM, the backend constructs a highly contextual payload inside the generate method of the BaseAssistant class.

First, it fetches the vector search results from the database. Next, it iterates through any injected tools. For instance, I built a BingAPISearchTool powered by SerpAPI to allow the LLM to fetch real-time web results if the internal memory is insufficient.

// Inside BaseAssistant.ts
let memoryResults: any[] = [];
if (options.memoryQuery) {
  memoryResults = await this.memory.query(userId, embedding);
}

let toolResults: Record<string, string> = {};
if (options.tools) {
  for (const tool of options.tools) {
    toolResults[tool.name] = await tool.run(message);
  }
}

const contextualPrompt = `Context Memory: ${JSON.stringify(memoryResults)}\nTools: ${JSON.stringify(toolResults)}\nUser: ${message}`;

Dynamic Personalities

To make the interaction engaging, the final prompt is wrapped in a "Personality" abstraction. By applying the Strategy pattern, users can select how the AI behaves.

For example, if a user selects the "Stewie Griffin" personality, the system applies the following transformation:

export class StewieGriffin extends BasePersonality {
  constructor() {
    super(
      "Stewie Griffin",
      "Sarcastic, genius-level verbal wit, villain energy with comedic undertones."
    );
  }
}

// The BasePersonality applies the prompt wrapper:
apply(message: string): string {
  return `【${this.name} Style】 ${message}\nInstructions: ${this.styleInstructions}`;
}

The LLM receives a master prompt combining the Memory Context, the Tool Results, the User's query, and strict behavioral instructions.

Part 5: Real-Time Streaming with Redis and SSE

LLM generation takes time. If we waited for the entire response to generate before returning an HTTP payload, the frontend would hang, resulting in a poor user experience. To solve this, the backend streams the response back to the Remix frontend chunk-by-chunk using Server-Sent Events (SSE) and Redis Pub/Sub.

As the LLM stream resolves, we use our CacheProvider (Redis) to publish the string chunks to a user-specific Redis channel:

// Inside PublicController.ts
protected processChunk(userId: string, chunk: string): void {
  this.cacheProvider.publish!(`CHANNEL_${userId}`, chunk);
}

Simultaneously, a dedicated Express endpoint (/user/:id/live-feed) keeps an open HTTP connection to the client. It subscribes to that exact Redis channel and writes the data to the stream:

public async eventSourceForUser(req: Request, res: Response): Promise<void> {
  const userId = req.params.userId;

  res.writeHead(200, {
    "Content-Type": "text/event-stream",
    "Connection": "keep-alive",
    "Cache-Control": "no-cache",
  });

  this.cacheProvider!.subscribe!(`CHANNEL_${userId}`, (event: string) => {
    res.write(`data: ${event}\n\n`);
  });

  req.on("close", async () => {
    await this.cacheProvider!.unsubscribe!(`CHANNEL_${userId}`);
    res.end();
  });
}

On the frontend, my APIConnector utilizes the native browser EventSource API to capture these chunks and emit them into the React state, rendering a seamless, typewriter-like effect:

async connectToEventSourceByUserId(userId: number): Promise<void> {
  if (this.eventSource) this.eventSource.close();

  this.eventSource = new EventSource(`/api/v1/user/${userId}/live-feed`);
  this.eventSource.onmessage = (event: MessageEvent<string>) => {
    const payload = JSON.parse(event.data);
    this.emit("event", payload);
  };
}

Part 6: Middleware, Resilience, and Rate Limiting

To ensure the API is production-ready, I built several custom Express middlewares.

1. Correlation IDs:
To track requests through the microservices, I implemented a RequestIdExtendedMiddleware that intercepts requests, validates UUIDs using yup, and attaches an X-Correlation-Id header to everything. The Axios client automatically injects this on the frontend.

2. Redis Rate Limiting:
AI models are expensive. To prevent abuse, I implemented rate limiting using express-rate-limit backed by rate-limit-redis. By utilizing our existing Redis connection pool, we limit each IP to 1000 requests per 15-minute window seamlessly across all horizontally scaled containers.

this._middleware = rateLimit({
  windowMs: 15 * 60 * 1000, 
  max: 1000, 
  standardHeaders: true, 
  store: new RedisStore({
    prefix: "RateLimiting:",
    sendCommand: async (...command: any[]) => {
      return this._cacheProvider.sendCommand!(command);
    },
  })
});

Part 7: Cloud-Native Deployment on AWS

Running Docker Compose locally is great, but production requires a fault-tolerant backbone. Drawing from my academic research on replatforming Three-Tier apps to Cloud-Native, I architected the production environment entirely on AWS.

Compute & Orchestration: ECS and AWS Fargate

The API instances run on Amazon Elastic Container Service (ECS) paired with AWS Fargate. Fargate is a serverless compute engine for containers, removing the need to provision underlying EC2 servers.

The instances sit behind an Application Load Balancer (ALB). I configured the ALB for zero-time deployments; the load balancer automatically checks the liveness and readiness probes of new containers before gracefully draining traffic from old ones.

Fargate also handles autoscaling. Based on empirical testing, I found that an allocation of 0.5 vCPU and 1 GB of memory was optimal for the API container. My system is configured to maintain 2 instances baseline but automatically scales up to 16 instances during heavy AI processing workloads.

Data & State: Amazon Aurora and ElastiCache

The vector database runs on Amazon Aurora PostgreSQL. Aurora offers the high availability and performance of commercial databases but at a fraction of the cost, handling our pgvector queries effortlessly. Alongside it, Amazon ElastiCache manages our Redis workloads (rate limiting, session storage, and SSE Pub/Sub).

Financial Implications

A major dilemma in cloud architecture is cost vs. availability. During the beta testing of this replatforming (processing over 15,000 documents), I heavily analyzed the cost.

By utilizing Serverless concepts, the cost of storing an object (which involves invoking an LLM, storing in Aurora, and writing files to S3/EFS) dropped to fractions of a cent:

Aurora: ~$0.0002 per record
Compute / Memory: ~$0.0004 per operation

Scaling dynamically rather than provisioning massive always-on EC2 instances allowed me to keep the baseline infrastructure cost incredibly low while still offering enterprise-grade resilience.

Conclusion

Modernizing a legacy application into a cloud-native, LLM-powered RAG system reveals just how far software engineering has come in a decade. We went from writing manual C++ computer vision clustering algorithms to utilizing highly capable multimodal LLMs and vector databases built right into PostgreSQL.

By combining the power of modern Node.js frameworks, React/Remix, and serverless AWS orchestration, we can build context-aware agents that don't just read and write to a database, but actively synthesize information, utilize tools, and adapt to user contexts at scale.

The Agentic Research App is open-source and available on my GitHub. I encourage you to explore the code, experiment with pgvector and Redis Pub/Sub, and try writing a few agent personalities of your own!

https://github.com/arpad1337/agentic-research-app

DEV Community

Building a Cloud-Native Agentic AI Research App: A Comprehensive Deep Dive into pgvector, Remix, and Multimodal LLMs

Part 1: The Architecture and the Stack

Part 2: Vector Memory with `pgvector`

Database Initialization

Sequelize ORM Integration

Part 3: Multimodal Text Extraction & Summarization

Part 4: The Agentic Pipeline and Strategy Personalities

Dynamic Personalities

Part 5: Real-Time Streaming with Redis and SSE

Part 6: Middleware, Resilience, and Rate Limiting

Part 7: Cloud-Native Deployment on AWS

Compute & Orchestration: ECS and AWS Fargate

Data & State: Amazon Aurora and ElastiCache

Financial Implications

Conclusion

Top comments (0)

Part 1: The Architecture and the Stack

Part 2: Vector Memory with pgvector

Database Initialization

Sequelize ORM Integration

Part 3: Multimodal Text Extraction & Summarization

Part 4: The Agentic Pipeline and Strategy Personalities

Dynamic Personalities

Part 5: Real-Time Streaming with Redis and SSE

Part 6: Middleware, Resilience, and Rate Limiting

Part 7: Cloud-Native Deployment on AWS

Compute & Orchestration: ECS and AWS Fargate

Data & State: Amazon Aurora and ElastiCache

Financial Implications

Conclusion

Part 2: Vector Memory with `pgvector`