George Lukas

Posted on Feb 6 • Edited on Feb 16

Chapter 1: Kubernetes — Operational Fundamentals

#devops #ai #selfhost #sre

You've probably had this thought: "What if I could run my own AI, without depending on external APIs, without token limits, without privacy concerns?"

You're not alone. We're living in a historic moment where running cutting-edge language models is no longer the exclusive privilege of tech giants. With a reasonable NVIDIA GPU and the right tools, you can have your own AI infrastructure, complete, private, under your total control.

Why Self-Hosting LLMs Matters

Real Privacy, Not Just Expectations

Your confidential data, strategic conversations, proprietary code—everything stays on your infrastructure. No logs on third-party servers, no training models with your information, no terms of service that change overnight.

Predictable Cost vs. API Uncertainty

Public API:
$20/month base + $0.002/1k tokens = ???
(how much did that project that exploded cost again?)

Self-hosted:
GPU cloud: $0.50/hour = $360/month fixed
(or own hardware: one-time cost)

Unlimited Experimentation

Want to test 47 prompt variations? Do fine-tuning with proprietary datasets? Run exhaustive benchmarks? With your infrastructure, you're not burning credits or hitting rate limits—you're using resources that are already yours.

Customized and Specialized Models

The real magic happens when you go beyond generic models. Fine-tuning for your specific domain, custom embeddings, specialized agents—all of this requires total control over the stack.

Your journey begins...

You download Ollama, run docker run, pull the llama3:latest model, and boom—you're chatting with an AI running on your hardware. Magical.

Then the questions start:

"How do I put a nice web interface on this?"
"What if I want multiple models with different contexts?"
"How do I backup conversations?"
"What if I want to add semantic search?"
"How do I make this accessible to my team without exposing it to the world?"
"What happens if my machine reboots?"

Suddenly, you're no longer playing with Docker—you're building distributed infrastructure.

The Gap Between "Works on My Machine" and "Production System"

This is the moment where many get stuck with fragile solutions:

# The solution that works... until it doesn't:
$ docker run -d ollama
$ docker run -d mongodb
$ docker run -d librechat
# (Ctrl+C to exit)
# (Machine reboots)
# (Everything dies)
# (You don't remember the exact commands)
# (Configurations are lost)

You need:

Orchestration of multiple interdependent services
Guaranteed data persistence
Automatic service discovery
Intelligent traffic routing
Secrets and configuration management
High availability and self-recovery
Scalability when growing

In other words: you need Kubernetes.

From Self-Hosting to Enterprise Infrastructure

This guide was born exactly from this need. It started with the simple desire to run Ollama + LibreChat locally, and evolved into a complete exploration of how to build modern infrastructure that:

Survives reboots
Can be recreated from scratch in minutes
Has versioning and change auditing
Scales from a laptop to dozens of servers
Follows the same standards that large companies use

What you'll build:

A complete self-hosted AI stack with:

Ollama — Your AI models running with GPU acceleration
LibreChat — Elegant ChatGPT-style interface
MongoDB — Persistence of conversations and contexts
MeiliSearch — Semantic search
NGINX Ingress — Secure and professional access

What you'll learn:

Much more than just "making it work". You'll understand the complete evolution of modern infrastructure:

Chapter 1: Kubernetes — Operational Fundamentals: Configure everything by hand to understand concepts
Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code: Automate with Infrastructure as Code
Chapter 3: Terraform + Helm — A Better Abstraction: Correct abstractions emerge
Chapter 4: GitOps with Terraform + ArgoCD — A LLM Self-Hosting Product Infrastructure: Enterprise patterns that scale

Motivation

Self-hosting isn't just about saving money or having privacy. It's about technological autonomy.

When you master the complete stack—from the AI model to the infrastructure that supports it—you're no longer dependent on:

Companies that can suddenly shut down APIs
Unilateral price changes
Arbitrary usage restrictions
Latencies from distant datacenters

You have real control over one of the most transformative technologies of our era.

And the best part? The skills you'll develop managing infrastructure for LLMs are exactly the same ones needed to manage any modern distributed system. You're learning Kubernetes through a real use case.

Part I: Fundamentals — Understanding the Ecosystem

What Are We Building?

Throughout this journey, we'll deploy a complete and real stack:

Ollama — AI model backend with NVIDIA GPU acceleration

A stateful application that needs specialized resources (GPU) and careful configuration.

LibreChat — Modern web interface for interacting with LLMs

Frontend application that communicates with multiple backends.

MongoDB — NoSQL database

Persistent storage with backup and recovery requirements.

MeiliSearch — Search engine

Indexes that need to be maintained and synchronized.

NGINX Ingress — Intelligent traffic routing

Gateway for all HTTP/HTTPS traffic.

Technology Stack:

# Foundation
Kubernetes: 1.28+
  └─ Minikube: Local cluster for development
  └─ Docker: Container runtime
  └─ NVIDIA Container Toolkit: Bridge to GPUs

# Management
kubectl: Official Kubernetes CLI
Helm: Package manager (think npm/apt for K8s)
Terraform: Infrastructure as Code (comes in chapters 2-4)
ArgoCD: GitOps operator (chapter 4)

# Applications
Ollama: ollama/ollama (with GPU)
LibreChat: ghcr.io/danny-avila/librechat
MongoDB: bitnami/mongodb
MeiliSearch: getmeili/meilisearch

Each tool here has a specific purpose, and throughout the journey you'll understand exactly when and why to use each one.

Chapter 1: Kubernetes "By Hand" — Learning Through Direct Experience

It may seem counterproductive to start doing everything manually when the final goal is total automation. But there's method to this apparent madness:

Solid conceptual foundation — You need to understand Pods, Services, Ingress, internal DNS
Effective debugging — When something breaks in production (and it will), you need to know what to look for
Appreciation of abstractions — You only value Helm/Terraform after writing YAML manually
Knowledge of conventions — Each tool has its idiosyncrasies

Preparing the Environment: Your Local Cluster

First, let's create a safe playground—a Kubernetes cluster running locally on your machine.

Minikube Installation

# Arch Linux (adjust for your distro)
pacman -S minikube nvidia-container-toolkit libnvidia-container helm ollama terraform

# Start cluster with GPU Support
minikube start \
  --driver docker \
  --container-runtime docker \
  --gpus all \
  --memory 8192 \
  --cpus 4

What's happening here?

When you execute minikube start, a sequence of events is triggered:

Node creation: Minikube creates a Docker container that acts as your cluster's "server"
Kubernetes installation: Inside this container, K8s core components are installed:
- kube-apiserver — The brain, receives all requests
- etcd — Distributed database that stores state
- kube-scheduler — Decides which node each Pod will run on
- kube-controller-manager — Ensures current state = desired state
- kubelet — Agent running on each node, manages containers
Container runtime: Docker is configured as the runtime (who actually runs the containers)
GPU exposure: --gpus all maps the host's GPUs into the cluster

The result? A real Kubernetes cluster, running completely on your machine, isolated and safe for experimentation.

Enabling the Ingress Controller

# Activate NGINX Ingress addon
minikube addons enable ingress

# Verify
minikube kubectl -- get pods -n ingress-nginx

What is Ingress?

Think of Ingress as the "smart front door" of your cluster. Instead of each service having its own external IP, Ingress acts as a reverse proxy that:

Routes traffic based on hostnames and paths
Handles SSL/TLS termination
Provides load balancing
Implements rate limiting and authentication

Without Ingress:

User → NodePort :30080 → Pod1
User → NodePort :30081 → Pod2
User → NodePort :30082 → Pod3

You would need to remember each port for application.

With Ingress:

User → http://app1.company.com → Ingress → Pod1
User → http://app2.company.com → Ingress → Pod2
User → http://app3.company.com → Ingress → Pod3

You have intuitive hostnames.

Part II: Deploying the Stack

Now we'll deploy our applications. We'll start with Ollama (the AI backend) and then add LibreChat (the interface).

1. Installing Ollama with GPU Support

Creating the namespace:

minikube kubectl -- create ns ollama

Namespaces are logical separations within the cluster. Think of them as "folders" for organizing resources.

cluster/
├── default/          # Defalt Namespace, avoid using
├── kube-system/      # K8s core components
├── kube-public/      # Public resources
├── ollama/           # Ollama 
└── librechat/        # Librechat

This keeps Ollama isolated from other applications.

Namespace also provides:

Organization: Logically separates applications
Permissions: RBAC can be applied by namespace
Quotas: Limit resources by application
Network isolation: Policies can restric communication

Installing via Helm:

Helm is often called "Kubernetes package manager", more than that, it's also a template engine with lifecycle management.

# Add repo
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update

# Create configuration file
cat > ollama-values.yaml <<EOF
ollama:
  gpu:
    enabled: true
    type: nvidia
    number: 1
  models:
    - llama2

ingress:
  enabled: true
  hosts:
    - host: ollama.glukas.space
      paths:
        - path: /
          pathType: Prefix
EOF

# Install
helm install ollama ollama-helm/ollama \
  -f ollama-values.yaml \
  --namespace ollama

What's Helm doing?

When you run helm install, Helm:

Template rendering:

   values.yaml + chart templates → YAMLs do Kubernetes

Helm take values and inject them on chart templates, generating Deployments, Services, ConfigMaps, etc.

Dependency resolution:
If chart have dependencies(sub-charts), Helm resolve it, then installs.
Pre-install hooks:
Some charts execute jobs before installation (such migrations, etc)
Resource creation:
Rendered YAMLs are then applied on cluster via kubectl apply
Release tracking:
Helm then save the metadata on cluster, allowing upgrades and rollbacks

Verifying the installation:

By running minikube kubectl get all -n ollama:

NAME                               READY             STATUS              RESTARTS      AGE
pod/ollama-5d9c8f7b6-xk2j9         1/1               Running             0             2m

NAME                               TYPE              CLUSTER-IP         EXTERNAL-IP   PORT(S)
service/ollama                     ClusterIP         10.96.234.123      <none>        11434/TCP

NAME                               READY             UP-TO-DATE         AVAILABLE     AGE
deployment.apps/ollama             1/1               1                  1             2m

NAME                               DESIRED           CURRENT            READY         AGE
replicaset.apps/ollama-5d9c8f7b6   1                 1                  1             2m

Each line represents a Kubernetes resource created by Helm.

Local DNS Configuration

This is the IP you'll use to access services exposed through Ingress.

# Cluster IP
minikube ip
# Output example: 192.168.49.2

# Add to /etc/hosts
echo "192.168.49.2 ollama.glukas.space" | sudo tee -a /etc/hosts

How it works?

Your browser tries resolves ollama.glukas.space
OS first query /etc/hosts
Finds 192.168.49.2 (Minikube IP)
Request goes to k8s cluster
Ingress Controller see the header Host: ollama.glukas.space
Routes to Ollama service
Service distributes to Pods

Pulling an AI model:

# Execute a command inside the Ollama pod
minikube kubectl exec -n ollama deploy/ollama -- \
  ollama pull llama3

# This downloads the Llama 3 model
# The first time takes a few minutes (several GB)

Testing the model:

minikube kubectl exec -n ollama deploy/ollama -- \
  ollama run llama3 "Explain Kubernetes in one sentence"

# Response:
# Kubernetes is a container orchestration platform that automates
# deployment, scaling, and management of containerized applications
# across a cluster of machines.

If you got a response, congratulations! You have a local LLM running with GPU acceleration.

Testing access:

curl http://ollama.glukas.space/api/tags

# Expected response: JSON with available models
{
  "models": [
    {
      "name": "llama3:latest",
      "modified_at": "2024-01-15T10:30:00Z",
      ...
    }
  ]
}

If the JSON appeared, Ollama is successfully accessible via Ingress!

LibreChat Deploy:

Now comes the fun part: a proper web interface to interact with your AI.

Understanding LibreChat's Architecture

LibreChat is more than just a frontend. It's a complete application that needs:

Database (MongoDB) — To store conversations, users, settings
Search (MeiliSearch) — For semantic search in conversations
Backend — Node.js API that orchestrates everything
Frontend — React interface

The official Helm chart manages all of this.

Creating the namespace:

minikube kubectl -- create bs librechat

Creating Secrets:

# Generate secrets
JWT_SECRET=$(openssl rand -hex 32)
JWT_REFRESH_SECRET=$(openssl rand -hex 32)
CREDS_KEY=$(openssl rand -hex 32)
CREDS_IV=$(openssl rand -hex 16)

# Kubernetes secret creation
kubectl create secret generic librechat-credentials-env \
  --from-literal=JWT_SECRET="${JWT_SECRET}" \
  --from-literal=JWT_REFRESH_SECRET="${JWT_REFRESH_SECRET}" \
  --from-literal=CREDS_KEY="${CREDS_KEY}" \
  --from-literal=CREDS_IV="${CREDS_IV}" \
  --from-literal=MONGO_URI="mongodb://librechat-mongodb:27017/LibreChat" \
  --from-literal=MEILI_HOST="http://librechat-meilisearch:7700" \
  --from-literal=OLLAMA_BASE_URL="http://ollama.ollama.svc.cluster.local:11434" \
  --namespace librechat

Secrets vs ConfigMaps: When to Use Each?

	ConfigMap	Secret
Purpose	Public configuration	Sensitive data
Encoding	Plain text	Base64 (not encryption!)
Typical Use	Feature flags, public URLs	Passwords, tokens, keys
Auditing	Can be committed to Git	NEVER commit
Example	`API_ENDPOINT=https://api.com`	`API_KEY=sk_live_xxxxx`

Important: Base64 is not encryption! It's just encoding. In production, use solutions such as:

Sealed Secrets (Bitnami)
External Secrets Operator + HashiCorp Vault
SOPS (Mozilla)
Cloud provider KMS (AWS Secrets Manager, GCP Secret Manager)

Configuration file: librechat-values.yaml

cat > librechat-values.yaml <<'EOF'
# Env Config (Level 1)
config:
  APP_TITLE: "LibreChat + Ollama"
  HOST: "0.0.0.0"
  PORT: "3080"
  SEARCH: "true"
  MONGO_URI: "mongodb://librechat-mongodb:27017/LibreChat"
  MEILI_HOST: "http://librechat-meilisearch:7700"

# Application Conig (Level 2)
librechat:
  configEnv:
    ALLOW_REGISTRATION: "true"

  # YAML Injection (Level 3)
  configYamlContent: |
    version: 1.1.5
    cache: true

    endpoints:
      custom:
        - name: "Ollama"
          apiKey: "ollama"
          # Kubernetes internal DNS: <service>.<namespace>.svc.cluster.local
          baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"
          models:
            default:
              - "llama2:latest"
            fetch: true
          titleConvo: true
          titleModel: "llama2:latest"
          summarize: false
          summaryModel: "llama2:latest"
          forcePrompt: false
          modelDisplayLabel: "Ollama"
          addParams:
            temperature: 0.7
            max_tokens: 2000

# Reference to Secret previously created
extraEnvVarsSecret: "librechat-credentials-env"

# External access configuration
ingress:
  enabled: true
  className: "nginx"
  hosts:
    - host: librechat.glukas.space
      paths:
        - path: /
          pathType: Prefix

# Sub-charts (dependencies)
mongodb:
  enabled: true
  auth:
    enabled: false  # NEVER in production
  image:
    repository: bitnami/mongodb
    tag: latest
  persistence:
    enabled: true
    size: 8Gi

meilisearch:
  enabled: true
  auth:
    enabled: false  # NEVER in production
  environment:
    MEILI_NO_ANALYTICS: "true"
    MEILI_ENV: "development"
  persistence:
    enabled: true
    size: 1Gi

# Main application storage
persistence:
  enabled: true
  size: 5Gi
  storageClass: "standard"
EOF

Configuration breaking down:

Level 1 — Environment Variables Simples

config:
  APP_TITLE: "LibreChat + Ollama"

Environment variables injected on container.

Level 2 — Application Configuration

librechat:
  configEnv:
    ALLOW_REGISTRATION: "true"

LibreChat specific configuration(it will also be env vars).

Level 3 — Nested YAML

configYamlContent: |
  version: 1.1.5
  endpoints:
    custom: ...

A YAML file that will be created inside thecontainer. O pipe | preserves line breaks.

Internal Kubernetes DNS:

LibreChat inside the cluster talks to Ollama via internal DNS
No need to go through external Ingress

Kubernetes FQDN (Fully Qualified Domain Name) Anatomy:

baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"

http://ollama.ollama.svc.cluster.local:11434/v1
       └─┬─┘  └─┬─┘  └┬┘ └────┬─────┘  └─┬─┘└┬┘
      Service   NS   Type   Domain     Port Path

ollama — Service Name
ollama — Namespace where the Service is
svc — Resource Type (service)
cluster.local — cluster Domain
11434 — Port
/v1 — API Path

Shorthand that also works (inside the same namespace):

http://ollama:11434              # Same namespace
http://ollama.ollama:11434       # Explicit Namespace
http://ollama.ollama.svc:11434   # With Type

Why using full FQDN?

LibreChat is inside librechat namespace and needs to access Ollama on ollama namespace. Cross-namespace communication requires <service>.<namespace>.

Deploying

# Install LibreChat and its dependencies
helm install librechat oci://ghcr.io/danny-avila/librechat-chart/librechat \
  -f librechat-values.yaml \
  --namespace librechat

# Watch the deploy
watch minikube kubectl -- get pods -n librechat -w

Watch to Expect:

NAME                                     READY   STATUS              RESTARTS   AGE
librechat-5d8f4b6c9d-xk7p2               0/1     ContainerCreating   0          10s
librechat-mongodb-0                      0/1     Pending             0          10s
librechat-meilisearch-7f9d8c5b6-j4k8m    0/1     ContainerCreating   0          10s

Wait Status to be Running and READY to be 1/1.

Testing the System

# Adding to /etc/hosts
echo "192.168.49.2 librechat.glukas.space" | sudo tee -a /etc/hosts

# Access in browser
# http://librechat.glukas.space

If everything worked, you'll see the LibreChat interface. Create an account, configure the Ollama model, and chat with the AI running completely locally.

Let's ask some difficult questions:

1. Reproducibility: If I delete the cluster now, can you recreate everything exactly the same?

Technically yes, but you'll need to:

Remember all the commands in the right order
Recreate all secrets with the same values
Hope the charts haven't changed
Hope you didn't forget to document some configuration

2. Versioning: How do you rollback if something breaks?

You don't. Helm has releases, but the configurations are on your machine. If you changed a value three weeks ago, good luck remembering what it was before.

3. Auditing: Who changed what and when?

Nobody knows. There's no record.

4. Consistency: How do you ensure dev, staging, and production are the same?

Copy and paste values between environments. And pray.

5. Drift Detection: Is the cluster as you think it is?

Maybe. Someone might have run kubectl edit directly. You'll never know.

Fundamental Problems of the Manual Approach

1. The Reproducibility Problem

# What you DID:
$ kubectl create namespace ollama
$ helm install ollama ...
$ kubectl create secret ...
$ helm install librechat ...

# What you HAVE documented:
$ # (cricket sounds)

You made it work, but this knowledge is in your head and bash history. If the cluster dies tomorrow, you'll need to rebuild everything from memory or by mining logs.

2. The Distributed State Problem

Your system now has state in several places:

Helm releases — Metadata in the cluster
Secrets — Created via kubectl
Configurations — YAML files on your machine
DNS — Entries in /etc/hosts
Commands — Bash history

There's no single source of truth.

3. The Scale Problem

This worked for 2 applications. Now imagine:

10 applications
3 environments (dev, staging, prod)
5 people on the team

Will you create 30 manual configurations? And when someone on the team updates one, how do the others know?

4. The Drift Problem

# On day 1:
$ helm install app v1.0

# On day 30, someone does:
$ kubectl edit deployment app

# Now there's divergence between:
# - What Helm thinks it deployed
# - What's actually running
# - What you think is running

Drift is when the current state diverges from the desired state. In manual infrastructure, drift is invisible until it breaks.

5. The Collaboration Problem

Developer A: "I changed the Ollama configuration"
Developer B: "Which one? Where?"
Developer A: "In my values.yaml"
Developer B: "Did you commit it?"
Developer A: "Was I supposed to commit it?"

Without centralized versioning, collaboration is chaotic.

Lessons from Chapter 1:

Despite the problems mentioned, this manual implementation was essential:

You now understand Kubernetes from the inside

Pods aren't abstractions, they're running processes
Services aren't magic, they're iptables rules
Ingress isn't complex, it's HTTP proxy with rules

You gained technical vocabulary

Can talk about namespaces, secrets, internal DNS
Understand what Helm does vs what Kubernetes does
Know the difference between StatefulSet and Deployment

You felt the pain that motivates automation

Each manual command is an opportunity for error
Each unversioned configuration is technical debt
Each manual change is impossible to audit

You built intuition about the ecosystem

Know when to use Helm chart vs pure YAML
Understand design trade-offs (simplicity vs flexibility)
Can debug networking and DNS problems

The Next Step: Infrastructure as Code

Now that you deeply understand what is running and why, you're ready to learn how to automate.

In Chapter 2, we'll take everything we did here and transform it into versioned, reproducible, and auditable code. We'll introduce Terraform and explore the first approach to managing Kubernetes with Infrastructure as Code.

Next Chapters

Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code

How to use Terraform to manage Kubernetes resources directly. We'll explore why this seems like a good idea, where it works well, and where it starts to break. We'll learn about state management, change planning, and the limits of the approach.

Chapter 3: Terraform + Helm — A Better Abstraction

The correct abstraction emerges. We'll use Terraform to manage Helm releases, combining the best of both worlds: Terraform's versioning and auditing with Helm's packaging expertise.

Chapter 4: GitOps with Terraform + ArgoCD — A LLM Self-Hosting Product Infrastructure

Robust architecture that completely separates infrastructure management (Terraform) from application management (ArgoCD), implementing real continuous delivery with GitOps, pull-based reconciliation, and multi-tenancy.

Continue to:

Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code →

Additional Resources

Official Documentation:

Important Concepts:

The Twelve-Factor App — Methodology for cloud-native apps
Kubernetes Patterns — Design patterns for K8s
DNS for Services and Pods — Understanding internal DNS

Community:

Kubernetes Slack
r/kubernetes
CNCF Landscape — Complete map of the cloud-native ecosystem

DEV Community