You've probably had this thought: "What if I could run my own AI, without depending on external APIs, without token limits, without privacy concerns?"
You're not alone. We're living in a historic moment where running cutting-edge language models is no longer the exclusive privilege of tech giants. With a reasonable NVIDIA GPU and the right tools, you can have your own AI infrastructure, complete, private, under your total control.
Why Self-Hosting LLMs Matters
Real Privacy, Not Just Expectations
Your confidential data, strategic conversations, proprietary code—everything stays on your infrastructure. No logs on third-party servers, no training models with your information, no terms of service that change overnight.
Predictable Cost vs. API Uncertainty
Public API:
$20/month base + $0.002/1k tokens = ???
(how much did that project that exploded cost again?)
Self-hosted:
GPU cloud: $0.50/hour = $360/month fixed
(or own hardware: one-time cost)
Unlimited Experimentation
Want to test 47 prompt variations? Do fine-tuning with proprietary datasets? Run exhaustive benchmarks? With your infrastructure, you're not burning credits or hitting rate limits—you're using resources that are already yours.
Customized and Specialized Models
The real magic happens when you go beyond generic models. Fine-tuning for your specific domain, custom embeddings, specialized agents—all of this requires total control over the stack.
Your journey begins...
You download Ollama, run docker run, pull the llama3:latest model, and boom—you're chatting with an AI running on your hardware. Magical.
Then the questions start:
- "How do I put a nice web interface on this?"
- "What if I want multiple models with different contexts?"
- "How do I backup conversations?"
- "What if I want to add semantic search?"
- "How do I make this accessible to my team without exposing it to the world?"
- "What happens if my machine reboots?"
Suddenly, you're no longer playing with Docker—you're building distributed infrastructure.
The Gap Between "Works on My Machine" and "Production System"
This is the moment where many get stuck with fragile solutions:
# The solution that works... until it doesn't:
$ docker run -d ollama
$ docker run -d mongodb
$ docker run -d librechat
# (Ctrl+C to exit)
# (Machine reboots)
# (Everything dies)
# (You don't remember the exact commands)
# (Configurations are lost)
You need:
- Orchestration of multiple interdependent services
- Guaranteed data persistence
- Automatic service discovery
- Intelligent traffic routing
- Secrets and configuration management
- High availability and self-recovery
- Scalability when growing
In other words: you need Kubernetes.
From Self-Hosting to Enterprise Infrastructure
This guide was born exactly from this need. It started with the simple desire to run Ollama + LibreChat locally, and evolved into a complete exploration of how to build modern infrastructure that:
- Survives reboots
- Can be recreated from scratch in minutes
- Has versioning and change auditing
- Scales from a laptop to dozens of servers
- Follows the same standards that large companies use
What you'll build:
A complete self-hosted AI stack with:
- Ollama — Your AI models running with GPU acceleration
- LibreChat — Elegant ChatGPT-style interface
- MongoDB — Persistence of conversations and contexts
- MeiliSearch — Semantic search
- NGINX Ingress — Secure and professional access
What you'll learn:
Much more than just "making it work". You'll understand the complete evolution of modern infrastructure:
- Chapter 1: Kubernetes — Operational Fundamentals: Configure everything by hand to understand concepts
- Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code: Automate with Infrastructure as Code
- Chapter 3: Terraform + Helm — A Better Abstraction: Correct abstractions emerge
- Chapter 4: GitOps with Terraform + ArgoCD — A LLM Self-Hosting Product Infrastructure: Enterprise patterns that scale
Motivation
Self-hosting isn't just about saving money or having privacy. It's about technological autonomy.
When you master the complete stack—from the AI model to the infrastructure that supports it—you're no longer dependent on:
- Companies that can suddenly shut down APIs
- Unilateral price changes
- Arbitrary usage restrictions
- Latencies from distant datacenters
You have real control over one of the most transformative technologies of our era.
And the best part? The skills you'll develop managing infrastructure for LLMs are exactly the same ones needed to manage any modern distributed system. You're learning Kubernetes through a real use case.
Part I: Fundamentals — Understanding the Ecosystem
What Are We Building?
Throughout this journey, we'll deploy a complete and real stack:
Ollama — AI model backend with NVIDIA GPU acceleration
A stateful application that needs specialized resources (GPU) and careful configuration.
LibreChat — Modern web interface for interacting with LLMs
Frontend application that communicates with multiple backends.
MongoDB — NoSQL database
Persistent storage with backup and recovery requirements.
MeiliSearch — Search engine
Indexes that need to be maintained and synchronized.
NGINX Ingress — Intelligent traffic routing
Gateway for all HTTP/HTTPS traffic.
Technology Stack:
# Foundation
Kubernetes: 1.28+
└─ Minikube: Local cluster for development
└─ Docker: Container runtime
└─ NVIDIA Container Toolkit: Bridge to GPUs
# Management
kubectl: Official Kubernetes CLI
Helm: Package manager (think npm/apt for K8s)
Terraform: Infrastructure as Code (comes in chapters 2-4)
ArgoCD: GitOps operator (chapter 4)
# Applications
Ollama: ollama/ollama (with GPU)
LibreChat: ghcr.io/danny-avila/librechat
MongoDB: bitnami/mongodb
MeiliSearch: getmeili/meilisearch
Each tool here has a specific purpose, and throughout the journey you'll understand exactly when and why to use each one.
Chapter 1: Kubernetes "By Hand" — Learning Through Direct Experience
It may seem counterproductive to start doing everything manually when the final goal is total automation. But there's method to this apparent madness:
- Solid conceptual foundation — You need to understand Pods, Services, Ingress, internal DNS
- Effective debugging — When something breaks in production (and it will), you need to know what to look for
- Appreciation of abstractions — You only value Helm/Terraform after writing YAML manually
- Knowledge of conventions — Each tool has its idiosyncrasies
Preparing the Environment: Your Local Cluster
First, let's create a safe playground—a Kubernetes cluster running locally on your machine.
Minikube Installation
# Arch Linux (adjust for your distro)
pacman -S minikube nvidia-container-toolkit libnvidia-container helm ollama terraform
# Start cluster with GPU Support
minikube start \
--driver docker \
--container-runtime docker \
--gpus all \
--memory 8192 \
--cpus 4
What's happening here?
When you execute minikube start, a sequence of events is triggered:
- Node creation: Minikube creates a Docker container that acts as your cluster's "server"
-
Kubernetes installation: Inside this container, K8s core components are installed:
-
kube-apiserver— The brain, receives all requests -
etcd— Distributed database that stores state -
kube-scheduler— Decides which node each Pod will run on -
kube-controller-manager— Ensures current state = desired state -
kubelet— Agent running on each node, manages containers
-
- Container runtime: Docker is configured as the runtime (who actually runs the containers)
-
GPU exposure:
--gpus allmaps the host's GPUs into the cluster
The result? A real Kubernetes cluster, running completely on your machine, isolated and safe for experimentation.
Enabling the Ingress Controller
# Activate NGINX Ingress addon
minikube addons enable ingress
# Verify
minikube kubectl -- get pods -n ingress-nginx
What is Ingress?
Think of Ingress as the "smart front door" of your cluster. Instead of each service having its own external IP, Ingress acts as a reverse proxy that:
- Routes traffic based on hostnames and paths
- Handles SSL/TLS termination
- Provides load balancing
- Implements rate limiting and authentication
Without Ingress:
User → NodePort :30080 → Pod1
User → NodePort :30081 → Pod2
User → NodePort :30082 → Pod3
You would need to remember each port for application.
With Ingress:
User → http://app1.company.com → Ingress → Pod1
User → http://app2.company.com → Ingress → Pod2
User → http://app3.company.com → Ingress → Pod3
You have intuitive hostnames.
Part II: Deploying the Stack
Now we'll deploy our applications. We'll start with Ollama (the AI backend) and then add LibreChat (the interface).
1. Installing Ollama with GPU Support
Creating the namespace:
minikube kubectl -- create ns ollama
Namespaces are logical separations within the cluster. Think of them as "folders" for organizing resources.
cluster/
├── default/ # Defalt Namespace, avoid using
├── kube-system/ # K8s core components
├── kube-public/ # Public resources
├── ollama/ # Ollama
└── librechat/ # Librechat
This keeps Ollama isolated from other applications.
Namespace also provides:
- Organization: Logically separates applications
- Permissions: RBAC can be applied by namespace
- Quotas: Limit resources by application
- Network isolation: Policies can restric communication
Installing via Helm:
Helm is often called "Kubernetes package manager", more than that, it's also a template engine with lifecycle management.
# Add repo
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
# Create configuration file
cat > ollama-values.yaml <<EOF
ollama:
gpu:
enabled: true
type: nvidia
number: 1
models:
- llama2
ingress:
enabled: true
hosts:
- host: ollama.glukas.space
paths:
- path: /
pathType: Prefix
EOF
# Install
helm install ollama ollama-helm/ollama \
-f ollama-values.yaml \
--namespace ollama
What's Helm doing?
When you run helm install, Helm:
- Template rendering:
values.yaml + chart templates → YAMLs do Kubernetes
Helm take values and inject them on chart templates, generating Deployments, Services, ConfigMaps, etc.
Dependency resolution:
If chart have dependencies(sub-charts), Helm resolve it, then installs.Pre-install hooks:
Some charts execute jobs before installation (such migrations, etc)Resource creation:
Rendered YAMLs are then applied on cluster viakubectl applyRelease tracking:
Helm then save the metadata on cluster, allowing upgrades and rollbacks
Verifying the installation:
By running minikube kubectl get all -n ollama:
NAME READY STATUS RESTARTS AGE
pod/ollama-5d9c8f7b6-xk2j9 1/1 Running 0 2m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
service/ollama ClusterIP 10.96.234.123 <none> 11434/TCP
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ollama 1/1 1 1 2m
NAME DESIRED CURRENT READY AGE
replicaset.apps/ollama-5d9c8f7b6 1 1 1 2m
Each line represents a Kubernetes resource created by Helm.
Local DNS Configuration
This is the IP you'll use to access services exposed through Ingress.
# Cluster IP
minikube ip
# Output example: 192.168.49.2
# Add to /etc/hosts
echo "192.168.49.2 ollama.glukas.space" | sudo tee -a /etc/hosts
How it works?
- Your browser tries resolves
ollama.glukas.space - OS first query
/etc/hosts - Finds
192.168.49.2(Minikube IP) - Request goes to k8s cluster
- Ingress Controller see the header
Host: ollama.glukas.space - Routes to Ollama service
- Service distributes to Pods
Pulling an AI model:
# Execute a command inside the Ollama pod
minikube kubectl exec -n ollama deploy/ollama -- \
ollama pull llama3
# This downloads the Llama 3 model
# The first time takes a few minutes (several GB)
Testing the model:
minikube kubectl exec -n ollama deploy/ollama -- \
ollama run llama3 "Explain Kubernetes in one sentence"
# Response:
# Kubernetes is a container orchestration platform that automates
# deployment, scaling, and management of containerized applications
# across a cluster of machines.
If you got a response, congratulations! You have a local LLM running with GPU acceleration.
Testing access:
curl http://ollama.glukas.space/api/tags
# Expected response: JSON with available models
{
"models": [
{
"name": "llama3:latest",
"modified_at": "2024-01-15T10:30:00Z",
...
}
]
}
If the JSON appeared, Ollama is successfully accessible via Ingress!
LibreChat Deploy:
Now comes the fun part: a proper web interface to interact with your AI.
Understanding LibreChat's Architecture
LibreChat is more than just a frontend. It's a complete application that needs:
- Database (MongoDB) — To store conversations, users, settings
- Search (MeiliSearch) — For semantic search in conversations
- Backend — Node.js API that orchestrates everything
- Frontend — React interface
The official Helm chart manages all of this.
Creating the namespace:
minikube kubectl -- create bs librechat
Creating Secrets:
# Generate secrets
JWT_SECRET=$(openssl rand -hex 32)
JWT_REFRESH_SECRET=$(openssl rand -hex 32)
CREDS_KEY=$(openssl rand -hex 32)
CREDS_IV=$(openssl rand -hex 16)
# Kubernetes secret creation
kubectl create secret generic librechat-credentials-env \
--from-literal=JWT_SECRET="${JWT_SECRET}" \
--from-literal=JWT_REFRESH_SECRET="${JWT_REFRESH_SECRET}" \
--from-literal=CREDS_KEY="${CREDS_KEY}" \
--from-literal=CREDS_IV="${CREDS_IV}" \
--from-literal=MONGO_URI="mongodb://librechat-mongodb:27017/LibreChat" \
--from-literal=MEILI_HOST="http://librechat-meilisearch:7700" \
--from-literal=OLLAMA_BASE_URL="http://ollama.ollama.svc.cluster.local:11434" \
--namespace librechat
Secrets vs ConfigMaps: When to Use Each?
| ConfigMap | Secret | |
|---|---|---|
| Purpose | Public configuration | Sensitive data |
| Encoding | Plain text | Base64 (not encryption!) |
| Typical Use | Feature flags, public URLs | Passwords, tokens, keys |
| Auditing | Can be committed to Git | NEVER commit |
| Example | API_ENDPOINT=https://api.com |
API_KEY=sk_live_xxxxx |
Important: Base64 is not encryption! It's just encoding. In production, use solutions such as:
- Sealed Secrets (Bitnami)
- External Secrets Operator + HashiCorp Vault
- SOPS (Mozilla)
- Cloud provider KMS (AWS Secrets Manager, GCP Secret Manager)
Configuration file: librechat-values.yaml
cat > librechat-values.yaml <<'EOF'
# Env Config (Level 1)
config:
APP_TITLE: "LibreChat + Ollama"
HOST: "0.0.0.0"
PORT: "3080"
SEARCH: "true"
MONGO_URI: "mongodb://librechat-mongodb:27017/LibreChat"
MEILI_HOST: "http://librechat-meilisearch:7700"
# Application Conig (Level 2)
librechat:
configEnv:
ALLOW_REGISTRATION: "true"
# YAML Injection (Level 3)
configYamlContent: |
version: 1.1.5
cache: true
endpoints:
custom:
- name: "Ollama"
apiKey: "ollama"
# Kubernetes internal DNS: <service>.<namespace>.svc.cluster.local
baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"
models:
default:
- "llama2:latest"
fetch: true
titleConvo: true
titleModel: "llama2:latest"
summarize: false
summaryModel: "llama2:latest"
forcePrompt: false
modelDisplayLabel: "Ollama"
addParams:
temperature: 0.7
max_tokens: 2000
# Reference to Secret previously created
extraEnvVarsSecret: "librechat-credentials-env"
# External access configuration
ingress:
enabled: true
className: "nginx"
hosts:
- host: librechat.glukas.space
paths:
- path: /
pathType: Prefix
# Sub-charts (dependencies)
mongodb:
enabled: true
auth:
enabled: false # NEVER in production
image:
repository: bitnami/mongodb
tag: latest
persistence:
enabled: true
size: 8Gi
meilisearch:
enabled: true
auth:
enabled: false # NEVER in production
environment:
MEILI_NO_ANALYTICS: "true"
MEILI_ENV: "development"
persistence:
enabled: true
size: 1Gi
# Main application storage
persistence:
enabled: true
size: 5Gi
storageClass: "standard"
EOF
Configuration breaking down:
Level 1 — Environment Variables Simples
config:
APP_TITLE: "LibreChat + Ollama"
Environment variables injected on container.
Level 2 — Application Configuration
librechat:
configEnv:
ALLOW_REGISTRATION: "true"
LibreChat specific configuration(it will also be env vars).
Level 3 — Nested YAML
configYamlContent: |
version: 1.1.5
endpoints:
custom: ...
A YAML file that will be created inside thecontainer. O pipe | preserves line breaks.
Internal Kubernetes DNS:
- LibreChat inside the cluster talks to Ollama via internal DNS
- No need to go through external Ingress
Kubernetes FQDN (Fully Qualified Domain Name) Anatomy:
baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"
http://ollama.ollama.svc.cluster.local:11434/v1
└─┬─┘ └─┬─┘ └┬┘ └────┬─────┘ └─┬─┘└┬┘
Service NS Type Domain Port Path
-
ollama— Service Name -
ollama— Namespace where the Service is -
svc— Resource Type (service) -
cluster.local— cluster Domain -
11434— Port -
/v1— API Path
Shorthand that also works (inside the same namespace):
http://ollama:11434 # Same namespace
http://ollama.ollama:11434 # Explicit Namespace
http://ollama.ollama.svc:11434 # With Type
Why using full FQDN?
LibreChat is inside librechat namespace and needs to access Ollama on ollama namespace. Cross-namespace communication requires <service>.<namespace>.
Deploying
# Install LibreChat and its dependencies
helm install librechat oci://ghcr.io/danny-avila/librechat-chart/librechat \
-f librechat-values.yaml \
--namespace librechat
# Watch the deploy
watch minikube kubectl -- get pods -n librechat -w
Watch to Expect:
NAME READY STATUS RESTARTS AGE
librechat-5d8f4b6c9d-xk7p2 0/1 ContainerCreating 0 10s
librechat-mongodb-0 0/1 Pending 0 10s
librechat-meilisearch-7f9d8c5b6-j4k8m 0/1 ContainerCreating 0 10s
Wait Status to be Running and READY to be 1/1.
Testing the System
# Adding to /etc/hosts
echo "192.168.49.2 librechat.glukas.space" | sudo tee -a /etc/hosts
# Access in browser
# http://librechat.glukas.space
If everything worked, you'll see the LibreChat interface. Create an account, configure the Ollama model, and chat with the AI running completely locally.
Let's ask some difficult questions:
1. Reproducibility: If I delete the cluster now, can you recreate everything exactly the same?
Technically yes, but you'll need to:
- Remember all the commands in the right order
- Recreate all secrets with the same values
- Hope the charts haven't changed
- Hope you didn't forget to document some configuration
2. Versioning: How do you rollback if something breaks?
You don't. Helm has releases, but the configurations are on your machine. If you changed a value three weeks ago, good luck remembering what it was before.
3. Auditing: Who changed what and when?
Nobody knows. There's no record.
4. Consistency: How do you ensure dev, staging, and production are the same?
Copy and paste values between environments. And pray.
5. Drift Detection: Is the cluster as you think it is?
Maybe. Someone might have run kubectl edit directly. You'll never know.
Fundamental Problems of the Manual Approach
1. The Reproducibility Problem
# What you DID:
$ kubectl create namespace ollama
$ helm install ollama ...
$ kubectl create secret ...
$ helm install librechat ...
# What you HAVE documented:
$ # (cricket sounds)
You made it work, but this knowledge is in your head and bash history. If the cluster dies tomorrow, you'll need to rebuild everything from memory or by mining logs.
2. The Distributed State Problem
Your system now has state in several places:
- Helm releases — Metadata in the cluster
-
Secrets — Created via
kubectl - Configurations — YAML files on your machine
-
DNS — Entries in
/etc/hosts - Commands — Bash history
There's no single source of truth.
3. The Scale Problem
This worked for 2 applications. Now imagine:
- 10 applications
- 3 environments (dev, staging, prod)
- 5 people on the team
Will you create 30 manual configurations? And when someone on the team updates one, how do the others know?
4. The Drift Problem
# On day 1:
$ helm install app v1.0
# On day 30, someone does:
$ kubectl edit deployment app
# Now there's divergence between:
# - What Helm thinks it deployed
# - What's actually running
# - What you think is running
Drift is when the current state diverges from the desired state. In manual infrastructure, drift is invisible until it breaks.
5. The Collaboration Problem
Developer A: "I changed the Ollama configuration"
Developer B: "Which one? Where?"
Developer A: "In my values.yaml"
Developer B: "Did you commit it?"
Developer A: "Was I supposed to commit it?"
Without centralized versioning, collaboration is chaotic.
Lessons from Chapter 1:
Despite the problems mentioned, this manual implementation was essential:
You now understand Kubernetes from the inside
- Pods aren't abstractions, they're running processes
- Services aren't magic, they're iptables rules
- Ingress isn't complex, it's HTTP proxy with rules
You gained technical vocabulary
- Can talk about namespaces, secrets, internal DNS
- Understand what Helm does vs what Kubernetes does
- Know the difference between StatefulSet and Deployment
You felt the pain that motivates automation
- Each manual command is an opportunity for error
- Each unversioned configuration is technical debt
- Each manual change is impossible to audit
You built intuition about the ecosystem
- Know when to use Helm chart vs pure YAML
- Understand design trade-offs (simplicity vs flexibility)
- Can debug networking and DNS problems
The Next Step: Infrastructure as Code
Now that you deeply understand what is running and why, you're ready to learn how to automate.
In Chapter 2, we'll take everything we did here and transform it into versioned, reproducible, and auditable code. We'll introduce Terraform and explore the first approach to managing Kubernetes with Infrastructure as Code.
Next Chapters
Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code
How to use Terraform to manage Kubernetes resources directly. We'll explore why this seems like a good idea, where it works well, and where it starts to break. We'll learn about state management, change planning, and the limits of the approach.
Chapter 3: Terraform + Helm — A Better Abstraction
The correct abstraction emerges. We'll use Terraform to manage Helm releases, combining the best of both worlds: Terraform's versioning and auditing with Helm's packaging expertise.
Chapter 4: GitOps with Terraform + ArgoCD — A LLM Self-Hosting Product Infrastructure
Robust architecture that completely separates infrastructure management (Terraform) from application management (ArgoCD), implementing real continuous delivery with GitOps, pull-based reconciliation, and multi-tenancy.
Continue to:
Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code →
Additional Resources
Official Documentation:
Important Concepts:
- The Twelve-Factor App — Methodology for cloud-native apps
- Kubernetes Patterns — Design patterns for K8s
- DNS for Services and Pods — Understanding internal DNS
Community:
- Kubernetes Slack
- r/kubernetes
- CNCF Landscape — Complete map of the cloud-native ecosystem

Top comments (0)