DEV Community

George Lukas
George Lukas

Posted on • Edited on

Chapter 1: Kubernetes — Operational Fundamentals

You've probably had this thought: "What if I could run my own AI, without depending on external APIs, without token limits, without privacy concerns?"

You're not alone. We're living in a historic moment where running cutting-edge language models is no longer the exclusive privilege of tech giants. With a reasonable NVIDIA GPU and the right tools, you can have your own AI infrastructure, complete, private, under your total control.

Why Self-Hosting LLMs Matters

Real Privacy, Not Just Expectations

Your confidential data, strategic conversations, proprietary code—everything stays on your infrastructure. No logs on third-party servers, no training models with your information, no terms of service that change overnight.

Predictable Cost vs. API Uncertainty

Public API:
$20/month base + $0.002/1k tokens = ???
(how much did that project that exploded cost again?)

Self-hosted:
GPU cloud: $0.50/hour = $360/month fixed
(or own hardware: one-time cost)
Enter fullscreen mode Exit fullscreen mode

Unlimited Experimentation

Want to test 47 prompt variations? Do fine-tuning with proprietary datasets? Run exhaustive benchmarks? With your infrastructure, you're not burning credits or hitting rate limits—you're using resources that are already yours.

Customized and Specialized Models

The real magic happens when you go beyond generic models. Fine-tuning for your specific domain, custom embeddings, specialized agents—all of this requires total control over the stack.

Your journey begins...

You download Ollama, run docker run, pull the llama3:latest model, and boom—you're chatting with an AI running on your hardware. Magical.

Then the questions start:

  • "How do I put a nice web interface on this?"
  • "What if I want multiple models with different contexts?"
  • "How do I backup conversations?"
  • "What if I want to add semantic search?"
  • "How do I make this accessible to my team without exposing it to the world?"
  • "What happens if my machine reboots?"

Suddenly, you're no longer playing with Docker—you're building distributed infrastructure.

The Gap Between "Works on My Machine" and "Production System"

This is the moment where many get stuck with fragile solutions:

# The solution that works... until it doesn't:
$ docker run -d ollama
$ docker run -d mongodb
$ docker run -d librechat
# (Ctrl+C to exit)
# (Machine reboots)
# (Everything dies)
# (You don't remember the exact commands)
# (Configurations are lost)
Enter fullscreen mode Exit fullscreen mode

You need:

  • Orchestration of multiple interdependent services
  • Guaranteed data persistence
  • Automatic service discovery
  • Intelligent traffic routing
  • Secrets and configuration management
  • High availability and self-recovery
  • Scalability when growing

In other words: you need Kubernetes.

From Self-Hosting to Enterprise Infrastructure

This guide was born exactly from this need. It started with the simple desire to run Ollama + LibreChat locally, and evolved into a complete exploration of how to build modern infrastructure that:

  • Survives reboots
  • Can be recreated from scratch in minutes
  • Has versioning and change auditing
  • Scales from a laptop to dozens of servers
  • Follows the same standards that large companies use

What you'll build:

A complete self-hosted AI stack with:

  • Ollama — Your AI models running with GPU acceleration
  • LibreChat — Elegant ChatGPT-style interface
  • MongoDB — Persistence of conversations and contexts
  • MeiliSearch — Semantic search
  • NGINX Ingress — Secure and professional access

What you'll learn:

Much more than just "making it work". You'll understand the complete evolution of modern infrastructure:

  1. Chapter 1: Kubernetes — Operational Fundamentals: Configure everything by hand to understand concepts
  2. Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code: Automate with Infrastructure as Code
  3. Chapter 3: Terraform + Helm — A Better Abstraction: Correct abstractions emerge
  4. Chapter 4: GitOps with Terraform + ArgoCD — A LLM Self-Hosting Product Infrastructure: Enterprise patterns that scale

Motivation

Self-hosting isn't just about saving money or having privacy. It's about technological autonomy.

When you master the complete stack—from the AI model to the infrastructure that supports it—you're no longer dependent on:

  • Companies that can suddenly shut down APIs
  • Unilateral price changes
  • Arbitrary usage restrictions
  • Latencies from distant datacenters

You have real control over one of the most transformative technologies of our era.

And the best part? The skills you'll develop managing infrastructure for LLMs are exactly the same ones needed to manage any modern distributed system. You're learning Kubernetes through a real use case.


Part I: Fundamentals — Understanding the Ecosystem

What Are We Building?

Throughout this journey, we'll deploy a complete and real stack:

Ollama — AI model backend with NVIDIA GPU acceleration

A stateful application that needs specialized resources (GPU) and careful configuration.

LibreChat — Modern web interface for interacting with LLMs

Frontend application that communicates with multiple backends.

MongoDB — NoSQL database

Persistent storage with backup and recovery requirements.

MeiliSearch — Search engine

Indexes that need to be maintained and synchronized.

NGINX Ingress — Intelligent traffic routing

Gateway for all HTTP/HTTPS traffic.

Technology Stack:

# Foundation
Kubernetes: 1.28+
  └─ Minikube: Local cluster for development
  └─ Docker: Container runtime
  └─ NVIDIA Container Toolkit: Bridge to GPUs

# Management
kubectl: Official Kubernetes CLI
Helm: Package manager (think npm/apt for K8s)
Terraform: Infrastructure as Code (comes in chapters 2-4)
ArgoCD: GitOps operator (chapter 4)

# Applications
Ollama: ollama/ollama (with GPU)
LibreChat: ghcr.io/danny-avila/librechat
MongoDB: bitnami/mongodb
MeiliSearch: getmeili/meilisearch
Enter fullscreen mode Exit fullscreen mode

Each tool here has a specific purpose, and throughout the journey you'll understand exactly when and why to use each one.


Chapter 1: Kubernetes "By Hand" — Learning Through Direct Experience

It may seem counterproductive to start doing everything manually when the final goal is total automation. But there's method to this apparent madness:

  1. Solid conceptual foundation — You need to understand Pods, Services, Ingress, internal DNS
  2. Effective debugging — When something breaks in production (and it will), you need to know what to look for
  3. Appreciation of abstractions — You only value Helm/Terraform after writing YAML manually
  4. Knowledge of conventions — Each tool has its idiosyncrasies

Preparing the Environment: Your Local Cluster

First, let's create a safe playground—a Kubernetes cluster running locally on your machine.

Minikube Installation

# Arch Linux (adjust for your distro)
pacman -S minikube nvidia-container-toolkit libnvidia-container helm ollama terraform

# Start cluster with GPU Support
minikube start \
  --driver docker \
  --container-runtime docker \
  --gpus all \
  --memory 8192 \
  --cpus 4
Enter fullscreen mode Exit fullscreen mode

What's happening here?

When you execute minikube start, a sequence of events is triggered:

  1. Node creation: Minikube creates a Docker container that acts as your cluster's "server"
  2. Kubernetes installation: Inside this container, K8s core components are installed:
    • kube-apiserver — The brain, receives all requests
    • etcd — Distributed database that stores state
    • kube-scheduler — Decides which node each Pod will run on
    • kube-controller-manager — Ensures current state = desired state
    • kubelet — Agent running on each node, manages containers
  3. Container runtime: Docker is configured as the runtime (who actually runs the containers)
  4. GPU exposure: --gpus all maps the host's GPUs into the cluster

The result? A real Kubernetes cluster, running completely on your machine, isolated and safe for experimentation.

Enabling the Ingress Controller

# Activate NGINX Ingress addon
minikube addons enable ingress

# Verify
minikube kubectl -- get pods -n ingress-nginx
Enter fullscreen mode Exit fullscreen mode

What is Ingress?

Think of Ingress as the "smart front door" of your cluster. Instead of each service having its own external IP, Ingress acts as a reverse proxy that:

  • Routes traffic based on hostnames and paths
  • Handles SSL/TLS termination
  • Provides load balancing
  • Implements rate limiting and authentication

Without Ingress:

User → NodePort :30080 → Pod1
User → NodePort :30081 → Pod2
User → NodePort :30082 → Pod3
Enter fullscreen mode Exit fullscreen mode

You would need to remember each port for application.

With Ingress:

User → http://app1.company.com → Ingress → Pod1
User → http://app2.company.com → Ingress → Pod2
User → http://app3.company.com → Ingress → Pod3
Enter fullscreen mode Exit fullscreen mode

You have intuitive hostnames.

Part II: Deploying the Stack

Now we'll deploy our applications. We'll start with Ollama (the AI backend) and then add LibreChat (the interface).

1. Installing Ollama with GPU Support

Creating the namespace:

minikube kubectl -- create ns ollama
Enter fullscreen mode Exit fullscreen mode

Namespaces are logical separations within the cluster. Think of them as "folders" for organizing resources.

cluster/
├── default/          # Defalt Namespace, avoid using
├── kube-system/      # K8s core components
├── kube-public/      # Public resources
├── ollama/           # Ollama 
└── librechat/        # Librechat
Enter fullscreen mode Exit fullscreen mode

This keeps Ollama isolated from other applications.

Namespace also provides:

  • Organization: Logically separates applications
  • Permissions: RBAC can be applied by namespace
  • Quotas: Limit resources by application
  • Network isolation: Policies can restric communication

Installing via Helm:

Helm is often called "Kubernetes package manager", more than that, it's also a template engine with lifecycle management.

# Add repo
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update

# Create configuration file
cat > ollama-values.yaml <<EOF
ollama:
  gpu:
    enabled: true
    type: nvidia
    number: 1
  models:
    - llama2

ingress:
  enabled: true
  hosts:
    - host: ollama.glukas.space
      paths:
        - path: /
          pathType: Prefix
EOF

# Install
helm install ollama ollama-helm/ollama \
  -f ollama-values.yaml \
  --namespace ollama
Enter fullscreen mode Exit fullscreen mode

What's Helm doing?

When you run helm install, Helm:

  1. Template rendering:
   values.yaml + chart templates → YAMLs do Kubernetes
Enter fullscreen mode Exit fullscreen mode

Helm take values and inject them on chart templates, generating Deployments, Services, ConfigMaps, etc.

  1. Dependency resolution:
    If chart have dependencies(sub-charts), Helm resolve it, then installs.

  2. Pre-install hooks:
    Some charts execute jobs before installation (such migrations, etc)

  3. Resource creation:
    Rendered YAMLs are then applied on cluster via kubectl apply

  4. Release tracking:
    Helm then save the metadata on cluster, allowing upgrades and rollbacks

Verifying the installation:

By running minikube kubectl get all -n ollama:

NAME                               READY             STATUS              RESTARTS      AGE
pod/ollama-5d9c8f7b6-xk2j9         1/1               Running             0             2m

NAME                               TYPE              CLUSTER-IP         EXTERNAL-IP   PORT(S)
service/ollama                     ClusterIP         10.96.234.123      <none>        11434/TCP

NAME                               READY             UP-TO-DATE         AVAILABLE     AGE
deployment.apps/ollama             1/1               1                  1             2m

NAME                               DESIRED           CURRENT            READY         AGE
replicaset.apps/ollama-5d9c8f7b6   1                 1                  1             2m
Enter fullscreen mode Exit fullscreen mode

Each line represents a Kubernetes resource created by Helm.

Local DNS Configuration

This is the IP you'll use to access services exposed through Ingress.

# Cluster IP
minikube ip
# Output example: 192.168.49.2

# Add to /etc/hosts
echo "192.168.49.2 ollama.glukas.space" | sudo tee -a /etc/hosts
Enter fullscreen mode Exit fullscreen mode

How it works?

  1. Your browser tries resolves ollama.glukas.space
  2. OS first query /etc/hosts
  3. Finds 192.168.49.2 (Minikube IP)
  4. Request goes to k8s cluster
  5. Ingress Controller see the header Host: ollama.glukas.space
  6. Routes to Ollama service
  7. Service distributes to Pods

Pulling an AI model:

# Execute a command inside the Ollama pod
minikube kubectl exec -n ollama deploy/ollama -- \
  ollama pull llama3

# This downloads the Llama 3 model
# The first time takes a few minutes (several GB)
Enter fullscreen mode Exit fullscreen mode

Testing the model:

minikube kubectl exec -n ollama deploy/ollama -- \
  ollama run llama3 "Explain Kubernetes in one sentence"

# Response:
# Kubernetes is a container orchestration platform that automates
# deployment, scaling, and management of containerized applications
# across a cluster of machines.
Enter fullscreen mode Exit fullscreen mode

If you got a response, congratulations! You have a local LLM running with GPU acceleration.

Testing access:

curl http://ollama.glukas.space/api/tags

# Expected response: JSON with available models
{
  "models": [
    {
      "name": "llama3:latest",
      "modified_at": "2024-01-15T10:30:00Z",
      ...
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

If the JSON appeared, Ollama is successfully accessible via Ingress!

LibreChat Deploy:

Now comes the fun part: a proper web interface to interact with your AI.

Understanding LibreChat's Architecture

LibreChat is more than just a frontend. It's a complete application that needs:

  • Database (MongoDB) — To store conversations, users, settings
  • Search (MeiliSearch) — For semantic search in conversations
  • Backend — Node.js API that orchestrates everything
  • Frontend — React interface

The official Helm chart manages all of this.

Creating the namespace:

minikube kubectl -- create bs librechat
Enter fullscreen mode Exit fullscreen mode

Creating Secrets:

# Generate secrets
JWT_SECRET=$(openssl rand -hex 32)
JWT_REFRESH_SECRET=$(openssl rand -hex 32)
CREDS_KEY=$(openssl rand -hex 32)
CREDS_IV=$(openssl rand -hex 16)

# Kubernetes secret creation
kubectl create secret generic librechat-credentials-env \
  --from-literal=JWT_SECRET="${JWT_SECRET}" \
  --from-literal=JWT_REFRESH_SECRET="${JWT_REFRESH_SECRET}" \
  --from-literal=CREDS_KEY="${CREDS_KEY}" \
  --from-literal=CREDS_IV="${CREDS_IV}" \
  --from-literal=MONGO_URI="mongodb://librechat-mongodb:27017/LibreChat" \
  --from-literal=MEILI_HOST="http://librechat-meilisearch:7700" \
  --from-literal=OLLAMA_BASE_URL="http://ollama.ollama.svc.cluster.local:11434" \
  --namespace librechat
Enter fullscreen mode Exit fullscreen mode

Secrets vs ConfigMaps: When to Use Each?

ConfigMap Secret
Purpose Public configuration Sensitive data
Encoding Plain text Base64 (not encryption!)
Typical Use Feature flags, public URLs Passwords, tokens, keys
Auditing Can be committed to Git NEVER commit
Example API_ENDPOINT=https://api.com API_KEY=sk_live_xxxxx

Important: Base64 is not encryption! It's just encoding. In production, use solutions such as:

  • Sealed Secrets (Bitnami)
  • External Secrets Operator + HashiCorp Vault
  • SOPS (Mozilla)
  • Cloud provider KMS (AWS Secrets Manager, GCP Secret Manager)

Configuration file: librechat-values.yaml

cat > librechat-values.yaml <<'EOF'
# Env Config (Level 1)
config:
  APP_TITLE: "LibreChat + Ollama"
  HOST: "0.0.0.0"
  PORT: "3080"
  SEARCH: "true"
  MONGO_URI: "mongodb://librechat-mongodb:27017/LibreChat"
  MEILI_HOST: "http://librechat-meilisearch:7700"

# Application Conig (Level 2)
librechat:
  configEnv:
    ALLOW_REGISTRATION: "true"

  # YAML Injection (Level 3)
  configYamlContent: |
    version: 1.1.5
    cache: true

    endpoints:
      custom:
        - name: "Ollama"
          apiKey: "ollama"
          # Kubernetes internal DNS: <service>.<namespace>.svc.cluster.local
          baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"
          models:
            default:
              - "llama2:latest"
            fetch: true
          titleConvo: true
          titleModel: "llama2:latest"
          summarize: false
          summaryModel: "llama2:latest"
          forcePrompt: false
          modelDisplayLabel: "Ollama"
          addParams:
            temperature: 0.7
            max_tokens: 2000

# Reference to Secret previously created
extraEnvVarsSecret: "librechat-credentials-env"

# External access configuration
ingress:
  enabled: true
  className: "nginx"
  hosts:
    - host: librechat.glukas.space
      paths:
        - path: /
          pathType: Prefix

# Sub-charts (dependencies)
mongodb:
  enabled: true
  auth:
    enabled: false  # NEVER in production
  image:
    repository: bitnami/mongodb
    tag: latest
  persistence:
    enabled: true
    size: 8Gi

meilisearch:
  enabled: true
  auth:
    enabled: false  # NEVER in production
  environment:
    MEILI_NO_ANALYTICS: "true"
    MEILI_ENV: "development"
  persistence:
    enabled: true
    size: 1Gi

# Main application storage
persistence:
  enabled: true
  size: 5Gi
  storageClass: "standard"
EOF
Enter fullscreen mode Exit fullscreen mode

Configuration breaking down:

Level 1 — Environment Variables Simples

config:
  APP_TITLE: "LibreChat + Ollama"
Enter fullscreen mode Exit fullscreen mode

Environment variables injected on container.

Level 2 — Application Configuration

librechat:
  configEnv:
    ALLOW_REGISTRATION: "true"
Enter fullscreen mode Exit fullscreen mode

LibreChat specific configuration(it will also be env vars).

Level 3 — Nested YAML

configYamlContent: |
  version: 1.1.5
  endpoints:
    custom: ...
Enter fullscreen mode Exit fullscreen mode

A YAML file that will be created inside thecontainer. O pipe | preserves line breaks.

Internal Kubernetes DNS:

  • LibreChat inside the cluster talks to Ollama via internal DNS
  • No need to go through external Ingress

Kubernetes FQDN (Fully Qualified Domain Name) Anatomy:

baseURL: "http://ollama.ollama.svc.cluster.local:11434/v1"
Enter fullscreen mode Exit fullscreen mode
http://ollama.ollama.svc.cluster.local:11434/v1
       └─┬─┘  └─┬─┘  └┬┘ └────┬─────┘  └─┬─┘└┬┘
      Service   NS   Type   Domain     Port Path
Enter fullscreen mode Exit fullscreen mode
  • ollama — Service Name
  • ollama — Namespace where the Service is
  • svc — Resource Type (service)
  • cluster.local — cluster Domain
  • 11434 — Port
  • /v1 — API Path

Shorthand that also works (inside the same namespace):

http://ollama:11434              # Same namespace
http://ollama.ollama:11434       # Explicit Namespace
http://ollama.ollama.svc:11434   # With Type
Enter fullscreen mode Exit fullscreen mode

Why using full FQDN?

LibreChat is inside librechat namespace and needs to access Ollama on ollama namespace. Cross-namespace communication requires <service>.<namespace>.

Deploying

# Install LibreChat and its dependencies
helm install librechat oci://ghcr.io/danny-avila/librechat-chart/librechat \
  -f librechat-values.yaml \
  --namespace librechat

# Watch the deploy
watch minikube kubectl -- get pods -n librechat -w
Enter fullscreen mode Exit fullscreen mode

Watch to Expect:

NAME                                     READY   STATUS              RESTARTS   AGE
librechat-5d8f4b6c9d-xk7p2               0/1     ContainerCreating   0          10s
librechat-mongodb-0                      0/1     Pending             0          10s
librechat-meilisearch-7f9d8c5b6-j4k8m    0/1     ContainerCreating   0          10s
Enter fullscreen mode Exit fullscreen mode

Wait Status to be Running and READY to be 1/1.

Testing the System

# Adding to /etc/hosts
echo "192.168.49.2 librechat.glukas.space" | sudo tee -a /etc/hosts

# Access in browser
# http://librechat.glukas.space
Enter fullscreen mode Exit fullscreen mode

If everything worked, you'll see the LibreChat interface. Create an account, configure the Ollama model, and chat with the AI running completely locally.

Image1

Let's ask some difficult questions:

1. Reproducibility: If I delete the cluster now, can you recreate everything exactly the same?

Technically yes, but you'll need to:

  • Remember all the commands in the right order
  • Recreate all secrets with the same values
  • Hope the charts haven't changed
  • Hope you didn't forget to document some configuration

2. Versioning: How do you rollback if something breaks?

You don't. Helm has releases, but the configurations are on your machine. If you changed a value three weeks ago, good luck remembering what it was before.

3. Auditing: Who changed what and when?

Nobody knows. There's no record.

4. Consistency: How do you ensure dev, staging, and production are the same?

Copy and paste values between environments. And pray.

5. Drift Detection: Is the cluster as you think it is?

Maybe. Someone might have run kubectl edit directly. You'll never know.

Fundamental Problems of the Manual Approach

1. The Reproducibility Problem

# What you DID:
$ kubectl create namespace ollama
$ helm install ollama ...
$ kubectl create secret ...
$ helm install librechat ...

# What you HAVE documented:
$ # (cricket sounds)
Enter fullscreen mode Exit fullscreen mode

You made it work, but this knowledge is in your head and bash history. If the cluster dies tomorrow, you'll need to rebuild everything from memory or by mining logs.

2. The Distributed State Problem

Your system now has state in several places:

  • Helm releases — Metadata in the cluster
  • Secrets — Created via kubectl
  • Configurations — YAML files on your machine
  • DNS — Entries in /etc/hosts
  • Commands — Bash history

There's no single source of truth.

3. The Scale Problem

This worked for 2 applications. Now imagine:

  • 10 applications
  • 3 environments (dev, staging, prod)
  • 5 people on the team

Will you create 30 manual configurations? And when someone on the team updates one, how do the others know?

4. The Drift Problem

# On day 1:
$ helm install app v1.0

# On day 30, someone does:
$ kubectl edit deployment app

# Now there's divergence between:
# - What Helm thinks it deployed
# - What's actually running
# - What you think is running
Enter fullscreen mode Exit fullscreen mode

Drift is when the current state diverges from the desired state. In manual infrastructure, drift is invisible until it breaks.

5. The Collaboration Problem

Developer A: "I changed the Ollama configuration"
Developer B: "Which one? Where?"
Developer A: "In my values.yaml"
Developer B: "Did you commit it?"
Developer A: "Was I supposed to commit it?"
Enter fullscreen mode Exit fullscreen mode

Without centralized versioning, collaboration is chaotic.

Lessons from Chapter 1:

Despite the problems mentioned, this manual implementation was essential:

You now understand Kubernetes from the inside

  • Pods aren't abstractions, they're running processes
  • Services aren't magic, they're iptables rules
  • Ingress isn't complex, it's HTTP proxy with rules

You gained technical vocabulary

  • Can talk about namespaces, secrets, internal DNS
  • Understand what Helm does vs what Kubernetes does
  • Know the difference between StatefulSet and Deployment

You felt the pain that motivates automation

  • Each manual command is an opportunity for error
  • Each unversioned configuration is technical debt
  • Each manual change is impossible to audit

You built intuition about the ecosystem

  • Know when to use Helm chart vs pure YAML
  • Understand design trade-offs (simplicity vs flexibility)
  • Can debug networking and DNS problems

The Next Step: Infrastructure as Code

Now that you deeply understand what is running and why, you're ready to learn how to automate.

In Chapter 2, we'll take everything we did here and transform it into versioned, reproducible, and auditable code. We'll introduce Terraform and explore the first approach to managing Kubernetes with Infrastructure as Code.


Next Chapters

Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code

How to use Terraform to manage Kubernetes resources directly. We'll explore why this seems like a good idea, where it works well, and where it starts to break. We'll learn about state management, change planning, and the limits of the approach.

Chapter 3: Terraform + Helm — A Better Abstraction

The correct abstraction emerges. We'll use Terraform to manage Helm releases, combining the best of both worlds: Terraform's versioning and auditing with Helm's packaging expertise.

Chapter 4: GitOps with Terraform + ArgoCD — A LLM Self-Hosting Product Infrastructure

Robust architecture that completely separates infrastructure management (Terraform) from application management (ArgoCD), implementing real continuous delivery with GitOps, pull-based reconciliation, and multi-tenancy.

Continue to:

Chapter 2: Terraform + Kubernetes Provider — Infrastructure as Code →


Additional Resources

Official Documentation:

Important Concepts:

Community:

Top comments (0)