Ajit Kumar

Posted on Feb 16

The Complete Guide to Ollama: Run Large Language Models Locally

#ollama #llm #rag #ai

Imagine having the power of GPT-4 or Claude running entirely on your laptop—no internet required, no API costs, and complete privacy. Until recently, this seemed like a distant dream reserved for tech giants with massive server farms. But the landscape has changed dramatically. Thanks to Ollama, anyone with a modern computer can now run sophisticated AI models locally, whether you're coding on a plane at 35,000 feet, analyzing sensitive documents that can never touch the cloud, or simply experimenting with AI without watching your API bill climb. This comprehensive guide will walk you through everything you need to know about Ollama—from your first installation to building production-ready AI-powered applications that respect your privacy and your wallet.

Part A: Setting Up Ollama

Introduction to Ollama

Ollama is a powerful, open-source tool that enables you to run large language models (LLMs) locally on your own machine. Think of it as Docker for AI models—it packages everything you need to run models like Llama, Mistral, CodeLlama, and dozens of others into easy-to-use containers that work seamlessly on your computer.

Why use Ollama?

Privacy: Your data never leaves your machine. Perfect for sensitive code, personal documents, or proprietary information.
No API costs: Run unlimited queries without paying per token.
Offline capability: Work without internet connectivity once models are downloaded.
Speed: No network latency—responses are generated locally.
Experimentation: Try different models and configurations freely.
Integration: Works with various tools and development environments.

Installation

Ollama supports macOS, Linux, and Windows. The installation process is straightforward for all platforms.

macOS

Download the official installer from ollama.com:

# Or use Homebrew
brew install ollama

Linux

Run the installation script:

curl -fsSL https://ollama.com/install.sh | sh

This script supports most Linux distributions including Ubuntu, Debian, Fedora, and CentOS.

Windows

Download the Windows installer from ollama.com or use the preview version. Ollama on Windows runs natively and supports the same commands as other platforms.

Docker Installation

For containerized deployment:

docker pull ollama/ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Testing Your Installation

After installation, verify Ollama is working:

# Check version
ollama --version

# Check if the service is running
ollama serve

You should see output indicating the Ollama server is running on http://localhost:11434.

To test with a simple model:

# Pull and run a small model
ollama run llama3.2

# You should get an interactive prompt
# Try: "Hello, how are you?"

If you see a response from the model, congratulations—Ollama is working correctly!

Updating Ollama

Keeping Ollama updated ensures you have the latest features, bug fixes, and security patches.

Why Update?

New model support: Access to the latest model architectures and versions
Performance improvements: Faster inference and better memory management
Bug fixes: Resolved issues and stability improvements
New features: Enhanced CLI commands and API capabilities
Security patches: Important security updates

How to Update

macOS (Homebrew):

brew update
brew upgrade ollama

macOS (Installer):
Download and run the latest installer from ollama.com—it will update your existing installation.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

The script automatically detects and updates your existing installation.

Windows:
Download and run the latest installer.

Docker:

docker pull ollama/ollama:latest
docker stop ollama
docker rm ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

After updating, verify the new version:

ollama --version

Understanding Ollama Models

Ollama supports a vast ecosystem of models, each optimized for different tasks. Here's a breakdown of the major categories:

General Purpose Models

Llama 3.2 (1B, 3B) - Meta's latest efficient models

Best for: General conversation, reasoning, lightweight tasks
Size: 1B parameters (740MB), 3B parameters (2GB)
Command: ollama run llama3.2 or ollama run llama3.2:1b

Llama 3.1 (8B, 70B, 405B) - Meta's powerful flagship models

Best for: Complex reasoning, long context (128K tokens)
Size: 8B (4.7GB), 70B (40GB), 405B (231GB)
Command: ollama run llama3.1 or ollama run llama3.1:70b

Mistral (7B) - Efficient and capable model

Best for: General purpose tasks with good speed/quality balance
Size: 4.1GB
Command: ollama run mistral

Mixtral (8x7B, 8x22B) - Mixture of Experts models

Best for: High-quality outputs with efficient inference
Size: 26GB (8x7B), 80GB (8x22B)
Command: ollama run mixtral

Coding Models

CodeLlama (7B, 13B, 34B, 70B) - Specialized for code

Best for: Code generation, completion, debugging
Size: 7B (3.8GB), 13B (7.4GB), 34B (19GB), 70B (39GB)
Command: ollama run codellama

DeepSeek Coder (6.7B, 33B) - Strong coding performance

Best for: Multiple programming languages, code explanation
Size: 6.7B (3.8GB), 33B (19GB)
Command: ollama run deepseek-coder

Qwen 2.5 Coder (1.5B, 7B, 32B) - Alibaba's coding model

Best for: Code generation with multilingual support
Command: ollama run qwen2.5-coder

Specialized Models

Llama 3.2 Vision (11B, 90B) - Multimodal understanding

Best for: Image analysis, visual question answering
Command: ollama run llama3.2-vision

Phi-3 (3.8B, 14B) - Microsoft's small language models

Best for: Efficient deployment, edge devices
Size: 3.8B (2.3GB), 14B (7.9GB)
Command: ollama run phi3

Gemma 2 (2B, 9B, 27B) - Google's open models

Best for: Safety-focused applications
Command: ollama run gemma2

Neural Chat (7B) - Fine-tuned for conversations

Best for: Interactive dialogue
Command: ollama run neural-chat

How to Choose a Model

For general use on consumer hardware: llama3.2:3b or mistral
For coding tasks: codellama:7b or deepseek-coder:6.7b
For best quality (with powerful GPU): llama3.1:70b or mixtral:8x22b
For resource-constrained systems: phi3:3.8b or llama3.2:1b
For vision tasks: llama3.2-vision

Starting and Stopping Ollama

Starting Ollama

The Ollama server needs to be running to handle model requests.

On macOS and Linux:

# Start in foreground (see logs)
ollama serve

# Or run as background service (automatically starts on boot)
# On macOS, Ollama runs as a LaunchAgent
# On Linux with systemd:
systemctl start ollama

On Windows:
Ollama runs as a service automatically after installation. Access it via the system tray icon.

Verify it's running:

curl http://localhost:11434
# Should return: "Ollama is running"

Stopping Ollama

On macOS:

# If running in foreground, press Ctrl+C
# If running as service:
launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist

On Linux:

# If running in foreground, press Ctrl+C
# If running as systemd service:
systemctl stop ollama

On Windows:
Right-click the Ollama icon in the system tray and select "Quit Ollama."

Docker:

docker stop ollama

Serving a Specific Model

Once Ollama is running, you can interact with models in several ways.

Interactive Mode

Run a model and chat directly in the terminal:

ollama run llama3.2

This starts an interactive session. Type your prompts and press Enter. Type /bye to exit.

Pull a Model Without Running

Download a model for later use:

ollama pull codellama:13b

List Downloaded Models

See which models you have locally:

ollama list

Output shows model name, size, and last modified date.

Run with Custom Parameters

Control model behavior with parameters:

ollama run llama3.2 --verbose

# Or set parameters in the prompt
ollama run llama3.2 "Write a haiku" --temperature 0.8 --top-p 0.9

API Mode (Serving to Other Applications)

Ollama exposes an OpenAI-compatible API on http://localhost:11434:

# Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

This API allows integration with countless tools and applications.

Remove Models You Don't Need

Free up disk space:

ollama rm codellama:70b

Quick Reference Commands

# Installation & Updates
ollama --version              # Check version
curl -fsSL https://ollama.com/install.sh | sh  # Install/update

# Model Management
ollama pull <model>           # Download a model
ollama list                   # List local models
ollama rm <model>             # Delete a model
ollama show <model>           # Show model information

# Running Models
ollama run <model>            # Interactive mode
ollama run <model> "prompt"   # One-off generation
ollama serve                  # Start API server

# API Testing
curl http://localhost:11434   # Check if running
curl http://localhost:11434/api/tags  # List models via API

Part B: Real-World Use Cases

Now that you have Ollama set up, let's explore powerful ways to use it in your daily workflow.

Use Case 1: Local Coding with Claude Code and AI IDEs

One of the most powerful applications of Ollama is integrating local models with development tools. This gives you AI-powered coding assistance that's private, fast, and free.

Using Ollama with Claude Code

Claude Code is a command-line agentic coding tool. While it typically uses Anthropic's API, you can configure it to work with local models via Ollama for certain tasks.

Setup:

Install Claude Code:

npm install -g @anthropic-ai/claude-code

Configure to use Ollama as a secondary tool for code completion:

# You can use Ollama's OpenAI-compatible endpoint
export OLLAMA_API_BASE=http://localhost:11434/v1

Workflow:

Use Claude Code with Anthropic's API for complex agentic tasks (planning, multi-file refactoring)
Use local Ollama models for quick completions, code explanations, or offline work
Combine both: Claude for architecture, Ollama for implementation details

Using Ollama with Cursor

Cursor is a popular AI-powered IDE. Configure it to use Ollama:

Open Cursor Settings
Navigate to Models
Add custom model endpoint: http://localhost:11434/v1
Select your preferred model (e.g., codellama:13b)

Benefits:

No API costs for code completions
Faster responses (no network latency)
Private code never leaves your machine

Using Ollama with Continue.dev

Continue is a VS Code / JetBrains extension that brings AI assistance to your IDE:

Install the Continue extension
Configure ~/.continue/config.json:

{
  "models": [
    {
      "title": "CodeLlama Local",
      "provider": "ollama",
      "model": "codellama:13b",
      "apiBase": "http://localhost:11434"
    }
  ]
}

Use Cmd/Ctrl+L to chat with your local model
Highlight code and ask questions, generate tests, or refactor

Using Ollama with Tabby (Self-Hosted Code Completion)

Tabby provides GitHub Copilot-like code completion:

# Run Tabby with Ollama backend
docker run -it --gpus all \
  -p 8080:8080 \
  -v ~/.tabby:/data \
  tabbyml/tabby \
  serve --model TabbyML/CodeLlama-7B --device cuda

Configure your IDE to point to http://localhost:8080.

Best Practices for Local Coding Models

Choose the right model size: 7B models for real-time completions, 13B+ for deeper analysis
Use code-specific models: CodeLlama, DeepSeek Coder, or Qwen Coder perform better than general models
Combine tools: Use local models for most tasks, cloud models for complex reasoning
Fine-tune prompts: Local models benefit from clear, specific instructions
Monitor resource usage: Large models need significant RAM/VRAM

Use Case 2: Document Ingestion and RAG Workflows

Retrieval-Augmented Generation (RAG) is a technique where you feed your own documents to an LLM, enabling it to answer questions based on your specific data. Ollama is perfect for building private RAG systems.

What is RAG?

Traditional LLMs only know what they were trained on. RAG extends their knowledge by:

Ingesting your documents (PDFs, text files, web pages)
Chunking them into smaller pieces
Embedding these chunks into vector representations
Storing embeddings in a vector database
Retrieving relevant chunks when you ask a question
Generating answers using both the retrieved context and the LLM

Simple RAG with Ollama and Python

Here's a basic RAG implementation:

# Install dependencies
# pip install ollama chromadb langchain pypdf --break-system-packages

import ollama
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

# 1. Load and chunk documents
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

# 2. Create embeddings and store in vector DB
client = chromadb.Client()
collection = client.create_collection("my_docs")

for i, chunk in enumerate(chunks):
    # Use Ollama to generate embeddings
    embedding = ollama.embeddings(
        model="nomic-embed-text",
        prompt=chunk.page_content
    )["embedding"]

    collection.add(
        ids=[str(i)],
        embeddings=[embedding],
        documents=[chunk.page_content]
    )

# 3. Query the system
def ask_question(question):
    # Get embedding for question
    question_embedding = ollama.embeddings(
        model="nomic-embed-text",
        prompt=question
    )["embedding"]

    # Find relevant chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=3
    )

    # Build context from retrieved chunks
    context = "\n\n".join(results["documents"][0])

    # Generate answer using Ollama
    response = ollama.chat(
        model="llama3.2",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )

    return response["message"]["content"]

# Use it
answer = ask_question("What are the main points in the document?")
print(answer)

Advanced RAG with Ollama and Open Source Tools

Using Ollama with LangChain:

from langchain.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings

# Initialize Ollama
llm = Ollama(model="llama3.2")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create vector store
vectorstore = FAISS.from_documents(chunks, embeddings)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    chain_type="stuff"
)

# Ask questions
result = qa_chain.run("What does the document say about X?")

Using Ollama with AnythingLLM:

AnythingLLM is a full-stack RAG application with a beautiful UI:

Install: Download from anythingllm.com
Configure Ollama as LLM provider in settings
Upload documents via the UI
Start chatting with your documents

Using Ollama with Quivr (Your Second Brain):

Quivr is a RAG application for personal knowledge management:

# Clone and setup
git clone https://github.com/QuivrHQ/quivr
cd quivr
docker-compose up

# Configure to use Ollama in .env file
LLM_PROVIDER=ollama
OLLAMA_API_BASE=http://localhost:11434

Ingestion Workflow Best Practices

1. Document Preprocessing:

Clean text (remove headers, footers, page numbers)
Extract tables and images separately if needed
Use OCR for scanned documents (tesseract-ocr)

2. Chunking Strategies:

Fixed size: Simple, works for uniform content (500-1000 tokens)
Semantic: Split by paragraphs or sections (better context)
Recursive: LangChain's RecursiveCharacterTextSplitter (balanced)

3. Embedding Models:

nomic-embed-text: Best all-around, 137M parameters
mxbai-embed-large: Highest quality, 335M parameters
all-minilm: Fastest, lightweight option

4. Retrieval Optimization:

Use hybrid search (keyword + semantic)
Re-rank results with a cross-encoder
Adjust number of retrieved chunks (typically 3-5)

5. Prompt Engineering:

Context: [retrieved chunks]

Based ONLY on the context above, answer the following question.
If the answer is not in the context, say "I don't have enough information."

Question: {user_question}

Answer:

Production RAG Considerations

Vector Database: Use Qdrant, Weaviate, or Pinecone for scale
Caching: Cache embeddings to avoid recomputation
Monitoring: Track retrieval accuracy and response quality
Updates: Implement incremental updates when documents change
Multi-modal: Use vision models for PDFs with images/charts

Performance Tips for Ollama

GPU Acceleration: Ollama automatically uses GPU if available (NVIDIA, AMD, or Apple Silicon)
Memory Management: Larger models need more RAM/VRAM (check with ollama show <model>)
Concurrent Requests: Ollama handles multiple requests, but memory is shared
Model Quantization: Use quantized models (Q4, Q5) for better performance with acceptable quality loss

Conclusion

Ollama opens up a world of possibilities for running powerful AI models on your own hardware. From coding assistance to document analysis, from privacy-preserving workflows to cost-free experimentation, the applications are endless.

Key Takeaways: