DEV Community

Cover image for The Complete Guide to Ollama: Run Large Language Models Locally
Ajit Kumar
Ajit Kumar

Posted on

The Complete Guide to Ollama: Run Large Language Models Locally

Imagine having the power of GPT-4 or Claude running entirely on your laptop—no internet required, no API costs, and complete privacy. Until recently, this seemed like a distant dream reserved for tech giants with massive server farms. But the landscape has changed dramatically. Thanks to Ollama, anyone with a modern computer can now run sophisticated AI models locally, whether you're coding on a plane at 35,000 feet, analyzing sensitive documents that can never touch the cloud, or simply experimenting with AI without watching your API bill climb. This comprehensive guide will walk you through everything you need to know about Ollama—from your first installation to building production-ready AI-powered applications that respect your privacy and your wallet.

Part A: Setting Up Ollama

Introduction to Ollama

Ollama is a powerful, open-source tool that enables you to run large language models (LLMs) locally on your own machine. Think of it as Docker for AI models—it packages everything you need to run models like Llama, Mistral, CodeLlama, and dozens of others into easy-to-use containers that work seamlessly on your computer.

Why use Ollama?

  • Privacy: Your data never leaves your machine. Perfect for sensitive code, personal documents, or proprietary information.
  • No API costs: Run unlimited queries without paying per token.
  • Offline capability: Work without internet connectivity once models are downloaded.
  • Speed: No network latency—responses are generated locally.
  • Experimentation: Try different models and configurations freely.
  • Integration: Works with various tools and development environments.

Installation

Ollama supports macOS, Linux, and Windows. The installation process is straightforward for all platforms.

macOS

Download the official installer from ollama.com:

# Or use Homebrew
brew install ollama
Enter fullscreen mode Exit fullscreen mode

Linux

Run the installation script:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

This script supports most Linux distributions including Ubuntu, Debian, Fedora, and CentOS.

Windows

Download the Windows installer from ollama.com or use the preview version. Ollama on Windows runs natively and supports the same commands as other platforms.

Docker Installation

For containerized deployment:

docker pull ollama/ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Enter fullscreen mode Exit fullscreen mode

Testing Your Installation

After installation, verify Ollama is working:

# Check version
ollama --version

# Check if the service is running
ollama serve
Enter fullscreen mode Exit fullscreen mode

You should see output indicating the Ollama server is running on http://localhost:11434.

To test with a simple model:

# Pull and run a small model
ollama run llama3.2

# You should get an interactive prompt
# Try: "Hello, how are you?"
Enter fullscreen mode Exit fullscreen mode

If you see a response from the model, congratulations—Ollama is working correctly!

Updating Ollama

Keeping Ollama updated ensures you have the latest features, bug fixes, and security patches.

Why Update?

  • New model support: Access to the latest model architectures and versions
  • Performance improvements: Faster inference and better memory management
  • Bug fixes: Resolved issues and stability improvements
  • New features: Enhanced CLI commands and API capabilities
  • Security patches: Important security updates

How to Update

macOS (Homebrew):

brew update
brew upgrade ollama
Enter fullscreen mode Exit fullscreen mode

macOS (Installer):
Download and run the latest installer from ollama.com—it will update your existing installation.

Linux:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

The script automatically detects and updates your existing installation.

Windows:
Download and run the latest installer.

Docker:

docker pull ollama/ollama:latest
docker stop ollama
docker rm ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Enter fullscreen mode Exit fullscreen mode

After updating, verify the new version:

ollama --version
Enter fullscreen mode Exit fullscreen mode

Understanding Ollama Models

Ollama supports a vast ecosystem of models, each optimized for different tasks. Here's a breakdown of the major categories:

General Purpose Models

Llama 3.2 (1B, 3B) - Meta's latest efficient models

  • Best for: General conversation, reasoning, lightweight tasks
  • Size: 1B parameters (740MB), 3B parameters (2GB)
  • Command: ollama run llama3.2 or ollama run llama3.2:1b

Llama 3.1 (8B, 70B, 405B) - Meta's powerful flagship models

  • Best for: Complex reasoning, long context (128K tokens)
  • Size: 8B (4.7GB), 70B (40GB), 405B (231GB)
  • Command: ollama run llama3.1 or ollama run llama3.1:70b

Mistral (7B) - Efficient and capable model

  • Best for: General purpose tasks with good speed/quality balance
  • Size: 4.1GB
  • Command: ollama run mistral

Mixtral (8x7B, 8x22B) - Mixture of Experts models

  • Best for: High-quality outputs with efficient inference
  • Size: 26GB (8x7B), 80GB (8x22B)
  • Command: ollama run mixtral

Coding Models

CodeLlama (7B, 13B, 34B, 70B) - Specialized for code

  • Best for: Code generation, completion, debugging
  • Size: 7B (3.8GB), 13B (7.4GB), 34B (19GB), 70B (39GB)
  • Command: ollama run codellama

DeepSeek Coder (6.7B, 33B) - Strong coding performance

  • Best for: Multiple programming languages, code explanation
  • Size: 6.7B (3.8GB), 33B (19GB)
  • Command: ollama run deepseek-coder

Qwen 2.5 Coder (1.5B, 7B, 32B) - Alibaba's coding model

  • Best for: Code generation with multilingual support
  • Command: ollama run qwen2.5-coder

Specialized Models

Llama 3.2 Vision (11B, 90B) - Multimodal understanding

  • Best for: Image analysis, visual question answering
  • Command: ollama run llama3.2-vision

Phi-3 (3.8B, 14B) - Microsoft's small language models

  • Best for: Efficient deployment, edge devices
  • Size: 3.8B (2.3GB), 14B (7.9GB)
  • Command: ollama run phi3

Gemma 2 (2B, 9B, 27B) - Google's open models

  • Best for: Safety-focused applications
  • Command: ollama run gemma2

Neural Chat (7B) - Fine-tuned for conversations

  • Best for: Interactive dialogue
  • Command: ollama run neural-chat

How to Choose a Model

  1. For general use on consumer hardware: llama3.2:3b or mistral
  2. For coding tasks: codellama:7b or deepseek-coder:6.7b
  3. For best quality (with powerful GPU): llama3.1:70b or mixtral:8x22b
  4. For resource-constrained systems: phi3:3.8b or llama3.2:1b
  5. For vision tasks: llama3.2-vision

Starting and Stopping Ollama

Starting Ollama

The Ollama server needs to be running to handle model requests.

On macOS and Linux:

# Start in foreground (see logs)
ollama serve

# Or run as background service (automatically starts on boot)
# On macOS, Ollama runs as a LaunchAgent
# On Linux with systemd:
systemctl start ollama
Enter fullscreen mode Exit fullscreen mode

On Windows:
Ollama runs as a service automatically after installation. Access it via the system tray icon.

Verify it's running:

curl http://localhost:11434
# Should return: "Ollama is running"
Enter fullscreen mode Exit fullscreen mode

Stopping Ollama

On macOS:

# If running in foreground, press Ctrl+C
# If running as service:
launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist
Enter fullscreen mode Exit fullscreen mode

On Linux:

# If running in foreground, press Ctrl+C
# If running as systemd service:
systemctl stop ollama
Enter fullscreen mode Exit fullscreen mode

On Windows:
Right-click the Ollama icon in the system tray and select "Quit Ollama."

Docker:

docker stop ollama
Enter fullscreen mode Exit fullscreen mode

Serving a Specific Model

Once Ollama is running, you can interact with models in several ways.

Interactive Mode

Run a model and chat directly in the terminal:

ollama run llama3.2
Enter fullscreen mode Exit fullscreen mode

This starts an interactive session. Type your prompts and press Enter. Type /bye to exit.

Pull a Model Without Running

Download a model for later use:

ollama pull codellama:13b
Enter fullscreen mode Exit fullscreen mode

List Downloaded Models

See which models you have locally:

ollama list
Enter fullscreen mode Exit fullscreen mode

Output shows model name, size, and last modified date.

Run with Custom Parameters

Control model behavior with parameters:

ollama run llama3.2 --verbose

# Or set parameters in the prompt
ollama run llama3.2 "Write a haiku" --temperature 0.8 --top-p 0.9
Enter fullscreen mode Exit fullscreen mode

API Mode (Serving to Other Applications)

Ollama exposes an OpenAI-compatible API on http://localhost:11434:

# Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'
Enter fullscreen mode Exit fullscreen mode

This API allows integration with countless tools and applications.

Remove Models You Don't Need

Free up disk space:

ollama rm codellama:70b
Enter fullscreen mode Exit fullscreen mode

Quick Reference Commands

# Installation & Updates
ollama --version              # Check version
curl -fsSL https://ollama.com/install.sh | sh  # Install/update

# Model Management
ollama pull <model>           # Download a model
ollama list                   # List local models
ollama rm <model>             # Delete a model
ollama show <model>           # Show model information

# Running Models
ollama run <model>            # Interactive mode
ollama run <model> "prompt"   # One-off generation
ollama serve                  # Start API server

# API Testing
curl http://localhost:11434   # Check if running
curl http://localhost:11434/api/tags  # List models via API
Enter fullscreen mode Exit fullscreen mode

Part B: Real-World Use Cases

Now that you have Ollama set up, let's explore powerful ways to use it in your daily workflow.

Use Case 1: Local Coding with Claude Code and AI IDEs

One of the most powerful applications of Ollama is integrating local models with development tools. This gives you AI-powered coding assistance that's private, fast, and free.

Using Ollama with Claude Code

Claude Code is a command-line agentic coding tool. While it typically uses Anthropic's API, you can configure it to work with local models via Ollama for certain tasks.

Setup:

  1. Install Claude Code:
npm install -g @anthropic-ai/claude-code
Enter fullscreen mode Exit fullscreen mode
  1. Configure to use Ollama as a secondary tool for code completion:
# You can use Ollama's OpenAI-compatible endpoint
export OLLAMA_API_BASE=http://localhost:11434/v1
Enter fullscreen mode Exit fullscreen mode

Workflow:

  • Use Claude Code with Anthropic's API for complex agentic tasks (planning, multi-file refactoring)
  • Use local Ollama models for quick completions, code explanations, or offline work
  • Combine both: Claude for architecture, Ollama for implementation details

Using Ollama with Cursor

Cursor is a popular AI-powered IDE. Configure it to use Ollama:

  1. Open Cursor Settings
  2. Navigate to Models
  3. Add custom model endpoint: http://localhost:11434/v1
  4. Select your preferred model (e.g., codellama:13b)

Benefits:

  • No API costs for code completions
  • Faster responses (no network latency)
  • Private code never leaves your machine

Using Ollama with Continue.dev

Continue is a VS Code / JetBrains extension that brings AI assistance to your IDE:

  1. Install the Continue extension
  2. Configure ~/.continue/config.json:
{
  "models": [
    {
      "title": "CodeLlama Local",
      "provider": "ollama",
      "model": "codellama:13b",
      "apiBase": "http://localhost:11434"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode
  1. Use Cmd/Ctrl+L to chat with your local model
  2. Highlight code and ask questions, generate tests, or refactor

Using Ollama with Tabby (Self-Hosted Code Completion)

Tabby provides GitHub Copilot-like code completion:

# Run Tabby with Ollama backend
docker run -it --gpus all \
  -p 8080:8080 \
  -v ~/.tabby:/data \
  tabbyml/tabby \
  serve --model TabbyML/CodeLlama-7B --device cuda
Enter fullscreen mode Exit fullscreen mode

Configure your IDE to point to http://localhost:8080.

Best Practices for Local Coding Models

  1. Choose the right model size: 7B models for real-time completions, 13B+ for deeper analysis
  2. Use code-specific models: CodeLlama, DeepSeek Coder, or Qwen Coder perform better than general models
  3. Combine tools: Use local models for most tasks, cloud models for complex reasoning
  4. Fine-tune prompts: Local models benefit from clear, specific instructions
  5. Monitor resource usage: Large models need significant RAM/VRAM

Use Case 2: Document Ingestion and RAG Workflows

Retrieval-Augmented Generation (RAG) is a technique where you feed your own documents to an LLM, enabling it to answer questions based on your specific data. Ollama is perfect for building private RAG systems.

What is RAG?

Traditional LLMs only know what they were trained on. RAG extends their knowledge by:

  1. Ingesting your documents (PDFs, text files, web pages)
  2. Chunking them into smaller pieces
  3. Embedding these chunks into vector representations
  4. Storing embeddings in a vector database
  5. Retrieving relevant chunks when you ask a question
  6. Generating answers using both the retrieved context and the LLM

Simple RAG with Ollama and Python

Here's a basic RAG implementation:

# Install dependencies
# pip install ollama chromadb langchain pypdf --break-system-packages

import ollama
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

# 1. Load and chunk documents
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

# 2. Create embeddings and store in vector DB
client = chromadb.Client()
collection = client.create_collection("my_docs")

for i, chunk in enumerate(chunks):
    # Use Ollama to generate embeddings
    embedding = ollama.embeddings(
        model="nomic-embed-text",
        prompt=chunk.page_content
    )["embedding"]

    collection.add(
        ids=[str(i)],
        embeddings=[embedding],
        documents=[chunk.page_content]
    )

# 3. Query the system
def ask_question(question):
    # Get embedding for question
    question_embedding = ollama.embeddings(
        model="nomic-embed-text",
        prompt=question
    )["embedding"]

    # Find relevant chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=3
    )

    # Build context from retrieved chunks
    context = "\n\n".join(results["documents"][0])

    # Generate answer using Ollama
    response = ollama.chat(
        model="llama3.2",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )

    return response["message"]["content"]

# Use it
answer = ask_question("What are the main points in the document?")
print(answer)
Enter fullscreen mode Exit fullscreen mode

Advanced RAG with Ollama and Open Source Tools

Using Ollama with LangChain:

from langchain.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings

# Initialize Ollama
llm = Ollama(model="llama3.2")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create vector store
vectorstore = FAISS.from_documents(chunks, embeddings)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    chain_type="stuff"
)

# Ask questions
result = qa_chain.run("What does the document say about X?")
Enter fullscreen mode Exit fullscreen mode

Using Ollama with AnythingLLM:

AnythingLLM is a full-stack RAG application with a beautiful UI:

  1. Install: Download from anythingllm.com
  2. Configure Ollama as LLM provider in settings
  3. Upload documents via the UI
  4. Start chatting with your documents

Using Ollama with Quivr (Your Second Brain):

Quivr is a RAG application for personal knowledge management:

# Clone and setup
git clone https://github.com/QuivrHQ/quivr
cd quivr
docker-compose up

# Configure to use Ollama in .env file
LLM_PROVIDER=ollama
OLLAMA_API_BASE=http://localhost:11434
Enter fullscreen mode Exit fullscreen mode

Ingestion Workflow Best Practices

1. Document Preprocessing:

  • Clean text (remove headers, footers, page numbers)
  • Extract tables and images separately if needed
  • Use OCR for scanned documents (tesseract-ocr)

2. Chunking Strategies:

  • Fixed size: Simple, works for uniform content (500-1000 tokens)
  • Semantic: Split by paragraphs or sections (better context)
  • Recursive: LangChain's RecursiveCharacterTextSplitter (balanced)

3. Embedding Models:

  • nomic-embed-text: Best all-around, 137M parameters
  • mxbai-embed-large: Highest quality, 335M parameters
  • all-minilm: Fastest, lightweight option

4. Retrieval Optimization:

  • Use hybrid search (keyword + semantic)
  • Re-rank results with a cross-encoder
  • Adjust number of retrieved chunks (typically 3-5)

5. Prompt Engineering:

Context: [retrieved chunks]

Based ONLY on the context above, answer the following question.
If the answer is not in the context, say "I don't have enough information."

Question: {user_question}

Answer:
Enter fullscreen mode Exit fullscreen mode

Production RAG Considerations

  • Vector Database: Use Qdrant, Weaviate, or Pinecone for scale
  • Caching: Cache embeddings to avoid recomputation
  • Monitoring: Track retrieval accuracy and response quality
  • Updates: Implement incremental updates when documents change
  • Multi-modal: Use vision models for PDFs with images/charts

Performance Tips for Ollama

  1. GPU Acceleration: Ollama automatically uses GPU if available (NVIDIA, AMD, or Apple Silicon)
  2. Memory Management: Larger models need more RAM/VRAM (check with ollama show <model>)
  3. Concurrent Requests: Ollama handles multiple requests, but memory is shared
  4. Model Quantization: Use quantized models (Q4, Q5) for better performance with acceptable quality loss

Conclusion

Ollama opens up a world of possibilities for running powerful AI models on your own hardware. From coding assistance to document analysis, from privacy-preserving workflows to cost-free experimentation, the applications are endless.

Key Takeaways:

  • Ollama makes running local LLMs accessible and practical
  • Choose models based on your hardware and use case
  • Integration with development tools enhances productivity
  • RAG workflows unlock the power of your private data
  • Local models provide privacy, speed, and zero API costs

Next Steps:

  1. Install Ollama and test a few models
  2. Integrate with your favorite IDE or coding tool
  3. Build a simple RAG system with your documents
  4. Explore the Ollama model library at ollama.com/library
  5. Join the community at github.com/ollama/ollama

Happy local LLM exploration! 🚀

Top comments (0)