Imagine having the power of GPT-4 or Claude running entirely on your laptop—no internet required, no API costs, and complete privacy. Until recently, this seemed like a distant dream reserved for tech giants with massive server farms. But the landscape has changed dramatically. Thanks to Ollama, anyone with a modern computer can now run sophisticated AI models locally, whether you're coding on a plane at 35,000 feet, analyzing sensitive documents that can never touch the cloud, or simply experimenting with AI without watching your API bill climb. This comprehensive guide will walk you through everything you need to know about Ollama—from your first installation to building production-ready AI-powered applications that respect your privacy and your wallet.
Part A: Setting Up Ollama
Introduction to Ollama
Ollama is a powerful, open-source tool that enables you to run large language models (LLMs) locally on your own machine. Think of it as Docker for AI models—it packages everything you need to run models like Llama, Mistral, CodeLlama, and dozens of others into easy-to-use containers that work seamlessly on your computer.
Why use Ollama?
- Privacy: Your data never leaves your machine. Perfect for sensitive code, personal documents, or proprietary information.
- No API costs: Run unlimited queries without paying per token.
- Offline capability: Work without internet connectivity once models are downloaded.
- Speed: No network latency—responses are generated locally.
- Experimentation: Try different models and configurations freely.
- Integration: Works with various tools and development environments.
Installation
Ollama supports macOS, Linux, and Windows. The installation process is straightforward for all platforms.
macOS
Download the official installer from ollama.com:
# Or use Homebrew
brew install ollama
Linux
Run the installation script:
curl -fsSL https://ollama.com/install.sh | sh
This script supports most Linux distributions including Ubuntu, Debian, Fedora, and CentOS.
Windows
Download the Windows installer from ollama.com or use the preview version. Ollama on Windows runs natively and supports the same commands as other platforms.
Docker Installation
For containerized deployment:
docker pull ollama/ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Testing Your Installation
After installation, verify Ollama is working:
# Check version
ollama --version
# Check if the service is running
ollama serve
You should see output indicating the Ollama server is running on http://localhost:11434.
To test with a simple model:
# Pull and run a small model
ollama run llama3.2
# You should get an interactive prompt
# Try: "Hello, how are you?"
If you see a response from the model, congratulations—Ollama is working correctly!
Updating Ollama
Keeping Ollama updated ensures you have the latest features, bug fixes, and security patches.
Why Update?
- New model support: Access to the latest model architectures and versions
- Performance improvements: Faster inference and better memory management
- Bug fixes: Resolved issues and stability improvements
- New features: Enhanced CLI commands and API capabilities
- Security patches: Important security updates
How to Update
macOS (Homebrew):
brew update
brew upgrade ollama
macOS (Installer):
Download and run the latest installer from ollama.com—it will update your existing installation.
Linux:
curl -fsSL https://ollama.com/install.sh | sh
The script automatically detects and updates your existing installation.
Windows:
Download and run the latest installer.
Docker:
docker pull ollama/ollama:latest
docker stop ollama
docker rm ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
After updating, verify the new version:
ollama --version
Understanding Ollama Models
Ollama supports a vast ecosystem of models, each optimized for different tasks. Here's a breakdown of the major categories:
General Purpose Models
Llama 3.2 (1B, 3B) - Meta's latest efficient models
- Best for: General conversation, reasoning, lightweight tasks
- Size: 1B parameters (740MB), 3B parameters (2GB)
- Command:
ollama run llama3.2orollama run llama3.2:1b
Llama 3.1 (8B, 70B, 405B) - Meta's powerful flagship models
- Best for: Complex reasoning, long context (128K tokens)
- Size: 8B (4.7GB), 70B (40GB), 405B (231GB)
- Command:
ollama run llama3.1orollama run llama3.1:70b
Mistral (7B) - Efficient and capable model
- Best for: General purpose tasks with good speed/quality balance
- Size: 4.1GB
- Command:
ollama run mistral
Mixtral (8x7B, 8x22B) - Mixture of Experts models
- Best for: High-quality outputs with efficient inference
- Size: 26GB (8x7B), 80GB (8x22B)
- Command:
ollama run mixtral
Coding Models
CodeLlama (7B, 13B, 34B, 70B) - Specialized for code
- Best for: Code generation, completion, debugging
- Size: 7B (3.8GB), 13B (7.4GB), 34B (19GB), 70B (39GB)
- Command:
ollama run codellama
DeepSeek Coder (6.7B, 33B) - Strong coding performance
- Best for: Multiple programming languages, code explanation
- Size: 6.7B (3.8GB), 33B (19GB)
- Command:
ollama run deepseek-coder
Qwen 2.5 Coder (1.5B, 7B, 32B) - Alibaba's coding model
- Best for: Code generation with multilingual support
- Command:
ollama run qwen2.5-coder
Specialized Models
Llama 3.2 Vision (11B, 90B) - Multimodal understanding
- Best for: Image analysis, visual question answering
- Command:
ollama run llama3.2-vision
Phi-3 (3.8B, 14B) - Microsoft's small language models
- Best for: Efficient deployment, edge devices
- Size: 3.8B (2.3GB), 14B (7.9GB)
- Command:
ollama run phi3
Gemma 2 (2B, 9B, 27B) - Google's open models
- Best for: Safety-focused applications
- Command:
ollama run gemma2
Neural Chat (7B) - Fine-tuned for conversations
- Best for: Interactive dialogue
- Command:
ollama run neural-chat
How to Choose a Model
- For general use on consumer hardware: llama3.2:3b or mistral
- For coding tasks: codellama:7b or deepseek-coder:6.7b
- For best quality (with powerful GPU): llama3.1:70b or mixtral:8x22b
- For resource-constrained systems: phi3:3.8b or llama3.2:1b
- For vision tasks: llama3.2-vision
Starting and Stopping Ollama
Starting Ollama
The Ollama server needs to be running to handle model requests.
On macOS and Linux:
# Start in foreground (see logs)
ollama serve
# Or run as background service (automatically starts on boot)
# On macOS, Ollama runs as a LaunchAgent
# On Linux with systemd:
systemctl start ollama
On Windows:
Ollama runs as a service automatically after installation. Access it via the system tray icon.
Verify it's running:
curl http://localhost:11434
# Should return: "Ollama is running"
Stopping Ollama
On macOS:
# If running in foreground, press Ctrl+C
# If running as service:
launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist
On Linux:
# If running in foreground, press Ctrl+C
# If running as systemd service:
systemctl stop ollama
On Windows:
Right-click the Ollama icon in the system tray and select "Quit Ollama."
Docker:
docker stop ollama
Serving a Specific Model
Once Ollama is running, you can interact with models in several ways.
Interactive Mode
Run a model and chat directly in the terminal:
ollama run llama3.2
This starts an interactive session. Type your prompts and press Enter. Type /bye to exit.
Pull a Model Without Running
Download a model for later use:
ollama pull codellama:13b
List Downloaded Models
See which models you have locally:
ollama list
Output shows model name, size, and last modified date.
Run with Custom Parameters
Control model behavior with parameters:
ollama run llama3.2 --verbose
# Or set parameters in the prompt
ollama run llama3.2 "Write a haiku" --temperature 0.8 --top-p 0.9
API Mode (Serving to Other Applications)
Ollama exposes an OpenAI-compatible API on http://localhost:11434:
# Generate a completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Chat completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
This API allows integration with countless tools and applications.
Remove Models You Don't Need
Free up disk space:
ollama rm codellama:70b
Quick Reference Commands
# Installation & Updates
ollama --version # Check version
curl -fsSL https://ollama.com/install.sh | sh # Install/update
# Model Management
ollama pull <model> # Download a model
ollama list # List local models
ollama rm <model> # Delete a model
ollama show <model> # Show model information
# Running Models
ollama run <model> # Interactive mode
ollama run <model> "prompt" # One-off generation
ollama serve # Start API server
# API Testing
curl http://localhost:11434 # Check if running
curl http://localhost:11434/api/tags # List models via API
Part B: Real-World Use Cases
Now that you have Ollama set up, let's explore powerful ways to use it in your daily workflow.
Use Case 1: Local Coding with Claude Code and AI IDEs
One of the most powerful applications of Ollama is integrating local models with development tools. This gives you AI-powered coding assistance that's private, fast, and free.
Using Ollama with Claude Code
Claude Code is a command-line agentic coding tool. While it typically uses Anthropic's API, you can configure it to work with local models via Ollama for certain tasks.
Setup:
- Install Claude Code:
npm install -g @anthropic-ai/claude-code
- Configure to use Ollama as a secondary tool for code completion:
# You can use Ollama's OpenAI-compatible endpoint
export OLLAMA_API_BASE=http://localhost:11434/v1
Workflow:
- Use Claude Code with Anthropic's API for complex agentic tasks (planning, multi-file refactoring)
- Use local Ollama models for quick completions, code explanations, or offline work
- Combine both: Claude for architecture, Ollama for implementation details
Using Ollama with Cursor
Cursor is a popular AI-powered IDE. Configure it to use Ollama:
- Open Cursor Settings
- Navigate to Models
- Add custom model endpoint:
http://localhost:11434/v1 - Select your preferred model (e.g.,
codellama:13b)
Benefits:
- No API costs for code completions
- Faster responses (no network latency)
- Private code never leaves your machine
Using Ollama with Continue.dev
Continue is a VS Code / JetBrains extension that brings AI assistance to your IDE:
- Install the Continue extension
- Configure
~/.continue/config.json:
{
"models": [
{
"title": "CodeLlama Local",
"provider": "ollama",
"model": "codellama:13b",
"apiBase": "http://localhost:11434"
}
]
}
- Use Cmd/Ctrl+L to chat with your local model
- Highlight code and ask questions, generate tests, or refactor
Using Ollama with Tabby (Self-Hosted Code Completion)
Tabby provides GitHub Copilot-like code completion:
# Run Tabby with Ollama backend
docker run -it --gpus all \
-p 8080:8080 \
-v ~/.tabby:/data \
tabbyml/tabby \
serve --model TabbyML/CodeLlama-7B --device cuda
Configure your IDE to point to http://localhost:8080.
Best Practices for Local Coding Models
- Choose the right model size: 7B models for real-time completions, 13B+ for deeper analysis
- Use code-specific models: CodeLlama, DeepSeek Coder, or Qwen Coder perform better than general models
- Combine tools: Use local models for most tasks, cloud models for complex reasoning
- Fine-tune prompts: Local models benefit from clear, specific instructions
- Monitor resource usage: Large models need significant RAM/VRAM
Use Case 2: Document Ingestion and RAG Workflows
Retrieval-Augmented Generation (RAG) is a technique where you feed your own documents to an LLM, enabling it to answer questions based on your specific data. Ollama is perfect for building private RAG systems.
What is RAG?
Traditional LLMs only know what they were trained on. RAG extends their knowledge by:
- Ingesting your documents (PDFs, text files, web pages)
- Chunking them into smaller pieces
- Embedding these chunks into vector representations
- Storing embeddings in a vector database
- Retrieving relevant chunks when you ask a question
- Generating answers using both the retrieved context and the LLM
Simple RAG with Ollama and Python
Here's a basic RAG implementation:
# Install dependencies
# pip install ollama chromadb langchain pypdf --break-system-packages
import ollama
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
# 1. Load and chunk documents
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
# 2. Create embeddings and store in vector DB
client = chromadb.Client()
collection = client.create_collection("my_docs")
for i, chunk in enumerate(chunks):
# Use Ollama to generate embeddings
embedding = ollama.embeddings(
model="nomic-embed-text",
prompt=chunk.page_content
)["embedding"]
collection.add(
ids=[str(i)],
embeddings=[embedding],
documents=[chunk.page_content]
)
# 3. Query the system
def ask_question(question):
# Get embedding for question
question_embedding = ollama.embeddings(
model="nomic-embed-text",
prompt=question
)["embedding"]
# Find relevant chunks
results = collection.query(
query_embeddings=[question_embedding],
n_results=3
)
# Build context from retrieved chunks
context = "\n\n".join(results["documents"][0])
# Generate answer using Ollama
response = ollama.chat(
model="llama3.2",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response["message"]["content"]
# Use it
answer = ask_question("What are the main points in the document?")
print(answer)
Advanced RAG with Ollama and Open Source Tools
Using Ollama with LangChain:
from langchain.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings
# Initialize Ollama
llm = Ollama(model="llama3.2")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Create vector store
vectorstore = FAISS.from_documents(chunks, embeddings)
# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(),
chain_type="stuff"
)
# Ask questions
result = qa_chain.run("What does the document say about X?")
Using Ollama with AnythingLLM:
AnythingLLM is a full-stack RAG application with a beautiful UI:
- Install: Download from anythingllm.com
- Configure Ollama as LLM provider in settings
- Upload documents via the UI
- Start chatting with your documents
Using Ollama with Quivr (Your Second Brain):
Quivr is a RAG application for personal knowledge management:
# Clone and setup
git clone https://github.com/QuivrHQ/quivr
cd quivr
docker-compose up
# Configure to use Ollama in .env file
LLM_PROVIDER=ollama
OLLAMA_API_BASE=http://localhost:11434
Ingestion Workflow Best Practices
1. Document Preprocessing:
- Clean text (remove headers, footers, page numbers)
- Extract tables and images separately if needed
- Use OCR for scanned documents (tesseract-ocr)
2. Chunking Strategies:
- Fixed size: Simple, works for uniform content (500-1000 tokens)
- Semantic: Split by paragraphs or sections (better context)
- Recursive: LangChain's RecursiveCharacterTextSplitter (balanced)
3. Embedding Models:
-
nomic-embed-text: Best all-around, 137M parameters -
mxbai-embed-large: Highest quality, 335M parameters -
all-minilm: Fastest, lightweight option
4. Retrieval Optimization:
- Use hybrid search (keyword + semantic)
- Re-rank results with a cross-encoder
- Adjust number of retrieved chunks (typically 3-5)
5. Prompt Engineering:
Context: [retrieved chunks]
Based ONLY on the context above, answer the following question.
If the answer is not in the context, say "I don't have enough information."
Question: {user_question}
Answer:
Production RAG Considerations
- Vector Database: Use Qdrant, Weaviate, or Pinecone for scale
- Caching: Cache embeddings to avoid recomputation
- Monitoring: Track retrieval accuracy and response quality
- Updates: Implement incremental updates when documents change
- Multi-modal: Use vision models for PDFs with images/charts
Performance Tips for Ollama
- GPU Acceleration: Ollama automatically uses GPU if available (NVIDIA, AMD, or Apple Silicon)
-
Memory Management: Larger models need more RAM/VRAM (check with
ollama show <model>) - Concurrent Requests: Ollama handles multiple requests, but memory is shared
- Model Quantization: Use quantized models (Q4, Q5) for better performance with acceptable quality loss
Conclusion
Ollama opens up a world of possibilities for running powerful AI models on your own hardware. From coding assistance to document analysis, from privacy-preserving workflows to cost-free experimentation, the applications are endless.
Key Takeaways:
- Ollama makes running local LLMs accessible and practical
- Choose models based on your hardware and use case
- Integration with development tools enhances productivity
- RAG workflows unlock the power of your private data
- Local models provide privacy, speed, and zero API costs
Next Steps:
- Install Ollama and test a few models
- Integrate with your favorite IDE or coding tool
- Build a simple RAG system with your documents
- Explore the Ollama model library at ollama.com/library
- Join the community at github.com/ollama/ollama
Happy local LLM exploration! 🚀
Top comments (0)