Local RAG Setup Guide
Build a private AI knowledge base on your Mac — your documents never leave your device.
What is RAG?
Your Documents + AI = Accurate, Private Answers
RAG (Retrieval-Augmented Generation) combines a large language model with your own documents. Instead of relying solely on what the model was trained on, RAG retrieves relevant passages from your files and feeds them to the LLM — producing answers grounded in your actual data.
Why Local RAG Matters
- Complete privacy — your data never leaves your machine, no cloud uploads
- Zero API costs — no per-token charges, no subscriptions
- Full control — choose your models, chunking strategy, and retrieval pipeline
- Compliance-ready — meets data residency requirements for regulated industries
Use Cases
Architecture
How RAG Works
Documents are split into chunks, converted to numerical embeddings, and stored in a vector database. When you ask a question, the system finds the most relevant chunks and sends them alongside your query to the LLM for an accurate, grounded answer.
Core Components
- Ollama — Local LLM runtime (llama3, qwen2.5, mistral)
- Embedding model — Converts text to vectors (nomic-embed-text via Ollama)
- Vector store — Stores and searches embeddings (ChromaDB, Qdrant)
- Ingestion pipeline — Splits, embeds, and loads documents
- Chat UI — Interface for querying your knowledge base
Quick Start
Open WebUI + Ollama
The easiest way to get started with RAG. Open WebUI has built-in document upload and retrieval — no extra services needed. Upload your files through the UI and start asking questions immediately.
Install Ollama & Pull Models
Install Ollama from ollama.com, then pull a chat model and the embedding model.
Launch Open WebUI
Run Open WebUI via Docker with a single command.
Upload Documents
Open the UI, go to Workspace, then Documents. Upload your PDF, DOCX, TXT, or MD files.
Start Chatting
Begin a new chat, reference your documents with #, and ask questions. The AI retrieves relevant passages automatically.
Supported formats: PDF, DOCX, TXT, MD, CSV, XLSX, PPTX
More Control
PrivateGPT
PrivateGPT gives you more control over document ingestion and retrieval. It supports batch ingestion, configurable chunking, and a clean API for integration — all running 100% locally with Ollama.
Clone PrivateGPT
Clone the repository and install dependencies.
Configure Ollama Backend
Edit settings-ollama.yaml to point to your local Ollama instance and choose models.
Ingest Documents
Place files in the source_documents folder and run ingestion, or use the web UI to upload.
Query via UI or API
Use the built-in UI or call the REST API for programmatic access to your knowledge base.
llm.model to your preferred Ollama model (e.g. llama3.1:8b) and embedding.model to nomic-embed-text. Adjust rag.similarity_top_k to control how many chunks are retrieved (default: 2, recommended: 3–5).Full Automation
n8n + Qdrant + Ollama
Build a fully automated RAG pipeline with n8n for workflow orchestration, Qdrant as a production-grade vector database, and Ollama for local inference. Automate document ingestion, enable multi-user access, and create custom retrieval chains.
The Stack
Launch Qdrant & n8n
Start Qdrant and n8n with Docker Compose. Ollama runs on your host Mac.
Build Ingestion Workflow
In n8n, create a workflow: file trigger → extract text → chunk → embed via Ollama → store in Qdrant.
Build Query Workflow
Create a second workflow: webhook receives question → embed query → search Qdrant → send context + question to Ollama → return answer.
Activate & Test
Activate both workflows. Drop documents into the watched folder and query via the webhook endpoint or a chat UI.
n8n Workflow Pattern
The ingestion workflow watches a folder for new files, extracts text, splits it into chunks (500–1000 tokens with overlap), generates embeddings via Ollama, and stores them in Qdrant. The query workflow receives questions via a webhook, embeds the query, retrieves the top-k similar chunks from Qdrant, and sends the combined context to Ollama for a grounded answer.
Advanced Retrieval
Hybrid Search + Reranking
Vector search alone misses exact keyword matches, and keyword search alone misses semantic meaning. The 2025 industry standard is to combine both — then rerank the results with a cross-encoder model for maximum precision.
How Hybrid Search Works
Run dense vector search (semantic) and sparse keyword search (BM25) in parallel, then fuse results using Reciprocal Rank Fusion (RRF). Each candidate gets a combined score based on its rank in both lists. This catches both semantically similar passages and exact-term matches.
The 3-Stage Retrieval Pipeline
Recall — Cast a Wide Net
Retrieve top 100–200 candidates using hybrid search (dense + BM25). Fast and cheap.
Deduplicate — Remove Noise
Apply Maximal Marginal Relevance (MMR) to remove near-duplicates and improve diversity.
Rerank — Precision Scoring
Score remaining candidates with a cross-encoder reranker for the most accurate final ranking.
Local Reranker Models (via Ollama)
| Model | Type | Notes |
|---|---|---|
| Qwen3-Reranker-0.5B | Cross-encoder | Small, fast, 100+ languages. Best for most local setups. |
| Qwen3-Reranker-4B | Cross-encoder | Higher accuracy, needs 32 GB+ RAM alongside LLM. |
| ColBERTv2 | Late interaction | Token-level matching. Can be used as retriever or reranker. |
| mxbai-rerank | Cross-encoder | Fast, open-source, good cost-efficiency. |
| bge-reranker-v2 | Cross-encoder | Strong MTEB benchmark performance. |
Chunking Strategies
Smart Chunking for Large Datasets
How you split documents into chunks directly affects retrieval quality. For large datasets, the right chunking strategy can mean the difference between useful answers and irrelevant noise.
Recommended Approach
Start with fixed 512-token chunks with 10% overlap as your baseline. If retrieval quality is insufficient, upgrade to parent-child chunking (best accuracy-to-cost ratio). Use contextual retrieval only for high-value corpora where the indexing cost is justified.
| Strategy | Accuracy | Index Cost | Best For |
|---|---|---|---|
| Fixed-size | Good | Low | Quick start, general use |
| Parent-child | Excellent | Low | Large datasets, structured docs |
| Late chunking | Very good | Medium | Long documents, context-heavy |
| Contextual | Best | High | High-value, critical corpora |
Query Optimization
Smarter Queries, Better Answers
The way you query your knowledge base matters as much as how you index it. These techniques transform vague questions into precise retrievals.
HyDE in Practice
User Asks Question
“What are the data residency requirements in Hong Kong?”
LLM Generates Hypothetical Answer
Ollama produces a plausible answer (may contain hallucinations — that is expected).
Embed the Hypothetical Answer
The generated text is closer to actual documents in embedding space than the original question.
Retrieve & Generate Final Answer
Search finds more relevant documents, LLM produces a grounded answer from actual data.
HyDEQueryTransform and Haystack has a native HyDE component. Both work with Ollama as the local LLM backend. For query decomposition, use LlamaIndex’s SubQuestionQueryEngine.Scaling
Scaling RAG & Graph RAG
As your document collection grows beyond hundreds of thousands, you need additional strategies to maintain speed and accuracy. Graph RAG adds structured knowledge understanding that vector search alone cannot provide.
Index Tuning (HNSW)
HNSW is the dominant approximate nearest-neighbour algorithm. Its default parameters degrade as your corpus grows — you must re-tune as you scale.
| Parameter | Default | RAG Recommended | Effect |
|---|---|---|---|
| M (connections) | 16 | 32–64 | Higher = better recall, more memory |
| efConstruction | 200 | 256–512 | Higher = better index quality, slower build |
| efSearch | 50 | 128–256 | Higher = better recall at query time |
Tiered Retrieval
For millions of documents, use a multi-tier approach: first filter by metadata (date, source, category) to narrow the candidate pool, then run vector search within the partition, then rerank the results. This keeps latency low even at massive scale.
Graph RAG — Understanding Relationships
Vector search finds similar passages, but cannot understand relationships across documents. Graph RAG extracts entities and relationships to build a knowledge graph, enabling queries like “What are the main themes across all documents?” or “How do these regulations relate to each other?”
Evaluation
Measuring RAG Quality
You cannot improve what you do not measure. Use the RAGAS framework to evaluate both retrieval and generation quality separately — a retrieval problem needs a different fix than a generation problem.
| Metric | What It Measures | Target |
|---|---|---|
| Context Precision | Are retrieved chunks relevant and correctly ranked? | > 0.7 |
| Context Recall | Does retrieval capture all needed information? | > 0.9 |
| Faithfulness | Is the answer grounded in retrieved context (no hallucinations)? | > 0.85 |
| Answer Relevancy | Is the answer actually relevant to the question? | > 0.8 |
| Hit Rate | Does any relevant document appear in top-k results? | > 0.95 |
Evaluation Best Practices
- Create a golden dataset — 50–100 question-answer pairs with annotated relevant documents
- Measure retrieval and generation separately — pinpoint exactly where quality drops
- Benchmark on YOUR data — MTEB leaderboard scores do not always translate to your domain
- A/B test chunking strategies — compare fixed vs. parent-child vs. contextual on your actual queries
- Monitor in production — track answer quality, latency, and user feedback continuously
Recommended Stack
Production RAG Stack for Mac
The optimal local RAG stack for Apple Silicon, combining all the advanced techniques above into a cohesive system.
| Component | Recommended Tool | Why |
|---|---|---|
| LLM Runtime | Ollama | Native Metal acceleration, simple CLI |
| Embeddings | nomic-embed-text / bge-m3 | Top open-source MTEB scores, fast locally |
| Vector DB | Qdrant | Native hybrid search, BM42, sparse vectors |
| Reranker | Qwen3-Reranker-0.5B | Small, multilingual, runs via Ollama |
| Framework | LlamaIndex / Haystack | Full pipeline orchestration, HyDE built-in |
| Graph RAG | LightRAG | Local knowledge graph with Ollama support |
| Evaluation | RAGAS | Standard RAG eval, reference-free metrics |
Comparison
Choosing the Right Tier
Each tier balances simplicity against flexibility. Start with Tier 1 for quick experiments, move to Tier 2 for team use, or go straight to Tier 3 for enterprise automation.
| Feature | Tier 1: Open WebUI | Tier 2: PrivateGPT | Tier 3: n8n + Qdrant |
|---|---|---|---|
| Setup time | ~10 minutes | ~30 minutes | ~1 hour |
| Ingestion | Manual upload via UI | Folder-based + UI | Automated pipeline |
| Vector store | Built-in (ChromaDB) | Built-in (Qdrant/Chroma) | Qdrant (dedicated) |
| API access | Limited | Full REST API | Webhook endpoints |
| Multi-user | Yes (built-in auth) | Single user | Yes (n8n auth) |
| Best for | Individuals, quick start | Teams, batch processing | Enterprises, automation |
Hardware
Hardware Requirements
RAM is the primary bottleneck for local RAG. The LLM and embedding model run simultaneously, so plan for both. Below are recommended minimums per tier.
| Tier | Minimum RAM | Recommended RAM | Suggested LLM |
|---|---|---|---|
| Tier 1: Open WebUI | 16 GB | 24 GB | llama3.1:8b |
| Tier 2: PrivateGPT | 16 GB | 32 GB | qwen2.5:14b |
| Tier 3: n8n + Qdrant | 32 GB | 64 GB | qwen2.5:32b |