← Back to Guides
📚

Local RAG Setup Guide

Build a private AI knowledge base on your Mac — your documents never leave your device.

What is RAG?

Your Documents + AI = Accurate, Private Answers

RAG (Retrieval-Augmented Generation) combines a large language model with your own documents. Instead of relying solely on what the model was trained on, RAG retrieves relevant passages from your files and feeds them to the LLM — producing answers grounded in your actual data.

Why Local RAG Matters

  • Complete privacy — your data never leaves your machine, no cloud uploads
  • Zero API costs — no per-token charges, no subscriptions
  • Full control — choose your models, chunking strategy, and retrieval pipeline
  • Compliance-ready — meets data residency requirements for regulated industries

Use Cases

⚖️
Legal
Contract review, case law research, regulatory compliance. Query thousands of legal documents instantly.
📈
Finance
Compliance documentation, risk reports, internal policies. Keep sensitive financial data on-premises.
🏥
Healthcare
Patient records, clinical guidelines, research papers. HIPAA-compatible local processing.
🏢
Enterprise
Internal wikis, SOPs, onboarding manuals. Turn institutional knowledge into an instant Q&A system.

Architecture

How RAG Works

Documents are split into chunks, converted to numerical embeddings, and stored in a vector database. When you ask a question, the system finds the most relevant chunks and sends them alongside your query to the LLM for an accurate, grounded answer.

Documents
PDF, DOCX, TXT, MD
Embeddings
nomic-embed-text
Vector DB
ChromaDB / Qdrant
Query + LLM
Ollama
Answer
Grounded response

Core Components

  • Ollama — Local LLM runtime (llama3, qwen2.5, mistral)
  • Embedding model — Converts text to vectors (nomic-embed-text via Ollama)
  • Vector store — Stores and searches embeddings (ChromaDB, Qdrant)
  • Ingestion pipeline — Splits, embeds, and loads documents
  • Chat UI — Interface for querying your knowledge base
Tier 1 — Simple

Quick Start

Open WebUI + Ollama

The easiest way to get started with RAG. Open WebUI has built-in document upload and retrieval — no extra services needed. Upload your files through the UI and start asking questions immediately.

Best For
Individuals, small teams, quick prototyping. Get a working RAG system in under 10 minutes.
1

Install Ollama & Pull Models

Install Ollama from ollama.com, then pull a chat model and the embedding model.

2

Launch Open WebUI

Run Open WebUI via Docker with a single command.

3

Upload Documents

Open the UI, go to Workspace, then Documents. Upload your PDF, DOCX, TXT, or MD files.

4

Start Chatting

Begin a new chat, reference your documents with #, and ask questions. The AI retrieves relevant passages automatically.

Terminal
# Step 1: Pull models ollama pull llama3.1:8b ollama pull nomic-embed-text # Step 2: Launch Open WebUI docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ -v open-webui:/app/backend/data \ --name open-webui \ --restart always \ ghcr.io/open-webui/open-webui:main # Step 3: Open in browser open http://localhost:3000

Supported formats: PDF, DOCX, TXT, MD, CSV, XLSX, PPTX

Tier 2 — Intermediate

More Control

PrivateGPT

PrivateGPT gives you more control over document ingestion and retrieval. It supports batch ingestion, configurable chunking, and a clean API for integration — all running 100% locally with Ollama.

Best For
Teams needing batch ingestion, folder watching, better chunking control, and API access.
1

Clone PrivateGPT

Clone the repository and install dependencies.

2

Configure Ollama Backend

Edit settings-ollama.yaml to point to your local Ollama instance and choose models.

3

Ingest Documents

Place files in the source_documents folder and run ingestion, or use the web UI to upload.

4

Query via UI or API

Use the built-in UI or call the REST API for programmatic access to your knowledge base.

Terminal
# Step 1: Clone and set up git clone https://github.com/zylon-ai/private-gpt.git cd private-gpt poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant" # Step 2: Configure for Ollama cp settings-ollama.yaml settings.yaml # Edit settings.yaml — set llm_model and embedding_model # Step 3: Run PrivateGPT PGPT_PROFILES=ollama make run # Step 4: Open in browser open http://localhost:8001
Key Configuration
In settings.yaml, set llm.model to your preferred Ollama model (e.g. llama3.1:8b) and embedding.model to nomic-embed-text. Adjust rag.similarity_top_k to control how many chunks are retrieved (default: 2, recommended: 3–5).
Tier 3 — Advanced

Full Automation

n8n + Qdrant + Ollama

Build a fully automated RAG pipeline with n8n for workflow orchestration, Qdrant as a production-grade vector database, and Ollama for local inference. Automate document ingestion, enable multi-user access, and create custom retrieval chains.

Best For
Enterprises, multi-user environments, automated ingestion pipelines, and custom retrieval workflows.

The Stack

🔄
n8n
Visual workflow builder. Orchestrates ingestion, chunking, embedding, and retrieval as automated workflows.
🧠
Qdrant
Production-grade vector database. Handles millions of embeddings with filtering, payload storage, and high-speed search.
🦙
Ollama
Local LLM runtime. Provides both chat completion and embedding models for the entire pipeline.
1

Launch Qdrant & n8n

Start Qdrant and n8n with Docker Compose. Ollama runs on your host Mac.

2

Build Ingestion Workflow

In n8n, create a workflow: file trigger → extract text → chunk → embed via Ollama → store in Qdrant.

3

Build Query Workflow

Create a second workflow: webhook receives question → embed query → search Qdrant → send context + question to Ollama → return answer.

4

Activate & Test

Activate both workflows. Drop documents into the watched folder and query via the webhook endpoint or a chat UI.

docker-compose.yml
version: "3.8" services: qdrant: image: qdrant/qdrant:latest ports: - "6333:6333" volumes: - qdrant_data:/qdrant/storage n8n: image: n8nio/n8n:latest ports: - "5678:5678" environment: - N8N_BASIC_AUTH_ACTIVE=true - N8N_BASIC_AUTH_USER=admin - N8N_BASIC_AUTH_PASSWORD=your-password extra_hosts: - "host.docker.internal:host-gateway" volumes: - n8n_data:/home/node/.n8n - ./documents:/documents volumes: qdrant_data: n8n_data:
Terminal
# Pull embedding model ollama pull nomic-embed-text # Start the stack docker compose up -d # Verify services curl http://localhost:6333/healthz # Qdrant open http://localhost:5678 # n8n

n8n Workflow Pattern

The ingestion workflow watches a folder for new files, extracts text, splits it into chunks (500–1000 tokens with overlap), generates embeddings via Ollama, and stores them in Qdrant. The query workflow receives questions via a webhook, embeds the query, retrieves the top-k similar chunks from Qdrant, and sends the combined context to Ollama for a grounded answer.

Advanced Retrieval

Hybrid Search + Reranking

Vector search alone misses exact keyword matches, and keyword search alone misses semantic meaning. The 2025 industry standard is to combine both — then rerank the results with a cross-encoder model for maximum precision.

How Hybrid Search Works

Run dense vector search (semantic) and sparse keyword search (BM25) in parallel, then fuse results using Reciprocal Rank Fusion (RRF). Each candidate gets a combined score based on its rank in both lists. This catches both semantically similar passages and exact-term matches.

Query
Dense Vectors
Semantic similarity
BM25 / SPLADE
Keyword matching
RRF Fusion
Merge & rank
Reranker
Cross-encoder

The 3-Stage Retrieval Pipeline

1

Recall — Cast a Wide Net

Retrieve top 100–200 candidates using hybrid search (dense + BM25). Fast and cheap.

2

Deduplicate — Remove Noise

Apply Maximal Marginal Relevance (MMR) to remove near-duplicates and improve diversity.

3

Rerank — Precision Scoring

Score remaining candidates with a cross-encoder reranker for the most accurate final ranking.

Local Reranker Models (via Ollama)

Model Type Notes
Qwen3-Reranker-0.5B Cross-encoder Small, fast, 100+ languages. Best for most local setups.
Qwen3-Reranker-4B Cross-encoder Higher accuracy, needs 32 GB+ RAM alongside LLM.
ColBERTv2 Late interaction Token-level matching. Can be used as retriever or reranker.
mxbai-rerank Cross-encoder Fast, open-source, good cost-efficiency.
bge-reranker-v2 Cross-encoder Strong MTEB benchmark performance.
Key Insight
Even a small 0.5B reranker dramatically improves precision over vector search alone. Qdrant has native hybrid search support (BM42 + dense vectors), making it the best local vector DB for this pattern.

Chunking Strategies

Smart Chunking for Large Datasets

How you split documents into chunks directly affects retrieval quality. For large datasets, the right chunking strategy can mean the difference between useful answers and irrelevant noise.

📏
Fixed-Size (Baseline)
512–1024 tokens with 10–20% overlap. Simple, predictable, and surprisingly competitive. Start here.
🌳
Parent-Child
Search small child chunks (100–500 tokens) for precision, return parent chunks (500–2000 tokens) for context. Best accuracy for large datasets.
🔗
Late Chunking
Process entire document through embedding model first, then split. Each chunk retains awareness of full document context. No extra LLM calls.
📝
Contextual Retrieval
Prepend an LLM-generated summary to each chunk before embedding. 49% fewer retrieval failures. Expensive at index time (1 LLM call per chunk).

Recommended Approach

Start with fixed 512-token chunks with 10% overlap as your baseline. If retrieval quality is insufficient, upgrade to parent-child chunking (best accuracy-to-cost ratio). Use contextual retrieval only for high-value corpora where the indexing cost is justified.

Strategy Accuracy Index Cost Best For
Fixed-size Good Low Quick start, general use
Parent-child Excellent Low Large datasets, structured docs
Late chunking Very good Medium Long documents, context-heavy
Contextual Best High High-value, critical corpora

Query Optimization

Smarter Queries, Better Answers

The way you query your knowledge base matters as much as how you index it. These techniques transform vague questions into precise retrievals.

💡
HyDE — Hypothetical Document Embeddings
The LLM generates a hypothetical answer first, then that answer is embedded and used for vector search. Bridges the gap between abstract queries and concrete documents. Hallucination in the hypothetical is expected and fine.
🔍
Query Decomposition
Break complex multi-hop questions into simpler sub-queries. “Compare A vs B” becomes two separate retrievals that are merged for the final answer. Essential for analytical questions.
🚦
Query Routing
Route different query types to different retrieval strategies automatically. Factual lookups use keyword search, conceptual questions use dense vectors, comparative questions use decomposition.
🤖
Agentic RAG
An LLM agent dynamically decides retrieval strategy — whether to search, which index, whether to refine the query, when enough context is gathered. The most flexible and powerful approach.

HyDE in Practice

1

User Asks Question

“What are the data residency requirements in Hong Kong?”

2

LLM Generates Hypothetical Answer

Ollama produces a plausible answer (may contain hallucinations — that is expected).

3

Embed the Hypothetical Answer

The generated text is closer to actual documents in embedding space than the original question.

4

Retrieve & Generate Final Answer

Search finds more relevant documents, LLM produces a grounded answer from actual data.

Implementation
LlamaIndex provides HyDEQueryTransform and Haystack has a native HyDE component. Both work with Ollama as the local LLM backend. For query decomposition, use LlamaIndex’s SubQuestionQueryEngine.

Scaling

Scaling RAG & Graph RAG

As your document collection grows beyond hundreds of thousands, you need additional strategies to maintain speed and accuracy. Graph RAG adds structured knowledge understanding that vector search alone cannot provide.

Index Tuning (HNSW)

HNSW is the dominant approximate nearest-neighbour algorithm. Its default parameters degrade as your corpus grows — you must re-tune as you scale.

Parameter Default RAG Recommended Effect
M (connections) 16 32–64 Higher = better recall, more memory
efConstruction 200 256–512 Higher = better index quality, slower build
efSearch 50 128–256 Higher = better recall at query time

Tiered Retrieval

For millions of documents, use a multi-tier approach: first filter by metadata (date, source, category) to narrow the candidate pool, then run vector search within the partition, then rerank the results. This keeps latency low even at massive scale.

Graph RAG — Understanding Relationships

Vector search finds similar passages, but cannot understand relationships across documents. Graph RAG extracts entities and relationships to build a knowledge graph, enabling queries like “What are the main themes across all documents?” or “How do these regulations relate to each other?”

🕸️
LightRAG
Lightweight Graph RAG framework. Supports Ollama for entity extraction. Three query modes: naive (vector), local (entity-focused), global (community-focused).
📊
nano-graphrag
Minimal Graph RAG implementation. Selects only top-k communities for efficiency. Much faster and cheaper than Microsoft GraphRAG.
📦
RAGFlow
Production-grade open-source RAG engine. Handles complex documents (PDFs with tables, layouts), supports Graph RAG, agent capabilities. Docker-based.
When to Use Graph RAG
Add Graph RAG when you need cross-document relationship understanding — regulatory compliance (how rules relate), research (how studies connect), or enterprise knowledge bases (how processes depend on each other). For simple document Q&A, vector search with hybrid retrieval is sufficient.

Evaluation

Measuring RAG Quality

You cannot improve what you do not measure. Use the RAGAS framework to evaluate both retrieval and generation quality separately — a retrieval problem needs a different fix than a generation problem.

Metric What It Measures Target
Context Precision Are retrieved chunks relevant and correctly ranked? > 0.7
Context Recall Does retrieval capture all needed information? > 0.9
Faithfulness Is the answer grounded in retrieved context (no hallucinations)? > 0.85
Answer Relevancy Is the answer actually relevant to the question? > 0.8
Hit Rate Does any relevant document appear in top-k results? > 0.95

Evaluation Best Practices

  • Create a golden dataset — 50–100 question-answer pairs with annotated relevant documents
  • Measure retrieval and generation separately — pinpoint exactly where quality drops
  • Benchmark on YOUR data — MTEB leaderboard scores do not always translate to your domain
  • A/B test chunking strategies — compare fixed vs. parent-child vs. contextual on your actual queries
  • Monitor in production — track answer quality, latency, and user feedback continuously
Tools
RAGAS is the standard open-source evaluation framework. It provides reference-free metrics computed using an LLM judge. DeepEval offers 14+ metrics with CI/CD integration. Both work with local Ollama models as the judge.

Recommended Stack

Production RAG Stack for Mac

The optimal local RAG stack for Apple Silicon, combining all the advanced techniques above into a cohesive system.

Component Recommended Tool Why
LLM Runtime Ollama Native Metal acceleration, simple CLI
Embeddings nomic-embed-text / bge-m3 Top open-source MTEB scores, fast locally
Vector DB Qdrant Native hybrid search, BM42, sparse vectors
Reranker Qwen3-Reranker-0.5B Small, multilingual, runs via Ollama
Framework LlamaIndex / Haystack Full pipeline orchestration, HyDE built-in
Graph RAG LightRAG Local knowledge graph with Ollama support
Evaluation RAGAS Standard RAG eval, reference-free metrics

Comparison

Choosing the Right Tier

Each tier balances simplicity against flexibility. Start with Tier 1 for quick experiments, move to Tier 2 for team use, or go straight to Tier 3 for enterprise automation.

Feature Tier 1: Open WebUI Tier 2: PrivateGPT Tier 3: n8n + Qdrant
Setup time ~10 minutes ~30 minutes ~1 hour
Ingestion Manual upload via UI Folder-based + UI Automated pipeline
Vector store Built-in (ChromaDB) Built-in (Qdrant/Chroma) Qdrant (dedicated)
API access Limited Full REST API Webhook endpoints
Multi-user Yes (built-in auth) Single user Yes (n8n auth)
Best for Individuals, quick start Teams, batch processing Enterprises, automation

Hardware

Hardware Requirements

RAM is the primary bottleneck for local RAG. The LLM and embedding model run simultaneously, so plan for both. Below are recommended minimums per tier.

Tier Minimum RAM Recommended RAM Suggested LLM
Tier 1: Open WebUI 16 GB 24 GB llama3.1:8b
Tier 2: PrivateGPT 16 GB 32 GB qwen2.5:14b
Tier 3: n8n + Qdrant 32 GB 64 GB qwen2.5:32b
Storage Note
Vector databases are compact — 100,000 document chunks typically require less than 1 GB of storage. The LLM models themselves are the largest files (4–20 GB each). See the Hardware Guide for detailed RAM calculations.

Ready to build your private knowledge base?

Check which Mac has the RAM for your RAG setup, or book a free assessment with our team.