MacAI HK | Local RAG Setup Guide

What is RAG?

Your Documents + AI = Accurate, Private Answers

RAG (Retrieval-Augmented Generation) combines a large language model with your own documents. Instead of relying solely on what the model was trained on, RAG retrieves relevant passages from your files and feeds them to the LLM — producing answers grounded in your actual data.

Why Local RAG Matters

Complete privacy — your data never leaves your machine, no cloud uploads
Zero API costs — no per-token charges, no subscriptions
Full control — choose your models, chunking strategy, and retrieval pipeline
Compliance-ready — meets data residency requirements for regulated industries

Use Cases

⚖️

Legal

Contract review, case law research, regulatory compliance. Query thousands of legal documents instantly.

📈

Finance

Compliance documentation, risk reports, internal policies. Keep sensitive financial data on-premises.

🏥

Healthcare

Patient records, clinical guidelines, research papers. HIPAA-compatible local processing.

🏢

Enterprise

Internal wikis, SOPs, onboarding manuals. Turn institutional knowledge into an instant Q&A system.

Architecture

How RAG Works

Documents are split into chunks, converted to numerical embeddings, and stored in a vector database. When you ask a question, the system finds the most relevant chunks and sends them alongside your query to the LLM for an accurate, grounded answer.

Documents

PDF, DOCX, TXT, MD

→

Embeddings

nomic-embed-text

→

Vector DB

ChromaDB / Qdrant

→

Query + LLM

Ollama

→

Answer

Grounded response

Core Components

Ollama — Local LLM runtime (llama3, qwen2.5, mistral)
Embedding model — Converts text to vectors (nomic-embed-text via Ollama)
Vector store — Stores and searches embeddings (ChromaDB, Qdrant)
Ingestion pipeline — Splits, embeds, and loads documents
Chat UI — Interface for querying your knowledge base

Tier 1 — Simple

Quick Start

Open WebUI + Ollama

The easiest way to get started with RAG. Open WebUI has built-in document upload and retrieval — no extra services needed. Upload your files through the UI and start asking questions immediately.

Best For

Individuals, small teams, quick prototyping. Get a working RAG system in under 10 minutes.

1

Install Ollama & Pull Models

Install Ollama from ollama.com, then pull a chat model and the embedding model.

2

Launch Open WebUI

Run Open WebUI via Docker with a single command.

3

Upload Documents

Open the UI, go to Workspace, then Documents. Upload your PDF, DOCX, TXT, or MD files.

4

Start Chatting

Begin a new chat, reference your documents with #, and ask questions. The AI retrieves relevant passages automatically.

Terminal
# Step 1: Pull modelsollama pull llama3.1:8b
ollama pull nomic-embed-text

# Step 2: Launch Open WebUIdocker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Step 3: Open in browseropen http://localhost:3000
    

Supported formats: PDF, DOCX, TXT, MD, CSV, XLSX, PPTX

Tier 2 — Intermediate

More Control

PrivateGPT

PrivateGPT gives you more control over document ingestion and retrieval. It supports batch ingestion, configurable chunking, and a clean API for integration — all running 100% locally with Ollama.

Best For

Teams needing batch ingestion, folder watching, better chunking control, and API access.

1

Clone PrivateGPT

Clone the repository and install dependencies.

2

Configure Ollama Backend

Edit settings-ollama.yaml to point to your local Ollama instance and choose models.

3

Ingest Documents

Place files in the source_documents folder and run ingestion, or use the web UI to upload.

4

Query via UI or API

Use the built-in UI or call the REST API for programmatic access to your knowledge base.

Terminal
# Step 1: Clone and set upgit clone https://github.com/zylon-ai/private-gpt.git
cd private-gpt
poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant"

# Step 2: Configure for Ollamacp settings-ollama.yaml settings.yaml
# Edit settings.yaml — set llm_model and embedding_model# Step 3: Run PrivateGPTPGPT_PROFILES=ollama make run# Step 4: Open in browseropen http://localhost:8001
    

Key Configuration

In settings.yaml, set llm.model to your preferred Ollama model (e.g. llama3.1:8b) and embedding.model to nomic-embed-text. Adjust rag.similarity_top_k to control how many chunks are retrieved (default: 2, recommended: 3–5).

Tier 3 — Advanced

Full Automation

n8n + Qdrant + Ollama

Build a fully automated RAG pipeline with n8n for workflow orchestration, Qdrant as a production-grade vector database, and Ollama for local inference. Automate document ingestion, enable multi-user access, and create custom retrieval chains.

Best For

Enterprises, multi-user environments, automated ingestion pipelines, and custom retrieval workflows.

The Stack

🔄

n8n

Visual workflow builder. Orchestrates ingestion, chunking, embedding, and retrieval as automated workflows.

🧠

Qdrant

Production-grade vector database. Handles millions of embeddings with filtering, payload storage, and high-speed search.

🦙

Ollama

Local LLM runtime. Provides both chat completion and embedding models for the entire pipeline.

1

Launch Qdrant & n8n

Start Qdrant and n8n with Docker Compose. Ollama runs on your host Mac.

2

Build Ingestion Workflow

In n8n, create a workflow: file trigger → extract text → chunk → embed via Ollama → store in Qdrant.

3

Build Query Workflow

Create a second workflow: webhook receives question → embed query → search Qdrant → send context + question to Ollama → return answer.

4

Activate & Test

Activate both workflows. Drop documents into the watched folder and query via the webhook endpoint or a chat UI.

docker-compose.yml
version: "3.8"services:
  qdrant:
    image: qdrant/qdrant:latestports:
      - "6333:6333"volumes:
      - qdrant_data:/qdrant/storagen8n:
    image: n8nio/n8n:latestports:
      - "5678:5678"environment:
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=admin
      - N8N_BASIC_AUTH_PASSWORD=your-passwordextra_hosts:
      - "host.docker.internal:host-gateway"volumes:
      - n8n_data:/home/node/.n8n
      - ./documents:/documentsvolumes:
  qdrant_data:
  n8n_data:
    

Terminal
# Pull embedding modelollama pull nomic-embed-text

# Start the stackdocker compose up -d

# Verify servicescurl http://localhost:6333/healthz   # Qdrantopen http://localhost:5678             # n8n

n8n Workflow Pattern

The ingestion workflow watches a folder for new files, extracts text, splits it into chunks (500–1000 tokens with overlap), generates embeddings via Ollama, and stores them in Qdrant. The query workflow receives questions via a webhook, embeds the query, retrieves the top-k similar chunks from Qdrant, and sends the combined context to Ollama for a grounded answer.

Advanced Retrieval

Hybrid Search + Reranking

Vector search alone misses exact keyword matches, and keyword search alone misses semantic meaning. The 2025 industry standard is to combine both — then rerank the results with a cross-encoder model for maximum precision.

How Hybrid Search Works

Run dense vector search (semantic) and sparse keyword search (BM25) in parallel, then fuse results using Reciprocal Rank Fusion (RRF). Each candidate gets a combined score based on its rank in both lists. This catches both semantically similar passages and exact-term matches.

Query

→

Dense Vectors

Semantic similarity

BM25 / SPLADE

Keyword matching

→

RRF Fusion

Merge & rank

→

Reranker

Cross-encoder

The 3-Stage Retrieval Pipeline

1

Recall — Cast a Wide Net

Retrieve top 100–200 candidates using hybrid search (dense + BM25). Fast and cheap.

2

Deduplicate — Remove Noise

Apply Maximal Marginal Relevance (MMR) to remove near-duplicates and improve diversity.

3

Rerank — Precision Scoring

Score remaining candidates with a cross-encoder reranker for the most accurate final ranking.

Local Reranker Models (via Ollama)

Model	Type	Notes
Qwen3-Reranker-0.5B	Cross-encoder	Small, fast, 100+ languages. Best for most local setups.
Qwen3-Reranker-4B	Cross-encoder	Higher accuracy, needs 32 GB+ RAM alongside LLM.
ColBERTv2	Late interaction	Token-level matching. Can be used as retriever or reranker.
mxbai-rerank	Cross-encoder	Fast, open-source, good cost-efficiency.
bge-reranker-v2	Cross-encoder	Strong MTEB benchmark performance.

Key Insight

Even a small 0.5B reranker dramatically improves precision over vector search alone. Qdrant has native hybrid search support (BM42 + dense vectors), making it the best local vector DB for this pattern.

Chunking Strategies

Smart Chunking for Large Datasets

How you split documents into chunks directly affects retrieval quality. For large datasets, the right chunking strategy can mean the difference between useful answers and irrelevant noise.

📏

Fixed-Size (Baseline)

512–1024 tokens with 10–20% overlap. Simple, predictable, and surprisingly competitive. Start here.

🌳

Parent-Child

Search small child chunks (100–500 tokens) for precision, return parent chunks (500–2000 tokens) for context. Best accuracy for large datasets.

🔗

Late Chunking

Process entire document through embedding model first, then split. Each chunk retains awareness of full document context. No extra LLM calls.

📝

Contextual Retrieval

Prepend an LLM-generated summary to each chunk before embedding. 49% fewer retrieval failures. Expensive at index time (1 LLM call per chunk).

Recommended Approach

Start with fixed 512-token chunks with 10% overlap as your baseline. If retrieval quality is insufficient, upgrade to parent-child chunking (best accuracy-to-cost ratio). Use contextual retrieval only for high-value corpora where the indexing cost is justified.

Strategy	Accuracy	Index Cost	Best For
Fixed-size	Good	Low	Quick start, general use
Parent-child	Excellent	Low	Large datasets, structured docs
Late chunking	Very good	Medium	Long documents, context-heavy
Contextual	Best	High	High-value, critical corpora

Query Optimization

Smarter Queries, Better Answers

The way you query your knowledge base matters as much as how you index it. These techniques transform vague questions into precise retrievals.

💡

HyDE — Hypothetical Document Embeddings

The LLM generates a hypothetical answer first, then that answer is embedded and used for vector search. Bridges the gap between abstract queries and concrete documents. Hallucination in the hypothetical is expected and fine.

🔍

Query Decomposition

Break complex multi-hop questions into simpler sub-queries. “Compare A vs B” becomes two separate retrievals that are merged for the final answer. Essential for analytical questions.

🚦

Query Routing

Route different query types to different retrieval strategies automatically. Factual lookups use keyword search, conceptual questions use dense vectors, comparative questions use decomposition.

🤖

Agentic RAG

An LLM agent dynamically decides retrieval strategy — whether to search, which index, whether to refine the query, when enough context is gathered. The most flexible and powerful approach.

HyDE in Practice

1

User Asks Question

“What are the data residency requirements in Hong Kong?”

2

LLM Generates Hypothetical Answer

Ollama produces a plausible answer (may contain hallucinations — that is expected).

3

Embed the Hypothetical Answer

The generated text is closer to actual documents in embedding space than the original question.

4

Retrieve & Generate Final Answer

Search finds more relevant documents, LLM produces a grounded answer from actual data.

Implementation

LlamaIndex provides HyDEQueryTransform and Haystack has a native HyDE component. Both work with Ollama as the local LLM backend. For query decomposition, use LlamaIndex’s SubQuestionQueryEngine.

Scaling

Scaling RAG & Graph RAG

As your document collection grows beyond hundreds of thousands, you need additional strategies to maintain speed and accuracy. Graph RAG adds structured knowledge understanding that vector search alone cannot provide.

Index Tuning (HNSW)

HNSW is the dominant approximate nearest-neighbour algorithm. Its default parameters degrade as your corpus grows — you must re-tune as you scale.

Parameter	Default	RAG Recommended	Effect
M (connections)	16	32–64	Higher = better recall, more memory
efConstruction	200	256–512	Higher = better index quality, slower build
efSearch	50	128–256	Higher = better recall at query time

Tiered Retrieval

For millions of documents, use a multi-tier approach: first filter by metadata (date, source, category) to narrow the candidate pool, then run vector search within the partition, then rerank the results. This keeps latency low even at massive scale.

Graph RAG — Understanding Relationships

Vector search finds similar passages, but cannot understand relationships across documents. Graph RAG extracts entities and relationships to build a knowledge graph, enabling queries like “What are the main themes across all documents?” or “How do these regulations relate to each other?”

🕸️

LightRAG

Lightweight Graph RAG framework. Supports Ollama for entity extraction. Three query modes: naive (vector), local (entity-focused), global (community-focused).

📊

nano-graphrag

Minimal Graph RAG implementation. Selects only top-k communities for efficiency. Much faster and cheaper than Microsoft GraphRAG.

📦

RAGFlow

Production-grade open-source RAG engine. Handles complex documents (PDFs with tables, layouts), supports Graph RAG, agent capabilities. Docker-based.

When to Use Graph RAG

Add Graph RAG when you need cross-document relationship understanding — regulatory compliance (how rules relate), research (how studies connect), or enterprise knowledge bases (how processes depend on each other). For simple document Q&A, vector search with hybrid retrieval is sufficient.

Evaluation

Measuring RAG Quality

You cannot improve what you do not measure. Use the RAGAS framework to evaluate both retrieval and generation quality separately — a retrieval problem needs a different fix than a generation problem.

Metric	What It Measures	Target
Context Precision	Are retrieved chunks relevant and correctly ranked?	> 0.7
Context Recall	Does retrieval capture all needed information?	> 0.9
Faithfulness	Is the answer grounded in retrieved context (no hallucinations)?	> 0.85
Answer Relevancy	Is the answer actually relevant to the question?	> 0.8
Hit Rate	Does any relevant document appear in top-k results?	> 0.95

Evaluation Best Practices

Create a golden dataset — 50–100 question-answer pairs with annotated relevant documents
Measure retrieval and generation separately — pinpoint exactly where quality drops
Benchmark on YOUR data — MTEB leaderboard scores do not always translate to your domain
A/B test chunking strategies — compare fixed vs. parent-child vs. contextual on your actual queries
Monitor in production — track answer quality, latency, and user feedback continuously

Tools

RAGAS is the standard open-source evaluation framework. It provides reference-free metrics computed using an LLM judge. DeepEval offers 14+ metrics with CI/CD integration. Both work with local Ollama models as the judge.

Recommended Stack

Production RAG Stack for Mac

The optimal local RAG stack for Apple Silicon, combining all the advanced techniques above into a cohesive system.

Component	Recommended Tool	Why
LLM Runtime	Ollama	Native Metal acceleration, simple CLI
Embeddings	nomic-embed-text / bge-m3	Top open-source MTEB scores, fast locally
Vector DB	Qdrant	Native hybrid search, BM42, sparse vectors
Reranker	Qwen3-Reranker-0.5B	Small, multilingual, runs via Ollama
Framework	LlamaIndex / Haystack	Full pipeline orchestration, HyDE built-in
Graph RAG	LightRAG	Local knowledge graph with Ollama support
Evaluation	RAGAS	Standard RAG eval, reference-free metrics

Comparison

Choosing the Right Tier

Each tier balances simplicity against flexibility. Start with Tier 1 for quick experiments, move to Tier 2 for team use, or go straight to Tier 3 for enterprise automation.

Feature	Tier 1: Open WebUI	Tier 2: PrivateGPT	Tier 3: n8n + Qdrant
Setup time	~10 minutes	~30 minutes	~1 hour
Ingestion	Manual upload via UI	Folder-based + UI	Automated pipeline
Vector store	Built-in (ChromaDB)	Built-in (Qdrant/Chroma)	Qdrant (dedicated)
API access	Limited	Full REST API	Webhook endpoints
Multi-user	Yes (built-in auth)	Single user	Yes (n8n auth)
Best for	Individuals, quick start	Teams, batch processing	Enterprises, automation

Hardware

Hardware Requirements

RAM is the primary bottleneck for local RAG. The LLM and embedding model run simultaneously, so plan for both. Below are recommended minimums per tier.

Tier	Minimum RAM	Recommended RAM	Suggested LLM
Tier 1: Open WebUI	16 GB	24 GB	llama3.1:8b
Tier 2: PrivateGPT	16 GB	32 GB	qwen2.5:14b
Tier 3: n8n + Qdrant	32 GB	64 GB	qwen2.5:32b

Storage Note

Vector databases are compact — 100,000 document chunks typically require less than 1 GB of storage. The LLM models themselves are the largest files (4–20 GB each). See the Hardware Guide for detailed RAM calculations.

Local RAG Setup Guide

Your Documents + AI = Accurate, Private Answers

Why Local RAG Matters

Use Cases

How RAG Works

Core Components

Open WebUI + Ollama

Install Ollama & Pull Models

Launch Open WebUI

Upload Documents

Start Chatting

PrivateGPT

Clone PrivateGPT

Configure Ollama Backend

Ingest Documents

Query via UI or API

n8n + Qdrant + Ollama

The Stack

Launch Qdrant & n8n

Build Ingestion Workflow

Build Query Workflow

Activate & Test

n8n Workflow Pattern

Hybrid Search + Reranking

How Hybrid Search Works

The 3-Stage Retrieval Pipeline

Recall — Cast a Wide Net

Deduplicate — Remove Noise

Rerank — Precision Scoring

Local Reranker Models (via Ollama)

Smart Chunking for Large Datasets

Recommended Approach

Smarter Queries, Better Answers

HyDE in Practice

User Asks Question

LLM Generates Hypothetical Answer

Embed the Hypothetical Answer

Retrieve & Generate Final Answer

Scaling RAG & Graph RAG

Index Tuning (HNSW)

Tiered Retrieval

Graph RAG — Understanding Relationships

Measuring RAG Quality

Evaluation Best Practices

Production RAG Stack for Mac

Choosing the Right Tier

Hardware Requirements

Ready to build your private knowledge base?