Deep Dive9 min read2026-02-28

RAG: Ask Questions About Your Documents Without Sending Them to the Cloud

How to set up Retrieval Augmented Generation locally with Ollama and AnythingLLM. Keep your PDFs, contracts, and notes private.

Contents

1. What is RAG and Why You Need It
2. How RAG Works (Simply)
3. Setup Option 1: Open WebUI (Easiest)
4. Setup Option 2: AnythingLLM (More Control)
5. Tips for Good RAG Results

What is RAG and Why You Need It

Retrieval Augmented Generation (RAG) lets an AI model answer questions about your documents — PDFs, text files, code, notes — without retraining the model. It works by searching your documents for relevant passages and feeding them to the model along with your question.

This is the use case that brings most professionals into local AI. Lawyers searching case files. Doctors querying medical literature. Developers navigating codebases. Everyone who has "I need AI but can't send my data to ChatGPT."

How RAG Works (Simply)

Index: Your documents are split into chunks and converted into embeddings (mathematical representations)
Query: When you ask a question, your question is also converted to an embedding
Retrieve: The system finds document chunks most similar to your question
Generate: Those chunks are fed to the LLM as context, and it generates an answer

The LLM never "reads" all your documents. It only sees the 3-5 most relevant chunks per question. This keeps responses fast and accurate.

Setup Option 1: Open WebUI (Easiest)

Open WebUI has RAG built in. If you followed our Private ChatGPT guide, you already have it.

Open the chat interface at localhost:3000
Click the + button next to the message input
Upload a PDF, text file, or document
Ask questions about it — the AI will cite specific passages

Open WebUI uses the model running in Ollama for generation and a built-in embedding model for search. It handles chunking, embedding, and retrieval automatically.

Setup Option 2: AnythingLLM (More Control)

AnythingLLM is the r/LocalLLaMA community's favorite for RAG. It gives you more control over document processing:

Choose your embedding model (Nomic, BGE, mxbai)
Customize chunk size and overlap
Create separate workspaces for different document collections
Connect to Ollama as the LLM backend
Desktop app — no Docker needed

Recommended embedding model: nomic-embed-text via Ollama. Small (270 MB), fast, and high quality.

ollama pull nomic-embed-text

Tips for Good RAG Results

Chunk size matters: 500-1000 characters per chunk works best. Too small = missing context. Too large = irrelevant noise.
Use a good embedding model: Nomic, BGE-M3, or Snowflake Arctic Embed. The quality of retrieval depends more on the embedding than the LLM.
Keep context reasonable: 4-8K context is usually enough for RAG. Stuffing 32K context with retrieved chunks wastes VRAM.
Ask specific questions: "What does section 4.2 say about liability?" works better than "Summarize the document."
Separate workspaces: Don't mix your tax documents with your codebase. Keep collections focused.

References & Further Reading

[1]Mintplex Labs (2026). AnythingLLM
[2]Open WebUI (2026). Open WebUI RAG Documentation
[3]AI Tool Discovery (2026). r/LocalLLaMA RAG Discussion

Find the best model for your hardware

Use FitMyLLM to get personalized recommendations based on your GPU, use case, and speed requirements.

Try FitMyLLM

▸ DISPATCH

The weekly briefing.

New models · GPU deals · benchmark updates. Once a week. Unsubscribe with one click.