RAG: Ask Questions About Your Documents Without Sending Them to the Cloud
How to set up Retrieval Augmented Generation locally with Ollama and AnythingLLM. Keep your PDFs, contracts, and notes private.
What is RAG and Why You Need It
Retrieval Augmented Generation (RAG) lets an AI model answer questions about your documents — PDFs, text files, code, notes — without retraining the model. It works by searching your documents for relevant passages and feeding them to the model along with your question.
This is the use case that brings most professionals into local AI. Lawyers searching case files. Doctors querying medical literature. Developers navigating codebases. Everyone who has "I need AI but can't send my data to ChatGPT."
How RAG Works (Simply)
- Index: Your documents are split into chunks and converted into embeddings (mathematical representations)
- Query: When you ask a question, your question is also converted to an embedding
- Retrieve: The system finds document chunks most similar to your question
- Generate: Those chunks are fed to the LLM as context, and it generates an answer
The LLM never "reads" all your documents. It only sees the 3-5 most relevant chunks per question. This keeps responses fast and accurate.
Setup Option 1: Open WebUI (Easiest)
Open WebUI has RAG built in. If you followed our Private ChatGPT guide, you already have it.
- Open the chat interface at
localhost:3000 - Click the + button next to the message input
- Upload a PDF, text file, or document
- Ask questions about it — the AI will cite specific passages
Open WebUI uses the model running in Ollama for generation and a built-in embedding model for search. It handles chunking, embedding, and retrieval automatically.
Setup Option 2: AnythingLLM (More Control)
AnythingLLM is the r/LocalLLaMA community's favorite for RAG. It gives you more control over document processing:
- Choose your embedding model (Nomic, BGE, mxbai)
- Customize chunk size and overlap
- Create separate workspaces for different document collections
- Connect to Ollama as the LLM backend
- Desktop app — no Docker needed
Recommended embedding model: nomic-embed-text via Ollama. Small (270 MB), fast, and high quality.
ollama pull nomic-embed-text
Tips for Good RAG Results
- Chunk size matters: 500-1000 characters per chunk works best. Too small = missing context. Too large = irrelevant noise.
- Use a good embedding model: Nomic, BGE-M3, or Snowflake Arctic Embed. The quality of retrieval depends more on the embedding than the LLM.
- Keep context reasonable: 4-8K context is usually enough for RAG. Stuffing 32K context with retrieved chunks wastes VRAM.
- Ask specific questions: "What does section 4.2 say about liability?" works better than "Summarize the document."
- Separate workspaces: Don't mix your tax documents with your codebase. Keep collections focused.
References & Further Reading
- [1]Mintplex Labs (2026). AnythingLLM
- [2]Open WebUI (2026). Open WebUI RAG Documentation
- [3]AI Tool Discovery (2026). r/LocalLLaMA RAG Discussion
Find the best model for your hardware
Use FitMyLLM to get personalized recommendations based on your GPU, use case, and speed requirements.
Try FitMyLLMGet weekly updates on new models, GPU deals, and benchmark results.