Local RAG system for Claude Code with hybrid search and cross-encoder reranking
Knowledge RAG
LLMs don't know your docs. Every conversation starts from zero.
Your notes, writeups, internal procedures, PDFs — none of it exists to your AI assistant. Cloud RAG solutions leak your private data. Local ones require Docker, Ollama, and 15 minutes of setup before a single query.
Knowledge RAG fixes this. One pip install, zero external servers.
Your documents become instantly searchable inside Claude Code — with reranking precision that actually finds what you need.
clone → pip install → restart Claude Code → done.
12 MCP Tools | Hybrid Search + Cross-Encoder Reranking | Markdown-Aware Chunking | 100% Local, Zero Cloud
What's New | Installation | API Reference | Architecture
Breaking Changes (v2.x → v3.0)
v3.0 is a major release. If you are upgrading from v2.x, read this section first.
| Change | v2.x | v3.0 |
|---|---|---|
| Embedding engine | Ollama (external server) | FastEmbed (ONNX in-process) |
| Embedding model | nomic-embed-text (768D) | BAAI/bge-small-en-v1.5 (384D) |
| Embedding dimensions | 768 | 384 |
| Dependencies | ollama>=0.6.0 |
fastembed>=0.4.0, requests, beautifulsoup4 |
| MCP tools | 6 tools | 12 tools |
| Default hybrid_alpha | 0.3 | 0.3 |
Migration Steps
# 1. Pull the latest code
git pull origin main
# 2. Activate your virtual environment
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate
# 3. Install new dependencies
pip install -r requirements.txt
# 4. Restart Claude Code — the server auto-detects dimension mismatch
# and triggers a nuclear rebuild on first startup (re-embeds everything)
The first startup after upgrading will take longer than usual because:
- FastEmbed downloads the BAAI/bge-small-en-v1.5 model (~50MB, cached in
~/.cache/fastembed/) - All documents are re-embedded with the new 384-dim model
- The cross-encoder reranker model is downloaded on first query (~25MB)
After the initial rebuild, startup and queries are faster than v2.x because there is no Ollama server dependency.
What's New in v3.1.0
Office Document Support (DOCX, XLSX, PPTX, CSV)
9 formats supported. DOCX headings preserved as markdown structure, Excel sheets extracted as text tables, PowerPoint slides extracted per-slide, CSV natively parsed. All new formats integrate with markdown-aware chunking.
File Watcher — Auto-Reindex on Changes
Documents directory is monitored in real-time via watchdog. When you add, modify, or delete a file, the system auto-reindexes with 5-second debounce. No manual reindex_documents needed.
MMR Result Diversification
Maximal Marginal Relevance applied after reranking to reduce redundant results. Balances relevance vs diversity (lambda=0.7). If your top 5 results were all from the same document, MMR pushes varied sources up.
What's New in v3.0.0
Ollama Removed — Zero External Dependencies
FastEmbed replaces Ollama entirely. Embeddings and reranking run in-process via ONNX Runtime. No server to start, no port to check, no process to manage. The embedding model downloads automatically on first run and is cached locally.
Cross-Encoder Reranking
After hybrid RRF fusion produces initial candidates, a cross-encoder (Xenova/ms-marco-MiniLM-L-6-v2) re-scores query-document pairs jointly. This dramatically improves precision for ambiguous queries where bi-encoder similarity alone is insufficient.
Markdown-Aware Chunking
.md files are now split by ## and ### header boundaries instead of fixed 1000-character windows. Each section becomes a semantically coherent chunk. Sections larger than chunk_size are sub-chunked with overlap. Non-markdown files still use the standard fixed-size chunker.
Query Expansion
54 security-term synonym mappings expand abbreviated queries before BM25 search. Searching for "sqli" automatically includes "sql injection"; "privesc" includes "privilege escalation"; "pth" includes "pass-the
Tools (2)
reindex_documentsManually triggers a re-indexing of the documents directory.search_knowledgePerforms a hybrid search (semantic + BM25) across indexed documents with reranking.Configuration
{"mcpServers": {"knowledge-rag": {"command": "python", "args": ["-m", "knowledge_rag"]}}}