Zotero Chunk RAG MCP Server

Semantic search over a Zotero library using Gemini and ChromaDB.

README.md

DeepZotero

Semantic search over a Zotero library. PDFs are extracted (text, tables, figures), chunked, embedded, and stored in ChromaDB. An MCP server exposes the index to Claude Code (or any MCP client) as 13 tools for semantic search, boolean search, table/figure search, context expansion, citation graph lookup, indexing, and cost tracking.

What it extracts

  • Text — section-aware chunks with overlap, classified by document section (abstract, methods, results, etc.)
  • Tables — vision-based extraction via Claude Haiku 4.5. Each table is rendered to PNG and transcribed to structured markdown (headers, rows, footnotes). Falls back to PyMuPDF heuristics if vision is disabled.
  • Figures — detected with captions, extracted as PNGs, searchable by caption text.

Requirements

  • Python 3.10+
  • A Gemini API key for embeddings (unless using embedding_provider: "local")
  • An Anthropic API key for vision-based table extraction (optional but recommended)
  • A Zotero installation with PDFs in storage/

Install

python -m venv .venv
.venv/Scripts/python.exe -m pip install -e .

For vision table extraction:

.venv/Scripts/python.exe -m pip install -e ".[vision]"

Setup

1. Configuration

mkdir -p ~/.config/deep-zotero
cp config.example.json ~/.config/deep-zotero/config.json

Edit ~/.config/deep-zotero/config.json:

{
    "zotero_data_dir": "~/Zotero",
    "chroma_db_path": "~/.local/share/deep-zotero/chroma",
    "gemini_api_key": "YOUR_GEMINI_KEY",
    "anthropic_api_key": "YOUR_ANTHROPIC_KEY"
}

All other fields have sensible defaults. You can also set GEMINI_API_KEY and ANTHROPIC_API_KEY as environment variables instead.

2. API keys

Gemini (required for default embeddings): Get a key at aistudio.google.com/app/apikey. Set it as gemini_api_key in config or GEMINI_API_KEY env var. If you don't want to use Gemini, set "embedding_provider": "local" to use ChromaDB's built-in all-MiniLM-L6-v2 model (no API key needed, lower quality).

Anthropic (required for vision table extraction): Get a key at console.anthropic.com. Set it as anthropic_api_key in config or ANTHROPIC_API_KEY env var. Without this key, tables are still extracted via PyMuPDF heuristics but accuracy on complex tables is lower. Vision extraction uses the Anthropic Batch API with Claude Haiku 4.5 — cost is roughly $0.016 per table, with prompt caching reducing cost on large batches.

To disable vision extraction entirely:

{
    "vision_enabled": false
}

3. Index your library

deep-zotero-index -v

To test with a subset first:

deep-zotero-index --limit 10 -v

This reads the Zotero SQLite database (read-only, safe while Zotero is open), extracts text/tables/figures from each PDF, chunks the text, embeds via Gemini, and stores everything in ChromaDB.

CLI options:

Flag Description
--force Delete and rebuild index for all matching items
--limit N Only index N items
--item-key KEY Index a single Zotero item
--title PATTERN Regex filter on title (case-insensitive)
--no-vision Skip vision table extraction for this run
--config PATH Use a different config file
-v Debug logging

The indexer is incremental — it only processes items not already in the index. Use --force after changing chunk_size, embedding_dimensions, or ocr_language.

You can also trigger indexing from the MCP client via the index_library tool.

4. Register the MCP server

Add to your Claude Code settings (~/.claude/settings.json):

{
    "mcpServers": {
        "deep-zotero": {
            "command": "/path/to/.venv/bin/python",
            "args": ["-m", "deep_zotero.server"]
        }
    }
}

On Windows:

{
    "mcpServers": {
        "deep-zotero": {
            "command": "C:\\path\\to\\.venv\\Scripts\\python.exe",
            "args": ["-m", "deep_zotero.server"]
        }
    }
}

Restart Claude Code. All 13 tools will be available.


Configuration reference

Zotero

Field Default Description
zotero_data_dir ~/Zotero Path to Zotero's data directory (contains zotero.sqlite and storage/)
chroma_db_path ~/.local/share/deep-zotero/chroma Where the ChromaDB index is stored on disk

Embedding

Field Default Description
embedding_provider "gemini" "gemini" for Gemini API, "local" for ChromaDB's built-in all-MiniLM-L6-v2 (no key needed)
embedding_model "gemini-embedding-001" Gemini model name (only used when provider is "gemini")
embedding_dimensions 768 Output vector dimensions. gemini-embedding-001 supports 64-3072. Changing requires --force re-ind

Tools 1

index_libraryTriggers the indexing of the Zotero library to extract text, tables, and figures.

Environment Variables

GEMINI_API_KEYAPI key for Gemini embeddings
ANTHROPIC_API_KEYAPI key for vision-based table extraction

Try it

Search my Zotero library for recent papers discussing transformer architecture.
Find the table in my saved PDFs that compares model performance metrics.
Retrieve the specific passage from my research notes regarding the methodology of the 2023 study.
Index my Zotero library to ensure all new PDFs are searchable.

Frequently Asked Questions

What are the key features of Zotero Chunk RAG?

Section-aware text chunking with overlap. Vision-based table extraction using Claude Haiku 4.5. Figure detection and caption-based indexing. Incremental indexing of Zotero SQLite database. Support for both Gemini and local embedding providers.

What can I use Zotero Chunk RAG for?

Quickly locating specific data points within a large collection of academic PDFs.. Extracting structured markdown tables from complex research papers.. Performing semantic searches across a personal library of scientific literature.. Automating the indexing of new research materials as they are added to Zotero..

How do I install Zotero Chunk RAG?

Install Zotero Chunk RAG by running: python -m venv .venv && .venv/Scripts/python.exe -m pip install -e .

What MCP clients work with Zotero Chunk RAG?

Zotero Chunk RAG works with any MCP-compatible client including Claude Desktop, Claude Code, Cursor, and other editors with MCP support.

Turn this server into reusable context

Keep Zotero Chunk RAG docs, env vars, and workflow notes in Conare so your agent carries them across sessions.

Open Conare