DeepZotero
Semantic search over a Zotero library. PDFs are extracted (text, tables, figures), chunked, embedded, and stored in ChromaDB. An MCP server exposes the index to Claude Code (or any MCP client) as 13 tools for semantic search, boolean search, table/figure search, context expansion, citation graph lookup, indexing, and cost tracking.
What it extracts
- Text — section-aware chunks with overlap, classified by document section (abstract, methods, results, etc.)
- Tables — vision-based extraction via Claude Haiku 4.5. Each table is rendered to PNG and transcribed to structured markdown (headers, rows, footnotes). Falls back to PyMuPDF heuristics if vision is disabled.
- Figures — detected with captions, extracted as PNGs, searchable by caption text.
Requirements
- Python 3.10+
- A Gemini API key for embeddings (unless using
embedding_provider: "local") - An Anthropic API key for vision-based table extraction (optional but recommended)
- A Zotero installation with PDFs in
storage/
Install
python -m venv .venv
.venv/Scripts/python.exe -m pip install -e .
For vision table extraction:
.venv/Scripts/python.exe -m pip install -e ".[vision]"
Setup
1. Configuration
mkdir -p ~/.config/deep-zotero
cp config.example.json ~/.config/deep-zotero/config.json
Edit ~/.config/deep-zotero/config.json:
{
"zotero_data_dir": "~/Zotero",
"chroma_db_path": "~/.local/share/deep-zotero/chroma",
"gemini_api_key": "YOUR_GEMINI_KEY",
"anthropic_api_key": "YOUR_ANTHROPIC_KEY"
}
All other fields have sensible defaults. You can also set GEMINI_API_KEY and ANTHROPIC_API_KEY as environment variables instead.
2. API keys
Gemini (required for default embeddings):
Get a key at aistudio.google.com/app/apikey. Set it as gemini_api_key in config or GEMINI_API_KEY env var. If you don't want to use Gemini, set "embedding_provider": "local" to use ChromaDB's built-in all-MiniLM-L6-v2 model (no API key needed, lower quality).
Anthropic (required for vision table extraction):
Get a key at console.anthropic.com. Set it as anthropic_api_key in config or ANTHROPIC_API_KEY env var. Without this key, tables are still extracted via PyMuPDF heuristics but accuracy on complex tables is lower. Vision extraction uses the Anthropic Batch API with Claude Haiku 4.5 — cost is roughly $0.016 per table, with prompt caching reducing cost on large batches.
To disable vision extraction entirely:
{
"vision_enabled": false
}
3. Index your library
deep-zotero-index -v
To test with a subset first:
deep-zotero-index --limit 10 -v
This reads the Zotero SQLite database (read-only, safe while Zotero is open), extracts text/tables/figures from each PDF, chunks the text, embeds via Gemini, and stores everything in ChromaDB.
CLI options:
| Flag | Description |
|---|---|
--force |
Delete and rebuild index for all matching items |
--limit N |
Only index N items |
--item-key KEY |
Index a single Zotero item |
--title PATTERN |
Regex filter on title (case-insensitive) |
--no-vision |
Skip vision table extraction for this run |
--config PATH |
Use a different config file |
-v |
Debug logging |
The indexer is incremental — it only processes items not already in the index. Use --force after changing chunk_size, embedding_dimensions, or ocr_language.
You can also trigger indexing from the MCP client via the index_library tool.
4. Register the MCP server
Add to your Claude Code settings (~/.claude/settings.json):
{
"mcpServers": {
"deep-zotero": {
"command": "/path/to/.venv/bin/python",
"args": ["-m", "deep_zotero.server"]
}
}
}
On Windows:
{
"mcpServers": {
"deep-zotero": {
"command": "C:\\path\\to\\.venv\\Scripts\\python.exe",
"args": ["-m", "deep_zotero.server"]
}
}
}
Restart Claude Code. All 13 tools will be available.
Configuration reference
Zotero
| Field | Default | Description |
|---|---|---|
zotero_data_dir |
~/Zotero |
Path to Zotero's data directory (contains zotero.sqlite and storage/) |
chroma_db_path |
~/.local/share/deep-zotero/chroma |
Where the ChromaDB index is stored on disk |
Embedding
| Field | Default | Description |
|---|---|---|
embedding_provider |
"gemini" |
"gemini" for Gemini API, "local" for ChromaDB's built-in all-MiniLM-L6-v2 (no key needed) |
embedding_model |
"gemini-embedding-001" |
Gemini model name (only used when provider is "gemini") |
embedding_dimensions |
768 |
Output vector dimensions. gemini-embedding-001 supports 64-3072. Changing requires --force re-ind |
Tools 1
index_libraryTriggers the indexing of the Zotero library to extract text, tables, and figures.Environment Variables
GEMINI_API_KEYAPI key for Gemini embeddingsANTHROPIC_API_KEYAPI key for vision-based table extraction