An MCP server that enables reading PDF file contents, allowing PDF documents to be used as a knowledge base for LLMs.
PDF MCP Server
An MCP server that enables reading PDF file contents, allowing PDF documents to be used as a knowledge base for LLMs.
Features
- High-Quality Extraction: Uses marker-pdf (via a Python backend) to extract text with layout awareness and high-fidelity LaTeX equation recognition.
- Robust Fallback: Automatically switches to a Node.js-based parser (
pdf-parse) if the Python environment is unavailable or fails, ensuring extraction always succeeds (albeit with lower formatting quality). - Smart Filtering: Supports page range extraction to process only relevant sections of large documents.
Installation
Prerequisites
- Node.js (v18+)
- Python (v3.10+) and
pip(for high-quality extraction)
Setup
Install Node.js dependencies:
npm installInstall Python dependencies (Recommended): To enable high-quality extraction (especially for scientific papers with math), install the Python dependencies.
# Create or activate a virtual environment if desired python3 -m pip install -r python/requirements.txtNote: The first time you run the tool with the Python backend, it will download necessary AI models (OCR, layout analysis, etc.) to a local cache. This download is approximately 3.3GB. Ensure you have a stable internet connection.
Build the server:
npm run build
Usage
Configuration for Claude/MCP Clients
Add this to your MCP settings configuration:
{
"mcpServers": {
"pdf-reader": {
"command": "node",
"args": ["/absolute/path/to/mcpPdf/dist/index.js"],
"env": {
// Optional: Override where python is found if not in venv or path
// "PYTHON_PATH": "/path/to/python"
}
}
}
}
Tool: `read_pdf`
Reads and extracts text content from a PDF file.
Inputs:
path(string): Absolute path to the PDF file.start_page(number, optional): Starting page number (1-based).end_page(number, optional): Ending page number (1-based).
How it works:
- Attempt 1 (Python/Marker): The server tries to run the internal
convert.pyscript.- If successfully configured, this loads the
markermodels from the local cache (.cachedirectory in the project). - It accurately converts equations to LaTeX and preserves document structure.
- If successfully configured, this loads the
- Attempt 2 (Fallback): If the Python script fails (e.g., missing dependencies, runtime error), the server catches the error and uses
pdf-parse(a native Node.js library).- This extracts raw text. Equations may appear as linearized text, and layout may be less preserved.
Troubleshooting
- Permission Errors: The project is configured to use a local
.cachedirectory for models to avoid system permission issues. If you encounter errors, ensure the project directory is writable. - Slow Performance: The high-quality extraction uses deep learning models. It can be slow on large documents without a GPU. Use the
start_pageandend_pagearguments to extract only what you need.
Tools (1)
read_pdfReads and extracts text content from a PDF file.Environment Variables
PYTHON_PATHOverride where python is found if not in venv or pathConfiguration
{"mcpServers":{"pdf-reader":{"command":"node","args":["/absolute/path/to/mcpPdf/dist/index.js"],"env":{}}}}