Control Ubuntu desktops through screenshots, mouse clicks, and keyboard interactions
Ubuntu Desktop Control MCP Server
An MCP (Model Context Protocol) server that enables LLMs to control your Ubuntu desktop by taking screenshots and sending mouse clicks. This allows AI assistants to visually interact with your desktop applications.
⚡ NEW: Optimized Production Workflow
5x faster, 5x more accurate! Now using the same optimization techniques as Anthropic's Computer Use API:
- 📸 Smart Screenshots: Auto-downsampled to 1280x720 (5x smaller)
- 🎯 Numbered Elements: See what's clickable at a glance with overlaid IDs
- 🤖 AT-SPI Integration: Automatic UI element detection using accessibility API
- 📐 Percentage Coords: Resolution-agnostic positioning (no more pixel hunting!)
- ⚡ Workflow Batching: Execute multiple actions in one MCP call
- 🎪 Element Cache: Direct element interaction - "click element #5"
Example - Old way (8+ calls, ~15s):
take_screenshot() → analyze → grid overlay → zoom quadrant → find pixel → click → miss
Example - New way (1 call, ~3s):
take_screenshot() → "I see Pinta is element #5" → click_screen(element_id=5) → ✓
See README.md for full details.
Features
- 📸 Screenshot Capture: Annotated screenshots with automatic element detection
- 🔢 Element Detection: AT-SPI + CV fallback for robust UI element identification
- 🖱️ Smart Clicking: Click by element ID or percentage coordinates
- ⌨️ Keyboard Control: Type text and press keys/hotkeys
- 🎯 Mouse Movement: Smooth cursor positioning with animation
- 🚀 Workflow Batching: Execute multi-step tasks in single MCP call
- 📊 Diagnostics: Display scaling detection, warnings, and recommendations
Quick Start
1. Prerequisites
- Ubuntu Linux (X11 required, Wayland not fully supported)
- Python 3.9+
2. Installation
From PyPI (Recommended)
pip install ubuntu-desktop-control
From Source
# Clone repository
git clone https://github.com/charettep/ubuntu-desktop-control-mcp.git
cd ubuntu-desktop-control-mcp
# Install system dependencies (requires sudo)
chmod +x scripts/install.sh
./scripts/install.sh
# Install Python dependencies
pip install -e .
Configuration
Claude Code
Installation Methods
Method 1: CLI (Recommended)
claude mcp add --transport stdio ubuntu-desktop-control -- \
ubuntu-desktop-control
Method 2: Manual Config
Edit ~/.claude/claude_desktop_config.json:
{
"mcpServers": {
"ubuntu-desktop-control": {
"command": "ubuntu-desktop-control",
"args": []
}
}
}
VS Code Insiders
Installation Methods
Method 1: MCP Command
- Open Command Palette (
Ctrl+Shift+P) - Run
MCP: Open Workspace Folder Configuration - Add the server configuration below.
Method 2: Manual Config
Create .vscode/mcp.json in your workspace:
{
"servers": {
"ubuntu-desktop-control": {
"type": "stdio",
"command": "ubuntu-desktop-control",
"args": []
}
}
}
Codex CLI
Installation Methods
Method 1: CLI
codex mcp add ubuntu-desktop-control -- \
ubuntu-desktop-control
Method 2: Manual Config
Edit ~/.config/codex/config.toml:
[mcp_servers.ubuntu-desktop-control]
type = "stdio"
command = "ubuntu-desktop-control"
args = []
Tools
Core Capabilities
| Tool | Description |
|---|---|
take_screenshot |
Capture the desktop (optionally per-monitor) with annotated elements. |
click_screen |
Click by element ID or percentage coordinates (supports per-monitor). |
move_mouse |
Move the cursor by element ID or percentage coordinates (supports per-monitor). |
drag_mouse |
Drag the cursor to coordinates while holding a mouse button. |
type_text |
Type text using the keyboard. |
press_key |
Press a specific key (e.g., 'enter', 'esc'). |
press_hotkey |
Press a combination of keys simultaneously (e.g., Ctrl+Shift+C). |
get_screen_info |
Get screen dimensions and display server type (X11/Wayland). |
get_display_diagnostics |
Troubleshoot scaling and coordinate mismatches. |
map_GUI_elements_location |
Detect and map UI elements (hitboxes) using Computer Vision. |
convert_screenshot_coordinates |
Convert pixels from a screenshot to logical click coordinates. |
list_prompt_templates |
List available prompt templates (for clients without native prompt support). |
execute_workflow |
Execute a batch of actions (screenshot/click/move/type/wait). |
Prompt Rendering Tools
These tools allow clients without native prompt support (like Codex CLI) to render prompt templates as text.
| Tool | Description |
|---|---|
render_prompt_baseline_display_check |
Render the baseline display check prompt. |
render_prompt_capture_full_desktop |
Render the full desktop capture pr |
Tools (11)
take_screenshotCapture the desktop (optionally per-monitor) with annotated elements.click_screenClick by element ID or percentage coordinates (supports per-monitor).move_mouseMove the cursor by element ID or percentage coordinates (supports per-monitor).drag_mouseDrag the cursor to coordinates while holding a mouse button.type_textType text using the keyboard.press_keyPress a specific key (e.g., 'enter', 'esc').press_hotkeyPress a combination of keys simultaneously (e.g., Ctrl+Shift+C).get_screen_infoGet screen dimensions and display server type (X11/Wayland).get_display_diagnosticsTroubleshoot scaling and coordinate mismatches.map_GUI_elements_locationDetect and map UI elements (hitboxes) using Computer Vision.execute_workflowExecute a batch of actions (screenshot/click/move/type/wait).Configuration
{"mcpServers": {"ubuntu-desktop-control": {"command": "ubuntu-desktop-control", "args": []}}}