Screen Agent MCP Server

Windows desktop automation MCP server with UI recognition and learning.

README.md

Screen Agent

Windows 桌面自动化 MCP 服务器,支持 OCR、UIA 控件和多点颜色匹配的 UI 识别。

功能特性

  • 多种 UI 识别方式

    • OCR 文字识别(RapidOCR)
    • Windows UIA 控件识别
    • 多点颜色特征匹配
  • 智能操作

    • 窗口绑定与自动聚焦
    • 弹窗检测与处理
    • 操作验证与错误恢复
  • 学习与进化

    • 技能学习系统
    • 操作成功率追踪
    • 向量数据库存储经验

安装

环境要求

  • Windows 10/11
  • Python 3.12+
  • Ollama(可选,用于视觉识别)

安装步骤

# 克隆仓库
git clone https://github.com/lqszhsp/screen-agent.git
cd screen-agent

# 创建虚拟环境
python -m venv venv
venv\Scripts\activate

# 安装依赖
pip install -r requirements.txt

配置

  1. 复制配置模板:
copy config\settings.example.py config\settings.py
  1. 编辑 config/settings.py 设置 API 密钥(如需使用云端视觉 API)

使用方法

作为 MCP 服务器

在 Claude Desktop 或其他 MCP 客户端中配置:

{
  "mcpServers": {
    "screen-agent": {
      "command": "python",
      "args": ["C:\\path\\to\\screen_agent\\mcp_server.py"]
    }
  }
}

可用工具

工具 说明
screen_get_layout 绑定窗口,获取布局信息
screen_click 点击屏幕元素
screen_input_text 输入文字
screen_scroll 滚动屏幕
screen_hotkey 按下快捷键
screen_capture 截图并识别元素
screen_wait 等待指定时间
screen_explore 自动探索界面
screen_detect_ui 检测 UI 元素位置
screen_scan_ui_elements 扫描并生成图标特征
screen_ask_user_locate 请求用户帮助定位
screen_learn_success 记录成功操作
screen_query_knowledge 查询已学习知识

点击模式

# OCR 模式(默认)- 通过文字定位
screen_click(target="设置", mode="ocr")

# UIA 模式 - 通过控件定位
screen_click(target="确定", mode="ui", control_type="Button")

# 多点颜色模式 - 通过颜色特征定位
screen_click(mode="multipoint", features={"0|0": "#07c160", "10|10": "#ffffff"})

项目结构

screen_agent/
├── mcp_server.py          # MCP 服务器入口
├── actions/               # 操作模块
│   ├── click.py          # 点击操作
│   ├── input_text.py     # 文字输入
│   ├── scroll.py         # 滚动操作
│   └── ...
├── core/                  # 核心模块
│   ├── perception.py     # OCR 感知
│   ├── window_manager.py # 窗口管理
│   ├── evolution.py      # 进化机制
│   └── ...
├── app_layouts/           # 程序布局文件
│   ├── _guidelines.md    # 操作手册
│   ├── _template.md      # 布局模板
│   ├── 微信.md           # 微信布局
│   └── ...
└── config/               # 配置文件
    └── settings.py

布局文件

每个程序可以有专属的布局文件(app_layouts/{程序名}.md),包含:

  • 窗口结构和区域定义
  • 常用元素位置
  • 操作规范和限制
  • 快捷键列表

参考 app_layouts/_template.md 创建新的布局文件。

技术文档

许可证

MIT License

Tools 13

screen_get_layoutBind to a window and retrieve its layout information.
screen_clickClick on a specific screen element using OCR, UIA, or color matching.
screen_input_textInput text into the active or specified element.
screen_scrollPerform a scroll action on the screen.
screen_hotkeyExecute a keyboard shortcut.
screen_captureCapture a screenshot and identify UI elements.
screen_waitPause execution for a specified duration.
screen_exploreAutomatically explore the current interface.
screen_detect_uiDetect the position of UI elements.
screen_scan_ui_elementsScan the screen and generate icon features.
screen_ask_user_locateRequest assistance from the user to locate an element.
screen_learn_successRecord a successful operation to the learning system.
screen_query_knowledgeQuery the learned knowledge base for previous operations.

Environment Variables

API_KEYAPI key for cloud-based visual recognition services (if used).

Try it

Find the 'Settings' button on the current window and click it using OCR.
Scan the current UI and identify all available buttons.
Input the text 'Hello World' into the active text field.
Query the knowledge base to see if we have learned how to handle this specific application window.
Perform a scroll action to reach the bottom of the current page.

Frequently Asked Questions

What are the key features of Screen Agent?

Multi-modal UI recognition including OCR, Windows UIA, and color matching.. Intelligent window management with auto-focus and pop-up detection.. Learning system that tracks operation success and stores experience in a vector database.. Support for custom application layout files to define UI structures and shortcuts.. Error recovery and operation validation mechanisms..

What can I use Screen Agent for?

Automating repetitive data entry tasks in legacy Windows applications.. Creating self-healing automation scripts that adapt to UI changes.. Building agents that can navigate complex desktop software interfaces.. Standardizing interaction patterns across different desktop applications using layout files..

How do I install Screen Agent?

Install Screen Agent by running: git clone https://github.com/lqszhsp/screen-agent.git && cd screen-agent && python -m venv venv && venv\Scripts\activate && pip install -r requirements.txt

What MCP clients work with Screen Agent?

Screen Agent works with any MCP-compatible client including Claude Desktop, Claude Code, Cursor, and other editors with MCP support.

Conare · memory for coding agents

Turn this server into reusable context

Keep Screen Agent docs, env vars, and workflow notes in Conare so your agent carries them across sessions.

Set up free$npx conare@latest