CogAgent-9B-2024122_a140792.../README.md

94 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
frameworks:
- Pytorch
license: other
domain:
- nlp
language:
- cn
- en
---
# CogAgent
<p style="text-align: center;">
<p align="center">
<a href="https://github.com/THUDM/CogAgent">🌐 Github </a> |
<a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogAgent-Demo">🤗 Huggingface Space</a> |
<a href="https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report-en">📄 Technical Report </a> |
<a href="https://arxiv.org/abs/2312.08914">📜 arxiv paper </a>
</p>
## 关于模型
`CogAgent-9B-2024122` 模型基于 [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b)
双语开源VLM基座模型通过数据的采集与优化、多阶段训练与策略改进等方法`CogAgent-9B-20241220` 在GUI
感知、推理预测准确性、动作空间完善性、任务的普适和泛化性上得到了大幅提升,能够接受中英文双语的屏幕截图和语言交互。
此版CogAgent模型已被应用于智谱AI的 [GLM-PC产品](https://cogagent.aminer.cn/home)
。我们希望这版模型的发布能够帮助到学术研究者们和开发者们,一起推进基于视觉语言往我们的模型的 GUI agent 的研究和应用。
## 运行模型
<p>请前往我们的 <a href="https://github.com/THUDM/CogAgent">github</a> 查看具体的运行示例,以及模型提示词拼接部分 <strong style="color: red;">(这直接影响模型是否正常运行)</strong></p>
其中,特别注意提示词拼接过程。
您可以参考 [app/client.py#L115](https://github.com/THUDM/CogAgent/blob/e3ca6f4dc94118d3dfb749f195cbb800ee4543ce/app/client.py#L115)
拼接用户输入提示词。
``` python
current_platform = identify_os() # "Mac" or "WIN" or "Mobile",注意大小写
platform_str = f"(Platform: {current_platform})\n"
format_str = "(Answer in Action-Operation-Sensitive format.)\n" # You can use other format to replace "Action-Operation-Sensitive"
history_str = "\nHistory steps: "
for index, (grounded_op_func, action) in enumerate(zip(history_grounded_op_funcs, history_actions)):
history_str += f"\n{index}. {grounded_op_func}\t{action}" # start from 0.
query = f"Task: {task}{history_str}\n{platform_str}{format_str}" # Be careful about the \n
```
一个最简用户输入拼接代码如下所示:
```
"Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
```
拼接后的python字符串形如
``` python
"Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
```
由于篇幅较长,若您想仔细了解每个字段的含义和表示,请参考[github](https://github.com/THUDM/CogAgent)。
## 先前的工作
在2023年11月我们发布了CogAgent的第一代模型现在你可以在 [CogVLM&CogAgent官方仓库](https://github.com/THUDM/CogVLM)
找到相关代码和权重地址。
<div align="center">
<img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
</div>
<table>
<tr>
<td>
<h2> CogVLM </h2>
<p> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
<p><b>CogVLM</b> 是一个强大的开源视觉语言模型VLM。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数支持490*490分辨率的图像理解和多轮对话。</p>
<p><b>CogVLM-17B 17B在10个经典的跨模态基准测试中取得了最先进的性能</b>包括NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA 和 TDIUC 基准测试。</p>
</td>
<td>
<h2> CogAgent </h2>
<p> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
<p><b>CogAgent</b> 是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数, <b>支持1120*1120分辨率的图像理解。在CogVLM的能力之上它进一步拥有了GUI图像Agent的能力。</b></p>
<p> <b>CogAgent-18B 在9个经典的跨模态基准测试中实现了最先进的通用性能</b>包括 VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, 和 POPE 测试基准。它在包括AITW和Mind2Web在内的GUI操作数据集上显著超越了现有的模型。</p>
</td>
</tr>
</table>
## 协议
模型权重的使用请遵循 [Model License](LICENSE)。