988 lines
31 KiB
Markdown
988 lines
31 KiB
Markdown
---
|
||
frameworks:
|
||
- Pytorch
|
||
license: other
|
||
tasks:
|
||
- visual-question-answering
|
||
---
|
||
|
||
|
||
<h1>A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone</h1>
|
||
|
||
[GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Demo](http://120.92.209.146:8887/)</a>
|
||
|
||
## MiniCPM-V 2.6
|
||
|
||
**MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建,共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比,MiniCPM-V 2.6 性能提升显著,并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括:
|
||
|
||
|
||
- 🔥 **领先的性能。**
|
||
MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 65.2,**以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。
|
||
|
||
- 🖼️ **多图理解和上下文学习。**
|
||
MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**,并展现出了优秀的上下文学习能力。
|
||
|
||
- 🎬 **视频理解。**
|
||
MiniCPM-V 2.6 还可以**接受视频输入**,进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。
|
||
|
||
- 💪 **强大的 OCR 能力及其他功能。**
|
||
MiniCPM-V 2.6 可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344)。在 OCRBench 上取得**最佳水平,超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V,并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。
|
||
|
||
- 🚀 **卓越的效率。**
|
||
除了对个人用户友好的模型大小,MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此,MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。
|
||
|
||
- 💫 **易于使用。**
|
||
MiniCPM-V 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) 和 [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](https://github.com/OpenBMB/MiniCPM-V/blob/main/README_zh.md#vllm-%E9%83%A8%E7%BD%B2-) 支持高吞吐量和内存高效的推理,(4) 针对新领域和任务进行微调,(5) 使用 [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/README_zh.md#%E6%9C%AC%E5%9C%B0-webui-demo-) 快速设置本地 WebUI 演示,(6) 在线[demo](http://120.92.209.146:8887/)即可体验。
|
||
|
||
### 性能评估 <!-- omit in toc -->
|
||
<div align="center">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/radar_final.png" width=66% />
|
||
</div>
|
||
|
||
|
||
##### 单图评测结果
|
||
OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench:
|
||
<div align="center">
|
||
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th>Token Density<sup>+</sup></th>
|
||
<th>OpenCompass</th>
|
||
<th>MME</th>
|
||
<th>MMVet</th>
|
||
<th>OCRBench</th>
|
||
<th>MMMU val</th>
|
||
<th>MathVista mini</th>
|
||
<th>MMB1.1 test</th>
|
||
<th>AI2D</th>
|
||
<th>TextVQA val</th>
|
||
<th>DocVQA test</th>
|
||
<th>HallusionBench</th>
|
||
<th>Object HalBench</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="15" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o</td>
|
||
<td>-</td>
|
||
<td>1088</td>
|
||
<td>69.9</td>
|
||
<td>2328.7</td>
|
||
<td>69.1</td>
|
||
<td>736</td>
|
||
<td>69.2</td>
|
||
<td>61.3</td>
|
||
<td>82.2</td>
|
||
<td>84.6</td>
|
||
<td>-</td>
|
||
<td>92.8</td>
|
||
<td>55.0</td>
|
||
<td>17.6</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||
<td>-</td>
|
||
<td>750</td>
|
||
<td>67.9</td>
|
||
<td>1920.0</td>
|
||
<td>66.0</td>
|
||
<td>788</td>
|
||
<td>65.9</td>
|
||
<td>61.6</td>
|
||
<td>78.5</td>
|
||
<td>80.2</td>
|
||
<td>-</td>
|
||
<td>95.2</td>
|
||
<td>49.9</td>
|
||
<td>13.8</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>64.4</td>
|
||
<td>2110.6</td>
|
||
<td>64.0</td>
|
||
<td>754</td>
|
||
<td>60.6</td>
|
||
<td>57.7</td>
|
||
<td>73.9</td>
|
||
<td>79.1</td>
|
||
<td>73.5</td>
|
||
<td>86.5</td>
|
||
<td>45.6</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o mini</td>
|
||
<td>-</td>
|
||
<td>1088</td>
|
||
<td>64.1</td>
|
||
<td>2003.4</td>
|
||
<td>66.9</td>
|
||
<td>785</td>
|
||
<td>60.0</td>
|
||
<td>52.4</td>
|
||
<td>76.0</td>
|
||
<td>77.8</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>46.1</td>
|
||
<td>12.4</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||
<td>-</td>
|
||
<td>1088</td>
|
||
<td>63.5</td>
|
||
<td>2070.2</td>
|
||
<td>67.5</td>
|
||
<td>656</td>
|
||
<td>61.7</td>
|
||
<td>54.7</td>
|
||
<td>79.8</td>
|
||
<td>78.6</td>
|
||
<td>78.0</td>
|
||
<td>87.2</td>
|
||
<td>43.9</td>
|
||
<td>14.2</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Step-1V</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>59.5</td>
|
||
<td>2206.4</td>
|
||
<td>63.3</td>
|
||
<td>625</td>
|
||
<td>49.9</td>
|
||
<td>44.8</td>
|
||
<td>78.0</td>
|
||
<td>79.2</td>
|
||
<td>71.6</td>
|
||
<td>-</td>
|
||
<td>48.4</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Qwen-VL-Max</td>
|
||
<td>-</td>
|
||
<td>784</td>
|
||
<td>58.3</td>
|
||
<td>2281.7</td>
|
||
<td>61.8</td>
|
||
<td>684</td>
|
||
<td>52.0</td>
|
||
<td>43.4</td>
|
||
<td>74.6</td>
|
||
<td>75.7</td>
|
||
<td>79.5</td>
|
||
<td>93.1</td>
|
||
<td>41.2</td>
|
||
<td>13.4</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="15" align="left"><strong>Open-source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
|
||
<td>34B</td>
|
||
<td>157</td>
|
||
<td>55.0</td>
|
||
<td>2006.5</td>
|
||
<td>50.7</td>
|
||
<td>574</td>
|
||
<td>48.8</td>
|
||
<td>40.4</td>
|
||
<td>77.8</td>
|
||
<td>78.9</td>
|
||
<td>69.3</td>
|
||
<td>-</td>
|
||
<td>34.8</td>
|
||
<td>12.6</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
|
||
<td>34B</td>
|
||
<td>157</td>
|
||
<td>-</td>
|
||
<td>2141</td>
|
||
<td>59.3</td>
|
||
<td>518</td>
|
||
<td>48.0</td>
|
||
<td>43.3</td>
|
||
<td>-</td>
|
||
<td>80.5</td>
|
||
<td>74.1</td>
|
||
<td>78.9</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
||
<td>34B</td>
|
||
<td>1820</td>
|
||
<td>58.3</td>
|
||
<td>2049.9</td>
|
||
<td>53.2</td>
|
||
<td>591</td>
|
||
<td>50.4</td>
|
||
<td>50.3</td>
|
||
<td>77.8</td>
|
||
<td>79.5</td>
|
||
<td>76.7</td>
|
||
<td>75.5</td>
|
||
<td>41.6</td>
|
||
<td>14.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
||
<td>13B</td>
|
||
<td>784</td>
|
||
<td>59.1</td>
|
||
<td>2018.8</td>
|
||
<td>58.0</td>
|
||
<td>776</td>
|
||
<td>46.9</td>
|
||
<td>51.1</td>
|
||
<td>67.9</td>
|
||
<td>71.2</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>45.0</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||
<td>8B</td>
|
||
<td>706</td>
|
||
<td>64.1</td>
|
||
<td>2215.1</td>
|
||
<td>54.3</td>
|
||
<td>794</td>
|
||
<td><strong>51.2</strong></td>
|
||
<td>58.3</td>
|
||
<td><strong>79.4</strong></td>
|
||
<td><strong>83.6</strong></td>
|
||
<td>77.4</td>
|
||
<td><strong>91.6</strong></td>
|
||
<td>45.0</td>
|
||
<td>21.3</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
|
||
<td>8B</td>
|
||
<td>1882</td>
|
||
<td>58.8</td>
|
||
<td>2024.6</td>
|
||
<td>52.8</td>
|
||
<td>725</td>
|
||
<td>45.8</td>
|
||
<td>54.3</td>
|
||
<td>72.0</td>
|
||
<td>78.4</td>
|
||
<td>76.6</td>
|
||
<td>84.8</td>
|
||
<td>42.4</td>
|
||
<td>10.3</td>
|
||
</tr>
|
||
<tr style="background-color: #e6f2ff;">
|
||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||
<td>8B</td>
|
||
<td><strong>2822</strong></td>
|
||
<td><strong>65.2</strong></td>
|
||
<td><strong>2348.4</strong>*</td>
|
||
<td><strong>60.0</strong></td>
|
||
<td><strong>852</strong>*</td>
|
||
<td>49.8*</td>
|
||
<td><strong>60.6</strong></td>
|
||
<td>78.0</td>
|
||
<td>82.1</td>
|
||
<td><strong>80.1<strong></td>
|
||
<td>90.8</td>
|
||
<td><strong>48.1</strong>*</td>
|
||
<td><strong>8.2</strong></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
</div>
|
||
* 我们使用思维链提示词来评估这些基准。
|
||
|
||
<sup>+</sup> Token Density:每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
|
||
|
||
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
|
||
|
||
|
||
##### 多图评测结果
|
||
Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB:
|
||
<div align="center">
|
||
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th>Mantis Eval</th>
|
||
<th>BLINK val</th>
|
||
<th>Mathverse mv</th>
|
||
<th>Sciverse mv</th>
|
||
<th>MIRB</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||
<td>-</td>
|
||
<td>62.7</td>
|
||
<td>54.6</td>
|
||
<td>60.3</td>
|
||
<td>66.9</td>
|
||
<td>53.1</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
|
||
<td>14B</td>
|
||
<td>66.4</td>
|
||
<td>52.6</td>
|
||
<td>32.7</td>
|
||
<td>30.2</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="7" align="left"><strong>Open-source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Emu2-Chat</td>
|
||
<td>37B</td>
|
||
<td>37.8</td>
|
||
<td>36.2</td>
|
||
<td>-</td>
|
||
<td>27.2</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">CogVLM</td>
|
||
<td>17B</td>
|
||
<td>45.2</td>
|
||
<td>41.1</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">VPG-C</td>
|
||
<td>7B</td>
|
||
<td>52.4</td>
|
||
<td>43.1</td>
|
||
<td>24.3</td>
|
||
<td>23.1</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">VILA 8B</td>
|
||
<td>8B</td>
|
||
<td>51.2</td>
|
||
<td>39.3</td>
|
||
<td>-</td>
|
||
<td>36.5</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||
<td>8B</td>
|
||
<td>53.1*</td>
|
||
<td>48.9</td>
|
||
<td>32.1*</td>
|
||
<td>-</td>
|
||
<td>42.5</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||
<td>8B</td>
|
||
<td>59.0*</td>
|
||
<td>50.9</td>
|
||
<td>30.5*</td>
|
||
<td>34.4*</td>
|
||
<td><strong>56.9*</strong></td>
|
||
</tr>
|
||
<tr style="background-color: #e6f2ff;">
|
||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||
<td>8B</td>
|
||
<td><strong>69.1</strong></td>
|
||
<td><strong>53.0</strong></td>
|
||
<td><strong>84.9</strong></td>
|
||
<td><strong>74.9</strong></td>
|
||
<td>53.8</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
|
||
</div>
|
||
* 正式开源模型权重的评测结果。
|
||
|
||
|
||
##### 视频评测结果
|
||
Video-MME 和 Video-ChatGPT:
|
||
<div align="center">
|
||
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th colspan="2">Video-MME</th>
|
||
<th colspan="5">Video-ChatGPT</th>
|
||
</tr>
|
||
<tr>
|
||
<th align="left"></th>
|
||
<th></th>
|
||
<th>w/o subs</th>
|
||
<th>w subs</th>
|
||
<th>Correctness</th>
|
||
<th>Detail</th>
|
||
<th>Context</th>
|
||
<th>Temporal</th>
|
||
<th>Consistency</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="9" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||
<td>-</td>
|
||
<td>60.0</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||
<td>-</td>
|
||
<td>59.9</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
|
||
<td>7B</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>3.39</td>
|
||
<td>3.29</td>
|
||
<td>3.92</td>
|
||
<td>2.60</td>
|
||
<td>3.12</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
|
||
<td>34B</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>3.29</td>
|
||
<td>3.23</td>
|
||
<td>3.83</td>
|
||
<td>2.51</td>
|
||
<td>3.47</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">CogVLM2-Video</td>
|
||
<td>12B</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>3.49</td>
|
||
<td><strong>3.46</strong></td>
|
||
<td>3.23</td>
|
||
<td><strong>2.98</strong></td>
|
||
<td><strong>3.64</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LongVA</td>
|
||
<td>7B</td>
|
||
<td>52.4</td>
|
||
<td>54.3</td>
|
||
<td>3.05</td>
|
||
<td>3.09</td>
|
||
<td>3.77</td>
|
||
<td>2.44</td>
|
||
<td><strong>3.64</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||
<td>8B</td>
|
||
<td>54.0</td>
|
||
<td>56.9</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||
<td>8B</td>
|
||
<td>55.8</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
|
||
<td>32B</td>
|
||
<td>60.2</td>
|
||
<td>63.0</td>
|
||
<td>3.48</td>
|
||
<td>3.37</td>
|
||
<td><strong>3.95</strong></td>
|
||
<td>2.64</td>
|
||
<td>3.28</td>
|
||
</tr>
|
||
<tr style="background-color: #e6f2ff;">
|
||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||
<td>8B</td>
|
||
<td><strong>60.9</strong></td>
|
||
<td><strong>63.6</strong></td>
|
||
<td><strong>3.59</strong></td>
|
||
<td>3.28</td>
|
||
<td>3.93</td>
|
||
<td>2.73</td>
|
||
<td>3.62</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</div>
|
||
|
||
|
||
##### 少样本评测结果
|
||
TextVQA, VizWiz, VQAv2, OK-VQA:
|
||
<div align="center">
|
||
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th>Shot</th>
|
||
<th>TextVQA val</th>
|
||
<th>VizWiz test-dev</th>
|
||
<th>VQAv2 test-dev</th>
|
||
<th>OK-VQA val</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
|
||
<td rowspan="3">80B</td>
|
||
<td>0*</td>
|
||
<td>35.0</td>
|
||
<td>31.6</td>
|
||
<td>56.3</td>
|
||
<td>40.6</td>
|
||
</tr>
|
||
<tr>
|
||
<td>4</td>
|
||
<td>36.5</td>
|
||
<td>39.6</td>
|
||
<td>63.1</td>
|
||
<td><strong>57.4</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td>8</td>
|
||
<td>37.3</td>
|
||
<td>44.8</td>
|
||
<td>65.6</td>
|
||
<td>57.5</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
|
||
<td rowspan="3">80B</td>
|
||
<td>0*</td>
|
||
<td>30.9</td>
|
||
<td>36.0</td>
|
||
<td>60.0</td>
|
||
<td>45.2</td>
|
||
</tr>
|
||
<tr>
|
||
<td>4</td>
|
||
<td>34.3</td>
|
||
<td>40.4</td>
|
||
<td>63.6</td>
|
||
<td>52.4</td>
|
||
</tr>
|
||
<tr>
|
||
<td>8</td>
|
||
<td>35.7</td>
|
||
<td>46.1</td>
|
||
<td>64.8</td>
|
||
<td>55.1</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
|
||
<td rowspan="3">7B</td>
|
||
<td>0*</td>
|
||
<td>43.0</td>
|
||
<td>49.8</td>
|
||
<td>63.2</td>
|
||
<td>45.5</td>
|
||
</tr>
|
||
<tr>
|
||
<td>4</td>
|
||
<td>45.4</td>
|
||
<td>51.3</td>
|
||
<td>64.5</td>
|
||
<td>46.5</td>
|
||
</tr>
|
||
<tr>
|
||
<td>8</td>
|
||
<td>45.6</td>
|
||
<td>52.2</td>
|
||
<td>64.7</td>
|
||
<td>46.6</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
|
||
<td rowspan="3">37B</td>
|
||
<td>0</td>
|
||
<td>26.4</td>
|
||
<td>40.4</td>
|
||
<td>33.5</td>
|
||
<td>26.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td>4</td>
|
||
<td>48.2</td>
|
||
<td>54.6</td>
|
||
<td>67.0</td>
|
||
<td>53.2</td>
|
||
</tr>
|
||
<tr>
|
||
<td>8</td>
|
||
<td>49.3</td>
|
||
<td>54.7</td>
|
||
<td>67.8</td>
|
||
<td>54.1</td>
|
||
</tr>
|
||
<tr>
|
||
<td align="left" nowrap="nowrap" rowspan="2">MM1</td>
|
||
<td rowspan="2">30B</td>
|
||
<td>0</td>
|
||
<td>26.2</td>
|
||
<td>40.4</td>
|
||
<td>48.9</td>
|
||
<td>26.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td>8</td>
|
||
<td>49.3</td>
|
||
<td>54.7</td>
|
||
<td><strong>70.9</strong></td>
|
||
<td>54.1</td>
|
||
</tr>
|
||
<tr style="background-color: #e6f2ff;">
|
||
<td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
|
||
<td rowspan="3">8B</td>
|
||
<td>0</td>
|
||
<td>43.9</td>
|
||
<td>33.8</td>
|
||
<td>45.4</td>
|
||
<td>23.9</td>
|
||
</tr>
|
||
<tr style="background-color: #e6f2ff;">
|
||
<td>4</td>
|
||
<td>63.6</td>
|
||
<td>60.5</td>
|
||
<td>65.5</td>
|
||
<td>50.1</td>
|
||
</tr>
|
||
<tr style="background-color: #e6f2ff;">
|
||
<td>8</td>
|
||
<td><strong>64.6</strong></td>
|
||
<td><strong>63.4</strong></td>
|
||
<td>68.2</td>
|
||
<td>51.4</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
|
||
</div>
|
||
* 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。
|
||
|
||
<sup>+</sup> 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。
|
||
|
||
|
||
|
||
### 典型示例 <!-- omit in toc -->
|
||
|
||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
|
||
</div>
|
||
<details>
|
||
<summary>点击查看更多示例.</summary>
|
||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
|
||
</div>
|
||
</details>
|
||
|
||
我们将 MiniCPM-V 2.6 部署在iPad Pro上,并录制了以下演示视频。
|
||
|
||
<div style="display: flex; justify-content: center;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/ai.gif" width="48%" style="margin: 0 10px;"/>
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/beer.gif" width="48%" style="margin: 0 10px;"/>
|
||
</div>
|
||
<div style="display: flex; justify-content: center; margin-top: 20px;">
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/ticket.gif" width="48%" style="margin: 0 10px;"/>
|
||
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/wfh.gif" width="48%" style="margin: 0 10px;"/>
|
||
</div>
|
||
|
||
<div style="text-align: center;">
|
||
<video controls autoplay src="https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6/resolve/master/assets/case_draw.mp4" width="50%" /> </video>
|
||
<video controls autoplay src="https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6/resolve/master/assets/case_mb.mp4" width="50%" /> </video>
|
||
</div>
|
||
|
||
|
||
## Demo
|
||
Click here to try out the Demo of [MiniCPM-V 2.6](http://120.92.209.146:8887/).
|
||
|
||
|
||
## 使用方法
|
||
使用Huggingface transformers 在NVIDIA GPUs推理。Requirements如下:(python 3.10)
|
||
```
|
||
Pillow==10.1.0
|
||
torch==2.1.2
|
||
torchvision==0.16.2
|
||
transformers==4.40.0
|
||
sentencepiece==0.1.99
|
||
decord
|
||
```
|
||
|
||
```python
|
||
# test.py
|
||
# test.py
|
||
import torch
|
||
from PIL import Image
|
||
from modelscope import AutoModel, AutoTokenizer
|
||
|
||
model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True,
|
||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||
model = model.eval().cuda()
|
||
tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True)
|
||
|
||
image = Image.open('image.png').convert('RGB')
|
||
question = 'What is in the image?'
|
||
msgs = [{'role': 'user', 'content': [image, question]}]
|
||
|
||
res = model.chat(
|
||
image=None,
|
||
msgs=msgs,
|
||
tokenizer=tokenizer
|
||
)
|
||
print(res)
|
||
|
||
## if you want to use streaming, please make sure sampling=True and stream=True
|
||
## the model.chat will return a generator
|
||
res = model.chat(
|
||
image=None,
|
||
msgs=msgs,
|
||
tokenizer=tokenizer,
|
||
sampling=True,
|
||
stream=True
|
||
)
|
||
|
||
generated_text = ""
|
||
for new_text in res:
|
||
generated_text += new_text
|
||
print(new_text, flush=True, end='')
|
||
```
|
||
|
||
### 多图理解
|
||
<details>
|
||
<summary> 点击查看使用 MiniCPM-V 2.6 进行多图理解的Python示例 </summary>
|
||
|
||
```python
|
||
import torch
|
||
from PIL import Image
|
||
from transformers import AutoModel, AutoTokenizer
|
||
|
||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||
model = model.eval().cuda()
|
||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||
|
||
image1 = Image.open('image1.jpg').convert('RGB')
|
||
image2 = Image.open('image2.jpg').convert('RGB')
|
||
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
|
||
|
||
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
|
||
|
||
answer = model.chat(
|
||
image=None,
|
||
msgs=msgs,
|
||
tokenizer=tokenizer
|
||
)
|
||
print(answer)
|
||
```
|
||
</details>
|
||
|
||
### In-context few-shot learning
|
||
<details>
|
||
<summary> 点击查看使用 MiniCPM-V 2.6 进行few-shot推理的Python示例 </summary>
|
||
|
||
```python
|
||
import torch
|
||
from PIL import Image
|
||
from transformers import AutoModel, AutoTokenizer
|
||
|
||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||
model = model.eval().cuda()
|
||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||
|
||
question = "production date"
|
||
image1 = Image.open('example1.jpg').convert('RGB')
|
||
answer1 = "2023.08.04"
|
||
image2 = Image.open('example2.jpg').convert('RGB')
|
||
answer2 = "2007.04.24"
|
||
image_test = Image.open('test.jpg').convert('RGB')
|
||
|
||
msgs = [
|
||
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
|
||
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
|
||
{'role': 'user', 'content': [image_test, question]}
|
||
]
|
||
|
||
answer = model.chat(
|
||
image=None,
|
||
msgs=msgs,
|
||
tokenizer=tokenizer
|
||
)
|
||
print(answer)
|
||
```
|
||
</details>
|
||
|
||
### 视频理解
|
||
<details>
|
||
<summary> 点击查看使用 MiniCPM-V 2.6 进行视频理解的Python示例 </summary>
|
||
|
||
```python
|
||
import torch
|
||
from PIL import Image
|
||
from modelscope import AutoModel, AutoTokenizer
|
||
from decord import VideoReader, cpu # pip install decord
|
||
|
||
params={}
|
||
|
||
model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True,
|
||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||
model = model.eval().cuda()
|
||
tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True)
|
||
|
||
MAX_NUM_FRAMES=64
|
||
|
||
def encode_video(video_path):
|
||
def uniform_sample(l, n):
|
||
gap = len(l) / n
|
||
idxs = [int(i * gap + gap / 2) for i in range(n)]
|
||
return [l[i] for i in idxs]
|
||
|
||
vr = VideoReader(video_path, ctx=cpu(0))
|
||
sample_fps = round(vr.get_avg_fps() / 1) # FPS
|
||
frame_idx = [i for i in range(0, len(vr), sample_fps)]
|
||
if len(frame_idx) > MAX_NUM_FRAMES:
|
||
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
|
||
frames = vr.get_batch(frame_idx).asnumpy()
|
||
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
|
||
print('num frames:', len(frames))
|
||
return frames
|
||
|
||
video_path="/mnt/workspace/2.mp4"
|
||
frames = encode_video(video_path)
|
||
question = "Describe the video"
|
||
msgs = [
|
||
{'role': 'user', 'content': frames + [question]},
|
||
]
|
||
|
||
# Set decode params for video
|
||
params={}
|
||
params["use_image_id"] = False
|
||
params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448 可设为1
|
||
|
||
answer = model.chat(
|
||
image=None,
|
||
msgs=msgs,
|
||
tokenizer=tokenizer,
|
||
**params
|
||
)
|
||
print(answer)
|
||
```
|
||
</details>
|
||
|
||
更多使用介绍请查看 [GitHub](https://github.com/OpenBMB/MiniCPM-V) 。
|
||
|
||
|
||
## llama.cpp推理 <a id="llamacpp"></a>
|
||
MiniCPM-V 2.6 支持 llama.cpp 推理. 使用方法请查看我们的fork [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv).
|
||
|
||
|
||
## Int4 量化版
|
||
int4 量化版,更低的显存占用(7GB): [MiniCPM-V-2_6-int4](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4).
|
||
|
||
|
||
## License
|
||
#### Model License
|
||
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
||
* The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
|
||
* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.
|
||
|
||
|
||
|
||
#### Statement
|
||
* As an LMM, MiniCPM-V 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 2.6 does not represent the views and positions of the model developers
|
||
* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
|
||
|
||
## Other Multimodal Projects from Our Team
|
||
|
||
[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
|
||
|
||
## Citation
|
||
If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
|
||
|
||
```bib
|
||
@article{yao2024minicpm,
|
||
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
|
||
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
|
||
journal={arXiv preprint arXiv:2408.01800},
|
||
year={2024}
|
||
}
|
||
```
|