2024-11-19 09:55:05 +08:00
---
frameworks:
- Pytorch
license: other
tasks:
- visual-question-answering
---
2024-11-19 09:10:38 +08:00
2024-11-19 09:55:05 +08:00
< h1 > A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone< / h1 >
[GitHub ](https://github.com/OpenBMB/MiniCPM-V ) | [Demo ](http://120.92.209.146:8887/ )</ a >
## MiniCPM-V 2.6
**MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建,共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比, MiniCPM-V 2.6 性能提升显著, 并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括:
- 🔥 **领先的性能。**
MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 65.2, **以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。
- 🖼️ **多图理解和上下文学习。**
MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**,并展现出了优秀的上下文学习能力。
- 🎬 **视频理解。**
MiniCPM-V 2.6 还可以**接受视频输入**,进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。
- 💪 **强大的 OCR 能力及其他功能。**
MiniCPM-V 2.6 可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344) 。在 OCRBench 上取得**最佳水平,超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V ](https://github.com/RLHF-V/RLAIF-V/ ) 和 [VisCPM ](https://github.com/OpenBMB/VisCPM ) 技术,其具备了**可信的多模态行为**,在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V, 并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。
- 🚀 **卓越的效率。**
除了对个人用户友好的模型大小, MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此, MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。
- 💫 **易于使用。**
MiniCPM-V 2.6 可以通过多种方式轻松使用:(1) [llama.cpp ](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md ) 和 [ollama ](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md ) 支持在本地设备上进行高效的 CPU 推理,(2) [int4 ](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4 ) 和 [GGUF ](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf ) 格式的量化模型,有 16 种尺寸,(3) [vLLM ](https://github.com/OpenBMB/MiniCPM-V/blob/main/README_zh.md#vllm-%E9%83%A8%E7%BD%B2- ) 支持高吞吐量和内存高效的推理,(4) 针对新领域和任务进行微调,(5) 使用 [Gradio ](https://github.com/OpenBMB/MiniCPM-V/blob/main/README_zh.md#%E6%9C%AC%E5%9C%B0-webui-demo- ) 快速设置本地 WebUI 演示,(6) 在线[demo](http://120.92.209.146:8887/)即可体验。
### 性能评估 <!-- omit in toc -->
< div align = "center" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/radar_final.png" width = 66% / >
< / div >
##### 单图评测结果
OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench:
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Model< / th >
< th > Size< / th >
< th > Token Density< sup > +< / sup > < / th >
< th > OpenCompass< / th >
< th > MME< / th >
< th > MMVet< / th >
< th > OCRBench< / th >
< th > MMMU val< / th >
< th > MathVista mini< / th >
< th > MMB1.1 test< / th >
< th > AI2D< / th >
< th > TextVQA val< / th >
< th > DocVQA test< / th >
< th > HallusionBench< / th >
< th > Object HalBench< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "15" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o< / td >
< td > -< / td >
< td > 1088< / td >
< td > 69.9< / td >
< td > 2328.7< / td >
< td > 69.1< / td >
< td > 736< / td >
< td > 69.2< / td >
< td > 61.3< / td >
< td > 82.2< / td >
< td > 84.6< / td >
< td > -< / td >
< td > 92.8< / td >
< td > 55.0< / td >
< td > 17.6< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Claude 3.5 Sonnet< / td >
< td > -< / td >
< td > 750< / td >
< td > 67.9< / td >
< td > 1920.0< / td >
< td > 66.0< / td >
< td > 788< / td >
< td > 65.9< / td >
< td > 61.6< / td >
< td > 78.5< / td >
< td > 80.2< / td >
< td > -< / td >
< td > 95.2< / td >
< td > 49.9< / td >
< td > 13.8< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Gemini 1.5 Pro< / td >
< td > -< / td >
< td > -< / td >
< td > 64.4< / td >
< td > 2110.6< / td >
< td > 64.0< / td >
< td > 754< / td >
< td > 60.6< / td >
< td > 57.7< / td >
< td > 73.9< / td >
< td > 79.1< / td >
< td > 73.5< / td >
< td > 86.5< / td >
< td > 45.6< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o mini< / td >
< td > -< / td >
< td > 1088< / td >
< td > 64.1< / td >
< td > 2003.4< / td >
< td > 66.9< / td >
< td > 785< / td >
< td > 60.0< / td >
< td > 52.4< / td >
< td > 76.0< / td >
< td > 77.8< / td >
< td > -< / td >
< td > -< / td >
< td > 46.1< / td >
< td > 12.4< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4V< / td >
< td > -< / td >
< td > 1088< / td >
< td > 63.5< / td >
< td > 2070.2< / td >
< td > 67.5< / td >
< td > 656< / td >
< td > 61.7< / td >
< td > 54.7< / td >
< td > 79.8< / td >
< td > 78.6< / td >
< td > 78.0< / td >
< td > 87.2< / td >
< td > 43.9< / td >
< td > 14.2< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Step-1V< / td >
< td > -< / td >
< td > -< / td >
< td > 59.5< / td >
< td > 2206.4< / td >
< td > 63.3< / td >
< td > 625< / td >
< td > 49.9< / td >
< td > 44.8< / td >
< td > 78.0< / td >
< td > 79.2< / td >
< td > 71.6< / td >
< td > -< / td >
< td > 48.4< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Qwen-VL-Max< / td >
< td > -< / td >
< td > 784< / td >
< td > 58.3< / td >
< td > 2281.7< / td >
< td > 61.8< / td >
< td > 684< / td >
< td > 52.0< / td >
< td > 43.4< / td >
< td > 74.6< / td >
< td > 75.7< / td >
< td > 79.5< / td >
< td > 93.1< / td >
< td > 41.2< / td >
< td > 13.4< / td >
< / tr >
< tr >
< td colspan = "15" align = "left" > < strong > Open-source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-NeXT-Yi-34B< / td >
< td > 34B< / td >
< td > 157< / td >
< td > 55.0< / td >
< td > 2006.5< / td >
< td > 50.7< / td >
< td > 574< / td >
< td > 48.8< / td >
< td > 40.4< / td >
< td > 77.8< / td >
< td > 78.9< / td >
< td > 69.3< / td >
< td > -< / td >
< td > 34.8< / td >
< td > 12.6< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Mini-Gemini-HD-34B< / td >
< td > 34B< / td >
< td > 157< / td >
< td > -< / td >
< td > 2141< / td >
< td > 59.3< / td >
< td > 518< / td >
< td > 48.0< / td >
< td > 43.3< / td >
< td > -< / td >
< td > 80.5< / td >
< td > 74.1< / td >
< td > 78.9< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Cambrian-34B< / td >
< td > 34B< / td >
< td > 1820< / td >
< td > 58.3< / td >
< td > 2049.9< / td >
< td > 53.2< / td >
< td > 591< / td >
< td > 50.4< / td >
< td > 50.3< / td >
< td > 77.8< / td >
< td > 79.5< / td >
< td > 76.7< / td >
< td > 75.5< / td >
< td > 41.6< / td >
< td > 14.7< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GLM-4V-9B< / td >
< td > 13B< / td >
< td > 784< / td >
< td > 59.1< / td >
< td > 2018.8< / td >
< td > 58.0< / td >
< td > 776< / td >
< td > 46.9< / td >
< td > 51.1< / td >
< td > 67.9< / td >
< td > 71.2< / td >
< td > -< / td >
< td > -< / td >
< td > 45.0< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternVL2-8B< / td >
< td > 8B< / td >
< td > 706< / td >
< td > 64.1< / td >
< td > 2215.1< / td >
< td > 54.3< / td >
< td > 794< / td >
< td > < strong > 51.2< / strong > < / td >
< td > 58.3< / td >
< td > < strong > 79.4< / strong > < / td >
< td > < strong > 83.6< / strong > < / td >
< td > 77.4< / td >
< td > < strong > 91.6< / strong > < / td >
< td > 45.0< / td >
< td > 21.3< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > MiniCPM-Llama-V 2.5< / td >
< td > 8B< / td >
< td > 1882< / td >
< td > 58.8< / td >
< td > 2024.6< / td >
< td > 52.8< / td >
< td > 725< / td >
< td > 45.8< / td >
< td > 54.3< / td >
< td > 72.0< / td >
< td > 78.4< / td >
< td > 76.6< / td >
< td > 84.8< / td >
< td > 42.4< / td >
< td > 10.3< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td nowrap = "nowrap" align = "left" > MiniCPM-V 2.6< / td >
< td > 8B< / td >
< td > < strong > 2822< / strong > < / td >
< td > < strong > 65.2< / strong > < / td >
< td > < strong > 2348.4< / strong > *< / td >
< td > < strong > 60.0< / strong > < / td >
< td > < strong > 852< / strong > *< / td >
< td > 49.8*< / td >
< td > < strong > 60.6< / strong > < / td >
< td > 78.0< / td >
< td > 82.1< / td >
< td > < strong > 80.1< strong > < / td >
< td > 90.8< / td >
< td > < strong > 48.1< / strong > *< / td >
< td > < strong > 8.2< / strong > < / td >
< / tr >
< / tbody >
< / table >
< / div >
* 我们使用思维链提示词来评估这些基准。
< sup > +< / sup > Token Density: 每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
##### 多图评测结果
Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB:
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Model< / th >
< th > Size< / th >
< th > Mantis Eval< / th >
< th > BLINK val< / th >
< th > Mathverse mv< / th >
< th > Sciverse mv< / th >
< th > MIRB< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "7" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4V< / td >
< td > -< / td >
< td > 62.7< / td >
< td > 54.6< / td >
< td > 60.3< / td >
< td > 66.9< / td >
< td > 53.1< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-NeXT-Interleave-14B< / td >
< td > 14B< / td >
< td > 66.4< / td >
< td > 52.6< / td >
< td > 32.7< / td >
< td > 30.2< / td >
< td > -< / td >
< / tr >
< tr >
< td colspan = "7" align = "left" > < strong > Open-source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Emu2-Chat< / td >
< td > 37B< / td >
< td > 37.8< / td >
< td > 36.2< / td >
< td > -< / td >
< td > 27.2< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > CogVLM< / td >
< td > 17B< / td >
< td > 45.2< / td >
< td > 41.1< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > VPG-C< / td >
< td > 7B< / td >
< td > 52.4< / td >
< td > 43.1< / td >
< td > 24.3< / td >
< td > 23.1< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > VILA 8B< / td >
< td > 8B< / td >
< td > 51.2< / td >
< td > 39.3< / td >
< td > -< / td >
< td > 36.5< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternLM-XComposer-2.5< / td >
< td > 8B< / td >
< td > 53.1*< / td >
< td > 48.9< / td >
< td > 32.1*< / td >
< td > -< / td >
< td > 42.5< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternVL2-8B< / td >
< td > 8B< / td >
< td > 59.0*< / td >
< td > 50.9< / td >
< td > 30.5*< / td >
< td > 34.4*< / td >
< td > < strong > 56.9*< / strong > < / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td nowrap = "nowrap" align = "left" > MiniCPM-V 2.6< / td >
< td > 8B< / td >
< td > < strong > 69.1< / strong > < / td >
< td > < strong > 53.0< / strong > < / td >
< td > < strong > 84.9< / strong > < / td >
< td > < strong > 74.9< / strong > < / td >
< td > 53.8< / td >
< / tr >
< / tbody >
< / table >
< / div >
* 正式开源模型权重的评测结果。
##### 视频评测结果
Video-MME 和 Video-ChatGPT:
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Model< / th >
< th > Size< / th >
< th colspan = "2" > Video-MME< / th >
< th colspan = "5" > Video-ChatGPT< / th >
< / tr >
< tr >
< th align = "left" > < / th >
< th > < / th >
< th > w/o subs< / th >
< th > w subs< / th >
< th > Correctness< / th >
< th > Detail< / th >
< th > Context< / th >
< th > Temporal< / th >
< th > Consistency< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "9" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Claude 3.5 Sonnet< / td >
< td > -< / td >
< td > 60.0< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4V< / td >
< td > -< / td >
< td > 59.9< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td colspan = "9" align = "left" > < strong > Open-source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-NeXT-7B< / td >
< td > 7B< / td >
< td > -< / td >
< td > -< / td >
< td > 3.39< / td >
< td > 3.29< / td >
< td > 3.92< / td >
< td > 2.60< / td >
< td > 3.12< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-NeXT-34B< / td >
< td > 34B< / td >
< td > -< / td >
< td > -< / td >
< td > 3.29< / td >
< td > 3.23< / td >
< td > 3.83< / td >
< td > 2.51< / td >
< td > 3.47< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > CogVLM2-Video< / td >
< td > 12B< / td >
< td > -< / td >
< td > -< / td >
< td > 3.49< / td >
< td > < strong > 3.46< / strong > < / td >
< td > 3.23< / td >
< td > < strong > 2.98< / strong > < / td >
< td > < strong > 3.64< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LongVA< / td >
< td > 7B< / td >
< td > 52.4< / td >
< td > 54.3< / td >
< td > 3.05< / td >
< td > 3.09< / td >
< td > 3.77< / td >
< td > 2.44< / td >
< td > < strong > 3.64< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternVL2-8B< / td >
< td > 8B< / td >
< td > 54.0< / td >
< td > 56.9< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternLM-XComposer-2.5< / td >
< td > 8B< / td >
< td > 55.8< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-NeXT-Video< / td >
< td > 32B< / td >
< td > 60.2< / td >
< td > 63.0< / td >
< td > 3.48< / td >
< td > 3.37< / td >
< td > < strong > 3.95< / strong > < / td >
< td > 2.64< / td >
< td > 3.28< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td nowrap = "nowrap" align = "left" > MiniCPM-V 2.6< / td >
< td > 8B< / td >
< td > < strong > 60.9< / strong > < / td >
< td > < strong > 63.6< / strong > < / td >
< td > < strong > 3.59< / strong > < / td >
< td > 3.28< / td >
< td > 3.93< / td >
< td > 2.73< / td >
< td > 3.62< / td >
< / tr >
< / tbody >
< / table >
< / div >
##### 少样本评测结果
TextVQA, VizWiz, VQAv2, OK-VQA:
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Model< / th >
< th > Size< / th >
< th > Shot< / th >
< th > TextVQA val< / th >
< th > VizWiz test-dev< / th >
< th > VQAv2 test-dev< / th >
< th > OK-VQA val< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td align = "left" nowrap = "nowrap" rowspan = "3" > Flamingo< / td >
< td rowspan = "3" > 80B< / td >
< td > 0*< / td >
< td > 35.0< / td >
< td > 31.6< / td >
< td > 56.3< / td >
< td > 40.6< / td >
< / tr >
< tr >
< td > 4< / td >
< td > 36.5< / td >
< td > 39.6< / td >
< td > 63.1< / td >
< td > < strong > 57.4< / strong > < / td >
< / tr >
< tr >
< td > 8< / td >
< td > 37.3< / td >
< td > 44.8< / td >
< td > 65.6< / td >
< td > 57.5< / td >
< / tr >
< tr >
< td align = "left" nowrap = "nowrap" rowspan = "3" > IDEFICS< / td >
< td rowspan = "3" > 80B< / td >
< td > 0*< / td >
< td > 30.9< / td >
< td > 36.0< / td >
< td > 60.0< / td >
< td > 45.2< / td >
< / tr >
< tr >
< td > 4< / td >
< td > 34.3< / td >
< td > 40.4< / td >
< td > 63.6< / td >
< td > 52.4< / td >
< / tr >
< tr >
< td > 8< / td >
< td > 35.7< / td >
< td > 46.1< / td >
< td > 64.8< / td >
< td > 55.1< / td >
< / tr >
< tr >
< td align = "left" nowrap = "nowrap" rowspan = "3" > OmniCorpus< / td >
< td rowspan = "3" > 7B< / td >
< td > 0*< / td >
< td > 43.0< / td >
< td > 49.8< / td >
< td > 63.2< / td >
< td > 45.5< / td >
< / tr >
< tr >
< td > 4< / td >
< td > 45.4< / td >
< td > 51.3< / td >
< td > 64.5< / td >
< td > 46.5< / td >
< / tr >
< tr >
< td > 8< / td >
< td > 45.6< / td >
< td > 52.2< / td >
< td > 64.7< / td >
< td > 46.6< / td >
< / tr >
< tr >
< td align = "left" nowrap = "nowrap" rowspan = "3" > Emu2< / td >
< td rowspan = "3" > 37B< / td >
< td > 0< / td >
< td > 26.4< / td >
< td > 40.4< / td >
< td > 33.5< / td >
< td > 26.7< / td >
< / tr >
< tr >
< td > 4< / td >
< td > 48.2< / td >
< td > 54.6< / td >
< td > 67.0< / td >
< td > 53.2< / td >
< / tr >
< tr >
< td > 8< / td >
< td > 49.3< / td >
< td > 54.7< / td >
< td > 67.8< / td >
< td > 54.1< / td >
< / tr >
< tr >
< td align = "left" nowrap = "nowrap" rowspan = "2" > MM1< / td >
< td rowspan = "2" > 30B< / td >
< td > 0< / td >
< td > 26.2< / td >
< td > 40.4< / td >
< td > 48.9< / td >
< td > 26.7< / td >
< / tr >
< tr >
< td > 8< / td >
< td > 49.3< / td >
< td > 54.7< / td >
< td > < strong > 70.9< / strong > < / td >
< td > 54.1< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td align = "left" nowrap = "nowrap" rowspan = "3" > MiniCPM-V 2.6< sup > +< / sup > < / td >
< td rowspan = "3" > 8B< / td >
< td > 0< / td >
< td > 43.9< / td >
< td > 33.8< / td >
< td > 45.4< / td >
< td > 23.9< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td > 4< / td >
< td > 63.6< / td >
< td > 60.5< / td >
< td > 65.5< / td >
< td > 50.1< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td > 8< / td >
< td > < strong > 64.6< / strong > < / td >
< td > < strong > 63.4< / strong > < / td >
< td > 68.2< / td >
< td > 51.4< / td >
< / tr >
< / tbody >
< / table >
< / div >
* 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。
< sup > +< / sup > 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。
### 典型示例 <!-- omit in toc -->
< div style = "display: flex; flex-direction: column; align-items: center;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multi_img-bike.png" alt = "Bike" style = "margin-bottom: 5px;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multi_img-menu.png" alt = "Menu" style = "margin-bottom: 5px;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multi_img-code.png" alt = "Code" style = "margin-bottom: 5px;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/ICL-Mem.png" alt = "Mem" style = "margin-bottom: 5px;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multiling-medal.png" alt = "medal" style = "margin-bottom: 10px;" >
< / div >
< details >
< summary > 点击查看更多示例.< / summary >
< div style = "display: flex; flex-direction: column; align-items: center;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/ICL-elec.png" alt = "elec" style = "margin-bottom: 5px;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/minicpmv2_6/multiling-olympic.png" alt = "Menu" style = "margin-bottom: 10px;" >
< / div >
< / details >
我们将 MiniCPM-V 2.6 部署在iPad Pro上, 并录制了以下演示视频。
< div style = "display: flex; justify-content: center;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/ai.gif" width = "48%" style = "margin: 0 10px;" / >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/beer.gif" width = "48%" style = "margin: 0 10px;" / >
< / div >
< div style = "display: flex; justify-content: center; margin-top: 20px;" >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/ticket.gif" width = "48%" style = "margin: 0 10px;" / >
< img src = "https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/gif_cases/wfh.gif" width = "48%" style = "margin: 0 10px;" / >
< / div >
< div style = "text-align: center;" >
< video controls autoplay src = "https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6/resolve/master/assets/case_draw.mp4" width = "50%" / > < / video >
< video controls autoplay src = "https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6/resolve/master/assets/case_mb.mp4" width = "50%" / > < / video >
< / div >
## Demo
Click here to try out the Demo of [MiniCPM-V 2.6 ](http://120.92.209.146:8887/ ).
## 使用方法
使用Huggingface transformers 在NVIDIA GPUs推理。Requirements如下: (python 3.10)
```
Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
decord
```
```python
# test.py
# test.py
import torch
from PIL import Image
from modelscope import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True)
image = Image.open('image.png').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
res = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(res)
## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
stream=True
)
generated_text = ""
for new_text in res:
generated_text += new_text
print(new_text, flush=True, end='')
```
### 多图理解
< details >
< summary > 点击查看使用 MiniCPM-V 2.6 进行多图理解的Python示例 < / summary >
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
< / details >
### In-context few-shot learning
< details >
< summary > 点击查看使用 MiniCPM-V 2.6 进行few-shot推理的Python示例 < / summary >
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
< / details >
### 视频理解
< details >
< summary > 点击查看使用 MiniCPM-V 2.6 进行视频理解的Python示例 < / summary >
```python
import torch
from PIL import Image
from modelscope import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
params={}
model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True)
MAX_NUM_FRAMES=64
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path="/mnt/workspace/2.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params={}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448 可设为1
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
```
< / details >
更多使用介绍请查看 [GitHub ](https://github.com/OpenBMB/MiniCPM-V ) 。
## llama.cpp推理 <a id="llamacpp"></a>
MiniCPM-V 2.6 支持 llama.cpp 推理. 使用方法请查看我们的fork [llama.cpp ](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv ).
## Int4 量化版
int4 量化版,更低的显存占用(7GB): [MiniCPM-V-2_6-int4 ](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4 ).
## License
#### Model License
* The code in this repo is released under the [Apache-2.0 ](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE ) License.
* The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md ](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md ).
* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire" ](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g ) for registration, are also available for free commercial use.
#### Statement
* As an LMM, MiniCPM-V 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 2.6 does not represent the views and positions of the model developers
* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
## Other Multimodal Projects from Our Team
[VisCPM ](https://github.com/OpenBMB/VisCPM/tree/main ) | [RLHF-V ](https://github.com/RLHF-V/RLHF-V ) | [LLaVA-UHD ](https://github.com/thunlp/LLaVA-UHD ) | [RLAIF-V ](https://github.com/RLHF-V/RLAIF-V )
## Citation
If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
```bib
@article {yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={arXiv preprint arXiv:2408.01800},
year={2024}
}
```