--- frameworks: - Pytorch license: other tasks: - visual-question-answering ---

A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

[GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Demo](http://120.92.209.146:8887/) ## MiniCPM-V 2.6 **MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建,共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比,MiniCPM-V 2.6 性能提升显著,并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括: - 🔥 **领先的性能。** MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 65.2,**以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。 - 🖼️ **多图理解和上下文学习。** MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**,并展现出了优秀的上下文学习能力。 - 🎬 **视频理解。** MiniCPM-V 2.6 还可以**接受视频输入**,进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。 - 💪 **强大的 OCR 能力及其他功能。** MiniCPM-V 2.6 可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344)。在 OCRBench 上取得**最佳水平,超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V,并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。 - 🚀 **卓越的效率。** 除了对个人用户友好的模型大小,MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此,MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。 - 💫 **易于使用。** MiniCPM-V 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) 和 [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](https://github.com/OpenBMB/MiniCPM-V/blob/main/README_zh.md#vllm-%E9%83%A8%E7%BD%B2-) 支持高吞吐量和内存高效的推理,(4) 针对新领域和任务进行微调,(5) 使用 [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/README_zh.md#%E6%9C%AC%E5%9C%B0-webui-demo-) 快速设置本地 WebUI 演示,(6) 在线[demo](http://120.92.209.146:8887/)即可体验。 ### 性能评估
##### 单图评测结果 OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench:
Model Size Token Density+ OpenCompass MME MMVet OCRBench MMMU val MathVista mini MMB1.1 test AI2D TextVQA val DocVQA test HallusionBench Object HalBench
Proprietary
GPT-4o - 1088 69.9 2328.7 69.1 736 69.2 61.3 82.2 84.6 - 92.8 55.0 17.6
Claude 3.5 Sonnet - 750 67.9 1920.0 66.0 788 65.9 61.6 78.5 80.2 - 95.2 49.9 13.8
Gemini 1.5 Pro - - 64.4 2110.6 64.0 754 60.6 57.7 73.9 79.1 73.5 86.5 45.6 -
GPT-4o mini - 1088 64.1 2003.4 66.9 785 60.0 52.4 76.0 77.8 - - 46.1 12.4
GPT-4V - 1088 63.5 2070.2 67.5 656 61.7 54.7 79.8 78.6 78.0 87.2 43.9 14.2
Step-1V - - 59.5 2206.4 63.3 625 49.9 44.8 78.0 79.2 71.6 - 48.4 -
Qwen-VL-Max - 784 58.3 2281.7 61.8 684 52.0 43.4 74.6 75.7 79.5 93.1 41.2 13.4
Open-source
LLaVA-NeXT-Yi-34B 34B 157 55.0 2006.5 50.7 574 48.8 40.4 77.8 78.9 69.3 - 34.8 12.6
Mini-Gemini-HD-34B 34B 157 - 2141 59.3 518 48.0 43.3 - 80.5 74.1 78.9 - -
Cambrian-34B 34B 1820 58.3 2049.9 53.2 591 50.4 50.3 77.8 79.5 76.7 75.5 41.6 14.7
GLM-4V-9B 13B 784 59.1 2018.8 58.0 776 46.9 51.1 67.9 71.2 - - 45.0 -
InternVL2-8B 8B 706 64.1 2215.1 54.3 794 51.2 58.3 79.4 83.6 77.4 91.6 45.0 21.3
MiniCPM-Llama-V 2.5 8B 1882 58.8 2024.6 52.8 725 45.8 54.3 72.0 78.4 76.6 84.8 42.4 10.3
MiniCPM-V 2.6 8B 2822 65.2 2348.4* 60.0 852* 49.8* 60.6 78.0 82.1 80.1 90.8 48.1* 8.2
* 我们使用思维链提示词来评估这些基准。 + Token Density:每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。 注意:闭源模型的 Token Density 由 API 收费方式估算得到。 ##### 多图评测结果 Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB:
Model Size Mantis Eval BLINK val Mathverse mv Sciverse mv MIRB
Proprietary
GPT-4V - 62.7 54.6 60.3 66.9 53.1
LLaVA-NeXT-Interleave-14B 14B 66.4 52.6 32.7 30.2 -
Open-source
Emu2-Chat 37B 37.8 36.2 - 27.2 -
CogVLM 17B 45.2 41.1 - - -
VPG-C 7B 52.4 43.1 24.3 23.1 -
VILA 8B 8B 51.2 39.3 - 36.5 -
InternLM-XComposer-2.5 8B 53.1* 48.9 32.1* - 42.5
InternVL2-8B 8B 59.0* 50.9 30.5* 34.4* 56.9*
MiniCPM-V 2.6 8B 69.1 53.0 84.9 74.9 53.8
* 正式开源模型权重的评测结果。 ##### 视频评测结果 Video-MME 和 Video-ChatGPT:
Model Size Video-MME Video-ChatGPT
w/o subs w subs Correctness Detail Context Temporal Consistency
Proprietary
Claude 3.5 Sonnet - 60.0 - - - - - -
GPT-4V - 59.9 - - - - - -
Open-source
LLaVA-NeXT-7B 7B - - 3.39 3.29 3.92 2.60 3.12
LLaVA-NeXT-34B 34B - - 3.29 3.23 3.83 2.51 3.47
CogVLM2-Video 12B - - 3.49 3.46 3.23 2.98 3.64
LongVA 7B 52.4 54.3 3.05 3.09 3.77 2.44 3.64
InternVL2-8B 8B 54.0 56.9 - - - - -
InternLM-XComposer-2.5 8B 55.8 - - - - - -
LLaVA-NeXT-Video 32B 60.2 63.0 3.48 3.37 3.95 2.64 3.28
MiniCPM-V 2.6 8B 60.9 63.6 3.59 3.28 3.93 2.73 3.62
##### 少样本评测结果 TextVQA, VizWiz, VQAv2, OK-VQA:
Model Size Shot TextVQA val VizWiz test-dev VQAv2 test-dev OK-VQA val
Flamingo 80B 0* 35.0 31.6 56.3 40.6
4 36.5 39.6 63.1 57.4
8 37.3 44.8 65.6 57.5
IDEFICS 80B 0* 30.9 36.0 60.0 45.2
4 34.3 40.4 63.6 52.4
8 35.7 46.1 64.8 55.1
OmniCorpus 7B 0* 43.0 49.8 63.2 45.5
4 45.4 51.3 64.5 46.5
8 45.6 52.2 64.7 46.6
Emu2 37B 0 26.4 40.4 33.5 26.7
4 48.2 54.6 67.0 53.2
8 49.3 54.7 67.8 54.1
MM1 30B 0 26.2 40.4 48.9 26.7
8 49.3 54.7 70.9 54.1
MiniCPM-V 2.6+ 8B 0 43.9 33.8 45.4 23.9
4 63.6 60.5 65.5 50.1
8 64.6 63.4 68.2 51.4
* 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。 + 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。 ### 典型示例
Bike Menu Code Mem medal
点击查看更多示例.
elec Menu
我们将 MiniCPM-V 2.6 部署在iPad Pro上,并录制了以下演示视频。
## Demo Click here to try out the Demo of [MiniCPM-V 2.6](http://120.92.209.146:8887/). ## 使用方法 使用Huggingface transformers 在NVIDIA GPUs推理。Requirements如下:(python 3.10) ``` Pillow==10.1.0 torch==2.1.2 torchvision==0.16.2 transformers==4.40.0 sentencepiece==0.1.99 decord ``` ```python # test.py # test.py import torch from PIL import Image from modelscope import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True) image = Image.open('image.png').convert('RGB') question = 'What is in the image?' msgs = [{'role': 'user', 'content': [image, question]}] res = model.chat( image=None, msgs=msgs, tokenizer=tokenizer ) print(res) ## if you want to use streaming, please make sure sampling=True and stream=True ## the model.chat will return a generator res = model.chat( image=None, msgs=msgs, tokenizer=tokenizer, sampling=True, stream=True ) generated_text = "" for new_text in res: generated_text += new_text print(new_text, flush=True, end='') ``` ### 多图理解
点击查看使用 MiniCPM-V 2.6 进行多图理解的Python示例 ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) image1 = Image.open('image1.jpg').convert('RGB') image2 = Image.open('image2.jpg').convert('RGB') question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.' msgs = [{'role': 'user', 'content': [image1, image2, question]}] answer = model.chat( image=None, msgs=msgs, tokenizer=tokenizer ) print(answer) ```
### In-context few-shot learning
点击查看使用 MiniCPM-V 2.6 进行few-shot推理的Python示例 ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) question = "production date" image1 = Image.open('example1.jpg').convert('RGB') answer1 = "2023.08.04" image2 = Image.open('example2.jpg').convert('RGB') answer2 = "2007.04.24" image_test = Image.open('test.jpg').convert('RGB') msgs = [ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]}, {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]}, {'role': 'user', 'content': [image_test, question]} ] answer = model.chat( image=None, msgs=msgs, tokenizer=tokenizer ) print(answer) ```
### 视频理解
点击查看使用 MiniCPM-V 2.6 进行视频理解的Python示例 ```python import torch from PIL import Image from modelscope import AutoModel, AutoTokenizer from decord import VideoReader, cpu # pip install decord params={} model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True) MAX_NUM_FRAMES=64 def encode_video(video_path): def uniform_sample(l, n): gap = len(l) / n idxs = [int(i * gap + gap / 2) for i in range(n)] return [l[i] for i in idxs] vr = VideoReader(video_path, ctx=cpu(0)) sample_fps = round(vr.get_avg_fps() / 1) # FPS frame_idx = [i for i in range(0, len(vr), sample_fps)] if len(frame_idx) > MAX_NUM_FRAMES: frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) frames = vr.get_batch(frame_idx).asnumpy() frames = [Image.fromarray(v.astype('uint8')) for v in frames] print('num frames:', len(frames)) return frames video_path="/mnt/workspace/2.mp4" frames = encode_video(video_path) question = "Describe the video" msgs = [ {'role': 'user', 'content': frames + [question]}, ] # Set decode params for video params={} params["use_image_id"] = False params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448 可设为1 answer = model.chat( image=None, msgs=msgs, tokenizer=tokenizer, **params ) print(answer) ```
更多使用介绍请查看 [GitHub](https://github.com/OpenBMB/MiniCPM-V) 。 ## llama.cpp推理 MiniCPM-V 2.6 支持 llama.cpp 推理. 使用方法请查看我们的fork [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv). ## Int4 量化版 int4 量化版,更低的显存占用(7GB): [MiniCPM-V-2_6-int4](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4). ## License #### Model License * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. * The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). * The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use. #### Statement * As an LMM, MiniCPM-V 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 2.6 does not represent the views and positions of the model developers * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. ## Other Multimodal Projects from Our Team [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) ## Citation If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️! ```bib @article{yao2024minicpm, title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, journal={arXiv preprint arXiv:2408.01800}, year={2024} } ```