2025-01-15 15:50:07 +08:00
---
frameworks:
- Pytorch
license: other
tasks:
- any-to-any
---
2025-01-15 13:39:54 +08:00
2025-01-15 15:50:07 +08:00
< h1 > 端侧可用的 GPT-4o 级视觉、语音、多模态流式大模型< / h1 >
[GitHub ](https://github.com/OpenBMB/MiniCPM-o ) | [Online Demo ](https://minicpm-omni-webdemo-us.modelbest.cn ) </ a >
## MiniCPM-o 2.6
MiniCPM-o 2.6 是 MiniCPM-o 系列的最新、性能最佳模型。该模型基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B 构建,共 8B 参数,通过端到端方式训练和推理。相比 MiniCPM-V 2.6, 该模型在性能上有了显著提升, 并支持了实时语音对话和多模态流式交互的新功能。MiniCPM-o 2.6 的主要特性包括:
- 🔥 **领先的视觉能力。**
MiniCPM-o 2.6 在 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 70.2, **以 8B 量级的大小在单图理解方面超越了 GPT-4o-202405、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。此外,它的多图和视频理解表现也**优于 GPT-4V 和 Claude 3.5 Sonnet**,并展现出了优秀的上下文学习能力。
- 🎙 **出色的语音能力。**
MiniCPM-o 2.6 **支持可配置声音的中英双语实时对话** 。MiniCPM-o 2.6 在语音理解任务(如 ASR 和 STT translation) 上的表现**优于 GPT-4o-realtime**,并在语音对话的语义和声学评估中展现了**开源模型中最高的语音生成性能**。它还支持情绪/语速/风格控制、语音克隆、角色扮演等进阶能力。
- 🎬 **强大的多模态流式交互能力。**
作为一项新功能, MiniCPM-o 2.6 能够**接受连续的视频和音频流,并和用户进行实时语音交互**。在 StreamingBench( 针对实时视频理解、全模态( 视/音频) 理解、多模态上下文理解的综合评测基准) 中, MiniCPM-o 2.6 获得开源模型最高分并**超过了 GPT-4o-realtime 和 Claude 3.5 Sonnet**。
- 💪 **强大的 OCR 能力及其他功能。**
MiniCPM-o 2.6 进一步优化了 MiniCPM-V 2.6 的众多视觉理解能力,其可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344) 。在 OCRBench 上取得**25B 以下最佳水平,超过 GPT-4o-202405 等商用闭源模型**。基于最新的 [RLHF-V ](https://rlhf-v.github.io/ )、[RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM ](https://github.com/OpenBMB/VisCPM ) 技术,其具备了**可信的多模态行为**,在 MMHal-Bench 上超过了 GPT-4o 和 Claude 3.5,并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。
- 🚀 **卓越的效率。**
除了对个人用户友好的模型大小, MiniCPM-o 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此, MiniCPM-o 2.6 可以支持 iPad 等终端设备上的高效**多模态流式交互**。
- 💫 **易于使用。**
MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp ](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md ) 支持在本地设备上进行高效的 CPU 推理,(2) [int4 ](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4 ) 和 [GGUF ](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf ) 格式的量化模型,有 16 种尺寸,(3) [vLLM ](#vllm-部署- ) 支持高吞吐量和内存高效的推理,(4) 通过[LLaMA-Factory](./docs/llamafactory_train.md)框架针对新领域和任务进行微调,(5) 使用 [Gradio ](#本地-webui-demo- ) 快速设置本地 WebUI 演示,(6) [在线demo ](https://minicpm-omni-webdemo-us.modelbest.cn/ )。
**模型架构。**
- **端到端全模态架构。** 通过**端到端**的方式连接和训练不同模态的编/解码模块以充分利用丰富的多模态知识。
- **全模态流式机制。** (1) 我们将不同模态的离线编/解码器改造为适用于**流式输入/输出**的在线模块。 (2) 我们针对大语言模型基座设计了一种**时分复用的全模态流式信息处理机制**,将平行的不同模态的信息流拆分重组为周期性时间片序列。
- **可配置的声音方案。** 我们设计了包含传统文本系统提示词和**用于指定模型声音的语音系统提示词**结构。从而,模型可在推理时灵活地通过文字或语音样例控制声音风格,支持声音克隆和声音生成等高级能力。
< div align = "center" >
< img src = "./assets/minicpm-o-26-framework.png" , width = 80% >
< / div >
### 性能评估 <!-- omit in toc -->
< div align = "center" >
< img src = "./assets/radar.jpg" , width = 70% >
< / div >
< details >
< summary > 点击查看视觉理解能力详细评测结果。< / summary >
**图像理解能力**
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Model< / th >
< th > Size< / th >
< th > Token Density< sup > +< / sup > < / th >
< th > OpenCompass< / th >
< th > OCRBench< / th >
< th > MathVista mini< / th >
< th > ChartQA< / th >
< th > MMVet< / th >
< th > MMStar< / th >
< th > MME< / th >
< th > MMB1.1 test< / th >
< th > AI2D< / th >
< th > MMMU val< / th >
< th > HallusionBench< / th >
< th > TextVQA val< / th >
< th > DocVQA test< / th >
< th > MathVerse mini< / th >
< th > MathVision< / th >
< th > MMHal Score< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "19" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o-20240513< / td >
< td > -< / td >
< td > 1088< / td >
< td > < u > 69.9< / u > < / td >
< td > 736< / td >
< td > 61.3< / td >
< td > 85.7< / td >
< td > < strong > 69.1< / strong > < / td >
< td > 63.9< / td >
< td > 2328.7< / td >
< td > 82.2< / td >
< td > 84.6< / td >
< td > < strong > 69.2< / strong > < / td >
< td > < strong > 55.0< / strong > < / td >
< td > -< / td >
< td > 92.8< / td >
< td > < strong > 50.2< / strong > < / td >
< td > < strong > 30.4< / strong > < / td >
< td > < u > 3.6< / u > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Claude3.5-Sonnet< / td >
< td > -< / td >
< td > 750< / td >
< td > 67.9< / td >
< td > 788< / td >
< td > 61.6< / td >
< td > < strong > 90.8< / strong > < / td >
< td > 66.0< / td >
< td > 62.2< / td >
< td > 1920.0< / td >
< td > 78.5< / td >
< td > 80.2< / td >
< td > < u > 65.9< / u > < / td >
< td > 49.9< / td >
< td > -< / td >
< td > < strong > 95.2< / strong > < / td >
< td > -< / td >
< td > -< / td >
< td > 3.4< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Gemini-1.5-Pro< / td >
< td > -< / td >
< td > -< / td >
< td > 64.4< / td >
< td > 754< / td >
< td > 57.7< / td >
< td > 81.3< / td >
< td > 64.0< / td >
< td > 59.1< / td >
< td > 2110.6< / td >
< td > 73.9< / td >
< td > 79.1< / td >
< td > 60.6< / td >
< td > 45.6< / td >
< td > 73.5< / td >
< td > 86.5< / td >
< td > -< / td >
< td > 19.2< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o-mini-20240718< / td >
< td > -< / td >
< td > 1088< / td >
< td > 64.1< / td >
< td > 785< / td >
< td > 52.4< / td >
< td > -< / td >
< td > 66.9< / td >
< td > 54.8< / td >
< td > 2003.4< / td >
< td > 76.0< / td >
< td > 77.8< / td >
< td > 60.0< / td >
< td > 46.1< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > 3.3< / td >
< / tr >
< tr >
< td colspan = "19" align = "left" > < strong > Open Source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Cambrian-34B< / td >
< td > 34B< / td >
< td > < u > 1820< / u > < / td >
< td > 58.3< / td >
< td > 591< / td >
< td > 50.3< / td >
< td > 75.6< / td >
< td > 53.2< / td >
< td > 54.2< / td >
< td > 2049.9< / td >
< td > 77.8< / td >
< td > 79.5< / td >
< td > 50.4< / td >
< td > 41.6< / td >
< td > 76.7< / td >
< td > 75.5< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GLM-4V-9B< / td >
< td > 13B< / td >
< td > 784< / td >
< td > 59.1< / td >
< td > 776< / td >
< td > 51.1< / td >
< td > -< / td >
< td > 58.0< / td >
< td > 54.8< / td >
< td > 2018.8< / td >
< td > 67.9< / td >
< td > 71.2< / td >
< td > 46.9< / td >
< td > 45.0< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Pixtral-12B< / td >
< td > 12B< / td >
< td > 256< / td >
< td > 61.0< / td >
< td > 685< / td >
< td > 56.9< / td >
< td > 81.8< / td >
< td > 58.5< / td >
< td > 54.5< / td >
< td > -< / td >
< td > 72.7< / td >
< td > 79.0< / td >
< td > 51.1< / td >
< td > 47.0< / td >
< td > 75.7< / td >
< td > 90.7< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > DeepSeek-VL2-27B (4B)< / td >
< td > 27B< / td >
< td > 672< / td >
< td > 66.4< / td >
< td > 809< / td >
< td > 63.9< / td >
< td > 86.0< / td >
< td > 60.0< / td >
< td > 61.9< / td >
< td > 2253.0< / td >
< td > 81.2< / td >
< td > 83.8< / td >
< td > 54.0< / td >
< td > 45.3< / td >
< td > < u > 84.2< / u > < / td >
< td > 93.3< / td >
< td > -< / td >
< td > -< / td >
< td > 3.0< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Qwen2-VL-7B< / td >
< td > 8B< / td >
< td > 784< / td >
< td > 67.1< / td >
< td > < u > 866< / u > < / td >
< td > 58.2< / td >
< td > 83.0< / td >
< td > 62.0< / td >
< td > 60.7< / td >
< td > 2326.0< / td >
< td > 81.8< / td >
< td > 83.0< / td >
< td > 54.1< / td >
< td > 50.6< / td >
< td > < strong > 84.3< / strong > < / td >
< td > < u > 94.5< / u > < / td >
< td > 31.9< / td >
< td > 16.3< / td >
< td > 3.2< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-OneVision-72B< / td >
< td > 72B< / td >
< td > 182< / td >
< td > 68.1< / td >
< td > 741< / td >
< td > 67.5< / td >
< td > 83.7< / td >
< td > 60.6< / td >
< td > < strong > 65.8< / strong > < / td >
< td > 2261.0< / td >
< td > < strong > 85.0< / strong > < / td >
< td > < u > 85.6< / u > < / td >
< td > 56.8< / td >
< td > 49.0< / td >
< td > 80.5< / td >
< td > 91.3< / td >
< td > 39.1< / td >
< td > -< / td >
< td > 3.5< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternVL-2.5-8B< / td >
< td > 8B< / td >
< td > 706< / td >
< td > 68.3< / td >
< td > 822< / td >
< td > < u > 64.4< / u > < / td >
< td > 84.8< / td >
< td > 62.8< / td >
< td > 62.8< / td >
< td > 2344.0< / td >
< td > < u > 83.6< / u > < / td >
< td > 84.5< / td >
< td > 56.0< / td >
< td > 50.1< / td >
< td > 79.1< / td >
< td > 93.0< / td >
< td > 39.5< / td >
< td > 19.7< / td >
< td > 3.4< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > MiniCPM-V 2.6< / td >
< td > 8B< / td >
< td > < strong > 2822< / strong > < / td >
< td > 65.2< / td >
< td > 852*< / td >
< td > 60.6< / td >
< td > 79.4< / td >
< td > 60.0< / td >
< td > 57.5< / td >
< td > < u > 2348.4*< / u > < / td >
< td > 78.0< / td >
< td > 82.1< / td >
< td > 49.8*< / td >
< td > 48.1*< / td >
< td > 80.1< / td >
< td > 90.8< / td >
< td > 25.7< / td >
< td > 18.3< / td >
< td > 3.6< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > MiniCPM-o 2.6< / td >
< td > 8B< / td >
< td > < strong > 2822< / strong > < / td >
< td > < strong > 70.2< / strong > < / td >
< td > < strong > 897*< / strong > < / td >
< td > < strong > 71.9*< / strong > < / td >
< td > < u > 86.9*< / u > < / td >
< td > < u > 67.5< / u > < / td >
< td > < u > 64.0< / u > < / td >
< td > < strong > 2372.0*< / strong > < / td >
< td > 80.5< / td >
< td > < strong > 85.8< / strong > < / td >
< td > 50.4*< / td >
< td > < u > 51.9< / u > < / td >
< td > 82.0< / td >
< td > 93.5< / td >
< td > < u > 41.4*< / u > < / td >
< td > < u > 23.1*< / u > < / td >
< td > < strong > 3.8< / strong > < / td >
< / tr >
< / tbody >
< / table >
< / div >
* 我们使用思维链提示词来评估这些基准,对于 MME 我们只在 Cognition 任务上使用了思维链。
+ Token Density: 每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
**多图和视频理解能力**
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Model< / th >
< th > Size< / th >
< th > BLINK-val< / th >
< th > Mantis-Eval< / th >
< th > MIRB< / th >
< th > Video-MME (wo / w subs)< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "6" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o-20240513< / td >
< td > -< / td >
< td > < strong > 68< / strong > < / td >
< td > -< / td >
< td > -< / td >
< td > < strong > 71.9/77.2< strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT4V< / td >
< td > -< / td >
< td > 54.6< / td >
< td > 62.7< / td >
< td > 53.1< / td >
< td > 59.9/63.3< / td >
< / tr >
< tr >
< td colspan = "6" align = "left" > < strong > Open-source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-NeXT-Interleave 14B< / td >
< td > 14B< / td >
< td > 52.6< / td >
< td > 66.4< / td >
< td > 30.2< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-One-Vision-72B< / td >
< td > 72B< / td >
< td > 55.4< / td >
< td > < strong > 77.6< / strong > < / td >
< td > -< / td >
< td > < u > 66.2/69.5< / u > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > MANTIS 8B< / td >
< td > 8B< / td >
< td > 49.1< / td >
< td > 59.5< / td >
< td > 34.8< / td >
< td > -< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Qwen2-VL-7B< / td >
< td > 8B< / td >
< td > 53.2< / td >
< td > 69.6*< / td >
< td > < strong > 67.6*< / strong > < / td >
< td > 63.3/69.0< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternVL-2.5-8B< / td >
< td > 8B< / td >
< td > 54.8< / td >
< td > 67.7< / td >
< td > 52.5< / td >
< td > 64.2/66.9< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > MiniCPM-V 2.6< / td >
< td > 8B< / td >
< td > 53< / td >
< td > 69.1< / td >
< td > 53.8< / td >
< td > 60.9/63.6< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > MiniCPM-o 2.6< / td >
< td > 8B< / td >
< td > < u > 56.7< / u > < / td >
< td > < u > 71.9< / u > < / td >
< td > < u > 58.6< / u > < / td >
< td > 63.9/67.9< / td >
< / tr >
< / tbody >
< / table >
< / div >
* 正式开源模型权重的评测结果。
< / details >
< details >
< summary > 点击查看语音理解和生成能力的详细评测结果。< / summary >
**语音理解能力**
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Task< / th >
< th > Size< / th >
< th colspan = "3" > ASR (zh)< / th >
< th colspan = "3" > ASR (en)< / th >
< th colspan = "2" > ASR< / th >
< th > Emotion< / th >
< / tr >
< tr >
< th align = "left" > Metric< / th >
< td > < / td >
< th colspan = "3" > CER↓< / th >
< th colspan = "3" > WER↓< / th >
< th colspan = "2" > BLEU↑< / th >
< th > ACC↑< / th >
< / tr >
< tr >
< th align = "left" > Dataset< / th >
< td > < / td >
< th > AISHELL-1< / th >
< th > Fleurs zh< / th >
< th > WenetSpeech test-net< / th >
< th > LibriSpeech test-clean< / th >
< th > GigaSpeech< / th >
< th > TED-LIUM< / th >
< th > CoVoST en2zh< / th >
< th > CoVoST zh2en< / th >
< th > MELD emotion< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "11" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o-Realtime< / td >
< td > -< / td >
< td > 7.3*< / td >
< td > < u > 5.4*< / u > < / td >
< td > 28.9*< / td >
< td > 2.6*< / td >
< td > 12.9*< / td >
< td > 4.8*< / td >
< td > 37.1*< / td >
< td > 15.7*< / td >
< td > 33.2*< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Gemini-1.5-Pro< / td >
< td > -< / td >
< td > 4.5*< / td >
< td > 5.9*< / td >
< td > 14.3*< / td >
< td > 2.9*< / td >
< td > 10.6*< / td >
< td > < strong > 3.0*< / strong > < / td >
< td > < u > 47.3*< / u > < / td >
< td > 22.6*< / td >
< td > 48.4*< / td >
< / tr >
< tr >
< td colspan = "11" align = "left" > < strong > Open-Source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Qwen2-Audio< / td >
< td > 8B< / td >
< td > -< / td >
< td > 7.5< / td >
< td > -< / td >
< td > < strong > 1.6< / strong > < / td >
< td > -< / td >
< td > -< / td >
< td > 45.2< / td >
< td > < u > 24.4< / u > < / td >
< td > < strong > 55.3< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Qwen2-Audio-Instruction< / td >
< td > 8B< / td >
< td > 2.6*< / td >
< td > 6.9*< / td >
< td > < u > 10.3*< / u > < / td >
< td > 3.1*< / td >
< td > < u > 9.7< / u > *< / td >
< td > 5.9*< / td >
< td > 39.5*< / td >
< td > 22.9*< / td >
< td > 17.4*< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GLM-4-Voice-Base< / td >
< td > 9B< / td >
< td > < u > 2.5< / u > < / td >
< td > -< / td >
< td > -< / td >
< td > 2.8< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< td > -< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td nowrap = "nowrap" align = "left" > MiniCPM-o 2.6< / td >
< td > 8B< / td >
< td > < strong > 1.6< / strong > < / td >
< td > < strong > 4.4< / strong > < / td >
< td > < strong > 6.9< / strong > < / td >
< td > < u > 1.7< / u > < / td >
< td > < strong > 8.7< / strong > < / td >
< td > < strong > 3.0< / strong > < / td >
< td > < strong > 48.2< / strong > < / td >
< td > < strong > 27.2< / strong > < / td >
< td > < u > 52.4< / u > < / td >
< / tr >
< / tbody >
< / table >
< / div >
* 正式开源模型权重的评测结果。< br >< br >
**语音生成能力。**
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Task< / th >
< th > Size< / th >
< th colspan = "9" > SpeechQA< / th >
< / tr >
< tr >
< th align = "left" > Metric< / th >
< th > < / th >
< th colspan = "3" > ACC↑< / th >
< th > G-Eval (10 point)↑< / th >
< th > Semantic ELO score↑< / th >
< th > Acoustic ELO score↑< / th >
< th > Overall ELO score↑< / th >
< th > UTMOS↑< / th >
< th > ASR-WER↓< / th >
< / tr >
< tr >
< th align = "left" > Dataset< / th >
< th > < / th >
< th > Speech Llama Q.< / th >
< th > Speech Web Q.< / th >
< th > Speech Trivia QA< / th >
< th > Speech AlpacaEval< / th >
< th colspan = "5" > AudioArena< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "11" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o-Realtime< / td >
< td > < / td >
< td > < strong > 71.7< / strong > < / td >
< td > < strong > 51.6< / strong > < / td >
< td > < strong > 69.7< / strong > < / td >
< td > < strong > 7.4< / strong > < / td >
< td > < strong > 1157< / strong > < / td >
< td > < strong > 1203< / strong > < / td >
< td > < strong > 1200< / strong > < / td >
< td > < strong > 4.2< / strong > < / td >
< td > < strong > 2.3< / strong > < / td >
< / tr >
< tr >
< td colspan = "11" align = "left" > < strong > Open-Source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GLM-4-Voice< / td >
< td > 9B< / td >
< td > 50.0< / td >
< td > 32.0< / td >
< td > 36.4< / td >
< td > < u > 5.1< / u > < / td >
< td > 999< / td >
< td > 1147< / td >
< td > 1035< / td >
< td > < u > 4.1< / u > < / td >
< td > < u > 11.7< / u > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Llama-Omni< / td >
< td > 8B< / td >
< td > 45.3< / td >
< td > 22.9< / td >
< td > 10.7< / td >
< td > 3.9< / td >
< td > 960< / td >
< td > 878< / td >
< td > 897< / td >
< td > 3.2< / td >
< td > 24.3< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Moshi< / td >
< td > 7B< / td >
< td > 43.7< / td >
< td > 23.8< / td >
< td > 16.7< / td >
< td > 2.4< / td >
< td > 871< / td >
< td > 808< / td >
< td > 875< / td >
< td > 2.8< / td >
< td > 8.2< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Mini-Omni< / td >
< td > 1B< / td >
< td > 22.0< / td >
< td > 12.8< / td >
< td > 6.9< / td >
< td > 2.5< / td >
< td > 926< / td >
< td > 803< / td >
< td > 865< / td >
< td > 3.4< / td >
< td > 10.0< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td nowrap = "nowrap" align = "left" > MiniCPM-o 2.6< / td >
< td > 8B< / td >
< td > < u > 61.0< / u > < / td >
< td > < u > 40.0< / u > < / td >
< td > < u > 40.2< / u > < / td >
< td > < u > 5.1< / u > < / td >
< td > < u > 1088< / u > < / td >
< td > < u > 1163< / u > < / td >
< td > < u > 1131< / u > < / td >
< td > < strong > 4.2< / strong > < / td >
< td > 9.8< / td >
< / tr >
< / tbody >
< / table >
< / div >
所有的结果都基于 < a href = "https://github.com/OpenBMB/UltraEval-Audio" target = "_blank" > AudioEvals< / a > 。< br > < br >
**声音克隆能力。**
< div align = "center" >
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Task< / th >
< th colspan = "2" > TTS< / th >
< / tr >
< tr >
< th align = "left" > Metric< / th >
< th > SIMO↑< / th >
< th > SIMO↑< / th >
< / tr >
< tr >
< th align = "left" > Dataset< / th >
< th > Seed-TTS test-zh< / th >
< th > Seed-TTS test-en< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td nowrap = "nowrap" align = "left" > F5-TTS< / td >
< td > < strong > 76< / strong > < / td >
< td > < strong > 67< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > CosyVoice< / td >
< td > < u > 75< / u > < / td >
< td > < u > 64< / u > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > FireRedTTS< / td >
< td > 63< / td >
< td > 46< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td nowrap = "nowrap" align = "left" > MiniCPM-o 2.6< / td >
< td > 57< / td >
< td > 47< / td >
< / tr >
< / tbody >
< / table >
< / div >
< / details >
< details >
< summary > 点击查看多模态流式交互能力评测详细结果。< / summary >
**多模态流式交互能力**: StreamingBench 分数
< table style = "margin: 0px auto;" >
< thead >
< tr >
< th align = "left" > Model< / th >
< th > Size< / th >
< th > Real-Time Video Understanding< / th >
< th > Omni-Source Understanding< / th >
< th > Contextual Understanding< / th >
< th > Overall< / th >
< / tr >
< / thead >
< tbody align = "center" >
< tr >
< td colspan = "7" align = "left" > < strong > Proprietary< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Gemini 1.5 Pro< / td >
< td > -< / td >
< td > < u > 77.4< / u > < / td >
< td > < strong > 67.8< / strong > < / td >
< td > < strong > 51.1< / strong > < / td >
< td > < strong > 70.3< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > GPT-4o< / td >
< td > -< / td >
< td > 74.5< / td >
< td > 51.0< / td >
< td > < u > 48.0< / u > < / td >
< td > 64.1< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Claude-3.5-Sonnet< / td >
< td > -< / td >
< td > 74.0< / td >
< td > 41.4< / td >
< td > 37.8< / td >
< td > 59.7< / td >
< / tr >
< tr >
< td colspan = "9" align = "left" > < strong > Open-source< / strong > < / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > VILA-1.5< / td >
< td > 8B< / td >
< td > 61.5< / td >
< td > 37.5< / td >
< td > 26.7< / td >
< td > 49.5< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LongVA< / td >
< td > 7B< / td >
< td > 63.1< / td >
< td > 35.9< / td >
< td > 30.2< / td >
< td > 50.7< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-Next-Video-34B< / td >
< td > 34B< / td >
< td > 69.8< / td >
< td > 41.7< / td >
< td > 34.3< / td >
< td > 56.7< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > Qwen2-VL-7B< / td >
< td > 8B< / td >
< td > 71.2< / td >
< td > 40.7< / td >
< td > 33.1< / td >
< td > 57.0< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternVL2-8B< / td >
< td > 8B< / td >
< td > 70.1< / td >
< td > 42.7< / td >
< td > 34.1< / td >
< td > 57.0< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > VITA-1.5< / td >
< td > 8B< / td >
< td > 70.9< / td >
< td > 40.8< / td >
< td > 35.8< / td >
< td > 57.4< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > LLaVA-OneVision-7B< / td >
< td > 8B< / td >
< td > 74.3< / td >
< td > 40.8< / td >
< td > 31.0< / td >
< td > 58.4< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > InternLM-XC2.5-OL-7B< / td >
< td > 8B< / td >
< td > 75.4< / td >
< td > 46.2< / td >
< td > 33.6< / td >
< td > 60.8< / td >
< / tr >
< tr >
< td nowrap = "nowrap" align = "left" > MiniCPM-V 2.6< / td >
< td > 8B< / td >
< td > 72.4< / td >
< td > 40.2< / td >
< td > 33.4< / td >
< td > 57.7< / td >
< / tr >
< tr style = "background-color: #e6f2ff ;" >
< td nowrap = "nowrap" align = "left" > MiniCPM-o 2.6< / td >
< td > 8B< / td >
< td > < strong > 79.9< / strong > < / td >
< td > < u > 53.4< / u > < / td >
< td > 38.5< / td >
< td > < u > 66.0< / u > < / td >
< / tr >
< / tbody >
< / table >
< / details >
### 典型示例 <!-- omit in toc -->
以下示例为 MiniCPM-o 2.6 部署在 iPad Pro 上所录制得到。
< div style = "display: flex; flex-direction: column; align-items: center;" >
< img src = "./assets/minicpmo2_6_math_intersect.png" alt = "math" style = "margin-bottom: 5px;" >
< img src = "./assets/minicpmo2_6_diagram_train_NN.png" alt = "diagram" style = "margin-bottom: 5px;" >
< img src = "./assets/minicpmo2_6_multi-image_bike.png" alt = "bike" style = "margin-bottom: 5px;" >
< / div >
## Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
```
Pillow==10.1.0
torch==2.2.0
torchaudio==2.2.0
torchvision==0.17.0
transformers==4.44.2
librosa==0.9.0
soundfile==0.12.1
vector-quantize-pytorch==1.18.5
vocos==0.1.0
decord
moviepy
```
### Model initialization
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
# load omni model default, the default init_vision/init_audio/init_tts is True
# if load vision-only model, please set init_audio=False and init_tts=False
# if load audio-only model, please set init_vision=False
model = AutoModel.from_pretrained(
'openbmb/MiniCPM-o-2_6',
trust_remote_code=True,
attn_implementation='sdpa', # sdpa or flash_attention_2
torch_dtype=torch.bfloat16,
init_vision=True,
init_audio=True,
init_tts=True
)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
# In addition to vision-only mode, tts processor and vocos also needs to be initialized
model.init_tts()
model.tts.float()
```
### Omni mode
we provide two inference modes: chat and streaming
#### chat inference
```python
import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf
def get_video_chunk_content(video_path, flatten=True):
video = VideoFileClip(video_path)
print('video_duration:', video.duration)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
temp_audio_file_path = temp_audio_file.name
video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
num_units = math.ceil(video.duration)
# 1 frame + 1s audio chunk
contents= []
for i in range(num_units):
frame = video.get_frame(i+1)
image = Image.fromarray((frame).astype(np.uint8))
audio = audio_np[sr*i:sr*(i+1)]
if flatten:
contents.extend(["< unit > ", image, audio])
else:
contents.append(["< unit > ", image, audio])
return contents
video_path="/path/to/video"
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# if use voice clone prompt, please set ref_audio
# ref_audio_path = '/path/to/ref_audio'
# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]
# please set generate_audio=True and output_audio_path to save the tts result
generate_audio = True
output_audio_path = 'output.wav'
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
temperature=0.5,
max_new_tokens=4096,
omni_input=True, # please set omni_input=True when omni inference
use_tts_template=True,
generate_audio=generate_audio,
output_audio_path=output_audio_path,
max_slice_nums=1,
use_image_id=False,
return_dict=True
)
print(res)
```
#### streaming inference
```python
# a new conversation need reset session first, it will reset the kv-cache
model.reset_session()
contents = get_video_chunk_content(video_path, flatten=False)
session_id = '123'
generate_audio = True
# 1. prefill system prompt
res = model.streaming_prefill(
session_id=session_id,
msgs=[sys_msg],
tokenizer=tokenizer
)
# 2. prefill video/audio chunks
for content in contents:
msgs = [{"role":"user", "content": content}]
res = model.streaming_prefill(
session_id=session_id,
msgs=msgs,
tokenizer=tokenizer
)
# 3. generate
res = model.streaming_generate(
session_id=session_id,
tokenizer=tokenizer,
temperature=0.5,
generate_audio=generate_audio
)
audios = []
text = ""
if generate_audio:
for r in res:
audio_wav = r.audio_wav
sampling_rate = r.sampling_rate
txt = r.text
audios.append(audio_wav)
text += txt
res = np.concatenate(audios)
sf.write("output.wav", res, samplerate=sampling_rate)
print("text:", text)
print("audio saved to output.wav")
else:
for r in res:
text += r['text']
print("text:", text)
```
### Audio-Only mode
#### Mimick
```python
mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
temperature=0.3,
generate_audio=True,
output_audio_path='output.wav', # save the tts result to output_audio_path
)
```
#### General Speech Conversation with Configurable Voices
< details > < summary > Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.< / summary >
```python
ref_audio, _ = librosa.load('./assert/voice_01.wav', sr=16000, mono=True) # load the reference audio
# Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt.
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
# Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant.
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something!
```
```python
msgs = [sys_prompt, user_question]
res = model.chat(
image=None,
msgs=msgs,
context=None,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
stream=False,
stream_input=True,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
)
# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
image=None,
msgs=msgs,
context=None,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
stream=False,
stream_input=True,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_round_2.wav',
)
print(res)
```
< / details >
#### Addressing various audio tasks
< details >
< summary > Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. < / summary >
```python
'''
Audio Understanding Task Prompt:
Speech:
ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
General Audio:
Audio Caption: Summarize the main content of the audio.
Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
'''
task_prompt = "\n"
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
res = model.chat(
image=None,
msgs=msgs,
context=None,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
stream=False,
stream_input=True,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
)
print(res)
```
```python
'''
Speech Generation Task Prompt:
Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
Example:
# 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。
# Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
Voice Cloning or Voice Creation: With this mode, model will act like a TTS model.
'''
# Human Instruction-to-Speech:
task_prompt = '' #Try to make some Human Instruction-to-Speech prompt
msgs = [{'role': 'user', 'content': [task_prompt]}] # you can try to use the same audio question
# Voice Cloning mode: With this mode, model will act like a TTS model.
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
# text_prompt = f"Please read the text below."
# user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Creation)
msgs = [sys_prompt, user_question]
res = model.chat(
image=None,
msgs=msgs,
context=None,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
stream=False,
stream_input=True,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
)
```
< / details >
### Vision-Only mode
`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
#### chat with single image
```python
# test.py
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
res = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(res)
## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
stream=True
)
generated_text = ""
for new_text in res:
generated_text += new_text
print(new_text, flush=True, end='')
```
#### Chat with multiple images
< details >
< summary > Click to show Python code running MiniCPM-o 2.6 with multiple images input. < / summary >
```python
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
< / details >
#### In-context few-shot learning
< details >
< summary > Click to view Python code running MiniCPM-o 2.6 with few-shot input. < / summary >
```python
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
< / details >
#### Chat with video
< details >
< summary > Click to view Python code running MiniCPM-o 2.6 with video input. < / summary >
```python
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path ="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params={}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
```
< / details >
Please look at [GitHub ](https://github.com/OpenBMB/MiniCPM-o ) for more detail about usage.
## llama.cpp推理 <a id="llamacpp"></a>
敬请期待
## Int4 量化版
int4 量化版,更低的显存占用(7GB): [MiniCPM-o-2_6-int4 ](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4 ).
## License
#### Model License
* The code in this repo is released under the [Apache-2.0 ](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE ) License.
* The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md ](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md ).
* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire" ](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g ) for registration, are also available for free commercial use.
#### Statement
* As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
## Other Multimodal Projects from Our Team
[VisCPM ](https://github.com/OpenBMB/VisCPM/tree/main ) | [RLHF-V ](https://github.com/RLHF-V/RLHF-V ) | [LLaVA-UHD ](https://github.com/thunlp/LLaVA-UHD ) | [RLAIF-V ](https://github.com/RLHF-V/RLAIF-V )
## Citation
If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
```bib
@article {yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={arXiv preprint arXiv:2408.01800},
year={2024}
}
```