first commit

This commit is contained in:
xxl 2025-01-10 15:07:36 +08:00
parent 8df79c3b39
commit 8daf977261
13 changed files with 303353 additions and 2 deletions

143
README.md
View File

@ -1,3 +1,142 @@
# mini-omni2
---
license: mit
pipeline_tag: any-to-any
library_name: mini-omni2
---
mini-omni2
# Mini-Omni2
<!-- <p align="center">
<img src="./data/figures/title.png" width="100%"/>
</p> -->
<p align="center">
🤗 <a href="https://huggingface.co/gpt-omni/mini-omni2">Hugging Face</a> | 📖 <a href="https://github.com/gpt-omni/mini-omni2">Github</a>
| 📑 <a href="https://arxiv.org/abs/2410.11190">Technical report</a>
</p>
Mini-Omni2 is an **omni-interactive** model. It can **understand image, audio and text inputs and has end-to-end voice conversations with users**. Featuring **real-time voice output**, **omni-capable multimodal understanding** and flexible interaction **ability with interruption mechanism while speaking**.
<p align="center">
<img src="./data/figures/framework.jpeg" width="100%"/>
</p>
## Updates
- **2024.10:** Release the model, technical report, inference and chat demo code.
## Features
**Multimodal interaction**: with the ability to understand images, speech and text, just like GPT-4o.
**Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required, just like [Mini-Omni](https://github.com/gpt-omni/mini-omni).
<!-- ✅ **Streaming audio output**: with first-chunk latency of audio stream less than 0.3s. -->
<!-- ✅ **Duplex interaction**: hearing while speaking, it can be interrupted by key words like "stop omni". -->
## Demo
NOTE: need to unmute first.
https://github.com/user-attachments/assets/ad97ca7f-f8b4-40c3-a7e8-fa54b4edf155
## ToDo
- [ ] update interruption mechanism
## Install
Create a new conda environment and install the required packages:
```sh
conda create -n omni python=3.10
conda activate omni
git clone https://github.com/gpt-omni/mini-omni2.git
cd mini-omni2
pip install -r requirements.txt
```
## Quick start
**Interactive demo**
- start server
NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.
```sh
sudo apt-get install ffmpeg
conda activate omni
cd mini-omni2
python3 server.py --ip '0.0.0.0' --port 60808
```
- run streamlit demo
NOTE: you need to run streamlit **locally** with PyAudio installed.
```sh
pip install PyAudio==0.2.14
API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
```
**Local test**
```sh
conda activate omni
cd mini-omni2
# test run the preset audio samples and questions
python inference_vision.py
```
## Mini-Omni2 Overview
**1. Multimodal Modeling**:
We use multiple sequences as the input and output of the model. In the input part, we will concatenate image, audio and text features to perform a series of comprehensive tasks, as shown in the following figures. In the output part, we use text-guided delayed parallel output to generate real-time speech responses.
<p align="center">
<img src="./data/figures/inputids.png" width="100%"/>
</p>
**2. Multi-stage Training**:
We propose an efficient alignment training method and conduct encoder adaptation, modal alignment, and multimodal fine-tuning respectively in the three-stage training.
<p align="center">
<img src="./data/figures/training.jpeg" width="100%"/>
</p>
<!-- **3. Cases**:
Here are more cases of Mini-Omni2:
<p align="center">
<img src="./data/figures/samples.png" width="100%"/>
</p> -->
## FAQ
**1. Does the model support other languages?**
No, the model is only trained on English. However, as we use whisper as the audio encoder, the model can understand other languages which is supported by whisper (like chinese), but the output is only in English.
**2. Error: can not run streamlit in local browser, with remote streamlit server**
You need start streamlit **locally** with PyAudio installed.
## Acknowledgements
- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
- [whisper](https://github.com/openai/whisper/) for audio encoding.
- [clip](https://github.com/openai/CLIP) for image encoding.
- [snac](https://github.com/hubertsiuzdak/snac/) for audio decoding.
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.
<!-- ## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni2&type=Date)](https://star-history.com/#gpt-omni/mini-omni2&Date)

BIN
ViT-B-32.pt (Stored with Git LFS) Normal file

Binary file not shown.

BIN
data/figures/framework.jpeg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 396 KiB

BIN
data/figures/inputids.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 327 KiB

BIN
data/figures/samples.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
data/figures/title.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
data/figures/training.jpeg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 344 KiB

BIN
data/omni2-demo.mp4 (Stored with Git LFS) Normal file

Binary file not shown.

BIN
lit_model.pth (Stored with Git LFS) Normal file

Binary file not shown.

43
model_config.yaml Normal file
View File

@ -0,0 +1,43 @@
add_qkv_bias: true
asr_adapter: llamamlp
attn_dropout: 0.0
bias: false
block_size: 2048
force_align: false
gelu_approximate: none
head_size: 64
hf_config:
name: Qwen2-0.5B
org: Qwen
intermediate_size: 4864
lm_head_bias: false
mlp_class_name: LLaMAMLP
n_embd: 896
n_expert: 0
n_expert_per_token: 0
n_head: 14
n_layer: 24
n_query_groups: 2
name: Qwen2-0.5B
norm_class_name: RMSNorm
norm_eps: 1.0e-06
padded_vocab_size: 181120
padding_multiple: 512
parallel_residual: false
pos_type: rope
post_adapter: false
post_adapter_layers: 6
prompt_vocab_size: null
rope_base: 1000000
rope_condense_ratio: 1
rotary_percentage: 1
scale_embeddings: false
shared_attention_norm: false
tie_word_embeddings: true
use_pretrain_phoneme_emb: false
vocab_size: 50254
text_vocab_size: 152000
cat_audio_vocab_size: 29120
audio_vocab_size: 4160
whisper_adapter_dim: 768
vision_adapter_dim: 512

BIN
small.pt (Stored with Git LFS) Normal file

Binary file not shown.

303111
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

40
tokenizer_config.json Normal file
View File

@ -0,0 +1,40 @@
{
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": ["<|im_start|>", "<|im_end|>"],
"bos_token": null,
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"errors": "replace",
"model_max_length": 32768,
"pad_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}