VideoChat-Flash-Qwen2-7B_re.../README.md

---
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
tags:
- multimodal
pipeline_tag: video-text-to-text
model-index:
- name: VideoChat-Flash-Qwen2-7B_res448
  results:
  - task:
      type: multimodal
    dataset:
      name: MLVU
      type: mlvu
    metrics:
    - type: accuracy
      value: 74.7
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: MVBench
      type: mvbench
    metrics:
    - type: accuracy
      value: 74.0
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: Perception Test
      type: percepTest
    metrics:
    - type: accuracy
      value: 76.2
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: LongVideoBench
      type: longvideobench
    metrics:
    - type: accuracy
      value: 64.7
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: VideoMME (wo sub)
      type: videomme
    metrics:
    - type: accuracy
      value: 65.3
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: LVBench
      type: lvbench
    metrics:
    - type: accuracy
      value: 48.2
      name: accuracy
      verified: true


---

# 🦜VideoChat-Flash-Qwen2-7B_res448⚡
[\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-Flash)   [\[📜 Tech Report\]](https://www.arxiv.org/abs/2501.00574) [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash)

VideoChat-Flash-7B is constructed upon UMT-L (300M) and Qwen2-7B, employing only **16 tokens per frame**. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately **10,000 frames**.

> Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.


## 📈 Performance
| Model |  MVBench | LongVideoBench |  VideoMME(w/o sub)| Max Input Frames|
| ---   |  ---     |   ---            | ---     |  ---     |
|[VideoChat-Flash-Qwen2_5-2B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448)| 70.0 | 58.3   | 57.0| 10000 |
|[VideoChat-Flash-Qwen2-7B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res224) | 73.2 | 64.2 | 64.0 | 10000 |
|[VideoChat-Flash-Qwen2_5-7B-1M@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224) | 73.4 | **66.5** | 63.5 | 50000 |
|[VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B) | **74.3** | 64.5 | 65.1 | 10000 |
|[VideoChat-Flash-Qwen2-7B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448)| 74.0| 64.7 | **65.3**| 10000 |


## 🚀 How to use the model

First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below:
```
pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
pip install flash-attn --no-build-isolation
```
Then you could use our model:
```python
from transformers import AutoModel, AutoTokenizer
import torch

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2-7B_res448'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)
```

## ✏️ Citation

```bibtex

@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}

```