VideoChat-Flash-Qwen2-7B_res448
Go to file
xxl 57e3f85dc5 first commit 2025-03-05 15:49:07 +08:00
.gitattributes Add .gitattributes 2025-03-05 13:43:34 +08:00
README.md first commit 2025-03-05 15:49:07 +08:00
added_tokens.json first commit 2025-03-05 15:49:07 +08:00
config.json first commit 2025-03-05 15:49:07 +08:00
constants.py first commit 2025-03-05 15:49:07 +08:00
conversation.py first commit 2025-03-05 15:49:07 +08:00
generation_config.json first commit 2025-03-05 15:49:07 +08:00
merges.txt first commit 2025-03-05 15:49:07 +08:00
mm_projector_builder.py first commit 2025-03-05 15:49:07 +08:00
mm_utils.py first commit 2025-03-05 15:49:07 +08:00
model-00001-of-00004.safetensors first commit 2025-03-05 15:49:07 +08:00
model-00002-of-00004.safetensors first commit 2025-03-05 15:49:07 +08:00
model-00003-of-00004.safetensors first commit 2025-03-05 15:49:07 +08:00
model-00004-of-00004.safetensors first commit 2025-03-05 15:49:07 +08:00
model.safetensors.index.json first commit 2025-03-05 15:49:07 +08:00
modeling_qwen2_flash.py first commit 2025-03-05 15:49:07 +08:00
modeling_videochat_flash.py first commit 2025-03-05 15:49:07 +08:00
special_tokens_map.json first commit 2025-03-05 15:49:07 +08:00
tokenizer.json first commit 2025-03-05 15:49:07 +08:00
tokenizer_config.json first commit 2025-03-05 15:49:07 +08:00
trainer_state.json first commit 2025-03-05 15:49:07 +08:00
training_args.bin first commit 2025-03-05 15:49:07 +08:00
vision_tower_builder.py first commit 2025-03-05 15:49:07 +08:00
vocab.json first commit 2025-03-05 15:49:07 +08:00

README.md

language library_name license metrics tags pipeline_tag model-index
en
transformers apache-2.0
accuracy
multimodal
video-text-to-text
name results
VideoChat-Flash-Qwen2-7B_res448
task dataset metrics
type
multimodal
name type
MLVU mlvu
type value name verified
accuracy 74.7 accuracy true
task dataset metrics
type
multimodal
name type
MVBench mvbench
type value name verified
accuracy 74.0 accuracy true
task dataset metrics
type
multimodal
name type
Perception Test percepTest
type value name verified
accuracy 76.2 accuracy true
task dataset metrics
type
multimodal
name type
LongVideoBench longvideobench
type value name verified
accuracy 64.7 accuracy true
task dataset metrics
type
multimodal
name type
VideoMME (wo sub) videomme
type value name verified
accuracy 65.3 accuracy true
task dataset metrics
type
multimodal
name type
LVBench lvbench
type value name verified
accuracy 48.2 accuracy true

🦜VideoChat-Flash-Qwen2-7B_res448

[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

VideoChat-Flash-7B is constructed upon UMT-L (300M) and Qwen2-7B, employing only 16 tokens per frame. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately 10,000 frames.

Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.

📈 Performance

Model MVBench LongVideoBench VideoMME(w/o sub) Max Input Frames
VideoChat-Flash-Qwen2_5-2B@448 70.0 58.3 57.0 10000
VideoChat-Flash-Qwen2-7B@224 73.2 64.2 64.0 10000
VideoChat-Flash-Qwen2_5-7B-1M@224 73.4 66.5 63.5 50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224 74.3 64.5 65.1 10000
VideoChat-Flash-Qwen2-7B@448 74.0 64.7 65.3 10000

🚀 How to use the model

First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
pip install flash-attn --no-build-isolation

Then you could use our model:

from transformers import AutoModel, AutoTokenizer
import torch

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2-7B_res448'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✏️ Citation


@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}