VideoLLaMA3-7B
Go to file
xxl 57e516b2b2 first commit 2025-01-26 19:15:28 +08:00
.gitattributes Add .gitattributes 2025-01-26 17:53:42 +08:00
README.md first commit 2025-01-26 19:15:28 +08:00
added_tokens.json first commit 2025-01-26 19:15:28 +08:00
chat_template.json first commit 2025-01-26 19:15:28 +08:00
config.json first commit 2025-01-26 19:15:28 +08:00
configuration_videollama3.py first commit 2025-01-26 19:15:28 +08:00
configuration_videollama3_encoder.py first commit 2025-01-26 19:15:28 +08:00
generation_config.json first commit 2025-01-26 19:15:28 +08:00
image_processing_videollama3.py first commit 2025-01-26 19:15:28 +08:00
merges.txt first commit 2025-01-26 19:15:28 +08:00
model-00001-of-00004.safetensors first commit 2025-01-26 19:15:28 +08:00
model-00002-of-00004.safetensors first commit 2025-01-26 19:15:28 +08:00
model-00004-of-00004.safetensors first commit 2025-01-26 19:15:28 +08:00
model.safetensors.index.json first commit 2025-01-26 19:15:28 +08:00
modeling_videollama3.py first commit 2025-01-26 19:15:28 +08:00
modeling_videollama3_encoder.py first commit 2025-01-26 19:15:28 +08:00
preprocessor_config.json first commit 2025-01-26 19:15:28 +08:00
processing_videollama3.py first commit 2025-01-26 19:15:28 +08:00
processor_config.json first commit 2025-01-26 19:15:28 +08:00
special_tokens_map.json first commit 2025-01-26 19:15:28 +08:00
tokenizer_config.json first commit 2025-01-26 19:15:28 +08:00
vocab.json first commit 2025-01-26 19:15:28 +08:00

README.md

library_name tags license datasets language metrics pipeline_tag base_model
transformers
multi-modal
large-language-model
video-language-model
apache-2.0
lmms-lab/LLaVA-OneVision-Data
allenai/pixmo-docs
HuggingFaceM4/Docmatix
lmms-lab/LLaVA-Video-178K
ShareGPT4Video/ShareGPT4Video
en
accuracy
visual-question-answering
Qwen/Qwen2.5-7B-Instruct
DAMO-NLP-SG/VideoLLaMA3-7B-Image

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

If you like our project, please give us a star on Github for the latest update.

📰 News

🌟 Introduction

VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.

🌎 Model Zoo

Model Base Model HF Link
VideoLLaMA3-7B (This Checkpoint) Qwen2.5-7B DAMO-NLP-SG/VideoLLaMA3-7B
VideoLLaMA3-2B Qwen2.5-1.5B DAMO-NLP-SG/VideoLLaMA3-2B
VideoLLaMA3-7B-Image Qwen2.5-7B DAMO-NLP-SG/VideoLLaMA3-7B-Image
VideoLLaMA3-2B-Image Qwen2.5-1.5B DAMO-NLP-SG/VideoLLaMA3-2B-Image

We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:

Model Base Model HF Link
VideoLLaMA3-7B Vision Encoder siglip-so400m-patch14-384 DAMO-NLP-SG/VL3-SigLIP-NaViT

🚀 Main Results

image
  • * denotes the reproduced results.

🤖 Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoModel, AutoImageProcessor

model_name = "DAMO-NLP-SG/VideoLLaMA3-7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Video conversation
conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "video", "data": {"video_path": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 128}},
            {"type": "text", "data": "What is the cat doing?"},
        ]
    },
]

inputs = processor(conversation=conversation, return_tensors="pt")
inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)

Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}