license |
pipeline_tag |
library_name |
base_model |
base_model_relation |
language |
tags |
mit |
image-text-to-text |
transformers |
OpenGVLab/InternVL2-1B |
OpenGVLab/InternVL2_5-8B |
OpenGVLab/InternVL2_5-4B |
OpenGVLab/InternViT-300M-448px-V2_5 |
internlm/internlm2_5-7b-chat |
Qwen/Qwen2-0.5B-Instruct |
Qwen/Qwen2.5-3B-Instruct |
|
merge |
|
|
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
[📂 GitHub]
[📜 Sa2VA paper]
[🚀 Quick Start]
Introduction
Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
Sa2VA Family
We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.
Sa2VA Performance
Model Name |
MMBench |
MME |
RefCOCO |
RefCOCO+ |
RefCOCOg |
MeVIS |
DAVIS |
ReVOS |
Sa2VA-1B |
1381/405 |
68.3 |
77.4 |
69.9 |
72.3 |
50.8 |
72.3 |
47.6 |
Sa2VA-4B |
1536/530 |
77.3 |
78.9 |
71.7 |
74.1 |
52.1 |
73.8 |
53.2 |
Sa2VA-8B |
1617/511 |
81.6 |
81.6 |
76.2 |
78.7 |
57.0 |
75.2 |
57.6 |
Quick Start
We provide an example code to run Sa2VA
using transformers
.
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os
# load the model and tokenizer
path = "ByteDance/Sa2VA-4B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
# for image chat
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Please describe the image."
image = Image.open(image_path).convert('RGB')
input_dict = {
'image': image,
'text': text_prompts,
'past_text': '',
'mask_prompts': None,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
# for image chat with segmentation output
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."
image = Image.open(image_path).convert('RGB')
input_dict = {
'image': image,
'text': text_prompts,
'past_text': '',
'mask_prompts': None,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(1, h, w), ...)
# for chat with visual prompt (mask format) input
mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Can you provide me with a detailed description of the region in the picture marked by region1."
image = Image.open(image_path).convert('RGB')
input_dict = {
'image': image,
'text': text_prompts,
'past_text': '',
'mask_prompts': mask_prompts,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
# for video chat
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
if len(images_paths) > 5: # uniformly sample 5 frames
step = (len(images_paths) - 1) // (5 - 1)
images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]
text_prompts = "<image>Please describe the video."
input_dict = {
'video': images_paths,
'text': text_prompts,
'past_text': '',
'mask_prompts': None,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
# for video chat with segmentation mask output
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the person."
input_dict = {
'video': images_paths,
'text': text_prompts,
'past_text': '',
'mask_prompts': None,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(n_frames, h, w), ...)
Citation
If you find this project useful in your research, please consider citing:
@article{sa2va,
title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
journal={arXiv preprint},
year={2025}
}