Sa2VA-1B/README.md

---
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
base_model:
  - OpenGVLab/InternVL2-1B
  - OpenGVLab/InternVL2_5-8B
  - OpenGVLab/InternVL2_5-4B
  - OpenGVLab/InternViT-300M-448px-V2_5
  - internlm/internlm2_5-7b-chat
  - Qwen/Qwen2-0.5B-Instruct
  - Qwen/Qwen2.5-3B-Instruct
base_model_relation: merge
language:
  - multilingual
tags:
  - Sa2VA
  - custom_code
---

# Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

[\[📂 GitHub\]](https://github.com/magic-research/Sa2VA)
[\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001)
[\[🚀 Quick Start\]](#quick-start) 


## Introduction

Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.

## Sa2VA Family

We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.

| Model Name |                             Base MLLM                             |                                Language Part                                |                       HF Link                        |
|:----------:|:-----------------------------------------------------------------:|:---------------------------------------------------------------------------:|:----------------------------------------------------:|
|  Sa2VA-1B  | [InternVL2.0-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) |   [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)    | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-1B) |
|  Sa2VA-4B  | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) |   [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)    | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-4B) |
|  Sa2VA-8B  | [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat)  | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-8B) |

## Sa2VA Performance
| Model Name |                             MMBench                             |                                    MME                                     |                       RefCOCO                        | RefCOCO+ | RefCOCOg | MeVIS | DAVIS | ReVOS |
|:----------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:-----:|
|  Sa2VA-1B  |                            1381/405                             | 68.3 | 77.4 | 69.9 | 72.3 | 50.8 | 72.3 | 47.6 | 
|  Sa2VA-4B  |                            1536/530                             | 77.3 | 78.9 | 71.7 | 74.1 | 52.1 | 73.8 | 53.2 |
|  Sa2VA-8B  | 1617/511 | 81.6 | 81.6 | 76.2 | 78.7 | 57.0 | 75.2 | 57.6 |


## Quick Start

We provide an example code to run `Sa2VA` using `transformers`.

```python
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os

# load the model and tokenizer
path = "ByteDance/Sa2VA-4B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# for image chat
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Please describe the image."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer

# for image chat with segmentation output
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(1, h, w), ...)
    
# for chat with visual prompt (mask format) input
mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Can you provide me with a detailed description of the region in the picture marked by region1."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': mask_prompts,
    'tokenizer': tokenizer,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer

# for video chat
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
if len(images_paths) > 5:  # uniformly sample 5 frames
    step = (len(images_paths) - 1) // (5 - 1)
    images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]
text_prompts = "<image>Please describe the video."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer


# for video chat with segmentation mask output
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the person."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(n_frames, h, w), ...)
```

## Citation

If you find this project useful in your research, please consider citing:

```BibTeX
@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2025}
}
```
first commit 2025-01-10 13:53:03 +08:00			`---`
			`license: mit`
			`pipeline_tag: image-text-to-text`
			`library_name: transformers`
			`base_model:`
			`- OpenGVLab/InternVL2-1B`
			`- OpenGVLab/InternVL2_5-8B`
			`- OpenGVLab/InternVL2_5-4B`
			`- OpenGVLab/InternViT-300M-448px-V2_5`
			`- internlm/internlm2_5-7b-chat`
			`- Qwen/Qwen2-0.5B-Instruct`
			`- Qwen/Qwen2.5-3B-Instruct`
			`base_model_relation: merge`
			`language:`
			`- multilingual`
			`tags:`
			`- Sa2VA`
			`- custom_code`
			`---`
Initial commit 2025-01-10 13:32:06 +08:00
first commit 2025-01-10 13:53:03 +08:00			`# Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos`

			`[\[📂 GitHub\]](https://github.com/magic-research/Sa2VA)`
			`[\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001)`
			`[\[🚀 Quick Start\]](#quick-start)`



			`## Introduction`

			`Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.`

			`## Sa2VA Family`

			`We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.`

			`\| Model Name \| Base MLLM \| Language Part \| HF Link \|`
			`\|:----------:\|:-----------------------------------------------------------------:\|:---------------------------------------------------------------------------:\|:----------------------------------------------------:\|`
			`\| Sa2VA-1B \| [InternVL2.0-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) \| [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) \| [🤗 link](https://huggingface.co/ByteDance/Sa2VA-1B) \|`
			`\| Sa2VA-4B \| [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) \| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) \| [🤗 link](https://huggingface.co/ByteDance/Sa2VA-4B) \|`
			`\| Sa2VA-8B \| [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) \| [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) \| [🤗 link](https://huggingface.co/ByteDance/Sa2VA-8B) \|`

			`## Sa2VA Performance`
			`\| Model Name \| MMBench \| MME \| RefCOCO \| RefCOCO+ \| RefCOCOg \| MeVIS \| DAVIS \| ReVOS \|`
			`\|:----------:\|:---------------------------------------------------------------:\|:--------------------------------------------------------------------------:\|:----------------------------------------------------:\|:----------------------------------------------------:\|:----------------------------------------------------:\|:----------------------------------------------------:\|:----------------------------------------------------:\|:-----:\|`
			`\| Sa2VA-1B \| 1381/405 \| 68.3 \| 77.4 \| 69.9 \| 72.3 \| 50.8 \| 72.3 \| 47.6 \|`
			`\| Sa2VA-4B \| 1536/530 \| 77.3 \| 78.9 \| 71.7 \| 74.1 \| 52.1 \| 73.8 \| 53.2 \|`
			`\| Sa2VA-8B \| 1617/511 \| 81.6 \| 81.6 \| 76.2 \| 78.7 \| 57.0 \| 75.2 \| 57.6 \|`


			`## Quick Start`

			We provide an example code to run `Sa2VA` using `transformers`.

			```python
			`import torch`
			`from transformers import AutoTokenizer, AutoModel`
			`from PIL import Image`
			`import numpy as np`
			`import os`

			`# load the model and tokenizer`
			`path = "ByteDance/Sa2VA-4B"`
			`model = AutoModel.from_pretrained(`
			`path,`
			`torch_dtype=torch.bfloat16,`
			`low_cpu_mem_usage=True,`
			`use_flash_attn=True,`
			`trust_remote_code=True).eval().cuda()`
			`tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)`

			`# for image chat`
			`image_path = "/PATH/TO/IMAGE"`
			`text_prompts = "<image>Please describe the image."`
			`image = Image.open(image_path).convert('RGB')`
			`input_dict = {`
			`'image': image,`
			`'text': text_prompts,`
			`'past_text': '',`
			`'mask_prompts': None,`
			`'tokenizer': tokenizer,`
			`}`
			`return_dict = model.predict_forward(**input_dict)`
			`answer = return_dict["prediction"] # the text format answer`

			`# for image chat with segmentation output`
			`image_path = "/PATH/TO/IMAGE"`
			`text_prompts = "<image>Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."`
			`image = Image.open(image_path).convert('RGB')`
			`input_dict = {`
			`'image': image,`
			`'text': text_prompts,`
			`'past_text': '',`
			`'mask_prompts': None,`
			`'tokenizer': tokenizer,`
			`}`
			`return_dict = model.predict_forward(**input_dict)`
			`answer = return_dict["prediction"] # the text format answer`
			`masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(1, h, w), ...)`

			`# for chat with visual prompt (mask format) input`
			`mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)`
			`image_path = "/PATH/TO/IMAGE"`
			`text_prompts = "<image>Can you provide me with a detailed description of the region in the picture marked by region1."`
			`image = Image.open(image_path).convert('RGB')`
			`input_dict = {`
			`'image': image,`
			`'text': text_prompts,`
			`'past_text': '',`
			`'mask_prompts': mask_prompts,`
			`'tokenizer': tokenizer,`
			`}`
			`return_dict = model.predict_forward(**input_dict)`
			`answer = return_dict["prediction"] # the text format answer`

			`# for video chat`
			`video_folder = "/PATH/TO/VIDEO_FOLDER"`
			`images_paths = os.listdir(video_folder)`
			`images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]`
			`if len(images_paths) > 5: # uniformly sample 5 frames`
			`step = (len(images_paths) - 1) // (5 - 1)`
			`images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]`
			`text_prompts = "<image>Please describe the video."`
			`input_dict = {`
			`'video': images_paths,`
			`'text': text_prompts,`
			`'past_text': '',`
			`'mask_prompts': None,`
			`'tokenizer': tokenizer,`
			`}`
			`return_dict = model.predict_forward(**input_dict)`
			`answer = return_dict["prediction"] # the text format answer`


			`# for video chat with segmentation mask output`
			`video_folder = "/PATH/TO/VIDEO_FOLDER"`
			`images_paths = os.listdir(video_folder)`
			`images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]`
			`text_prompts = "<image>Please segment the person."`
			`input_dict = {`
			`'video': images_paths,`
			`'text': text_prompts,`
			`'past_text': '',`
			`'mask_prompts': None,`
			`'tokenizer': tokenizer,`
			`}`
			`return_dict = model.predict_forward(**input_dict)`
			`answer = return_dict["prediction"] # the text format answer`
			`masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(n_frames, h, w), ...)`
			```

			`## Citation`

			`If you find this project useful in your research, please consider citing:`

			```BibTeX
			`@article{sa2va,`
			`title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},`
			`author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},`
			`journal={arXiv preprint},`
			`year={2025}`
			`}`
			```